# **INDIAN STARTUP ECOSYSTEM**

## Project Description
We embark on a journey of discovery as we leverage our data analysis expertise to uncover the untapped potential within the Indian startup ecosystem. This project is designed to not only decode the numbers but to distill insights that will guide our team towards a successful foray into this dynamic market.

## Scope of Work

- Conduct a thorough exploration of datasets, dissecting funding patterns, sectoral nuances, and geographical hotspots in the Indian startup landscap
- Analyze funding received by startups in india from 2018 to 2021



## Hypothesis 
**Null Hypothesis (H₀)**: There is no significant difference in the average funding received by startups and Location.

**Alternative Hypothesis (H₁)**: There are significant differences in the average funding received by startups and Location.

**Null Hypothesis (H0)**: There is no significant relationship between funding and the sector  

**Alternative Hypothesis (H1)**: There is a significant relationship between funding and the sector

## Questions 
1. How does funding vary across different industry sectors in India?
2. How does funding vary with the loaction of the start-ups
3. What is the relationship between the amount of funding and the stage of the company?
4. How have funding trends evolved between 2018 and 2021?
5. What are the most attractive sectors for investors?
6. Does the location of the company influence its sector?




# **DATA EXPLORATION, DATA UNDERSTANDING and DATA ANALYSIS**

In [1]:
# Load libraries
# Database connnection
import pyodbc     
from dotenv import dotenv_values    #import the dotenv_values function from the dotenv package

# Data manipulation
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# filter warnings
import warnings 
warnings.filterwarnings('ignore')

## **1. Loading and Inspection of Data**

**Load data from the SQL server**

In [2]:
# Load environment variables from .env file into a dictionary
environment_variables = dotenv_values('.env')

# Get the values for the credentials you set in the '.env' file
server = environment_variables.get("SERVER")
database = environment_variables.get("DATABASE")
username = environment_variables.get("USERNAME")
password = environment_variables.get("PASSWORD")

In [3]:
# Create a connection string

connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password}"
    
connection = pyodbc.connect(connection_string)



In [4]:
# sql query to get 2020 data. 
query_2020="SELECT * FROM dbo.LP1_startup_funding2020"

# sql query to get 2021 data. 
query_2021="SELECT * FROM dbo.LP1_startup_funding2021"

In [5]:
    # load 2021 data
data_2021=pd.read_sql(query_2021,connection)

    # load 2020 data
data_2020=pd.read_sql(query_2020,connection)

**Load CSV Files**

In [6]:
# load 2019 data
data_2019=pd.read_csv(r'C:\Users\iamde\OneDrive\Desktop\jupyter\india_startup_data\startup_funding2019.csv')

    # load 2018 data
data_2018=pd.read_csv(r'C:\Users\iamde\OneDrive\Desktop\jupyter\india_startup_data\startup_funding2018.csv')


### **2. Data Exploration and Understanding**

**Preview Each dataset**

In [7]:
# preview the rows and columns for the 2018 dataset
data_2018.sample(5)

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
273,EGK Foods,"Food and Beverage, Food Processing, Organic Fo...",Seed,"₹20,000,000","Mumbai, Maharashtra, India",A food processing enterprise registered with S...
131,Wakefit,"Manufacturing, Retail",Venture - Series Unknown,"₹650,000,000","Bangalore, Karnataka, India",Wakefitkart is a mattress manufacturing compan...
511,Dream11,"Fantasy Sports, Mobile, Sports",Series D,100000000,"Mumbai, Maharashtra, India",Dream11 is India’s Biggest Sports Game with 30...
125,Ambee,"Health Care, Medical Device, Public Safety",Angel,—,"Bangalore, Karnataka, India",Ambee is an Environment analytics startup.
401,Gugu,"Battery, Electric Vehicle, Energy, Renewable E...",Seed,"$250,000","Coimbatore, Tamil Nadu, India",Gugu designs & Sells all Electric Connected Tw...


In [8]:
# get a sample of 2019 dataset
data_2019.sample(5)

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
31,truMe,,,IoT,A global platform for Identity and Access Mana...,"Babu Dayal, Pramod Uniyal, Lalit Mehta",Rajan Kaistha,"$140,000",
37,Observe.AI,,Bangalore,AI,Creates a voice AI platform,Swapnil,Scale Venture Partners,"$26,000,000",Series A
73,Shadowfax,2015.0,Bangalore,Logistics,A platform for delivery services,"Abhishek Bansal, Vaibhav Khandelwal","Flipkart, Eight Roads Ventures, NGP Capital, Q...","$60,000,000",Series D
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",
15,LivFin,2017.0,Delhi,Fintech,"Grants small business loans, supply chain fina...",Rakesh Malhotra,German development finance institution DEG,"$5,000,000",


In [9]:
# get a sample of 2020 dataset
data_2020.sample(5)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
337,Vesta Space Technology,2018.0,Pune,Tech Startup,VestaSpace Technology specialises in making sm...,Arun Kumar Sureban,,10000000.0,,
242,HobSpace,2019.0,Mumbai,Edtech,India's Activity Hub for Kids,"Priya Goel Sheth, Harsh Jain",Artha Venture,200000.0,Seed,
355,Trukky,2015.0,Ahmedabad,Logistics,"Transportation, logistics services & cargo ser...","Saswat Sahu, Rishi Raj",Mumbai Angels.,,,
320,Mobile Premier League,2018.0,Bangalore,Entertainment,Mobile Premier League(MPL) is a skill based E-...,"Sai Srinivas Kiran G, Shubham Malhotra","Sequoia Capital India, Times Internet",74000000.0,,
686,Voicezen,,Gurugram,AI,Conversational AI startup,Anurba Nath,Bharti Airtel,,,


In [10]:
# get a sample of 2021 dataset
data_2021.sample(5)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
146,BatteryPool,2018.0,Pune,Automotive,Building the Operating System for managing Ele...,Ashwin Shankar,IAN,$Undisclosed,Seed
780,Stylework,2017.0,Gurugram,Co-working,Stylework is an unconventional co-working spac...,Sparsh Khandelwal,Inflection Point Ventures,"$500,000",Pre-series A
821,Glamplus,2020.0,Bangalore,SaaS startup,India's #1 SaaS based Salon experience Software,Divyanshu Singh,Inflection Point Ventures,"$200,000",Pre-series A
155,UpScalio,2021.0,Gurugram,E-commerce,"UpScalio is India’s next generation, data-driv...",Gautam Kshatriya,,$42000000,
88,Factors.AI,2020.0,Bangalore,SaaS startup,Factors.AI is a Marketing Analytics platform p...,"Aravind Murthy, Praveen Das, Srikrishna Swamin...","Elevation Partners, Emergent Ventures",$2000000,


**Shape of the data**

In [11]:
# get the number of rows and columns for the datasets
print(f"The 2018 dataset has {data_2018.shape[0]} rows and {data_2018.shape[1]} Columns\n")
print(f"The 2019 dataset has {data_2019.shape[0]} rows and {data_2019.shape[1]} Columns\n")
print(f"The 2020 dataset has {data_2020.shape[0]} rows and {data_2020.shape[1]} Columns\n")
print(f"The 2021 dataset has {data_2021.shape[0]} rows and {data_2021.shape[1]} Columns\n\n")

The 2018 dataset has 526 rows and 6 Columns

The 2019 dataset has 89 rows and 9 Columns

The 2020 dataset has 1055 rows and 10 Columns

The 2021 dataset has 1209 rows and 9 Columns




**Info of the data**

In [12]:
# overview of 2018 dataset
data_2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Company Name   526 non-null    object
 1   Industry       526 non-null    object
 2   Round/Series   526 non-null    object
 3   Amount         526 non-null    object
 4   Location       526 non-null    object
 5   About Company  526 non-null    object
dtypes: object(6)
memory usage: 24.8+ KB


In [13]:
# overview of 2019 dataset
data_2019.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company/Brand  89 non-null     object 
 1   Founded        60 non-null     float64
 2   HeadQuarter    70 non-null     object 
 3   Sector         84 non-null     object 
 4   What it does   89 non-null     object 
 5   Founders       86 non-null     object 
 6   Investor       89 non-null     object 
 7   Amount($)      89 non-null     object 
 8   Stage          43 non-null     object 
dtypes: float64(1), object(8)
memory usage: 6.4+ KB


In [14]:
# overview of 2020 dataset
data_2020.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1055 entries, 0 to 1054
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1055 non-null   object 
 1   Founded        842 non-null    float64
 2   HeadQuarter    961 non-null    object 
 3   Sector         1042 non-null   object 
 4   What_it_does   1055 non-null   object 
 5   Founders       1043 non-null   object 
 6   Investor       1017 non-null   object 
 7   Amount         801 non-null    float64
 8   Stage          591 non-null    object 
 9   column10       2 non-null      object 
dtypes: float64(2), object(8)
memory usage: 82.6+ KB


In [15]:
# overview of 2021 dataset
data_2021.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1209 entries, 0 to 1208
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1209 non-null   object 
 1   Founded        1208 non-null   float64
 2   HeadQuarter    1208 non-null   object 
 3   Sector         1209 non-null   object 
 4   What_it_does   1209 non-null   object 
 5   Founders       1205 non-null   object 
 6   Investor       1147 non-null   object 
 7   Amount         1206 non-null   object 
 8   Stage          781 non-null    object 
dtypes: float64(1), object(8)
memory usage: 85.1+ KB


**Displaying datasets columns**

In [16]:
# 2021 data columns
data_2021.columns

Index(['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage'],
      dtype='object')

In [17]:
# 2020 data columns

data_2020.columns

Index(['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage', 'column10'],
      dtype='object')

In [18]:
# 2019 data columns

data_2019.columns

Index(['Company/Brand', 'Founded', 'HeadQuarter', 'Sector', 'What it does',
       'Founders', 'Investor', 'Amount($)', 'Stage'],
      dtype='object')

In [19]:
# 2018 data columns

data_2018.columns

Index(['Company Name', 'Industry', 'Round/Series', 'Amount', 'Location',
       'About Company'],
      dtype='object')

## **Observations:**
**Issues with the data**

1. There is a discrepancy in the naming conventions between the columns in the 2018 and 2019 datasets compared to the 2020 and 2021 datasets.

2. The 2018 dataset exhibits some missing columns, contributing to an incomplete representation of the data.

3. Conversely, the 2020 dataset contains an additional column that appears to be extraneous and does not serve a meaningful purpose in our analysis.

**Course of Action:**

1. **Missing Column Engineering for 2018:**
   - We will address the absence of certain columns in the 2018 dataset by employing data engineering techniques to create and populate the missing columns, ensuring a comprehensive and consistent dataset.

2. **Column Name Standardization:**
   - To establish uniformity and coherence across all datasets, we will embark on a column renaming process for the 2018 and 2019 datasets. This action aims to align the naming conventions with those observed in the 2020 and 2021 datasets, facilitating seamless data integration and analysis.

3. **Extraneous Column Removal in 2020:**
   - The redundant column identified in the 2020 dataset will be removed, streamlining the dataset and eliminating unnecessary elements that do not contribute to the overall analysis objectives.

These actions collectively enhance the integrity, consistency, and completeness of the dataset, paving the way for a more robust and coherent analytical process.
ical process.







# **3. Data Cleaning**

**Handling missing columns and currency signs in the 2018 dataset**

- The 2018 dataset has missing; 'founded', 'founders', and 'investor' columns


In [20]:
# Engineer missing columns for the 2018 dataset
columns_to_add = ['founded', 'founders', 'investor']
for column in columns_to_add:
    if column not in data_2018.columns:
        data_2018[column] = np.NaN

# Replace '₹', commas, '—', and "''" in 'Amount' column
data_2018['Amount'] = data_2018['Amount'].str.replace(',', '').str.replace('—', '').str.replace("''",'').replace('', np.nan)

# Conditionally apply multiplication only where '₹' is present
mask = data_2018['Amount'].str.contains('₹', na=False)
data_2018.loc[mask, 'Amount'] = data_2018.loc[mask, 'Amount'].str.replace('₹', '').astype(float) * 0.0146

**Data collection Year**

- There is need to add a column that represents the year each dataset was collected. This will help with handling the datasets after merging the dataframes 


In [21]:
# add year when data was collected column to every dataset
data_2018['data_year'] = pd.to_datetime('2018', format='%Y').year
data_2019['data_year'] = pd.to_datetime('2019', format='%Y').year
data_2020['data_year'] = pd.to_datetime('2020', format='%Y').year
data_2021['data_year'] = pd.to_datetime('2021', format='%Y').year

**Merge the dataframes**

**Notes**
- The function below concatenates the dataframes then renames the columns to ensure uniformity across the merged dataframe


In [53]:
# Define function to concatenate and rename columns
def concat_dfs(df_1, df_2, df_3, df_4):
    # Rename columns in individual DataFrames
    df_1.rename(columns={'Company Name': 'company_brand', 'Industry': 'sector', 'Round/Series': 'stage',
                         'Amount': 'amount($)', 'Location': 'location', 'About Company': 'about_company'},
                inplace=True)
    df_2.rename(columns={'Company/Brand': 'company_brand', 'Founded': 'founded', 'HeadQuarter': 'location',
                         'Sector': 'sector', 'What it does': 'about_company', 'Founders': 'founders',
                         'Investor': 'investor', 'Amount($)': 'amount($)', 'Stage': 'stage'},
                inplace=True)
    df_3.rename(columns={'Company_Brand': 'company_brand', 'Founded': 'founded', 'HeadQuarter': 'location',
                         'Sector': 'sector', 'What_it_does': 'about_company', 'Founders': 'founders',
                         'Investor': 'investor', 'Amount': 'amount($)', 'Stage': 'stage'},
                inplace=True)
    df_4.rename(columns={'Company_Brand': 'company_brand', 'Founded': 'founded', 'HeadQuarter': 'location',
                         'Sector': 'sector', 'What_it_does': 'about_company', 'Founders': 'founders',
                         'Investor': 'investor', 'Amount': 'amount($)', 'Stage': 'stage'},
                inplace=True)

    # Concatenate dataframes along the row axis and reset index
    df = pd.concat([df_1, df_2, df_3, df_4])

    return df


df = concat_dfs(data_2018, data_2019, data_2020, data_2021)


In [23]:
#Drop the extreneous column 10
df.drop('column10', axis=1, inplace= True)

**Cleaning 'Amount' column**

**Notes**  
- Remove all currency signs  

- Remove all other umwanted characters, words and symbols  

- Populate null values with the mode of the dataset  

- Convert the column from object to float

In [24]:

# Remove dollar sign
df['amount($)'] = df['amount($)'].replace('\$', '', regex=True)

# Remove commas
df['amount($)'] = df['amount($)'].str.replace(',', '')

# Remove all other irrelevant characters, words and symbols
df['amount($)'] = df['amount($)'].replace(["Upsparks", 'undisclosed', 'Undisclosed', "ah! Ventures", 
                                               "Pre-series A", "ITO Angel Network LetsVenture", 
                                               "JITO Angel Network LetsVenture", "Series C", 'Seed', ','], '')

# Convert the 'amount($)' column to numeric
df['amount($)'] = pd.to_numeric(df['amount($)'])

In [25]:
df['amount($)'].value_counts(sort=True)

amount($)
1000000.0      116
2000000.0       77
5000000.0       56
3000000.0       54
500000.0        49
              ... 
311000000.0      1
220000.0         1
6800000.0        1
49400000.0       1
55000000.0       1
Name: count, Length: 245, dtype: int64

In [26]:
df['amount($)'].info()

<class 'pandas.core.series.Series'>
Index: 2879 entries, 0 to 1208
Series name: amount($)
Non-Null Count  Dtype  
--------------  -----  
1367 non-null   float64
dtypes: float64(1)
memory usage: 45.0 KB


**Cleaning data_year column**

**Notes**  


- Convert data type to period


In [27]:
# Convert the data_year column to date
df['data_year']=pd.to_datetime(df['data_year'], format='%Y')
df['data_year']=df['data_year'].dt.to_period('y')
# df['founded']=pd.to_datetime(df['founded']).dt.year

In [28]:
# check for nulls and duplicated
print(f"There are {df['data_year'].isna().sum()} Null values in the 'data_year' column")

There are 0 Null values in the 'data_year' column


In [29]:
df['data_year'].info()

<class 'pandas.core.series.Series'>
Index: 2879 entries, 0 to 1208
Series name: data_year
Non-Null Count  Dtype        
--------------  -----        
2879 non-null   period[Y-DEC]
dtypes: period[Y-DEC](1)
memory usage: 45.0 KB


**Cleaning 'founded' column**

**Notes**
- Handle nulls by populating with the 'bfill' method



In [30]:
print(f"There are {df['founded'].isna().sum()} Null values in the 'founded' column")

There are 769 Null values in the 'founded' column


In [31]:
df['founded'].unique()

array([  nan, 2014., 2004., 2013., 2010., 2018., 2019., 2017., 2011.,
       2015., 2016., 2012., 2008., 2020., 1998., 2007., 1982., 2009.,
       1995., 2006., 1978., 1999., 1994., 2005., 1973., 2002., 2001.,
       2021., 1993., 1989., 2000., 2003., 1991., 1984., 1963.])

**Notes**  
- There are 769 null values in the 'founded' column.  

- Since dropping the nulls will lead to a significant loss of our data, Backward fill will be used to fill the null values

In [32]:
# Fill missing values with backward-fill and then interpolate
# df['founded'] = df['founded'].interpolate(limit_direction='backward',
                                                                #   limit=10, method='linear', 
                                                                #   limit_area='inside', 
                                                                #   upper=2021)



In [33]:
# Convert to datetime
df['founded'] = pd.to_datetime(df['founded'], format='%Y')

# Convert to period
df['founded'] = df['founded'].dt.to_period('Y')


In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2879 entries, 0 to 1208
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype        
---  ------         --------------  -----        
 0   company_brand  2879 non-null   object       
 1   sector         2861 non-null   object       
 2   stage          1941 non-null   object       
 3   amount($)      1367 non-null   float64      
 4   location       2765 non-null   object       
 5   about_company  2879 non-null   object       
 6   founded        2110 non-null   period[Y-DEC]
 7   founders       2334 non-null   object       
 8   investor       2253 non-null   object       
 9   data_year      2879 non-null   period[Y-DEC]
dtypes: float64(1), object(7), period[Y-DEC](2)
memory usage: 247.4+ KB


**Cleaning the 'founders' column**

In [35]:
# Remove unwanted characters
df['founders'] = df['founders'].replace(['...', np.nan], np.NaN)

# Check the number of NaN values in the 'founders' column
nan_count = df['founders'].isna().sum()

print(nan_count)

545


In [36]:
df['founders'].unique()

array([nan, 'Shantanu Deshpande', 'Adamas Belva Syah Devara, Iman Usman.',
       ..., 'Bala Sarda', 'Arnav Kumar, Vaibhav Singh',
       'Vishal Chopra, Himanshu Gupta'], dtype=object)

In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2879 entries, 0 to 1208
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype        
---  ------         --------------  -----        
 0   company_brand  2879 non-null   object       
 1   sector         2861 non-null   object       
 2   stage          1941 non-null   object       
 3   amount($)      1367 non-null   float64      
 4   location       2765 non-null   object       
 5   about_company  2879 non-null   object       
 6   founded        2110 non-null   period[Y-DEC]
 7   founders       2334 non-null   object       
 8   investor       2253 non-null   object       
 9   data_year      2879 non-null   period[Y-DEC]
dtypes: float64(1), object(7), period[Y-DEC](2)
memory usage: 247.4+ KB


**Cleaning **Stage** column**

Startups start with pre-seed, progress through seed, Series A, Series B, etc., securing resources for development and strategies. Additional rounds like Series C or D may follow. External funding at each stage fuels growth toward the venture's full potential.

**Pre-Seed Funding**  
Entrepreneurial idea in early development; small funds needed; limited informal channels for raising funds.

**Seed Funding**  
First official equity funding; investors provide funds for equity ownership.

**Series A Financing**  
First venture capital round; developed product, consistent revenue, long-term profit plan.

**Series B Financing**  
For established startups; substantial user base and revenue; funding for expansion.

**Series C and Beyond**  
Optional rounds for final push before IPO or unmet objectives; Series C is the third venture capital round.

**Initial Public Offering (IPO)**  
Process of offering corporate shares to the public; used for funding or divestment.

link: https://www.startupindia.gov.in/content/sih/en/funding.html

In [38]:
# Cleaning stage column
df['stage'].unique()
df['stage']=df['stage'].replace(['https://docs.google.com/spreadsheets/d/1x9ziNeaz6auNChIHnMI8U6kS7knTr3byy_YBGfQaoUA/edit#gid=1861303593',
'$6000000','$1000000','$300000','$1200000'],np.NaN)

In [39]:
# Standardize funding stages in the 'stage' column
df['stage'] = df['stage'].replace(['Series A', 'Seies A', 'Series A-1', 'Series A2', 'Series A+', 'Series A+'], 'Series A')
df['stage'] = df['stage'].replace(['Pre-seed', 'Pre-seed Round', 'Pre seed Round', 'Pre seed round'], 'Pre-Seed Stage')
df['stage'] = df['stage'].replace(['Pre series A', 'Pre-series A', 'Pre Series A', 'Pre series A1', 'Pre-series A1', 'Pre- series A'], 'Pre series A')
df['stage'] = df['stage'].replace(['Series B', 'Series B+', 'Series B2', 'Series B3'], 'Series B')
df['stage'] = df['stage'].replace(['Series C', 'Series C', 'Series C, D','Series C', 'Private Equity','PE', 'Post-IPO Equity','Series D', 'Series E', 'Series F', 'Series G', 'Series H', 'Series I','Series D1','Series F2', 'Series F1'], 'Series C and Beyond')
df['stage'] = df['stage'].replace(['Venture - Series Unknown', None,'Grant','Debt','Debt Financing','Post-IPO Debt','Non-equity Assistance','Bridge','Bridge Round','Fresh funding','Funding Round','Mid series','Edge',], 'unknown')
df['stage'] = df['stage'].replace(['Corporate Round','Undisclosed','Secondary Market','Pre-series','Post series A','Pre-series B','Pre-Series B','Pre series B','Pre-series C','Pre series C'], 'Other Stages')
df['stage'] = df['stage'].replace(['Seed','Seed funding','Pre-Seed','Angel', 'Angel Round','Seed fund', 'Seed round', 'Seed A','Seed Funding', 'Seed Round & Series A', 'Series E2', 'Seed Round','Seed Investment','Seed+','Early seed'],'Seed Stage')

In [40]:
df['stage'].unique()

array(['Seed Stage', 'Series A', 'Series B', 'Series C and Beyond',
       'unknown', 'Other Stages', 'Pre series A', 'Pre-Seed Stage'],
      dtype=object)

In [41]:
df['stage'].isna().sum()

0

In [42]:
df['stage'].value_counts()

stage
unknown                1060
Seed Stage              746
Series A                309
Pre series A            291
Series C and Beyond     232
Series B                138
Pre-Seed Stage           65
Other Stages             38
Name: count, dtype: int64

In [43]:
df['sector'].unique()

array(['Brand Marketing, Event Promotion, Marketing, Sponsorship, Ticketing',
       'Agriculture, Farming',
       'Credit, Financial Services, Lending, Marketplace',
       'Financial Services, FinTech',
       'E-Commerce Platforms, Retail, SaaS',
       'Cloud Infrastructure, PaaS, SaaS',
       'Internet, Leisure, Marketplace', 'Market Research',
       'Information Services, Information Technology', 'Mobile Payments',
       'B2B, Shoes', 'Internet',
       'Apps, Collaboration, Developer Platform, Enterprise Software, Messaging, Productivity Tools, Video Chat',
       'Food Delivery', 'Industrial Automation',
       'Automotive, Search Engine, Service Industry',
       'Finance, Internet, Travel',
       'Accounting, Business Information Systems, Business Travel, Finance, SaaS',
       'Artificial Intelligence, Product Search, SaaS, Service Industry, Software',
       'Internet of Things, Waste Management',
       'Air Transportation, Freight Service, Logistics, Marine Transport

**Cleaning the Sector Column**

In [44]:
# Get the first sentence of every list
df['sector']=df['sector'].str.split(",").str[0]

In [45]:
df['sector'].unique()

array(['Brand Marketing', 'Agriculture', 'Credit', 'Financial Services',
       'E-Commerce Platforms', 'Cloud Infrastructure', 'Internet',
       'Market Research', 'Information Services', 'Mobile Payments',
       'B2B', 'Apps', 'Food Delivery', 'Industrial Automation',
       'Automotive', 'Finance', 'Accounting', 'Artificial Intelligence',
       'Internet of Things', 'Air Transportation', 'Food and Beverage',
       'Autonomous Vehicles', 'Enterprise Software', 'Logistics',
       'Insurance', 'Information Technology', 'Blockchain', 'Education',
       'E-Commerce', 'Renewable Energy', 'E-Learning', 'Clean Energy',
       'Transportation', 'Fitness', 'Hospitality',
       'Media and Entertainment', 'Broadcasting', 'EdTech', 'Health Care',
       '—', 'Sports', 'Big Data', 'Cloud Computing', 'Food Processing',
       'Trading Platform', 'Consumer Goods', 'Wellness', 'Fashion',
       'Consulting', 'Biotechnology', 'Communities', 'Consumer',
       'Consumer Applications', 'Mobile',

**Cleaning the location column**

In [47]:
# Get the first location from every list
df['location']=df['location'].str.split(",").str[0]

In [48]:
df['location'].unique()

array(['Bangalore', 'Mumbai', 'Gurgaon', 'Noida', 'Hyderabad',
       'Bengaluru', 'Kalkaji', 'Delhi', 'India', 'Hubli', 'New Delhi',
       'Chennai', 'Mohali', 'Kolkata', 'Pune', 'Jodhpur', 'Kanpur',
       'Ahmedabad', 'Azadpur', 'Haryana', 'Cochin', 'Faridabad', 'Jaipur',
       'Kota', 'Anand', 'Bangalore City', 'Belgaum', 'Thane', 'Margão',
       'Indore', 'Alwar', 'Kannur', 'Trivandrum', 'Ernakulam',
       'Kormangala', 'Uttar Pradesh', 'Andheri', 'Mylapore', 'Ghaziabad',
       'Kochi', 'Powai', 'Guntur', 'Kalpakkam', 'Bhopal', 'Coimbatore',
       'Worli', 'Alleppey', 'Chandigarh', 'Guindy', 'Lucknow', nan,
       'Telangana', 'Gurugram', 'Surat', 'Uttar pradesh', 'Rajasthan',
       'Tirunelveli', None, 'Singapore', 'Gujarat', 'Kerala', 'Frisco',
       'California', 'Dhingsara', 'New York', 'Patna', 'San Francisco',
       'San Ramon', 'Paris', 'Plano', 'Sydney', 'San Francisco Bay Area',
       'Bangaldesh', 'London', 'Milano', 'Palmwoods', 'France',
       'Samastipur', 

**Correct naming variations in the location column**

In [50]:
df['location']=df['location'].replace({'Bengaluru': 'Bangalore', 'Banglore': 'Bangalore', 'Gurugram': 'Gurgaon', 'Hyderebad': 'Hyderabad', 
                                       'New Delhi': 'Delhi', 'Ahmadabad': 'Ahmedabad', 'Ernakulam': 'Cochin', 'Telugana': 'Telangana',
                                         'Uttar pradesh': 'Uttar Pradesh', 'Rajastan': 'Rajasthan', 'San Franciscao': 'San Francisco', 
                                         'Samsitpur': 'Samastipur', 'Santra': 'Samtra', 'Rajsamand': 'Rajasthan', 'Kerala': 'Kochi', 
                                         'The Nilgiris': 'Nilgiris', 'Gurugram\t#REF!': 'Gurgaon', 'California': 'San Francisco', 
                                         'San Francisco Bay Area': 'San Francisco', 'Hyderebad': 'Hyderabad', 'Online Media\t#REF!': 'Online Media',
                                           'Pharmaceuticals\t#REF!': 'Pharmaceuticals', 'Information Technology & Services': 
                                           'IT Services', 'Small Towns': 'Unknown', 'Odisha': 'Odisha', 'Beijing': 'Beijing', 'Orissia': 'Odisha', 
                                           'Santra': 'Samtra', 'Vadodara': 'Vadodara', 'Food & Beverages': 'Food and Beverages', 'Bhilwara': 'Bhilwara',
                                             'Gandhinagar': 'Gandhinagar', 'Thiruvananthapuram': 'Thiruvananthapuram', 'Gurgaon': 'Gurgaon',
                                               'Patna': 'Patna', 'San Ramon': 'San Ramon', 'Plano': 'Plano', 'Bangaldesh': 'Bangladesh', 'Milano': 'Milano',
                                                 'California': 'California', 'Jharkhand': 'Jharkhand'}) 


In [55]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2879 entries, 0 to 1208
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   company_brand  2879 non-null   object 
 1   sector         2861 non-null   object 
 2   stage          1941 non-null   object 
 3   amount($)      2474 non-null   object 
 4   location       2765 non-null   object 
 5   about_company  2879 non-null   object 
 6   founded        2110 non-null   float64
 7   founders       2334 non-null   object 
 8   investor       2253 non-null   object 
 9   data_year      2879 non-null   int64  
 10  column10       2 non-null      object 
dtypes: float64(1), int64(1), object(9)
memory usage: 269.9+ KB
