# Indian Start-Up Funding Analysis (2018 - 2021)

# Business Understanding

Main Objective of the Project

The primary objective of this project is to analyze the funding trends in the Indian start up ecosystem from 2018 to 2021.
By examining the data, we aim o identify patterns, trends and insights that can inform strategic decisions for entering the Indian start up market. Specifically, we will focus on understanding the amount of funding received by start ups, the type of investors involved and the sectors that attract the most investment.


Key Research Questions
1. How has the total amount of funding received by start-ups in India changed from 2018-2021?
2. Which sectors have received the most funding in each year?
3. Who are the top investors in the India start-up ecosystem from 2018-2021?
4. Which regions or cities in India are receiving the most start-up funding?
5. Does the stage align with the investment timeline?
6. How lon has the company been operating and how doeas this affect the amount of investment?


Importing Necessary Packages

In [2]:
import pyodbc
import pandas as pd
from dotenv import dotenv_values
import warnings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)


warnings.filterwarnings('ignore')

In [3]:
# . env file

server='dap-projects-database.database.windows.net'
database='dapDB'
username='LP1_learner'
password='Hyp0th3s!$T3$t!ng'


In [4]:
#Loading environment variables from .env into a dictionary

env_var = dotenv_values('.env')

#Getting credentials

server = env_var.get('server')
database = env_var.get('database')
username = env_var.get('username')
password = env_var.get('password')

conn = f'DRIVER={{SQL SERVER}};SERVER={server};DATABASE={database};UID={username};PWD={password};timeout=30'




In [5]:
#Connecting to the server

connection = pyodbc.connect(conn)

In [6]:
import pyodbc

try:
    connection = pyodbc.connect(conn)
    print("Connection successful!")
except pyodbc.Error as ex:
    sqlstate = ex.args[1]
    print(f"Connection failed: {sqlstate}")


Connection successful!


In [7]:
#Fetching from database

db_query = '''
            SELECT *
            FROM INFORMATION_SCHEMA.TABLES
            WHERE TABLE_TYPE = 'BASE TABLE'
            '''

data = pd.read_sql(db_query, connection)
data

Unnamed: 0,TABLE_CATALOG,TABLE_SCHEMA,TABLE_NAME,TABLE_TYPE
0,dapDB,dbo,LP1_startup_funding2021,BASE TABLE
1,dapDB,dbo,LP1_startup_funding2020,BASE TABLE


In [8]:
query_1 = '''
          SELECT * 
          FROM LP1_startup_funding2021
          '''

data_021 = pd.read_sql(query_1, connection)
data_021.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D
3,Bizongo,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital","$51,000,000",Series C
4,FypMoney,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal","$2,000,000",Seed


In [9]:
query_2 = '''
          SELECT * 
          FROM LP1_startup_funding2020
          '''


data_020 = pd.read_sql(query_2, connection)
data_020.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,
3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,,
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,,


In [10]:
#To view all rows and columns without truncating

data_019 = pd.read_csv(r'C:\Users\Admin\Downloads\startup_funding2019.csv')
data_019.head()

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...","$30,000,000",Series D
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",


In [11]:

#Alternative to view all rows and columns without truncating

pd.options.display.max_columns = None
pd.options.display.max_rows = None


data_018 = pd.read_csv(r'C:\Users\Admin\Downloads\startup_funding2018.csv')
data_018.head()

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...


# Data Manipulation

In [12]:
# Concatenate all datasets

combined_data = pd.concat([data_021, data_020, data_019, data_018], ignore_index=True)


In [13]:
# Preview of combined data

combined_data.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10,Company/Brand,What it does,Amount($),Company Name,Industry,Round/Series,Location,About Company
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A,,,,,,,,,
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",,,,,,,,,,
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D,,,,,,,,,
3,Bizongo,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital","$51,000,000",Series C,,,,,,,,,
4,FypMoney,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal","$2,000,000",Seed,,,,,,,,,



# **Exploratory Data Analysis**


In [14]:
# Determining the shape of the data set

combined_data.shape

(2879, 18)

In [15]:
# Determining null values

combined_data.isna().any()


Company_Brand    True
Founded          True
HeadQuarter      True
Sector           True
What_it_does     True
Founders         True
Investor         True
Amount           True
Stage            True
column10         True
Company/Brand    True
What it does     True
Amount($)        True
Company Name     True
Industry         True
Round/Series     True
Location         True
About Company    True
dtype: bool

In [16]:
# Determining null values

combined_data.isna().sum()


Company_Brand     615
Founded           769
HeadQuarter       640
Sector            544
What_it_does      615
Founders          545
Investor          626
Amount            346
Stage            1464
column10         2877
Company/Brand    2790
What it does     2790
Amount($)        2790
Company Name     2353
Industry         2353
Round/Series     2353
Location         2353
About Company    2353
dtype: int64

In [17]:
#Assessing columns in all the data sets

combined_data.columns

Index(['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage', 'column10', 'Company/Brand',
       'What it does', 'Amount($)', 'Company Name', 'Industry', 'Round/Series',
       'Location', 'About Company'],
      dtype='object')

In [18]:
#Gaining an overvew of the data

combined_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2879 entries, 0 to 2878
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  2264 non-null   object 
 1   Founded        2110 non-null   float64
 2   HeadQuarter    2239 non-null   object 
 3   Sector         2335 non-null   object 
 4   What_it_does   2264 non-null   object 
 5   Founders       2334 non-null   object 
 6   Investor       2253 non-null   object 
 7   Amount         2533 non-null   object 
 8   Stage          1415 non-null   object 
 9   column10       2 non-null      object 
 10  Company/Brand  89 non-null     object 
 11  What it does   89 non-null     object 
 12  Amount($)      89 non-null     object 
 13  Company Name   526 non-null    object 
 14  Industry       526 non-null    object 
 15  Round/Series   526 non-null    object 
 16  Location       526 non-null    object 
 17  About Company  526 non-null    object 
dtypes: float

In [19]:
# Determining the datatypes

combined_data.dtypes

Company_Brand     object
Founded          float64
HeadQuarter       object
Sector            object
What_it_does      object
Founders          object
Investor          object
Amount            object
Stage             object
column10          object
Company/Brand     object
What it does      object
Amount($)         object
Company Name      object
Industry          object
Round/Series      object
Location          object
About Company     object
dtype: object

In [20]:
# Getting an overview of the data

combined_data.describe().T

# This gave us a description of the year founded only since it is the only float datatype in the dataset

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Founded,2110.0,2016.079621,4.368006,1963.0,2015.0,2017.0,2019.0,2021.0


In [21]:
# Getting an overview of the Amount column to try and understand why it isn't grouped as a float

combined_data.loc[:, 'Amount']

0                            $1,200,000
1                          $120,000,000
2                           $30,000,000
3                           $51,000,000
4                            $2,000,000
5                          $188,000,000
6                              $200,000
7                           Undisclosed
8                           Undisclosed
9                            $1,000,000
10                           $3,000,000
11                             $100,000
12                             $700,000
13                           $2,000,000
14                           $9,000,000
15                          $40,000,000
16                          $49,000,000
17                             $400,000
18                             $300,000
19                          $25,000,000
20                         $160,000,000
21                          Undisclosed
22                             $150,000
23                           $1,800,000
24                           $5,000,000


In [22]:
# Removing characters from the Amount column

combined_data['Amount'] = combined_data['Amount'].str.replace('[$₹]', '', regex=True)
combined_data['Amount'] = combined_data['Amount'].str.replace(',', '')

combined_data['Amount'] 


0                              1200000
1                            120000000
2                             30000000
3                             51000000
4                              2000000
5                            188000000
6                               200000
7                          Undisclosed
8                          Undisclosed
9                              1000000
10                             3000000
11                              100000
12                              700000
13                             2000000
14                             9000000
15                            40000000
16                            49000000
17                              400000
18                              300000
19                            25000000
20                           160000000
21                         Undisclosed
22                              150000
23                             1800000
24                             5000000
25                       

In [24]:
# To determine how many labels of 'undisclosed' figures are in the column

# Using value_counts to get the count of 'undisclosed'
undisclosed_count = combined_data['Amount'].value_counts()
undisclosed_count


Amount
—                                 148
Undisclosed                       116
1000000                           112
2000000                            75
3000000                            54
5000000                            53
500000                             50
10000000                           49
200000                             42
4000000                            37
300000                             31
30000000                           29
20000000                           28
50000000                           28
6000000                            27
1500000                            27
400000                             27
100000000                          22
100000                             22
undisclosed                        22
15000000                           21
40000000                           21
7000000                            18
1200000                            16
600000                             16
700000                             16
12000

The above code helps us realize that not only are there 'Undisclosed' values but also 'undisclosed' values, the difference is in the capitalization. Also it is good to note that the count of this values are roughly 138.

In [25]:
# Determining unique values

combined_data.nunique()

Company_Brand    1745
Founded            34
HeadQuarter       123
Sector            502
What_it_does     2112
Founders         1980
Investor         1777
Amount            283
Stage              62
column10            2
Company/Brand      87
What it does       88
Amount($)          50
Company Name      525
Industry          405
Round/Series       21
Location           50
About Company     524
dtype: int64

In [26]:
# Checking for duplicates

combined_data.loc[combined_data.duplicated()].head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10,Company/Brand,What it does,Amount($),Company Name,Industry,Round/Series,Location,About Company
107,Curefoods,2020.0,Bangalore,Food & Beverages,Healthy & nutritious foods and cold pressed ju...,Ankit Nagori,"Iron Pillar, Nordstar, Binny Bansal",13000000,,,,,,,,,,
109,Bewakoof,2012.0,Mumbai,Apparel & Fashion,Bewakoof is a lifestyle fashion brand that mak...,Prabhkiran Singh,InvestCorp,8000000,,,,,,,,,,
111,FanPlay,2020.0,Computer Games,Computer Games,A real money game app specializing in trivia g...,YC W21,"Pritesh Kumar, Bharat Gupta",Upsparks,$1200000,,,,,,,,,
117,Advantage Club,2014.0,Mumbai,HRTech,Advantage Club is India's largest employee eng...,"Sourabh Deorah, Smiti Bhatt Deorah","Y Combinator, Broom Ventures, Kunal Shah",1700000,,,,,,,,,,
119,Ruptok,2020.0,New Delhi,FinTech,Ruptok fintech Pvt. Ltd. is an online gold loa...,Ankur Gupta,Eclear Leasing,1000000,,,,,,,,,,


In [27]:
# Narrowing it down to a column

combined_data.loc[combined_data.duplicated(subset='Company_Brand')].head(5)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10,Company/Brand,What it does,Amount($),Company Name,Industry,Round/Series,Location,About Company
80,Uable,2020.0,Bangalore,EdTech,Uable are on a bold mission to redefine the fu...,Saurabh Saxena,"JAFCO Asia, Chiratae Ventures",35000000,Pre-series A,,,,,,,,,
107,Curefoods,2020.0,Bangalore,Food & Beverages,Healthy & nutritious foods and cold pressed ju...,Ankit Nagori,"Iron Pillar, Nordstar, Binny Bansal",13000000,,,,,,,,,,
108,TartanSense,2015.0,Bangalore,Information Technology,TartanSense unlocks value for small farm holde...,Jaisimha Rao,"FMC, Omnivore, Blume Ventures",5000000,Series A,,,,,,,,,
109,Bewakoof,2012.0,Mumbai,Apparel & Fashion,Bewakoof is a lifestyle fashion brand that mak...,Prabhkiran Singh,InvestCorp,8000000,,,,,,,,,,
110,Kirana247,2018.0,New Delhi,Logistics & Supply Chain,An on-demand FMCG supply chain company leverag...,"Tarun Jiwarajka, Pankhuri Jiwarajka",,1000000,Pre-series A,,,,,,,,,


In [28]:
# Checking an example duplicate

combined_data.query("Company_Brand == 'Kirana247'")



Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10,Company/Brand,What it does,Amount($),Company Name,Industry,Round/Series,Location,About Company
97,Kirana247,2018.0,New Delhi,Logistics & Supply Chain,An on-demand FMCG supply chain company leverag...,"Tarun Jiwarajka, Pankhuri Jiwarajka",,1000000,,,,,,,,,,
110,Kirana247,2018.0,New Delhi,Logistics & Supply Chain,An on-demand FMCG supply chain company leverag...,"Tarun Jiwarajka, Pankhuri Jiwarajka",,1000000,Pre-series A,,,,,,,,,


In [29]:
combined_data.columns

Index(['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage', 'column10', 'Company/Brand',
       'What it does', 'Amount($)', 'Company Name', 'Industry', 'Round/Series',
       'Location', 'About Company'],
      dtype='object')


# **Data Cleaning**

In [30]:
# Concatenate all datasets

combined_data = pd.concat([data_021, data_020, data_019, data_018], ignore_index=True)


In [31]:
combined_data.columns

Index(['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage', 'column10', 'Company/Brand',
       'What it does', 'Amount($)', 'Company Name', 'Industry', 'Round/Series',
       'Location', 'About Company'],
      dtype='object')

In [32]:
# Dropping irrelevant columns

combined_data.drop('column10', axis =1, inplace = True)


In [33]:
combined_data.columns

Index(['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage', 'Company/Brand',
       'What it does', 'Amount($)', 'Company Name', 'Industry', 'Round/Series',
       'Location', 'About Company'],
      dtype='object')

In [34]:
combined_data.tail()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,Company/Brand,What it does,Amount($),Company Name,Industry,Round/Series,Location,About Company
2874,,,,,,,,225000000,,,,,Udaan,"B2B, Business Development, Internet, Marketplace",Series C,"Bangalore, Karnataka, India","Udaan is a B2B trade platform, designed specif..."
2875,,,,,,,,—,,,,,Happyeasygo Group,"Tourism, Travel",Series A,"Haryana, Haryana, India",HappyEasyGo is an online travel domain.
2876,,,,,,,,7500,,,,,Mombay,"Food and Beverage, Food Delivery, Internet",Seed,"Mumbai, Maharashtra, India",Mombay is a unique opportunity for housewives ...
2877,,,,,,,,"₹35,000,000",,,,,Droni Tech,Information Technology,Seed,"Mumbai, Maharashtra, India",Droni Tech manufacture UAVs and develop softwa...
2878,,,,,,,,35000000,,,,,Netmeds,"Biotechnology, Health Care, Pharmaceutical",Series C,"Chennai, Tamil Nadu, India",Welcome to India's most convenient pharmacy!


In [35]:
# Consolidate names of the company into a single column

combined_data['Company_Name'] = combined_data['Company_Brand'].combine_first(combined_data['Company/Brand']).combine_first(combined_data['Company Name'])
combined_data.tail()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,Company/Brand,What it does,Amount($),Company Name,Industry,Round/Series,Location,About Company,Company_Name
2874,,,,,,,,225000000,,,,,Udaan,"B2B, Business Development, Internet, Marketplace",Series C,"Bangalore, Karnataka, India","Udaan is a B2B trade platform, designed specif...",Udaan
2875,,,,,,,,—,,,,,Happyeasygo Group,"Tourism, Travel",Series A,"Haryana, Haryana, India",HappyEasyGo is an online travel domain.,Happyeasygo Group
2876,,,,,,,,7500,,,,,Mombay,"Food and Beverage, Food Delivery, Internet",Seed,"Mumbai, Maharashtra, India",Mombay is a unique opportunity for housewives ...,Mombay
2877,,,,,,,,"₹35,000,000",,,,,Droni Tech,Information Technology,Seed,"Mumbai, Maharashtra, India",Droni Tech manufacture UAVs and develop softwa...,Droni Tech
2878,,,,,,,,35000000,,,,,Netmeds,"Biotechnology, Health Care, Pharmaceutical",Series C,"Chennai, Tamil Nadu, India",Welcome to India's most convenient pharmacy!,Netmeds


In [36]:
combined_data.columns

Index(['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage', 'Company/Brand',
       'What it does', 'Amount($)', 'Company Name', 'Industry', 'Round/Series',
       'Location', 'About Company', 'Company_Name'],
      dtype='object')

We use the combine_first() method to merge 'Company Name' and 'Company_Brand' columns into a new 'Company_Name' column. This method fills missing values in the 'Company_Name' column with corresponding non-missing values from the 'Company_Brand' column.


In [37]:
# Drop the original company column names
# Drop the original 'Company Name' and 'Company_Brand' columns using the drop() method.

combined_data.drop(['Company Name', 'Company_Brand', 'Company/Brand'], axis=1, inplace=True, errors='ignore')
combined_data.tail()

Unnamed: 0,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,What it does,Amount($),Industry,Round/Series,Location,About Company,Company_Name
2874,,,,,,,225000000,,,,"B2B, Business Development, Internet, Marketplace",Series C,"Bangalore, Karnataka, India","Udaan is a B2B trade platform, designed specif...",Udaan
2875,,,,,,,—,,,,"Tourism, Travel",Series A,"Haryana, Haryana, India",HappyEasyGo is an online travel domain.,Happyeasygo Group
2876,,,,,,,7500,,,,"Food and Beverage, Food Delivery, Internet",Seed,"Mumbai, Maharashtra, India",Mombay is a unique opportunity for housewives ...,Mombay
2877,,,,,,,"₹35,000,000",,,,Information Technology,Seed,"Mumbai, Maharashtra, India",Droni Tech manufacture UAVs and develop softwa...,Droni Tech
2878,,,,,,,35000000,,,,"Biotechnology, Health Care, Pharmaceutical",Series C,"Chennai, Tamil Nadu, India",Welcome to India's most convenient pharmacy!,Netmeds


In [38]:
combined_data.columns

Index(['Founded', 'HeadQuarter', 'Sector', 'What_it_does', 'Founders',
       'Investor', 'Amount', 'Stage', 'What it does', 'Amount($)', 'Industry',
       'Round/Series', 'Location', 'About Company', 'Company_Name'],
      dtype='object')

Trying to apply the same combine.first method to the amounts column since they are two instead of dropping the 'Amount($)' column as it has a few significant values

In [39]:
# This 'Amount' column seems to have a ton of figures

combined_data['Amount']

0                            $1,200,000
1                          $120,000,000
2                           $30,000,000
3                           $51,000,000
4                            $2,000,000
5                          $188,000,000
6                              $200,000
7                           Undisclosed
8                           Undisclosed
9                            $1,000,000
10                           $3,000,000
11                             $100,000
12                             $700,000
13                           $2,000,000
14                           $9,000,000
15                          $40,000,000
16                          $49,000,000
17                             $400,000
18                             $300,000
19                          $25,000,000
20                         $160,000,000
21                          Undisclosed
22                             $150,000
23                           $1,800,000
24                           $5,000,000


In [40]:
# This on the other hand has a lot of missing values

combined_data['Amount($)']

0                NaN
1                NaN
2                NaN
3                NaN
4                NaN
5                NaN
6                NaN
7                NaN
8                NaN
9                NaN
10               NaN
11               NaN
12               NaN
13               NaN
14               NaN
15               NaN
16               NaN
17               NaN
18               NaN
19               NaN
20               NaN
21               NaN
22               NaN
23               NaN
24               NaN
25               NaN
26               NaN
27               NaN
28               NaN
29               NaN
30               NaN
31               NaN
32               NaN
33               NaN
34               NaN
35               NaN
36               NaN
37               NaN
38               NaN
39               NaN
40               NaN
41               NaN
42               NaN
43               NaN
44               NaN
45               NaN
46               NaN
47           

In [41]:
# As much as it has a lot of missing figures, since when we ran our shape before, it was roughly 2880
# This means that this column has a few figures on it's own

combined_data['Amount($)'].isnull().sum()

2790

In [42]:
# Consolidate amounts into a single standardized column

combined_data['Amount_USD'] = combined_data['Amount'].combine_first(combined_data['Amount($)'])
combined_data.tail()

Unnamed: 0,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,What it does,Amount($),Industry,Round/Series,Location,About Company,Company_Name,Amount_USD
2874,,,,,,,225000000,,,,"B2B, Business Development, Internet, Marketplace",Series C,"Bangalore, Karnataka, India","Udaan is a B2B trade platform, designed specif...",Udaan,225000000
2875,,,,,,,—,,,,"Tourism, Travel",Series A,"Haryana, Haryana, India",HappyEasyGo is an online travel domain.,Happyeasygo Group,—
2876,,,,,,,7500,,,,"Food and Beverage, Food Delivery, Internet",Seed,"Mumbai, Maharashtra, India",Mombay is a unique opportunity for housewives ...,Mombay,7500
2877,,,,,,,"₹35,000,000",,,,Information Technology,Seed,"Mumbai, Maharashtra, India",Droni Tech manufacture UAVs and develop softwa...,Droni Tech,"₹35,000,000"
2878,,,,,,,35000000,,,,"Biotechnology, Health Care, Pharmaceutical",Series C,"Chennai, Tamil Nadu, India",Welcome to India's most convenient pharmacy!,Netmeds,35000000


In [43]:
combined_data.drop(['Amount', 'Amount($)'], inplace=True, axis =1)
combined_data.head()

Unnamed: 0,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Stage,What it does,Industry,Round/Series,Location,About Company,Company_Name,Amount_USD
0,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First",Pre-series A,,,,,,Unbox Robotics,"$1,200,000"
1,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management",,,,,,,upGrad,"$120,000,000"
2,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital",Series D,,,,,,Lead School,"$30,000,000"
3,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital",Series C,,,,,,Bizongo,"$51,000,000"
4,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal",Seed,,,,,,FypMoney,"$2,000,000"


In [44]:
combined_data.columns

Index(['Founded', 'HeadQuarter', 'Sector', 'What_it_does', 'Founders',
       'Investor', 'Stage', 'What it does', 'Industry', 'Round/Series',
       'Location', 'About Company', 'Company_Name', 'Amount_USD'],
      dtype='object')

Our location looks like it could also use some tweaking. Let's have a look at both the location and headquarter values, we can then drop then one which has most of the missing values

In [45]:
combined_data['Location'].isnull().sum()

2353

In [46]:
combined_data['HeadQuarter'].isnull().sum()

640

Alternatively, instead of dropping the location column, we can split the headquarter bit in the location column and merge it with the headquarter column

In [47]:
# Split the location column to give part of the string as the headquarter 

combined_data[['City', 'Remaining_Location']] = combined_data['Location'].str.split(pat =',', n = 1, expand=True)

combined_data.tail()

Unnamed: 0,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Stage,What it does,Industry,Round/Series,Location,About Company,Company_Name,Amount_USD,City,Remaining_Location
2874,,,,,,,,,"B2B, Business Development, Internet, Marketplace",Series C,"Bangalore, Karnataka, India","Udaan is a B2B trade platform, designed specif...",Udaan,225000000,Bangalore,"Karnataka, India"
2875,,,,,,,,,"Tourism, Travel",Series A,"Haryana, Haryana, India",HappyEasyGo is an online travel domain.,Happyeasygo Group,—,Haryana,"Haryana, India"
2876,,,,,,,,,"Food and Beverage, Food Delivery, Internet",Seed,"Mumbai, Maharashtra, India",Mombay is a unique opportunity for housewives ...,Mombay,7500,Mumbai,"Maharashtra, India"
2877,,,,,,,,,Information Technology,Seed,"Mumbai, Maharashtra, India",Droni Tech manufacture UAVs and develop softwa...,Droni Tech,"₹35,000,000",Mumbai,"Maharashtra, India"
2878,,,,,,,,,"Biotechnology, Health Care, Pharmaceutical",Series C,"Chennai, Tamil Nadu, India",Welcome to India's most convenient pharmacy!,Netmeds,35000000,Chennai,"Tamil Nadu, India"


In [48]:
# Drop the redundant columns

combined_data.drop(['Remaining_Location', 'Location'], axis = 1, inplace = True, errors = 'ignore')
combined_data.head()

Unnamed: 0,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Stage,What it does,Industry,Round/Series,About Company,Company_Name,Amount_USD,City
0,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First",Pre-series A,,,,,Unbox Robotics,"$1,200,000",
1,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management",,,,,,upGrad,"$120,000,000",
2,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital",Series D,,,,,Lead School,"$30,000,000",
3,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital",Series C,,,,,Bizongo,"$51,000,000",
4,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal",Seed,,,,,FypMoney,"$2,000,000",


In [49]:
combined_data.columns

Index(['Founded', 'HeadQuarter', 'Sector', 'What_it_does', 'Founders',
       'Investor', 'Stage', 'What it does', 'Industry', 'Round/Series',
       'About Company', 'Company_Name', 'Amount_USD', 'City'],
      dtype='object')

Combine both the HeadQuarter column and the City column and rename it as State column to contain both HeadQuarter and City column

In [53]:
combined_data['State'] = combined_data['HeadQuarter'].combine_first(combined_data['City'])
combined_data.head()


Unnamed: 0,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Stage,What it does,Industry,Round/Series,About Company,Company_Name,Amount_USD,City,State
0,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First",Pre-series A,,,,,Unbox Robotics,"$1,200,000",,Bangalore
1,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management",,,,,,upGrad,"$120,000,000",,Mumbai
2,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital",Series D,,,,,Lead School,"$30,000,000",,Mumbai
3,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital",Series C,,,,,Bizongo,"$51,000,000",,Mumbai
4,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal",Seed,,,,,FypMoney,"$2,000,000",,Gurugram


In [56]:
# Drop redundant columns

combined_data.drop(['HeadQuarter', 'City'], axis=1, inplace=True, errors = 'ignore')
combined_data.head()

Unnamed: 0,Founded,Sector,What_it_does,Founders,Investor,Stage,What it does,Industry,Round/Series,About Company,Company_Name,Amount_USD,State
0,2019.0,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First",Pre-series A,,,,,Unbox Robotics,"$1,200,000",Bangalore
1,2015.0,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management",,,,,,upGrad,"$120,000,000",Mumbai
2,2012.0,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital",Series D,,,,,Lead School,"$30,000,000",Mumbai
3,2015.0,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital",Series C,,,,,Bizongo,"$51,000,000",Mumbai
4,2021.0,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal",Seed,,,,,FypMoney,"$2,000,000",Gurugram


In [57]:
combined_data['Stage'].isnull().sum()

1464

In [59]:
combined_data['Round/Series'].isnull().sum()

2353

Similarly, we can combine the Stage and Round/Series column to obtain the Funding_Rounds column

In [61]:

combined_data['Funding_Rounds'] = combined_data['Stage'].combine_first(combined_data['Round/Series'])
combined_data.tail()

Unnamed: 0,Founded,Sector,What_it_does,Founders,Investor,Stage,What it does,Industry,Round/Series,About Company,Company_Name,Amount_USD,State,Funding_Rounds
2874,,,,,,,,"B2B, Business Development, Internet, Marketplace",Series C,"Udaan is a B2B trade platform, designed specif...",Udaan,225000000,Bangalore,Series C
2875,,,,,,,,"Tourism, Travel",Series A,HappyEasyGo is an online travel domain.,Happyeasygo Group,—,Haryana,Series A
2876,,,,,,,,"Food and Beverage, Food Delivery, Internet",Seed,Mombay is a unique opportunity for housewives ...,Mombay,7500,Mumbai,Seed
2877,,,,,,,,Information Technology,Seed,Droni Tech manufacture UAVs and develop softwa...,Droni Tech,"₹35,000,000",Mumbai,Seed
2878,,,,,,,,"Biotechnology, Health Care, Pharmaceutical",Series C,Welcome to India's most convenient pharmacy!,Netmeds,35000000,Chennai,Series C


In [64]:
#Drop redundant columns

combined_data.drop(['Stage', 'Round/Series'], axis=1, inplace=True, errors='ignore')
combined_data.tail()

Unnamed: 0,Founded,Sector,What_it_does,Founders,Investor,What it does,Industry,About Company,Company_Name,Amount_USD,State,Funding_Rounds
2874,,,,,,,"B2B, Business Development, Internet, Marketplace","Udaan is a B2B trade platform, designed specif...",Udaan,225000000,Bangalore,Series C
2875,,,,,,,"Tourism, Travel",HappyEasyGo is an online travel domain.,Happyeasygo Group,—,Haryana,Series A
2876,,,,,,,"Food and Beverage, Food Delivery, Internet",Mombay is a unique opportunity for housewives ...,Mombay,7500,Mumbai,Seed
2877,,,,,,,Information Technology,Droni Tech manufacture UAVs and develop softwa...,Droni Tech,"₹35,000,000",Mumbai,Seed
2878,,,,,,,"Biotechnology, Health Care, Pharmaceutical",Welcome to India's most convenient pharmacy!,Netmeds,35000000,Chennai,Series C


In [66]:
combined_data['What it does'].isnull().sum()

2790

In [68]:
combined_data['What_it_does'].isnull().sum()

615

In [69]:
combined_data['About Company'].isnull().sum()

2353

Likewise, the same case applies for the What_it_does, What it does and About Company column. They are all merged into the Company_Profile column to obtain a single column

In [70]:
# Drop redundant columns

combined_data['Company_Profile'] = combined_data['What_it_does'].combine_first(combined_data['What it does'].combine_first(combined_data['About Company']))
combined_data.head()

Unnamed: 0,Founded,Sector,What_it_does,Founders,Investor,What it does,Industry,About Company,Company_Name,Amount_USD,State,Funding_Rounds,Company_Profile
0,2019.0,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First",,,,Unbox Robotics,"$1,200,000",Bangalore,Pre-series A,Unbox Robotics builds on-demand AI-driven ware...
1,2015.0,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management",,,,upGrad,"$120,000,000",Mumbai,,UpGrad is an online higher education platform.
2,2012.0,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital",,,,Lead School,"$30,000,000",Mumbai,Series D,LEAD School offers technology based school tra...
3,2015.0,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital",,,,Bizongo,"$51,000,000",Mumbai,Series C,Bizongo is a business-to-business online marke...
4,2021.0,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal",,,,FypMoney,"$2,000,000",Gurugram,Seed,"FypMoney is Digital NEO Bank for Teenagers, em..."


In [71]:
combined_data.drop(['What_it_does', 'What it does', 'About Company'], axis=1, inplace=True, errors='ignore')
combined_data.tail()

Unnamed: 0,Founded,Sector,Founders,Investor,Industry,Company_Name,Amount_USD,State,Funding_Rounds,Company_Profile
2874,,,,,"B2B, Business Development, Internet, Marketplace",Udaan,225000000,Bangalore,Series C,"Udaan is a B2B trade platform, designed specif..."
2875,,,,,"Tourism, Travel",Happyeasygo Group,—,Haryana,Series A,HappyEasyGo is an online travel domain.
2876,,,,,"Food and Beverage, Food Delivery, Internet",Mombay,7500,Mumbai,Seed,Mombay is a unique opportunity for housewives ...
2877,,,,,Information Technology,Droni Tech,"₹35,000,000",Mumbai,Seed,Droni Tech manufacture UAVs and develop softwa...
2878,,,,,"Biotechnology, Health Care, Pharmaceutical",Netmeds,35000000,Chennai,Series C,Welcome to India's most convenient pharmacy!


In [72]:
combined_data.columns

Index(['Founded', 'Sector', 'Founders', 'Investor', 'Industry', 'Company_Name',
       'Amount_USD', 'State', 'Funding_Rounds', 'Company_Profile'],
      dtype='object')