## Indian Start-up Funding Analysis 




### Summary
This project seeks to gain insight into the fundings received by start-ups companies in India between 2018 and 2021. As an analyst on a team trying to venture into the Indian start-up ecosystem, we seek to propose the best course of action by analysing funding received by these start-ups.

### Business Understanding



Start-up funding plays a crucial role providing essential capital to nurture new businesses that drive economic growth and technological advancement. The Indian start-up ecosystem spans across various sectors such as e-commerce, fintech, edtech, healthtech and agritech.
This project aims to equip the team with knowledge and stategic insights on identifying the most promising sectors, cities, funding trends and other relevant information necessary to make informed decisions before venturing into that system


Key business questions to be anwsered by analysing the data includes;

1. Which sectors have shown the highest growth in terms of funding over the four year period under review?

2. What are the prefered locations for the majority of startups?

3. Who are the main investors in the start-up ecosystem?

4. Which sectors receive the highest level of funding and which receive the lowest

5. What is the average amount of funding across the start-up sectors 



#### Hypothesis


Null hypothesis(HO): The amount of funding is based on the sector of the start-up.

Alternative hypothesis(H1): The amount of funding is not based on the sector of the start-up


### Data Understanding


 
There are four datasets, each giving infomation about start-up funding for the year in review. These dataset contains various attributes such as company name, sector, funding amount and location of company. 

The key attributes in the datasets include;

Column names and description:

Company/Brand: Name of the company/start-up

Founded: Year start-up was founded

Sector: Sector of service

What it does: Description about Company

Founders: Founders of the Company

Investor: Investors

Amount($): Raised fund

Stage: Round of funding reached




 

### Importing the various libraries and modules to help in data preparation and analysis

In [1]:
import pyodbc
import pandas as pd
from dotenv import dotenv_values
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import re


pd.set_option("display.max_columns", None)

pd.set_option('display.max_rows', None)

import warnings 

warnings.filterwarnings('ignore')

### Loading the datasets to be used for the analysis

In [2]:
# Load environment variables from .env file into a dictionary
environment_variables = dotenv_values('.env')


# Get the values for the credentials you set in the '.env' file
server = environment_variables.get("SERVER")
database = environment_variables.get("DATABASE")
username = environment_variables.get("USERNAME")
password = environment_variables.get("PASSWORD")

connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password};MARS_Connection=yes;MinProtocolVersion=TLSv1.2;"

# creating a connection to the database
connection = pyodbc.connect(connection_string)

# Selecting the datasets from the database
query1 = ''' SELECT * FROM dbo.LP1_startup_funding2021'''
query2 = ''' SELECT * FROM dbo.LP1_startup_funding2020'''

In [3]:
# Loading the datasets

data_2021 = pd.read_sql(query1, connection)
data_2020 = pd.read_sql(query2, connection)
data_2019 = pd.read_csv(r"C:\Users\bb\Desktop\Azubi\Career_Acc\LP\LP1-The-Indian-Startup-Ecosystem-\startup_funding2019.csv")
data_2018 = pd.read_csv(r"C:\Users\bb\Desktop\Azubi\Career_Acc\LP\LP1-The-Indian-Startup-Ecosystem-\startup_funding2018.csv")


### EDA


#### Exploring the 2018 Dataset

In [4]:

data_2018.head()

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...


In [5]:
data_2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Company Name   526 non-null    object
 1   Industry       526 non-null    object
 2   Round/Series   526 non-null    object
 3   Amount         526 non-null    object
 4   Location       526 non-null    object
 5   About Company  526 non-null    object
dtypes: object(6)
memory usage: 24.8+ KB


In [6]:
# A brief description of the 2018 dataset

data_2018.describe(include="all").transpose()

Unnamed: 0,count,unique,top,freq
Company Name,526,525,TheCollegeFever,2
Industry,526,405,—,30
Round/Series,526,21,Seed,280
Amount,526,198,—,148
Location,526,50,"Bangalore, Karnataka, India",102
About Company,526,524,"TheCollegeFever is a hub for fun, fiesta and f...",2


In [7]:
# Checking for duplicates

data_2018.duplicated().sum()

1

#### Exploring the 2019 Dataset

In [8]:
data_2019.head()

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...","$30,000,000",Series D
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",


In [9]:
data_2019.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company/Brand  89 non-null     object 
 1   Founded        60 non-null     float64
 2   HeadQuarter    70 non-null     object 
 3   Sector         84 non-null     object 
 4   What it does   89 non-null     object 
 5   Founders       86 non-null     object 
 6   Investor       89 non-null     object 
 7   Amount($)      89 non-null     object 
 8   Stage          43 non-null     object 
dtypes: float64(1), object(8)
memory usage: 6.4+ KB


In [10]:
# Checking for NaN values in the dataset

data_2019.isna().sum()

Company/Brand     0
Founded          29
HeadQuarter      19
Sector            5
What it does      0
Founders          3
Investor          0
Amount($)         0
Stage            46
dtype: int64

In [11]:
# A brief description of the 2019 dataset

data_2019.describe(include="all").transpose()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Company/Brand,89.0,87.0,Kratikal,2.0,,,,,,,
Founded,60.0,,,,2014.533333,2.937003,2004.0,2013.0,2015.0,2016.25,2019.0
HeadQuarter,70.0,17.0,Bangalore,21.0,,,,,,,
Sector,84.0,52.0,Edtech,7.0,,,,,,,
What it does,89.0,88.0,Online meat shop,2.0,,,,,,,
Founders,86.0,85.0,"Vivek Gupta, Abhay Hanjura",2.0,,,,,,,
Investor,89.0,86.0,Undisclosed,3.0,,,,,,,
Amount($),89.0,50.0,Undisclosed,12.0,,,,,,,
Stage,43.0,15.0,Series A,10.0,,,,,,,


In [12]:
# Checking for duplicates

data_2019.duplicated().sum()

0

In [13]:
# Checking the years the companies were founded to see if they qualify to be considered start-ups

data_2019['Founded'].unique()

array([  nan, 2014., 2004., 2013., 2010., 2018., 2019., 2017., 2011.,
       2015., 2016., 2012., 2008.])

#### Exploring the 2020 Dataset

In [14]:
data_2020.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,
3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,,
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,,


In [15]:
data_2020.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1055 entries, 0 to 1054
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1055 non-null   object 
 1   Founded        842 non-null    float64
 2   HeadQuarter    961 non-null    object 
 3   Sector         1042 non-null   object 
 4   What_it_does   1055 non-null   object 
 5   Founders       1043 non-null   object 
 6   Investor       1017 non-null   object 
 7   Amount         801 non-null    float64
 8   Stage          591 non-null    object 
 9   column10       2 non-null      object 
dtypes: float64(2), object(8)
memory usage: 82.6+ KB


In [16]:
# Checking for NaN values in the dataset

data_2020.isna().sum()

Company_Brand       0
Founded           213
HeadQuarter        94
Sector             13
What_it_does        0
Founders           12
Investor           38
Amount            254
Stage             464
column10         1053
dtype: int64

In [17]:
# A brief description of the 2020 dataset

data_2020.describe(include="all").transpose()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Company_Brand,1055.0,905.0,Nykaa,6.0,,,,,,,
Founded,842.0,,,,2015.36342,4.097909,1973.0,2014.0,2016.0,2018.0,2020.0
HeadQuarter,961.0,77.0,Bangalore,317.0,,,,,,,
Sector,1042.0,302.0,Fintech,80.0,,,,,,,
What_it_does,1055.0,990.0,Provides online learning classes,4.0,,,,,,,
Founders,1043.0,927.0,Falguni Nayar,6.0,,,,,,,
Investor,1017.0,848.0,Venture Catalysts,20.0,,,,,,,
Amount,801.0,,,,113042969.543071,2476634939.888352,12700.0,1000000.0,3000000.0,11000000.0,70000000000.0
Stage,591.0,42.0,Series A,96.0,,,,,,,
column10,2.0,2.0,Pre-Seed,1.0,,,,,,,


In [18]:
# Checking for duplicates

data_2020.duplicated().sum()

3

In [19]:
# Checking the years the companies were founded to see if they qualify to be considered start-ups

data_2020['Founded'].unique()

array([2019., 2018., 2020., 2016., 2008., 2015., 2017., 2014., 1998.,
       2007., 2011., 1982., 2013., 2009., 2012., 1995., 2010., 2006.,
       1978.,   nan, 1999., 1994., 2005., 1973., 2002., 2004., 2001.])

#### Exploring the 2021 Dataset

In [20]:
data_2021.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D
3,Bizongo,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital","$51,000,000",Series C
4,FypMoney,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal","$2,000,000",Seed


In [21]:
data_2021.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1209 entries, 0 to 1208
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1209 non-null   object 
 1   Founded        1208 non-null   float64
 2   HeadQuarter    1208 non-null   object 
 3   Sector         1209 non-null   object 
 4   What_it_does   1209 non-null   object 
 5   Founders       1205 non-null   object 
 6   Investor       1147 non-null   object 
 7   Amount         1206 non-null   object 
 8   Stage          781 non-null    object 
dtypes: float64(1), object(8)
memory usage: 85.1+ KB


In [22]:
# Checking for NaN values in the dataset

data_2021.isna().sum()

Company_Brand      0
Founded            1
HeadQuarter        1
Sector             0
What_it_does       0
Founders           4
Investor          62
Amount             3
Stage            428
dtype: int64

In [23]:
# A brief description of the 2021 dataset

data_2021.describe(include="all").transpose()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Company_Brand,1209.0,1033.0,BharatPe,8.0,,,,,,,
Founded,1208.0,,,,2016.655629,4.517364,1963.0,2015.0,2018.0,2020.0,2021.0
HeadQuarter,1208.0,70.0,Bangalore,426.0,,,,,,,
Sector,1209.0,254.0,FinTech,122.0,,,,,,,
What_it_does,1209.0,1143.0,BharatPe develops a QR code-based payment app ...,4.0,,,,,,,
Founders,1205.0,1095.0,"Ashneer Grover, Shashvat Nakrani",7.0,,,,,,,
Investor,1147.0,937.0,Inflection Point Ventures,24.0,,,,,,,
Amount,1206.0,278.0,$Undisclosed,73.0,,,,,,,
Stage,781.0,31.0,Seed,246.0,,,,,,,


In [24]:
# Checking for duplicates

data_2021.duplicated().sum()

19

In [25]:
# Checking the years the companies were founded to see if they qualify to be considered start-ups

data_2021['Founded'].unique()

array([2019., 2015., 2012., 2021., 2014., 2018., 2016., 2020., 2010.,
       2017., 1993., 2008., 2013., 1999., 1989., 2011.,   nan, 2009.,
       2002., 1994., 2006., 2000., 2007., 1978., 2003., 1998., 1991.,
       1984., 2004., 2005., 1963.])

#### Issues arrising from the datasets

1. There are duplicates in the datasets. 
2. Some of the company names are not properly spaced .
3. There are values represented as "-" in the  Industry and Amount Columns.
4. The Location column  in data_2018 has the City, State, and Country included. We only want the city.
5. The Amount column in data_2018 has two different currencies and some non integer values 
6. There are differences in column name for same category in the datasets 
7. Some columns have very high null value count.
8. There are differences in data types for some columns of same category
9. There are companies whose founding year are above the 10 year period threshold to be qualified as start-ups in India



#### In order to properly clean the datasets for the data analysis, all four dataset will have to be merged into one dataset. 
#### To streamline this process , we have to standardize the column names and add or drop appropriate columns where necessary 



In [26]:
# standardise the column name for the company name category

data_2018.rename(columns={'Company Name': 'Company_Name'}, inplace=True)
data_2019.rename(columns={'Company/Brand': 'Company_Name'}, inplace=True)
data_2020.rename(columns={'Company_Brand': 'Company_Name'}, inplace=True)
data_2021.rename(columns={'Company_Brand': 'Company_Name'}, inplace=True)

# standardise the column name for the company description category

data_2018.rename(columns={'About Company': 'About_Company'}, inplace=True)
data_2019.rename(columns={'What it does': 'About_Company'}, inplace=True)
data_2020.rename(columns={'What_it_does': 'About_Company'}, inplace=True)
data_2021.rename(columns={'What_it_does': 'About_Company'}, inplace=True)

# add funding year column to help identyfy year of dataset

data_2018['funding_year'] = 2018
data_2019['funding_year'] = 2019
data_2020['funding_year'] = 2020
data_2021['funding_year'] = 2021

# add columns which are missing from the 2018 datasets

data_2018['Investor'] = 'undisclosed'

# Rename column names to match the names in the othe datasets

data_2018.rename(columns={'Amount': 'Amount($)'}, inplace=True)
data_2020.rename(columns={'Amount': 'Amount($)'}, inplace=True)
data_2021.rename(columns={'Amount': 'Amount($)'}, inplace=True)
data_2018.rename(columns={'Industry': 'Sector'}, inplace=True)
data_2018.rename(columns={'Round/Series': 'Stage'}, inplace=True)
data_2018.rename(columns={'Location': 'HeadQuarter'}, inplace=True)

# drop companies which are above the 10 year founding period threshold to be considered as start-ups

threshold_2019 = 2009
condition_1 = (data_2019['Founded'] < threshold_2019) & data_2019['Founded'].notna()
data_2019 = data_2019[~condition_1]

threshold_2020 = 2010
condition_2 = (data_2020['Founded'] < threshold_2020) & data_2020['Founded'].notna()
data_2020 = data_2020[~condition_2]

threshold_2021 = 2011
condition_3 = (data_2021['Founded'] < threshold_2021) & data_2021['Founded'].notna()
data_2021 = data_2021[~condition_3]

# drop the column with almost no data 

data_2020.drop(['column10'],axis=1,inplace=True)


# drop columns not needed for analysis

data_2021.drop(columns={'Founders','Founded'}, inplace=True)
data_2020.drop(columns={'Founders','Founded'}, inplace=True)
data_2019.drop(columns={'Founders','Founded'}, inplace=True)



In [27]:
# converting all amounts in rupees(₹) to dollars($) for the 2018 dataset

exchange_rates_2018 = 0.0146
    

# Function to convert currency
def convert_currency(amount):
    if amount.startswith('₹'):
        cleaned_amount = amount.replace('₹', '').replace(',', '')
        try:
            return float(cleaned_amount) * exchange_rates_2018
        except ValueError:
            return amount   # Keep the original value for invalid formats
    else:
        try:
            return float(amount)
        except ValueError:
            return amount

# Apply the function to the 'Amount' column
data_2018['Amount'] = data_2018['Amount'].apply(convert_currency)



### Merging the datasets into one

In [28]:
data_combined = pd.concat([data_2021, data_2020, data_2019,data_2018], ignore_index=True)
data_combined.head(10)

Unnamed: 0,Company_Name,HeadQuarter,Sector,About_Company,Investor,Amount,Stage,funding_year
0,Unbox Robotics,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"BEENEXT, Entrepreneur First","$1,200,000",Pre-series A,2021
1,upGrad,Mumbai,EdTech,UpGrad is an online higher education platform.,"Unilazer Ventures, IIFL Asset Management","$120,000,000",,2021
2,Lead School,Mumbai,EdTech,LEAD School offers technology based school tra...,"GSV Ventures, Westbridge Capital","$30,000,000",Series D,2021
3,Bizongo,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"CDC Group, IDG Capital","$51,000,000",Series C,2021
4,FypMoney,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...","Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal","$2,000,000",Seed,2021
5,Urban Company,New Delhi,Home services,Urban Company (Formerly UrbanClap) is a home a...,Vy Capital,"$188,000,000",,2021
6,Comofi Medtech,Bangalore,HealthTech,Comofi Medtech is a healthcare robotics startup.,"CIIE.CO, KIIT-TBI","$200,000",,2021
7,Qube Health,Mumbai,HealthTech,India's Most Respected Workplace Healthcare Ma...,Inflection Point Ventures,Undisclosed,Pre-series A,2021
8,Vitra.ai,Bangalore,Tech Startup,Vitra.ai is an AI-based video translation plat...,Inflexor Ventures,Undisclosed,,2021
9,Fitterfly,Mumbai,HealthTech,Fitterfly offers customized and personalized w...,"9Unicorns Accelerator Fund, Metaform Ventures","$3,000,000",Pre-series A,2021


### Data Cleaning

The first step is to drop duplicates from the dataset, and then 
each column is investigated and the appropraite cleaning method is applied to resolve issues identified.

In [29]:
# check the datatype and details of the columns

data_combined.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2758 entries, 0 to 2757
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Company_Name   2758 non-null   object
 1   HeadQuarter    2648 non-null   object
 2   Sector         2740 non-null   object
 3   About_Company  2758 non-null   object
 4   Investor       2664 non-null   object
 5   Amount         2506 non-null   object
 6   Stage          1900 non-null   object
 7   funding_year   2758 non-null   int64 
dtypes: int64(1), object(7)
memory usage: 172.5+ KB


In [30]:
# Checking for duplicates in dataset

data_combined.duplicated().sum()

22

In [31]:
# drop duplicate values from dataset

data_combined.drop_duplicates(inplace=True)

#### Company_Name column

In [32]:
# Checking for NaN values in the dataset

data_combined['Company_Name'].isna().sum()

0

#### Funding year column

In [33]:
# convert the year to a datetime format

data_combined['funding_year'] = pd.to_datetime(data_combined['funding_year'])

data_combined.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2736 entries, 0 to 2757
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Company_Name   2736 non-null   object        
 1   HeadQuarter    2626 non-null   object        
 2   Sector         2718 non-null   object        
 3   About_Company  2736 non-null   object        
 4   Investor       2643 non-null   object        
 5   Amount         2485 non-null   object        
 6   Stage          1886 non-null   object        
 7   funding_year   2736 non-null   datetime64[ns]
dtypes: datetime64[ns](1), object(7)
memory usage: 192.4+ KB


#### HeadQuarter column 

Empty values are filled with the most common cities

In [34]:
#  Run samples to check the data

data_combined['HeadQuarter'].sample(n=10)

763     New Delhi
1074    Bangalore
1581       Mumbai
82      Bangalore
1122    Bangalore
1689         Pune
588     New Delhi
1460     Gurugram
802       Chennai
1225      Kolkata
Name: HeadQuarter, dtype: object

In [35]:
# Split and extract only cities for the column

data_combined["HeadQuarter"] = data_combined['HeadQuarter'].str.split(',').str[0]
data_combined['HeadQuarter'].sample(n=10)

332      Gurugram
2469        Delhi
2507    Ghaziabad
1739    Bangalore
1276    Bangalore
941      Gurugram
384       Gurgaon
301        Mumbai
872     Bangalore
1270    Bangalore
Name: HeadQuarter, dtype: object

In [36]:
# Checking for NaN values in the dataset

data_combined['HeadQuarter'].isna().sum()

110

In [37]:
# check for the most common city in the headQuarter column

most_common_city = data_combined['HeadQuarter'].mode()[0]
most_common_city

'Bangalore'

In [38]:
# Fill empty columns with the most common city

data_combined['HeadQuarter'].fillna(most_common_city, inplace=True)

#### The Sector column: 

We reconstitute the sectors into 10 main industries 

In [39]:
data_combined['Sector'].sample(n=50)

2753     B2B, Business Development, Internet, Marketplace
2303                           Consulting, Retail, Social
2388                 Banking, Finance, Financial Services
2077                                               Edtech
1207                           Conversational AI platform
1384                                            LegalTech
2206                                      Automotive tech
681                                          Tech Startup
1422                                              FinTech
770                                                EdTech
1970                                           Automotive
2725                        Health Care, Health Insurance
2659                                Health Care, Hospital
301                                            Healthcare
1894                                             Foodtech
1303                                          QSR startup
2243                                             Internet
1695          

In [40]:
data_combined['Sector'].unique()

array(['AI startup', 'EdTech', 'B2B E-commerce', 'FinTech',
       'Home services', 'HealthTech', 'Tech Startup', 'B2B service',
       'Helathcare', 'Renewable Energy', 'E-commerce', 'IT startup',
       'Food & Beverages', 'Aeorspace', 'Deep Tech', 'Dating', 'Gaming',
       'Robotics', 'Retail', 'Food', 'Oil and Energy', 'AgriTech',
       'Electronics', 'Milk startup', 'AI Chatbot', 'IT', 'Logistics',
       'Hospitality', 'Fashion', 'Marketing', 'Transportation',
       'LegalTech', 'Food delivery', 'Automotive', 'SaaS startup',
       'Fantasy sports', 'Video communication', 'Social Media',
       'Skill development', 'Rental', 'Recruitment', 'HealthCare',
       'Sports', 'Computer Games', 'Consumer Goods',
       'Information Technology', 'Apparel & Fashion',
       'Logistics & Supply Chain', 'Healthtech', 'Healthcare',
       'SportsTech', 'HRTech', 'Wine & Spirits',
       'Mechanical & Industrial Engineering', 'Spiritual',
       'Financial Services', 'Industrial Automation

In [None]:
def sector_redistribution(sector):
    if re.search('bank|fintech|finance|crypto|account|credit|venture|crowd|blockchain|'
                 , sector):
        return 'Finance'
    elif re.search(r'automotive|air transport|transport|logistics|vehicle|transportation|', sector):
        return 'Transport'
    elif re.search('Healthcare|Healthtech|Healtcare|Health|Helathcare|HealthCare|Health Care|Pharmaceutical', sector):
        return 'Health'
    else:
        return sector

In [41]:
# Checking for NaN values in the dataset

data_combined['Sector'].isna().sum()

18

### The Amount column

We assume all the amounts to be in dollars as those with the ruppees denomination have been converted to dollars

In [42]:
# Run samples to check the data

data_combined['Amount'].sample(n=50)

1753             NaN
1722       6609000.0
708      $12,000,000
86          $2500000
1897             NaN
1339       4000000.0
1426       1000000.0
1713       1100000.0
1696       2250000.0
24       $53,000,000
2239        730000.0
2687        400000.0
2093         12700.0
2012      24000000.0
1765       1326000.0
1131        $2000000
2682       8760000.0
882         $1500000
1855      12000000.0
1724      27700000.0
1775             NaN
1047    $Undisclosed
864       $$1,55,000
1981      75000000.0
1594        350000.0
2315        496400.0
330      $10,000,000
2366        $900,000
1841       1500000.0
533         $3000000
2562        292000.0
1397      55000000.0
1142        600000.0
2475               —
2352               —
216         $7000000
2723               —
719       $5,000,000
1086        $1500000
1147             NaN
2314        120000.0
1633             NaN
715       $6,300,000
1009        $5000000
149         $3800000
476         $4500000
541          $370000
687       $1,

In [43]:
# remove all non integer values 

data_combined['Amount'] = data_combined['Amount'].apply(lambda x: re.sub(r'\D', '', str(x)))

#  convert datatype to float

data_combined['Amount'] = pd.to_numeric(data_combined['Amount'])

data_combined.info()


<class 'pandas.core.frame.DataFrame'>
Index: 2736 entries, 0 to 2757
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Company_Name   2736 non-null   object        
 1   HeadQuarter    2736 non-null   object        
 2   Sector         2718 non-null   object        
 3   About_Company  2736 non-null   object        
 4   Investor       2643 non-null   object        
 5   Amount         2180 non-null   float64       
 6   Stage          1886 non-null   object        
 7   funding_year   2736 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(1), object(6)
memory usage: 192.4+ KB


#### The Investor column

In [44]:
data_combined['Investor'].sample(n=50)

1947                                        KA Enterprise
1355                Whiteboard Capital, VKG Ventures LLP.
43                   DSG Consumer Partners, Saama Capital
2545                                          undisclosed
2745                                          undisclosed
955                          Blume, Alpha Wave Incubation
633               Paradigm Shift Capital, AngelList India
1111    HOF Capital, Old Well Ventures, LetsVenture, 9...
2526                                          undisclosed
1477                               US based NRI Investors
2674                                          undisclosed
808                                       GSF Accelerator
1082                             Omnivore, India Quotient
1024                           James Milner, Adam Lallana
1444                                     Homage Ventures.
680      9Unicorns Accelerator Fund, Indian Angel Network
1609                                       Better Capital
276           