# Analysis of Funding Received by Start-ups in India from 2018 to 2021

## 1. Business Understanding

### 1.1 Project Description
This data analysis project focuses on the funding received by start-ups in India from 2018 to 2021. The objective is to gain insights into the Indian start-up ecosystem and propose the best course of action for our team's venture. By analyzing the data on funding amounts, start-up details, and investor information, we aim to unearth prevailing patterns and gain insights about the opportunities in India's start-up ecosystem to inform decision-making.

### 1.2 The type of the problem
This project is of the exploratory data analytics and visualization type as it is our aim to deduce hidden insights and patterns from the available data.



## 2. Data understanding
The data used in this project was sampled from different start-up companies in India. It contains funding history for the period 2018 - 2021. The data was obtained from 4 datasets.

## 3. Data preparation
### 3.1 Installing and importing libraries

In [2]:
%pip install pyodbc  
%pip install python-dotenv 





In [3]:
import pyodbc
from dotenv import dotenv_values 

# Analysis libraries
import pandas as pd 
import numpy as np
from sklearn.impute import SimpleImputer

# Visualization libraries
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Warning libraries
import warnings
warnings.filterwarnings('ignore')

### 3.2 Loading data

In [4]:
data2018 = pd.read_csv('data/startup_funding2018.csv')
data2018.head()

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...


In [5]:
data2019 = pd.read_csv('data/startup_funding2019.csv')
data2019.head()

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...","$30,000,000",Series D
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",


In [6]:
# Load environment variables from .env file into a dictionary
environment_variables = dotenv_values('.env')


# Get the values for the credentials you set in the '.env' file
database = environment_variables.get("DATABASE")
server = environment_variables.get("SERVER")
username = environment_variables.get("USERNAME")
password = environment_variables.get("PASSWORD")


connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password}"

In [7]:
# Use the connect method of the pyodbc library and pass in the connection string.
# This will connect to the server and might take a few seconds to be complete. 
# Check your internet connection if it takes more time than necessary

connection = pyodbc.connect(connection_string)

In [8]:
# Now the sql query to get the data is what what you see below. 
# Note that you will not have permissions to insert delete or update this database table. 

query = "Select * from dbo.LP1_startup_funding2020"
data2020 = pd.read_sql(query, connection)

In [9]:
data2020.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,
3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,,
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,,


In [10]:
query = "Select * from dbo.LP1_startup_funding2021"
data2021 = pd.read_sql(query, connection)

In [11]:
data2021.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D
3,Bizongo,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital","$51,000,000",Series C
4,FypMoney,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal","$2,000,000",Seed


### To guide our analysis, we have formulated 5 SMART questions:
* What is the overall trend in funding received by start-ups in India from 2018 to 2021?
* Which industries or sectors have received the highest funding during this period?
* What is the distribution of startups across the cities in India?
* What is the average funding amount received by start-ups in India during this period?
* Is there a correlation between the funding amount and the number of investors involved in funding rounds?

###  We have also developed three hypotheses for testing:
Hypothesis 1:<br>
Null: The funding received by start-ups in India has not demonstrated consistent upward trajectory over the years.<br>
Alternate: The funding received by start-ups in India has demonstrated a consistent upward trajectory over the years.

Hypothesis 2: <br>
Null: There are no significant disparities in funding received by all sectors of the Indian startups. <br>
Alternate: The technology sectors receive higher funding compared to other industries.

Hypothesis 3: <br>
Null: Situating a startup in a particular city does not influence funding.<br>
Alternate: Situating a startup in a particular city significantly affects funding.

To test these hypotheses, we will conduct the following analyses: 
* For Hypothesis 1, we will analyze the year-by-year funding amounts and calculate the average growth rate of funding.
* To investigate Hypothesis 2, we will categorize start-ups based on industry and compare the funding amounts received by each sector.
* Regarding Hypothesis 3, we will examine the distribution of start-ups across cities and deduce which cities harbour most highly-funded start-ups.

## 4. Data Cleaning
First, the 2018 data will be cleaned.

In [12]:
data2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Company Name   526 non-null    object
 1   Industry       526 non-null    object
 2   Round/Series   526 non-null    object
 3   Amount         526 non-null    object
 4   Location       526 non-null    object
 5   About Company  526 non-null    object
dtypes: object(6)
memory usage: 24.8+ KB


In [13]:
#defining a function that returns counts of unique values
def value(column):
    return data2018[column].value_counts()


value("Company Name")

TheCollegeFever             2
Urban Ladder                1
Medikabazaar                1
Freshboxx                   1
Cyclops Medtech             1
                           ..
DRIVEU                      1
Pentation Analytics         1
Cred                        1
MY CHIRAAG CAB              1
Theranosis Life Sciences    1
Name: Company Name, Length: 525, dtype: int64

In [14]:
value("Industry")

—                                                                                            30
Financial Services                                                                           15
Education                                                                                     8
Information Technology                                                                        7
Finance, Financial Services                                                                   5
                                                                                             ..
EdTech, Education, Enterprise Software, Peer to Peer                                          1
Renewable Energy                                                                              1
Automotive, E-Commerce, Marketplace                                                           1
Home Decor, Home Improvement, Home Renovation, Home Services, Interior Design, Smart Home     1
Marketplace, Real Estate, Rental Propert

In [15]:
value("Round/Series")

Seed                                                                                                       280
Series A                                                                                                    73
Angel                                                                                                       37
Venture - Series Unknown                                                                                    37
Series B                                                                                                    20
Series C                                                                                                    16
Debt Financing                                                                                              13
Private Equity                                                                                              10
Corporate Round                                                                                              8
P

In [16]:
value("Location")

Bangalore, Karnataka, India         102
Mumbai, Maharashtra, India           94
Bengaluru, Karnataka, India          55
Gurgaon, Haryana, India              52
New Delhi, Delhi, India              51
Pune, Maharashtra, India             20
Chennai, Tamil Nadu, India           19
Hyderabad, Andhra Pradesh, India     18
Delhi, Delhi, India                  16
Noida, Uttar Pradesh, India          15
Haryana, Haryana, India              11
Jaipur, Rajasthan, India              9
Kolkata, West Bengal, India           6
Ahmedabad, Gujarat, India             6
Bangalore City, Karnataka, India      5
India, Asia                           4
Indore, Madhya Pradesh, India         4
Kormangala, Karnataka, India          3
Bhopal, Madhya Pradesh, India         2
Ghaziabad, Uttar Pradesh, India       2
Kochi, Kerala, India                  2
Thane, Maharashtra, India             2
Hubli, Karnataka, India               1
Powai, Assam, India                   1
Cochin, Kerala, India                 1


In [17]:
value("About Company")

TheCollegeFever is a hub for fun, fiesta and frolic of Colleges.                                                      2
Algorithmic trading platform.                                                                                         2
Chtrbox connects social media influencers with brands .                                                               1
Fusion Microfinance Pvt. Ltd. is an NBFC registered with RBI                                                          1
MediMetry is an online platform where patients can consult specialists doctors online - from anywhere, at anytime.    1
                                                                                                                     ..
Professional Drones for Enterprise Applications                                                                       1
Operator of low-cost mobile cinema theatres in rural areas.                                                           1
India's Largest Peer-to-Peer Knowledge S

 ### 4.1 Dealing with duplicates

In [18]:
#checking for duplicates
data2018.duplicated().value_counts()

False    525
True       1
dtype: int64

Our 2018 data contains one duplicate which is marked by 'True' value.

In [19]:
#print out all the duplicates next to each other
data2018[data2018.duplicated(keep=False)]

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
348,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."


In [20]:
#dropping all duplicates
data2018=data2018.drop_duplicates()

In [21]:
#RECHECK IF THERE ANY DUPLICATES LEFT
data2018.duplicated().value_counts()

False    525
dtype: int64

### 4.2 Data type conversion

In [22]:
data2018.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 525 entries, 0 to 525
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Company Name   525 non-null    object
 1   Industry       525 non-null    object
 2   Round/Series   525 non-null    object
 3   Amount         525 non-null    object
 4   Location       525 non-null    object
 5   About Company  525 non-null    object
dtypes: object(6)
memory usage: 28.7+ KB


In [23]:
value("Amount")

—                 148
1000000            24
500000             13
2000000            12
₹50,000,000         9
                 ... 
175000              1
₹1,540,000,000      1
₹264,000,000        1
₹260,000,000        1
99230000            1
Name: Amount, Length: 198, dtype: int64

The ```Amount``` column is categorised as a string data type yet amount should be in integer. This is brought about by several problems in this column:
* The ```,``` ,```$``` ,```₹``` characters which should be removed.
* Also there are ```_``` characters which indicate null values.
* Some currencies are Indian rupees while others US Dollars , so for standard currency the Indian rupee will be converted to US Dollar.
* Finally, the ```Amount``` column will be converted to numeric data type.

In [24]:
#Removing ',' from the Amount column
data2018['Amount'].replace(',','', regex=True, inplace=True)
data2018['Amount']

0         250000
1      ₹40000000
2      ₹65000000
3        2000000
4              —
         ...    
521    225000000
522            —
523         7500
524    ₹35000000
525     35000000
Name: Amount, Length: 525, dtype: object

In [25]:
#Function to convert Indian rupee currency to dollars
def convert_to_dollars(value):

    if '₹' in str(value):

        amount = value.replace('₹', '')

        return pd.to_numeric(amount, errors='coerce') * 0.146

    elif '$' in str(value):

        return value.replace('$', '')

    elif "—" in str(value):

        return None  # or any other suitable value to represent missing/invalid data

    else:

        return value


pd.set_option('display.float_format', '{:.1f}'.format)

data2018 = data2018.applymap(convert_to_dollars)


    
        

In [26]:
#Converting the column to a numeric data type
data2018["Amount"]=data2018["Amount"].astype(float)
data2018["Amount"].dtype

dtype('float64')

In [27]:
data2018["Amount"]

0        250000.0
1       5840000.0
2       9490000.0
3       2000000.0
4             nan
          ...    
521   225000000.0
522           nan
523        7500.0
524     5110000.0
525    35000000.0
Name: Amount, Length: 525, dtype: float64

### 4.3 Data Uniformity
For data uniformity across the 4 datasets which will later make merging easier, ```Location``` and ```Industry``` columns will only remain with the first word.

In [28]:
#maintianing first word before comma in the location column
data2018["Location"] = data2018["Location"].map(lambda x: x.split(',')[0])
data2018["Location"]

0      Bangalore
1         Mumbai
2        Gurgaon
3          Noida
4      Hyderabad
         ...    
521    Bangalore
522      Haryana
523       Mumbai
524       Mumbai
525      Chennai
Name: Location, Length: 525, dtype: object

In [29]:
#maintianing first word before comma in the industry column
data2018["Industry"] = data2018["Industry"].map(lambda x: str(x).split(',')[0])
data2018["Industry"]

0             Brand Marketing
1                 Agriculture
2                      Credit
3          Financial Services
4        E-Commerce Platforms
                ...          
521                       B2B
522                   Tourism
523         Food and Beverage
524    Information Technology
525             Biotechnology
Name: Industry, Length: 525, dtype: object

Earlier, it was noticed that the ```Industry``` column had '-' characters which indicate the values are missing. So the missing values are going to be replaced with ```Unknown ``` value given the column is a categorical data type.

In [30]:
#Replacing '_' with Unknown values
data2018["Industry"]=data2018["Industry"].replace('—', "Unknown", regex=True)
data2018["Industry"]

0             Brand Marketing
1                 Agriculture
2                      Credit
3          Financial Services
4        E-Commerce Platforms
                ...          
521                       B2B
522                   Tourism
523         Food and Beverage
524    Information Technology
525             Biotechnology
Name: Industry, Length: 525, dtype: object

### 4.4 Handling Missing values

In [31]:
#Checking for missing values 
data2018.isnull().sum()

Company Name       0
Industry           0
Round/Series       0
Amount           148
Location           0
About Company      0
dtype: int64

In [32]:
#Computing summary of statistics for 2018 data
data2018.describe()

Unnamed: 0,Amount
count,377.0
mean,47244478.4
std,212692748.2
min,7500.0
25%,1000000.0
50%,3530000.0
75%,14965000.0
max,2920000000.0


The ```Amount``` column contains several missing values. Given the column is numerical, the best strategy would be to replace the missing values with the median  as it is less sensitive to outliers than the mean.

In [33]:
#Imputing missing values
array= data2018["Amount"].values.reshape(-1,1)
imputer=SimpleImputer(strategy="median")

data2018["Amount"]=imputer.fit_transform(array)

In [34]:
#Confirming there are no more missing values
data2018.isnull().sum()

Company Name     0
Industry         0
Round/Series     0
Amount           0
Location         0
About Company    0
dtype: int64

Inorder to make merging and analysis after merging easier, a column named ```Year Funded ``` which contains the respective years for the dataset will be formed.

In [35]:
data2018= data2018.assign(YearFunded=2018)
data2018.head()

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company,YearFunded
0,TheCollegeFever,Brand Marketing,Seed,250000.0,Bangalore,"TheCollegeFever is a hub for fun, fiesta and f...",2018
1,Happy Cow Dairy,Agriculture,Seed,5840000.0,Mumbai,A startup which aggregates milk from dairy far...,2018
2,MyLoanCare,Credit,Series A,9490000.0,Gurgaon,Leading Online Loans Marketplace in India,2018
3,PayMe India,Financial Services,Angel,2000000.0,Noida,PayMe India is an innovative FinTech organizat...,2018
4,Eunimart,E-Commerce Platforms,Seed,3530000.0,Hyderabad,Eunimart is a one stop solution for merchants ...,2018
