# Exploring the Indian Startup Ecosystem: A Data Driven Analysis of Funding Trends and Industry Sectors

**Project Description**

Your team is trying to venture into the Indian start-up ecosystem. As the data experts of the team, you are to investigate the ecosystem  by analyzing funding received by start-ups from 2018 to 2021 and propose the best course of action.



### **Business Understanding**


The Indian Start-up ecosystem - ranked as the third largest in the world is a network of entrepreneurs, investors and other stakeholders working to build and grow technology-driven startups in the country.

India has seen an astronomical increase in startups and funding with over 16,000 new companies added in 2020 resulting in an unprecedented growth and funding.

Funding is generally provided by investment firms, angel investors, venture capitalists and private equity firms. In the face of market uncertainties, the Indian start-up ecosystem received $8.4 billion in 2023 indicating how resilient the it is.



### **Data understanding and collection**

In [1]:
#Importing all the necessary packages
import pyodbc #just installed with pip
import os
from dotenv import dotenv_values #import the dotenv_values function from the dotenv package
import pandas as pd
import warnings


warnings. filterwarnings('ignore')

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
# Load environment variables from .env file into a dictionary
environment_variables = dotenv_values('.env')

# Get the values for the credentials you set in the '.env' file
database = environment_variables.get("DB_DATABASENAME")

server = environment_variables.get("DB_SERVER")
username = environment_variables.get("DB_USERNAME")
password = environment_variables.get("DB_PASSWORD")

connection_string = f"DRIVER={{SQL Server}} ; SERVER={server}; DATABASE={database}; UID={username} ; PWD={password}"

In [3]:
# Use the connect method of the pyodbc library and pass in the connection string.

connection = pyodbc.connect( connection_string)

# Now the sql query to get the data is what what you see below.


In [4]:
#Querying the database to retrieve all relevant files from table 1
query1 = "SELECT * FROM dbo.LP1_startup_funding2020"
df_2020 = pd.read_sql(query1, connection)
df_2020.head(2)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,


In [5]:
#Querying the database to retrieve all relevant files from table 2
query2 = "SELECT * FROM dbo.LP1_startup_funding2021"

df_2021 = pd.read_sql(query2, connection)
df_2021.head(2)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",


In [11]:
#reading third dataset from csv file into a pandas dataframe
df_2019 = pd.read_csv("D:\Programming Stuffs\DAP(Azubi Africa)\Career Accelerator\Team Selenium\Indian_Start_Up_Analysis\datasets\startup_funding2019.csv")
df_2019.head(2)

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C


In [12]:
#reading third dataset from csv file into a pandas dataframe
df_2018 = pd.read_csv("D:\Programming Stuffs\DAP(Azubi Africa)\Career Accelerator\Team Selenium\Indian_Start_Up_Analysis\datasets\startup_funding2018.csv")
df_2018.head()

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...


In [13]:
#concatenating the datasets into a single dataframe
df = pd.concat([df_2018, df_2019,df_2020,df_2021],ignore_index=True)
df.head()
#saving the df file to a csv file for further cleaning and preparation
df.to_csv("startup_2018_19_20_21.csv")

In [14]:
# Renaming columns to lowercase with underscores
df= df.rename(columns=lambda x: x.lower().replace(' ', '_'))
df.head(2)

Unnamed: 0,company_name,industry,round/series,amount,location,about_company,company/brand,founded,headquarter,sector,what_it_does,founders,investor,amount($),stage,company_brand,what_it_does.1,column10
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",,,,,,,,,,,,
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,,,,,,,,,,,,


In [15]:
# check the shape of the dataset
df.shape

(2879, 18)

In [16]:
# perform descriptive statistics on data
df.describe(include="all").T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
company_name,526.0,525.0,TheCollegeFever,2.0,,,,,,,
industry,526.0,405.0,—,30.0,,,,,,,
round/series,526.0,21.0,Seed,280.0,,,,,,,
amount,2533.0,754.0,—,148.0,,,,,,,
location,526.0,50.0,"Bangalore, Karnataka, India",102.0,,,,,,,
about_company,526.0,524.0,"TheCollegeFever is a hub for fun, fiesta and f...",2.0,,,,,,,
company/brand,89.0,87.0,Kratikal,2.0,,,,,,,
founded,2110.0,,,,2016.079621,4.368006,1963.0,2015.0,2017.0,2019.0,2021.0
headquarter,2239.0,123.0,Bangalore,764.0,,,,,,,
sector,2335.0,502.0,FinTech,173.0,,,,,,,


From the observations,It is difficult to talk much about the descriptive statistics of the dataset because of the presence of a lot of null values.

In [17]:
# check the general information about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2879 entries, 0 to 2878
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   company_name   526 non-null    object 
 1   industry       526 non-null    object 
 2   round/series   526 non-null    object 
 3   amount         2533 non-null   object 
 4   location       526 non-null    object 
 5   about_company  526 non-null    object 
 6   company/brand  89 non-null     object 
 7   founded        2110 non-null   float64
 8   headquarter    2239 non-null   object 
 9   sector         2335 non-null   object 
 10  what_it_does   89 non-null     object 
 11  founders       2334 non-null   object 
 12  investor       2253 non-null   object 
 13  amount($)      89 non-null     object 
 14  stage          1415 non-null   object 
 15  company_brand  2264 non-null   object 
 16  what_it_does   2264 non-null   object 
 17  column10       2 non-null      object 
dtypes: float

### Observations
- The dataset consists of 18 columns in all with some duplicated columns
- It is assumed that the duplicated columns are; what_it_does, company_brand and company/brand
- The datatype of 17 of the columns are objects except the founded column which is a float 
- All the columns in the dataset consists of null values
- Column 10 has no real signinficant data and will be dropped in data cleaning
- Some rows have wrong information pertaining to the columns and will have to be discussed further
- Nulls in the dataset will also have to be discussed and dealt with


### Asumptions
- All amounts will be converted to USD in the Data Cleaning and EDA
 - 2018 amount column will assume the currency of USD
- Rename of df_2018 columns based on similarities in wording and comparison with other years




## Hypothesis testing


*Hypothesis* - The amount of funds a company receive depends on the sector a company finds itself

Null hypothesis: The sector of a start up does not have an impact on the amount of funding received.

Alternative hypothesis testing: The sector of a start up does have an impact on the amount of funding received.


# Analytical questions

- Which sector has received the most funding over the time frame?

- The distribution of start ups in stages and the amount allocated each

- In which 3 locations have start ups had the most funding?

- Which year had the most investors?

- Who are the top 10 investor in the Indian start ups?

- What was the impact of Covid-19 pandemic on start-up funding in 2020 as compare to the other years?

- How is funding related to metropolitan cities and small towns? (For recommendation on government policies)


