### ANALYZING THE INDIAN START-UP ECOSYSTEM-A DATA DRIVEN APPROACH USING 'CRISP-DM'

#### Project Description
This project involves analyzing the funding received by startups in India from 2018 to 2021.
Leveraging the CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology, We will identify key trends, sectoral and geographical distribution of funding,investor dynamics, and existing challenges within the ecosystem. The ultimate goal is to  proposing the best course of action to investors and foster a more robust and sustainable startup environment in India.
The datasets provided contain details of the startups,geographical location, the funding amounts recieved,investor information.

#### Objectives

1.Identify key investment trends in the Indian startup ecosystem from 2018 to 2021
2.Explore the distribution of startup funding across different sectors and regions in India.
3.Determine high-growth sectors that attract significant funding.
4.Analyze investor activity and preferences within the ecosystem.
5.Understand the characteristics of startups that receive substantial funding.


#### Hypothesis

Hypothesis  (Alternative Hypothesis): There has been an increasing trend in the amount of funding received by Indian start-ups within the technology industry from 2018 to 2021.
Null Hypothesis: There has been no significant increase in the amount of funding received by Indian startups from 2018 to 2021.

<!-- Hypothesis: The fintech sector has received the highest proportion of startup funding compared to other sectors such as healthcare and edtech.
Hypothesis: Startup funding is concentrated in major metropolitan areas like Bangalore, Mumbai, and Delhi, with relatively less investment in tier-2 and tier-3 cities. -->



#### Analytical Questions

1.	What are the trends in the total amount of funding received by Indian startups from 2018 to 2021?

2.	Which sectors have attracted the most investment during each year, and how have these trends evolved over the four years?

3.	Who are the highest investors, Total Fund invested and sector? 

4.	At which stage are Indian-Start-ups funded the most?

5. How is startup funding distributed across different sectors and regions. Highest and lowest?

### Data Understanding
#### Loading datasets from different sources

In [65]:
#import all necessary libraries
import pyodbc     
from dotenv import dotenv_values    #import the dotenv_values function from the dotenv package
import pandas as pd
import warnings 

warnings.filterwarnings('ignore')

In [66]:
#load 2018 data 
data_18 = pd.read_csv('startup_funding2018.csv')

#View data
data_18.head(10)

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...
5,Hasura,"Cloud Infrastructure, PaaS, SaaS",Seed,1600000,"Bengaluru, Karnataka, India",Hasura is a platform that allows developers to...
6,Tripshelf,"Internet, Leisure, Marketplace",Seed,"₹16,000,000","Kalkaji, Delhi, India",Tripshelf is an online market place for holida...
7,Hyperdata.IO,Market Research,Angel,"₹50,000,000","Hyderabad, Andhra Pradesh, India",Hyperdata combines advanced machine learning w...
8,Freightwalla,"Information Services, Information Technology",Seed,—,"Mumbai, Maharashtra, India",Freightwalla is an international forwarder tha...
9,Microchip Payments,Mobile Payments,Seed,—,"Bangalore, Karnataka, India",Microchip payments is a mobile-based payment a...


In [67]:
#load 2019 data 
data_19 = pd.read_csv('startup_funding2019.csv',encoding = "ISO-8859-1")

#View data
data_19.head(10)

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...","$30,000,000",Series D
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",
5,FlytBase,,Pune,Technology,A drone automation platform,Nitin Gupta,Undisclosed,Undisclosed,
6,Finly,,Bangalore,SaaS,It builds software products that makes work si...,"Vivek AG, Veekshith C Rai","Social Capital, AngelList India, Gemba Capital...",Undisclosed,
7,Kratikal,2013.0,Noida,Technology,It is a product-based cybersecurity solutions ...,"Pavan Kushwaha, Paratosh Bansal, Dip Jung Thapa","Gilda VC, Art Venture, Rajeev Chitrabhanu.","$1,000,000",Pre series A
8,Quantiphi,,,AI & Tech,It is an AI and big data services company prov...,Renuka Ramnath,Multiples Alternate Asset Management,"$20,000,000",Series A
9,Lenskart,2010.0,Delhi,E-commerce,It is a eyewear company,"Peyush Bansal, Amit Chaudhary, Sumeet Kapahi",SoftBank,"$275,000,000",Series G


In [68]:
# Load environment variables from .env file into a dictionary to access 2020 and 2021 datasets from database
environment_variables = dotenv_values('.env')

# Get the values for the credentials you set in the '.env' file
server = environment_variables.get("server")
database = environment_variables.get("database")
login = environment_variables.get("login")
password = environment_variables.get("password")

In [69]:
# Create a connection string
connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={login};PWD={password};MARS_Connection=yes;MinProtocolVersion=TLSv1.2;"

In [70]:
# Use the connect method of the pyodbc library and pass in the connection string.
connection = pyodbc.connect(connection_string)

In [71]:
# write an sql query to get data

query = '''SELECT *
           FROM INFORMATION_SCHEMA.TABLES
           WHERE TABLE_TYPE = 'BASE TABLE'
        '''
data = pd.read_sql(query, connection)
data.head()

Unnamed: 0,TABLE_CATALOG,TABLE_SCHEMA,TABLE_NAME,TABLE_TYPE
0,dapDB,dbo,LP1_startup_funding2021,BASE TABLE
1,dapDB,dbo,LP1_startup_funding2020,BASE TABLE


In [72]:
#write an sql query to extract 2020 details from the data
query = '''SELECT *
           FROM LP1_startup_funding2020
        '''
data_20 = pd.read_sql(query, connection)
data_20.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,
3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,,
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,,


In [73]:
#write an sql query to extract 2021 details from the database
query = '''SELECT *

           FROM LP1_startup_funding2021
        '''
data_21 = pd.read_sql(query, connection)
data_21.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D
3,Bizongo,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital","$51,000,000",Series C
4,FypMoney,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal","$2,000,000",Seed


### DATA UNDERSTANDING /EXPLORATION

column names and description:

Company/Brand: Name of the company/start-up

Founded: Year start-up was founded

Sector: Sector of service

What it does: Description about Company

Founders: Founders of the Company

Investor: Investors

Amount($): Raised fund from grants

Stage: founding stage 

In [74]:
# a look at the shapes of the dataframe
data_18.shape , data_19.shape , data_20.shape , data_21.shape

((526, 6), (89, 9), (1055, 10), (1209, 9))

The info method is useful to get a quick decribtion of the data. In particular the total number of rows, and each attribute's type and number of non-null values

In [75]:
# a quick overview of the datatypes
data_18.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Company Name   526 non-null    object
 1   Industry       526 non-null    object
 2   Round/Series   526 non-null    object
 3   Amount         526 non-null    object
 4   Location       526 non-null    object
 5   About Company  526 non-null    object
dtypes: object(6)
memory usage: 24.8+ KB


In [76]:
data_19.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company/Brand  89 non-null     object 
 1   Founded        60 non-null     float64
 2   HeadQuarter    70 non-null     object 
 3   Sector         84 non-null     object 
 4   What it does   89 non-null     object 
 5   Founders       86 non-null     object 
 6   Investor       89 non-null     object 
 7   Amount($)      89 non-null     object 
 8   Stage          43 non-null     object 
dtypes: float64(1), object(8)
memory usage: 6.4+ KB


In [77]:
data_20.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1055 entries, 0 to 1054
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1055 non-null   object 
 1   Founded        842 non-null    float64
 2   HeadQuarter    961 non-null    object 
 3   Sector         1042 non-null   object 
 4   What_it_does   1055 non-null   object 
 5   Founders       1043 non-null   object 
 6   Investor       1017 non-null   object 
 7   Amount         801 non-null    float64
 8   Stage          591 non-null    object 
 9   column10       2 non-null      object 
dtypes: float64(2), object(8)
memory usage: 82.5+ KB


In [78]:
data_21.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1209 entries, 0 to 1208
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1209 non-null   object 
 1   Founded        1208 non-null   float64
 2   HeadQuarter    1208 non-null   object 
 3   Sector         1209 non-null   object 
 4   What_it_does   1209 non-null   object 
 5   Founders       1205 non-null   object 
 6   Investor       1147 non-null   object 
 7   Amount         1206 non-null   object 
 8   Stage          781 non-null    object 
dtypes: float64(1), object(8)
memory usage: 85.1+ KB


### DATA EXPLORATION AND CLEANING
we have a total of 4 datasets from different sources and need to be merged before we can perform som analyis 

In [79]:
#add a column for the year funding was recieved for each dataframe
data_21.insert(0,'year_funded','2021',allow_duplicates = True) 
data_20.insert(0,'year_funded','2020',allow_duplicates = True)
data_19.insert(0,'year_funded','2019',allow_duplicates= True)
data_18.insert(0,'year_funded','2018',allow_duplicates = True)

In [80]:
#rename column names for each dataframe for uniformity before merging

data_18.rename(columns = {'Company Name':'company_name','Round/Series':'funding_stage','About Company': 'about_company','Amount':'Amount($)'},inplace = True)
data_19.rename(columns = {'Company/Brand':'company_name','HeadQuarter':'Location','Sector':'industry','What it does' :'about_company','Amount':'Amount($)','Stage': 'funding_stage'},inplace = True)
data_20.rename(columns = {'Company/Brand':'company_name','HeadQuarter':'Location','Sector':'industry','What_it_does' :'about_company','Amount':'Amount($)','Stage': 'funding_stage'},inplace = True)
data_21.rename(columns = {'Company_Brand':'company_name','HeadQuarter':'Location','Sector':'industry','What_it_does' :'about_company','Amount':'Amount($)','Stage': 'funding_stage'},inplace = True)
data_21.head(5)

Unnamed: 0,year_funded,company_name,Founded,Location,industry,about_company,Founders,Investor,Amount($),funding_stage
0,2021,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A
1,2021,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",
2,2021,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D
3,2021,Bizongo,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital","$51,000,000",Series C
4,2021,FypMoney,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal","$2,000,000",Seed


In [81]:
# merge all dataframes

merged_df = pd.concat([data_18,data_19,data_20,data_21], axis = 0)
# pd.concat([df1, df2], axis=1)
# df.head(10)
# df.info()
# df.describe()
# df.isnull().sum()
merged_df.tail(50)

Unnamed: 0,year_funded,company_name,Industry,funding_stage,Amount($),Location,about_company,Founded,industry,Founders,Investor,Company_Brand,column10
1159,2021,4Fin,,Pre-seed,$1100000,Pune,4Fin is a Fintech Platform catering to needs o...,2021.0,Financial Services,"Amit Tewary, Ajit Sinha",Curesense Therapeutics,,
1160,2021,Atomberg Technologies,,,$Undisclosed,Mumbai,A maker of energy-efficient smart fans,2012.0,Consumer Electronics,"Manoj Meena, Sibabrata Das",Ka Enterprises,,
1161,2021,Genext Students,,,$Undisclosed,Mumbai,LIVE online classes with expert tutors for K-1...,2013.0,EdTech,"Ali Asgar Kagzi, Piyush Dhanuka",Navneet Education,,
1162,2021,immunitoAI,,Seed,$1000000,Bangalore,Perform Antibody Discovery using Artificial In...,2020.0,Biotechnology,"Aridni Shah, Trisha Chatterjee",pi Ventures,,
1163,2021,GameEon Studios,,,$320000,Mumbai,GameEon is based in the sleepless city of Mumb...,2013.0,Computer Games,Nikhil Malankar,Mumbai Angels Network,,
1164,2021,Farmers Fresh Zone,,Pre-series A,$800000,Kochi,D2C Health and Wellness Brand for Fresh and Sa...,2015.0,AgriTech,Pradeep PS,"IAN Fund, Malabar Angel Network, Native Angel ...",,
1165,2021,Anveshan,,Seed,$500000,Bangalore,Revolutionizing the food industry through tech...,2019.0,Food Production,"Aayushi Khandelwal, Akhil Kansal, Kuldeep Parewa","DSG Consumer Partners, Titan Capital",,
1166,2021,OckyPocky,,Seed,$Undisclosed,Gurugram,OckyPocky is India's 1st interactive English l...,2015.0,EdTech,Amit Agrawal,"Sujeet Kumar, SucSEED Indovation Fund",,
1167,2021,Coutloot,,Pre-series,$8000000,Mumbai,Empowering local markets to sell online social...,2016.0,Consumer Services,"Mahima Kaul, Jasmeet Thind","Ameba Capital, 9Unicorns",,
1168,2021,Nova Benefits,,Series A,$10000000,Bangalore,Nova Benefits is the one stop tech platform fo...,2020.0,"Health, Wellness & Fitness","Saransh Garg, Yash Gupta","Susquehanna International Group, Bessemer Vent...",,


In [82]:
#['$','₹'],'') df[df.columns[1:]] = df[df.columns[1:]].apply(lambda x: x.str.replace('$','')).apply(lambda x: x.str.replace(',','')).astype(np.int64)

#merged_df['Amount($)'] = merged_df['Amount($)'].apply(lambda x: x.str.replace('$','')).apply(lambda x: x.str.replace('₹',''))
# merged_df['Amount($)'] = merged_df['Amount($)'].str.strip('₹')

In [83]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2879 entries, 0 to 1208
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   year_funded    2879 non-null   object 
 1   company_name   1824 non-null   object 
 2   Industry       526 non-null    object 
 3   funding_stage  1941 non-null   object 
 4   Amount($)      2622 non-null   object 
 5   Location       2765 non-null   object 
 6   about_company  2879 non-null   object 
 7   Founded        2110 non-null   float64
 8   industry       2335 non-null   object 
 9   Founders       2334 non-null   object 
 10  Investor       2253 non-null   object 
 11  Company_Brand  1055 non-null   object 
 12  column10       2 non-null      object 
dtypes: float64(1), object(12)
memory usage: 314.9+ KB
