BUSINESS UNDERSTANDING

Main Objective of the Project
The primary objective of this project is to analyze the funding trends in the Indian start-up ecosystem from 2018 to 2021. By examining the data, we aim to identify patterns, trends, and insights that can inform strategic decisions for entering the Indian start-up market. Specifically, we will focus on understanding the amount of funding received by start-ups, the types of investors involved, and the sectors that attract the most investment.

Key Research Questions
•	How has the total amount of funding received by start-ups in India changed from 2018 to 2021?
•	Which sectors have received the most funding in each year?
•	Who are the top investors in the Indian start-up ecosystem from 2018 to 2021?
•	Which regions or cities in India are receiving the most start-up funding?
Does the stage align with the investment timeline?
How long as the company been operating, and how does that affect the amount of investment?

Null Hypothesis (H0): There is no significant difference in the total amount of funding received by start-ups in India across the years 2018 to 2021.
Alternative Hypothesis (H1): There is significant difference in the total amount of funding received by start-ups in India across the years 2018 to 2021.


In [3]:
import pandas as pd

In [4]:
!pip install pyodbc




[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [5]:
from dotenv import dotenv_values
import warnings
import pyodbc
warnings.filterwarnings('ignore')

In [6]:
server="dap-projects-database.database.windows.net"
login="LP1_learner"
password= "Hyp0th3s!$T3$t!ng"
database="dapDB"

In [7]:
environment_variables = dotenv_values('.env')
#Get the values for the credentials you set in the '.env' file
database = environment_variables.get("database")
server = environment_variables.get("server")
username = environment_variables.get("login")
password = environment_variables.get("password")
connection_string=f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password}"




In [8]:
# Use the connect method of the pyodbc library and pass in the connection string.
# This will connect to the server and might take a few seconds to be complete. 
# Check your internet connection if it takes more time than necessary
connection=pyodbc.connect(connection_string)

In [9]:
# Now the sql query to get the data is what what you see below.
# Note that you will not have permissions to insert delete or update this database table. 

db_query = '''SELECT * 
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_TYPE = 'BASE TABLE' '''

In [10]:
ata = pd.read_sql(db_query, connection)
ata

Unnamed: 0,TABLE_CATALOG,TABLE_SCHEMA,TABLE_NAME,TABLE_TYPE
0,dapDB,dbo,LP1_startup_funding2021,BASE TABLE
1,dapDB,dbo,LP1_startup_funding2020,BASE TABLE


DATASET ONE: DATA UNDERSTANDING

In [11]:
query = "select * from dbo.LP1_startup_funding2021"

ata = pd.read_sql(query, connection)
ata

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D
3,Bizongo,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital","$51,000,000",Series C
4,FypMoney,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal","$2,000,000",Seed
...,...,...,...,...,...,...,...,...,...
1204,Gigforce,2019.0,Gurugram,Staffing & Recruiting,A gig/on-demand staffing company.,"Chirag Mittal, Anirudh Syal",Endiya Partners,$3000000,Pre-series A
1205,Vahdam,2015.0,New Delhi,Food & Beverages,VAHDAM is among the world’s first vertically i...,Bala Sarda,IIFL AMC,$20000000,Series D
1206,Leap Finance,2019.0,Bangalore,Financial Services,International education loans for high potenti...,"Arnav Kumar, Vaibhav Singh",Owl Ventures,$55000000,Series C
1207,CollegeDekho,2015.0,Gurugram,EdTech,"Collegedekho.com is Student’s Partner, Friend ...",Ruchir Arora,"Winter Capital, ETS, Man Capital",$26000000,Series B


In [12]:
#checking the first five rows of the ata dataset
ata.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D
3,Bizongo,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital","$51,000,000",Series C
4,FypMoney,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal","$2,000,000",Seed


In [13]:
#the names of the columns in the dataset
print(ata.columns)

Index(['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage'],
      dtype='object')


In [14]:
#descriptive summary of the dataset
ata.describe().T


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Founded,1208.0,2016.655629,4.517364,1963.0,2015.0,2018.0,2020.0,2021.0


In [15]:
#checking for some information of the dataset. The dataset has some missing values
ata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1209 entries, 0 to 1208
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1209 non-null   object 
 1   Founded        1208 non-null   float64
 2   HeadQuarter    1208 non-null   object 
 3   Sector         1209 non-null   object 
 4   What_it_does   1209 non-null   object 
 5   Founders       1205 non-null   object 
 6   Investor       1147 non-null   object 
 7   Amount         1206 non-null   object 
 8   Stage          781 non-null    object 
dtypes: float64(1), object(8)
memory usage: 85.1+ KB


DATA CLEANING

Handling Missing Data

In [16]:
#checking for missing data in the different columns of the dataset
ata.isnull().sum()


Company_Brand      0
Founded            1
HeadQuarter        1
Sector             0
What_it_does       0
Founders           4
Investor          62
Amount             3
Stage            428
dtype: int64

Fill Missing Values

In [17]:
#forward filling of missing values
ata.fillna(method='ffill', inplace=True)  # Forward f


In [18]:
ata.isnull().sum()


Company_Brand    0
Founded          0
HeadQuarter      0
Sector           0
What_it_does     0
Founders         0
Investor         0
Amount           0
Stage            0
dtype: int64

In [19]:
#Removing $, and other characters from the Amount column
ata["Amount"]

0         $1,200,000
1       $120,000,000
2        $30,000,000
3        $51,000,000
4         $2,000,000
            ...     
1204        $3000000
1205       $20000000
1206       $55000000
1207       $26000000
1208        $8000000
Name: Amount, Length: 1209, dtype: object

In [20]:
ata["Amount"] = ata.Amount.apply(lambda x:str(x).replace('$', ''))

In [21]:
ata["Amount"] = ata.Amount.apply(lambda x:str(x).replace(',', ''))

In [22]:
#confirm if the characters have been removed
ata["Amount"]

0         1200000
1       120000000
2        30000000
3        51000000
4         2000000
          ...    
1204      3000000
1205     20000000
1206     55000000
1207     26000000
1208      8000000
Name: Amount, Length: 1209, dtype: object

HANDLING DUPLICATES

In [23]:
#Identify duplicate rows
print("Number of duplicate rows:")
print(ata.duplicated().sum())

Number of duplicate rows:
16


In [24]:
# Display duplicate rows
print(ata[ata.duplicated()])

          Company_Brand  Founded             HeadQuarter  \
110           Kirana247   2018.0               New Delhi   
111             FanPlay   2020.0          Computer Games   
243            Trinkerr   2021.0               Bangalore   
244               Zorro   2021.0                Gurugram   
245       Ultraviolette   2021.0               Bangalore   
246          NephroPlus   2009.0               Hyderabad   
247             Unremot   2020.0               Bangalore   
248         FanAnywhere   2021.0               Bangalore   
249          PingoLearn   2021.0                    Pune   
250                Spry   2021.0                  Mumbai   
251             Enmovil   2015.0               Hyderabad   
252       ASQI Advisors   2019.0                  Mumbai   
253  Insurance Samadhan   2018.0               New Delhi   
254     Evenflow Brands   2020.0                  Mumbai   
255          MasterChow   2020.0        Food & Beverages   
256  Fullife Healthcare   2009.0  Pharma

In [25]:
# Remove duplicate rows
ata_cleaned = ata.drop_duplicates()

# Verify that duplicates have been removed
print("\nNumber of duplicate rows after removal:")
print(ata_cleaned.duplicated().sum())


Number of duplicate rows after removal:
0


In [26]:
ata_cleaned.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First",1200000,Pre-series A
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management",120000000,Pre-series A
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital",30000000,Series D
3,Bizongo,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital",51000000,Series C
4,FypMoney,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal",2000000,Seed


In [27]:
ata_cleaned.columns

Index(['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage'],
      dtype='object')

Possible Question One: Find the total amount of investment in 2020

In [28]:

# Convert the 'Amount' column to numeric, setting errors='coerce' to handle non-numeric values
ata_cleaned['Amount'] = pd.to_numeric(ata_cleaned['Amount'], errors='coerce')

# Drop NaN values from the 'Amount' column
ata_cleaned = ata_cleaned.dropna(subset=['Amount'])

# Calculate the sum of the 'Amount' column
amount_sum = ata_cleaned['Amount'].sum()

# Print the result
print("The sum of the 'Amount' column is:", amount_sum)

The sum of the 'Amount' column is: 179596426000.0


Possible Question Two: Which investor contributed the highest amount in 2020

In [29]:
# Group by 'Investor' and sum their contributions
investor_contributions = ata_cleaned.groupby('Investor')['Amount'].sum()

# Identify the investor with the highest total contribution
top_investor = investor_contributions.idxmax()
top_contribution = investor_contributions.max()

# Print the result
print(f"The investor who contributed the highest amount in 2018 is: {top_investor} with a total of {top_contribution}")

The investor who contributed the highest amount in 2018 is: Arena Holdings, Think Investments with a total of 150065000000.0


Question three: Which sector recived the highest amount of investment in 2020

In [30]:
# Group by 'Sector' and sum their contributions
sector_contributions = ata_cleaned.groupby('Sector')['Amount'].sum()

# Identify the sector with the highest total contribution
top_sector = sector_contributions.idxmax()
top_contribution = sector_contributions.max()

# Print the result
print(f"The sector that received the highest amount of investment in 2018 is: {top_sector} with a total of {top_contribution}")

The sector that received the highest amount of investment in 2018 is: FinTech with a total of 152611980000.0


Question Four: The region with the highest investment in 2020

In [31]:
# Group by 'HeadQuarter' and sum the investments
hq_investment_2020 = ata_cleaned.groupby('HeadQuarter')['Amount'].sum()

# Identify the headquarter with the highest total investment
top_hq_2020 = hq_investment_2020.idxmax()
highest_investment_2020 = hq_investment_2020.max()

# Print the result
print(f"The region that received the highest amount of investment in 2020 is: {top_hq_2020} with a total of {highest_investment_2020}")

The region that received the highest amount of investment in 2020 is: Mumbai with a total of 154133150000.0


DATASET TWO: DATA UNDERSTANDING

In [32]:
query1 = "select * from dbo.LP1_startup_funding2020"
ata2 = pd.read_sql(query1, connection)
ata2

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,
3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,,
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,,
...,...,...,...,...,...,...,...,...,...,...
1050,Leverage Edu,,Delhi,Edtech,AI enabled marketplace that provides career gu...,Akshay Chaturvedi,"DSG Consumer Partners, Blume Ventures",1500000.0,,
1051,EpiFi,,,Fintech,It offers customers with a single interface fo...,"Sujith Narayanan, Sumit Gwalani","Sequoia India, Ribbit Capital",13200000.0,Seed Round,
1052,Purplle,2012.0,Mumbai,Cosmetics,Online makeup and beauty products retailer,"Manish Taneja, Rahul Dash",Verlinvest,8000000.0,,
1053,Shuttl,2015.0,Delhi,Transport,App based bus aggregator serice,"Amit Singh, Deepanshu Malviya",SIG Global India Fund LLP.,8043000.0,Series C,


In [33]:
#dealing with missing data on the second dataset
ata2.isnull().sum()

Company_Brand       0
Founded           213
HeadQuarter        94
Sector             13
What_it_does        0
Founders           12
Investor           38
Amount            254
Stage             464
column10         1053
dtype: int64

In [34]:
#check for some information about the ata2 dataset
ata2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1055 entries, 0 to 1054
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1055 non-null   object 
 1   Founded        842 non-null    float64
 2   HeadQuarter    961 non-null    object 
 3   Sector         1042 non-null   object 
 4   What_it_does   1055 non-null   object 
 5   Founders       1043 non-null   object 
 6   Investor       1017 non-null   object 
 7   Amount         801 non-null    float64
 8   Stage          591 non-null    object 
 9   column10       2 non-null      object 
dtypes: float64(2), object(8)
memory usage: 82.6+ KB


In [35]:
#describe the dataset
ata2.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Founded,842.0,2015.363,4.097909,1973.0,2014.0,2016.0,2018.0,2020.0
Amount,801.0,113043000.0,2476635000.0,12700.0,1000000.0,3000000.0,11000000.0,70000000000.0


Handling Missing Data

In [36]:
ata2.columns

Index(['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage', 'column10'],
      dtype='object')

In [37]:

# Drop the column named 'column 10' from the DataFrame
ata2 = ata2.drop(columns=['column10'])

In [38]:
#checking the data types of each column
print("Data types for each column:")
print(ata2.dtypes)

Data types for each column:
Company_Brand     object
Founded          float64
HeadQuarter       object
Sector            object
What_it_does      object
Founders          object
Investor          object
Amount           float64
Stage             object
dtype: object


In [39]:
#Separate numeric (float) and object columns
float_cols = ata2.select_dtypes(include=['float']).columns
object_cols = ata2.select_dtypes(include=['object']).columns

In [40]:
#Fill missing data with the mean for float columns
ata2[float_cols] = ata2[float_cols].fillna(ata2[float_cols].mean())

In [41]:
#Fill missing data with the mode for object columns
for col in object_cols:
    ata2[col] = ata2[col].fillna(ata2[col].mode()[0])

In [42]:
#Verify that missing data has been filled
print("\nColumns with missing data after filling:")
print(ata2.isnull().sum())


Columns with missing data after filling:
Company_Brand    0
Founded          0
HeadQuarter      0
Sector           0
What_it_does     0
Founders         0
Investor         0
Amount           0
Stage            0
dtype: int64


Handling Duplicates

In [43]:
#Identify duplicate rows
print("Number of duplicate rows:")
print(ata2.duplicated().sum())


Number of duplicate rows:
3


In [44]:
# Display duplicate rows
print("\nDuplicate rows:")
print(ata2[ata2.duplicated()])


Duplicate rows:
    Company_Brand  Founded HeadQuarter                 Sector  \
145     Krimanshi   2015.0     Jodhpur  Biotechnology company   
205         Nykaa   2012.0      Mumbai              Cosmetics   
362        Byju’s   2011.0   Bangalore                 EdTech   

                                          What_it_does         Founders  \
145  Krimanshi aims to increase rural income by imp...     Nikhil Bohra   
205  Nykaa is an online marketplace for different b...    Falguni Nayar   
362  An Indian educational technology and online tu...  Byju Raveendran   

                                           Investor        Amount     Stage  
145  Rajasthan Venture Capital Fund, AIM Smart City  6.000000e+05      Seed  
205                        Alia Bhatt, Katrina Kaif  1.130430e+08  Series A  
362           Owl Ventures, Tiger Global Management  5.000000e+08  Series A  


In [45]:
#Remove duplicate rows
ata2_cleaned = ata2.drop_duplicates()

In [46]:
#Verify that duplicates have been removed
print("\nNumber of duplicate rows after removal:")
print(ata2_cleaned.duplicated().sum())


Number of duplicate rows after removal:
0


In [47]:
ata2_cleaned.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,Series A
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,113043000.0,Pre-seed
3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,Series A
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,Series A


DATASET THREE

In [48]:
data1 = pd.read_csv("C:/Users/user/Downloads/startup_funding2018.csv")
data1

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...
...,...,...,...,...,...,...
521,Udaan,"B2B, Business Development, Internet, Marketplace",Series C,225000000,"Bangalore, Karnataka, India","Udaan is a B2B trade platform, designed specif..."
522,Happyeasygo Group,"Tourism, Travel",Series A,—,"Haryana, Haryana, India",HappyEasyGo is an online travel domain.
523,Mombay,"Food and Beverage, Food Delivery, Internet",Seed,7500,"Mumbai, Maharashtra, India",Mombay is a unique opportunity for housewives ...
524,Droni Tech,Information Technology,Seed,"₹35,000,000","Mumbai, Maharashtra, India",Droni Tech manufacture UAVs and develop softwa...


DATA UNDERSTANDING

In [49]:
#Using the head() method to view the first five rows of the dataset
data1.head()

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...


In [50]:
#using the info() method to get some understanding of the dataset
data1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Company Name   526 non-null    object
 1   Industry       526 non-null    object
 2   Round/Series   526 non-null    object
 3   Amount         526 non-null    object
 4   Location       526 non-null    object
 5   About Company  526 non-null    object
dtypes: object(6)
memory usage: 24.8+ KB


In [51]:
#using the describ() method
data1.describe().T

Unnamed: 0,count,unique,top,freq
Company Name,526,525,TheCollegeFever,2
Industry,526,405,—,30
Round/Series,526,21,Seed,280
Amount,526,198,—,148
Location,526,50,"Bangalore, Karnataka, India",102
About Company,526,524,"TheCollegeFever is a hub for fun, fiesta and f...",2


HANDLING MISSING DATA

In [52]:
#checking for missing data
data1.isnull().sum()

#the dataset has 0 missing data

Company Name     0
Industry         0
Round/Series     0
Amount           0
Location         0
About Company    0
dtype: int64

HANDLING DUPLICATES

In [53]:
#Identify duplicate rows
print("Number of duplicate rows:")
print(data1.duplicated().sum())

Number of duplicate rows:
1


In [54]:
# Display duplicate rows
print("\nDuplicate rows:")
print(data1[data1.duplicated()])


Duplicate rows:
        Company Name                                           Industry  \
348  TheCollegeFever  Brand Marketing, Event Promotion, Marketing, S...   

    Round/Series  Amount                     Location  \
348         Seed  250000  Bangalore, Karnataka, India   

                                         About Company  
348  TheCollegeFever is a hub for fun, fiesta and f...  


In [55]:
#Remove duplicate rows
data1_cleaned = data1.drop_duplicates()

In [56]:
print("\nNumber of duplicate rows after removal:")
print(data1_cleaned.duplicated().sum())


Number of duplicate rows after removal:
0


Dataset Four Data Understanding

In [57]:
data2 = pd.read_csv("C:/Users/user/Downloads/startup_funding2019.csv")
data2

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...","$30,000,000",Series D
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",
...,...,...,...,...,...,...,...,...,...
84,Infra.Market,,Mumbai,Infratech,It connects client requirements to their suppl...,"Aaditya Sharda, Souvik Sengupta","Tiger Global, Nexus Venture Partners, Accel Pa...","$20,000,000",Series A
85,Oyo,2013.0,Gurugram,Hospitality,Provides rooms for comfortable stay,Ritesh Agarwal,"MyPreferred Transformation, Avendus Finance, S...","$693,000,000",
86,GoMechanic,2016.0,Delhi,Automobile & Technology,Find automobile repair and maintenance service...,"Amit Bhasin, Kushal Karwa, Nitin Rana, Rishabh...",Sequoia Capital,"$5,000,000",Series B
87,Spinny,2015.0,Delhi,Automobile,Online car retailer,"Niraj Singh, Ramanshu Mahaur, Ganesh Pawar, Mo...","Norwest Venture Partners, General Catalyst, Fu...","$50,000,000",


In [58]:
data2.head()

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...","$30,000,000",Series D
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",


In [59]:
data2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company/Brand  89 non-null     object 
 1   Founded        60 non-null     float64
 2   HeadQuarter    70 non-null     object 
 3   Sector         84 non-null     object 
 4   What it does   89 non-null     object 
 5   Founders       86 non-null     object 
 6   Investor       89 non-null     object 
 7   Amount($)      89 non-null     object 
 8   Stage          43 non-null     object 
dtypes: float64(1), object(8)
memory usage: 6.4+ KB


In [60]:
data2.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Founded,60.0,2014.533333,2.937003,2004.0,2013.0,2015.0,2016.25,2019.0


HANDLING MISSING DATA

In [61]:
#Check for missing data
data2.isnull().sum()

Company/Brand     0
Founded          29
HeadQuarter      19
Sector            5
What it does      0
Founders          3
Investor          0
Amount($)         0
Stage            46
dtype: int64

In [62]:

#Separate numeric (float) and object columns
float_cols = data2.select_dtypes(include=['float']).columns
object_cols = data2.select_dtypes(include=['object']).columns

#Fill missing data with the mean for float columns
data2[float_cols] = data2[float_cols].fillna(data2[float_cols].mean())

#Fill missing data with the mode for object columns
for col in object_cols:
    data2[col] = data2[col].fillna(data2[col].mode()[0])

#Verify that missing data has been filled
print("\nColumns with missing data after filling:")
print(data2.isnull().sum())



Columns with missing data after filling:
Company/Brand    0
Founded          0
HeadQuarter      0
Sector           0
What it does     0
Founders         0
Investor         0
Amount($)        0
Stage            0
dtype: int64


HANDLING DUPLICATES

In [63]:
#Identify duplicate rows
print("Number of duplicate rows:")
print(data2.duplicated().sum())

#the dataset has no duplicates

Number of duplicate rows:
0


DATA ANALYSIS AND VISUALIZATION

Research Question One: How has the total amount of funding received by start-ups in India changed from 2018 to 2021?
•	Which sectors have received the most funding in each year?
•	Who are the top investors in the Indian start-up ecosystem from 2018 to 2021?
•	Which regions or cities in India are receiving the most start-up funding?


In [64]:
pip install matplotlib --user


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip
