Indian Start-up Ecosystem Funding Analysis (2018-2021)

Business Understanding

Summary of the Task:
This project involves analyzing the funding received by start-ups in India from 2018 to 2021. The goal is to investigate the Indian start-up ecosystem and propose strategic recommendations based on data-driven insights. The datasets are provided for each year, and the analysis will cover start-up details, funding amounts, and investors' information. Data is stored across various sources, and it is crucial to gather, clean, and analyze this data to derive meaningful insights


Project Name:
Indian Start-up Funding Project (2018-2021)

Summary of the Task:
This project involves analyzing the funding received by start-ups in India from 2018 to 2021. The goal is to investigate the Indian start-up ecosystem and propose strategic recommendations based on data-driven insights. The datasets are provided for each year, and the analysis will cover start-up details, funding amounts, and investors' information. Data is stored across various sources, and it is crucial to gather, clean, and analyze this data to derive meaningful insights.

Libraries and Packages:
pandas for data manipulation and analysis
numpy for numerical operations
pyodbc for database connectivity
sqlalchemy for database ORM (optional)
matplotlib and seaborn for data visualization
scikit-learn for machine learning (if applicable)
python-dotenv for managing environment variables
requests for handling HTTP requests (if needed)
os and pathlib for handling file paths and directories

## Business Questions

1.What sectors have shown the highest growth in terms of funding received over the past four years?

2.What geographical regions within India have emerged as the primary hubs for startup activity and investment, and what factors contribute to their prominence?

3.Are there any notable differences in funding patterns between early-stage startups and more established companies?

4.Which sectors recieve the lowest level of funding and which sectors recieve the highest levels of funding in India and what factors contribute to this?

5.Which investors have more impact on startups over the years?

6.What are the key characteristics of startups that successfully secure funding, and how do they differ from those that struggle to attract investment?

1. Sectors with Highest Growth: This question helps identify the sectors that are experiencing rapid growth in terms of funding received, providing valuable insights into where investor interest and capital are flowing. Understanding these sectors can help investors identify potential high-growth opportunities for investment.

2. Geographical Regions for Startup Activity: Understanding the primary hubs for startup activity and investment within India helps investors gauge where the most vibrant ecosystems are located. Factors contributing to their prominence, such as infrastructure, government support, and access to talent, can influence investment decisions and strategies.

3. Funding Patterns Across Startup Stages: Comparing funding patterns between early-stage startups and more established companies helps investors understand how investment behavior varies depending on the maturity and growth stage of the startup. This insight can inform investment strategies tailored to different stages of the startup lifecycle.

4. Sectorial Funding Disparities: Identifying sectors with the lowest and highest levels of funding sheds light on where capital is concentrated and where there may be untapped opportunities. Understanding the factors contributing to these disparities can help investors assess sector-specific risks and opportunities.

5. Impactful Investors: Analyzing the influence of different investors on startups over the years provides insights into which investors have been most active and successful in driving startup growth. This understanding can help investors identify potential partners or co-investors and assess the reputations and track records of different investment firms.

6. Characteristics of Funded Startups: Identifying key characteristics shared by startups that successfully secure funding helps investors understand what factors contribute to investment readiness and attractiveness. Contrasting these characteristics with those of startups that struggle to attract investment can provide valuable lessons for entrepreneurs and investors alike.

Null Hypothesis(Ho): There is no significant difference in the amount of funding between startups in particular "location".

Alternative Hypothesis(Ha): There is a significant difference in the amount of funding between startups in "Blocation".

In [5463]:
# Data manipulation and analysis
import pandas as pd
import numpy as np
import re

import folium
import plotly.express as px
from plotly.subplots import make_subplots
from plotly.subplots import go
from matplotlib.ticker import FuncFormatter
# Database connectivity
import pyodbc

# Database ORM (optional)
from sqlalchemy import create_engine

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns


# Machine learning (if applicable)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Managing environment variables
from dotenv import dotenv_values

# Handling HTTP requests (if needed)
import requests

# Handling file paths and directories
import os
from pathlib import Path


import warnings 

warnings.filterwarnings('ignore')

Loading Data to Python VSO Environment:

1. Database Connection (2020 and 2021 Data):

In [5464]:
# Load environment variables from .env file into a dictionary
environment_variables = dotenv_values('.env')

# Get the values for the credentials you set in the '.env' file
server = environment_variables.get("SERVER")
database = environment_variables.get("DATABASE")
username = environment_variables.get("UID")
password = environment_variables.get("PWD")


In [5465]:
# Create a connection string
connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password};MARS_Connection=yes;MinProtocolVersion=TLSv1.2;"

In [5466]:
# Use the connect method of the pyodbc library and pass in the connection string.
# This will connect to the server and might take a few seconds to be complete. 
# Check your internet connection if it takes more time than necessary

connection = pyodbc.connect(connection_string)





In [5467]:
# Now the sql query to get the data is what what you see below. 


#query = "SELECT * FROM LP2_Telco_churn_first_3000"

# Note that you will not have permissions to insert delete or update this database table. 
# select data from 2020

query = "SELECT * FROM LP1_startup_funding2020"

data20 = pd.read_sql(query, connection)
data20.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,
3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,,
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,,


In [5468]:
data20.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1055 entries, 0 to 1054
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1055 non-null   object 
 1   Founded        842 non-null    float64
 2   HeadQuarter    961 non-null    object 
 3   Sector         1042 non-null   object 
 4   What_it_does   1055 non-null   object 
 5   Founders       1043 non-null   object 
 6   Investor       1017 non-null   object 
 7   Amount         801 non-null    float64
 8   Stage          591 non-null    object 
 9   column10       2 non-null      object 
dtypes: float64(2), object(8)
memory usage: 82.5+ KB


In [5469]:
data20.shape

(1055, 10)

In [5470]:
# creating a column to identify each dataset by addition of data year

data20['Funding_Year'] = 2020

#Change the funding year to integer type

data20['Funding_Year'] = data20['Funding_Year'].astype(int)

data20.info()

data20.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1055 entries, 0 to 1054
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1055 non-null   object 
 1   Founded        842 non-null    float64
 2   HeadQuarter    961 non-null    object 
 3   Sector         1042 non-null   object 
 4   What_it_does   1055 non-null   object 
 5   Founders       1043 non-null   object 
 6   Investor       1017 non-null   object 
 7   Amount         801 non-null    float64
 8   Stage          591 non-null    object 
 9   column10       2 non-null      object 
 10  Funding_Year   1055 non-null   int32  
dtypes: float64(2), int32(1), object(8)
memory usage: 86.7+ KB


Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10,Funding_Year
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,,2020
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,,2020
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,,2020
3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,,,2020
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,,,2020


In [5471]:
data20.shape

(1055, 11)

In [5472]:
#printing columns to compare if the column names are matching
print(data20.columns)

Index(['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage', 'column10', 'Funding_Year'],
      dtype='object')


In [5473]:
# Renaming some columns

data20.rename(columns = {'Company_Brand' :'Company_Name'}, inplace =True)

data20.rename(columns = {'HeadQuarter': 'Location'}, inplace =True)

data20.head()

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,column10,Funding_Year
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,,2020
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,,2020
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,,2020
3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,,,2020
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,,,2020


In [5474]:
#select specific columns
data20 = data20[['Company_Name', 'Founded','Location','Sector','What_it_does','Founders','Investor','Amount','Stage','Funding_Year']]
                
data20.head() 

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,2020
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,2020
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,2020
3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,,2020
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,,2020


In [5475]:
# Converting the funded column to numeric data
data20['Founded'] = pd.to_numeric(data20['Founded'], errors='coerce').convert_dtypes(int)

#### Exchange rates 

[Source: OFX](https://www.ofx.com/en-au/forex-news/historical-exchange-rates/yearly-average-rates/)
```bash
exchange_rates = {
    2018: 0.014649,
    2019: 0.014209,
    2020: 0.013501,
    2021: 0.013527
}

In [5476]:
# Create a function to clean the Amount column of the 2018 DataFrame and convert Indian Rupees to US Dollars

def clean_amount_2020(Amount):
    try:
        Amount = str(Amount)
        # Remove commas
        Amount = Amount.replace(",", "")
        Amount = Amount.replace('—', "")
        # Check if the value is in Indian Rupees and convert US Dollars: Using USD = 68.4113 which was the average annual Indian Rupee to US Dollars in 2018
        if "₹" in Amount:
            Amount = Amount.replace("₹", "")
            return round(float(Amount) * 0.0146, 2)
        # Check if the value is in US Dollars
        elif "$" in Amount:
            Amount = Amount.replace("$", "")
            return round (float(Amount), 2)
        # check if no currency symbol is present, assume US Dollars
        else:
            return round(float(Amount), 2)
    except ValueError:
        # If the value is not a number, return NaN
        return np.nan
        
# Clean the Amount column of the 2018 DataFrame
data20["Amount"] = data20["Amount"].apply(clean_amount_2020)

In [5477]:
# Converting the Amount column to a numeric, there the need to remove some symbols including commas and currency

data20['Amount'] = data20['Amount'].apply(lambda x:str(x).replace('$', ''))

data20['Amount'] = data20['Amount'].apply(lambda x:str(x).replace(',', ''))

data20['Amount'] = data20['Amount'].replace('—', np.nan)




In [5478]:
#Find the number of rows with undisclosed amounts 

index1 = data20.index[data20['Amount']=='Undisclosed']

print('The total number of undisclosed records is', len(index1))

The total number of undisclosed records is 0


In [5479]:
# convert undisclosed to NAN
data20['Amount'] = data20['Amount'].replace('Undisclosed', np.nan)

In [5480]:
#print a summary information on the 2020 data 
data20.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1055 entries, 0 to 1054
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Company_Name  1055 non-null   object
 1   Founded       842 non-null    Int64 
 2   Location      961 non-null    object
 3   Sector        1042 non-null   object
 4   What_it_does  1055 non-null   object
 5   Founders      1043 non-null   object
 6   Investor      1017 non-null   object
 7   Amount        1055 non-null   object
 8   Stage         591 non-null    object
 9   Funding_Year  1055 non-null   int32 
dtypes: Int64(1), int32(1), object(8)
memory usage: 79.5+ KB


In [5481]:
#Find the row with 887000 23000000 in the amount section
index1 = data20.index[data20['Amount']=='887000 23000000']
index1

Index([], dtype='int64')

In [5482]:
#replace the values with the average 
avg = str((887000+23000000)/2)
data20.at[465, 'Amount'] = avg 


In [5483]:
#print the row record to confirm
print(data20.iloc[(465)])

Company_Name                                         True Balance
Founded                                                      2014
Location                                                 Gurugram
Sector                                                    Finance
What_it_does    Earn money by meeting financial needs of your ...
Founders                                     Charlie, Jay, Martin
Investor                                              Balancehero
Amount                                                 11943500.0
Stage                                                    Series C
Funding_Year                                                 2020
Name: 465, dtype: object


In [5484]:

#Find the row with 800000000 to 850000000 in the amount section
index2 = data20.index[data20['Amount']=='800000000 to 850000000']

In [5485]:
#replace the values with the average 
avg = str((800000000+850000000)/2)

data20.at[472, 'Amount'] = avg 

In [5486]:
#print the row record to confirm 
print(data20.iloc[(472)])

Company_Name                                             Eruditus
Founded                                                      2010
Location                                                   Mumbai
Sector                                                  Education
What_it_does    Bring world-class business and professional ed...
Founders                     Chaitanya Kalipatnapu, Ashwin Damera
Investor        Bertelsmann India Investments, Sequoia Capital...
Amount                                                825000000.0
Stage                                                        None
Funding_Year                                                 2020
Name: 472, dtype: object


In [5487]:
#Convert the Amount column to numeric 

data20['Amount'] = pd.to_numeric(data20['Amount'], errors='coerce')

In [5488]:
#print a summary information on the 2020 data 
data20.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1055 entries, 0 to 1054
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Company_Name  1055 non-null   object 
 1   Founded       842 non-null    Int64  
 2   Location      961 non-null    object 
 3   Sector        1042 non-null   object 
 4   What_it_does  1055 non-null   object 
 5   Founders      1043 non-null   object 
 6   Investor      1017 non-null   object 
 7   Amount        803 non-null    float64
 8   Stage         591 non-null    object 
 9   Funding_Year  1055 non-null   int32  
dtypes: Int64(1), float64(1), int32(1), object(7)
memory usage: 79.5+ KB


In [5489]:
duplicates = data20[data20.duplicated()]

duplicates

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year
145,Krimanshi,2015,Jodhpur,Biotechnology company,Krimanshi aims to increase rural income by imp...,Nikhil Bohra,"Rajasthan Venture Capital Fund, AIM Smart City",600000.0,Seed,2020
205,Nykaa,2012,Mumbai,Cosmetics,Nykaa is an online marketplace for different b...,Falguni Nayar,"Alia Bhatt, Katrina Kaif",,,2020
362,Byju’s,2011,Bangalore,EdTech,An Indian educational technology and online tu...,Byju Raveendran,"Owl Ventures, Tiger Global Management",500000000.0,,2020


In [5490]:
#drop all duplicates and leave only one record 

data20 = data20.drop_duplicates(keep='first')

In [5491]:
#Check the 2020 datatset information to confirm the datatypes 
data20.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1052 entries, 0 to 1054
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Company_Name  1052 non-null   object 
 1   Founded       839 non-null    Int64  
 2   Location      958 non-null    object 
 3   Sector        1039 non-null   object 
 4   What_it_does  1052 non-null   object 
 5   Founders      1040 non-null   object 
 6   Investor      1014 non-null   object 
 7   Amount        801 non-null    float64
 8   Stage         590 non-null    object 
 9   Funding_Year  1052 non-null   int32  
dtypes: Int64(1), float64(1), int32(1), object(7)
memory usage: 87.3+ KB


In [5492]:
#Check the first set of row 
data20.head()

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year
0,Aqgromalin,2019,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,2020
1,Krayonnz,2019,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,2020
2,PadCare Labs,2018,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,2020
3,NCOME,2020,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,,2020
4,Gramophone,2016,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,,2020


In [5493]:
# select data from 2021

query = "SELECT * FROM LP1_startup_funding2021"

data21 = pd.read_sql(query, connection)
data21.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D
3,Bizongo,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital","$51,000,000",Series C
4,FypMoney,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal","$2,000,000",Seed


In [5494]:
data21.shape

(1209, 9)

In [5495]:
data21.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1209 entries, 0 to 1208
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1209 non-null   object 
 1   Founded        1208 non-null   float64
 2   HeadQuarter    1208 non-null   object 
 3   Sector         1209 non-null   object 
 4   What_it_does   1209 non-null   object 
 5   Founders       1205 non-null   object 
 6   Investor       1147 non-null   object 
 7   Amount         1206 non-null   object 
 8   Stage          781 non-null    object 
dtypes: float64(1), object(8)
memory usage: 85.1+ KB


In [5496]:
# creating a column to identify each dataset by addition of data year

data21['Funding_Year'] = 2021

# change the Funding_Year to interger type

data21['Funding_Year'] = data21['Funding_Year'].astype(int)

data21.info()

data21.head()






<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1209 entries, 0 to 1208
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1209 non-null   object 
 1   Founded        1208 non-null   float64
 2   HeadQuarter    1208 non-null   object 
 3   Sector         1209 non-null   object 
 4   What_it_does   1209 non-null   object 
 5   Founders       1205 non-null   object 
 6   Investor       1147 non-null   object 
 7   Amount         1206 non-null   object 
 8   Stage          781 non-null    object 
 9   Funding_Year   1209 non-null   int32  
dtypes: float64(1), int32(1), object(8)
memory usage: 89.9+ KB


Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A,2021
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",,2021
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D,2021
3,Bizongo,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital","$51,000,000",Series C,2021
4,FypMoney,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal","$2,000,000",Seed,2021


In [5497]:
#printing columns to compare if the column names are matching

print(data21.columns)






Index(['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage', 'Funding_Year'],
      dtype='object')


In [5498]:
# Renaming some columns

data21.rename(columns = {'Company_Brand' :'Company_Name'}, inplace =True)

data21.rename(columns = {'HeadQuarter': 'Location'}, inplace =True)

data21.head()

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A,2021
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",,2021
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D,2021
3,Bizongo,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital","$51,000,000",Series C,2021
4,FypMoney,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal","$2,000,000",Seed,2021


In [5499]:
#select specific columns
data21 = data21[['Company_Name', 'Founded','Location','Sector','What_it_does', 'Founders','Investor','Amount','Stage','Funding_Year']]
                
data21.head() 

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A,2021
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",,2021
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D,2021
3,Bizongo,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital","$51,000,000",Series C,2021
4,FypMoney,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal","$2,000,000",Seed,2021


In [5500]:
# total undisclosed in the dataset
index5 = data21.index[data21['Amount']=='Undisclosed']

print(len(index5))

43


In [5501]:
#print the row records 
data21.loc[(index5)].tail()

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year
824,Avalon Labs,2017.0,Bangalore,FinTech,"Avalon Labs incubates, invests, partners with ...",Varun Mayya,"Tanglin Ventures, Better Capital, Whiteboard C...",Undisclosed,Pre-series A,2021
827,Rezo.ai,2017.0,Noida,AI startup,Conversational AI platform - Innovate the way ...,"Manish Gupta, Rashi Gupta","Devesh Sachdev, Bhavesh Manglani",Undisclosed,Seed,2021
833,Polygon,2017.0,Mumbai,Crypto,Polygon is a blockchain scalability platform.,"Jaynti Kanani, Sandeep Nailwal, Anurag Arjun","Mark Cuban, MiH Ventures",Undisclosed,,2021
846,Ingenium,2018.0,New Delhi,EdTech,Ingenium Education has been pushing e-learning...,"Pramudit Somvanshi, Mohit Patel, Aakash Gupta",Lead Angels,Undisclosed,Seed,2021
853,Celcius,2020.0,Mumbai,Logistics,The “ONLINE” Cold Chain network for Reefer tru...,"Swarup Bose, Rajneesh Raman, Arbind Jain",Eaglewings Ventures,Undisclosed,Seed,2021


In [5502]:
# Replace the Undisclosed with NAN

data21['Amount'] = data21['Amount'].replace('Undisclosed', np.nan)

In [5503]:
#print the last 5 row records 
data21.loc[(index5)].tail()

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year
824,Avalon Labs,2017.0,Bangalore,FinTech,"Avalon Labs incubates, invests, partners with ...",Varun Mayya,"Tanglin Ventures, Better Capital, Whiteboard C...",,Pre-series A,2021
827,Rezo.ai,2017.0,Noida,AI startup,Conversational AI platform - Innovate the way ...,"Manish Gupta, Rashi Gupta","Devesh Sachdev, Bhavesh Manglani",,Seed,2021
833,Polygon,2017.0,Mumbai,Crypto,Polygon is a blockchain scalability platform.,"Jaynti Kanani, Sandeep Nailwal, Anurag Arjun","Mark Cuban, MiH Ventures",,,2021
846,Ingenium,2018.0,New Delhi,EdTech,Ingenium Education has been pushing e-learning...,"Pramudit Somvanshi, Mohit Patel, Aakash Gupta",Lead Angels,,Seed,2021
853,Celcius,2020.0,Mumbai,Logistics,The “ONLINE” Cold Chain network for Reefer tru...,"Swarup Bose, Rajneesh Raman, Arbind Jain",Eaglewings Ventures,,Seed,2021


In [5504]:
# number of upspark in Amount column
index6 = data21.index[data21['Amount']=='Upsparks']

print(len(index6)), index6

2


(None, Index([98, 111], dtype='int64'))

In [5505]:
# display them
data21.loc[index6]

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year
98,FanPlay,2020.0,Computer Games,Computer Games,A real money game app specializing in trivia g...,YC W21,"Pritesh Kumar, Bharat Gupta",Upsparks,$1200000,2021
111,FanPlay,2020.0,Computer Games,Computer Games,A real money game app specializing in trivia g...,YC W21,"Pritesh Kumar, Bharat Gupta",Upsparks,$1200000,2021


In [5506]:
#drop the duplicate

data21 = data21.drop(labels=index6[1], axis=0)

In [5507]:
#Rearrange the record data correctly 

data21.loc[index6[0], ['Amount', 'Stage']] = ['$1200000', '']


In [5508]:
# dispaly the changes 
data21.iloc[98]

Company_Name                                              FanPlay
Founded                                                    2020.0
Location                                           Computer Games
Sector                                             Computer Games
What_it_does    A real money game app specializing in trivia g...
Founders                                                   YC W21
Investor                              Pritesh Kumar, Bharat Gupta
Amount                                                   $1200000
Stage                                                            
Funding_Year                                                 2021
Name: 98, dtype: object

In [5509]:
# Find element in amount with series C
index7 = data21.index[data21['Amount']=='Series C']

print(len(index7)), index7

2


(None, Index([242, 256], dtype='int64'))

In [5510]:
# show the entry
data21.loc[index7]

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year
242,Fullife Healthcare,2009.0,Pharmaceuticals\t#REF!,Primary Business is Development and Manufactur...,Varun Khanna,Morgan Stanley Private Equity Asia,$22000000,Series C,,2021
256,Fullife Healthcare,2009.0,Pharmaceuticals\t#REF!,Primary Business is Development and Manufactur...,Varun Khanna,Morgan Stanley Private Equity Asia,$22000000,Series C,,2021


In [5511]:
#since its duplicate  drop one 
data21 = data21.drop(labels=index7[1], axis=0)

In [5512]:
#rearrange the columns entery 
data21.loc[index7[0], ['Sector', 'Location', 'Amount', 'Investor', 'Stage']] = ['Pharmaceuticals', '', '$22000000', '', 'Series C']

data21.loc[242]

Company_Name                    Fullife Healthcare
Founded                                     2009.0
Location                                          
Sector                             Pharmaceuticals
What_it_does                          Varun Khanna
Founders        Morgan Stanley Private Equity Asia
Investor                                          
Amount                                   $22000000
Stage                                     Series C
Funding_Year                                  2021
Name: 242, dtype: object

In [5513]:
index8 = data21.index[data21['Amount']=='Seed']

print(index8)

Index([257, 1148], dtype='int64')


In [5514]:
data21.loc[index8]

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year
257,MoEVing,2021.0,Gurugram\t#REF!,MoEVing is India's only Electric Mobility focu...,"Vikash Mishra, Mragank Jain","Anshuman Maheshwary, Dr Srihari Raju Kalidindi",$5000000,Seed,,2021
1148,Godamwale,2016.0,Mumbai,Logistics & Supply Chain,Godamwale is tech enabled integrated logistics...,"Basant Kumar, Vivek Tiwari, Ranbir Nandan",1000000\t#REF!,Seed,,2021


In [5515]:
data21.loc[index8[0], ['Sector', 'Location', 'Amount', 'Investor', 'Stage']] = ['Electric Mobility', 'Gurugram', '$5000000', '', 'Seed']
data21.loc[index8[1], ['Amount', 'Investor', 'Stage']] = ['1000000', '', 'Seed']

In [5516]:
data21.loc[257]

Company_Name                                           MoEVing
Founded                                                 2021.0
Location                                              Gurugram
Sector                                       Electric Mobility
What_it_does                       Vikash Mishra, Mragank Jain
Founders        Anshuman Maheshwary, Dr Srihari Raju Kalidindi
Investor                                                      
Amount                                                $5000000
Stage                                                     Seed
Funding_Year                                              2021
Name: 257, dtype: object

In [5517]:
index9 = data21.index[data21['Amount']=='ah! Ventures']

print(index9)

Index([538], dtype='int64')


In [5518]:
data21.loc[index9]

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year
538,Little Leap,2020.0,New Delhi,EdTech,Soft Skills that make Smart Leaders,Holistic Development Programs for children in ...,Vishal Gupta,ah! Ventures,$300000,2021


In [5519]:
data21.loc[index9, ['Amount', 'Stage']] = ['$300000', '']

In [5520]:
data21.loc[538]

Company_Name                                          Little Leap
Founded                                                    2020.0
Location                                                New Delhi
Sector                                                     EdTech
What_it_does                  Soft Skills that make Smart Leaders
Founders        Holistic Development Programs for children in ...
Investor                                             Vishal Gupta
Amount                                                    $300000
Stage                                                            
Funding_Year                                                 2021
Name: 538, dtype: object

In [5521]:
# Pre-series A
index10 = data21.index[data21['Amount']=='Pre-series A']

index10

Index([545], dtype='int64')

In [5522]:
data21.loc[index10]

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year
545,AdmitKard,2016.0,Noida,EdTech,A tech solution for end to end career advisory...,"Vamsi Krishna, Pulkit Jain, Gaurav Munjal\t#REF!",$1000000,Pre-series A,,2021


In [5523]:
# ITO angel network, letsventure
index11 = data21.index[data21['Amount']=='ITO Angel Network, LetsVenture']

index11

Index([551], dtype='int64')

In [5524]:
data21.loc[index11]

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year
551,BHyve,2020.0,Mumbai,Human Resources,A Future of Work Platform for diffusing Employ...,Backed by 100x.VC,"Omkar Pandharkame, Ketaki Ogale","ITO Angel Network, LetsVenture",$300000,2021


In [5525]:

# rearranging 
data21.at[551, 'Amount'] = '$300000'
data21.at[551, 'Investor'] = 'Omkar Pandharkame, Ketaki Ogale, JITO Angel Network, LetsVenture'
data21.at[551, 'Stage'] = ''



In [5526]:
data21.loc[index11]

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year
551,BHyve,2020.0,Mumbai,Human Resources,A Future of Work Platform for diffusing Employ...,Backed by 100x.VC,"Omkar Pandharkame, Ketaki Ogale, JITO Angel Ne...",$300000,,2021


In [5527]:
# JITO Angel Network, LetsVenture
index12 = data21.index[data21['Amount']=='JITO Angel Network, LetsVenture']

index12

Index([677], dtype='int64')

In [5528]:
data21.loc[index12]

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year
677,Saarthi Pedagogy,2015.0,Ahmadabad,EdTech,"India's fastest growing Pedagogy company, serv...",Pedagogy,Sushil Agarwal,"JITO Angel Network, LetsVenture",$1000000,2021


In [5529]:
# rearranging 
data21.at[677, 'Amount'] = '$1000000'
data21.at[677, 'Investor'] = 'Sushil Agarwal, JITO Angel Network, LetsVenture'
data21.at[677, 'Stage'] = ''

In [5530]:
data21.loc[index12]

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year
677,Saarthi Pedagogy,2015.0,Ahmadabad,EdTech,"India's fastest growing Pedagogy company, serv...",Pedagogy,"Sushil Agarwal, JITO Angel Network, LetsVenture",$1000000,,2021


In [5531]:
index13 = data21.index[data21['Amount']=='nan']

data21['Amount'] = data21['Amount'].replace('nan', np.nan)

In [5532]:
index13 = data21.index[data21['Amount']=='nan']

data21['Amount'] = data21['Amount'].replace('nan', np.nan)

#### Exchange rates 

[Source: OFX](https://www.ofx.com/en-au/forex-news/historical-exchange-rates/yearly-average-rates/)
```bash
exchange_rates = {
    2018: 0.014649,
    2019: 0.014209,
    2020: 0.013501,
    2021: 0.013527
}

In [5533]:
# Create a function to clean the Amount column of the 2018 DataFrame and convert Indian Rupees to US Dollars

def clean_amount_2021(Amount):
    try:
        Amount = str(Amount)
        # Remove commas
        Amount = Amount.replace(",", "")
        Amount = Amount.replace('—', "")
        # Check if the value is in Indian Rupees and convert US Dollars: Using USD = 68.4113 which was the average annual Indian Rupee to US Dollars in 2018
        if "₹" in Amount:
            Amount = Amount.replace("₹", "")
            return round(float(Amount) * 0.0146, 2)
        # Check if the value is in US Dollars
        elif "$" in Amount:
            Amount = Amount.replace("$", "")
            return round (float(Amount), 2)
        # check if no currency symbol is present, assume US Dollars
        else:
            return round(float(Amount), 2)
    except ValueError:
        # If the value is not a number, return NaN
        return np.nan
        
# Clean the Amount column of the 2018 DataFrame
data21["Amount"] = data21["Amount"].apply(clean_amount_2021)

In [5534]:
# replace $ and , to empty space, - to NAN
data21['Amount'] = data21['Amount'].apply(lambda x:str(x).replace('$', ''))

data21['Amount'] = data21['Amount'].apply(lambda x:str(x).replace(',', ''))

data21['Amount'] = data21['Amount'].replace('—', np.nan)

In [5535]:
data21.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1207 entries, 0 to 1208
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Company_Name  1207 non-null   object 
 1   Founded       1206 non-null   float64
 2   Location      1206 non-null   object 
 3   Sector        1207 non-null   object 
 4   What_it_does  1207 non-null   object 
 5   Founders      1203 non-null   object 
 6   Investor      1145 non-null   object 
 7   Amount        1207 non-null   object 
 8   Stage         783 non-null    object 
 9   Funding_Year  1207 non-null   int32  
dtypes: float64(1), int32(1), object(8)
memory usage: 131.3+ KB


# convert amount column to numeric
data21['Amount']  = pd.to_numeric(data21['Amount'], errors='coerce')

In [5536]:
# Considering Location Column
data21.loc[98]


Company_Name                                              FanPlay
Founded                                                    2020.0
Location                                           Computer Games
Sector                                             Computer Games
What_it_does    A real money game app specializing in trivia g...
Founders                                                   YC W21
Investor                              Pritesh Kumar, Bharat Gupta
Amount                                                  1200000.0
Stage                                                            
Funding_Year                                                 2021
Name: 98, dtype: object

In [5537]:
data21.loc[752]

Company_Name                                        NewLink Group
Founded                                                    2016.0
Location                                                  Beijing
Sector                                               Tech Startup
What_it_does    Developer of an energy management and transpor...
Founders                                      Yang Wang, Zhen Dai
Investor                                             Bain Capital
Amount                                                200000000.0
Stage                                                        None
Funding_Year                                                 2021
Name: 752, dtype: object

In [5538]:
data21['Location'] = data21.Location.str.split(',').str[0]
data21.at[32, 'Location'] = 'Andhra Pradesh'
data21.at[98, 'Location'] = ''
data21.at[241, 'Location'] = ''
data21.at[255, 'Location'] = ''
data21.at[752, 'Location'] = ''
data21.at[1100, 'Location'] = ''
data21.at[1176, 'Location'] = ''

In [5539]:
# Considering Sector Attribute

data21['Sector'] = data21.Sector.str.split(',').str[0]
data21.at[1100, 'Sector'] = 'Audio experience'

In [5540]:
data21.head()

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First",1200000.0,Pre-series A,2021
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management",120000000.0,,2021
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital",30000000.0,Series D,2021
3,Bizongo,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital",51000000.0,Series C,2021
4,FypMoney,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal",2000000.0,Seed,2021


Loading Data to Python VSO Environment:

2. Database Connection (2019 Data):

In [5541]:
# Accessing the data for 2019 can be found in OneDrive. The file name startup_funding2019.csv

data19 = pd.read_csv('startup_funding2019.csv')
data19.head()



Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...","$30,000,000",Series D
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",


In [5542]:
data19.shape


(89, 9)

In [5543]:
data19.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company/Brand  89 non-null     object 
 1   Founded        60 non-null     float64
 2   HeadQuarter    70 non-null     object 
 3   Sector         84 non-null     object 
 4   What it does   89 non-null     object 
 5   Founders       86 non-null     object 
 6   Investor       89 non-null     object 
 7   Amount($)      89 non-null     object 
 8   Stage          43 non-null     object 
dtypes: float64(1), object(8)
memory usage: 6.4+ KB


In [5544]:
# Creating a column to identify each dataset by addition of data year
data19['Funding_Year'] = 2019

#Change the funding year to integer type

data19['Funding_Year'] = data19['Funding_Year'].astype(int)


data19.head()

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage,Funding_Year
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",,2019
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C,2019
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding,2019
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...","$30,000,000",Series D,2019
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",,2019


In [5545]:

# rename the columns for consistency 

data19.rename(columns = {'Company/Brand':'Company_Name'}, inplace = True)

data19.rename(columns = {'HeadQuarter':'Location'}, inplace = True)

data19.rename(columns = {'Amount($)':'Amount'}, inplace = True)

data19.rename(columns = {'What it does':'What_it_does'}, inplace = True)

data19.head()

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",,2019
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C,2019
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding,2019
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...","$30,000,000",Series D,2019
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",,2019


In [5546]:
#select specific columns
data19 = data19[['Company_Name', 'Founded','Location','Sector','What_it_does','Founders','Investor','Amount','Stage','Funding_Year']]
data19.head()               


Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",,2019
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C,2019
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding,2019
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...","$30,000,000",Series D,2019
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",,2019


In [5547]:
#check the summarized information on the 2019 dataset 
data19.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Company_Name  89 non-null     object 
 1   Founded       60 non-null     float64
 2   Location      70 non-null     object 
 3   Sector        84 non-null     object 
 4   What_it_does  89 non-null     object 
 5   Founders      86 non-null     object 
 6   Investor      89 non-null     object 
 7   Amount        89 non-null     object 
 8   Stage         43 non-null     object 
 9   Funding_Year  89 non-null     int32  
dtypes: float64(1), int32(1), object(8)
memory usage: 6.7+ KB


##### Exchange rates 

[Source: OFX](https://www.ofx.com/en-au/forex-news/historical-exchange-rates/yearly-average-rates/)
```bash
exchange_rates = {
    2018: 0.014649,
    2019: 0.014209,
    2020: 0.013501,
    2021: 0.013527
}

In [5548]:
# Create a function to clean the Amount column of the 2018 DataFrame and convert Indian Rupees to US Dollars

def clean_amount_2019(Amount):
    try:
        Amount = str(Amount)
        # Remove commas
        Amount = Amount.replace(",", "")
        Amount = Amount.replace('—', "")
        # Check if the value is in Indian Rupees and convert US Dollars: Using USD = 68.4113 which was the average annual Indian Rupee to US Dollars in 2018
        if "₹" in Amount:
            Amount = Amount.replace("₹", "")
            return round(float(Amount) * 0.0142, 2)
        # Check if the value is in US Dollars
        elif "$" in Amount:
            Amount = Amount.replace("$", "")
            return round (float(Amount), 2)
        # check if no currency symbol is present, assume US Dollars
        else:
            return round(float(Amount), 2)
    except ValueError:
        # If the value is not a number, return NaN
        return np.nan
        
# Clean the Amount column of the 2018 DataFrame
data19["Amount"] = data19["Amount"].apply(clean_amount_2019)

In [5549]:
#To convert the column to a numerical one, there the need to remove some symbols including commas and currency

data19['Amount'] = data19['Amount'].apply(lambda x:str(x).replace('₹', ''))

data19['Amount'] = data19['Amount'].apply(lambda x:str(x).replace('$', ''))

data19['Amount'] = data19['Amount'].apply(lambda x:str(x).replace(',', ''))

data19['Amount'] = data19['Amount'].replace('—', np.nan)

In [5550]:
#Some rows-values in the amount column are undisclosed 
# Extract the rows with undisclosed funding information 

index_new = data19.index[data19['Amount']=='Undisclosed']
#Print the number of rows with such undisclosed values
print('The number of values with undisclosed amount is ', len(index_new))

The number of values with undisclosed amount is  0


In [5551]:
#check out these records 
data19.loc[(index_new)]

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year


In [5552]:
#Since undisclosed amounts does not provide any intelligenc, 
#we decided to drop rows with such characteristics 
# Replace the undisclosed amounts with an empty string

data19['Amount'] = data19['Amount'].replace('Undisclosed', np.nan)

In [5553]:
#check out these records 
data19.loc[(index_new)]

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year


In [5554]:
#Convert the Amount column to float 

data19['Amount'] = pd.to_numeric(data19['Amount'], errors='coerce')


In [5555]:
#Check the first 5 rows of the dataset 
data19.head()

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,6300000.0,,2019
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,150000000.0,Series C,2019
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey",28000000.0,Fresh funding,2019
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...",30000000.0,Series D,2019
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),6000000.0,,2019


In [5556]:
#Check the summary information of the dataset 
data19.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Company_Name  89 non-null     object 
 1   Founded       60 non-null     float64
 2   Location      70 non-null     object 
 3   Sector        84 non-null     object 
 4   What_it_does  89 non-null     object 
 5   Founders      86 non-null     object 
 6   Investor      89 non-null     object 
 7   Amount        77 non-null     float64
 8   Stage         43 non-null     object 
 9   Funding_Year  89 non-null     int32  
dtypes: float64(2), int32(1), object(7)
memory usage: 6.7+ KB


In [5557]:
#Check if there are any NULL VALUES 
data19.isna().any().sum()

6

In [5558]:
#find duplicates 

duplicate = data19[data19.duplicated()]

duplicate


Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year


In [5559]:
data19['Amount'].unique()

array([6.300e+06, 1.500e+08, 2.800e+07, 3.000e+07, 6.000e+06,       nan,
       1.000e+06, 2.000e+07, 2.750e+08, 2.200e+07, 5.000e+06, 1.405e+05,
       5.400e+08, 1.500e+07, 1.827e+05, 1.200e+07, 1.100e+07, 1.550e+07,
       1.500e+06, 5.500e+06, 2.500e+06, 1.400e+05, 2.300e+08, 4.940e+07,
       3.200e+07, 2.600e+07, 1.500e+05, 4.000e+05, 2.000e+06, 1.000e+08,
       8.000e+06, 1.000e+05, 5.000e+07, 1.200e+08, 4.000e+06, 6.800e+06,
       3.600e+07, 5.700e+06, 2.500e+07, 6.000e+05, 7.000e+07, 6.000e+07,
       2.200e+05, 2.800e+06, 2.100e+06, 7.000e+06, 3.110e+08, 4.800e+06,
       6.930e+08, 3.300e+07])

Loading Data to Python VSO Environment:

2. Database Connection (2018 Data):

In [5560]:
# The third data (data for 2018) is hosted on this GitHub Repository, in file called startup_funding2018.csv

data18 = pd.read_csv('startup_funding2018.csv')
data18.head()



Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...


In [5561]:
data18.shape

(526, 6)

In [5562]:
data18.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Company Name   526 non-null    object
 1   Industry       526 non-null    object
 2   Round/Series   526 non-null    object
 3   Amount         526 non-null    object
 4   Location       526 non-null    object
 5   About Company  526 non-null    object
dtypes: object(6)
memory usage: 24.8+ KB


In [5563]:
# rename the columns for consistency 

#industry --> sector 
#Round/Series --> stage 

data18.rename(columns = {'Company Name':'Company_Name'}, inplace = True)

data18.rename(columns = {'headquarter':'Location'}, inplace = True)

data18.rename(columns = {'Industry':'Sector'}, inplace = True)

data18.rename(columns = {'Round/Series':'Stage'}, inplace = True)

data18.rename(columns = {'About Company': 'What_it_does'}, inplace = True)

# Add founded, investor, What_it_does, Founders, funding year as a column 
data18['Founded'] = np.nan
data18['Investor'] = np.nan
data18['Founders'] = np.nan
data18

Unnamed: 0,Company_Name,Sector,Stage,Amount,Location,What_it_does,Founded,Investor,Founders
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",,,
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,,,
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,,,
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,,,
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...,,,
...,...,...,...,...,...,...,...,...,...
521,Udaan,"B2B, Business Development, Internet, Marketplace",Series C,225000000,"Bangalore, Karnataka, India","Udaan is a B2B trade platform, designed specif...",,,
522,Happyeasygo Group,"Tourism, Travel",Series A,—,"Haryana, Haryana, India",HappyEasyGo is an online travel domain.,,,
523,Mombay,"Food and Beverage, Food Delivery, Internet",Seed,7500,"Mumbai, Maharashtra, India",Mombay is a unique opportunity for housewives ...,,,
524,Droni Tech,Information Technology,Seed,"₹35,000,000","Mumbai, Maharashtra, India",Droni Tech manufacture UAVs and develop softwa...,,,


In [5564]:
# Creating a column to identify each dataset by addition of data year

data18['Funding_Year'] = 2018

#Change the funding year to integer type 

data18['Funding_Year'] = data18['Funding_Year'].astype(int)

data18.head()

Unnamed: 0,Company_Name,Sector,Stage,Amount,Location,What_it_does,Founded,Investor,Founders,Funding_Year
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",,,,2018
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,,,,2018
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,,,,2018
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,,,,2018
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...,,,,2018


In [5565]:
#select specific columns
data18 = data18[['Company_Name', 'Founded','Location','Sector', 'What_it_does', 'Founders','Investor','Amount','Stage','Funding_Year']]
                
data18.head() 

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year
0,TheCollegeFever,,"Bangalore, Karnataka, India","Brand Marketing, Event Promotion, Marketing, S...","TheCollegeFever is a hub for fun, fiesta and f...",,,250000,Seed,2018
1,Happy Cow Dairy,,"Mumbai, Maharashtra, India","Agriculture, Farming",A startup which aggregates milk from dairy far...,,,"₹40,000,000",Seed,2018
2,MyLoanCare,,"Gurgaon, Haryana, India","Credit, Financial Services, Lending, Marketplace",Leading Online Loans Marketplace in India,,,"₹65,000,000",Series A,2018
3,PayMe India,,"Noida, Uttar Pradesh, India","Financial Services, FinTech",PayMe India is an innovative FinTech organizat...,,,2000000,Angel,2018
4,Eunimart,,"Hyderabad, Andhra Pradesh, India","E-Commerce Platforms, Retail, SaaS",Eunimart is a one stop solution for merchants ...,,,—,Seed,2018


In [5566]:
#check the shape of the dataset 
data18.shape 

(526, 10)

In [5567]:
#check if there are any Null Values
data18.isna().any()

Company_Name    False
Founded          True
Location        False
Sector          False
What_it_does    False
Founders         True
Investor         True
Amount          False
Stage           False
Funding_Year    False
dtype: bool

In [5568]:
#Strip the location column to only the city-area. 
data18['Location'] = data18.Location.str.split(',').str[0]
data18['Location'].head()

0    Bangalore
1       Mumbai
2      Gurgaon
3        Noida
4    Hyderabad
Name: Location, dtype: object

In [5569]:
#Strip the sector column to the first sector element.
data18['Sector'] = data18.Sector.str.split(',').str[0]
data18['Sector'].head()

0         Brand Marketing
1             Agriculture
2                  Credit
3      Financial Services
4    E-Commerce Platforms
Name: Sector, dtype: object

In [5570]:
data18

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year
0,TheCollegeFever,,Bangalore,Brand Marketing,"TheCollegeFever is a hub for fun, fiesta and f...",,,250000,Seed,2018
1,Happy Cow Dairy,,Mumbai,Agriculture,A startup which aggregates milk from dairy far...,,,"₹40,000,000",Seed,2018
2,MyLoanCare,,Gurgaon,Credit,Leading Online Loans Marketplace in India,,,"₹65,000,000",Series A,2018
3,PayMe India,,Noida,Financial Services,PayMe India is an innovative FinTech organizat...,,,2000000,Angel,2018
4,Eunimart,,Hyderabad,E-Commerce Platforms,Eunimart is a one stop solution for merchants ...,,,—,Seed,2018
...,...,...,...,...,...,...,...,...,...,...
521,Udaan,,Bangalore,B2B,"Udaan is a B2B trade platform, designed specif...",,,225000000,Series C,2018
522,Happyeasygo Group,,Haryana,Tourism,HappyEasyGo is an online travel domain.,,,—,Series A,2018
523,Mombay,,Mumbai,Food and Beverage,Mombay is a unique opportunity for housewives ...,,,7500,Seed,2018
524,Droni Tech,,Mumbai,Information Technology,Droni Tech manufacture UAVs and develop softwa...,,,"₹35,000,000",Seed,2018


##### Exchange rates 

[Source: OFX](https://www.ofx.com/en-au/forex-news/historical-exchange-rates/yearly-average-rates/)
```bash
exchange_rates = {
    2018: 0.014649,
    2019: 0.014209,
    2020: 0.013501,
    2021: 0.013527
}

In [5571]:
# Create a function to clean the Amount column of the 2018 DataFrame and convert Indian Rupees to US Dollars

def clean_amount_2018(Amount):
    try:
        Amount = str(Amount)
        # Remove commas
        Amount = Amount.replace(",", "")
        Amount = Amount.replace('—', "")
        # Check if the value is in Indian Rupees and convert US Dollars: Using USD = 68.4113 which was the average annual Indian Rupee to US Dollars in 2018
        if "₹" in Amount:
            Amount = Amount.replace("₹", "")
            return round(float(Amount) * 0.0146, 2)
        # Check if the value is in US Dollars
        elif "$" in Amount:
            Amount = Amount.replace("$", "")
            return round (float(Amount), 2)
        # check if no currency symbol is present, assume US Dollars
        else:
            return round(float(Amount), 2)
    except ValueError:
        # If the value is not a number, return NaN
        return np.nan
        
# Clean the Amount column of the 2018 DataFrame
data18["Amount"] = data18["Amount"].apply(clean_amount_2018)

In [5572]:
#get index of rows where 'Amount' column is in rupeess
#get_index = data18.index[data18['Amount'].str.contains('₹')]

In [5573]:
#Check the summary information about the 2018 dataset 
data18.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Company_Name  526 non-null    object 
 1   Founded       0 non-null      float64
 2   Location      526 non-null    object 
 3   Sector        526 non-null    object 
 4   What_it_does  526 non-null    object 
 5   Founders      0 non-null      float64
 6   Investor      0 non-null      float64
 7   Amount        378 non-null    float64
 8   Stage         526 non-null    object 
 9   Funding_Year  526 non-null    int32  
dtypes: float64(4), int32(1), object(5)
memory usage: 39.2+ KB


In [5574]:
data18.head()

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year
0,TheCollegeFever,,Bangalore,Brand Marketing,"TheCollegeFever is a hub for fun, fiesta and f...",,,250000.0,Seed,2018
1,Happy Cow Dairy,,Mumbai,Agriculture,A startup which aggregates milk from dairy far...,,,584000.0,Seed,2018
2,MyLoanCare,,Gurgaon,Credit,Leading Online Loans Marketplace in India,,,949000.0,Series A,2018
3,PayMe India,,Noida,Financial Services,PayMe India is an innovative FinTech organizat...,,,2000000.0,Angel,2018
4,Eunimart,,Hyderabad,E-Commerce Platforms,Eunimart is a one stop solution for merchants ...,,,,Seed,2018


In [5575]:
data18['Amount'].unique()

array([2.500000e+05, 5.840000e+05, 9.490000e+05, 2.000000e+06,
                nan, 1.600000e+06, 2.336000e+05, 7.300000e+05,
       1.460000e+06, 1.500000e+05, 1.100000e+06, 7.300000e+03,
       6.000000e+06, 6.500000e+05, 5.110000e+05, 9.344000e+05,
       2.920000e+05, 1.000000e+06, 5.000000e+06, 4.000000e+06,
       4.380000e+05, 2.800000e+06, 1.700000e+06, 1.300000e+06,
       7.300000e+04, 1.825000e+05, 2.190000e+05, 5.000000e+05,
       1.518400e+06, 6.570000e+05, 1.340000e+07, 3.650000e+05,
       3.854400e+05, 1.168000e+05, 8.760000e+02, 9.000000e+06,
       1.000000e+05, 2.000000e+04, 1.200000e+05, 4.964000e+05,
       4.993200e+06, 1.431450e+05, 8.760000e+06, 7.420000e+08,
       1.460000e+07, 2.920000e+07, 3.980000e+06, 1.000000e+04,
       1.460000e+03, 3.650000e+06, 1.000000e+09, 7.000000e+06,
       3.500000e+07, 8.030000e+06, 2.850000e+07, 3.504000e+06,
       1.752000e+06, 2.400000e+06, 3.000000e+07, 3.650000e+07,
       2.300000e+07, 1.100000e+07, 6.424000e+05, 3.2400

In [5576]:
data18.loc[(178)]

Company_Name                                       BuyForexOnline
Founded                                                       NaN
Location                                                Bangalore
Sector                                                     Travel
What_it_does    BuyForexOnline.com is India's first completely...
Founders                                                      NaN
Investor                                                      NaN
Amount                                                  2000000.0
Stage           https://docs.google.com/spreadsheets/d/1x9ziNe...
Funding_Year                                                 2018
Name: 178, dtype: object

In [5577]:
data18.loc[178, ['Stage']] = ['']

data18['Stage'] = data18['Stage'].apply(lambda x:str(x).replace('Undisclosed', ''))

In [5578]:
#drop duplicates 

data18 = data18.drop_duplicates(keep='first')


In [5579]:
data18.info()

<class 'pandas.core.frame.DataFrame'>
Index: 525 entries, 0 to 525
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Company_Name  525 non-null    object 
 1   Founded       0 non-null      float64
 2   Location      525 non-null    object 
 3   Sector        525 non-null    object 
 4   What_it_does  525 non-null    object 
 5   Founders      0 non-null      float64
 6   Investor      0 non-null      float64
 7   Amount        377 non-null    float64
 8   Stage         525 non-null    object 
 9   Funding_Year  525 non-null    int32  
dtypes: float64(4), int32(1), object(5)
memory usage: 43.1+ KB


In [5580]:
# Rename round_series to stage and location to headquarter
data18.rename(columns={
    'Company Name': 'company_brand', 
    'Industry': 'sector', 
    'Round/Series': 'stage', 
    'About Company': 'what_it_does', 
    
    },
    inplace=True
)

data18.info()

<class 'pandas.core.frame.DataFrame'>
Index: 525 entries, 0 to 525
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Company_Name  525 non-null    object 
 1   Founded       0 non-null      float64
 2   Location      525 non-null    object 
 3   Sector        525 non-null    object 
 4   What_it_does  525 non-null    object 
 5   Founders      0 non-null      float64
 6   Investor      0 non-null      float64
 7   Amount        377 non-null    float64
 8   Stage         525 non-null    object 
 9   Funding_Year  525 non-null    int32  
dtypes: float64(4), int32(1), object(5)
memory usage: 43.1+ KB


In [5581]:
# git clone https://github.com/SamuelAsong/indian-startup-funding-analysis.git
# cd indian-startup-funding-analysis
# pip install -r requirements.txt


CRISP-DM Process:
Business Understanding:

Define project objectives and requirements
Understand the start-up ecosystem and the importance of funding data
Data Understanding:

Gather and explore datasets for 2018, 2019, 2020, and 2021
Identify key features and initial insights
Data Preparation:

Clean and preprocess data
Handle missing values, duplicates, and inconsistent data
Merge datasets into a single comprehensive dataset
Data Analysis:

Perform exploratory data analysis (EDA)
Identify trends, patterns, and outliers
Visualize funding trends over the years
Modeling (if applicable):

Develop machine learning models to predict funding success (optional)
Evaluate model performance
Evaluation:

Assess the analysis results and model performance
Validate findings against business objectives
Deployment:

Present findings and recommendations
Prepare a final report and presentation
Conclusion and Findings:
Summarize key insights from the data analysis
Highlight significant trends and patterns in the Indian start-up funding landscape
Provide actionable recommendations based on data-driven insights
Discuss potential limitations and future work
This structured approach ensures a comprehensive analysis and effective communication of results, helping to make strategic, data-driven decisions in the Indian start-up ecosystem.








The CRISP-DM reference model 
1 Business understanding 

1.1 Determine business objectives 

1.2 Assess situation 

1.3 Determine data mining goals 

1.4 Produce project plan 

2 Data understanding 

2.1 Collect initial data 

2.2 Describe data 

2.3 Explore data 

2.4 Verify data quality 

3 Data preparation 

3.1 Select data 

3.2 Clean data 

3.3 Construct data 

3.4 Integrate data 

3.5 Format data 

4 Modeling 

4.1 Select modeling technique 

4.2 Generate test design 

4.3 Build model 

4.4 Assess model 

5 Evaluation 

5.1 Evaluate results 

5.2 Review process 

5.3 Determine next steps 

6 Deployment 

6.1 Plan deployment 

6.2 Plan monitoring and maintenance report 

6.4 Review project 

#### Exploratory Data Analysis: EDA

This is the segment dedicated to thoroughly examining the datasets, presenting them, formulating hypotheses, and strategizing the cleaning, processing, and creation of features.

In [5582]:
print (data21.columns)

Index(['Company_Name', 'Founded', 'Location', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage', 'Funding_Year'],
      dtype='object')


In [5583]:
print (data20.columns)

Index(['Company_Name', 'Founded', 'Location', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage', 'Funding_Year'],
      dtype='object')


In [5584]:
print (data19.columns)


Index(['Company_Name', 'Founded', 'Location', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage', 'Funding_Year'],
      dtype='object')


In [5585]:
print (data18.columns)

Index(['Company_Name', 'Founded', 'Location', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage', 'Funding_Year'],
      dtype='object')


In [5586]:
# concatenating all the dataframes together
df = pd.concat([data18, data19, data20, data21], axis=0)

In [5587]:
# Export DataFrame to CSV file
#df.to_csv('df.csv', index=False)


In [5588]:
#select specific columns

                
df.head() 

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year
0,TheCollegeFever,,Bangalore,Brand Marketing,"TheCollegeFever is a hub for fun, fiesta and f...",,,250000.0,Seed,2018
1,Happy Cow Dairy,,Mumbai,Agriculture,A startup which aggregates milk from dairy far...,,,584000.0,Seed,2018
2,MyLoanCare,,Gurgaon,Credit,Leading Online Loans Marketplace in India,,,949000.0,Series A,2018
3,PayMe India,,Noida,Financial Services,PayMe India is an innovative FinTech organizat...,,,2000000.0,Angel,2018
4,Eunimart,,Hyderabad,E-Commerce Platforms,Eunimart is a one stop solution for merchants ...,,,,Seed,2018


In [5589]:
df.tail()

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year
1204,Gigforce,2019.0,Gurugram,Staffing & Recruiting,A gig/on-demand staffing company.,"Chirag Mittal, Anirudh Syal",Endiya Partners,3000000.0,Pre-series A,2021
1205,Vahdam,2015.0,New Delhi,Food & Beverages,VAHDAM is among the world’s first vertically i...,Bala Sarda,IIFL AMC,20000000.0,Series D,2021
1206,Leap Finance,2019.0,Bangalore,Financial Services,International education loans for high potenti...,"Arnav Kumar, Vaibhav Singh",Owl Ventures,55000000.0,Series C,2021
1207,CollegeDekho,2015.0,Gurugram,EdTech,"Collegedekho.com is Student’s Partner, Friend ...",Ruchir Arora,"Winter Capital, ETS, Man Capital",26000000.0,Series B,2021
1208,WeRize,2019.0,Bangalore,Financial Services,India’s first socially distributed full stack ...,"Vishal Chopra, Himanshu Gupta","3one4 Capital, Kalaari Capital",8000000.0,Series A,2021


In [5590]:
df.tail()

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year
1204,Gigforce,2019.0,Gurugram,Staffing & Recruiting,A gig/on-demand staffing company.,"Chirag Mittal, Anirudh Syal",Endiya Partners,3000000.0,Pre-series A,2021
1205,Vahdam,2015.0,New Delhi,Food & Beverages,VAHDAM is among the world’s first vertically i...,Bala Sarda,IIFL AMC,20000000.0,Series D,2021
1206,Leap Finance,2019.0,Bangalore,Financial Services,International education loans for high potenti...,"Arnav Kumar, Vaibhav Singh",Owl Ventures,55000000.0,Series C,2021
1207,CollegeDekho,2015.0,Gurugram,EdTech,"Collegedekho.com is Student’s Partner, Friend ...",Ruchir Arora,"Winter Capital, ETS, Man Capital",26000000.0,Series B,2021
1208,WeRize,2019.0,Bangalore,Financial Services,India’s first socially distributed full stack ...,"Vishal Chopra, Himanshu Gupta","3one4 Capital, Kalaari Capital",8000000.0,Series A,2021


In [5591]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2873 entries, 0 to 1208
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Company_Name  2873 non-null   object 
 1   Founded       2105 non-null   Float64
 2   Location      2759 non-null   object 
 3   Sector        2855 non-null   object 
 4   What_it_does  2873 non-null   object 
 5   Founders      2329 non-null   object 
 6   Investor      2248 non-null   object 
 7   Amount        2462 non-null   object 
 8   Stage         1941 non-null   object 
 9   Funding_Year  2873 non-null   int32  
dtypes: Float64(1), int32(1), object(8)
memory usage: 238.5+ KB


In [5592]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Founded,2105.0,2016.085986,4.367256,1963.0,2015.0,2017.0,2019.0,2021.0
Funding_Year,2873.0,2020.023669,1.087148,2018.0,2020.0,2020.0,2021.0,2021.0


In [5593]:
# index with undisclosed value at Investor column
index_ = df.index[df['Investor']=='Undisclosed']
index_

Index([5, 59, 70, 633, 675, 741, 798, 824, 902, 964, 1003, 1006, 1007], dtype='int64')

In [5594]:
# replacing them with NAN
df['Investor'] = df['Investor'].replace('Undisclosed', np.nan)

In [5595]:
#Strip the location data to only the city-area. 
df['Location'] = df.Location.str.split(',').str[0]
df['Location'].head()

0    Bangalore
1       Mumbai
2      Gurgaon
3        Noida
4    Hyderabad
Name: Location, dtype: object

Cleaning Columns
Location / Sector columns


In [5596]:
# Function to remove '#REF!' in a series
def remove_ref(value):
    if isinstance(value, str):
        value = value.replace('#REF!', '').strip()
            
    return value

# Columns of Interest 
columns = ['Location', 'Investor']
for column in columns:    
    # Identify rows where column value contains '#REF!
    mask = df[column].str.contains('#REF!')
    
    # Fill missing values in mask with False
    mask.fillna(False, inplace=True)
    
    # Update the column by applying the remove_ref function to the column
    df.loc[mask, column] = df.loc[mask, column].apply(remove_ref)
    
    # Shift values in selected rows excluding the last column 'year'
df.loc[mask, column:'Stage'] = df.loc[mask, column:'Stage'].shift(1, axis=1)


# Sanitisizing the sector column after shifting
mask = df['Sector'].apply(lambda x: x in df['Location'].unique())

# Update 'headquarter' value with 'sector' value
df.loc[mask, 'Location'] = df.loc[mask, 'Sector']

# Set the 'sector' value to NaN
df.loc[mask, 'Sector'] = np.nan 

To ensure consistent representation of missing values by replacing 'None' string values with NaN 

In [5597]:

# Function replace None with NaN
def replace_none(value):
    return np.nan if isinstance(value, str) and value.strip().lower() in ['none', 'nan'] else value

# Apply the function to all columns
df = df.applymap(replace_none) # element-wise


In [5598]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2873 entries, 0 to 1208
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Company_Name  2873 non-null   object 
 1   Founded       2105 non-null   float64
 2   Location      2748 non-null   object 
 3   Sector        2855 non-null   object 
 4   What_it_does  2873 non-null   object 
 5   Founders      2329 non-null   object 
 6   Investor      2235 non-null   object 
 7   Amount        2318 non-null   object 
 8   Stage         1941 non-null   object 
 9   Funding_Year  2873 non-null   int64  
dtypes: float64(1), int64(1), object(8)
memory usage: 246.9+ KB


##### If a value in the 'Stage' column is a website link, its correct value is NaN


In [5599]:

# Function to remove website link from stage column
def remove_website_link(value):
    # Regular expression pattern to match website URLs
    pattern = r'https?://\S+'
    
    # Check if the value is a string and matches the pattern
    if isinstance(value, str) and re.match(pattern, value):
        return np.nan
    return value


In [5600]:
# Function to clean amount values
def floater(string):
    try:
        string = float(string)
    except ValueError:
        string = np.nan
    
    return string
    
def clean_amount(row): 
    Amount = row[0]    
    Funding_Year   = row['Funding_Year'] 
    
    # Source: https://www.ofx.com/en-au/forex-news/historical-exchange-rates/yearly-average-rates/
    exchange_rates = {
        2018: 0.014649,
        2019: 0.014209,
        2020: 0.013501,
        2021: 0.013527
    }
    
    exchange_rate = exchange_rates[year]   
    
    # Convert to string
    Amount = str(Amount)   
    
    if isinstance(Amount, str):        
        # Set of elements to replace
        to_replace = {' ', ','}

        # Replace each element in the set with an empty string
        for r in to_replace:
            Amount = Amount.replace(r, '')        
                        
        if Amount == '' or Amount == '—': 
            Amount = np.nan
        # If the amount is in INR (Indian Rupees), convert it to USD using the conversion rate of the year
        elif '₹' in Amount:
            Amount = Amount.replace('₹', '')
            Amount = floater(Amount) * exchange_rate
        
        # If the amount is in USD, remove the '$' symbol and convert it to a float
        elif '$' in Amount:
            Amount = Amount.replace('$', '')
            Amount = floater(Amount)
        else:
            Amount = floater(Amount)

    
    return Amount

In [5601]:
# Function to clean Amount field
def clean_amount(Amount, Funding_Year):
    # Define exchange rates
    exchange_rates = {
        2018: 0.014649,
        2019: 0.014209,
        2020: 0.013501,
        2021: 0.013527
    }
    
    # Handle missing values
    if pd.isnull(Amount) or Amount == '—' or Amount == '':
        return np.nan
    
    # Convert INR to USD using exchange rate
    if '₹' in Amount:
        Amount = float(Amount.replace('₹', '')) * exchange_rates[Funding_Year]
    
    # Remove $ symbol and convert to float for USD amounts
    elif '$' in Amount:
        Amount = float(Amount.replace('$', ''))
    
    # Convert other numeric strings to float
    else:
        try:
            Amount = float(Amount)
        except ValueError:
            return np.nan
    
    return Amount


In [5602]:
df.isna().sum()

Company_Name      0
Founded         768
Location        125
Sector           18
What_it_does      0
Founders        544
Investor        638
Amount          555
Stage           932
Funding_Year      0
dtype: int64

##### Handling Investor and Amount Values:
- If the investor value is numeric or contains '$':
  - Missing amount values should be replaced with the investor value.
  - The stage value should revert to the original amount value.
  - The investor value is set to NaN or missing.


def clean_amount(row):
    investor_value = row['Investor']
    funding_year_value = row['Funding_Year']
    # Your cleaning logic here
    return cleaned_value

mask = df[['Investor', 'Funding_Year']].apply(lambda row: pd.notna(clean_amount(row['Investor'], row['Funding_Year'])), axis=1)


In [5603]:
# Identify rows where 'Stage' value is numeric using clean amount function
mask = df[['Stage', 'Funding_Year']].apply(lambda row: pd.notna(clean_amount(row['Stage'], row['Funding_Year'])), axis=1)

# Update the 'What_it_does' column to its concatenation with 'Founder' value
old_What_it_does = df.loc[mask, 'What_it_does']
old_Founder = df.loc[mask, 'Founders']
df.loc[mask, 'What_it_does'] = old_What_it_does.fillna('') + ' ' + old_Founder.fillna('')

# Update 'Founder' column using the old 'Investor' value
df.loc[mask, 'Founders'] = df.loc[mask, 'Investor']

# Update 'Investor' column using the old 'Amount' value
df.loc[mask, 'Investor'] = df.loc[mask, 'Amount']

# Update 'Amount' column using the old 'Stage' value
df.loc[mask, 'Amount'] = df.loc[mask, 'Stage']

# Set 'Stage' to NaN
df.loc[mask, 'Stage'] = np.nan



In [5604]:
df.head()

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year
0,TheCollegeFever,,Bangalore,Brand Marketing,"TheCollegeFever is a hub for fun, fiesta and f...",,,250000.0,Seed,2018
1,Happy Cow Dairy,,Mumbai,Agriculture,A startup which aggregates milk from dairy far...,,,584000.0,Seed,2018
2,MyLoanCare,,Gurgaon,Credit,Leading Online Loans Marketplace in India,,,949000.0,Series A,2018
3,PayMe India,,Noida,Financial Services,PayMe India is an innovative FinTech organizat...,,,2000000.0,Angel,2018
4,Eunimart,,Hyderabad,E-Commerce Platforms,Eunimart is a one stop solution for merchants ...,,,,Seed,2018


#### Clean and convert amounts to USD and rename colume from amount to amount($)

In [5605]:
# Clean and convert amounts to USD considering the average exchange rate per year
def clean_and_convert_amount(row):
    Amount = row['Amount']
    Funding_Year = row['Funding_Year']
    
    # Your cleaning logic to extract the numeric value from 'amount' and convert it to USD
    # You can use the funding year to fetch the average exchange rate for that year
    
    # For demonstration purposes, let's assume a conversion rate of 1 USD = 100 units of local currency
    # You should replace this with actual conversion logic based on your data
    
    # Example conversion logic:
    # if funding_year == 2020:
    #     converted_amount = amount * exchange_rate_2020
    # elif funding_year == 2021:
    #     converted_amount = amount * exchange_rate_2021
    # else:
    #     converted_amount = amount  # No conversion if year not found
    
    # Here we'll just return the amount as-is without conversion for demonstration
    Converted_Amount = Amount
    
    return Converted_Amount

# Clean and convert amounts to USD considering the average exchange rate per year
df['Amount'] = df[['Amount', 'Funding_Year']].apply(clean_and_convert_amount, axis=1)




In [5606]:
df.head()

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year
0,TheCollegeFever,,Bangalore,Brand Marketing,"TheCollegeFever is a hub for fun, fiesta and f...",,,250000.0,Seed,2018
1,Happy Cow Dairy,,Mumbai,Agriculture,A startup which aggregates milk from dairy far...,,,584000.0,Seed,2018
2,MyLoanCare,,Gurgaon,Credit,Leading Online Loans Marketplace in India,,,949000.0,Series A,2018
3,PayMe India,,Noida,Financial Services,PayMe India is an innovative FinTech organizat...,,,2000000.0,Angel,2018
4,Eunimart,,Hyderabad,E-Commerce Platforms,Eunimart is a one stop solution for merchants ...,,,,Seed,2018


#### Data Cleaning by Column
##### Cleaning Amount Column

In [5607]:
df['Amount'].unique()

array([250000.0, 584000.0, 949000.0, 2000000.0, nan, 1600000.0, 233600.0,
       730000.0, 1460000.0, 150000.0, 1100000.0, 7300.0, 6000000.0,
       650000.0, 511000.0, 934400.0, 292000.0, 1000000.0, 5000000.0,
       4000000.0, 438000.0, 2800000.0, 1700000.0, 1300000.0, 73000.0,
       182500.0, 219000.0, 500000.0, 1518400.0, 657000.0, 13400000.0,
       365000.0, 385440.0, 116800.0, 876.0, 9000000.0, 100000.0, 20000.0,
       120000.0, 496400.0, 4993200.0, 143145.0, 8760000.0, 742000000.0,
       14600000.0, 29200000.0, 3980000.0, 10000.0, 1460.0, 3650000.0,
       1000000000.0, 7000000.0, 35000000.0, 8030000.0, 28500000.0,
       3504000.0, 1752000.0, 2400000.0, 30000000.0, 36500000.0,
       23000000.0, 11000000.0, 642400.0, 3240000.0, 876000.0, 540000000.0,
       9490000.0, 23360000.0, 900000.0, 10000000.0, 1500000.0, 1022000.0,
       14000000.0, 1496500.0, 100000000.0, 17520.0, 75920000.0, 800000.0,
       1041000.0, 15000.0, 1400000.0, 1200000.0, 2200000.0, 1800000.0,
       3

In [5608]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2873 entries, 0 to 1208
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Company_Name  2873 non-null   object 
 1   Founded       2105 non-null   float64
 2   Location      2748 non-null   object 
 3   Sector        2855 non-null   object 
 4   What_it_does  2873 non-null   object 
 5   Founders      2329 non-null   object 
 6   Investor      2234 non-null   object 
 7   Amount        2319 non-null   object 
 8   Stage         1940 non-null   object 
 9   Funding_Year  2873 non-null   int64  
dtypes: float64(1), int64(1), object(8)
memory usage: 246.9+ KB


In [5609]:
# Remove non-numeric characters and the dollar sign ('$')
df['Amount'] = df['Amount'].apply(lambda x: str(x).replace('$', '') if pd.notna(x) else x)

In [5610]:
# Convert 'Amount' column to float 
df['Amount'] = df['Amount'].astype(float)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2873 entries, 0 to 1208
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Company_Name  2873 non-null   object 
 1   Founded       2105 non-null   float64
 2   Location      2748 non-null   object 
 3   Sector        2855 non-null   object 
 4   What_it_does  2873 non-null   object 
 5   Founders      2329 non-null   object 
 6   Investor      2234 non-null   object 
 7   Amount        2319 non-null   float64
 8   Stage         1940 non-null   object 
 9   Funding_Year  2873 non-null   int64  
dtypes: float64(2), int64(1), object(7)
memory usage: 246.9+ KB


##### Cleaning the Sector Column

In [5611]:
# Standardize the sector names by converting them to title case
df["sector"] = df["Sector"].str.title()

import re

def sector_redistribution(Sector):
    if re.search(r'bank|fintech|finance|crypto|account|credit|venture|crowd|blockchain', Sector):
        return 'Finance'
    elif re.search(r'automotive|air transport|transport|logistics|vehicle|transportation', Sector):
        return 'Transport'
    elif re.search(r'agri|agtech|agribusiness|farm|agriculture', Sector):
        return 'Agriculture'
    elif re.search(r'tech|information technology|software|hardware|internet|cloud|digital|computer|software development', Sector):
        return 'Technology'
    elif re.search(r'food|beverage|culinary|restaurant|gastronomy', Sector):
        return 'Food and Beverage'
    elif re.search(r'business intelligence|market research|analytics|data analysis|data insights|market insights', Sector):
        return 'Business Intelligence'
    elif re.search(r'energy|renewable energy|clean energy|sustainable energy|green energy|power', Sector):
        return 'Energy'
    elif re.search(r'hospitality|hotel|tourism|accommodation|travel', Sector):
        return 'Hospitality'
    elif re.search(r'commerce|retail|e-commerce|online marketplace|digital marketplace|online retail', Sector):
        return 'Commerce'
    elif re.search(r'manufacturing|industrial|production|factory|manufacture', Sector):
        return 'Manufacturing'
    elif re.search(r'media|entertainment|digital media|broadcasting|content|digital content|media production', Sector):
        return 'Media and Entertainment'
    elif re.search(r'real estate|property|housing|realtor|realestate', Sector):
        return 'Real Estate'
    elif re.search(r'health|medical|biotech|pharma|biomedical', Sector):
        return 'Healthcare'
    elif re.search(r'education|edtech|e-learning|learning|teaching', Sector):
        return 'Education'
    elif re.search(r'research|science|scientific|laboratory|experiment', Sector):
        return 'Research'
    elif re.search(r'government|public|policy|governance|civil', Sector):
        return 'Government'
    elif re.search(r'art|design|creative|graphic|visual|artistic', Sector):
        return 'Art and Design'
    elif re.search(r'social|community|society|group|networking', Sector):
        return 'Social Networking'
    elif re.search(r'sports|fitness|exercise|athlete|sporting', Sector):
        return 'Sports and Fitness'
    elif re.search(r'legal|law|lawyer|attorney|justice', Sector):
        return 'Legal'
    elif re.search(r'consulting|consultant|advice|advisor|consultancy', Sector):
        return 'Consulting'
    elif re.search(r'travel|tourism|journey|vacation|trip', Sector):
        return 'Travel and Tourism'
    elif re.search(r'insurance|insure|assurance|coverage|policy', Sector):
        return 'Insurance'
    elif re.search(r'retail|shop|store|mall|market', Sector):
        return 'Retail'
    elif re.search(r'finance|financial|money|monetary|economy', Sector):
        return 'Finance'
    elif re.search(r'technology|tech|technological|digital|innovation', Sector):
        return 'Technology'
    elif re.search(r'automotive|vehicle|car|transportation|motor', Sector):
        return 'Automotive'
    elif re.search(r'entertainment|media|broadcast|movie|film', Sector):
        return 'Entertainment'
    elif re.search(r'restaurant|food|eatery|dining|cuisine', Sector):
        return 'Restaurant'
    elif re.search(r'clothing|apparel|garment|fashion|attire', Sector):
        return 'Fashion'
    elif re.search(r'manufacturing|industry|factory|production|produce', Sector):
        return 'Manufacturing'
    elif re.search(r'hospitality|hotel|travel|tourism|accommodation', Sector):
        return 'Hospitality'
    else:
        return Sector


In [5612]:
df

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year,sector
0,TheCollegeFever,,Bangalore,Brand Marketing,"TheCollegeFever is a hub for fun, fiesta and f...",,,250000.0,Seed,2018,Brand Marketing
1,Happy Cow Dairy,,Mumbai,Agriculture,A startup which aggregates milk from dairy far...,,,584000.0,Seed,2018,Agriculture
2,MyLoanCare,,Gurgaon,Credit,Leading Online Loans Marketplace in India,,,949000.0,Series A,2018,Credit
3,PayMe India,,Noida,Financial Services,PayMe India is an innovative FinTech organizat...,,,2000000.0,Angel,2018,Financial Services
4,Eunimart,,Hyderabad,E-Commerce Platforms,Eunimart is a one stop solution for merchants ...,,,,Seed,2018,E-Commerce Platforms
...,...,...,...,...,...,...,...,...,...,...,...
1204,Gigforce,2019.0,Gurugram,Staffing & Recruiting,A gig/on-demand staffing company.,"Chirag Mittal, Anirudh Syal",Endiya Partners,3000000.0,Pre-series A,2021,Staffing & Recruiting
1205,Vahdam,2015.0,New Delhi,Food & Beverages,VAHDAM is among the world’s first vertically i...,Bala Sarda,IIFL AMC,20000000.0,Series D,2021,Food & Beverages
1206,Leap Finance,2019.0,Bangalore,Financial Services,International education loans for high potenti...,"Arnav Kumar, Vaibhav Singh",Owl Ventures,55000000.0,Series C,2021,Financial Services
1207,CollegeDekho,2015.0,Gurugram,EdTech,"Collegedekho.com is Student’s Partner, Friend ...",Ruchir Arora,"Winter Capital, ETS, Man Capital",26000000.0,Series B,2021,Edtech


In [5613]:
import re
 
def sector_redistribution(Sector):
    if re.search('Credit Cards|Banking|Insuretech|Infratech|Saas\xa0\xa0Startup|Equity Management|Wealth Management|Saas  Startup|Insurtech|Crowdsourcing|Cryptocurrency|Online| Financial Service|Neo-Banking|Capital Markets|Mutual Funds|Bank|Finance|Crypto|Account|Credit|Venture|Crowd|Blockchain|Fund|Lending|Trading|Wealth|Insurance|Remittance|Money|Equity|Investment|Mortgage|Financial Services|Nft|Payments', Sector):
        return 'Finance'
    elif re.search(r'Auto-Tech|Tyre Management|Automobiles|Automobile|E-Mobility|Autonomous Vehicles|Vehicle Repair Startup|Automotive|Air Transport|Transport|Logistics|Vehicle|Transportation|Aviation|Vehicles|Tyre|Fleet|Wheels|Aero|Mobility|Aeorspace|Wl & Rac Protection|Micro-Mobiity|Delivery Service', Sector):
        return 'Transport'
    elif re.search('Machine Learning|Hrtech|Ar/Vr|Technology|Ai|E-Connect|E-Market|Traveltech|Biotech|Medtech|Ad-Tech|Healthtech|Games|Computer & Network Security|Saas Startup|Scanning App|Cloud Company|Cybersecurity|Aero Company|Cloud Computing|Techonology|E-Learning|Content Management|Recruitment|Consultancy|Ecommerce|Ev|Designing|Networking|Product Studio|Ecommerce|Proptech|Techonology|Milk Startup|Craft Beer|Craft Beer|Online Credit Management Startup|Foodtech|Spacetech|Deisgning|Clothing|Logitech|Femtech|D2C|Skill Development|Martech|Luxury Car Startup|Emobility|It|Healthcare|Qsr Startup|Sportstech|E-Marketplace|Cleantech|Heathtech|Digital Mortgage|Innovation Management|Photonics Startup|Life Sciences|Cloud Kitchen|Content Marktplace|Vehicle Repair Startup|Photonics Startup|Nano Distribution Network|Artificial Intelligence|Fintech|Tech|Cloud|Artificial|Data|Internet|Things|Apps|Android|Software|Computer|Mobile|3d Printing|Funding Platform|Applications|File|Embedded Systems|Portals|Fraud Detection|Search Engine|Nanotechnology|Security|Saas|Bit Company|Augmented Reality|Drone|Ar Startup|]baas|App|/|Virtual|It Startup|Photonics|E Tailor|Bai|Ai & Debt|Ai Company|Ai Chatbot|Iot Startup|Ai Startup|Iot|Social Platform|Ar Platform|Api Platform|Mlops Platform|Online Storytelling|Digital Platform|Paas Startup|Taas Startup|Digital Assistant', Sector):
        return 'Technology'
    elif re.search('Food & Beverage|Beverages|Foodtech|Craft Beer|Milk Startup|Beverage|Catering|Cook|Food|Restaurants', Sector):
        return 'Food & Beverage'
    elif re.search('Business Intelligence|Data Science|Analytics|Consulting|Human|Career|Erp|Advertising|Advertisement|Market Research|Entrepre|Recruit|Hr|Working|Sultancy|Advisory|Work|Job|Management|Skill|Legal|Crm|Specific Domain To Individuals|Information Services', Sector):
        return 'Business Intelligence'
    elif re.search('Renewables & Environment|Renewable Player|Electric Vehicle|Pollution Control Equiptment|Cleantech|Clean Energy|Energy|Boil &|Boil|Solar|Electricity|Environment', Sector):
        return 'Energy'
    elif re.search('Hospitality|Customer Service|Home Services|E Store|Customer Service Company|Co-Working|Accomodation|Cloud Kitchen|Customer|Hospital|Tourism|Events|Wedding|Travel|Hosts|Booking|Wedding|Qsr', Sector):
        return 'Hospitality'
    elif re.search('Trading Platform|Consumer|Supply Chain Platform|B2B|Business Supplies & Equipment|Fmcg|E-Tail|Entreprenurship|Car Trade|Reatil Startup|E-Mobility|Estore|Capital Markets|E-Commerce|Sales & Services|Sales And Distribution|Estore|Retail Startup|Packaging Services|E-Marketplace|2|Trade|Enterprise|Commerce|Business|Commercial|Consumer Goods|Marketplace|Business Consumer|Marketing|Retail|Market|Store|Furniture|Wholesale|Wine & Spirits|Multinational|E-|Packaging|Sales|Tplace|Warehouse|Fm|Product|Merchandise|Reatil|Conglomerates|Invoice Discounting|Supply Chain|Car Service|Service Industry|Company-As-A-Service|Consumer Service|Facilities Support Services|Facilities Services', Sector):
        return 'Commerce'
    elif re.search('Mechanical & Industrial Engineering|Packaging Solution Startup|Manufacturing|Home Interior Services|Craft Beer|Product Studio|Luxury Car Startup|Mechanical Or Industrial Engineering|Battery|Manufacturing|Electronics|Industrial Automation|Aerospace|Conductor|Gaming|Robotics|Engineering|Mechanical|Appliance|Automation|Ev Startup|Startup Laboratory|E-Vehicle|Luxury Car', Sector):
        return 'Manufacturing'
    elif re.search('Media and Entertainment|Games|E-Sports|Celebrity Engagement|Content Creation|Virtual Auditing Startup|Content Marktplace|Media|Dating|Music|Audio|Gaming|Creative|Entertainment|Broadcasting|Video|Blogging|Content|Celebrity|Ott', Sector):
        return 'Media and Entertainment'
    elif re.search('Commercial Real Estate|Interior & Decor|Co-Living|Apartment|Real Estate|Home|Interior|Construction|Rental|Housing|Accommodation|Hauz', Sector):
        return 'Real Estate'
    elif re.search('Telecommunications|Telecommuncation|Telecommunication|Telecom|News|Escrow|Publication', Sector):
        return 'Telecommunications'
    elif re.search('Healthtech|Healthcare|Pharmaceuticals|Pharmaceuticals|Healtcare|Pharmaceutical|Pharmacy|Helathcare|Medical|Healthtech|Dental|Health|Health Insurance|Medic|Supplement|Biopharma|Veterinary|Pharma|Heathcare|Nutrition|Hygiene|Care|Sanitation|Bio|Cannabis|Tobacco|Sciences', Sector):
        return 'Health'
    elif re.search('Sports & Fitness|Sportstech|Sports|Esports|Game|Ball|Player|Manchester', Sector):
        return 'Sports & Fitness'
    elif re.search('Skincare Startup|Foootwear|Eye Wear|Personal Care Startup|Beauty and Fashion|Clothing|Beauty|Cosmetic|Skincare|Fashion|Wear|Cosmetics|Textiles|Eyewear|Jewellery|Cloth|Eyeglasses', Sector):
        return 'Beauty and Fashion'
    elif re.search('Defense & Space|Government|Advisory Firm|Communities|Smart Cities|Government|Classifieds|Community|Water|Defense|Pollution|Translation & Localization|Taxation|Maritime', Sector):
        return 'Government'
    elif re.search('E-Learning|EduTech|Edttech|E-Learning|Skill Development|E-Learning|Job Discovery Platform|E-Learning|Preschool Daycare|E-Learning|E-Learning|Edutech|Education|Learn|Edtech', Sector):
        return 'Education'
    elif re.search('Nan|-|nan|NaN|—|None', Sector):
        return 'NaN'
    elif re.search('LifeStyle|Lifestyle|Decor|Fitness|Home Decor|Arts & Crafts|Training|Wellness|Personal Care|Deisgn|Craft|Design|Podcast|Lifestyle|Spiritual|Matrimony|Living|Cultural|Home', Sector):
        return 'LifeStyle'
    elif re.search('Water Purification|Job Portal|Social Audio|Others|Cannabis Startup|Staffing & Recruiting|Human Resources|Venture Capital|Multinational Conglomerate Company|Venture Capitalist|Hauz Khas|Social Network|Coworking|Biomaterial Startup|Environmental Service|Content Publishing|Legaltech|Environmental Services|Data Intelligence|Work Fulfillment|Pet Care|Deeptech|Martech|Photonics Startup|Sanitation Solutions|Mutual Funds', Sector):
        return 'Others'
    else:
        return Sector


In [5614]:
# Show all the Sector columns


print(df['Sector'].unique())  # List of unique sectors






['Brand Marketing' 'Agriculture' 'Credit' 'Financial Services'
 'E-Commerce Platforms' 'Cloud Infrastructure' 'Internet'
 'Market Research' 'Information Services' 'Mobile Payments' 'B2B' 'Apps'
 'Food Delivery' 'Industrial Automation' 'Automotive' 'Finance'
 'Accounting' 'Artificial Intelligence' 'Internet of Things'
 'Air Transportation' 'Food and Beverage' 'Autonomous Vehicles'
 'Enterprise Software' 'Logistics' 'Insurance' 'Information Technology'
 'Blockchain' 'Education' 'E-Commerce' 'Renewable Energy' 'E-Learning'
 'Clean Energy' 'Transportation' 'Fitness' 'Hospitality'
 'Media and Entertainment' 'Broadcasting' 'EdTech' 'Health Care' '—'
 'Sports' 'Big Data' 'Cloud Computing' 'Food Processing'
 'Trading Platform' 'Consumer Goods' 'Wellness' 'Fashion' 'Consulting'
 'Biotechnology' 'Communities' 'Consumer' 'Consumer Applications' 'Mobile'
 'Advertising' 'Marketplace' 'Aerospace' 'Home Decor' 'Energy'
 'Digital Marketing' 'Creative Agency' 'Consumer Lending'
 'Health Diagnostics' 'B

import re

def sector_naming(sector):
    # Define regular expressions for each sector category
    finance_keywords = 'Credit Cards|Banking|Insuretech|Infratech|Saas\xa0\xa0Startup|Equity Management|Wealth Management|Saas  Startup|Insurtech|Crowdsourcing|Cryptocurrency|Online| Financial Service|Neo-Banking|Capital Markets|Mutual Funds|Bank|Finance|Crypto|Account|Credit|Venture|Crowd|Blockchain|Fund|Lending|Trading|Wealth|Insurance|Remittance|Money|Equity|Investment|Mortgage|Financial Services|Nft|Payments'
    agriculture_keywords = 'Agritech |Agriculture |Soil-Tech |Fishery|Agri|Biotechnology|Industrial|Farming|Fish|Milk|Diary|Dairy|Dairy Startup'
    technology_keywords = 'Machine Learning|Hrtech|Ar/Vr|Technology|Ai|E-Connect|E-Market|Traveltech|Biotech|Medtech|Ad-Tech|Healthtech|Games|Computer & Network Security|Saas Startup|Scanning App|Cloud Company|Cybersecurity|Aero Company|Cloud Computing|Techonology|E-Learning|Content Management|Recruitment|Consultancy|Ecommerce|Ev|Designing|Networking|Product Studio|Ecommerce|Proptech|Techonology|Milk Startup|Craft Beer|Craft Beer|Online Credit Management Startup|Foodtech|Spacetech|Deisgning|Clothing|Logitech|Femtech|D2C|Skill Development|Martech|Luxury Car Startup|Emobility|It|Healthcare|Qsr Startup|Sportstech|E-Marketplace|Cleantech|Heathtech|Digital Mortgage|Innovation Management|Photonics Startup|Life Sciences|Cloud Kitchen|Content Marktplace|Vehicle Repair Startup|Photonics Startup|Nano Distribution Network|Artificial Intelligence|Fintech|Tech|Cloud|Artificial|Data|Internet|Things|Apps|Android|Software|Computer|Mobile|3d Printing|Funding Platform|Applications|File|Embedded Systems|Portals|Fraud Detection|Search Engine|Nanotechnology|Security|Saas|Bit Company|Augmented Reality|Drone|Ar Startup|]baas|App|/|Virtual|It Startup|Photonics|E Tailor|Bai|Ai & Debt|Ai Company|Ai Chatbot|Iot Startup|Ai Startup|Iot|Social Platform|Ar Platform|Api Platform|Mlops Platform|Online Storytelling|Digital Platform|Paas Startup|Taas Startup|Digital Assistant'
    food_beverage_keywords = 'Food & Beverage|Beverages|Foodtech|Craft Beer|Milk Startup|Beverage|Catering|Cook|Food|Restaurants'
    transport_keywords = 'Auto-Tech|Tyre Management|Automobiles|Automobile|E-Mobility|Autonomous Vehicles|Vehicle Repair Startup|Automotive|Air Transport|Transport|Logistics|Vehicle|Transportation|Aviation|Vehicles|Tyre|Fleet|Wheels|Aero|Mobility|Aeorspace|Wl & Rac Protection|Micro-Mobiity|Delivery Service'
    business_intelligence_keywords = 'Business Intelligence|Data Science|Analytics|Consulting|Human|Career|Erp|Advertising|Advertisement|Market Research|Entrepre|Recruit|Hr|Working|Sultancy|Advisory|Work|Job|Management|Skill|Legal|Crm|Specific Domain To Individuals|Information Services'
    energy_keywords = 'Renewables & Environment|Renewable Player|Electric Vehicle|Pollution Control Equiptment|Cleantech|Clean Energy|Energy|Boil &|Boil|Solar|Electricity|Environment'
    hospitality_keywords = 'Hospitality|Customer Service|Home Services|E Store|Customer Service Company|Co-Working|Accomodation|Cloud Kitchen|Customer|Hospital|Tourism|Events|Wedding|Travel|Hosts|Booking|Wedding|Qsr'
    commerce_keywords = 'Trading Platform|Consumer|Supply Chain Platform|B2B|Business Supplies & Equipment|Fmcg|E-Tail|Entreprenurship|Car Trade|Reatil Startup|E-Mobility|Estore|Capital Markets|E-Commerce|Sales & Services|Sales And Distribution|Estore|Retail Startup|Packaging Services|E-Marketplace|2|Trade|Enterprise|Commerce|Business|Commercial|Consumer Goods|Marketplace|Business Consumer|Marketing|Retail|Market|Store|Furniture|Wholesale|Wine & Spirits|Multinational|E-|Packaging|Sales|Tplace|Warehouse|Fm|Product|Merchandise|Reatil|Conglomerates|Invoice Discounting|Supply Chain|Car Service|Service Industry|Company-As-A-Service|Consumer Service|Facilities Support Services|Facilities Services'
    manufacturing_keywords = 'Mechanical & Industrial Engineering|Packaging Solution Startup|Manufacturing|Home Interior Services|Craft Beer|Product Studio|Luxury Car Startup|Mechanical Or Industrial Engineering|Battery|Manufacturing|Electronics|Industrial Automation|Aerospace|Conductor|Gaming|Robotics|Engineering|Mechanical|Appliance|Automation|Ev Startup|Startup Laboratory|E-Vehicle|Luxury Car'
    media_entertainment_keywords = 'Media and Entertainment|Games|E-Sports|Celebrity Engagement|Content Creation|Virtual Auditing Startup|Content Marktplace|Media|Dating|Music|Audio|Gaming|Creative|Entertainment|Broadcasting|Video|Blogging|Content|Celebrity|Ott'
    real_estate_keywords = 'Commercial Real Estate|Interior & Decor|Co-Living|Apartment|Real Estate|Home|Interior|Construction|Rental|Housing|Accommodation|Hauz'
    telecommunications_keywords = 'Telecommunications|Telecommuncation|Telecommunication|Telecom|News|Escrow|Publication'
    health_keywords = 'Healthtech|Healthcare|Pharmaceuticals|Pharmaceuticals|Healtcare|Pharmaceutical|Pharmacy|Helathcare|Medical|Healthtech|Dental|Health|Health Insurance|Medic|Supplement|Biopharma|Veterinary|Pharma|Heathcare|Nutrition|Hygiene|Care|Sanitation|Bio|Cannabis|Tobacco|Sciences'
    sports_fitness_keywords = 'Sports & Fitness|Sportstech|Sports|Esports|Game|Ball|Player|Manchester'
    beauty_fashion_keywords = 'Skincare Startup|Foootwear|Eye Wear|Personal Care Startup|Beauty and Fashion|Clothing|Beauty|Cosmetic|Skincare|Fashion|Wear|Cosmetics|Textiles|Eyewear|Jewellery|Cloth|Eyeglasses'
    government_keywords = 'Defense & Space|Government|Advisory Firm|Communities|Smart Cities|Government|Classifieds|Community|Water|Defense|Pollution|Translation & Localization|Taxation|Maritime'
    education_keywords = 'E-Learning|EduTech|Edttech|E-Learning|Skill Development|E-Learning|Job Discovery Platform|E-Learning|Preschool Daycare|E-Learning|E-Learning|Edutech|Education|Learn|Edtech'
    NaN_keywords = 'Nan|-|nan|NaN|—|None'
    lifeStyle_keywords = 'LifeStyle|Lifestyle|Decor|Fitness|Home Decor|Arts & Crafts|Training|Wellness|Personal Care|Deisgn|Craft|Design|Podcast|Lifestyle|Spiritual|Matrimony|Living|Cultural|Home'
    others_keywords = 'Water Purification|Job Portal|Social Audio|Others|Cannabis Startup|Staffing & Recruiting|Human Resources|Venture Capital|Multinational Conglomerate Company|Venture Capitalist|Hauz Khas|Social Network|Coworking|Biomaterial Startup|Environmental Service|Content Publishing|Legaltech|Environmental Services|Data Intelligence|Work Fulfillment|Pet Care|Deeptech|Martech|Photonics Startup|Sanitation Solutions|Mutual Funds'

    # Define a dictionary mapping sector categories to their corresponding regular expressions
    sector_regex_map = {
        'Finance': finance_keywords,
        'Agriculture': agriculture_keywords,
        'Technology': technology_keywords,
        'Food & Beverage': food_beverage_keywords,
        'Transport': transport_keywords,
        'Business Intelligence': business_intelligence_keywords,
        'Energy': energy_keywords,
        'Hospitality': hospitality_keywords,
        'Commerce': commerce_keywords,
        'Manufacturing': manufacturing_keywords,
        'Media and Entertainment': media_entertainment_keywords,
        'Real Estate': real_estate_keywords,
        'Telecommunications': telecommunications_keywords,
        'Health': health_keywords,
        'Sports & Fitness': sports_fitness_keywords,
        'Beauty and Fashion': beauty_fashion_keywords,
        'Government': government_keywords,
        'Education': education_keywords
    }

    # Iterate over the sector_regex_map and check if any regular expression matches the sector
    for sector_name, regex_pattern in sector_regex_map.items():
        if re.search(regex_pattern, sector, re.IGNORECASE):
            return sector_name
    
    # If no match is found, return the original sector name
    return sector



In [5615]:
# Filter rows where either ' Sector'
df[df['Sector'].isnull() ]

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year,sector
41,VMate,,,,A short video platform,,Alibaba,100000000.0,,2019,
49,Awign Enterprises,2016.0,Bangalore,,It supplies workforce to the economy,"Annanya Sarthak, Gurpreet Singh, Praveen Sah","Work10M, Michael and Susan Dell Foundation, Ea...",4000000.0,Series A,2019,
52,TapChief,2016.0,Bangalore,,It connects individuals in need of advice in a...,"Shashank Murali, Binay Krishna, Arjun Krishna",Blume Ventures.,1500000.0,Pre series A,2019,
56,KredX,,Bangalore,,Invoice discounting platform,Manish Kumar,Tiger Global Management,26000000.0,Series B,2019,
57,m.Paani,,Mumbai,,It digitizes and organises local retailers,Akanksha Hazari,"AC Ventures, Henkel",5500000.0,Series A,2019,
518,Text Mercato,2015.0,,,Cataloguing startup that serves ecommerce plat...,"Kiran Ramakrishna, Subhajit Mukherjee",1Crowd,649600.0,Series A,2020,
569,Magicpin,2015.0,,,"It is a local discovery, rewards, and commerce...","Anshoo Sharma, Brij Bhushan",Samsung Venture Investment Corporation,7000000.0,Series D,2020,
687,Leap Club,,,,Community led professional network for women,"Ragini Das, Anand Sinha","Whiteboard Capital, FirstCheque, Artha India V...",340000.0,Pre seed round,2020,
699,Juicy Chemistry,2014.0,,,It focuses on organic based skincare products,Pritesh Asher,Akya Ventures,650000.0,Series A,2020,
707,Magicpin,2015.0,,,"It is a local discovery, rewards, and commerce...","Anshoo Sharma, Brij Bhushan",Lightspeed Venture Partners,3879000.0,,2020,


##### Cleaning Stage Column

In [5616]:
# Get unique values in the Stage column
unique_stages = df['Stage'].unique()

# Convert the unique values array to a list``
unique_stages_list = unique_stages.tolist()

# Print the list of unique stages
unique_stages_list

['Seed',
 'Series A',
 'Angel',
 'Series B',
 'Pre-Seed',
 'Private Equity',
 'Venture - Series Unknown',
 'Grant',
 'Debt Financing',
 'Post-IPO Debt',
 'Series H',
 'Series C',
 'Series E',
 'Corporate Round',
 '',
 'Series D',
 'Secondary Market',
 'Post-IPO Equity',
 'Non-equity Assistance',
 'Funding Round',
 nan,
 'Fresh funding',
 'Pre series A',
 'Series G',
 'Post series A',
 'Seed funding',
 'Seed fund',
 'Series F',
 'Series B+',
 'Seed round',
 'Pre-series A',
 None,
 'Pre-seed',
 'Pre-series',
 'Debt',
 'Pre-series C',
 'Pre-series B',
 'Bridge',
 'Series B2',
 'Pre- series A',
 'Edge',
 'Pre-Series B',
 'Seed A',
 'Series A-1',
 'Seed Funding',
 'Pre-seed Round',
 'Seed Round & Series A',
 'Pre Series A',
 'Pre seed Round',
 'Angel Round',
 'Pre series A1',
 'Series E2',
 'Seed Round',
 'Bridge Round',
 'Pre seed round',
 'Pre series B',
 'Pre series C',
 'Seed Investment',
 'Series D1',
 'Mid series',
 'Series C, D',
 'Seed+',
 'Series F2',
 'Series A+',
 'Series B3',
 '

In [5617]:
# Select rows where the 'Stage' column contains NaN values
df[pd.isna(df['Stage'])]

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year,sector
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,6300000.0,,2019,Ecommerce
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),6000000.0,,2019,Agritech
5,FlytBase,,Pune,Technology,A drone automation platform,Nitin Gupta,,,,2019,Technology
6,Finly,,Bangalore,SaaS,It builds software products that makes work si...,"Vivek AG, Veekshith C Rai","Social Capital, AngelList India, Gemba Capital...",,,2019,Saas
10,Cub McPaws,2010.0,Mumbai,E-commerce & AR,A B2C brand that focusses on premium and comf...,"Abhay Bhat, Kinnar Shah",Venture Catalysts,,,2019,E-Commerce & Ar
...,...,...,...,...,...,...,...,...,...,...,...
1172,Peppermint,2019.0,Pune,Industrial Automation,Intelligent Housekeeping Robots for public and...,"Runal Dahiwade, Miraj C Vora","Venture Catalysts, Indian Angel Network",600000.0,,2021,Industrial Automation
1182,Sugar.fit,2021.0,Bangalore,Health,"Innovative technology, compassionate diabetes ...","Shivtosh Kumar, Madan Somasundaram","Cure.fit, Endiya Partners, Tanglin Venture",10000000.0,,2021,Health
1192,Geniemode,2021.0,Gurugram,B2B,Transforming global sourcing for retailers & s...,"Amit Sharma, Tanuj Gangwani",Info Edge Ventures,2000000.0,,2021,B2B
1193,Sapio Analytics,2019.0,Mumbai,Computer Software,Sapio helps government create policies driven ...,"Hardik Somani, Ashwin Srivastava, Shripal Jain...","Rachit Poddar, Rajesh Gupta",,,2021,Computer Software


In [5618]:
# Convert NaN values in the 'Stage' column to an empty string ('')
df.loc[pd.isna(df['Stage']), 'Stage'] = ''

In [5619]:
# Get unique values in the Stage column
unique_stages = df['Stage'].unique()

# Convert the unique values array to a list``
unique_stages_list = unique_stages.tolist()

# Print the list of unique stages
unique_stages_list

['Seed',
 'Series A',
 'Angel',
 'Series B',
 'Pre-Seed',
 'Private Equity',
 'Venture - Series Unknown',
 'Grant',
 'Debt Financing',
 'Post-IPO Debt',
 'Series H',
 'Series C',
 'Series E',
 'Corporate Round',
 '',
 'Series D',
 'Secondary Market',
 'Post-IPO Equity',
 'Non-equity Assistance',
 'Funding Round',
 'Fresh funding',
 'Pre series A',
 'Series G',
 'Post series A',
 'Seed funding',
 'Seed fund',
 'Series F',
 'Series B+',
 'Seed round',
 'Pre-series A',
 'Pre-seed',
 'Pre-series',
 'Debt',
 'Pre-series C',
 'Pre-series B',
 'Bridge',
 'Series B2',
 'Pre- series A',
 'Edge',
 'Pre-Series B',
 'Seed A',
 'Series A-1',
 'Seed Funding',
 'Pre-seed Round',
 'Seed Round & Series A',
 'Pre Series A',
 'Pre seed Round',
 'Angel Round',
 'Pre series A1',
 'Series E2',
 'Seed Round',
 'Bridge Round',
 'Pre seed round',
 'Pre series B',
 'Pre series C',
 'Seed Investment',
 'Series D1',
 'Mid series',
 'Series C, D',
 'Seed+',
 'Series F2',
 'Series A+',
 'Series B3',
 'PE',
 'Series

##### Cleaning Location Column

In [5620]:
# Get unique values in the HeadQuarter column
unique_Location = df['Location'].unique()

# Convert the unique values array to a list``
unique_Location_list = unique_Location.tolist()

# Print the list of unique HeadQuarter
unique_Location_list

['Bangalore',
 'Mumbai',
 'Gurgaon',
 'Noida',
 'Hyderabad',
 'Bengaluru',
 'Kalkaji',
 'Delhi',
 'India',
 'Hubli',
 'New Delhi',
 'Chennai',
 'Mohali',
 'Kolkata',
 'Pune',
 'Jodhpur',
 'Kanpur',
 'Ahmedabad',
 'Azadpur',
 'Haryana',
 'Cochin',
 'Faridabad',
 'Jaipur',
 'Kota',
 'Anand',
 'Bangalore City',
 'Belgaum',
 'Thane',
 'Margão',
 'Indore',
 'Alwar',
 'Kannur',
 'Trivandrum',
 'Ernakulam',
 'Kormangala',
 'Uttar Pradesh',
 'Andheri',
 'Mylapore',
 'Ghaziabad',
 'Kochi',
 'Powai',
 'Guntur',
 'Kalpakkam',
 'Bhopal',
 'Coimbatore',
 'Worli',
 'Alleppey',
 'Chandigarh',
 'Guindy',
 'Lucknow',
 nan,
 'Telangana',
 'Gurugram',
 'Surat',
 'Uttar pradesh',
 'Rajasthan',
 'Tirunelveli',
 None,
 'Singapore',
 'Gujarat',
 'Kerala',
 'Frisco',
 'California',
 'Dhingsara',
 'New York',
 'Patna',
 'San Francisco',
 'San Ramon',
 'Paris',
 'Plano',
 'Sydney',
 'San Francisco Bay Area',
 'Bangaldesh',
 'London',
 'Milano',
 'Palmwoods',
 'France',
 'Samastipur',
 'Irvine',
 'Tumkur',
 'New

In [5621]:
# Replace 'Online Media\t#REF!' with 'Online Media'
df ['Location'].replace('Online Media\t#REF!', 'Online Media', inplace=True)

# Replace 'Manchester, Greater Manchester' with 'Manchester'
df['Location'].replace('Manchester, Greater Manchester', 'Manchester', inplace=True)

In [5622]:
# Get unique values in the Location column
unique_Location = df['Location'].unique()

# Convert the unique values array to a list``
unique_Location_list = unique_Location.tolist()

# Print the list of unique Location
unique_Location_list

['Bangalore',
 'Mumbai',
 'Gurgaon',
 'Noida',
 'Hyderabad',
 'Bengaluru',
 'Kalkaji',
 'Delhi',
 'India',
 'Hubli',
 'New Delhi',
 'Chennai',
 'Mohali',
 'Kolkata',
 'Pune',
 'Jodhpur',
 'Kanpur',
 'Ahmedabad',
 'Azadpur',
 'Haryana',
 'Cochin',
 'Faridabad',
 'Jaipur',
 'Kota',
 'Anand',
 'Bangalore City',
 'Belgaum',
 'Thane',
 'Margão',
 'Indore',
 'Alwar',
 'Kannur',
 'Trivandrum',
 'Ernakulam',
 'Kormangala',
 'Uttar Pradesh',
 'Andheri',
 'Mylapore',
 'Ghaziabad',
 'Kochi',
 'Powai',
 'Guntur',
 'Kalpakkam',
 'Bhopal',
 'Coimbatore',
 'Worli',
 'Alleppey',
 'Chandigarh',
 'Guindy',
 'Lucknow',
 nan,
 'Telangana',
 'Gurugram',
 'Surat',
 'Uttar pradesh',
 'Rajasthan',
 'Tirunelveli',
 None,
 'Singapore',
 'Gujarat',
 'Kerala',
 'Frisco',
 'California',
 'Dhingsara',
 'New York',
 'Patna',
 'San Francisco',
 'San Ramon',
 'Paris',
 'Plano',
 'Sydney',
 'San Francisco Bay Area',
 'Bangaldesh',
 'London',
 'Milano',
 'Palmwoods',
 'France',
 'Samastipur',
 'Irvine',
 'Tumkur',
 'New

In [5623]:
# Filter the data where the "HeadQuarter" column is equal to "The Nilgiris"
df[df['Location'] == "The Nilgiris"]

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year,sector
1190,Prolgae,2016.0,The Nilgiris,Biotechnology,Prolgae Spirulina Supplies Pvt. Ltd. is a Nord...,Aakas Sadasivam,Vijayan,200000.0,Seed,2021,Biotechnology


##### Cleaning What_it_does Column

In [5624]:
# Get unique values in the HeadQuarter column
unique_WiD = df['What_it_does'].unique()

# Convert the unique values array to a list``
unique_WiD_list = unique_WiD.tolist()

# Print the list of uniqueLocation
unique_WiD_list

['TheCollegeFever is a hub for fun, fiesta and frolic of Colleges.',
 'A startup which aggregates milk from dairy farmers in rural Maharashtra.',
 'Leading Online Loans Marketplace in India',
 'PayMe India is an innovative FinTech organization which offers short term financial suport to corporate employees.',
 'Eunimart is a one stop solution for merchants to create a difference by selling globally.',
 'Hasura is a platform that allows developers to build, deploy, and host cloud-native applications quickly.',
 'Tripshelf is an online market place for holiday packages.',
 'Hyperdata combines advanced machine learning with human intelligence.',
 'Freightwalla is an international forwarder thats helps you manage supply chain by providing online tools including instant quotations.',
 'Microchip payments is a mobile-based payment application and point-of-sale device',
 'Building Transactionary B2B Marketplaces',
 'Emojifi is an app that provides live emoji, stickers & GIFs suggestions based

##### Cleaning the Year Column

In [5625]:

# Convert 'Funding_Year' column to datetime with format '%Y'
df['Funding_Year'] = pd.to_datetime(df['Funding_Year'], format='%Y').dt.year

# Check the data type and other info
df.info()


<class 'pandas.core.frame.DataFrame'>
Index: 2873 entries, 0 to 1208
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Company_Name  2873 non-null   object 
 1   Founded       2105 non-null   float64
 2   Location      2748 non-null   object 
 3   Sector        2855 non-null   object 
 4   What_it_does  2873 non-null   object 
 5   Founders      2329 non-null   object 
 6   Investor      2234 non-null   object 
 7   Amount        2319 non-null   float64
 8   Stage         2873 non-null   object 
 9   Funding_Year  2873 non-null   int32  
 10  sector        2855 non-null   object 
dtypes: float64(2), int32(1), object(8)
memory usage: 258.1+ KB


##### Cleaning Founded Column

In [5626]:
# Get unique values in the Founded column
unique_Founded = df['Founded'].unique()

# Convert the unique values array to a list``
unique_Founded_list = unique_Founded.tolist()

# Print the list of unique Founded
unique_Founded_list

[nan,
 2014.0,
 2004.0,
 2013.0,
 2010.0,
 2018.0,
 2019.0,
 2017.0,
 2011.0,
 2015.0,
 2016.0,
 2012.0,
 2008.0,
 2020.0,
 1998.0,
 2007.0,
 1982.0,
 2009.0,
 1995.0,
 2006.0,
 1978.0,
 1999.0,
 1994.0,
 2005.0,
 1973.0,
 2002.0,
 2001.0,
 2021.0,
 1993.0,
 1989.0,
 2000.0,
 2003.0,
 1991.0,
 1984.0,
 1963.0]

In [5627]:
# Filter the data where the "Founded" column contains NaN values
df[df['Founded'].isna()]

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year,sector
0,TheCollegeFever,,Bangalore,Brand Marketing,"TheCollegeFever is a hub for fun, fiesta and f...",,,250000.0,Seed,2018,Brand Marketing
1,Happy Cow Dairy,,Mumbai,Agriculture,A startup which aggregates milk from dairy far...,,,584000.0,Seed,2018,Agriculture
2,MyLoanCare,,Gurgaon,Credit,Leading Online Loans Marketplace in India,,,949000.0,Series A,2018,Credit
3,PayMe India,,Noida,Financial Services,PayMe India is an innovative FinTech organizat...,,,2000000.0,Angel,2018,Financial Services
4,Eunimart,,Hyderabad,E-Commerce Platforms,Eunimart is a one stop solution for merchants ...,,,,Seed,2018,E-Commerce Platforms
...,...,...,...,...,...,...,...,...,...,...,...
1043,Quicko,,Ahmedabad,Taxation,Online tax planning and filing platform,Vishvajit Sonagara,"Zerodha fintech fund, Rainmatter",280000.0,,2020,Taxation
1044,Satin Creditcare,,Gurgaon,Fintech,A micro finance company,,Austrian Bank,15000000.0,,2020,Fintech
1050,Leverage Edu,,Delhi,Edtech,AI enabled marketplace that provides career gu...,Akshay Chaturvedi,"DSG Consumer Partners, Blume Ventures",1500000.0,,2020,Edtech
1051,EpiFi,,,Fintech,It offers customers with a single interface fo...,"Sujith Narayanan, Sumit Gwalani","Sequoia India, Ribbit Capital",13200000.0,Seed Round,2020,Fintech


In [5628]:
# Calculate the average of non-NaN values in the 'Founded' column
average_founded = df['Founded'].mean()

# Fill NaN values in the 'Founded' column with the calculated average
df['Founded'].fillna(average_founded, inplace=True)

# Filter the data where the "Founded" column contains NaN values
df[df['Founded'].isna()]

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year,sector


In [5629]:
df['Founded'].unique()

array([2016.08598575, 2014.        , 2004.        , 2013.        ,
       2010.        , 2018.        , 2019.        , 2017.        ,
       2011.        , 2015.        , 2016.        , 2012.        ,
       2008.        , 2020.        , 1998.        , 2007.        ,
       1982.        , 2009.        , 1995.        , 2006.        ,
       1978.        , 1999.        , 1994.        , 2005.        ,
       1973.        , 2002.        , 2001.        , 2021.        ,
       1993.        , 1989.        , 2000.        , 2003.        ,
       1991.        , 1984.        , 1963.        ])

In [5630]:
# Round all values in the 'Founded' column to the nearest whole number
df['Founded'] = df['Founded'].round()

# View unique values in the 'Founded' column after rounding
df['Founded'].unique()

array([2016., 2014., 2004., 2013., 2010., 2018., 2019., 2017., 2011.,
       2015., 2012., 2008., 2020., 1998., 2007., 1982., 2009., 1995.,
       2006., 1978., 1999., 1994., 2005., 1973., 2002., 2001., 2021.,
       1993., 1989., 2000., 2003., 1991., 1984., 1963.])

In [5631]:
# Convert 'Founded' column to datetime with format '%Y'
df['Founded'] = pd.to_datetime(df['Founded'], format='%Y').dt.year

# Check the data type and other info
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2873 entries, 0 to 1208
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Company_Name  2873 non-null   object 
 1   Founded       2873 non-null   int32  
 2   Location      2748 non-null   object 
 3   Sector        2855 non-null   object 
 4   What_it_does  2873 non-null   object 
 5   Founders      2329 non-null   object 
 6   Investor      2234 non-null   object 
 7   Amount        2319 non-null   float64
 8   Stage         2873 non-null   object 
 9   Funding_Year  2873 non-null   int32  
 10  sector        2855 non-null   object 
dtypes: float64(1), int32(2), object(8)
memory usage: 246.9+ KB


##### Cleaning the Founders Column

In [5632]:
# Get unique values in the Founders column
unique_Founders = df['Founders'].unique()

# Convert the unique values array to a list``
unique_Founders_list = unique_Founders.tolist()

# Print the list of unique Founded
unique_Founders_list

[nan,
 'Shantanu Deshpande',
 'Adamas Belva Syah Devara, Iman Usman.',
 'Jatin Solanki',
 'Srikanth Iyer, Rama Harinath',
 'Narayana Reddy Punyala',
 'Nitin Gupta',
 'Vivek AG, Veekshith C Rai',
 'Pavan Kushwaha, Paratosh Bansal, Dip Jung Thapa',
 'Renuka Ramnath',
 'Peyush Bansal, Amit Chaudhary, Sumeet Kapahi',
 'Abhay Bhat, Kinnar Shah',
 'D Padmanabhan',
 'Puneet Gupta, Sucharita Mukherjee',
 'Ishit Jethwa',
 'Ahana Gautam, Udit Kejriwal',
 'Rakesh Malhotra',
 'Byju Raveendran',
 'Chapman, Priya Sharma, Ashish Anantharaman',
 'Amit Modi',
 'Mohammed, Shashwat Diesh',
 'Renato Araujo',
 'Harsimarbir Singh, Dr Vaibhav Kapoor, Dr Garima Sawhney',
 'Gautam Tambay, Parul Gupta',
 'Dhiraj Naubhar, Dheeraj Bansal',
 'Tushar Kumar, Prashant Singh',
 'Arihant Jain, Ajeet Kushwaha',
 'Nishant Jain, Rohan Kumar',
 'Sam Udotong',
 'Sandipan Mitra, Uttam Kumar',
 'Nukul Upadhye, Mahesh Jakhotia, Jitender Bedwal, Daya Rai, Nikhil Tripathi',
 'Vivek Gupta, Abhay Hanjura',
 'Babu Dayal, Pramod Uni

##### Cleaning the Investors Column

In [5633]:
df['Investor'].unique()

array([nan, 'Sixth Sense Ventures', 'General Atlantic', ...,
       'Owl Ventures', 'Winter Capital, ETS, Man Capital',
       '3one4 Capital, Kalaari Capital'], dtype=object)

In [5634]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2873 entries, 0 to 1208
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Company_Name  2873 non-null   object 
 1   Founded       2873 non-null   int32  
 2   Location      2748 non-null   object 
 3   Sector        2855 non-null   object 
 4   What_it_does  2873 non-null   object 
 5   Founders      2329 non-null   object 
 6   Investor      2234 non-null   object 
 7   Amount        2319 non-null   float64
 8   Stage         2873 non-null   object 
 9   Funding_Year  2873 non-null   int32  
 10  sector        2855 non-null   object 
dtypes: float64(1), int32(2), object(8)
memory usage: 246.9+ KB


In [5635]:
df.head()

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year,sector
0,TheCollegeFever,2016,Bangalore,Brand Marketing,"TheCollegeFever is a hub for fun, fiesta and f...",,,250000.0,Seed,2018,Brand Marketing
1,Happy Cow Dairy,2016,Mumbai,Agriculture,A startup which aggregates milk from dairy far...,,,584000.0,Seed,2018,Agriculture
2,MyLoanCare,2016,Gurgaon,Credit,Leading Online Loans Marketplace in India,,,949000.0,Series A,2018,Credit
3,PayMe India,2016,Noida,Financial Services,PayMe India is an innovative FinTech organizat...,,,2000000.0,Angel,2018,Financial Services
4,Eunimart,2016,Hyderabad,E-Commerce Platforms,Eunimart is a one stop solution for merchants ...,,,,Seed,2018,E-Commerce Platforms


#### the imputation with the mean was chosen to fill the amount column

In [5636]:
# Calculate the mean of non-null values in the 'Amount' column
mean_amount = df['Amount'].mean()

# Round the mean to two decimal places
mean_amount_rounded = round(mean_amount, 2)

# Fill missing values in the 'Amount' column with the rounded mean
df['Amount'].fillna(mean_amount_rounded, inplace=True)


In [5637]:
df['Amount'].min()

876.0

In [5638]:
df[df['Amount'] == 0.0]

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year,sector


In [5639]:
df.head()

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year,sector
0,TheCollegeFever,2016,Bangalore,Brand Marketing,"TheCollegeFever is a hub for fun, fiesta and f...",,,250000.0,Seed,2018,Brand Marketing
1,Happy Cow Dairy,2016,Mumbai,Agriculture,A startup which aggregates milk from dairy far...,,,584000.0,Seed,2018,Agriculture
2,MyLoanCare,2016,Gurgaon,Credit,Leading Online Loans Marketplace in India,,,949000.0,Series A,2018,Credit
3,PayMe India,2016,Noida,Financial Services,PayMe India is an innovative FinTech organizat...,,,2000000.0,Angel,2018,Financial Services
4,Eunimart,2016,Hyderabad,E-Commerce Platforms,Eunimart is a one stop solution for merchants ...,,,121048700.0,Seed,2018,E-Commerce Platforms


In [5640]:
# Calculate the mean of non-zero amounts
mean_amount = df[df['Amount'] != 0.0]['Amount'].mean()

# Replace zero values in the 'Amount' column with the mean
df.loc[df['Amount'] == 0.0, 'Amount'] = mean_amount

In [5641]:
df[df['Amount'] == 0.0]

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year,sector


In [5642]:
df.head()

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year,sector
0,TheCollegeFever,2016,Bangalore,Brand Marketing,"TheCollegeFever is a hub for fun, fiesta and f...",,,250000.0,Seed,2018,Brand Marketing
1,Happy Cow Dairy,2016,Mumbai,Agriculture,A startup which aggregates milk from dairy far...,,,584000.0,Seed,2018,Agriculture
2,MyLoanCare,2016,Gurgaon,Credit,Leading Online Loans Marketplace in India,,,949000.0,Series A,2018,Credit
3,PayMe India,2016,Noida,Financial Services,PayMe India is an innovative FinTech organizat...,,,2000000.0,Angel,2018,Financial Services
4,Eunimart,2016,Hyderabad,E-Commerce Platforms,Eunimart is a one stop solution for merchants ...,,,121048700.0,Seed,2018,E-Commerce Platforms


#### REMOVING NON STARTUPS

According to EU-startups, If we look at the more comprehensive definition of a startup, it cannot just be any ‘fledgling’ business enterprise, it has to be focused on growth and scale. You are no longer a startup if you have achieved scale, albeit the arbitrary the definition of scale. Scale is typically measured in terms of revenue, number of employees and valuation, but can also include age i.e. 
##### categorizing companies that are more than 5 years old as no longer startups.
https://www.eu-startups.com/2021/03/when-is-a-startup-no-longer-a-startup/#:~:text=Scale%20is%20typically%20measured%20in,old%20as%20no%20longer%20startups.

In [5644]:
# Calculate the 'Years of Existence' by subtracting 'Founded' from 'Year'
df['Years of Existence'] = df['Funding_Year'] - df['Founded']
df

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year,sector,Years of Existence
0,TheCollegeFever,2016,Bangalore,Brand Marketing,"TheCollegeFever is a hub for fun, fiesta and f...",,,2.500000e+05,Seed,2018,Brand Marketing,2
1,Happy Cow Dairy,2016,Mumbai,Agriculture,A startup which aggregates milk from dairy far...,,,5.840000e+05,Seed,2018,Agriculture,2
2,MyLoanCare,2016,Gurgaon,Credit,Leading Online Loans Marketplace in India,,,9.490000e+05,Series A,2018,Credit,2
3,PayMe India,2016,Noida,Financial Services,PayMe India is an innovative FinTech organizat...,,,2.000000e+06,Angel,2018,Financial Services,2
4,Eunimart,2016,Hyderabad,E-Commerce Platforms,Eunimart is a one stop solution for merchants ...,,,1.210487e+08,Seed,2018,E-Commerce Platforms,2
...,...,...,...,...,...,...,...,...,...,...,...,...
1204,Gigforce,2019,Gurugram,Staffing & Recruiting,A gig/on-demand staffing company.,"Chirag Mittal, Anirudh Syal",Endiya Partners,3.000000e+06,Pre-series A,2021,Staffing & Recruiting,2
1205,Vahdam,2015,New Delhi,Food & Beverages,VAHDAM is among the world’s first vertically i...,Bala Sarda,IIFL AMC,2.000000e+07,Series D,2021,Food & Beverages,6
1206,Leap Finance,2019,Bangalore,Financial Services,International education loans for high potenti...,"Arnav Kumar, Vaibhav Singh",Owl Ventures,5.500000e+07,Series C,2021,Financial Services,2
1207,CollegeDekho,2015,Gurugram,EdTech,"Collegedekho.com is Student’s Partner, Friend ...",Ruchir Arora,"Winter Capital, ETS, Man Capital",2.600000e+07,Series B,2021,Edtech,6


In [5645]:
# Filter data where 'Years of Existence' is greater than 5
years_of_existence_gt_5 = df[df['Years of Existence'] > 5]
years_of_existence_gt_5

Unnamed: 0,Company_Name,Founded,Location,Sector,What_it_does,Founders,Investor,Amount,Stage,Funding_Year,sector,Years of Existence
4,Nu Genes,2004,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),6.000000e+06,,2019,Agritech,15
7,Kratikal,2013,Noida,Technology,It is a product-based cybersecurity solutions ...,"Pavan Kushwaha, Paratosh Bansal, Dip Jung Thapa","Gilda VC, Art Venture, Rajeev Chitrabhanu.",1.000000e+06,Pre series A,2019,Technology,6
9,Lenskart,2010,Delhi,E-commerce,It is a eyewear company,"Peyush Bansal, Amit Chaudhary, Sumeet Kapahi",SoftBank,2.750000e+08,Series G,2019,E-Commerce,9
10,Cub McPaws,2010,Mumbai,E-commerce & AR,A B2C brand that focusses on premium and comf...,"Abhay Bhat, Kinnar Shah",Venture Catalysts,1.210487e+08,,2019,E-Commerce & Ar,9
16,Byju's,2011,,Edtech,Provides online learning classes,Byju Raveendran,"South Africa’s Naspers Ventures, the CPP Inves...",5.400000e+08,,2019,Edtech,8
...,...,...,...,...,...,...,...,...,...,...,...,...
1195,Delhivery,2011,Gurugram,Logistics & Supply Chain,Delhivery is a leading logistics and supply ch...,Sahil Barua,Addition,7.600000e+07,Series I,2021,Logistics & Supply Chain,10
1196,Flipspaces,2011,Mumbai,Design,Flipspaces is a global tech-enabled venture to...,Kunal Sharma,Prashasta Seth,2.000000e+06,Pre-series B,2021,Design,10
1201,TechEagle,2015,Gurugram,Aviation & Aerospace,"Safe, secure & reliable On-Demand Drone delive...",Vikram Singh Meena,India Accelerator,5.000000e+05,Seed,2021,Aviation & Aerospace,6
1205,Vahdam,2015,New Delhi,Food & Beverages,VAHDAM is among the world’s first vertically i...,Bala Sarda,IIFL AMC,2.000000e+07,Series D,2021,Food & Beverages,6


##### All the above rows will be dropped because at the time of collecting their data, they were more than 5 years which disqualifies them from being startups

In [5646]:
# Drop rows where 'Years of Existence' is greater than 5
df = df[df['Years of Existence'] <= 5]

# Display the updated DataFrame
df.info()


<class 'pandas.core.frame.DataFrame'>
Index: 2271 entries, 0 to 1208
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Company_Name        2271 non-null   object 
 1   Founded             2271 non-null   int32  
 2   Location            2161 non-null   object 
 3   Sector              2254 non-null   object 
 4   What_it_does        2271 non-null   object 
 5   Founders            1728 non-null   object 
 6   Investor            1651 non-null   object 
 7   Amount              2271 non-null   float64
 8   Stage               2271 non-null   object 
 9   Funding_Year        2271 non-null   int32  
 10  sector              2254 non-null   object 
 11  Years of Existence  2271 non-null   int32  
dtypes: float64(1), int32(3), object(8)
memory usage: 204.0+ KB


In [5648]:
# Export DataFrame to CSV file
df.to_csv('dfcl.csv', index=False)