# **INDIAN STARTUP ECOSYSTEM**

## Project Description
We embark on a journey of discovery as we leverage our data analysis expertise to uncover the untapped potential within the Indian startup ecosystem. This project is designed to not only decode the numbers but to distill insights that will guide our team towards a successful foray into this dynamic market.

## Scope of Work

- Conduct a thorough exploration of datasets, dissecting funding patterns, sectoral nuances, and geographical hotspots in the Indian startup landscap
- Analyze funding received by startups in india from 2018 to 2021



## Hypothesis 

**Null Hypothesis (H0)**: There is no significant relationship between funding and the sector  

**Alternative Hypothesis (H1)**: There is a significant relationship between funding and the sector

## Questions 
1. How does funding vary across different industry sectors in India?
2. How does funding vary with the loaction of the start-ups
3. What is the relationship between the amount of funding and the stage of the company?
4. How have funding trends evolved between 2018 and 2021?
5. What are the most attractive sectors for investors?
6. Does the location of the company influence its sector?




# **DATA EXPLORATION, DATA UNDERSTANDING and DATA ANALYSIS**

In [5]:
# Load libraries
# Database connnection
import pyodbc     
from dotenv import dotenv_values    #import the dotenv_values function from the dotenv package

# Data manipulation
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# filter warnings
import warnings 
warnings.filterwarnings('ignore')

# **1. Loading and Inspection of Data**

## **1.1 Loading data from the SQL server**

In [6]:
# Load environment variables from .env file into a dictionary
environment_variables = dotenv_values('.env')

# Get the values for the credentials you set in the '.env' file
server = environment_variables.get("SERVER")
database = environment_variables.get("DATABASE")
username = environment_variables.get("USERNAME")
password = environment_variables.get("PASSWORD")

In [7]:
# Create a connection string

connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password}"
    
connection = pyodbc.connect(connection_string)



In [8]:
# sql query to get 2020 data. 
query_2020="SELECT * FROM dbo.LP1_startup_funding2020"

# sql query to get 2021 data. 
query_2021="SELECT * FROM dbo.LP1_startup_funding2021"

In [9]:
    # load 2021 data
data_2021=pd.read_sql(query_2021,connection)

    # load 2020 data
data_2020=pd.read_sql(query_2020,connection)

## **1.2 Loading CSV Files**

In [10]:
# load 2019 data
data_2019=pd.read_csv(r'C:\Users\iamde\OneDrive\Desktop\jupyter\india_startup_data\startup_funding2019.csv')

    # load 2018 data
data_2018=pd.read_csv(r'C:\Users\iamde\OneDrive\Desktop\jupyter\india_startup_data\startup_funding2018.csv')


# **2.Exploratory Data Analaysis(EDA)**

## **2.0.  2018 Dataset EDA**

In [7]:
# preview the rows and columns for the 2018 dataset
data_2018.sample(5)

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
238,Trell,—,Seed,1250000,"Bangalore City, Karnataka, India",Trell is a location based network which helps ...
41,IndigoLearn,"E-Learning, Education",Seed,150000,"Hyderabad, Andhra Pradesh, India",India's the premier education destination cate...
349,Wicked Ride Adventure Services Private Limited,"Automotive, Last Mile Transportation, Peer to ...",Venture - Series Unknown,"$10,000,000","Bangalore, Karnataka, India",Peer-to-Peer motor vehicles sharing
430,UpCyclersLab,"Apps, Education, Retail",Seed,—,"Mumbai, Maharashtra, India",UpCyclersLab is a startup that makes sustainab...
72,BUGWORKS Research India,"Biotechnology, Life Science, Pharmaceutical, P...",Series A,9000000,"Bangalore, Karnataka, India",BUGWORKS is a drug discovery company.


In [11]:
# checking for number of columns and rows
print (data_2018.shape)
print(f"There are {data_2018.shape[0]} rows, and {data_2018.shape[1]} columns")

(526, 6)
There are 526 rows, and 6 columns


In [12]:
# checking info
data_2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Company Name   526 non-null    object
 1   Industry       526 non-null    object
 2   Round/Series   526 non-null    object
 3   Amount         526 non-null    object
 4   Location       526 non-null    object
 5   About Company  526 non-null    object
dtypes: object(6)
memory usage: 24.8+ KB


In [13]:
# Describing the data
data_2018.describe()

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
count,526,526,526,526,526,526
unique,525,405,21,198,50,524
top,TheCollegeFever,—,Seed,—,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
freq,2,30,280,148,102,2


**Findings**  
- TheCollegeFever company was the most common company  

- Seed series was the most preffered  

- most companies were based in Bangalore location  

In [14]:
# checking for duplicates
print("There are ",data_2018.duplicated().sum(),"duplicate(s)")

There are  1 duplicate(s)


In [15]:
# Checking for missing values
data_2018.isnull().sum()

Company Name     0
Industry         0
Round/Series     0
Amount           0
Location         0
About Company    0
dtype: int64

## **2.1. 2019 EDA**

In [8]:
# get a sample of 2019 dataset
data_2019.sample(5)

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",
79,Zolostays,2015.0,,Accomodation,It offers affordable housing apartments to you...,"Akhil Sikri, Nikhil Sikri, Sneha Choudhry",Trifecta Capital,"$7,000,000",
45,Afinoz,,Noida,Fintech,Online financial marketplace for customized ra...,Rachna Suneja,Fintech innovation lab,Undisclosed,
65,Cubical Labs,2013.0,,IoT,Home automation solution provider,"Dhruv Ratra, Swati Vyas",Rockstud Capital,Undisclosed,Series B
22,Springboard,2013.0,,Edtech,Offers online courses and extensive mentor-bas...,"Gautam Tambay, Parul Gupta",Reach Capital,"$11,000,000",Post series A


In [16]:
# checking for number of columns and rows
print (data_2019.shape)
print(f"There are {data_2019.shape[0]} rows, and {data_2019.shape[1]} columns")

(89, 9)
There are 89 rows, and 9 columns


In [17]:
# checking for duplicates
print("There are ",data_2019.duplicated().sum(),"duplicate(s)")

There are  0 duplicate(s)


In [18]:
# Checking for nulls
data_2019.isnull().sum()

Company/Brand     0
Founded          29
HeadQuarter      19
Sector            5
What it does      0
Founders          3
Investor          0
Amount($)         0
Stage            46
dtype: int64

In [19]:
# checking for datatypes in the different columns
data_2019.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company/Brand  89 non-null     object 
 1   Founded        60 non-null     float64
 2   HeadQuarter    70 non-null     object 
 3   Sector         84 non-null     object 
 4   What it does   89 non-null     object 
 5   Founders       86 non-null     object 
 6   Investor       89 non-null     object 
 7   Amount($)      89 non-null     object 
 8   Stage          43 non-null     object 
dtypes: float64(1), object(8)
memory usage: 6.4+ KB


In [20]:
# performing descriptive analysis
data_2019.describe(include='all')

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
count,89,60.0,70,84,89,86,89,89,43
unique,87,,17,52,88,85,86,50,15
top,Kratikal,,Bangalore,Edtech,Online meat shop,"Vivek Gupta, Abhay Hanjura",Undisclosed,Undisclosed,Series A
freq,2,,21,7,2,2,3,12,10
mean,,2014.533333,,,,,,,
std,,2.937003,,,,,,,
min,,2004.0,,,,,,,
25%,,2013.0,,,,,,,
50%,,2015.0,,,,,,,
75%,,2016.25,,,,,,,


**Findings**
- Kratikal company was the most common company

- Most companies had their headquarters based in Bangalore

- Majority of the companies are involved in the Edtech sector

## **2.2. 2020 EDA**

In [9]:
# get a sample of 2020 dataset
data_2020.sample(5)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
329,Inflexor Ventures,2020.0,Mumbai,FinTech,Experienced Venture Capital-General Partner fo...,"Parampara Capital, Venkat Vallabhaneni, Jatin ...",,10000000.0,,
831,Stanplus,2016.0,Hyderabad,Healthtech,Ambulance service,Prabhdeep Singh,Pegasus FinInvest,1500000.0,Pre series A,
671,IVF Access,2019.0,Bangalore,Healthtech,It provdes IVF treatments,"Naresh Rao, Nikhil Rajmohan, Harinath Chakrava...",Vertex Ventures SEA & India,5000000.0,Series A,
205,Nykaa,2012.0,Mumbai,Cosmetics,Nykaa is an online marketplace for different b...,Falguni Nayar,"Alia Bhatt, Katrina Kaif",,,
604,Mitron,,Bangalore,Media,Short Video and Social Platform,"Shivank Agarwal, Anish Khandelwal","3One4 Capital, LetsVenture",267000.0,Seed Round,


In [22]:
#checking for number of columns and rows
print (data_2020.shape)
print(f"There are {data_2020.shape[0]} rows, and {data_2020.shape[1]} columns")

(1055, 10)
There are 1055 rows, and 10 columns


In [23]:
# checking for duplicates
print("There are ",data_2020.duplicated().sum(),"duplicate(s)")

There are  3 duplicate(s)


In [25]:
# Checking for nulls
data_2020.isnull().sum()

Company_Brand       0
Founded           213
HeadQuarter        94
Sector             13
What_it_does        0
Founders           12
Investor           38
Amount            254
Stage             464
column10         1053
dtype: int64

In [27]:
# checking for datatypes in the different columns
data_2020.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1055 entries, 0 to 1054
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1055 non-null   object 
 1   Founded        842 non-null    float64
 2   HeadQuarter    961 non-null    object 
 3   Sector         1042 non-null   object 
 4   What_it_does   1055 non-null   object 
 5   Founders       1043 non-null   object 
 6   Investor       1017 non-null   object 
 7   Amount         801 non-null    float64
 8   Stage          591 non-null    object 
 9   column10       2 non-null      object 
dtypes: float64(2), object(8)
memory usage: 82.6+ KB


In [28]:
# Describe the data
data_2020.describe()

Unnamed: 0,Founded,Amount
count,842.0,801.0
mean,2015.36342,113043000.0
std,4.097909,2476635000.0
min,1973.0,12700.0
25%,2014.0,1000000.0
50%,2016.0,3000000.0
75%,2018.0,11000000.0
max,2020.0,70000000000.0


In [10]:
# get a sample of 2021 dataset
data_2021.sample(5)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
778,4baseCare,2018.0,Bangalore,HealthTech,4baseCare develops a unified and patient-centr...,"Hitesh Goswami, Kshitij Rishi","Mount Judi Ventures, growX Ventures, Season Tw...","$2,000,000",Pre-series A
70,ShareChat,2015.0,Bangalore,Social Media,ShareChat is a social networking and regional ...,"Ankush Sachdeva, Bhanu Pratap Singh, Farid Ahsan","Twitter Ventures, Pawan Munjal","$500,000,000",
49,CredFlow,2020.0,New Delhi,FinTech,CredFlow provides financial solutions to autom...,Kunal Aggarwal,"Stellaris Venture Partners, Omidyar Network In...","$2,000,000",Seed
763,Juicy Chemistry,2014.0,Coimbatore,HealthCare,Juicy Chemistry operates as an eponymous consu...,"Megha, Pritesh Asher.",Akya Ventures,"$6,300,000",Series A
298,Fabriclore,2016.0,Jaipur,Apparel & Fashion,India's top brand of artisanal & contemporary ...,"Vijay Sharma, Anupam Arya, Sandeep Sharma","Fluid Ventures, Mulberry Silks",$700000,Pre-series A


**Shape of the data**

In [11]:
# get the number of rows and columns for the datasets
print(f"The 2018 dataset has {data_2018.shape[0]} rows and {data_2018.shape[1]} Columns\n")
print(f"The 2019 dataset has {data_2019.shape[0]} rows and {data_2019.shape[1]} Columns\n")
print(f"The 2020 dataset has {data_2020.shape[0]} rows and {data_2020.shape[1]} Columns\n")
print(f"The 2021 dataset has {data_2021.shape[0]} rows and {data_2021.shape[1]} Columns\n\n")

The 2018 dataset has 526 rows and 6 Columns

The 2019 dataset has 89 rows and 9 Columns

The 2020 dataset has 1055 rows and 10 Columns

The 2021 dataset has 1209 rows and 9 Columns




**Info of the data**

In [12]:
# overview of 2018 dataset
data_2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Company Name   526 non-null    object
 1   Industry       526 non-null    object
 2   Round/Series   526 non-null    object
 3   Amount         526 non-null    object
 4   Location       526 non-null    object
 5   About Company  526 non-null    object
dtypes: object(6)
memory usage: 24.8+ KB


In [13]:
# overview of 2019 dataset
data_2019.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company/Brand  89 non-null     object 
 1   Founded        60 non-null     float64
 2   HeadQuarter    70 non-null     object 
 3   Sector         84 non-null     object 
 4   What it does   89 non-null     object 
 5   Founders       86 non-null     object 
 6   Investor       89 non-null     object 
 7   Amount($)      89 non-null     object 
 8   Stage          43 non-null     object 
dtypes: float64(1), object(8)
memory usage: 6.4+ KB


In [14]:
# overview of 2020 dataset
data_2020.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1055 entries, 0 to 1054
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1055 non-null   object 
 1   Founded        842 non-null    float64
 2   HeadQuarter    961 non-null    object 
 3   Sector         1042 non-null   object 
 4   What_it_does   1055 non-null   object 
 5   Founders       1043 non-null   object 
 6   Investor       1017 non-null   object 
 7   Amount         801 non-null    float64
 8   Stage          591 non-null    object 
 9   column10       2 non-null      object 
dtypes: float64(2), object(8)
memory usage: 82.6+ KB


In [15]:
# overview of 2021 dataset
data_2021.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1209 entries, 0 to 1208
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1209 non-null   object 
 1   Founded        1208 non-null   float64
 2   HeadQuarter    1208 non-null   object 
 3   Sector         1209 non-null   object 
 4   What_it_does   1209 non-null   object 
 5   Founders       1205 non-null   object 
 6   Investor       1147 non-null   object 
 7   Amount         1206 non-null   object 
 8   Stage          781 non-null    object 
dtypes: float64(1), object(8)
memory usage: 85.1+ KB


**Displaying datasets columns**

In [16]:
# 2021 data columns
data_2021.columns

Index(['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage'],
      dtype='object')

In [17]:
# 2020 data columns

data_2020.columns

Index(['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage', 'column10'],
      dtype='object')

In [18]:
# 2019 data columns

data_2019.columns

Index(['Company/Brand', 'Founded', 'HeadQuarter', 'Sector', 'What it does',
       'Founders', 'Investor', 'Amount($)', 'Stage'],
      dtype='object')

In [19]:
# 2018 data columns

data_2018.columns

Index(['Company Name', 'Industry', 'Round/Series', 'Amount', 'Location',
       'About Company'],
      dtype='object')

## **Observations:**
**Issues with the data**

1. There is a discrepancy in the naming conventions between the columns in the 2018 and 2019 datasets compared to the 2020 and 2021 datasets.

2. The 2018 dataset exhibits some missing columns, contributing to an incomplete representation of the data.

3. Conversely, the 2020 dataset contains an additional column that appears to be extraneous and does not serve a meaningful purpose in our analysis.

**Course of Action:**

1. **Missing Column Engineering for 2018:**
   - We will address the absence of certain columns in the 2018 dataset by employing data engineering techniques to create and populate the missing columns, ensuring a comprehensive and consistent dataset.

2. **Column Name Standardization:**
   - To establish uniformity and coherence across all datasets, we will embark on a column renaming process for the 2018 and 2019 datasets. This action aims to align the naming conventions with those observed in the 2020 and 2021 datasets, facilitating seamless data integration and analysis.

3. **Extraneous Column Removal in 2020:**
   - The redundant column identified in the 2020 dataset will be removed, streamlining the dataset and eliminating unnecessary elements that do not contribute to the overall analysis objectives.

These actions collectively enhance the integrity, consistency, and completeness of the dataset, paving the way for a more robust and coherent analytical process.
ical process.







# **3. Data Cleaning**

### Handling missing columns and currency signs in the 2018 dataset

- The 2018 dataset has missing; 'founded', 'founders', and 'investor' columns


In [20]:
# Engineer missing columns for the 2018 dataset
columns_to_add = ['founded', 'founders', 'investor']
for column in columns_to_add:
    if column not in data_2018.columns:
        data_2018[column] = np.NaN

# Replace '₹', commas, '—', and "''" in 'Amount' column
data_2018['Amount'] = data_2018['Amount'].str.replace(',', '').str.replace('—', '').str.replace("''",'').replace('', np.nan)

# Conditionally apply multiplication only where '₹' is present
mask = data_2018['Amount'].str.contains('₹', na=False)
data_2018.loc[mask, 'Amount'] = data_2018.loc[mask, 'Amount'].str.replace('₹', '').astype(float) * 0.0146

### **Data collection Year**

- There is need to add a column that represents the year each dataset was collected. This will help with handling the datasets after merging the dataframes 


In [21]:
# add year when data was collected column to every dataset
data_2018['data_year'] = pd.to_datetime('2018', format='%Y').year
data_2019['data_year'] = pd.to_datetime('2019', format='%Y').year
data_2020['data_year'] = pd.to_datetime('2020', format='%Y').year
data_2021['data_year'] = pd.to_datetime('2021', format='%Y').year

### **Merging the dataframes**

**Notes**
- The function below concatenates the dataframes then renames the columns to ensure uniformity across the merged dataframe


In [22]:
# Define function to rename columns
def clean_dfs(df_1, df_2, df_3, df_4):
    # Rename columns in individual DataFrames
    df_1.rename(columns={'Company Name': 'company_brand', 'Industry': 'sector', 'Round/Series': 'stage',
                         'Amount': 'amount($)', 'Location': 'headquater', 'About Company': 'about_company'},
                inplace=True)
    df_2.rename(columns={'Company/Brand': 'company_brand', 'Founded': 'founded', 'HeadQuarter': 'headquater',
                         'Sector': 'sector', 'What it does': 'about_company', 'Founders': 'founders',
                         'Investor': 'investor', 'Amount($)': 'amount($)', 'Stage': 'stage'},
                inplace=True)
    df_3.rename(columns={'Company_Brand': 'company_brand', 'Founded': 'founded', 'HeadQuarter': 'headquater',
                         'Sector': 'sector', 'What_it_does': 'about_company', 'Founders': 'founders',
                         'Investor': 'investor', 'Amount': 'amount($)', 'Stage': 'stage'},
                inplace=True)
    df_4.rename(columns={'Company_Brand': 'company_brand', 'Founded': 'founded', 'HeadQuarter': 'headquater',
                         'Sector': 'sector', 'What_it_does': 'about_company', 'Founders': 'founders',
                         'Investor': 'investor', 'Amount': 'amount($)', 'Stage': 'stage'},
                inplace=True)
    return [df_1, df_2, df_3, df_4]
df=clean_dfs(data_2018, data_2019, data_2020, data_2021)

In [23]:
# Conact the dataframes
def concat(df_list):
    concatenated_df = pd.concat(df_list, ignore_index=True)
    return concatenated_df
df = concat([data_2018, data_2019, data_2020, data_2021])

In [24]:
df.head()

Unnamed: 0,company_brand,sector,stage,amount($),headquater,about_company,founded,founders,investor,data_year,column10
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000.0,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",,,,2018,
1,Happy Cow Dairy,"Agriculture, Farming",Seed,584000.0,"Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,,,,2018,
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,949000.0,"Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,,,,2018,
3,PayMe India,"Financial Services, FinTech",Angel,2000000.0,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,,,,2018,
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...,,,,2018,


In [25]:
#Drop the extreneous column 10
df.drop('column10', axis=1, inplace= True)

## **Cleaning 'Amount' column**

**Notes**  
- Remove all currency signs  

- Remove all other umwanted characters, words and symbols  

- Fill the nulls uning interpolate method 

- Convert the column from object to float

In [26]:

# Remove dollar sign
df['amount($)'] = df['amount($)'].replace('\$', '', regex=True)

# Remove commas
df['amount($)'] = df['amount($)'].str.replace(',', '')

# Remove all other irrelevant characters, words and symbols
df['amount($)'] = df['amount($)'].replace(["Upsparks", 'undisclosed', 'Undisclosed', "ah! Ventures", 
                                               "Pre-series A", "ITO Angel Network LetsVenture", 
                                               "JITO Angel Network LetsVenture", "Series C", 'Seed', ','], '')

# Convert the 'amount($)' column to numeric
df['amount($)'] = pd.to_numeric(df['amount($)'])

In [27]:
df['amount($)'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 2879 entries, 0 to 2878
Series name: amount($)
Non-Null Count  Dtype  
--------------  -----  
1367 non-null   float64
dtypes: float64(1)
memory usage: 22.6 KB


## **Cleaning data_year column**

**Notes**  


- Convert data type to period


In [28]:
# Convert the data_year column to date
df['data_year']=pd.to_datetime(df['data_year'], format='%Y')
df['data_year']=df['data_year'].dt.to_period('y')
# df['founded']=pd.to_datetime(df['founded']).dt.year

In [29]:
# check for nulls and duplicated
print(f"There are {df['data_year'].isna().sum()} Null values in the 'data_year' column")

There are 0 Null values in the 'data_year' column


In [30]:
df['data_year'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 2879 entries, 0 to 2878
Series name: data_year
Non-Null Count  Dtype        
--------------  -----        
2879 non-null   period[Y-DEC]
dtypes: period[Y-DEC](1)
memory usage: 22.6 KB


## **Cleaning 'founded' column**

**Notes**
- Handle nulls by populating with the 'bfill' method



In [31]:
print(f"There are {df['founded'].isna().sum()} Null values in the 'founded' column")

There are 769 Null values in the 'founded' column


**Notes**  
- There are 769 null values in the 'founded' column.  

- Since dropping the nulls will lead to a significant loss of our data, Backward fill will be used to fill the null values

**COURSE OF ACTION**

- We will be cleaning the missing values by fill them using interpolate method which is suitable for time series data  

- We will also be converting the data type from float to datetime for purpose of our analysis.

In [32]:
# Fill the nulls
df["founded"].interpolate(method='linear',inplace= True)

print(f"There are {df['founded'].isna().sum()} missing values")

There are 527 missing values


In [33]:
# Convert to datetime
df['founded'] = pd.to_datetime(df['founded'], format='%Y')

# Convert to period
df['founded'] = df['founded'].dt.to_period('Y')


In [34]:
df['founded'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 2879 entries, 0 to 2878
Series name: founded
Non-Null Count  Dtype        
--------------  -----        
2352 non-null   period[Y-DEC]
dtypes: period[Y-DEC](1)
memory usage: 22.6 KB


In [49]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2852 entries, 0 to 2878
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype        
---  ------         --------------  -----        
 0   company_brand  2852 non-null   object       
 1   sector         2852 non-null   object       
 2   stage          2852 non-null   object       
 3   amount($)      1350 non-null   float64      
 4   headquater     2738 non-null   object       
 5   about_company  2852 non-null   object       
 6   founded        2326 non-null   period[Y-DEC]
 7   founders       2308 non-null   object       
 8   investor       2228 non-null   object       
 9   data_year      2852 non-null   period[Y-DEC]
dtypes: float64(1), object(7), period[Y-DEC](2)
memory usage: 245.1+ KB


## **Cleaning the 'founders' column**

In [35]:
# Remove unwanted characters
df['founders'] = df['founders'].replace(['...', np.nan], np.NaN)

# Check the number of NaN values in the 'founders' column
nan_count = df['founders'].isna().sum()

print(nan_count)

545


In [36]:
df['founders'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 2879 entries, 0 to 2878
Series name: founders
Non-Null Count  Dtype 
--------------  ----- 
2334 non-null   object
dtypes: object(1)
memory usage: 22.6+ KB


## **Cleaning **Stage** column**

Startups start with pre-seed, progress through seed, Series A, Series B, etc., securing resources for development and strategies. Additional rounds like Series C or D may follow. External funding at each stage fuels growth toward the venture's full potential.

**Pre-Seed Funding**  
Entrepreneurial idea in early development; small funds needed; limited informal channels for raising funds.

**Seed Funding**  
First official equity funding; investors provide funds for equity ownership.

**Series A Financing**  
First venture capital round; developed product, consistent revenue, long-term profit plan.

**Series B Financing**  
For established startups; substantial user base and revenue; funding for expansion.

**Series C and Beyond**  
Optional rounds for final push before IPO or unmet objectives; Series C is the third venture capital round.

**Initial Public Offering (IPO)**  
Process of offering corporate shares to the public; used for funding or divestment.

link: https://www.startupindia.gov.in/content/sih/en/funding.html

In [37]:
# Cleaning stage column
df['stage'].unique()
df['stage']=df['stage'].replace(['https://docs.google.com/spreadsheets/d/1x9ziNeaz6auNChIHnMI8U6kS7knTr3byy_YBGfQaoUA/edit#gid=1861303593',
'$6000000','$1000000','$300000','$1200000'],np.NaN)

In [38]:
# Standardize funding stages in the 'stage' column
df['stage'] = df['stage'].replace(['Series A', 'Seies A', 'Series A-1', 'Series A2', 'Series A+', 'Series A+'], 'Series A')
df['stage'] = df['stage'].replace(['Pre-seed', 'Pre-seed Round', 'Pre seed Round', 'Pre seed round'], 'Pre-Seed Stage')
df['stage'] = df['stage'].replace(['Pre series A', 'Pre-series A', 'Pre Series A', 'Pre series A1', 'Pre-series A1', 'Pre- series A'], 'Pre series A')
df['stage'] = df['stage'].replace(['Series B', 'Series B+', 'Series B2', 'Series B3'], 'Series B')
df['stage'] = df['stage'].replace(['Series C', 'Series C', 'Series C, D','Series C', 'Private Equity','PE', 'Post-IPO Equity','Series D', 'Series E', 'Series F', 'Series G', 'Series H', 'Series I','Series D1','Series F2', 'Series F1'], 'Series C and Beyond')
df['stage'] = df['stage'].replace(['Venture - Series Unknown', None,'Grant','Debt','Debt Financing','Post-IPO Debt','Non-equity Assistance','Bridge','Bridge Round','Fresh funding','Funding Round','Mid series','Edge',], 'unknown')
df['stage'] = df['stage'].replace(['Corporate Round','Undisclosed','Secondary Market','Pre-series','Post series A','Pre-series B','Pre-Series B','Pre series B','Pre-series C','Pre series C'], 'Other Stages')
df['stage'] = df['stage'].replace(['Seed','Seed funding','Pre-Seed','Angel', 'Angel Round','Seed fund', 'Seed round', 'Seed A','Seed Funding', 'Seed Round & Series A', 'Series E2', 'Seed Round','Seed Investment','Seed+','Early seed'],'Seed Stage')

In [39]:
# strip off characters 
df['stage'] = df['stage'].str.strip('\t#REF!')

## **Cleaning the Sector Column**

In [40]:
# Get the first sentence of every list
df['sector']=df['sector'].str.split(",").str[0]

In [41]:
df['sector'] = df['sector'].replace({
    'Edtech': 'EdTech', 'Fintech': 'FinTech', 'Agriculture': 'AgriTech', 'Food & Beverages': 'Food and Beverages',
    'Financial Services': 'FinTech', 'Healthcare': 'HealthTech', 'HealthTech': 'HealthTech', 'Medical': 'HealthTech', 
    'Medtech': 'HealthTech', 'Pharmaceutical': 'HealthTech', 'Health Insurance': 'HealthTech', 
    'Biotechnology': 'HealthTech', 'Health Diagnostics': 'HealthTech', 'Hospital': 'HealthTech', 
    'Hospital & Health Care': 'HealthTech', 'Wellness': 'HealthTech', 'Dental': 'HealthTech', 
    'Alternative Medicine': 'HealthTech', 'Nutrition': 'HealthTech', 'Fitness': 'HealthTech', 
    'Mental Health': 'HealthTech', 'Healthcare/Edtech': 'HealthTech',
    'Life sciences': 'HealthTech', 'Biotech': 'Healthcare', 'Nutrition Tech': 'HealthTech', 
    'E-mobility': 'HealthTech', 'Med Tech': 'HealthTech', 'FemTech': 'HealthTech', 
    'Cannabis startup': 'HealthTech', 'Pharmacy': 'HealthTech', 'Medical Device': 'HealthTech', 
    'BioTechnology': 'HealthTech', 'Fertility tech': 'HealthTech', 'Ayurveda tech': 'HealthTech', 
    'E-tail': 'Healthcare', 'E store': 'E-Commerce', 'E-store': 'Healthcare', 'Telemedicine': 'HealthTech', 
    'HealthCare': 'HealthTech', 'AI startup': 'AI', 'Information Services': 'InfoTech & Services', 
    'Healthtech': 'HealthTech', 'Finance': 'FinTech', 'Health Care': 'HealthTech', 
    'Logistics & Supply Chain': 'Logistics', 'Food Industry': 'FoodTech', 'Foodtech': 'FoodTech', 
    '—': 'Undisclosed', 'SaaS startup': 'SaaS', 'Health': 'HealthTech', 'Ecommerce': 'E-Commerce', 
    'Tech Startup': 'Tech', 'Mobility': 'Transportation', 'SaaS': 'Tech', 'Artificial Intelligence': 'AI', 
    'Food and Beverage': 'Food and Beverages', 'Information Technology': 'InfoTech & Services', 
    'Internet': 'Tech', 'Apps': 'Tech', 'Computer Software': 'Tech', 'E-commerce': 'E-Commerce', 
    'Agritech': 'AgriTech', 'Hospital & Health Care': 'HealthTech', 'Food': 'Foodtech', 'Cosmetics': 'Consumer Goods', 
    'Tech company': 'Tech', 'Automobile': 'Automotive', 'Apparel & Fashion': 'Fashion', 'Education': 'EdTech', 
    'Social Media': 'Media', 'Digital Media': 'Media', 'IT': 'InfoTech & Services', 'IoT': 'AI', 
    'Software': 'Tech', 'Industrial Automation': 'AI', 'Technology': 'Tech', 
    'Information Technology & Services': 'InfoTech & Services', None: 'Unknown'
})




## **Cleaning HeadQuater Column**

In [42]:
# Get the first location from every list
df['headquater']=df['headquater'].str.split(",").str[0]

In [43]:
df['headquater']=df['headquater'].replace({'Bengaluru': 'Bangalore', 'Banglore': 'Bangalore', 'Gurugram': 'Gurgaon', 'Hyderebad': 'Hyderabad', 
                                      'New Delhi': 'Delhi', 'Ahmadabad': 'Ahmedabad', 'Ernakulam': 'Cochin', 'Telugana': 'Telangana',
                                      'Rajastan': 'Rajasthan', 'San Franciscao': 'San Francisco', 'Samsitpur': 'Samastipur', 'Santra': 'Samtra',
                                      'Rajsamand': 'Rajasthan', 'Kerala': 'Kochi','The Nilgiris': 'Nilgiris', 'Gurugram\t#REF!': 'Gurgaon', 
                                      'California': 'San Francisco', 'San Francisco Bay Area': 'San Francisco', 'Hyderebad': 'Hyderabad',
                                      'Online Media\t#REF!': 'Unknown','Pharmaceuticals\t#REF!': 'Unknown',
                                      'Information Technology & Services':'Unknown' ,'Small Towns': 'Unknown','Orissia': 'Odisha', 
                                      'Santra': 'Samtra', 'Vadodara': 'Vadodara', 'Food & Beverages': 'Unknown', 'Bangaldesh': 'Bangladesh',}) 


In [44]:
df.head()

Unnamed: 0,company_brand,sector,stage,amount($),headquater,about_company,founded,founders,investor,data_year
0,TheCollegeFever,Brand Marketing,Seed Stage,250000.0,Bangalore,"TheCollegeFever is a hub for fun, fiesta and f...",NaT,,,2018
1,Happy Cow Dairy,AgriTech,Seed Stage,,Mumbai,A startup which aggregates milk from dairy far...,NaT,,,2018
2,MyLoanCare,Credit,Series A,,Gurgaon,Leading Online Loans Marketplace in India,NaT,,,2018
3,PayMe India,FinTech,Seed Stage,2000000.0,Noida,PayMe India is an innovative FinTech organizat...,NaT,,,2018
4,Eunimart,E-Commerce Platforms,Seed Stage,,Hyderabad,Eunimart is a one stop solution for merchants ...,NaT,,,2018


In [45]:
# check for duplicates
df[df.duplicated()]

Unnamed: 0,company_brand,sector,stage,amount($),headquater,about_company,founded,founders,investor,data_year
348,TheCollegeFever,Brand Marketing,Seed Stage,250000.0,Bangalore,"TheCollegeFever is a hub for fun, fiesta and f...",NaT,,,2018
760,Krimanshi,Biotechnology company,Seed Stage,,Jodhpur,Krimanshi aims to increase rural income by imp...,2015,Nikhil Bohra,"Rajasthan Venture Capital Fund, AIM Smart City",2020
820,Nykaa,Consumer Goods,unknown,,Mumbai,Nykaa is an online marketplace for different b...,2012,Falguni Nayar,"Alia Bhatt, Katrina Kaif",2020
977,Byju’s,EdTech,unknown,,Bangalore,An Indian educational technology and online tu...,2011,Byju Raveendran,"Owl Ventures, Tiger Global Management",2020
982,Zomato,Food devlivery,unknown,,Haryana,Get online food delivery from restaurants near...,2008,"Deepinder Goyal, Pankaj Chaddah","MacRitchie Investments, Baillie Gifford",2020
1428,Nykaa,E-Commerce,unknown,,Mumbai,Deals in cosmetic and wellness products,2012,Falguni Nayar,Steadview capital,2020
1578,Vogo,Automotive,Series C and Beyond,,Bangalore,A scooter-sharing platform allowing users to r...,2016,"Anand Ayyadurai, Padmanabhan Balakrishnan, San...",Lightstone Aspada,2020
1624,Bounce,Automotive and Rentals,Series C and Beyond,,Bangalore,Offers a variety of bikes and scooters that ca...,2014,"Vivekananda Hallekere, Anil Giri Raju,Arun Agni","Accel Partners, B Capital",2020
1777,Curefoods,Food and Beverages,unknown,13000000.0,Bangalore,Healthy & nutritious foods and cold pressed ju...,2020,Ankit Nagori,"Iron Pillar, Nordstar, Binny Bansal",2021
1779,Bewakoof,Fashion,unknown,8000000.0,Mumbai,Bewakoof is a lifestyle fashion brand that mak...,2012,Prabhkiran Singh,InvestCorp,2021


In [46]:
df.drop_duplicates(keep='first', inplace= True)
df.head()

Unnamed: 0,company_brand,sector,stage,amount($),headquater,about_company,founded,founders,investor,data_year
0,TheCollegeFever,Brand Marketing,Seed Stage,250000.0,Bangalore,"TheCollegeFever is a hub for fun, fiesta and f...",NaT,,,2018
1,Happy Cow Dairy,AgriTech,Seed Stage,,Mumbai,A startup which aggregates milk from dairy far...,NaT,,,2018
2,MyLoanCare,Credit,Series A,,Gurgaon,Leading Online Loans Marketplace in India,NaT,,,2018
3,PayMe India,FinTech,Seed Stage,2000000.0,Noida,PayMe India is an innovative FinTech organizat...,NaT,,,2018
4,Eunimart,E-Commerce Platforms,Seed Stage,,Hyderabad,Eunimart is a one stop solution for merchants ...,NaT,,,2018


In [47]:
# check for duplicates
print(f" There are {df.duplicated().sum()} duplicates")

 There are 0 duplicates


In [48]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2852 entries, 0 to 2878
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype        
---  ------         --------------  -----        
 0   company_brand  2852 non-null   object       
 1   sector         2852 non-null   object       
 2   stage          2852 non-null   object       
 3   amount($)      1350 non-null   float64      
 4   headquater     2738 non-null   object       
 5   about_company  2852 non-null   object       
 6   founded        2326 non-null   period[Y-DEC]
 7   founders       2308 non-null   object       
 8   investor       2228 non-null   object       
 9   data_year      2852 non-null   period[Y-DEC]
dtypes: float64(1), object(7), period[Y-DEC](2)
memory usage: 245.1+ KB
