# Indian Startup Data Analysis

# 1.0 Introduction

                      
India has emerged as one of the fastest-growing startup ecosystems in the world, attracting significant investment in various 

sectors. The India startup funding dataset typically contains information on startup names, industry sectors, funding rounds, 

funding amounts, investor names, and funding dates

# 2.0 Ask

### 2.1 Hypothesis


 H0: Funding to start-ups in India has not changed over time.
 HA: Funding to start-ups has changed over time

### 2.2 Questions

To reach a final decision on whether or not to accept the hypothesis, the following questions will be answered:

1.  Which start-ups are found in the capital of India?
2.  Which start-ups are into Information technologies related business?
3.  which industry got most of the start-up funding and why did they get such an amount?
4.  Which industry got the least of the start-up funding. What may be the reason?
5.  Which year recorded the most companies being formed
6.  Which sectors receive most funding from investors?


# 3.0 Prepare & Process (Data Cleaning & Preparation)

### 3.1 Loading the libraries and packages

In [5]:
# For data manipulation and cleaning
import pandas as pd
import numpy as np

# For data visualizations
import matplotlib.pyplot as plt
import seaborn as sns

# Other supporting libraries
import re
import warnings

# Hiding the warnings
warnings.filterwarnings('ignore')

print("Libraries and packages setup complete. Warnings hidden")



### 3.2 Loading the datasets

In [6]:
data_18 = pd.read_csv("startup_funding2018.csv") # for the 2018 startup data
data_19 =  pd.read_csv("startup_funding2019.csv") # for the 2019 startup data
data_20 = pd.read_csv("startup_funding2020.csv") # for the 2020 startup data
data_21 =  pd.read_csv("startup_funding2021.csv") # for the 2021 startup data

print("Datasets loaded.")

FileNotFoundError: [Errno 2] No such file or directory: 'startup_funding2018.csv'

### 3.3 Previewing the Datasets & Getting Summary Information

#### 3.3.1 The 2018 Dataset

In [None]:
print(data_18.info(), "\n")
data_18

#### 3.3.2 The 2019 Dataset

In [None]:
print(data_19.info(), "\n")
data_19

#### 3.3.2 The 2020 Dataset

In [None]:
print(data_20.info(), "\n")
data_20

#### 3.3.2 The 2021 Dataset

In [None]:
print(data_21.info(), "\n")
data_21

### 3.4 Observations from previewing the datasets
**3.4.1 The 2018 DataFrame**
- The columns in 2018 are different from those of 2019 - 2021, meaning they have to be renamed for concatenation.
- The amounts in the 2018 DataFrame are a mix of Indian Rupees (INR) and US Dollars (USD), meaning they have to be converted into same currency.
- The industry and location columns have multiple information. A decision is to be made between selecting the first value before the separator(,) as the main value, or representing that column with a wordcloud.

**3.4.2 The 2019 DataFrame**
- The datatype of the "Founded" column is set to float64. It should be set to a string for uniformity.
- The headquarter column has multiple information. A decision is to be made between selecting the first value before the separator(,) as the main value, or representing that column with a wordcloud.

**3.4.3 The 2020 DataFrame**
- There is an extra column called "Unnamed:9", giving it a total of 10 columns. It should be dropped to ensure complete alignment with the other DataFrames for ease of concatenation.

**3.4.4 The 2021 DataFrame**
- The datatype of the "Founded" column is set to float64. It should be set to a string for uniformity.

**3.4.5 General Observations**
- The currency signs and commas have to be removed from each of amount column for each DataFrame.

### 3.5 Assumptions

- The 2018 average INR/USD rate will be used to convert the Indian Rupee values to US Dollars in the 2018 DataFrame.
- First values of industry and location in the 2018 data will be selected as the primary sector and headquarters respectively.
- Amounts without currency symbols are assumed to be in USD ($)
- Financial analysis will be narrowed to transactions whose amounts are available in the loaded data.

### 3.6 Acting on the Observations and Assumptions

#### 3.6.1 Processing the 2018 DataFrame

In [None]:
data_18

In [None]:
# Selecting the main industries of the startups as Industry
data_18['Industry'] = data_18['Industry'].apply(str)
data_18['Industry'] = data_18['Industry'].str.split(',').str[0]
data_18['Industry'] = data_18['Industry'].replace("'", "", regex=True)

# Selecting the main locations of the startups as Location
data_18['Location'] = data_18['Location'].apply(str)
data_18['Location'] = data_18['Location'].str.split(',').str[0]
data_18['Location'] = data_18['Location'].replace("'", "", regex=True)

data_18

In [None]:
# Cleaning the Amounts column
## Removing the commas and dashes from the Amounts
data_18['Amount'] = data_18['Amount'].apply(str)
data_18['Amount'].replace(",", "", inplace = True, regex=True)
data_18['Amount'].replace("—", 0, inplace = True, regex=True)
data_18['Amount'].replace("$", "", inplace = True, regex=True)

## Creating temporary columns to help with the conversion of INR to USD
data_18['INR Amount'] = data_18['Amount'].str.rsplit('₹', n = 2).str[1]
data_18['INR Amount'] = data_18['INR Amount'].apply(float).fillna(0)
data_18['USD Amount'] = data_18['INR Amount'] * 0.0146
data_18['USD Amount'] = data_18['USD Amount'].replace(0, np.nan)
data_18['USD Amount'] = data_18['USD Amount'].fillna(data_18['Amount'])
data_18['USD Amount'] = data_18['USD Amount'].replace("$", "", regex=True)
data_18["Amount"] = data_18["USD Amount"]
data_18["Amount"] = data_18["Amount"].apply(lambda x: float(str(x).replace("$","")))
data_18["Amount"] = data_18["Amount"].replace(0, np.nan)

# Dropping the temporary columns
data_18.drop(columns = ["INR Amount", "USD Amount"], inplace = True)

# Correcting the mistaken Funding Round entry
data_18['Round/Series'] = data_18['Round/Series'].replace("https://docs.google.com/spreadsheets/d/1x9ziNeaz6auNChIHnMI8U6kS7knTr3byy_YBGfQaoUA/edit#gid=1861303593", "Seed")

# Adding a column to represent the year of funding
data_18["Year of Funding"] = "2018"

print(data_18.info(), "\n")
data_18

#### 3.6.2 Processing the 2019 DataFrame

In [None]:
print(data_19.info(), "\n")
data_19

In [None]:
# Converting the Founded column to a string
data_19['Founded'] = data_19['Founded'].apply(str)

# Removing the commas and dashes from the Amounts
data_19["Amount($)"] = data_19["Amount($)"].apply(str)
data_19["Amount($)"].replace(",", "", inplace = True, regex=True)
data_19["Amount($)"].replace("Undisclosed", np.nan, inplace = True, regex=True)
data_19["Amount($)"] = data_19["Amount($)"].apply(lambda x: float(str(x).replace("$","")))
data_19["Founded"].replace("nan", np.nan, inplace = True, regex=True)

# Appending years of funding to the respective dataframes
data_19["Year of Funding"] = "2019"

print(data_19.info(), "\n")
data_19

#### 3.6.3 Processing the 2020 DataFrame

In [None]:
print(data_20.info(), "\n")
data_20

In [None]:
# Dropping the extra column in the 2020 DataFrame
data_20 = data_20.iloc[: , :-1]
data_20

# Selecting the first value as Headquarters
data_20['HeadQuarter'] = data_20['HeadQuarter'].apply(str)
data_20['HeadQuarter'] = data_20['HeadQuarter'].str.split(',').str[0]
data_20['HeadQuarter'] = data_20['HeadQuarter'].replace("'", "", regex=True)

# Removing the commas and dashes from the Amounts
data_20["Amount($)"] = data_20["Amount($)"].apply(str)
data_20["Amount($)"].replace(",", "", inplace = True, regex=True)
data_20["Amount($)"].replace("Undisclosed", np.nan, inplace = True, regex=True)
data_20["Amount($)"].replace("Undiclsosed", np.nan, inplace = True, regex=True)
data_20["Amount($)"].replace("Undislosed", np.nan, inplace = True, regex=True)
data_20["Amount($)"].replace("nan", np.nan, inplace = True, regex=True)

In [None]:
# From here, it is seen that True Balance has 2 entries in this DataFrame, both representing funding they received.
# The entry with index 465 will be dropped and the Stage for 136 corrected
data_20.loc[data_20["Company/Brand"] == "True Balance", "Stage"] = "Series C"
data_20.drop([465], axis = 0, inplace = True)

In [None]:
# Eruditus also has an erroenously stated amount, which is to be corrected
data_20.loc[data_20["Company/Brand"] == "Eruditus", ["Amount($)", "Stage"]] = [113000000, "Series D"]
data_20.loc[data_20["Company/Brand"] == "Eruditus"]

In [None]:
# Removing the $ signs and converting the Amount column to float
data_20["Amount($)"] = data_20["Amount($)"].apply(lambda x: float(str(x).replace("$","")))

# Appending years of funding to the respective dataframes
data_20["Year of Funding"] = "2020"

print(data_20.info(), "\n")
data_20

#### 3.6.4 Processing the 2021 DataFrame

In [None]:
print(data_21.info(), "\n")
data_21

In [None]:
# Converting the Founded column to a string
data_21['Founded'] = data_21['Founded'].apply(str)
data_21["Founded"].replace("nan", np.nan, inplace = True, regex=True)

# Removing the commas and dashes from the Amounts
data_21["Amount($)"] = data_21["Amount($)"].apply(str)
data_21["Amount($)"].replace(",", "", inplace = True, regex=True)
data_21["Amount($)"].replace("Undisclosed", np.nan, inplace = True, regex=True)
data_21["Amount($)"].replace("Undiclsosed", np.nan, inplace = True, regex=True)
data_21["Amount($)"].replace("Undislosed", np.nan, inplace = True, regex=True)
data_21["Amount($)"].replace("undisclosed", np.nan, inplace = True, regex=True)
data_21["Amount($)"].replace("nan", np.nan, inplace = True, regex=True)

In [None]:
# FanPlay has duplicates with indexes 1768 and 1781, and has wrongly placed amounts.
data_21.drop([98], axis = 0, inplace = True)

data_21.loc[data_21["Company/Brand"] == "FanPlay", ["Amount($)", "Stage", "Investor"]] = [1200000, "Series A", "Upsparks"]
data_21.loc[data_21["Company/Brand"] == "FanPlay"]

In [None]:
# For Fullife Healthcare, the entries with indexes 1912 and 1926 both represent Series C funding raised in 2021. One has to be deleted.
data_21.loc[data_21["Company/Brand"] == "Fullife Healthcare", ["Amount($)", "Stage", "Investor"]] = [22000000, "Series C", "Morgan Stanley Private Equity Asia"]
data_21.drop([256], axis = 0, inplace = True)
data_21.loc[data_21["Company/Brand"] == "Fullife Healthcare"]

In [None]:
# Yet another error is thrown when trying to convert Amount to a float, this time by an entry "Seed", which has to be checked and corrected
data_21.loc[data_21["Amount($)"] == "Seed"]

# Correcting for MoEVing
data_21.loc[data_21["Company/Brand"] == "MoEVing", ["Amount($)", "Stage", "Investor"]] = [5000000, "Seed", np.nan]

# Correcting for Godamwale
data_21.loc[data_21["Company/Brand"] == "Godamwale", ["Amount($)", "Stage", "Investor"]] = [1000000, "Seed", "Anand Aryamane"]

In [None]:
# A similar error is thrown again, by an entry "ah! Ventures", which has to be checked and corrected
data_21.loc[data_21["Amount($)"] == "ah! Ventures"]

data_21.loc[data_21["Company/Brand"] == "Little Leap", ["Amount($)", "Stage", "Investor"]] = [int(26700000/73.9339), "Seed", "ah! Ventures"]
data_21.loc[data_21["Company/Brand"] == "Little Leap"]

In [None]:
# A similar error is thrown again, by an entry "Pre-series A", which has to be checked and corrected
data_21.loc[data_21["Amount($)"] == "Pre-series A"]

data_21.loc[data_21["Company/Brand"] == "AdmitKard", ["Amount($)", "Stage", "Investor"]] = [int(26700000/73.9339), "Pre-series A", np.nan]
data_21.loc[data_21["Company/Brand"] == "AdmitKard"]

In [None]:
# A similar error is thrown again, by an entry "ITO Angel Network LetsVenture", which has to be checked and corrected
data_21.loc[data_21["Amount($)"] == "ITO Angel Network LetsVenture"]

data_21.loc[data_21["Amount($)"] == "ITO Angel Network LetsVenture", ["Amount($)", "Stage", "Investor"]] = [300000, "Angel", "ITO Angel Network LetsVenture"]
data_21.loc[data_21["Company/Brand"] == "BHyve"]

In [None]:
# A similar error is thrown again, by an entry "ITO Angel Network LetsVenture", which has to be checked and corrected
data_21.loc[data_21["Amount($)"] == "JITO Angel Network LetsVenture"]

data_21.loc[data_21["Amount($)"] == "JITO Angel Network LetsVenture", ["Amount($)", "Stage"]] = [1000000, "Seed"]
data_21.loc[data_21["Company/Brand"] == "Saarthi Pedagogy"]

In [None]:
data_21.loc[data_21["Stage"] == "$6000000"]

data_21.loc[data_21["Stage"] == "$6000000", ["Amount($)", "Stage"]] = [9627286, "Venture"]
data_21.loc[data_21["Company/Brand"] == "MYRE Capital"]

In [None]:
data_21["Amount($)"].replace('', 0, inplace = True, regex=True)
data_21["Amount($)"] = data_21["Amount($)"].apply(lambda x: str(x).replace("$",""))
data_21["Amount($)"].replace('nan', 0, inplace = True, regex=True)
data_21["Amount($)"] = pd.to_numeric(data_21["Amount($)"])
data_21["Amount($)"] = data_21["Amount($)"].apply(float)

# Appending years of funding to the respective dataframes
data_21["Year of Funding"] = "2021"

print(data_21.info(), "\n")
data_21

### 3.7 Joining all the DataFrames into a combined DataFrame

In [None]:
# Joining the DataFrames with similar column names
combined_19_21 = pd.concat([data_19, data_20, data_21], ignore_index = True)
combined_19_21.columns = ["Company Name", "Year Founded", "Headquarters", "Sector", "Description", "Founders", "Investors", "Amount", "Funding Stage", "Funding Year"]
combined_19_21

In [None]:
# Renaming the columns in the 2018 dataframe to match with the other dataframes
data_18.columns = ['Company Name', 'Sector', 'Funding Stage', 'Amount', 'Headquarters', 'Description', "Funding Year"]
data_18

In [None]:
# Joining the 2018 DataFrame to the 2019-2021 DataFrame
complete_set = pd.concat([data_18, combined_19_21], ignore_index = True)
print(complete_set.info(), "\n")
complete_set

In [None]:
# Converting various columns into appropriate formats
complete_set['Amount'] = complete_set['Amount'].replace(np.nan, 0)
complete_set['Amount'] = complete_set['Amount'].apply(int)
complete_set["Funding Year"] = complete_set["Funding Year"].apply(str)
complete_set["Year Founded"] = complete_set["Year Founded"].apply(str)

# Dropping all duplicates from the combined DataFrame
complete_set.drop_duplicates(inplace = True)
complete_set.reset_index(drop=True, inplace = True)

# Taking a final look at the Complete Set
print(complete_set.nunique(), "\n")
complete_set

In [None]:
# Exporting to have a look at the DataFrame in Excel
complete_set.to_csv("complete_set.csv", index = False)

### 3.8 Checking the integrity of the combined DataFrame

#### 3.8.1 Company Name

In [None]:
unique_companies = (complete_set.loc[:,"Company Name"]).value_counts()
unique_companies

In [None]:
# Capitalizing only the first letters of each entry in the column for normalization
complete_set["Company Name"] = complete_set["Company Name"].apply(lambda x: str(x).capitalize())

# Correcting the misspelt names of startups
complete_set.loc[complete_set["Company Name"] == "Byju", "Company Name"] = "Byju's"

# Reassigning the dataframe and previewing it
unique_companies = (complete_set.loc[:,"Company Name"]).value_counts()
unique_companies.head(10)

In [None]:
unique_companies.head(10).sort_values().plot.barh()

From here, we note that Bharatpe (10), Byju's (10) and Zomato (7) were the startups involved in most deals over the period. Let's take a closer look at them.

In [None]:
complete_set.loc[(complete_set["Company Name"] == "Bharatpe") | 
                 (complete_set["Company Name"] == "Byju's") | 
                 (complete_set["Company Name"] == "Zomato")]

We note from the above that the top 3 startups involved in most funding deals are located in 3 different locations and operate in 3 different sectors.

#### 3.8.2 Sector

In [None]:
unique_sectors = (complete_set.loc[:,"Sector"]).value_counts()
unique_sectors

In [None]:
# Capitalizing only the first letters of each entry in the Sector column for normalization
complete_set["Sector"] = complete_set["Sector"].apply(lambda x: str(x).capitalize())

# Equating similar entries in the Sector column for ease of analysis
complete_set["Sector"] = complete_set["Sector"].apply(lambda x: str(x).replace(" and ", " & "))
complete_set["Sector"] = complete_set["Sector"].apply(lambda x: str(x).replace("startup", ""))
complete_set["Sector"] = complete_set["Sector"].apply(lambda x: str(x).replace("  ", " "))
complete_set.loc[complete_set["Sector"] == "nan", "Sector"] = "Sector TBD"
complete_set.loc[complete_set["Sector"] == "â€”", "Sector"] = "Sector TBD"

In [None]:
# Equating similar entries in the Sector column for ease of analysis (2)
complete_set.loc[complete_set["Sector"] == "Accomodation", "Sector"] = "Housing & real estate"
complete_set.loc[complete_set["Sector"] == "Accounting", "Sector"] = "Financial services"
complete_set.loc[complete_set["Sector"] == "Ad-tech", "Sector"] = "Advertising, marketing & sales"
complete_set.loc[complete_set["Sector"] == "Advertisement", "Sector"] = "Advertising, marketing & sales"
complete_set.loc[complete_set["Sector"] == "Advertising", "Sector"] = "Advertising, marketing & sales"
complete_set.loc[complete_set["Sector"] == "Advisory firm", "Sector"] = "Advisory & consultancy"
complete_set.loc[complete_set["Sector"] == "Aeorspace", "Sector"] = "Aviation & aerospace"
complete_set.loc[complete_set["Sector"] == "Aero company", "Sector"] = "Aviation & aerospace"
complete_set.loc[complete_set["Sector"] == "Aerospace", "Sector"] = "Aviation & aerospace"
complete_set.loc[complete_set["Sector"] == "Agri tech", "Sector"] = "Agriculture & agritech"
complete_set.loc[complete_set["Sector"] == "Agriculture", "Sector"] = "Agriculture & agritech"
complete_set.loc[complete_set["Sector"] == "Agritech", "Sector"] = "Agriculture & agritech"
complete_set.loc[complete_set["Sector"] == "Agritech/commerce", "Sector"] = "Agriculture & agritech"
complete_set.loc[complete_set["Sector"] == "Agtech", "Sector"] = "Agriculture & agritech"
complete_set.loc[complete_set["Sector"] == "Ai & data science", "Sector"] = "Ai"
complete_set.loc[complete_set["Sector"] == "Ai & debt", "Sector"] = "Ai"
complete_set.loc[complete_set["Sector"] == "Ai & deep learning", "Sector"] = "Ai"
complete_set.loc[complete_set["Sector"] == "Ai & media", "Sector"] = "Ai"
complete_set.loc[complete_set["Sector"] == "Ai & tech", "Sector"] = "Ai"
complete_set.loc[complete_set["Sector"] == "Ai chatbot", "Sector"] = "Ai"
complete_set.loc[complete_set["Sector"] == "Ai company", "Sector"] = "Ai"
complete_set.loc[complete_set["Sector"] == "Ai health", "Sector"] = "Ai"
complete_set.loc[complete_set["Sector"] == "Ai platform", "Sector"] = "Ai"
complete_set.loc[complete_set["Sector"] == "Ai robotics", "Sector"] = "Ai"
complete_set.loc[complete_set["Sector"] == "Ai", "Sector"] = "Ai"
complete_set.loc[complete_set["Sector"] == "Air transportation", "Sector"] = "Aviation & aerospace"
complete_set.loc[complete_set["Sector"] == "Alternative medicine", "Sector"] = "Medical"
complete_set.loc[complete_set["Sector"] == "Analytics", "Sector"] = "Data science & analytics"
complete_set.loc[complete_set["Sector"] == "Appliance", "Sector"] = "Appliances & Electronics"
complete_set.loc[complete_set["Sector"] == "Apps", "Sector"] = "Software"
complete_set.loc[complete_set["Sector"] == "Ar platform", "Sector"] = "Ar/vr"
complete_set.loc[complete_set["Sector"] == "Ar", "Sector"] = "Ar/vr"
complete_set.loc[complete_set["Sector"] == "Ar/vr", "Sector"] = "Ar/vr"
complete_set.loc[complete_set["Sector"] == "Artificial intelligence", "Sector"] = "Ai"
complete_set.loc[complete_set["Sector"] == "Audio", "Sector"] = "Entertainment"
complete_set.loc[complete_set["Sector"] == "Augmented reality", "Sector"] = "Ar/vr"
complete_set.loc[complete_set["Sector"] == "Auto-tech", "Sector"] = "Automation tech"
complete_set.loc[complete_set["Sector"] == "Automation", "Sector"] = "Automation tech"
complete_set.loc[complete_set["Sector"] == "Automobile & technology", "Sector"] = "Automobiles & automotives"
complete_set.loc[complete_set["Sector"] == "Automobile technology", "Sector"] = "Automobiles & automotives"
complete_set.loc[complete_set["Sector"] == "Automobile", "Sector"] = "Automobiles & automotives"
complete_set.loc[complete_set["Sector"] == "Automobiles", "Sector"] = "Automobiles & automotives"
complete_set.loc[complete_set["Sector"] == "Automotive & rentals", "Sector"] = "Automobiles & automotives"
complete_set.loc[complete_set["Sector"] == "Automotive and rentals", "Sector"] = "Automobiles & automotives"
complete_set.loc[complete_set["Sector"] == "Automotive company", "Sector"] = "Automobiles & automotives"
complete_set.loc[complete_set["Sector"] == "Automotive tech", "Sector"] = "Automobiles & automotives"
complete_set.loc[complete_set["Sector"] == "Automotive", "Sector"] = "Automobiles & automotives"
complete_set.loc[complete_set["Sector"] == "Autonomous vehicles", "Sector"] = "Automobiles & automotives"
complete_set.loc[complete_set["Sector"] == "Aviation", "Sector"] = "Aviation & aerospace"
complete_set.loc[complete_set["Sector"] == "Ayurveda tech", "Sector"] = "Medicine & healthcare"
complete_set.loc[complete_set["Sector"] == "B2b agritech", "Sector"] = "Agriculture & agritech"
complete_set.loc[complete_set["Sector"] == "B2b e-commerce", "Sector"] = "E-commerce"
complete_set.loc[complete_set["Sector"] == "B2b ecommerce", "Sector"] = "E-commerce"
complete_set.loc[complete_set["Sector"] == "B2b manufacturing", "Sector"] = "Manufacturing"
complete_set.loc[complete_set["Sector"] == "B2b marketplace", "Sector"] = "E-commerce"
complete_set.loc[complete_set["Sector"] == "B2b service", "Sector"] = "B2b"
complete_set.loc[complete_set["Sector"] == "B2b supply chain", "Sector"] = "Logistics & supply chain"
complete_set.loc[complete_set["Sector"] == "B2b travel", "Sector"] = "Travel & Transport"
complete_set.loc[complete_set["Sector"] == "B2b", "Sector"] = "B2b"
complete_set.loc[complete_set["Sector"] == "Banking", "Sector"] = "Financial Services"
complete_set.loc[complete_set["Sector"] == "Battery design", "Sector"] = "Battery"
complete_set.loc[complete_set["Sector"] == "Battery manufacturer", "Sector"] = "Battery"
complete_set.loc[complete_set["Sector"] == "Beauty & wellness", "Sector"] = "Beauty"
complete_set.loc[complete_set["Sector"] == "Beauty products", "Sector"] = "Beauty"
complete_set.loc[complete_set["Sector"] == "Beverage", "Sector"] = "Consumer goods & services"
complete_set.loc[complete_set["Sector"] == "Beverage", "Sector"] = "Food & Nutrition"
complete_set.loc[complete_set["Sector"] == "Beverages", "Sector"] = "Consumer goods & services"
complete_set.loc[complete_set["Sector"] == "Big data", "Sector"] = "Data science & analytics"
complete_set.loc[complete_set["Sector"] == "Bike marketplace", "Sector"] = "Bike services"
complete_set.loc[complete_set["Sector"] == "Bike rental", "Sector"] = "Bike services"
complete_set.loc[complete_set["Sector"] == "Biopharma", "Sector"] = "Pharmaceutical"
complete_set.loc[complete_set["Sector"] == "Biotech", "Sector"] = "Biotechnology"
complete_set.loc[complete_set["Sector"] == "Biotechnology company", "Sector"] = "Biotechnology"
complete_set.loc[complete_set["Sector"] == "Biotechnology", "Sector"] = "Biotechnology"
complete_set.loc[complete_set["Sector"] == "Blockchain", "Sector"] = "Cryptocurrency"
complete_set.loc[complete_set["Sector"] == "Blogging", "Sector"] = "Content services"
complete_set.loc[complete_set["Sector"] == "Brand marketing", "Sector"] = "Advertising, marketing & sales"
complete_set.loc[complete_set["Sector"] == "Broadcasting", "Sector"] = "Information services"
complete_set.loc[complete_set["Sector"] == "Business development", "Sector"] = "Advisory & consultancy"
complete_set.loc[complete_set["Sector"] == "Business intelligence", "Sector"] = "Data science & analytics"
complete_set.loc[complete_set["Sector"] == "Business travel", "Sector"] = "Travel & Transport"
complete_set.loc[complete_set["Sector"] == "Cannabis", "Sector"] = "Agriculture & agritech"
complete_set.loc[complete_set["Sector"] == "Capital markets", "Sector"] = "Financial services"
complete_set.loc[complete_set["Sector"] == "Car service", "Sector"] = "Automobiles & automotives"
complete_set.loc[complete_set["Sector"] == "Car trade", "Sector"] = "Automobiles & automotives"
complete_set.loc[complete_set["Sector"] == "Catering", "Sector"] = "Food & Nutrition"
complete_set.loc[complete_set["Sector"] == "Children", "Sector"] = "Child care"
complete_set.loc[complete_set["Sector"] == "Classifieds", "Sector"] = "E-commerce"
complete_set.loc[complete_set["Sector"] == "Clean energy", "Sector"] = "Clean energy"
complete_set.loc[complete_set["Sector"] == "Clean energy", "Sector"] = "Energy"
complete_set.loc[complete_set["Sector"] == "Cleantech", "Sector"] = "Clean energy"
complete_set.loc[complete_set["Sector"] == "Cleantech", "Sector"] = "Energy"
complete_set.loc[complete_set["Sector"] == "Clothing", "Sector"] = "Apparel & fashion"
complete_set.loc[complete_set["Sector"] == "Cloud company", "Sector"] = "Cloud computing"
complete_set.loc[complete_set["Sector"] == "Cloud infrastructure", "Sector"] = "Cloud computing"
complete_set.loc[complete_set["Sector"] == "Commerce", "Sector"] = "Advertising, marketing & sales"
complete_set.loc[complete_set["Sector"] == "Commercial real estate", "Sector"] = "Housing & real estate"
complete_set.loc[complete_set["Sector"] == "Commercial", "Sector"] = "Advertising, marketing & sales"
complete_set.loc[complete_set["Sector"] == "Communities", "Sector"] = "Social media & communities"
complete_set.loc[complete_set["Sector"] == "Communities", "Sector"] = "Social media & networking"
complete_set.loc[complete_set["Sector"] == "Community platform", "Sector"] = "Social media & communities"
complete_set.loc[complete_set["Sector"] == "Community platform", "Sector"] = "Social media & networking"
complete_set.loc[complete_set["Sector"] == "Community", "Sector"] = "Social media & communities"
complete_set.loc[complete_set["Sector"] == "Community", "Sector"] = "Social media & networking"
complete_set.loc[complete_set["Sector"] == "Computer games", "Sector"] = "Games"
complete_set.loc[complete_set["Sector"] == "Computer software", "Sector"] = "Software"
complete_set.loc[complete_set["Sector"] == "Construction tech", "Sector"] = "Construction"
complete_set.loc[complete_set["Sector"] == "Consultancy", "Sector"] = "Advisory & consultancy"
complete_set.loc[complete_set["Sector"] == "Consulting", "Sector"] = "Advisory & consultancy"
complete_set.loc[complete_set["Sector"] == "Consumer appliances", "Sector"] = "Consumer goods & services"
complete_set.loc[complete_set["Sector"] == "Consumer applications", "Sector"] = "Consumer goods & services"
complete_set.loc[complete_set["Sector"] == "Consumer electronics", "Sector"] = "Consumer goods & services"
complete_set.loc[complete_set["Sector"] == "Consumer goods", "Sector"] = "Consumer goods & services"
complete_set.loc[complete_set["Sector"] == "Consumer lending", "Sector"] = "Financial services"
complete_set.loc[complete_set["Sector"] == "Consumer service", "Sector"] = "Consumer goods & services"
complete_set.loc[complete_set["Sector"] == "Consumer services", "Sector"] = "Consumer goods & services"
complete_set.loc[complete_set["Sector"] == "Consumer software", "Sector"] = "Consumer goods & services"
complete_set.loc[complete_set["Sector"] == "Consumer software", "Sector"] = "Software"
complete_set.loc[complete_set["Sector"] == "Consumer", "Sector"] = "Consumer goods & services"
complete_set.loc[complete_set["Sector"] == "Content commerce", "Sector"] = "Content services"
complete_set.loc[complete_set["Sector"] == "Content creation", "Sector"] = "Content services"
complete_set.loc[complete_set["Sector"] == "Content management", "Sector"] = "Content services"
complete_set.loc[complete_set["Sector"] == "Content marketplace", "Sector"] = "Content services"
complete_set.loc[complete_set["Sector"] == "Content marktplace", "Sector"] = "Content services"
complete_set.loc[complete_set["Sector"] == "Content publishing", "Sector"] = "Content services"
complete_set.loc[complete_set["Sector"] == "Continuing education", "Sector"] = "Education"
complete_set.loc[complete_set["Sector"] == "Conversational ai platform", "Sector"] = "Ai"
complete_set.loc[complete_set["Sector"] == "Cooking", "Sector"] = "Food & Nutrition"
complete_set.loc[complete_set["Sector"] == "Cosmetics", "Sector"] = "Beauty"
complete_set.loc[complete_set["Sector"] == "Coworking", "Sector"] = "Co-working"
complete_set.loc[complete_set["Sector"] == "Creative agency", "Sector"] = "Arts & crafts"
complete_set.loc[complete_set["Sector"] == "Credit cards", "Sector"] = "Financial Services"
complete_set.loc[complete_set["Sector"] == "Credit", "Sector"] = "Financial Services"
complete_set.loc[complete_set["Sector"] == "Crm", "Sector"] = "Customer service"
complete_set.loc[complete_set["Sector"] == "Crypto", "Sector"] = "Cryptocurrency"
complete_set.loc[complete_set["Sector"] == "Customer service company", "Sector"] = "Customer service"
complete_set.loc[complete_set["Sector"] == "Cybersecurity", "Sector"] = "Computer & network security"
complete_set.loc[complete_set["Sector"] == "D2c business", "Sector"] = "D2c"
complete_set.loc[complete_set["Sector"] == "D2c fashion", "Sector"] = "Apparel & fashion"
complete_set.loc[complete_set["Sector"] == "D2c jewellery", "Sector"] = "Jewellery"
complete_set.loc[complete_set["Sector"] == "Dairy", "Sector"] = "Food & Nutrition"
complete_set.loc[complete_set["Sector"] == "Data analytics", "Sector"] = "Data science & analytics"
complete_set.loc[complete_set["Sector"] == "Data intelligence", "Sector"] = "Data science & analytics"
complete_set.loc[complete_set["Sector"] == "Data science", "Sector"] = "Data science & analytics"
complete_set.loc[complete_set["Sector"] == "Dating app", "Sector"] = "Social media & networking"
complete_set.loc[complete_set["Sector"] == "Dating", "Sector"] = "Social media & networking"
complete_set.loc[complete_set["Sector"] == "Deep tech ai", "Sector"] = "Deep Tech"
complete_set.loc[complete_set["Sector"] == "Deep tech", "Sector"] = "Deep Tech"
complete_set.loc[complete_set["Sector"] == "Deeptech", "Sector"] = "Deep Tech"
complete_set.loc[complete_set["Sector"] == "Defense & space", "Sector"] = "Defense"
complete_set.loc[complete_set["Sector"] == "Defense tech", "Sector"] = "Defense"
complete_set.loc[complete_set["Sector"] == "Deisgning", "Sector"] = "Design"
complete_set.loc[complete_set["Sector"] == "Delivery service", "Sector"] = "Logistics & supply chain"
complete_set.loc[complete_set["Sector"] == "Delivery", "Sector"] = "Logistics & supply chain"
complete_set.loc[complete_set["Sector"] == "Dental", "Sector"] = "Medicine & healthcare"
complete_set.loc[complete_set["Sector"] == "Design", "Sector"] = ""
complete_set.loc[complete_set["Sector"] == "Dietary supplements", "Sector"] = "Food & Nutrition"
complete_set.loc[complete_set["Sector"] == "Digital entertainment", "Sector"] = "Entertainment"
complete_set.loc[complete_set["Sector"] == "Digital marketing", "Sector"] = "Advertising, marketing & sales"
complete_set.loc[complete_set["Sector"] == "Digital media", "Sector"] = "Information services"
complete_set.loc[complete_set["Sector"] == "Digital mortgage", "Sector"] = "Financial Services"
complete_set.loc[complete_set["Sector"] == "E store", "Sector"] = "E-commerce"
complete_set.loc[complete_set["Sector"] == "E-commerce & ar", "Sector"] = "E-commerce"
complete_set.loc[complete_set["Sector"] == "E-commerce platforms", "Sector"] = "E-commerce"
complete_set.loc[complete_set["Sector"] == "E-commerce", "Sector"] = "E-commerce"
complete_set.loc[complete_set["Sector"] == "E-learning", "Sector"] = "Education"
complete_set.loc[complete_set["Sector"] == "E-market", "Sector"] = "E-commerce"
complete_set.loc[complete_set["Sector"] == "E-marketplace", "Sector"] = "E-commerce"
complete_set.loc[complete_set["Sector"] == "E-mobility", "Sector"] = "Travel & Transport"
complete_set.loc[complete_set["Sector"] == "E-sports", "Sector"] = "Sports"
complete_set.loc[complete_set["Sector"] == "E-tail", "Sector"] = "E-commerce"
complete_set.loc[complete_set["Sector"] == "Ecommerce", "Sector"] = "E-commerce"
complete_set.loc[complete_set["Sector"] == "Edtech", "Sector"] = "Education"
complete_set.loc[complete_set["Sector"] == "Edttech", "Sector"] = "Edtech"
complete_set.loc[complete_set["Sector"] == "Edttech", "Sector"] = "Education"
complete_set.loc[complete_set["Sector"] == "Education management", "Sector"] = "Education"
complete_set.loc[complete_set["Sector"] == "Electricity", "Sector"] = "Energy"
complete_set.loc[complete_set["Sector"] == "Electronics", "Sector"] = "Appliances & Electronics"
complete_set.loc[complete_set["Sector"] == "Emobility", "Sector"] = "Travel & Transport"
complete_set.loc[complete_set["Sector"] == "Energy", "Sector"] = "Energy"
complete_set.loc[complete_set["Sector"] == "Enterprise software", "Sector"] = "Enterprise resource planning (erp)"
complete_set.loc[complete_set["Sector"] == "Environmental consulting", "Sector"] = "Environmental services"
complete_set.loc[complete_set["Sector"] == "Environmental service", "Sector"] = "Environmental services"
complete_set.loc[complete_set["Sector"] == "Equity management", "Sector"] = "Financial services"
complete_set.loc[complete_set["Sector"] == "Esports", "Sector"] = "Sports"
complete_set.loc[complete_set["Sector"] == "Estore", "Sector"] = "E-commerce"
complete_set.loc[complete_set["Sector"] == "Ev", "Sector"] = "Electric vehicle"
complete_set.loc[complete_set["Sector"] == "Events", "Sector"] = "Entertainment"
complete_set.loc[complete_set["Sector"] == "Eye wear", "Sector"] = "Medicine & healthcare"
complete_set.loc[complete_set["Sector"] == "Eyeglasses", "Sector"] = "Medicine & healthcare"
complete_set.loc[complete_set["Sector"] == "Eyewear", "Sector"] = "Medicine & healthcare"
complete_set.loc[complete_set["Sector"] == "Facilities support service", "Sector"] = "Facilities services"
complete_set.loc[complete_set["Sector"] == "Fantasy sports", "Sector"] = "Sports"
complete_set.loc[complete_set["Sector"] == "Farming", "Sector"] = "Agriculture & agritech"
complete_set.loc[complete_set["Sector"] == "Fashion & lifestyle", "Sector"] = "Apparel & fashion"
complete_set.loc[complete_set["Sector"] == "Fashion tech", "Sector"] = "Apparel & fashion"
complete_set.loc[complete_set["Sector"] == "Fashion", "Sector"] = "Apparel & fashion"
complete_set.loc[complete_set["Sector"] == "Femtech", "Sector"] = "Medicine & healthcare"
complete_set.loc[complete_set["Sector"] == "Fertility tech", "Sector"] = "Medicine & healthcare"
complete_set.loc[complete_set["Sector"] == "Finance company", "Sector"] = "Financial services"
complete_set.loc[complete_set["Sector"] == "Finance", "Sector"] = "Financial services"
complete_set.loc[complete_set["Sector"] == "Financial services", "Sector"] = "Financial services"
complete_set.loc[complete_set["Sector"] == "Fintech", "Sector"] = "Fintech"
complete_set.loc[complete_set["Sector"] == "Fishery", "Sector"] = "Agriculture & agritech"
complete_set.loc[complete_set["Sector"] == "Fitness", "Sector"] = "Personal care"
complete_set.loc[complete_set["Sector"] == "Fmcg", "Sector"] = "Consumer goods & services"
complete_set.loc[complete_set["Sector"] == "Food & bevarages", "Sector"] = "Food & Nutrition"
complete_set.loc[complete_set["Sector"] == "Food & beverage", "Sector"] = "Food & Nutrition"
complete_set.loc[complete_set["Sector"] == "Food & beverages", "Sector"] = "Food & Nutrition"
complete_set.loc[complete_set["Sector"] == "Food & logistics", "Sector"] = "Food & Nutrition"
complete_set.loc[complete_set["Sector"] == "Food & nutrition", "Sector"] = "Food & Nutrition"
complete_set.loc[complete_set["Sector"] == "Food & Nutrition", "Sector"] = "Food & Nutrition"
complete_set.loc[complete_set["Sector"] == "Food & tech", "Sector"] = "Food & Nutrition"
complete_set.loc[complete_set["Sector"] == "Food delivery", "Sector"] = "Food & Nutrition"
complete_set.loc[complete_set["Sector"] == "Food devlivery", "Sector"] = "Food & Nutrition"
complete_set.loc[complete_set["Sector"] == "Food diet", "Sector"] = "Food & Nutrition"
complete_set.loc[complete_set["Sector"] == "Food industry", "Sector"] = "Food & Nutrition"
complete_set.loc[complete_set["Sector"] == "Food processing", "Sector"] = "Food & Nutrition"
complete_set.loc[complete_set["Sector"] == "Food production", "Sector"] = "Food & Nutrition"
complete_set.loc[complete_set["Sector"] == "Food tech", "Sector"] = "Food & Nutrition"
complete_set.loc[complete_set["Sector"] == "Food", "Sector"] = "Food & Nutrition"
complete_set.loc[complete_set["Sector"] == "Foodtech & logistics", "Sector"] = "Food & Nutrition"
complete_set.loc[complete_set["Sector"] == "Foodtech", "Sector"] = "Food & Nutrition"
complete_set.loc[complete_set["Sector"] == "Foootwear", "Sector"] = "Apparel & fashion"
complete_set.loc[complete_set["Sector"] == "Funding platform", "Sector"] = "Financial Services"
complete_set.loc[complete_set["Sector"] == "Furniture rental", "Sector"] = "Furniture & Home Decor"
complete_set.loc[complete_set["Sector"] == "Furniture", "Sector"] = "Furniture & Home Decor"
complete_set.loc[complete_set["Sector"] == "Fusion beverages", "Sector"] = "Food & Nutrition"
complete_set.loc[complete_set["Sector"] == "Gaming", "Sector"] = "Games"
complete_set.loc[complete_set["Sector"] == "Healtcare", "Sector"] = "Medicine & healthcare"
complete_set.loc[complete_set["Sector"] == "Health & fitness", "Sector"] = "Medicine & healthcare"
complete_set.loc[complete_set["Sector"] == "Health & wellness", "Sector"] = "Medicine & healthcare"
complete_set.loc[complete_set["Sector"] == "Health and fitness", "Sector"] = "Medicine & healthcare"
complete_set.loc[complete_set["Sector"] == "Health care", "Sector"] = "Medicine & healthcare"
complete_set.loc[complete_set["Sector"] == "Health diagnostics", "Sector"] = "Medicine & healthcare"
complete_set.loc[complete_set["Sector"] == "Health insurance", "Sector"] = "Insurance"
complete_set.loc[complete_set["Sector"] == "Health", "Sector"] = "Medicine & healthcare"
complete_set.loc[complete_set["Sector"] == "Health, wellness & fitness", "Sector"] = "Medicine & healthcare"
complete_set.loc[complete_set["Sector"] == "Healthcare", "Sector"] = "Medicine & healthcare"
complete_set.loc[complete_set["Sector"] == "Healthcare/edtech", "Sector"] = "Medicine & healthcare"
complete_set.loc[complete_set["Sector"] == "Healthtech", "Sector"] = "Medicine & healthcare"
complete_set.loc[complete_set["Sector"] == "Heathcare", "Sector"] = "Medicine & healthcare"
complete_set.loc[complete_set["Sector"] == "Heathtech", "Sector"] = "Medicine & healthcare"
complete_set.loc[complete_set["Sector"] == "Helathcare", "Sector"] = "Medicine & healthcare"
complete_set.loc[complete_set["Sector"] == "Higher education", "Sector"] = "Education"
complete_set.loc[complete_set["Sector"] == "Home decor", "Sector"] = "Furniture & Home Decor"
complete_set.loc[complete_set["Sector"] == "Home design", "Sector"] = "Furniture & Home Decor"
complete_set.loc[complete_set["Sector"] == "Home interior services", "Sector"] = "Furniture & Home Decor"
complete_set.loc[complete_set["Sector"] == "Home services", "Sector"] = "Furniture & Home Decor"
complete_set.loc[complete_set["Sector"] == "Hospital & health care", "Sector"] = "Medicine & healthcare"
complete_set.loc[complete_set["Sector"] == "Hospital", "Sector"] = "Medicine & healthcare"
complete_set.loc[complete_set["Sector"] == "Hospitality", "Sector"] = "Tourism & Hospitality"
complete_set.loc[complete_set["Sector"] == "Housing & rentals", "Sector"] = "Housing & real estate"
complete_set.loc[complete_set["Sector"] == "Housing marketplace", "Sector"] = "Housing & real estate"
complete_set.loc[complete_set["Sector"] == "Housing", "Sector"] = "Housing & real estate"
complete_set.loc[complete_set["Sector"] == "Hr tech", "Sector"] = "Human Resources"
complete_set.loc[complete_set["Sector"] == "Hr", "Sector"] = "Human Resources"
complete_set.loc[complete_set["Sector"] == "Hrtech", "Sector"] = "Human Resources"
complete_set.loc[complete_set["Sector"] == "Human resources", "Sector"] = "Human Resources"
complete_set.loc[complete_set["Sector"] == "Hygiene management", "Sector"] = "Hygiene"
complete_set.loc[complete_set["Sector"] == "Information technology & services", "Sector"] = "Information technology"
complete_set.loc[complete_set["Sector"] == "Insurance tech", "Sector"] = "Insurance"
complete_set.loc[complete_set["Sector"] == "Insurance technology", "Sector"] = "Insurance"
complete_set.loc[complete_set["Sector"] == "Insuretech", "Sector"] = "Insurance"
complete_set.loc[complete_set["Sector"] == "Insurtech", "Sector"] = "Insurance"
complete_set.loc[complete_set["Sector"] == "Interior & decor", "Sector"] = "Furniture & Home Decor"
complete_set.loc[complete_set["Sector"] == "Interior design", "Sector"] = "Furniture & Home Decor"
complete_set.loc[complete_set["Sector"] == "Internet of things", "Sector"] = "IoT"
complete_set.loc[complete_set["Sector"] == "Investment banking", "Sector"] = "Financial services"
complete_set.loc[complete_set["Sector"] == "Investment management", "Sector"] = "Financial services"
complete_set.loc[complete_set["Sector"] == "Investment tech", "Sector"] = "Financial services"
complete_set.loc[complete_set["Sector"] == "Investment", "Sector"] = "Financial services"
complete_set.loc[complete_set["Sector"] == "Iot platform", "Sector"] = "IoT"
complete_set.loc[complete_set["Sector"] == "Iot/automobile", "Sector"] = "IoT"
complete_set.loc[complete_set["Sector"] == "It company", "Sector"] = "It"
complete_set.loc[complete_set["Sector"] == "Job discovery platform", "Sector"] = "Job search"
complete_set.loc[complete_set["Sector"] == "Job portal", "Sector"] = "Job search"
complete_set.loc[complete_set["Sector"] == "Last mile transportation", "Sector"] = "Travel & Transport"
complete_set.loc[complete_set["Sector"] == "Legal services", "Sector"] = "Legal"
complete_set.loc[complete_set["Sector"] == "Legal tech", "Sector"] = "Legal"
complete_set.loc[complete_set["Sector"] == "Legaltech", "Sector"] = "Legal"
complete_set.loc[complete_set["Sector"] == "Logistics", "Sector"] = "Logistics & supply chain"
complete_set.loc[complete_set["Sector"] == "Logitech", "Sector"] = "Logistics & supply chain"
complete_set.loc[complete_set["Sector"] == "Luxury car", "Sector"] = "Automobiles & automotives"
complete_set.loc[complete_set["Sector"] == "Management consulting", "Sector"] = "Advisory"
complete_set.loc[complete_set["Sector"] == "Market research", "Sector"] = "Advertising, marketing & sales"
complete_set.loc[complete_set["Sector"] == "Marketing & advertising", "Sector"] = "Advertising, marketing & sales"
complete_set.loc[complete_set["Sector"] == "Marketing & customer loyalty", "Sector"] = "Advertising, marketing & sales"
complete_set.loc[complete_set["Sector"] == "Marketing company", "Sector"] = "Advertising, marketing & sales"
complete_set.loc[complete_set["Sector"] == "Marketing", "Sector"] = "Advertising, marketing & sales"
complete_set.loc[complete_set["Sector"] == "Marketplace", "Sector"] = "E-commerce"
complete_set.loc[complete_set["Sector"] == "Martech", "Sector"] = "Advertising, marketing & sales"
complete_set.loc[complete_set["Sector"] == "Matrimony", "Sector"] = "Marriage"
complete_set.loc[complete_set["Sector"] == "Mechanical & industrial engineering", "Sector"] = "Engineering"
complete_set.loc[complete_set["Sector"] == "Mechanical or industrial engineering", "Sector"] = "Engineering"
complete_set.loc[complete_set["Sector"] == "Med tech", "Sector"] = "Medical"
complete_set.loc[complete_set["Sector"] == "Media & entertainment", "Sector"] = "Media"
complete_set.loc[complete_set["Sector"] == "Media & networking", "Sector"] = "Media"
complete_set.loc[complete_set["Sector"] == "Media and entertainment", "Sector"] = "Media"
complete_set.loc[complete_set["Sector"] == "Media tech", "Sector"] = "Media"
complete_set.loc[complete_set["Sector"] == "Medical device", "Sector"] = "Medicine & healthcare"
complete_set.loc[complete_set["Sector"] == "Medtech", "Sector"] = "Medicine & healthcare"
complete_set.loc[complete_set["Sector"] == "Mental health", "Sector"] = "Medicine & healthcare"
complete_set.loc[complete_set["Sector"] == "Micro-mobiity", "Sector"] = "Travel & Transport"
complete_set.loc[complete_set["Sector"] == "Milk", "Sector"] = "Food & Nutrition"
complete_set.loc[complete_set["Sector"] == "Mlops platform", "Sector"] = "Machine learning"
complete_set.loc[complete_set["Sector"] == "Mobile games", "Sector"] = "Games"
complete_set.loc[complete_set["Sector"] == "Mobile games", "Sector"] = "Mobile"
complete_set.loc[complete_set["Sector"] == "Mobile payments", "Sector"] = "Financial services"
complete_set.loc[complete_set["Sector"] == "Mobility tech", "Sector"] = "Travel & Transport"
complete_set.loc[complete_set["Sector"] == "Mobility", "Sector"] = "Travel & Transport"
complete_set.loc[complete_set["Sector"] == "Mobility/transport", "Sector"] = "Travel & Transport"
complete_set.loc[complete_set["Sector"] == "Music streaming", "Sector"] = "Entertainment"
complete_set.loc[complete_set["Sector"] == "Music", "Sector"] = "Entertainment"
complete_set.loc[complete_set["Sector"] == "Mutual funds", "Sector"] = "Financial services"
complete_set.loc[complete_set["Sector"] == "Nano distribution network", "Sector"] = "Logistics & supply chain"
complete_set.loc[complete_set["Sector"] == "Neo-banking", "Sector"] = "Financial services"
complete_set.loc[complete_set["Sector"] == "Networking platform", "Sector"] = "Social media & networking"
complete_set.loc[complete_set["Sector"] == "Networking", "Sector"] = "Social media & networking"
complete_set.loc[complete_set["Sector"] == "News", "Sector"] = "Information services"
complete_set.loc[complete_set["Sector"] == "Nft marketplace", "Sector"] = "NFT"
complete_set.loc[complete_set["Sector"] == "Nft", "Sector"] = "NFT"
complete_set.loc[complete_set["Sector"] == "Nutrition sector", "Sector"] = "Food & Nutrition"
complete_set.loc[complete_set["Sector"] == "Nutrition tech", "Sector"] = "Food & Nutrition"
complete_set.loc[complete_set["Sector"] == "Nutrition", "Sector"] = "Food & Nutrition"
complete_set.loc[complete_set["Sector"] == "Oil & energy", "Sector"] = "Energy"
complete_set.loc[complete_set["Sector"] == "Online credit management", "Sector"] = "Financial Services"
complete_set.loc[complete_set["Sector"] == "Online financial service", "Sector"] = "Financial Services"
complete_set.loc[complete_set["Sector"] == "Online games", "Sector"] = "Games"
complete_set.loc[complete_set["Sector"] == "Online media", "Sector"] = "Information services"
complete_set.loc[complete_set["Sector"] == "Online portals", "Sector"] = "Information services"
complete_set.loc[complete_set["Sector"] == "Packaging solution", "Sector"] = "Packaging services"
complete_set.loc[complete_set["Sector"] == "Pet care", "Sector"] = "Animal Care"
complete_set.loc[complete_set["Sector"] == "Pharma", "Sector"] = "Pharmaceutical"
complete_set.loc[complete_set["Sector"] == "Pharmacy", "Sector"] = "Pharmaceutical"
complete_set.loc[complete_set["Sector"] == "Podcast", "Sector"] = "Entertainment"
complete_set.loc[complete_set["Sector"] == "Pollution control equiptment", "Sector"] = "Hygiene"
complete_set.loc[complete_set["Sector"] == "Preschool daycare", "Sector"] = "Education"
complete_set.loc[complete_set["Sector"] == "Professional training & coaching", "Sector"] = "Human Resources"
complete_set.loc[complete_set["Sector"] == "Publication", "Sector"] = "Information services"
complete_set.loc[complete_set["Sector"] == "Real estate", "Sector"] = "Housing & real estate"
complete_set.loc[complete_set["Sector"] == "Real Estate", "Sector"] = "Housing & real estate"
complete_set.loc[complete_set["Sector"] == "Reatil", "Sector"] = "Retail"
complete_set.loc[complete_set["Sector"] == "Recruitment", "Sector"] = "Human Resources"
complete_set.loc[complete_set["Sector"] == "Renewable player", "Sector"] = "Renewable energy"
complete_set.loc[complete_set["Sector"] == "Renewables & environment", "Sector"] = "Renewable energy"
complete_set.loc[complete_set["Sector"] == "Rental space", "Sector"] = "Rentals"
complete_set.loc[complete_set["Sector"] == "Rental", "Sector"] = "Rentals"
complete_set.loc[complete_set["Sector"] == "Retail aggregator", "Sector"] = "Retail"
complete_set.loc[complete_set["Sector"] == "Retail tech", "Sector"] = "Retail"
complete_set.loc[complete_set["Sector"] == "Robotics & ai", "Sector"] = "Ai"
complete_set.loc[complete_set["Sector"] == "Robotics", "Sector"] = "Ai"
complete_set.loc[complete_set["Sector"] == "Saas platform", "Sector"] = "SAAS"
complete_set.loc[complete_set["Sector"] == " Saas", "Sector"] = "SAAS"
complete_set.loc[complete_set["Sector"] == "Saas/edtech", "Sector"] = "SAAS"
complete_set.loc[complete_set["Sector"] == "SaasÂ Â startup", "Sector"] = "SAAS"
complete_set.loc[complete_set["Sector"] == "Sales & distribution", "Sector"] = "Advertising, marketing & sales"
complete_set.loc[complete_set["Sector"] == "Sales & services", "Sector"] = "Advertising, marketing & sales"
complete_set.loc[complete_set["Sector"] == "Sales and distribution", "Sector"] = "Advertising, marketing & sales"
complete_set.loc[complete_set["Sector"] == "Sanitation solutions", "Sector"] = "Hygiene"
complete_set.loc[complete_set["Sector"] == "Skincare", "Sector"] = "Apparel & fashion"
complete_set.loc[complete_set["Sector"] == "Skincare", "Sector"] = "Beauty"
complete_set.loc[complete_set["Sector"] == "Sles & marketing", "Sector"] = "Advertising, marketing & sales"
complete_set.loc[complete_set["Sector"] == "Sles and marketing", "Sector"] = "Advertising, marketing & sales"
complete_set.loc[complete_set["Sector"] == "Social audio", "Sector"] = "Entertainment"
complete_set.loc[complete_set["Sector"] == "Social commerce", "Sector"] = "E-commerce"
complete_set.loc[complete_set["Sector"] == "Social community", "Sector"] = "Social media & networking"
complete_set.loc[complete_set["Sector"] == "Social e-commerce", "Sector"] = "E-commerce"
complete_set.loc[complete_set["Sector"] == "Social media & communities", "Sector"] = "Social media & networking"
complete_set.loc[complete_set["Sector"] == "Social media", "Sector"] = "Social media & networking"
complete_set.loc[complete_set["Sector"] == "Social network", "Sector"] = "Social media & networking"
complete_set.loc[complete_set["Sector"] == "Social platform", "Sector"] = "Social media & networking"
complete_set.loc[complete_set["Sector"] == "Software company", "Sector"] = "Software"
complete_set.loc[complete_set["Sector"] == "Soil-tech", "Sector"] = "Agriculture & agritech"
complete_set.loc[complete_set["Sector"] == "Solar monitoring company", "Sector"] = "Solar energy"
complete_set.loc[complete_set["Sector"] == "Solar saas", "Sector"] = "Solar energy"
complete_set.loc[complete_set["Sector"] == "Solar solution", "Sector"] = "Solar energy"
complete_set.loc[complete_set["Sector"] == "Solar", "Sector"] = "Solar energy"
complete_set.loc[complete_set["Sector"] == "Sportstech", "Sector"] = "Sports"
complete_set.loc[complete_set["Sector"] == "Staffing & recruiting", "Sector"] = "Human Resources"
complete_set.loc[complete_set["Sector"] == "Supply chain platform", "Sector"] = "Logistics & supply chain"
complete_set.loc[complete_set["Sector"] == "Supply chain, agritech", "Sector"] = "Logistics & supply chain"
complete_set.loc[complete_set["Sector"] == "Taxation", "Sector"] = "Financial services"
complete_set.loc[complete_set["Sector"] == "Tech company", "Sector"] = "Tech"
complete_set.loc[complete_set["Sector"] == "Tech hub", "Sector"] = "Tech"
complete_set.loc[complete_set["Sector"] == "Tech platform", "Sector"] = "Tech"
complete_set.loc[complete_set["Sector"] == "Technology", "Sector"] = "Tech"
complete_set.loc[complete_set["Sector"] == "Techonology", "Sector"] = "Tech"
complete_set.loc[complete_set["Sector"] == "Telecommuncation", "Sector"] = "Telecommunication"
complete_set.loc[complete_set["Sector"] == "Telecommunications", "Sector"] = "Telecommunication"
complete_set.loc[complete_set["Sector"] == "Textiles", "Sector"] = "Apparel & fashion"
complete_set.loc[complete_set["Sector"] == "Tobacco", "Sector"] = "Agriculture & agritech"
complete_set.loc[complete_set["Sector"] == "Tourism & ev", "Sector"] = "Tourism & Hospitality"
complete_set.loc[complete_set["Sector"] == "Tourism", "Sector"] = "Tourism & Hospitality"
complete_set.loc[complete_set["Sector"] == "Trading platform", "Sector"] = "E-commerce"
complete_set.loc[complete_set["Sector"] == "Training", "Sector"] = "Human Resources"
complete_set.loc[complete_set["Sector"] == "Transport & rentals", "Sector"] = "Travel & Transport"
complete_set.loc[complete_set["Sector"] == "Transport automation", "Sector"] = "Travel & Transport"
complete_set.loc[complete_set["Sector"] == "Transport", "Sector"] = "Travel & Transport"
complete_set.loc[complete_set["Sector"] == "Transportation", "Sector"] = "Travel & Transport"
complete_set.loc[complete_set["Sector"] == "Travel & saas", "Sector"] = "Travel & Transport"
complete_set.loc[complete_set["Sector"] == "Travel tech", "Sector"] = "Travel & Transport"
complete_set.loc[complete_set["Sector"] == "Travel", "Sector"] = "Travel & Transport"
complete_set.loc[complete_set["Sector"] == "Traveltech", "Sector"] = "Travel & Transport"
complete_set.loc[complete_set["Sector"] == "Vehicle repair", "Sector"] = "Automobiles & automotives"
complete_set.loc[complete_set["Sector"] == "Venture capital & private equity", "Sector"] = "Financial services"
complete_set.loc[complete_set["Sector"] == "Venture capital", "Sector"] = "Financial services"
complete_set.loc[complete_set["Sector"] == "Venture capitalist", "Sector"] = "Financial services"
complete_set.loc[complete_set["Sector"] == "Veterinary", "Sector"] = "Animal Care"
complete_set.loc[complete_set["Sector"] == "Video personalization", "Sector"] = "Video communication"
complete_set.loc[complete_set["Sector"] == "Video platform", "Sector"] = "Video communication"
complete_set.loc[complete_set["Sector"] == "Video sharing platform", "Sector"] = "Video communication"
complete_set.loc[complete_set["Sector"] == "Video streaming platform", "Sector"] = "Video communication"
complete_set.loc[complete_set["Sector"] == "Video", "Sector"] = "Video communication"
complete_set.loc[complete_set["Sector"] == "Virtual auditing", "Sector"] = "Financial Services"
complete_set.loc[complete_set["Sector"] == "Virtual banking", "Sector"] = "Financial Services"
complete_set.loc[complete_set["Sector"] == "Visual media", "Sector"] = "Information services"
complete_set.loc[complete_set["Sector"] == "Vr & saas", "Sector"] = "Ar/vr"
complete_set.loc[complete_set["Sector"] == "Wealth management", "Sector"] = "Financial services"
complete_set.loc[complete_set["Sector"] == "Wedding", "Sector"] = "Marriage"
complete_set.loc[complete_set["Sector"] == "Wellness", "Sector"] = "Personal care"
complete_set.loc[complete_set["Sector"] == "Wine & spirits", "Sector"] = "Food & Nutrition"
complete_set.loc[complete_set["Sector"] == "Yoga & wellness", "Sector"] = "Personal care"

In [None]:
unique_sectors = (complete_set.loc[:,"Sector"]).value_counts()
unique_sectors.head(10)

In [None]:
unique_sectors.head(10).sort_values().plot.barh()
plt.xlabel("Number of Funding Deals")
plt.ylabel("Sector")
plt.title("Funding Deals per Sector")

Here, we note that startups in the education (279), fintech (258), and medicine & healthcare (227) sectors were involved in most transactions over the period, with financial services (166) and e-commerce (158) following in distance. The top 10 (out of 202) sectors, by number of deals, make up about 53% of the total deals over the period. It will therefore be fair to conclude that funding is centralized around a few sectors.

#### 3.8.3 Funding Stage

In [None]:
unique_stages = (complete_set.loc[:,"Funding Stage"]).value_counts()
unique_stages.head(15)

From the preview of the unique stages above, we see that a number of stages implying the same stage are grouped differently due to differences in capitalization, spacing and use of hyphens. They should therefore be standardized to ensure that analyses based on the column presents a truer representation.

In [None]:
# Equating similar entries in the funding stage column for ease of analysis
complete_set["Funding Stage"] = complete_set["Funding Stage"].apply(lambda x: str(x).replace(" Round",""))
complete_set["Funding Stage"] = complete_set["Funding Stage"].apply(lambda x: str(x).replace(" round",""))
complete_set["Funding Stage"] = complete_set["Funding Stage"].apply(lambda x: str(x).replace(" - Series Unknown",""))

In [None]:
# Equating similar entries in the Sector column for ease of analysis (2)
complete_set.loc[complete_set["Funding Stage"] == "$6000000", "Funding Stage"] = "Venture"
complete_set.loc[complete_set["Funding Stage"] == "Debt Financing", "Funding Stage"] = "Debt"
complete_set.loc[complete_set["Funding Stage"] == "Early seed", "Funding Stage"] = "Seed"
complete_set.loc[complete_set["Funding Stage"] == "Fresh Funding", "Funding Stage"] = "Seed"
complete_set.loc[complete_set["Funding Stage"] == "Fresh funding", "Funding Stage"] = "Seed"
complete_set.loc[complete_set["Funding Stage"] == "nan", "Funding Stage"] = "Undisclosed"
complete_set.loc[complete_set["Funding Stage"] == "PE", "Funding Stage"] = "Private Equity"
complete_set.loc[complete_set["Funding Stage"] == "Pre seed", "Funding Stage"] = "Pre-seed"
complete_set.loc[complete_set["Funding Stage"] == "Pre series A1", "Funding Stage"] = "Pre-series A"
complete_set.loc[complete_set["Funding Stage"] == "Pre- series A", "Funding Stage"] = "Pre-series A"
complete_set.loc[complete_set["Funding Stage"] == "Pre-Series A", "Funding Stage"] = "Pre-series A"
complete_set.loc[complete_set["Funding Stage"] == "Pre-Series A1", "Funding Stage"] = "Pre-series A"
complete_set.loc[complete_set["Funding Stage"] == "Seed fund", "Funding Stage"] = "Seed"
complete_set.loc[complete_set["Funding Stage"] == "Seed funding", "Funding Stage"] = "Seed"
complete_set.loc[complete_set["Funding Stage"] == "Seed Funding", "Funding Stage"] = "Seed"
complete_set.loc[complete_set["Funding Stage"] == "Seies A", "Funding Stage"] = "Series A"
complete_set.loc[complete_set["Funding Stage"] == "Series A2", "Funding Stage"] = "Series A"
complete_set.loc[complete_set["Funding Stage"] == "Series I", "Funding Stage"] = "Series A"
complete_set.loc[complete_set["Funding Stage"] == "Pre Series A", "Funding Stage"] = "Pre-series A"
complete_set.loc[complete_set["Funding Stage"] == "Pre series A", "Funding Stage"] = "Pre-series A"
complete_set.loc[complete_set["Funding Stage"] == "Pre series B", "Funding Stage"] = "Pre-series B"
complete_set.loc[complete_set["Funding Stage"] == "Pre series C", "Funding Stage"] = "Pre-series C"
complete_set.loc[complete_set["Funding Stage"] == "Pre-Seed", "Funding Stage"] = "Pre-seed"
complete_set.loc[complete_set["Funding Stage"] == "Pre-series A1", "Funding Stage"] = "Pre-series A"
complete_set.loc[complete_set["Funding Stage"] == "Seed & Series A", "Funding Stage"] = "Series A" # Checked Crunchbase
complete_set.loc[complete_set["Funding Stage"] == "Seed A", "Funding Stage"] = "Series A"
complete_set.loc[complete_set["Funding Stage"] == "Seed Investment", "Funding Stage"] = "Seed"
complete_set.loc[complete_set["Funding Stage"] == "Seed+", "Funding Stage"] = "Seed"
complete_set.loc[complete_set["Funding Stage"] == "Pre-Series B", "Funding Stage"] = "Pre-series B"

In [None]:
# Taking a final preview of the revised column data
unique_stages.head(15)

In [None]:
unique_stages = (complete_set.loc[:,"Funding Stage"]).value_counts()
unique_stages.head(10).sort_values().plot.barh()
plt.xlabel("Number of Funding Deals")
plt.ylabel("Funding Stage")
plt.title("Funding Deals per Funding Stage")

We note from the plot above that most of the funding stages for deals over the period were undisclosed. For the disclosed stages, "Seed" round had the most deals by a mile (924) followed by "Series A" (310) and "Pre-Series A" (289), with "Series B" and "Series C" following in the distance with 134 and 114 deals respectively.

#### 3.8.4 Amounts

In [None]:
unique_amounts = (complete_set.loc[:,"Amount"]).value_counts()
unique_amounts

In [None]:
pd.set_option('display.float_format', lambda x: '%.2f' % x)

In [None]:
# Replacing all Nulls with 0
complete_set.loc[complete_set["Amount"] == "nan", "Amount"] = np.nan
complete_set["Amount"].fillna(0.00,inplace = True)

# Creating a copy of the DataFrame sorted by Amount
amt_sorting = (complete_set.loc[complete_set["Amount"] != 0]).round(2).sort_values(by = "Amount")
amt_sorting

In [None]:
# Creating a function to find outliers using IQR
def outliers_IQR(df):
    q1 = df.quantile(0.25)
    q3 = df.quantile(0.75)
    IQR = q3-q1
    outliers = df[((df<(q1-1.5*IQR)) | (df>(q3+1.5*IQR)))]
    return outliers

outliers = outliers_IQR(complete_set["Amount"])
print("Number of outliers: " + str(len(outliers)), "\n")
print("Outliers percentage of total: " + str((((len(outliers))/len(complete_set["Amount"])*100))) + "%", "\n")
print("Max outlier value: " + str(outliers.max()), "\n")
print("Min outlier value: " + str(outliers.min()), "\n")

Since the outliers represent over 16% of the data available, it would be unwise to remove all of them. Outliers from the top and bottom will be selected for confirmation online, then further action taken on them.

In [None]:
# Who could be the outliers?
## Looking at the outliers from the top
(amt_sorting.loc[amt_sorting["Amount"] > 0]).head(10)

In [None]:
# Correcting the erroneous entries
## SATYA Microcapital (raised ₹725M debt financing from BlueOrchard in 2020 to be converted at 74.1322 INR/USD)
complete_set.loc[(complete_set["Company Name"] == "Satya microcapital") & 
                 (complete_set["Amount"] == 9.00), 
                 ["Funding Stage", "Amount", "Investors"]] = ["Debt", int(725000000/74.1322), "BlueOrchard Finance S A"]

## Ultraviolette Automotive: raised ₹60M Series A funding in 2018 
complete_set.loc[(complete_set["Company Name"] == "Ultraviolette automotive") &
                 (complete_set["Amount"] == 876.00), 
                 ["Funding Stage", "Amount"]] = ["Series A", int(60000000*0.0146)]

## Peel Works received lots of funding in 2020 not available in the dataset, I will therefore add them:
peel_addon = pd.DataFrame({'Company Name': ["Peel works","Peel works","Peel works","Peel works"], 
              'Sector': ["Saas", "Saas", "Saas"," Saas"], 
              'Funding Stage': ["Series D","Series D","Debt","Debt"],
              'Amount': [2840000, 2000000, 1000000, 408000], 
              'Headquarters': ["Mumbai","Mumbai","Mumbai","Mumbai"],
              'Description': [np.nan, np.nan, np.nan, np.nan], 
              'Funding Year': ["2020", "2020", "2020", "2020"],
              'Year Founded': ["2010", "2010", "2010", "2010"], 
              'Founders': ["Sachin Chhabra, Nidhi Ramachandran","Sachin Chhabra, Nidhi Ramachandran","Sachin Chhabra, Nidhi Ramachandran","Sachin Chhabra, Nidhi Ramachandran"],
              'Investors': ["CESC", np.nan, "BlackSoil", "Equanimity Investments"]})

# Dropping the erroneous column and inserting the corrections
complete_set = complete_set.drop([1444], axis = 0) 
complete_set = pd.concat([complete_set,peel_addon], ignore_index=True, axis = 0)

complete_set.loc[(complete_set["Company Name"] == "Peel works")]

In [None]:
## Looking at the outliers from the bottom
(amt_sorting.loc[amt_sorting["Amount"] > 0]).tail(10)

In [None]:
# Boxplot to summarize the amounts
(amt_sorting.loc[amt_sorting["Amount"] > 0]).plot.box()
(amt_sorting.loc[amt_sorting["Amount"] > 0]).describe()
plt.title("Distribution of Funding Amounts")

In [None]:
complete_set.loc[(complete_set["Amount"] == 70000000000.00) |
                (complete_set["Amount"] == 150000000000.00)]

Despite being correct entries, the top 2 outliers Alteria Capital (index 1737) and Reliance Retail Ventures Ltd (index 892) will be dropped to assess their impact on the amounts.

In [None]:
complete_set_no_outliers = complete_set.drop([892, 1737], axis = 0)

In [None]:
# Re-assigning the sorted DataFrame and displaying the revised DataFrame in a boxplot
amt_sorting = (complete_set_no_outliers.loc[complete_set_no_outliers["Amount"] > 0]).round(2).sort_values(by = "Amount")
(amt_sorting.loc[amt_sorting["Amount"] > 0]).plot.box()
(amt_sorting.loc[amt_sorting["Amount"] > 0]).describe()

Per the boxplots for the complete set with the two major outliers and that without the two major outliers, their effects on the mean is evident, at USD 121.27m with outliers and USD 25.90m without outliers. This represents a stark USD 95.37m difference between the two datasets. Even so, it is visible from the boxplot for the complete set without outliers that there are two more big outliers dragging the USD 25.90m mean outward, with the next looking relatively close to the norm.

As such, the **median** of **USD 3m** (in both cases - with and without outliers) will be used for computations and analysis for the "average" startup. That is to say that the average funding deal over the period was worth **USD 3m**.

#### 3.8.5 Locations

In [None]:
unique_locations = (complete_set.loc[:,"Headquarters"]).value_counts()
unique_locations

In [None]:
complete_set["Headquarters"].apply(str)
complete_set["Headquarters"].fillna("Unknown Location", inplace = True)
complete_set.loc[complete_set["Headquarters"] == "nan", "Headquarters"] = "Unknown"
complete_set.loc[complete_set["Headquarters"] == "Bengaluru", "Headquarters"] = "Bangalore"
complete_set.loc[complete_set["Headquarters"] == "Bombay", "Headquarters"] = "Mumbai"
complete_set.loc[complete_set["Headquarters"] == "Gurugram", "Headquarters"] = "Gurgaon"
complete_set.loc[complete_set["Headquarters"] == "Mountain View, CA", "Headquarters"] = "California"

unique_locations = (complete_set.loc[:,"Headquarters"]).value_counts()
unique_locations.head(10)

In [None]:
unique_locations.head(10).sort_values().plot.barh()
plt.xlabel("Number of Funding Deals")
plt.ylabel("City")
plt.title("Funding Deals per City")

From the above, it is seen that Bangalore (916) leads the pack (as the city with most startups involved in deals) by almost twice as much as the next location, Mumbai, which has 471. It also leaves a fair distance between itself and Gurgaon (317), New Delhi (230), Chennai (106), and Pune (104) in that order.

These make up about 75% of the total transactions over the period, thus pointing to part acceptance of the null hypothesis which posits that funding to startups is centralized around specific locations and sectors.

#### 3.8.6 Description

In [None]:
complete_set["Description"].apply(str) # Applying a string formatting to the column
complete_set["Description"].fillna("Unknown", inplace = True) # Filling null spaces with Unknown
complete_set["Description"].replace("  "," ", inplace = True) # Replacing all double spaces
company_descs = (complete_set.loc[:,"Description"]).value_counts()
company_descs

#### 3.8.7 Funding Year

In [None]:
complete_set["Funding Year"].apply(str) # Applying string formatting
complete_set["Funding Year"].fillna(np.nan, inplace = True) # Filling unknown spaces with null
funding_year = (complete_set.loc[:,"Funding Year"]).value_counts()
funding_year = funding_year.sort_index()
funding_year

In [None]:
funding_year.plot()
plt.title("Number of Deals per Year")
plt.ylabel("Year of Funding")
plt.xlabel("Number of Deals")

Here, we note that the number of funding deals, despite the big drop in 2019, is increasing. Number of funding deals increased from 525 in 2018 to 1190 in 2021. This implies that chances are fairly high that a new startup would be able to land a funding deal going forward since the number of deals is on an increasing trajectory.

What about the amounts? How have the total amounts invested changed over the period?

In [None]:
# Let's look at how much the average startup raised based on their founding year
funding_set = complete_set.groupby("Funding Year").Amount.agg(["count","sum", "mean", "median"])
funding_set

In [None]:
sns.barplot(x = funding_set.index, y = funding_set["sum"])
plt.title("Total Amount of Funding per Year")
plt.ylabel("Total Amount of Funding")
plt.xlabel("Year of Funding")

We note that in line with the increasing number of deals over the period, total amounts invested have been increasing over the period, with investments in 2021 having the highest monetary value.

We must not ignore the fact that despite 2018 having a higher number of transactions, the average amount per deal (USD 12.7bn) was significantly less than that of 2019 (USD 37.5bn)

#### 3.8.8 Year Founded

In [None]:
((complete_set.loc[:,"Year Founded"]).value_counts())[1:]

In [None]:
# Making final touches to the column for further analysis
complete_set["Year Founded"].apply(str) # Applying string formatting
complete_set["Year Founded"].replace("-","nan", inplace = True) # Replacing dashes with nan
complete_set["Year Founded"] = complete_set["Year Founded"].apply(lambda x: str(x).replace(".0","")) # Removing .0 attached
year_founded = ((complete_set.loc[:,"Year Founded"]).value_counts())[1:] # Excluding the rows with "nan"
year_founded = year_founded.sort_index() # Showing trend of startups founded since 2000
year_founded

In [None]:
# Taking a look at which "startups" were founded earlier than 2000
complete_set.loc[(complete_set["Year Founded"] == "1963") |
                (complete_set["Year Founded"] == "1973") |
                (complete_set["Year Founded"] == "1978") |
                (complete_set["Year Founded"] == "1982") |
                (complete_set["Year Founded"] == "1984") |
                (complete_set["Year Founded"] == "1989") |
                (complete_set["Year Founded"] == "1991") |
                (complete_set["Year Founded"] == "1993") |
                (complete_set["Year Founded"] == "1994") |
                (complete_set["Year Founded"] == "1995") |
                (complete_set["Year Founded"] == "1998") |
                (complete_set["Year Founded"] == "1999")].sort_values(by = "Year Founded")

In [None]:
# After crosschecking the internet for the founding years of the startups, some mistakes in the data were noted and will be corrected accordingly:
## Arya was founded in 2013
complete_set.loc[complete_set["Company Name"] == "Arya", "Year Founded"] = "2013"

# CreditWise Capital was founded in 2018
complete_set.loc[complete_set["Company Name"] == "Credit wise capital", "Year Founded"] = "2018"

# Lendingkart was founded in 2014
complete_set.loc[complete_set["Company Name"] == "Lendingkart", "Year Founded"] = "2014"

# Nobel hygiene was founded in 2001
complete_set.loc[complete_set["Company Name"] == "Nobel hygiene", "Year Founded"] = "2001"

In [None]:
year_founded.plot()
plt.title("Number of Deals for Startups per Founding Year")
plt.ylabel("Founding Year")
plt.xlabel("Number of Transactions")

Here, we note that generally, newer startups are involved in more funding deals, with an overall increase from startups founded in 2000. 2021's low value may imply lower chances of landing a funding deal in the same year which a startup was founded. As such, it is fair to say that other sources of finance may be used in the first year of founding before actively seeking a funding deal down the line.

In [None]:
# Let's look at how much the average startup raised based on their founding year
year_set = complete_set.groupby("Year Founded").Amount.agg(["count","sum", "mean", "median"])
year_set

In [None]:
plt.figure(figsize = (18,9))
sns.barplot(x = year_set.index, y = year_set["sum"])
plt.xlabel("Year Founded")
plt.ylabel("Amount Raised")
plt.title("Total Funding Raised by Startups based on Founding Year")

Here again, we note that startups founded in 2018 raised the most funds over the period, followed by 2006 and then other startups founded after 2009. This further reinforces the earlier observation that startups may have to find alternative sources of during the early stages, and look out for funding opportunities with time and growth.

#### 3.8.9 Founders

In [None]:
complete_set["Founders"].apply(str) # Casting the column to string
complete_set["Founders"] = complete_set["Founders"].str.split(',').str[0] # Selecting the first name as the primary founder
complete_set["Founders"].fillna("Unknown Founder", inplace = True) # Filling nulls with "Unknown Founder"
complete_set["Founders"].replace("  "," ", inplace = True) # Removing any double spaces
founders = (complete_set.loc[:,"Founders"]).value_counts()
founders.head(10)

In [None]:
founders[1:].head(10).sort_values().plot.barh() # Plotting while excluding the "Unknown Founders"
plt.xlabel("Number of Fundings Received")
plt.ylabel("Name of Founder")
plt.title("Number of Known Fundings per Founder")

In [None]:
# Taking a look at which "startups" were founded earlier than 2000
complete_set.loc[(complete_set["Founders"].str.split(',').str[0] == "Byju Raveendran") |
                (complete_set["Founders"].str.split(',').str[0] == "Ashneer Grover") |
                (complete_set["Founders"].str.split(',').str[0] == "Saurabh Saxena") |
                (complete_set["Founders"].str.split(',').str[0] == "Gaurav Munjal") |
                 (complete_set["Founders"].str.split(',').str[0] == "Deepinder Goyal") |
                (complete_set["Founders"].str.split(',').str[0] == "bhavish Aggarwal")]

Maybe beside the point, but we note here that Byju Raveendran's startups were involved in most deals over the period (12). This is followed by Ashneer Grover (10) with Saurabh Saxena (7) not far behind.

#### 3.8.10 Investors

In [None]:
complete_set["Investors"].apply(str)
complete_set["Investors"] = complete_set["Investors"].str.split(',').str[0]
complete_set["Investors"].fillna("Undisclosed", inplace = True)
complete_set["Investors"].replace("  "," ", inplace = True)
complete_set["Investors"].replace("$Undisclosed","Undisclosed", inplace = True)
investors = (complete_set.loc[:,"Investors"]).value_counts()
investors

In [None]:
investors[1:].head(10).sort_values().plot.barh() # Filtering the top 10 investors by investment while excluding the "Undisclosed"
plt.xlabel("Number of Investments Made")
plt.ylabel("Name of Investor")
plt.title("Number of Known Investments per Investor")

Despite filtering out the 627 unknown investors (data not available in the dataset), we note a total of about 1,245 different investors over the period.

We also note that Venture Catalysts (52) and Inflection Point Ventures (42) made the most investments, by number of transactions, over the period. They are followed by Sequoia Capital India (25) and Tiger Global (24).

Next up, we take a look at the amounts involved in the deals by the various investors.

In [None]:
investors_set = (complete_set.groupby(by = "Investors").Amount.agg(["count","sum", "mean", "median"]).sort_values(by = "sum", ascending = False))[1:]
investors_set

In [None]:
sns.barplot(x = (investors_set["sum"])[:15], y = (investors_set.index)[:15])
plt.xlabel("Total Value of Investments")
plt.ylabel("Investor")
plt.title("Value of the Total Investments by Investor")

Even with the foregoing, we note that Silver Lake's single deal of USD 70bn investment tops the list of investments by value. They are followed in a distance by Salesforce Ventures with 2 transactions summing up to USD 3bn, Tiger Global (24 transactions) with USD 2.356bn, Facebook (2 transactions) with USD 2.31bn, and then General Atlantic (7 transactions) with USD 1.65bn. 

This creates the impression that number of deals does not necessarily imply high value of investments since only one of the top four investors involved in deals featured in the top 5 by value of total investments. Shall we attempt confirm? Yes.

In [None]:
# Scatterplot to visualize the relationship between count of investment deals and total value of investments
plt.figure(figsize = (20,12))
sns.regplot(y = (investors_set["sum"]), x = (investors_set["count"]))
#sns.lineplot(y = (investors_set["sum"]), x = (investors_set["count"]))
plt.xlabel("Number of Investment Deals")
plt.ylabel("Total Value")
plt.title("Total Value of Investment Deals by Investor")

From the plot, it is safe to conclude that there is no relationship between the number of investment deals and the monetary value of investments. It is however noteworthy that aside Tiger Global (24), investors with less than 10 transactions seemed to have made more investments by value.

In [None]:
# Exploring the monetary value of the average investment deal by an investor
investors_set = investors_set.sort_values(by = "median", ascending = False)
investors_set.head(10)

In [None]:
# Due to it's outlying effect, Silver Lake will be excluded from the visualization to have a better look at the others
sns.barplot(x = (investors_set["mean"])[1:16], y = (investors_set.index)[1:16])
plt.xlabel("Average Value of Investments")
plt.ylabel("Investor")
plt.title("Average Value of Investments by Investor")

Here, we note that Salesforce Ventures (USD 1.5bn per deal) make the best investors after Silver Lake (USD 70bn for a single deal). Facebook (USD 1.15bn per deal), Canaan Valley Capital (USD 1bn for a single deal), and Carmignac (USD 800m for a single deal) also follow in that order.

We note that the investors who made the biggest investments over the period were involved in less than 5 deals over the period. This implies that whichever startup intends to get funding from any of these investors have to get their work right. Given that these investors have invested in a variety of sectors, it is difficult to conclude that a startup is likely to land a funding deal with these top investors by investing in a particular sector. 

We must however not rule out the fact that the education sector had the most funding deals over the period and also tops this list (with the tourism & hospitality industry) as the most favoured sector (on average) by the top investors. We can therefore advice that, other things being equal, a startup that intends to get higher than average funding can consider the education sector as its first.

In [None]:
which_sectors = complete_set.loc[(complete_set["Investors"] == "Silver Lake") |
                 (complete_set["Investors"] == "Salesforce Ventures") |
                 (complete_set["Investors"] == "Facebook") |
                 (complete_set["Investors"] == "Canaan Valley Capital") |
                 (complete_set["Investors"] == "Carmignac") |
                 (complete_set["Investors"] == "MyPreferred Transformation") |
                 (complete_set["Investors"] == "SoftBank Vision Fund") |
                 (complete_set["Investors"] == "South Africa’s Naspers Ventures") |
                 (complete_set["Investors"] == "Twitter Ventures") |
                 (complete_set["Investors"] == "Marquee international institutional investors")]
print(which_sectors["Sector"].value_counts(), "\n")
which_sectors

# 4.0 More Analyses & Visualizations

In [4]:
complete_set

NameError: name 'complete_set' is not defined

In [None]:
complete_set.loc[complete_set["Amount"] > 0.00]

## Questions

### Which start-ups are found in the capital of India?

In [3]:
start_up_found_in_New_delhi = complete_set.groupby("Company Name")["Headquarters"].sum().reset_index()
start_up_found_in_New_delhi

NameError: name 'complete_set' is not defined

### Which start-ups are found in the capital City (New Delhi)?

In [None]:
start_up_found_in_New_delhi = complete_set[complete_set.Headquarters == "New Delhi"]
start_up_found_in_New_delhi

### 4.2 How is funding distributed across deals by location?

In [None]:
by_location = complete_set.groupby(by = "Headquarters").Amount.agg(["count", "sum", "mean", "median"]).sort_values(by = "sum", ascending = False)
by_location.head(10)

In [None]:
sns.barplot(y = (by_location.index)[:10], x = (by_location["sum"])[:10])
plt.xlabel("Sum of Amounts Invested")
plt.ylabel("Location")
plt.title("Distribution of Investments per City")

In [None]:
# What is the sum total of investments over the period?
total_investments = complete_set.Amount.sum()
print("The sum total of investments over the period was USD " + str(complete_set.Amount.sum()), "\n")

In [None]:
mumbai_percent = (230811847986.00/total_investments)*100
bangalore_percent = (24170275334.00/total_investments)*100

print("Mumbai-based startups took up " + str(mumbai_percent) + "% of total amounts invested over the period.", "\n", 
      "Bangalore-based startups followed with " + str(bangalore_percent) + "% of total amounts invested over the period.", "\n")

Looking at the total value of deals by location, Mumbai leads with USD 230bn from 471 deals, averaging about USD 490m per deal. Bangalore (916 deals) follows in a distance with USD 24bn at USD 26m per deal. The top 5 is completed with Gurgaon-headquartered  startups (317 deals) receiving USD 6.9bn (USD 21.9m average), New Delhi-headquartered startups (230 deals) receiving USD 3.4bn (USD 14.9m average), and California-based startups (6 deals) receiving USD 3bn in funding at an average of USD 513m.

We also note that startups headquartered in Mumbai and Bangalore alone got a cumulative 91% of the total funding over the period. This matches the null hypothesis that funding to startups is centralized around specific locations. *Based on this, the team may be advised to consider Mumbai as the first choice for the headquarters of the potential startup.*

### 4.3 Does number of deals translate into funding for the sectors?

In [None]:
# Recap of the top 10 sectors by number of deals
unique_sectors.head(10)

In [None]:
sector_wise = complete_set.groupby(by = "Sector").Amount.agg(["count", "sum", "mean", "median"]).sort_values(by = "sum", ascending = False)
sector_wise.head(10)

In [None]:
# Visualizing the top 10 sectors by total funding received by it's startups
sns.barplot(y = sector_wise[:10].index, x = (sector_wise["sum"])[:10])
plt.xlabel("Total Value of Investment")
plt.ylabel("Sector")
plt.title("Total Value of Investments by Sector")

In [None]:
# Total percentage of funding to 
fintech_percent = (154792358725.00/total_investments)*100
retail_percent = (70550380000.00/total_investments)*100
fintech_percent+retail_percent

In [None]:
# Visualizing the median amounts invested per sector
sns.barplot(y = sector_wise[:9].index, x = (sector_wise["median"])[:9]) # Excluding the "Multinational conglomerate company"
plt.xlabel("Median Value of Investment (in millions)")
plt.ylabel("Sector")
plt.title("Median Value of Investments by Sector")

Given that 7 out of the 10 sectors whose startups had the most deals were in the top 10 startups by funding received, we can conclude that number of deals has a positive relationship with the amount of funding. The Fintech sector led this time, receiving a total of USD 154bn with USD 5m as the median amount. Retail (USD 70bn total) came second with a USD 4m median, followed in order by the education (USD 6bn total) with a USD 1m median, e-commerce (USD 4.7bn total) with a USD 3.85m, and food & nutirition (USD 5bn total)

From the second graph, we note that by total amounts invested, the Fintech sector (55%) and the Retail sector (25%) make up 80% of total funding received. By median value of investments, Fintech still leads the race, followed by Retail, E-commerce, Financial Services, and Automobiles and Automotives, in that order.

These add to the assertion that funding is centralized around specific sectors.

### 4.4 What is the average amount of funding for start-ups in: the sector with most funding, and the location with most funding?

From earlier analyses, we know that the sector with the most funding is the Fintech sector with a total of USD 154bn, an average of USD 597m per deal, and USD 5m per deal median. For the location with most funding, we learnt that it was Mumbai with a total of USD 230bn, an average of USD 490m per deal and a USD 2m median.

In [None]:
# Reminding ourselves of the summary stats of the amounts involved in deals over the period.
complete_set["Amount"].describe()

From the description of the Amount column, we learn that the average deal over the period was worth USD 97m on average, and USD 1.6m median. 

Great. So how much did startups at the intersection of Mumbai and the Fintech sector receive per deal, and how does it compare to the sector, location, and general averages?

In [None]:
# Extracting data for deals involving startups in the Fintech sector and headquartered in Mumbai
max_intersection = complete_set.loc[(complete_set["Headquarters"] == "Mumbai") & 
                (complete_set["Sector"] == "Fintech")]
max_intersection

In [None]:
print(max_intersection["Amount"].sum(), "\n")
max_intersection.describe()

First we note that startups that fall within the intersection of Mumbai and Fintech were involved in a total of 48 deals worth a cumulative USD 150bn over the period.

Second, we note that each deal had an average worth of about USD 3bn, and a median of USD 3.8m. This implies that startups in the Fintech sector and headquartered in Mumbai receive more funding per deal on average. The median per deal is second to the median funding per deal of the Fintech sector. Based on these, it is safe to say that the intersection of Fintech and Mumbai is a sweet spot anyone who intends to venture into the Indian startup space may want to explore.

The above data is summarized in the below.

In [None]:
data = {"Grouping":["General","Fintech sector","Mumbai-based","Fintech & Mumbai"],
       "Mean Funding":["97,892,761.19", "597,653,894.69", "490,046,386.38", "3,131,643,743.75"],
       "Median Funding":["1,600,000.00", "5,000,000.00", "2,000,000.00", "3,868,500.00"]}
data_summarized = pd.DataFrame(data)
data_summarized

### 4.5 How does the breakdown by stages of funding look?

In [None]:
stage_wise = complete_set.groupby(by = "Funding Stage").Amount.agg(["count", "sum", "mean", "median"]).sort_values(
    by = "sum", ascending = False)
stage_wise

In [None]:
# Plotting the top 10 stages/rounds of funding in terms of amounts invested
sns.barplot(y = stage_wise[:10].index, x = (stage_wise["sum"])[:10])
plt.xlabel("Funding Stage")
plt.ylabel("Total Amount Invested")
plt.title("Total Amount Invested per Funding Stage/Round")

As noted earlier, a large portion of the funding stages was undisclosed. For those that were disclosed, a debt (61 deals) was the most common type of funding with a total investment of USD 150bn (USD 2bn average). This was followed in the distance by Series C (USD 5bn), Series B (USD 3.8bn), and Series D (USD 3.6), with each averaging USD 47m, USD 28m, and USD 69m respectively.

The Seed round, which had the highest number of deals*, had about USD 1.5m per deal, which should be good enough to hold a startup down for sometime as it focuses on survival and other funding sources.

### 4.6 Which startups were most favoured by investors?

In [None]:
company_wise = complete_set.groupby(by = "Company Name").Amount.agg(["count", "sum", "mean"]).sort_values(by = "sum",
                                                                                                         ascending = False)
company_wise.head(10)

Looking at the number of deals, the most favoured startups were Byju's (10), Bharatpe (10) and Zomato (7). On the other hand, when we assess by the amount of funding received, Alteria Capital (USD 150bn), Reliance retail ventures ltd (USD 70bn), and Snowflake (USD 3bn) were the investors' favorites. Byju's, Zomato, and Oyo make the top 10 in both assessments implying that they are well favoured by investors both by number of deals and by amounts received.

# 5.0 Conclusions and Recommendations

### 5.1 the Hypothesis

With the above in mind, it is safe to accept the null hypothesis, given that funding to startups over the period was centralized around particular locations (Mumbai and Bangalore) and particular sectors (Fintech and Retail)

### 5.2 Advice to the Team

- Prioritize the fintech, retail, and education sectors for further studies into the possible establishment of the startup.
- Prioritize Bangalore, Mumbai and Gurgaon as the possible locations for the startup.
- The likeliest source of funding is debt, which may come with some conditionalities. It is then followed by Series C and Series B. Even so, an average startup gets an average of about USD 1.5m at the Seed round.
- Seek alternative sources of funding during the early years, as you target survival and growth, as funding usually comes in after a year or two of establishment.