# FUNDING ANALYSIS FOR INDIAN STARTUPS

#### Team: Team Namibia

## Table of Contents


[**Step 1: Business Understanding**](#Step-1:-Business-Understanding)

[**Step 2: Data Understanding**](#Step-2:-Data-Understanding)

- [**Load Data**](#Load-Data)
- [**Data Quality**](#Check-Data-Quality)
- [**Exploratory Data Analysis-EDA**](#Exploratory-Data-Analysis---EDA)
 

## Step 1: Business Understanding
Team Namibia is trying to venture into the Indian start-up ecosystem. As the data expert of the team, we are to investigate the ecosystem and propose the best course of action.

#### Problem Statement:
Ideas, creativity, and execution are essential for a start-up to flourish. But are they enough? Investors provide start-ups and other entrepreneurial ventures with the capital---popularly known as "funding"---to think big, grow rich, and leave a lasting impact.

In this project we are investigating the dynamics of startup funding in India over the period from 2018 to 2021. The aim is to understand the trends, sector preferences, investment stages, key investors, and funding Patterns. Additionally, if there have been significant differences in funding amounts across different years and sectors, it can guide the action plan to be taken.

#### Objective
In this analysis we will provide insights into the startup funding landscape in India from 2018 to 2021 by: 
- Identifying trends and patterns in funding amounts over the years.
- Determining which sectors received the most funding and how sector preferences changed over time.
- Understanding the distribution of funding across different stages of startups (e.g., Seed, Series A).
- Identifying key investors and their investment behaviors.
- Analyzing the geographical distribution of funding within India.

#### Analytical Questions
1. What are the trends and patterns in funding amounts for startups in India between 2018 to 2021?
   - Analyzing the annual and quarterly trends in funding can reveal patterns and growth trajectories. Look for peaks, dips, and any consistent growth patterns over these years.(Amount, Year funded)
2. Which sectors received the most funding, and how did sector preferences change over time from 2018 to 2021?
   - Identifying which industries or sectors received the most funding can show sectoral preferences and shifts. Understanding how this distribution has evolved over the years can highlight emerging trends and declining interests. (industry, amount, year funded)
3. How is the distribution of funding across different stages of startups (e.g., Seed, Series A)?
   - Analyzing the funding amounts at different startup stages can provide insights into the investment appetite at various growth phases. It can also help in understanding the maturity and risk preference of investors. (stages, Amount)
4. Who are the key investors in Indian startups, and what are their investment behaviors/patterns?
   - Identifying the most active investors and analyzing their investment portfolios can shed light on key players in the ecosystem. Understanding their investment patterns can also reveal strategic preferences and alliances.(Investor, amount, industry, stages)
5. What is the geographical distribution of startup funding within India, and how has this distribution changed over the years 2018 to 2021?
   - Analyzing the geographical distribution of startup funding can show regional hotspots for entrepreneurship and investment. Observing how this has changed over the years can reveal shifts in regional focus and development.(location, year_funded, amount)

# Step 2: Data Understanding

The data from 2018 is obtained from GitHub in csv format, 2019 data is obtained from google drive in csv format and 2020 to 2021 data is obtained from an SQL database.

## Load Data

#### Install pyodbc and python-dotenv if necessary

In [249]:
# %pip install pyodbc  
# %pip install python-dotenv 

#### Importing the necessary packages 

In [1]:
# Import the pyodbc library to handle ODBC database connections
import pyodbc 

# Import the dotenv function to load environment variables from a .env file
from dotenv import dotenv_values 

# Import the pandas library for data manipulation and analysis
import pandas as pd 

# Import the warnings library to handle warning messages
import warnings

# Filter out (ignore) any warnings that are raised
warnings.filterwarnings('ignore')

# Import the numpy library for data manipulation and analysis
import numpy as np


#### Establishing a connection to the SQL database

In [3]:
# Load environment variables from .env file into a dictionary
environment_variables = dotenv_values('.env')

# Get the values for the credentials you set in the .env file
database = environment_variables.get("DATABASE")
server = environment_variables.get("SERVER")
username = environment_variables.get("UID")
password = environment_variables.get("PWD")

# Create the connection string using the retrieved credentials
connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password};MARS_Connection=yes;MinProtocolVersion=TLSv1.2;"


#### Load 2020 & 2021 data

In [4]:
            #----------Load 2020 data----------
# Establish a connection to the database using the connection string
connection = pyodbc.connect(connection_string) 

# Define the SQL query to select all columns from the specified table
query = "Select * from dbo.LP1_startup_funding2020"

# Execute the SQL query and fetch the result into a pandas DataFrame using the established database connection
df_2020 = pd.read_sql(query, connection)


In [5]:
           #----------Load 2021 data----------
# Establish a connection to the database using the connection string
connection = pyodbc.connect(connection_string)

# Define the SQL query to select all columns from the specified table
query1 = "Select * from dbo.LP1_startup_funding2020"

# Execute the SQL query and fetch the result into a pandas DataFrame using the established database connection
df_2021 = pd.read_sql(query1, connection)


#### Load 2018 & 2019 data

In [6]:
# Load 2018
df_2018 = pd.read_csv(r'C:\Users\Pc\Desktop\Data analysis\Azubi Africa\Career Accelerator\Indian-Start-up-Funding-Analysis\Dataset\startup_funding2018.csv')

# Load 2019
df_2019 = pd.read_csv(r'C:\Users\Pc\Desktop\Data analysis\Azubi Africa\Career Accelerator\Indian-Start-up-Funding-Analysis\Dataset\startup_funding2019.csv')



In [7]:
print(data_2018.columns)
print(data_2019.columns)
print(data_2020.columns)
print(data_2021.columns)

NameError: name 'data_2018' is not defined

#### Rename the columns & Save all the data in one DataFrame

In [8]:
# Rename 2018 column: 'Round/Series' to 'Funding Stage'
df_2018 = df_2018.rename(columns = {'Round/Series': 'Funding Stage'})

# Rename 2019 columns
df_2019 = df_2019.rename(columns = {'Company/Brand': 'Company Name', 'Sector': 'Industry', 'Stage': 'Funding Stage', 'Amount($)': 'Amount', 'HeadQuarter': 'Location', 'What it does': 'About Company', 'Founded': 'Year Founded'})

# Rename 2020 columns
df_2020 = df_2020.rename(columns = {'Company_Brand': 'Company Name', 'Sector': 'Industry', 'Stage': 'Funding Stage', 'HeadQuarter': 'Location', 'What_it_does': 'About Company', 'Founded': 'Year Founded'})

# Rename 2021 columns
df_2021 = df_2021.rename(columns = {'Company_Brand': 'Company Name', 'Sector': 'Industry', 'Stage': 'Funding Stage', 'HeadQuarter': 'Location', 'What_it_does': 'About Company', 'Founded': 'Year Founded'})



In [9]:
# Add a column to each DataFrame to indicate the year
df_2018['Year Funded'] = 2018
df_2019['Year Funded'] = 2019
df_2020['Year Funded'] = 2020
df_2021['Year Funded'] = 2021

# Concatenate all DataFrames into one master DataFrame
df = pd.concat([df_2018, df_2019, df_2020, df_2021], ignore_index=True)


# Print out the new DataFrame to confirm the combination was done correctly
df.head()

Unnamed: 0,Company Name,Industry,Funding Stage,Amount,Location,About Company,Year Funded,Year Founded,Founders,Investor,column10
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",2018,,,,
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,2018,,,,
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,2018,,,,
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,2018,,,,
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...,2018,,,,


## Data Quality

- The columns: 'column10', 'Founders' & 'Year Founded' must be removed as they do not help answer our questions.
- Review duplicates.
- The "Company Name" column does not have any may concerns except a few names with ".com", ".ai", ".AI", ".sh" and "+" present.
- In the "Industry" column:
    - "—" must be investgated further using the "About Company" column in order to fill it with the right data.
    - There are companies with multiple industries in a single row, we need to keep only one and remove the rest.
    - We can find the unique values in the column and categorize them under specific industries using "Regular Expression Models".
- There is a link in the "Funding Stage" column at index 178.
    - There is a link as "Funding Stage" 
    - Investigate "NaN" present.
    - Investigate "Undisclosed"
    - Same names are presented differently.
- The "Amount" column (Prescence of "₹", "$", "—" and "Undisclosed"in the column):
    - Extract "₹" to a new column
    - Investigate "NaN"
    - Replace "₹", "$" and "—" with "" in the column
    - Investigate "Undisclosed"
    - Convert the dtype of the column to Int64 as it is in the wrong format.
- In the "Location" column:
    - Investigate "India, Asia"
    - Investigate "NaN"
    - Split the column, keep the 1st one(containing cities), 
        - Join it to the main dataframe "df"
        - Delete the "Location"
        - Rename the newly joined to "Location"
- In the "Investor" column:
    - Investigate "NaN"
    - Split the column

### Data Review

In [10]:
df.shape

(2725, 11)

In [11]:
df.isnull().sum()

Company Name        0
Industry           31
Funding Stage     974
Amount            508
Location          207
About Company       0
Year Funded         0
Year Founded      981
Founders          553
Investor          602
column10         2721
dtype: int64

In [12]:
# Checking columns
df.columns

Index(['Company Name', 'Industry', 'Funding Stage', 'Amount', 'Location',
       'About Company', 'Year Funded', 'Year Founded', 'Founders', 'Investor',
       'column10'],
      dtype='object')

Remove unwanted columns & checking data info

In [13]:
# Dropping unwanted columns
df = df.drop(columns=['column10','Founders','Year Founded'])

# checking data info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2725 entries, 0 to 2724
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Company Name   2725 non-null   object
 1   Industry       2694 non-null   object
 2   Funding Stage  1751 non-null   object
 3   Amount         2217 non-null   object
 4   Location       2518 non-null   object
 5   About Company  2725 non-null   object
 6   Year Funded    2725 non-null   int64 
 7   Investor       2123 non-null   object
dtypes: int64(1), object(7)
memory usage: 170.4+ KB


In [14]:
# Display the Amount column for further information
pd.set_option('display.max_rows', None)


### Handling Duplicates

In [15]:
# Check for duplicates
df_duplicates= df[df.duplicated(keep = False)].sort_values(by= "Company Name")

# Check for number of duplicates
sum_dups= df.duplicated().sum()

print("Number of duplicates:", sum_dups)

Number of duplicates: 7


In [16]:
# Drop duplicates
df = df.drop_duplicates()

# Confirm the new shape. Rows should be less by 23
df.shape

(2718, 8)

## Data Cleaning

### Rows with multiple columns missing

In [17]:
# Find rows with missing data in the "Amount" and "Funding Stage" columns
double_nulls= df[df['Amount'].isna()& df['Funding Stage'].isna()]

# Drop them from the database
df.drop(double_nulls.index, inplace= True)

# Confirm the new shape. Rows should be less by 282
df.shape

(2438, 8)

### Industry

- Replace "—" with nulls
- Fill the nulls using the column "About Company" as reference
- Extract only one Industry from the 'Industry' column
- Categorize all Industries into Major Industries

In [18]:
# Replace "—" with Nulls
df['Industry'] = df['Industry'].replace('—', np.nan)


Fill the nulls using the column "About Company" as reference

In [19]:
# Mapping of company names to industries
company_to_industry = {
    "VMate": "Media and Entertainment",
    "Awign Enterprises": "Services (Human Resources)",
    "TapChief": "Services (Consulting / Professional Services)",
    "KredX": "Financial Services",
    "m.Paani": "E-Commerce",
    "Text Mercato": "E-Commerce",
    "Magicpin": "E-Commerce",
	"Leap Club": "E-Commerce",
	"Juicy Chemistry": "Services",
	"Servify": "Retail",
	"Wagonfly": "Media and Entertainment",
	"DrinkPrime": "E-Commerce",
	"Kitchens Centre": "Consumer Durables",
	"Innoviti": "Services",
	"Brick&Bolt": "Financial Services",
	"Toddle": "Real Estate",
	"HaikuJAM": "IT & BPM",
    "MissMalini Entertainment" : "Entertainment and Media",
    "Jagaran Microfin" : "Microfinance",
    "FLEECA" : "Automotive Services",
    "WheelsEMI" : "Financial Services",
    "Fric Bergen" : "Food and Beverage",
    "Deftouch" : "Gaming",
    "Corefactors" : "Marketing",
    "Cell Propulsion" : "Transportation Technology",
    "Flathalt" : "Real Estate",
    "dishq" : "Food Technology",
    "Trell" : "Social Networking",
    "HousingMan.com" : "Real Estate",
    "Steradian Semiconductors" : "Semiconductor Technology",
    "SaffronStays" : "Travel and Hospitality",
    "Inner Being Wellness" : "Beauty and Wellness",
    "MySEODoc" : "Digital Marketing",
    "ENLYFT DIGITAL SOLUTIONS PRIVATE LIMITED" : "Digital Marketing",
    "Scale Labs" : "E-commerce Solutions",
    "Roadcast" : "Business Services",
    "Toffee" : "Insurance Technology",
    "ORO Wealth" : "Financial Services",
    "Finwego" : "Financial Services",
    "Cred" : "Financial Services",
    "Origo" : "Agriculture",
    "Sequretek" : "Cyber Security",
    "Avenues Payments India Pvt. Ltd." : "IT Solutions",
    "Planet11 eCommerce Solutions India (Avenue11)" : "Technology",
    "Iba Halal Care" : "Cosmetics",
    "Togedr" : "Activity Discovery and Booking",
    "Scholify" : "Edutech"    
}

# Function to fill missing industries based on company name
def fill_industry(row):
    if pd.isna(row["Industry"]):
        return company_to_industry.get(row["Company Name"], row["Industry"])
    return row["Industry"]

# Apply the function to update the 'Industry' column
df["Industry"] = df.apply(fill_industry, axis=1)

# Checking the Null value in the 'Industry' column
print("Null values after cleaning:",df['Industry'].isna().sum())    # null values changes from 59 to 0

Null values after cleaning: 0


Extract only one Industry from the 'Industry' column

In [20]:
# Function to extract the first industry from the 'Industry' column
def industry_extract(row):
    industries = row['Industry'].split(',')
    return industries[0].strip() if len(industries) > 1 else row['Industry']

# Apply the function to update the 'Industry' column
df['new_industry'] = df.apply(industry_extract, axis=1)
    
# Remove "Industry"
df = df.drop(columns=['Industry'])

# Rename "new_industry" to "Industry"
df = df.rename(columns={'new_industry': 'Industry'})

df[["Industry"]].head()

Unnamed: 0,Industry
0,Brand Marketing
1,Agriculture
2,Credit
3,Financial Services
4,E-Commerce Platforms


Categorize all Industries into Major Industries

In [21]:
# Import re library to work with regular expressions 
import re

# Function to categorize the industries into major ones
def sector_redistribution(Industry):
    if re.search(r'bank|fintech|finance|crypto|account|credit|venture|crowd|blockchain|microfinance|lending|wealth|insurance|mutual fund|funding|invest|neo-bank|online financial service|escrow', Industry, re.IGNORECASE):
        return 'Finance and FinTech'
    elif re.search(r'e-?commerce|retail|marketplace|e-store|e-tail|e-tailer|consumer|durables|appliances|electronics', Industry, re.IGNORECASE):
        return 'E-Commerce and Retail'
    elif re.search(r'marketing|advertising|brand|digital marketing|sales|customer loyalty|creative agency|content management', Industry, re.IGNORECASE):
        return 'Marketing and Advertising'
    elif re.search(r'agriculture|agtech|agr[iy]tech|food|beverage|catering|cooking|dairy|nutrition|soil', Industry, re.IGNORECASE):
        return 'Agriculture and Food'
    elif re.search(r'health|medical|biotech|pharma|medtech|care|diagnostics|wellness|fitness|personal care|skincare|mental health|life science|alternative medicine|veterinary', Industry, re.IGNORECASE):
        return 'Healthcare and Wellness'
    elif re.search(r'transport|automotive|vehicle|logistics|delivery|air transport|mobility|car|bike|EV|auto-tech|transportation', Industry, re.IGNORECASE):
        return 'Transportation and Mobility'
    elif re.search(r'real estate|construction|interior|housing|home decor|commercial real estate|co-?working|co-?living', Industry, re.IGNORECASE):
        return 'Real Estate and Construction'
    elif re.search(r'media|entertainment|broadcasting|streaming|video|music|gaming|sports|digital entertainment|visual media', Industry, re.IGNORECASE):
        return 'Media and Entertainment'
    elif re.search(r'education|e-?learning|edtech|training|continuing education|career planning|edutech', Industry, re.IGNORECASE):
        return 'Education'
    elif re.search(r'renewable|clean energy|solar|environmental|energy|cleantech|sanitation', Industry, re.IGNORECASE):
        return 'Energy and Environment'
    elif re.search(r'consulting|business services|professional services|customer service|legal|facility|IT & BPM', Industry, re.IGNORECASE):
        return 'Professional Services'
    elif re.search(r'information technology|IT|tech|technology|cloud|internet of things|iot|big data|saas|cyber security|software|ai|machine learning|robotics|deep tech|data science|api|digital|platform|networking|smart cities', Industry, re.IGNORECASE):
        return 'Technology'
    elif re.search(r'consumer goods|consumer applications|consumer durables|consumer electronics|consumer appliances|eyewear|jewellery|fashion', Industry, re.IGNORECASE):
        return 'Consumer Goods'
    elif re.search(r'industrial|manufacturing|automation|industrial automation|packaging', Industry, re.IGNORECASE):
        return 'Industrial and Manufacturing'
    else:
        return Industry
    
# Apply the function to update the 'Industry' column
df['Industry'] = df["Industry"].apply(sector_redistribution)

# Find unique values in the "Industry" column
unique2= df["Industry"].unique()

# Check for number of unique values in the "Industry" column
print(f"Number of unique Industries: {len(unique2)}")       # Unique values changes from 425 to 108


Number of unique Industries: 108


### Amount


- Extract "₹" to a new column
- Remove "₹", "$", "—" and "Undisclosed" in the column
- Change column dtype to "Int64" and Convert rupees to dollars

In [22]:
# Extract the symbols into new column (currencies)
df['currency'] = df.Amount.str.extract(r'([$₹])')


Remove "₹", "$", "—" and "Undisclosed" in the column

In [23]:
# Remove "$", "₹", "—", "," symbols and "Undisclosed" from the 'Amount' column
df['Amount'] = df['Amount'].str.replace('[$₹,—]', '', regex=True)
df['Amount'] = df['Amount'].str.replace(r'Undisclosed', '', regex=False)


Change column dtype to "Int64" and Convert rupees to dollars

In [24]:
# Convert Amount to a numeric column
df['Amount'] = pd.to_numeric(df['Amount'])

# Change the Dtype to Integer
df['Amount'] = df['Amount'].astype('Int64')   # "Amount" is changed to Int64


            #------------Converted all rupees to dollars------------
# Give the rate a variable
rate = 0.013   # Average rupees to dollars exchange rate from 2018 - 2021

# Filter the data for rows that contains "₹" in the "currency" column
rupees = df[df['currency'] == "₹"]

# Convert all rupees to dollars
df['Amount']= rupees['Amount']*rate

# Change data type of the "Amount" column to integer
df['Amount'] = df['Amount'].astype('Int64')

# Drop the "currency" column
df = df.drop(columns=['currency'])

# Fill all null values with zeros
df['Amount'] = df['Amount'].fillna(0) # This only works in the position


df[['Amount']].head()


Unnamed: 0,Amount
0,0
1,520000
2,845000
3,0
4,0


### Funding Stage

- Change column casing
- Remove row with link
- Fill nulls with "Undisclosed"

In [25]:
# Change the case of all rows in the "Funding Stage" column to proper case
df['Funding Stage'] = df['Funding Stage'].str.title()

# Remove the row with the link 
df= df.drop(df[df['Funding Stage'].str.contains('https:', na=False)].index)

# Fill all 974 null values with "Undisclosed"
df['Funding Stage']= df['Funding Stage'].fillna('Undisclosed')

# Print
print("Null values in Funding Stage:",df['Funding Stage'].isna().sum())

Null values in Funding Stage: 0


Categorize Funding stages to their correct names

In [26]:
import re

# Function to categorize the Funding Stage 
def stage_correction(Stage):
    if re.search(r'Angel|Angel Round', Stage, re.IGNORECASE):
        return 'Angel'
    elif re.search(r'Bridge|Bridge Round', Stage, re.IGNORECASE):
        return 'Bridge'
    elif re.search(r'Debt|Debt Financing', Stage, re.IGNORECASE):
        return 'Debt Financing'
    elif re.search(r'Fresh Funding|Funding Round', Stage, re.IGNORECASE):
        return 'Funding Round'
    elif re.search(r'Pre Seed Round|Pre-Seed|Pre-Seed Round', Stage, re.IGNORECASE):
        return 'Pre-Seed'
    elif re.search(r'Pre Series A|Pre- Series A|Pre-Series|Pre-Series A|Pre-Series A1', Stage, re.IGNORECASE):
        return 'Pre-Series A'
    elif re.search(r'Pre Series B|Pre-Series B', Stage, re.IGNORECASE):
        return 'Pre-Series B'
    elif re.search(r'Seed|Seed A|Seed Fund|Seed Funding|Seed Investment|Seed Round', Stage, re.IGNORECASE):
        return 'Seed Round'
    elif re.search(r'Pre Series C|Pre-Series C', Stage, re.IGNORECASE):
        return 'Pre-Series C'
    elif re.search(r'Series A|Series A-1', Stage, re.IGNORECASE):
        return 'Series A'
    elif re.search(r'Series B|Series B+', Stage, re.IGNORECASE):
        return 'Series B'
    elif re.search(r'Series D|Series D1', Stage, re.IGNORECASE):
        return 'Series D'
    else:
        return Stage
    
# Apply the function to update the 'Industry' column
df['Funding Stage'] = df['Funding Stage'].apply(stage_correction)

# Find unique values in the "Industry" column
unique3= df["Funding Stage"].unique()

# Check for number of unique values in the "Industry" column
print(f"Number of unique Stages: {len(unique3)}")         # unique values changes from 50 to 30

Number of unique Stages: 30


### Investor

In [27]:
# Fill nulls with "Unknown"
df['Investor']= df['Investor'].fillna("Unknown")

# Split the column
investor_split = df['Investor'].str.rsplit(',', expand=True)

# Drop all columns except for "0" and "1"
investor_split= investor_split.drop(investor_split.columns[2:], axis=1)

# Assign new column names to the splits
investor_split.columns = ['Investor_1', 'Investor_2']

# Strip both columns of spaces
investor_split["Investor_1"]= investor_split["Investor_1"].str.strip()
investor_split["Investor_2"]= investor_split["Investor_2"].str.strip()

# Fill the nulls of investor_2 with "Unknown"
investor_split["Investor_2"]= investor_split["Investor_2"].fillna("Unknown")

# Join the investor_split to the existing dataset and delete the Investor column
df= df.join(investor_split).drop("Investor", axis=1)

df.head()

Unnamed: 0,Company Name,Funding Stage,Amount,Location,About Company,Year Funded,Industry,Investor_1,Investor_2
0,TheCollegeFever,Seed Round,0,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",2018,Marketing and Advertising,Unknown,Unknown
1,Happy Cow Dairy,Seed Round,520000,"Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,2018,Agriculture and Food,Unknown,Unknown
2,MyLoanCare,Series A,845000,"Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,2018,Finance and FinTech,Unknown,Unknown
3,PayMe India,Angel,0,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,2018,Financial Services,Unknown,Unknown
4,Eunimart,Seed Round,0,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...,2018,E-Commerce and Retail,Unknown,Unknown


### Location

- Exctract cities from the column & Correct all Typos
- Fill nulls based on research at "pitchbook.com" and "crunchbase.com
- Filter out Cities that are not located in India
- Impute missing values of the unfound Locations of companies with "Unknown"

In [28]:
# Extracting the first part of the 'Location' column after splitting by a comma. e.g the selection of the city
df['Location'] = df['Location'].str.split(pat=',').str[0]

# Dictionary of replacements to correct the typos for some locations
replacements = {
    'Banglore': 'Bengaluru',
    'Small Towns': 'Andhra Pradesh',
    'Gurugram\t#REF!': 'Gurugram',
    'Samsitpur': 'Bengaluru',
    'Telugana': 'Hyderabad',
    'Orissia': 'Bengaluru',
    'Bangalore City': 'Bengaluru',
    'Uttar pradesh': 'Uttar Pradesh'
}

# Replace typos in the 'Location' column with the correct names
df['Location'] = df['Location'].replace(replacements)

Research was done using "pitchbook.com" and "crunchbase.com" to discover the location of these startups (with missing values) and inpute their location into the dataframe.

Fill nulls based on research at "pitchbook.com" and "crunchbase.com

In [30]:

# Dictionary mapping company names to locations for companies where Location was the only column missing
company_to_location = {
    'Habitat': 'Chennai',
    'Raskik': 'Gurugram',
    'Otipy': 'Gurugram',
    'Daalchini': 'Noida',
    'Bijnis': 'New Delhi',
    'Oziva': 'Mumbai',
    'Jiffy ai': 'Bengaluru',
    'Juicy Chemistry': 'Coimbatore',
    'Shiprocket': 'Gurugram',
    'Phable': 'Bengaluru',
    'NIRA': 'Bengaluru',
    'Setu': 'Bengaluru',
    'Zupee': 'Gurugram',
    'DeHaat': 'Patna',
    'CoinDCX': 'Mumbai',
    'Smart Coin': 'Bengaluru',
    'Shop101': 'Mumbai',
    'Neeman': 'Hyderabad',
    'SmartVizX': 'Noida',
    'Onsitego': 'Mumbai',
    'HempStreet': 'Delhi',
    'Classplus': 'Noida',
    'Fleetx': 'Gurugram',
    'Oye! Rickshaw': 'Delhi',
    'MoneyTap': 'Bengaluru',
    'LogiNext': 'Mumbai',
    'Skylo': 'Bengaluru',
    'Samya AI': 'Bengaluru',
    'Kristal AI': 'Bengaluru',
    'Invento Robotics': 'Bengaluru',
    'Teach Us': 'Mumbai',
    'Phenom People': 'Hyderabad',
    'TechnifyBiz': 'Delhi',
    'Klub': 'Bengaluru',
    'Techbooze': 'Delhi',
    'Testbook': 'Gurugram',
    'Mamaearth': 'Gurugram',
    'EpiFi': 'Bengaluru',
    'Vidyakul': 'Gurugram',
    'Pristyn Care': 'Gurugram',
    'Springboard': 'Bengaluru',
    'Bijak': 'Gurugram',
    'Rivigo': 'Gurugram',
    'Cubical Labs': 'Delhi'
}

# Function to fill location based on company name
def update_location(row):
    if row['Company Name'] in company_to_location:
        return company_to_location[row['Company Name']]
    return row['Location']

# Apply the function on the location column
df['Location'] = df.apply(update_location, axis=1)



Filter out Cities that are not located in India

In [31]:

# List of cities that are not located in India
non_indian_cities = [
    "Singapore", "Frisco", "California", "New York", "San Francisco", "San Ramon",
    "Paris", "Plano", "Sydney", "San Francisco Bay Area", "Bangaldesh", "London",
    "Milano", "Palmwoods", "France", "Irvine", "Newcastle Upon Tyne", "Shanghai",
    "Jiaxing", "San Franciscao", "Tangerang", "Berlin", "Seattle", "Riyadh", "Seoul",
    "Bangkok", "Hyderebad", "Computer Games", "Food & Beverages", "Pharmaceuticals #REF!",
    "Beijing", "Santra", "Mountain View", "Online Media #REF!", "Information Technology & Services"
]

# Filter the dataframe to exclude rows with cities that do not belong
df = df[~df['Location'].isin(non_indian_cities)]


In [32]:
# Impute missing values in the Location column with Unknown
df['Location'].fillna('Unknown', inplace=True)

df.head()

Unnamed: 0,Company Name,Funding Stage,Amount,Location,About Company,Year Funded,Industry,Investor_1,Investor_2
0,TheCollegeFever,Seed Round,0,Bangalore,"TheCollegeFever is a hub for fun, fiesta and f...",2018,Marketing and Advertising,Unknown,Unknown
1,Happy Cow Dairy,Seed Round,520000,Mumbai,A startup which aggregates milk from dairy far...,2018,Agriculture and Food,Unknown,Unknown
2,MyLoanCare,Series A,845000,Gurgaon,Leading Online Loans Marketplace in India,2018,Finance and FinTech,Unknown,Unknown
3,PayMe India,Angel,0,Noida,PayMe India is an innovative FinTech organizat...,2018,Financial Services,Unknown,Unknown
4,Eunimart,Seed Round,0,Hyderabad,Eunimart is a one stop solution for merchants ...,2018,E-Commerce and Retail,Unknown,Unknown
