# FUNDING ANALYSIS FOR INDIAN STARTUPS

#### Team: Team Namibia

## Table of Contents


[**Step 1: Business Understanding**](#Step-1:-Business-Understanding)

[**Step 2: Data Understanding**](#Step-2:-Data-Understanding)

- [**Load Data**](#Load-Data)
- [**Data Quality**](#Check-Data-Quality)
- [**Exploratory Data Analysis-EDA**](#Exploratory-Data-Analysis---EDA)
 

## Step 1: Business Understanding
Team Namibia is trying to venture into the Indian start-up ecosystem. As the data expert of the team, we are to investigate the ecosystem and propose the best course of action.

#### Problem Statement:
Ideas, creativity, and execution are essential for a start-up to flourish. But are they enough? Investors provide start-ups and other entrepreneurial ventures with the capital---popularly known as "funding"---to think big, grow rich, and leave a lasting impact.

In this project we are investigating the dynamics of startup funding in India over the period from 2018 to 2021. The aim is to understand the trends, sector preferences, investment stages, key investors, and funding Patterns. Additionally, if there have been significant differences in funding amounts across different years and sectors, it can guide the action plan to be taken.

#### Objective
In this analysis we will provide insights into the startup funding landscape in India from 2018 to 2021 by: 
- Identifying trends and patterns in funding amounts over the years.
- Determining which sectors received the most funding and how sector preferences changed over time.
- Understanding the distribution of funding across different stages of startups (e.g., Seed, Series A).
- Identifying key investors and their investment behaviors.
- Analyzing the geographical distribution of funding within India.

#### Analytical Questions
1. What are the trends and patterns in funding amounts for startups in India between 2018 to 2021?
   - Analyzing the annual and quarterly trends in funding can reveal patterns and growth trajectories. Look for peaks, dips, and any consistent growth patterns over these years.(Amount, Year funded)
2. Which sectors received the most funding, and how did sector preferences change over time from 2018 to 2021?
   - Identifying which industries or sectors received the most funding can show sectoral preferences and shifts. Understanding how this distribution has evolved over the years can highlight emerging trends and declining interests. (industry, amount, year funded)
3. How is the distribution of funding across different stages of startups (e.g., Seed, Series A)?
   - Analyzing the funding amounts at different startup stages can provide insights into the investment appetite at various growth phases. It can also help in understanding the maturity and risk preference of investors. (stages, Amount)
4. Who are the key investors in Indian startups, and what are their investment behaviors/patterns?
   - Identifying the most active investors and analyzing their investment portfolios can shed light on key players in the ecosystem. Understanding their investment patterns can also reveal strategic preferences and alliances.(Investor, amount, industry, stages)
5. What is the geographical distribution of startup funding within India, and how has this distribution changed over the years 2018 to 2021?
   - Analyzing the geographical distribution of startup funding can show regional hotspots for entrepreneurship and investment. Observing how this has changed over the years can reveal shifts in regional focus and development.(location, year_funded, amount)

# Step 2: Data Understanding

The data from 2018 is obtained from GitHub in csv format, 2019 data is obtained from google drive in csv format and 2020 to 2021 data is obtained from an SQL database.

## Load Data

#### Install pyodbc and python-dotenv if necessary

In [1]:
# %pip install pyodbc  
# %pip install python-dotenv 

#### Importing the necessary packages 

In [1]:
# Import the pyodbc library to handle ODBC database connections
import pyodbc 

# Import the dotenv function to load environment variables from a .env file
from dotenv import dotenv_values 

# Import the pandas library for data manipulation and analysis
import pandas as pd 

# Import the warnings library to handle warning messages
import warnings

# Filter out (ignore) any warnings that are raised
warnings.filterwarnings('ignore')

# Import the numpy library for data manipulation and analysis
import numpy as np


#### Establishing a connection to the SQL database

In [2]:
# Load environment variables from .env file into a dictionary
environment_variables = dotenv_values('.env')

# Get the values for the credentials you set in the .env file
database = environment_variables.get("DATABASE")
server = environment_variables.get("SERVER")
username = environment_variables.get("UID")
password = environment_variables.get("PWD")

# Create the connection string using the retrieved credentials
connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password};MARS_Connection=yes;MinProtocolVersion=TLSv1.2;"


#### Load 2020 & 2021 data

In [3]:
            #----------Load 2020 data----------
# Establish a connection to the database using the connection string
connection = pyodbc.connect(connection_string) 

# Define the SQL query to select all columns from the specified table
query = "Select * from dbo.LP1_startup_funding2020"

# Execute the SQL query and fetch the result into a pandas DataFrame using the established database connection
df_2020 = pd.read_sql(query, connection)


In [4]:
           #----------Load 2021 data----------
# Establish a connection to the database using the connection string
connection = pyodbc.connect(connection_string)

# Define the SQL query to select all columns from the specified table
query1 = "Select * from dbo.LP1_startup_funding2020"

# Execute the SQL query and fetch the result into a pandas DataFrame using the established database connection
df_2021 = pd.read_sql(query1, connection)


#### Load 2018 & 2019 data

In [5]:
# Load 2018
df_2018 = pd.read_csv(r'C:\Users\Pc\Desktop\Data analysis\Azubi Africa\Career Accelerator\Indian-Start-up-Funding-Analysis\Dataset\startup_funding2018.csv')

# Load 2019
df_2019 = pd.read_csv(r'C:\Users\Pc\Desktop\Data analysis\Azubi Africa\Career Accelerator\Indian-Start-up-Funding-Analysis\Dataset\startup_funding2019.csv')



In [6]:
print(data_2018.columns)
print(data_2019.columns)
print(data_2020.columns)
print(data_2021.columns)

NameError: name 'data_2018' is not defined

#### Rename the columns & Save all the data in one DataFrame

In [43]:
# Rename 2018 column: 'Round/Series' to 'Funding Stage'
df_2018 = df_2018.rename(columns = {'Round/Series': 'Funding Stage'})

# Rename 2019 columns
df_2019 = df_2019.rename(columns = {'Company/Brand': 'Company Name', 'Sector': 'Industry', 'Stage': 'Funding Stage', 'Amount($)': 'Amount', 'HeadQuarter': 'Location', 'What it does': 'About Company', 'Founded': 'Year Founded'})

# Rename 2020 columns
df_2020 = df_2020.rename(columns = {'Company_Brand': 'Company Name', 'Sector': 'Industry', 'Stage': 'Funding Stage', 'HeadQuarter': 'Location', 'What_it_does': 'About Company', 'Founded': 'Year Founded'})

# Rename 2021 columns
df_2021 = df_2021.rename(columns = {'Company_Brand': 'Company Name', 'Sector': 'Industry', 'Stage': 'Funding Stage', 'HeadQuarter': 'Location', 'What_it_does': 'About Company', 'Founded': 'Year Founded'})



In [44]:
# Add a column to each DataFrame to indicate the year
df_2018['Year Funded'] = 2018
df_2019['Year Funded'] = 2019
df_2020['Year Funded'] = 2020
df_2021['Year Funded'] = 2021

# Concatenate all DataFrames into one master DataFrame
df = pd.concat([df_2018, df_2019, df_2020, df_2021], ignore_index=True)


# Print out the new DataFrame to confirm the combination was done correctly
df.head()

Unnamed: 0,Company Name,Industry,Funding Stage,Amount,Location,About Company,Year Funded,Year Founded,Founders,Investor,column10
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",2018,,,,
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,2018,,,,
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,2018,,,,
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,2018,,,,
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...,2018,,,,


## Data Quality

- The columns: 'column10', 'Founders' & 'Year Founded' must be removed as they do not help answer our questions.
- Review duplicates.
- The "Company Name" column does not have any may concerns except a few names with ".com", ".ai", ".AI", ".sh" and "+" present.
- In the "Industry" column:
    - "—" must be investgated further using the "About Company" column in order to fill it with the right data.
    - There are companies with multiple industries in a single row, we need to keep only one and remove the rest.
    - We can find the unique values in the column and categorize them under specific industries using "Regular Expression Models".
- There is a link in the "Funding Stage" column at index 178.
    - Investigate "NaN" and "None" present in th "Funding Stage" column.
- The "Amount" column (Prescence of "₹", "$", "—" and "Undisclosed"in the column):
    - Extract "₹" to a new column
    - Investigate "NaN"
    - Replace "₹", "$" and "—" with "" in the column
    - Investigate "Undisclosed"
    - Convert the dtype of the column to Int64 as it is in the wrong format.
- In the "Location" column:
    - Investigate "India, Asia"
    - Investigate "NaN"
    - Split the column, keep the 1st one(containing cities), 
        - Join it to the main dataframe "df"
        - Delete the "Location"
        - Rename the newly joined to "Location"
- In the "Investor" column:
    - Investigate "NaN"
    - Split the column

### Data Review

In [45]:
df.shape

(2725, 11)

In [46]:
df.isnull().sum()

Company Name        0
Industry           31
Funding Stage     974
Amount            508
Location          207
About Company       0
Year Funded         0
Year Founded      981
Founders          553
Investor          602
column10         2721
dtype: int64

In [47]:
# Checking columns
df.columns

Index(['Company Name', 'Industry', 'Funding Stage', 'Amount', 'Location',
       'About Company', 'Year Funded', 'Year Founded', 'Founders', 'Investor',
       'column10'],
      dtype='object')

Remove unwanted columns & checking data info

In [48]:
# Dropping unwanted columns
df = df.drop(columns=['column10','Founders','Year Founded'])

# checking data info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2725 entries, 0 to 2724
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Company Name   2725 non-null   object
 1   Industry       2694 non-null   object
 2   Funding Stage  1751 non-null   object
 3   Amount         2217 non-null   object
 4   Location       2518 non-null   object
 5   About Company  2725 non-null   object
 6   Year Funded    2725 non-null   int64 
 7   Investor       2123 non-null   object
dtypes: int64(1), object(7)
memory usage: 170.4+ KB


In [49]:
# Display the Amount column for further information
pd.set_option('display.max_rows', None)


### Handling Duplicates

In [50]:
# Check for duplicates
df_duplicates= df[df.duplicated(keep = False)].sort_values(by= "Company Name")

# Check for number of duplicates
sum_dups= df.duplicated().sum()

print("Number of duplicates:", sum_dups)

Number of duplicates: 7


In [51]:
# Drop duplicates
df = df.drop_duplicates()

# Confirm the new shape. Rows should be less by 23
df.shape

(2718, 8)

## Data Cleaning

### Industry

In [52]:
# Check for null values
print("Null values in Industry:",df["Industry"].isna().sum())

Null values in Industry: 31


In [53]:
# Check total number of '—'
dash_count= df[df['Industry']=='—']['Industry'].count()

print("Number of '—' in Industry:",dash_count)

Number of '—' in Industry: 30


Replace "—" with nulls

In [54]:
# Replace "—" with Nulls
df['Industry'] = df['Industry'].replace('—', np.nan)

# Check for null values
print("New null values in Industry:",df["Industry"].isna().sum())


New null values in Industry: 61


Fill the nulls with the determined Industries (obtained from the "About Company" column)

In [55]:
# Mapping of company names to industries
company_to_industry = {
    "VMate": "Media and Entertainment",
    "Awign Enterprises": "Services (Human Resources)",
    "TapChief": "Services (Consulting / Professional Services)",
    "KredX": "Financial Services",
    "m.Paani": "E-Commerce",
    "Text Mercato": "E-Commerce",
    "Magicpin": "E-Commerce",
	"Leap Club": "E-Commerce",
	"Juicy Chemistry": "Services",
	"Servify": "Retail",
	"Wagonfly": "Media and Entertainment",
	"DrinkPrime": "E-Commerce",
	"Kitchens Centre": "Consumer Durables",
	"Innoviti": "Services",
	"Brick&Bolt": "Financial Services",
	"Toddle": "Real Estate",
	"HaikuJAM": "IT & BPM",
    "MissMalini Entertainment" : "Entertainment and Media",
    "Jagaran Microfin" : "Microfinance",
    "FLEECA" : "Automotive Services",
    "WheelsEMI" : "Financial Services",
    "Fric Bergen" : "Food and Beverage",
    "Deftouch" : "Gaming",
    "Corefactors" : "Marketing",
    "Cell Propulsion" : "Transportation Technology",
    "Flathalt" : "Real Estate",
    "dishq" : "Food Technology",
    "Trell" : "Social Networking",
    "HousingMan.com" : "Real Estate",
    "Steradian Semiconductors" : "Semiconductor Technology",
    "SaffronStays" : "Travel and Hospitality",
    "Inner Being Wellness" : "Beauty and Wellness",
    "MySEODoc" : "Digital Marketing",
    "ENLYFT DIGITAL SOLUTIONS PRIVATE LIMITED" : "Digital Marketing",
    "Scale Labs" : "E-commerce Solutions",
    "Roadcast" : "Business Services",
    "Toffee" : "Insurance Technology",
    "ORO Wealth" : "Financial Services",
    "Finwego" : "Financial Services",
    "Cred" : "Financial Services",
    "Origo" : "Agriculture",
    "Sequretek" : "Cyber Security",
    "Avenues Payments India Pvt. Ltd." : "IT Solutions",
    "Planet11 eCommerce Solutions India (Avenue11)" : "Technology",
    "Iba Halal Care" : "Cosmetics",
    "Togedr" : "Activity Discovery and Booking",
    "Scholify" : "Edutech"    
}

# Function to fill missing industries based on company name
def fill_industry(row):
    if pd.isna(row["Industry"]):
        return company_to_industry.get(row["Company Name"], row["Industry"])
    return row["Industry"]

# Apply the function to update the 'Industry' column
df["Industry"] = df.apply(fill_industry, axis=1)

# Checking the Null value in the 'Industry' column
print("Null values in Industry after cleaning:",df['Industry'].isna().sum())

Null values in Industry after cleaning: 0


Extract 1 Industry from the 'Industry' column

In [56]:
# Function to extract the first industry from the 'Industry' column
def industry_extract(row):
    industries = row['Industry'].split(',')
    return industries[0].strip() if len(industries) > 1 else row['Industry']

# Apply the function to update the 'Industry' column
df['new_industry'] = df.apply(industry_extract, axis=1)
    
# Remove "Industry"
df = df.drop(columns=['Industry'])

# Rename "new_industry" to "Industry"
df = df.rename(columns={'new_industry': 'Industry'})

df[["Industry"]].head()

Unnamed: 0,Industry
0,Brand Marketing
1,Agriculture
2,Credit
3,Financial Services
4,E-Commerce Platforms


Find the unique values of the "Industry" column

In [57]:
# Find unique values in the "Industry" column
unique1= df["Industry"].unique()

# Check for number of unique values in the "Industry" column
print(f"Number of unique Industries: {len(unique1)}")

Number of unique Industries: 454


Categorize all Industries into Major Industries

In [58]:
import re

def sector_redistribution(Industry):
    if re.search(r'bank|fintech|finance|crypto|account|credit|venture|crowd|blockchain|microfinance|lending|wealth|insurance|mutual fund|funding|invest|neo-bank|online financial service|escrow', Industry, re.IGNORECASE):
        return 'Finance and FinTech'
    elif re.search(r'e-?commerce|retail|marketplace|e-store|e-tail|e-tailer|consumer|durables|appliances|electronics', Industry, re.IGNORECASE):
        return 'E-Commerce and Retail'
    elif re.search(r'marketing|advertising|brand|digital marketing|sales|customer loyalty|creative agency|content management', Industry, re.IGNORECASE):
        return 'Marketing and Advertising'
    elif re.search(r'agriculture|agtech|agr[iy]tech|food|beverage|catering|cooking|dairy|nutrition|soil', Industry, re.IGNORECASE):
        return 'Agriculture and Food'
    elif re.search(r'health|medical|biotech|pharma|medtech|care|diagnostics|wellness|fitness|personal care|skincare|mental health|life science|alternative medicine|veterinary', Industry, re.IGNORECASE):
        return 'Healthcare and Wellness'
    elif re.search(r'transport|automotive|vehicle|logistics|delivery|air transport|mobility|car|bike|EV|auto-tech|transportation', Industry, re.IGNORECASE):
        return 'Transportation and Mobility'
    elif re.search(r'real estate|construction|interior|housing|home decor|commercial real estate|co-?working|co-?living', Industry, re.IGNORECASE):
        return 'Real Estate and Construction'
    elif re.search(r'media|entertainment|broadcasting|streaming|video|music|gaming|sports|digital entertainment|visual media', Industry, re.IGNORECASE):
        return 'Media and Entertainment'
    elif re.search(r'education|e-?learning|edtech|training|continuing education|career planning|edutech', Industry, re.IGNORECASE):
        return 'Education'
    elif re.search(r'renewable|clean energy|solar|environmental|energy|cleantech|sanitation', Industry, re.IGNORECASE):
        return 'Energy and Environment'
    elif re.search(r'consulting|business services|professional services|customer service|legal|facility|IT & BPM', Industry, re.IGNORECASE):
        return 'Professional Services'
    elif re.search(r'information technology|IT|tech|technology|cloud|internet of things|iot|big data|saas|cyber security|software|ai|machine learning|robotics|deep tech|data science|api|digital|platform|networking|smart cities', Industry, re.IGNORECASE):
        return 'Technology'
    elif re.search(r'consumer goods|consumer applications|consumer durables|consumer electronics|consumer appliances|eyewear|jewellery|fashion', Industry, re.IGNORECASE):
        return 'Consumer Goods'
    elif re.search(r'industrial|manufacturing|automation|industrial automation|packaging', Industry, re.IGNORECASE):
        return 'Industrial and Manufacturing'
    else:
        return Industry
    
# Apply the function to update the 'Industry' column
df['Industry'] = df["Industry"].apply(sector_redistribution)

# Find unique values in the "Industry" column
unique2= df["Industry"].unique()

# Check for number of unique values in the "Industry" column
print(f"New Number of unique Industries: {len(unique2)}")


New Number of unique Industries: 114
