# Indian Start-up Investment Analysis (2018 - 2021)

### Aim:
To assess the investment potential and attractiveness of the Indian startup ecosystem and provide recommendations for optimal course of action

### Objectives:
 
1. To assess the overall attractiveness of the Indian startup ecosystem based on funding trends and investor activity from 2018 to 2021.
2. To identify key sectors with high potential for investment based on their funding attractiveness and growth prospects.
3. To evaluate the investment opportunities across different stages of startup development and their risk-return profiles.
4. To analyze the geographical distribution of startups and funding to identify strategic investment locations and regional investment disparities.
5. To determine the correlation between funding amounts received by startups and their subsequent performance, providing insights into potential returns on investment and success rates.

### Business Questions:
1. What are the trends in funding amounts for Indian startups from 2018 to 2021? Are there any significant fluctuations or consistent growth patterns observed over this period?

2. Which sectors within the Indian startup ecosystem attracted the highest total funding during the specified timeframe? Are there any emerging sectors that have shown rapid growth in terms of investment?

3. What is the distribution of investment amounts across different stages of startup development (e.g., seed, early-stage, growth)? Are certain stages more favored by investors, and if so, why?

4. How is the geographical distribution of startups and funding within India? Are there specific regions or cities that have emerged as hubs for startup activity and investment, and are there any notable regional disparities?

5. Is there a correlation between the funding amounts received by startups and their subsequent performance metrics such as revenue growth, user acquisition, or market share? What insights can be gleaned from this correlation in terms of potential returns on investment and success rates?

6. Who are the top investors in the Indian startup ecosystem during the specified period? What sectors do they predominantly invest in, and are there any patterns in their investment strategies?

7. What are the characteristics of successful Indian startups in terms of founding team composition, industry focus, and funding trajectory? Can these characteristics be used to identify potential investment opportunities or predict startup success

### Hypothesis to Test:
 
Given the goal of assessing the investment potential in the Indian startup ecosystem, we hypothesize that:
 
**Null Hypothesis (H0)**: There is no clear pattern in the funding received by Indian startups from 2018 to 2021, and factors like sector, stage, location, and funding amount do not affect startup success.

**Alternative Hypothesis (H1)**: There is a clear pattern in the funding received by Indian startups from 2018 to 2021, and factors like sector, stage, location, and funding amount affect startup success.

## Import Packages for Analysis

In [205]:
# import relevant packages
import pyodbc
from dotenv import dotenv_values
import pandas as pd
import warnings
import numpy as np

warnings.filterwarnings('ignore')


#### Connect to server for 2020 and 2021 datasets

In [206]:
# load environment variables from .env file into a dictionary
environment_variables = dotenv_values('.env')

# Get the values for the credentials from .env file
database=environment_variables.get("DATABASE")
server=environment_variables.get("SERVER")
login=environment_variables.get("LOGIN")
password=environment_variables.get("PASSWORD")

# create a connection string
connection_string=f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={login};PWD={password}"

In [207]:
# create connection using the pyodbc method 

connection = pyodbc.connect(connection_string)

#### Select tables of interest from the Database

In [208]:
# selecting tables from Database
db_query = ''' SELECT *
            FROM INFORMATION_SCHEMA.TABLES
            WHERE TABLE_TYPE = 'BASE TABLE' '''

#### View tables of interest from the Database for verification purposes

In [209]:
# call selected table from SQL Database
ata=pd.read_sql(db_query, connection)

ata

Unnamed: 0,TABLE_CATALOG,TABLE_SCHEMA,TABLE_NAME,TABLE_TYPE
0,dapDB,dbo,LP1_startup_funding2021,BASE TABLE
1,dapDB,dbo,LP1_startup_funding2020,BASE TABLE


# Preview of Datasets

#### Data_2020

In [210]:
# Call DataFrame to understand DataFrame details for 2020
query= "SELECT * FROM dbo.LP1_startup_funding2020"
data_2020 =pd.read_sql(query, connection)

data_2020.head()



Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,
3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,,
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,,


#### Data_2021

In [211]:
# Call DataFrame to understand DataFrame details for 2021.
query= "SELECT * FROM dbo.LP1_startup_funding2021"
data_2021 =pd.read_sql(query, connection)

data_2021.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D
3,Bizongo,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital","$51,000,000",Series C
4,FypMoney,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal","$2,000,000",Seed


#### Data_2019 

#### Load csv data from other sources for analysis

In [212]:
# Read 2019 DataFrame to understand data structure.
data_2019=pd.read_csv("D:\\JHanson\\Justice Hanson\\DS Career Accelerator\Project 1\\Indian-Start-up-Investment-Analysis\\CSV Data\\startup_funding2019.csv")

data_2019.head(5)

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...","$30,000,000",Series D
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",


#### Data_2018 

In [213]:
# Read 2018 DataFrame to understand data structure.
data_2018=pd.read_csv("D:\\JHanson\\Justice Hanson\\DS Career Accelerator\Project 1\\Indian-Start-up-Investment-Analysis\\CSV Data\\startup_funding2018.csv")

data_2018.head(5)

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...


#### Print Column Names for Comparison

In [214]:
# Print column names for comparison
print("Column names in data_2018:")
print(data_2018.columns)
print("\n")

# Column names for data_2019
print("Column names in data_2019:")
print(data_2019.columns)
print("\n")

# Column names for data_2020
print("Column names in data_2020:")
print(data_2020.columns)
print("\n")

# Column names for data_2021
print("Column names in data_2021:")
print(data_2021.columns)
print("\n")

Column names in data_2018:
Index(['Company Name', 'Industry', 'Round/Series', 'Amount', 'Location',
       'About Company'],
      dtype='object')


Column names in data_2019:
Index(['Company/Brand', 'Founded', 'HeadQuarter', 'Sector', 'What it does',
       'Founders', 'Investor', 'Amount($)', 'Stage'],
      dtype='object')


Column names in data_2020:
Index(['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage', 'column10'],
      dtype='object')


Column names in data_2021:
Index(['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage'],
      dtype='object')




Observations
1. **Inconsistency in Column Names**
2018 dataset has different column names compared to the rest of the datasets.

2. **Variations in Column Names**
The same type of information is represented by different column names across years (e.g., 'Company Name', 'Company/Brand', 'Company_Brand').

3. **Unique Columns**
Some years have unique columns not present in other years, which may complicate direct merging.

*This will be addressed by Standardizing column names across all datasets, align similar columns to a standard name for consistency and unique columns based on their importance and relevance will be handled while irrelevant columns will be dropped.*


#### Column renaming for consistency and merging

In [217]:
# Rename columns in each dataset
data_2018.rename(columns={
    'Company Name': 'company_name',
    'Industry': 'sector',
    'Round/Series': 'stage',
    'Amount': 'funding_amount',
    'Location': 'location',
    'About Company': 'description'
}, inplace=True)

# 2019 column mapping
data_2019.rename(columns={
    'Company/Brand': 'company_name',
    'HeadQuarter': 'location',
    'Sector': 'sector',
    'What it does': 'description',
    'Amount($)': 'funding_amount'
}, inplace=True)

# 2020 column mapping
data_2020.rename(columns={
    'Company_Brand': 'company_name',
    'What_it_does': 'description',
    'Amount': 'funding_amount'
}, inplace=True)

# 2021 column mapping
data_2021.rename(columns={
    'Company_Brand': 'company_name',
    'What_it_does': 'description',
    'Amount': 'funding_amount'
}, inplace=True)

# Merge datasets using the standardized column names
merged_data = pd.concat([data_2018, data_2019, data_2020, data_2021], ignore_index=True)

merged_data.head(5)


Unnamed: 0,company_name,sector,stage,funding_amount,location,description,Founded,Founders,Investor,Stage,HeadQuarter,Sector,column10
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",,,,,,,
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,,,,,,,
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,,,,,,,
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,,,,,,,
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...,,,,,,,


In [218]:
# Merge 'location' and 'HeadQuarter' columns into a single 'location' column
merged_data['location'] = merged_data['location'].combine_first(merged_data['HeadQuarter'])

# Merge 'industry' and 'Sector' columns into a single 'industry' column
merged_data['sector'] = merged_data['sector'].combine_first(merged_data['Sector'])

# Merge 'Stage' and 'stage' columns into a single 'stage' column
merged_data['stage'] = merged_data['Stage'].combine_first(merged_data['stage'])

# Drop the original columns
merged_data.drop(columns=['HeadQuarter', 'Sector', 'Stage'], inplace=True)



In [221]:
# Print merge_data for verification
merged_data.head(5)

Unnamed: 0,company_name,sector,stage,funding_amount,location,description,Founded,Founders,Investor,column10
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",,,,
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,,,,
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,,,,
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,,,,
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...,,,,


In [222]:
# View from the bottom
merged_data.tail(5)

Unnamed: 0,company_name,sector,stage,funding_amount,location,description,Founded,Founders,Investor,column10
2874,Gigforce,Staffing & Recruiting,Pre-series A,$3000000,Gurugram,A gig/on-demand staffing company.,2019.0,"Chirag Mittal, Anirudh Syal",Endiya Partners,
2875,Vahdam,Food & Beverages,Series D,$20000000,New Delhi,VAHDAM is among the world’s first vertically i...,2015.0,Bala Sarda,IIFL AMC,
2876,Leap Finance,Financial Services,Series C,$55000000,Bangalore,International education loans for high potenti...,2019.0,"Arnav Kumar, Vaibhav Singh",Owl Ventures,
2877,CollegeDekho,EdTech,Series B,$26000000,Gurugram,"Collegedekho.com is Student’s Partner, Friend ...",2015.0,Ruchir Arora,"Winter Capital, ETS, Man Capital",
2878,WeRize,Financial Services,Series A,$8000000,Bangalore,India’s first socially distributed full stack ...,2019.0,"Vishal Chopra, Himanshu Gupta","3one4 Capital, Kalaari Capital",


In [223]:
# Check data structure and information
merged_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2879 entries, 0 to 2878
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   company_name    2879 non-null   object 
 1   sector          2861 non-null   object 
 2   stage           1941 non-null   object 
 3   funding_amount  2622 non-null   object 
 4   location        2765 non-null   object 
 5   description     2879 non-null   object 
 6   Founded         2110 non-null   float64
 7   Founders        2334 non-null   object 
 8   Investor        2253 non-null   object 
 9   column10        2 non-null      object 
dtypes: float64(1), object(9)
memory usage: 225.1+ KB


Observation
Column 10 is virtually empty, indicating irrelevant column for analysis, while the dataset exhibits incorrect data types, requiring rectification for accurate insights.

In [224]:
# Drop Column 10 permanently
merged_data.drop(columns=['column10'], inplace=True)


In [225]:
# Confirm dropped column
merged_data.head()

Unnamed: 0,company_name,sector,stage,funding_amount,location,description,Founded,Founders,Investor
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",,,
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,,,
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,,,
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,,,
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...,,,


# 2. Quality Checks on Columns

   #### 2.1 sector column

In [161]:
# Print all unique values in the 'sector' column
unique_sectors = merged_data['sector'].unique()

print("Unique values in 'sector' column:")
for sector in unique_sectors:
    print(sector)

Unique values in 'sector' column:
Brand Marketing, Event Promotion, Marketing, Sponsorship, Ticketing
Agriculture, Farming
Credit, Financial Services, Lending, Marketplace
Financial Services, FinTech
E-Commerce Platforms, Retail, SaaS
Cloud Infrastructure, PaaS, SaaS
Internet, Leisure, Marketplace
Market Research
Information Services, Information Technology
Mobile Payments
B2B, Shoes
Internet
Apps, Collaboration, Developer Platform, Enterprise Software, Messaging, Productivity Tools, Video Chat
Food Delivery
Industrial Automation
Automotive, Search Engine, Service Industry
Finance, Internet, Travel
Accounting, Business Information Systems, Business Travel, Finance, SaaS
Artificial Intelligence, Product Search, SaaS, Service Industry, Software
Internet of Things, Waste Management
Air Transportation, Freight Service, Logistics, Marine Transportation
Financial Services
Food and Beverage
Autonomous Vehicles
Enterprise Software, Health Care, Hospital, Parenting, Personal Health, SaaS
Agricu

#### 2.1.0 sector redistribution

In [226]:
import re

def sector_redistribution(sector):
    if isinstance(sector, str):
        if re.search(r'Credit|Financial Services|Lending|Marketplace|FinTech|Accounting|Banking|Venture Capital|Investment|Financial Exchanges|Micro Lending|Wealth Management|Insurance|Crowdfunding|Finance|Impact Investing|Personal Finance|Cryptocurrency|Trading Platform|Financial Services', sector):
            return 'Finance'
        elif re.search(r'Automotive|Air Transportation|Transport|Logistics|Vehicle|Transportation|Railroad|Last Mile Transportation|Electric Vehicle|Ride Sharing|Autonomous Vehicles|Marine Transportation|Battery', sector):
            return 'Transport'
        elif re.search(r'E-Commerce|Retail|Fashion|Jewelry|Shopping|Retail Technology|Marketplace|E-Commerce Platforms|Online Portals|Facilities Support Services|Procurement|Interior Design|Home Decor|Home Improvement|Home Services|Furniture', sector):
            return 'E-Commerce & Retail'
        elif re.search(r'Cloud Infrastructure|PaaS|SaaS|Software|Enterprise Software|Network Hardware|Network Security|Delivery Service|Information Technology|Cloud Computing|Data Analytics|AI|Machine Learning|Analytics|Big Data|IoT|Blockchain|Artificial Intelligence|Digital Marketing|SEO|SEM|Web Development|Digital Media|Media and Entertainment|Social Media|CRM|Virtual Reality|Augmented Reality|Enterprise Resource Planning', sector):
            return 'Technology & IT'
        elif re.search(r'Health Care|Hospital|Medical|Health Diagnostics|Medical Device|Wellness|Personal Health|Health Insurance|Health and Fitness|MedTech|Pharmaceutical|Life Science|Biotechnology|Diabetes|Elder Care|Alternative Medicine|mHealth|Dental|Home Health Care|Nutrition|Medical|HealthTech', sector):
            return 'Health & Medical'
        elif re.search(r'Food Delivery|Food and Beverage|Food Processing|Restaurants|Catering|Snack Food|Tea|Organic Food|Food Industry|FoodTech|Cloud Kitchen|Beverages|Fusion Beverages|Food & Nutrition|Food Production|Cooking', sector):
            return 'Food & Beverage'
        elif re.search(r'Advertising|Brand Marketing|Event Promotion|Marketing|Sponsorship|Ticketing|Digital Marketing|Creative Agency|Video Streaming|Broadcasting|News|Publishing|Media|Media Tech|Content Management|Content Publishing|Video Platform', sector):
            return 'Media & Advertising'
        elif re.search(r'Agriculture|AgTech|Farming|Farmers Market|AgriTech|Foodtech|Dairy', sector):
            return 'Agriculture'
        elif re.search(r'Tourism|Travel|TravelTech|Business Travel|Tourism & EV|Travel Accommodations|Hospitality|Hotel|Reservations', sector):
            return 'Travel & Hospitality'
        elif re.search(r'Consulting|Business Development|Advisory|Management Consulting|Outsourcing|Customer Service|Professional Services', sector):
            return 'Consulting & Professional Services'
        elif re.search(r'Education|E-Learning|EdTech|Higher Education|Education Management|Continuing Education|Skill Assessment|Tutoring|STEM Education|Career Planning|Training', sector):
            return 'Education'
        elif re.search(r'Supply Chain Management|Freight Service|Logistics|Delivery|Warehousing|Packaging Services|Supply Chain', sector):
            return 'Logistics & Supply Chain'
        elif re.search(r'Industrial Automation|Manufacturing|Robotics|Automation|Industrial|Mechanical & Industrial Engineering|Production|Factory|Industrial Technology|Automobile Technology', sector):
            return 'Manufacturing & Industrial'
        elif re.search(r'Energy|Renewable Energy|CleanTech|Solar|Electricity|Energy Storage|Environmental Services|GreenTech|Environmental Consulting|Natural Resources|Oil and Gas|Energy Technology', sector):
            return 'Energy & Environmental'
        elif re.search(r'Children|Parenting|Child Care|Preschool Daycare|KidTech', sector):
            return 'Parenting & Child Care'
        elif re.search(r'Sports|Fitness|Health and Fitness|Wellness|Yoga|eSports|Gaming|Video Games|Fantasy Sports|Sporting Goods|SportsTech|Health & Wellness', sector):
            return 'Sports & Fitness'
        elif re.search(r'Fashion|Beauty|Lifestyle|Cosmetics|Apparel|Footwear|Wearables|Fashion Tech|Jewelry|Skincare|Beauty Products|Beauty & Wellness', sector):
            return 'Fashion & Beauty'
        elif re.search(r'Construction|Building|Infrastructure|Real Estate|PropTech|Commercial Real Estate|Property Management|Rental Property|Housing|Home Services|Interior Design', sector):
            return 'Construction & Real Estate'
        elif re.search(r'HR|Human Resources|Staffing|Recruitment|HRTech', sector):
            return 'Human Resources'
        elif re.search(r'Finance|Financial Services|FinTech|Mobile Payments|Payments|Insurance|Insurance Tech|InsureTech|Insurtech|Personal Finance|Wealth Management|Investment|Mutual Funds|Investment Banking|Venture Capital', sector):
            return 'Finance'
        else:
            return 'Others'
    else:
        return 'Others'
    

# Apply the sector redistribution function to create a new column
merged_data['redistributed_sector'] = merged_data['sector'].apply(sector_redistribution)

  


#### Confirm sector redistribution

In [228]:
# Display the DataFrame with the new column
merged_data.head(5)  

Unnamed: 0,company_name,sector,stage,funding_amount,location,description,Founded,Founders,Investor,redistributed_sector
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",,,,Media & Advertising
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,,,,Agriculture
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,,,,Finance
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,,,,Finance
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...,,,,E-Commerce & Retail


In [229]:
merged_data.tail(5)

Unnamed: 0,company_name,sector,stage,funding_amount,location,description,Founded,Founders,Investor,redistributed_sector
2874,Gigforce,Staffing & Recruiting,Pre-series A,$3000000,Gurugram,A gig/on-demand staffing company.,2019.0,"Chirag Mittal, Anirudh Syal",Endiya Partners,Human Resources
2875,Vahdam,Food & Beverages,Series D,$20000000,New Delhi,VAHDAM is among the world’s first vertically i...,2015.0,Bala Sarda,IIFL AMC,Food & Beverage
2876,Leap Finance,Financial Services,Series C,$55000000,Bangalore,International education loans for high potenti...,2019.0,"Arnav Kumar, Vaibhav Singh",Owl Ventures,Finance
2877,CollegeDekho,EdTech,Series B,$26000000,Gurugram,"Collegedekho.com is Student’s Partner, Friend ...",2015.0,Ruchir Arora,"Winter Capital, ETS, Man Capital",Education
2878,WeRize,Financial Services,Series A,$8000000,Bangalore,India’s first socially distributed full stack ...,2019.0,"Vishal Chopra, Himanshu Gupta","3one4 Capital, Kalaari Capital",Finance


#### stage column 

In [230]:
# Print unique values in the "stage" column
print(merged_data['stage'].unique())


['Seed' 'Series A' 'Angel' 'Series B' 'Pre-Seed' 'Private Equity'
 'Venture - Series Unknown' 'Grant' 'Debt Financing' 'Post-IPO Debt'
 'Series H' 'Series C' 'Series E' 'Corporate Round' 'Undisclosed'
 'https://docs.google.com/spreadsheets/d/1x9ziNeaz6auNChIHnMI8U6kS7knTr3byy_YBGfQaoUA/edit#gid=1861303593'
 'Series D' 'Secondary Market' 'Post-IPO Equity' 'Non-equity Assistance'
 'Funding Round' nan 'Fresh funding' 'Pre series A' 'Series G'
 'Post series A' 'Seed funding' 'Seed fund' 'Series F' 'Series B+'
 'Seed round' 'Pre-series A' 'Pre-seed' 'Pre-series' 'Debt' 'Pre-series C'
 'Pre-series B' 'Bridge' 'Series B2' 'Pre- series A' 'Edge' 'Pre-Series B'
 'Seed A' 'Series A-1' 'Seed Funding' 'Pre-seed Round'
 'Seed Round & Series A' 'Pre Series A' 'Pre seed Round' 'Angel Round'
 'Pre series A1' 'Series E2' 'Seed Round' 'Bridge Round' 'Pre seed round'
 'Pre series B' 'Pre series C' 'Seed Investment' 'Series D1' 'Mid series'
 'Series C, D' '$1200000' 'Seed+' 'Series F2' 'Series A+' 'Series

#### swap wrong enteries to the right column

In [303]:
# Rows where stage should be moved to funding_amount and vice versa
rows_to_swap = [1768, 1781, 2221, 2344, 2208, 2347, 2818]

# Print the rows before swapping
#print("Before swapping:")
#print(merged_data.loc[rows_to_swap])

# Swap the values between stage and funding_amount for the identified rows
for row in rows_to_swap:
    stage_value = merged_data.loc[row, 'stage']
    funding_amount_value = merged_data.loc[row, 'funding_amount']
    merged_data.loc[row, 'stage'] = funding_amount_value
    merged_data.loc[row, 'funding_amount'] = stage_value

#### Confirm swapped values

In [304]:
# List of specific row indices to be printed
indices_to_print = [1768, 1781, 2221, 2344, 2208, 2347, 2818]

# Select the relevant columns and rows
selected_rows = merged_data.loc[indices_to_print, ['stage', 'funding_amount']]

# Print the DataFrame
print(selected_rows)

                               stage funding_amount
1768                        Upsparks       $1200000
1781                        Upsparks       $1200000
2221  Ito angel network, letsventure        $300000
2344                         Unknown       $6000000
2208                           Angel        $300000
2347                           Angel       $1000000
2818                            Seed            NaN


 #### Clean and standardize 'stage' column in DataFrame using these regex patterns 

In [295]:
# dictionaries of corection to clen the funding stage
Stage_corrections = {(r'Ang.+'): 'Angel', (r'Brid.+'): 'Bridge', (r'Debt.*'): 'Debt Financing', \
               (r'(Ear.+|Seed.*)'): 'Seed', (r'(S.+ies\sa.*)'): 'Series A',\
              (r'(S.+ies\sb.*)'): 'Series B', (r'(S.+ies\sc.*)'): 'Series C',\
              (r'(S.+ies\sd.*)'): 'Series D', (r'(S.+ies\sf.*)'): 'Series F',\
               (r'(S.+ies\se.*)'): 'Series E', (r'(S.+ies\sg.*)'): 'Series G', (r'(S.+ies\sh.*)'): 'Series H',\
              (r'(Htt.*|Ven.+|^$)'): 'Unknown', (r'(P.+ies\sa.*)'): 'Pre Series A',\
              (r'(P.+ies\sb.*)'): 'Pre Series B', (r'(P.+ies\sc.*)'): 'Pre Series C',\
              (r'(.*(f|F)un.+)'): 'Fresh Funding', (r'(.+seed.*)'): 'Pre Seed', np.nan: 'Unknown', (r'^Pe$'): 'Private equity'}

In [296]:
# Run and effect changes in the dataset
merged_data['stage'] = merged_data['stage'].str.capitalize().replace(Stage_corrections, regex=True)

In [297]:
# get the number of unique stages
stages = merged_data['stage'].sort_values().unique()
print(len(stages))
stages

32


array(['Angel', 'Bridge', 'Corporate round', 'Debt Financing', 'Edge',
       'Fresh Funding', 'Grant', 'Ito angel network, letsventure',
       'Mid series', 'Non-equity assistance', 'Post-ipo debt',
       'Post-ipo equity', 'Pre Seed', 'Pre Series A', 'Pre Series B',
       'Pre Series C', 'Pre-series', 'Private equity', 'Secondary market',
       'Seed', 'Series A', 'Series B', 'Series C', 'Series D', 'Series E',
       'Series F', 'Series G', 'Series H', 'Series i', 'Undisclosed',
       'Unknown', 'Upsparks'], dtype=object)