## **Step 1: Problem Framing**

### Objective

To analyze the factors influencing startup growth and success in India using real-world funding data. The goal is to identify what drives a startup’s success — such as location, sector, or funding patterns — and uncover actionable insights for new founders and investors.

---

### Defining “Startup Success”

In this project, we’ll define **startup success** based on **funding performance**, since our dataset primarily contains funding-related attributes.

> **Definition:**
> A startup is considered **successful** if it has raised **more than ₹10 Crores (~USD 1.2 Million)** in total funding.

This definition is chosen because:

* Funding represents external investor confidence and growth potential.
* The dataset lacks variables like employee count or survival years.
* It’s measurable and aligns with typical early-stage success indicators.

We will convert `AmountInUSD` to INR (₹) for consistency and classify startups as:

* **Successful:** Total funding > ₹10 Cr
* **Not yet successful:** Total funding ≤ ₹10 Cr

---

### Hypotheses to Test

We’ll test the following **3 hypotheses**:

1. **H1: Startups in Tier-1 cities (like Bengaluru, Mumbai, and Delhi NCR) raise higher average funding than those in Tier-2/3 cities.**
   *Rationale:* Tier-1 cities have stronger investor ecosystems, accelerators, and tech hubs.

2. **H2: Startups in the FinTech and E-Commerce sectors attract more funding compared to other sectors.**
   *Rationale:* These industries have shown rapid growth and investor interest in India.

3. **H3: The number of funding rounds is positively correlated with the total amount raised.**
   *Rationale:* Startups that receive multiple rounds usually demonstrate growth and investor trust.

---

### Expected Outcome

We expect to find:

* Tier-1 city startups dominating funding totals.
* FinTech and E-Commerce sectors leading in capital raised.
* A clear positive correlation between number of rounds and total funding.

---

## **Step 2: Data Cleaning & Preparation**

In [52]:
# Dataset import and explore
import pandas as pd

# When loading your data
df = pd.read_csv('startup_funding.csv', encoding='utf-8', engine='python', on_bad_lines='skip')

# Force all string columns to be treated as proper text
df = df.applymap(lambda x: x.encode('unicode_escape').decode('utf-8') if isinstance(x, str) else x)

print(df.head())

   Sr No Date dd/mm/yyyy                  Startup Name    Industry Vertical  \
0      1      09/01/2020                   BYJU\u2019S               E-Tech   
1      2      13/01/2020                        Shuttl       Transportation   
2      3      09/01/2020                     Mamaearth           E-commerce   
3      4      02/01/2020  https://www.wealthbucket.in/              FinTech   
4      5      02/01/2020                        Fashor  Fashion and Apparel   

                             SubVertical City  Location  \
0                             E-learning      Bengaluru   
1              App based shuttle service        Gurgaon   
2  Retailer of baby and toddler products      Bengaluru   
3                      Online Investment      New Delhi   
4            Embroiled Clothes For Women         Mumbai   

              Investors Name       InvestmentnType Amount in USD Remarks  
0    Tiger Global Management  Private Equity Round  20,00,00,000     NaN  
1  Susquehanna Growt

  df = df.applymap(lambda x: x.encode('unicode_escape').decode('utf-8') if isinstance(x, str) else x)


In [53]:
# Data types of columns
df.dtypes

Unnamed: 0,0
Sr No,int64
Date dd/mm/yyyy,object
Startup Name,object
Industry Vertical,object
SubVertical,object
City Location,object
Investors Name,object
InvestmentnType,object
Amount in USD,object
Remarks,object


In [54]:
# Getting familiar with the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3044 entries, 0 to 3043
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Sr No              3044 non-null   int64 
 1   Date dd/mm/yyyy    3044 non-null   object
 2   Startup Name       3044 non-null   object
 3   Industry Vertical  2873 non-null   object
 4   SubVertical        2108 non-null   object
 5   City  Location     2864 non-null   object
 6   Investors Name     3020 non-null   object
 7   InvestmentnType    3040 non-null   object
 8   Amount in USD      2084 non-null   object
 9   Remarks            419 non-null    object
dtypes: int64(1), object(9)
memory usage: 237.9+ KB


In [55]:
# Summary statistics before cleaning
df.describe()

Unnamed: 0,Sr No
count,3044.0
mean,1522.5
std,878.871435
min,1.0
25%,761.75
50%,1522.5
75%,2283.25
max,3044.0


In [56]:
# Checking for missing values in every column
df.isna().sum()

Unnamed: 0,0
Sr No,0
Date dd/mm/yyyy,0
Startup Name,0
Industry Vertical,171
SubVertical,936
City Location,180
Investors Name,24
InvestmentnType,4
Amount in USD,960
Remarks,2625


In [57]:
# Checking duplicates
df.duplicated().sum()

np.int64(0)

In [58]:
# As we are focusing on Funds and companies. We will remove NA values where company name or Amount is not available
df.dropna(subset = ['Startup Name','Amount in USD'],inplace=True)
df.shape

(2084, 10)

In [59]:
import numpy as np

# Column renaming and data type conversions
df = df.rename(columns={'Date dd/mm/yyyy':'Date','InvestmentnType':'Investment Type'})
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True, errors='coerce')
df['Date'] = df['Date'].fillna(method = 'bfill')

df['Amount in USD'] = (
    df['Amount in USD']
    .apply(lambda x: ''.join(ch for ch in str(x) if ch.isdigit() or ch == '.'))
    .replace('', np.nan)
    .astype(float)
)

  df['Date'] = df['Date'].fillna(method = 'bfill')


In [60]:
# As we are focusing on Funds and companies. We will remove NA values where company name or Amount is not available after processing
df.dropna(subset = ['Startup Name','Amount in USD'],inplace=True)
df.shape

(2077, 10)

In [61]:
df['Amount in USD'] = (df['Amount in USD']/(10**3)).round(2)# Converting the amount in millions for creating less range

# Current value of USD in INR
df.insert(loc=9, column='Amount in INR', value=(df['Amount in USD'] * 88).round(2))

In [62]:
df = df.rename(columns={'City  Location': 'City Location'})# there was two ' ' in the name

# Filling missing values of categorical columns
missing_values_columns = ['Industry Vertical','SubVertical','City Location','Investors Name','Investment Type']
df[missing_values_columns] = df[missing_values_columns].fillna('Unknown')
df['Remarks'] = df['Remarks'].fillna('No Remark')

In [63]:
# Amount in thousand(K)
df.rename(columns = {'Amount in USD':'Amount in USD(K)','Amount in INR':'Amount in INR(K)'},inplace= True)

In [64]:
# Rechecking data type conversion
df.dtypes

Unnamed: 0,0
Sr No,int64
Date,datetime64[ns]
Startup Name,object
Industry Vertical,object
SubVertical,object
City Location,object
Investors Name,object
Investment Type,object
Amount in USD(K),float64
Amount in INR(K),float64


In [65]:
# Number of unique values before cleaning
df['City Location'].nunique()

92

In [67]:
# Clean 'City Location' thoroughly
df['City Location'] = (
    df['City Location']
    .astype(str)
    # Remove real non-breaking spaces and invisible Unicode
    .str.replace(u'\xa0', '', regex=False)
    # Remove encoded versions (\xc2\xa0 or Xc2Xa0)
    .str.replace(r'\\xc2|\\xa0|xc2xa0|Xc2Xa0|Xcxa', '', regex=True)
    # Remove unnecessary special characters
    .str.replace(r'[^\w\s,/&-]', '', regex=True)
    .str.strip()
)

# If multiple locations exist, keep only the first
df['City Location'] = df['City Location'].str.split('/|,|&|and').str[0].str.strip()

# Standardize common variants
df['City Location'] = df['City Location'].replace({
    'Bangalore': 'Bengaluru',
    'bangalore': 'Bengaluru',
    'Delhi': 'New Delhi',
    'new Delhi': 'New Delhi',
    'Gurugram': 'Gurgaon',
    'gurgaon': 'Gurgaon',
    'Bombay': 'Mumbai',
    'mumbai': 'Mumbai',
    'Bhubneswar': 'Bhubaneswar',
    'Triv': 'Trivandrum',
    'Ch': 'Chennai',
    'Sfo': 'San Francisco',
    'Us': 'USA',
    'Usa': 'USA',
    'Ahemadabad': 'Ahmedabad'
})

# Apply consistent capitalization
df['City Location'] = df['City Location'].str.title()

# Replace any leftover blanks or unknowns
df['City Location'] = df['City Location'].replace('', 'Unknown')

# Verify result
print(f"Number of unique values after cleaning: {df['City Location'].nunique()}\n")
print("Unique values:\n", df['City Location'].unique())

Number of unique values after cleaning: 57

Unique values:
 ['Bengaluru' 'Gurgaon' 'New Delhi' 'Mumbai' 'Chennai' 'Pune' 'Noida'
 'Faridabad' 'San Francisco' 'San Jose' 'Amritsar' 'Kormangala' 'Tulangan'
 'Hyderabad' 'Burnsville' 'Menlo Park' 'Palo Alto' 'Santa Monica'
 'Singapore' 'Taramani' 'Andheri' 'Chembur' 'Nairobi' 'Haryana' 'New York'
 'Karnataka' 'Bhopal' 'India' 'Jaipur' 'Nagpur' 'Indore' 'California'
 'Ahmedabad' 'Rourkela' 'Bhubaneswar' 'Kolkata' 'Coimbatore' 'Udaipur'
 'Surat' 'Goa' 'Gaya' 'Vadodara' 'Missourie' 'Panaji' 'Gwalior' 'Kerala'
 'Unknown' 'Lucknow' 'Trivandrum' 'Varanasi' 'Kanpur' 'Jodhpur' 'Boston'
 'Usa' 'Belgaum' 'Dallas' 'Udupi']


In [68]:
# Define city tiers
tier_1_india = ['Bengaluru', 'Mumbai', 'New Delhi', 'Chennai', 'Pune', 'Hyderabad', 'Gurgaon', 'Noida', 'Faridabad']
tier_1_international = ['San Francisco', 'New York', 'Singapore', 'San Jose', 'Menlo Park', 'Santa Monica', 'Boston', 'Dallas', 'Burnsville', 'Palo Alto']
# Tier 2/3 Indian cities (all remaining Indian cities except unknown)
tier_2_3_india = [
    'Jaipur', 'Lucknow', 'Coimbatore', 'Indore', 'Ahmedabad', 'Bhopal', 'Nagpur', 'Udaipur',
    'Surat', 'Goa', 'Gaya', 'Vadodara', 'Panaji', 'Gwalior', 'Belgaum', 'Udupi', 'Bhubneswar',
    'Kolkata', 'Amritsar', 'Triv', 'Kormangala', 'Tulangan', 'Taramani', 'Andheri', 'Chembur'
]

# Function to assign tier
def assign_tier(city):
    if city in tier_1_india:
        return 'Tier 1 India'
    elif city in tier_1_international:
        return 'Tier 1 International'
    elif city in tier_2_3_india:
        return 'Tier 2/3 India'
    else:
        return 'Other'

# Assign city a tier in new column
df['City Tier'] = df['City Location'].apply(assign_tier)

df[['City Location', 'City Tier']].head(10)


Unnamed: 0,City Location,City Tier
0,Bengaluru,Tier 1 India
1,Gurgaon,Tier 1 India
2,Bengaluru,Tier 1 India
3,New Delhi,Tier 1 India
4,Mumbai,Tier 1 India
5,Chennai,Tier 1 India
6,Gurgaon,Tier 1 India
7,Pune,Tier 1 India
8,Gurgaon,Tier 1 India
9,Bengaluru,Tier 1 India


In [69]:
# Number of unique values before cleaning
df['Industry Vertical'].nunique()

595

In [70]:
# Clean and standardize 'Industry Vertical' column
col = 'Industry Vertical'

# Convert to string and remove encoded artifacts
df[col] = df[col].astype(str).str.replace(r'\\xc2|\\xa0|\\\\n', '', regex=True).str.strip()

# Fix casing inconsistencies
df[col] = df[col].str.title().str.strip()

# Remove unwanted symbols and extra spaces
df[col] = df[col].str.replace(r'[^\w\s&/-]', '', regex=True)

# Standardize common variants and misspellings
df[col] = df[col].replace({
    'Ecommece': 'E-Commerce',
    'Ecommerce': 'E-Commerce',
    'E-Commerce Platform': 'E-Commerce',
    'Ecommerce Platform': 'E-Commerce',
    'Ecommerce Marketplace': 'E-Commerce',
    'Ecomm': 'E-Commerce',
    'E Tech': 'E-Tech',
    'Fintech': 'FinTech',
    'Fin-Tech': 'FinTech',
    'Financial Tech': 'FinTech',
    'Financial Technology': 'FinTech',
    'Nbfc': 'NBFC',
    'Fiinance': 'Finance',
    'Health And Wellness': 'Health & Wellness',
    'Healthcare': 'Health & Wellness',
    'Health Care': 'Health & Wellness',
    'Edtech': 'EdTech',
    'Ed-Tech': 'EdTech',
    'Education': 'EdTech',
    'E Tech': 'EdTech',
    'Saas': 'SaaS',
    'Saas Platform': 'SaaS',
    'Tech': 'Technology',
    'It': 'Technology',
    'Artificial Intelligence Platform': 'Artificial Intelligence',
    'Clean-Tech': 'CleanTech',
    'Agtech': 'AgriTech',
    'Reality': 'Real Estate'
})

# Simplify long descriptive phrases into main industry categories
df[col] = df[col].replace({
    r'.*Food.*Delivery.*': 'Food Delivery',
    r'.*Grocery.*Delivery.*': 'Grocery Delivery',
    r'.*Food.*Tech.*': 'Food Tech',
    r'.*Food[- ]?And[- ]?Beverages.*': 'Food & Beverages',
    r'.*E-Learning.*': 'EdTech',
    r'.*Online.*Education.*': 'EdTech',
    r'.*Education.*Platform.*': 'EdTech',
    r'.*Online.*Marketplace.*': 'E-Commerce',
    r'.*E-Commerce.*': 'E-Commerce',
    r'.*Real[- ]?Estate.*': 'Real Estate',
    r'.*Hospitality.*': 'Hospitality',
    r'.*Logistics.*': 'Logistics',
    r'.*Travel.*Tech.*': 'Travel Tech',
    r'.*Health.*Platform.*': 'Health & Wellness',
    r'.*Healthcare.*': 'Health & Wellness',
    r'.*Fintech.*': 'FinTech',
    r'.*Hyperlocal.*': 'Hyperlocal Services',
    r'.*Agriculture.*': 'AgriTech',
    r'.*Clean.*Energy.*': 'CleanTech',
    r'.*Energy.*': 'Energy'
}, regex=True)

# Final cleanup
df[col] = df[col].str.strip().str.title()

# Number of unique values after cleaning
print("Number of unique values after cleaning: ",df['Industry Vertical'].nunique())
print('\n')

# Check cleaned unique values
print(df[col].unique())


Number of unique values after cleaning:  469


['E-Tech' 'Transportation' 'E-Commerce' 'Fintech' 'Fashion And Apparel'
 'Logistics' 'Hospitality' 'Technology' 'Aerospace'
 'B2B-Focused Foodtech Startup' 'Finance' 'Video' 'Gaming' 'Software'
 'Health & Wellness' 'Edtech' 'B2B Marketing' 'Video Games' 'Saas'
 'Last Mile Transportation' 'Customer Service' 'B2B'
 'Advertising Marketing' 'Iot' 'Information Technology'
 'Consumer Technology' 'Accounting' 'Retail' 'Customer Service Platform'
 'Automotive' 'Consumer Goods' 'Food And Beverage' 'Compliance'
 'Transport' 'Artificial Intelligence' 'Luxury Label'
 'Waste Management Service' 'Deep-Tech' 'Agritech' 'Energy'
 'Digital Media' 'Services' 'Automobile' 'Social Media' 'Nanotechnology'
 'Services Platform' 'Travel Tech' 'Saas Ecommerce' 'Nbfc' 'Food'
 'Food Tech' 'Automation' 'Investment' 'Real Estate' 'Consumer Internet'
 'B2B Platform' 'Cleantech' 'Food & Beverages' 'Media' 'Publishing'
 'Entertainment' 'Inspiration' 'Lifestyle' 'Food & B

In [71]:
# Clean and standardize 'Industry Vertical' column
col = 'Investors Name'

# Convert to string and clean all encoded/backslash artifacts
df[col] = (
    df[col]
    .astype(str)
    .str.replace(r'\\+', '', regex=True)          # remove any number of backslashes
    .str.replace(r'xc2|xa0|n', '', regex=True)    # remove leftover encoded characters
    .str.replace(r'\s+', ' ', regex=True)         # normalize multiple spaces
    .str.strip()
)

In [72]:
# Number of unique values before cleaning
df['Investment Type'].nunique()

51

In [74]:
# Column name
col = 'Investment Type'

# Convert to string, remove artifacts, and strip spaces
df[col] = df[col].astype(str).str.replace(r'\\n', '', regex=True).str.strip()

# Standardize common investment types
df[col] = df[col].replace({
    # Pre-Series A variants
    'Pre-series A': 'Pre-Series A',
    'pre-series A': 'Pre-Series A',

    # Seed funding variants
    'Seed': 'Seed Funding',
    'Seed Round': 'Seed Funding',
    'Seed Funding Round': 'Seed Funding',
    'Seed/Angel Funding': 'Seed/Angel Funding',
    'Angel / Seed Funding': 'Seed/Angel Funding',
    'Seed\\Funding': 'Seed Funding',
    'Seed\\\\\\Funding':'Seed Funding',

    # Private equity
    'Private Equity': 'Private Equity',
    'Private Equity Round': 'Private Equity',
    'Private\\Equity': 'Private Equity',
    'PrivateEquity': 'Private Equity',
    'Private Funding': 'Private Equity',
    'Private': 'Private Equity',
    'Private\\\\\\Equity':'Private Equity',


    # Debt
    'Debt': 'Debt Funding',
    'Debt Funding': 'Debt Funding',
    'Debt And Preference Capital': 'Debt Funding',
    'Structured Debt': 'Debt Funding',
    'Term Loan': 'Debt Funding',

    # Equity
    'Equity': 'Equity Funding',
    'Equity Funding': 'Equity Funding',

    # Series rounds
    'Series A': 'Series A',
    'Series B': 'Series B',
    'Series B (Extension)': 'Series B',
    'Series C': 'Series C',
    'Series D': 'Series D',
    'Series E': 'Series E',
    'Series F': 'Series F',
    'Series G': 'Series G',
    'Series H': 'Series H',
    'Series J': 'Series J',

    # Angel
    'Angel': 'Angel',
    'Angel Round': 'Angel',

    # Other types
    'Maiden Round': 'Maiden Round',
    'Funding Round': 'Funding Round',
    'Single Venture': 'Single Venture',
    'Corporate Round': 'Corporate Round',
    'Venture Round': 'Venture Round',
    'Venture - Series Unknown': 'Venture - Series Unknown',
    'Inhouse Funding': 'Inhouse Funding',
    'Mezzanine': 'Mezzanine',
    'Crowd Funding': 'Crowd Funding',
    'Unknown': 'Unknown'
})

# Title case for consistency
df[col] = df[col].str.title()

# Number of unique values after cleaning
print("Number of unique values after cleaning: ",df['Investment Type'].nunique())
print('\n')

# Check cleaned unique values
print(df[col].unique())


Number of unique values after cleaning:  31


['Private Equity' 'Series C' 'Series B' 'Pre-Series A' 'Seed Funding'
 'Series A' 'Series D' 'Series F' 'Series E' 'Debt Funding' 'Series G'
 'Series H' 'Unknown' 'Funding Round' 'Maiden Round' 'Single Venture'
 'Corporate Round' 'Venture Round' 'Angel' 'Series J'
 'Venture - Series Unknown' 'Inhouse Funding' 'Seed/ Angel Funding'
 'Equity Funding' 'Debt-Funding' 'Mezzanine' 'Equity Based Funding'
 'Seed / Angel Funding' 'Seed/Angel Funding' 'Seed / Angle Funding'
 'Crowd Funding']


In [75]:
# Renumber serial number after cleaning
# Reset index and start Sr No from 1
df.reset_index(drop=True, inplace=True)
df['Sr No'] = df.index + 1

In [76]:
df.tail()

Unnamed: 0,Sr No,Date,Startup Name,Industry Vertical,SubVertical,City Location,Investors Name,Investment Type,Amount in USD(K),Amount in INR(K),Remarks,City Tier
2072,2073,2015-01-28,Grabhouse.com,Unknown,Unknown,Unknown,"Kalaari Capital, Sequoia Capital",Private Equity,2500.0,220000.0,Series A,Other
2073,2074,2015-01-29,Printvenue,Unknown,Unknown,Unknown,Asia Pacific Iteret Group,Private Equity,4500.0,396000.0,No Remark,Other
2074,2075,2015-01-29,Graphene,Unknown,Unknown,Unknown,KARSEMVEN Fud,Private Equity,825.0,72600.0,Govt backed VC Fund,Other
2075,2076,2015-01-30,Mad Street Den,Unknown,Unknown,Unknown,"Exfiity Fud, GrowX Vetures.",Private Equity,1500.0,132000.0,No Remark,Other
2076,2077,2015-01-31,couponmachine.in,Unknown,Unknown,Unknown,UK based Group of Agel Ivestors,Seed Funding,140.0,12320.0,No Remark,Other


In [77]:
# Rechecking for any missing values
df.isna().sum()

Unnamed: 0,0
Sr No,0
Date,0
Startup Name,0
Industry Vertical,0
SubVertical,0
City Location,0
Investors Name,0
Investment Type,0
Amount in USD(K),0
Amount in INR(K),0


In [78]:
# Saving Cleaned dataset for next process
cleaned_df = df.to_csv('startup_funding_clean.csv',index = False)

In [79]:
# Summary statistics after cleaning
df.describe()

Unnamed: 0,Sr No,Date,Amount in USD(K),Amount in INR(K)
count,2077.0,2077,2077.0,2077.0
mean,1039.0,2016-10-22 11:57:34.405392128,22332.17,1965231.0
min,1.0,2015-01-02 00:00:00,0.02,1.76
25%,520.0,2015-10-21 00:00:00,470.0,41360.0
50%,1039.0,2016-08-19 00:00:00,1750.0,154000.0
75%,1558.0,2017-09-05 00:00:00,8200.0,721600.0
max,2077.0,2020-01-13 00:00:00,3900000.0,343200000.0
std,599.722575,,149465.7,13152980.0


In [80]:
# Sample of the cleaned dataset
df.head()

Unnamed: 0,Sr No,Date,Startup Name,Industry Vertical,SubVertical,City Location,Investors Name,Investment Type,Amount in USD(K),Amount in INR(K),Remarks,City Tier
0,1,2020-01-09,BYJU\u2019S,E-Tech,E-learning,Bengaluru,Tiger Global Maagemet,Private Equity,200000.0,17600000.0,No Remark,Tier 1 India
1,2,2020-01-13,Shuttl,Transportation,App based shuttle service,Gurgaon,Susquehaa Growth Equity,Series C,8048.39,708258.32,No Remark,Tier 1 India
2,3,2020-01-09,Mamaearth,E-Commerce,Retailer of baby and toddler products,Bengaluru,Sequoia Capital Idia,Series B,18358.86,1615579.68,No Remark,Tier 1 India
3,4,2020-01-02,https://www.wealthbucket.in/,Fintech,Online Investment,New Delhi,Viod Khatumal,Pre-Series A,3000.0,264000.0,No Remark,Tier 1 India
4,5,2020-01-02,Fashor,Fashion And Apparel,Embroiled Clothes For Women,Mumbai,Sprout Veture Parters,Seed Funding,1800.0,158400.0,No Remark,Tier 1 India


In [81]:
# Basic Statistics
min_usd = df['Amount in USD(K)'].min()
max_usd = df['Amount in USD(K)'].max()
avg_usd = df['Amount in USD(K)'].mean()
print("Minimum Amount in USD(K) : ",min_usd)
print("Maximum Amount in USD(K) : ",max_usd)
print("Average Amount in USD(K) : ",avg_usd)

Minimum Amount in USD(K) :  0.02
Maximum Amount in USD(K) :  3900000.0
Average Amount in USD(K) :  22332.168964853154
