<a href="https://colab.research.google.com/github/AMMLRepos/predict-startup-funding/blob/main/Predicting_funding_amount_for_a_Startup_in_India.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview 
Startup is a company or a project which is undertaken by an individual or a group to address a specific business problem catering to various verticales of industries and users. Startup comes with ideas which turns into a million or even billion dollar business, providing jobs to various skilled resources in the country and bring in efficiency in some sort of industry or business vertical. 
Startups in India are growing at a rapid pace with a huge push from Government of India initiatives like Make in India, Startup India, etc. 

India has the 3rd largest startup ecosystem in the world; expected to witness YoY growth of a consistent annual growth of 12-15%

India has about 50,000 startups in India in 2018; around 8,900 – 9,300 of these are technology led startups 1300 new tech startups were born in 2019 alone implying there are 2-3 tech startups born every day.

**Source** - https://www.startupindia.gov.in/content/sih/en/international/go-to-market-guide/indian-startup-ecosystem.html

# How startups operate ?
Every business needs funds to run its operations and make more money. Startups are self-funded, government funded, funded by other bigger companies like Microsoft, Facebook, TCS, etc. or funded by financial institutions, stock traders and investors. 



# Objective - To predict a funding amount for a startup in India using Machine Learning techniques
- Startup funding gets impacted by various factors like the target industry and vertical, city of operations, investment type, etc. 
- Startup funding is a touch job to predict as it will depend on various dependent and independent features

# Data 
We will use openly available dataset on [Kaggle](https://www.kaggle.com/sudalairajkumar/indian-startup-funding)


# Steps 
- Setup development environment 
- Import required libraries 
- Perform descriptive analytics on data 
- Perform Exploratory data analysis to understand trends and relationships
- Prepare data for training - Encoding, Data cleaning, data splitting (train and test)
- Train the model using preferred alogrithm(s)
- Test the model using test data 
- Predict with sample data
- Evaluate model 
- Repeat for improving model performance

## Import required libraries 

In [667]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 

In [668]:
startup_df = pd.read_csv("startup_funding.csv")
startup_df

Unnamed: 0,Sr No,Date dd/mm/yyyy,Startup Name,Industry Vertical,SubVertical,City Location,Investors Name,InvestmentnType,Amount in USD,Remarks
0,1,09/01/2020,BYJU’S,E-Tech,E-learning,Bengaluru,Tiger Global Management,Private Equity Round,200000000,
1,2,13/01/2020,Shuttl,Transportation,App based shuttle service,Gurgaon,Susquehanna Growth Equity,Series C,8048394,
2,3,09/01/2020,Mamaearth,E-commerce,Retailer of baby and toddler products,Bengaluru,Sequoia Capital India,Series B,18358860,
3,4,02/01/2020,https://www.wealthbucket.in/,FinTech,Online Investment,New Delhi,Vinod Khatumal,Pre-series A,3000000,
4,5,02/01/2020,Fashor,Fashion and Apparel,Embroiled Clothes For Women,Mumbai,Sprout Venture Partners,Seed Round,1800000,
5,6,13/01/2020,Pando,Logistics,"Open-market, freight management platform",Chennai,Chiratae Ventures,Series A,9000000,
6,7,10/01/2020,Zomato,Hospitality,Online Food Delivery Platform,Gurgaon,Ant Financial,Private Equity Round,150000000,
7,8,12/12/2019,Ecozen,Technology,Agritech,Pune,Sathguru Catalyzer Advisors,Series A,6000000,
8,9,06/12/2019,CarDekho,E-Commerce,Automobile,Gurgaon,Ping An Global Voyager Fund,Series D,70000000,
9,10,03/12/2019,Dhruva Space,Aerospace,Satellite Communication,Bengaluru,"Mumbai Angels, Ravikanth Reddy",Seed,50000000,


## Perform basic analysis on dataset

In [669]:
startup_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3044 entries, 0 to 3043
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Sr No              3044 non-null   int64 
 1   Date dd/mm/yyyy    3044 non-null   object
 2   Startup Name       3044 non-null   object
 3   Industry Vertical  2873 non-null   object
 4   SubVertical        2108 non-null   object
 5   City  Location     2864 non-null   object
 6   Investors Name     3020 non-null   object
 7   InvestmentnType    3040 non-null   object
 8   Amount in USD      2084 non-null   object
 9   Remarks            419 non-null    object
dtypes: int64(1), object(9)
memory usage: 237.9+ KB


**We can conclud**e - 
- There are 3044 records 
- There are missing values in various columns, i.e., Industry Vertical, Sub-vertical, Investors Name, Amount in USD, Remarks 
- All critical columns are string/object type and has to be treated as categorical 
- Amount in USD is a object type column which must be changed to categorical
- We can drop Remarks, Starup Name, Sr No and Date column which will have no impact on the prediction. Date might have an impact but for now, we can remove that column


## Remove columns of no impact 

In [670]:
startup_df.columns

Index(['Sr No', 'Date dd/mm/yyyy', 'Startup Name', 'Industry Vertical',
       'SubVertical', 'City  Location', 'Investors Name', 'InvestmentnType',
       'Amount in USD', 'Remarks'],
      dtype='object')

In [671]:
#Remove unncessary columns 
startup_df.drop(columns = ['Sr No', 'Date dd/mm/yyyy', 'Startup Name', 'Remarks'],  inplace = True)

## Rename columns 

In [672]:
startup_df.rename(columns = {"City  Location" : "City Location", "InvestmentnType" : "Investment Type"}, inplace = True )

In [673]:
startup_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3044 entries, 0 to 3043
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Industry Vertical  2873 non-null   object
 1   SubVertical        2108 non-null   object
 2   City Location      2864 non-null   object
 3   Investors Name     3020 non-null   object
 4   Investment Type    3040 non-null   object
 5   Amount in USD      2084 non-null   object
dtypes: object(6)
memory usage: 142.8+ KB


## Change datatype of Amount in USD column

Let us change the data type of Amount in USD column. It is currently object and must be changed to int or float. Since amount has commas, we need to first remove those to allow a string to float conversion


In [674]:
startup_df["FundedAmount"] = startup_df["Amount in USD"].replace(',','', regex = True)

In [675]:
startup_df

Unnamed: 0,Industry Vertical,SubVertical,City Location,Investors Name,Investment Type,Amount in USD,FundedAmount
0,E-Tech,E-learning,Bengaluru,Tiger Global Management,Private Equity Round,200000000,200000000
1,Transportation,App based shuttle service,Gurgaon,Susquehanna Growth Equity,Series C,8048394,8048394
2,E-commerce,Retailer of baby and toddler products,Bengaluru,Sequoia Capital India,Series B,18358860,18358860
3,FinTech,Online Investment,New Delhi,Vinod Khatumal,Pre-series A,3000000,3000000
4,Fashion and Apparel,Embroiled Clothes For Women,Mumbai,Sprout Venture Partners,Seed Round,1800000,1800000
5,Logistics,"Open-market, freight management platform",Chennai,Chiratae Ventures,Series A,9000000,9000000
6,Hospitality,Online Food Delivery Platform,Gurgaon,Ant Financial,Private Equity Round,150000000,150000000
7,Technology,Agritech,Pune,Sathguru Catalyzer Advisors,Series A,6000000,6000000
8,E-Commerce,Automobile,Gurgaon,Ping An Global Voyager Fund,Series D,70000000,70000000
9,Aerospace,Satellite Communication,Bengaluru,"Mumbai Angels, Ravikanth Reddy",Seed,50000000,50000000


Let us drop the "Amount in USD" column

In [676]:
startup_df.drop(columns = ["Amount in USD"], inplace = True, axis = 1)

Let us now count missing values in FundedAmount

In [677]:
startup_df.isna().sum()

Industry Vertical    171
SubVertical          936
City Location        180
Investors Name        24
Investment Type        4
FundedAmount         960
dtype: int64

We need to treat 960 missing values in the dataframe. Since this is a numeric value, we can try with mean, mode or median. For that let us do some basic density analysis 

In [678]:
startup_df["FundedAmount"].fillna("0", inplace = True)

There is a value "Undisclosed" in the FundedAmount which has to be replaced with 0 for string to float operation work 

In [679]:
startup_df.replace(to_replace = 'undisclosed', value = "0", inplace = True)
startup_df.replace(to_replace = 'Undisclosed', value = "0", inplace = True)
startup_df.replace(to_replace = 'unknown', value = "0", inplace = True)

In [680]:
startup_df["FundedAmount"] = startup_df["FundedAmount"].replace('\+','', regex = True)
startup_df["FundedAmount"] = startup_df["FundedAmount"].replace('[^A-Za-z0-9]+','', regex = True)
startup_df["FundedAmount"] = startup_df["FundedAmount"].replace('xc2xa0','', regex = True)
startup_df["FundedAmount"] = startup_df["FundedAmount"].replace('NA','0', regex = True)

We can now change the datatype of the FundedAmount column

In [681]:
startup_df["FundedAmount"] = startup_df["FundedAmount"].astype(float)

In [682]:
startup_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3044 entries, 0 to 3043
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Industry Vertical  2873 non-null   object 
 1   SubVertical        2108 non-null   object 
 2   City Location      2864 non-null   object 
 3   Investors Name     3020 non-null   object 
 4   Investment Type    3040 non-null   object 
 5   FundedAmount       3044 non-null   float64
dtypes: float64(1), object(5)
memory usage: 142.8+ KB


## Check for missing values
We have many missing values in the dataset which must be treated before we go ahead with model training and development 




In [683]:
startup_df.isna().sum()

Industry Vertical    171
SubVertical          936
City Location        180
Investors Name        24
Investment Type        4
FundedAmount           0
dtype: int64

We can see that we have missing values in critical fields like Industry Vertical, Subvertical, City Location, Investors name, investment type, etc

Max number of missing values are in Amount. 
**We have to analyze further to decide on how to treat these missing values**


### Treating missing values 


 

- FundedAmount - Every 0 in funded amount is a NAN or a missing value which we created during datatype conversion. We need to figure out a way to replace it with either mean, mode or median

In [684]:
print("Max funded amount - ", startup_df["FundedAmount"].max())
print("Mean funded amount - ", startup_df["FundedAmount"].mean())
print("Median funded amount - ", startup_df["FundedAmount"].median())
print("Min funded amount - ", startup_df["FundedAmount"].min())

startup_df.describe()

Max funded amount -  3900000000.0
Mean funded amount -  13270376.930354796
Median funded amount -  500000.0
Min funded amount -  0.0


Unnamed: 0,FundedAmount
count,3044.0
mean,13270380.0
std,104404200.0
min,0.0
25%,0.0
50%,500000.0
75%,4000000.0
max,3900000000.0


Let us replace 0 with mean. Please note that mean is impacted due to the multiple(940) 0s in the dataframe

In [685]:
startup_df["FundedAmount"] = startup_df["FundedAmount"].replace(0,startup_df["FundedAmount"].mean())

In [686]:
print("Max funded amount - ", startup_df["FundedAmount"].max())
print("Mean funded amount - ", startup_df["FundedAmount"].mean())
print("Median funded amount - ", startup_df["FundedAmount"].median())
print("Min funded amount - ", startup_df["FundedAmount"].min())

Max funded amount -  3900000000.0
Mean funded amount -  17503470.22844042
Median funded amount -  7400000.0
Min funded amount -  16000.0


- Investors Name - Let us replace missing investors name with Uknown or others

In [687]:
startup_df["Investors Name"].fillna("Others", inplace = True)

In [688]:
startup_df.isna().sum()

Industry Vertical    171
SubVertical          936
City Location        180
Investors Name         0
Investment Type        4
FundedAmount           0
dtype: int64

## Checking Unique values in Industries, sub-verticals, City Location and Investment Type

In [689]:
print("Number of unique values in City Location = ", startup_df["City Location"].nunique())
print("\nCount of each unique value in City Location\n", startup_df["City Location"].value_counts())

Number of unique values in City Location =  112

Count of each unique value in City Location
 Bangalore                 700
Mumbai                    567
New Delhi                 421
Gurgaon                   287
Bengaluru                 141
Pune                      105
Hyderabad                  99
Chennai                    97
Noida                      92
Gurugram                   50
Ahmedabad                  38
Delhi                      34
Jaipur                     30
Kolkata                    21
Indore                     13
Chandigarh                 11
Goa                        10
Vadodara                   10
Singapore                   8
Coimbatore                  5
Pune / US                   4
\\xc2\\xa0Gurgaon           4
Kanpur                      4
Faridabad                   3
Nagpur                      3
\\xc2\\xa0New Delhi         3
Bhopal                      3
Udaipur                     2
Mumbai/Bengaluru            2
Agra                        2
Bangal

In [690]:
print("Number of unique values in Industry Sub vertical = ", startup_df["SubVertical"].nunique())
print("\nCount of each unique value in Industry Sub vertical\n", startup_df["SubVertical"].value_counts())

Number of unique values in Industry Sub vertical =  1942

Count of each unique value in Industry Sub vertical
 Online Lending Platform                                                                                  11
Online Pharmacy                                                                                          10
Food Delivery Platform                                                                                    8
Education                                                                                                 5
Online Lending                                                                                            5
Online lending platform                                                                                   5
Online Learning Platform                                                                                  5
Online Education Platform                                                                                 5
Non-Banking Financial Com

In [691]:
print("Number of unique values in Industry vertical = ", startup_df["Industry Vertical"].nunique())
print("\nCount of each unique value in Industry vertical\n", startup_df["Industry Vertical"].value_counts())

Number of unique values in Industry vertical =  821

Count of each unique value in Industry vertical
 Consumer Internet                                             941
Technology                                                    478
eCommerce                                                     186
Healthcare                                                     70
Finance                                                        62
ECommerce                                                      61
Logistics                                                      32
E-Commerce                                                     29
Education                                                      24
Food & Beverage                                                23
Ed-Tech                                                        14
E-commerce                                                     12
FinTech                                                         9
IT                                      

In [692]:
print("Number of unique values in Investment Type = ", startup_df["Investment Type"].nunique())
print("\nCount of each unique value in Investment Type\n", startup_df["Investment Type"].value_counts())

Number of unique values in Investment Type =  55

Count of each unique value in Investment Type
 Private Equity                 1356
Seed Funding                   1355
Seed/ Angel Funding              60
Seed / Angel Funding             47
Seed\\nFunding                   30
Debt Funding                     25
Series A                         24
Seed/Angel Funding               23
Series B                         20
Series C                         14
Series D                         12
Angel / Seed Funding              8
Seed Round                        7
Pre-Series A                      4
Private Equity Round              4
Seed                              4
Seed / Angle Funding              3
Equity                            2
Series E                          2
pre-Series A                      2
Series F                          2
Venture Round                     2
Corporate Round                   2
Pre Series A                      1
Series B (Extension)              1
Fun

With so many categories in the critical columns of our dataset, it will be a waste of effort to predict the funding amount. We need to create new columns with compressed format of categories to include many categories into one. 
For this to happen, one must have business rules to categorize specific verticals and sub-verticals in the specific category which can be termed as a super set for it. 

## Creating new categories for Cities
Cities does impact the startup funding and we can divide cities into - 
- Tier 1 Cities - Delhi, Mumbai, Bangalore, Hyderabad, Chennai, etc. 
- Tier 2 Cities - Indore, Bhopal, Lucknow, Ahmedabad, Vadodra, Warangal, etc. 
- Foreign Cities - San Fransisco, California, Singapore, Nairobi, etc. 
- Others - All cities which are not in the above two categories  




In [693]:
cities_list = startup_df["City Location"].unique()

In [694]:
tier1_cities = ["bangaluru", "gurgaon", "new delhi", "delhi", "mumbai", "chennai", "pune", "noida", "hyderabad", "bangalore", "kolkata"]
tier2_cities = ["indore", "bhopal", "amritsar", "koramngala", "bhubneswar", "surat", "ahmedabad", "jodhpur", "nagpur","trivandrum", "panaji", "gwalior", "chandigarh"]
india_global = ["san francisco", "san jose","palo alto","santa monica", "singapore", "new york", "boston", "california"]
#others = potentially anything else which is not mentioned above

Let us write a function to categorize all cities into above mentioned categories based on text or string matching

In [695]:
def city_mapper(data):
  try:
    if data["City Location"].lower() in tier1_cities:
      return "Tier1"
    elif data["City Location"].lower()  in tier2_cities:
      return "Tier2"
    elif data["City Location"].lower() in india_global:
      return "Global"
    else:
      return "Other"
  except:
    return "Other"

In [696]:
startup_df["city_category"] = startup_df.apply(city_mapper, axis = 1)

In [697]:
startup_df

Unnamed: 0,Industry Vertical,SubVertical,City Location,Investors Name,Investment Type,FundedAmount,city_category
0,E-Tech,E-learning,Bengaluru,Tiger Global Management,Private Equity Round,200000000.0,Other
1,Transportation,App based shuttle service,Gurgaon,Susquehanna Growth Equity,Series C,8048394.0,Tier1
2,E-commerce,Retailer of baby and toddler products,Bengaluru,Sequoia Capital India,Series B,18358860.0,Other
3,FinTech,Online Investment,New Delhi,Vinod Khatumal,Pre-series A,3000000.0,Tier1
4,Fashion and Apparel,Embroiled Clothes For Women,Mumbai,Sprout Venture Partners,Seed Round,1800000.0,Tier1
5,Logistics,"Open-market, freight management platform",Chennai,Chiratae Ventures,Series A,9000000.0,Tier1
6,Hospitality,Online Food Delivery Platform,Gurgaon,Ant Financial,Private Equity Round,150000000.0,Tier1
7,Technology,Agritech,Pune,Sathguru Catalyzer Advisors,Series A,6000000.0,Tier1
8,E-Commerce,Automobile,Gurgaon,Ping An Global Voyager Fund,Series D,70000000.0,Tier1
9,Aerospace,Satellite Communication,Bengaluru,"Mumbai Angels, Ravikanth Reddy",Seed,50000000.0,Other


In [698]:
startup_df["city_category"].unique()

array(['Other', 'Tier1', 'Global', 'Tier2'], dtype=object)

We can now drop City Location column

In [699]:
startup_df.drop(columns = ["City Location"], inplace = True)

## Creating new categories for Industry vertical

In [700]:
# Listing all industry verticals 
industry_vertical_list = startup_df["Industry Vertical"].unique()
print(industry_vertical_list)

['E-Tech' 'Transportation' 'E-commerce' 'FinTech' 'Fashion and Apparel'
 'Logistics' 'Hospitality' 'Technology' 'E-Commerce' 'Aerospace'
 'B2B-focused foodtech startup' 'Finance' 'Video' 'Gaming' 'Software'
 'Health and wellness' 'Education' 'Food and Beverage'
 'Health and Wellness' 'B2B Marketing' 'Video Games' 'SaaS'
 'Last Mile Transportation' 'Healthcare' 'Customer Service' 'B2B'
 'Consumer Goods' 'Advertising, Marketing' 'IoT' 'Information Technology'
 'Consumer Technology' 'Accounting' 'Retail' 'Customer Service Platform'
 'Automotive' 'EdTech' 'Services' 'Compliance' 'Transport'
 'Artificial Intelligence' 'Tech' 'Health Care' 'Luxury Label'
 'Waste Management Service' 'Deep-Tech' 'Agriculture' 'Energy'
 'Digital Media' 'Saas' 'Automobile' 'Agtech' 'Social Media' 'Fintech'
 'Edtech' 'AI' 'Ecommerce' 'Nanotechnology' 'Services Platform'
 'Travel Tech' 'Online Education' 'Online Marketplace' 'SaaS, Ecommerce'
 'NBFC' 'Food' 'Food Tech' 'Automation' 'Investment' 'Social Network'
 '

Let us define the major categories or Industry verticals we are interested into, i.e. - Education, Agriculture, Ecommerce, E-Governance, Banking, Automobiles, Supply Chain and Logistics, Transport, Sports and Entertainment, 
Housing, Food, Travel and Hospitality, Healthcare, Utility, Services, Fitness, Gaming, Real Estate and Others 

In [701]:
#Setup the master list for industries 
industry_super_set = [ "education", "agriculture", "ecommerce", "governance", 
                      "banking", "automobiles", "apparel", "social media", "energy", "trading",
                      "supply chain and logistics", "transport", "sports and entertainment", 
                      "food", "travel and hospitality", "healthcare", "utility services", "fitness", 
                      "gaming", "real estate", "others"]

Let us now create a general rule for categorizing all industries into super set of industries. We can create complex rules to be more precise but it will take a lot of effort and manual checks to get that out. For now we can categorize based on keywords 

In [702]:
education_keywords = ["e-tech", "ed-tech", "elearning", "school", "university", "education", "tution", "learning", "training", "skill"]
agriculture_keywords = ["agriculture", "agri", "crop", "yield", "harvesting", "farmer"]
ecommerce_keywords = ["ecommerce", "online", "shopping", "portal"]
banking_keywords = ["financial", "fin-tech", "investment" ]
automobiles_keywords = ["scooter", "auto", "automobile", "car", "bike"]
apparel_keywords = ["fashion", "apparel", "clothing", "wear", "lifestyle"]
social_media_keywords = ["social"]
energy_keywords = ["oil", "electricity", "gasoline", "gas", "fuel"]
supply_chain_and_logistics_keywords = ["supply", "logistics", "retail"]
transport_keywords = ["transportation", "taxi", "cab", "bus", "train"]
sport_and_entertainment_keywords = ["sports", "cricket", ]
food_keywords = ["food", "beverages", "restaurant"] 
travel_and_hospitality_keywords = ["travel", "hotels", "hostel", "rent", "accomadation"]
healthcare_keywords = ["health", "healthcare", "medical", "doctor"]
gaming_keywords = ["game", "entertainment", "games", "reality"]

In [703]:
def industry_mapper(data):
  try:
    if data["Industry Vertical"].lower() in education_keywords:
      return "Ed-Tech"
    elif data["Industry Vertical"].lower()  in agriculture_keywords:
      return "Agricutlure"
    elif data["Industry Vertical"].lower() in ecommerce_keywords:
      return "Ecommerce"
    elif data["Industry Vertical"].lower() in banking_keywords:
      return "Banking"
    elif data["Industry Vertical"].lower() in automobiles_keywords:
      return "Automobiles"
    elif data["Industry Vertical"].lower() in apparel_keywords:
      return "Apparel"
    elif data["Industry Vertical"].lower() in social_media_keywords:
      return "Social Media"
    elif data["Industry Vertical"].lower() in energy_keywords:
      return "Energy"
    elif data["Industry Vertical"].lower() in supply_chain_and_logistics_keywords:
      return "Supply Chain and Logistics"
    elif data["Industry Vertical"].lower() in transport_keywords:
      return "Transportation"
    elif data["Industry Vertical"].lower() in sport_and_entertainment_keywords:
      return "Sports and Entertainment"
    elif data["Industry Vertical"].lower() in food_keywords:
      return "Food and Beverages"
    elif data["Industry Vertical"].lower() in travel_and_hospitality_keywords:
      return "Travel and Hospitality"
    elif data["Industry Vertical"].lower() in healthcare_keywords:
      return "Healthcare"
    elif data["Industry Vertical"].lower() in gaming_keywords:
      return "Gaming"
    else:
      return "Other"

  except:
    return "Other"

In [704]:
startup_df["Industry Category"] = startup_df.apply(industry_mapper, axis = 1)

In [705]:
startup_df["Industry Category"].unique()

array(['Ed-Tech', 'Transportation', 'Other', 'Supply Chain and Logistics',
       'Healthcare', 'Agricutlure', 'Automobiles', 'Ecommerce',
       'Food and Beverages', 'Banking', 'Apparel', 'Gaming'], dtype=object)

We can now drop Industry Vertical column

In [706]:
startup_df.drop(columns = ["Industry Vertical", "SubVertical"], inplace = True)

Let us again check missing values

In [707]:
startup_df.isna().sum()

Investors Name       0
Investment Type      4
FundedAmount         0
city_category        0
Industry Category    0
dtype: int64

## Creating new categories for Investment Type

In [708]:
startup_df["Investment Type"].unique()

array(['Private Equity Round', 'Series C', 'Series B', 'Pre-series A',
       'Seed Round', 'Series A', 'Series D', 'Seed', 'Series F',
       'Series E', 'Debt Funding', 'Series G', 'Series H', 'Venture',
       'Seed Funding', nan, 'Funding Round', 'Corporate Round',
       'Maiden Round', 'pre-series A', 'Seed Funding Round',
       'Single Venture', 'Venture Round', 'Pre-Series A', 'Angel',
       'Series J', 'Angel Round', 'pre-Series A',
       'Venture - Series Unknown', 'Bridge Round', 'Private Equity',
       'Debt and Preference capital', 'Inhouse Funding',
       'Seed/ Angel Funding', 'Debt', 'Pre Series A', 'Equity',
       'Debt-Funding', 'Mezzanine', 'Series B (Extension)',
       'Equity Based Funding', 'Private Funding', 'Seed / Angel Funding',
       'Seed/Angel Funding', 'Seed funding', 'Seed / Angle Funding',
       'Angel / Seed Funding', 'Private', 'Structured Debt', 'Term Loan',
       'PrivateEquity', 'Angel Funding', 'Seed\\\\nFunding',
       'Private\\\\nEqui

We need to create super categories to reduce the number of unique values in investment type. Like we did for City and Industry Vertical

In [709]:
private_fuding = ["private"]
series_funding = ["series"]
venture_funding = ["venture"]
debt_funding = ["debt"]
angel_funding = ["angel"]
crowd_funding = ["crowd"]
#other for all other categories 

In [710]:
def investment_mapper(data):
  try:
    if data["Investment Type"].lower() in private_funding:
      return "private"
    elif data["Investment Type"].lower() in series_funding:
      return "series"
    elif data["Investment Type"].lower() in venture_funding:
      return "venture"
    elif data["Investment Type"].lower() in debt_funding:
      return "debt"
    elif data["Investment Type"].lower() in angel_funding:
      return "angel"
    elif data["Investment Type"].lower() in crowd_funding:
      return "crowd"
    else:
      return "Others"

  except:
    return "Others"

In [711]:
startup_df["Investment Category"] = startup_df.apply(investment_mapper, axis = 1)

Let us drop column Investment Type

In [712]:
startup_df.drop(columns = ["Investment Type", "Investors Name"], inplace = True)

In [713]:
startup_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3044 entries, 0 to 3043
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   FundedAmount         3044 non-null   float64
 1   city_category        3044 non-null   object 
 2   Industry Category    3044 non-null   object 
 3   Investment Category  3044 non-null   object 
dtypes: float64(1), object(3)
memory usage: 95.2+ KB


## Preparing data for Model Training
We now have our required columns for prediction. Let us 
- Perform OneHot Encoding
- seperate dependent and independent variables 
- split train and test data
- train model
- test model 
- evaluate model 
- Conclude 

### Encoding Categorical columns

In [714]:
from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder()

In [715]:
startup_industry_vertical = pd.DataFrame(onehot.fit_transform(startup_df[["Industry Category"]]).toarray())
startup_df = startup_df.join(startup_industry_vertical)

In [716]:
startup_df

Unnamed: 0,FundedAmount,city_category,Industry Category,Investment Category,0,1,2,3,4,5,6,7,8,9,10,11
0,200000000.0,Other,Ed-Tech,Others,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,8048394.0,Tier1,Transportation,Others,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,18358860.0,Other,Other,Others,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,3000000.0,Tier1,Other,Others,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,1800000.0,Tier1,Other,Others,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
5,9000000.0,Tier1,Supply Chain and Logistics,Others,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
6,150000000.0,Tier1,Other,Others,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
7,6000000.0,Tier1,Other,Others,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
8,70000000.0,Tier1,Other,Others,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
9,50000000.0,Other,Other,Others,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [717]:
#Get unique items and sort the list of Industry Verticals
industry_verticals = startup_df["Industry Category"].unique()
industry_verticals.sort()
print(industry_verticals)

['Agricutlure' 'Apparel' 'Automobiles' 'Banking' 'Ecommerce' 'Ed-Tech'
 'Food and Beverages' 'Gaming' 'Healthcare' 'Other'
 'Supply Chain and Logistics' 'Transportation']


In [718]:
# Create a dictionary mapping for each industry vertical and int value starting from 0 to 11

In [719]:
industry_vertical_dict = {}
i = 0
for industry in industry_verticals:
  industry_vertical_dict[i] = industry
  i = i + 1

print(industry_vertical_dict)

{0: 'Agricutlure', 1: 'Apparel', 2: 'Automobiles', 3: 'Banking', 4: 'Ecommerce', 5: 'Ed-Tech', 6: 'Food and Beverages', 7: 'Gaming', 8: 'Healthcare', 9: 'Other', 10: 'Supply Chain and Logistics', 11: 'Transportation'}


In [720]:
startup_df.rename(columns = industry_vertical_dict, inplace = True)

In [722]:
startup_df.head()

Unnamed: 0,FundedAmount,city_category,Industry Category,Investment Category,Agricutlure,Apparel,Automobiles,Banking,Ecommerce,Ed-Tech,Food and Beverages,Gaming,Healthcare,Other,Supply Chain and Logistics,Transportation
0,200000000.0,Other,Ed-Tech,Others,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,8048394.0,Tier1,Transportation,Others,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,18358860.0,Other,Other,Others,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,3000000.0,Tier1,Other,Others,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,1800000.0,Tier1,Other,Others,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


We can now drop 