<a href="https://colab.research.google.com/github/Stella-Achar-Oiro/LP1-Data-Analysis-Project/blob/main/Indian_Startup_Ecosystem_Analysis_Week2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Details
Name : Stella Achar Oiro <br>
email: stella.achar@azubiafrica.org <br>
Team: Prague <br>
Link to Github Repo - ([GitHub](https://github.com/Stella-Achar-Oiro/LP1-Data-Analysis-Project))


Project - Indian Startup Funding

# Indian Startup Ecosystem Analysis
# Intro

## Project Intro and Objective
**Objective** - Our team is trying to venture into the Indian startup ecosystem. As the data experts of the team we are supposed to investigate the ecosystem and propose the best course of action.

## Business & Data Understanding
**How does Venture Capital Funding Work?** The process entails entrepreneurs pitching an idea to potential investors seeking finances (funding) to start/improve the business in various ways. A business plan is often a requirement for Venture Capitalist to assess the viability of the entrepreneur’s idea. **Invention** and **Innovation** are key concepts related to Venture Capital funding. Roughly 80% of funding goes into building infrastructure needed to grow the business. (Expenses `[manufacturing, sales, marketing]` and Balance Sheet `[working capital & assets]`).

Venture Capital Funding is short-term oriented - it is meant to grow the business to a point where it can be later sold off (to a Corporation or theough IPO). Sometimes the VC wants **equity** (a share of the business). Start-ups (esp. in the new information economy) lack hard assets and can’t therefore access funding from banks (require collateral)
**Large institutions** are often behind VC funds – insurance companies, pension funds, financial firms, universities. **Angel Investors** who are High net worth individuals are also participants. While screening ideas or after giving funding to businesses, Venture Capitalists can impose restrictions on behaviour of entrepreneurs.

While looking at the **The Indian Venture Capital context**,  we see a booming start-up ecosystem that has become one of the world’s most vibrant and innovative business environments. Country is increasingly being recognized as popular destinations for venture capital (VC) investments. Early-stage investments are seeing particularly strong growth in recent years while Domestic investors are still the most active investor group.

The are several stages of **Startup/Venture Capital Funding** - 
1. **Seed Funding** - the earliest stage of capital raising process as business hasn't started and is only an idea. It is considered the most risky and complicated as investor doesn’t have enough info to make a decision. Main participants are mostly Angel Investors (appreciate riskier ventures and expect high returns). Management/Soft skills of entrepreneur can determine business success (although modern Tech challenges this view).
2. **Series A Fianncing** - A type of equity-based financing (company offers convertible preferred shares). Consists of businesses that already generate revenues but are still in the pre-profit stage. Primary objective of funding is for continued growth. It follows a more formal process – more complete company info, valuation of company done & VC firm does due diligence (progress made since start, management’s efficiency in managing resources). Series A rounds raise approximately $2 million to $15 million and the median Series A funding 2021 = $10 million. ([Investopedia](https://www.investopedia.com/articles/personal-finance/102015/series-b-c-funding-what-it-all-means-and-how-it-works.asp))
3. **Series B Financing** - this is the 3rd Stage Start-up financing and the 2nd stage VC financing. Like Series A, it is also equity based. The business is already in middle stages of growth, has a user base and could be generating profit. It is now looking to expand their market. More participants enter at  this stage as they are willing to invest in later stages of growth. The objective of funding here is to take business past development stage, help business scale or meet levels of demand. Most participants here are Venture Capital firms and private equity firms.

 

## Overview of the Data

The dataset available coonsists of the following peices of data;
1. Startup Name/Brand
2. Year of Founding
3. Location/Headquarters of Business
4. Sector of the business
5. Founders (Owners)
6. Investor (assume it’s the VCs)
7. Amount of Funding
8. Stage (at which business is in while seeking/offered funding)

**Note** - The 2018 Data Lacks some of the above data and has conflict of some data (like currency).

# Hypothesis

**Null Hypothesis** – No factors determine the amount of funding offered to a start-up by venture capitalists in India.

**Alternative Hypothesis** – A range of factors about a start-up(characteristics) such as sector it wants to venture into, use of technology or years its has been in existence, stage of growth and location determine the amount of funding offered to start-ups by venture capitalists in India.

# Research Questions

**Sector**
1. What sectors have attracted the largest funding in the last 4 years?
2. Who are the major funders of ventures in India in the last 4 years? And what are the major sectors they are directing their funding?
3. What are the Maximum, Minimum, Average and Median Funding amounts offered to each sector/ stage of funding and how do they compare?
4. What is the trend of financing over the four years, cumulatively (total funding) and in independent sectors? (Monthly and/Yearly Trends)
**Technology**
5. What business technologies have recieved the highest funding or have been funded more over the last four years?
**Location**
6. What locations have received the biggest funding?
7. What is the spread of Venture Capital funding in India regionally and how do they compare?
**Stage**
8. Does the stage of the venture affect the amount of funding offered to ventures?
9. How many ventures that have got a previous round of financing gone on to successfully raise another stage of financing?
**Founders**
10. Does the number of business owners (entrepreneurs) in a venture determine the amount/likelihood of funding from investors?
11. Does the year of founding determine/influence the funding amount?

## Installation
Here is the section to install all the packages/libraries that will be needed to tackle the challlenge.

In [None]:
# !pip install pandas

## Importation
Here is the section to import all the packages/libraries that will be used through this notebook.

In [None]:
# Data handling
import numpy as np
import pandas as pd

# Vizualisation (Matplotlib, Plotly, Seaborn, etc. )
import matplotlib.pyplot as plt
import seaborn as sns

# EDA (pandas-profiling, etc. )
...

# Feature Processing (Scikit-learn processing, etc. )
...

# Machine Learning (Scikit-learn Estimators, Catboost, LightGBM, etc. )
...

# Hyperparameters Fine-tuning (Scikit-learn hp search, cross-validation, etc. )
...

# Other packages
import os


# Data Loading
Here is the section to load the datasets (train, eval, test) and the additional files

In [None]:
# For CSV, use pandas.read_csv

from google.colab import drive
drive.mount('/content/drive')

startupdf_2018 = pd.read_csv('/content/drive/MyDrive/India Startup Funding/startup_funding2018.csv')
startupdf_2019 = pd.read_csv('/content/drive/MyDrive/India Startup Funding/startup_funding2019.csv')
startupdf_2020 = pd.read_csv('/content/drive/MyDrive/India Startup Funding/startup_funding2020.csv')
startupdf_2021 = pd.read_csv('/content/drive/MyDrive/India Startup Funding/startup_funding2021.csv')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Exploratory Data Analysis: EDA
Here is the section to **inspect** the datasets in depth, **present** it, make **hypotheses** and **think** the *cleaning, processing and features creation*.

## Dataset overview

Have a look at the loaded datsets using the following methods: `.head(), .info()`

In [None]:
# Get a summary of the 2018 data
startupdf_2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Company Name   526 non-null    object
 1   Industry       526 non-null    object
 2   Round/Series   526 non-null    object
 3   Amount         526 non-null    object
 4   Location       526 non-null    object
 5   About Company  526 non-null    object
dtypes: object(6)
memory usage: 24.8+ KB


In [None]:
# Get a summary of the 2019 data
startupdf_2019.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company/Brand  89 non-null     object 
 1   Founded        60 non-null     float64
 2   HeadQuarter    70 non-null     object 
 3   Sector         84 non-null     object 
 4   What it does   89 non-null     object 
 5   Founders       86 non-null     object 
 6   Investor       89 non-null     object 
 7   Amount($)      89 non-null     object 
 8   Stage          43 non-null     object 
dtypes: float64(1), object(8)
memory usage: 6.4+ KB


In [None]:
#Make the 2018 data Column names match the 2019-2021 data

startupdf_2018.rename(columns={'Company Name' : 'Company/Brand', 'Industry' : 'Sector', 'Round/Series' : 'Stage', 'Amount' : 'Amount($)', 'Location' : 'HeadQuarter', 'About Company' : 'What it does'}, inplace = True)

In [None]:
# View the 2018 dataframe columns
startupdf_2018.columns

Index(['Company/Brand', 'Sector', 'Stage', 'Amount($)', 'HeadQuarter',
       'What it does'],
      dtype='object')

In [None]:
#View the 2018 columns data types

startupdf_2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Company/Brand  526 non-null    object
 1   Sector         526 non-null    object
 2   Stage          526 non-null    object
 3   Amount($)      526 non-null    object
 4   HeadQuarter    526 non-null    object
 5   What it does   526 non-null    object
dtypes: object(6)
memory usage: 24.8+ KB


In [None]:
#Concatenate all four dataframes
#Vertical concatenation used (axis=0 ) means along rows
#ignore-Index parameter to reset the index

startup_df = pd.concat([startupdf_2018, startupdf_2019, startupdf_2020, startupdf_2021], axis=0, ignore_index=True, sort=False)

In [None]:
#View concatenated dataframe to confirm

startup_df

Unnamed: 0,Company/Brand,Sector,Stage,Amount($),HeadQuarter,What it does,Founded,Founders,Investor,Unnamed: 9
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",,,,
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,,,,
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,,,,
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,,,,
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...,,,,
...,...,...,...,...,...,...,...,...,...,...
2874,Gigforce,Staffing & Recruiting,Pre-series A,$3000000,Gurugram,A gig/on-demand staffing company.,2019.0,"Chirag Mittal, Anirudh Syal",Endiya Partners,
2875,Vahdam,Food & Beverages,Series D,$20000000,New Delhi,VAHDAM is among the world’s first vertically i...,2015.0,Bala Sarda,IIFL AMC,
2876,Leap Finance,Financial Services,Series C,$55000000,Bangalore,International education loans for high potenti...,2019.0,"Arnav Kumar, Vaibhav Singh",Owl Ventures,
2877,CollegeDekho,EdTech,Series B,$26000000,Gurugram,"Collegedekho.com is Student’s Partner, Friend ...",2015.0,Ruchir Arora,"Winter Capital, ETS, Man Capital",


In [None]:
#View number of columns and rows of new dataframe
startup_df.shape


(2879, 10)

In [None]:
#View data types of the columns of the new dataframe
startup_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2879 entries, 0 to 2878
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Company/Brand  2879 non-null   object
 1   Sector         2861 non-null   object
 2   Stage          1941 non-null   object
 3   Amount($)      2873 non-null   object
 4   HeadQuarter    2765 non-null   object
 5   What it does   2879 non-null   object
 6   Founded        2111 non-null   object
 7   Founders       2334 non-null   object
 8   Investor       2253 non-null   object
 9   Unnamed: 9     2 non-null      object
dtypes: object(10)
memory usage: 225.0+ KB


In [None]:
# Remove/Drop the extra null column - Unnamed: 9
startup_df.drop('Unnamed: 9', inplace=True, axis=1)

In [None]:
# View the dataframe to confirm column deletion
startup_df

Unnamed: 0,Company/Brand,Sector,Stage,Amount($),HeadQuarter,What it does,Founded,Founders,Investor
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",,,
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,,,
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,,,
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,,,
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...,,,
...,...,...,...,...,...,...,...,...,...
2874,Gigforce,Staffing & Recruiting,Pre-series A,$3000000,Gurugram,A gig/on-demand staffing company.,2019.0,"Chirag Mittal, Anirudh Syal",Endiya Partners
2875,Vahdam,Food & Beverages,Series D,$20000000,New Delhi,VAHDAM is among the world’s first vertically i...,2015.0,Bala Sarda,IIFL AMC
2876,Leap Finance,Financial Services,Series C,$55000000,Bangalore,International education loans for high potenti...,2019.0,"Arnav Kumar, Vaibhav Singh",Owl Ventures
2877,CollegeDekho,EdTech,Series B,$26000000,Gurugram,"Collegedekho.com is Student’s Partner, Friend ...",2015.0,Ruchir Arora,"Winter Capital, ETS, Man Capital"


### Start Cleaning the Amounts Column

In this section, we will start cleaning the Amounts column of the dataframe as it contains the info that relates to all our questions 

In [None]:
# Check types of data in the Amounts Column

startup_df['Amount($)'].apply(type).unique()

array([<class 'str'>, <class 'float'>], dtype=object)

In [None]:
#drop the words "Undisclosed" and "undisclosed" in Amounts column

startup_df =startup_df.drop(startup_df[(startup_df['Amount($)'] == 'Undisclosed') | (startup_df['Amount($)'] == 'undisclosed')].index )

In [None]:
#Replace commas with ''

startup_df['Amount($)'] = startup_df['Amount($)'].apply(lambda x: x.replace(',',''))
#startup_df['Amount($)'] = startup_df['Amount($)'].str.replace(',','', regex=True)

In [None]:
#Replace Rupee sign with ''

startup_df['Amount($)'] = startup_df['Amount($)'].apply(lambda x: x.replace('₹',''))
#startup_df['Rupee-Amount'] = startup_df['Amount($)'].str.replace('₹','', regex=True)

In [None]:
#Replace Dollar sign with ''

startup_df['Amount($)'] = startup_df['Amount($)'].apply(lambda x: x.replace('$',''))
#startup_df['Amount($)'] = startup_df['Amount($)'].str.replace('₹','', regex=True)

In [None]:
startup_df.head(10)

In [None]:
startup_df.shape

(2581, 9)

In [None]:
startup_df.describe()

Unnamed: 0,Company/Brand,Sector,Stage,Amount($),HeadQuarter,What it does,Founded,Founders,Investor
count,2581,2564,1802,2575,2486,2581,1883.0,2041,1967
unique,2007,816,71,698,164,2418,52.0,1750,1575
top,BharatPe,FinTech,Seed,—,Bangalore,BYJU'S is an educational technology company th...,2020.0,"Ashneer Grover, Shashvat Nakrani",Inflection Point Ventures
freq,10,159,571,148,679,5,228.0,7,32


In [None]:
startup_df.dtypes

Company/Brand    object
Sector           object
Stage            object
Amount($)        object
HeadQuarter      object
What it does     object
Founded          object
Founders         object
Investor         object
dtype: object

In [None]:
#Check null values in each column

startup_df.isna().sum()

Company/Brand      0
Sector            17
Stage            779
Amount($)          6
HeadQuarter       95
What it does       0
Founded          698
Founders         540
Investor         614
dtype: int64

In [None]:
# Check for duplicated rows

startup_df.duplicated().sum()

22

In [None]:
# Check for Duplicated columns

startup_df.T.duplicated().sum()

0

In [None]:
#Drop duplicates and create another dataframe with unduplicated columns

unduplicated_startup_df = startup_df.drop_duplicates()
unduplicated_startup_df

In [None]:
unduplicated_startup_df.shape

(2559, 9)

## Univariate Analysis

‘Univariate analysis’ is the analysis of one variable at a time. This analysis might be done by computing some statistical indicators and by plotting some charts respectively using the pandas dataframe's method `.describe()` and one of the plotting libraries like  [Seaborn](https://seaborn.pydata.org/), [Matplotlib](https://matplotlib.org/), [Plotly](https://seaborn.pydata.org/), etc.

Please, read [this article](https://towardsdatascience.com/8-seaborn-plots-for-univariate-exploratory-data-analysis-eda-in-python-9d280b6fe67f) to know more about the charts.

In [None]:
# Code here

## Multivariate Analysis

Multivariate analysis’ is the analysis of more than one variable and aims to study the relationships among them. This analysis might be done by computing some statistical indicators like the `correlation` and by plotting some charts.

Please, read [this article](https://towardsdatascience.com/10-must-know-seaborn-functions-for-multivariate-data-analysis-in-python-7ba94847b117) to know more about the charts.

In [None]:
# Code here

# Feature processing
Here is the section to **clean** and **process** the features of the dataset.

## Missing/NaN Values
Handle the missing/NaN values using the Scikif-learn SimpleImputer

In [None]:
# Code Here

## Scaling
Scale the numeric features using the Scikif-learn StandardScaler, MinMaxScaler, or another Scaler.

In [None]:
# Code here

## Encoding
Encode the categorical features using the Scikif-learn OneHotEncoder.

In [None]:
# Code here