<a href="https://colab.research.google.com/github/eaedk/Machine-Learning-Tutorials/blob/main/DataAnalysis_Step_By_Step_Guide.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro
## General



"Exploratory Data Analysis (EDA) refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations."(https://towardsdatascience.com/exploratory-data-analysis)

EDA enables us ask meaningful questions and gain insights on factors that can impact our business.It also aids by informing conclusions and supporting decision making.

In Data Analysis, there are some analysis paradigms : **Univariate, Bivariate, Multivariate**. We apply these paradigms to analyze the features (or statistical variables, or columns of the dataframe) of the dataset and to have a better understanding.

**Numeric features** are features with numbers that you can perform mathematical operations on. They are further divided into discrete (countable integers with clear boundaries) and continuous (can take any value, even decimals, within a range).

**Categorical features** are columns with a limited number of possible values. Examples are `sex, country, or age group`.

## Notebook overview

This notebook seeks to document the process undertaken to analyze fundings received by Indian Start-ups from 2018 to 2021.

# Setup

## Installation
Here is the section to install all the packages/libraries that will be needed to tackle the challlenge.

In [2]:
# pip install pandas
# pip install numpy 
# pip install matplotlib
# pip install seaborn 
!pip install forex_python
!pip install babel 
!pip install seaborn



## Importation
Here is the section to import all the packages/libraries that will be used through this notebook.

In [3]:
# Data handling
import pandas as pd
import numpy as np
from forex_python.converter import CurrencyRates
from babel.numbers import format_currency
import datetime as dt

# Vizualisation (Matplotlib, Plotly, Seaborn, etc. )
import matplotlib.pyplot as plt
import seaborn as sn

# Data Loading
Here is the section to load the datasets (train, eval, test) and the additional files

In [4]:
df2018 = pd.read_csv('India Startup Funding/startup_funding2018.csv')
df2019 = pd.read_csv('India Startup Funding/startup_funding2019.csv')
df2020 = pd.read_csv('India Startup Funding/startup_funding2020.csv')
df2021 = pd.read_csv('India Startup Funding/startup_funding2021.csv')

# Exploratory Data Analysis: EDA
Here is the section to **inspect** the datasets in depth, **present** it, make **hypotheses** and **think** the *cleaning, processing and features creation*.

## Dataset overview

Have a look at the loaded datsets using the following methods: `.head(), .info()`

In [5]:
# A quick look at the shape of our dataset

df2018.shape

(526, 6)

In [6]:
#Taking a look at the head of our 2018 Data

df2018.head()

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...


In [7]:
#Taking a look at the tail of our 2018 Data

df2018.tail()

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
521,Udaan,"B2B, Business Development, Internet, Marketplace",Series C,225000000,"Bangalore, Karnataka, India","Udaan is a B2B trade platform, designed specif..."
522,Happyeasygo Group,"Tourism, Travel",Series A,—,"Haryana, Haryana, India",HappyEasyGo is an online travel domain.
523,Mombay,"Food and Beverage, Food Delivery, Internet",Seed,7500,"Mumbai, Maharashtra, India",Mombay is a unique opportunity for housewives ...
524,Droni Tech,Information Technology,Seed,"₹35,000,000","Mumbai, Maharashtra, India",Droni Tech manufacture UAVs and develop softwa...
525,Netmeds,"Biotechnology, Health Care, Pharmaceutical",Series C,35000000,"Chennai, Tamil Nadu, India",Welcome to India's most convenient pharmacy!


In [8]:
#Look at the columns in the dataset and their data types

df2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Company Name   526 non-null    object
 1   Industry       526 non-null    object
 2   Round/Series   526 non-null    object
 3   Amount         526 non-null    object
 4   Location       526 non-null    object
 5   About Company  526 non-null    object
dtypes: object(6)
memory usage: 24.8+ KB


In [9]:
#replacing — with 0 in Amount column
df2018['Amount'] == '—'
df2018[df2018.columns[3: ]] = df2018[df2018.columns[3: ]].replace('[\—,]' , '0' , regex=True)
df2018

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,Bangalore0 Karnataka0 India,TheCollegeFever is a hub for fun0 fiesta and f...
1,Happy Cow Dairy,"Agriculture, Farming",Seed,₹4000000000,Mumbai0 Maharashtra0 India,A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,₹6500000000,Gurgaon0 Haryana0 India,Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000,Noida0 Uttar Pradesh0 India,PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,0,Hyderabad0 Andhra Pradesh0 India,Eunimart is a one stop solution for merchants ...
...,...,...,...,...,...,...
521,Udaan,"B2B, Business Development, Internet, Marketplace",Series C,225000000,Bangalore0 Karnataka0 India,Udaan is a B2B trade platform0 designed specif...
522,Happyeasygo Group,"Tourism, Travel",Series A,0,Haryana0 Haryana0 India,HappyEasyGo is an online travel domain.
523,Mombay,"Food and Beverage, Food Delivery, Internet",Seed,7500,Mumbai0 Maharashtra0 India,Mombay is a unique opportunity for housewives ...
524,Droni Tech,Information Technology,Seed,₹3500000000,Mumbai0 Maharashtra0 India,Droni Tech manufacture UAVs and develop softwa...


In [10]:
#Creating a new currency and date columns
df2018['Currency'] = np.where(df2018.Amount.str.contains('₹'), 'INR', 'USD')
df2018['Currency_Rate_Date'] = dt.datetime(2018,12,31)
df2018.head(350)

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company,Currency,Currency_Rate_Date
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,Bangalore0 Karnataka0 India,TheCollegeFever is a hub for fun0 fiesta and f...,USD,2018-12-31
1,Happy Cow Dairy,"Agriculture, Farming",Seed,₹4000000000,Mumbai0 Maharashtra0 India,A startup which aggregates milk from dairy far...,INR,2018-12-31
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,₹6500000000,Gurgaon0 Haryana0 India,Leading Online Loans Marketplace in India,INR,2018-12-31
3,PayMe India,"Financial Services, FinTech",Angel,2000000,Noida0 Uttar Pradesh0 India,PayMe India is an innovative FinTech organizat...,USD,2018-12-31
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,0,Hyderabad0 Andhra Pradesh0 India,Eunimart is a one stop solution for merchants ...,USD,2018-12-31
...,...,...,...,...,...,...,...,...
345,FreightBro,"Apps, B2B, Freight Service, Logistics, SaaS, S...",Seed,0,Mumbai0 Maharashtra0 India,Software for the new-age freight,USD,2018-12-31
346,Finwego,—,Seed,0,Chennai0 Tamil Nadu0 India,Finwego partners with Small and Medium Busines...,USD,2018-12-31
347,Cricnwin,"Digital Entertainment, Fantasy Sports, Gaming,...",Seed,0,Gurgaon0 Haryana0 India,Cricnwin is a Gurugram - based Fan Engagement ...,USD,2018-12-31
348,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,Bangalore0 Karnataka0 India,TheCollegeFever is a hub for fun0 fiesta and f...,USD,2018-12-31


In [11]:
#Converting Amount to float
df2018['Amount'] = df2018['Amount'].replace({'\$': '', '\₹': ''}, regex=True).astype(float)
df2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Company Name        526 non-null    object        
 1   Industry            526 non-null    object        
 2   Round/Series        526 non-null    object        
 3   Amount              526 non-null    float64       
 4   Location            526 non-null    object        
 5   About Company       526 non-null    object        
 6   Currency            526 non-null    object        
 7   Currency_Rate_Date  526 non-null    datetime64[ns]
dtypes: datetime64[ns](1), float64(1), object(6)
memory usage: 33.0+ KB


In [13]:
#Converting Amount to same currency

c = CurrencyRates()
df2018['Amount_USD'] = df2018.apply(lambda x: c.convert(x.Currency, 'USD', x.Amount, x.Currency_Rate_Date), axis = 1)
#df_2018['Exchange Rate'] = df_2018['USD']/df_2018['Amount']

exchange_rate = c.get_rate('USD','INR')


df2018.head(350)


#c.convert('USD', 'INR', df_2018['Amount'])
#df_2018.drop(['Date'])

#df_2018['Amount'].replace({'$': '*1', '₹': '*exchange_rate'}, regex=True )

#for i in df_2018.index:
    #if df_2018.Amount.str.contains('₹'):
        #df_2018['Amount_USD'] = df_2018['Amount']/exchange_rate
    #else:
        #df_2018['Amount_USD'] = df_2018['Amount']

#df_usd = [df_2018.Amount.str.contains('₹')]
#df_usd

        
#df_2018['Amount_USD'] = c.get_rate('USD','INR')
#df_2018

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company,Currency,Currency_Rate_Date,Amount_USD
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,2.500000e+05,Bangalore0 Karnataka0 India,TheCollegeFever is a hub for fun0 fiesta and f...,USD,2018-12-31,2.500000e+05
1,Happy Cow Dairy,"Agriculture, Farming",Seed,4.000000e+09,Mumbai0 Maharashtra0 India,A startup which aggregates milk from dairy far...,INR,2018-12-31,5.744402e+07
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,6.500000e+09,Gurgaon0 Haryana0 India,Leading Online Loans Marketplace in India,INR,2018-12-31,9.334653e+07
3,PayMe India,"Financial Services, FinTech",Angel,2.000000e+06,Noida0 Uttar Pradesh0 India,PayMe India is an innovative FinTech organizat...,USD,2018-12-31,2.000000e+06
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,0.000000e+00,Hyderabad0 Andhra Pradesh0 India,Eunimart is a one stop solution for merchants ...,USD,2018-12-31,0.000000e+00
...,...,...,...,...,...,...,...,...,...
345,FreightBro,"Apps, B2B, Freight Service, Logistics, SaaS, S...",Seed,0.000000e+00,Mumbai0 Maharashtra0 India,Software for the new-age freight,USD,2018-12-31,0.000000e+00
346,Finwego,—,Seed,0.000000e+00,Chennai0 Tamil Nadu0 India,Finwego partners with Small and Medium Busines...,USD,2018-12-31,0.000000e+00
347,Cricnwin,"Digital Entertainment, Fantasy Sports, Gaming,...",Seed,0.000000e+00,Gurgaon0 Haryana0 India,Cricnwin is a Gurugram - based Fan Engagement ...,USD,2018-12-31,0.000000e+00
348,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,2.500000e+05,Bangalore0 Karnataka0 India,TheCollegeFever is a hub for fun0 fiesta and f...,USD,2018-12-31,2.500000e+05


In [14]:
# A quick look at the shape of our dataset

df2019.shape

(89, 9)

In [15]:
#Taking a look at the head of our 2021 Data

df2019.head()

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...","$30,000,000",Series D
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",


In [16]:
#Taking a look at the tail of our 2021 Data

df2019.tail()

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
84,Infra.Market,,Mumbai,Infratech,It connects client requirements to their suppl...,"Aaditya Sharda, Souvik Sengupta","Tiger Global, Nexus Venture Partners, Accel Pa...","$20,000,000",Series A
85,Oyo,2013.0,Gurugram,Hospitality,Provides rooms for comfortable stay,Ritesh Agarwal,"MyPreferred Transformation, Avendus Finance, S...","$693,000,000",
86,GoMechanic,2016.0,Delhi,Automobile & Technology,Find automobile repair and maintenance service...,"Amit Bhasin, Kushal Karwa, Nitin Rana, Rishabh...",Sequoia Capital,"$5,000,000",Series B
87,Spinny,2015.0,Delhi,Automobile,Online car retailer,"Niraj Singh, Ramanshu Mahaur, Ganesh Pawar, Mo...","Norwest Venture Partners, General Catalyst, Fu...","$50,000,000",
88,Ess Kay Fincorp,,Rajasthan,Banking,Organised Non-Banking Finance Company,Rajendra Setia,"TPG, Norwest Venture Partners, Evolvence India","$33,000,000",


In [17]:
#Look at the columns in the dataset and their data types

df2019.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company/Brand  89 non-null     object 
 1   Founded        60 non-null     float64
 2   HeadQuarter    70 non-null     object 
 3   Sector         84 non-null     object 
 4   What it does   89 non-null     object 
 5   Founders       86 non-null     object 
 6   Investor       89 non-null     object 
 7   Amount($)      89 non-null     object 
 8   Stage          43 non-null     object 
dtypes: float64(1), object(8)
memory usage: 6.4+ KB


In [18]:
# A quick look at the shape of our dataset

df2020.shape

(1055, 10)

In [19]:
#Taking a look at the head of our 2021 Data

df2020.head()

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage,Unnamed: 9
0,Aqgromalin,2019,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,"$200,000",,
1,Krayonnz,2019,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,"$100,000",Pre-seed,
2,PadCare Labs,2018,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,Undisclosed,Pre-seed,
3,NCOME,2020,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital","$400,000",,
4,Gramophone,2016,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge","$340,000",,


In [20]:
#Taking a look at the tail of our 2021 Data

df2020.tail()

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage,Unnamed: 9
1050,Leverage Edu,,Delhi,Edtech,AI enabled marketplace that provides career gu...,Akshay Chaturvedi,"DSG Consumer Partners, Blume Ventures","$1,500,000",,
1051,EpiFi,,,Fintech,It offers customers with a single interface fo...,"Sujith Narayanan, Sumit Gwalani","Sequoia India, Ribbit Capital","$13,200,000",Seed Round,
1052,Purplle,2012.0,Mumbai,Cosmetics,Online makeup and beauty products retailer,"Manish Taneja, Rahul Dash",Verlinvest,"$8,000,000",,
1053,Shuttl,2015.0,Delhi,Transport,App based bus aggregator serice,"Amit Singh, Deepanshu Malviya",SIG Global India Fund LLP.,"$8,043,000",Series C,
1054,Pando,2017.0,Chennai,Logitech,Networked logistics management software,"Jayakrishnan, Abhijeet Manohar",Chiratae Ventures,"$9,000,000",Series A,


In [21]:
#Look at the columns in the dataset and their data types

df2020.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1055 entries, 0 to 1054
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Company/Brand  1055 non-null   object
 1   Founded        843 non-null    object
 2   HeadQuarter    961 non-null    object
 3   Sector         1042 non-null   object
 4   What it does   1055 non-null   object
 5   Founders       1043 non-null   object
 6   Investor       1017 non-null   object
 7   Amount($)      1052 non-null   object
 8   Stage          591 non-null    object
 9   Unnamed: 9     2 non-null      object
dtypes: object(10)
memory usage: 82.5+ KB


In [22]:
# A quick look at the shape of our dataset

df2021.shape

(1209, 9)

In [23]:
#Taking a look at the head of our 2021 Data

df2021.head()

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D
3,Bizongo,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital","$51,000,000",Series C
4,FypMoney,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal","$2,000,000",Seed


In [24]:
#Taking a look at the tail of our 2021 Data

df2021.tail()

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
1204,Gigforce,2019.0,Gurugram,Staffing & Recruiting,A gig/on-demand staffing company.,"Chirag Mittal, Anirudh Syal",Endiya Partners,$3000000,Pre-series A
1205,Vahdam,2015.0,New Delhi,Food & Beverages,VAHDAM is among the world’s first vertically i...,Bala Sarda,IIFL AMC,$20000000,Series D
1206,Leap Finance,2019.0,Bangalore,Financial Services,International education loans for high potenti...,"Arnav Kumar, Vaibhav Singh",Owl Ventures,$55000000,Series C
1207,CollegeDekho,2015.0,Gurugram,EdTech,"Collegedekho.com is Student’s Partner, Friend ...",Ruchir Arora,"Winter Capital, ETS, Man Capital",$26000000,Series B
1208,WeRize,2019.0,Bangalore,Financial Services,India’s first socially distributed full stack ...,"Vishal Chopra, Himanshu Gupta","3one4 Capital, Kalaari Capital",$8000000,Series A


In [25]:
#Look at the columns in the dataset and their data types

df2021.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1209 entries, 0 to 1208
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company/Brand  1209 non-null   object 
 1   Founded        1208 non-null   float64
 2   HeadQuarter    1208 non-null   object 
 3   Sector         1209 non-null   object 
 4   What it does   1209 non-null   object 
 5   Founders       1205 non-null   object 
 6   Investor       1147 non-null   object 
 7   Amount($)      1206 non-null   object 
 8   Stage          781 non-null    object 
dtypes: float64(1), object(8)
memory usage: 85.1+ KB


A careful look at the data shows that df2018 is different from all the other data sets. 
We will combine df2019,df2020 and df2021.

df2018 will cleaned and analyzed individually 

In [26]:
frames = [df2019, df2020, df2021]
  
df= pd.concat(frames)

In [27]:
df

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage,Unnamed: 9
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",,
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C,
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding,
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...","$30,000,000",Series D,
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",,
...,...,...,...,...,...,...,...,...,...,...
1204,Gigforce,2019.0,Gurugram,Staffing & Recruiting,A gig/on-demand staffing company.,"Chirag Mittal, Anirudh Syal",Endiya Partners,$3000000,Pre-series A,
1205,Vahdam,2015.0,New Delhi,Food & Beverages,VAHDAM is among the world’s first vertically i...,Bala Sarda,IIFL AMC,$20000000,Series D,
1206,Leap Finance,2019.0,Bangalore,Financial Services,International education loans for high potenti...,"Arnav Kumar, Vaibhav Singh",Owl Ventures,$55000000,Series C,
1207,CollegeDekho,2015.0,Gurugram,EdTech,"Collegedekho.com is Student’s Partner, Friend ...",Ruchir Arora,"Winter Capital, ETS, Man Capital",$26000000,Series B,


#### Lets take a look at the unique values in each of the columns to get a genral overview of what the table contains

In [28]:
df['Company/Brand'].unique()

array(['Bombay Shaving', 'Ruangguru', 'Eduisfun', ...,
       'Cogos Technologies', 'Vahdam', 'WeRize'], dtype=object)

In [29]:
df['Founded'].unique()

array([nan, 2014.0, 2004.0, 2013.0, 2010.0, 2018.0, 2019.0, 2017.0,
       2011.0, 2015.0, 2016.0, 2012.0, 2008.0, '2019', '2018', '2020',
       '2016', '2008', '2015', '2017', '2014', '1998', '2007', '2011',
       '1982', '2013', '2009', '2012', '1995', '2010', '2006', '1978',
       '1999', '1994', '2005', '1973', '-', '2002', '2004', '2001',
       2021.0, 2020.0, 1993.0, 1999.0, 1989.0, 2009.0, 2002.0, 1994.0,
       2006.0, 2000.0, 2007.0, 1978.0, 2003.0, 1998.0, 1991.0, 1984.0,
       2005.0, 1963.0], dtype=object)

In [30]:
df['HeadQuarter'].unique()

array([nan, 'Mumbai', 'Chennai', 'Telangana', 'Pune', 'Bangalore',
       'Noida', 'Delhi', 'Ahmedabad', 'Gurugram', 'Haryana', 'Chandigarh',
       'Jaipur', 'New Delhi', 'Surat', 'Uttar pradesh', 'Hyderabad',
       'Rajasthan', 'Indore', 'Gurgaon', 'Belgaum', 'Andheri', 'Kolkata',
       'Tirunelveli, Tamilnadu', 'Thane', 'Singapore', 'Gujarat',
       'Kerala', 'Jodhpur', 'Jaipur, Rajastan',
       'Frisco, Texas, United States', 'California', 'Dhingsara, Haryana',
       'New York, United States', 'Patna',
       'San Francisco, California, United States',
       'San Francisco, United States', 'San Ramon, California',
       'Paris, Ile-de-France, France', 'Plano, Texas, United States',
       'Sydney', 'San Francisco Bay Area, Silicon Valley, West Coast',
       'Bangaldesh', 'London, England, United Kingdom',
       'Sydney, New South Wales, Australia', 'Milano, Lombardia, Italy',
       'Palmwoods, Queensland, Australia', 'France',
       'San Francisco Bay Area, West Coast, W

In [31]:
df['Sector'].unique()

array(['Ecommerce', 'Edtech', 'Interior design', 'AgriTech', 'Technology',
       'SaaS', 'AI & Tech', 'E-commerce', 'E-commerce & AR', 'Fintech',
       'HR tech', 'Food tech', 'Health', 'Healthcare', 'Safety tech',
       'Pharmaceutical', 'Insurance technology', 'AI', 'Foodtech', 'Food',
       'IoT', 'E-marketplace', 'Robotics & AI', 'Logistics', 'Travel',
       'Manufacturing', 'Food & Nutrition', 'Social Media', nan,
       'E-Sports', 'Cosmetics', 'B2B', 'Jewellery', 'B2B Supply Chain',
       'Games', 'Food & tech', 'Accomodation', 'Automotive tech',
       'Legal tech', 'Mutual Funds', 'Cybersecurity', 'Automobile',
       'Sports', 'Healthtech', 'Yoga & wellness', 'Virtual Banking',
       'Transportation', 'Transport & Rentals',
       'Marketing & Customer loyalty', 'Infratech', 'Hospitality',
       'Automobile & Technology', 'Banking', 'EdTech',
       'Hygiene management', 'Escrow', 'Networking platform', 'FinTech',
       'Crowdsourcing', 'Food & Bevarages', 'HealthTec

In [32]:
df['What it does'].unique()

array(['Provides a range of male grooming products',
       'A learning platform that provides topic-based journey, animated videos, quizzes, infographic and mock tests to students',
       'It aims to make learning fun via games.', ...,
       'International education loans for high potential students.',
       'Collegedekho.com is Student’s Partner, Friend & Confidante, To Help Him Take a Decision and Move On to His Career Goals.',
       'India’s first socially distributed full stack financial services platform for small town India'],
      dtype=object)

# Feature Processing
Here is the section to **clean** and **process** the features of the dataset.

### Missing/NaN Values
Each column is cleaned individually by first looking at the total number of empty rows and then extracting thoses rows and handle the missing/NaN values.

#### Founded Column

In [33]:
# Total Number of empty rows in the Founded Column 
df['Founded'].isnull().sum()

242

In [34]:
# Extracting the row with  missing data in the Founded column
df[df['Founded'].isna()]

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage,Unnamed: 9
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",,
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding,
5,FlytBase,,Pune,Technology,A drone automation platform,Nitin Gupta,Undisclosed,Undisclosed,,
6,Finly,,Bangalore,SaaS,It builds software products that makes work si...,"Vivek AG, Veekshith C Rai","Social Capital, AngelList India, Gemba Capital...",Undisclosed,,
8,Quantiphi,,,AI & Tech,It is an AI and big data services company prov...,Renuka Ramnath,Multiples Alternate Asset Management,"$20,000,000",Series A,
...,...,...,...,...,...,...,...,...,...,...
1043,Quicko,,Ahmedabad,Taxation,Online tax planning and filing platform,Vishvajit Sonagara,"Zerodha fintech fund, Rainmatter","$280,000",,
1044,Satin Creditcare,,Gurgaon,Fintech,A micro finance company,,Austrian Bank,"$15,000,000",,
1050,Leverage Edu,,Delhi,Edtech,AI enabled marketplace that provides career gu...,Akshay Chaturvedi,"DSG Consumer Partners, Blume Ventures","$1,500,000",,
1051,EpiFi,,,Fintech,It offers customers with a single interface fo...,"Sujith Narayanan, Sumit Gwalani","Sequoia India, Ribbit Capital","$13,200,000",Seed Round,


In [35]:
# fill NaN rows with 0

df['Founded'] = df['Founded'].replace(np.nan, 0)

In [36]:
# filling rows with - with 0

df['Founded'] = df['Founded'].replace('-',0)

In [37]:
#Change the datatype of Founded column from Float to int first to remove the decimal

df['Founded'] = df['Founded'].astype(int)

In [38]:
#Replace the missing data in Founders column with 'N/A'

df['Founded'] = df['Founded'].replace(0, 'N/A')

In [39]:
#Change the datatype of Founded column from Float to string 

df['Founded'] = df['Founded'].astype(str)

In [40]:
df

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage,Unnamed: 9
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",,
1,Ruangguru,2014,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C,
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding,
3,HomeLane,2014,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...","$30,000,000",Series D,
4,Nu Genes,2004,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",,
...,...,...,...,...,...,...,...,...,...,...
1204,Gigforce,2019,Gurugram,Staffing & Recruiting,A gig/on-demand staffing company.,"Chirag Mittal, Anirudh Syal",Endiya Partners,$3000000,Pre-series A,
1205,Vahdam,2015,New Delhi,Food & Beverages,VAHDAM is among the world’s first vertically i...,Bala Sarda,IIFL AMC,$20000000,Series D,
1206,Leap Finance,2019,Bangalore,Financial Services,International education loans for high potenti...,"Arnav Kumar, Vaibhav Singh",Owl Ventures,$55000000,Series C,
1207,CollegeDekho,2015,Gurugram,EdTech,"Collegedekho.com is Student’s Partner, Friend ...",Ruchir Arora,"Winter Capital, ETS, Man Capital",$26000000,Series B,


#### Data types of HeadQuater, Sector, What it does, Founders, Investor and Stage are the same.  So we replace all missing data in those columns with 'N/A'

In [41]:
#updating the columns 

df.update(df[['HeadQuarter','Sector','What it does', 'Founders', 'Investor','Stage']].fillna('N/A'))

In [42]:
df

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage,Unnamed: 9
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",,
1,Ruangguru,2014,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C,
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding,
3,HomeLane,2014,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...","$30,000,000",Series D,
4,Nu Genes,2004,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",,
...,...,...,...,...,...,...,...,...,...,...
1204,Gigforce,2019,Gurugram,Staffing & Recruiting,A gig/on-demand staffing company.,"Chirag Mittal, Anirudh Syal",Endiya Partners,$3000000,Pre-series A,
1205,Vahdam,2015,New Delhi,Food & Beverages,VAHDAM is among the world’s first vertically i...,Bala Sarda,IIFL AMC,$20000000,Series D,
1206,Leap Finance,2019,Bangalore,Financial Services,International education loans for high potenti...,"Arnav Kumar, Vaibhav Singh",Owl Ventures,$55000000,Series C,
1207,CollegeDekho,2015,Gurugram,EdTech,"Collegedekho.com is Student’s Partner, Friend ...",Ruchir Arora,"Winter Capital, ETS, Man Capital",$26000000,Series B,


In [None]:
# A look at Unnamed: 9 column 

df['Unnamed: 9'].unique()

In [None]:
df['Stage'].unique()

## Issues With The Data

After looking carefully at the data, the following issues were identified

1. The 2018 Dataset had different and fewer columns 
2. The data sets have missing values 
3. Some values are in the wrong columns
4. The datatypes of some of the columns need to be changed
5. One column is unnamed


## Univariate Analysis

‘Univariate analysis’ is the analysis of one variable at a time. This analysis might be done by computing some statistical indicators and by plotting some charts respectively using the pandas dataframe's method `.describe()` and one of the plotting libraries like  [Seaborn](https://seaborn.pydata.org/), [Matplotlib](https://matplotlib.org/), [Plotly](https://seaborn.pydata.org/), etc.

Please, read [this article](https://towardsdatascience.com/8-seaborn-plots-for-univariate-exploratory-data-analysis-eda-in-python-9d280b6fe67f) to know more about the charts.

In [None]:
# code here

## Multivariate Analysis

Multivariate analysis’ is the analysis of more than one variable and aims to study the relationships among them. This analysis might be done by computing some statistical indicators like the `correlation` and by plotting some charts.

Please, read [this article](https://towardsdatascience.com/10-must-know-seaborn-functions-for-multivariate-data-analysis-in-python-7ba94847b117) to know more about the charts.

In [None]:
# Code here