## Accessing your data from the database

- Please follow the steps in this notebook to have access to the dataset. 
- If you encounter any challenges please leave an issue on this repo here on GitHub

### Steps to take to use environment variables as opposed to credentials literals

1. Install pyodbc  - a package for creating connection strings to your remote database server
2. Install python-dotenv - a package for creating environment variables that will help you hide sensitve configuration informantion such as database credentials and API keys
3. Import all the necessary libraies
   1. pyodbc (for creating a connection)
   2. python-dotenv (loading environment variables)
   3. os (for accessing the environement variables using the load_env function. This is not needed if you use the dotenv_values function instead)
4. Now create a file called .env in the root of your project folder (Note, the file name begins with a dot)
5. In the .env file, put all your sensitive information like server name, database name, username, and password

Example

   - SERVER='server_name_here'
   - DATABASE='database_name_here'
   - USERNAME='username_here'
   - PASSWORD='password_here'


6. Next create a .gitignore file (a new file with the name `.gitignore`. Note that gitignore file names begin with a dot)
7. Open the .gitignore file and type in the name of the .env file we just created like this "/.env". This will prevent git from tracking that file. Essesntially any file name in the gitignore file will be ignored by git and won't be checked into the repository
8. Create a connection by accessing your connection string with your defined environment variables

## Understanding the Business


India has experienced a surge in startups and funding, with over 16,000 new tech companies added in 2020. Despite funding obstacles, investment firms have shown confidence in Indian startups, with a total funding of $8.4 billion in 2023. India's startup ecosystem soars with unprecedented growth and funding

India's startup ecosystem, ranking third globally, boasts over 99,000 recognized startups and 108 unicorns valued at US$340.80 billion, with a bright future ahead

The Indian startup ecosystem is built on several key pillars, including government support, access to capital, a growing talent pool, and a supportive culture for entrepreneurship. One of the most important factors driving the growth of startups in India is the government's focus on supporting entrepreneurship.

The top three global ecosystems remain the same from 2020, with Silicon Valley at #1, followed by New York City and London tied at #2.


-Venturing into the Indian start-ups ecosystem
-To investigate the ecosystem and propose the best course of action

-We will analyze funding recieved by start-ups in India from 2018 to 2021.

-We will seek to ask the following questions to help us propose the best cousrse of action.


### QUESTIONS

1. Do older companies attract higher investment amounts than newer companies?
2. Does the average number of founders in startups vary significantly across Mumabi and Noida?
3. Is there a significant variation in the average funding amount received by companies across Seed and Angel?
4. What is the average amount of funding recieved in 2018 compared to 2019
5. Which recieved more funding on the average
6. What is the average amount of funding recieved in 2020 compared to 2021
7. Is the average amount recieved by startups in Bangalore more than that od New Delhi?

## 1
### Null Hypothesis : 
    Companies that are 10 years and above do not receive more than average amount
        
### Alternative Hypothesis : 
    Companies that are 10 years and above receive more than average amount

## 2
### Null Hypothesis:
    The average number of founders is the same in Mumbai and Noida
    
### Alternative Hypothesis
    The average number of founders is not the same in Mumbai and Noida
  
## 3
### Null Hypothesis:
    The average amount of funding recived in 2018 is more than or equal to that of 2019

### Alternative Hypothesis:
    The average amount of funding recieved in 2018 is less than that of 2019
    
## 4  
### Null Hypothesis: 
    The average amount of funding recived in 2020 is more than or equal to that of 2021

### Alternative Hypothesis: 
    The average amount of funding recieved in 2020 is less than that of 2021
    
## 5   
### Null Hypothesis:
    Companies belonging to Fintech recieved more than or equal to the average amount of funding
    
### Alternative Hypothesis:
    Companies belonging to Fintech recieved less than to the average amount of funding 
    
## 6    
### Null Hypothesis:
    The average amount of recieved by startups in Bangalore is more or equal to that of New Delhi
    
### Alternative Hypothesis:
     The average amount of recieved by startups in Bangalore is less than that of New Delhi

#### Step 1 and 2 - Install pyodbc and python-dotenv

#### Step 3 - Import all the necessary packages

In [4]:
import pyodbc
import pymssql
import pypyodbc

from dotenv import dotenv_values    #import the dotenv_values function from the dotenv package
import pandas as pd
import numpy as np


import warnings 

warnings.filterwarnings('ignore')


#### Step 4 - Create your .env file in the root of your project

#### Step 5 - In the .env file, put all your sensitive information like server name, password etc


#### Step 6 & 7 - Next create a .gitignore file and type '/.env' file we just created. This will prevent git from tracking that file.

#### Step 8 - Create a connection by accessing your connection string with your defined environment variables

In [5]:
# Load environment variables from .env file into a dictionary
environment_variables = dotenv_values('`.env`')

# Get the values for the credentials you set in the '.env' file
server = environment_variables.get("SERVER")
database = environment_variables.get("DATABASE")
username = environment_variables.get("USERNAME")
password = environment_variables.get("PASSWORD")
driver = environment_variables.get("DRIVER")


In [6]:
# Create a connection string
connection_string = f'DRIVER={driver};SERVER={server};DATABASE={database};UID={username};PWD={password}'


In [7]:
# Use the connect method of the pyodbc library and pass in the connection string.
# This will connect to the server and might take a few seconds to be complete. 
# Check your internet connection if it takes more time than necessary

connection = pyodbc.connect(connection_string)


In [8]:
# Now the sql query to get the data is what what you see below. 
# Note that you will not have permissions to insert delete or update this database table. 

query = "SELECT * FROM dbo.LP1_startup_funding2020"

df_fund2020 = pd.read_sql(query, connection)
df_fund2020.head(3)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,


In [9]:
query1 = 'Select * from dbo.LP1_startup_funding2021'

df_fund2021 = pd.read_sql(query1, connection)
df_fund2021.head(3)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D


Next, get data from other sources and concatenate (Depends on the project) to perform your analysis

ALL THE BEST!!!

In [10]:
df_fund2018=pd.read_csv('startup_funding2018.csv')

In [11]:
df_fund2019=pd.read_csv('startup_funding2019.csv')


In [12]:
df_fund2018.head(3)

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India


In [13]:
df_fund2019.head(3)

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding


In [14]:
df_fund2020.head(3)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,


In [15]:
df_fund2021.head(3)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D


#### We will create year column for each dataframe before concatenating to helps us identify which year funding was awarded

In [16]:
df_fund2018['Year']=2018
df_fund2019['Year']=2019
df_fund2020['Year']=2020
df_fund2021['Year']=2021

In [17]:
df_fund2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Company Name   526 non-null    object
 1   Industry       526 non-null    object
 2   Round/Series   526 non-null    object
 3   Amount         526 non-null    object
 4   Location       526 non-null    object
 5   About Company  526 non-null    object
 6   Year           526 non-null    int64 
dtypes: int64(1), object(6)
memory usage: 28.9+ KB


In [18]:
df_fund2019.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company/Brand  89 non-null     object 
 1   Founded        60 non-null     float64
 2   HeadQuarter    70 non-null     object 
 3   Sector         84 non-null     object 
 4   What it does   89 non-null     object 
 5   Founders       86 non-null     object 
 6   Investor       89 non-null     object 
 7   Amount($)      89 non-null     object 
 8   Stage          43 non-null     object 
 9   Year           89 non-null     int64  
dtypes: float64(1), int64(1), object(8)
memory usage: 7.1+ KB


In [19]:
df_fund2020.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1055 entries, 0 to 1054
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1055 non-null   object 
 1   Founded        842 non-null    float64
 2   HeadQuarter    961 non-null    object 
 3   Sector         1042 non-null   object 
 4   What_it_does   1055 non-null   object 
 5   Founders       1043 non-null   object 
 6   Investor       1017 non-null   object 
 7   Amount         801 non-null    float64
 8   Stage          591 non-null    object 
 9   column10       2 non-null      object 
 10  Year           1055 non-null   int64  
dtypes: float64(2), int64(1), object(8)
memory usage: 90.8+ KB


In [20]:
df_fund2021.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1209 entries, 0 to 1208
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1209 non-null   object 
 1   Founded        1208 non-null   float64
 2   HeadQuarter    1208 non-null   object 
 3   Sector         1209 non-null   object 
 4   What_it_does   1209 non-null   object 
 5   Founders       1205 non-null   object 
 6   Investor       1147 non-null   object 
 7   Amount         1206 non-null   object 
 8   Stage          781 non-null    object 
 9   Year           1209 non-null   int64  
dtypes: float64(1), int64(1), object(8)
memory usage: 94.6+ KB


### Observations

- 'Company Name' in df_fund2018 is the same as 'Company/Brand' in df_fund2019, same as 'Company_Brand' in df_fund2020 and same as 'Company_Brand' in df_fund2021

       - we will change Company Name in df_fund2018 to Company_Brand and Company/Brand in df_fund2019 will be change to Company_Brand.

- Founded is in all dataframes except df_fund2018

- We also observed that df_fund2018 contains Industry where as the remaining dataframes contains Sector, however after checking, we noticed the contain similar values.

       - we change Industry in df_fund2018 to Sector
    
- We observed that Location and headquater looks the same, we have Headquarter in all dataframes except df_fund2018, which contains Location
    
    - We will change Headquarter to Location
    
- We observed stage in df_fund2019, df_fund2020, df_fund2021 have the same values as Round/Series in df_fund2018
    
    - we will change Round/Series in df_fund2018 to stage
    
- we observed Amount in df_fund2019 is Amount($), where as all other dataframes contain Amount
    - we will change Amount($) to Amount
    
- we observed About Company in df_fund2018 has similar values as What_it_does in the remaining dataframes, in df_fund2019 there are no underscores
    - we change all to About Company
    

- we observe Founders is in all dataframes except df_fund2018

- we observe Investors is in all dataframes except df_fund2018

- We observed df_fund2020 contains Column10 with null values

    - we will drop column10




In [21]:
#Renaming the Columns for easy concatenation

df_fund2018=df_fund2018.rename(columns={'Company Name': 'Company_Brand'})
df_fund2018=df_fund2018.rename(columns={'Industry': 'Sector'})
df_fund2018=df_fund2018.rename(columns={'Round/Series': 'Stage'})
df_fund2019=df_fund2019.rename(columns={'Company/Brand': 'Company_Brand'})
df_fund2019=df_fund2019.rename(columns={'HeadQuarter': 'Location'})
df_fund2019=df_fund2019.rename(columns={'Amount($)': 'Amount'})
df_fund2019=df_fund2019.rename(columns={'What it does': 'About Company'})
df_fund2020=df_fund2020.rename(columns={'HeadQuarter': 'Location'})
df_fund2020=df_fund2020.rename(columns={'What_it_does': 'About Company'})
df_fund2021=df_fund2021.rename(columns={'HeadQuarter': 'Location'})
df_fund2021=df_fund2021.rename(columns={'What_it_does': 'About Company'})

### Cleaning the 2018 dataset

In [22]:
#Making a copy of 2018 dataframe

df18=df_fund2018.copy()

In [23]:
#Replacing the digits part of value with nothing

df18['cur_symb18']=df18['Amount'].astype(str).replace(('\d'), '', regex= True)

In [24]:
#Checking the unique currency symbols 

df18['cur_symb18'].unique()

array(['', '₹,,', '—', '₹,', '$,', '$,,', '₹,,,', '$,,,'], dtype=object)

In [25]:
# Replacing the symbols with nothing

df18['Amount_no_symb18']=df18['Amount'].astype(str).replace('\D', '', regex= True)

In [26]:
#Checking to confirm changes were effected

df18.head(2)

Unnamed: 0,Company_Brand,Sector,Stage,Amount,Location,About Company,Year,cur_symb18,Amount_no_symb18
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",2018,,250000
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,2018,"₹,,",40000000


In [27]:
#Confirming the datatypes

df18.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Company_Brand     526 non-null    object
 1   Sector            526 non-null    object
 2   Stage             526 non-null    object
 3   Amount            526 non-null    object
 4   Location          526 non-null    object
 5   About Company     526 non-null    object
 6   Year              526 non-null    int64 
 7   cur_symb18        526 non-null    object
 8   Amount_no_symb18  526 non-null    object
dtypes: int64(1), object(8)
memory usage: 37.1+ KB


In [28]:
#droping the amount column with symbol

df18=df18.drop('Amount',axis=1)

In [29]:
# Renaming columns

df18=df18.rename(columns={'Amount_no_symb18':'Amount'})
df18=df18.rename(columns={'cur_symb18':'Currency'})

In [30]:
#Confirming changes

df18.head(3)

Unnamed: 0,Company_Brand,Sector,Stage,Location,About Company,Year,Currency,Amount
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",2018,,250000
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,2018,"₹,,",40000000
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,2018,"₹,,",65000000


In [31]:
#Changing Amount from object datatype to numeric type

df18['Amount']=pd.to_numeric(df18['Amount'])

In [32]:
# Confirming changes

df18.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  526 non-null    object 
 1   Sector         526 non-null    object 
 2   Stage          526 non-null    object 
 3   Location       526 non-null    object 
 4   About Company  526 non-null    object 
 5   Year           526 non-null    int64  
 6   Currency       526 non-null    object 
 7   Amount         378 non-null    float64
dtypes: float64(1), int64(1), object(6)
memory usage: 33.0+ KB


In [33]:
df18['Currency']=df18['Currency'].str.replace(',','')

In [34]:
currency_value=df18.groupby('Currency')['Amount'].mean()
currency_value

Currency
     1.219853e+07
$    5.535524e+07
—             NaN
₹    5.903119e+08
Name: Amount, dtype: float64

In [35]:
'''in order to have The location of 2018 dataset to correspond with 
the other three dataset which have only one city as loaction, we picked 
the first city in the location column and dropped the remaining'''  

df18['Location'] = df18['Location'].str.split(',').str.get(0) 


In [36]:
df18.head()

Unnamed: 0,Company_Brand,Sector,Stage,Location,About Company,Year,Currency,Amount
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,Bangalore,"TheCollegeFever is a hub for fun, fiesta and f...",2018,,250000.0
1,Happy Cow Dairy,"Agriculture, Farming",Seed,Mumbai,A startup which aggregates milk from dairy far...,2018,₹,40000000.0
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,Gurgaon,Leading Online Loans Marketplace in India,2018,₹,65000000.0
3,PayMe India,"Financial Services, FinTech",Angel,Noida,PayMe India is an innovative FinTech organizat...,2018,,2000000.0
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,Hyderabad,Eunimart is a one stop solution for merchants ...,2018,—,


#### converting rupees to dollars

- We noticed apart from 2020 dataset which has no currency symbol, 2019 and 2021 had dollar symbols

- we noticed that 2018 had both dollar and rupee symbols, since dollar is the dominant currency we decided to convert the rupees to dollars.
    The Average exchange rate in 2018 was 0.0146 USD (https://www.exchangerates.org.uk/INR-USD-spot-exchange-rates-history-2018.html).
    
- We also decided to assume where the currency is empty was dollar. 

In [37]:
# Exchange rate of rupee to dollars

exchange_rate = 0.0146
# convert exchange_rate from rupees to dollars and make a change to values with rupee symbol
df18.loc[df18['Currency'] == '₹', 'Amount'] *= exchange_rate
df18.head()

Unnamed: 0,Company_Brand,Sector,Stage,Location,About Company,Year,Currency,Amount
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,Bangalore,"TheCollegeFever is a hub for fun, fiesta and f...",2018,,250000.0
1,Happy Cow Dairy,"Agriculture, Farming",Seed,Mumbai,A startup which aggregates milk from dairy far...,2018,₹,584000.0
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,Gurgaon,Leading Online Loans Marketplace in India,2018,₹,949000.0
3,PayMe India,"Financial Services, FinTech",Angel,Noida,PayMe India is an innovative FinTech organizat...,2018,,2000000.0
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,Hyderabad,Eunimart is a one stop solution for merchants ...,2018,—,


In [38]:
df18['Currency']='$'

In [39]:
df18.head()

Unnamed: 0,Company_Brand,Sector,Stage,Location,About Company,Year,Currency,Amount
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,Bangalore,"TheCollegeFever is a hub for fun, fiesta and f...",2018,$,250000.0
1,Happy Cow Dairy,"Agriculture, Farming",Seed,Mumbai,A startup which aggregates milk from dairy far...,2018,$,584000.0
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,Gurgaon,Leading Online Loans Marketplace in India,2018,$,949000.0
3,PayMe India,"Financial Services, FinTech",Angel,Noida,PayMe India is an innovative FinTech organizat...,2018,$,2000000.0
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,Hyderabad,Eunimart is a one stop solution for merchants ...,2018,$,


### Cleaning the 2019 dataset

In [40]:
#Making a copy of 2019 dataframe

df19=df_fund2019.copy()

In [41]:
#Replacing the digits part of value with nothing

df19['cur_symb19']=df19['Amount'].astype(str).replace(('\d'), '', regex= True)

In [42]:
# Checking unique symbols

df19['cur_symb19'].unique()

array(['$,,', 'Undisclosed', '$,'], dtype=object)

#### Observations

- we noticed the currency for 2019 was $, also there were some businesses that did not disclosed their amount

In [43]:
#Replacing symbols with nothing

df19['Amount_no_symb19']=df19['Amount'].astype(str).replace('\D', '', regex= True)

In [44]:
# Droping the column with Amount and symbols

df19=df19.drop('Amount',axis=1)

In [45]:
# Renaming Columns

df19=df19.rename(columns={'Amount_no_symb19':'Amount'})
df19=df19.rename(columns={'cur_symb19':'Currency'})

In [46]:
df19['Currency']=df19['Currency'].str.replace(',','')

In [47]:
#Confirming changes

df19.head(2)

Unnamed: 0,Company_Brand,Founded,Location,Sector,About Company,Founders,Investor,Stage,Year,Currency,Amount
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,,2019,$,6300000
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,Series C,2019,$,150000000


In [48]:
# Converting Amount from object datatype to numeric datatype

df19['Amount']=pd.to_numeric(df19['Amount'])

In [49]:
df19['Currency']='$'

In [50]:
# Confirming the changes

df19.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  89 non-null     object 
 1   Founded        60 non-null     float64
 2   Location       70 non-null     object 
 3   Sector         84 non-null     object 
 4   About Company  89 non-null     object 
 5   Founders       86 non-null     object 
 6   Investor       89 non-null     object 
 7   Stage          43 non-null     object 
 8   Year           89 non-null     int64  
 9   Currency       89 non-null     object 
 10  Amount         77 non-null     float64
dtypes: float64(2), int64(1), object(8)
memory usage: 7.8+ KB


### Cleaning the 2020 dataset

In [51]:
# Making a copy of the 2020 dataset

df20=df_fund2020.copy()

In [52]:
# Confirming changes
df20.head(2)

Unnamed: 0,Company_Brand,Founded,Location,Sector,About Company,Founders,Investor,Amount,Stage,column10,Year
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,,2020
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,,2020


In [53]:
# Checking for null values in the amount column
df20['Amount'].isna().sum()

254

In [54]:
#Converting the Amount column from object datatype to numeric type

df20['Amount']=pd.to_numeric(df20['Amount'])

In [55]:
'''
    Since dollar is the dorminant currency and 2020 data had no currency, we are assuming the currency was dollars
    '''

df20['Currency']='$'

In [56]:
#Confirming changes

df20.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1055 entries, 0 to 1054
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1055 non-null   object 
 1   Founded        842 non-null    float64
 2   Location       961 non-null    object 
 3   Sector         1042 non-null   object 
 4   About Company  1055 non-null   object 
 5   Founders       1043 non-null   object 
 6   Investor       1017 non-null   object 
 7   Amount         801 non-null    float64
 8   Stage          591 non-null    object 
 9   column10       2 non-null      object 
 10  Year           1055 non-null   int64  
 11  Currency       1055 non-null   object 
dtypes: float64(2), int64(1), object(9)
memory usage: 99.0+ KB


### Cleaning the 2021 dataset

In [57]:
#Making a copy of 2021 dataset

df21=df_fund2021.copy()

In [58]:
#Replacing the digits partof Amount with nothing

df21['cur_symb21']=df21['Amount'].astype(str).replace(('\d'), '', regex= True)

In [59]:
#Checking for type of symbols 

df21['cur_symb21'].unique()

array(['$,,', '$,', 'Undisclosed', '$,,,', 'None', '$Undisclosed', '$',
       'Upsparks', 'Series C', 'Seed', '$$,', '$undisclosed',
       'ah! Ventures', 'Pre-series A', 'ITO Angel Network, LetsVenture',
       'JITO Angel Network, LetsVenture', '$$,,'], dtype=object)

In [60]:
#Replacing the symbols with nothing

df21['Amount_no_symb21']=df21['Amount'].astype(str).replace('\D', '', regex= True)

In [61]:
#Droping the amount column with symbols

df21=df21.drop('Amount',axis=1)

In [62]:
# Renaming columns

df21=df21.rename(columns={'Amount_no_symb21':'Amount'})
df21=df21.rename(columns={'cur_symb21':'Currency'})

In [63]:
# confirming changes

df21.head(2)

Unnamed: 0,Company_Brand,Founded,Location,Sector,About Company,Founders,Investor,Stage,Year,Currency,Amount
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First",Pre-series A,2021,"$,,",1200000
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management",,2021,"$,,",120000000


#### Observations

- We noticed the currency was $, also other symbols had no amount but just texts.


In [64]:
#Converting the datatype from object to numeric 

df21['Amount']=pd.to_numeric(df21['Amount'])

In [65]:
text_value=df21.groupby('Currency')['Amount'].mean()
text_value

Currency
$                                  2.811951e+07
$$,                                1.000000e+04
$$,,                               1.550000e+05
$,                                 3.781757e+05
$,,                                3.296623e+07
$,,,                               7.550000e+10
$Undisclosed                                NaN
$undisclosed                                NaN
ITO Angel Network, LetsVenture              NaN
JITO Angel Network, LetsVenture             NaN
None                                        NaN
Pre-series A                                NaN
Seed                                        NaN
Series C                                    NaN
Undisclosed                                 NaN
Upsparks                                    NaN
ah! Ventures                                NaN
Name: Amount, dtype: float64

In [66]:
df21['Currency']='$'

In [67]:
#Confirming changes

df21.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1209 entries, 0 to 1208
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1209 non-null   object 
 1   Founded        1208 non-null   float64
 2   Location       1208 non-null   object 
 3   Sector         1209 non-null   object 
 4   About Company  1209 non-null   object 
 5   Founders       1205 non-null   object 
 6   Investor       1147 non-null   object 
 7   Stage          781 non-null    object 
 8   Year           1209 non-null   int64  
 9   Currency       1209 non-null   object 
 10  Amount         1056 non-null   float64
dtypes: float64(2), int64(1), object(8)
memory usage: 104.0+ KB


In [80]:
#Concatenating Dataframes
df = pd.concat([df18, df19, df20, df21], ignore_index=True)

In [81]:
df.sample(3)

Unnamed: 0,Company_Brand,Sector,Stage,Location,About Company,Year,Currency,Amount,Founded,Founders,Investor,column10
139,Chizel,"3D Printing, Manufacturing, Product Design",Seed,Pune,Chizel is a worlds most advanced cloud platfor...,2018,$,,,,,
753,Kitchens Centre,Proptech,Pre-series A,New Delhi,Kitchens Centre offers turnkey solutions to cl...,2020,$,,2019.0,Lakshay Jain,"Jonathan Swanson, Ankush Gera",
1306,Vakilsearch,Legal,,Chennai,"Online platform for legal, tax and compliance ...",2020,$,,2010.0,Hrishikesh Datar,Sujeet Kumar,


In [82]:
df.shape

(2879, 12)

#### Observations

- After combining the datasets from the various sources, we got 3033 rows and 9 columns

In [83]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2879 entries, 0 to 2878
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  2879 non-null   object 
 1   Sector         2861 non-null   object 
 2   Stage          1941 non-null   object 
 3   Location       2765 non-null   object 
 4   About Company  2879 non-null   object 
 5   Year           2879 non-null   int64  
 6   Currency       2879 non-null   object 
 7   Amount         2312 non-null   float64
 8   Founded        2110 non-null   float64
 9   Founders       2334 non-null   object 
 10  Investor       2253 non-null   object 
 11  column10       2 non-null      object 
dtypes: float64(2), int64(1), object(9)
memory usage: 270.0+ KB


#### Observations

- We observed that the Amount column is indicating as object type insterd of float, we will investigate further and change to float
- all other object types are correctly specified.
- Year and Founded are suposed to be datetime type but are indicating int and float types, we will change them to datetime

In [88]:
df.head()

Unnamed: 0,Company_Brand,Sector,Stage,Location,About Company,Year,Currency,Amount,Founded,Founders,Investor,column10
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,Bangalore,"TheCollegeFever is a hub for fun, fiesta and f...",2018,$,250000.0,,,,
1,Happy Cow Dairy,"Agriculture, Farming",Seed,Mumbai,A startup which aggregates milk from dairy far...,2018,$,584000.0,,,,
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,Gurgaon,Leading Online Loans Marketplace in India,2018,$,949000.0,,,,
3,PayMe India,"Financial Services, FinTech",Angel,Noida,PayMe India is an innovative FinTech organizat...,2018,$,2000000.0,,,,
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,Hyderabad,Eunimart is a one stop solution for merchants ...,2018,$,,,,,


In [90]:
df.duplicated().sum()

23

In [92]:
df=df.drop_duplicates()

In [93]:
df.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Company_Brand,2856.0,2214.0,BharatPe,10.0,,,,,,,
Sector,2838.0,873.0,FinTech,172.0,,,,,,,
Stage,1927.0,75.0,Seed,599.0,,,,,,,
Location,2742.0,141.0,Bangalore,859.0,,,,,,,
About Company,2856.0,2691.0,BYJU'S is an educational technology company th...,5.0,,,,,,,
Year,2856.0,,,,2020.017857,1.087759,2018.0,2020.0,2020.0,2021.0,2021.0
Currency,2856.0,1.0,$,2856.0,,,,,,,
Amount,2293.0,,,,121934365.88792,3457172580.021939,876.0,1000000.0,3000000.0,12000000.0,150000000000.0
Founded,2088.0,,,,2016.06705,4.368211,1963.0,2015.0,2017.0,2019.0,2021.0
Founders,2312.0,1980.0,"Ashneer Grover, Shashvat Nakrani",7.0,,,,,,,


In [95]:
df['Location'].unique()

array(['Bangalore', 'Mumbai', 'Gurgaon', 'Noida', 'Hyderabad',
       'Bengaluru', 'Kalkaji', 'Delhi', 'India', 'Hubli', 'New Delhi',
       'Chennai', 'Mohali', 'Kolkata', 'Pune', 'Jodhpur', 'Kanpur',
       'Ahmedabad', 'Azadpur', 'Haryana', 'Cochin', 'Faridabad', 'Jaipur',
       'Kota', 'Anand', 'Bangalore City', 'Belgaum', 'Thane', 'Margão',
       'Indore', 'Alwar', 'Kannur', 'Trivandrum', 'Ernakulam',
       'Kormangala', 'Uttar Pradesh', 'Andheri', 'Mylapore', 'Ghaziabad',
       'Kochi', 'Powai', 'Guntur', 'Kalpakkam', 'Bhopal', 'Coimbatore',
       'Worli', 'Alleppey', 'Chandigarh', 'Guindy', 'Lucknow', nan,
       'Telangana', 'Gurugram', 'Surat', 'Uttar pradesh', 'Rajasthan',
       'Tirunelveli, Tamilnadu', None, 'Singapore', 'Gujarat', 'Kerala',
       'Jaipur, Rajastan', 'Frisco, Texas, United States', 'California',
       'Dhingsara, Haryana', 'New York, United States', 'Patna',
       'San Francisco, California, United States',
       'San Francisco, United States', 'S

In [96]:
df18['Location'].unique()

array(['Bangalore', 'Mumbai', 'Gurgaon', 'Noida', 'Hyderabad',
       'Bengaluru', 'Kalkaji', 'Delhi', 'India', 'Hubli', 'New Delhi',
       'Chennai', 'Mohali', 'Kolkata', 'Pune', 'Jodhpur', 'Kanpur',
       'Ahmedabad', 'Azadpur', 'Haryana', 'Cochin', 'Faridabad', 'Jaipur',
       'Kota', 'Anand', 'Bangalore City', 'Belgaum', 'Thane', 'Margão',
       'Indore', 'Alwar', 'Kannur', 'Trivandrum', 'Ernakulam',
       'Kormangala', 'Uttar Pradesh', 'Andheri', 'Mylapore', 'Ghaziabad',
       'Kochi', 'Powai', 'Guntur', 'Kalpakkam', 'Bhopal', 'Coimbatore',
       'Worli', 'Alleppey', 'Chandigarh', 'Guindy', 'Lucknow'],
      dtype=object)

In [97]:
df19['Location'].unique()

array([nan, 'Mumbai', 'Chennai', 'Telangana', 'Pune', 'Bangalore',
       'Noida', 'Delhi', 'Ahmedabad', 'Gurugram', 'Haryana', 'Chandigarh',
       'Jaipur', 'New Delhi', 'Surat', 'Uttar pradesh', 'Hyderabad',
       'Rajasthan'], dtype=object)

In [100]:
df20['Location'].unique()

array(['Chennai', 'Bangalore', 'Pune', 'New Delhi', 'Indore', 'Hyderabad',
       'Gurgaon', 'Belgaum', 'Noida', 'Mumbai', 'Andheri', 'Jaipur',
       'Ahmedabad', 'Kolkata', 'Tirunelveli, Tamilnadu', 'Thane', None,
       'Singapore', 'Gurugram', 'Gujarat', 'Haryana', 'Kerala', 'Jodhpur',
       'Jaipur, Rajastan', 'Delhi', 'Frisco, Texas, United States',
       'California', 'Dhingsara, Haryana', 'New York, United States',
       'Patna', 'San Francisco, California, United States',
       'San Francisco, United States', 'San Ramon, California',
       'Paris, Ile-de-France, France', 'Plano, Texas, United States',
       'Sydney', 'San Francisco Bay Area, Silicon Valley, West Coast',
       'Bangaldesh', 'London, England, United Kingdom',
       'Sydney, New South Wales, Australia', 'Milano, Lombardia, Italy',
       'Palmwoods, Queensland, Australia', 'France',
       'San Francisco Bay Area, West Coast, Western US',
       'Trivandrum, Kerala, India', 'Cochin', 'Samastipur, Bihar',


In [101]:
df21['Location'].unique()

array(['Bangalore', 'Mumbai', 'Gurugram', 'New Delhi', 'Hyderabad',
       'Jaipur', 'Ahmadabad', 'Chennai', None,
       'Small Towns, Andhra Pradesh', 'Goa', 'Rajsamand', 'Ranchi',
       'Faridabad, Haryana', 'Gujarat', 'Pune', 'Thane', 'Computer Games',
       'Cochin', 'Noida', 'Chandigarh', 'Gurgaon', 'Vadodara',
       'Food & Beverages', 'Pharmaceuticals\t#REF!', 'Gurugram\t#REF!',
       'Kolkata', 'Ahmedabad', 'Mohali', 'Haryana', 'Indore', 'Powai',
       'Ghaziabad', 'Nagpur', 'West Bengal', 'Patna', 'Samsitpur',
       'Lucknow', 'Telangana', 'Silvassa', 'Thiruvananthapuram',
       'Faridabad', 'Roorkee', 'Ambernath', 'Panchkula', 'Surat',
       'Coimbatore', 'Andheri', 'Mangalore', 'Telugana', 'Bhubaneswar',
       'Kottayam', 'Beijing', 'Panaji', 'Satara', 'Orissia', 'Jodhpur',
       'New York', 'Santra', 'Mountain View, CA', 'Trivandrum',
       'Jharkhand', 'Kanpur', 'Bhilwara', 'Guwahati',
       'Online Media\t#REF!', 'Kochi', 'London',
       'Information Technol

In [103]:
#it=
df[df['Location']=='Information Technology & Services']

Unnamed: 0,Company_Brand,Sector,Stage,Location,About Company,Year,Currency,Amount,Founded,Founders,Investor,column10
2846,Peak,"Manchester, Greater Manchester",Series C,Information Technology & Services,Peak helps the world's smartest companies put ...,2021,$,75000000.0,2014.0,Atul Sharma,SoftBank Vision Fund 2,
