## STOCKHOLM TEAM

## Exploratory Data Analysis of the Indian StartUp Funding Ecosystem 

### Business Understanding

**Project Description:**

Explore the Indian startup funding ecosystem through an in-depth analysis of funding data from 2019 to 2021. Gain insights into key trends, funding patterns, and factors driving startup success. Investigate the relationship between funding and startup growth, with a focus on temporal patterns and city-level dynamics. Identify preferred sectors for investment and uncover industry-specific funding trends. This exploratory data analysis provides a comprehensive overview of the Indian startup ecosystem, offering valuable insights for entrepreneurs, investors, and policymakers.

## Data Understanding

This project aims to explore and gain a deeper understanding of the Indian startup funding ecosystem. The dataset used for analysis contains information about startup funding from 2019 to 2021. The dataset includes various attributes such as the company's name, sector, funding amount, funding round, investor details, and location.

To conduct a comprehensive analysis, we will examine the dataset to understand its structure, contents, and any potential data quality issues. By understanding the data, we can ensure the accuracy and reliability of our analysis.

The key attributes in the dataset include:

- **Company**: The name of the startup receiving funding.
- **Sector**: The industry or sector to which the startup belongs.
- **Amount**: The amount of funding received by the startup.
- **Stage**: The round of funding (e.g., seed, series A, series B).
- **Location**: The city or region where the startup is based.
- **About**: What the company does.

By examining these attributes, we can uncover insights about the funding landscape, identify trends in funding amounts and rounds, explore the preferred sectors for investment, and analyze the role of cities in the startup ecosystem.

Throughout the analysis, we will use visualizations and statistical techniques to present the findings effectively. By understanding the data and its characteristics, we can proceed with confidence in our analysis, derive meaningful insights, and make informed decisions based on the findings.

### Hypothesis:

#### NULL Hypothesis (HO) :

#### **The sector of a company does not have an impact on the amount of funding it receives.**


#### ALTERNATE Hypothesis (HA):

#### **The sector of a company does have an impact on the amount of funding it receives.**




##  Research / Analysis Questions:

1. What are the most common industries represented in the datasets?

2. How does the funding amount vary across different rounds/series in the datasets?
   
3. Which locations have the highest number of companies in the datasets?
   
4. What kind of investment type should startups look for depending on their industry type? (EDA: Analysis of funding preferences by industry)

5. Are there any correlations between the funding amount and the company's sector or location?
   
6. What are the top investors in the datasets based on the number of investments made?
   
7. Which industries are favored by investors based on the number of funding rounds? (EDA: Top 10 industries which are favored by investors)

8. Are there any outliers in the funding amounts in the datasets?
   
9.  Is there a relationship between the company's sector and the presence of certain investors?
    
10. What is the range of funds generally received by startups in India (Max, min, avg, and count of funding)? (EDA: Descriptive statistics of funding amounts)


## Data Preparation

Before diving into the analysis, we will preprocess and clean the data to ensure its quality and suitability for analysis. This may involve handling missing values, correcting data types, and addressing any inconsistencies or outliers that could affect the accuracy of our results.

Once the data is prepared, we will be ready to perform an in-depth exploratory analysis of the Indian startup funding ecosystem. The analysis will involve answering specific research questions, identifying patterns and trends, and generating meaningful visualizations to present the findings.

Through this process of data understanding and preparation, we will set a solid foundation for conducting a robust and insightful analysis of the Indian startup funding data.

**The data for each year is sourced from separate two csv files and two from a remote server. They will be merged later to one dataset**

### Load the Packages/Modules

In [368]:
# Importing the Modules needed
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns 
import pyodbc #just installed with pip
from dotenv import dotenv_values #import the dotenv_values function from the dotenv package
from forex_python.converter import CurrencyRates

### Import Datasets

In [369]:
df = pd.read_csv('startup_funding2018.csv') # read the data_2018 and convert it to pandas data frame 

In [370]:
df2 = pd.read_csv('startup_funding2019.csv') # read the data_2019 and convert it to pandas data frame

#### Acessing the Remote Server Datasets

In [371]:
# Load environment variables from .env file into a dictionary
environment_variables = dotenv_values('.env')


# Get the values for the credentials you set in the '.env' file
database = environment_variables.get("DATABASE")
server = environment_variables.get("SERVER")
username = environment_variables.get("USERNAME")
password = environment_variables.get("PASSWORD")

connection_string = "DRIVER={ODBC Driver 17 for SQL Server};SERVER={dap-projects-database.database.windows.net};DATABASE={dapDB};UID={dataAnalyst_LP1};PWD={G7x@9kR$2x}"
# connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password}"

In [372]:
# Use the connect method of the pyodbc library and pass in the connection string.
# This will connect to the server and might take a few seconds to be complete. 
# Check your internet connection if it takes more time than necessary

connection = pyodbc.connect(connection_string)

In [373]:
# Now the sql query to get the data is what what you see below. 
# Note that you will not have permissions to insert delete or update this database table. 
query1 = "SELECT * FROM dbo.LP1_startup_funding2020"
query2 = "SELECT * FROM dbo.LP1_startup_funding2021"
df3 = pd.read_sql(query1, connection)
df4 = pd.read_sql(query2, connection)

#### 2018 Data

In [374]:
df.head()

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...


In [375]:
df.shape

(526, 6)

In [376]:
df.columns

Index(['Company Name', 'Industry', 'Round/Series', 'Amount', 'Location',
       'About Company'],
      dtype='object')

In [377]:
df.info()  # Get information about the DataFrame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Company Name   526 non-null    object
 1   Industry       526 non-null    object
 2   Round/Series   526 non-null    object
 3   Amount         526 non-null    object
 4   Location       526 non-null    object
 5   About Company  526 non-null    object
dtypes: object(6)
memory usage: 24.8+ KB


In [378]:
df.describe(include='object')  # Generate descriptive statistics of the DataFrame

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
count,526,526,526,526,526,526
unique,525,405,21,198,50,524
top,TheCollegeFever,—,Seed,—,"Bangalore, Karnataka, India",Algorithmic trading platform.
freq,2,30,280,148,102,2


now we have some description about the data set, we can now move on with data cleaning
 
MISSING VALUES 

In [379]:
missing_values = df.isnull().sum() # looking for missing values 

print(missing_values)

Company Name     0
Industry         0
Round/Series     0
Amount           0
Location         0
About Company    0
dtype: int64


Standardizing Data Formats

now let's see how we can standardize tha data set to make sure we have the same format of data points 

first let's check for dash symbols within the columns using a simple python function 

In [380]:
columns_to_check = ['Amount', 'Company Name', 'Location', 'About Company', 'Industry', 'Round/Series']

for column in columns_to_check:
    has_dash_symbols = df[column].str.contains('—').any()
    print(f"{column}: {has_dash_symbols}")

Amount: True
Company Name: False
Location: False
About Company: False
Industry: True
Round/Series: False


now let's handle the dash symbols in **the Amount column**, clean and format the amount the column correctly & Convert Currency to USD

In [381]:
df['Amount'].head()

0         250000
1    ₹40,000,000
2    ₹65,000,000
3        2000000
4              —
Name: Amount, dtype: object

In [382]:
# Cleaning the Amounts column
df['Amount'] = df['Amount'].apply(str)
df['Amount'].replace(",", "", inplace = True, regex=True)
df['Amount'].replace("—", 0, inplace = True, regex=True)
df['Amount'].replace("$", "", inplace = True, regex=True)

In [383]:
# Create an instance of CurrencyRates
c = CurrencyRates()

In [384]:
# Creating temporary columns to help with the conversion of INR to USD
df['Indiancurr'] = df['Amount'].str.rsplit('₹', n=2).str[1]
df['Indiancurr'] = df['Indiancurr'].apply(float).fillna(0)
df['UsCurr'] = df['Indiancurr'] * c.get_rate('INR', 'USD')
df['UsCurr'] = df['UsCurr'].replace(0, np.nan)
df['UsCurr'] = df['UsCurr'].fillna(df['Amount'])
df['UsCurr'] = df['UsCurr'].replace("$", "", regex=True)
df['Amount'] = df['UsCurr']
df['Amount'] = df['Amount'].apply(lambda x: float(str(x).replace("$","")))
df['Amount'] = df['Amount'].replace(0, np.nan)

# Define a lambda function to format the amount
format_amount = lambda amount: "{:,.2f}".format(amount)

# Apply the formatting lambda function to the 'Amount' column
df['Amount'] = df['Amount'].map(format_amount)


In [385]:
df['Amount'].head()

0      250,000.00
1      485,093.94
2      788,277.65
3    2,000,000.00
4             nan
Name: Amount, dtype: object

In [386]:
df['Amount'] = df['Amount'].str.replace(',', '').astype(float)
type(df['Amount'][0])

numpy.float64

#### Handling Categorical Data
NOW LET'S 

handle the categorical data in the 'Industry', 'Round/Series', and 'Location' columns

Analyzing unique values
Start by examining the unique values in each column to identify any inconsistencies or variations we do this 
Using the unique() function to get the unique values in each column.

**Location**

In [387]:
df['Location'].unique()

array(['Bangalore, Karnataka, India', 'Mumbai, Maharashtra, India',
       'Gurgaon, Haryana, India', 'Noida, Uttar Pradesh, India',
       'Hyderabad, Andhra Pradesh, India', 'Bengaluru, Karnataka, India',
       'Kalkaji, Delhi, India', 'Delhi, Delhi, India', 'India, Asia',
       'Hubli, Karnataka, India', 'New Delhi, Delhi, India',
       'Chennai, Tamil Nadu, India', 'Mohali, Punjab, India',
       'Kolkata, West Bengal, India', 'Pune, Maharashtra, India',
       'Jodhpur, Rajasthan, India', 'Kanpur, Uttar Pradesh, India',
       'Ahmedabad, Gujarat, India', 'Azadpur, Delhi, India',
       'Haryana, Haryana, India', 'Cochin, Kerala, India',
       'Faridabad, Haryana, India', 'Jaipur, Rajasthan, India',
       'Kota, Rajasthan, India', 'Anand, Gujarat, India',
       'Bangalore City, Karnataka, India', 'Belgaum, Karnataka, India',
       'Thane, Maharashtra, India', 'Margão, Goa, India',
       'Indore, Madhya Pradesh, India', 'Alwar, Rajasthan, India',
       'Kannur, Kerala, Ind

In [388]:
df['Location'].value_counts()

Bangalore, Karnataka, India         102
Mumbai, Maharashtra, India           94
Bengaluru, Karnataka, India          55
Gurgaon, Haryana, India              52
New Delhi, Delhi, India              51
Pune, Maharashtra, India             20
Chennai, Tamil Nadu, India           19
Hyderabad, Andhra Pradesh, India     18
Delhi, Delhi, India                  16
Noida, Uttar Pradesh, India          15
Haryana, Haryana, India              11
Jaipur, Rajasthan, India              9
Kolkata, West Bengal, India           6
Ahmedabad, Gujarat, India             6
Bangalore City, Karnataka, India      5
India, Asia                           4
Indore, Madhya Pradesh, India         4
Kormangala, Karnataka, India          3
Kochi, Kerala, India                  2
Bhopal, Madhya Pradesh, India         2
Ghaziabad, Uttar Pradesh, India       2
Thane, Maharashtra, India             2
Kannur, Kerala, India                 1
Ernakulam, Kerala, India              1
Anand, Gujarat, India                 1


In [389]:
# The 'Location' column is in the format, 'City, Region, Country',
# Only 'City' aspect is needed for this analysis
# Take all character until we reach the first comma sign

df['Location'] = df['Location'].apply(str)
df['Location'] = df['Location'].str.split(',').str[0]
df['Location'] = df['Location'].replace("'","",regex=True)

In [390]:
df['Location']

0           Bangalore
1              Mumbai
2             Gurgaon
3               Noida
4           Hyderabad
5           Bengaluru
6             Kalkaji
7           Hyderabad
8              Mumbai
9           Bangalore
10              Delhi
11          Bengaluru
12              India
13              Hubli
14          Bangalore
15          Bengaluru
16             Mumbai
17          Bengaluru
18          New Delhi
19            Chennai
20             Mumbai
21             Mumbai
22          New Delhi
23              Delhi
24          Bengaluru
25             Mohali
26            Chennai
27             Mumbai
28             Mumbai
29          Hyderabad
30          New Delhi
31            Kolkata
32          Bangalore
33          Bengaluru
34             Mumbai
35          Bengaluru
36             Mumbai
37          New Delhi
38             Mumbai
39            Chennai
40          New Delhi
41          Hyderabad
42              India
43             Mumbai
44          New Delhi
45        

In [391]:
# From obersavtion, there is use of official and unofficial names of certain cities.
# The incorrect names need to be rectified for correct analysis, eg A city with more than one name.
df["Location"] = df["Location"].replace (['Bangalore','Bangalore City','Belgaum'], 'Bengaluru')
df['Location'] = df['Location'].str.replace('New Delhi','Delhi')

In [392]:
df['Location'].unique()

array(['Bengaluru', 'Mumbai', 'Gurgaon', 'Noida', 'Hyderabad', 'Kalkaji',
       'Delhi', 'India', 'Hubli', 'Chennai', 'Mohali', 'Kolkata', 'Pune',
       'Jodhpur', 'Kanpur', 'Ahmedabad', 'Azadpur', 'Haryana', 'Cochin',
       'Faridabad', 'Jaipur', 'Kota', 'Anand', 'Thane', 'Margão',
       'Indore', 'Alwar', 'Kannur', 'Trivandrum', 'Ernakulam',
       'Kormangala', 'Uttar Pradesh', 'Andheri', 'Mylapore', 'Ghaziabad',
       'Kochi', 'Powai', 'Guntur', 'Kalpakkam', 'Bhopal', 'Coimbatore',
       'Worli', 'Alleppey', 'Chandigarh', 'Guindy', 'Lucknow'],
      dtype=object)

In [393]:
df['Location'].value_counts()

Bengaluru        163
Mumbai            94
Delhi             67
Gurgaon           52
Pune              20
Chennai           19
Hyderabad         18
Noida             15
Haryana           11
Jaipur             9
Ahmedabad          6
Kolkata            6
Indore             4
India              4
Kormangala         3
Kochi              2
Bhopal             2
Thane              2
Ghaziabad          2
Alwar              1
Worli              1
Ernakulam          1
Mohali             1
Guindy             1
Alleppey           1
Anand              1
Trivandrum         1
Kota               1
Lucknow            1
Jodhpur            1
Kannur             1
Powai              1
Cochin             1
Faridabad          1
Hubli              1
Azadpur            1
Mylapore           1
Margão             1
Uttar Pradesh      1
Kalkaji            1
Coimbatore         1
Kanpur             1
Guntur             1
Andheri            1
Chandigarh         1
Kalpakkam          1
Name: Location, dtype: int64

**Industry**

In [394]:
df['Industry'].value_counts()

—                                                                                                                                           30
Financial Services                                                                                                                          15
Education                                                                                                                                    8
Information Technology                                                                                                                       7
Finance, Financial Services                                                                                                                  5
Health Care, Hospital                                                                                                                        5
Health Care                                                                                                                                  4

In [395]:
# Replace '-' with NaN
df['Industry'].replace('-', np.nan, inplace=True)

In [396]:
df.columns

Index(['Company Name', 'Industry', 'Round/Series', 'Amount', 'Location',
       'About Company', 'Indiancurr', 'UsCurr'],
      dtype='object')

In [397]:
df.drop(columns=['Indiancurr','UsCurr'], inplace=True)

In [398]:
df.insert(6,"Funding Year", 2018)

In [399]:
df.rename(columns = {'Company Name':'Company',
                        'Industry':'Sector',
                        'Amount':'Amount',
                        'About Company':'About',
                        'Round/Series' : 'Stage'},
             inplace = True)

#### 2019 Data

In [400]:
df2.head()

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...","$30,000,000",Series D
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",


In [401]:
df2.shape

(89, 9)

In [402]:
df2.columns

Index(['Company/Brand', 'Founded', 'HeadQuarter', 'Sector', 'What it does',
       'Founders', 'Investor', 'Amount($)', 'Stage'],
      dtype='object')

In [403]:
df2.info() # Get inforamation about the data2 dataframe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company/Brand  89 non-null     object 
 1   Founded        60 non-null     float64
 2   HeadQuarter    70 non-null     object 
 3   Sector         84 non-null     object 
 4   What it does   89 non-null     object 
 5   Founders       86 non-null     object 
 6   Investor       89 non-null     object 
 7   Amount($)      89 non-null     object 
 8   Stage          43 non-null     object 
dtypes: float64(1), object(8)
memory usage: 6.4+ KB


In [404]:
df2.describe(include='object') # General descriptive statistics of the data2 dataFrame

Unnamed: 0,Company/Brand,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
count,89,70,84,89,86,89,89,43
unique,87,17,52,88,85,86,50,15
top,Kratikal,Bangalore,Edtech,Online meat shop,"Vivek Gupta, Abhay Hanjura",Undisclosed,Undisclosed,Series A
freq,2,21,7,2,2,3,12,10


In [405]:
# Droping the columns that are not important to our analysis including an unnamed column

df2.drop(columns=['Founded','Founders','Investor'], inplace=True)

now we have some description about the data set, we can now move on with data cleaning
 
MISSING VALUES 

In [406]:
missing_values2 = df2.isnull().sum() # looking for missing values in dataFrame 2
missing_values2

Company/Brand     0
HeadQuarter      19
Sector            5
What it does      0
Amount($)         0
Stage            46
dtype: int64

LET'S DEAL WITH THE MISSING VALUES FROM THE ABOVE OUTPUT

In [407]:
# Replace specific values with NaN
df2['HeadQuarter'].replace('Not available', np.nan, inplace=True)

# Replace empty strings with NaN
df2['Sector'].replace('', np.nan, inplace=True)
df2['Stage'].replace('', np.nan, inplace=True)

In [408]:
df2.isnull().sum()

Company/Brand     0
HeadQuarter      19
Sector            5
What it does      0
Amount($)         0
Stage            46
dtype: int64

Standardizing Data Formats

now let's see how we can standardize tha data set to make sure we have the same format of data points 

first let's check for dash symbols within the columns using a simple python function 

In [409]:
# checking for '-' symbol within the columns

columns_to_check2 = ['Company/Brand', 'HeadQuarter', 'Sector', 'What it does', 'Amount($)', 'Stage']

for column2 in columns_to_check2:
    has_dash_symbols2 = df2[column2].astype(str).str.contains('-').any()
    print(f'{column2}: {has_dash_symbols2}')

Company/Brand: False
HeadQuarter: False
Sector: True
What it does: True
Amount($): False
Stage: True


In [410]:
# checking for currency symbol 

columns_to_check2 = ['Company/Brand','HeadQuarter', 'Sector', 'What it does', 'Amount($)']

for column2 in columns_to_check2:
    has_currency_symbols = df2[column2].astype(str).str.contains('[$₹]').any()
    print(f'{column2}: {has_currency_symbols}')

Company/Brand: False
HeadQuarter: False
Sector: False
What it does: False
Amount($): True


In [411]:
# replacing the '-' symbols using a simple function 

dash_currency_columns = ['Sector', 'What it does', 'Stage']

for dash_columns2 in dash_currency_columns:
    dash_replaced2 = df2[dash_columns2].replace('-', np.nan, inplace=True)

now let's handle the dash symbols in the Amount column, clean and format the amount the column correctly 

In [412]:
df2['Amount($)']

0       $6,300,000
1     $150,000,000
2      $28,000,000
3      $30,000,000
4       $6,000,000
5      Undisclosed
6      Undisclosed
7       $1,000,000
8      $20,000,000
9     $275,000,000
10     Undisclosed
11     $22,000,000
12      $5,000,000
13        $140,500
14     Undisclosed
15      $5,000,000
16    $540,000,000
17     $15,000,000
18        $182,700
19     Undisclosed
20      $5,000,000
21     $12,000,000
22     $11,000,000
23     Undisclosed
24     $15,500,000
25      $1,500,000
26      $5,500,000
27      $5,000,000
28     $12,000,000
29      $2,500,000
30     $30,000,000
31        $140,000
32     Undisclosed
33    $230,000,000
34     $20,000,000
35     $49,400,000
36     $32,000,000
37     $26,000,000
38        $150,000
39        $400,000
40      $2,000,000
41    $100,000,000
42      $8,000,000
43      $1,500,000
44        $100,000
45     Undisclosed
46     $50,000,000
47      $6,000,000
48    $120,000,000
49      $4,000,000
50     $30,000,000
51      $4,000,000
52      $1,5

In [413]:
df2['Amount($)'] = df2['Amount($)'].astype(str).str.replace('[\₹$,]', '', regex=True)  # removing the currency symbol from the Amount in dataFrame2

In [414]:
df2['Amount($)']

0         6300000
1       150000000
2        28000000
3        30000000
4         6000000
5     Undisclosed
6     Undisclosed
7         1000000
8        20000000
9       275000000
10    Undisclosed
11       22000000
12        5000000
13         140500
14    Undisclosed
15        5000000
16      540000000
17       15000000
18         182700
19    Undisclosed
20        5000000
21       12000000
22       11000000
23    Undisclosed
24       15500000
25        1500000
26        5500000
27        5000000
28       12000000
29        2500000
30       30000000
31         140000
32    Undisclosed
33      230000000
34       20000000
35       49400000
36       32000000
37       26000000
38         150000
39         400000
40        2000000
41      100000000
42        8000000
43        1500000
44         100000
45    Undisclosed
46       50000000
47        6000000
48      120000000
49        4000000
50       30000000
51        4000000
52        1500000
53        1000000
54    Undisclosed
55    Undi

In [415]:
df2['Amount($)'].unique()

array(['6300000', '150000000', '28000000', '30000000', '6000000',
       'Undisclosed', '1000000', '20000000', '275000000', '22000000',
       '5000000', '140500', '540000000', '15000000', '182700', '12000000',
       '11000000', '15500000', '1500000', '5500000', '2500000', '140000',
       '230000000', '49400000', '32000000', '26000000', '150000',
       '400000', '2000000', '100000000', '8000000', '100000', '50000000',
       '120000000', '4000000', '6800000', '36000000', '5700000',
       '25000000', '600000', '70000000', '60000000', '220000', '2800000',
       '2100000', '7000000', '311000000', '4800000', '693000000',
       '33000000'], dtype=object)

In [416]:
# Cleaning the Amounts column & # removing the currency symbol in df_2019
df2['Amount($)'] = df2['Amount($)'].apply(str)
df2['Amount($)'].replace(",", "", inplace = True, regex=True)
df2['Amount($)'].replace("—", 0, inplace = True, regex=True)
df2['Amount($)'].replace("$", "", inplace = True, regex=True)
df2['Amount($)'] = df2['Amount($)'].str.replace('Undisclosed', '0', regex=True)

In [417]:
df2['Amount($)'] = df2['Amount($)'].astype(float)
type(df2['Amount($)'][0])

numpy.float64

In [418]:
df2['Amount($)']

0       6300000.0
1     150000000.0
2      28000000.0
3      30000000.0
4       6000000.0
5             0.0
6             0.0
7       1000000.0
8      20000000.0
9     275000000.0
10            0.0
11     22000000.0
12      5000000.0
13       140500.0
14            0.0
15      5000000.0
16    540000000.0
17     15000000.0
18       182700.0
19            0.0
20      5000000.0
21     12000000.0
22     11000000.0
23            0.0
24     15500000.0
25      1500000.0
26      5500000.0
27      5000000.0
28     12000000.0
29      2500000.0
30     30000000.0
31       140000.0
32            0.0
33    230000000.0
34     20000000.0
35     49400000.0
36     32000000.0
37     26000000.0
38       150000.0
39       400000.0
40      2000000.0
41    100000000.0
42      8000000.0
43      1500000.0
44       100000.0
45            0.0
46     50000000.0
47      6000000.0
48    120000000.0
49      4000000.0
50     30000000.0
51      4000000.0
52      1500000.0
53      1000000.0
54            0.0
55        

In [419]:
df2['Amount($)'].unique()

array([6.300e+06, 1.500e+08, 2.800e+07, 3.000e+07, 6.000e+06, 0.000e+00,
       1.000e+06, 2.000e+07, 2.750e+08, 2.200e+07, 5.000e+06, 1.405e+05,
       5.400e+08, 1.500e+07, 1.827e+05, 1.200e+07, 1.100e+07, 1.550e+07,
       1.500e+06, 5.500e+06, 2.500e+06, 1.400e+05, 2.300e+08, 4.940e+07,
       3.200e+07, 2.600e+07, 1.500e+05, 4.000e+05, 2.000e+06, 1.000e+08,
       8.000e+06, 1.000e+05, 5.000e+07, 1.200e+08, 4.000e+06, 6.800e+06,
       3.600e+07, 5.700e+06, 2.500e+07, 6.000e+05, 7.000e+07, 6.000e+07,
       2.200e+05, 2.800e+06, 2.100e+06, 7.000e+06, 3.110e+08, 4.800e+06,
       6.930e+08, 3.300e+07])

In [420]:
df2['HeadQuarter']

0               NaN
1            Mumbai
2            Mumbai
3           Chennai
4         Telangana
5              Pune
6         Bangalore
7             Noida
8               NaN
9             Delhi
10           Mumbai
11        Bangalore
12          Chennai
13        Ahmedabad
14              NaN
15            Delhi
16              NaN
17        Bangalore
18              NaN
19         Gurugram
20              NaN
21              NaN
22              NaN
23            Delhi
24        Bangalore
25            Delhi
26          Haryana
27              NaN
28        Bangalore
29              NaN
30        Bangalore
31              NaN
32       Chandigarh
33        Bangalore
34              NaN
35         Gurugram
36        Bangalore
37        Bangalore
38           Jaipur
39        Bangalore
40        Bangalore
41              NaN
42           Mumbai
43              NaN
44        Bangalore
45            Noida
46            Noida
47             Pune
48           Mumbai
49        Bangalore


In [421]:
df2.insert(6,"Funding Year", 2019)

In [422]:
df.rename(columns = {'Company Name':'Company',
                        'Industry':'Sector',
                        'Amount':'Amount',
                        'About Company':'About',
                        'Round/Series' : 'Stage'},
             inplace = True)

### 2020 Data

In [423]:
df3.head(3) # Get the first 3 rows of the DataFrame

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,


In [424]:
df3.shape # Get the shape of the DataFrame

(1055, 10)

In [425]:
df3.info() #provide information about colunms, data types and non-null values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1055 entries, 0 to 1054
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1055 non-null   object 
 1   Founded        842 non-null    float64
 2   HeadQuarter    961 non-null    object 
 3   Sector         1042 non-null   object 
 4   What_it_does   1055 non-null   object 
 5   Founders       1043 non-null   object 
 6   Investor       1017 non-null   object 
 7   Amount         801 non-null    float64
 8   Stage          591 non-null    object 
 9   column10       2 non-null      object 
dtypes: float64(2), object(8)
memory usage: 82.5+ KB


In [426]:
df3.columns# Get the column names

Index(['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage', 'column10'],
      dtype='object')

In [427]:
df3.describe(include="object") # Generate descriptive statistics for object columns

Unnamed: 0,Company_Brand,HeadQuarter,Sector,What_it_does,Founders,Investor,Stage,column10
count,1055,961,1042,1055,1043,1017,591,2
unique,905,77,302,990,927,848,42,2
top,Zomato,Bangalore,Fintech,Provides online learning classes,Falguni Nayar,Venture Catalysts,Series A,Pre-Seed
freq,6,317,80,4,6,20,96,1


In [428]:
df3.describe(include='float') # Generate descriptive statistics for float columns

Unnamed: 0,Founded,Amount
count,842.0,801.0
mean,2015.36342,113043000.0
std,4.097909,2476635000.0
min,1973.0,12700.0
25%,2014.0,1000000.0
50%,2016.0,3000000.0
75%,2018.0,11000000.0
max,2020.0,70000000000.0


In [429]:
df3['column10'].value_counts() # Count the occurrences of each unique value in 'column10'

Pre-Seed      1
Seed Round    1
Name: column10, dtype: int64

In [430]:
df3.isna().sum() #looking for missing values in dataFrame 2

Company_Brand       0
Founded           213
HeadQuarter        94
Sector             13
What_it_does        0
Founders           12
Investor           38
Amount            254
Stage             464
column10         1053
dtype: int64

In [431]:
# checking for '-' symbol within the columns
df3_to_check_colomns = ['Company_Brand','HeadQuarter', 'Sector', 'What_it_does','Stage','Amount']

for col in df3_to_check_colomns:
    dash_symbols = df3[col].astype(str).str.contains('—').any()
    print(f"{col}: {dash_symbols}")

Company_Brand: False
HeadQuarter: False
Sector: False
What_it_does: False
Stage: False
Amount: False


In [432]:
# checking for '$' symbol within the columns
df3_to_check_colomns = ['Company_Brand','HeadQuarter', 'Sector', 'What_it_does','Stage','Amount']

for col in df3_to_check_colomns:
    dash_symbols = df3[col].astype(str).str.contains('$').any()
    print(f"{col}: {dash_symbols}")

Company_Brand: True
HeadQuarter: True
Sector: True
What_it_does: True
Stage: True
Amount: True


In [433]:
# Set display option to show all rows
pd.set_option("display.max_rows", None)

In [434]:
df3["Amount"].value_counts()# Calculate the frequency count of unique values in the "Amount" columnb

1.000000e+06    53
2.000000e+06    39
3.000000e+06    27
5.000000e+06    24
5.000000e+05    22
1.000000e+07    18
1.500000e+06    16
6.000000e+05    14
4.000000e+06    14
1.500000e+07    13
3.000000e+07    13
6.000000e+06    13
3.000000e+05    11
4.000000e+05    11
8.000000e+06    11
7.000000e+06    10
2.500000e+06    10
2.000000e+05    10
1.000000e+08     9
2.000000e+07     9
1.000000e+05     8
2.500000e+07     8
1.100000e+07     7
1.100000e+06     7
8.000000e+05     7
5.000000e+07     6
7.500000e+06     6
5.500000e+06     6
3.500000e+06     6
4.500000e+06     5
2.500000e+05     5
1.200000e+06     5
1.300000e+06     5
2.000000e+08     5
1.700000e+06     5
9.000000e+06     5
4.000000e+07     5
1.200000e+07     5
1.600000e+07     4
2.800000e+07     4
1.800000e+07     4
1.500000e+08     4
5.500000e+07     4
1.500000e+05     4
3.500000e+07     4
7.500000e+05     4
1.400000e+06     4
6.000000e+07     4
2.100000e+07     4
3.400000e+05     4
3.400000e+06     3
1.900000e+07     3
1.270000e+04

In [435]:
# Cleaning the Amounts column
df3['Amount'] = df3['Amount'].apply(str)
df3['Amount'].replace(",", "", inplace = True, regex=True)
df3['Amount'].replace("—", 0, inplace = True, regex=True)
df3['Amount'].replace("$", "", inplace = True, regex=True)
df3['Company_Brand'].replace("$", "", inplace = True, regex=True)
df3['HeadQuarter'].replace("$", "", inplace = True, regex=True)
df3['Sector'].replace("$", "", inplace = True, regex=True)
df3['What_it_does'].replace("$", "", inplace = True, regex=True)
df3['Stage'].replace("$", "", inplace = True, regex=True)


In [446]:
df3['Amount'] = df3['Amount'].astype(float)

In [447]:
df3['Amount']

0       2.000000e+05
1       1.000000e+05
2                NaN
3       4.000000e+05
4       3.400000e+05
5       6.000000e+05
6       6.000000e+05
7                NaN
8       4.500000e+07
9       1.000000e+06
10      2.000000e+06
11               NaN
12               NaN
13      1.200000e+06
14      6.000000e+05
15      6.600000e+08
16      1.200000e+05
17      7.500000e+06
18               NaN
19      1.000000e+06
20               NaN
21      5.000000e+06
22      1.000000e+06
23      5.000000e+05
24      3.000000e+06
25      1.000000e+07
26      1.450000e+08
27      1.000000e+08
28               NaN
29               NaN
30      2.100000e+07
31      4.000000e+06
32      2.000000e+07
33      1.000000e+06
34      5.600000e+05
35               NaN
36      4.000000e+05
37      2.750000e+05
38      2.000000e+07
39      1.000000e+06
40               NaN
41      4.500000e+06
42      5.000000e+06
43      1.500000e+07
44               NaN
45      3.900000e+08
46      7.000000e+06
47           

In [441]:
df3['Company_Brand']

0                                  Aqgromalin
1                                    Krayonnz
2                                PadCare Labs
3                                       NCOME
4                                  Gramophone
5                                      qZense
6                                MyClassboard
7                                       Metvy
8                                      Rupeek
9                                   Gig India
10                                Slurrp Farm
11                                     Medfin
12                                    MasterG
13                                   Brila 91
14                                 FoodyBuddy
15                                     Zomato
16                                  OurEye.ai
17                                 Shiprocket
18                                  Pine Labs
19                          Moneyboxx Finance
20                                       EWar
21                         SucSEED

In [442]:
df3['HeadQuarter']

0                                                 Chennai
1                                               Bangalore
2                                                    Pune
3                                               New Delhi
4                                                  Indore
5                                               Bangalore
6                                               Hyderabad
7                                                 Gurgaon
8                                               Bangalore
9                                                    Pune
10                                                Gurgaon
11                                              Bangalore
12                                              New Delhi
13                                              New Delhi
14                                                Belgaum
15                                                Gurgaon
16                                              Bangalore
17            

In [443]:
df3['Sector']

0                                 AgriTech
1                                   EdTech
2                       Hygiene management
3                                   Escrow
4                                 AgriTech
5                                 AgriTech
6                                   EdTech
7                      Networking platform
8                                  FinTech
9                            Crowdsourcing
10                        Food & Bevarages
11                              HealthTech
12                         Fashion startup
13                        Food & Bevarages
14                           Food Industry
15                           Food Delivery
16                Virtual auditing startup
17                              E-commerce
18                                 FinTech
19                                 FinTech
20                                  Gaming
21                                 FinTech
22                                 FinTech
23         

In [437]:
df3 = df3.drop(['column10','Founded','Founders','Investor'], axis=1)

In [438]:
# Assign 2021 to the 'Funding Year' column
df3['Funding Year'] = 2020

In [439]:
# Check for duplicate values
df3.head()


Unnamed: 0,Company_Brand,HeadQuarter,Sector,What_it_does,Amount,Stage,Funding Year
0,Aqgromalin,Chennai,AgriTech,Cultivating Ideas for Profit,200000.0,,2020
1,Krayonnz,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,100000.0,Pre-seed,2020
2,PadCare Labs,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,,Pre-seed,2020
3,NCOME,New Delhi,Escrow,Escrow-as-a-service platform,400000.0,,2020
4,Gramophone,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,340000.0,,2020


#### 2021 Data