## STOCKHOLM TEAM

## Exploratory Data Analysis of the Indian StartUp Funding Ecosystem 

### Business Understanding

**Project Description:**

Explore the Indian startup funding ecosystem through an in-depth analysis of funding data from 2019 to 2021. Gain insights into key trends, funding patterns, and factors driving startup success. Investigate the relationship between funding and startup growth, with a focus on temporal patterns and city-level dynamics. Identify preferred sectors for investment and uncover industry-specific funding trends. This exploratory data analysis provides a comprehensive overview of the Indian startup ecosystem, offering valuable insights for entrepreneurs, investors, and policymakers.

## Data Understanding

This project aims to explore and gain a deeper understanding of the Indian startup funding ecosystem. The dataset used for analysis contains information about startup funding from 2019 to 2021. The dataset includes various attributes such as the company's name, sector, funding amount, funding round, investor details, and location.

To conduct a comprehensive analysis, we will examine the dataset to understand its structure, contents, and any potential data quality issues. By understanding the data, we can ensure the accuracy and reliability of our analysis.

The key attributes in the dataset include:

- **Company**: The name of the startup receiving funding.
- **Sector**: The industry or sector to which the startup belongs.
- **Amount**: The amount of funding received by the startup.
- **Stage**: The round of funding (e.g., seed, series A, series B).
- **Location**: The city or region where the startup is based.
- **About**: What the company does.
- **Funding Year**:When the company was funded

By examining these attributes, we can uncover insights about the funding landscape, identify trends in funding amounts and rounds, explore the preferred sectors for investment, and analyze the role of cities in the startup ecosystem.

Throughout the analysis, we will use visualizations and statistical techniques to present the findings effectively. By understanding the data and its characteristics, we can proceed with confidence in our analysis, derive meaningful insights, and make informed decisions based on the findings.

### Hypothesis:

#### NULL Hypothesis (HO) :

#### **The sector of a company does not have an impact on the amount of funding it receives.**


#### ALTERNATE Hypothesis (HA):

#### **The sector of a company does have an impact on the amount of funding it receives.**




##  Research / Analysis Questions:

1. What are the most common industries represented in the datasets?

2. How does the funding amount vary across different rounds/series in the datasets?
   
3. Which locations have the highest number of companies in the datasets?
   
4. What kind of investment type should startups look for depending on their industry type? (EDA: Analysis of funding preferences by industry)

5. Are there any correlations between the funding amount and the company's sector or location?
   
6. What are the top investors in the datasets based on the number of investments made?
   
7. Which industries are favored by investors based on the number of funding rounds? (EDA: Top 10 industries which are favored by investors)

8. Are there any outliers in the funding amounts in the datasets?
   
9.  Is there a relationship between the company's sector and the presence of certain investors?
    
10. What is the range of funds generally received by startups in India (Max, min, avg, and count of funding)? (EDA: Descriptive statistics of funding amounts)


## Data Preparation

Before diving into the analysis, we will preprocess and clean the data to ensure its quality and suitability for analysis. This may involve handling missing values, correcting data types, and addressing any inconsistencies or outliers that could affect the accuracy of our results.

Once the data is prepared, we will be ready to perform an in-depth exploratory analysis of the Indian startup funding ecosystem. The analysis will involve answering specific research questions, identifying patterns and trends, and generating meaningful visualizations to present the findings.

Through this process of data understanding and preparation, we will set a solid foundation for conducting a robust and insightful analysis of the Indian startup funding data.

**The data for each year is sourced from separate two csv files and two from a remote server. They will be merged later to one dataset**

### Load the Packages/Modules

In [None]:
%pip install forex-python
%pip install pandas
%pip install python-dotenv
%pip install seaborn
%pip install matplotlib
%pip install pyodbc
%pip install numpy

In [1]:
# Importing the Modules needed
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns

import pyodbc #just installed with pip
from dotenv import dotenv_values #import the dotenv_values function from the dotenv package
import warnings 
warnings.filterwarnings('ignore')

import re 
from scipy.stats import chi2_contingency

## Display Options

In [2]:
# Set display options
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.max_rows', None)  # Show all rows
pd.set_option('display.width', None)  # Disable column width restriction
pd.set_option('display.max_colwidth', None)  # Disable truncation of column contents

## Import Datasets

In [3]:
df = pd.read_csv('startup_funding2018.csv') # read the data_2018 and convert it to pandas data frame 

In [4]:
df2 = pd.read_csv('startup_funding2019.csv') # read the data_2019 and convert it to pandas data frame

## Accessing the Remote Server Datasets

In [5]:
# Load environment variables from .env file into a dictionary
environment_variables = dotenv_values('.env')


# Get the values for the credentials you set in the '.env' file
database = environment_variables.get("DATABASE")
server = environment_variables.get("SERVER")
username = environment_variables.get("USERNAME")
password = environment_variables.get("PASSWORD")


connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password}"
#connection_string = "DRIVER={ODBC Driver 17 for SQL Server};SERVER={dap-projects-database.database.windows.net};DATABASE={dapDB};UID={dataAnalyst_LP1};PWD={G7x@9kR$2x}"

In [6]:
# Use the connect method of the pyodbc library and pass in the connection string.
# This will connect to the server and might take a few seconds to be complete. 
# Check your internet connection if it takes more time than necessary

connection = pyodbc.connect(connection_string)

In [7]:
# Now the sql query to get the data is what what you see below. 
# Note that you will not have permissions to insert delete or update this database table. 
query1 = "SELECT * FROM dbo.LP1_startup_funding2020"
query2 = "SELECT * FROM dbo.LP1_startup_funding2021"
df3 = pd.read_sql(query1, connection)
df4 = pd.read_sql(query2, connection)

# 2018 Data

In [8]:
df.head()

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, Sponsorship, Ticketing",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and frolic of Colleges."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy farmers in rural Maharashtra.
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organization which offers short term financial suport to corporate employees.
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants to create a difference by selling globally.


In [9]:
df.shape # displaying the shape of the data as in column and row wise

(526, 6)

In [10]:
df.columns # here we want to look at the columns in data set

Index(['Company Name', 'Industry', 'Round/Series', 'Amount', 'Location',
       'About Company'],
      dtype='object')

In [11]:
df.info()  # Getting information about the DataFrame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Company Name   526 non-null    object
 1   Industry       526 non-null    object
 2   Round/Series   526 non-null    object
 3   Amount         526 non-null    object
 4   Location       526 non-null    object
 5   About Company  526 non-null    object
dtypes: object(6)
memory usage: 24.8+ KB


In [12]:
df.describe(include='object').transpose()  # here Generating descriptive statistics of the DataFrame

Unnamed: 0,count,unique,top,freq
Company Name,526,525,TheCollegeFever,2
Industry,526,405,—,30
Round/Series,526,21,Seed,280
Amount,526,198,—,148
Location,526,50,"Bangalore, Karnataka, India",102
About Company,526,524,"TheCollegeFever is a hub for fun, fiesta and frolic of Colleges.",2


now we have some description about the data set, we can now move on with data cleaning
 
MISSING VALUES 

In [13]:
missing_values = df.isnull().sum() # looking for missing values 
print(missing_values)

Company Name     0
Industry         0
Round/Series     0
Amount           0
Location         0
About Company    0
dtype: int64


#### Handling Duplicated Data

In [14]:
# below we are checking duplicates values withinn the columns 

columns_to_check = ['Company Name', 'Industry', 'Round/Series', 'Amount', 'Location', 'About Company']

for column in columns_to_check:
    has_duplicates = df[column].duplicated().any()
    print(f'{column}: {has_duplicates}')

Company Name: True
Industry: True
Round/Series: True
Amount: True
Location: True
About Company: True


In [15]:
df.drop_duplicates(subset=['Company Name', 'Industry', 'Round/Series', 'Amount', 'Location', 'About Company'], inplace=True)

Standardizing Data Formats

now let's see how we can standardize tha data set to make sure we have the same format of data points 

first let's check for dash symbols within the columns using a simple python function 

In [16]:
# below we are checking for '-' symbols within the columns

columns_to_check = ['Amount', 'Company Name', 'Location', 'About Company', 'Industry', 'Round/Series']

for column in columns_to_check:
    has_dash_symbols = df[column].str.contains('—').any()
    print(f"{column}: {has_dash_symbols}")

Amount: True
Company Name: False
Location: False
About Company: False
Industry: True
Round/Series: False


now let's handle the dash symbols in **the Amount column**, clean and format the amount the column correctly & Convert Currency to USD

In [17]:
df['Amount'].unique() # first let's look at the Amount the column

array(['250000', '₹40,000,000', '₹65,000,000', '2000000', '—', '1600000',
       '₹16,000,000', '₹50,000,000', '₹100,000,000', '150000', '1100000',
       '₹500,000', '6000000', '650000', '₹35,000,000', '₹64,000,000',
       '₹20,000,000', '1000000', '5000000', '4000000', '₹30,000,000',
       '2800000', '1700000', '1300000', '₹5,000,000', '₹12,500,000',
       '₹15,000,000', '500000', '₹104,000,000', '₹45,000,000', '13400000',
       '₹25,000,000', '₹26,400,000', '₹8,000,000', '₹60,000', '9000000',
       '100000', '20000', '120000', '₹34,000,000', '₹342,000,000',
       '$143,145', '₹600,000,000', '$742,000,000', '₹1,000,000,000',
       '₹2,000,000,000', '$3,980,000', '$10,000', '₹100,000',
       '₹250,000,000', '$1,000,000,000', '$7,000,000', '$35,000,000',
       '₹550,000,000', '$28,500,000', '$2,000,000', '₹240,000,000',
       '₹120,000,000', '$2,400,000', '$30,000,000', '₹2,500,000,000',
       '$23,000,000', '$150,000', '$11,000,000', '₹44,000,000',
       '$3,240,000', '₹60

## Assumptions Made for Amount Column
- Amounts without currency symbols in the 2018 dataset are in USD.
- The average Indian Rupee (INR) to US Dollar (USD) rate for the relevant year will be used for currency conversions.
- Use exchange rate from https://www.exchangerates.org.uk/INR-USD-spot-exchange-rates-history-2018.html, use the average exchange rate of 0.0146

In [21]:
# Cleaning the Amounts column 

df['Amount'] = df['Amount'].apply(str)
df['Amount'].replace([',', '—', '$'], ['', 0, ''], inplace=True, regex=True)

In [22]:
# Set the desired exchange rate
exchange_rate = 0.0146

# Extract the Indian currency amount
df['Indiancurr'] = df['Amount'].str.rsplit('₹', n=2).str[1]
df['Indiancurr'] = df['Indiancurr'].apply(float).fillna(0)

# Convert Indian currency to USD using the specified exchange rate
df['UsCurr'] = df['Indiancurr'] * exchange_rate

# Replace 0 values with NaN
df['UsCurr'] = df['UsCurr'].replace(0, np.nan)

# Fill NaN values in 'UsCurr' with original 'Amount' values
df['UsCurr'] = df['UsCurr'].fillna(df['Amount'])

# Remove '$' symbol from 'UsCurr' column
df['UsCurr'] = df['UsCurr'].replace("$", "", regex=True)

# Update 'Amount' column with converted USD values
df['Amount'] = df['UsCurr'].apply(lambda x: float(str(x).replace("$","")))

# Replace 0 values with NaN in 'Amount' column
df['Amount'] = df['Amount'].replace(0, np.nan)

# Format the 'Amount' column
format_amount = lambda amount: "{:,.2f}".format(amount)
df['Amount'] = df['Amount'].map(format_amount)


In [23]:
df['Amount'].head() # now let's confirm the Amount column one more time 

0      250,000.00
1      584,000.00
2      949,000.00
3    2,000,000.00
4             nan
Name: Amount, dtype: object

In [24]:
df['Amount'] = df['Amount'].str.replace(',', '').astype(float) # since the Amount column is holding and amount, we have to comvert it to float
type(df['Amount'][0])

numpy.float64

#### Handling Categorical Data
NOW LET'S 

handle the categorical data in the 'Industry', 'Round/Series', and 'Location' columns

Analyzing unique values
Start by examining the unique values in each column to identify any inconsistencies or variations we do this 
Using the unique() function to get the unique values in each column.

### Location Column

In [25]:
df['Location'].unique() # checking each unique values 

array(['Bangalore, Karnataka, India', 'Mumbai, Maharashtra, India',
       'Gurgaon, Haryana, India', 'Noida, Uttar Pradesh, India',
       'Hyderabad, Andhra Pradesh, India', 'Bengaluru, Karnataka, India',
       'Kalkaji, Delhi, India', 'Delhi, Delhi, India', 'India, Asia',
       'Hubli, Karnataka, India', 'New Delhi, Delhi, India',
       'Chennai, Tamil Nadu, India', 'Mohali, Punjab, India',
       'Kolkata, West Bengal, India', 'Pune, Maharashtra, India',
       'Jodhpur, Rajasthan, India', 'Kanpur, Uttar Pradesh, India',
       'Ahmedabad, Gujarat, India', 'Azadpur, Delhi, India',
       'Haryana, Haryana, India', 'Cochin, Kerala, India',
       'Faridabad, Haryana, India', 'Jaipur, Rajasthan, India',
       'Kota, Rajasthan, India', 'Anand, Gujarat, India',
       'Bangalore City, Karnataka, India', 'Belgaum, Karnataka, India',
       'Thane, Maharashtra, India', 'Margão, Goa, India',
       'Indore, Madhya Pradesh, India', 'Alwar, Rajasthan, India',
       'Kannur, Kerala, Ind

#### The Location column contains combined information (e.g., city, state, country)

In [26]:
# The 'Location' column is in the format, 'City, Region, Country',
# Only 'City' aspect is needed for this analysis
# Take all character until we reach the first comma sign

df['Location'] = df['Location'].apply(str)
df['Location'] = df['Location'].str.split(',').str[0]
df['Location'] = df['Location'].replace("'","",regex=True)

In [27]:
# From observation, some city names that refer to the same place are appearing different.
# The incorrect names need to be rectified for correct analysis, eg A city with more than one name.
df["Location"] = df["Location"].replace (['Bangalore','Bangalore City'], 'Bengaluru')
df.loc[~df['Location'].str.contains('New Delhi', na=False), 'Location'] = df['Location'].str.replace('Delhi', 'New Delhi')
df['Location'] = df['Location'].replace (['Gurgaon'], 'Gurugram')

In [28]:
df['Location'].unique() # checking the unique values once more

array(['Bengaluru', 'Mumbai', 'Gurugram', 'Noida', 'Hyderabad', 'Kalkaji',
       'New Delhi', 'India', 'Hubli', 'Chennai', 'Mohali', 'Kolkata',
       'Pune', 'Jodhpur', 'Kanpur', 'Ahmedabad', 'Azadpur', 'Haryana',
       'Cochin', 'Faridabad', 'Jaipur', 'Kota', 'Anand', 'Belgaum',
       'Thane', 'Margão', 'Indore', 'Alwar', 'Kannur', 'Trivandrum',
       'Ernakulam', 'Kormangala', 'Uttar Pradesh', 'Andheri', 'Mylapore',
       'Ghaziabad', 'Kochi', 'Powai', 'Guntur', 'Kalpakkam', 'Bhopal',
       'Coimbatore', 'Worli', 'Alleppey', 'Chandigarh', 'Guindy',
       'Lucknow'], dtype=object)

### Industry Column

In [29]:
df['Industry'].value_counts() # taking a look at the Industry column first to have some insight into the column 

Industry
—                                                                                                                                           30
Financial Services                                                                                                                          15
Education                                                                                                                                    8
Information Technology                                                                                                                       7
Finance, Financial Services                                                                                                                  5
Health Care, Hospital                                                                                                                        5
Artificial Intelligence                                                                                                              

In [30]:
# let's check all the unique values in the industry column
df['Industry'].unique()

array(['Brand Marketing, Event Promotion, Marketing, Sponsorship, Ticketing',
       'Agriculture, Farming',
       'Credit, Financial Services, Lending, Marketplace',
       'Financial Services, FinTech',
       'E-Commerce Platforms, Retail, SaaS',
       'Cloud Infrastructure, PaaS, SaaS',
       'Internet, Leisure, Marketplace', 'Market Research',
       'Information Services, Information Technology', 'Mobile Payments',
       'B2B, Shoes', 'Internet',
       'Apps, Collaboration, Developer Platform, Enterprise Software, Messaging, Productivity Tools, Video Chat',
       'Food Delivery', 'Industrial Automation',
       'Automotive, Search Engine, Service Industry',
       'Finance, Internet, Travel',
       'Accounting, Business Information Systems, Business Travel, Finance, SaaS',
       'Artificial Intelligence, Product Search, SaaS, Service Industry, Software',
       'Internet of Things, Waste Management',
       'Air Transportation, Freight Service, Logistics, Marine Transport

In [31]:
# keeping only the first unique vlaues in the Industry column
df['Industry'] = df['Industry'].str.split(',').str[0]

In [32]:
# Clean Industry column
df['Industry'] = df['Industry'].str.strip()  # Remove leading and trailing spaces
df['Industry'] = df['Industry'].str.title()  # Standardize capitalization

In [33]:
df[df['Industry']=='—']

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company,Indiancurr,UsCurr
58,MissMalini Entertainment,—,Seed,1518400.0,Mumbai,"MissMalini Entertainment is a multi-platform new media network dedicated to Entertaining, Connecting & Empowering young Indians.",0.0,1518400.0
105,Jagaran Microfin,—,Debt Financing,8030000.0,Kolkata,Jagaran Microfin is a Microfinance institution which achieves a healthy amalgamation of social and financial,0.0,8030000.0
121,FLEECA,—,Seed,,Jaipur,FLEECA is a Tyre Care Provider company.,0.0,
146,WheelsEMI,—,Series B,14000000.0,Pune,"WheelsEMI is the brand name of NBFC, WheelsEMI Pvt. Ltd.",0.0,14000000.0
153,Fric Bergen,—,Venture - Series Unknown,,Alwar,Fric Bergen is a leader in the specialty food industry.,0.0,
174,Deftouch,—,Seed,,Bengaluru,Deftouch is a mobile game development company that currently focuses on winning the Cricket gaming market with a social multiplayer game.,0.0,
181,Corefactors,—,Seed,,Bengaluru,"Corefactors is a leading campaign management, business communication and analytics company.",0.0,
210,Cell Propulsion,—,Seed,102200.0,Bengaluru,Cell Propulsion is an electric mobility startup that designs autonomous electric vehicles.,0.0,102200.0
230,Flathalt,—,Angel,50000.0,Gurugram,FInd your Customized Home here.,0.0,50000.0
235,dishq,—,Seed,400000.0,Bengaluru,dishq leverages food science and machine learning (AI) to understand and predict people's tastes.,0.0,400000.0


In [None]:
df.loc[(df['Company Name'] == 'MissMalini Entertainment') & (df['Industry'] == '—'), 'Industry'] = 'Fashion and Lifestyle Blog'
df.loc[(df['Company Name'] == 'Jagaran Microfin') & (df['Industry'] =='—'), 'Industry'] = 'Financial Services'
df.loc[(df['Company Name'] == 'FLEECA') & (df['Industry'] == '—'), 'Industry'] = 'Automotive Services'
df.loc[(df['Company Name'] == 'WheelsEMI') & (df['Industry'] == '—'), 'Industry'] = 'Automotive Financing'
df.loc[(df['Company Name'] == 'Fric Bergen') & (df['Industry'] == '—'), 'Industry'] = 'Food and Beverage'
df.loc[(df['Company Name'] == 'Deftouch') & (df['Industry'] == '—'), 'Industry'] = 'Gaming and Entertainment'
df.loc[(df['Company Name'] == 'Corefactors') & (df['Industry'] == '—'), 'Industry'] = 'Marketing Technology'
df.loc[(df['Company Name'] == 'Cell Propulsion') & (df['Industry'] == '—'), 'Industry'] = 'Electric Vehicle Technology'
df.loc[(df['Company Name'] == 'Flathalt') & (df['Industry'] == '—'), 'Industry'] = 'Real Estate Technology'
df.loc[df['Company Name'] == 'dishq', 'Company Name'] = 'DISH'
df.loc[(df['Company Name'] == 'DISH')& (df['Industry'] == '—'), 'Industry'] = 'Telecommunications'
df.loc[(df['Company Name'] == 'Trell') & (df['Industry'] == '—'), 'Industry'] = 'E-commerce'
df.loc[df['Company Name'] == 'HousingMan.com', 'Company Name'] = 'HousingMan'
df.loc[(df['Company Name'] == 'HousingMan') & (df['Industry'] == '—'), 'Industry'] = 'Real Estate Technology'
df.loc[(df['Company Name'] == 'Steradian Semiconductors') & (df['Industry'] == '—'), 'Industry'] = 'Automotive Technology'
df.loc[(df['Company Name'] == 'SaffronStays') & (df['Industry'] == '—'), 'Industry'] = 'Hospitality Technology'
df.loc[(df['Company Name'] == 'Inner Being Wellness')  & (df['Industry'] == '—'), 'Industry'] = 'Health and Wellness'
df.loc[(df['Company Name'] == 'MySEODoc') & (df['Industry'] == '—'), 'Industry'] = 'Digital Marketing'
df.loc[df['Company Name'] == 'ENLYFT DIGITAL SOLUTIONS PRIVATE LIMITED', 'Company Name'] = 'ENLYFT DIGITAL SOLUTIONS'
df.loc[(df['Company Name'] == 'ENLYFT DIGITAL SOLUTIONS') & (df['Industry'] == '—'), 'Industry'] = 'Digital Marketing'
df.loc[(df['Company Name'] == 'Scale Labs') & (df['Industry'] == '—'), 'Industry'] = 'E-commerce Solutions'
df.loc[(df['Company Name'] == 'Roadcast')  & (df['Industry'] == '—'), 'Industry'] = 'Transportation and Logistics Technology'
df.loc[df['Company Name'] == 'Toffee', 'Company Name'] = 'Toffee Pvt Ltd'
df.loc[(df['Company Name'] == 'Toffee Pvt Ltd')& (df['Industry'] == '—'), 'Industry'] = 'Digital Marketing'
df.loc[(df['Company Name'] == 'ORO Wealth') & (df['Industry'] == '—'), 'Industry'] = 'Financial Technology'
df.loc[(df['Company Name'] == 'Finwego')& (df['Industry'] =='—'), 'Industry']= 'Human Resources Technology'
df.loc[(df['Company Name'] == 'Cred') & (df['Industry'] == '—'), 'Industry'] = 'Fintech'
df.loc[(df['Company Name'] == 'Origo') & (df['Industry'] == '—'), 'Industry']  = 'Agri-Fintech'
df.loc[(df['Company Name'] == 'Sequretek') & (df['Industry'] == '—'), 'Industry']= 'Cybersecurity'
df.loc[df['Company Name'] == 'Avenues Payments India Pvt. Ltd.', 'Company Name'] = 'Avenues Payments'
df.loc[(df['Company Name'] == 'Avenues Payments') & (df['Industry'] == '—'), 'Industry'] = 'eCommerce Solutions'
df.loc[df['Company Name'] == 'Planet11 eCommerce Solutions India (Avenue11)', 'Company Name'] = 'Planet11'
df.loc[(df['Company Name'] == 'Planet11')& (df['Industry'] == '—'), 'Industry'] = 'eCommerce Solutions'
df.loc[(df['Company Name'] == 'Iba Halal Care') & (df['Industry'] == '—'), 'Industry'] = 'Cosmetics'
df.loc[(df['Company Name'] == 'Togedr') & (df['Industry'] == '—'), 'Industry'] = 'Travel and Adventure'
df.loc[(df['Company Name'] == 'Scholify') & (df['Industry'] == '—'), 'Industry'] = 'EdTech'

In [34]:
company_mapping = {
    'dishq': 'DISH',
    'HousingMan.com': 'HousingMan',
    'ENLYFT DIGITAL SOLUTIONS PRIVATE LIMITED': 'ENLYFT DIGITAL SOLUTIONS',
    'Toffee': 'Toffee Pvt Ltd',
    'Avenues Payments India Pvt. Ltd.': 'Avenues Payments',
    'Planet11 eCommerce Solutions India (Avenue11)': 'Planet11',
    
}

industry_mapping = {
    '—': '',
    'Fashion and Lifestyle Blog': 'Fashion and Lifestyle Blog',
    'Financial Services': 'Financial Services',
    'Automotive Services': 'Automotive Services',
    'Automotive Financing': 'Automotive Financing',
    'Food and Beverage': 'Food and Beverage',
    'Gaming and Entertainment': 'Gaming and Entertainment',
    'Marketing Technology': 'Marketing Technology',
    'Electric Vehicle Technology': 'Electric Vehicle Technology',
    'Real Estate Technology': 'Real Estate Technology',
    'Telecommunications': 'Telecommunications',
    'E-commerce': 'E-commerce',
    'Hospitality Technology': 'Hospitality Technology',
    'Health and Wellness': 'Health and Wellness',
    'Digital Marketing': 'Digital Marketing',
    'E-commerce Solutions': 'E-commerce Solutions',
    'Transportation and Logistics Technology': 'Transportation and Logistics Technology',
    'Cosmetics': 'Cosmetics',
    'Travel and Adventure': 'Travel and Adventure',
    'EdTech': 'EdTech'
}

# Replace the dash symbol with the corresponding values using apply function
df['Company Name'] = df['Company Name'].apply(lambda x: company_mapping[x] if x in company_mapping else x)
df['Industry'] = df['Industry'].apply(lambda x: industry_mapping[x] if x in industry_mapping else x)


In [35]:
df[df['Industry']=='—']

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company,Indiancurr,UsCurr


In [36]:
df.head() # getting the first sample of the data set 

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company,Indiancurr,UsCurr
0,TheCollegeFever,Brand Marketing,Seed,250000.0,Bengaluru,"TheCollegeFever is a hub for fun, fiesta and frolic of Colleges.",0.0,250000.0
1,Happy Cow Dairy,Agriculture,Seed,584000.0,Mumbai,A startup which aggregates milk from dairy farmers in rural Maharashtra.,0.0,584000.0
2,MyLoanCare,Credit,Series A,949000.0,Gurugram,Leading Online Loans Marketplace in India,0.0,949000.0
3,PayMe India,Financial Services,Angel,2000000.0,Noida,PayMe India is an innovative FinTech organization which offers short term financial suport to corporate employees.,0.0,2000000.0
4,Eunimart,E-Commerce Platforms,Seed,,Hyderabad,Eunimart is a one stop solution for merchants to create a difference by selling globally.,0.0,


### Round/Series Column

In [37]:
df['Round/Series'].unique() # getting the unique values 

array(['Seed', 'Series A', 'Angel', 'Series B', 'Pre-Seed',
       'Private Equity', 'Venture - Series Unknown', 'Grant',
       'Debt Financing', 'Post-IPO Debt', 'Series H', 'Series C',
       'Series E', 'Corporate Round', 'Undisclosed',
       'https://docs.google.com/spreadsheets/d/1x9ziNeaz6auNChIHnMI8U6kS7knTr3byy_YBGfQaoUA/edit#gid=1861303593',
       'Series D', 'Secondary Market', 'Post-IPO Equity',
       'Non-equity Assistance', 'Funding Round'], dtype=object)

In [38]:
# below we are replacing some unique values such as undisclosed with nan and remove some inconsistency from the data

df['Round/Series']=df['Round/Series'].replace('Undisclosed',np.nan)
df['Round/Series']=df['Round/Series'].replace('Venture - Series Unknown',np.nan)
df['Round/Series'] = df['Round/Series'].replace('https://docs.google.com/spreadsheets/d/1x9ziNeaz6auNChIHnMI8U6kS7knTr3byy_YBGfQaoUA/edit#gid=1861303593', 'nan')
df['Round/Series'] = df['Round/Series'].replace('nan', np.nan)

In [39]:
df['Round/Series'].unique() # getting the unique values 

array(['Seed', 'Series A', 'Angel', 'Series B', 'Pre-Seed',
       'Private Equity', nan, 'Grant', 'Debt Financing', 'Post-IPO Debt',
       'Series H', 'Series C', 'Series E', 'Corporate Round', 'Series D',
       'Secondary Market', 'Post-IPO Equity', 'Non-equity Assistance',
       'Funding Round'], dtype=object)

### Clean Text Data 

In [40]:
# Clean Company Name column
df['Company Name'] = df['Company Name'].str.strip()  # Remove leading and trailing spaces
df['Company Name'] = df['Company Name'].str.title()  # Standardize capitalization

# Clean About Company column
df['About Company'] = df['About Company'].str.strip()  # Remove leading and trailing spaces

# Function to handle special characters or encoding issues
def clean_text(text):
    # Remove special characters using regex
    cleaned_text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
    return cleaned_text

# Apply the clean_text function to the About Company column
df['About Company'] = df['About Company'].apply(clean_text)

# Print the cleaned DataFrame
df.head()

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company,Indiancurr,UsCurr
0,Thecollegefever,Brand Marketing,Seed,250000.0,Bengaluru,TheCollegeFever is a hub for fun fiesta and frolic of Colleges,0.0,250000.0
1,Happy Cow Dairy,Agriculture,Seed,584000.0,Mumbai,A startup which aggregates milk from dairy farmers in rural Maharashtra,0.0,584000.0
2,Myloancare,Credit,Series A,949000.0,Gurugram,Leading Online Loans Marketplace in India,0.0,949000.0
3,Payme India,Financial Services,Angel,2000000.0,Noida,PayMe India is an innovative FinTech organization which offers short term financial suport to corporate employees,0.0,2000000.0
4,Eunimart,E-Commerce Platforms,Seed,,Hyderabad,Eunimart is a one stop solution for merchants to create a difference by selling globally,0.0,


In [41]:
df.columns # looking at the columns in the data set to comfirm 

Index(['Company Name', 'Industry', 'Round/Series', 'Amount', 'Location',
       'About Company', 'Indiancurr', 'UsCurr'],
      dtype='object')

In [42]:
df.drop(columns=['Indiancurr','UsCurr'], inplace=True) # dropping some colunmns we need no more 

In [43]:
df.insert(6,"Funding Year", 2018) # inserting a new column 'startup_funding 2018' to keep track of this data set

In [44]:
# below are renaming the columns to ensure consistency when combinning the four data sets 

df.rename(columns = {'Company Name':'Company',
                        'Industry':'Sector',
                        'Amount':'Amount',
                        'About Company':'About',
                        'Round/Series' : 'Stage'},
             inplace = True)

In [45]:
df.head() # finally confirming the head of the data to be sure of all changes before saving the data

Unnamed: 0,Company,Sector,Stage,Amount,Location,About,Funding Year
0,Thecollegefever,Brand Marketing,Seed,250000.0,Bengaluru,TheCollegeFever is a hub for fun fiesta and frolic of Colleges,2018
1,Happy Cow Dairy,Agriculture,Seed,584000.0,Mumbai,A startup which aggregates milk from dairy farmers in rural Maharashtra,2018
2,Myloancare,Credit,Series A,949000.0,Gurugram,Leading Online Loans Marketplace in India,2018
3,Payme India,Financial Services,Angel,2000000.0,Noida,PayMe India is an innovative FinTech organization which offers short term financial suport to corporate employees,2018
4,Eunimart,E-Commerce Platforms,Seed,,Hyderabad,Eunimart is a one stop solution for merchants to create a difference by selling globally,2018


In [46]:
df.to_csv('df18.csv', index=False) # here we are saving the clean data and naming it df18.csv


# 2019 Data

In [47]:
df2.head() # first let's look at the head of the data set 

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",
1,Ruangguru,2014.0,Mumbai,Edtech,"A learning platform that provides topic-based journey, animated videos, quizzes, infographic and mock tests to students","Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ Labs","$30,000,000",Series D
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, processing and marketing of seeds",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",


In [48]:
df2.shape # now let's look at the shape of the data to get some idea about the columns and rows 

(89, 9)

In [49]:
df2.columns # now let's look at the columns in the 2019 data sets 

Index(['Company/Brand', 'Founded', 'HeadQuarter', 'Sector', 'What it does',
       'Founders', 'Investor', 'Amount($)', 'Stage'],
      dtype='object')

In [50]:
df2.info() # Getting inforamation about the data2 dataframe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company/Brand  89 non-null     object 
 1   Founded        60 non-null     float64
 2   HeadQuarter    70 non-null     object 
 3   Sector         84 non-null     object 
 4   What it does   89 non-null     object 
 5   Founders       86 non-null     object 
 6   Investor       89 non-null     object 
 7   Amount($)      89 non-null     object 
 8   Stage          43 non-null     object 
dtypes: float64(1), object(8)
memory usage: 6.4+ KB


In [51]:
df2.describe(include='object').transpose() # getting General descriptive statistics of the data2 dataFrame

Unnamed: 0,count,unique,top,freq
Company/Brand,89,87,Kratikal,2
HeadQuarter,70,17,Bangalore,21
Sector,84,52,Edtech,7
What it does,89,88,Online meat shop,2
Founders,86,85,"Vivek Gupta, Abhay Hanjura",2
Investor,89,86,Undisclosed,3
Amount($),89,50,Undisclosed,12
Stage,43,15,Series A,10


#### Handling Duplicated Data

In [52]:
# below we are checking for duplicated values within the columns 

columns_to_check2 = ['Company/Brand', 'Founded', 'HeadQuarter', 'Sector', 'What it does', 'Founders', 'Investor', 'Amount($)', 'Stage',]

for column2 in columns_to_check2:
    has_duplicates2 = df2[column2].duplicated().any()
    print(f'{column2}: {has_duplicates2}')

Company/Brand: True
Founded: True
HeadQuarter: True
Sector: True
What it does: True
Founders: True
Investor: True
Amount($): True
Stage: True


In [53]:
# below we are dropping all the duplicated rows within the colums

df2.drop_duplicates(subset=['Company/Brand', 'Founded', 'HeadQuarter', 'Sector', 'What it does', 'Founders', 'Investor', 'Amount($)', 'Stage',], inplace=True)

now we have some description about the data set, we can now move on with data cleaning
 
MISSING VALUES 

In [54]:
missing_values2 = df2.isnull().sum() # looking for missing values in dataFrame 2
missing_values2

Company/Brand     0
Founded          29
HeadQuarter      19
Sector            5
What it does      0
Founders          3
Investor          0
Amount($)         0
Stage            46
dtype: int64

LET'S DEAL WITH THE MISSING VALUES FROM THE ABOVE OUTPUT

In [55]:
df2['HeadQuarter'].unique() # let's get some idea about the unique values int he HeadQuater column

array([nan, 'Mumbai', 'Chennai', 'Telangana', 'Pune', 'Bangalore',
       'Noida', 'Delhi', 'Ahmedabad', 'Gurugram', 'Haryana', 'Chandigarh',
       'Jaipur', 'New Delhi', 'Surat', 'Uttar pradesh', 'Hyderabad',
       'Rajasthan'], dtype=object)

In [56]:
df2[df2['HeadQuarter'].isnull()]

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",
8,Quantiphi,,,AI & Tech,It is an AI and big data services company providing business solutions.,Renuka Ramnath,Multiples Alternate Asset Management,"$20,000,000",Series A
14,Open Secret,,,Food tech,It produces and sells top quality snacks,"Ahana Gautam, Udit Kejriwal",Matrix Partners,Undisclosed,
16,Byju's,2011.0,,Edtech,Provides online learning classes,Byju Raveendran,"South Africa’s Naspers Ventures, the CPP Investment Board","$540,000,000",
18,Witblox,2014.0,,Edtech,Offers a range of robotics learning tools,Amit Modi,Mumbai Angels Network,"$182,700",
20,SalaryFits,2015.0,,Fintech,A platform that promotes financial well-being of employees at workplace,Renato Araujo,Brazilian VC Fund Confrapar,"$5,000,000",
21,Pristyn Care,2018.0,,Healthcare,Delivers advanced medical care & clinical excellence aided by next-level technology,"Harsimarbir Singh, Dr Vaibhav Kapoor, Dr Garima Sawhney","Sequoia India, Hummingbird Ventures, Greenoaks Capital, AngelList.","$12,000,000",Series B
22,Springboard,2013.0,,Edtech,Offers online courses and extensive mentor-based learning,"Gautam Tambay, Parul Gupta",Reach Capital,"$11,000,000",Post series A
27,Fireflies .ai,,,AI,Developer of an artificial intelligence-powered assistant for businesses,Sam Udotong,Canaan Partners,"$5,000,000",
29,Bijak,2019.0,,AgriTech,B2B platform for agricultural commodities.,"Nukul Upadhye, Mahesh Jakhotia, Jitender Bedwal, Daya Rai, Nikhil Tripathi","Omnivore and Omidyar Network India, Sequoia Capital","$2,500,000",Seed fund


In [57]:
df2.loc[df2['Company/Brand'] == 'Bombay Shaving', 'HeadQuarter'] = 'Gurugram'
df2.loc[df2['Company/Brand'] == 'Quantiphi', 'HeadQuarter'] = 'Marlborough'
df2.loc[df2['Company/Brand'] == 'Open Secret', 'HeadQuarter'] = 'Mumbai'
df2.loc[df2['Company/Brand'] == "Byju's", 'HeadQuarter'] = 'Bengaluru'
df2.loc[df2['Company/Brand'] == "Witblox", 'HeadQuarter'] = 'Mumbai'
df2.loc[df2['Company/Brand'] == "SalaryFits", 'HeadQuarter'] = 'London'
df2.loc[df2['Company/Brand'] == "Pristyn Care", 'HeadQuarter'] = 'Gurgaon'
df2.loc[df2['Company/Brand'] == "Springboard", 'HeadQuarter'] = 'Bengaluru'
df2.loc[df2['Company/Brand'] == "Fireflies .ai", 'HeadQuarter'] = 'San Francisco'
df2.loc[df2['Company/Brand'] == "Bijak", 'HeadQuarter'] = 'New Delhi'
df2.loc[df2['Company/Brand'] == "truMe", 'HeadQuarter'] = 'Gurugram'
df2.loc[df2['Company/Brand'] == "Rivigo", 'HeadQuarter'] = 'Gurgaon'
df2.loc[df2['Company/Brand'] == "VMate", 'HeadQuarter'] = 'Gurgaon'
df2.loc[df2['Company/Brand'] == "Slintel", 'HeadQuarter'] = 'California'
df2.loc[df2['Company/Brand'] == "Ninjacart", 'HeadQuarter'] = 'Bengaluru'
df2.loc[df2['Company/Brand'] == "Zebu", 'HeadQuarter'] = 'London'
df2.loc[df2['Company/Brand'] == "Phable", 'HeadQuarter'] = 'Bengaluru'
df2.loc[df2['Company/Brand'] == "Zolostays", 'HeadQuarter'] = 'Bengaluru'
df2.loc[df2['Company/Brand'] == 'Cubical Labs', 'HeadQuarter'] = 'New Delhi'


In [58]:
df2[df2['HeadQuarter'].isnull()]

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage


In [59]:
# below we are replacing some names within the columns with their official names
df2.loc[~df2['HeadQuarter'].str.contains('New Delhi', na=False), 'HeadQuarter'] = df2['HeadQuarter'].str.replace('Delhi', 'New Delhi')
df2["HeadQuarter"] = df2["HeadQuarter"].replace (['Bangalore','Bangalore City'], 'Bengaluru')
df2['HeadQuarter'] = df2['HeadQuarter'].replace (['Gurgaon'], 'Gurugram')

In [60]:
df2['HeadQuarter'].unique() # now let's comfirm the unique values again

array(['Gurugram', 'Mumbai', 'Chennai', 'Telangana', 'Pune', 'Bengaluru',
       'Noida', 'Marlborough', 'New Delhi', 'Ahmedabad', 'London',
       'Haryana', 'San Francisco', 'Chandigarh', 'Jaipur', 'California',
       'Surat', 'Uttar pradesh', 'Hyderabad', 'Rajasthan'], dtype=object)

In [61]:
df2['Sector'].unique() # now let's look at the unique values of the 'Sector' column

array(['Ecommerce', 'Edtech', 'Interior design', 'AgriTech', 'Technology',
       'SaaS', 'AI & Tech', 'E-commerce', 'E-commerce & AR', 'Fintech',
       'HR tech', 'Food tech', 'Health', 'Healthcare', 'Safety tech',
       'Pharmaceutical', 'Insurance technology', 'AI', 'Foodtech', 'Food',
       'IoT', 'E-marketplace', 'Robotics & AI', 'Logistics', 'Travel',
       'Manufacturing', 'Food & Nutrition', 'Social Media', nan,
       'E-Sports', 'Cosmetics', 'B2B', 'Jewellery', 'B2B Supply Chain',
       'Games', 'Food & tech', 'Accomodation', 'Automotive tech',
       'Legal tech', 'Mutual Funds', 'Cybersecurity', 'Automobile',
       'Sports', 'Healthtech', 'Yoga & wellness', 'Virtual Banking',
       'Transportation', 'Transport & Rentals',
       'Marketing & Customer loyalty', 'Infratech', 'Hospitality',
       'Automobile & Technology', 'Banking'], dtype=object)

In [62]:
df2[df2['Sector'].isnull()]

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
41,VMate,,Gurugram,,A short video platform,,Alibaba,"$100,000,000",
49,Awign Enterprises,2016.0,Bengaluru,,It supplies workforce to the economy,"Annanya Sarthak, Gurpreet Singh, Praveen Sah","Work10M, Michael and Susan Dell Foundation, Eagle10, Unitus Ventures.","$4,000,000",Series A
52,TapChief,2016.0,Bengaluru,,It connects individuals in need of advice in a specific domain to individuals who have expertise in the same,"Shashank Murali, Binay Krishna, Arjun Krishna",Blume Ventures.,"$1,500,000",Pre series A
56,KredX,,Bengaluru,,Invoice discounting platform,Manish Kumar,Tiger Global Management,"$26,000,000",Series B
57,m.Paani,,Mumbai,,It digitizes and organises local retailers,Akanksha Hazari,"AC Ventures, Henkel","$5,500,000",Series A


In [63]:
df2.loc[df2['Company/Brand'] == 'VMate', 'Sector'] = 'Short Video Platform'
df2.loc[df2['Company/Brand'] == 'Awign Enterprises', 'Sector'] = 'Workforce Solutions'
df2.loc[df2['Company/Brand'] == 'TapChief', 'Sector'] = 'Online Consulting'
df2.loc[df2['Company/Brand'] == 'KredX', 'Sector'] = 'Fintech'
df2.loc[df2['Company/Brand'] == 'm.Paani', 'Sector'] = 'E-commerce'

In [64]:
df2[df2['Sector'].isnull()]

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage


In [65]:
df2['Stage'].unique() # now let's look at the unique values of the 'stage' colum

array([nan, 'Series C', 'Fresh funding', 'Series D', 'Pre series A',
       'Series A', 'Series G', 'Series B', 'Post series A',
       'Seed funding', 'Seed fund', 'Series E', 'Series F', 'Series B+',
       'Seed round', 'Pre-series A'], dtype=object)

In [66]:
# Replacing empty strings with NaN
df2['Stage'].replace('', np.nan, inplace=True)

In [67]:
df2.isnull().sum() # let's check for null vlaues and sum them up 

Company/Brand     0
Founded          29
HeadQuarter       0
Sector            0
What it does      0
Founders          3
Investor          0
Amount($)         0
Stage            46
dtype: int64

Standardizing Data Formats

now let's see how we can standardize tha data set to make sure we have the same format of data points 

first let's check for dash symbols within the columns using a simple python function 

In [68]:
# checking for '-' symbol within the columns

columns_to_check2 = ['Company/Brand', 'HeadQuarter', 'Sector', 'What it does', 'Amount($)', 'Stage']

for column2 in columns_to_check2:
    has_dash_symbols2 = df2[column2].astype(str).str.contains('-').any()
    print(f'{column2}: {has_dash_symbols2}')

Company/Brand: False
HeadQuarter: False
Sector: True
What it does: True
Amount($): False
Stage: True


In [69]:
df2[df2['Sector'].str.contains('-', na=False)]

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
9,Lenskart,2010.0,New Delhi,E-commerce,It is a eyewear company,"Peyush Bansal, Amit Chaudhary, Sumeet Kapahi",SoftBank,"$275,000,000",Series G
10,Cub McPaws,2010.0,Mumbai,E-commerce & AR,A B2C brand that focusses on premium and comfortable merchandise for Generation Alpha – kids,"Abhay Bhat, Kinnar Shah",Venture Catalysts,Undisclosed,
32,Pumpkart,2014.0,Chandigarh,E-marketplace,B2B model for appliances and electrical products,KS Bhatia,Dinesh Dua,Undisclosed,
38,Freshokartz,2016.0,Jaipur,E-marketplace,Online fruits and vegetables delivery company,Rajendra Lora,ThinkLab,"$150,000",Pre series A
42,Bombay Shirt Company,2012.0,Mumbai,E-commerce,Online custom shirt brand,Akshay Narvekar,Lightbox Ventures,"$8,000,000",
44,MyGameMate,,Bengaluru,E-Sports,eSports platform where players can access various multiplayer mobile games to participate in online tournaments,Parshavv Jain& Raju Kushwaha,"Jindagi Live Angels,","$100,000",
57,m.Paani,,Mumbai,E-commerce,It digitizes and organises local retailers,Akanksha Hazari,"AC Ventures, Henkel","$5,500,000",Series A
64,Moms Co,,New Delhi,E-commerce,It is into mother and baby care-focused consumer goods,Malika Sadani,"Saama Capital, DSG Consumer Partners","$5,000,000",Series B


In [70]:
df2[df2['Stage'].str.contains('-', na=False)]

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
82,Kratikal,,Uttar pradesh,Technology,Provides cyber security solutions,Pavan Kushwaha,"Gilda VC, Art Venture, Rajeev Chitrabhanu","$1,000,000",Pre-series A


In [71]:
# checking for currency symbol 

columns_to_check2 = ['Company/Brand','HeadQuarter', 'Sector', 'What it does', 'Amount($)']

for column2 in columns_to_check2:
    has_currency_symbols = df2[column2].astype(str).str.contains('[$₹]').any()
    print(f'{column2}: {has_currency_symbols}')

Company/Brand: False
HeadQuarter: False
Sector: False
What it does: False
Amount($): True


now let's handle the dash symbols in the Amount column, clean and format the amount the column correctly 

In [72]:
df2['Amount($)'].unique() # let's check for unique values 

array(['$6,300,000', '$150,000,000', '$28,000,000', '$30,000,000',
       '$6,000,000', 'Undisclosed', '$1,000,000', '$20,000,000',
       '$275,000,000', '$22,000,000', '$5,000,000', '$140,500',
       '$540,000,000', '$15,000,000', '$182,700', '$12,000,000',
       '$11,000,000', '$15,500,000', '$1,500,000', '$5,500,000',
       '$2,500,000', '$140,000', '$230,000,000', '$49,400,000',
       '$32,000,000', '$26,000,000', '$150,000', '$400,000', '$2,000,000',
       '$100,000,000', '$8,000,000', '$100,000', '$50,000,000',
       '$120,000,000', '$4,000,000', '$6,800,000', '$36,000,000',
       '$5,700,000', '$25,000,000', '$600,000', '$70,000,000',
       '$60,000,000', '$220,000', '$2,800,000', '$2,100,000',
       '$7,000,000', '$311,000,000', '$4,800,000', '$693,000,000',
       '$33,000,000'], dtype=object)

In [73]:
# Cleaning the Amounts column & # removing the currency symbol in df_2019
df2['Amount($)'] = df2['Amount($)'].astype(str).str.replace('[\₹$,]', '', regex=True)
df2['Amount($)'] = df2['Amount($)'].str.replace('Undisclosed', '0', regex=True)
df2['Amount($)'].replace(",", "", inplace = True, regex=True)
df2['Amount($)'].replace("—", 0, inplace = True, regex=True)

In [74]:
df2['Amount($)'] = df2['Amount($)'].astype(float) # here we are converting the amount column to float data type 
type(df2['Amount($)'][0])

numpy.float64

In [75]:
df2['Amount($)'] # here we are looking at the Amount column 

0       6300000.0
1     150000000.0
2      28000000.0
3      30000000.0
4       6000000.0
5             0.0
6             0.0
7       1000000.0
8      20000000.0
9     275000000.0
10            0.0
11     22000000.0
12      5000000.0
13       140500.0
14            0.0
15      5000000.0
16    540000000.0
17     15000000.0
18       182700.0
19            0.0
20      5000000.0
21     12000000.0
22     11000000.0
23            0.0
24     15500000.0
25      1500000.0
26      5500000.0
27      5000000.0
28     12000000.0
29      2500000.0
30     30000000.0
31       140000.0
32            0.0
33    230000000.0
34     20000000.0
35     49400000.0
36     32000000.0
37     26000000.0
38       150000.0
39       400000.0
40      2000000.0
41    100000000.0
42      8000000.0
43      1500000.0
44       100000.0
45            0.0
46     50000000.0
47      6000000.0
48    120000000.0
49      4000000.0
50     30000000.0
51      4000000.0
52      1500000.0
53      1000000.0
54            0.0
55        

In [76]:
df2['Amount($)'].unique() # this line of code looks at the unique value 

array([6.300e+06, 1.500e+08, 2.800e+07, 3.000e+07, 6.000e+06, 0.000e+00,
       1.000e+06, 2.000e+07, 2.750e+08, 2.200e+07, 5.000e+06, 1.405e+05,
       5.400e+08, 1.500e+07, 1.827e+05, 1.200e+07, 1.100e+07, 1.550e+07,
       1.500e+06, 5.500e+06, 2.500e+06, 1.400e+05, 2.300e+08, 4.940e+07,
       3.200e+07, 2.600e+07, 1.500e+05, 4.000e+05, 2.000e+06, 1.000e+08,
       8.000e+06, 1.000e+05, 5.000e+07, 1.200e+08, 4.000e+06, 6.800e+06,
       3.600e+07, 5.700e+06, 2.500e+07, 6.000e+05, 7.000e+07, 6.000e+07,
       2.200e+05, 2.800e+06, 2.100e+06, 7.000e+06, 3.110e+08, 4.800e+06,
       6.930e+08, 3.300e+07])

### Clean Text Data

In [77]:
# Clean Company Name column
df2['Company/Brand'] = df2['Company/Brand'].str.strip()  # Remove leading and trailing spaces
df2['Company/Brand'] = df2['Company/Brand'].str.title()  # Standardize capitalization

# Clean Company Name column
df2['Sector'] = df2['Sector'].str.strip()  # Remove leading and trailing spaces
df2['Sector'] = df2['Sector'].str.title()  # Standardize capitalization

# Clean About Company column
df2['What it does'] = df2['What it does'].str.strip()  # Remove leading and trailing spaces

# Function to handle special characters or encoding issues
def clean_text(text):
    # Remove special characters using regex
    cleaned_text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
    return cleaned_text

# Apply the clean_text function to the About Company column
df2['What it does'] = df2['What it does'].apply(clean_text)

# Print the cleaned DataFrame
df2.head()

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,Gurugram,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,6300000.0,
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topicbased journey animated videos quizzes infographic and mock tests to students,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,150000000.0,Series C
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey",28000000.0,Fresh funding
3,Homelane,2014.0,Chennai,Interior Design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ Labs",30000000.0,Series D
4,Nu Genes,2004.0,Telangana,Agritech,It is a seed company engaged in production processing and marketing of seeds,Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),6000000.0,


In [78]:
# Dropping the columns that are not important to our analysis

df2.drop(columns=['Founded','Founders','Investor'], inplace=True)

In [79]:
df2.insert(6,"Funding Year", 2019) # here we are inserting a new column to keep track of the data set after combining 

In [80]:
# below we are renaming the columns to enure consistency 

df2.rename(columns = {'Company/Brand':'Company',
                        'HeadQuarter':'Location',
                        'Amount($)':'Amount',
                        'What it does':'About'},
             inplace = True)

In [81]:
df2.head() # let's confirm the data set by looking at the head before we save it 

Unnamed: 0,Company,Location,Sector,About,Amount,Stage,Funding Year
0,Bombay Shaving,Gurugram,Ecommerce,Provides a range of male grooming products,6300000.0,,2019
1,Ruangguru,Mumbai,Edtech,A learning platform that provides topicbased journey animated videos quizzes infographic and mock tests to students,150000000.0,Series C,2019
2,Eduisfun,Mumbai,Edtech,It aims to make learning fun via games,28000000.0,Fresh funding,2019
3,Homelane,Chennai,Interior Design,Provides interior designing solutions,30000000.0,Series D,2019
4,Nu Genes,Telangana,Agritech,It is a seed company engaged in production processing and marketing of seeds,6000000.0,,2019


In [82]:
df2.to_csv('df_19.csv', index=False) # here we are saving the set and naming it df_19.csv

NOW LET'S WORK ON THE THIRD DATA SET 2020

# 2020 Data

In [83]:
df3.head() #showing the first five rows

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem which provides state of the art technological solutions.,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,
3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,,
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling access to agri inputs and powering efficient farm management.,"Ashish Rajan Singh, Harshit Gupta, Nishant Mahatre, Tauseef Khan","Siana Capital Management, Info Edge",340000.0,,


In [84]:
df3.shape

(1055, 10)

In [85]:
df3.columns #accessing specific columns

Index(['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage', 'column10'],
      dtype='object')

In [86]:
df3.info() # Get inforamation about the df3 dataframe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1055 entries, 0 to 1054
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1055 non-null   object 
 1   Founded        842 non-null    float64
 2   HeadQuarter    961 non-null    object 
 3   Sector         1042 non-null   object 
 4   What_it_does   1055 non-null   object 
 5   Founders       1043 non-null   object 
 6   Investor       1017 non-null   object 
 7   Amount         801 non-null    float64
 8   Stage          591 non-null    object 
 9   column10       2 non-null      object 
dtypes: float64(2), object(8)
memory usage: 82.6+ KB


In [87]:
df3.describe(include='object').transpose() # Getting general descriptive statistics of the data2 dataFrame

Unnamed: 0,count,unique,top,freq
Company_Brand,1055,905,Nykaa,6
HeadQuarter,961,77,Bangalore,317
Sector,1042,302,Fintech,80
What_it_does,1055,990,Provides online learning classes,4
Founders,1043,927,Falguni Nayar,6
Investor,1017,848,Venture Catalysts,20
Stage,591,42,Series A,96
column10,2,2,Pre-Seed,1


In [88]:
df3.describe(include='float').T # Getting general descriptive statistics for float columns

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Founded,842.0,2015.363,4.097909,1973.0,2014.0,2016.0,2018.0,2020.0
Amount,801.0,113043000.0,2476635000.0,12700.0,1000000.0,3000000.0,11000000.0,70000000000.0


#### Handling Duplicated Data

In [89]:
# checking for duplicated values 

columns_to_check3 = ['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does', 'Founders', 'Investor', 'Amount', 'Stage']
for column2 in columns_to_check3:
    has_duplicates2 = df3[column2].duplicated().any()
    print(f'{column2}: {has_duplicates2}')

Company_Brand: True
Founded: True
HeadQuarter: True
Sector: True
What_it_does: True
Founders: True
Investor: True
Amount: True
Stage: True


In [90]:
# below we are dropping the duplicates rows 

df3.drop_duplicates(subset=['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does', 'Founders', 'Investor', 'Amount', 'Stage'], inplace=True)

#### Handling Categorical Data

In [91]:
df3.isna().sum() #looking for missing values in dataFrame 2

Company_Brand       0
Founded           213
HeadQuarter        94
Sector             13
What_it_does        0
Founders           12
Investor           38
Amount            253
Stage             462
column10         1050
dtype: int64

In [92]:
df3['HeadQuarter'].unique() #displaying the unique values found in the 'HeadQuarter' column.

array(['Chennai', 'Bangalore', 'Pune', 'New Delhi', 'Indore', 'Hyderabad',
       'Gurgaon', 'Belgaum', 'Noida', 'Mumbai', 'Andheri', 'Jaipur',
       'Ahmedabad', 'Kolkata', 'Tirunelveli, Tamilnadu', 'Thane', None,
       'Singapore', 'Gurugram', 'Gujarat', 'Haryana', 'Kerala', 'Jodhpur',
       'Jaipur, Rajastan', 'Delhi', 'Frisco, Texas, United States',
       'California', 'Dhingsara, Haryana', 'New York, United States',
       'Patna', 'San Francisco, California, United States',
       'San Francisco, United States', 'San Ramon, California',
       'Paris, Ile-de-France, France', 'Plano, Texas, United States',
       'Sydney', 'San Francisco Bay Area, Silicon Valley, West Coast',
       'Bangaldesh', 'London, England, United Kingdom',
       'Sydney, New South Wales, Australia', 'Milano, Lombardia, Italy',
       'Palmwoods, Queensland, Australia', 'France',
       'San Francisco Bay Area, West Coast, Western US',
       'Trivandrum, Kerala, India', 'Cochin', 'Samastipur, Bihar',


In [93]:
df3[df3['HeadQuarter'].isnull()]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
87,Habitat,2016.0,,EdTech,"Habitat, a social learning edtech platform for civil service aspirants","Rohit Pande, Shikhar Sachan","Unitus Ventures, Whiteboard Capital",600000.0,Seed,
92,Raskik,2019.0,,Fusion beverages,All new fusion-flavour fruit juices with the goodness of coconut and taste of India's finest fruits.,"Vikas Chawla, Abhay Parnerkar, Satyajit Ram","Venture Catalysts, 9Unicorns",1000000.0,Pre-series A,
95,Pravasirojgar,2020.0,,Job portal,Initiative for blue-collar job workers,Sonu Sood,GoodWorker.,33000000.0,,
121,Kaagaz Scanner,2020.0,,Scanning app,Kaagaz Scanner is the Indian replacement to banned Cam Scanner App.,"Snehanshu Gandhi, Gaurav Shrishrimal",Pravega Ventures,575000.0,,
487,Exprs,2018.0,,Nano Distribution Network,"Nano Distribution Centres, enabling seamless connectivity of businesses & Consumers.","Srinivas Madhavam, Srikanth Rajashekhar, Rahul Mehta","Sweta Rau, Sandeep Kapoor",5660000.0,,
499,Verloop.io,2015.0,,AI,Helps automate customer service using Artificial Intelligence,Gaurav Singh,Alpha Wave incubation,5000000.0,,
500,Otipy,2020.0,,Agritech/Commerce,Connects consumers with farmers thorugh women resellers,Varun Khanna,Inflection Point Ventures,1000000.0,Pre Series A,
515,Daalchini,2017.0,,IoT,Physical and Digital vending machines startup,"Prerna Kalra, Vidya Bhushan",Artha Venture Fund,669000.0,Pre Series A,
516,Suno India,2018.0,,Media,Multilingual podcast platform,"DVL Padma Priya, Rakesh Kamal, Tarun Nirwan",Shobu Yarlagadda,,Angel Round,
519,Eden Smart Homes,2018.0,,IoT,Develops smart home automation systems,"Pranjal Kacholia, Divyansh Mathur",Inflection Point Ventures,,,


In [95]:
df3.loc[df3['Company_Brand'] == 'Habitat', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Wealth Bucket', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'EpiFi', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'XpressBees', 'HeadQuarter'] = 'Pune'
df3.loc[df3['Company_Brand'] == 'Shiksha', 'HeadQuarter'] = 'Noida'
df3.loc[df3['Company_Brand'] == 'Byju', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Zomato', 'HeadQuarter'] = 'Gurugram'
df3.loc[df3['Company_Brand'] == 'Rentomojo', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Mamaearth', 'HeadQuarter'] = 'Gurgaon'
df3.loc[df3['Company_Brand'] == 'HaikuJAM', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Testbook', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Techbooze', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Rheo', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Klub', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'TechnifyBiz', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Aesthetic Nutrition', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Gamerji', 'HeadQuarter'] = 'Ahmedabad'
df3.loc[df3['Company_Brand'] == 'Phenom People', 'HeadQuarter'] = 'Hyderabad'
df3.loc[df3['Company_Brand'] == 'Teach Us', 'HeadQuarter'] = 'Hyderabad'
df3.loc[df3['Company_Brand'] == 'Invento Robotics', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Kristal AI', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Samya AI', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Skylo', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'SmartKarrot', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Park+', 'HeadQuarter'] = 'Gurgaon'
df3.loc[df3['Company_Brand'] == 'LogiNext', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'MoneyTap', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'RACEnergy', 'HeadQuarter'] = 'Hyderabad'
df3.loc[df3['Company_Brand'] == 'Oye! Rickshaw', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Fleetx', 'HeadQuarter'] = 'Gurgaon'
df3.loc[df3['Company_Brand'] == 'Raskik', 'HeadQuarter'] = 'Gurgaon'
df3.loc[df3['Company_Brand'] == 'Pravasirojgar', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Kaagaz Scanner', 'HeadQuarter'] = 'Gurugram'
df3.loc[df3['Company_Brand'] == 'Exprs', 'HeadQuarter'] = 'Madhapur'
df3.loc[df3['Company_Brand'] == 'Verloop.io', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Otipy', 'HeadQuarter'] = 'Gurugram'
df3.loc[df3['Company_Brand'] == 'Daalchini', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Suno India', 'HeadQuarter'] = 'Hyderabad'
df3.loc[df3['Company_Brand'] == 'Eden Smart Homes', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Bijnis', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Oziva', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Yulu', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Peppermint', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Jiffy ai', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Postman', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'F5', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Myelin Foundry', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'iNurture Education', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Credgencies', 'HeadQuarter'] = 'Gurugram'
df3.loc[df3['Company_Brand'] == 'Vahak', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Illumnus', 'HeadQuarter'] = 'Gurgaon'
df3.loc[df3['Company_Brand'] == 'Juicy Chemistry', 'HeadQuarter'] = 'Coimbatore'
df3.loc[df3['Company_Brand'] == 'Shiprocket', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Phable', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Generic Aadhaar', 'HeadQuarter'] = 'Thane'
df3.loc[df3['Company_Brand'] == 'Nium', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'DailyHunt', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Pedagogy', 'HeadQuarter'] = 'Ahmedabad'
df3.loc[df3['Company_Brand'] == 'Sarva', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'NIRA', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Indusface', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Morning Context', 'HeadQuarter'] = 'Singapore'
df3.loc[df3['Company_Brand'] == 'Savvy Co op', 'HeadQuarter'] = 'New York'
df3.loc[df3['Company_Brand'] == 'BLive', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Toch', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Setu', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Rebel Foods', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Amica', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Fingerlix', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Zupee', 'HeadQuarter'] = 'Gurugram'
df3.loc[df3['Company_Brand'] == 'DeHaat', 'HeadQuarter'] = 'Patna'
df3.loc[df3['Company_Brand'] == 'Akna Medical', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'RaRa Delivery', 'HeadQuarter'] = 'Jakarta'
df3.loc[df3['Company_Brand'] == 'Obviously AI', 'HeadQuarter'] = 'San Francisco'
df3.loc[df3['Company_Brand'] == 'CoinDCX', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'NuNu TV', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Fintso', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Smart Coin', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Shop101', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Neeman', 'HeadQuarter'] = 'Hyderabad'
df3.loc[df3['Company_Brand'] == 'Invideo', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'AvalonMeta', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'SmartVizX', 'HeadQuarter'] = 'Noida'
df3.loc[df3['Company_Brand'] == 'Carbon Clean', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Onsitego', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Nova Credit', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'HempStreet', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Classplus', 'HeadQuarter'] = 'Noida'
df3.loc[df3['Company_Brand'] == 'Chaayos', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Altor', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'WorkIndia', 'HeadQuarter'] = 'Mumbai'

In [96]:
df3[df3['HeadQuarter'].isnull()]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10


In [97]:
# below we are reformating the Headquater column with their official values
#df3.loc[~df3['HeadQuarter'].str.contains('New Delhi', na=False), 'HeadQuarter'] = df3['HeadQuarter'].str.replace('Delhi', 'New Delhi')
df3["HeadQuarter"] = df3["HeadQuarter"].replace (['Bangalore','Banglore','Bangalore City'], 'Bengaluru')
df3['HeadQuarter'] = df3['HeadQuarter'].replace (['Gurgaon'], 'Gurugram')

In [98]:
df3['HeadQuarter'].unique()

array(['Chennai', 'Bengaluru', 'Pune', 'New Delhi', 'Indore', 'Hyderabad',
       'Gurugram', 'Belgaum', 'Noida', 'Mumbai', 'Andheri', 'Jaipur',
       'Ahmedabad', 'Kolkata', 'Tirunelveli, Tamilnadu', 'Thane',
       'Singapore', 'Gujarat', 'Haryana', 'Kerala', 'Jodhpur',
       'Jaipur, Rajastan', 'Delhi', 'Frisco, Texas, United States',
       'California', 'Dhingsara, Haryana', 'New York, United States',
       'Patna', 'San Francisco, California, United States',
       'San Francisco, United States', 'San Ramon, California',
       'Paris, Ile-de-France, France', 'Plano, Texas, United States',
       'Sydney', 'San Francisco Bay Area, Silicon Valley, West Coast',
       'Bangaldesh', 'London, England, United Kingdom',
       'Sydney, New South Wales, Australia', 'Milano, Lombardia, Italy',
       'Palmwoods, Queensland, Australia', 'France',
       'San Francisco Bay Area, West Coast, Western US',
       'Trivandrum, Kerala, India', 'Cochin', 'Samastipur, Bihar',
       'Irvine, C

In [99]:
df3["column10"].value_counts() # Calculate the frequency count of unique values in the "Amount" column

column10
Pre-Seed      1
Seed Round    1
Name: count, dtype: int64

In [100]:
df3[df3['column10'].isin(['Pre-Seed','Seed Round'])] #checking if the values in the 'column10' column match either 'Pre-Seed' or 'Seed Round'.

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
611,Walrus,2019.0,Bengaluru,Fintech,It provides banking solutions for teens and young adults,"Bhagaban Behera, Sriharsha Shetty, Nakul Kelkar",Better Capital,,Pre-Seed,Pre-Seed
613,goDutch,,Mumbai,Fintech,Group Payments platform,"Aniruddh Singh, Riyaz Khan, Sagar Sheth","Matrix India, Y Combinator, Global Founders Capital, Soma Capital, VentureSouq",1700000.0,Seed Round,Seed Round


In [101]:
df3['Sector'].unique() # ckecking for unique values in the Sector column 

array(['AgriTech', 'EdTech', 'Hygiene management', 'Escrow',
       'Networking platform', 'FinTech', 'Crowdsourcing',
       'Food & Bevarages', 'HealthTech', 'Fashion startup',
       'Food Industry', 'Food Delivery', 'Virtual auditing startup',
       'E-commerce', 'Gaming', 'Work fulfillment', 'AI startup',
       'Telecommunication', 'Logistics', 'Tech Startup', 'Sports',
       'Retail', 'Medtech', 'Tyre management', 'Cloud company',
       'Software company', 'Venture capitalist', 'Renewable player',
       'IoT startup', 'SaaS startup', 'Aero company', 'Marketing company',
       'Retail startup', 'Co-working Startup', 'Finance company',
       'Tech company', 'Solar Monitoring Company',
       'Video sharing platform', 'Gaming startup',
       'Video streaming platform', 'Consumer appliances',
       'Blockchain startup', 'Conversational AI platform', 'Real Estate',
       'SaaS platform', 'AI platform', 'Fusion beverages', 'HR Tech',
       'Job portal', 'Hospitality', 'Digit

In [102]:
df3[df3['Sector'].isnull()]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
518,Text Mercato,2015.0,Bengaluru,,Cataloguing startup that serves ecommerce platforms,"Kiran Ramakrishna, Subhajit Mukherjee",1Crowd,649600.0,Series A,
569,Magicpin,2015.0,Gurugram,,"It is a local discovery, rewards, and commerce platform","Anshoo Sharma, Brij Bhushan",Samsung Venture Investment Corporation,7000000.0,Series D,
687,Leap Club,,Gurugram,,Community led professional network for women,"Ragini Das, Anand Sinha","Whiteboard Capital, FirstCheque, Artha India Ventures, Sweta Rau, Deepak Abbot, Amrish Rau, Harpreet Singh Grover",340000.0,Pre seed round,
699,Juicy Chemistry,2014.0,Coimbatore,,It focuses on organic based skincare products,Pritesh Asher,Akya Ventures,650000.0,Series A,
707,Magicpin,2015.0,Gurugram,,"It is a local discovery, rewards, and commerce platform","Anshoo Sharma, Brij Bhushan",Lightspeed Venture Partners,3879000.0,,
732,Servify,,Mumbai,,It is a technology company which serves as a platform for brands to offer end-to-end solutions to their users,Sreevathsa Prabhakar,Barkawi,250000.0,,
746,Wagonfly,2018.0,Bengaluru,,Contactless shopping and delivery experience by using radio frequency to tag retail items,Raghavendra Prasad,Investment Trust of India,500000.0,,
763,DrinkPrime,,Bengaluru,,Water purifier subscription service,"Manas Ranjan Hota, Vijender Reddy","Abhishek Goyal, Bharat Jaisinghani, FirstCheque",,Seed Round,
809,Kitchens Centre,2019.0,Delhi,,Offers solutions to cloud kitchens by providing commercial space and kitchen infrastructure to assisting with branding and other services,Lakshay Jain,AngelList India,500000.0,Seed Round,
918,Innoviti,,Bengaluru,,Digital payments solutions company,Rajeev Agrawal,FMO,5000000.0,,


In [103]:
df3.loc[df3['Company_Brand'] == 'Text Mercato', 'Sector'] = 'E-commerce Technology'
df3.loc[df3['Company_Brand'] == 'Magicpin', 'Sector'] = 'Hyperlocal Services'
df3.loc[df3['Company_Brand'] == 'Leap Club', 'Sector'] = 'Professional Networking'
df3.loc[df3['Company_Brand'] == 'Juicy Chemistry', 'Sector'] = 'Organic Skincare'
df3.loc[df3['Company_Brand'] == 'Servify', 'Sector'] = 'Technology Services'
df3.loc[df3['Company_Brand'] == 'Wagonfly', 'Sector'] = 'Retail Technology'
df3.loc[df3['Company_Brand'] == 'DrinkPrime', 'Sector'] = 'Water Technology'
df3.loc[df3['Company_Brand'] == 'Kitchens Centre', 'Sector'] = 'Food Service Infrastructure'
df3.loc[df3['Company_Brand'] == 'Innoviti', 'Sector'] = 'Fintech'
df3.loc[df3['Company_Brand'] == 'Brick&Bolt', 'Sector'] = 'Construction and Real Estate'
df3.loc[df3['Company_Brand'] == 'Toddle', 'Sector'] = 'EdTech'
df3.loc[df3['Company_Brand'] == 'HaikuJAM', 'Sector'] = 'EdTech'

In [104]:
df3[df3['Sector'].isnull()]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10


In [105]:
df3['Stage'].unique() # checking the unique values in the data set

array([None, 'Pre-seed', 'Seed', 'Pre-series A', 'Pre-series', 'Series C',
       'Series A', 'Series B', 'Debt', 'Pre-series C', 'Pre-series B',
       'Series E', 'Bridge', 'Series D', 'Series B2', 'Series F',
       'Pre- series A', 'Edge', 'Series H', 'Pre-Series B', 'Seed A',
       'Series A-1', 'Seed Funding', 'Pre-Seed', 'Seed round',
       'Pre-seed Round', 'Seed Round & Series A', 'Pre Series A',
       'Pre seed Round', 'Angel Round', 'Pre series A1', 'Series E2',
       'Pre series A', 'Seed Round', 'Bridge Round', 'Pre seed round',
       'Pre series B', 'Pre series C', 'Seed Investment', 'Series D1',
       'Mid series', 'Series C, D', 'Seed funding'], dtype=object)

In [106]:
df3["Amount"].head()# Calculate the frequency count of unique values in the "Amount" column

0    200000.0
1    100000.0
2         NaN
3    400000.0
4    340000.0
Name: Amount, dtype: float64

In [107]:
# checking for '-' symbol within the columns
df3_to_check_colomns = ['Company_Brand','HeadQuarter', 'Sector', 'What_it_does','Stage','Amount']
for col in df3_to_check_colomns:
    dash_symbols = df3[col].astype(str).str.contains('—').any()
    print(f"{col}: {dash_symbols}")

Company_Brand: False
HeadQuarter: False
Sector: False
What_it_does: False
Stage: False
Amount: False


In [108]:
# checking for '$' symbol within the columns
df3_to_check_colomns = ['Company_Brand','HeadQuarter', 'Sector', 'What_it_does','Stage','Amount']

for col in df3_to_check_colomns:
    dash_symbols = df3[col].astype(str).str.contains('$').any()
    print(f"{col}: {dash_symbols}")

Company_Brand: True
HeadQuarter: True
Sector: True
What_it_does: True
Stage: True
Amount: True


### Clean Text Data

In [109]:
# Clean Company Name column
df3['Company_Brand'] = df3['Company_Brand'].str.strip()  # Remove leading and trailing spaces
df3['Company_Brand'] = df3['Company_Brand'].str.title()  # Standardize capitalization

# Clean Company Name column
df3['Sector'] = df3['Sector'].str.strip()  # Remove leading and trailing spaces
df3['Sector'] = df3['Sector'].str.title()  # Standardize capitalization


# Clean About Company column
df3['What_it_does'] = df3['What_it_does'].str.strip()  # Remove leading and trailing spaces

# Function to handle special characters or encoding issues
def clean_text(text):
    # Remove special characters using regex
    cleaned_text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
    return cleaned_text

# Apply the clean_text function to the About Company column
df3['What_it_does'] = df3['What_it_does'].apply(clean_text)

# Print the cleaned DataFrame
df3.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
0,Aqgromalin,2019.0,Chennai,Agritech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,
1,Krayonnz,2019.0,Bengaluru,Edtech,An academyguardianscholar centric ecosystem which provides state of the art technological solutions,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,
2,Padcare Labs,2018.0,Pune,Hygiene Management,Converting biohazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,
3,Ncome,2020.0,New Delhi,Escrow,Escrowasaservice platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,,
4,Gramophone,2016.0,Indore,Agritech,Gramophone is an AgTech platform enabling access to agri inputs and powering efficient farm management,"Ashish Rajan Singh, Harshit Gupta, Nishant Mahatre, Tauseef Khan","Siana Capital Management, Info Edge",340000.0,,


In [110]:
df3 = df3.drop(['column10','Founded','Founders','Investor'], axis=1) #dropping specific columns from the DataFrame 

In [111]:
df3['Funding Year'] = 2020 # Assign 2020 to the 'Funding Year' column

In [112]:
new_column_names = {'Company_Brand': 'Company', 'What_it_does': 'About', 'HeadQuarter': 'Location'} # Renaming columns
df3 = df3.rename(columns=new_column_names)

In [113]:
df3.head() # checking the head of the data to confirm before saving the data 

Unnamed: 0,Company,Location,Sector,About,Amount,Stage,Funding Year
0,Aqgromalin,Chennai,Agritech,Cultivating Ideas for Profit,200000.0,,2020
1,Krayonnz,Bengaluru,Edtech,An academyguardianscholar centric ecosystem which provides state of the art technological solutions,100000.0,Pre-seed,2020
2,Padcare Labs,Pune,Hygiene Management,Converting biohazardous waste to harmless waste,,Pre-seed,2020
3,Ncome,New Delhi,Escrow,Escrowasaservice platform,400000.0,,2020
4,Gramophone,Indore,Agritech,Gramophone is an AgTech platform enabling access to agri inputs and powering efficient farm management,340000.0,,2020


In [114]:
df3.isnull().sum()

Company           0
Location          0
Sector            0
About             0
Amount          253
Stage           462
Funding Year      0
dtype: int64

In [115]:
# saving the clean data set

df3.to_csv('df_2020.csv', index=False)

# 2021 Data

In [116]:
df4.head() #showing the first five rows

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,"Unbox Robotics builds on-demand AI-driven warehouse robotics solutions, which can be deployed using limited foot-print, time, and capital.","Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh, Ronnie Screwvala","Unilazer Ventures, IIFL Asset Management","$120,000,000",
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school transformation system that assures excellent learning for every child.,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D
3,Bizongo,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marketplace for packaging products.,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital","$51,000,000",Series C
4,FypMoney,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, empowering them with financial literacy and ease of secured financial transactions.",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal","$2,000,000",Seed


In [117]:
df4.shape #understanding the size of your DataFrame

(1209, 9)

In [118]:
df4.columns #retrieving the column names of the DataFrame

Index(['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage'],
      dtype='object')

In [119]:
df4.info() #providing a summary of the DataFrame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1209 entries, 0 to 1208
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1209 non-null   object 
 1   Founded        1208 non-null   float64
 2   HeadQuarter    1208 non-null   object 
 3   Sector         1209 non-null   object 
 4   What_it_does   1209 non-null   object 
 5   Founders       1205 non-null   object 
 6   Investor       1147 non-null   object 
 7   Amount         1206 non-null   object 
 8   Stage          781 non-null    object 
dtypes: float64(1), object(8)
memory usage: 85.1+ KB


In [120]:
df4.describe(include='object').T #providing descriptive statistics for columns of object data type in the DataFrame

Unnamed: 0,count,unique,top,freq
Company_Brand,1209,1033,BharatPe,8
HeadQuarter,1208,70,Bangalore,426
Sector,1209,254,FinTech,122
What_it_does,1209,1143,BharatPe develops a QR code-based payment app for offline retailers and businesses.,4
Founders,1205,1095,"Ashneer Grover, Shashvat Nakrani",7
Investor,1147,937,Inflection Point Ventures,24
Amount,1206,278,$Undisclosed,73
Stage,781,31,Seed,246


In [121]:
df4.isnull().sum() # looking for missing values in dataFrame

Company_Brand      0
Founded            1
HeadQuarter        1
Sector             0
What_it_does       0
Founders           4
Investor          62
Amount             3
Stage            428
dtype: int64

#### Handling Duplicated Data

In [122]:
#checking for duplicate values in each column of the DataFrame df4
columns_to_check4 = ['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does', 'Founders', 'Investor', 'Amount', 'Stage']

for column4 in columns_to_check4:
    has_duplicates4 = df4[column4].duplicated().any()
    print(f'{column4}: {has_duplicates4}')

Company_Brand: True
Founded: True
HeadQuarter: True
Sector: True
What_it_does: True
Founders: True
Investor: True
Amount: True
Stage: True


In [123]:
#removing any rows that have the same values in all the specified columns.
df4.drop_duplicates(subset=['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does', 'Founders', 'Investor', 'Amount', 'Stage'], inplace=True)

#### Handling Categorical Data

In [124]:
df4['HeadQuarter'].unique() # here we are looking at the unique values in the column 

array(['Bangalore', 'Mumbai', 'Gurugram', 'New Delhi', 'Hyderabad',
       'Jaipur', 'Ahmadabad', 'Chennai', None,
       'Small Towns, Andhra Pradesh', 'Goa', 'Rajsamand', 'Ranchi',
       'Faridabad, Haryana', 'Gujarat', 'Pune', 'Thane', 'Computer Games',
       'Cochin', 'Noida', 'Chandigarh', 'Gurgaon', 'Vadodara',
       'Food & Beverages', 'Pharmaceuticals\t#REF!', 'Gurugram\t#REF!',
       'Kolkata', 'Ahmedabad', 'Mohali', 'Haryana', 'Indore', 'Powai',
       'Ghaziabad', 'Nagpur', 'West Bengal', 'Patna', 'Samsitpur',
       'Lucknow', 'Telangana', 'Silvassa', 'Thiruvananthapuram',
       'Faridabad', 'Roorkee', 'Ambernath', 'Panchkula', 'Surat',
       'Coimbatore', 'Andheri', 'Mangalore', 'Telugana', 'Bhubaneswar',
       'Kottayam', 'Beijing', 'Panaji', 'Satara', 'Orissia', 'Jodhpur',
       'New York', 'Santra', 'Mountain View, CA', 'Trivandrum',
       'Jharkhand', 'Kanpur', 'Bhilwara', 'Guwahati',
       'Online Media\t#REF!', 'Kochi', 'London',
       'Information Technol

In [125]:
df4[df4['HeadQuarter'].isnull()]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
29,Vidyakul,2018.0,,EdTech,Vidyakul is an vernacular e-learning platform that helps state board students to learn academics via pre-recorded and live lectures,"Raman Garg, Tarun Saini","JITO Angel Network, SOSV","$500,000",Seed


In [126]:
df4['HeadQuarter'] = df4['HeadQuarter'].fillna('Gurugram')
df4[df4['Company_Brand'] == 'Vidyakul']

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
29,Vidyakul,2018.0,Gurugram,EdTech,Vidyakul is an vernacular e-learning platform that helps state board students to learn academics via pre-recorded and live lectures,"Raman Garg, Tarun Saini","JITO Angel Network, SOSV","$500,000",Seed
1184,Vidyakul,2017.0,Gurugram,EdTech,Vidyakul is a group of academic experts.,"Tarun Saini, Gaurav Singhvi",We Founder Circle,$500000,Bridge


In [137]:
df4[df4['HeadQuarter'].isna()]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage


In [127]:
#using a filter to get all the miss match values in the HeadQuater column

df4[df4['HeadQuarter'].isin(['Online Media\t#REF!', 'Pharmaceuticals\t#REF!','Computer Games','Information Technology & Services','Food & Beverages'])]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
98,FanPlay,2020.0,Computer Games,Computer Games,A real money game app specializing in trivia games,YC W21,"Pritesh Kumar, Bharat Gupta",Upsparks,$1200000
241,MasterChow,2020.0,Food & Beverages,Hauz Khas,A ready-to-cook Asian cuisine brand,"Vidur Kataria, Sidhanth Madan",WEH Ventures,$461000,Seed
242,Fullife Healthcare,2009.0,Pharmaceuticals\t#REF!,Primary Business is Development and Manufacturing of Novel Healthcare Products in Effervescent forms using imported propriety ingredients.,Varun Khanna,Morgan Stanley Private Equity Asia,$22000000,Series C,
1100,Sochcast,2020.0,Online Media\t#REF!,Sochcast is an Audio experiences company that give the listener and creators an Immersive Audio experience,"CA Harvinderjit Singh Bhatia, Garima Surana, Anil Srivatsa","Vinners, Raj Nayak, Amritaanshu Agrawal",$Undisclosed,,
1176,Peak,2014.0,Information Technology & Services,"Manchester, Greater Manchester",Peak helps the world's smartest companies put the power of AI at the center of all commercial decision making with Decision Intelligence,Atul Sharma,SoftBank Vision Fund 2,$75000000,Series C


In [128]:
#assigning specific values to HeadQuarter", "Amount", "Stage in the DataFrame

df4.loc[df4["Company_Brand"] == "FanPlay", ["HeadQuarter", "Amount", "Stage"]] = ['Bangalore', "$1200000",np.nan]
df4.loc[df4["Company_Brand"] == "FanPlay"]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
98,FanPlay,2020.0,Bangalore,Computer Games,A real money game app specializing in trivia games,YC W21,"Pritesh Kumar, Bharat Gupta",$1200000,


In [129]:
#assigning specific values to HeadQuarter", "Amount", "Stage in the DataFrame

df4.loc[df4["Company_Brand"] == "MasterChow", ["HeadQuarter", "Sector"]] = ["Hauz Khas", "Food & Beverages"]
df4.loc[df4["Company_Brand"] == "MasterChow"]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
241,MasterChow,2020.0,Hauz Khas,Food & Beverages,A ready-to-cook Asian cuisine brand,"Vidur Kataria, Sidhanth Madan",WEH Ventures,$461000,Seed


In [130]:
# here we are repositioning the values into their correct columns

df4.loc[df4["Company_Brand"] == "Fullife Healthcare", ["HeadQuarter","Sector","What_it_does","Investor", "Amount", "Stage"]] = ['Mumbai',"Pharmaceuticals","Primary Business is Development and Manufactur...","Varun Khanna", "$22000000","Series C"]
df4.loc[df4["Company_Brand"] == "Fullife Healthcare"]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
242,Fullife Healthcare,2009.0,Mumbai,Pharmaceuticals,Primary Business is Development and Manufactur...,Morgan Stanley Private Equity Asia,Varun Khanna,$22000000,Series C


In [131]:
# getting the all the data points that matches the company_Brand name 'Peak'

df4.loc[df4["Company_Brand"] == "Peak", ["HeadQuarter", "Sector"]] = ["Manchester", "Information Technology & Services"]
df4.loc[df4["Company_Brand"] == "Peak"]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
1176,Peak,2014.0,Manchester,Information Technology & Services,Peak helps the world's smartest companies put the power of AI at the center of all commercial decision making with Decision Intelligence,Atul Sharma,SoftBank Vision Fund 2,$75000000,Series C


In [132]:
# getting the all the data points that matches the company_Brand name 'Sochcast'

df4.loc[df4["Company_Brand"] == "Sochcast", ["HeadQuarter", "Sector",'What_it_does','Founders','Investor',"Amount"]] = ['Bengaluru', 'Online Media','Sochcast is an Audio experiences company that give the listener and creators an Immersive Audio experience','CA Harvinderjit Singh Bhatia, Garima Surana','Vinners, Raj Nayak, Amritaanshu Agrawal',np.nan]
df4.loc[df4["Company_Brand"] == "Sochcast"]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
1100,Sochcast,2020.0,Bengaluru,Online Media,Sochcast is an Audio experiences company that give the listener and creators an Immersive Audio experience,"CA Harvinderjit Singh Bhatia, Garima Surana","Vinners, Raj Nayak, Amritaanshu Agrawal",,


In [134]:
# From obersavtion, there is use of official and unofficial names of certain cities.
# The incorrect names need to be rectified for correct analysis, eg A city with more than one name.

df4['HeadQuarter'] = df4['HeadQuarter'].replace (['Bangalore'], 'Bengaluru')
df4['HeadQuarter'].replace('Gurugram\t#REF!','Gurugram',inplace =True, regex=True)
df4.loc[~df4['HeadQuarter'].str.contains('New Delhi', na=False), 'HeadQuarter'] = df4['HeadQuarter'].str.replace('Delhi', 'New Delhi')
df4['HeadQuarter'] = df4['HeadQuarter'].replace (['Gurgaon'], 'Gurugram')

In [135]:
df4['HeadQuarter'].unique()

array(['Bengaluru', 'Mumbai', 'Gurugram', 'New Delhi', 'Hyderabad',
       'Jaipur', 'Ahmadabad', 'Chennai', 'Small Towns, Andhra Pradesh',
       'Goa', 'Rajsamand', 'Ranchi', 'Faridabad, Haryana', 'Gujarat',
       'Pune', 'Thane', 'Cochin', 'Noida', 'Chandigarh', 'Vadodara',
       'Hauz Khas', 'Kolkata', 'Ahmedabad', 'Mohali', 'Haryana', 'Indore',
       'Powai', 'Ghaziabad', 'Nagpur', 'West Bengal', 'Patna',
       'Samsitpur', 'Lucknow', 'Telangana', 'Silvassa',
       'Thiruvananthapuram', 'Faridabad', 'Roorkee', 'Ambernath',
       'Panchkula', 'Surat', 'Coimbatore', 'Andheri', 'Mangalore',
       'Telugana', 'Bhubaneswar', 'Kottayam', 'Beijing', 'Panaji',
       'Satara', 'Orissia', 'Jodhpur', 'New York', 'Santra',
       'Mountain View, CA', 'Trivandrum', 'Jharkhand', 'Kanpur',
       'Bhilwara', 'Guwahati', 'Kochi', 'London', 'Manchester',
       'The Nilgiris', 'Gandhinagar'], dtype=object)

In [138]:
df4['Sector'].unique() # here we are looking at the unique value of the Sector column 

array(['AI startup', 'EdTech', 'B2B E-commerce', 'FinTech',
       'Home services', 'HealthTech', 'Tech Startup', 'E-commerce',
       'B2B service', 'Helathcare', 'Renewable Energy', 'Electronics',
       'IT startup', 'Food & Beverages', 'Aeorspace', 'Deep Tech',
       'Dating', 'Gaming', 'Robotics', 'Retail', 'Food', 'Oil and Energy',
       'AgriTech', 'Telecommuncation', 'Milk startup', 'AI Chatbot', 'IT',
       'Logistics', 'Hospitality', 'Fashion', 'Marketing',
       'Transportation', 'LegalTech', 'Food delivery', 'Automotive',
       'SaaS startup', 'Fantasy sports', 'Video communication',
       'Social Media', 'Skill development', 'Rental', 'Recruitment',
       'HealthCare', 'Sports', 'Computer Games', 'Consumer Goods',
       'Information Technology', 'Apparel & Fashion',
       'Logistics & Supply Chain', 'Healthtech', 'Healthcare',
       'SportsTech', 'HRTech', 'Wine & Spirits',
       'Mechanical & Industrial Engineering', 'Spiritual',
       'Financial Services', 'I

In [139]:
df4['Sector'].isna().sum()

0

In [140]:
# here we are updating this Row 'MoEVing'

df4.loc[df4["Company_Brand"] == "MoEVing", ["Sector",'What_it_does','Founders','Investor','Amount','Stage']] = [
    'Electric Mobility',"MoEVing is India's only Electric Mobility focused Technology Platform with a vision to accelerate EV adoption in India.",
    'Vikash Mishra, Mragank Jain','Anshuman Maheshwary, Dr Srihari Raju Kalidindi','$5000000','Seed']
df4.loc[df4["Company_Brand"] == "MoEVing"]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
257,MoEVing,2021.0,Gurugram,Electric Mobility,MoEVing is India's only Electric Mobility focused Technology Platform with a vision to accelerate EV adoption in India.,"Vikash Mishra, Mragank Jain","Anshuman Maheshwary, Dr Srihari Raju Kalidindi",$5000000,Seed


In [141]:
df4["Stage"].unique() # getting the unique values in this column 

array(['Pre-series A', None, 'Series D', 'Series C', 'Seed', 'Series B',
       'Series E', 'Pre-seed', 'Series A', 'Pre-series B', 'Debt', nan,
       'Bridge', 'Seed+', 'Series F2', 'Series A+', 'Series G',
       'Series F', 'Series H', 'Series B3', 'PE', 'Series F1',
       'Pre-series A1', '$300000', 'Early seed', 'Series D1', '$6000000',
       '$1000000', 'Seies A', 'Pre-series', 'Series A2', 'Series I'],
      dtype=object)

In [142]:
df4[df4["Stage"]=='$6000000'] # getting the row that matches the Amount 
# repositioning the values to their respective columns  

df4.loc[df4["Company_Brand"] == "MYRE Capital", ["Amount", "Stage"]] = ["$6000000",np.nan]
df4.loc[df4["Company_Brand"] == "MYRE Capital"]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
674,MYRE Capital,2020.0,Mumbai,Commercial Real Estate,Democratising Real Estate Ownership,Own rent yielding commercial properties,Aryaman Vir,$6000000,


In [143]:
df4[df4["Stage"]=='$300000'] # getting the row that matches the Amount

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
538,Little Leap,2020.0,New Delhi,EdTech,Soft Skills that make Smart Leaders,Holistic Development Programs for children in age range 5-15,Vishal Gupta,ah! Ventures,$300000
551,BHyve,2020.0,Mumbai,Human Resources,A Future of Work Platform for diffusing Employee Tacit Knowledge and enabling Peer Learning Networks,Backed by 100x.VC,"Omkar Pandharkame, Ketaki Ogale","ITO Angel Network, LetsVenture",$300000


In [144]:
# repositioning the values to their respective columns

df4.loc[df4["Company_Brand"] == "Little Leap", ["Amount", "Stage"]] = ["$300000",np.nan]
df4.loc[df4["Company_Brand"] == "Little Leap"]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
538,Little Leap,2020.0,New Delhi,EdTech,Soft Skills that make Smart Leaders,Holistic Development Programs for children in age range 5-15,Vishal Gupta,$300000,


In [145]:
# repositioning the values to their respective columns
df4.loc[df4["Company_Brand"] == "BHyve", ["Amount", "Stage"]] = ["$300000",np.nan]
df4.loc[df4["Company_Brand"] == "BHyve"]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
551,BHyve,2020.0,Mumbai,Human Resources,A Future of Work Platform for diffusing Employee Tacit Knowledge and enabling Peer Learning Networks,Backed by 100x.VC,"Omkar Pandharkame, Ketaki Ogale",$300000,


In [146]:
df4[df4["Stage"]=='$1000000'] # getting the row that matches the Amount 

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
677,Saarthi Pedagogy,2015.0,Ahmadabad,EdTech,"India's fastest growing Pedagogy company, serving to school as an academic growth partner and provide 360° solutions to schools on Academic Strategies",Pedagogy,Sushil Agarwal,"JITO Angel Network, LetsVenture",$1000000


In [147]:
# repositioning the values to their respective columns
df4.loc[df4["Company_Brand"] == "Saarthi Pedagogy", ["Amount", "Stage"]] = ["$1000000",np.nan]
df4.loc[df4["Company_Brand"] == "Saarthi Pedagogy"]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
677,Saarthi Pedagogy,2015.0,Ahmadabad,EdTech,"India's fastest growing Pedagogy company, serving to school as an academic growth partner and provide 360° solutions to schools on Academic Strategies",Pedagogy,Sushil Agarwal,$1000000,


In [148]:
df4["Amount"].unique() # getting unique values 

array(['$1,200,000', '$120,000,000', '$30,000,000', '$51,000,000',
       '$2,000,000', '$188,000,000', '$200,000', 'Undisclosed',
       '$1,000,000', '$3,000,000', '$100,000', '$700,000', '$9,000,000',
       '$40,000,000', '$49,000,000', '$400,000', '$300,000',
       '$25,000,000', '$160,000,000', '$150,000', '$1,800,000',
       '$5,000,000', '$850,000', '$53,000,000', '$500,000', '$1,100,000',
       '$6,000,000', '$800,000', '$10,000,000', '$21,000,000',
       '$7,500,000', '$26,000,000', '$7,400,000', '$1,500,000',
       '$600,000', '$800,000,000', '$17,000,000', '$3,500,000',
       '$15,000,000', '$215,000,000', '$2,500,000', '$350,000,000',
       '$5,500,000', '$83,000,000', '$110,000,000', '$500,000,000',
       '$65,000,000', '$150,000,000,000', '$300,000,000', '$2,200,000',
       '$35,000,000', '$140,000,000', '$4,000,000', '$13,000,000', None,
       '$Undisclosed', '$2000000', '$800000', '$6000000', '$2500000',
       '$9500000', '$13000000', '$5000000', '$8000000',

In [149]:
# checking if these specific values are present in the amount column 

df4[df4['Amount'].isin([ 'Seed','Pre-series A'])]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
545,AdmitKard,2016.0,Noida,EdTech,A tech solution for end to end career advisory to students looking to study abroad.,"Vamsi Krishna, Pulkit Jain, Gaurav Munjal\t#REF!",$1000000,Pre-series A,
1148,Godamwale,2016.0,Mumbai,Logistics & Supply Chain,Godamwale is tech enabled integrated logistics company providing end to end supply chain solutions.,"Basant Kumar, Vivek Tiwari, Ranbir Nandan",1000000\t#REF!,Seed,


In [150]:
# getting the row that matches the Amount 
# repositioning the values to their respective columns

df4.loc[df4["Company_Brand"] == "Godamwale", ["Amount", "Stage", "Investor"]] = ["$1000000", "Seed",np.nan]
df4.loc[df4["Company_Brand"] == "Godamwale"]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
1148,Godamwale,2016.0,Mumbai,Logistics & Supply Chain,Godamwale is tech enabled integrated logistics company providing end to end supply chain solutions.,"Basant Kumar, Vivek Tiwari, Ranbir Nandan",,$1000000,Seed


In [151]:
df4.loc[df4["Company_Brand"] == "AdmitKard", ["Amount", "Stage", "Investor"]] = [
    "$1000000", "Pre-series A",np.nan]
df4.loc[df4["Company_Brand"] == "AdmitKard"]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
545,AdmitKard,2016.0,Noida,EdTech,A tech solution for end to end career advisory to students looking to study abroad.,"Vamsi Krishna, Pulkit Jain, Gaurav Munjal\t#REF!",,$1000000,Pre-series A


In [152]:
# Cleaning the Amounts column & # removing the currency symbol in df_2021
df4['Amount'] = df4['Amount'].astype(str).str.replace('[\₹$,—]', '', regex=True)
df4['Amount'] = df4['Amount'].str.replace('Undisclosed|undisclosed|None|,', '0', regex=True)
df4['Amount'] = df4['Amount'].str.replace(r'^\s*$', '0', regex=True)

In [153]:
df4['Amount'] = df4['Amount'].astype(float)
type(df4['Amount'][0])

numpy.float64

In [154]:
df4['Amount'].unique()

array([1.20e+06, 1.20e+08, 3.00e+07, 5.10e+07, 2.00e+06, 1.88e+08,
       2.00e+05, 0.00e+00, 1.00e+06, 3.00e+06, 1.00e+05, 7.00e+05,
       9.00e+06, 4.00e+07, 4.90e+07, 4.00e+05, 3.00e+05, 2.50e+07,
       1.60e+08, 1.50e+05, 1.80e+06, 5.00e+06, 8.50e+05, 5.30e+07,
       5.00e+05, 1.10e+06, 6.00e+06, 8.00e+05, 1.00e+07, 2.10e+07,
       7.50e+06, 2.60e+07, 7.40e+06, 1.50e+06, 6.00e+05, 8.00e+08,
       1.70e+07, 3.50e+06, 1.50e+07, 2.15e+08, 2.50e+06, 3.50e+08,
       5.50e+06, 8.30e+07, 1.10e+08, 5.00e+08, 6.50e+07, 1.50e+11,
       3.00e+08, 2.20e+06, 3.50e+07, 1.40e+08, 4.00e+06, 1.30e+07,
       9.50e+06, 8.00e+06, 1.20e+07, 1.70e+06, 1.50e+08, 1.00e+08,
       2.25e+08, 6.70e+06, 1.30e+06, 2.00e+07, 2.50e+05, 5.20e+07,
       3.80e+06, 1.75e+07, 4.20e+07, 2.30e+06, 7.00e+06, 4.50e+08,
       2.80e+07, 8.50e+06, 3.70e+07, 3.70e+08, 1.60e+07, 4.40e+07,
       7.70e+05, 1.25e+08, 5.00e+07, 4.90e+06, 1.45e+08, 2.20e+07,
       7.00e+07, 6.60e+06, 3.20e+07, 2.40e+07, 7.25e+05, 4.61e

### Clean Text Data

In [155]:
# Clean Company Name column
df4['Company_Brand'] = df4['Company_Brand'].str.strip()  # Remove leading and trailing spaces
df4['Company_Brand'] = df4['Company_Brand'].str.title()  # Standardize capitalization

# Clean Company Name column
df4['Sector'] = df4['Sector'].str.strip()  # Remove leading and trailing spaces
df4['Sector'] = df4['Sector'].str.title()  # Standardize capitalization


# Clean About Company column
df4['What_it_does'] = df4['What_it_does'].str.strip()  # Remove leading and trailing spaces

# Function to handle special characters or encoding issues
def clean_text(text):
    # Remove special characters using regex
    cleaned_text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
    return cleaned_text

# Apply the clean_text function to the About Company column
df4['What_it_does'] = df4['What_it_does'].apply(clean_text)

# Print the cleaned DataFrame
df4.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019.0,Bengaluru,Ai Startup,Unbox Robotics builds ondemand AIdriven warehouse robotics solutions which can be deployed using limited footprint time and capital,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First",1200000.0,Pre-series A
1,Upgrad,2015.0,Mumbai,Edtech,UpGrad is an online higher education platform,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh, Ronnie Screwvala","Unilazer Ventures, IIFL Asset Management",120000000.0,
2,Lead School,2012.0,Mumbai,Edtech,LEAD School offers technology based school transformation system that assures excellent learning for every child,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital",30000000.0,Series D
3,Bizongo,2015.0,Mumbai,B2B E-Commerce,Bizongo is a businesstobusiness online marketplace for packaging products,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital",51000000.0,Series C
4,Fypmoney,2021.0,Gurugram,Fintech,FypMoney is Digital NEO Bank for Teenagers empowering them with financial literacy and ease of secured financial transactions,Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal",2000000.0,Seed


In [156]:
# Dropping the columns that are not important to our analysis

df4.drop(columns=['Founders','Investor','Founded'], inplace=True)

In [157]:
df4.insert(6,"Funding Year", 2021) # inserting a new column 'funding Year 2021' to keep track of the data sets when combining

In [158]:
df4.rename(columns = {'Company_Brand':'Company',
                        'HeadQuarter':'Location',
                        'What_it_does':'About'},
             inplace = True)

In [159]:
df4.head() # looking at head to comfirm before saving the data 

Unnamed: 0,Company,Location,Sector,About,Amount,Stage,Funding Year
0,Unbox Robotics,Bengaluru,Ai Startup,Unbox Robotics builds ondemand AIdriven warehouse robotics solutions which can be deployed using limited footprint time and capital,1200000.0,Pre-series A,2021
1,Upgrad,Mumbai,Edtech,UpGrad is an online higher education platform,120000000.0,,2021
2,Lead School,Mumbai,Edtech,LEAD School offers technology based school transformation system that assures excellent learning for every child,30000000.0,Series D,2021
3,Bizongo,Mumbai,B2B E-Commerce,Bizongo is a businesstobusiness online marketplace for packaging products,51000000.0,Series C,2021
4,Fypmoney,Gurugram,Fintech,FypMoney is Digital NEO Bank for Teenagers empowering them with financial literacy and ease of secured financial transactions,2000000.0,Seed,2021


In [160]:
df4.to_csv('df_2021.csv', index=False)

## Concatenate the Cleaned Datasets

In [None]:
#Load the cleaned Datasets
df = pd.read_csv("df18.csv")
df2 = pd.read_csv("df_19.csv")
df3 = pd.read_csv("df_2020.csv")
df4 = pd.read_csv("df_2021.csv")

In [161]:
# Concatenate the data frames
clean_done = pd.concat([df, df2, df3, df4])

In [162]:
# Reseting the index of the concatenated data frame
clean_combined = clean_done.reset_index(drop=True)

In [164]:
clean_combined.tail()

Unnamed: 0,Company,Sector,Stage,Amount,Location,About,Funding Year
2851,Gigforce,Staffing & Recruiting,Pre-series A,3000000.0,Gurugram,A gigondemand staffing company,2021
2852,Vahdam,Food & Beverages,Series D,20000000.0,New Delhi,VAHDAM is among the worlds first vertically integrated onlinefirst tea brands,2021
2853,Leap Finance,Financial Services,Series C,55000000.0,Bengaluru,International education loans for high potential students,2021
2854,Collegedekho,Edtech,Series B,26000000.0,Gurugram,Collegedekhocom is Students Partner Friend Confidante To Help Him Take a Decision and Move On to His Career Goals,2021
2855,Werize,Financial Services,Series A,8000000.0,Bengaluru,Indias first socially distributed full stack financial services platform for small town India,2021


NOW LET'S CLEAN THE WHOLE DATA COMBINED TO BE SURE EVERYTHING IS CLEANED WELL 

In [165]:
# first let's check for missing value in the entire data set 

missing_combined = clean_combined.isnull().sum()
missing_combined

Company           0
Sector            0
Stage           970
Amount          402
Location          0
About             0
Funding Year      0
dtype: int64

In [166]:
# now let's see the missing values by columns by summarizing them in percentages 

missing_percentage = (missing_combined / len(clean_combined) *  100)
missing_combined_summary = pd.DataFrame({'missing_combined':missing_combined, 'missing_percentage':missing_percentage})
missing_combined_summary


Unnamed: 0,missing_combined,missing_percentage
Company,0,0.0
Sector,0,0.0
Stage,970,33.963585
Amount,402,14.07563
Location,0,0.0
About,0,0.0
Funding Year,0,0.0


Based on the missing values summary

Company: There are no missing values in the "Company" column, so no further action is needed.


Sector: The "Sector" column has 19 missing values, which account for approximately 0.67% of the total data. 

since we are talking about the sector or fields of the company we can impute the missing values 
with the most occurring sector/field base on the assumsion that is most likely most of the companies might belong to that 
most occurring sector in the data set



In [None]:
# Imputing missing values in the "Sector" column with the most frequent sector 

most_fre_sector = clean_combined.mode().iloc[0] # explicitly accessing the first value using .iloc[0]
clean_combined['Sector'].fillna(most_fre_sector, inplace=True)

In [None]:
# filling the missing value with unknown 

#clean_combined['Sector'] = clean_combined['Sector'].fillna('unknown')

BELOW IS ONE WAY TO HELP SELECT THE BEAT WAY TO DEAL WITH THE MISSING VALUES IN THE STAGE COLUMN 

 creating a cross-tabulation or contingency table between the "Stage" column and the "Sector" column
 This will generate a table showing the counts of each combination of stages and industries. It will help you identify if certain stages are more prevalent in specific Sectors

In [None]:
cross_tab_sec_stage = pd.crosstab(clean_combined['Stage'], ['Sector'])
print(cross_tab_sec_stage)

BEFORE CONTINUING LET'S FURTHER GROUP THE STAGE COLUMN TO MAKE THINGS SIMPLER 

In [None]:
grouped_stages = {
    # Group 1: Early Stage
    'Pre-seed': 'Early Stage',
    'Seed': 'Early Stage',
    'Seed A': 'Early Stage',
    'Seed Funding': 'Early Stage',
    'Seed Investment': 'Early Stage',
    'Seed Round': 'Early Stage',
    'Seed Round & Series A': 'Early Stage',
    'Seed fund': 'Early Stage',
    'Seed funding': 'Early Stage',
    'Seed round': 'Early Stage',
    'Seed+': 'Early Stage',

    # Group 2: Mid Stage
    'Series A': 'Mid Stage',
    'Series A+': 'Mid Stage',
    'Series A-1': 'Mid Stage',
    'Series A2': 'Mid Stage',
    'Series B': 'Mid Stage',
    'Series B+': 'Mid Stage',
    'Series B2': 'Mid Stage',
    'Series B3': 'Mid Stage',
    'Series C': 'Mid Stage',
    
    # Group 3: Late Stage
    'Series D': 'Late Stage',
    'Series D1': 'Late Stage',
    'Series E': 'Late Stage',
    'Series E2': 'Late Stage',
    'Series F': 'Late Stage',
    'Series F1': 'Late Stage',
    'Series F2': 'Late Stage',
    'Series G': 'Late Stage',
    'Series H': 'Late Stage',
    
    # Group 4: Other Stages
    'Angel': 'Other Stages',
    'Angel Round': 'Other Stages',
    'Bridge': 'Other Stages',
    'Bridge Round': 'Other Stages',
    'Corporate Round': 'Other Stages',
    'Debt': 'Other Stages',
    'Debt Financing': 'Other Stages',
    'Early seed': 'Other Stages',
    'Edge': 'Other Stages',
    'Fresh funding': 'Other Stages',
    'Funding Round': 'Other Stages',
    'Grant': 'Other Stages',
    'Mid series': 'Other Stages',
    'Non-equity Assistance': 'Other Stages',
    'None': 'Other Stages',
    'PE': 'Other Stages',
    'Post series A': 'Other Stages',
    'Post-IPO Debt': 'Other Stages',
    'Post-IPO Equity': 'Other Stages',
    'Pre Series A': 'Other Stages',
    'Pre- series A': 'Other Stages',
    'Pre-Seed': 'Other Stages',
    'Pre-Series B': 'Other Stages',
    'Private Equity': 'Other Stages',
    'Secondary Market': 'Other Stages'
}

clean_combined['Stage'] = clean_combined['Stage'].replace(grouped_stages)


In [None]:
clean_combined['Stage']

In [None]:
# now let's check the cross_table again

cross_tab_sec_stage = pd.crosstab(clean_combined['Stage'], ['Sector'])
print(cross_tab_sec_stage)

now to deal with the missing value in the stage column, we will use the percentage of the first 6 largest most occurring 
stage to fill in the missing values 


In [None]:
# getting the percentages 

cross_stage_perc = cross_tab_sec_stage['Sector'] / cross_tab_sec_stage['Sector'].sum()
print(cross_stage_perc)

In [None]:
# selecting the first six 
top_six_stages = cross_stage_perc.nlargest(6)
top_six_stages

NOW LER'S FILL IN THE MISSING VALUES IN THE STAGE COLUMN, USING THE RESPECTIVE VALUES IN FROM THE TOP SIX 
STAGES 

In [None]:
# Filling missing values in "Stage" column with the top six values

# Normalize the probabilities
normalized_probs = top_six_stages / top_six_stages.sum() 

clean_combined['Stage'] = clean_combined['Stage'].fillna(pd.Series(np.random.choice(top_six_stages.index, size=len(clean_combined), p=normalized_probs.values)))
print(clean_combined['Stage'])

In [None]:
clean_combined['Stage'].isnull().sum()

DEALING WITH MISSING VALUES IN THE AMOUNT COLUMN 


first let's identify if there is any relationship between the missing values and the diffferent sectors 
this insight into the missing value will guide us on how to properly impute for the missing values 

We will start by creating a contingency table to show the distribution of missing values across the different
Sectors 

NOTE: this table and test is to help us prove or reject a hypothesis, by conducting a chi-square test 
Using the chi2_contingency function from the scipy.stats module to perform the chi-square test, this function calculates the chi-square statistic, p-value, degrees of freedom, and expected frequencies

but we will only look at the p-value with a specific chosen significant value 

Finally, we will interprete the result of the p-value, if the p-value is below a chosen significance level (e.g., 0.05), we can reject the null hypothesis and conclude that there is a significant association between the missing values in the "Amount" column and the "Sector" column.

BELOW IS THE HYPOTHESIS AND THE ALTERNATIVE HYPOTHESIS

Null hypothesis (H0): There is no association between the missing values in the "Amount" column and the "Sector" column.

Alternative hypothesis (H1): There is a significant association between the missing values in the "Amount" column and the "Sector" column

In [None]:
# using the 'Scipy.stats

# creating the contingency table 

contingency_Stg_Sect = pd.crosstab(clean_combined['Sector'], clean_combined['Amount'].isnull())

# performing the chi-square test 
chi2, p_value, _, _ = chi2_contingency(contingency_Stg_Sect,)

# printing the results 
print('chi-square Statistic_result:', chi2)
print('p-value:', p_value)

In [None]:
# Comparing with significance level
alpha = 0.05  # Chosen significance level
if p_value < alpha:
    print("There is a significant relationship between the missing values and the sector.") # true value 0.000000000013686032693344578.
else:
    print("There is no significant relationship between the missing values and the sector.")

From the result above, we can say there is a non-random pattern or a strong relationship between these variables
This insight can help us choose a correct way of imputing for the missing values 

BASE ON THIS OUTPUT OF THE P-VALUE WE WILL:

Impute missing values using conditional probabilities, calculate conditional probabilities to estimate the missing values,
 we will calculate the probability of a certain funding amount given a specific sector and use this probability to impute missing values

 This approach leverages the relationship between the "Amount" column and the "Sector" column to impute missing values in a more informed manner, taking into account the patterns and associations observed in the data

In [None]:
# Calculating the conditional probabilities

# Grouping the dataset by the "Sector" column
sector_groups = clean_combined.groupby("Sector")

# Calculate conditional probabilities for each sector
conditional_prob_sector = {}

# Count the occurrences of funding amounts within each sector group
for sector, group in sector_groups:
    Amount_counts = group["Amount"].value_counts()

# Calculating conditional probabilities by dividing by the total count
    conditional_prob_sector[sector] = Amount_counts / Amount_counts.sum()

# Printing the conditional probabilities for each sector
for sector, probabilities in conditional_prob_sector.items():
    print(f"Conditional probabilities for sector '{sector}':")
    print(probabilities)
    print()



From the output above we can deduce the following 

the conditional probabilities represent the distribution of funding amounts within specific sectors. Each funding amount is associated with a probability, indicating the likelihood of observing that amount given the corresponding sector


For instance, in the first set of conditional probabilities:

The funding amounts 1649.32, 6063.67, 36382.05, 1819102.26, 218292.27, 97018.79, 9701.88, and 23000000.00 each have a probability of 0.045455.
This indicates that within the corresponding sector, each of these funding amounts has an equal likelihood of being observed.


In the second and third sets of conditional probabilities, only a single funding amount is present with a probability of 1.0. This suggests that within the respective sectors ('API platform' and 'AR platform'), only one specific funding amount (49115.76 and 84891.44, respectively) is observed, and it has a probability of 1.0


These conditional probabilities can be used to impute missing values in the "Amount" column based on the corresponding sector. By considering the conditional probabilities, we can assign funding amounts to the missing values that are consistent with the observed distribution of funding amounts within each sector

In [None]:
# Creating a copy of the dataframe to preserve the original data
imputed_combined = clean_combined.copy()

# Iterating over the rows with missing values in the "Amount" column
for index, row in imputed_combined[imputed_combined['Amount'].isnull()].iterrows():
    sector = row['Sector']
    if sector in conditional_prob_sector:
        probabilities = conditional_prob_sector[sector]
        funding_amounts = probabilities.index.to_numpy()
        probabilities = probabilities.to_numpy()

        # Normalizing probabilities to sum up to 1
        probabilities /= np.sum(probabilities)
        
        if len(funding_amounts) > 0:
            # Randomly selecting a funding amount based on the conditional probabilities
            imputed_combined.at[index, 'Amount'] = np.random.choice(funding_amounts, p=probabilities)


In [None]:
clean_combined['Amount']