## STOCKHOLM TEAM

## Exploratory Data Analysis of the Indian StartUp Funding Ecosystem 

### Business Understanding

**Project Description:**

Explore the Indian startup funding ecosystem through an in-depth analysis of funding data from 2019 to 2021. Gain insights into key trends, funding patterns, and factors driving startup success. Investigate the relationship between funding and startup growth, with a focus on temporal patterns and city-level dynamics. Identify preferred sectors for investment and uncover industry-specific funding trends. This exploratory data analysis provides a comprehensive overview of the Indian startup ecosystem, offering valuable insights for entrepreneurs, investors, and policymakers.

## Data Understanding

This project aims to explore and gain a deeper understanding of the Indian startup funding ecosystem. The dataset used for analysis contains information about startup funding from 2019 to 2021. The dataset includes various attributes such as the company's name, sector, funding amount, funding round, investor details, and location.

To conduct a comprehensive analysis, we will examine the dataset to understand its structure, contents, and any potential data quality issues. By understanding the data, we can ensure the accuracy and reliability of our analysis.

The key attributes in the dataset include:

- **Company**: The name of the startup receiving funding.
- **Sector**: The industry or sector to which the startup belongs.
- **Amount**: The amount of funding received by the startup.
- **Stage**: The round of funding (e.g., seed, series A, series B).
- **Location**: The city or region where the startup is based.
- **About**: What the company does.
- **Funding Year**:When the company was funded

By examining these attributes, we can uncover insights about the funding landscape, identify trends in funding amounts and rounds, explore the preferred sectors for investment, and analyze the role of cities in the startup ecosystem.

Throughout the analysis, we will use visualizations and statistical techniques to present the findings effectively. By understanding the data and its characteristics, we can proceed with confidence in our analysis, derive meaningful insights, and make informed decisions based on the findings.

### Hypothesis:

#### NULL Hypothesis (HO) :

#### **The sector of a company does not have an impact on the amount of funding it receives.**


#### ALTERNATE Hypothesis (HA):

#### **The sector of a company does have an impact on the amount of funding it receives.**




##  Research / Analysis Questions:

1. What are the most common industries represented in the datasets?

2. How does the funding amount vary across different rounds/series in the datasets?
   
3. Which locations have the highest number of companies in the datasets?
   
4. What kind of investment type should startups look for depending on their industry type? (EDA: Analysis of funding preferences by industry)

5. Are there any correlations between the funding amount and the company's sector or location?
   
6. What are the top investors in the datasets based on the number of investments made?
   
7. Which industries are favored by investors based on the number of funding rounds? (EDA: Top 10 industries which are favored by investors)

8. Are there any outliers in the funding amounts in the datasets?
   
9.  Is there a relationship between the company's sector and the presence of certain investors?
    
10. What is the range of funds generally received by startups in India (Max, min, avg, and count of funding)? (EDA: Descriptive statistics of funding amounts)


## Data Preparation

Before diving into the analysis, we will preprocess and clean the data to ensure its quality and suitability for analysis. This may involve handling missing values, correcting data types, and addressing any inconsistencies or outliers that could affect the accuracy of our results.

Once the data is prepared, we will be ready to perform an in-depth exploratory analysis of the Indian startup funding ecosystem. The analysis will involve answering specific research questions, identifying patterns and trends, and generating meaningful visualizations to present the findings.

Through this process of data understanding and preparation, we will set a solid foundation for conducting a robust and insightful analysis of the Indian startup funding data.

**The data for each year is sourced from separate two csv files and two from a remote server. They will be merged later to one dataset**

### Load the Packages/Modules

In [None]:
%pip install forex-python
%pip install pandas
%pip install python-dotenv
%pip install seaborn
%pip3 install matplotlib
%pip install pyodbc
%pip install numpy
%pip install scipy


In [None]:
# Importing the Modules needed
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns

import pyodbc #just installed with pip
from dotenv import dotenv_values #import the dotenv_values function from the dotenv package
import warnings 
warnings.filterwarnings('ignore')

from forex_python.converter import CurrencyRates
import re 

from scipy.stats import chi2_contingency
from scipy.stats import ks_2samp

### Import Datasets

In [None]:
df = pd.read_csv('startup_funding2018.csv') # read the data_2018 and convert it to pandas data frame 

In [None]:
df2 = pd.read_csv('startup_funding2019.csv') # read the data_2019 and convert it to pandas data frame

#### Accessing the Remote Server Datasets

In [None]:
# Load environment variables from .env file into a dictionary
environment_variables = dotenv_values('.env')


# Get the values for the credentials you set in the '.env' file
database = environment_variables.get("DATABASE")
server = environment_variables.get("SERVER")
username = environment_variables.get("USERNAME")
password = environment_variables.get("PASSWORD")


connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password}"

In [None]:
# Use the connect method of the pyodbc library and pass in the connection string.
# This will connect to the server and might take a few seconds to be complete. 
# Check your internet connection if it takes more time than necessary

connection = pyodbc.connect(connection_string)

In [None]:
# Now the sql query to get the data is what what you see below. 
# Note that you will not have permissions to insert delete or update this database table. 
query1 = "SELECT * FROM dbo.LP1_startup_funding2020"
query2 = "SELECT * FROM dbo.LP1_startup_funding2021"
df3 = pd.read_sql(query1, connection)
df4 = pd.read_sql(query2, connection)

## Display Options

In [None]:
# Set display options to show all values without truncation
# pd.set_option('display.max_colwidth', None)
# pd.set_option('display.max_rows', None)
# pd.set_option('display.max_columns', None)

## Import Datasets

#### 2018 Data

In [None]:
df.head()

In [None]:
df.shape # displaying the shape of the data as in column and row wise

In [None]:
df.columns # here we want to look at the columns in data set

In [None]:
df.info()  # Getting information about the DataFrame

In [None]:
df.describe(include='object')  # here Generating descriptive statistics of the DataFrame

now we have some description about the data set, we can now move on with data cleaning
 
MISSING VALUES 

#### Handling Duplicated Data

In [None]:
# below we are checking duplicates values withinn the columns 

columns_to_check = ['Company Name', 'Industry', 'Round/Series', 'Amount', 'Location', 'About Company']

for column in columns_to_check:
    has_duplicates = df[column].duplicated().any()
    print(f'{column}: {has_duplicates}')

In [None]:
df.drop_duplicates(subset=['Company Name', 'Industry', 'Round/Series', 'Amount', 'Location', 'About Company'], inplace=True)

Standardizing Data Formats

now let's see how we can standardize tha data set to make sure we have the same format of data points 

first let's check for dash symbols within the columns using a simple python function 

In [None]:
# below we are checking for '-' symbols within the columns

columns_to_check = ['Amount', 'Company Name', 'Location', 'About Company', 'Industry', 'Round/Series']

for column in columns_to_check:
    has_dash_symbols = df[column].str.contains('—').any()
    print(f"{column}: {has_dash_symbols}")

now let's handle the dash symbols in **the Amount column**, clean and format the amount the column correctly & Convert Currency to USD

In [None]:
df['Amount'].head() # first let's look at the Amount the column

In [None]:
# Cleaning the Amounts column 

df['Amount'] = df['Amount'].apply(str)
df['Amount'].replace(",", "", inplace = True, regex=True)
df['Amount'].replace("—", 0, inplace = True, regex=True)
df['Amount'].replace("$", "", inplace = True, regex=True)

## Assumptions Made for Amount Column
- Amounts without currency symbols in the 2018 dataset are in USD.
- The average Indian Rupee (INR) to US Dollar (USD) rate for the relevant year will be used for currency conversions.
- Use exchange rate from https://www.exchangerates.org.uk/INR-USD-spot-exchange-rates-history-2018.html, use the average exchange rate of 0.0146

In [None]:
# Set the desired exchange rate
exchange_rate = 0.0146

# Cleaning the Amounts column
df['Amount'] = df['Amount'].apply(str)
df['Amount'].replace([',', '—', '$'], ['', 0, ''], inplace=True, regex=True)

# Extract the Indian currency amount
df['Indiancurr'] = df['Amount'].str.rsplit('₹', n=2).str[1]
df['Indiancurr'] = df['Indiancurr'].apply(float).fillna(0)

# Convert Indian currency to USD using the specified exchange rate
df['UsCurr'] = df['Indiancurr'] * exchange_rate

# Replace 0 values with NaN
df['UsCurr'] = df['UsCurr'].replace(0, np.nan)

# Fill NaN values in 'UsCurr' with original 'Amount' values
df['UsCurr'] = df['UsCurr'].fillna(df['Amount'])

# Remove '$' symbol from 'UsCurr' column
df['UsCurr'] = df['UsCurr'].replace("$", "", regex=True)

# Update 'Amount' column with converted USD values
df['Amount'] = df['UsCurr'].apply(lambda x: float(str(x).replace("$","")))

# Replace 0 values with NaN in 'Amount' column
df['Amount'] = df['Amount'].replace(0, np.nan)

# Format the 'Amount' column
format_amount = lambda amount: "{:,.2f}".format(amount)
df['Amount'] = df['Amount'].map(format_amount)

In [None]:
df['Amount'] = df['Amount'].str.replace(',', '').astype(float) # since the Amount column is holding and amount, we have to comvert it to float
type(df['Amount'][0])

#### Handling Categorical Data
NOW LET'S 

handle the categorical data in the 'Industry', 'Round/Series', and 'Location' columns

Analyzing unique values
Start by examining the unique values in each column to identify any inconsistencies or variations we do this 
Using the unique() function to get the unique values in each column.

### Location Column

#### The Location column contains combined information (e.g., city, state, country)

In [None]:
df['Location'].unique() # checking each unique values 

In [None]:
df['Location'].value_counts() # getting the total of all unique values 

In [None]:
# The 'Location' column is in the format, 'City, Region, Country',
# Only 'City' aspect is needed for this analysis
# Take all character until we reach the first comma sign

df['Location'] = df['Location'].apply(str)
df['Location'] = df['Location'].str.split(',').str[0]
df['Location'] = df['Location'].replace("'","",regex=True)

In [None]:
# From observation, some city names that refer to the same place are appearing different.
# The incorrect names need to be rectified for correct analysis, eg A city with more than one name.
df["Location"] = df["Location"].replace (['Bangalore','Bangalore City'], 'Bengaluru')
df.loc[~df['Location'].str.contains('New Delhi', na=False), 'Location'] = df['Location'].str.replace('Delhi', 'New Delhi')
df['Location'] = df['Location'].replace (['Gurgaon'], 'Gurugram')

In [None]:
df['Location'] # taking a look at the location column to comfirm the changes 

In [None]:
df['Location'].unique() # checking the unique values once more

In [None]:
df['Location'].value_counts() # counting the unique values again to be sure of the changes 

In [None]:
df['Location'].isnull().sum() # checking for null values in the loaction column

### Industry Column

In [None]:
df['Industry'] # taking a look at the Industry column first to have some insight into the column 

In [None]:
# let's check all the unique values in the industry column
df['Industry'].unique()

In [None]:
df['Industry'].value_counts() # counting all the unique values in the Industry column 

BELOW WE WANT TO HANDLE, Title casing, leading and trailing spaces and also standardize the indusrty column 

In [None]:
# Get unique values in the 'Industry' column
unique_values = df['Industry'].unique()
# Create a set to store the delimiters
delimiters = set()

# Iterate over the unique values
for value in unique_values:
    parts = re.split(',|;|/|-', value) # Split the value by commas and other delimiters
    delimiters.update(filter(lambda x: x != '', parts[1:])) # Add the delimiters to the set
# Print the identified delimiters
print(delimiters)

In [None]:
# keeping only the first unique vlaues in the Industry column
df['Industry'] = df['Industry'].str.split(',').str[0]
#converting the industry names in the column to title case
df['Industry'] = df['Industry'].str.title()

In [None]:
df[df['Industry']=='—']

In [None]:
# renaming some of the Company names to their official names

company_mapping = {
    'dishq': 'DISH',
    'HousingMan.com': 'HousingMan',
    'ENLYFT DIGITAL SOLUTIONS PRIVATE LIMITED': 'ENLYFT DIGITAL SOLUTIONS',
    'Toffee': 'Toffee Pvt Ltd',
    'Avenues Payments India Pvt. Ltd.': 'Avenues Payments',
    'Planet11 eCommerce Solutions India (Avenue11)': 'Planet11',
    
}

# Replacing the '-' dash symbols in the Sector column 

industry_mapping = {
    '—': '',
    'Fashion and Lifestyle Blog': 'Fashion and Lifestyle Blog',
    'Financial Services': 'Financial Services',
    'Automotive Services': 'Automotive Services',
    'Automotive Financing': 'Automotive Financing',
    'Food and Beverage': 'Food and Beverage',
    'Gaming and Entertainment': 'Gaming and Entertainment',
    'Marketing Technology': 'Marketing Technology',
    'Electric Vehicle Technology': 'Electric Vehicle Technology',
    'Real Estate Technology': 'Real Estate Technology',
    'Telecommunications': 'Telecommunications',
    'E-commerce': 'E-commerce',
    'Hospitality Technology': 'Hospitality Technology',
    'Health and Wellness': 'Health and Wellness',
    'Digital Marketing': 'Digital Marketing',
    'E-commerce Solutions': 'E-commerce Solutions',
    'Transportation and Logistics Technology': 'Transportation and Logistics Technology',
    'Cosmetics': 'Cosmetics',
    'Travel and Adventure': 'Travel and Adventure',
    'EdTech': 'EdTech'
}

# Replace the dash symbol with the corresponding values using apply function
df['Company Name'] = df['Company Name'].apply(lambda x: company_mapping[x] if x in company_mapping else x)
df['Industry'] = df['Industry'].apply(lambda x: industry_mapping[x] if x in industry_mapping else x)

In [None]:
# checking if there are any leading or trailing spaces in the industry names in the 'Industry' column
has_spaces = df['Industry'].str.contains('^s|s$', regex=True)

rows_with_spaces = df[has_spaces]
print(rows_with_spaces)

In [None]:
# remove the leading or trailing spaces from the industry names in the 'Industry' column
df['Industry'] = df['Industry'].str.strip()

In [None]:
df['Industry'].isnull().sum() # confirming the null values in the industry column 

In [None]:
df.head() # getting the first sample of the data set 

### Round/Series Column

In [None]:
df['Round/Series'].unique() # getting the unique values 

In [None]:
df['Round/Series'].value_counts() # counting and returning the sum of all the values 

In [None]:
# below we are replacing some unique values such as undisclosed with nan and remove some inconsistency from the data

df['Round/Series']=df['Round/Series'].replace('Undisclosed',np.nan)
df['Round/Series']=df['Round/Series'].replace('Venture - Series Unknown',np.nan)
df['Round/Series'] = df['Round/Series'].replace('https://docs.google.com/spreadsheets/d/1x9ziNeaz6auNChIHnMI8U6kS7knTr3byy_YBGfQaoUA/edit#gid=1861303593', 'nan')
df['Round/Series'] = df['Round/Series'].replace('nan', np.nan)

In [None]:
df['Round/Series'].unique() # getting the unique values 

In [None]:
df['Round/Series'].value_counts() # counting and returning the sum of all the values  

### Clean Categorical Data 

In [None]:
# Clean Company Name column
df['Company Name'] = df['Company Name'].str.strip()  # Remove leading and trailing spaces
df['Company Name'] = df['Company Name'].str.title()  # Standardize capitalization

# Clean About Company column
df['About Company'] = df['About Company'].str.strip()  # Remove leading and trailing spaces

# Function to handle special characters or encoding issues
def clean_text(text):
    # Remove special characters using regex
    cleaned_text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
    return cleaned_text

# Apply the clean_text function to the About Company column
df['About Company'] = df['About Company'].apply(clean_text)

# Print the cleaned DataFrame
df.head()

BELOW WE ARE CHECKING FOR NULL VALUES IN THE ROUND/SERIESE COLUMN 

In [None]:
df['Round/Series'].isnull().sum() # checking for null values

NOW LET'S RE-ORDER THE ROUND/SERIES COLUMN 

In [None]:
grouped_stages = {
    # Group 1: Early Stage
    'Pre-seed': 'Early Stage',
    'Seed': 'Early Stage',
    'Seed A': 'Early Stage',
    'Seed Funding': 'Early Stage',
    'Seed Investment': 'Early Stage',
    'Seed Round': 'Early Stage',
    'Seed Round & Series A': 'Early Stage',
    'Seed fund': 'Early Stage',
    'Seed funding': 'Early Stage',
    'Seed round': 'Early Stage',
    'Seed+': 'Early Stage',

    # Group 2: Mid Stage
    'Series A': 'Mid Stage',
    'Series A+': 'Mid Stage',
    'Series A-1': 'Mid Stage',
    'Series A2': 'Mid Stage',
    'Series B': 'Mid Stage',
    'Series B+': 'Mid Stage',
    'Series B2': 'Mid Stage',
    'Series B3': 'Mid Stage',
    'Series C': 'Mid Stage',
    'Seies A': 'Mid Stage',
    
    # Group 3: Late Stage
    'Series D': 'Late Stage',
    'Series I': 'Late Stage',
    'Series D1': 'Late Stage',
    'Series E': 'Late Stage',
    'Series E2': 'Late Stage',
    'Series F': 'Late Stage',
    'Series F1': 'Late Stage',
    'Series F2': 'Late Stage',
    'Series G': 'Late Stage',
    'Series H': 'Late Stage',
    
    # Group 4: Other Stages
    'Angel': 'Other Stages',
    'Angel Round': 'Other Stages',
    'Bridge': 'Other Stages',
    'Bridge Round': 'Other Stages',
    'Corporate Round': 'Other Stages',
    'Debt': 'Other Stages',
    'Debt Financing': 'Other Stages',
    'Early seed': 'Other Stages',
    'Edge': 'Other Stages',
    'Fresh funding': 'Other Stages',
    'Funding Round': 'Other Stages',
    'Grant': 'Other Stages',
    'Mid series': 'Other Stages',
    'Non-equity Assistance': 'Other Stages',
    'None': 'Other Stages',
    'PE': 'Other Stages',
    'Post series A': 'Other Stages',
    'Post-IPO Debt': 'Other Stages',
    'Post-IPO Equity': 'Other Stages',
    'Pre Series A': 'Other Stages',
    'Pre- series A': 'Other Stages',
    'Pre-Seed': 'Other Stages',
    'Pre-Series B': 'Other Stages',
    'Private Equity': 'Other Stages',
    'Secondary Market': 'Other Stages',
    'Pre-series A': 'Other Stages',
    'None': 'Other Series',
    'Pre-series B':'Other Stages',
    'Pre-series A1': 'Other Stage',
    'Pre-series':'Other Stages',
}

df['Round/Series'] = df['Round/Series'].replace(grouped_stages)


In [None]:
df['Round/Series'] # confirming the Round/Series again 

NOW LET'S DEAL WITH THE NULL VALUES IN THE ROUND/SERIES 

NOW LET'S CREATE THE CROSSTAB

In [None]:
cross_table_Round_Series_Indu = pd.crosstab(df['Industry'], ['Round/Series']) # here we are creating a contingency table between stage and sector 
cross_table_Round_Series_Indu

now to deal with the missing value in the stage column, we will use the percentage of the first 6 largest most occurring 
Round/Series column to fill in the missing values

In [None]:
# below we are getting the percentages 
cross_table_Round_Series_Indu_perc = (cross_table_Round_Series_Indu['Round/Series'] / cross_table_Round_Series_Indu['Round/Series'].sum()) * 100
cross_table_Round_Series_Indu_perc


NOW LET'S LOOK AT THE FIRST SIX 

In [None]:
top_six_Round_Series = cross_table_Round_Series_Indu_perc.nlargest(6) # here we are looking at the top six Round/Series 
top_six_Round_Series

NOW LET'S FILL IN THE MISSING VALUES IN THE STAGE COLUMN, USING THE RESPECTIVE VALUES FROM THE TOP SIX 
STAGES 

In [None]:
# Filling missing values in "Round/Series" column with the top six values

# Normalize the probabilities
normalize_prob = top_six_Round_Series / top_six_Round_Series.sum()
# Filling missing values in "Round/Series" column with the top six values
df['Round/Series'] = df['Round/Series'].fillna(pd.Series(np.random.choice(top_six_Round_Series.index.tolist(), size=len(df['Round/Series']), p=normalize_prob.values)))


NOW LET'S CONFRIM THE MISSING VALUES IN THE ROUND/SERIES AGAIN 

In [None]:
# confirming the null values in the Round/Series column again 
df['Round/Series'].isnull().sum()

In [None]:
df.columns # looking at the columns in the data set to comfirm 

In [None]:
df.drop(columns=['Indiancurr','UsCurr'], inplace=True) # dropping some colunmns we need no more 

In [None]:
df.insert(6,"Funding Year", 2018) # inserting a new column 'startup_funding 2018' to keep track of this data set

In [None]:
# below are renaming the columns to ensure consistency when combinning the four data sets 

df.rename(columns = {'Company Name':'Company',
                        'Industry':'Sector',
                        'Amount':'Amount',
                        'About Company':'About',
                        'Round/Series' : 'Stage'},
             inplace = True)

In [None]:
df.head() # finally comfirming the head of the data to be sure of all changes before saving the data

NOW LET'S DO FINAL CLEANING TO BE SURE # 2018 DATA SETS 
WE WILL START BY CHECKING FOR NULL VALUES 

In [None]:
# here we want to check for null values in the entire data set
df.isnull().sum()


NOW LET'S DEAL WITH THE AMOUNT COLUMN


In [None]:
# first let's check for the percentage of missing values in the Amount column
Amount_missing = df['Amount'].isna().sum()
Amount_total = df['Amount'].count()
percent_Amount_missing = (Amount_missing / Amount_total) * 100
percent_Amount_missing

TO TAKE OF THE NULL OR MISSING VALUES. WE WILL FIRST NEED TO UNDERSTAND THE PATTERN OF THE MISSING DATA 


first let's identify if there is any relationship between the missing values and the diffferent sectors 
this insight into the missing value will guide us on how to properly impute for the missing values 

We will start by creating a contingency table to show the distribution of missing values across the different
Sectors 

NOTE: this table and test is to help us prove or reject a hypothesis, by conducting a chi-square test 
Using the chi2_contingency function from the scipy.stats module to perform the chi-square test, this function calculates the chi-square statistic, p-value, degrees of freedom, and expected frequencies

but we will only look at the p-value with a specific chosen significant value 

Finally, we will interprete the result of the p-value, if the p-value is below a chosen significance level (e.g., 0.05), we can reject the null hypothesis and conclude that there is a significant association between the missing values in the "Amount" column and the "Sector" column.

BELOW IS THE HYPOTHESIS AND THE ALTERNATIVE HYPOTHESIS

Null hypothesis (H0): There is no association between the missing values in the "Amount" column and the "Sector" column.

Alternative hypothesis (H1): There is a significant association between the missing values in the "Amount" column and the "Sector" column


Creating a contingency table:

 we will use the pd.crosstab() function to create a contingency table that will shows the distribution of missing values across the different sectors. This table will help us visualize the association between the two variables.

In [None]:
# creating the contingency table

conting_table = pd.crosstab(df['Sector'], df['Amount'].isnull())
conting_table

 now let's Perform the chi-square test: 

 Using the chi2_contingency() function from the scipy.stats module we will perform the chi-square test. This function calculates the chi-square statistic, p-value, degrees of freedom, and expected frequencies.

In [None]:
# below we are performing the chi-square test
chi2, p_value, _,_ = chi2_contingency(conting_table)
p_value

Interpreting the results:

Checking the p-value obtained from the chi-square test.

If the p-value is below our chosen significance level (in this case 0.05), we can reject the null hypothesis and conclude that there is a significant association between the missing values in the "Amount" column and the "Sector" column. If the p-value is above the significance level, we fail to reject the null hypothesis.

In [None]:
# we are interpretting the chi-sqaure test 
significance_level = 0.05

if p_value < significance_level:
    print("There is a significant association between the missing values in the 'Amount' column and the 'Sector' column.")
else:
    print("There is no significant association between the missing values in the 'Amount' column and the 'Sector' column.")


From the above output we can drop this approach to fill in the missing values 

THE NEXT APPROACH IS TO USE THE: 


Missing Data Patterns: 

We will analyze the patterns of missing values in the 'Amount' column and other relevant columns, in our case the 'Amount', 'Sector', 'Stage', 'Location' If the missing values are missing completely at random (MCAR) or missing at random (MAR), it may indicate that imputation methods like median imputation could be suitable.


BELOW WE WILL USE HEAT MAP AND CORRELATION PLOT TO TRY AND DETERMINE SOME PATTERNS 

1. MISSING DATA HEAT MAP

In [None]:

df[df['Amount'] == 'Brand Marketing']


In [None]:
df[df['Location'] == 'Brand Marketing']

In [None]:
df[df['Sector'] == 'Brand Marketing']

In [None]:
# creating a subset of the relevant columns
rele_col = ['Amount', 'Sector', 'Stage', 'Location']

# creating a dataframe with missing value indicator 
missing_indicator_df = df[rele_col].isnull()

# below we are creating a missing data heat map
sns.heatmap(missing_indicator_df, cmap='viridis', cbar=False)
plt.title('Missing Data Map')
plt.show()

The information from the above supports the assumption that the missing values in the 'Amount' column are missing completely at random (MCAR) or missing at random (MAR). This means that the missingness is unrelated to the 'Sector', 'Location', or 'Stage' variables.

Based on this pattern of missingness, median imputation could be a reasonable option to impute the missing values in the 'Amount' column.

2. CORRELATION PLOT

In [None]:
# below we are creating a correlation matrix plot

correl_matrix = df[rele_col].corr()
sns.heatmap(correl_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Plot')
plt.show()

NOW LET'S IMPUTE THE MISSING VALUES USING THE MEDIAN IMPUTATION 

In [None]:
# we are creating the median of the not missing values 
median_non_null_Amount = df['Amount'].dropna()

median_Amount = median_non_null_Amount.median() 

# below we are filling in the missing values with the median 
df['Amount'].fillna(median_Amount, inplace=True)

NOW LET'S CONFRIM THE AMOUNT FOR MISSING VALUES AGAIN 

In [None]:
df['Amount'].isnull().sum() # checking for null values 

In [None]:
df.isnull().sum() # checking to confirm if any of the column is still have nan

In [None]:
df.to_csv('df18.csv', index=False) # here we are saveing the clean data and naming it df18.csv

BELOW WE ARE WORKING ON THE NEXT DATA SET CALLED 2019 DATA SET

#### 2019 Data

In [None]:
df2.head() # first let's look at the head of the data set 

In [None]:
df2.shape # now let's look at the shape of the data to get some idea about the columns and rows 

In [None]:
df2.columns # now let's look at the columns in the 2019 data sets 

In [None]:
df2.info() # Getting inforamation about the data2 dataframe

In [None]:
df2.describe(include='object') # getting General descriptive statistics of the data2 dataFrame

#### Handling Duplicated Data

In [None]:
# below we are checking for duplicated values within the columns 

columns_to_check2 = ['Company/Brand', 'Founded', 'HeadQuarter', 'Sector', 'What it does', 'Founders', 'Investor', 'Amount($)', 'Stage',]

for column2 in columns_to_check2:
    has_duplicates2 = df2[column2].duplicated().any()
    print(f'{column2}: {has_duplicates2}')

In [None]:
# below we are dropping all the duplicated rows within the colums

df2.drop_duplicates(subset=['Company/Brand', 'Founded', 'HeadQuarter', 'Sector', 'What it does', 'Founders', 'Investor', 'Amount($)', 'Stage',], inplace=True)

now we have some description about the data set, we can now move on with data cleaning
 
MISSING VALUES 

In [None]:
missing_values2 = df2.isnull().sum() # looking for missing values in dataFrame 2
missing_values2

LET'S DEAL WITH THE MISSING VALUES FROM THE ABOVE OUTPUT

DEALING WIHT MISSING VALUES FOR HEADQUARTER 

### The company/Brand Column has actual data from existing startups. The null Headquarter values can be filled by searching the HeadQuarters on Google

In [None]:
#fillna values in HeadQuarter column

# using google we are able to get accurate info about the Company's headquater

df2.loc[df2['Company/Brand'] == 'Bombay Shaving', 'HeadQuarter'] = 'Gurugram'
df2.loc[df2['Company/Brand'] == 'Quantiphi', 'HeadQuarter'] = 'Marlborough'
df2.loc[df2['Company/Brand'] == 'Open Secret', 'HeadQuarter'] = 'Mumbai'
df2.loc[df2['Company/Brand'] == "Byju's", 'HeadQuarter'] = 'Bengaluru'
df2.loc[df2['Company/Brand'] == "Witblox", 'HeadQuarter'] = 'Mumbai'
df2.loc[df2['Company/Brand'] == "SalaryFits", 'HeadQuarter'] = 'London'
df2.loc[df2['Company/Brand'] == "Pristyn Care", 'HeadQuarter'] = 'Gurgaon'
df2.loc[df2['Company/Brand'] == "Springboard", 'HeadQuarter'] = 'Bengaluru'
df2.loc[df2['Company/Brand'] == "Fireflies .ai", 'HeadQuarter'] = 'San Francisco'
df2.loc[df2['Company/Brand'] == "Bijak", 'HeadQuarter'] = 'New Delhi'
df2.loc[df2['Company/Brand'] == "truMe", 'HeadQuarter'] = 'Gurugram'
df2.loc[df2['Company/Brand'] == "Rivigo", 'HeadQuarter'] = 'Gurgaon'
df2.loc[df2['Company/Brand'] == "VMate", 'HeadQuarter'] = 'Gurgaon'
df2.loc[df2['Company/Brand'] == "Slintel", 'HeadQuarter'] = 'California'
df2.loc[df2['Company/Brand'] == "Ninjacart", 'HeadQuarter'] = 'Bengaluru'
df2.loc[df2['Company/Brand'] == "Zebu", 'HeadQuarter'] = 'London'
df2.loc[df2['Company/Brand'] == "Phable", 'HeadQuarter'] = 'Bengaluru'
df2.loc[df2['Company/Brand'] == "Zolostays", 'HeadQuarter'] = 'Bengaluru'
df2.loc[df2['Company/Brand'] == 'Cubical Labs', 'HeadQuarter'] = 'New Delhi'


In [None]:
# below we are replacing some names within the columns with their official names.
# This ensures uniformity of the names

df2.loc[~df2['HeadQuarter'].str.contains('New Delhi', na=False), 'HeadQuarter'] = df2['HeadQuarter'].str.replace('Delhi', 'New Delhi')
df2["HeadQuarter"] = df2["HeadQuarter"].replace (['Bangalore','Bangalore City'], 'Bengaluru')
df2['HeadQuarter'] = df2['HeadQuarter'].replace (['Gurgaon'], 'Gurugram')

In [None]:
df2[df2['HeadQuarter'].isnull()] #Check if all null values in HeadQuarter have been filled

LET'S DEAL WITH THE MISSING VALUE IN THE SECTOR COLUMN

filling in the missing values in the "Sector" column using the mode (most frequent value) is a reasonable approach when the number of missing values is relatively small compared to the total number of values in the column

In [None]:
#fillna values in Sector column by Google Search
df2.loc[df2['Company/Brand'] == 'VMate', 'Sector'] = 'Short Video Platform'
df2.loc[df2['Company/Brand'] == 'Awign Enterprises', 'Sector'] = 'Workforce Solutions'
df2.loc[df2['Company/Brand'] == 'TapChief', 'Sector'] = 'Online Consulting'
df2.loc[df2['Company/Brand'] == 'KredX', 'Sector'] = 'Fintech'
df2.loc[df2['Company/Brand'] == 'm.Paani', 'Sector'] = 'E-commerce'

In [None]:
df2['Sector'].isnull().sum() # confirming the null values again

NOW LET'S DEAL WITH THE STAGE COLUMN 

BUT FIRST LET'S RE-ORDER THE STAEG COLUMN

In [None]:
df2['Stage'].value_counts() # checking for value counts in the stage column

now to deal with the missing value in the stage column, we will use the percentage of the first 6 largest most occurring 
stage to fill in the missing values


In [None]:
cross_table_Sector_Stage_2 = pd.crosstab(df2['Sector'], ['Stage']) # here we are creating a contingency table between stage and sector 
cross_table_Sector_Stage_2

In [None]:
# below we are getting the percentages 
cross_table_Sector_Stage_per_2 = (cross_table_Sector_Stage_2['Stage'] / cross_table_Sector_Stage_2['Stage'].sum()) * 100
cross_table_Sector_Stage_per_2

In [None]:
# here we are looking at the top six stages 
top_six_stages_2 = cross_table_Sector_Stage_per_2.nlargest(6)
top_six_stages_2

NOW LET'S FILL IN THE MISSING VALUES IN THE STAGE COLUMN, USING THE RESPECTIVE VALUES IN FROM THE TOP SIX 
STAGES 


In [None]:
# Filling missing values in "Stage" column with the top six values

# Normalize the probabilities
normalize_prob_2 = top_six_stages_2 / top_six_stages_2.sum()
# Filling missing values in "Stage" column with the top six values
df2['Stage'] = df2['Stage'].fillna(pd.Series(np.random.choice(top_six_stages_2.index.tolist(), size=len(df2['Stage']), p=normalize_prob_2.values)))

In [None]:
df2['Stage'].isnull().sum() # let's confirm the null values in Stage column again

In [None]:
df2.isnull().sum() # confirming the second data sets for missing valeus 

In [None]:
df2['HeadQuarter'].unique() # let's get some idea about the unique values int he HeadQuater column

In [None]:
df2['Sector'].unique() # now let's look at the unique values of the 'Sector' column

In [None]:
df2['Stage'].unique() # now let's look at the unique values of the 'stage' colum

In [None]:
df2['Stage'].value_counts()

In [None]:
df2[df2['Stage'] == 'AgriTech'] 

In [None]:
df.loc[df['Company'] == 'Zolostays', 'Stage'] = 'Series B'

In [None]:
df.loc[df['Company'] == 'Cub McPaws', 'Stage'] = 'Seed'
df.loc[df['Company'] == 'truMe', 'Stage'] = 'Seed'
df.loc[df['Company'] == 'MyGameMate', 'Stage'] = 'Seed'
df.loc[df['Company'] == 'Smart Institute', 'Stage'] = 'Seed'
df.loc[df['Company'] == 'Spinny', 'Stage'] = 'Series B'

In [None]:
df.loc[df['Company'] == 'DROR Labs Pvt. Ltd.', 'Stage'] = 'Seed'
df.loc[df['Company'] == 'Asteria Aerospace', 'Stage'] = 'Series B'
df.loc[df['Company'] == 'Binca Games', 'Stage'] = 'Seed'
df.loc[df['Company'] == 'Stanza Living', 'Stage'] = 'Series A'
df.loc[df['Company'] == 'PiBeam', 'Stage'] = 'Series A'
df.loc[df['Company'] == 'Credr', 'Stage'] = 'Series A'

In [None]:
df2.loc[df2['Company/Brand'] == 'FlytBase', 'Stage'] = 'Seed'
df2.loc[df2['Company/Brand'] == 'Lil’ Goodness and sCool meal	', 'Stage'] = 'Seed'
df.loc[df['Company'] == 'Origo', 'Stage'] = 'Seed'
df.loc[df['Company'] == 'Cuemath', 'Stage'] = 'Series A'
df.loc[df['Company'] == 'Phable', 'Stage'] = 'Seed'
df.loc[df['Company'] == 'Sarva', 'Stage'] = 'Seed'
df.loc[df['Company'] == 'Zoomcar', 'Stage'] = 'Series C'

In [None]:

df2.loc[df2['Company/Brand'] == 'Appnomic', 'Stage'] = 'Series A'
df2.loc[df2['Company/Brand'] == 'Finly', 'Stage'] = 'Seed'
df2.loc[df2['Company/Brand'] == 'LivFin', 'Stage'] = 'Series A'
df2.loc[df2['Company/Brand'] == 'Afinoz', 'Stage'] = 'Seed'
df2.loc[df2['Company/Brand'] == 'Box8', 'Stage'] = 'Series C'
df2.loc[df2['Company/Brand'] == 'Ecom Express', 'Stage'] = 'Series B'
df2.loc[df2['Company/Brand'] == 'Nivesh.com', 'Stage'] = 'Seed'
df2.loc[df2['Company/Brand'] == 'Ola', 'Stage'] = 'Series F'
df2.loc[df2['Company/Brand'] == 'Ess Kay Fincorp', 'Stage'] = 'Series D'


In [None]:
df2.loc[df2['Company/Brand'] == 'Bombay Shaving', 'Stage'] = 'Seed'
df2.loc[df2['Company/Brand'] == 'Nu Genes', 'Stage'] = 'Seed'
df2.loc[df2['Company/Brand'] == 'JobSquare', 'Stage'] = 'Seed'
df2.loc[df2['Company/Brand'] == "Byju's", 'Stage'] = 'Series F'
df2.loc[df2['Company/Brand'] == 'Fireflies .ai', 'Stage'] = 'Seed'
df2.loc[df2['Company/Brand'] == 'Bombay Shirt Company', 'Stage'] = 'Series A'
df2.loc[df2['Company/Brand'] == 'Slintel', 'Stage'] = 'Seed'
df2.loc[df2['Company/Brand'] == 'Ninjacart', 'Stage'] = 'Series C'
df2.loc[df2['Company/Brand'] == 'Euler Motors', 'Stage'] = 'Seed'
df2.loc[df2['Company/Brand'] == 'Zolozstays', 'Stage'] = 'Series A'
df2.loc[df2['Company/Brand'] == 'Oyo', 'Stage'] = 'Series D'


In [None]:
df2.loc[df2['Company/Brand'] == 'Open Secret', 'Stage'] = 'Series C'
df2.loc[df2['Company/Brand'] == 'Witblox', 'Stage'] = 'Seed'
df2.loc[df2['Company/Brand'] == 'SalaryFits', 'Stage'] = 'Series A'
df2.loc[df2['Company/Brand'] == 'Medlife', 'Stage'] = 'Series B'
df2.loc[df2['Company/Brand'] == 'Pumpkart', 'Stage'] = 'Seed'
df2.loc[df2['Company/Brand'] == 'VMate', 'Stage'] = 'Series A'
df2.loc[df2['Company/Brand'] == 'WishADish', 'Stage'] = 'Series A'
df2.loc[df2['Company/Brand'] == 'Lawyered', 'Stage'] = 'Seed'


In [None]:
grouped_stages_2 = {
    # Group 1: Early Stage
    'Pre-seed': 'Early Stage',
    'Seed': 'Early Stage',
    'Seed A': 'Early Stage',
    'Seed Funding': 'Early Stage',
    'Seed Investment': 'Early Stage',
    'Seed Round': 'Early Stage',
    'Seed Round & Series A': 'Early Stage',
    'Seed fund': 'Early Stage',
    'Seed funding': 'Early Stage',
    'Seed round': 'Early Stage',
    'Seed+': 'Early Stage',

    # Group 2: Mid Stage
    'Series A': 'Mid Stage',
    'Series A+': 'Mid Stage',
    'Series A-1': 'Mid Stage',
    'Series A2': 'Mid Stage',
    'Series B': 'Mid Stage',
    'Series B+': 'Mid Stage',
    'Series B2': 'Mid Stage',
    'Series B3': 'Mid Stage',
    'Series C': 'Mid Stage',
    'Seies A': 'Mid Stage',
    
    # Group 3: Late Stage
    'Series D': 'Late Stage',
    'Series I': 'Late Stage',
    'Series D1': 'Late Stage',
    'Series E': 'Late Stage',
    'Series E2': 'Late Stage',
    'Series F': 'Late Stage',
    'Series F1': 'Late Stage',
    'Series F2': 'Late Stage',
    'Series G': 'Late Stage',
    'Series H': 'Late Stage',
    
    # Group 4: Other Stages
    'Angel': 'Other Stages',
    'Angel Round': 'Other Stages',
    'Bridge': 'Other Stages',
    'Bridge Round': 'Other Stages',
    'Corporate Round': 'Other Stages',
    'Debt': 'Other Stages',
    'Debt Financing': 'Other Stages',
    'Early seed': 'Other Stages',
    'Edge': 'Other Stages',
    'Fresh funding': 'Other Stages',
    'Funding Round': 'Other Stages',
    'Grant': 'Other Stages',
    'Mid series': 'Other Stages',
    'Non-equity Assistance': 'Other Stages',
    'None': 'Other Stages',
    'PE': 'Other Stages',
    'Post series A': 'Other Stages',
    'Post-IPO Debt': 'Other Stages',
    'Post-IPO Equity': 'Other Stages',
    'Pre Series A': 'Other Stages',
    'Pre- series A': 'Other Stages',
    'Pre-Seed': 'Other Stages',
    'Pre-Series B': 'Other Stages',
    'Private Equity': 'Other Stages',
    'Secondary Market': 'Other Stages',
    'Pre-series A': 'Other Stages',
    'None': 'Other Series',
    'Pre-series B':'Other Stages',
    'Pre-series A1': 'Other Stage',
    'Pre-series':'Other Stages',
    'Pre series A':'Other Stages'
}

df2['Stage'] = df2['Stage'].replace(grouped_stages_2)
df2['Stage']

In [None]:
 # creating or maintaining only the valid stages

unwanted_stages = ['Fintech', 'Technology', 'AgriTech', 'E-commerce', 'Edtech']
df2['Stage'] = df2['Stage'].replace(unwanted_stages, np.nan)

In [None]:
df2['Stage'].isnull().sum() # checking for unique values in the stage column 

In [None]:
df2['Stage'].count() # getting the total of the values in the Stage column

In [None]:
# getting the mode of the non-null values 

non_null_values_stg = df2['Stage'].dropna()
mode_non_null_stg = non_null_values_stg.mode()  

In [None]:
df2['Stage'] = df2['Stage'].astype(str) # converting the stage column to string

In [None]:
df2['Stage'].fillna(mode_non_null_stg, inplace=True) # filling in the null value with the mode

In [None]:
df2['Stage'].isnull().sum() # checking for null values again

In [None]:
df2.isnull().sum() # let's check for null vlaues and sum them up 

Standardizing Data Formats

now let's see how we can standardize tha data set to make sure we have the same format of data points 

first let's check for dash symbols within the columns using a simple python function 

In [None]:
# checking for '-' symbol within the columns

columns_to_check2 = ['Company/Brand', 'HeadQuarter', 'Sector', 'What it does', 'Amount($)', 'Stage']

for column2 in columns_to_check2:
    has_dash_symbols2 = df2[column2].astype(str).str.contains('-').any()
    print(f'{column2}: {has_dash_symbols2}')

In [None]:
# checking for currency symbol 

columns_to_check2 = ['Company/Brand','HeadQuarter', 'Sector', 'What it does', 'Amount($)']

for column2 in columns_to_check2:
    has_currency_symbols = df2[column2].astype(str).str.contains('[$₹]').any()
    print(f'{column2}: {has_currency_symbols}')

In [None]:
# replacing the '-' symbols using a simple function 

dash_currency_columns = ['Sector', 'What it does', 'Stage']

for dash_columns2 in dash_currency_columns:
    dash_replaced2 = df2[dash_columns2].replace('-', np.nan, inplace=True)

now let's handle the dash symbols in the Amount column, clean and format the amount the column correctly 

In [None]:
df2['Amount($)'].unique() # let's check for unique values 

In [None]:
# Cleaning the Amounts column & # removing the currency symbol in df_2019
df2['Amount($)'] = df2['Amount($)'].astype(str).str.replace('[\₹$,]', '', regex=True)
df2['Amount($)'] = df2['Amount($)'].str.replace('Undisclosed', '0', regex=True)
df2['Amount($)'].replace(",", "", inplace = True, regex=True)
df2['Amount($)'].replace("—", 0, inplace = True, regex=True)

In [None]:
df2['Amount($)'] = df2['Amount($)'].astype(float) # here we are converting the amount column to float data type 
type(df2['Amount($)'][0])

In [None]:
df2['Amount($)'] # here we are looking at the Amount column 

In [None]:
df2['Amount($)'].unique() # this line of code looks at the unique value 

In [None]:
df2['Amount($)'].isnull().sum()

### Clean Text Data

In [None]:
# Clean Company Name column
df2['Company/Brand'] = df2['Company/Brand'].str.strip()  # Remove leading and trailing spaces
df2['Company/Brand'] = df2['Company/Brand'].str.title()  # Standardize capitalization

# Clean Company Name column
df2['Sector'] = df2['Sector'].str.strip()  # Remove leading and trailing spaces
df2['Sector'] = df2['Sector'].str.title()  # Standardize capitalization

# Clean About Company column
df2['What it does'] = df2['What it does'].str.strip()  # Remove leading and trailing spaces

# Function to handle special characters or encoding issues
def clean_text(text):
    # Remove special characters using regex
    cleaned_text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
    return cleaned_text

# Apply the clean_text function to the About Company column
df2['What it does'] = df2['What it does'].apply(clean_text)

# Print the cleaned DataFrame
df2.head()

In [None]:
# Dropping the columns that are not important to our analysis

df2.drop(columns=['Founded','Founders','Investor'], inplace=True)

In [None]:
df2.insert(6,"Funding Year", 2019) # here we are inserting a new column to keep track of the data set after combining 

In [None]:
# below we are renaming the columns to enure consistency 

df2.rename(columns = {'Company/Brand':'Company',
                        'HeadQuarter':'Location',
                        'Amount($)':'Amount',
                        'What it does':'About'},
             inplace = True)

In [None]:
df2.head() # let's comfirm the data set by looking at the head before we save it 

In [None]:
df2.to_csv('df_19.csv', index=False) # here we are saving the set and naming it df_19.csv

In [None]:
df2.isnull().sum() # checking to confirm if any of the columns still have nan

NOW LET'S WORK ON THE THIRD DATA SET 2020

### 2020 Data

In [None]:
df3.head() #showing the first five rows

In [None]:
df3.info() # Get inforamation about the df3 dataframe

In [None]:
df3.columns #accessing specific columns

In [None]:
df3.describe(include='object') # Getting general descriptive statistics of the data2 dataFrame

In [None]:
df3.describe(include='float') # Getting general descriptive statistics for float columns

#### Handling Duplicated Data

In [None]:
# checking for duplicated values 

columns_to_check3 = ['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does', 'Founders', 'Investor', 'Amount', 'Stage']
for column2 in columns_to_check3:
    has_duplicates2 = df3[column2].duplicated().any()
    print(f'{column2}: {has_duplicates2}')

In [None]:
# below we are dropping the duplicates rows 

df3.drop_duplicates(subset=['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does', 'Founders', 'Investor', 'Amount', 'Stage'], inplace=True)

#### Handling Categorical Data

In [None]:
df3.isna().sum() #looking for missing values in dataFrame 2

In [None]:
df3['HeadQuarter'].unique() #displaying the unique values found in the 'HeadQuarter' column.

In [None]:
# we are replacing the data in the Headquater by researching from google

df3.loc[df3['Company_Brand'] == 'Habitat', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Wealth Bucket', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'EpiFi', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'XpressBees', 'HeadQuarter'] = 'Pune'
df3.loc[df3['Company_Brand'] == 'Shiksha', 'HeadQuarter'] = 'Noida'
df3.loc[df3['Company_Brand'] == 'Byju', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Zomato', 'HeadQuarter'] = 'Gurugram'
df3.loc[df3['Company_Brand'] == 'Rentomojo', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Mamaearth', 'HeadQuarter'] = 'Gurgaon'
df3.loc[df3['Company_Brand'] == 'HaikuJAM', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Testbook', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Techbooze', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Rheo', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Klub', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'TechnifyBiz', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Aesthetic Nutrition', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Gamerji', 'HeadQuarter'] = 'Ahmedabad'
df3.loc[df3['Company_Brand'] == 'Phenom People', 'HeadQuarter'] = 'Hyderabad'
df3.loc[df3['Company_Brand'] == 'Teach Us', 'HeadQuarter'] = 'Hyderabad'
df3.loc[df3['Company_Brand'] == 'Invento Robotics', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Kristal AI', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Samya AI', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Skylo', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'SmartKarrot', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Park+', 'HeadQuarter'] = 'Gurgaon'
df3.loc[df3['Company_Brand'] == 'LogiNext', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'MoneyTap', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'RACEnergy', 'HeadQuarter'] = 'Hyderabad'
df3.loc[df3['Company_Brand'] == 'Oye! Rickshaw', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Fleetx', 'HeadQuarter'] = 'Gurgaon'
df3.loc[df3['Company_Brand'] == 'Raskik', 'HeadQuarter'] = 'Gurgaon'
df3.loc[df3['Company_Brand'] == 'Pravasirojgar', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Kaagaz Scanner', 'HeadQuarter'] = 'Gurugram'
df3.loc[df3['Company_Brand'] == 'Exprs', 'HeadQuarter'] = 'Madhapur'
df3.loc[df3['Company_Brand'] == 'Verloop.io', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Otipy', 'HeadQuarter'] = 'Gurugram'
df3.loc[df3['Company_Brand'] == 'Daalchini', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Suno India', 'HeadQuarter'] = 'Hyderabad'
df3.loc[df3['Company_Brand'] == 'Eden Smart Homes', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Bijnis', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Oziva', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Yulu', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Peppermint', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Jiffy ai', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Postman', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'F5', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Myelin Foundry', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'iNurture Education', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Credgencies', 'HeadQuarter'] = 'Gurugram'
df3.loc[df3['Company_Brand'] == 'Vahak', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Illumnus', 'HeadQuarter'] = 'Gurgaon'
df3.loc[df3['Company_Brand'] == 'Juicy Chemistry', 'HeadQuarter'] = 'Coimbatore'
df3.loc[df3['Company_Brand'] == 'Shiprocket', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Phable', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Generic Aadhaar', 'HeadQuarter'] = 'Thane'
df3.loc[df3['Company_Brand'] == 'Nium', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'DailyHunt', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Pedagogy', 'HeadQuarter'] = 'Ahmedabad'
df3.loc[df3['Company_Brand'] == 'Sarva', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'NIRA', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Indusface', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Morning Context', 'HeadQuarter'] = 'Singapore'
df3.loc[df3['Company_Brand'] == 'Savvy Co op', 'HeadQuarter'] = 'New York'
df3.loc[df3['Company_Brand'] == 'BLive', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Toch', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Setu', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Rebel Foods', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Amica', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Fingerlix', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Zupee', 'HeadQuarter'] = 'Gurugram'
df3.loc[df3['Company_Brand'] == 'DeHaat', 'HeadQuarter'] = 'Patna'
df3.loc[df3['Company_Brand'] == 'Akna Medical', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'RaRa Delivery', 'HeadQuarter'] = 'Jakarta'
df3.loc[df3['Company_Brand'] == 'Obviously AI', 'HeadQuarter'] = 'San Francisco'
df3.loc[df3['Company_Brand'] == 'CoinDCX', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'NuNu TV', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Fintso', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Smart Coin', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Shop101', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Neeman', 'HeadQuarter'] = 'Hyderabad'
df3.loc[df3['Company_Brand'] == 'Invideo', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'AvalonMeta', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'SmartVizX', 'HeadQuarter'] = 'Noida'
df3.loc[df3['Company_Brand'] == 'Carbon Clean', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Onsitego', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Nova Credit', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'HempStreet', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Classplus', 'HeadQuarter'] = 'Noida'
df3.loc[df3['Company_Brand'] == 'Chaayos', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Altor', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'WorkIndia', 'HeadQuarter'] = 'Mumbai'

In [None]:
# below we are reformating the Headquater column with their official values
df3.loc[~df3['HeadQuarter'].str.contains('New Delhi', na=False), 'HeadQuarter'] = df3['HeadQuarter'].str.replace('Delhi', 'New Delhi')
df3["HeadQuarter"] = df3["HeadQuarter"].replace (['Bangalore','Banglore','Bangalore City'], 'Bengaluru')
df3['HeadQuarter'] = df3['HeadQuarter'].replace (['Gurgaon'], 'Gurugram')

In [None]:
df3["column10"].value_counts() # Calculate the frequency count of unique values in the "Amount" column

In [None]:
df3[df3['column10'].isin(['Pre-Seed','Seed Round'])] #checking if the values in the 'column10' column match either 'Pre-Seed' or 'Seed Round'.

In [None]:
df3['Sector'].unique # ckecking for unique values in the Sector column 

In [None]:
df3[df3['Sector'].isnull()] # we are checking for null values

In [None]:
# we replacing the null values with the actual data by searching from google

df3.loc[df3['Company_Brand'] == 'Text Mercato', 'Sector'] = 'E-commerce Technology'
df3.loc[df3['Company_Brand'] == 'Magicpin', 'Sector'] = 'Hyperlocal Services'
df3.loc[df3['Company_Brand'] == 'Leap Club', 'Sector'] = 'Professional Networking'
df3.loc[df3['Company_Brand'] == 'Juicy Chemistry', 'Sector'] = 'Organic Skincare'
df3.loc[df3['Company_Brand'] == 'Servify', 'Sector'] = 'Technology Services'
df3.loc[df3['Company_Brand'] == 'Wagonfly', 'Sector'] = 'Retail Technology'
df3.loc[df3['Company_Brand'] == 'DrinkPrime', 'Sector'] = 'Water Technology'
df3.loc[df3['Company_Brand'] == 'Kitchens Centre', 'Sector'] = 'Food Service Infrastructure'
df3.loc[df3['Company_Brand'] == 'Innoviti', 'Sector'] = 'Fintech'
df3.loc[df3['Company_Brand'] == 'Brick&Bolt', 'Sector'] = 'Construction and Real Estate'
df3.loc[df3['Company_Brand'] == 'Toddle', 'Sector'] = 'EdTech'
df3.loc[df3['Company_Brand'] == 'HaikuJAM', 'Sector'] = 'EdTech'

In [None]:
df3[df3['Sector'].isnull()] # checking to confirm the null values 

In [None]:
df3['Stage'].unique # checking the unique values in the data set

LET'S CLEAN THE STAGE COLUMN 

BELOW WE ARE RE-ORDERING THE STAGE COLUMN 

In [None]:
grouped_stages_3 = {
    'Pre-seed': 'Early Stage',
    'Seed': 'Early Stage',
    'Seed A': 'Early Stage',
    'Seed Funding': 'Early Stage',
    'Seed Investment': 'Early Stage',
    'Seed Round': 'Early Stage',
    'Seed Round & Series A': 'Early Stage',
    'Seed fund': 'Early Stage',
    'Seed funding': 'Early Stage',
    'Seed round': 'Early Stage',
    'Seed+': 'Early Stage',
    'Series A': 'Mid Stage',
    'Series A+': 'Mid Stage',
    'Series A-1': 'Mid Stage',
    'Series A2': 'Mid Stage',
    'Series B': 'Mid Stage',
    'Series B+': 'Mid Stage',
    'Series B2': 'Mid Stage',
    'Series B3': 'Mid Stage',
    'Series C': 'Mid Stage',
    'Series A': 'Mid Stage',
    'Series D': 'Late Stage',
    'Series I': 'Late Stage',
    'Series D1': 'Late Stage',
    'Series E': 'Late Stage',
    'Series E2': 'Late Stage',
    'Series F': 'Late Stage',
    'Series F1': 'Late Stage',
    'Series F2': 'Late Stage',
    'Series G': 'Late Stage',
    'Series H': 'Late Stage',
    'Angel': 'Other Stages',
    'Angel Round': 'Other Stages',
    'Bridge': 'Other Stages',
    'Bridge Round': 'Other Stages',
    'Corporate Round': 'Other Stages',
    'Debt': 'Other Stages',
    'Debt Financing': 'Other Stages',
    'Early seed': 'Other Stages',
    'Edge': 'Other Stages',
    'Fresh funding': 'Other Stages',
    'Funding Round': 'Other Stages',
    'Grant': 'Other Stages',
    'Mid series': 'Other Stages',
    'Non-equity Assistance': 'Other Stages',
    'None': 'Other Stages',
    'PE': 'Other Stages',
    'Post series A': 'Other Stages',
    'Post-IPO Debt': 'Other Stages',
    'Post-IPO Equity': 'Other Stages',
    'Pre Series A': 'Other Stages',
    'Pre-Series B': 'Other Stages',
    'Private Equity': 'Other Stages',
    'Secondary Market': 'Other Stages',
    'Pre-series A': 'Other Stages',
    'Pre-seed Round': 'Other Stages',
    'Pre series C': 'Other Stages',
    'Pre series A1': 'Other Stages',
    'Pre seed round': 'Other Stages',
    'Pre seed Round': 'Other Stages',
    'Pre series A': 'Other Stages',
    'Pre series B': 'Other Stages',
    'Pre- series A': 'Other Stages',
    'Pre-Seed': 'Other Stages',
    'Pre-series': 'Other Stages',
    'Pre-series B': 'Other Stages',
    'Pre-series C': 'Other Stages',
    'Series C, D': 'Other Stages'
}

df3['Stage'] = df3['Stage'].replace(grouped_stages_3)


HANDLING THE HEADQUATER COLUMN 


In [None]:
# BELOW WE ARE TAKING CARE OF THE MISSING HEADQUATER / LOACTION 

df3.loc[df3['Company_Brand'] == 'Habitat', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Wealth Bucket', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'EpiFi', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'XpressBees', 'HeadQuarter'] = 'Pune'
df3.loc[df3['Company_Brand'] == 'Shiksha', 'HeadQuarter'] = 'Noida'
df3.loc[df3['Company_Brand'] == 'Byju', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Zomato', 'HeadQuarter'] = 'Gurugram'
df3.loc[df3['Company_Brand'] == 'Rentomojo', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Mamaearth', 'HeadQuarter'] = 'Gurgaon'
df3.loc[df3['Company_Brand'] == 'HaikuJAM', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Testbook', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Techbooze', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Rheo', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Klub', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'TechnifyBiz', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Aesthetic Nutrition', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Gamerji', 'HeadQuarter'] = 'Ahmedabad'
df3.loc[df3['Company_Brand'] == 'Phenom People', 'HeadQuarter'] = 'Hyderabad'
df3.loc[df3['Company_Brand'] == 'Teach Us', 'HeadQuarter'] = 'Hyderabad'
df3.loc[df3['Company_Brand'] == 'Invento Robotics', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Kristal AI', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Samya AI', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Skylo', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'SmartKarrot', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Park+', 'HeadQuarter'] = 'Gurgaon'
df3.loc[df3['Company_Brand'] == 'LogiNext', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'MoneyTap', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'RACEnergy', 'HeadQuarter'] = 'Hyderabad'
df3.loc[df3['Company_Brand'] == 'Oye! Rickshaw', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Fleetx', 'HeadQuarter'] = 'Gurgaon'
df3.loc[df3['Company_Brand'] == 'Raskik', 'HeadQuarter'] = 'Gurgaon'
df3.loc[df3['Company_Brand'] == 'Pravasirojgar', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Kaagaz Scanner', 'HeadQuarter'] = 'Gurugram'
df3.loc[df3['Company_Brand'] == 'Exprs', 'HeadQuarter'] = 'Madhapur'
df3.loc[df3['Company_Brand'] == 'Verloop.io', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Otipy', 'HeadQuarter'] = 'Gurugram'
df3.loc[df3['Company_Brand'] == 'Daalchini', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Suno India', 'HeadQuarter'] = 'Hyderabad'
df3.loc[df3['Company_Brand'] == 'Eden Smart Homes', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Bijnis', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Oziva', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Yulu', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Peppermint', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Jiffy ai', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Postman', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'F5', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Myelin Foundry', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'iNurture Education', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Credgencies', 'HeadQuarter'] = 'Gurugram'
df3.loc[df3['Company_Brand'] == 'Vahak', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Illumnus', 'HeadQuarter'] = 'Gurgaon'
df3.loc[df3['Company_Brand'] == 'Juicy Chemistry', 'HeadQuarter'] = 'Coimbatore'
df3.loc[df3['Company_Brand'] == 'Shiprocket', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Phable', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Generic Aadhaar', 'HeadQuarter'] = 'Thane'
df3.loc[df3['Company_Brand'] == 'Nium', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'DailyHunt', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Pedagogy', 'HeadQuarter'] = 'Ahmedabad'
df3.loc[df3['Company_Brand'] == 'Sarva', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'NIRA', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Indusface', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Morning Context', 'HeadQuarter'] = 'Singapore'
df3.loc[df3['Company_Brand'] == 'Savvy Co op', 'HeadQuarter'] = 'New York'
df3.loc[df3['Company_Brand'] == 'BLive', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Toch', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Setu', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Rebel Foods', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Amica', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Fingerlix', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Zupee', 'HeadQuarter'] = 'Gurugram'
df3.loc[df3['Company_Brand'] == 'DeHaat', 'HeadQuarter'] = 'Patna'
df3.loc[df3['Company_Brand'] == 'Akna Medical', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'RaRa Delivery', 'HeadQuarter'] = 'Jakarta'
df3.loc[df3['Company_Brand'] == 'Obviously AI', 'HeadQuarter'] = 'San Francisco'
df3.loc[df3['Company_Brand'] == 'CoinDCX', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'NuNu TV', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Fintso', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Smart Coin', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Shop101', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Neeman', 'HeadQuarter'] = 'Hyderabad'
df3.loc[df3['Company_Brand'] == 'Invideo', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'AvalonMeta', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'SmartVizX', 'HeadQuarter'] = 'Noida'
df3.loc[df3['Company_Brand'] == 'Carbon Clean', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'Onsitego', 'HeadQuarter'] = 'Mumbai'
df3.loc[df3['Company_Brand'] == 'Nova Credit', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'HempStreet', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Classplus', 'HeadQuarter'] = 'Noida'
df3.loc[df3['Company_Brand'] == 'Chaayos', 'HeadQuarter'] = 'New Delhi'
df3.loc[df3['Company_Brand'] == 'Altor', 'HeadQuarter'] = 'Bengaluru'
df3.loc[df3['Company_Brand'] == 'WorkIndia', 'HeadQuarter'] = 'Mumbai'

In [None]:
df3[df3['HeadQuarter'].isnull()]

In [None]:
# below we are reformating the Headquater column with their official values
df3.loc[~df3['HeadQuarter'].str.contains('New Delhi', na=False), 'HeadQuarter'] = df3['HeadQuarter'].str.replace('Delhi', 'New Delhi')
df3["HeadQuarter"] = df3["HeadQuarter"].replace (['Bangalore','Banglore','Bangalore City'], 'Bengaluru')
df3['HeadQuarter'] = df3['HeadQuarter'].replace (['Gurgaon'], 'Gurugram')

FOR NOW LET'S REPLACE ALL THE 'NONE' WITH NAN VALUES 

In [None]:
df3['Stage'] = df3['Stage'].astype(str)
df3['Stage'].replace('None', np.nan, inplace=True) # here we convert all the values to string so we can replace all the None values


In [None]:
df3['Stage'] # now we confirm the stage column again

WE WILL BE REPLACING THE NULL VALUES IN THE STAGE COLUMN USING THE 
MOST 6 OCCURENCE OF THE STAGES BUT TO GET THAT WE NEED THE SECTOR COLUMN 

WHICH MEANS WE NEED TO DEAL WITH NULL VALUES IN THE SECTOR COLUMN FIRST

In [None]:
df3['Sector'].count() # getting the total values in the Sector column 

In [None]:
# below we are confirming the null values in the sector column 
missing_values3 = df3['Stage'].isnull().sum()

percent_miss_sec_3 = (missing_values3 / df3['Stage'].count()) * 100
percent_miss_sec_3

 based on the result that only 1.25% of the values in the 'Sector' column are missing, it is reasonable to consider imputing the null values with the mode of the 'Sector' column.

below:


In [None]:
non_null_values_3 = df3[df3['Sector'].notnull()]  # Filtering non-null values
mode_sector = non_null_values_3['Sector'].mode().iloc[0]  # Getting the mode value
df3['Sector'].fillna(mode_sector, inplace=True)  # Imputing null values with the mode

In [None]:
# NOW LET'S CONFIRM THE NULL VALUES AGAIN 
df3['Sector'].isnull().sum()

NOW WE CAN USE OUR STRATEGY IN COMBINATION OF THE SECTOR COLUMN TO FILL IN THE 
NAN VALUES FOR THE STAGE COLUMN 

In [None]:
# creating the contingency table
conting_tabl_3 = pd.crosstab(df3['Stage'], ['Sector'])
conting_tabl_3

FINDING THEIR PERCENTAGES

In [None]:
total_non_null = 462 # the total number of nulls

percent_early_stage = (175 / total_non_null) * 100
percent_late_stage = (35 / total_non_null) * 100   # here we are getting their perentages 
percent_mid_stage = (206 / total_non_null) * 100
percent_other_stage = (174 / total_non_null) * 100

In [None]:
percent_early_stage, percent_late_stage,percent_mid_stage,percent_other_stage # here the percentages displayed below 

NOW WE WILL FIND AND SELECT THE STAGES BASE ON THE SECTOR COLUMN AND USE THIS STAGES 
AND AT A RANDOMIZED CHOICE TO FILL IN THE NULL VALUES 

In [None]:
stage_percentages = {
    'Early Stage': percent_early_stage,
    'Late Stage': percent_late_stage,
    'Mid Stage': percent_mid_stage,           # CREATING A LIST OF THE PERCENTAGES 
    'Other Stages': percent_other_stage
}

Filling in the null values in the 'Stage' column proportionally using the apply method and a lambda function:

In [None]:
# BELOW WE ARE FILLING IN THE MISSING VALUES

total_prob = sum(stage_percentages.values())
normalized_probs = [prob / total_prob for prob in stage_percentages.values()]

df3['Stage'] = df3['Stage'].apply(lambda x: np.random.choice(list(stage_percentages.keys()), p=normalized_probs) if pd.isnull(x) else x)


In [None]:
df3['Stage'].isnull().sum() # CONFIRMING THE NULL VALUES AGAIN 

In [None]:
df3["Amount"].value_counts()# Calculate the frequency count of unique values in the "Amount" column

In [None]:
# checking for '-' symbol within the columns
df3_to_check_colomns = ['Company_Brand','HeadQuarter', 'Sector', 'What_it_does','Stage','Amount']
for col in df3_to_check_colomns:
    dash_symbols = df3[col].astype(str).str.contains('—').any()
    print(f"{col}: {dash_symbols}")

In [None]:
# checking for '$' symbol within the columns
df3_to_check_colomns = ['Company_Brand','HeadQuarter', 'Sector', 'What_it_does','Stage','Amount']

for col in df3_to_check_colomns:
    dash_symbols = df3[col].astype(str).str.contains('$').any()
    print(f"{col}: {dash_symbols}")

# Converting Amounts in Indian Rupees to Us Dollar

In [None]:

c = CurrencyRates()  # Instantiate an object of the CurrencyRates class

# Creating temporary columns to help with the conversion of INR to USD
df3['Amount'] = df3['Amount'].astype(str)  # Convert 'Amount' column to string
df3['Indiancurr'] = df3['Amount'].str.rsplit('₹', n=2).str[-1]
df3['Indiancurr'] = df3['Indiancurr'].apply(float).fillna(0)
df3['UsCurr'] = df3['Indiancurr'] * c.get_rate('INR', 'USD')
df3['UsCurr'] = df3['UsCurr'].replace(0, np.nan)
df3['UsCurr'] = df3['UsCurr'].fillna(df3['Amount'])
df3['UsCurr'] = df3['UsCurr'].replace("$", "", regex=True)
df3['Amount'] = df3['UsCurr']
df3['Amount'] = df3['Amount'].apply(lambda x: float(str(x).replace("$","")))
df3['Amount'] = df3['Amount'].replace(0, np.nan)

# Defining a lambda function to format the amount
format_amount = lambda amount: "{:,.2f}".format(amount)

# Applying the formatting lambda function to the 'Amount' column
df3['Amount'] = df3['Amount'].map(format_amount)


In [None]:
# Cleaning the Amounts column

df3['Amount'] = df3['Amount'].apply(str)
df3['Amount'].replace(",", "", inplace = True, regex=True)
df3['Amount'].replace("$", "", inplace = True, regex=True)
df3['Company_Brand'].replace("$", "", inplace = True, regex=True)
df3['HeadQuarter'].replace("$", "", inplace = True, regex=True)
df3['Sector'].replace("$", "", inplace = True, regex=True)
df3['What_it_does'].replace("$", "", inplace = True, regex=True)
df3['Stage'].replace("$", "", inplace = True, regex=True)

In [None]:
# Remove leading or trailing spaces
df3['Amount'] = df3['Amount'].str.strip()

# Remove commas and symbols
df3['Amount'] = df3['Amount'].str.replace(',', '')
df3['Amount'] = df3['Amount'].str.replace('$', '')
# Add more replacements for other symbols as needed

# Convert 'Amount' column to float
df3['Amount'] = df3['Amount'].astype(float)

# Convert 'Amount' column to float, handling NaN values explicitly
df3['Amount'] = pd.to_numeric(df3['Amount'], errors='coerce')

# Set the float format
pd.options.display.float_format = '{:.2f}'.format

In [None]:
# Convert the values to regular floats and handle NaN values
amount_values = np.asarray(df3['Amount'], dtype=float)
amount_values[np.isnan(amount_values)] = np.nan

# Print the unique values
print(amount_values)

In [None]:
print(df3['Amount'].unique())

In [None]:
df3['Amount'] = df3['Amount'].astype(float) #converting the values in the "Amount" column of DataFrame df3 to the float data type.

In [None]:
df3["Amount"] # checking the amount column to comfirm the changes 

DEALING WITH MISSING VALUES IN THE AMOUNT COLUMN IN DATA SET 2020

In [None]:
#creating the contingency table

conting_table_3 = pd.crosstab(df3['Sector'], df3['Amount'].isnull())

In [None]:
# below we are performing the chi-square test
chi2_3, p_value_3, _,_ = chi2_contingency(conting_table_3)
chi2_3
p_value_3

In [None]:
# we are interpretting the chi-sqaure test 
significance_level_3 = 0.05

if p_value_3 < significance_level_3:
    print("There is a significant association between the missing values in the 'Amount' column and the 'Sector' column.")
else:
    print("There is no significant association between the missing values in the 'Amount' column and the 'Sector' column.")

NOW LET'S CHECK OUTLIERS TO EITHER RULE OUT MEAN IMPUTATION OF ACCEPT IT 

FIRST WE WILL USE THE BOX PLOT

In [None]:
# but we will use the non-null values to check for outliers and the statistical values 
non_null_value_3 = df3['Amount'].dropna()

# now let's create our box plot with the log scale
plt.boxplot(non_null_value_3)
plt.ylabel('Amount (log scale)')
plt.title('Box Plot of Non-null Values In The Amount Column')
plt.yscale('log')
plt.show()

FROM THE BOX PLOT OBSERVATION WE CAN SAY:


the box plot shows that the data points are skewed towards the bottom and there are some points far away from the bottom of the box, it indicates the presence of outliers. Outliers can significantly affect the mean, making it less representative of the central tendency of the data. In this case, using the median for imputation rather than the mean wll be a more robust approach.

TO FURTHER UNDERSTAND THE DATA, LET'S USE A HISTOGRAM TO SEE THE DISTRIBUTION OF

DATA POINTS IN THE AMOUNT COLUMN

In [None]:
# Filtering out the null values 
non_null_value_3 = df3['Amount'].dropna()

# Creating a histogram of the amount column with a log scale
plt.hist(non_null_value_3, bins=10, log=True)
plt.xlabel('Amount (Log Scale)')
plt.ylabel('Frequency Of Distribution')
plt.title('Histogram Of Non-Null Values In The Amount Column')
plt.show

FROM THE ABOVE DISPLAY OF THE HISTOGRAM, WE CAN MAKE THE FOLLOWING DEDUCTIONS

The histogram shows the distribution of the 'Amount' column, indicating that the majority of values are concentrated in the lower range with high frequency, while the higher values are sparsely distributed.

This distribution pattern suggests that there may be a right-skewness or a long tail in the data. It indicates that there are relatively fewer instances with higher values compared to the instances with lower values.

the distribution pattern observed in the histogram, with a concentration of values in the lower range and a sparser distribution towards higher values, suggests that using the median for imputation could be a suitable approach.

NOTE:

The median is a measure of central tendency that is less affected by outliers or extreme values compared to the mean. In our case, since there are some data points that are far away from the majority of values, using the median as an imputation method can provide a more robust estimate of the central value of the 'Amount' column.

In [None]:
# Replace 'NAN' strings with actual NaN values
df3['Amount'] = df3['Amount'].replace('NAN', np.nan)

In [None]:
# Filter the non-null values of the 'Amount' column:
non_null_values_3

# Calculating the median of the non-null values:
median_value_3 = non_null_values_3.median()

# Imputing the null values in the 'Amount' column with the median value:

df3['Amount'].fillna(median_value_3, inplace=True)

In [None]:
df3['Amount'].isnull().sum() # confirming the null values in the Amount column Again to be sure 

In [None]:
df3['Amount']

In [None]:
df3['Amount'] = df3['Amount'].astype(float)  # Convert 'Amount' column back to float


In [None]:
df3['Amount'].isnull().sum()

In [None]:
df3 = df3.drop(['column10','Founded','Founders','Investor'], axis=1) #dropping specific columns from the DataFrame 

In [None]:
df3['Funding Year'] = 2020 # Assign 2020 to the 'Funding Year' column

In [None]:
new_column_names = {'Company_Brand': 'Company', 'What_it_does': 'About', 'HeadQuarter': 'Location'} # Renaming columns
df3 = df3.rename(columns=new_column_names)

In [None]:
df3 = df3.drop(['Indiancurr', 'UsCurr'], axis=1) # dropping these columns 


In [None]:
df3.head() # checking the head of the data to confirm before saving the data 

In [None]:
df3.isnull().sum() # checking if any of the column still have nan

In [None]:
df3['Amount'].median()

In [None]:
impute_value = df3['Amount'].median()
df3['Amount']= df3['Amount'].fillna(impute_value)

In [None]:
df3.isnull().sum()

In [None]:
# saving the clean data set

df3.to_csv('df_2020.csv', index=False)

#### 2021 Data

In [None]:
df4.head() #showing the first five rows

In [None]:
df4.shape #understanding the size of your DataFrame

In [None]:
df4.columns #retrieving the column names of the DataFrame

In [None]:
df4.info() #providing a summary of the DataFrame

In [None]:
df4.describe(include='object') #providing descriptive statistics for columns of object data type in the DataFrame

In [None]:
#dropping all duplicates in all the columns 
df4.drop_duplicates(inplace=True)

In [None]:
df4.isnull().sum() # looking for missing values in dataFrame

#### Handling Duplicated Data

In [None]:
df4['HeadQuarter'].dropna(inplace=True) # dropping the nan in the Headquarter column

In [None]:
#checking for duplicate values in each column of the DataFrame df4
columns_to_check4 = ['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does', 'Founders', 'Investor', 'Amount', 'Stage']

for column4 in columns_to_check4:
    has_duplicates4 = df4[column4].duplicated().any()
    print(f'{column4}: {has_duplicates4}')

In [None]:
#removing any rows that have the same values in all the specified columns.
df4.drop_duplicates(subset=['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does', 'Founders', 'Investor', 'Amount', 'Stage'], inplace=True)

#### Handling Categorical Data

In [None]:
df4['HeadQuarter'].unique() # here we are looking at the unique values in the column 

In [None]:
# From obersavtion, there is use of official and unofficial names of certain cities.
# The incorrect names need to be rectified for correct analysis, eg A city with more than one name.

df4['HeadQuarter'] = df4['HeadQuarter'].replace (['Bangalore','Bangalore City','Belgaum'], 'Bengaluru')
df4['HeadQuarter'].replace('Gurugram\t#REF!','Gurugram',inplace =True, regex=True)
df4['HeadQuarter'] = df4['HeadQuarter'].str.replace('New Delhi','Delhi')

In [None]:
#using a filter to get all the miss match values in the HeadQuater column

df4[df4['HeadQuarter'].isin(['Online Media\t#REF!', 'Pharmaceuticals\t#REF!','Computer Games','Information Technology & Services','Food & Beverages'])]

In [None]:
#assigning specific values to HeadQuarter", "Amount", "Stage in the DataFrame

df4.loc[df4["Company_Brand"] == "FanPlay", ["HeadQuarter", "Amount", "Stage"]] = ["None", "$1200000","None"]
df4.loc[df4["Company_Brand"] == "FanPlay"]

In [None]:
#assigning specific values to HeadQuarter", "Amount", "Stage in the DataFrame

df4.loc[df4["Company_Brand"] == "MasterChow", ["HeadQuarter", "Sector"]] = ["Hauz Khas", "Food & Beverages"]
df4.loc[df4["Company_Brand"] == "MasterChow"]

In [None]:
# here we are repositioning the values into their correct columns

df4.loc[df4["Company_Brand"] == "Fullife Healthcare", ["HeadQuarter","Sector","What_it_does","Investor", "Amount", "Stage"]] = ["None","Pharmaceuticals","Primary Business is Development and Manufactur...","Varun Khanna", "$22000000","Series C"]
df4.loc[df4["Company_Brand"] == "Fullife Healthcare"]

In [None]:
# getting the all the data points that matches the company_Brand name 'Peak'

df4.loc[df4["Company_Brand"] == "Peak", ["HeadQuarter", "Sector"]] = ["Manchester", "Information Technology & Services"]
df4.loc[df4["Company_Brand"] == "Peak"]

In [None]:
# getting the all the data points that matches the company_Brand name 'Sochcast'

df4.loc[df4["Company_Brand"] == "Sochcast", ["HeadQuarter", "Sector",'What_it_does','Founders','Investor',"Amount"]] = [np.nan, 'Online Media','Sochcast is an Audio experiences company that give the listener and creators an Immersive Audio experience','CA Harvinderjit Singh Bhatia, Garima Surana','Vinners, Raj Nayak, Amritaanshu Agrawal',"$Undisclosed"]
df4.loc[df4["Company_Brand"] == "Sochcast"]

In [None]:
df4['Sector'].unique() # here we are looking at the unique value of the Sector column 

In [None]:
# here we are updating this Row 'MoEVing'

df4.loc[df4["Company_Brand"] == "MoEVing", ["Sector",'What_it_does','Founders','Investor','Amount','Stage']] = [
'Electric Mobility',"MoEVing is India's only Electric Mobility focused Technology Platform with a vision to accelerate EV adoption in India.",
'Vikash Mishra, Mragank Jain','Anshuman Maheshwary, Dr Srihari Raju Kalidindi','$5000000','Seed']
df4.loc[df4["Company_Brand"] == "MoEVing"]

In [None]:
df4["Stage"].unique() # getting the unique values in this column 

In [None]:
df4[df4["Stage"]=='$6000000'] # getting the row that matches the Amount 
# repositioning the values to their respective columns  

df4.loc[df4["Company_Brand"] == "MYRE Capital", ["Amount", "Stage"]] = ["6000000",np.nan]
df4.loc[df4["Company_Brand"] == "MYRE Capital"]

In [None]:
df4[df4["Stage"]=='$300000'] # getting the row that matches the Amount and 
# repositioning the values to their respective columns

df4.loc[df4["Company_Brand"] == "Little Leap", ["Amount", "Stage"]] = ["300000",np.nan]
df4.loc[df4["Company_Brand"] == "Little Leap"]

df4.loc[df4["Company_Brand"] == "BHyve", ["Amount", "Stage"]] = ["300000",np.nan]
df4.loc[df4["Company_Brand"] == "BHyve"]

In [None]:
df4[df4["Stage"]=='$1000000'] # getting the row that matches the Amount and 
# repositioning the values to their respective columns

df4.loc[df4["Company_Brand"] == "Saarthi Pedagogy", ["Amount", "Stage"]] = ["1000000",np.nan]
df4.loc[df4["Company_Brand"] == "Saarthi Pedagogy"]

In [None]:
df4["Amount"].unique() # getting unique values 

In [None]:
# checking if these specific values are present in the amount column 

df4[df4['Amount'].isin([ 'Seed','JITO Angel Network, LetsVenture','ITO Angel Network, LetsVenture','Pre-series A','ah! Ventures'])]

In [None]:
# getting the row that matches the Amount 
# repositioning the values to their respective columns

df4.loc[df4["Company_Brand"] == "Godamwale", ["Amount", "Stage", "Investor"]] = ["$1000000", "Seed",np.nan]
df4.loc[df4["Company_Brand"] == "Godamwale"]

In [None]:
# below we are reformatting rows with the company value Little Leep with its correct column values

df4.loc[df4["Company_Brand"] == "Little Leap", ["Amount", "Stage", "Investor"]] = [
    "$300000", np.nan, "ah! Ventures"]

df4.loc[df4["Investor"] == "ah! Ventures"] # here we are fetching the investor's column that matches 'ah! ventures'

In [None]:
df4.loc[df4["Company_Brand"] == "AdmitKard", ["Amount", "Stage", "Investor"]] = [
    "$1000000", "Pre-series A",np.nan]
df4.loc[df4["Company_Brand"] == "AdmitKard"]

In [None]:
# Cleaning the Amounts column & # removing the currency symbol in df_2021

df4['Amount'] = df4['Amount'].astype(str).str.replace('[\₹$,]', '', regex=True)
df4['Amount'] = df4['Amount'].str.replace('Undisclosed', 'NAN', regex=True)
df4['Amount'] = df4['Amount'].str.replace('undisclosed', 'NAN', regex=True)
df4['Amount'] = df4['Amount'].str.replace('None', 'NAN', regex=True)
df4['Amount'].replace(",", "", inplace = True, regex=True)
df4['Amount'].replace("—", 0, inplace = True, regex=True)
df4['Amount'].replace("", '0', inplace=True, regex=True)

In [None]:
df4['Amount'].unique()

In [None]:
df4[df4['Amount'] == 'Pre-series A']

In [None]:
df4.loc[df4['Company_Brand'] == 'AdmitKard', 'Amount'] = 1000000 # replacing the real value for this row by help of google

In [None]:
df4['Amount'].unique()

In [None]:
df4['Amount'] = df4['Amount'].astype(float)
type(df4['Amount'][0])   # we are converting to float 

In [None]:
df4['Amount'].unique() # comfirming the unique values in the Amount column

In [None]:
df4['Amount'].value_counts() # here we are checking the total value counts of all the unique values 

In [None]:
null_values_Amount4 = df4['Amount'].isnull().sum() # here we are comfirming for null values
print(null_values_Amount4) 

In [None]:
len(df4['Amount'])

NOW LET'S FIND THE PERCENTAGE OF NULL VALUES TO THE THAT OF THE WHOLE 
AMOUNT COLUMN 
this will help us to understand and appreciate the impact of the null values in the Amount column

In [None]:
# Finding the percentage of null values

Amnt_null_perc = (null_values_Amount4 / len(df4['Amount'])) * 100
Amnt_null_perc

In [None]:
amount_stats = df4['Amount'].describe()
print(amount_stats)

In [None]:
pd.options.display.float_format = '{:,.2f}'.format
print(amount_stats)

from the above output, we can see that, the percentage of null values in the Amount column 
is very high, now to impute for the missing values, we will conduct some test to select the best out of these two 
either mean or medain since, the date set is too small to use other methods such, multiple imputation, regression imputation etc,

SO BELOW WE WILL USE BOTH: 1. DISTRIBUTION SHAPE
2. CHECKING FOR OUTLIERS


In [None]:
# first we will use the distribution shape by the help of a histogram 
# below we are plotting the histogram 

plt.hist(df4['Amount'].dropna(), bins=10) 
plt.xlabel('Amount')
#plt.xticks(df4['Amount'].dropna().unique())
plt.ylabel('Frequency')
plt.title('Distribution of non-null values in the Amount column')
plt.show()


histogram we plot above suggests that the majority of the non-null values in the 'Amount' column are concentrated within the first bin (0.0 to 0.2 on the x-axis) with a frequency of 1000 on the y-axis. This means that a large number of values in the 'Amount' column are close to zero or have very small values.

The remaining bins from 0.2 to 1.4 on the x-axis have no or very few values, indicating that the range of values beyond the first bin is sparsely populated.

Overall, the histogram above suggests that the distribution of values in the 'Amount' column is highly skewed, with a heavy concentration of values around zero or small values, and a lack of values in the higher range. This skewness and concentration of values at zero or small values may impact the appropriateness of using the mean for imputation, as it may be heavily influenced by these extreme values.

BELOW IS THE NEXT STEP TO CONFIRM WHERTHER TO USE THE MEDAIN OF NOT 

To confirm whether using the median is a suitable imputation method, we can perform a hypothesis test to compare the distribution of non-null values in the 'Amount' column with the distribution of the imputed values using the median.

In [None]:
# below we are creating the two sets non-null and the median imputed 

non_null_values_4 = df4['Amount'].dropna()
median_imputed_values_4 = df4['Amount'].fillna(df4['Amount'].median())

below we are 
Performing a statistical test to compare the distributions of the two groups. 
One option is to use the Kolmogorov-Smirnov test, which can be performed using 
the ks_2samp() function from the scipy.stats module.

In [None]:
# below we are conducting the test 
test_statistic4, p_value4 = ks_2samp(non_null_values_4, median_imputed_values_4)
test_statistic4
p_value4

NOW: we will set a significant value to 0.05 
now we will also set both a null hypothesis and an alternate hyppthesis, which will either be rejected
of accpeted based on the significant value 

Null Hypothesis (H0): The distributions of non-null values and imputed values using the median are the same

Alternative Hypothesis (H1): The distributions of non-null values and imputed values using the median are different.

The significance level allows us to set a standard of evidence required to reject the null hypothesis. If the p-value, which represents the probability of observing the data given that the null hypothesis is true, is less than or equal to the significance level, we reject the null hypothesis. This implies that the observed result is unlikely to have occurred by chance alone and supports the alternative hypothesis

In [None]:
significance_level = 0.05

if p_value4 < significance_level:
    print("There is a significant difference between the distributions.")
else:
    print("There is no significant difference between the distributions.")


TEST_OUT_COME AND IMPLICATIONS 
The test results indicate that there is no significant difference between the distributions of the non-null values and the imputed values using the median. Since the p-value (0.0431) is greater than the significance level (0.05), we fail to reject the null hypothesis. This suggests that the imputed values using the median are similar to the observed non-null values in terms of their distribution

based on the test results, it appears that using the median to impute the missing values in the 'Amount' column would be a reasonable approach. The distribution of the imputed values using the median is not significantly different from the distribution of the non-null values. Therefore, imputing the missing values with the median value can provide a reliable estimate while preserving the overall distribution characteristics of the data

NOW WE CAN CONFIDENTLY FILLIN THE NULL VALUES WITH THE MEDIAN 
AS SHOWN BELOW 

In [None]:
median_value_4 = df4['Amount'].median()
df4['Amount'] = df4['Amount'].fillna(median_value_4) # here we fill in the nan values using the median strategy 

In [None]:
# now let's confirm the Amount column column for null values again 
df4['Amount'].isna().sum()

In [None]:
# first we will use the distribution shape by the help of a histogram 
# below we are plotting the histogram 

# Apply logarithmic transformation to the data
# Filter out non-positive and missing values
valid_amounts = df4['Amount'][df4['Amount'] > 0].dropna()

# Apply logarithmic transformation to the filtered values
log_amount = np.log10(valid_amounts)

# Plot the histogram using logarithmic scale
plt.hist(log_amount, bins=10)
plt.xlabel('Logarithm of Amount')
plt.ylabel('Frequency')
plt.title('Distribution of logarithm of values in the Amount column')
plt.show()


NOW LET'S DEAL WITH NULL VALUES IN THE STAGE COLUMN 

In [None]:
null_stage_4 = df4['Stage'].isnull().sum()  # checking for null values in the stage column 
null_stage_4

In [None]:
null_stage_4 = df4['Stage'].isnull().sum()
perce_null_stage4 = (null_stage_4 / len(df4['Stage'])) * 100 # here we want to know the percentage of the null values in the stage column 
perce_null_stage4

BEFORE CONTINUING LET'S FURTHER GROUP THE STAGE COLUMN TO MAKE THINGS SIMPLER 

In [None]:
grouped_stages_4 = {
    # Group 1: Early Stage
    'Pre-seed': 'Early Stage',
    'Seed': 'Early Stage',
    'Seed A': 'Early Stage',
    'Seed Funding': 'Early Stage',
    'Seed Investment': 'Early Stage',
    'Seed Round': 'Early Stage',
    'Seed Round & Series A': 'Early Stage',
    'Seed fund': 'Early Stage',
    'Seed funding': 'Early Stage',
    'Seed round': 'Early Stage',
    'Seed+': 'Early Stage',

    # Group 2: Mid Stage
    'Series A': 'Mid Stage',
    'Series A+': 'Mid Stage',
    'Series A-1': 'Mid Stage',
    'Series A2': 'Mid Stage',
    'Series B': 'Mid Stage',
    'Series B+': 'Mid Stage',
    'Series B2': 'Mid Stage',
    'Series B3': 'Mid Stage',
    'Series C': 'Mid Stage',
    'Seies A': 'Mid Stage',
    
    # Group 3: Late Stage
    'Series D': 'Late Stage',
    'Series I': 'Late Stage',
    'Series D1': 'Late Stage',
    'Series E': 'Late Stage',
    'Series E2': 'Late Stage',
    'Series F': 'Late Stage',
    'Series F1': 'Late Stage',
    'Series F2': 'Late Stage',
    'Series G': 'Late Stage',
    'Series H': 'Late Stage',
    
    # Group 4: Other Stages
    'Angel': 'Other Stages',
    'Angel Round': 'Other Stages',
    'Bridge': 'Other Stages',
    'Bridge Round': 'Other Stages',
    'Corporate Round': 'Other Stages',
    'Debt': 'Other Stages',
    'Debt Financing': 'Other Stages',
    'Early seed': 'Other Stages',
    'Edge': 'Other Stages',
    'Fresh funding': 'Other Stages',
    'Funding Round': 'Other Stages',
    'Grant': 'Other Stages',
    'Mid series': 'Other Stages',
    'Non-equity Assistance': 'Other Stages',
    'None': 'Other Stages',
    'PE': 'Other Stages',
    'Post series A': 'Other Stages',
    'Post-IPO Debt': 'Other Stages',
    'Post-IPO Equity': 'Other Stages',
    'Pre Series A': 'Other Stages',
    'Pre- series A': 'Other Stages',
    'Pre-Seed': 'Other Stages',
    'Pre-Series B': 'Other Stages',
    'Private Equity': 'Other Stages',
    'Secondary Market': 'Other Stages',
    'Pre-series A': 'Other Stages',
    'None': 'Other Series',
    'Pre-series B':'Other Stages',
    'Pre-series A1': 'Other Stage',
    'Pre-series':'Other Stages',
}

df4['Stage'] = df4['Stage'].replace(grouped_stages_4)


In [None]:
df4['Stage'] # here we are want to look at the stage column again 

In [None]:
# checking for these values in the stage column which are not supposed to be there

not_wanted_stage_4 = ["FinTech", "EdTech", "Financial Services", "Food & Beverages", "Information Technology & Services",  "E-commerce"]
not_wanted_rows = df4['Stage'].isin(not_wanted_stage_4)
not_wanted_rows.sum()

BELOW WE WANT TO DISPLAY STAGES THAT ARE GROUP INTO THE GROUPS FROM ABOVE 

In [None]:
# Count the occurrences of each unique value in the "Stage" column
stage_counts = df4['Stage'].value_counts()

# Filter for values that are not in the grouped stages
ungrouped_stages = stage_counts[~stage_counts.index.isin(grouped_stages_4.values())]

# Display the ungrouped stage values
print(ungrouped_stages)


LET'S DROP VALUES(ROW) FROM THE SECTOR COLUMN THAT DO NOT HAVE ANY CORRESPONDING STAGE IN THE STAGE COLUMN 

BELOW IS ONE WAY TO HELP SELECT THE BEAT METHOD TO DEAL WITH THE MISSING VALUES IN THE STAGE COLUMN 

 creating a cross-tabulation or contingency table between the "Stage" column and the "Sector" column
 This will generate a table showing the counts of each combination of stages and Sectors. It will help us identify if certain stages are more prevalent in specific Sectors


BUT FIRST LET'S CONFIRM THE NULL VALUES OF THE SECTOR COLUMN 

In [None]:
df4['Sector'].isnull().sum() # checking for null values in the Sector column 

NOW LET'S CREATE THE CROSSTAB

In [None]:
cross_table_sec_stage_4 = pd.crosstab(df4['Sector'], ['Stage']) # here we are creating a contingency table between stage and sector 
cross_table_sec_stage_4

now to deal with the missing value in the stage column, we will use the percentage of the first 6 largest most occurring 
stage to fill in the missing values


In [None]:
# below we are getting the percentages 
cross_table_sec_stage_perc_4 = (cross_table_sec_stage_4['Stage'] / cross_table_sec_stage_4['Stage'].sum()) * 100
cross_table_sec_stage_perc_4

NOW LET'S LOOK AT THE FIRST SIX 

In [None]:
top_six_stages = cross_table_sec_stage_perc_4.nlargest(6) # here we are looking at the top six stages 
top_six_stages

NOW LET'S FILL IN THE MISSING VALUES IN THE STAGE COLUMN, USING THE RESPECTIVE VALUES IN FROM THE TOP SIX 
STAGES 


In [None]:
# Filling missing values in "Stage" column with the top six values

# Normalize the probabilities
normalize_prob_4 = top_six_stages / top_six_stages.sum()
# Filling missing values in "Stage" column with the top six values
df4['Stage'] = df4['Stage'].fillna(pd.Series(np.random.choice(top_six_stages.index.tolist(), size=len(df4['Stage']), p=normalize_prob_4.values)))

NOW LET'S CONFRIM THE MISSING VALUES IN THE AMOUNT STAGES AGAIN 

In [None]:
# confirming the null values in the amount column again 
df4['Stage'].isnull().sum()

In [None]:
df4.columns


In [None]:
# Assuming 'Company_Brand' is the correct column name, modify the following code:
df4.loc[df4['Company_Brand'] == 'upGrad', 'Stage'] = 'Series D'
df4.loc[df4['Company_Brand'] == 'Urban Company', 'Stage'] = 'Series D'
df4.loc[df4['Company_Brand'] == 'Comofi Medtech', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Smart Joules', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Miko', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'M1xchange', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Do Your Thng', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'LegitQuest', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Fantasy Akhada', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Speciale Invest', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Meesho', 'Stage'] = 'Series D'
df4.loc[df4['Company_Brand'] == 'Elevar', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Curefoods', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Camp K12', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Defy', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Homversity', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Loop Health', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Smartstaff', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Hyperface', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Melorra', 'Stage'] = 'Series B'
df4.loc[df4['Company_Brand'] == 'Onato', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Mestastop Solutions', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'MergerDomo', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Trell', 'Stage'] = 'Series C'
df4.loc[df4['Company_Brand'] == 'Homeville', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Ola Electric', 'Stage'] = 'Series D'
df4.loc[df4['Company_Brand'] == 'Delhivery', 'Stage'] = 'Series F'
df4.loc[df4['Company_Brand'] == 'Upgame', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Sochcast', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'byteXL', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'EventBeep', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'GameEon Studios', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Tessolve', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'EF Polymer', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'LearnVern', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Beldara', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Oye Rickshaw', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'OfBusiness', 'Stage'] = 'Series D'
df4.loc[df4['Company_Brand'] == 'CareerLabs', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Studio Sirah', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == '1Bridge', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'TartanSense', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Bewakoof', 'Stage'] = 'Series C'
df4.loc[df4['Company_Brand'] == 'Elda Health', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Ruptok', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == "O' Be Cocktails", 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Hike', 'Stage'] = 'Series C'
df4.loc[df4['Company_Brand'] == 'House of Kieraya', 'Stage'] = 'Series B'
df4.loc[df4['Company_Brand'] == 'DrinkPrime', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'SATYA MicroCapital', 'Stage'] = 'Series C'
df4.loc[df4['Company_Brand'] == 'CreatorStack', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Rage Coffee', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Klub', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Stellaris Venture Partners', 'Stage'] = 'Series D'
df4.loc[df4['Company_Brand'] == 'Celcius', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'UrbanMatrix Technologies', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Evenflow Brands', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Atomberg', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'ShopMyLooks', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Veefin', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'BangDB', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'O’ Be Cocktails', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'OneCard', 'Stage'] = 'Series C'
df4.loc[df4['Company_Brand'] == 'Hubhopper', 'Stage'] = 'Series B'
df4.loc[df4['Company_Brand'] == 'Avataar Ventures', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Codingal', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Junio', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'MPL', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Bombay Shaving Company', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'MFine', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Darwinbox', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'SSA Finserv', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Pariksha', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Devic Earth', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Pocket Aces', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Biocon Biologics', 'Stage'] = 'Series C'
df4.loc[df4['Company_Brand'] == 'Biconomy', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Bandhoo', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Mamaearth', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Inspacco', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'GODI Energy', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Lenskart', 'Stage'] = 'Series E'
df4.loc[df4['Company_Brand'] == 'Clensta', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Polygon', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Thingsup', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'TRDR', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'SuperBottoms', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Wingreens Farms', 'Stage'] = 'Series C'
df4.loc[df4['Company_Brand'] == 'Bombay Hemp Company', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Zenpay Solutions', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Visit Health', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Zetwerk', 'Stage'] = 'Series D'
df4.loc[df4['Company_Brand'] == 'Wiingy', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Arcana', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Duroflex', 'Stage'] = 'Series C'
df4.loc[df4['Company_Brand'] == 'Tvasta', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Vakilsearch', 'Stage'] = 'Series B'
df4.loc[df4['Company_Brand'] == 'PumPumPum', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Sterling Accuris Wellness', 'Stage'] = 'Series D'
df4.loc[df4['Company_Brand'] == 'Braingroom', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Vegrow', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Automovill', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Bella Vita Organic', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'SmartCoin', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'MYSUN', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Square Yards', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Slang Labs', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'SMOOR', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'UrbanKisaan', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'BHyve', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'SpEd@home', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Now&Me', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Capital Float', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'PazCare', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'MicroDegree', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Plutomen', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Grinntech', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Navars', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Slice', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'CredR', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Dream Sports', 'Stage'] = 'Series D'
df4.loc[df4['Company_Brand'] == 'Annapurna Finance', 'Stage'] = 'Series C'
df4.loc[df4['Company_Brand'] == 'Purplle', 'Stage'] = 'Series D'
df4.loc[df4['Company_Brand'] == 'Nazara Technologies', 'Stage'] = 'Series D'
df4.loc[df4['Company_Brand'] == 'Svasti Microfinance', 'Stage'] = 'Series C'
df4.loc[df4['Company_Brand'] == 'BlackSoil NBFC', 'Stage'] = 'Series C'
df4.loc[df4['Company_Brand'] == 'Kinara Capital', 'Stage'] = 'Series C'
df4.loc[df4['Company_Brand'] == 'AMPM', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Design Cafe', 'Stage'] = 'Series C'
df4.loc[df4['Company_Brand'] == 'eShipz', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Atomberg Technologies', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Peppermint', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'CredR', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Dream Sports', 'Stage'] = 'Series D'
df4.loc[df4['Company_Brand'] == 'Annapurna Finance', 'Stage'] = 'Series C'
df4.loc[df4['Company_Brand'] == 'Purplle', 'Stage'] = 'Series D'
df4.loc[df4['Company_Brand'] == 'Nazara Technologies', 'Stage'] = 'Series D'
df4.loc[df4['Company_Brand'] == 'Svasti Microfinance', 'Stage'] = 'Series C'
df4.loc[df4['Company_Brand'] == 'BlackSoil NBFC', 'Stage'] = 'Series C'
df4.loc[df4['Company_Brand'] == 'Kinara Capital', 'Stage'] = 'Series C'
df4.loc[df4['Company_Brand'] == 'AMPM', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Design Cafe', 'Stage'] = 'Series C'
df4.loc[df4['Company_Brand'] == 'eShipz', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Atomberg Technologies', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Peppermint', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Spintly', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'ShopSe', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'ShareChat', 'Stage'] = 'Series C'
df4.loc[df4['Company_Brand'] == 'Safexpay', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Advantage Club', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'SuperGaming', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'SleepyCat', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Ultrahuman', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Yojak', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Navia Life Care', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Locale.ai', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Whiz League', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'CHARGE+ZONE', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'PingoLearn', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Practically', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Keka HR', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Marquee Equity', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'GoTo', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Furlenco', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Chalo', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Udaan', 'Stage'] = 'Series D'
df4.loc[df4['Company_Brand'] == 'MyGlamm', 'Stage'] = 'Series D'
df4.loc[df4['Company_Brand'] == 'Inshorts', 'Stage'] = 'Series C'
df4.loc[df4['Company_Brand'] == 'Bikry app', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'The Ayurveda Co', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Furlenco', 'Stage'] = 'Series C'
df4.loc[df4['Company_Brand'] == 'Rockclimber', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Power Gummies', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Answer Genomics', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Saarthi Pedagogy', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Lavado', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'NIRAMAI', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Meddo', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Five Star Finance', 'Stage'] = 'Series C'
df4.loc[df4['Company_Brand'] == 'Policybazaar', 'Stage'] = 'Series D'
df4.loc[df4['Company_Brand'] == 'OYO', 'Stage'] = 'Series F'
df4.loc[df4['Company_Brand'] == 'Blume Ventures', 'Stage'] = 'Series C'
df4.loc[df4['Company_Brand'] == 'ImaginXP', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Virohan', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Apna.co', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Get My Parking', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'FanCode', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Enthu.ai', 'Stage'] = 'Pre-Seed'
df4.loc[df4['Company_Brand'] == 'Zepto', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'TurboHire', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'SatSure', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Leap India', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Better Capital', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Rentomojo', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Kissan Pro', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'VLCC Health Care', 'Stage'] = 'Series C'
df4.loc[df4['Company_Brand'] == 'SUN Mobility', 'Stage'] = 'Series B'
df4.loc[df4['Company_Brand'] == 'The Indus Valley', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'BharatPe', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'BankSathi', 'Stage'] = 'Pre-Seed'
df4.loc[df4['Company_Brand'] == 'Auntie Fung', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Sanctum Wealth', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Easiloan', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Boutique Spirit Brands', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Chingari', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Skeps', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Kirana247', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Imagimake', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'goEgoNetwork', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Snack Amor', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Expertrons', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == '1K Kirana Bazaar', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Zupee', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'VerSe Innovation', 'Stage'] = 'Series D'
df4.loc[df4['Company_Brand'] == 'MetroRide', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'PropReturns', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Deciwood', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Skippi Ice Pops', 'Stage'] = 'Pre-Seed'
df4.loc[df4['Company_Brand'] == 'Onelife', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'TenderCuts', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Scentials', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Remedico', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'PrepBytes', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'RevFin', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Paperfly', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Bolkar', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Oneiric Gaming', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'iMumz', 'Stage'] = 'Pre-Seed'
df4.loc[df4['Company_Brand'] == 'BlackSoil', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Chai Waale', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'JetSynthesys', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Skymet', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'GalaxyCard', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Pankhuri', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Vah Vah!', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Pratilipi', 'Stage'] = 'Series B'
df4.loc[df4['Company_Brand'] == 'Arcatron Mobility', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'KreditBee', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Holisol', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'India Quotient', 'Stage'] = 'Series C'
df4.loc[df4['Company_Brand'] == 'Nobel Hygiene', 'Stage'] = 'Series C'
df4.loc[df4['Company_Brand'] == 'Instoried', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Homingos', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'NODWIN', 'Stage'] = 'Series D'
df4.loc[df4['Company_Brand'] == 'Bijnis', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Clairco', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == "BYJU'S", 'Stage'] = 'Series D'
df4.loc[df4['Company_Brand'] == 'Petpooja', 'Stage'] = 'Series B'
df4.loc[df4['Company_Brand'] == 'Arbo Works', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Recordent', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Kaar Technologies', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Phool.co', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Log 9 Materials', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'EV Plugs', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'CredRight', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Leverage Edu', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Enercomp', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'LivQuik Technology', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Tinkerly', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Pine Labs', 'Stage'] = 'Series C'
df4.loc[df4['Company_Brand'] == 'Lido Learning', 'Stage'] = 'Series D'
df4.loc[df4['Company_Brand'] == 'Taikee', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'boAt', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Onsurity', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Unacademy', 'Stage'] = 'Series C'
df4.loc[df4['Company_Brand'] == 'Flo Mobility', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'TheHouseMonk', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Sirona Hygiene', 'Stage'] = 'Series B'
df4.loc[df4['Company_Brand'] == 'Vista Rooms', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Digit Insurance', 'Stage'] = 'Series C'
df4.loc[df4['Company_Brand'] == 'Lohum', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Unacademy', 'Stage'] = 'Series D'
df4.loc[df4['Company_Brand'] == 'Knocksense', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'DcodeAI', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'ixigo', 'Stage'] = 'Series D'
df4.loc[df4['Company_Brand'] == 'Droom', 'Stage'] = 'Series D'
df4.loc[df4['Company_Brand'] == 'Oliveboard', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Digit Insurance', 'Stage'] = 'Series D'
df4.loc[df4['Company_Brand'] == 'CoRover', 'Funding Type'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Powerplay', 'Funding Type'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'CustomerGlu', 'Funding Type'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Cell Propulsion', 'Funding Type'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Chqbook', 'Funding Type'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'WaterScience', 'Funding Type'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'BigLeap', 'Funding Type'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Fourth Partner Energy', 'Funding Type'] = 'Series B'
df4.loc[df4['Company_Brand'] == 'Safex Chemicals', 'Funding Type'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'IndiaLends', 'Funding Type'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'NewLink Group', 'Funding Type'] = 'Series C'
df4.loc[df4['Company_Brand'] == 'Nexpert', 'Funding Type'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Max Healthcare', 'Funding Type'] = 'Series D'
df4.loc[df4['Company_Brand'] == 'Ecom Express', 'Funding Type'] = 'Series C'
df4.loc[df4['Company_Brand'] == 'IGL', 'Funding Type'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Pickright Technologies', 'Funding Type'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Toplyne', 'Funding Type'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Wonderchef', 'Funding Type'] = 'Series B'
df4.loc[df4['Company_Brand'] == 'Totality', 'Funding Type'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Vitra.ai', 'Funding Series'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Swiggy', 'Funding Series'] = 'Series E'
df4.loc[df4['Company_Brand'] == 'OTO Capital', 'Funding Series'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'UpScalio', 'Funding Series'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Freyr Energy', 'Funding Series'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Northern Arc', 'Funding Series'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Rapido', 'Funding Series'] = 'Series D'
df4.loc[df4['Company_Brand'] == 'YPay', 'Funding Series'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Curefit', 'Funding Series'] = 'Series D'
df4.loc[df4['Company_Brand'] == 'Probus Insurance', 'Funding Series'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Ola', 'Funding Series'] = 'Series F'
df4.loc[df4['Company_Brand'] == 'Karkinos Healthcare', 'Funding Series'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Taskmo', 'Funding Series'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Eka.care', 'Funding Series'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Kredent', 'Funding Series'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'TWID', 'Funding Series'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Pocketly', 'Funding Series'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'CoRover', 'Funding Series'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Cora Health', 'Funding Series'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Cell Propulsion', 'Funding Series'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Wellbeing Nutrition', 'Funding Series'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'BYJU’S', 'Funding Series'] = 'Series J'
df4.loc[df4['Company_Brand'] == 'MYRE Capital', 'Funding Series'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Edmingle', 'Funding Series'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Fourth Partner Energy', 'Funding Series'] = 'Series B'
df4.loc[df4['Company_Brand'] == 'Raptee Energy', 'Funding Series'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Anar Business Community', 'Funding Series'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Asirvad Microfinance', 'Funding Series'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Disruptium', 'Funding Series'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Toplyne', 'Funding Series'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Tickertape', 'Funding Series'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'True Balance', 'Funding Series'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Indifi', 'Funding Series'] = 'Series D'
df4.loc[df4['Company_Brand'] == 'Mobileware Technologies', 'Funding Series'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'LeadSquared', 'Funding Series'] = 'Series C'
df4.loc[df4['Company_Brand'] == 'Gramophone', 'Funding Series'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Sugar.fit', 'Funding Series'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Vitra.ai', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Freyr Energy', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'DealShare', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'iBus Networks', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'WeWork India', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'LegitQuest', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Swiggy', 'Stage'] = 'Series E'
df4.loc[df4['Company_Brand'] == 'Sporjo', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'UpScalio', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == '8i Ventures', 'Stage'] = 'Series A'
df4.loc[df4['Company_Brand'] == 'Fitpage', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Karkinos Healthcare', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Vendor Infra', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Taskmo', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Sapio Analytics', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Genworks Health', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Pocketly', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'CoRover', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Green Soul', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Accio Robotics', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Onelife Nutriscience', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Shyplite', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'WaterScience', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'MYRE Capital', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Fourth Partner Energy', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Knackit', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Safex Chemicals', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Anar Business Community', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'NewLink Group', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Livve Homes', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Nexprt', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'ideaForge', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Disruptium', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Pickright Technologies', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'VilCart', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Doola', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'R for Rabbit', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Supertails', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'LegitQuest', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'NeoDocs', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Gumlet', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Wellbeing Nutrition', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Detect Technologies', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'ThatMate', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Zoomcar', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Tickertape', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Northern Arc', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Factors.AI', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Yellow Class', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Zorgers', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'MediBuddy', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Samaaro', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Shumee', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Fuel Buddy', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'YPay', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Raptee Energy', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Asirvad Microfinance', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Zingavita', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Kredent', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Ankur capital', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Cashify', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == '6Degree', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'FreeStand', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Hakuna Matata', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Flatheads', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Candes', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Edmingle', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Indic Inspirations', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'True Balance', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Alpha Coach', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'IGL', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Medpho', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Powerplay', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Blaer Motors', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Zaara Biotech', 'Stage'] = 'Seed'
df4.loc[df4['Company_Brand'] == 'Indifi', 'Stage'] = 'Seed'


In [None]:
# replacing all the values in the Stage column which equels 'edTech'
df4['Stage'].replace('EdTech', np.nan, inplace=True)

In [None]:
df4['Stage'] = df4['Stage'].astype(str)

In [None]:
still_null = df4['Stage'].isnull() # here we want to show all the rows with the null or nan values 
rows_still_null = df4[still_null]
rows_still_null

In [None]:
df.loc[df['Company'] == 'Geniemode', 'Stage'] = 'Seed'
df.loc[df['Company'] == 'Sapio Analytics', 'Stage'] = 'Seed'
df.loc[df['Company'] == 'Voxelgrids', 'Stage'] = 'Seed'


In [None]:
# Dropping the columns that are not important to our analysis

df4.drop(columns=['Founders','Investor','Founded', 'Funding Type','Funding Series'], inplace=True)

In [None]:
df4.insert(6,"Funding Year", 2021) # inserting a new column 'funding Year 2021' to keep track of the data sets when combining

In [None]:
df4.rename(columns = {'Company_Brand':'Company',
                        'HeadQuarter':'Location',
                        'What_it_does':'About'},
             inplace = True)

In [None]:
# BELOW WE ARE DROPPING  ALL DUPLICATES IN THE COLUMNS
df4.drop_duplicates(subset=['Company', 'About', 'Stage', 'Amount', 'Sector', 'Location'], inplace=True)

In [None]:
df4[df4['Stage'] == 'Information Technology & Services']

In [None]:
df4.head(100) # looking at head to comfirm before saving the data 

In [None]:
df4['Stage'] = df4['Stage'].astype(str)

In [None]:
df4['Location'].astype(str) # converting to string data type so we can drop all the null values 

In [None]:
df4['Location'].dropna(inplace=True) # dropping the remaining null values 

In [None]:
df4.isnull().sum()

In [None]:
# Find null rows in the 'Location' column
null_rows = df4[df4['Location'].isnull()]
null_rows

In [None]:
df4.loc[df4["Company"] == "Vidyakul", "Location"] = "Gurgaon"
df4.loc[df4["Company"] == "Vidyakul"]

In [None]:
df4.loc[df4["Company"] == "Sochcast", "Location"] = "Bangalore"
df4.loc[df4["Company"] == "Sochcast"]

In [None]:
df4.isna().sum()

In [None]:
df4['Sector']

In [None]:
df4.to_csv('df_2021.csv', index=False)

In [None]:
# Concatenate the data frames
clean_done = pd.concat([df, df2, df3, df4])

In [None]:
# Reseting the index of the concatenated data frame
clean_done.to_csv('Clean_Data_18_19_20_21_snyk.csv', index=False)

In [None]:
clean_done.to_csv('Clean_Data_18_19_20_21_snyk.txt', index=False, sep='\t')

In [None]:
clean_done.isna().sum()

In [None]:
clean_done.duplicated().any()

In [None]:
clean_done['Sector'].head(100)

In [None]:
clean_done.isna().sum()

In [None]:
clean_done.astype(str)

In [None]:
clean_done.dropna(how='all', inplace=True)

In [None]:
clean_done['Sector']

In [None]:
clean_done['Sector'] = clean_done['Sector'].str.title()

In [None]:
clean_done['Sector'].duplicated().any()

In [None]:
clean_done.drop_duplicates(subset=['Sector'], inplace=True)

In [None]:
clean_done['Sector'].duplicated().any()

In [None]:
clean_done['Sector']

WORKING ON THE STAGE COLUMN

In [None]:
clean_done['Stage']

In [None]:
# List of valid categories
valid_categories = ['Early Stage', 'Mid Stage', 'Late Stage', 'Other Stages']

# Get the count of unique values in the 'Stage' column
stage_counts = clean_done['Stage'].value_counts()

# Check if there are any values not in the valid categories
invalid_stages = stage_counts.index[~stage_counts.index.isin(valid_categories)]

if len(invalid_stages) > 0:
    print("The 'Stage' column contains values that are not grouped into the valid categories:")
    print(invalid_stages)
else:
    print("All values in the 'Stage' column are grouped into the valid categories.")

In [None]:
clean_done_stage = {
    # Group 1: Early Stage
    'Pre-seed': 'Early Stage',
    'Seed': 'Early Stage',
    'Seed A': 'Early Stage',
    'Seed Funding': 'Early Stage',
    'Seed Investment': 'Early Stage',
    'Seed Round': 'Early Stage',
    'Seed Round & Series A': 'Early Stage',
    'Seed fund': 'Early Stage',
    'Seed funding': 'Early Stage',
    'Seed round': 'Early Stage',
    'Seed+': 'Early Stage',

    # Group 2: Mid Stage
    'Series A': 'Mid Stage',
    'Series A+': 'Mid Stage',
    'Series A-1': 'Mid Stage',
    'Series A2': 'Mid Stage',
    'Series B': 'Mid Stage',
    'Series B+': 'Mid Stage',
    'Series B2': 'Mid Stage',
    'Series B3': 'Mid Stage',
    'Series C': 'Mid Stage',
    'Seies A': 'Mid Stage',
    
    # Group 3: Late Stage
    'Series D': 'Late Stage',
    'Series I': 'Late Stage',
    'Series D1': 'Late Stage',
    'Series E': 'Late Stage',
    'Series E2': 'Late Stage',
    'Series F': 'Late Stage',
    'Series F1': 'Late Stage',
    'Series F2': 'Late Stage',
    'Series G': 'Late Stage',
    'Series H': 'Late Stage',
    
    # Group 4: Other Stages
    'Angel': 'Other Stages',
    'Angel Round': 'Other Stages',
    'Bridge': 'Other Stages',
    'Bridge Round': 'Other Stages',
    'Corporate Round': 'Other Stages',
    'Debt': 'Other Stages',
    'Debt Financing': 'Other Stages',
    'Early seed': 'Other Stages',
    'Edge': 'Other Stages',
    'Fresh funding': 'Other Stages',
    'Funding Round': 'Other Stages',
    'Grant': 'Other Stages',
    'Mid series': 'Other Stages',
    'Non-equity Assistance': 'Other Stages',
    'None': 'Other Stages',
    'PE': 'Other Stages',
    'Post series A': 'Other Stages',
    'Post-IPO Debt': 'Other Stages',
    'Post-IPO Equity': 'Other Stages',
    'Pre Series A': 'Other Stages',
    'Pre- series A': 'Other Stages',
    'Pre-Seed': 'Other Stages',
    'Pre-Series B': 'Other Stages',
    'Private Equity': 'Other Stages',
    'Secondary Market': 'Other Stages',
    'Pre-series A': 'Other Stages',
    'None': 'Other Series',
    'Pre-series B':'Other Stages',
    'Pre-series A1': 'Other Stage',
    'Pre-series':'Other Stages',
    'Seed':'Other Stages',
    'Series A':'Other Stages',
    'Series D':'Other Stages',
    'Series B':'Other Stages'
}

clean_done['Stage'] = clean_done['Stage'].replace(clean_done_stage)

In [None]:
# List of valid categories
valid_categories = ['Early Stage', 'Mid Stage', 'Late Stage', 'Other Stages']

# Get the count of unique values in the 'Stage' column
stage_counts = clean_done['Stage'].value_counts()

# Check if there are any values not in the valid categories
invalid_stages = stage_counts.index[~stage_counts.index.isin(valid_categories)]

if len(invalid_stages) > 0:
    print("The 'Stage' column contains values that are not grouped into the valid categories:")
    print(invalid_stages)
    
    # Print the rows with invalid stages
    rows_with_invalid_stages = clean_done[clean_done['Stage'].isin(invalid_stages)]
    print(rows_with_invalid_stages)
else:
    print("All values in the 'Stage' column are grouped into the valid categories.")

In [None]:
# List of valid categories
valid_categories = ['Early Stage', 'Mid Stage', 'Late Stage', 'Other Stages']

# Get the count of unique values in the 'Stage' column
stage_counts = clean_done['Stage'].value_counts()

# Check if there are any values not in the valid categories
invalid_stages = stage_counts.index[~stage_counts.index.isin(valid_categories)]

if len(invalid_stages) > 0:
    print("The 'Stage' column contains values that are not grouped into the valid categories:")
    print(invalid_stages)
    
    # Drop rows with invalid stages
    clean_done = clean_done[~clean_done['Stage'].isin(invalid_stages)]
    print("Rows with invalid stages have been dropped.")
else:
    print("All values in the 'Stage' column are grouped into the valid categories.")

In [None]:
clean_done['Stage'].unique()

In [None]:
clean_done.isnull().sum()

In [None]:
clean_done['Amount'].astype(float)

In [None]:
clean_done['Funding Year'].astype(int)

In [None]:
clean_done.to_csv('visual_ready.csv', index=False)