# Indian Start-up Investment Analysis (2018 - 2021)

### Aim:
To assess the investment potential and attractiveness of the Indian startup ecosystem and provide recommendations for optimal course of action

### Objectives:
 
1. To assess the overall attractiveness of the Indian startup ecosystem based on funding trends and investor activity from 2018 to 2021.
2. To identify key sectors with high potential for investment based on their funding attractiveness and growth prospects.
3. To evaluate the investment opportunities across different stages of startup development and their risk-return profiles.
4. To analyze the geographical distribution of startups and funding to identify strategic investment locations and regional investment disparities.
5. To determine the correlation between funding amounts received by startups and their subsequent performance, providing insights into potential returns on investment and success rates.

### Business Questions:
1. What are the trends in funding amounts for Indian startups from 2018 to 2021? Are there any significant fluctuations or consistent growth patterns observed over this period?

2. Which sectors within the Indian startup ecosystem attracted the highest total funding during the specified timeframe? Are there any emerging sectors that have shown rapid growth in terms of investment?

3. What is the distribution of investment amounts across different stages of startup development (e.g., seed, early-stage, growth)? Are certain stages more favored by investors, and if so, why?

4. How is the geographical distribution of startups and funding within India? Are there specific regions or cities that have emerged as hubs for startup activity and investment, and are there any notable regional disparities?

5. Is there a correlation between the funding amounts received by startups and their subsequent performance metrics such as revenue growth, user acquisition, or market share? What insights can be gleaned from this correlation in terms of potential returns on investment and success rates?

6. Who are the top investors in the Indian startup ecosystem during the specified period? What sectors do they predominantly invest in, and are there any patterns in their investment strategies?

7. What are the characteristics of successful Indian startups in terms of founding team composition, industry focus, and funding trajectory? Can these characteristics be used to identify potential investment opportunities or predict startup success

### Hypothesis to Test:
 
Given the goal of assessing the investment potential in the Indian startup ecosystem, we hypothesize that:
 
**Null Hypothesis (H0)**: There is no clear pattern in the funding received by Indian startups from 2018 to 2021, and factors like sector, stage, location, and funding amount do not affect startup success.

**Alternative Hypothesis (H1)**: There is a clear pattern in the funding received by Indian startups from 2018 to 2021, and factors like sector, stage, location, and funding amount affect startup success.

## Import Packages for Analysis

In [1]:
# import relevant packages
import pyodbc
from dotenv import dotenv_values
import pandas as pd
import warnings
import numpy as np

warnings.filterwarnings('ignore')


#### Connect to server for 2020 and 2021 datasets

In [2]:
# load environment variables from .env file into a dictionary
environment_variables = dotenv_values('.env')

# Get the values for the credentials from .env file
database=environment_variables.get("DATABASE")
server=environment_variables.get("SERVER")
login=environment_variables.get("LOGIN")
password=environment_variables.get("PASSWORD")

# create a connection string
connection_string=f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={login};PWD={password}"

In [4]:
# create connection using the pyodbc method 

connection = pyodbc.connect(connection_string)

#### Select tables of interest from the Database

In [5]:
# selecting tables from Database
db_query = ''' SELECT *
            FROM INFORMATION_SCHEMA.TABLES
            WHERE TABLE_TYPE = 'BASE TABLE' '''

#### View tables of interest from the Database for verification purposes

In [6]:
# call selected table from SQL Database
ata=pd.read_sql(db_query, connection)

ata

Unnamed: 0,TABLE_CATALOG,TABLE_SCHEMA,TABLE_NAME,TABLE_TYPE
0,dapDB,dbo,LP1_startup_funding2021,BASE TABLE
1,dapDB,dbo,LP1_startup_funding2020,BASE TABLE


## Data_2020 Manipulation

In [7]:
# Call DataFrame to understand DataFrame details for 2020
query= "SELECT * FROM dbo.LP1_startup_funding2020"
data_2020 =pd.read_sql(query, connection)

data_2020.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,
3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,,
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,,


In [8]:
#check the shape
data_2020.shape    

(1055, 10)

In [9]:
# Check the description of the DataFrame

data_2020.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Founded,842.0,2015.363,4.097909,1973.0,2014.0,2016.0,2018.0,2020.0
Amount,801.0,113043000.0,2476635000.0,12700.0,1000000.0,3000000.0,11000000.0,70000000000.0


#### Observation
On average founding year is around 2015, with a standard deviation of about 4.1 years indicating that the founding years are fairly concentrated around the mean (2015). This suggest that most companies were founded within a narrow time frame around 2015 in the dataset_2020.

In [10]:
# check data information to understand data structure
data_2020.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1055 entries, 0 to 1054
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1055 non-null   object 
 1   Founded        842 non-null    float64
 2   HeadQuarter    961 non-null    object 
 3   Sector         1042 non-null   object 
 4   What_it_does   1055 non-null   object 
 5   Founders       1043 non-null   object 
 6   Investor       1017 non-null   object 
 7   Amount         801 non-null    float64
 8   Stage          591 non-null    object 
 9   column10       2 non-null      object 
dtypes: float64(2), object(8)
memory usage: 82.6+ KB


#### Verify rate of missing values

In [11]:
# Calculate the number of missing values
missing_values_count = data_2020.isna().sum()

# Calculate the percentage of missing values
missing_values_percentage = (missing_values_count / len(data_2020)) * 100

# Combine the two into a DataFrame for better readability
missing_values_summary = pd.DataFrame({
    'Missing Values': missing_values_count,
    'Percentage': missing_values_percentage
})

# Print the summary
print(missing_values_summary)


               Missing Values  Percentage
Company_Brand               0    0.000000
Founded                   213   20.189573
HeadQuarter                94    8.909953
Sector                     13    1.232227
What_it_does                0    0.000000
Founders                   12    1.137441
Investor                   38    3.601896
Amount                    254   24.075829
Stage                     464   43.981043
column10                 1053   99.810427


#### Statistical Observations on missing values
The dataset_2020 is highly complete for **Company_Brand** and **What_it_does**, with no missing values, ensuring consistent descriptive information for each company. Conversely, the **Stage** column faces the most significant issue, with 43.98% missing values, which will hamper the ability to analyze the development stages of the companies comprehensively. However, there are only minimal missing data in the **Sector**, **Founders**, and **Investor** columns (1.23%, 1.14%, and 3.60% respectively).

In [12]:
#Import relevant package
from sklearn.impute import SimpleImputer

# Calculate the count and percentage of missing values for each column
missing_values_count = data_2020.isna().sum()
missing_values_percentage = (missing_values_count / len(data_2020)) * 100

# Combine the count and percentage into a DataFrame for better readability
missing_values_summary = pd.DataFrame({
    'Missing Values': missing_values_count,
    'Percentage': missing_values_percentage
})

# Print the missing values summary
print("Missing Values Summary:\n", missing_values_summary)

# Define a function to replace string 'None' with 'Undisclosed' in specified columns
def replace_none_with_undisclosed(df, columns):
    for column in columns:
        # Replace 'None' with 'Undisclosed'
        df[column] = df[column].astype(str).replace('None', 'Undisclosed')
    return df

# Apply the function to the relevant columns to standardize missing data representation
columns_to_replace = ['HeadQuarter', 'Stage', 'Sector', 'Founders', 'Investor']
data_2020 = replace_none_with_undisclosed(data_2020, columns_to_replace)

# Create a SimpleImputer object to fill 'Undisclosed' values with the most frequent value
categorical_imputer = SimpleImputer(strategy='most_frequent')

# Apply the imputer to fill 'Undisclosed' values in the specified categorical columns
data_2020[columns_to_replace] = categorical_imputer.fit_transform(data_2020[columns_to_replace])

# Create a SimpleImputer object for numerical columns 'Founded' and 'Amount' using the mean strategy
numerical_columns = ['Founded', 'Amount']
numerical_imputer = SimpleImputer(strategy='mean')

# Apply the imputer to fill missing values in the numerical columns
data_2020[numerical_columns] = numerical_imputer.fit_transform(data_2020[numerical_columns])


# Verify the changes by printing the updated missing values summary
missing_values_count_updated = data_2020.isna().sum()
missing_values_summary_updated = pd.DataFrame({
    'Missing Values': missing_values_count_updated,
    'Percentage': (missing_values_count_updated / len(data_2020)) * 100
})




Missing Values Summary:
                Missing Values  Percentage
Company_Brand               0    0.000000
Founded                   213   20.189573
HeadQuarter                94    8.909953
Sector                     13    1.232227
What_it_does                0    0.000000
Founders                   12    1.137441
Investor                   38    3.601896
Amount                    254   24.075829
Stage                     464   43.981043
column10                 1053   99.810427


#### Verify duplicates values

In [13]:
# Check for duplicate rows
duplicates = data_2020.duplicated()

# Count duplicate rows
num_duplicates = duplicates.sum()
print(f"Number of duplicate rows: {num_duplicates}")

# Print duplicate rows
duplicate_rows = data_2020[duplicates]
print("Duplicate rows:")
print(duplicate_rows)

# Drop duplicate rows
data_2020_cleaned = data_2020.drop_duplicates()

# Display the number of rows before and after removing duplicates
print(f"Number of rows before removing duplicates: {data_2020.shape[0]}")
print(f"Number of rows after removing duplicates: {data_2020_cleaned.shape[0]}")


Number of duplicate rows: 3
Duplicate rows:
    Company_Brand  Founded HeadQuarter                 Sector  \
145     Krimanshi   2015.0     Jodhpur  Biotechnology company   
205         Nykaa   2012.0      Mumbai              Cosmetics   
362        Byju’s   2011.0   Bangalore                 EdTech   

                                          What_it_does         Founders  \
145  Krimanshi aims to increase rural income by imp...     Nikhil Bohra   
205  Nykaa is an online marketplace for different b...    Falguni Nayar   
362  An Indian educational technology and online tu...  Byju Raveendran   

                                           Investor        Amount  \
145  Rajasthan Venture Capital Fund, AIM Smart City  6.000000e+05   
205                        Alia Bhatt, Katrina Kaif  1.130430e+08   
362           Owl Ventures, Tiger Global Management  5.000000e+08   

           Stage column10  
145         Seed     None  
205  Undisclosed     None  
362  Undisclosed     None  
Numbe

In [14]:
#Verify data information and structure    
data_2020_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1052 entries, 0 to 1054
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1052 non-null   object 
 1   Founded        1052 non-null   float64
 2   HeadQuarter    1052 non-null   object 
 3   Sector         1052 non-null   object 
 4   What_it_does   1052 non-null   object 
 5   Founders       1052 non-null   object 
 6   Investor       1052 non-null   object 
 7   Amount         1052 non-null   float64
 8   Stage          1052 non-null   object 
 9   column10       2 non-null      object 
dtypes: float64(2), object(8)
memory usage: 90.4+ KB


## Data_2021 Manipulation

In [15]:
# Call DataFrame to understand DataFrame details for 2021.
query= "SELECT * FROM dbo.LP1_startup_funding2021"
data_2021 =pd.read_sql(query, connection)

data_2021.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D
3,Bizongo,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital","$51,000,000",Series C
4,FypMoney,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal","$2,000,000",Seed


In [16]:
#check the shape
data_2021.shape    

(1209, 9)

In [17]:
# check data information and structure
data_2021.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1209 entries, 0 to 1208
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1209 non-null   object 
 1   Founded        1208 non-null   float64
 2   HeadQuarter    1208 non-null   object 
 3   Sector         1209 non-null   object 
 4   What_it_does   1209 non-null   object 
 5   Founders       1205 non-null   object 
 6   Investor       1147 non-null   object 
 7   Amount         1206 non-null   object 
 8   Stage          781 non-null    object 
dtypes: float64(1), object(8)
memory usage: 85.1+ KB


#### Observation
The 'Amount' column, which represents financial figures, is currently of type object, indicating it contains string data.

### Investigate Why Amount Dtype is object instead of float

In [18]:
# Display the 'Amount' column
'''The currency symbol is a contributing factor to wrong Dtype instead of float'''

print(data_2021['Amount'])

0         $1,200,000
1       $120,000,000
2        $30,000,000
3        $51,000,000
4         $2,000,000
            ...     
1204        $3000000
1205       $20000000
1206       $55000000
1207       $26000000
1208        $8000000
Name: Amount, Length: 1209, dtype: object


In [19]:
# Removing dollar signs and commas
data_2021['Amount'] = data_2021['Amount'].str.replace('[\$,]', '', regex=True)

# Display the 'Amount' column
print(data_2021['Amount'])

0         1200000
1       120000000
2        30000000
3        51000000
4         2000000
          ...    
1204      3000000
1205     20000000
1206     55000000
1207     26000000
1208      8000000
Name: Amount, Length: 1209, dtype: object


In [20]:
# Confirm if all values under Amount are all numeric
data_2021['Amount'].apply(lambda x: isinstance(x, str) and not x.replace(',', '').replace('$', '').isdigit())


0       False
1       False
2       False
3       False
4       False
        ...  
1204    False
1205    False
1206    False
1207    False
1208    False
Name: Amount, Length: 1209, dtype: bool

In [21]:
# Filter Non-Numeric Values:
non_numeric_rows=data_2021[data_2021['Amount'].apply(lambda x: isinstance(x, str) and not x.replace(',', '').replace('$', '').isdigit())]

print(non_numeric_rows)


              Company_Brand  Founded HeadQuarter                    Sector  \
7               Qube Health   2016.0      Mumbai                HealthTech   
8                  Vitra.ai   2020.0   Bangalore              Tech Startup   
21                    Uable   2020.0   Bangalore                    EdTech   
39                 TruNativ   2019.0      Mumbai          Food & Beverages   
54                   AntWak   2019.0   Bangalore                    EdTech   
...                     ...      ...         ...                       ...   
1148              Godamwale   2016.0      Mumbai  Logistics & Supply Chain   
1160  Atomberg Technologies   2012.0      Mumbai      Consumer Electronics   
1161        Genext Students   2013.0      Mumbai                    EdTech   
1166              OckyPocky   2015.0    Gurugram                    EdTech   
1193        Sapio Analytics   2019.0      Mumbai         Computer Software   

                                           What_it_does  \
7   

#### Amount Observation
The column Amount still contains non-numeric values with undesclosed amount in some rows, hence the reason for object Dtype instead of float

In [22]:
# Replace "Undisclosed" with NaN in the 'Amount' column
data_2021['Amount'] = data_2021['Amount'].replace('Undisclosed', np.nan)

# Filter non-numeric values in the 'Amount' column
non_numeric_values = data_2021[~data_2021['Amount'].apply(lambda x: isinstance(x, (int, float)))]['Amount']


'''Print to confirm after replacing Undisclosed amount with NaN the right Dtype
Resurlt still indicate the wrong dtype and needs to be investigated further'''

# Print non-numeric values
print("Non-numeric values after replacing 'Undisclosed':")
print(non_numeric_values)

Non-numeric values after replacing 'Undisclosed':
0         1200000
1       120000000
2        30000000
3        51000000
4         2000000
          ...    
1204      3000000
1205     20000000
1206     55000000
1207     26000000
1208      8000000
Name: Amount, Length: 1093, dtype: object


#### Further investigation on the 'Amount' column

In [23]:
# Print unique non-numeric values, if any
if not non_numeric_values.empty:
    print("Non-numeric values after replacing 'Undisclosed':")
    print(non_numeric_values.unique())
else:
    print("No non-numeric values found after replacing 'Undisclosed'.")

Non-numeric values after replacing 'Undisclosed':
['1200000' '120000000' '30000000' '51000000' '2000000' '188000000'
 '200000' '1000000' '3000000' '100000' '700000' '9000000' '40000000'
 '49000000' '400000' '300000' '25000000' '160000000' '150000' '1800000'
 '5000000' '850000' '53000000' '500000' '1100000' '6000000' '800000'
 '10000000' '21000000' '7500000' '26000000' '7400000' '1500000' '600000'
 '800000000' '17000000' '3500000' '15000000' '215000000' '2500000'
 '350000000' '5500000' '83000000' '110000000' '500000000' '65000000'
 '150000000000' '300000000' '2200000' '35000000' '140000000' '4000000'
 '13000000' None '9500000' '8000000' 'Upsparks' '12000000' '1700000'
 '150000000' '100000000' '225000000' '6700000' '1300000' '20000000'
 '250000' '52000000' '3800000' '17500000' '42000000' '2300000' '7000000'
 '450000000' '28000000' '8500000' '37000000' '370000000' '16000000'
 '44000000' '770000' '125000000' '50000000' '4900000' '145000000'
 '22000000' '70000000' '6600000' '32000000' '2400

#### its observed some numeric values are encoded as strings with some few non-numeric values

#### Convert numeric string to numeric values

In [24]:
# Define function to convert numeric strings to numeric values
def convert_to_numeric(value):
    try:
        return float(value)
    except (ValueError, TypeError):
        return np.nan

# Replace numeric strings with their corresponding numeric values
data_2021['Amount'] = data_2021['Amount'].apply(convert_to_numeric)

# Print the modified DataFrame
print(data_2021)

       Company_Brand  Founded HeadQuarter                 Sector  \
0     Unbox Robotics   2019.0   Bangalore             AI startup   
1             upGrad   2015.0      Mumbai                 EdTech   
2        Lead School   2012.0      Mumbai                 EdTech   
3            Bizongo   2015.0      Mumbai         B2B E-commerce   
4           FypMoney   2021.0    Gurugram                FinTech   
...              ...      ...         ...                    ...   
1204        Gigforce   2019.0    Gurugram  Staffing & Recruiting   
1205          Vahdam   2015.0   New Delhi       Food & Beverages   
1206    Leap Finance   2019.0   Bangalore     Financial Services   
1207    CollegeDekho   2015.0    Gurugram                 EdTech   
1208          WeRize   2019.0   Bangalore     Financial Services   

                                           What_it_does  \
0     Unbox Robotics builds on-demand AI-driven ware...   
1        UpGrad is an online higher education platform.   
2     

#### Verify the right dtype conversion on Amount 

In [25]:
# Check and print the data type of the 'Amount' column
print("Data type of 'Amount' column:", data_2021['Amount'].dtype)

Data type of 'Amount' column: float64


In [26]:
# Confirm data quality and consistency in Dtype conversion
data_2021.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1209 entries, 0 to 1208
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1209 non-null   object 
 1   Founded        1208 non-null   float64
 2   HeadQuarter    1208 non-null   object 
 3   Sector         1209 non-null   object 
 4   What_it_does   1209 non-null   object 
 5   Founders       1205 non-null   object 
 6   Investor       1147 non-null   object 
 7   Amount         1056 non-null   float64
 8   Stage          781 non-null    object 
dtypes: float64(2), object(7)
memory usage: 85.1+ KB


#### Handle Missing values in data_2021

In [27]:
# Calculate missing values and their percentage
missing_values = data_2021.isnull().sum()
missing_percentage = (missing_values / len(data_2021)) * 100

# Print missing values with percentage
missing_info = pd.DataFrame({'Missing Values': missing_values, 'Percentage': missing_percentage})
print(missing_info)


               Missing Values  Percentage
Company_Brand               0    0.000000
Founded                     1    0.082713
HeadQuarter                 1    0.082713
Sector                      0    0.000000
What_it_does                0    0.000000
Founders                    4    0.330852
Investor                   62    5.128205
Amount                    153   12.655087
Stage                     428   35.401158


By observation, very little values are missing in all the columns, exception the column Stage with 35.40% missing values compared to the rest which has over 80% of its  values entered.

Rows with missing values less than 15% will be dropped to ensure data integrity and maintain a high-quality dataset. For columns with missing values exceeding 15%, such as the 'Stage' column with a substantial 35.40% missing values, will be populated rather than drop them outright. The decision is made to retain as much data as possible while still ensuring that the 'Stage' information, despite its high rate of missingness, is available for analysis.







In [28]:
from sklearn.impute import SimpleImputer

# Identify columns with missing values less than 15%
columns_to_drop = data_2021.columns[data_2021.isnull().sum() / len(data_2021) < 0.15]

# Drop rows with missing values in those columns
data_2021_dropped = data_2021.dropna(subset=columns_to_drop)

# Fill missing values in the 'Stage' column with mode
stage_imputer = SimpleImputer(strategy='most_frequent')
data_2021_dropped['Stage'] = stage_imputer.fit_transform(data_2021_dropped[['Stage']]).ravel()  # Flatten the 2D array

# Verify the changes
data_2021_dropped.info()


<class 'pandas.core.frame.DataFrame'>
Index: 994 entries, 0 to 1208
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  994 non-null    object 
 1   Founded        994 non-null    float64
 2   HeadQuarter    994 non-null    object 
 3   Sector         994 non-null    object 
 4   What_it_does   994 non-null    object 
 5   Founders       994 non-null    object 
 6   Investor       994 non-null    object 
 7   Amount         994 non-null    float64
 8   Stage          664 non-null    object 
dtypes: float64(2), object(7)
memory usage: 77.7+ KB


#### Check for duplicates

In [29]:
# Check the number of duplicate rows
num_duplicates = data_2021_dropped.duplicated().sum()

# Print the number of duplicate rows
print("Number of duplicate rows:", num_duplicates)

Number of duplicate rows: 15


In [30]:
# Drop duplicate rows
data_2021_no_duplicates = data_2021.drop_duplicates()

# Drop rows with missing values
data_2021_cleaned = data_2021_no_duplicates.dropna()

# Verify the changes
data_2021_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 654 entries, 0 to 1208
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  654 non-null    object 
 1   Founded        654 non-null    float64
 2   HeadQuarter    654 non-null    object 
 3   Sector         654 non-null    object 
 4   What_it_does   654 non-null    object 
 5   Founders       654 non-null    object 
 6   Investor       654 non-null    object 
 7   Amount         654 non-null    float64
 8   Stage          654 non-null    object 
dtypes: float64(2), object(7)
memory usage: 51.1+ KB


## Data_2019 Manipulation

#### Load csv data from other sources for analysis

In [31]:
# Read 2019 DataFrame to understand data structure.
data_2019=pd.read_csv("D:\\JHanson\\Justice Hanson\\DS Career Accelerator\Project 1\\Indian-Start-up-Investment-Analysis\\CSV Data\\startup_funding2019.csv")

data_2019.head(5)

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...","$30,000,000",Series D
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",


#### Check data structure

In [32]:
# Check data shape 

data_2019.shape

(89, 9)

In [33]:
# Check data information and Dtype
data_2019.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company/Brand  89 non-null     object 
 1   Founded        60 non-null     float64
 2   HeadQuarter    70 non-null     object 
 3   Sector         84 non-null     object 
 4   What it does   89 non-null     object 
 5   Founders       86 non-null     object 
 6   Investor       89 non-null     object 
 7   Amount($)      89 non-null     object 
 8   Stage          43 non-null     object 
dtypes: float64(1), object(8)
memory usage: 6.4+ KB


By observation, Amount ($) is not in the correct data type

### Investigate why Amount ($) is object data type instead of float

In [34]:
# Print the 'Amount($)' column to examine its contents
print(data_2019['Amount($)'])

0       $6,300,000
1     $150,000,000
2      $28,000,000
3      $30,000,000
4       $6,000,000
          ...     
84     $20,000,000
85    $693,000,000
86      $5,000,000
87     $50,000,000
88     $33,000,000
Name: Amount($), Length: 89, dtype: object


In [35]:
# Replace 'Undisclosed' with np.nan
data_2019['Amount($)'] = data_2019['Amount($)'].replace('Undisclosed', np.nan)

# Remove dollar signs and commas
data_2019['Amount($)'] = data_2019['Amount($)'].str.replace('[\$,]', '', regex=True)

print(data_2019['Amount($)'])

0       6300000
1     150000000
2      28000000
3      30000000
4       6000000
        ...    
84     20000000
85    693000000
86      5000000
87     50000000
88     33000000
Name: Amount($), Length: 89, dtype: object


#### Further investigate why Amount dtype is still object after stripping off $-symbol

In [36]:
# Print unique values in the 'Amount($)' column to check for any anomalies
unique_values = data_2019['Amount($)'].unique()
print(unique_values)

['6300000' '150000000' '28000000' '30000000' '6000000' nan '1000000'
 '20000000' '275000000' '22000000' '5000000' '140500' '540000000'
 '15000000' '182700' '12000000' '11000000' '15500000' '1500000' '5500000'
 '2500000' '140000' '230000000' '49400000' '32000000' '26000000' '150000'
 '400000' '2000000' '100000000' '8000000' '100000' '50000000' '120000000'
 '4000000' '6800000' '36000000' '5700000' '25000000' '600000' '70000000'
 '60000000' '220000' '2800000' '2100000' '7000000' '311000000' '4800000'
 '693000000' '33000000']


The column still contain non-numeric values with numeric strings that needs to be converted to numeric values

In [37]:
# Replace non-numeric entries with np.nan
data_2019['Amount($)'] = data_2019['Amount($)'].replace(['Undisclosed', 'N/A', 'na', 'NaN'], np.nan)

# Remove dollar signs and commas
data_2019['Amount($)'] = data_2019['Amount($)'].str.replace('[\$,]', '', regex=True)

# Convert 'Amount($)' column to float
data_2019['Amount($)'] = pd.to_numeric(data_2019['Amount($)'], errors='coerce')

# Print the cleaned 'Amount($)' column to verify
print(data_2019['Amount($)'])

0       6300000.0
1     150000000.0
2      28000000.0
3      30000000.0
4       6000000.0
         ...     
84     20000000.0
85    693000000.0
86      5000000.0
87     50000000.0
88     33000000.0
Name: Amount($), Length: 89, dtype: float64


In [38]:
'''Verify data structure and rate of missing values for easy decison'''
# Verify data types
print(data_2019.info())

# Calculate and print the percentage of missing values
missing_values_percentage = data_2019.isnull().mean() * 100
print("\nPercentage of Missing Values:")
print(missing_values_percentage)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company/Brand  89 non-null     object 
 1   Founded        60 non-null     float64
 2   HeadQuarter    70 non-null     object 
 3   Sector         84 non-null     object 
 4   What it does   89 non-null     object 
 5   Founders       86 non-null     object 
 6   Investor       89 non-null     object 
 7   Amount($)      77 non-null     float64
 8   Stage          43 non-null     object 
dtypes: float64(2), object(7)
memory usage: 6.4+ KB
None

Percentage of Missing Values:
Company/Brand     0.000000
Founded          32.584270
HeadQuarter      21.348315
Sector            5.617978
What it does      0.000000
Founders          3.370787
Investor          0.000000
Amount($)        13.483146
Stage            51.685393
dtype: float64


In [39]:
# Veirfy values in HeadQuarter
print(data_2019['HeadQuarter'].unique())


[nan 'Mumbai' 'Chennai' 'Telangana' 'Pune' 'Bangalore' 'Noida' 'Delhi'
 'Ahmedabad' 'Gurugram' 'Haryana' 'Chandigarh' 'Jaipur' 'New Delhi'
 'Surat' 'Uttar pradesh' 'Hyderabad' 'Rajasthan']


#### Standardize the representation of locations in the HeadQuarter column and then impute missing values using the most frequent value (mode) in the column.

In [40]:
# Impute missing values for 'HeadQuarter' column with mode
headquarter_imputer = SimpleImputer(strategy='most_frequent')
data_2019['HeadQuarter'] = headquarter_imputer.fit_transform(data_2019['HeadQuarter'].values.reshape(-1, 1)).ravel()

# Impute missing values for 'Founded' column with median
founded_imputer = SimpleImputer(strategy='mean')
data_2019['Founded'] = founded_imputer.fit_transform(data_2019[['Founded']]).ravel()

# Impute missing values for 'Stage' column with mode
stage_imputer = SimpleImputer(strategy='most_frequent')
data_2019['Stage'] = stage_imputer.fit_transform(data_2019[['Stage']]).ravel()



In [41]:
# Assign the variable data_2019_clean
data_2019_clean = data_2019.copy()

# Print for verification
#print(data_2019_clean)

# Verify imputed values
print(data_2019_clean.info())

# Check for missing values again
missing_values_after_imputation = data_2019_clean.isnull().sum()
print("\nPercentage of Missing Values After Imputation:")
print((missing_values_after_imputation / len(data_2019_clean)) * 100)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company/Brand  89 non-null     object 
 1   Founded        89 non-null     float64
 2   HeadQuarter    89 non-null     object 
 3   Sector         84 non-null     object 
 4   What it does   89 non-null     object 
 5   Founders       86 non-null     object 
 6   Investor       89 non-null     object 
 7   Amount($)      77 non-null     float64
 8   Stage          89 non-null     object 
dtypes: float64(2), object(7)
memory usage: 6.4+ KB
None

Percentage of Missing Values After Imputation:
Company/Brand     0.000000
Founded           0.000000
HeadQuarter       0.000000
Sector            5.617978
What it does      0.000000
Founders          3.370787
Investor          0.000000
Amount($)        13.483146
Stage             0.000000
dtype: float64


#### Confirm imputation

In [42]:
# Verify imputed values
data_2019_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company/Brand  89 non-null     object 
 1   Founded        89 non-null     float64
 2   HeadQuarter    89 non-null     object 
 3   Sector         84 non-null     object 
 4   What it does   89 non-null     object 
 5   Founders       86 non-null     object 
 6   Investor       89 non-null     object 
 7   Amount($)      77 non-null     float64
 8   Stage          89 non-null     object 
dtypes: float64(2), object(7)
memory usage: 6.4+ KB


## Data_2018 Manipulation

In [43]:
# Read 2018 DataFrame to understand data structure.
data_2018=pd.read_csv("D:\\JHanson\\Justice Hanson\\DS Career Accelerator\Project 1\\Indian-Start-up-Investment-Analysis\\CSV Data\\startup_funding2018.csv")

data_2018.head(5)

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...


#### Check Data Structure

In [44]:
# Confirm Shape
data_2018.shape

(526, 6)

In [45]:
# Check data information and Dtype
data_2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Company Name   526 non-null    object
 1   Industry       526 non-null    object
 2   Round/Series   526 non-null    object
 3   Amount         526 non-null    object
 4   Location       526 non-null    object
 5   About Company  526 non-null    object
dtypes: object(6)
memory usage: 24.8+ KB


#### its observed some columns are missing from data_2018 as compared to the rest of the datasets, with no missing values but wrong data type  for the Amount column

### Investigate Dtype for Amount

In [46]:
# Print the 'Amount($)' column to examine its contents
print(data_2018['Amount'])

0           250000
1      ₹40,000,000
2      ₹65,000,000
3          2000000
4                —
          ...     
521      225000000
522              —
523           7500
524    ₹35,000,000
525       35000000
Name: Amount, Length: 526, dtype: object


#### Strip off currency symbol and convert Amount to the right dtype

In [47]:
# Remove currency symbols and commas, then convert to numeric
data_2018['Amount'] = data_2018['Amount'].str.replace(r'[^\d]', '', regex=True)  # Remove non-digit characters
data_2018['Amount'] = pd.to_numeric(data_2018['Amount'], errors='coerce')  # Convert to numeric, set errors to NaN

# Print the cleaned data and its data type
print(data_2018['Amount'].dtype)
print(data_2018['Amount'])

float64
0         250000.0
1       40000000.0
2       65000000.0
3        2000000.0
4              NaN
          ...     
521    225000000.0
522            NaN
523         7500.0
524     35000000.0
525     35000000.0
Name: Amount, Length: 526, dtype: float64


#### Calculate rate of missing values in data_2018

In [48]:
# Calculate the number of missing values
missing_values_count = data_2018.isna().sum()

# Calculate the percentage of missing values
missing_values_percentage = (missing_values_count / len(data_2018)) * 100

# Combine the two into a DataFrame for better readability
missing_values_summary = pd.DataFrame({
    'Missing Values': missing_values_count,
    'Percentage': missing_values_percentage
})

# Print the summary
print(missing_values_summary)

               Missing Values  Percentage
Company Name                0    0.000000
Industry                    0    0.000000
Round/Series                0    0.000000
Amount                    148   28.136882
Location                    0    0.000000
About Company               0    0.000000


#### Handle missing values in Amount Column

In [49]:
# Create the imputer to fill missing values with the median
amount_imputer = SimpleImputer(strategy='mean')

# Impute the missing values
data_2018['Amount'] = amount_imputer.fit_transform(data_2018[['Amount']])

# Save the cleaned data to a new variable
data_2018_clean = data_2018.copy()

# Print the cleaned data and its data type for verification
print(data_2018_clean.info())

# Check for missing values again
missing_values_after_imputation = data_2018_clean.isnull().sum()
print("\nPercentage of Missing Values After Imputation:")
print((missing_values_after_imputation / len(data_2018_clean)) * 100)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company Name   526 non-null    object 
 1   Industry       526 non-null    object 
 2   Round/Series   526 non-null    object 
 3   Amount         526 non-null    float64
 4   Location       526 non-null    object 
 5   About Company  526 non-null    object 
dtypes: float64(1), object(5)
memory usage: 24.8+ KB
None

Percentage of Missing Values After Imputation:
Company Name     0.0
Industry         0.0
Round/Series     0.0
Amount           0.0
Location         0.0
About Company    0.0
dtype: float64


#### Print the column names of cleaned datasets to verify their structure

In [50]:
# Inspect column names in each DataFrame  to understand data structure.

print("Column names in 2018 DataFrame:")
print(data_2018_clean.columns)

print("\nColumn names in 2019 DataFrame:")
print(data_2019_clean.columns)

print("\nColumn names in 2020 DataFrame:")
print(data_2020_cleaned.columns)

print("\nColumn names in 2021 DataFrame:")
print(data_2021_cleaned.columns)

Column names in 2018 DataFrame:
Index(['Company Name', 'Industry', 'Round/Series', 'Amount', 'Location',
       'About Company'],
      dtype='object')

Column names in 2019 DataFrame:
Index(['Company/Brand', 'Founded', 'HeadQuarter', 'Sector', 'What it does',
       'Founders', 'Investor', 'Amount($)', 'Stage'],
      dtype='object')

Column names in 2020 DataFrame:
Index(['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage', 'column10'],
      dtype='object')

Column names in 2021 DataFrame:
Index(['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage'],
      dtype='object')


##### Column name Observations
The datasets from 2018 to 2021 on Indian startup funding contain varying column names. The inconsistent column names necessitates a renaming strategy to align them uniformly before effectively merged into one DataFrame

## Column Mapping

In [51]:
# Define a column_mapping dictionary to standardize column names for consistent and accurate analysis
column_mapping = {
    'Company Name': 'Company',
    'Company/Brand': 'Company',
    'Company_Brand': 'Company',
    'Industry': 'Sector',
    'Sector': 'Sector',
    'Round/Series': 'Stage',
    'Stage': 'Stage',
    'Amount': 'Amount',
    'Amount($)': 'Amount',
    'Location': 'HeadQuarter',
    'HeadQuarter': 'HeadQuarter',
    'About Company': 'What_it_does',
    'What it does': 'What_it_does',
    'What_it_does': 'What_it_does',
    'Founded': 'Founded',
    'Founders': 'Founders',
    'Investor': 'Investor'
}

# Rename the columns in each DataFrame using the rename method and the column_mapping
data_2018_clean.rename(columns=column_mapping, inplace=True)
data_2019_clean.rename(columns=column_mapping, inplace=True)
data_2020_cleaned.rename(columns=column_mapping, inplace=True)
data_2021_cleaned.rename(columns=column_mapping, inplace=True)

# Print the renamed column names for verification
print("\nRenamed column names:")
print("2018 DataFrame:", data_2018_clean.columns)
print("2019 DataFrame:", data_2019_clean.columns)
print("2020 DataFrame:", data_2020_cleaned.columns)
print("2021 DataFrame:", data_2021_cleaned.columns)


Renamed column names:
2018 DataFrame: Index(['Company', 'Sector', 'Stage', 'Amount', 'HeadQuarter', 'What_it_does'], dtype='object')
2019 DataFrame: Index(['Company', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage'],
      dtype='object')
2020 DataFrame: Index(['Company', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage', 'column10'],
      dtype='object')
2021 DataFrame: Index(['Company', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage'],
      dtype='object')


## Merge 2018_Cleaned, 2019_Cleaned, 2021_Cleaned, 2021_Cleaned into one table (df_Merged)

In [52]:
# Merge DataFrames into one table using the concatenation function.
df_merged = pd.concat([data_2018_clean, data_2019_clean, data_2020_cleaned, data_2021_cleaned], ignore_index=True)

# Print the first few rows of the merged DataFrame
print("\nMerged DataFrame:")
print(df_merged)


Merged DataFrame:
              Company                                             Sector  \
0     TheCollegeFever  Brand Marketing, Event Promotion, Marketing, S...   
1     Happy Cow Dairy                               Agriculture, Farming   
2          MyLoanCare   Credit, Financial Services, Lending, Marketplace   
3         PayMe India                        Financial Services, FinTech   
4            Eunimart                 E-Commerce Platforms, Retail, SaaS   
...               ...                                                ...   
2316         Gigforce                              Staffing & Recruiting   
2317           Vahdam                                   Food & Beverages   
2318     Leap Finance                                 Financial Services   
2319     CollegeDekho                                             EdTech   
2320           WeRize                                 Financial Services   

             Stage        Amount                       HeadQuarter  

In [53]:
# Confirm merged DataFrame
df_merged.head()

Unnamed: 0,Company,Sector,Stage,Amount,HeadQuarter,What_it_does,Founded,Founders,Investor,column10
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000.0,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",,,,
1,Happy Cow Dairy,"Agriculture, Farming",Seed,40000000.0,"Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,,,,
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,65000000.0,"Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,,,,
3,PayMe India,"Financial Services, FinTech",Angel,2000000.0,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,,,,
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,239168300.0,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...,,,,


### Verify from the bottom of the merged DataFrame

In [54]:
# Print the last few rows of the merged DataFrame to verify years have been added correctly

df_merged.tail()

Unnamed: 0,Company,Sector,Stage,Amount,HeadQuarter,What_it_does,Founded,Founders,Investor,column10
2316,Gigforce,Staffing & Recruiting,Pre-series A,3000000.0,Gurugram,A gig/on-demand staffing company.,2019.0,"Chirag Mittal, Anirudh Syal",Endiya Partners,
2317,Vahdam,Food & Beverages,Series D,20000000.0,New Delhi,VAHDAM is among the world’s first vertically i...,2015.0,Bala Sarda,IIFL AMC,
2318,Leap Finance,Financial Services,Series C,55000000.0,Bangalore,International education loans for high potenti...,2019.0,"Arnav Kumar, Vaibhav Singh",Owl Ventures,
2319,CollegeDekho,EdTech,Series B,26000000.0,Gurugram,"Collegedekho.com is Student’s Partner, Friend ...",2015.0,Ruchir Arora,"Winter Capital, ETS, Man Capital",
2320,WeRize,Financial Services,Series A,8000000.0,Bangalore,India’s first socially distributed full stack ...,2019.0,"Vishal Chopra, Himanshu Gupta","3one4 Capital, Kalaari Capital",


## Data Quality Check on the merged DataFrame

In [55]:
# Drop 'column10' permanently
df_merged.drop(columns=['column10'], inplace=True)

# Print the info to verify that 'column10' has been dropped
print(df_merged.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2321 entries, 0 to 2320
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Company       2321 non-null   object 
 1   Sector        2316 non-null   object 
 2   Stage         2321 non-null   object 
 3   Amount        2309 non-null   float64
 4   HeadQuarter   2321 non-null   object 
 5   What_it_does  2321 non-null   object 
 6   Founded       1795 non-null   float64
 7   Founders      1792 non-null   object 
 8   Investor      1795 non-null   object 
dtypes: float64(2), object(7)
memory usage: 163.3+ KB
None


#### Check missing values in df_merged

In [57]:
# Calculate and print the percentage of missing values for each column
missing_values = df_merged.isnull().sum()
missing_percentage = (missing_values / len(df_merged)) * 100

# Create a DataFrame to display missing values and their percentages
missing_df = pd.DataFrame({'Missing Values': missing_values, 'Percentage': missing_percentage})

print("\nMissing Values and Their Percentages:")
print(missing_df)


Missing Values and Their Percentages:
              Missing Values  Percentage
Company                    0    0.000000
Sector                     5    0.215424
Stage                      0    0.000000
Amount                    12    0.517019
HeadQuarter                0    0.000000
What_it_does               0    0.000000
Founded                  526   22.662645
Founders                 529   22.791900
Investor                 526   22.662645


The missing values in the columns Founded (22.66%), Founders (22.79%), and Investor (22.66%) result from the missing columns in data_2018. These missing values are significant because dropping rows with these missing values would mean losing the entire data_2018, which contains 526 rows. Therefore, it is necessary to impute these missing values. However, the Amount column, with only 0.517% and Sector (0.22%) of missing values, will be dropped. This will not affect the analysis as there is sufficient data in the Amount column to work with.

In [66]:
# Impute missing values in 'Founded' column with median
founded_imputer = SimpleImputer(strategy='mean')
df_merged['Founded'] = founded_imputer.fit_transform(df_merged[['Founded']])

In [75]:
print(df_merged[['Investor']].shape)


(1778, 1)


In [83]:
print(df_merged['Investor'])


526                                  Sixth Sense Ventures
527                                      General Atlantic
528        Deepak Parekh, Amitabh Bachchan, Piyush Pandey
529     Evolvence India Fund (EIF), Pidilite Group, FJ...
530              Innovation in Food and Agriculture (IFA)
                              ...                        
2316                                      Endiya Partners
2317                                             IIFL AMC
2318                                         Owl Ventures
2319                     Winter Capital, ETS, Man Capital
2320                       3one4 Capital, Kalaari Capital
Name: Investor, Length: 1778, dtype: object


In [76]:
print(df_merged['Investor'].unique())


['Sixth Sense Ventures' 'General Atlantic'
 'Deepak Parekh, Amitabh Bachchan, Piyush Pandey' ... 'Owl Ventures'
 'Winter Capital, ETS, Man Capital' '3one4 Capital, Kalaari Capital']
