Business Understanding:
This project aims to explore the Indian startup ecosystem based on data from 2018 to 2021 and propose the best course of action for entering this dynamic market. The team intends to understand how each element in the ecosystem interacts. The analysis leverages datasets covering investment trends, and sectoral growth and provides a comprehensive overview to guide strategic decision-making. 
The dataset provides insight into the locations, industries, experience levels, amount of funding, brand, sector etc. The data set allows for analysis of trends between the yeards and how it can be used by the team to understand the evolving landscaepe of the Indian Startup Ecosystem. 


Business/ Analytical Questions

1.What sectors have shown the highest growth in terms of funding received over the past four years?

2.What geographical regions within India have emerged as primary hubs for startup activity and investment, and what factors contribute to their prominence?"

3.Are there any notable differences in funding patterns between early-stage startups and more established companies?

4.Which sectors recieve the lowest level of funding and which sectors recieve the highest levels of funding in India and what factors contribute to this?

5.Which investors have more impact on startups over the years?

6.What are the key characteristics of startups that successfully secure funding, and how do they differ from those that struggle to attract investment?

Hypothesis:
Null Hypothesis(Ho): There is no significant difference in the amount of funding between startups in particular "location".

Alternative Hypothesis(Ha): There is a significant difference in the amount of funding between startups in "Blocation".

In [214]:
# Data manipulation and analysis
import pandas as pd
import numpy as np
 
# Database connectivity
import pyodbc
 
# Database ORM (optional)
from sqlalchemy import create_engine
 
# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
 
# Machine learning (Extra)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
 
# Managing environment variables
from dotenv import dotenv_values
 
# Handling HTTP requests (if needed)
import requests
 
# Handling file paths and directories
import os
from pathlib import Path
 
 
import warnings
 
warnings.filterwarnings('ignore')

Loading Data to Python VSO Environment:
 
1. Database Connection (2020 and 2021 Data):

In [215]:
# Load environment variables from .env file into a dictionary
environment_variables = dotenv_values('.env')
 
# Get the values for the credentials you set in the '.env' file
server = environment_variables.get("SERVER")
database = environment_variables.get("DATABASE")
username = environment_variables.get("UID")
password = environment_variables.get("PWD")

In [216]:
# Create a connection string
connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password};MARS_Connection=yes;MinProtocolVersion=TLSv1.2;"

In [217]:

# This connection method below will connect the pyodbc library in the connection string 
 
connection = pyodbc.connect(connection_string)

In [218]:

# Now the sql query below is used to get the next set of data 
 
 
#query = "SELECT * FROM LP2_Telco_churn_first_3000"
 
# select data from 2020
 
query = "SELECT * FROM dbo.LP1_startup_funding2020"
 
data20 = pd.read_sql(query, connection)
data20.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,
3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,,
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,,


In [219]:
# select data from 2021
 
query = "SELECT * FROM dbo.LP1_startup_funding2021"
 
data21 = pd.read_sql(query, connection)
data21.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D
3,Bizongo,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital","$51,000,000",Series C
4,FypMoney,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal","$2,000,000",Seed


In [220]:
# Accessing the data for 2019.  The file name startup_funding2019.csv
 
data19 = pd.read_csv('startup_funding2019.csv')
data19.head()

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...","$30,000,000",Series D
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",


In [221]:
# The third data (data for 2018) is hosted on this GitHub Repository, in file called startup_funding2018.csv
 
data18 = pd.read_csv('startup_funding2018.csv')
data18.head()

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...


Exploring and Understanding the data


In [222]:
data18.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Company Name   526 non-null    object
 1   Industry       526 non-null    object
 2   Round/Series   526 non-null    object
 3   Amount         526 non-null    object
 4   Location       526 non-null    object
 5   About Company  526 non-null    object
dtypes: object(6)
memory usage: 24.8+ KB


In [223]:
data18.describe().T

Unnamed: 0,count,unique,top,freq
Company Name,526,525,TheCollegeFever,2
Industry,526,405,—,30
Round/Series,526,21,Seed,280
Amount,526,198,—,148
Location,526,50,"Bangalore, Karnataka, India",102
About Company,526,524,"TheCollegeFever is a hub for fun, fiesta and f...",2


In [224]:
data18.shape

(526, 6)

In [225]:
#Missing Values
print(data18.isna().sum())

Company Name     0
Industry         0
Round/Series     0
Amount           0
Location         0
About Company    0
dtype: int64


In [226]:
#Duplicate check 
print(data18.duplicated().sum())

1


In [227]:
data19.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company/Brand  89 non-null     object 
 1   Founded        60 non-null     float64
 2   HeadQuarter    70 non-null     object 
 3   Sector         84 non-null     object 
 4   What it does   89 non-null     object 
 5   Founders       86 non-null     object 
 6   Investor       89 non-null     object 
 7   Amount($)      89 non-null     object 
 8   Stage          43 non-null     object 
dtypes: float64(1), object(8)
memory usage: 6.4+ KB


In [228]:
data19.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Founded,60.0,2014.533333,2.937003,2004.0,2013.0,2015.0,2016.25,2019.0


In [229]:
data19.shape

(89, 9)

In [230]:
#Missing Values
print(data19.isna().sum())

Company/Brand     0
Founded          29
HeadQuarter      19
Sector            5
What it does      0
Founders          3
Investor          0
Amount($)         0
Stage            46
dtype: int64


In [231]:
#Duplicate check 
print(data19.duplicated().sum())

0


In [232]:
data20.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1055 entries, 0 to 1054
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1055 non-null   object 
 1   Founded        842 non-null    float64
 2   HeadQuarter    961 non-null    object 
 3   Sector         1042 non-null   object 
 4   What_it_does   1055 non-null   object 
 5   Founders       1043 non-null   object 
 6   Investor       1017 non-null   object 
 7   Amount         801 non-null    float64
 8   Stage          591 non-null    object 
 9   column10       2 non-null      object 
dtypes: float64(2), object(8)
memory usage: 82.6+ KB


In [233]:
data20.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Founded,842.0,2015.363,4.097909,1973.0,2014.0,2016.0,2018.0,2020.0
Amount,801.0,113043000.0,2476635000.0,12700.0,1000000.0,3000000.0,11000000.0,70000000000.0


In [234]:
data20.shape

(1055, 10)

In [235]:
#Missing Values
print(data20.isna().sum())

Company_Brand       0
Founded           213
HeadQuarter        94
Sector             13
What_it_does        0
Founders           12
Investor           38
Amount            254
Stage             464
column10         1053
dtype: int64


In [236]:
#Duplicate check 
print(data20.duplicated().sum())

3


In [237]:
data21.shape

(1209, 9)

In [238]:
data21.shape

(1209, 9)

In [239]:
data21.shape

(1209, 9)

In [240]:
data21.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1209 entries, 0 to 1208
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1209 non-null   object 
 1   Founded        1208 non-null   float64
 2   HeadQuarter    1208 non-null   object 
 3   Sector         1209 non-null   object 
 4   What_it_does   1209 non-null   object 
 5   Founders       1205 non-null   object 
 6   Investor       1147 non-null   object 
 7   Amount         1206 non-null   object 
 8   Stage          781 non-null    object 
dtypes: float64(1), object(8)
memory usage: 85.1+ KB


In [241]:
data21.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Founded,1208.0,2016.655629,4.517364,1963.0,2015.0,2018.0,2020.0,2021.0


In [242]:
data21.shape

(1209, 9)

In [243]:
#Missing Values
print(data21.isna().sum())

Company_Brand      0
Founded            1
HeadQuarter        1
Sector             0
What_it_does       0
Founders           4
Investor          62
Amount             3
Stage            428
dtype: int64


In [244]:
#Duplicate check 
print(data21.duplicated().sum())

19


Data cleaning 

In [245]:
#year 2021
# Remove commas and '$' from the 'Amount' column, then convert to integer
data21['Amount'] = data21['Amount'].str.replace(',', '')

# Check the cleaned 'Amount' column
data21['Amount'].head()


0      $1200000
1    $120000000
2     $30000000
3     $51000000
4      $2000000
Name: Amount, dtype: object

In [246]:
#Dealing with missing values 
#Amount column 


# Sample data
data = {
    'amount': ['$1200000', '$120000000', '$30000000', '$55000000', '$26000000', '$8000000']
}
data21 = pd.DataFrame(data)

# Remove dollar signs and commas, and convert to numeric
data21['amount'] = data21['amount'].replace('[\$,]', '', regex=True).astype(float)

# Now you can fill in the missing values using any method. Here is an example using the mean:
mean_value = year_data21['amount'].mean()
data21['amount'].fillna(mean_value, inplace=True)

print(data21)



        amount
0    1200000.0
1  120000000.0
2   30000000.0
3   55000000.0
4   26000000.0
5    8000000.0


In [247]:
#Missing Values
print(data21.isna().sum())

amount    0
dtype: int64


In [248]:
data21.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   amount  6 non-null      float64
dtypes: float64(1)
memory usage: 180.0 bytes


In [None]:
#year 2021
# Remove commas and '$' from the 'Amount' column, then convert to integer
data21['Amount'] = data21['Amount'].str.replace(',', '')

# Check the cleaned 'Amount' column
data21['Amount'].head()


0      $1200000
1    $120000000
2     $30000000
3     $51000000
4      $2000000
Name: Amount, dtype: object