## Accessing your data from the database

- Please follow the steps in this notebook to have access to the dataset. 
- If you encounter any challenges please leave an issue on this repo here on GitHub

### Steps to take to use environment variables as opposed to credentials literals

1. Install pyodbc  - a package for creating connection strings to your remote database server
2. Install python-dotenv - a package for creating environment variables that will help you hide sensitve configuration informantion such as database credentials and API keys
3. Import all the necessary libraies
   1. pyodbc (for creating a connection)
   2. python-dotenv (loading environment variables)
   3. os (for accessing the environement variables using the load_env function. This is not needed if you use the dotenv_values function instead)
4. Now create a file called .env in the root of your project folder (Note, the file name begins with a dot)
5. In the .env file, put all your sensitive information like server name, database name, username, and password

Example

   - SERVER='server_name_here'
   - DATABASE='database_name_here'
   - USERNAME='username_here'
   - PASSWORD='password_here'


6. Next create a .gitignore file (a new file with the name `.gitignore`. Note that gitignore file names begin with a dot)
7. Open the .gitignore file and type in the name of the .env file we just created like this "/.env". This will prevent git from tracking that file. Essesntially any file name in the gitignore file will be ignored by git and won't be checked into the repository
8. Create a connection by accessing your connection string with your defined environment variables

## Understanding the Business

-Venturing into the Indian start-ups ecosystem
-To investigate the ecosystem and propose the best course of action

-We will analyze funding recieved by start-ups in India from 2018 to 2021.

-We will seek to ask the following questions to help us propose the best cousrse of action.

##### 1) what is the average amount of funding recieved by a start-up per year

##### 2) Which Sectors do these start-ups belong

##### 3) Which start-ups recieved most funding and the industries they belong

##### 4) which start-ups survived after their first year of operation and the industry they belong

##### 7) which start-ups survived after the second year of operation and the industry they belong

##### 8) Which industries have the most successful start-ups

##### 9) What the loccations of the industrie that recieve most funding

##### 10) What is the Location of Industries that  



#### Step 1 and 2 - Install pyodbc and python-dotenv

In [24]:
#!pip install pyodbc  
#%pip install python-dotenv 

#!pip install pymssql
#!pip install pypyodbc

In [23]:
#pip install --upgrade pyodbc

#### Step 3 - Import all the necessary packages

In [129]:
import pyodbc
import pymssql
import pypyodbc

from dotenv import dotenv_values    #import the dotenv_values function from the dotenv package
import pandas as pd
import numpy as np


import warnings 

warnings.filterwarnings('ignore')


#### Step 4 - Create your .env file in the root of your project

#### Step 5 - In the .env file, put all your sensitive information like server name, password etc


#### Step 6 & 7 - Next create a .gitignore file and type '/.env' file we just created. This will prevent git from tracking that file.

#### Step 8 - Create a connection by accessing your connection string with your defined environment variables

In [130]:
# Load environment variables from .env file into a dictionary
environment_variables = dotenv_values('`.env`')

# Get the values for the credentials you set in the '.env' file
server = environment_variables.get("SERVER")
database = environment_variables.get("DATABASE")
username = environment_variables.get("USERNAME")
password = environment_variables.get("PASSWORD")
driver = environment_variables.get("DRIVER")


In [131]:
# Create a connection string
connection_string = f'DRIVER={driver};SERVER={server};DATABASE={database};UID={username};PWD={password}'


In [132]:
# Use the connect method of the pyodbc library and pass in the connection string.
# This will connect to the server and might take a few seconds to be complete. 
# Check your internet connection if it takes more time than necessary

connection = pyodbc.connect(connection_string)


In [133]:
# Now the sql query to get the data is what what you see below. 
# Note that you will not have permissions to insert delete or update this database table. 

query = "SELECT * FROM dbo.LP1_startup_funding2020"

df_fund2020 = pd.read_sql(query, connection)
df_fund2020.head(3)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,


In [134]:
query1 = 'Select * from dbo.LP1_startup_funding2021'

df_fund2021 = pd.read_sql(query1, connection)
df_fund2021.head(3)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D


Next, get data from other sources and concatenate (Depends on the project) to perform your analysis

ALL THE BEST!!!

In [135]:

# Print the first few rows of the DataFrame
#print(data2.head())
#converting the sql extracted data to csv respectively
#data.to_csv("startup_funding2020.csv", index=False)
#data1.to_csv("startup_funding2021.csv", index=False)

In [136]:
df_fund2018=pd.read_csv('startup_funding2018.csv')

In [137]:
df_fund2019=pd.read_csv('startup_funding2019.csv')


In [138]:
df_fund2018.head(3)

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India


In [139]:
df_fund2019.head(3)

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding


In [140]:
df_fund2020.head(3)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,


In [141]:
df_fund2021.head(3)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D


#### We will create year column for each dataframe before concatenating to helps us identify which year funding was awarded

In [142]:
df_fund2018['Year']=2018
df_fund2019['Year']=2019
df_fund2020['Year']=2020
df_fund2021['Year']=2021

In [143]:
df_fund2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Company Name   526 non-null    object
 1   Industry       526 non-null    object
 2   Round/Series   526 non-null    object
 3   Amount         526 non-null    object
 4   Location       526 non-null    object
 5   About Company  526 non-null    object
 6   Year           526 non-null    int64 
dtypes: int64(1), object(6)
memory usage: 28.9+ KB


In [144]:
df_fund2019.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company/Brand  89 non-null     object 
 1   Founded        60 non-null     float64
 2   HeadQuarter    70 non-null     object 
 3   Sector         84 non-null     object 
 4   What it does   89 non-null     object 
 5   Founders       86 non-null     object 
 6   Investor       89 non-null     object 
 7   Amount($)      89 non-null     object 
 8   Stage          43 non-null     object 
 9   Year           89 non-null     int64  
dtypes: float64(1), int64(1), object(8)
memory usage: 7.1+ KB


In [145]:
df_fund2020.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1055 entries, 0 to 1054
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1055 non-null   object 
 1   Founded        842 non-null    float64
 2   HeadQuarter    961 non-null    object 
 3   Sector         1042 non-null   object 
 4   What_it_does   1055 non-null   object 
 5   Founders       1043 non-null   object 
 6   Investor       1017 non-null   object 
 7   Amount         801 non-null    float64
 8   Stage          591 non-null    object 
 9   column10       2 non-null      object 
 10  Year           1055 non-null   int64  
dtypes: float64(2), int64(1), object(8)
memory usage: 90.8+ KB


In [146]:
df_fund2021.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1209 entries, 0 to 1208
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1209 non-null   object 
 1   Founded        1208 non-null   float64
 2   HeadQuarter    1208 non-null   object 
 3   Sector         1209 non-null   object 
 4   What_it_does   1209 non-null   object 
 5   Founders       1205 non-null   object 
 6   Investor       1147 non-null   object 
 7   Amount         1206 non-null   object 
 8   Stage          781 non-null    object 
 9   Year           1209 non-null   int64  
dtypes: float64(1), int64(1), object(8)
memory usage: 94.6+ KB


### Observations

- 'Company Name' in df_fund2018 is the same as 'Company/Brand' in df_fund2019, same as 'Company_Brand' in df_fund2020 and same as 'Company_Brand' in df_fund2021

       - we will change Company Name in df_fund2018 to Company_Brand and Company/Brand in df_fund2019 will be change to Company_Brand.

- Founded is in all dataframes except df_fund2018

- We also observed that df_fund2018 contains Industry where as the remaining dataframes contains Sector, however after checking, we noticed the contain similar values.

       - we change Industry in df_fund2018 to Sector
    
- We observed that Location and headquater looks the same, we have Headquarter in all dataframes except df_fund2018, which contains Location
    
    - We will change Headquarter to Location
    
- We observed stage in df_fund2019, df_fund2020, df_fund2021 have the same values as Round/Series in df_fund2018
    
    - we will change Round/Series in df_fund2018 to stage
    
- we observed Amount in df_fund2019 is Amount($), where as all other dataframes contain Amount
    - we will change Amount($) to Amount
    
- we observed About Company in df_fund2018 has similar values as What_it_does in the remaining dataframes, in df_fund2019 there are no underscores
    - we change all to About Company
    

- we observe Founders is in all dataframes except df_fund2018

- we observe Investors is in all dataframes except df_fund2018

- We observed df_fund2020 contains Column10 with null values

    - we will drop column10




In [147]:
#Renaming the Columns for easy concatenation

df_fund2018=df_fund2018.rename(columns={'Company Name': 'Company_Brand'})
df_fund2018=df_fund2018.rename(columns={'Industry': 'Sector'})
df_fund2018=df_fund2018.rename(columns={'Round/Series': 'Stage'})
df_fund2019=df_fund2019.rename(columns={'Company/Brand': 'Company_Brand'})
df_fund2019=df_fund2019.rename(columns={'HeadQuarter': 'Location'})
df_fund2019=df_fund2019.rename(columns={'Amount($)': 'Amount'})
df_fund2019=df_fund2019.rename(columns={'What it does': 'About Company'})
df_fund2020=df_fund2020.rename(columns={'HeadQuarter': 'Location'})
df_fund2020=df_fund2020.rename(columns={'What_it_does': 'About Company'})
df_fund2021=df_fund2021.rename(columns={'HeadQuarter': 'Location'})
df_fund2021=df_fund2021.rename(columns={'What_it_does': 'About Company'})

In [148]:
#droping column10 in df_fund2020
df_fund2020=df_fund2020.drop('column10', axis=1)


In [316]:
def clean_amount2(amount):
    if isinstance(amount, str):
        # Remove commas and dollar signs
        amount = amount.replace(',', '').replace('$', '')
        amount = amount.replace(',','').replace('₹','')
        try:
            # Try converting to float
            amount = float(amount)
            return amount
        except ValueError:
            # If conversion fails, return None
            return None
    else:
        # If not a string, return None
        return None
 


In [322]:
def clean_cur(amount):
    if isinstance(amount, str):
        # Remove commas and dollar signs
        #amount = amount.replace(',', '').str.replace('\d', '', regex=True)
        amount = amount.replace(',','').str.replace('\d','', regex=True)
        try:
            # Try converting to float
            #amount = float(amount)
            return amount
        except ValueError:
            # If conversion fails, return None
            return None
    else:
        # If not a string, return None
        return None
 


In [None]:
# Apply the function to the 'Amount' column
df_concat['Amount_cleaned'] = df_concat['Amount'].apply(clean_amount2)
print(df_concat.tail(50))
 
# Display the DataFrame
print(df_concat.isna().sum())

### Cleaning the 2018 dataset

In [325]:
#Making a copy of 2018 dataframe

df18=df_fund2018.copy()

In [319]:
df18['Amount_cleaned']=df18['Amount'].apply(clean_amount2)
df18.head(10)

Unnamed: 0,Company_Brand,Sector,Stage,Amount,Location,About Company,Year,Amount_cleaned
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",2018,250000.0
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,2018,40000000.0
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,2018,65000000.0
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,2018,2000000.0
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...,2018,
5,Hasura,"Cloud Infrastructure, PaaS, SaaS",Seed,1600000,"Bengaluru, Karnataka, India",Hasura is a platform that allows developers to...,2018,1600000.0
6,Tripshelf,"Internet, Leisure, Marketplace",Seed,"₹16,000,000","Kalkaji, Delhi, India",Tripshelf is an online market place for holida...,2018,16000000.0
7,Hyperdata.IO,Market Research,Angel,"₹50,000,000","Hyderabad, Andhra Pradesh, India",Hyperdata combines advanced machine learning w...,2018,50000000.0
8,Freightwalla,"Information Services, Information Technology",Seed,—,"Mumbai, Maharashtra, India",Freightwalla is an international forwarder tha...,2018,
9,Microchip Payments,Mobile Payments,Seed,—,"Bangalore, Karnataka, India",Microchip payments is a mobile-based payment a...,2018,


In [326]:
#df18['Amount_cleaned']=df18['Amount'].apply(clean_cur)
#df18.head(10)

In [327]:
#Replacing the digits part of value with nothing

df18['cur_symb18']=df18['Amount'].astype(str).replace(('\d'), '', regex= True)

In [328]:
#Checking the unique currency symbols 

df18['cur_symb18'].unique()

array(['', '₹,,', '—', '₹,', '$,', '$,,', '₹,,,', '$,,,'], dtype=object)

In [329]:
# Replacing the symbols with nothing

df18['Amount_no_symb18']=df18['Amount'].astype(str).replace('\D', '', regex= True)

In [330]:
#Checking to confirm changes were effected

df18.head(2)

Unnamed: 0,Company_Brand,Sector,Stage,Amount,Location,About Company,Year,cur_symb18,Amount_no_symb18
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",2018,,250000
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,2018,"₹,,",40000000


In [331]:
#Confirming the datatypes

df18.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Company_Brand     526 non-null    object
 1   Sector            526 non-null    object
 2   Stage             526 non-null    object
 3   Amount            526 non-null    object
 4   Location          526 non-null    object
 5   About Company     526 non-null    object
 6   Year              526 non-null    int64 
 7   cur_symb18        526 non-null    object
 8   Amount_no_symb18  526 non-null    object
dtypes: int64(1), object(8)
memory usage: 37.1+ KB


In [332]:
#droping the amount column with symbol

df18=df18.drop('Amount',axis=1)

In [333]:
# Renaming columns

df18=df18.rename(columns={'Amount_no_symb18':'Amount'})
df18=df18.rename(columns={'cur_symb18':'Currency'})

In [334]:
#Confirming changes

df18.head(3)

Unnamed: 0,Company_Brand,Sector,Stage,Location,About Company,Year,Currency,Amount
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",2018,,250000
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,2018,"₹,,",40000000
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,2018,"₹,,",65000000


In [335]:
#Changing Amount from object datatype to numeric type

df18['Amount']=pd.to_numeric(df18['Amount'])

In [336]:
# Confirming changes

df18.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  526 non-null    object 
 1   Sector         526 non-null    object 
 2   Stage          526 non-null    object 
 3   Location       526 non-null    object 
 4   About Company  526 non-null    object 
 5   Year           526 non-null    int64  
 6   Currency       526 non-null    object 
 7   Amount         378 non-null    float64
dtypes: float64(1), int64(1), object(6)
memory usage: 33.0+ KB


In [339]:
df18['Currency']=df18['Currency'].str.replace(',','')

In [340]:
currency_value=df18.groupby('Currency')['Amount'].mean()
currency_value

Currency
     1.219853e+07
$    5.535524e+07
—             NaN
₹    5.903119e+08
Name: Amount, dtype: float64

In [384]:
'''in order to have The location of 2018 dataset to correspond with 
the other three dataset which have only one city as loaction, we picked 
the first city in the location column and dropped the remaining'''  

df18['Location'] = df18['Location'].str.split(',').str.get(0) 


In [385]:
df18.head()

Unnamed: 0,Company_Brand,Sector,Stage,Location,About Company,Year,Currency,Amount
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,Bangalore,"TheCollegeFever is a hub for fun, fiesta and f...",2018,,250000.0
1,Happy Cow Dairy,"Agriculture, Farming",Seed,Mumbai,A startup which aggregates milk from dairy far...,2018,₹,40000000.0
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,Gurgaon,Leading Online Loans Marketplace in India,2018,₹,65000000.0
3,PayMe India,"Financial Services, FinTech",Angel,Noida,PayMe India is an innovative FinTech organizat...,2018,,2000000.0
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,Hyderabad,Eunimart is a one stop solution for merchants ...,2018,—,


#### converting rupees to dollars

- We noticed apart from 2020 dataset which has no currency symbol, 2019 and 2021 had dollar symbols

- we noticed that 2018 had both dollar and rupee symbols, since dollar is the dominant currency we decided to convert the rupees to dollars.
    The Average exchange rate in 2018 was 0.0146 USD.
    
- We also decided to assume where the currency is empty was dollar. 

In [410]:
#import numpy as np

#df18['Amount'] = np.where(df18['Currency'] == '₹', df18['Amount'] * 0.0146, df18['Amount'])
conversion_rate = 0.0146

# Filter rows where currency is '₹'
currency2dollar = df18.loc[df18['Currency'] == '₹', 'Amount_cleaned']

# Perform the conversion
conversion = currency2dollar * conversion_rate

# Update the 'Amount_cleaned' column with converted values
df18.loc[df18['Currency'] == '₹', 'Amount_cleaned'] = conversion

# Print the updated DataFrame
print("This is the DataFrame after conversion:")
print(df18)


print("This is the DataFrame after conversion:")
print(df18.head())

In [411]:
df18.head()

Unnamed: 0,Company_Brand,Sector,Stage,Location,About Company,Year,Currency,Amount
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,Bangalore,"TheCollegeFever is a hub for fun, fiesta and f...",2018,,1.160784e-259
1,Happy Cow Dairy,"Agriculture, Farming",Seed,Mumbai,A startup which aggregates milk from dairy far...,2018,₹,2.71159e-259
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,Gurgaon,Leading Online Loans Marketplace in India,2018,₹,4.4063339999999996e-259
3,PayMe India,"Financial Services, FinTech",Angel,Noida,PayMe India is an innovative FinTech organizat...,2018,,9.286267999999999e-259
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,Hyderabad,Eunimart is a one stop solution for merchants ...,2018,—,


### Cleaning the 2019 dataset

In [344]:
#Making a copy of 2019 dataframe

df19=df_fund2019.copy()

In [345]:
#Replacing the digits part of value with nothing

df19['cur_symb19']=df19['Amount'].astype(str).replace(('\d'), '', regex= True)

In [346]:
# Checking unique symbols

df19['cur_symb19'].unique()

array(['$,,', 'Undisclosed', '$,'], dtype=object)

#### Observations

- we noticed the currency for 2019 was $, also there were some businesses that did not disclosed their amount

In [347]:
#Replacing symbols with nothing

df19['Amount_no_symb19']=df19['Amount'].astype(str).replace('\D', '', regex= True)

In [348]:
# Droping the column with Amount and symbols

df19=df19.drop('Amount',axis=1)

In [349]:
# Renaming Columns

df19=df19.rename(columns={'Amount_no_symb19':'Amount'})
df19=df19.rename(columns={'cur_symb19':'Currency'})

In [350]:
df19['Currency']=df19['Currency'].str.replace(',','')

In [351]:
#Confirming changes

df19.head(2)

Unnamed: 0,Company_Brand,Founded,Location,Sector,About Company,Founders,Investor,Stage,Year,Currency,Amount
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,,2019,$,6300000
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,Series C,2019,$,150000000


In [352]:
# Converting Amount from object datatype to numeric datatype

df19['Amount']=pd.to_numeric(df19['Amount'])

In [353]:
# Confirming the changes

df19.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  89 non-null     object 
 1   Founded        60 non-null     float64
 2   Location       70 non-null     object 
 3   Sector         84 non-null     object 
 4   About Company  89 non-null     object 
 5   Founders       86 non-null     object 
 6   Investor       89 non-null     object 
 7   Stage          43 non-null     object 
 8   Year           89 non-null     int64  
 9   Currency       89 non-null     object 
 10  Amount         77 non-null     float64
dtypes: float64(2), int64(1), object(8)
memory usage: 7.8+ KB


### Cleaning the 2020 dataset

In [354]:
# Making a copy of the 2020 dataset

df20=df_fund2020.copy()

In [360]:
# Confirming changes
df20.head(2)

Unnamed: 0,Company_Brand,Founded,Location,Sector,About Company,Founders,Investor,Amount,Stage,Year,Currency
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,2020,$
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,2020,$


In [356]:
# Checking for null values in the amount column
df20['Amount'].isna().sum()

254

In [357]:
#Converting the Amount column from object datatype to numeric type

df20['Amount']=pd.to_numeric(df20['Amount'])

In [358]:
df20['Currency']='$'

In [359]:
#Confirming changes

df20.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1055 entries, 0 to 1054
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1055 non-null   object 
 1   Founded        842 non-null    float64
 2   Location       961 non-null    object 
 3   Sector         1042 non-null   object 
 4   About Company  1055 non-null   object 
 5   Founders       1043 non-null   object 
 6   Investor       1017 non-null   object 
 7   Amount         801 non-null    float64
 8   Stage          591 non-null    object 
 9   Year           1055 non-null   int64  
 10  Currency       1055 non-null   object 
dtypes: float64(2), int64(1), object(8)
memory usage: 90.8+ KB


### Cleaning the 2021 dataset

In [366]:
#Making a copy of 2021 dataset

df21=df_fund2021.copy()

In [367]:
#Replacing the digits partof Amount with nothing

df21['cur_symb21']=df21['Amount'].astype(str).replace(('\d'), '', regex= True)

In [368]:
#Checking for type of symbols 

df21['cur_symb21'].unique()

array(['$,,', '$,', 'Undisclosed', '$,,,', 'None', '$Undisclosed', '$',
       'Upsparks', 'Series C', 'Seed', '$$,', '$undisclosed',
       'ah! Ventures', 'Pre-series A', 'ITO Angel Network, LetsVenture',
       'JITO Angel Network, LetsVenture', '$$,,'], dtype=object)

In [369]:
#Replacing the symbols with nothing

df21['Amount_no_symb21']=df21['Amount'].astype(str).replace('\D', '', regex= True)

In [370]:
#Droping the amount column with symbols

df21=df21.drop('Amount',axis=1)

In [371]:
# Renaming columns

df21=df21.rename(columns={'Amount_no_symb21':'Amount'})
df21=df21.rename(columns={'cur_symb21':'Currency'})

In [372]:
# confirming changes

df21.head(2)

Unnamed: 0,Company_Brand,Founded,Location,Sector,About Company,Founders,Investor,Stage,Year,Currency,Amount
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First",Pre-series A,2021,"$,,",1200000
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management",,2021,"$,,",120000000


#### Observations

-We noticed the currency was $, also other companies have no amount but just text


In [373]:
#Converting the datatype from object to numeric 

df21['Amount']=pd.to_numeric(df21['Amount'])

In [380]:
text_value=df21.groupby('Currency')['Amount'].mean()
text_value

Currency
$    1.702779e+08
Name: Amount, dtype: float64

In [379]:
df21['Currency']='$'

In [381]:
#Confirming changes

df21.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1209 entries, 0 to 1208
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1209 non-null   object 
 1   Founded        1208 non-null   float64
 2   Location       1208 non-null   object 
 3   Sector         1209 non-null   object 
 4   About Company  1209 non-null   object 
 5   Founders       1205 non-null   object 
 6   Investor       1147 non-null   object 
 7   Stage          781 non-null    object 
 8   Year           1209 non-null   int64  
 9   Currency       1209 non-null   object 
 10  Amount         1056 non-null   float64
dtypes: float64(2), int64(1), object(8)
memory usage: 104.0+ KB


In [235]:
#Concatenating Dataframes
#df = pd.concat([df_fund2018, df_fund2019, df_fund2020, df_fund2021], ignore_index=True)

In [1]:
#df.head()

In [2]:
#df.shape

#### Observations

- After combining the datasets from the various sources, we got 3033 rows and 9 columns

In [3]:
#df.info()

#### Observations

- We observed that the Amount column is indicating as object type insterd of float, we will investigate further and change to float
- all other object types are correctly specified.
- Year and Founded are suposed to be datetime type but are indicating int and float types, we will change them to datetime

In [4]:
#converting Yearand Founded datatypes to datetime

#df['Year']=pd.to_datetime(df['Year'], format = '%Y')
#df['Founded']=pd.to_datetime(df['Founded'], format = '%Y')

In [5]:
#df.info()

In [6]:
#df['cur_symb']=df['Amount'].astype(str).replace(('\d'), '', regex= True)

In [7]:
#df['cur_symb'].unique()

In [8]:
#df['Amount_no_symb']=df['Amount'].astype(str).replace('\D', '', regex= True)

In [9]:
#df.head(10)

In [245]:
#df['Amount_no_symb'].unique()

In [10]:
#df.head(2)

In [11]:
#space=df[df['cur_symb']=='']


In [12]:
#space

In [13]:
#space['Year'].unique()

In [14]:
#dot=df[df['cur_symb']=='.']
#dot

In [15]:
#dash=df[df['cur_symb']=='—']
#dash

In [16]:
#dash['Amount_no_symb'].unique()

In [17]:
#dash['Year'].unique()

In [18]:
#rupee=df[df['cur_symb']=='₹,,,']
#rupee