# 📈 Predict Indian Startup Fundings - Pipeline
---

![](https://entrackr-bucket.s3.ap-south-1.amazonaws.com/wp-content/uploads/2022/02/26165935/Funding-image.jpg)

Given *data about startputs in India*, let's predict what will be the **funding** provided to a given startup.

This is a regression task in which we will use a **Tensorflow neural network** integrated to a **Pipeline**.

# Getting Started

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import r2_score

In [2]:
data = pd.read_csv('../input/indian-startup-funding/startup_funding.csv')
data

Unnamed: 0,Sr No,Date dd/mm/yyyy,Startup Name,Industry Vertical,SubVertical,City Location,Investors Name,InvestmentnType,Amount in USD,Remarks
0,1,09/01/2020,BYJU’S,E-Tech,E-learning,Bengaluru,Tiger Global Management,Private Equity Round,200000000,
1,2,13/01/2020,Shuttl,Transportation,App based shuttle service,Gurgaon,Susquehanna Growth Equity,Series C,8048394,
2,3,09/01/2020,Mamaearth,E-commerce,Retailer of baby and toddler products,Bengaluru,Sequoia Capital India,Series B,18358860,
3,4,02/01/2020,https://www.wealthbucket.in/,FinTech,Online Investment,New Delhi,Vinod Khatumal,Pre-series A,3000000,
4,5,02/01/2020,Fashor,Fashion and Apparel,Embroiled Clothes For Women,Mumbai,Sprout Venture Partners,Seed Round,1800000,
...,...,...,...,...,...,...,...,...,...,...
3039,3040,29/01/2015,Printvenue,,,,Asia Pacific Internet Group,Private Equity,4500000,
3040,3041,29/01/2015,Graphene,,,,KARSEMVEN Fund,Private Equity,825000,Govt backed VC Fund
3041,3042,30/01/2015,Mad Street Den,,,,"Exfinity Fund, GrowX Ventures.",Private Equity,1500000,
3042,3043,30/01/2015,Simplotel,,,,MakeMyTrip,Private Equity,,"Strategic Funding, Minority stake"


In [3]:
# Check missing values
data.isna().mean()

Sr No                0.000000
Date dd/mm/yyyy      0.000000
Startup Name         0.000000
Industry Vertical    0.056176
SubVertical          0.307490
City  Location       0.059133
Investors Name       0.007884
InvestmentnType      0.001314
Amount in USD        0.315375
Remarks              0.862352
dtype: float64

In [4]:
# Check column cardinalities
{column: len(data[column].unique()) for column in data.columns}

{'Sr No': 3044,
 'Date dd/mm/yyyy': 1035,
 'Startup Name': 2459,
 'Industry Vertical': 822,
 'SubVertical': 1943,
 'City  Location': 113,
 'Investors Name': 2413,
 'InvestmentnType': 56,
 'Amount in USD': 472,
 'Remarks': 73}

# Preprocessing

**The data is messy.** Let's build a function that cleans it and returns the training and test sets.

Here, we drop the useless columns and the ones with high cardinality because they can reduce the model performances after the one-hot encoding.

Then, the cells are cleaned from unfortunate characters that have be entered into the data and missing row targets are dropped.

Ultimately, the dates are decomposed into *year*, *month* and *day* which will be easier for the model to interpret.

In [5]:
def preprocess_inputs(df):
    df = df.copy()
    
    # Drop ID and Remarks columns
    df = df.drop(['Sr No', 'Remarks'], axis=1)

    # Drop high cardinality columns
    df = df.drop(['Startup Name', 'SubVertical', 'Investors Name'], axis=1)
    
    # Clean \\xc2\\xa0 examples
    df = df.applymap(lambda x: x.replace(r'\\xc2\\xa0', '') if type(x) == str else x)   
    
    # Clean target column
    df['Amount in USD'] = df['Amount in USD'].astype(str).apply(lambda x: x.replace(',', ''))
    df['Amount in USD'] = df['Amount in USD'].astype(str).apply(lambda x: x.replace('+', ''))
    df['Amount in USD'] = df['Amount in USD'].apply(pd.to_numeric, errors='coerce')
    
    # Drop missing target rows
    df = df.drop(df[df['Amount in USD'].isna()].index).reset_index(drop=True)
    
    # Fill categorical missing values with the mode
    for column in ['Industry Vertical', 'City  Location', 'InvestmentnType']:
        df[column] = df[column].fillna(df[column].mode()[0])

    # Clean date errors
    df['Date dd/mm/yyyy'] = df['Date dd/mm/yyyy'].replace({
        '05/072018': '05/07/2018',
        '01/07/015': '01/07/2015',
        '22/01//2015': '22/01/2015'
    })
    
    # Extract date features
    df['Date dd/mm/yyyy'] = pd.to_datetime(df['Date dd/mm/yyyy'])
    df['Year'] = df['Date dd/mm/yyyy'].apply(lambda x: x.year)
    df['Month'] = df['Date dd/mm/yyyy'].apply(lambda x: x.month)
    df['Day'] = df['Date dd/mm/yyyy'].apply(lambda x: x.day)
    df = df.drop('Date dd/mm/yyyy', axis=1)
    
    # Split X and y
    X = df.drop('Amount in USD', axis=1)
    y = df['Amount in USD']
    
    # Trai-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True, random_state=1)
    
    return X_train, X_test, y_train, y_test

In [6]:
X_train, X_test, y_train, y_test = preprocess_inputs(data)
X_train

Unnamed: 0,Industry Vertical,City Location,InvestmentnType,Year,Month,Day
924,eCommerce,Noida,Private Equity,2016,4,10
1108,Consumer Internet,Bangalore,Seed Funding,2016,7,21
1059,Consumer Internet,Mumbai,Private Equity,2016,8,29
160,Consumer Internet,Bengaluru,Seed/ Angel Funding,2018,8,8
1696,on-demand delivery service,Gurgaon,Seed Funding,2015,8,17
...,...,...,...,...,...,...
960,eCommerce,Ahmedabad,Private Equity,2016,10,26
905,Consumer Internet,Mumbai,Private Equity,2016,11,24
1096,eCommerce,New Delhi,Seed Funding,2016,7,15
235,Finance,Chennai,Seed / Angel Funding,2018,2,5


# Pipeline

The data is not ready yet for the model. Encoding and scaling have not been done yet.

**We will do this inside of a pipeline.**

In [7]:
nominal_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False))
])

preprocessor = ColumnTransformer(transformers=[
    ('nominal', nominal_transformer, ['Industry Vertical', 'City  Location', 'InvestmentnType'])
], remainder='passthrough')

regressor = RandomForestRegressor()

model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('scaler', StandardScaler()),
    ('regressor', regressor)
])

In [8]:
model.fit(X_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('nominal',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse=False))]),
                                                  ['Industry Vertical',
                                                   'City  Location',
                                                   'InvestmentnType'])])),
                ('scaler', StandardScaler()),
                ('regressor', RandomForestRegressor())])

# Results

In [9]:
y_pred = model.predict(X_test)
rmse = np.sqrt(np.mean((y_test - y_pred) ** 2))
r2 = r2_score(y_test, y_pred)

print("Test RMSE: {:.2f}".format(rmse))
print("Test R2-Score: {:.5f}".format(r2))

Test RMSE: 58933562.37
Test R2-Score: -0.11604
