Importing libraries for working with dataset and machine learning models

In [20]:
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.tree import export_graphviz
from sklearn.model_selection import train_test_split
import pandas_profiling
import seaborn as sns

Dataset from Kaggle - https://www.kaggle.com/datasets/utkarshx27/2021-startups

In [2]:
data = pd.read_csv('Startups_in_2021_end.csv')

This data is about startups from 2012 to 2021.  

The fileld unnamed:0 is the startup ID.    
Company is the name of the startup.     
Valuation ($B) - valuation in dollars.      
Date Joined - the date of joining the organization.  
Country and city are the place where a startup appears.  
Industry - in which industry the startup was created.  
Select Investors - list of investors

In [3]:
data

Unnamed: 0.1,Unnamed: 0,Company,Valuation ($B),Date Joined,Country,City,Industry,Select Investors
0,0,Bytedance,$140,4/7/2017,China,Beijing,Artificial intelligence,"Sequoia Capital China, SIG Asia Investments, S..."
1,1,SpaceX,$100.3,12/1/2012,United States,Hawthorne,Other,"Founders Fund, Draper Fisher Jurvetson, Rothen..."
2,2,Stripe,$95,1/23/2014,United States,San Francisco,Fintech,"Khosla Ventures, LowercaseCapital, capitalG"
3,3,Klarna,$45.6,12/12/2011,Sweden,Stockholm,Fintech,"Institutional Venture Partners, Sequoia Capita..."
4,4,Canva,$40,1/8/2018,Australia,Surry Hills,Internet software & services,"Sequoia Capital China, Blackbird Ventures, Mat..."
...,...,...,...,...,...,...,...,...
931,931,YipitData,$1,12/6/2021,United States,New York,Internet software & services,"RRE Ventures+, Highland Capital Partners, The ..."
932,932,Anyscale,$1,12/7/2021,United States,Berkeley,Artificial Intelligence,"Andreessen Horowitz, Intel Capital, Foundation..."
933,933,Iodine Software,$1,12/1/2021,United States,Austin,Data management & analytics,"Advent International, Bain Capital Ventures, S..."
934,934,ReliaQuest,$1,12/1/2021,United States,Tampa,Cybersecurity,"KKR, FTV Capital, Ten Eleven Ventures"


Let's look at the peculiarity of the data

In [4]:
pandas_profiling.ProfileReport(data)

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



To prepare the data, I need to translate text values into numbers. To do this, I need to clean up the data. Let's get rid of extra commas in the data, reduce the text to lowercase in Country and Industry fields. Let's make a dictionary of unique values

In [5]:
def replace_country(country):
    return country.split(',')[0]

def get_lowregiser(industry):
    return industry.lower()

def get_dict(column):
    d = {}
    for i in range(0, len(column.unique())):
        d[column.unique()[i]] = float(i)
    return d

data.drop('Unnamed: 0', axis=1, inplace=True)
data['Country'] = data['Country'].apply(replace_country)
data['Industry'] = data['Industry'].apply(get_lowregiser)
dict_country = get_dict(data['Country'])
dict_industry = get_dict(data['Industry'])

Let's get rid of the dollar sign in the Valuation field. We will leave only the year in the Date field. Day and month do not significantly affect the work of the model, in my opinion

In [6]:
def transform_valuation(valuation):
    return float(valuation.split('$')[1])

def transform_date(date):
    return int(date.split('/')[2])

def transform_country(country, dictionary):
    return dictionary[country]

def transform_industry(industry, dictionary):
    return dictionary[industry]

In [7]:
data['Valuation ($B)'] = data['Valuation ($B)'].apply(transform_valuation)
data['Date Joined'] = data['Date Joined'].apply(transform_date)
data['Country'] = data['Country'].apply(lambda x: transform_country(x, dict_country))
data['Industry'] = data['Industry'].apply(lambda x: transform_industry(x, dict_industry))

We have this dataset after transformation

In [8]:
data

Unnamed: 0,Company,Valuation ($B),Date Joined,Country,City,Industry,Select Investors
0,Bytedance,140.0,2017,0.0,Beijing,0.0,"Sequoia Capital China, SIG Asia Investments, S..."
1,SpaceX,100.3,2012,1.0,Hawthorne,1.0,"Founders Fund, Draper Fisher Jurvetson, Rothen..."
2,Stripe,95.0,2014,1.0,San Francisco,2.0,"Khosla Ventures, LowercaseCapital, capitalG"
3,Klarna,45.6,2011,2.0,Stockholm,2.0,"Institutional Venture Partners, Sequoia Capita..."
4,Canva,40.0,2018,3.0,Surry Hills,3.0,"Sequoia Capital China, Blackbird Ventures, Mat..."
...,...,...,...,...,...,...,...
931,YipitData,1.0,2021,1.0,New York,3.0,"RRE Ventures+, Highland Capital Partners, The ..."
932,Anyscale,1.0,2021,1.0,Berkeley,0.0,"Andreessen Horowitz, Intel Capital, Foundation..."
933,Iodine Software,1.0,2021,1.0,Austin,5.0,"Advent International, Bain Capital Ventures, S..."
934,ReliaQuest,1.0,2021,1.0,Tampa,13.0,"KKR, FTV Capital, Ten Eleven Ventures"


I want to build classification models. I need to choose one column, which machine will predict. I choose Select Investors column. I think, that Company and City columns do not affect the final investors. Therefore, I will delete them

In [9]:
data.drop('Company', axis=1, inplace=True)
data.drop('City', axis=1, inplace=True)

Check columns for NaN 

In [10]:
data.isnull().sum()

Valuation ($B)      0
Date Joined         0
Country             0
Industry            0
Select Investors    1
dtype: int64

The field Select Investors should be converted to the values 1 and 0. If the startup has more or three investors, we give the value 1. Otherwise 0. Therefore, we give the NaN value 0. The model will classify whether the startup will have more or equal to 3 investors

In [11]:
data['Select Investors'] = data['Select Investors'].fillna(value = '0')

In [12]:
def transform_target(investors):
    if len(investors.split(',')) >= 3:
           return 1
    else:
           return 0

data['Select Investors'] = data['Select Investors'].apply(transform_target)

Final dataset

In [13]:
data

Unnamed: 0,Valuation ($B),Date Joined,Country,Industry,Select Investors
0,140.0,2017,0.0,0.0,1
1,100.3,2012,1.0,1.0,1
2,95.0,2014,1.0,2.0,1
3,45.6,2011,2.0,2.0,1
4,40.0,2018,3.0,3.0,1
...,...,...,...,...,...
931,1.0,2021,1.0,3.0,1
932,1.0,2021,1.0,0.0,1
933,1.0,2021,1.0,5.0,1
934,1.0,2021,1.0,13.0,1


Let's look at the correlation table

In [14]:
corr = data.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,Valuation ($B),Date Joined,Country,Industry,Select Investors
Valuation ($B),1.0,-0.230509,-0.067176,-0.065136,0.036764
Date Joined,-0.230509,1.0,0.085285,-0.180748,0.24729
Country,-0.067176,0.085285,1.0,-0.014459,-0.093829
Industry,-0.065136,-0.180748,-0.014459,1.0,-0.023086
Select Investors,0.036764,0.24729,-0.093829,-0.023086,1.0


Fileds Country, Industry and Valuation affect target(Select Investors). I will use this columns for machine learning and delete column Date, because it does not affect target

In [15]:
X = data.drop(['Select Investors', 'Date Joined'], axis = 1)
Y = data['Select Investors']

Divide datsaet in the ratio 70% for learning and 30% for test

In [16]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 42)

Create DecisionTree model. Accuracy is 0.79

In [17]:
tree = DecisionTreeClassifier(max_depth = 10)
tree_simple = tree.fit(X_train, Y_train)
predictions = tree_simple.predict(X_test)
print(classification_report(Y_test, predictions))

              precision    recall  f1-score   support

           0       0.19      0.12      0.15        42
           1       0.85      0.91      0.88       239

    accuracy                           0.79       281
   macro avg       0.52      0.52      0.51       281
weighted avg       0.76      0.79      0.77       281



Create RandomForest model. Accuracy is 0.82

In [18]:
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 42)
classifier.fit(X_train, Y_train)
y_pred = classifier.predict(X_test)
print(classification_report(Y_test, y_pred))

              precision    recall  f1-score   support

           0       0.21      0.07      0.11        42
           1       0.85      0.95      0.90       239

    accuracy                           0.82       281
   macro avg       0.53      0.51      0.50       281
weighted avg       0.76      0.82      0.78       281



Create LogisticRegression model. Accuracy is 0.85

In [19]:
logisticRegression = LogisticRegression(random_state = 42)
logisticRegression.fit(X_train, Y_train)
y_pred = logisticRegression.predict(X_test)
print(classification_report(Y_test, y_pred))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        42
           1       0.85      1.00      0.92       239

    accuracy                           0.85       281
   macro avg       0.43      0.50      0.46       281
weighted avg       0.72      0.85      0.78       281



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
