### **GOALS & MOTIVATION**

Video Link : https://www.youtube.com/watch?v=AtunFggb5tw



Watching shows like Shark Tank often prompts curiosity about the criteria the investors, or "Sharks," use to evaluate startups before deciding to invest. There's a natural inclination to speculate whether their funding decisions could be enhanced with data-driven algorithms guiding their investment terms and conditions negotiations. To explore this further, we intend to leverage data from diverse sources such as Kaggle, Crunchbase, and Data World. By analyzing datasets from these platforms, we aim to identify patterns and insights that could potentially improve investment decision-making processes. We also aim to determine if a particular start-up is bound for failure or success. Through this project, we hope to gain a deeper understanding of the factors influencing startup success and investor behavior and hopefully have our own start-up one day.

Start up funding dataset : https://data.world/datanerd/startup-venture-funding

Start up Success Rate Analysis https://www.kaggle.com/datasets/sujithsherigar/startup-success-rate-analysis

Unicorns https://www.kaggle.com/datasets/ramjasmaurya/unicorn-startups

Startup Success Prediction https://www.kaggle.com/datasets/manishkc06/startup-success-prediction

First we decided to explore each dataset to see which variables would be the most appropriate for our use case.

# EXPLORATORY DATA ANALYSIS

### DATASET 1 - CAX_Startup_Data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd
dataset1=pd.read_csv("/content/drive/MyDrive/95-885-T11-Project2/Datasets/CAX_Startup_Data.csv")
dataset1.head()
dataset1.shape

(202, 116)

We already see that the dataset contains 116 columns and 116 columns seem like good features to predict the success of the start up but the no of rows are very less just 202.  It seems too small to perform machine learning. Now checking for the balance in the data. It does have Success and Failed in the Dependent Company Status column, so we check the counts in that.

In [None]:
print(dataset1['Dependent-Company Status'].value_counts())
df_success = (dataset1['Dependent-Company Status'] == 'Success')
print(df_success.sum())
df_fail = (dataset1['Dependent-Company Status'] == 'Failed')
print(df_fail.sum())

Dependent-Company Status
Success    159
Failed      43
Name: count, dtype: int64
159
43


In [None]:
dataset1.head()

Unnamed: 0,Company_Name,Dependent-Company Status,year of founding,Age of company in years,Internet Activity Score,Short Description of company profile,Industry of company,Focus functions of company,Investors,Employee Count,...,Percent_skill_Data Science,Percent_skill_Business Strategy,Percent_skill_Product Management,Percent_skill_Sales,Percent_skill_Domain,Percent_skill_Law,Percent_skill_Consulting,Percent_skill_Finance,Percent_skill_Investment,Renown score
0,Company3,Success,2011,3,455,Event Data Analytics API,Analytics|Cloud Computing|Software Development,operations,TechStars|Streamlined Ventures|Amplify Partner...,14.0,...,3.846153846,17.09401709,9.401709402,0.0,2.777777778,0,0,0,0,9
1,Company4,Success,2009,5,-99,The most advanced analytics for mobile,Mobile|Analytics,Marketing & Sales,Michael Birch|Max Levchin|Sequoia Capital|Keit...,45.0,...,0.0,0.0,0.0,0.0,0.0,0,0,0,0,5
2,Company5,Success,2010,4,496,The Location-Based Marketing Platform,Analytics|Marketing|Enterprise Software,Marketing & Sales,DFJ Frontier|Draper Nexus Ventures|Gil Elbaz|A...,39.0,...,0.0,0.0,0.0,0.0,0.0,0,0,0,0,6
3,Company6,Success,2010,4,106,big data for foodservice,Food & Beverages|Hospitality,analytics,Pritzker Group Venture Capital|Excelerate Labs...,14.0,...,3.125,6.25,3.125,3.125,0.0,0,0,0,0,6
4,Company7,Success,2011,3,39,,Analytics,Research,Plug & Play Ventures|Correlation Ventures|Cros...,7.0,...,22.22222222,0.0,0.0,0.0,5.555555556,0,0,0,0,0


While the columns consist of features like distirbution of skills of the employees, the kind of investors, internet activity score, no of employees etc but

1) the no of rows are less to be able to do Machine Learning

2) the dataset is highly imbalanced.

This is why we decided not to go ahead with this dataset.

### DATASET 2 - UNICORN

UnicornS are typically considered to be successful startups so we can try to analyze this dataset to see the peculiar characteristics(if any) and try to build on it

In [None]:
import pandas as pd
dataset2=pd.read_excel("/content/drive/MyDrive/95-885-T11-Project2/Datasets/Unicorn companies.xlsx")
dataset2.head()
dataset2.shape

(452, 6)

452x6 is definitely a bigger dataset compared to the previous one. But this is still not enough to fit a machine learning model and get impactful results. Lets deep dive into the features to see if we can find some meaningful data.

In [None]:
dataset2['Country'].value_counts()

Country
United States     220
China             109
United Kingdom     24
India              20
Germany            12
South Korea        10
Israel              8
Brazil              7
France              5
Indonesia           5
Switzerland         4
Australia           3
Japan               3
Hong Kong           3
Singapore           2
Colombia            2
Sweden              2
Spain               2
Canada              2
South Africa        2
Estonia             1
Netherlands         1
Luxembourg          1
Portugal            1
Philippines         1
Lithuania           1
Malta               1
Name: count, dtype: int64

Imbalanced towards mostly developed countries. Only developing that comes close to the developed ones is India. Also this dataset does not have a variable that talks about the status of the startup i.e. is it successful or not.

In [None]:
dataset2['industry'].value_counts()

industry
Fintech                                60
Internet software & services           54
E-commerce & direct-to-consumer        54
Artificial intelligence                46
Health                                 33
Other                                  30
Auto & transportation                  27
Supply chain, logistics, & delivery    27
Mobile & telecommunications            26
Data management & analytics            18
Hardware                               18
Consumer & retail                      17
Edtech                                 14
Cybersecurity                          14
Travel                                 13
Education                               1
Name: count, dtype: int64

Fintech is the industry with the most amount of Unicorns

Now, splitting the investors to see if any specific investors has more investments in Unicorns

In [None]:

split_columns = dataset2['Select Investors'].str.split(',', expand=True, n=4)
split_columns[1] = split_columns.get(1)
split_columns[2] = split_columns.get(2)
split_columns[3] = split_columns.get(3)
split_columns[4] = split_columns.get(4)
dataset2['Select Investors 1'] = split_columns[0]
dataset2['Select Investors 2'] = split_columns[1]
dataset2['Select Investors 3'] = split_columns[2]
dataset2['Select Investors 4'] = split_columns[3]
dataset2['Select Investors 5'] = split_columns[4]
dataset2.head()

Unnamed: 0,company,valuation($B),date joined,Country,industry,Select Investors,Select Investors 1,Select Investors 2,Select Investors 3,Select Investors 4,Select Investors 5
0,100credit,1.0,2018-04-18,China,Fintech,"Sequoia Capital China, China Reform Fund, Hill...",Sequoia Capital China,China Reform Fund,Hillhouse Capital Management,,
1,17zuoye,1.0,2018-03-07,China,Edtech,"DST Global, Temasek Holdings",DST Global,Temasek Holdings,,,
2,23andMe,2.5,2015-07-03,United States,Health,"Google Ventures, New Enterprise Associates, MP...",Google Ventures,New Enterprise Associates,MPM Capital,,
3,4Paradigm,1.2,2018-12-19,China,Artificial intelligence,"Sequoia Capital China, China Construction Bank...",Sequoia Capital China,China Construction Bank,Bank of China,,
4,58 Daojia,1.0,2016-02-18,China,Internet software & services,"KKR, Alibaba Group, Ping An Insurance",KKR,Alibaba Group,Ping An Insurance,,


In [None]:
INVESTOR1=dataset2['Select Investors 1'].value_counts()
print(INVESTOR1)
INVESTOR2=dataset2['Select Investors 2'].value_counts()
print(INVESTOR2)
INVESTOR3=dataset2['Select Investors 3'].value_counts()
print(INVESTOR3)

Select Investors 1
Sequoia Capital China           18
Andreessen Horowitz             10
Google Ventures                  9
New Enterprise Associates        9
Tencent Holdings                 8
                                ..
Koch Disruptive Technologies     1
Eight Roads Ventures             1
Movile                           1
Data Collective                  1
AID Partners                     1
Name: count, Length: 276, dtype: int64
Select Investors 2
 Sequoia Capital            11
 Tiger Global Management     7
 Tencent Holdings            6
 Khosla Ventures             5
 Greylock Partners           5
                            ..
 Linear Venture              1
 Founder H Fund              1
 Walden International        1
 Oceanwide Holdings          1
 AME Cloud Ventures          1
Name: count, Length: 288, dtype: int64
Select Investors 3
 Tencent Holdings                          7
 Sequoia Capital                           6
 IDG Capital                               6
 Acc

Thus, we decided not to move ahead with this dataset as well since there was a lack of data both horizontally and vertically and the quality of the data seems weak.

### DATASET 3 investments_vc_USA

In [None]:
dataset3=pd.read_csv("/content/drive/MyDrive/95-885-T11-Project2/Datasets/investments_VC_USA.csv", encoding="latin-1")
dataset3.head()
dataset3.shape

(19109, 37)

This dataset seems to have data good enough atleast size wise to fit a machine learning model

In [None]:
dataset3['status'].value_counts()

status
operating    16077
acquired      2046
closed         986
Name: count, dtype: int64

In [None]:
dataset3.head()

Unnamed: 0,Organization,category_list,market,funding_total_usd,status,country_code,state_code,region,city,funding_rounds,...,secondary_market,product_crowdfunding,round_A,round_B,round_C,round_D,round_E,round_F,round_G,round_H
0,waywire,|Entertainment|Politics|Social Media|News|,News,1750000,acquired,USA,NY,New York City,New York,1,...,0,0,0,0,0,0,0,0,0,0
1,r-ranch-and-mine,|Tourism|Entertainment|Games|,Tourism,60000,operating,USA,TX,Dallas,Fort Worth,2,...,0,0,0,0,0,0,0,0,0,0
2,1-800-doctors,|Health and Wellness|,Health and Wellness,1750000,operating,USA,NJ,Newark,Iselin,1,...,0,0,0,0,0,0,0,0,0,0
3,10-20-media,|E-Commerce|,E-Commerce,2050000,operating,USA,MD,Baltimore,Woodbine,4,...,0,0,0,0,0,0,0,0,0,0
4,1000-corks,|Search|,Search,40000,operating,USA,OR,"Portland, Oregon",Lake Oswego,1,...,0,0,0,0,0,0,0,0,0,0


Soon after going through this dataset, we noticed it was a screenshot (or one sheet of the Crunchbase dataset), thus instead of examining it, we went straight into the Crunchbase dataset to execute the next steps.  

### DATASET 4 - Crunchbase Data

In [None]:
dataset4=pd.read_excel("/content/drive/MyDrive/95-885-T11-Project2/Datasets/crunchbase_monthly_export_d43b4klo2ade53.xlsx",'Companies')
dataset4.head()
dataset4.shape

  warn(msg)


(49438, 18)

In [None]:
dataset4.isna().sum()


permalink                0
name                     1
homepage_url          3449
category_list         3961
market                3968
funding_total_usd        0
status                1314
country_code          5273
state_code           19277
region                5273
city                  6116
funding_rounds           0
founded_at           10884
founded_month        10956
founded_quarter      10956
founded_year         10956
first_funding_at         0
last_funding_at          0
dtype: int64

In [None]:
company_data = dataset4



We decided to use this dataset since it has significant no of rows and columns and the data quality is good as well. We will go deeper into this dataset in the steps below


**Decisions** : 1) Drop homepage_url since it will not help us with predicitng if a company will be successful or not

2) Drop state_code as well as region because feels like redundant information since there are too many null values and region as well since this is very specific data and we feel there may be differences across countries and not region specific

3) Drop founded_at, founded_year, founded_quarter and founded_month since there are too many nulls and again it will not be useful in prediction. THe year and time would have mattered if there was data around GDP, interest rates, economic performance of the various countries.

In [None]:
company_data=company_data.drop([ 'homepage_url','region','state_code','founded_at','founded_month','founded_quarter','founded_year', 'first_funding_at','last_funding_at'],axis=1)

In [None]:
company_data['status'].value_counts()

status
operating    41829
acquired      3692
closed        2603
Name: count, dtype: int64

In [None]:
company_data.isna().sum()

permalink               0
name                    1
category_list        3961
market               3968
funding_total_usd       0
status               1314
country_code         5273
city                 6116
funding_rounds          0
dtype: int64

In [None]:
company_data["Successful?"] = company_data['status'].apply(lambda status: 1 if status in ["operating", "acquired"] else 0)
company_data=company_data.drop('status',axis=1)

In [None]:
company_data.head()

Unnamed: 0,permalink,name,category_list,market,funding_total_usd,country_code,city,funding_rounds,Successful?
0,/organization/waywire,#waywire,|Entertainment|Politics|Social Media|News|,News,1750000,USA,New York,1,1
1,/organization/tv-communications,&TV Communications,|Games|,Games,4000000,USA,Los Angeles,2,1
2,/organization/rock-your-paper,'Rock' Your Paper,|Publishing|Education|,Publishing,40000,EST,Tallinn,1,1
3,/organization/in-touch-network,(In)Touch Network,|Electronics|Guides|Coffee|Restaurants|Music|i...,Electronics,1500000,GBR,London,1,1
4,/organization/r-ranch-and-mine,-R- Ranch and Mine,|Tourism|Entertainment|Games|,Tourism,60000,USA,Fort Worth,2,1


In [None]:
company_data.isna().sum()

permalink               0
name                    1
category_list        3961
market               3968
funding_total_usd       0
country_code         5273
city                 6116
funding_rounds          0
Successful?             0
dtype: int64

In [None]:
company_data['Successful?'].value_counts()

Successful?
1    45521
0     3917
Name: count, dtype: int64

This data is highly imbalanced so we will try to balance it.

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split


X = company_data.drop(['Successful?','name','permalink','category_list','market','country_code','city'], axis=1)
y = company_data['Successful?']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize SMOTE
smote = SMOTE(random_state=42)

# Apply SMOTE to the training data only
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# new class distribution
print("After SMOTE, counts of label '1': {}".format(sum(y_train_smote == 1)))
print("After SMOTE, counts of label '0': {}".format(sum(y_train_smote == 0)))

After SMOTE, counts of label '1': 36411
After SMOTE, counts of label '0': 36411


Now, we will first test model performance by incrementally adding each of these features. Let'only start by testing out a model with funding_amount, funding_rounds.

## TEST 1 - CLASSIFICATION WITH ONLY 2 FEATURES - FUNDING ROUND AND FUNDING AMOUNT

In [None]:
X_train.columns

Index(['funding_total_usd', 'funding_rounds'], dtype='object')

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.datasets import make_classification
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score


pipe = Pipeline(steps=[
    ('scaler', MinMaxScaler()),
    ('model', LogisticRegression())
])

# Using cross-validation to evaluate the model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipe, X_train_smote, y_train_smote, scoring='accuracy', cv=cv, n_jobs=-1)

print('Accuracy: %.3f (%.3f)' % (scores.mean(), scores.std()))

Accuracy: 0.577 (0.005)


With only Total Number of Funding Rounds and Total Funding Amount, we see a 57% accuracy in the model. However, those cannot be the only features we use to predict the success of a start up. So now we add One Hot encoding for the country code and check the performance again.

## TEST 2 - Adding one hot encoding for country_code.

In [None]:
test2_data= company_data[['Successful?','name','funding_total_usd','funding_rounds','country_code']].copy()

In [None]:
test2_data['country_code'].value_counts()

country_code
USA    28793
GBR     2642
CAN     1405
CHN     1239
DEU      968
       ...  
ALB        1
MOZ        1
LIE        1
BRN        1
MAF        1
Name: count, Length: 115, dtype: int64

In [None]:
test2_data=test2_data.groupby('country_code').filter(lambda x : len(x)>4)
country_dummies = pd.get_dummies(test2_data['country_code'], drop_first=True)

# combine these new dummy variables with your dataset
test2_data = pd.concat([test2_data.drop(['country_code'], axis=1), country_dummies], axis=1)
test2_data.shape

(44089, 79)

In [None]:
test2_data['Successful?'].value_counts()

Successful?
1    40792
0     3297
Name: count, dtype: int64

Again, imbalance so repeat the whole process.

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split


X = test2_data.drop(['Successful?','name'], axis=1)
y = test2_data['Successful?']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize SMOTE
smote = SMOTE(random_state=42)

# Apply SMOTE to the training data only
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# new class distribution
print("After SMOTE, counts of label '1': {}".format(sum(y_train_smote == 1)))
print("After SMOTE, counts of label '0': {}".format(sum(y_train_smote == 0)))

After SMOTE, counts of label '1': 32638
After SMOTE, counts of label '0': 32638


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.datasets import make_classification
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score


pipe = Pipeline(steps=[
    ('scaler', MinMaxScaler()),
    ('model', LogisticRegression())
])

# Using cross-validation to evaluate the model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipe, X_train_smote, y_train_smote, scoring='accuracy', cv=cv, n_jobs=-1)

print('Accuracy: %.3f (%.3f)' % (scores.mean(), scores.std()))

Accuracy: 0.691 (0.004)


As we would expect, adding features, improves the performance of the classifier. In our case, we see an improvement of ~13%.

Now, we just run it against City and One Hot Encode it. Let us see what adding city does to our classification.

## TEST 3 - Using city instead of country code

In [None]:
test3_data=company_data[['Successful?','name','funding_total_usd','funding_rounds','city']].copy()

In [None]:
test3_data.head()
test3_data['city'].value_counts()
test3_data=test3_data.groupby('city').filter(lambda x : len(x)>10)
city_dummies = pd.get_dummies(test3_data['city'], drop_first=True)

# combine these new dummy variables with your dataset
test3_data = pd.concat([test3_data.drop(['city'], axis=1), city_dummies], axis=1)

test3_data['Successful?'].value_counts()

Successful?
1    32625
0     2649
Name: count, dtype: int64

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split


X = test3_data.drop(['Successful?','name'], axis=1)
y = test3_data['Successful?']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize SMOTE
smote = SMOTE(random_state=42)

# Apply SMOTE to the training data only
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# new class distribution
print("After SMOTE, counts of label '1': {}".format(sum(y_train_smote == 1)))
print("After SMOTE, counts of label '0': {}".format(sum(y_train_smote == 0)))

After SMOTE, counts of label '1': 26077
After SMOTE, counts of label '0': 26077


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.datasets import make_classification
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score


pipe = Pipeline(steps=[
    ('scaler', MinMaxScaler()),
    ('model', LogisticRegression())
])

# Using cross-validation to evaluate the model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipe, X_train_smote, y_train_smote, scoring='accuracy', cv=cv, n_jobs=-1)

print('Accuracy: %.3f (%.3f)' % (scores.mean(), scores.std()))

Accuracy: 0.901 (0.004)


There are way more cities than countries on One Hot Encoding, Hence we see an increase in the performance. A whole ~30% improvement from the country code.

Next, we try to explore both cities and countries together and analyze it's effect on the performance.

# GEOGRAPHY TEST

In [None]:
geography_test=company_data
geography_test.columns

Index(['permalink', 'name', 'category_list', 'market', 'funding_total_usd',
       'country_code', 'city', 'funding_rounds', 'Successful?'],
      dtype='object')

In [None]:
geography_test.drop(['category_list','market','permalink'],axis=1)

Unnamed: 0,name,funding_total_usd,country_code,city,funding_rounds,Successful?
0,#waywire,1750000,USA,New York,1,1
1,&TV Communications,4000000,USA,Los Angeles,2,1
2,'Rock' Your Paper,40000,EST,Tallinn,1,1
3,(In)Touch Network,1500000,GBR,London,1,1
4,-R- Ranch and Mine,60000,USA,Fort Worth,2,1
...,...,...,...,...,...,...
49433,Zzish,320000,GBR,London,1,1
49434,ZZNode Science and Technology,1587301,CHN,Beijing,1,1
49435,Zzzzapp Wireless ltd.,97398,HRV,Split,5,1
49436,[a]list games,9300000,,,1,1


ONE HOT ENCODING BOTH CITY AS WELL AS COUNTRY

In [None]:
geography_test=geography_test.groupby('country_code').filter(lambda x : len(x)>5)
country_dummies = pd.get_dummies(geography_test['country_code'], drop_first=True)

# combine these new dummy variables with your dataset
geography_test = pd.concat([geography_test.drop(['country_code'], axis=1), country_dummies], axis=1)
geography_test.shape

geography_test=geography_test.groupby('city').filter(lambda x : len(x)>10)
city_dummies = pd.get_dummies(geography_test['city'], drop_first=True)

# combine these new dummy variables with your dataset
geography_test = pd.concat([geography_test.drop(['city'], axis=1), city_dummies], axis=1)

geography_test['Successful?'].value_counts()

Successful?
1    32621
0     2648
Name: count, dtype: int64

In [None]:
geography_test.columns

Index(['permalink', 'name', 'category_list', 'market', 'funding_total_usd',
       'funding_rounds', 'Successful?', 'ARG', 'AUS', 'AUT',
       ...
       'Westwood', 'White Plains', 'Wilmington', 'Woburn', 'Woodland Hills',
       'Worcester', 'Yoqne`am `illit', 'Zug', 'Zürich', 'Çan'],
      dtype='object', length=589)

In [None]:
geography_test.head()

Unnamed: 0,permalink,name,category_list,market,funding_total_usd,funding_rounds,Successful?,ARG,AUS,AUT,...,Westwood,White Plains,Wilmington,Woburn,Woodland Hills,Worcester,Yoqne`am `illit,Zug,Zürich,Çan
0,/organization/waywire,#waywire,|Entertainment|Politics|Social Media|News|,News,1750000,1,1,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,/organization/tv-communications,&TV Communications,|Games|,Games,4000000,2,1,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,/organization/rock-your-paper,'Rock' Your Paper,|Publishing|Education|,Publishing,40000,1,1,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,/organization/in-touch-network,(In)Touch Network,|Electronics|Guides|Coffee|Restaurants|Music|i...,Electronics,1500000,1,1,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,/organization/r-ranch-and-mine,-R- Ranch and Mine,|Tourism|Entertainment|Games|,Tourism,60000,2,1,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split


X = geography_test.drop(['Successful?','name','permalink','category_list','market'], axis=1)
y = geography_test['Successful?']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize SMOTE
smote = SMOTE(random_state=42)

# Apply SMOTE to the training data only
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# new class distribution
print("After SMOTE, counts of label '1': {}".format(sum(y_train_smote == 1)))
print("After SMOTE, counts of label '0': {}".format(sum(y_train_smote == 0)))

After SMOTE, counts of label '1': 26062
After SMOTE, counts of label '0': 26062


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.datasets import make_classification
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score


pipe = Pipeline(steps=[
    ('scaler', MinMaxScaler()),
    ('model', LogisticRegression())
])

# Using cross-validation to evaluate the model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipe, X_train_smote, y_train_smote, scoring='accuracy', cv=cv, n_jobs=-1)

print('Accuracy: %.3f (%.3f)' % (scores.mean(), scores.std()))

Accuracy: 0.902 (0.004)


We see that the accuracy remains the same as adding cities as categorical variables. So, instead of focusing on geography as a feature, we make a design decision to only keep country code dummies in our final model, since it is restricted to 70 unlike cities which are roughly around 500, which adds a lot of features to the dataset without adding as much value.

Now we try to get out of the geographical features into the categories. In the Original data, we see a categories list. But this is not ready to the directly used in a Machine Learning Model, because of it's categorical nature.
Hence, we go about splitting the data by | which would explode the dataset.
These values are coded with ids. and then recombined.

In [None]:
company_data['category_list'] = company_data['category_list'].astype(str)

#Ignoring first character then split on |

company_data['category_list'] = company_data['category_list'].str[1:-1].str.split('|')

company_data.head()

exploded = company_data.explode('category_list')

exploded['category_id'], unique_categories = pd.factorize(exploded['category_list'])


Category Mappings to their IDs.

In [None]:
category_mapping = pd.DataFrame({
    'category_id': range(len(unique_categories)),
    'category_name': unique_categories
})

print(category_mapping)

     category_id        category_name
0              0        Entertainment
1              1             Politics
2              2         Social Media
3              3                 News
4              4                Games
..           ...                  ...
819          819           Timeshares
820          820              Indians
821          821      South East Asia
822          822      Building Owners
823          823  Clean Technology IT

[824 rows x 2 columns]


In [None]:
reconstructed_df = exploded.groupby('name')['category_id'].agg(list).reset_index()

In [None]:
final_df = pd.merge(company_data.drop(columns='category_list'), reconstructed_df, on='name')

Adding list of unique categories as a feature for prediction of Success.

In [None]:
final_df.head()

Unnamed: 0,permalink,name,market,funding_total_usd,country_code,city,funding_rounds,Successful?,category_id
0,/organization/waywire,#waywire,News,1750000,USA,New York,1,1,"[0, 1, 2, 3]"
1,/organization/tv-communications,&TV Communications,Games,4000000,USA,Los Angeles,2,1,[4]
2,/organization/rock-your-paper,'Rock' Your Paper,Publishing,40000,EST,Tallinn,1,1,"[5, 6]"
3,/organization/in-touch-network,(In)Touch Network,Electronics,1500000,GBR,London,1,1,"[7, 8, 9, 10, 11, 12, 13, 14, 15, 16]"
4,/organization/r-ranch-and-mine,-R- Ranch and Mine,Tourism,60000,USA,Fort Worth,2,1,"[17, 0, 4]"


Calculating the number of unique categories

In [None]:
final_df['category_count'] = final_df['category_id'].apply(len)

Adding relative frequency of the categories occuring together throughout the dataset.

In [None]:
from collections import Counter

# Flatten all category lists and count occurrences
all_categories = [category for sublist in final_df['category_id'] for category in sublist]
category_freq = Counter(all_categories)

# Normalize frequencies by the total number of occurrences
total_occurrences = sum(category_freq.values())
category_freq = {k: v / total_occurrences for k, v in category_freq.items()}

# Apply to DataFrame
final_df['category_freq_sum'] = final_df['category_id'].apply(lambda x: sum(category_freq.get(cat, 0) for cat in x))

In [None]:
final_df.head()

Unnamed: 0,permalink,name,market,funding_total_usd,country_code,city,funding_rounds,Successful?,category_id,category_count,category_freq_sum
0,/organization/waywire,#waywire,News,1750000,USA,New York,1,1,"[0, 1, 2, 3]",4,0.03617
1,/organization/tv-communications,&TV Communications,Games,4000000,USA,Los Angeles,2,1,[4],1,0.019325
2,/organization/rock-your-paper,'Rock' Your Paper,Publishing,40000,EST,Tallinn,1,1,"[5, 6]",2,0.016499
3,/organization/in-touch-network,(In)Touch Network,Electronics,1500000,GBR,London,1,1,"[7, 8, 9, 10, 11, 12, 13, 14, 15, 16]",10,0.100395
4,/organization/r-ranch-and-mine,-R- Ranch and Mine,Tourism,60000,USA,Fort Worth,2,1,"[17, 0, 4]",3,0.025786


In [None]:

final_df['market_id'], unique_markets = pd.factorize(final_df['market'])

 Create a new column named 'market_freq_sum' in the DataFrame 'final_df'
 For each row, calculate the normalized frequency of the market ID associated with that row
Assign the result to the 'market_freq_sum' column using a lambda function applied to the 'market_id' column
 If a market ID is not found in the normalized frequencies dictionary, default to 0

In [None]:
# Import Counter from the collections module
from collections import Counter

# Count the occurrences of each unique market ID in the 'market_id' column of the DataFrame
market_freq = Counter(final_df['market_id'])

# Calculate the total number of market occurrences by summing up the values in the market_freq dictionary
total_market_occurrences = sum(market_freq.values())

# Normalize the frequencies by dividing each frequency value by the total number of market occurrences
# This ensures that the frequencies sum up to 1, providing a proportionate representation
market_freq_normalized = {k: v / total_market_occurrences for k, v in market_freq.items()}


final_df['market_freq_sum'] = final_df['market_id'].apply(lambda x: market_freq_normalized.get(x, 0))

In [None]:
final_df=final_df.groupby('country_code').filter(lambda x : len(x)>4)
country_dummies = pd.get_dummies(final_df['country_code'], drop_first=True)

# combine these new dummy variables with your dataset
final_df = pd.concat([final_df.drop(['country_code'], axis=1), country_dummies], axis=1)

Now that we have prepared our data for categories and market, we can try run our classification based on a combination of all these features of a company.


# TEST 4 - ADDING CATEGORIES, MARKET

In [None]:
test4_data= final_df

In [None]:
test4_data.head()

Unnamed: 0,permalink,name,market,funding_total_usd,city,funding_rounds,Successful?,category_id,category_count,category_freq_sum,...,THA,TUR,TWN,TZA,UGA,UKR,URY,USA,VNM,ZAF
0,/organization/waywire,#waywire,News,1750000,New York,1,1,"[0, 1, 2, 3]",4,0.03617,...,False,False,False,False,False,False,False,True,False,False
1,/organization/tv-communications,&TV Communications,Games,4000000,Los Angeles,2,1,[4],1,0.019325,...,False,False,False,False,False,False,False,True,False,False
2,/organization/rock-your-paper,'Rock' Your Paper,Publishing,40000,Tallinn,1,1,"[5, 6]",2,0.016499,...,False,False,False,False,False,False,False,False,False,False
3,/organization/in-touch-network,(In)Touch Network,Electronics,1500000,London,1,1,"[7, 8, 9, 10, 11, 12, 13, 14, 15, 16]",10,0.100395,...,False,False,False,False,False,False,False,False,False,False
4,/organization/r-ranch-and-mine,-R- Ranch and Mine,Tourism,60000,Fort Worth,2,1,"[17, 0, 4]",3,0.025786,...,False,False,False,False,False,False,False,True,False,False


In [None]:
test4_data['Successful?'].value_counts()

Successful?
1    40792
0     3297
Name: count, dtype: int64

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split


X = test3_data.drop(['Successful?','name'], axis=1)
y = test3_data['Successful?']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize SMOTE
smote = SMOTE(random_state=42)

# Apply SMOTE to the training data only
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# new class distribution
print("After SMOTE, counts of label '1': {}".format(sum(y_train_smote == 1)))
print("After SMOTE, counts of label '0': {}".format(sum(y_train_smote == 0)))

After SMOTE, counts of label '1': 26077
After SMOTE, counts of label '0': 26077


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.datasets import make_classification
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix, classification_report


pipe = Pipeline(steps=[
    ('scaler', MinMaxScaler()),
    ('model', LogisticRegression())
])

# Using cross-validation to evaluate the model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipe, X_train_smote, y_train_smote, scoring='accuracy', cv=cv, n_jobs=-1)
pipe.fit(X_train_smote, y_train_smote)

y_pred = pipe.predict(X_test)

print('Accuracy: %.3f (%.3f)' % (scores.mean(), scores.std()))

# Evaluating the model
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.901 (0.004)
Confusion Matrix:
 [[  21  486]
 [  89 6459]]
Classification Report:
               precision    recall  f1-score   support

           0       0.19      0.04      0.07       507
           1       0.93      0.99      0.96      6548

    accuracy                           0.92      7055
   macro avg       0.56      0.51      0.51      7055
weighted avg       0.88      0.92      0.89      7055



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Exploring and cleaning the Investments Dataset to see what features can be included along with the company dataset to better improve prediction

In [None]:
investments_data=pd.read_excel("/content/drive/MyDrive/95-885-T11-Project2/Datasets/crunchbase_monthly_export_d43b4klo2ade53.xlsx",'Investments')
investments_data.head()

  warn(msg)


In [None]:
investments_data.isna().sum()

permalink                      0
name                           1
company_category_list       3264
company_market              3266
company_country_code        7359
company_state_code         35348
company_region              7359
company_city                8705
investor_permalink            66
investor_name                 66
investor_category_list     83999
investor_market            84051
investor_country_code      27985
investor_state_code        52232
investor_region            27985
investor_city              28499
funding_round_permalink        0
funding_round_type             0
funding_round_code         59837
funded_at                      0
funded_month                   0
funded_quarter                 0
funded_year                    0
raised_amount_usd          13351
dtype: int64

Dropping columns that may not be helpful in adding as features to out final model

In [None]:

investments_data= investments_data.drop(['company_category_list','company_market','company_country_code', 'company_state_code','investor_market', 'company_region', 'company_city', 'investor_permalink', 'investor_category_list', 'investor_country_code', 'investor_state_code', 'investor_region', 'investor_city','funding_round_code','funded_year','funded_quarter','funded_month','funded_at','funding_round_permalink'], axis =1)
investments_data.dropna(subset=['investor_name','name'], inplace = True)
investments_data['raised_amount_usd'].fillna('0', inplace = True)

In [None]:
investments_data['investor_id'], unique_investors = pd.factorize(investments_data['investor_name'])
# Count the occurrences of each unique investor ID in the 'investor_id' column of the investments_data DataFrame
investor_counts = investments_data['investor_id'].value_counts()

# Use the 'factorize' function to encode the 'investor_name' column into numerical IDs
# This function returns two values: the encoded IDs and the unique values in the 'investor_name' column


# Filter the investments_data DataFrame to keep only rows where the investor name appears at least 3 times
# This removes sparse data for investors with fewer than 3 investments
investments_data = investments_data.groupby('investor_id').filter(lambda x: len(x) >= 3)

# Count the occurrences of each unique funding round type in the 'funding_round_type' column of the investments_data DataFrame
funding_round_type_counts = investments_data['funding_round_type'].value_counts()

# Use the 'factorize' function from pandas to encode the 'funding_round_type' column into numerical IDs
# This function returns two values: the encoded IDs and the unique values in the 'funding_round_type' column
investments_data['funding_round_id'], unique_funding_round_types = pd.factorize(investments_data['funding_round_type'])

In [None]:
# Convert the 'raised_amount_usd' column in investments_data DataFrame to numeric data type
investments_data['raised_amount_usd'] = pd.to_numeric(investments_data['raised_amount_usd'])

# Group the investments_data DataFrame by the 'name' column and aggregate the data using specified functions
# - 'funding_round_id': Convert the values in the 'funding_round_id' column to lists for each group
# - 'investor_id': Convert the values in the 'investor_id' column to lists for each group
# - 'raised_amount_usd': Sum the values in the 'raised_amount_usd' column for each group
# Reset the index to make 'name' a regular column instead of the index

investments_data = investments_data.groupby('name').agg({
    'funding_round_id': lambda x: list(x),
    'investor_id': lambda x: list(x),
    'raised_amount_usd': 'sum'
}).reset_index()

In [None]:
# Renaming the columns in investment data so they can be joined with the company data
investments_data.rename(columns={'company_permalink': 'permalink'}, inplace=True)
investments_data.rename(columns={'company_name': 'name'}, inplace=True)

In [None]:
# Joinign company and investment data
final_data = pd.merge(final_df, investments_data, on=['name'])

In [None]:
final_data.head()

Unnamed: 0,permalink,name,market,funding_total_usd,city,funding_rounds,Successful?,category_id,category_count,category_freq_sum,...,TZA,UGA,UKR,URY,USA,VNM,ZAF,funding_round_id,investor_id,raised_amount_usd
0,/organization/waywire,#waywire,News,1750000,New York,1,1,"[0, 1, 2, 3]",4,0.03617,...,False,False,False,False,True,False,False,"[2, 2, 2, 2, 2]","[8602, 1113, 7463, 6027, 9125]",8750000.0
1,/organization/rock-your-paper,'Rock' Your Paper,Publishing,40000,Tallinn,1,1,"[5, 6]",2,0.016499,...,False,False,False,False,False,False,False,[2],[8394],40000.0
2,/organization/fox-networks,.Fox Networks,Advertising,4912393,Buenos Aires,1,0,[19],1,0.020527,...,False,False,False,False,False,False,False,[1],[1628],4912393.0
3,/organization/004-technologies,004 Technologies,Software,0,Champaign,1,1,[18],1,0.06635,...,False,False,False,False,True,False,False,[0],[8042],0.0
4,/organization/01games-technology,01Games Technology,Games,41250,Hong Kong,1,1,[4],1,0.019325,...,False,False,False,False,False,False,False,[2],[13159],41250.0


In [None]:
# Import Counter from the collections module to count occurrences of investors
from collections import Counter

# Flatten the list of investor IDs in the 'investor_id' column of the final_data DataFrame
all_investors = [investor for sublist in final_data['investor_id'] for investor in sublist]

# Count the occurrences of each investor ID using Counter
investor_freq = Counter(all_investors)

# Calculate the total number of occurrences of all investors
total_investor_occurrences = sum(investor_freq.values())

# Normalize the frequencies by dividing each frequency value by the total number of occurrences
investor_freq_normalized = {k: v / total_investor_occurrences for k, v in investor_freq.items()}

# Apply the normalized frequencies to the DataFrame by creating a new column named 'investor_freq_sum'
# For each row, sum the normalized frequencies for all investors associated with that row
final_data['investor_freq_sum'] = final_data['investor_id'].apply(lambda x: sum(investor_freq_normalized.get(investor, 0) for investor in x))

In [None]:
# Import Counter from the collections module to count occurrences of funding round IDs
from collections import Counter

# Flatten the list of funding round IDs in the 'funding_round_id' column of the final_data DataFrame
all_fundings = [funding for sublist in final_data['funding_round_id'] for funding in sublist]

# Count the occurrences of each funding round ID using Counter
funding_freq = Counter(all_fundings)

# Calculate the total number of occurrences of all funding round IDs
total_funding_occurrences = sum(funding_freq.values())

# Normalize the frequencies by dividing each frequency value by the total number of occurrences
funding_freq_normalized = {k: v / total_funding_occurrences for k, v in funding_freq.items()}

# Apply the normalized frequencies to the DataFrame by creating a new column named 'funding_freq_sum'
# For each row, sum the normalized frequencies for all funding round IDs associated with that row
final_data['funding_freq_sum'] = final_data['funding_round_id'].apply(lambda x: sum(funding_freq_normalized.get(funding, 0) for funding in x))

In [None]:
# Remove non-numeric columns ('name', 'permalink') from the final_data DataFrame
# This creates a new DataFrame containing only numerical columns
numerical_column = final_data.drop(['name', 'permalink','market', 'city','category_id', 'funding_round_id','investor_id'], axis=1)

# Calculate the correlation matrix for the numerical columns using the corr() function
correlation_matrix = numerical_column.corr()

# Print the correlation matrix
print(correlation_matrix)

                   funding_total_usd  funding_rounds  Successful?  \
funding_total_usd           1.000000        0.316371     0.025686   
funding_rounds              0.316371        1.000000     0.061043   
Successful?                 0.025686        0.061043     1.000000   
category_count             -0.006200        0.101188     0.030316   
category_freq_sum          -0.019098        0.057376    -0.001367   
...                              ...             ...          ...   
VNM                        -0.000387       -0.012434    -0.001159   
ZAF                        -0.001874       -0.010601    -0.002885   
raised_amount_usd           0.834773        0.252608     0.020781   
investor_freq_sum           0.255859        0.458550     0.045794   
funding_freq_sum            0.292505        0.601228     0.041693   

                   category_count  category_freq_sum  market_id  \
funding_total_usd       -0.006200          -0.019098   0.010923   
funding_rounds           0.101188    

In [None]:
final_data['investors_count'] = final_data['investor_id'].apply(lambda x: len(set(x)))
final_data['investors_count'].value_counts()
final_data['funding_round_count'] = final_data['funding_round_id'].apply(lambda x: len(set(x)))
final_data.drop(['funding_round_id', 'investor_id'], axis=1, inplace=True)

# TEST 5 - Classification with investor features

In [None]:
test5_data = final_data[['name','Successful?','investor_freq_sum', 'investors_count', 'funding_freq_sum', 'funding_round_count', 'funding_total_usd', 'funding_rounds']]
test5_data.columns

Index(['name', 'Successful?', 'investor_freq_sum', 'investors_count',
       'funding_freq_sum', 'funding_round_count', 'funding_total_usd',
       'funding_rounds'],
      dtype='object')

In [None]:

from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split


X = test5_data.drop(['Successful?','name'], axis=1)
y = test5_data['Successful?']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize SMOTE
smote = SMOTE(random_state=42)

# Apply SMOTE to the training data only
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Check the new class distribution
print("After SMOTE, counts of label '1': {}".format(sum(y_train_smote == 1)))
print("After SMOTE, counts of label '0': {}".format(sum(y_train_smote == 0)))


After SMOTE, counts of label '1': 18704
After SMOTE, counts of label '0': 18704


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.datasets import make_classification
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score


pipe = Pipeline(steps=[
    ('scaler', MinMaxScaler()),
    ('model', LogisticRegression())
])

# Using cross-validation to evaluate the model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipe, X_train_smote, y_train_smote, scoring='accuracy', cv=cv, n_jobs=-1)

print('Accuracy: %.3f (%.3f)' % (scores.mean(), scores.std()))


Accuracy: 0.621 (0.006)


with only investory information we get an accuracy of 62%

# RUNNING THE FINAL MODEL WITH ALL FEATURES

In [None]:
final_data.head()

Unnamed: 0,permalink,name,market,funding_total_usd,city,funding_rounds,Successful?,category_id,category_count,category_freq_sum,...,UKR,URY,USA,VNM,ZAF,raised_amount_usd,investor_freq_sum,funding_freq_sum,investors_count,funding_round_count
0,/organization/waywire,#waywire,News,1750000,New York,1,1,"[0, 1, 2, 3]",4,0.03617,...,False,False,True,False,False,8750000.0,0.005997,1.116341,5,1
1,/organization/rock-your-paper,'Rock' Your Paper,Publishing,40000,Tallinn,1,1,"[5, 6]",2,0.016499,...,False,False,False,False,False,40000.0,0.007158,0.223268,1,1
2,/organization/fox-networks,.Fox Networks,Advertising,4912393,Buenos Aires,1,0,[19],1,0.020527,...,False,False,False,False,False,4912393.0,0.000156,0.037131,1,1
3,/organization/004-technologies,004 Technologies,Software,0,Champaign,1,1,[18],1,0.06635,...,False,False,True,False,False,0.0,8.9e-05,0.667214,1,1
4,/organization/01games-technology,01Games Technology,Games,41250,Hong Kong,1,1,[4],1,0.019325,...,False,False,False,False,False,41250.0,0.000357,0.223268,1,1


In [None]:
final_data.columns

Index(['permalink', 'name', 'market', 'funding_total_usd', 'city',
       'funding_rounds', 'Successful?', 'category_id', 'category_count',
       'category_freq_sum', 'market_id', 'market_freq_sum', 'ARG', 'AUS',
       'AUT', 'BEL', 'BGD', 'BGR', 'BRA', 'CAN', 'CHE', 'CHL', 'CHN', 'COL',
       'CRI', 'CYM', 'CYP', 'CZE', 'DEU', 'DNK', 'DZA', 'EGY', 'ESP', 'EST',
       'FIN', 'FRA', 'GBR', 'GHA', 'GRC', 'HKG', 'HRV', 'HUN', 'IDN', 'IND',
       'IRL', 'ISL', 'ISR', 'ITA', 'JOR', 'JPN', 'KEN', 'KHM', 'KOR', 'LBN',
       'LTU', 'LUX', 'LVA', 'MEX', 'MYS', 'NGA', 'NLD', 'NOR', 'NZL', 'PAK',
       'PAN', 'PER', 'PHL', 'POL', 'PRT', 'ROM', 'RUS', 'SAU', 'SGP', 'SRB',
       'SVK', 'SVN', 'SWE', 'THA', 'TUR', 'TWN', 'TZA', 'UGA', 'UKR', 'URY',
       'USA', 'VNM', 'ZAF', 'raised_amount_usd', 'investor_freq_sum',
       'funding_freq_sum', 'investors_count', 'funding_round_count'],
      dtype='object')

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split


X = final_data.drop(['Successful?','name','market','city','category_id','raised_amount_usd','market_id','permalink'], axis=1)
y = final_data['Successful?']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize SMOTE
smote = SMOTE(random_state=42)

# Apply SMOTE to the training data only
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# new class distribution
print("After SMOTE, counts of label '1': {}".format(sum(y_train_smote == 1)))
print("After SMOTE, counts of label '0': {}".format(sum(y_train_smote == 0)))

After SMOTE, counts of label '1': 18704
After SMOTE, counts of label '0': 18704


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.datasets import make_classification
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score


pipe = Pipeline(steps=[
    ('scaler', MinMaxScaler()),
    ('model', LogisticRegression())
])

# Using cross-validation to evaluate the model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipe, X_train_smote, y_train_smote, scoring='accuracy', cv=cv, n_jobs=-1)

print('Accuracy: %.3f (%.3f)' % (scores.mean(), scores.std()))

Accuracy: 0.779 (0.007)


# Final Model Accuracy is ~78%.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RepeatedKFold, cross_val_score

pipelines = {
    'Decision Tree': Pipeline([
        ('scaler', MinMaxScaler()),
        ('model', DecisionTreeClassifier(random_state=42))
    ]),
    'SVM Classifier': Pipeline([
        ('scaler', MinMaxScaler()),
        ('model', SVC(random_state=42))
    ]),
    'Random Forest': Pipeline([
        ('scaler', MinMaxScaler()),  # It's generally not necessary to scale data for Random Forests, but done here for consistency
        ('model', RandomForestClassifier(random_state=42))
    ])
}

# Cross-validation setup using RepeatedStratifiedKFold to maintain class balance within each fold
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# Evaluate each pipeline
for name, pipeline in pipelines.items():
    # Ensure that the scoring method is appropriate for classification
    scores = cross_val_score(pipeline, X_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1)
    print(f'{name} Accuracy: {scores.mean():.3f} ({scores.std():.3f})')

Decision Tree Accuracy: 0.873 (0.006)
SVM Classifier Accuracy: 0.933 (0.001)
Random Forest Accuracy: 0.933 (0.002)


# FURTHER STEPS

Incorporate economical data for example GDP, interest rates in that year since in the business sense this can have great impact on the success of a startup.

We could use this same procedure with funding amount as our target and use Regression techniques to be able to predict the total funding a start up may receive.