Jensen Judkins - Analysis on 

Dataset: 

### TAKEN STRAIGHT FROM THE DATASET DESCRIPTION

Context

A startup or start-up is a company or project begun by an entrepreneur to seek, develop, and validate a scalable economic model. While entrepreneurship refers to all new businesses, including self-employment and businesses that never intend to become registered, startups refer to new businesses that intend to grow large beyond the solo founder. Startups face high uncertainty and have high rates of failure, but a minority of them do go on to be successful and influential. Some startups become unicorns: privately held startup companies valued at over US$1 billion. [Source of information: Wikipedia]
startup image
 Startups play a major role in economic growth. They bring new ideas, spur innovation, create employment thereby moving the economy. There has been an exponential growth in startups over the past few years. Predicting the success of a startup allows investors to find companies that have the potential for rapid growth, thereby allowing them to be one step ahead of the competition.
Objective

The objective is to predict whether a startup which is currently operating turns into a success or a failure. The success of a company is defined as the event that gives the company's founders a large sum of money through the process of M&A (Merger and Acquisition) or an IPO (Initial Public Offering). A company would be considered as failed if it had to be shut down.
About the Data

The data contains industry trends, investment insights and individual company information. There are 48 columns/features. Some of the features are:

    age_first_funding_year – quantitative
    age_last_funding_year – quantitative
    relationships – quantitative
    funding_rounds – quantitative
    funding_total_usd – quantitative
    milestones – quantitative
    age_first_milestone_year – quantitative
    age_last_milestone_year – quantitative
    state – categorical
    industry_type – categorical
    has_VC – categorical
    has_angel – categorical
    has_roundA – categorical
    has_roundB – categorical
    has_roundC – categorical
    has_roundD – categorical
    avg_participants – quantitative
    is_top500 – categorical
    status(acquired/closed) – categorical (the target variable, if a startup is ‘acquired’ by some other organization, means the startup succeed) 

Acknowledgements

    I would like to thank Ramkishan Panthena, for providing us this dataset. He is a Machine Learning Engineer at GMO.
    This dataset was used in data sprint #5 at DPhi.

Inspiration

Predicting the success of a startup allows investors to find companies that have the potential for rapid growth, thereby allowing them to be one step ahead of the competition.

In [None]:
#Imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression


Below is loading the dataset and just getting a grasp of what we can do with some of the categorical data. Below is a bar plot of the number of successful and failed startups and their respective business category.

In [None]:
#Load Dataset
df = pd.read_csv('startup data.csv')
#display(df.head())

#Create dataframe of only closed startups
df_closed = df[(df['status'] == 'closed')]

#Create dataframe of only acquired startups
df_acquired = df[(df['status'] == 'acquired')]


#Plotting
#plot number of closed startups relative to their category_code
df_closed['category_code'].value_counts().plot(kind='bar')
plt.title('Number of closed startups relative to their category_code')
plt.xlabel('category_code')
plt.ylabel('Number of closed startups')
plt.show()

#plot number of acquired startups relative to their category_code
df_acquired['category_code'].value_counts().plot(kind='bar')
plt.title('Number of acquired startups relative to their category_code')
plt.xlabel('category_code')
plt.ylabel('Number of acquired startups')
plt.show()


In [None]:
#Data Preprocessing

#Categorical Variable columns
# state – categorical
# industry_type – categorical
# has_VC – categorical
# has_angel – categorical
# has_roundA – categorical
# has_roundB – categorical
# has_roundC – categorical
# has_roundD – categorical
# is_top500 – categorical


#Encode categorical variables
label_encoders = {}
for column in df.columns:
    label_encoders[column] = LabelEncoder()
    df[column] = label_encoders[column].fit_transform(df[column])

#Split data into features and target
X = df.drop('status', axis=1)
X = X.drop('labels', axis=1)
X = X.drop('closed_at', axis=1)
y = df['status']

#print(X.head())

#Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#Logistic Regression Model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

#Support Vector Machine Model
svm_model = SVC(kernel='linear')
svm_model.fit(X_train_scaled, y_train)

#Model Evaluation
# Model evaluation
print("NOTE: 0 = closed, 1 = acquired")
print("Logistic Regression:")
print("Training Accuracy:", model.score(X_train_scaled, y_train))
print("Test Accuracy:", model.score(X_test_scaled, y_test))
print(classification_report(y_test, model.predict(X_test_scaled)))

print("\nSupport Vector Machine:")
print("Training Accuracy:", svm_model.score(X_train_scaled, y_train))
print("Test Accuracy:", svm_model.score(X_test_scaled, y_test))
print(classification_report(y_test, svm_model.predict(X_test_scaled)))

# Cross-validation
logistic_cv_score = cross_val_score(model, X, y, cv=5)
svm_cv_score = cross_val_score(svm_model, X, y, cv=5)
print("\nCross-validation scores:")
print("Logistic Regression:", logistic_cv_score.mean())
print("Support Vector Machine:", svm_cv_score.mean())

#Plot selected pairs of features
sns.pairplot(df, hue='status', vars=['state', 'industry_type', 'has_VC', 'has_angel', 'has_roundA', 'has_roundB', 'has_roundC', 'has_roundD', 'is_top500'])
plt.show()
