# Predictors of Startup Success 

### Introduction

A Startup is a new company (qualified when still at a project phase) which seeks to develop an innovation capable of disrupting sectors. While entrepreneurship refers to all new businesses, including self-employment and businesses that never intend to become registered, startups refer to new businesses that intend to grow and validate a scalable economic model.  

Startups play a major role in the economic growth of many countries. They bring new ideas, spur innovation, create new technologies and are an important source of employment. Over the past two decades, the startup sector has grown exponantially in number of new companies and in the volume of capital seeking to participate as investments.

Due to their high risks, startups face high uncertainty and a minority of them go on to be successful and influential. Their inherent risk and requirement for large amounts of capital make successful investment hard to come by. In fact, on average Venture Capital Funds (the main investors in startups) have a successful (IPO) investment rate of 1 out of 32. After adventuring myself in the field, it is clear that investors have limited tools to objectively identify the qualities of a successful startup and their potential to attain "success".  


Before we continue, it is important to define "what is a successfull startup?". Often we have in mind sotries of "IPO unicorns" like Facebook or Amazon, but in fact startup success could be generalized the realization of an "Exit". These events can be intiated by an acquisition (M&A) or an IPO (Initial Public Offering)

### Project Overview

Startup companiies are recently emerging and evolving more than ever. But still, many of them fail, in fact, 9 out of 10 startups fail. So, what are the characteristics of startups which succeed? There are many factors that might play a role in this question: founders and team experience, fundings, location, product, market and so on. Why some startups are acquired very early and others are not? Are ther maybe some factors that attract bigger companies into some startup acquisition? This report aims to investigate these questions using the [Startup Success Prediction](https://www.kaggle.com/datasets/manishkc06/startup-success-prediction) dataset available on Kaggle.  

### Problem Statement

The startup world requires a lot of decision making from the people involved, specially when dealing with situations of funding or acquisitions that will result in ownership changes. In these situations, the decision maker should have on hand all the information that can be provided with the amount of data available otday about companies, funding rounds and acquisitions.  

This project proposes a serie of supervised learning models that can forecast a startup success as well as a better understanding of the variables that most impact the acquisition or failure of a startup company. **Different supervised algorithms will be tested in order to compare our results and better understand the variables that most impact the outcome.**

### Performance Metrics

To evaluate the validity of each model we will use the True Positive Rate, False Positive Rate and the Area under the ROC Curve (AUC_ROC). True Positive rate and False Positive rates will derived from the confusion matrix of our classifiers. Since we expect the dataset to be imbalanced such metric wil be much more useful than accuracy. We expect the number of companies that are acquired will be much smaller than those in operation or closed. If we were to use accuracy, our models would probably have a high score even though their will be performing poorly of retrieving the few companies that are more likely to be acquired.  

The ROC curve combines TPR(y-axis) and FPR (x-axis). In general, the AUC is a good metric for binary classification problems.

# 1. Dataset Preparation

### Overview  
In this phase, we will import the startup dataset from the [Kaggle Dataset](https://www.kaggle.com/datasets/manishkc06/startup-success-prediction?resource=download). In addition,we will explore and prepare the dataset for further feature analysis.  

During the exploratory phase of this notebook, we will use graphics to explore and get a deeper understanding of the features at hand. Our investigation will focus on feature correlation, distribution, and null values.  

Once our exploration completed, we will seek to implement strategies to address our observations. Such strategies could be feature engineering, data imputation, labe encoding and others.

## 1.1 Load available data from the CSV file

In [1]:
#All imports for this notebook are here
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
from datetime import datetime
from dateutil import relativedelta

%matplotlib inline

In [3]:
#Start by importing our csv file into a dataframe
df = pd.read_csv('data/startup data.csv')

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,state_code,latitude,longitude,zip_code,id,city,Unnamed: 6,name,labels,...,object_id,has_VC,has_angel,has_roundA,has_roundB,has_roundC,has_roundD,avg_participants,is_top500,status
0,1005,CA,42.35888,-71.05682,92101,c:6669,San Diego,,Bandsintown,1,...,c:6669,0,1,0,0,0,0,1.0,0,acquired
1,204,CA,37.238916,-121.973718,95032,c:16283,Los Gatos,,TriCipher,1,...,c:16283,1,0,0,1,1,1,4.75,1,acquired
2,1001,CA,32.901049,-117.192656,92121,c:65620,San Diego,San Diego CA 92121,Plixi,1,...,c:65620,0,0,1,0,0,0,4.0,1,acquired
3,738,CA,37.320309,-122.05004,95014,c:42668,Cupertino,Cupertino CA 95014,Solidcore Systems,1,...,c:42668,0,0,0,1,1,1,3.3333,1,acquired
4,1002,CA,37.779281,-122.419236,94105,c:65806,San Francisco,San Francisco CA 94105,Inhale Digital,0,...,c:65806,1,1,0,0,0,0,1.0,1,closed


## 1.2 Dataset Description - Basic EDA

The dataset supplied through Kaggle did not come with a description of the features. For this reason, it may be necessary to drop features with limited interpretability or understanding of its impact.

In [6]:
# We will start by getting a rudimentary understanding of the dataset and features
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 923 entries, 0 to 922
Data columns (total 49 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Unnamed: 0                923 non-null    int64  
 1   state_code                923 non-null    object 
 2   latitude                  923 non-null    float64
 3   longitude                 923 non-null    float64
 4   zip_code                  923 non-null    object 
 5   id                        923 non-null    object 
 6   city                      923 non-null    object 
 7   Unnamed: 6                430 non-null    object 
 8   name                      923 non-null    object 
 9   labels                    923 non-null    int64  
 10  founded_at                923 non-null    object 
 11  closed_at                 335 non-null    object 
 12  first_funding_at          923 non-null    object 
 13  last_funding_at           923 non-null    object 
 14  age_first_

In [15]:
# Let's create a list of categorical feature
list_categories = df.select_dtypes("object").columns.tolist()
list_categories

['state_code',
 'zip_code',
 'id',
 'city',
 'Unnamed: 6',
 'name',
 'founded_at',
 'closed_at',
 'first_funding_at',
 'last_funding_at',
 'state_code.1',
 'category_code',
 'object_id',
 'status']

In [16]:
# Explore the columns with missing data
print(df.isnull().sum())

Unnamed: 0                    0
state_code                    0
latitude                      0
longitude                     0
zip_code                      0
id                            0
city                          0
Unnamed: 6                  493
name                          0
labels                        0
founded_at                    0
closed_at                   588
first_funding_at              0
last_funding_at               0
age_first_funding_year        0
age_last_funding_year         0
age_first_milestone_year    152
age_last_milestone_year     152
relationships                 0
funding_rounds                0
funding_total_usd             0
milestones                    0
state_code.1                  1
is_CA                         0
is_NY                         0
is_MA                         0
is_TX                         0
is_otherstate                 0
category_code                 0
is_software                   0
is_web                        0
is_mobil

### Summary of the EDA and Basic Description of the columns

The dataframe contains 923 observations and 49 features/column. Our last column "Status" is the target column which has no null values. Currently, the target feature contains "closed" as a failure or 0 and "acquired" as a success. Later, we will label incode the column to obtain a numerical values.  

There are 14 categorical features and 35 numerical features. Of all the features, only 5 have missing values. The feature with the largest number of missing value is "closed_at" with over half of the data missing with 588. Next, the column "unnamed_6" is missing 493 datapoints. "age_first_milestone_year" & "age_last_milestone_year" are bith missing around 10% of the data with 152. Finally, the state_code.1 column is missing 1 entry.  

The following column list is hard to understand the real world meaning of the underlying data:
- "Unnamed: 0", "Unnamed: 6", "avg_participants", "is_top500"  

In addition, the following features seem to be related:
- "state_code", "latitude", "longitude", "zip_code", "city", "state_code.1", "is_CA", "is_NY", "is_MA", "is_TX", "is_otherstate"
- "category_code", "is_software", "is_web", "is_mobile", "is_enterprise", "is_advertising", "is_gamesvideo", "is_ecommerce", "is_biotech", "is_consulting", "is_othercategory"
- "has_VC", "has_angel", "has_roundA", "has_roundB", "has_roundC", "has_roundD"

## 1.3 Visual EDA

In [None]:
#Box plot, Bar Chart, Correlation Matrix