### Startup Success Prediction ###


**About Dataset**

Context:
A startup or start-up is a company or project begun by an entrepreneur to seek, develop, and validate a scalable economic model. While entrepreneurship refers to all new businesses, including self-employment and businesses that never intend to become registered, startups refer to new businesses that intend to grow large beyond the solo founder. 

Startups face high uncertainty and have high rates of failure, but a minority of them do go on to be successful and influential. 

Some startups become unicorns: privately held startup companies valued at over US$1 billion.

**The objective**

Predict whether a startup which is currently operating turns into a success or a failure. The success of a company is defined as the event that gives the company's founders a large sum of money through the process of M&A (Merger and Acquisition) or an IPO (Initial Public Offering). A company would be considered as failed if it had to be shut down.

**About the Data**

The data contains industry trends, investment insights and individual company information. There are 48 columns/features. Some of the features are:

- age_first_funding_year – quantitative
- age_last_funding_year – quantitative
- relationships – quantitative
- funding_rounds – quantitative
- funding_total_usd – quantitative
- milestones – quantitative
- age_first_milestone_year – quantitative
- age_last_milestone_year – quantitative
- state – categorical
- industry_type – categorical
- has_VC – categorical
- has_angel – categorical
- has_roundA – categorical
- has_roundB – categorical
- has_roundC – categorical
- has_roundD – categorical
- avg_participants – quantitative
- is_top500 – categorical
- status(acquired/closed) – categorical (the target variable, if a startup is ‘acquired’ by some other organization, means the startup succeed) 

These variables are likely binary indicators (0 or 1) provide information about the startup's funding history, specifically whether it has received funding from angel investors and whether it has successfully completed funding rounds at different stages (A, B, C, D). They can be valuable features in predicting the success or failure of a startup, as funding rounds and investment from angels are often associated with the growth and potential success of a company.
Notes: Each funding round represents a different stage of investment, and startups may not follow a strict sequential order (A, B, C, D). The order can vary based on the business's specific circumstances and the preferences of investors.

#### 1. Import the necessary libraires

In [901]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

##### 2. Loading the database

In [902]:
#Make sure you're aware of the current working directory in your Jupyter notebook.
import os
print(os.getcwd())

/Users/gabrielaarzate/Desktop/predicting_startup_succes/notebook


In [903]:
# Assuming this notebook is in the 'notebook' directory
notebook_directory = '/Users/gabrielaarzate/Desktop/predicting_startup_succes/notebook'

In [904]:
# Construct the full file path from the notebook directory
file_path = os.path.join(notebook_directory, '..', 'data', 'startup.csv')

#### 2. Read the data

In [905]:
# It's time to read in our training and testing data using pd.read_csv, 
#and take a first look using the describe() function.
data = pd.read_csv(file_path, encoding="ISO-8859-1")

##### 3. Explore the data :

In [906]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 923 entries, 0 to 922
Data columns (total 49 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Unnamed: 0                923 non-null    int64  
 1   state_code                923 non-null    object 
 2   latitude                  923 non-null    float64
 3   longitude                 923 non-null    float64
 4   zip_code                  923 non-null    object 
 5   id                        923 non-null    object 
 6   city                      923 non-null    object 
 7   Unnamed: 6                430 non-null    object 
 8   name                      923 non-null    object 
 9   labels                    923 non-null    int64  
 10  founded_at                923 non-null    object 
 11  closed_at                 335 non-null    object 
 12  first_funding_at          923 non-null    object 
 13  last_funding_at           923 non-null    object 
 14  age_first_

#### Target Variable ####

In [907]:
data['status'].value_counts()
#In a binary classification problem like predicting whether a startup 
#will be "acquired" or "closed," a balanced dataset typically means that you have a roughly equal 
#number of examples for each class. In your case, you have 597 examples of "acquired" and 326 examples of "closed."

status
acquired    597
closed      326
Name: count, dtype: int64

##### 3.3 Analyse missing values 

In [908]:
data.isnull().sum()

Unnamed: 0                    0
state_code                    0
latitude                      0
longitude                     0
zip_code                      0
id                            0
city                          0
Unnamed: 6                  493
name                          0
labels                        0
founded_at                    0
closed_at                   588
first_funding_at              0
last_funding_at               0
age_first_funding_year        0
age_last_funding_year         0
age_first_milestone_year    152
age_last_milestone_year     152
relationships                 0
funding_rounds                0
funding_total_usd             0
milestones                    0
state_code.1                  1
is_CA                         0
is_NY                         0
is_MA                         0
is_TX                         0
is_otherstate                 0
category_code                 0
is_software                   0
is_web                        0
is_mobil

In [909]:
# Assuming 'data' is your DataFrame
null = pd.DataFrame(data.isnull().sum(), columns=["Null Values"])
null["% Missing Values"] = (data.isna().sum() / len(data) * 100)
null = null[null["% Missing Values"] > 0]

In [910]:
styled_null = (
    null.style
    .background_gradient(cmap='viridis', low=0.2, high=0.1)
)

In [911]:
styled_null
#We have 5 columns with missing values 

Unnamed: 0,Null Values,% Missing Values
Unnamed: 6,493,53.412784
closed_at,588,63.705309
age_first_milestone_year,152,16.468039
age_last_milestone_year,152,16.468039
state_code.1,1,0.108342


##### 3.4 Irrelevant features: 



- **Unnamed_0**: Appears to be an artifact of the data and often serves as an index or a row identifier.
- **id**: irrelevant 
- **Unnamed_6** : Repeating values from others columns(city, state_code,zip_code).
- **closed_at** : Irrelevant for now
- **state_code.1** : Has the same information as state.code column 
- **object_id** : Same information as id.
- **is_CA,is_NY,is_MA,is_TX,is_otherstate**: same information as state code.
- **is_software,is_web,is_mobile, is_enterprise, is_advertising, is_gamesvideo, is_ecommerce, is_biotech, is_consulting, is_othercategory** : same information as category_code.

We will drop 21 features that are irrelevant or repeating information.

#### 4 . Data Cleaning ####

4. 1 Drop the irrelevant features 

In [912]:
data = data.drop(['Unnamed: 0', 'id', 'Unnamed: 6', 'closed_at', 'state_code.1', 'is_CA', 'is_NY', 'is_MA', 'is_TX', 'is_otherstate', 'is_software', 'is_web', 'is_mobile', 'is_enterprise', 'is_advertising', 'is_gamesvideo', 'is_ecommerce', 'is_biotech', 'is_consulting', 'is_othercategory', 'object_id'], axis=1).copy()
data

Unnamed: 0,state_code,latitude,longitude,zip_code,city,name,labels,founded_at,first_funding_at,last_funding_at,...,category_code,has_VC,has_angel,has_roundA,has_roundB,has_roundC,has_roundD,avg_participants,is_top500,status
0,CA,42.358880,-71.056820,92101,San Diego,Bandsintown,1,1/1/2007,4/1/2009,1/1/2010,...,music,0,1,0,0,0,0,1.0000,0,acquired
1,CA,37.238916,-121.973718,95032,Los Gatos,TriCipher,1,1/1/2000,2/14/2005,12/28/2009,...,enterprise,1,0,0,1,1,1,4.7500,1,acquired
2,CA,32.901049,-117.192656,92121,San Diego,Plixi,1,3/18/2009,3/30/2010,3/30/2010,...,web,0,0,1,0,0,0,4.0000,1,acquired
3,CA,37.320309,-122.050040,95014,Cupertino,Solidcore Systems,1,1/1/2002,2/17/2005,4/25/2007,...,software,0,0,0,1,1,1,3.3333,1,acquired
4,CA,37.779281,-122.419236,94105,San Francisco,Inhale Digital,0,8/1/2010,8/1/2010,4/1/2012,...,games_video,1,1,0,0,0,0,1.0000,1,closed
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
918,CA,37.740594,-122.376471,94107,San Francisco,CoTweet,1,1/1/2009,7/9/2009,7/9/2009,...,advertising,0,0,1,0,0,0,6.0000,1,acquired
919,MA,42.504817,-71.195611,1803,Burlington,Reef Point Systems,0,1/1/1998,4/1/2005,3/23/2007,...,security,1,0,0,1,0,0,2.6667,1,closed
920,CA,37.408261,-122.015920,94089,Sunnyvale,Paracor Medical,0,1/1/1999,6/29/2007,6/29/2007,...,biotech,0,0,0,0,0,1,8.0000,1,closed
921,CA,37.556732,-122.288378,94404,San Francisco,Causata,1,1/1/2009,10/5/2009,11/1/2011,...,software,0,0,1,1,0,0,1.0000,1,acquired


In [913]:
num_columns = len(data.columns)
print(f"After dropping irrelevant features The DataFrame has {num_columns} columns.")

After dropping irrelevant features The DataFrame has 28 columns.


4. 2 Handle Missing Values of 2 variables 

In [914]:
#Before we had 5 columns with missing values, 3 of them were irrelevant(Unnamed: 6, state_code.1, closed_at) so 
#we dropped them now we need to input data to only 2 differents variables.

# 1. age_first_milestone_year
# 2. age_last_milestone_year 

# 2. Impute missing values with 0 to 'age_first_milestone_year'
data['age_first_milestone_year'].fillna(0, inplace=True)

# 3. Impute missing values with 0 to 'age_last_milestone_year'
data['age_last_milestone_year'].fillna(0, inplace=True)

In [915]:
data.isnull().sum()

state_code                  0
latitude                    0
longitude                   0
zip_code                    0
city                        0
name                        0
labels                      0
founded_at                  0
first_funding_at            0
last_funding_at             0
age_first_funding_year      0
age_last_funding_year       0
age_first_milestone_year    0
age_last_milestone_year     0
relationships               0
funding_rounds              0
funding_total_usd           0
milestones                  0
category_code               0
has_VC                      0
has_angel                   0
has_roundA                  0
has_roundB                  0
has_roundC                  0
has_roundD                  0
avg_participants            0
is_top500                   0
status                      0
dtype: int64

4.3 Analyzing Categorical Data to convert to Numerical Data 

In [916]:
numerical_features = data.select_dtypes(include=['number']).columns.tolist()
categorical_features = data.select_dtypes(include=['object']).columns.tolist()

# Assuming the target variable is 'status'
target_variable = ['status']

# Print the lists along with the number of features
print("Numerical Features ({0}):".format(len(numerical_features)))
print(numerical_features)

print("\nCategorical Features ({0}):".format(len(categorical_features)))
print(categorical_features)

print("\nTarget Variable ({0}):".format(len(target_variable)))
print(target_variable)

Numerical Features (19):
['latitude', 'longitude', 'labels', 'age_first_funding_year', 'age_last_funding_year', 'age_first_milestone_year', 'age_last_milestone_year', 'relationships', 'funding_rounds', 'funding_total_usd', 'milestones', 'has_VC', 'has_angel', 'has_roundA', 'has_roundB', 'has_roundC', 'has_roundD', 'avg_participants', 'is_top500']

Categorical Features (9):
['state_code', 'zip_code', 'city', 'name', 'founded_at', 'first_funding_at', 'last_funding_at', 'category_code', 'status']

Target Variable (1):
['status']


In [917]:
#1. Categorical Data ir Nominal or Ordinal?

# Assuming your_data_df is your main DataFrame
your_data_df = data

# List of your categorical columns
categorical_columns = ['state_code', 'city', 'name', 'category_code']

# Loop through each categorical column and display a subset of unique values
for categorical_column in categorical_columns:
    unique_values = your_data_df[categorical_column].unique()
    
    print(f"\nColumn: {categorical_column}")
    print("Unique Values (Subset):")
    print(unique_values[:10]) 


Column: state_code
Unique Values (Subset):
['CA' 'MA' 'KY' 'NY' 'CO' 'VA' 'TX' 'WA' 'IL' 'NC']

Column: city
Unique Values (Subset):
['San Diego' 'Los Gatos' 'Cupertino' 'San Francisco' 'Mountain View'
 'San Rafael' 'Williamstown' 'Palo Alto' 'Menlo Park' 'Louisville']

Column: name
Unique Values (Subset):
['Bandsintown' 'TriCipher' 'Plixi' 'Solidcore Systems' 'Inhale Digital'
 'Matisse Networks' 'RingCube Technologies' 'ClairMail' 'VoodooVox'
 'Doostang']

Column: category_code
Unique Values (Subset):
['music' 'enterprise' 'web' 'software' 'games_video' 'network_hosting'
 'finance' 'mobile' 'education' 'public_relations']


In [918]:
data_with_dummies = pd.get_dummies(data, drop_first=True)
data_with_dummies.head()

Unnamed: 0,latitude,longitude,labels,age_first_funding_year,age_last_funding_year,age_first_milestone_year,age_last_milestone_year,relationships,funding_rounds,funding_total_usd,...,category_code_search,category_code_security,category_code_semiconductor,category_code_social,category_code_software,category_code_sports,category_code_transportation,category_code_travel,category_code_web,status_closed
0,42.35888,-71.05682,1,2.2493,3.0027,4.6685,6.7041,3,3,375000,...,False,False,False,False,False,False,False,False,False,False
1,37.238916,-121.973718,1,5.126,9.9973,7.0055,7.0055,9,4,40100000,...,False,False,False,False,False,False,False,False,False,False
2,32.901049,-117.192656,1,1.0329,1.0329,1.4575,2.2055,5,1,2600000,...,False,False,False,False,False,False,False,False,True,False
3,37.320309,-122.05004,1,3.1315,5.3151,6.0027,6.0027,5,3,40000000,...,False,False,False,False,True,False,False,False,False,False
4,37.779281,-122.419236,0,0.0,1.6685,0.0384,0.0384,2,2,1300000,...,False,False,False,False,False,False,False,False,False,True


In [919]:
data_with_dummies.columns.values

array(['latitude', 'longitude', 'labels', ..., 'category_code_travel',
       'category_code_web', 'status_closed'], dtype=object)

In [920]:
#2. Numerical Data ir Nominal or Ordinal?
# Assuming your_data_df is your main DataFrame
your_data_df = data

# List of your categorical columns
numerical_columns = ['latitude', 'longitude', 'zip_code', 'labels','relationships',
                     'funding_rounds','funding_total_usd','milestones',
                     'has_VC','has_angel','has_roundA','has_roundB',
                     'has_roundC','has_roundD','avg_participants',
                     'is_top500']

# Loop through each categorical column and display a subset of unique values
for numerical_columns in numerical_columns:
    unique_values = your_data_df[numerical_columns].unique()
    
    print(f"\nColumn: {numerical_columns}")
    print("Unique Values (Subset):")
    print(unique_values[:10]) 


Column: latitude
Unique Values (Subset):
[42.35888   37.238916  32.901049  37.320309  37.779281  37.406914
 37.3915589 38.057107  42.712207  37.427235 ]

Column: longitude
Unique Values (Subset):
[ -71.05682   -121.973718  -117.192656  -122.05004   -122.419236
 -122.09037   -122.0702643 -122.513742   -73.203599  -122.145783 ]

Column: zip_code
Unique Values (Subset):
['92101' '95032' '92121' '95014' '94105' '94043' '94041' '94901' '1267'
 '94306']

Column: labels
Unique Values (Subset):
[1 0]

Column: relationships
Unique Values (Subset):
[ 3  9  5  2  6 25 13 14 22  8]

Column: funding_rounds
Unique Values (Subset):
[ 3  4  1  2  5  7  6 10  8]

Column: funding_total_usd
Unique Values (Subset):
[  375000 40100000  2600000 40000000  1300000  7500000 26000000 34100000
  9650000  5750000]

Column: milestones
Unique Values (Subset):
[3 1 2 4 0 5 6 8]

Column: has_VC
Unique Values (Subset):
[0 1]

Column: has_angel
Unique Values (Subset):
[1 0]

Column: has_roundA
Unique Values (Subset):


In [921]:
#3. Date Data ir Nominal or Ordinal?
# Assuming your_data_df is your main DataFrame
your_data_df = data

# List of your categorical columns
date_columns = ['founded_at','first_funding_at',
                     'last_funding_at','age_first_funding_year','age_last_funding_year','age_first_milestone_year',
                     'age_last_milestone_year']

# Loop through each categorical column and display a subset of unique values
for date_columns in date_columns:
    unique_values = your_data_df[date_columns].unique()
    
    print(f"\nColumn: {date_columns}")
    print("Unique Values (Subset):")
    print(unique_values[:10]) 


Column: founded_at
Unique Values (Subset):
['1/1/2007' '1/1/2000' '3/18/2009' '1/1/2002' '8/1/2010' '1/1/2005'
 '1/1/2004' '6/1/2005' '11/15/2000' '1/1/2006']

Column: first_funding_at
Unique Values (Subset):
['4/1/2009' '2/14/2005' '3/30/2010' '2/17/2005' '8/1/2010' '7/18/2006'
 '9/21/2006' '8/24/2005' '8/2/2005' '2/1/2007']

Column: last_funding_at
Unique Values (Subset):
['1/1/2010' '12/28/2009' '3/30/2010' '4/25/2007' '4/1/2012' '7/18/2006'
 '3/18/2010' '10/4/2010' '2/8/2013' '2/5/2010']

Column: age_first_funding_year
Unique Values (Subset):
[2.2493 5.126  1.0329 3.1315 0.     4.5452 1.7205 1.6466 3.5863 1.6712]

Column: age_last_funding_year
Unique Values (Subset):
[ 3.0027  9.9973  1.0329  5.3151  1.6685  4.5452  5.211   6.7616 11.1123
  4.6849]

Column: age_first_milestone_year
Unique Values (Subset):
[4.6685 7.0055 1.4575 6.0027 0.0384 5.0027 3.     5.6055 8.0055 2.9178]

Column: age_last_milestone_year
Unique Values (Subset):
[6.7041 7.0055 2.2055 6.0027 0.0384 5.0027 6.6082


Based on the subset of unique values you've displayed for each categorical column, it appears that the data is more likely nominal rather than ordinal. Here's a brief analysis of each column:
- **state_code:** Represents state codes. No clear order, likely nominal.
- **city:** Represents city names. No inherent order, likely nominal.
- **name:** Represents company names. No order, likely nominal.
- **founded_at, first_funding_at, last_funding_at:** Represent dates. Dates are typically considered ordinal, but these might be treated as nominal depending on the context.
- **category_code:** Represents categories. No inherent order, likely nominal.
- **status:** Represents the status of the company (acquired or closed). In this context, 'acquired' and 'closed' are more likely nominal categories rather than ordinal.

Nominal Data = named data (No Order):example = gender(m/f), age, groups Use One-Hot Encoding.
Ordinal Data (Clear Order): example =  1,2,3, Education = High School, Middle Schoo, =  Use Label Encoding.

In our analysis, the majority of the categorical data exhibited a nominal nature, Consequently, we opted for one-hot encoding to represent each category as a binary column. This method ensures the independence of categories, eliminating the risk of introducing unintended ordinal relationships. 

Update: I have a problem because when using One hot enconding it creates binary features, and it lead to a large number 
of new features, so i need to find another way to convert the categorical data to numerical.

#### Detecting outliers

#### 2 .  Data Analysissssss ####

Now that our data is cleaned, we will explore our data with descriptive and graphical statistics to describe and summarize our variables. In this stage, you will find yourself classifying features and determining their correlation with the target variable and each other.

In [922]:
# # Now use 'encoded_data' in the correlation calculation
# data_encoded = data_encoded
# column_names = data_encoded.columns.tolist()
# print(column_names)

In [923]:

# # Now use 'encoded_data' in the correlation calculation
# data_encoded = data_encoded
# column_names = data_encoded.columns.tolist()

# # Identify the column name representing 'status' after one-hot encoding
# status_column_name = [col for col in column_names if 'status' in col.lower()][0]


# # Extracting feature matrix
# feature_matrix = your_data_df[column_names]

# def calculate_correlations(df: pd.DataFrame, target: pd.Series) -> pd.DataFrame:
#     """
#     Calculates the correlations of all columns with regards to the target and returns a DataFrame with all column names
#     and their correlation coefficient with regards to the target.

#     :param df: Pandas DataFrame
#     :param target: Pandas Series with target values
#     :return: Pandas DataFrame with column names and their correlation coefficient with regards to the target
#     """
#     # Calculate correlations
#     correlations = df.corrwith(target)
    
#     # Create a DataFrame with column names and correlation coefficients
#     correlation_df = pd.DataFrame(correlations, columns=['correlation_coefficient'])
    
#     return correlation_df

# # Now you can calculate correlations using the identified 'status' column name
# xcorrelations = calculate_correlations(feature_matrix, encoded_data[status_column_name])

# # Display the correlations
# print(xcorrelations)

#### 3. Preprocessing

Tranforming status(target variable) from categorical to numerical values 

2.Detect and address outliers.

3.Remove or impute incorrect or inconsistent data.

Statistical Summary

#### Define the inputs and the target

#### 5- Data Segmentation

#### 6- Model Training

#### 7- Model Testing

#### 8-  Model Evaluation:
#### Confusion Matrix and Analysis:

#### 10 - Print a classification report