In [1]:
# Importing libraries used for Data manipulation and analysis
import pandas as pd
import numpy as np

# Import necessary modules from the scikit-learn library for machine learning tasks.
from sklearn import model_selection
from sklearn import preprocessing
from sklearn.metrics import precision_score, confusion_matrix, f1_score
from xgboost import XGBClassifier, XGBRFClassifier

# Import necessary module for saving model
import joblib

# Importing the dataset into a dataframe

In [2]:
startup_data = pd.read_csv("./data/Cleaned_startup_data.csv")

# Making a copy of the dataset
startup_data_copy = startup_data.copy()

# Data Preprocessing
Now that we have imported our cleaned dataset, let's take a brief look at it.

In [3]:
startup_data_copy.head()

Unnamed: 0,state_code,latitude,longitude,zip_code,city,founded_at,first_funding_at,last_funding_at,age_first_funding_year,age_last_funding_year,...,has_angel,has_roundA,has_roundB,has_roundC,has_roundD,avg_participants,is_top500,status,age_first_milestone_year_filled,age_last_milestone_year_filled
0,CA,42.35888,-71.05682,92101,San Diego,2007-01-01,2009-04-01,2010-01-01,2.2493,3.0027,...,1,0,0,0,0,1.0,0,acquired,0,0
1,CA,37.238916,-121.973718,95032,Los Gatos,2000-01-01,2005-02-14,2009-12-28,5.126,9.9973,...,0,0,1,1,1,4.75,1,acquired,0,0
2,CA,32.901049,-117.192656,92121,San Diego,2009-03-18,2010-03-30,2010-03-30,1.0329,1.0329,...,0,1,0,0,0,4.0,1,acquired,0,0
3,CA,37.320309,-122.05004,95014,Cupertino,2002-01-01,2005-02-17,2007-04-25,3.1315,5.3151,...,0,0,1,1,1,3.3333,1,acquired,0,0
4,CA,37.779281,-122.419236,94105,San Francisco,2010-08-01,2010-08-01,2012-04-01,0.0,1.6685,...,1,0,0,0,0,1.0,1,closed,0,0


Now that everything looks nice, let's proceed to the data preprocessing phase. During this stage, we will transform the data into a format that can be fully utilized by our model.

**Step 1: Handling Categorical and datetime Columns**<br>
we'll begin by addressing our datetime column and break them up into `year`, `month`, `day`. Then we'll address our our target variable and convert it into a dummy variable.

In [4]:
# Converting the founded_at, closed_at, first_funding_at and last_funding_at to the datetime dataype
datetime_columns = ["founded_at", "first_funding_at", "last_funding_at"]
for column in datetime_columns:
  startup_data_copy[column] = pd.to_datetime(startup_data_copy[column])

for i in datetime_columns:
  startup_data_copy[f"{i}_year"] = startup_data_copy[i].dt.year
  startup_data_copy[f"{i}_month"] = startup_data_copy[i].dt.month
  startup_data_copy[f"{i}_day"] = startup_data_copy[i].dt.day
  startup_data_copy = startup_data_copy.drop(columns=i)

In [5]:
# converting target variable to a dummy variable
startup_data_copy.status = startup_data_copy.status.apply(lambda x: 1 if x=="acquired" else 0)

# converting other variables to  dummy variables
categorical_columns = startup_data_copy.select_dtypes("object").columns[:-1]
for column in categorical_columns:
  startup_data_copy = startup_data_copy.join(pd.get_dummies(startup_data_copy[column], prefix=column)).drop(columns=column)

Let's now examine our dataset to verify whether all the changes have been implemented correctly.

In [6]:
startup_data_copy.head()

Unnamed: 0,latitude,longitude,zip_code,age_first_funding_year,age_last_funding_year,age_first_milestone_year,age_last_milestone_year,relationships,funding_rounds,funding_total_usd,...,city_Westford,city_Weston,city_Westport,city_Williamstown,city_Wilmington,city_Woburn,city_Woodbury,city_Yardley,city_Yorba Linda,city_Zeeland
0,42.35888,-71.05682,92101,2.2493,3.0027,4.6685,6.7041,3,3,375000,...,0,0,0,0,0,0,0,0,0,0
1,37.238916,-121.973718,95032,5.126,9.9973,7.0055,7.0055,9,4,40100000,...,0,0,0,0,0,0,0,0,0,0
2,32.901049,-117.192656,92121,1.0329,1.0329,1.4575,2.2055,5,1,2600000,...,0,0,0,0,0,0,0,0,0,0
3,37.320309,-122.05004,95014,3.1315,5.3151,6.0027,6.0027,5,3,40000000,...,0,0,0,0,0,0,0,0,0,0
4,37.779281,-122.419236,94105,0.0,1.6685,0.0384,0.0384,2,2,1300000,...,0,0,0,0,0,0,0,0,0,0


Now that all our changes have been implemented, let's move on to the next section.

**Step 2: Scaling of Numerical Columns**<br>
Scaling is a crucial data transformation technique used to ensure that numerical features are on the same scale, preventing some features from dominating others during machine learning model training.<br>
We can going utilizing the MinMax Standardization scaling technique and it's particularly suitable for our model.

In [7]:
def column_standarizer():
  c = pd.DataFrame((startup_data_copy.max()>1) == True).reset_index()
  c.columns = ['Column', 'Boolval']
  columns_to_be_normalized = c.query("Boolval == True").Column.values
  SS_Norm = preprocessing.MinMaxScaler()
  columns_to_be_normalized = [j for i in pd.DataFrame(columns_to_be_normalized).drop(9).values for j in i]

  for i in columns_to_be_normalized:
    val = np.array(startup_data_copy[i].values).reshape(-1, 1)
    startup_data_copy[i] = SS_Norm.fit_transform(val)
  return startup_data_copy
  return columns_to_be_normalized

In [8]:
startup_data_copy = startup_data_copy.drop(columns='category_code').astype(float)
startup_data_copy = column_standarizer()
startup_data_copy.funding_total_usd = startup_data_copy.funding_total_usd.apply(np.log10)

Let's check if the changes have been implemented.

In [9]:
startup_data_copy.head()

Unnamed: 0,latitude,longitude,zip_code,age_first_funding_year,age_last_funding_year,age_first_milestone_year,age_last_milestone_year,relationships,funding_rounds,funding_total_usd,...,city_Westford,city_Weston,city_Westport,city_Williamstown,city_Wilmington,city_Woburn,city_Woodbury,city_Yardley,city_Yorba Linda,city_Zeeland
0,0.494494,0.367152,0.096658,0.365061,0.389409,0.484841,0.432611,0.047619,0.222222,5.574031,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.342036,0.005562,0.099777,0.45803,0.615461,0.544988,0.442121,0.142857,0.333333,7.603144,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.212867,0.039515,0.096679,0.325749,0.325749,0.4022,0.290656,0.079365,0.0,6.414973,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.34446,0.00502,0.099757,0.393572,0.464142,0.519179,0.410478,0.079365,0.222222,7.60206,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.358127,0.002398,0.09879,0.292368,0.346291,0.365677,0.222272,0.031746,0.111111,6.113943,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Step 3: Data Splitting**<br>
Great! The changes have been successfully implemented, and now it's time to split the dataset into two distinct sets: the training dataset and the testing dataset. Data splitting is a crucial step in machine learning as it allows us to assess the model's performance on unseen data.<br>
In the next steps, we'll split the dataset into these two subsets, ensuring that both the training and testing data are representative and properly randomized to avoid bias in our model evaluation.


In [10]:
X =  startup_data_copy.drop(columns=["status"])
y = startup_data_copy.status.values
X_train, X_test, y_train, y_test = model_selection.train_test_split(X,
                                                                    y,
                                                                    test_size=0.2,
                                                                    random_state=42
                                                                   )

**Step 4: Model Selection and Training**<br>
XGBoost was selected due to its ability to handle class imbalance naturally. It incorporates techniques like weighted loss functions and subsampling of the majority class during tree construction, making it less prone to bias toward the majority class.
Model Training is an iterative process.

Let's first create a function to make the model build processing less messy.

In [11]:
def model_builder(model, X_train, X_test, y_train, y_test):
    mod = model
    mod.fit(X_train, y_train)
    pred = mod.predict(X_test)
    print("f1", f1_score(y_test, pred), "precision", precision_score(y_test, pred))
    print(confusion_matrix(y_test, pred))
    return mod

Now that that is out of the way, Let's go straight to model building.

In [12]:
model_1 = model_builder(XGBRFClassifier(),
                        X_train, X_test, y_train, y_test)
model_2 = model_builder(XGBClassifier(),
              X_train, X_test, y_train, y_test)

f1 0.8614232209737829 precision 0.8333333333333334
[[ 33  23]
 [ 14 115]]
f1 0.8913857677902621 precision 0.8623188405797102
[[ 37  19]
 [ 10 119]]


The performance of the two models can be summarized as follows:

**Model 1:**
- F1 Score: 0.8614
- Precision: 0.8333
- Confusion Matrix:
```
[[ 33  23]
 [ 14 115]]
 ```

**Model 2:**
- F1 Score: 0.8914
- Precision: 0.8623
- Confusion Matrix:
```
[[ 37  19]
 [ 10 119]]
```

In the context of binary classification, these metrics provide insights into how well each model is performing:

- **F1 Score**: It is a balanced metric that considers both precision and recall. A higher F1 score indicates a better balance between correctly identifying positive instances (sensitivity) and minimizing false positives (precision).

- **Precision**: Precision measures the accuracy of positive predictions made by the model. A higher precision score means that the model is better at making positive predictions, with fewer false positives.

- **Confusion Matrix**: The confusion matrix provides a detailed breakdown of the model's predictions. It includes the number of true negatives (TN), false positives (FP), false negatives (FN), and true positives (TP). This breakdown helps in understanding the model's performance in different aspects.

Based on the provided metrics and confusion matrices:

- Model 2 has a slightly higher F1 score and precision compared to Model 1.
- Model 2 has a higher number of true positives (TP) and true negatives (TN) but also a slightly lower number of false positives (FP) compared to Model 1.

In conclusion, Model 2 appears to perform slightly better overall, achieving a higher F1 score and precision, which indicates a better balance between correctly classifying positive instances and minimizing false positives. However, both models still have areas that can be improved.

**Step 5: Feature Engineering**

Feature engineering plays a critical role in enhancing model performance, particularly when dealing with imbalanced datasets where the minority class data is limited. By creating new columns or transforming existing ones, we empower the classifier to learn more effectively. This process not only improves the model's accuracy but also mitigates bias issues associated with imbalanced datasets.

To begin, let's address the issue of redundant columns that could potentially hinder the performance of our model.

In [13]:
# Creating a dataframe of the columns and their importance to the dataframe
feature_importance_df = pd.DataFrame({'Feature': X.columns,
                                      'Importance': model_2.feature_importances_})

# Sort the DataFrame by importance scores in descending order
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Display the feature importance table
print(feature_importance_df)

               Feature  Importance
15       is_otherstate    0.092631
7        relationships    0.086381
33           is_top500    0.060874
187   city_Los Angeles    0.057507
31          has_roundD    0.049015
..                 ...         ...
128       city_Chicago    0.000000
129    city_Cincinnati    0.000000
130     city_Cleveland    0.000000
131  city_College Park    0.000000
300       city_Zeeland    0.000000

[301 rows x 2 columns]


In [14]:
# Removing the columns of little importance to our model
X_mod = X[feature_importance_df.query("Importance>0.005").Feature]

# Creating the model
X_train_mod, X_test_mod, y_train_mod, y_test_mod = model_selection.train_test_split(X_mod, y,
                                                                    test_size=0.2,
                                                                    random_state=42
                                                                   )


model_3 = model_builder(XGBRFClassifier(), X_train_mod, X_test_mod, y_train_mod, y_test_mod)

model_4 = model_builder(XGBClassifier(),
                  X_train_mod, X_test_mod, y_train_mod, y_test_mod)

f1 0.8646616541353384 precision 0.8394160583941606
[[ 34  22]
 [ 14 115]]
f1 0.8796992481203006 precision 0.8540145985401459
[[ 36  20]
 [ 12 117]]


The performance of the two models can be summarized as follows:

**Model 3:**
- F1 Score: 0.8647
- Precision:0.8394
- Confusion Matrix:
```
[[ 34  22]
 [ 14 115]]
 ```

**Model 4:**
- F1 Score: 0.8797
- Precision: 0.8540
- Confusion Matrix:
```
[[ 36  20]
 [ 12 117]]
```

In the context of binary classification, these metrics provide insights into how well each model is performing:

- **F1 Score**: It is a balanced metric that considers both precision and recall. A higher F1 score indicates a better balance between correctly identifying positive instances (sensitivity) and minimizing false positives (precision).

- **Precision**: Precision measures the accuracy of positive predictions made by the model. A higher precision score means that the model is better at making positive predictions, with fewer false positives.

- **Confusion Matrix**: The confusion matrix provides a detailed breakdown of the model's predictions. It includes the number of true negatives (TN), false positives (FP), false negatives (FN), and true positives (TP). This breakdown helps in understanding the model's performance in different aspects.

Based on the provided metrics and confusion matrices:

- Model 4 has a slightly higher F1 score and precision compared to Model 3.
- Model 4 has a higher number of true positives (TP) and true negatives (TN) but also a slightly lower number of false positives (FP) compared to Model 3.

In conclusion, Model 4 appears to perform slightly better overall, achieving a higher F1 score and precision, which indicates a better balance between correctly classifying positive instances and minimizing false positives. <br>
It appears that the feature engineering had a detrimental impact on Model 4 but had a beneficial effect on Model 3. Nonetheless, both Model 3 and Model 4 still exhibit lower performance compared to Model 2.

**Step 6: Hyperparameter tuning**

By fine-tuning these hyperparameters, we aim to improve the model's performance and predictive accuracy. This process enables us to find the model settings that work best for a specific dataset and problem, leading to more accurate and efficient machine learning models.

In [15]:
model_5 = model_builder(XGBClassifier(n_estimators=2500, max_depth=3, learning_rate=0.01, sampling_method='uniform'),
              X_train, X_test, y_train, y_test)

f1 0.8923076923076924 precision 0.8854961832061069
[[ 41  15]
 [ 13 116]]


The performance of this model can be summarized as follows:

**Model 5:**
- F1 Score: 0.8923
- Precision:0.8855
- Confusion Matrix:
```
[[ 41  15]
 [ 13 116]]
 ```


In the context of binary classification, these metrics provide insights into how well each model is performing:

- **F1 Score**: It is a balanced metric that considers both precision and recall. A higher F1 score indicates a better balance between correctly identifying positive instances (sensitivity) and minimizing false positives (precision).

- **Precision**: Precision measures the accuracy of positive predictions made by the model. A higher precision score means that the model is better at making positive predictions, with fewer false positives.

- **Confusion Matrix**: The confusion matrix provides a detailed breakdown of the model's predictions. It includes the number of true negatives (TN), false positives (FP), false negatives (FN), and true positives (TP). This breakdown helps in understanding the model's performance in different aspects.

Based on the provided metrics and confusion matrices:

- Model 5 has a slightly higher F1 score and precision compared to other models.
- Model 5 has a higher number of true positives (TP) compared to other models.

In conclusion, Model 5 appears to perform slightly better overall, achieving a higher F1 score and precision, which indicates a better balance between correctly classifying positive instances and minimizing false positives. <br>
It appears that the Hyperparameter tuning had a huge impact on Model 5

# Conclusion
In conclusion, our classification model has demonstrated outstanding performance with an `F1-score of 0.8923` and a `precision of 0.8855`. The confusion matrix shows a relatively low number of false positives and false negatives, indicating that the model effectively discriminates between positive and negative instances.

However, while the model has achieved impressive results, there is always room for improvement. To further enhance the model's accuracy and robustness, one key avenue is to collect more data. Additional data can provide the model with a broader and more representative sample of instances, especially if the dataset is limited. This larger and more diverse dataset can help the model generalize better to unseen examples and further increase its predictive power.

In summary, while our model has achieved impressive results, the pursuit of excellence continues. Collecting more data is just one step in our journey to create a highly accurate and reliable classification model. By continually refining our approach and incorporating best practices, we aim to build a model that excels in its predictive capabilities.

# Saving of Model

In [16]:
joblib.dump(model_5, './model/Startups_Success_Rate_Prediction.joblib')

['Startups_Success_Rate_Prediction.joblib']