# Vanilla Models

This notebook will run our baseline models to understand better our dataset and how the features and target variables behave when running different models. Our focus is to find baseline models that can improve the Precision without affecting Recall. To evaluate the models, we will use the metrics Precision, Accuracy, Recall, F1 Score, and Confusion Matrix.

## Read In Data

In [2]:
# Import Packages
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

# Sklearn Packages
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn import metrics
from sklearn.metrics import mean_squared_error, precision_score, confusion_matrix, accuracy_score
from sklearn.dummy import DummyClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from imblearn.over_sampling import SMOTE
from sklearn.tree import DecisionTreeClassifier
from sklearn import set_config
set_config(print_changed_only=False)
from xgboost import XGBClassifier
from sklearn.utils.testing import ignore_warnings
from sklearn.exceptions import ConvergenceWarning

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=ConvergenceWarning)

pd.set_option('display.max_columns', 300)
% matplotlib inline

plt.style.use('seaborn')



In [3]:
# ead in UCI Heart Disiease Databasae
df = pd.read_csv('fetal_health.csv')


## Data Cleaning

In [4]:
df.fetal_health = np.where(df.fetal_health > 1.0, 2.0, df.fetal_health)

## Train Test Split

<b>Objectives: </b>

- Assign our feature variables and target variable into the X and y variables. 
- Split our dataset using Sklearn Train Test Split. We will use the Sklearn default option and assign 75% to our train set and 25% to our test set.

In [5]:
# Assigning features variables (X) and target variable (y)
X = df.drop('fetal_health', axis=1)
y = df.fetal_health

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [6]:
# Checking if train test split ran correclty
for dataset in [y_train, y_test]:
    print(round(len(dataset)/len(y), 2))

0.75
0.25


Our train and test set were split correctly. 75% of the dataset was assigned to our train set, and 25% was assigned to our test set.

## Baseline Models

To better understand our dataset and how our features variables affect our target variable, we decided to run baseline models and find what approach we should take when tuning our models. None of the baseline models will have any kinds of hyperparameters tuning. We will use the default hyperparameters.

<b>Objectives:</b>
- Run a Logistic Regression, KNN, Decision Tree, and Random Forest model.
- Create a function to simplify our model evaluation.
- Evaluate each model using the Accuracy, Recall, F1 Score, and Precision metrics.
- Create confusion matrices for each model.

In [7]:
# Evaluation function

def evaluation(y_true, y_pred):
    
# Print Accuracy, Recall, F1 Score, and Precision metrics.
    print('Evaluation Metrics:')
    print('Accuracy: ' + str(metrics.accuracy_score(y_test, y_pred)))
    print('Recall: ' + str(metrics.recall_score(y_test, y_pred)))
    print('F1 Score: ' + str(metrics.f1_score(y_test, y_pred)))
    print('Precision: ' + str(metrics.precision_score(y_test, y_pred)))
    
# Print confusion Matrix
    print('\nConfusion Matrix:')
    print(' TN,  FP, FN, TP')
    print(confusion_matrix(y_true, y_pred).ravel())
    
# Function prints best parameters for GridSearchCV
def print_results(results):
    print('Best Parameters: {}\n'.format(results.best_params_))    

### Logistic Regression

When trying to classify structured data, logistic regression models usually give a quick and reliable result. Thus, it will be our first baseline model.

In [8]:
# Baseline Logistic Regression Model

lr_baseline = LogisticRegression()

# Fitting and predicting
lr_baseline.fit(X_train, y_train)
y_pred_lr_baseline = lr_baseline.predict(X_test)

# Evaluation Metrics
evaluation(y_test, y_pred_lr_baseline)

Evaluation Metrics:
Accuracy: 0.8834586466165414
Recall: 0.9535452322738386
F1 Score: 0.9263657957244655
Precision: 0.9006928406466512

Confusion Matrix:
 TN,  FP, FN, TP
[390  19  43  80]


<b>Findings:</b>

As we can see, our baseline Logistic Regression model performed reasonably well, even without tuning. Our feature variables can train the model with fairly precision.

We will now test the same features and target variables using K-nearest neighbors, which classifies data points based on similarity.

### KNN

Next, we will run a K-Nearest Neighbors model. KNN is a simples model that stores and classifies all data points based on similarity measures (i.e., distance functions). Since it's an easy model to set up, we will run it and read the results to understand what the data tells us. We will have to scale our dataset for KNN before running it.

In [9]:
# Call and fit scaler 
scaler = StandardScaler()
scaler.fit(X_train)

# Scaling our dataset
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [10]:
# Baseline KNN Model
knn_baseline = KNeighborsClassifier()

# Fitting and predicting
knn_baseline.fit(X_train_scaled, y_train)
y_pred_knn_baseline = knn_baseline.predict(X_test_scaled)

# Evaluation metrics
evaluation(y_test, y_pred_knn_baseline)

Evaluation Metrics:
Accuracy: 0.9398496240601504
Recall: 0.9877750611246944
F1 Score: 0.961904761904762
Precision: 0.9373549883990719

Confusion Matrix:
 TN,  FP, FN, TP
[404   5  27  96]


<b>Findings:</b>

Our baseline KNN model performed better than our baseline Logistic Regression model in all the metrics that we are analyzing the model. Interestingly, the Recall metric had a performance of .99. This is very high. It means that our baseline model is capable of predicting true positives with high precision. For now, we will keep running baseline models to find what the dataset has to tell us.

### Decision Tree

One of our project's key questions is to find what metrics can predict which features have the highest coefficient in our model predictions. For this reason, we will run Decision Tree as well, since it has a better division of features to predict an outcome.

In [None]:
# Baseline Decision Tree Model
dt_baseline = DecisionTreeClassifier()

# Fitting and predicting
dt_baseline = tree_baseline.fit(X_train, y_train)
y_pred_dt_baseline = tree_baseline.predict(X_test)

# Evaluation metrics
evaluation(y_test, y_pred_dt_baseline)

<b>Findings:</b>

Decision Tree did not perform better than KNN for most of the metrics that we are using. However, it performed better for the precision metric, which is the metric that we are most concerned about. Thus, we will run a Random Forest model, which works as a collection of decision trees.

### Random Forest

After running a Decision Tree model, it makes sense to run a Random Forest model, an ensemble model that operates by constructing a multitude of decision trees at training time and outputting the class that is the average prediction for the individual trees.

In [12]:
# Baseline Random Forest Model
rfc_baseline = RandomForestClassifier()

# Fitting and predicting
rfc_baseline.fit(X_train, y_train)
y_pred_rfr_baseline = rfc_baseline.predict(X_test)

# Evaluation metrics
evaluation(y_test, y_pred_rfr_baseline)

Evaluation Metrics:
Accuracy: 0.9473684210526315
Recall: 0.9902200488997555
F1 Score: 0.9665871121718378
Precision: 0.9440559440559441

Confusion Matrix:
 TN,  FP, FN, TP
[405   4  24  99]


<b>Findings:</b>

Our baseline random forest model had an exciting performance. It outperformed the Decision Tree baseline model in almost every metric, except Precision, which slightly decreased. Compared to KNN, our best performing model so far, if we compare every metric, performs better for the metrics Accuracy, F1 Score, and Precision. It performed the same for the Recall metric.

### Baseline Models Evaluation

Accuracy, Recall, and F1 Score. Decision Tree had the best performance for the Precision metric. All of our models performed reasonably well. However, we have space for improvement in all of them. We will start improving some hyperparameters and check how they performed.

## Preprocessing Data

We have a class imbalance problem in our dataset. Thus, we will use Synthetic Minority Oversampling Technique (SMOTE), where the minority class is oversampled by producing synthetic examples to fix the class imbalance. We will then compare the results to our baseline models to check if the target variable is even.

In [13]:
# SMOTE
smote = SMOTE()
X_train_smote, y_train_smote = smote.fit_sample(X_train, y_train)

# Checking if SMOTE was correctly fitted
for dataset in (y_train, y_train_smote):
    print (dataset.value_counts(normalize=True))

1.0    0.781681
2.0    0.218319
Name: fetal_health, dtype: float64
2.0    0.5
1.0    0.5
Name: fetal_health, dtype: float64


## Vanilla Models Tuning

We will now start improving our model using hyperparameters tuning without any feature engineering. We will tune all the baseline models changing some hyperparameters. Then, we will use GridSearchCV and XGBoost to find the best hyperparameters.

### Logistic Regression

We will start with the first baseline model that we ran: Logistic Regression using SMOTE. We want to see if SMOTE can help our model best predict our target variable. We will not tune any hyperparameter for the first model.

In [14]:
# Logistic Regression Model
lr = LogisticRegression()

# Fitting and predicting
lr.fit(X_train_smote, y_train_smote)
y_pred_lr_smote = lr.predict(X_test)

# Evaluation Metrics
evaluation(y_test, y_pred_lr_smote)

Evaluation Metrics:
Accuracy: 0.8402255639097744
Recall: 0.8410757946210269
F1 Score: 0.8900388098318239
Precision: 0.945054945054945

Confusion Matrix:
 TN,  FP, FN, TP
[344  65  20 103]


Compared to our model using SMOTE underperformed in almost every metric using SMOTE. However, the Precision metric did perform better. It's the primary metric that we are interested in. We can see that there the number of False Negatives was reduced.

Next, we will add a few hyperparameters individually and see if we can improve our model. Here, our approach will be trying random hyperparameters and check the results compared to the model without any parameters.

In [15]:
# Logistic Regression Model
lr = LogisticRegression(C=100, max_iter=200, class_weight='balanced')

# Fitting and predicting
lr.fit(X_train_smote, y_train_smote)
y_pred_lr_tuned = lr.predict(X_test)

# Evaluation Metrics
evaluation(y_test, y_pred_lr_tuned)


Evaluation Metrics:
Accuracy: 0.8402255639097744
Recall: 0.8410757946210269
F1 Score: 0.8900388098318239
Precision: 0.945054945054945

Confusion Matrix:
 TN,  FP, FN, TP
[344  65  20 103]


`C=100` improved the Precision metric. All the other metrics remained the same.

`fit_intercept=False` reduced all the metrics.

`max_inter=200` improved the Precision metric compared to the default `max_inter=100`

`class_weight='balanced'` improved all the metrics.

<b>Findings:</b>

Compared to our baseline model, the Logistic Regression model did not perform better. We were able to improve the model slightly using some random tuning. However, we don't think that Logistic Regression has the power that we need. Thus, we will use some more advanced algorithms.

### KNN

Following our order of baseline models, we will know use KNN models. We will first compare the result of our baseline model to a model after applying SMOTE. Then, we will use GridSearchCV to find the best hyperparameter tuning. For this model, we will scale the train set using SMOTE.

In [16]:
# KNN model using SMOTE
knn = KNeighborsClassifier()

# Fitting and predicting
knn.fit(X_train_smote, y_train_smote)
y_pred_knn_smote = knn.predict(X_test_scaled)

# Evaluation metrics
evaluation(y_test, y_pred_knn_smote)

Evaluation Metrics:
Accuracy: 0.75
Recall: 0.9070904645476773
F1 Score: 0.848
Precision: 0.796137339055794

Confusion Matrix:
 TN,  FP, FN, TP
[371  38  95  28]


Our KNN model performed better than our baseline model in almost every metric, except Recall that had a small drop. We will now run a GridSearchCV to find the most relevant hyperparameters.

In [17]:
grid_params_knn = {
    'n_neighbors':list(range(1,10)),
    'weights':['uniform','distance'],
    'metric':['eucliean','manhattan','minkowski'],
    'leaf_size':list(range(1,101))
}

In [18]:
# Using GridSearchCV for a KNN model
gs_knn = GridSearchCV(knn,grid_params_knn,verbose=1,n_jobs=-1)

# Fitting and predicting
gs_knn.fit(X_train_smote, y_train_smote)
y_pred_gs_knn = gs_knn.predict(X_test_scaled)

# Evaluation metrics
evaluation(y_test, y_pred_gs_knn)
print_results(gs_knn)

Fitting 5 folds for each of 5400 candidates, totalling 27000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    2.6s
[Parallel(n_jobs=-1)]: Done 976 tasks      | elapsed:    8.8s
[Parallel(n_jobs=-1)]: Done 2976 tasks      | elapsed:   19.8s
[Parallel(n_jobs=-1)]: Done 5776 tasks      | elapsed:   35.4s
[Parallel(n_jobs=-1)]: Done 9376 tasks      | elapsed:   53.7s
[Parallel(n_jobs=-1)]: Done 13776 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 18976 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 24976 tasks      | elapsed:  2.2min


Evaluation Metrics:
Accuracy: 0.768796992481203
Recall: 1.0
F1 Score: 0.869287991498406
Precision: 0.768796992481203

Confusion Matrix:
 TN,  FP, FN, TP
[409   0 123   0]
Best Parameters: {'leaf_size': 1, 'metric': 'manhattan', 'n_neighbors': 1, 'weights': 'uniform'}



[Parallel(n_jobs=-1)]: Done 27000 out of 27000 | elapsed:  2.3min finished


KNN using GridSearchCV did not perform better than our baseline model using only SMOTE. The Recall metric increased to 1.0, which might be a sign of overfitting. The model is not capable of identify False Positives or True Positives.

### Decision Tree

Following our order of models, we will now try a Decision Tree model. On the baseline model, Decision Tree did not perform better in most of the metrics. However, it did perform better with the Precision metric, the focus of our project. We will run a model using only SMOTE and see the performance. We believe that Random Forest, since it's an ensemble model, will perform better than KNN. However, we want to know how the model performs without the imbalance classification problem.

In [20]:
# Decision Tree Model
tree = DecisionTreeClassifier()

# Fitting and predicting
tree.fit(X_train_smote, y_train_smote)
y_pred_tree = tree.predict(X_test)

# Evaluation metrics
evaluation(y_test, y_pred_tree)

Evaluation Metrics:
Accuracy: 0.9304511278195489
Recall: 0.960880195599022
F1 Score: 0.9550425273390036
Precision: 0.9492753623188406

Confusion Matrix:
 TN,  FP, FN, TP
[393  16  21 102]


<b>Findings:</b>

Our Decision Tree model had a good performance. So far, it was our best performing model for precision metric. We expect to improve this result using Random Forest models.


### Random Forest

Random Forest was our best performing baseline model. We will see if we can improve even more the results using hyperparameter tuning. However, let's first see how the model performs using only SMOTE first. Then we will use GridSearchCV to find the best hyperparameters.

In [21]:
# Random Forest Model
rfc = RandomForestClassifier()

# Fitting and predicting
rfc.fit(X_train_smote, y_train_smote)
y_preds_rfr = rfc.predict(X_test)

# Evaluation Metrics
evaluation(y_test, y_preds_rfr)

Evaluation Metrics:
Accuracy: 0.9511278195488722
Recall: 0.9828850855745721
F1 Score: 0.9686746987951808
Precision: 0.9548693586698337

Confusion Matrix:
 TN,  FP, FN, TP
[402   7  19 104]


<b>Findings: </b>

Compared to our baseline model, we can see improvements already in all the metrics. We can also see that our model reduced the number of False Positives, which is our focus.

### Random Forest with Grid Search

Now it's time to find the best hyperparameters for Random Forest, the model that interests us the most for now. First, we will set up a dictionary of hyperparameters that we want to try and then run a GridSearchCV to find the best fit for our model.

In [22]:
# GridSearch Parameters
parameters = {
    'n_estimators': [5, 50, 100, 150, 200],
    'max_depth': list(range(1, 11)),
    'criterion':['gini','entropy'],
    'max_features': list(range(20)),
    'oob_score':[False,True],
}

In [66]:
# # GridSearch (----------remove hyphen from GridSearchCV----------)
rfc_gs = GridSearchCV(rfc, parameters, cv=5, verbose=1, n_jobs=-1)

rfc_gs.fit(X_train_smote, y_train_smote)
y_preds_rfr_cv = rfc_gs.predict((X_test))

evaluation(y_test, y_preds_rfr_rfc_gs)
print_results(rfc_gs)

Fitting 5 folds for each of 4000 candidates, totalling 20000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    3.6s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   13.3s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   32.5s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   57.0s
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 2434 tasks      | elapsed:  3.1min
[Parallel(n_jobs=-1)]: Done 3184 tasks      | elapsed:  4.6min
[Parallel(n_jobs=-1)]: Done 4034 tasks      | elapsed:  6.6min
[Parallel(n_jobs=-1)]: Done 4984 tasks      | elapsed:  8.8min
[Parallel(n_jobs=-1)]: Done 6034 tasks      | elapsed: 11.2min
[Parallel(n_jobs=-1)]: Done 7184 tasks      | elapsed: 14.0min
[Parallel(n_jobs=-1)]: Done 8434 tasks      | elapsed: 17.2min
[Parallel(n_jobs=-1)]: Done 9784 tasks      | elapsed: 21.1min
[Parallel(n_jobs=-1)]: Done 11234 tasks      |

Evaluation Metrics:
Accuracy: 0.9661654135338346
Recall: 0.9853300733496333
F1 Score: 0.9781553398058253
Precision: 0.9710843373493976

Confusion Matrix:
 TN,  FP, FN, TP
[403   6  12 111]
Best Parameters: {'criterion': 'entropy', 'max_depth': 10, 'max_features': 11, 'n_estimators': 150, 'oob_score': True}



This is a great improvement from our baseline model. All the metrics improved and this is the best performing so far before feature engineering.

To confirm that this is the best performing model, we would like to play a little bit with the `max_feature` hyperparameter since we noticed that sometimes GridSearchCV might not choose the best one. 

In [24]:
# Baseline Random Forest Model
rfc = RandomForestClassifier(criterion='entropy',
                             n_estimators= 150, 
                             max_depth=9, 
                             max_features=10,
                             oob_score=True
                            )

# Fitting and predicting
rfc.fit(X_train_smote, y_train_smote)
y_preds_rfr_tuned = rfc.predict(X_test)

# Evaluation Metrics
evaluation(y_test, y_preds_rfr_tuned)

Evaluation Metrics:
Accuracy: 0.9605263157894737
Recall: 0.9779951100244498
F1 Score: 0.9744214372716199
Precision: 0.970873786407767

Confusion Matrix:
 TN,  FP, FN, TP
[400   9  12 111]


<b>Findings:</b>

As we predicted, GridSearchCV didn't give us the best `max_features` for precision. We are not sure why this happens, so we always try changing it slightly to see what the result would look like. We changed `max_depth=10` to `max_depth=9` and the metric precision improved.

However, we are trying to decrease False Positives. GridSearchCV with Random Forest gave us the lowest False Positives.

Finally, we will use XGBoost to see if we can improve our Random Tree model.

### XGBoost

Finally, our last try to improve our model will be XGBoost, which uses Gradient Descent and Boosting principles. We believe there might have space for improvement. First, we will use the default hyperparameters. Then, we will use XGBoost with GridSearch.

In [33]:
# Instantiate XGBClassifier
xgb = XGBClassifier()

# Fitting and predicting
xgb.fit(X_train_smote, y_train_smote)
y_pred_xg = xgb.predict(X_test)

# Evaluation Metrics
evaluation(y_test, y_pred_xg)

Evaluation Metrics:
Accuracy: 0.9454887218045113
Recall: 0.9633251833740831
F1 Score: 0.9645042839657283
Precision: 0.9656862745098039

Confusion Matrix:
 TN,  FP, FN, TP
[394  15  14 109]


<b>Findings:</b>

XGBoost performed using only the default parameters. It did not perform better than Random Forest using Grid Search.