## Heart Failure Prediction 🩺


The aim of this project is to use the fournished data to predict which patients could die becuase of heart failure. Let's start! 

P.s: If you like this notebook don't forget to **UPVOTE**!

<img src ="http://25.media.tumblr.com/tumblr_m9kolagxR81qfvx4yo1_400.gif">

In [None]:
# IMPORTING LIBRARIES

# Main Libraries

import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats
from scipy.stats import norm
from collections import Counter
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings(action="ignore")

# Pre-processing Libraries

from sklearn.utils import class_weight
from imblearn.over_sampling import SMOTE
from keras.utils.np_utils import to_categorical
from sklearn.model_selection import train_test_split
from prettytable import PrettyTable

# Machine Learning Libraries

import sklearn
from sklearn import tree
from catboost import CatBoostClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import KFold, cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score
from sklearn.metrics import recall_score, f1_score, roc_auc_score


# Defining working directory

work_dir = '../input/heart-failure-clinical-data/'

Let's take a look at our data:

In [None]:
# IMPORTING DATA

hf_data = pd.read_csv(work_dir + 'heart_failure_clinical_records_dataset.csv')
hf_data.head()

In [None]:
hf_data.info()

Now let's see the meaning of our categorical values:

1. Boolean features
        * Sex - Gender of patient Male = 1, Female =0
        * Diabetes - 0 = No, 1 = Yes
        * Anaemia - 0 = No, 1 = Yes
        * High_blood_pressure - 0 = No, 1 = Yes
        * Smoking - 0 = No, 1 = Yes
        * DEATH_EVENT - 0 = No, 1 = Yes

In [None]:
# Finding duplicates 

hf_data.duplicated().sum()

In [None]:
# Finding missing values(Nan)

[print(col) for col in hf_data if hf_data[col].isna().sum() > 0]

In [None]:
# Transforming categorical values into strings

hf_data['anaemia'] = hf_data['anaemia'].apply(str)
hf_data['diabetes'] = hf_data['diabetes'].apply(str)
hf_data['high_blood_pressure'] = hf_data['high_blood_pressure'].apply(str)
hf_data['smoking'] = hf_data['smoking'].apply(str)
hf_data['sex'] = hf_data['sex'].apply(str)
hf_data['DEATH_EVENT'] = hf_data['DEATH_EVENT'].apply(str)

In [None]:
# Let's look at the descriptive statistics

hf_data.describe()

Exploratory Data Analysis

The EDA is a crucial process that helps us to better understand our data through graphical representations and that fournish us the opportunity to gain insight about them. We can use different types of visualizations to optimize the process. The main steps usually are:

- Investigation of distributions
- Class balancing (in classification tasks)
- Outlier detection 
- Investigation of possible correlations

In [None]:
# Checking labels distributions

sns.set_theme(context = 'paper')

plt.figure(figsize = (10,5))
sns.countplot(hf_data['DEATH_EVENT'])
plt.title('Class Distributions \n (0: Survived || 1: Passed )', fontsize=14)
plt.show()

Looking at the plot it is clear how we are facing an imbalanced dataset with a 2:1 ratio in favour of survived patient (0). 

In [None]:
# Let's plot the numerical faetures (Histograms & Scatterplots)

plt.figure(figsize = (20,15))
sns.pairplot(hf_data)
plt.show()

The pairplot show us the histograms of each numerical variable and the scatterplots representig their correlations. Looking at it we can see how our data do not seem to be normally distributed and that no clear positive or negative correlations are visible. Let's investigate a little more!

In [None]:
# EDA & VISUALIZATIONS

# Correlation Heatmap

f, ax = plt.subplots(figsize=(15, 15))
mat = hf_data.corr('pearson')
mask = np.triu(np.ones_like(mat, dtype=bool))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(mat, mask=mask, cmap=cmap, vmax=1, center=0, annot = True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.show()

The correlation map confirms what found with the pairplot. No meaningful correlations are present among our data.

In [None]:
# Plotting features with an interesting correlation

f, axes = plt.subplots(ncols=4, figsize=(24,6))

sns.boxplot(x='DEATH_EVENT', y="age", data=hf_data, hue = 'sex',ax=axes[0])
axes[0].set_title('Age vs Death Event', fontsize = 14)

sns.boxplot(x='DEATH_EVENT', y="creatinine_phosphokinase", data=hf_data, hue = 'sex', ax=axes[1]) 
axes[1].set_title('Creatinine Phosphokinase vs Death Event', fontsize = 14)


sns.boxplot(x='DEATH_EVENT', y="ejection_fraction", data=hf_data, hue = 'sex', ax=axes[2])
axes[2].set_title('Ejection Fraction vs Death Event', fontsize = 14)


sns.boxplot(x='DEATH_EVENT', y="platelets", data=hf_data, hue = 'sex',ax=axes[3])  
axes[3].set_title('Platelets vs Death Event', fontsize = 14) 

plt.show()

In [None]:
# Plotting more features with an interesting correlation

f, axes = plt.subplots(ncols=3, figsize=(24,6))

sns.boxplot(x='DEATH_EVENT', y="serum_creatinine", data=hf_data,hue = 'sex', ax=axes[0])
axes[0].set_title('Serum Creatinine vs Death Event', fontsize = 14)

sns.boxplot(x='DEATH_EVENT', y="serum_sodium", data=hf_data,hue = 'sex', ax=axes[1]) 
axes[1].set_title('Serum Sodium vs Death Event', fontsize = 14)


sns.boxplot(x='DEATH_EVENT', y="time", data=hf_data,hue = 'sex', ax=axes[2])
axes[2].set_title('Time vs Death Event', fontsize = 14)
 
plt.show()

In [None]:
# Plotting the feature distributions our numeric features

f, ax = plt.subplots(1,4, figsize=(24, 6))

sns.distplot(hf_data['age'],fit=norm, color='#FB8861', ax = ax[0])
ax[0].set_title('Age \n Normal dist.', fontsize=14)

sns.distplot(hf_data['creatinine_phosphokinase'], fit=norm, color='#56F9BB',ax=ax[1])
ax[1].set_title('Creatinine Phosphokinase \n Non normal dist.', fontsize=14)

sns.distplot(hf_data['ejection_fraction'], fit=norm, color='#C5B3F9', ax = ax[2])
ax[2].set_title('Ejection Fraction\n Non normal dist.', fontsize=14)

sns.distplot(hf_data['platelets'], fit=norm, color='#C5B3F9',ax = ax[3])
ax[3].set_title(' Platelets \n Non normal dist.', fontsize=14)

plt.show()

In [None]:
# Plotting the feature distributions our numeric features

f, ax = plt.subplots(1,3, figsize=(24, 6))

sns.distplot(hf_data['serum_creatinine'],fit=norm, color='#FB8861', ax = ax[0])
ax[0].set_title('serum Creatinine \n Non normal dist.', fontsize=14)

sns.distplot(hf_data['serum_sodium'], fit=norm, color='#56F9BB',ax=ax[1])
ax[1].set_title('Serum Sodium \n Normal dist.', fontsize=14)

sns.distplot(hf_data['time'], fit=norm, color='#C5B3F9', ax = ax[2])
ax[2].set_title('Time \n Non normal dist.', fontsize=14)

plt.show()

In [None]:
# Outliers removal function

def outliers_removal(feature,feature_name,dataset):
    
    # Identify 25th & 75th quartiles

    q25, q75 = np.percentile(feature, 25), np.percentile(feature, 75)
    print('Quartile 25: {} | Quartile 75: {}'.format(q25, q75))
    feat_iqr = q75 - q25
    print('iqr: {}'.format(feat_iqr))
    
    feat_cut_off = feat_iqr * 1.5
    feat_lower, feat_upper = q25 - feat_cut_off, q75 + feat_cut_off
    print('Cut Off: {}'.format(feat_cut_off))
    print(feature_name +' Lower: {}'.format(feat_lower))
    print(feature_name +' Upper: {}'.format(feat_upper))
    
    outliers = [x for x in feature if x < feat_lower or x > feat_upper]
    print(feature_name + ' outliers for close to bankruptcy cases: {}'.format(len(outliers)))
    #print(feature_name + ' outliers:{}'.format(outliers))

    dataset = dataset.drop(dataset[(dataset[feature_name] > feat_upper) | (dataset[feature_name] < feat_lower)].index)
    print('-' * 65)
    
    return dataset

hf_data = outliers_removal(hf_data['age'],'age', hf_data)
hf_data = outliers_removal(hf_data['creatinine_phosphokinase'],'creatinine_phosphokinase', hf_data)
hf_data = outliers_removal(hf_data['ejection_fraction'],'ejection_fraction', hf_data)
hf_data = outliers_removal(hf_data['platelets'],'platelets', hf_data)
hf_data = outliers_removal(hf_data['serum_creatinine'],'serum_creatinine', hf_data)
hf_data = outliers_removal(hf_data['serum_sodium'],'serum_sodium', hf_data)
hf_data = outliers_removal(hf_data['time'],'time', hf_data)

In [None]:
# Plotting boxplots of numerical features

f, axes = plt.subplots(ncols=4, figsize=(24,6))

sns.boxplot(x='DEATH_EVENT', y="age", data=hf_data, hue = 'sex',ax=axes[0])
axes[0].set_title('Age vs Death Event', fontsize = 14)

sns.boxplot(x='DEATH_EVENT', y="creatinine_phosphokinase", data=hf_data, hue = 'sex', ax=axes[1]) 
axes[1].set_title('Creatinine Phosphokinase vs Death Event', fontsize = 14)


sns.boxplot(x='DEATH_EVENT', y="ejection_fraction", data=hf_data, hue = 'sex', ax=axes[2])
axes[2].set_title('Ejection Fraction vs Death Event', fontsize = 14)


sns.boxplot(x='DEATH_EVENT', y="platelets", data=hf_data, hue = 'sex',ax=axes[3])  
axes[3].set_title('Platelets vs Death Event', fontsize = 14) 

plt.show()

In [None]:
# Plotting boxplots of numerical features

f, axes = plt.subplots(ncols=3, figsize=(24,6))

sns.boxplot(x='DEATH_EVENT', y="serum_creatinine", data=hf_data,hue = 'sex', ax=axes[0])
axes[0].set_title('Serum Creatinine vs Death Event', fontsize = 14)

sns.boxplot(x='DEATH_EVENT', y="serum_sodium", data=hf_data,hue = 'sex', ax=axes[1]) 
axes[1].set_title('Serum Sodium vs Death Event', fontsize = 14)


sns.boxplot(x='DEATH_EVENT', y="time", data=hf_data,hue = 'sex', ax=axes[2])
axes[2].set_title('Time vs Death Event', fontsize = 14)
 
plt.show()

In [None]:
# Plotting the feature distributions our numeric features

f, ax = plt.subplots(1,4, figsize=(24, 6))

sns.distplot(hf_data['age'],fit=norm, color='#FB8861', ax = ax[0])
ax[0].set_title('Age \n Non normal dist.', fontsize=14)

sns.distplot(hf_data['creatinine_phosphokinase'], fit=norm, color='#56F9BB',ax=ax[1])
ax[1].set_title('Creatinine Phosphokinase \n Non normal dist.', fontsize=14)

sns.distplot(hf_data['ejection_fraction'], fit=norm, color='#C5B3F9', ax = ax[2])
ax[2].set_title('Ejection Fraction\n Non normal dist.', fontsize=14)

sns.distplot(hf_data['platelets'], fit=norm, color='#C5B3F9',ax = ax[3])
ax[3].set_title(' Platelets \n Non normal dist.', fontsize=14)

plt.show()

In [None]:
# Plotting the feature distributions our numeric features

f, ax = plt.subplots(1,3, figsize=(24, 6))

sns.distplot(hf_data['serum_creatinine'],fit=norm, color='#FB8861', ax = ax[0])
ax[0].set_title('serum Creatinine \n Non normal dist.', fontsize=14)

sns.distplot(hf_data['serum_sodium'], fit=norm, color='#56F9BB',ax=ax[1])
ax[1].set_title('Serum Sodium \n Non normal dist.', fontsize=14)

sns.distplot(hf_data['time'], fit=norm, color='#C5B3F9', ax = ax[2])
ax[2].set_title('Time \n Non normal dist.', fontsize=14)

plt.show()

Despite having non normal distribution yet, we can see that now the distributions are definitely closer to normality. In order to be sure of their normality we can use the statistical Shapiro test:  

In [None]:
# Checking Normality

def check_normality(data, name):
    shap_t,shap_p = stats.shapiro(data)
    print(name + ' parameters:')
    print()
    print("Skewness: %f" % abs(data).skew())
    print("Kurtosis: %f" % abs(data).kurt())
    print("Shapiro Test: %f" % shap_t)
    print("Shapiro p_value: %f" % shap_p)
    
    if shap_p > 0.05:
        print('The distribution is normal')
    else:
        print('The distribution is not normal')
    
check_normality(hf_data['age'],'Age')
print('-------------------------')
check_normality(hf_data['creatinine_phosphokinase'],'Creatinine Phosphokinase')
print('-------------------------')
check_normality(hf_data['ejection_fraction'],'Ejection Fraction')
print('-------------------------')
check_normality(hf_data['platelets'],'Platelets')
print('-------------------------')
check_normality(hf_data['serum_creatinine'],'Serum Creatinine')
print('-------------------------')
check_normality(hf_data['serum_sodium'],'Serum Sodium')
print('-------------------------')
check_normality(hf_data['time'],'Time')

As you can see, despite some features are close to normality, they cannot be consider as such. Anyway, let's see what performances we can obtain using these basilcally raw data.

In [None]:
# Train data - labels separation

labels = hf_data['DEATH_EVENT']
train = hf_data.drop(['DEATH_EVENT'], axis = 1)

# Modeling

In this part of the notebook, I will try to use several algorithms to see which is the most efficient and which are their strengths and weaknesses. In order to submit the results to the two opened tasks for this competition I deciced to use three different models:

- *Logistic Regression*
- *Random Forest Classifier*
- *Catboost Classifier*

Let's start modifying data to make them usable by these classifiers (Cabtoost is the only one who can automatically deal with outliers and categorical features, but the other two can't). In order to make this happen, we need to preprocess the categorical features manually using dummy coding. 

In [None]:
# Dealing with categorical data

train_dummy = pd.get_dummies(train)

# Splitting the data into Train & Test sets

Xtrain,X_test,ytrain,y_test = train_test_split(train_dummy,labels,
                                               test_size = 0.1,
                                               stratify = labels,
                                               shuffle = True)

X_train,X_val,y_train,y_val = train_test_split(Xtrain,ytrain,
                                               test_size = 0.1,
                                               stratify = ytrain,
                                               shuffle = True)

# This is how our train data preprocessed look like

X_train.head()

Now that our data are ready to be used, let's start to model. The first part will be focused on obtaining a **baseline value** for each model used with its default parameters. In order to have robust results I decided to apply 10-Fold Cross validation, meaning that we'll fit and evaluate our model on 10 different folds of train and validation data:

In [None]:
# A line of code that help us finding the names of all the metrics available in sklearn library

sklearn.metrics.SCORERS.keys()

In [None]:
# 10FOLD - LOGISTIC REGRESSION (Baseline)

l_reg = LogisticRegression(class_weight = 'balanced',
                           random_state = 42)
log_model = l_reg.fit(X_train,y_train)
 

scores_log = cross_validate(log_model, X_train, y_train, cv=10,
                        scoring=('accuracy','precision_weighted','recall_weighted','f1_weighted', 'roc_auc'),
                        return_train_score=True)

# Creating a DataFrame of results

cv_scores_log = pd.DataFrame(scores_log, columns = scores_log.keys()).mean()
cv_scores_avg = pd.DataFrame(columns = scores_log.keys())
cv_scores_avg = cv_scores_avg.append(cv_scores_log, ignore_index = True)




# 10FOLD - RANDOM FOREST CLASSIFIER (Baseline)

rfc = RandomForestClassifier(class_weight = 'balanced',
                             random_state = 42)
rfc_model = rfc.fit(X_train,y_train)
 

scores_rfc = cross_validate(rfc_model, X_train, y_train, cv=10,
                        scoring=('accuracy','precision_weighted','recall_weighted','f1_weighted', 'roc_auc'),
                        return_train_score=True)

cv_scores_rfc = pd.DataFrame(scores_rfc, columns = scores_rfc.keys()).mean()
cv_scores_avg = cv_scores_avg.append(cv_scores_rfc, ignore_index = True)

cv_scores_avg

**SMOTE**

Now, before hyperparameters optimizations, I want to see if using SMOTE (an upsampling technique that generates data similar to the one present in the minority class until when the two, or more, classes will be paired forming a balanced dataset) we can incrase the baseline performance. 

In [None]:
# Transforming the dataset

oversample = SMOTE()
X_smote, y_smote = oversample.fit_resample(train_dummy, labels)
counter = Counter(y_smote)
print(counter)

# Splitting the data

X_train_sm,X_val_sm,y_train_sm,y_val_sm = train_test_split(X_smote,y_smote,
                                               test_size = 0.1,
                                               stratify = y_smote,
                                               shuffle = True)

In [None]:
# 10FOLD - LOGISTIC REGRESSION (Baseline Smote)

l_reg = LogisticRegression(random_state = 42)
log_model_sm = l_reg.fit(X_train_sm,y_train_sm)
 

scores_log_sm = cross_validate(log_model_sm, X_train_sm, y_train_sm, cv=10,
                        scoring=('accuracy','precision_weighted','recall_weighted','f1_weighted', 'roc_auc'),
                        return_train_score=True)

In [None]:
# 10FOLD - RANDOM FOREST CLASSIFIER (Baseline Smote)

rfc = RandomForestClassifier(random_state = 42)
rfc_model_sm = rfc.fit(X_train_sm,y_train_sm)
 

scores_rfc_sm = cross_validate(rfc_model_sm, X_train_sm, y_train_sm, cv=10,
                        scoring=('accuracy','precision_weighted','recall_weighted','f1_weighted', 'roc_auc'),
                        return_train_score=True)

In [None]:
cv_scores_log_sm = pd.DataFrame(scores_log_sm, columns = scores_log_sm.keys()).mean()
cv_scores_rfc_sm = pd.DataFrame(scores_rfc_sm, columns = scores_rfc_sm.keys()).mean()
cv_scores_avg = cv_scores_avg.append(cv_scores_log_sm, ignore_index = True)
cv_scores_avg = cv_scores_avg.append(cv_scores_rfc_sm, ignore_index = True)
cv_scores_avg['Classifiers'] = ['LOG','RFC','LOG_sm','RFC_sm']
cv_scores_avg

In [None]:
f, ax = plt.subplots(1,5, figsize = (25,5))

sns.barplot(cv_scores_avg['Classifiers'], cv_scores_avg['test_accuracy'], ax = ax[0])
ax[0].set_title('Accuracy Scores', fontsize = 13)
sns.barplot(cv_scores_avg['Classifiers'], cv_scores_avg['test_precision_weighted'], ax = ax[1])
ax[1].set_title('Precision Scores', fontsize = 13)
sns.barplot(cv_scores_avg['Classifiers'], cv_scores_avg['test_recall_weighted'], ax = ax[2])
ax[2].set_title('Recall Scores', fontsize = 13)
sns.barplot(cv_scores_avg['Classifiers'], cv_scores_avg['test_f1_weighted'], ax = ax[3])
ax[3].set_title('F1_Scores Scores', fontsize = 13)
sns.barplot(cv_scores_avg['Classifiers'], cv_scores_avg['test_roc_auc'], ax = ax[4])
ax[4].set_title('Auc-Roc Scores', fontsize = 13)

plt.show()

In [None]:
f, ax = plt.subplots(1,2, figsize = (20,5))

sns.barplot(cv_scores_avg['Classifiers'], cv_scores_avg['fit_time'], ax = ax[0])
ax[0].set_title('Training Time', fontsize = 13)
ax[0].set_ylabel('Time (ms)')
sns.barplot(cv_scores_avg['Classifiers'], cv_scores_avg['score_time'], ax = ax[1])
ax[1].set_title('Predictions Time', fontsize = 13)
ax[1].set_ylabel('Time (ms)')
plt.show()

Now we have our baselines values for normal and Smote dataset. From this first analysis, we can see how the Random Forest Classifier with upsampled (smote) data seems to be the most promising. The cost of this efficiency is paid in terms of time, if we look at the second plot we can see how Random Forest Classifier is 2 times slower than the Logistic Regression.  Now,in this context this is negligible, but time is an important variable to consider when we train deep models. Let's try to optimize them!

**RANDOMIZED GRID SEARCH OPTIMIZATION:**

In [None]:
# Which are the default parameters of Logistic Regression ?

log_params = log_model.get_params()
log_params 

In [None]:
# Logistic Regression Optimization

log_parameters = dict(C = [0.5,1,1.5],
                      penalty = ['l2', 'l1','elasticnet'],
                      class_weight = ['Balanced', None],
                      solver = ['liblinear','lbfgs','newton-cg'])


log_RGS = RandomizedSearchCV(log_model, log_parameters, random_state=42)
search = log_RGS.fit(X_train, y_train)
opt_params = search.best_params_
opt_params

In [None]:
# LOGISTIC REGRESSION WITH OPTIMIZED HYPERPARAMETERS

log_opt = LogisticRegression(**opt_params)
log_model_opt = log_opt.fit(X_train,y_train)

scores_log_opt = cross_validate(log_model_opt, X_train, y_train, cv=10,
                        scoring=('accuracy','precision_weighted','recall_weighted','f1_weighted', 'roc_auc'),
                        return_train_score=True)

cv_scores_log_opt = pd.DataFrame(scores_log_opt, columns = scores_log_sm.keys()).mean()
cv_scores_avg = cv_scores_avg.append(cv_scores_log_opt, ignore_index = True)

cv_scores_avg['Classifiers'][4] = 'LOG_opt'
cv_scores_avg

In [None]:
# Which are the default parameters of Random Forest Classifier ?

rfc_params = rfc_model_sm.get_params()
rfc_params 

In [None]:
# Random Forest Classifier Optimization

rfc_parameters = dict(criterion = ['gini', 'entropy'],
                      ccp_alpha = [0.0,0.1,0.5],
                      bootstrap = [True,False])


rfc_RGS = RandomizedSearchCV(rfc_model_sm, rfc_parameters, random_state=42)
search_rfc = rfc_RGS.fit(X_train_sm, y_train_sm)
opt_params_rfc = search_rfc.best_params_
opt_params_rfc

In [None]:
# RANDOM FOREST CLASSIFIER WITH OPTIMIZED HYPERPARAMETERS

rfc_opt = RandomForestClassifier(**opt_params_rfc)
rfc_model_opt = rfc_opt.fit(X_train_sm,y_train_sm)

scores_rfc_opt = cross_validate(rfc_model_opt, X_train_sm, y_train_sm, cv=10,
                        scoring=('accuracy','precision_weighted','recall_weighted','f1_weighted', 'roc_auc'),
                        return_train_score=True)

rfc_pred_opt = rfc_model_opt.predict(X_test)
cv_scores_rfc_opt = pd.DataFrame(scores_rfc_opt, columns = scores_rfc_sm.keys()).mean()
cv_scores_avg = cv_scores_avg.append(cv_scores_rfc_opt, ignore_index = True)

In [None]:
cv_scores_avg['Classifiers'][5] = 'RFC_opt'
cv_scores_avg

In [None]:
# Plotting the first tree of our optimized forest 

plt.figure(figsize = (20,10))
tree.plot_tree(rfc_opt.estimators_[0], feature_names=X_train.columns, filled=True, fontsize=7)
plt.show()

In [None]:
feat_importance = pd.Series(rfc_model_opt.feature_importances_, index=X_train_sm.columns)

plt.figure(figsize = (15,6))
sns.barplot(feat_importance.nlargest(20),feat_importance.nlargest(20).index)
plt.title("Random Forest features' importance", fontsize = 13)
plt.xlabel('Feature Importance')
plt.show()

In [None]:
# Plotting Classifiers Performances (all metrics)

f, ax = plt.subplots(1,5, figsize = (25,5))

sns.barplot(cv_scores_avg['Classifiers'], cv_scores_avg['test_accuracy'], ax = ax[0])
ax[0].set_title('Accuracy Scores', fontsize = 13)
sns.barplot(cv_scores_avg['Classifiers'], cv_scores_avg['test_precision_weighted'], ax = ax[1])
ax[1].set_title('Precision Scores', fontsize = 13)
sns.barplot(cv_scores_avg['Classifiers'], cv_scores_avg['test_recall_weighted'], ax = ax[2])
ax[2].set_title('Recall Scores', fontsize = 13)
sns.barplot(cv_scores_avg['Classifiers'], cv_scores_avg['test_f1_weighted'], ax = ax[3])
ax[3].set_title('F1_Scores Scores', fontsize = 13)
sns.barplot(cv_scores_avg['Classifiers'], cv_scores_avg['test_roc_auc'], ax = ax[4])
ax[4].set_title('Auc-Roc Scores', fontsize = 13)

plt.show()

Now that we have our two optimized models, let's see if we can increase the perfomance using a different classifier: **CATBOOST**

In [None]:
# CATBOOST CLASSIFIER (Oversampled data)

cat = CatBoostClassifier(eval_metric = 'F1')

cat_model_sm = cat.fit(X_train_sm,y_train_sm,
                     eval_set = (X_val_sm,y_val_sm),
                     use_best_model=True,
                     verbose = 0,
                     plot=True)

In [None]:
# Preforming a Random Grid Search to find the best combination of parameters

grid = {'iterations': [500,1000],
        'learning_rate': [0.01, 0.03, 0.05],
        'depth': [2, 6, 10],
        'l2_leaf_reg': [1, 3, 5, 9]}

final_model = CatBoostClassifier()
randomized_search_result = final_model.randomized_search(grid,
                                                   X = X_train_sm,
                                                   y= y_train_sm,
                                                   verbose = False,
                                                   plot=False)

best_params = randomized_search_result['params']
best_params['loss_function'] = 'Logloss'
best_params['eval_metric'] = 'F1'

In [None]:
best_params

In [None]:
from catboost import cv, Pool

cv_dataset = Pool(data = X_train_sm,
                  label = y_train_sm)

params = randomized_search_result['params']
                  
# params = {"iterations": 1000,
#           "learning_rate": 0.03,
#           'eval_metric': 'F1',
#           "depth": 2,
#           'l2_leaf_reg': 1,
#           "loss_function": "Logloss",
#           "verbose": False}

scores = cv(cv_dataset,
            params,
            fold_count=10,
            plot= False)

scores

In [None]:
# Plotting F1_Score and Log Loss

f, ax = plt.subplots(2,1, figsize = (15,8))
plt.subplots_adjust(left=None, bottom=None, right=None, top=1.3, wspace=None, hspace=None)

sns.lineplot(scores['iterations'],scores['test-F1-mean'], ci = 'sd', ax = ax[0])
ax[0].set_title('Catboost 10K Cross Val F1_SCore', fontsize = 14)
ax[0].fill_between(scores['iterations'], (scores['test-F1-mean'] + scores['test-F1-std']),
                  (scores['test-F1-mean'] - scores['test-F1-std']), color='b', alpha=.2)
                
sns.lineplot(scores['iterations'],scores['test-Logloss-mean'], ax = ax[1])
ax[1].fill_between(scores['iterations'], (scores['test-Logloss-mean'] + scores['test-Logloss-std']),
                  (scores['test-Logloss-mean'] - scores['test-Logloss-std']), color='b', alpha=.2)
ax[1].set_title('Catboost 10K Cross Val Log Loss', fontsize = 14)
plt.show()

In [None]:
cat_opt = CatBoostClassifier(**params)

cat_model_opt = cat_opt.fit(X_train_sm,y_train_sm,
                     eval_set = (X_val_sm,y_val_sm),
                     use_best_model=True,
                     verbose = 0,
                     plot=True)

In [None]:
# Features' importance of our model

feat_imp = cat_model_opt.get_feature_importance(prettified=True)

# Plotting top 20 features' importance

plt.figure(figsize = (15,6))
sns.barplot(feat_imp['Importances'],feat_imp['Feature Id'], orient = 'h')
plt.show()

**TESTING**

Now that we have different optimized model, we can see how they perform on the test data to make our final considerations!

In [None]:
# Testing

rfc_pred_opt = rfc_model_opt.predict(X_test)
cat_pred_opt = cat_model_opt.predict(X_test)
log_pred_opt = log_model_opt.predict(X_test)

# Plotting the confusion matrix of the results

conf_mx0 = confusion_matrix(y_test,log_pred_opt)
conf_mx1 = confusion_matrix(y_test,rfc_pred_opt)
conf_mx2 = confusion_matrix(y_test,cat_pred_opt)

heat_cm0 = pd.DataFrame(conf_mx0, columns=np.unique(y_test), index = np.unique(y_test))
heat_cm0.index.name = 'Actual'
heat_cm0.columns.name = 'Predicted'

heat_cm1 = pd.DataFrame(conf_mx1, columns=np.unique(y_test), index = np.unique(y_test))
heat_cm1.index.name = 'Actual'
heat_cm1.columns.name = 'Predicted'

heat_cm2 = pd.DataFrame(conf_mx2, columns=np.unique(y_test), index = np.unique(y_test))
heat_cm2.index.name = 'Actual'
heat_cm2.columns.name = 'Predicted'

f, ax = plt.subplots(1, 3, figsize=(12,8))
f.subplots_adjust(left=None, bottom=None, right= 2, top=None, wspace=None, hspace= None)

sns.heatmap(heat_cm0, cmap="Blues", annot=True, annot_kws={"size": 16},fmt='g', ax = ax[0])
ax[0].set_title('Logistic Regression', fontsize = 15)
sns.heatmap(heat_cm1, cmap="Blues", annot=True, annot_kws={"size": 16},fmt='g', ax = ax[1])
ax[1].set_title('Random Forest Classifier', fontsize = 15)
sns.heatmap(heat_cm2, cmap="Blues", annot=True, annot_kws={"size": 16},fmt='g', ax = ax[2])
ax[2].set_title('Catboot Classifier', fontsize = 15)

plt.show()

In [None]:
# Pretty table to sum up all the results

my_table = PrettyTable(['Algorithm (opt)','Overall Accuracy','Precision','Recall','F1_Score','Roc-Auc'])

my_table.add_row(['Logistic Regression',
                 accuracy_score(y_test, log_pred_opt).round(4),
                 precision_score(y_test, log_pred_opt, average="binary", pos_label="1").round(4),
                 recall_score(y_test, log_pred_opt, average="binary", pos_label="1").round(4),
                 f1_score(y_test,log_pred_opt, average='weighted').round(4),
                 roc_auc_score(y_test,log_pred_opt, average='weighted').round(4)])

my_table.add_row(['Random Forest Classifier',
                 accuracy_score(y_test, rfc_pred_opt).round(4),
                 precision_score(y_test, rfc_pred_opt, average="binary", pos_label="1").round(4),
                 recall_score(y_test, rfc_pred_opt, average="binary", pos_label="1").round(4),
                 f1_score(y_test,rfc_pred_opt, average='weighted').round(4),
                 roc_auc_score(y_test,rfc_pred_opt, average='weighted').round(4)])

my_table.add_row(['CatBoost Classifier',
                 accuracy_score(y_test, cat_pred_opt).round(4),
                 precision_score(y_test, cat_pred_opt, average="binary", pos_label="1").round(4),
                 recall_score(y_test, cat_pred_opt, average="binary", pos_label="1").round(4),
                 f1_score(y_test,cat_pred_opt, average='weighted').round(4),
                 roc_auc_score(y_test,cat_pred_opt, average='weighted').round(4)])



print(my_table)