# Customer churn rate predictor model

## Table of contents

1. [Project information](#1)
    1. [Dataset description](#1.1)
    1. [Notes](#1.2)
1. [Loading libraries](#2)
1. [Loading dataset](#3)
1. [Data preprocessing](#4)
    1. [Handling missing values](#4.1)
    1. [Dropping irrelevant features](#4.2)
    1. [Dataset separation](#4.3)
    1. [Defining set target and features](#4.4)
    1. [Scaling: standardization](#4.5)
    1. [Encoding categorical features](#4.6)
    1. [Preprocessing summary](#4.7)
1. [Model training & validation](#5)
    1. [Setting the baseline score](#5.1)
    1. [With imbalanced data](#5.2)
        1. [Decision tree](#5.2.1)
        1. [Random forest](#5.2.2)
        1. [Logistic regression](#5.2.3)
        1. [Conclusion](#5.2.4)
    1. [Treating data imbalance](#5.3)
        1. [Class weighting](#5.3.1)
            1. [Decision tree](#5.3.1.1)
            1. [Random forest](#5.3.1.2)
            1. [Logistic regression](#5.3.1.3)
            1. [Conclusion](#5.3.1.4)
        1. [Upsampling](#5.3.2)
            1. [Decision tree](#5.3.2.1)
            1. [Random forest](#5.3.2.2)
            1. [Logistic regression](#5.3.2.3)
            1. [Conclusion](#5.3.2.4)
    1. [Summary of training and validation results](#5.4)
1. [Testing](#6)
1. [Conclusion](#7)


## Project information <a class="anchor" name="1"></a>

Beta Bank is losing customers and the higher-ups came to the conclusion that it would be more profitable to keep loyal customers than attract new ones.

We need to create a model capable of correctly predicting whether or not a customer will leave soon. The predictions will serve as a basis for Beta to keep or let go of their clients.

Objectives:
1. Create an ML model for this task.<br>
To ensure that we provided Beta with the best model, we need to do the following steps:</br>
1. Train models before and after data preprocessing, and compare the results
1. Evaluate the models' performance with several metrics.

### Dataset description <a class="anchor" name="1.1"></a>
We're provided with a dataset about clients' past behaviors and their history of bank contract termination.

Features:

- `RowNumber` — index
- `CustomerId`
- `Surname`
- `CreditScore`
- `Geography` — country of residence
- `Gender`
- `Age` — umur
- `Tenure` — tenure of customers' fixed deposit (in years)
- `Balance` 
- `NumOfProducts` — number of bank products in use by the customer
- `HasCrCard` — whether or not the customer has a credit card
- `IsActiveMember` — whether or not the customer is active
- `EstimatedSalary`

Target:
- `Exited` — whether or not the contract has been terminated

### Notes <a class="anchor" name="1.2"></a>
To ensure that we obtain consistent results, we will use `12345` as our `random_state` hyperparameter throughout the project.

## Loading libraries <a class="anchor" name="2"></a>

We'll use classification models (as opposed to regression models) because we need to predict 2 outcomes: whether or not a customer will leave.

In [4]:
# for dataframe manipulation
import pandas as pd

# ML model libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# sklearn data processing tools
## to divide the dataset into subsets
from sklearn.model_selection import train_test_split

## to standardize data
from sklearn.preprocessing import StandardScaler

## to shuffle data
from sklearn.utils import shuffle

## 

# sklearn metrics
from sklearn.metrics import (f1_score, roc_auc_score)

## Loading dataset <a class="anchor" name="3"></a>

In [5]:
df = pd.read_csv(r'datasets/Churn.csv')

# Checking the dataset
print(df.info())
df.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB
None


Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0
5,6,15574012,Chu,645,Spain,Male,44,8.0,113755.78,2,1,0,149756.71,1
6,7,15592531,Bartlett,822,France,Male,50,7.0,0.0,2,1,1,10062.8,0
7,8,15656148,Obinna,376,Germany,Female,29,4.0,115046.74,4,1,0,119346.88,1
8,9,15792365,He,501,France,Male,44,4.0,142051.07,2,0,1,74940.5,0
9,10,15592389,H?,684,France,Male,27,2.0,134603.88,1,1,1,71725.73,0


In [6]:
# Checking target class imbalance
print('Exited == 1:', len(df[df['Exited'] == 1])/len(df) * 100, '%')
print('Exited == 0:', len(df[df['Exited'] == 0])/len(df) * 100, '%')

Exited == 1: 20.369999999999997 %
Exited == 0: 79.63 %


Findings:
- Data types are correct. Features with numerical boolean values will be kept as integers to allow for model calculations.
- 9091 customers have missing `tenure` information.
- The target class is heavily imbalanced, having nearly 80% of the data with `Exited == 0`.

## Data preprocessing <a class="anchor" name="4"></a>

### Handling missing values <a class="anchor" name="4.1"></a>

To prevent from introducing any bias to the data, we will fill the null values in  `tenure` with its median.

In [7]:
df[df['Tenure'].isnull()].head(5)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
30,31,15589475,Azikiwe,591,Spain,Female,39,,0.0,3,1,0,140469.38,1
48,49,15766205,Yin,550,Germany,Male,38,,103391.38,1,0,1,90878.13,0
51,52,15768193,Trevisani,585,Germany,Male,36,,146050.97,2,0,0,86424.57,0
53,54,15702298,Parkhill,655,Germany,Male,41,,125561.97,1,0,0,164040.94,1
60,61,15651280,Hunter,742,Germany,Male,35,,136857.0,1,0,0,84509.57,0


In [8]:
df['Tenure'].fillna(df['Tenure'].median(), inplace=True)
df[df['Tenure'].isna()]['Tenure'].count()

0

### Dropping irrelevant features <a class="anchor" name="4.2"></a>

There is no correlation between customers' leaving and their identification, which means that these features are of no use to our models. We will drop these features.

In [9]:
df = df.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)
df.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


### Dataset separation <a class="anchor" name="4.3"></a>

Because we were not provided with separate sets for the 3 different stages of model creation, we will divide the available set into 3 with the following proportions: 75% for training, 15% for validation and 10% testing.

In [10]:
df.shape

(10000, 11)

In [11]:
# Separating the dataset into a training set and a second set
train_df, df2 = train_test_split(df, train_size=0.75, random_state=12345)

# Separating the remaining 25% for validation and testing
val_df, test_df = train_test_split(df2, test_size=0.4, random_state=12345)

In [12]:
# Checking each set
print('train_df:')
print(train_df.info())
print()

print('val_df:')
print(val_df.info())
print()

print('test_df:')
print(test_df.info())

train_df:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7500 entries, 226 to 4578
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CreditScore      7500 non-null   int64  
 1   Geography        7500 non-null   object 
 2   Gender           7500 non-null   object 
 3   Age              7500 non-null   int64  
 4   Tenure           7500 non-null   float64
 5   Balance          7500 non-null   float64
 6   NumOfProducts    7500 non-null   int64  
 7   HasCrCard        7500 non-null   int64  
 8   IsActiveMember   7500 non-null   int64  
 9   EstimatedSalary  7500 non-null   float64
 10  Exited           7500 non-null   int64  
dtypes: float64(3), int64(6), object(2)
memory usage: 703.1+ KB
None

val_df:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1500 entries, 946 to 6895
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0  

### Defining set target and features <a class="anchor" name="4.4"></a>

In line with our project goals, `Exited` will be our target and the rest will be our features.

In [13]:
# train_df
## Excluding Exited from the set
train_features = train_df.drop('Exited', axis=1) 
train_target = train_df['Exited']

# val_df
val_features = val_df.drop('Exited', axis=1)
val_target = val_df['Exited']

# test_df
test_features = test_df.drop('Exited', axis=1)
test_target = test_df['Exited']

### Scaling: standardization <a class="anchor" name="4.5"></a>

The performance of regression models is affected by the difference in data values, especially when features are measured in different units. Simply put, regression models may see data with larger numbers as having more significance than those with smaller values. Scaling increases the efficiency of regression models by converting data values into a uniform scale. To anticipate outliers in the current and future data, we will use standardization, a commonly used scaling method that is more robust to outliers.

It should be noted that, due to the difference in algorithms, scaling does not have any effect on tree-based models.

In [14]:
# Creating an instance of StandardScaler
standard_scaler = StandardScaler()

# Fitting & transforming training data
## Defining a logical slice rule for selecting numerical columns
numeric = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']

## .values attribute excludes dataframe headers and prevent errors/warnings
train_features[numeric] = standard_scaler.fit_transform(X=train_features[numeric].values)

# Transforming validation & test sets
val_features[numeric] = standard_scaler.transform(X=val_features[numeric].values)
test_features[numeric] = standard_scaler.transform(X=test_features[numeric].values)

### Encoding categorical features <a class="anchor" name="4.6"></a>

For ML models to determine the correlation between the target and categorical features, these features need to be encoded with numbers. We'll pick one-hot-encoding for this task (as opposed to ordinal encoding, which poses the risk of having the models infer arithmetic correlation between categories). Further, we will drop the first category to prevent redundancy in the data.

In [15]:
# One-hot encoding data
train_features = pd.get_dummies(train_features, drop_first=True)
val_features = pd.get_dummies(val_features, drop_first=True)
test_features = pd.get_dummies(test_features, drop_first=True)
train_features.head(3)

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geography_Germany,Geography_Spain,Gender_Male
226,0.442805,-0.841274,1.446098,-1.224577,0.817772,1,1,-1.26975,0,0,0
7756,-0.310897,-0.27073,0.719099,0.641783,-0.896874,1,1,0.960396,0,1,0
2065,-0.259274,-0.556002,1.082599,-1.224577,0.817772,1,0,0.661864,0,0,1


### Preprocessing summary <a class="anchor" name="4.7"></a>

So far, we have made these changes to the dataset:
1. Filling missing values in `Tenure` with the median,
1. Dropping irrelevant features,
1. Separating the full dataset into training, validation, and test sets,
1. Further separating the sets into feature and target sets,
1. Scaled the numerical values with standardization,
1. Converted categorical features into numerical boolean format with one-hot encoding.

We still haven't addressed one major problem: class imbalance. Handling class imbalance requires more experimentation, so we will do it together with model training to compare the results of different handling methods. That aside, our data is now ready for models to process.

## Model training & validation <a class="anchor" name="5"></a>

we will 3 models with diff hyperparam and diff treatment of class imbalance cimpare results

Next, we will train each of the three classification models with varying hyperparameters and evaluate their performance in predicting the validation set. The models' performance will be measured by their validation metric scores (not training scores because they will only rise with more training). That said, the corresponding training score will be displayed for comparison.

The precision metric measures correct predictions relative to false positives, while recall does so relative to false negatives. In our case, both false positives and false negatives are equally undesirable and will incur considerable losses for the bank. We need a balance between these two metrics, so we will use a metric that combines both measurements: the F1 score.

For overall class prediction capability, we will use the AUC-ROC curve score. Unlike accuracy, AUC-ROC performs well with imbalanced datasets, which is the case in our project.

### Setting the baseline score <a class="anchor" name="5.1"></a>

The bank instructed us to set the baseline F1 score to `0.59`.


### With imbalanced data <a class="anchor" name="5.2"></a>

First, we will train our models with the original, imbalanced dataset and see how well they perform.

#### Decision tree <a class="anchor" name="5.2.1"></a>

The performance of this model varies by tree depth. This means that we have to keep the tree deep enough to produce the best results, but not excessively deep to prevent overfitting and wasting resources. To achieve this, we'll train and validate the model 10 times with increasing depth and pick the one with the best scores.

In [16]:
df_dummy = df
df_dummy['dummy_pred'] = 1
dummy_pred = len(df_dummy.query('dummy_pred == 1 & Exited == 1')) / len(df_dummy)
dummy_pred_f1_score = (2 * dummy_pred) / (dummy_pred + 1)
dummy_pred_f1_score 

0.3384564260197724

In [17]:
# Defining variables to store scores and models in
tree_best_train_f1 = 0
tree_best_train_roc_auc = 0
tree_best_val_f1 = 0
tree_best_val_roc_auc = 0
tree_best_depth = 0

for depth in range(1, 11):
    # Creating & training models with different depths
    tree_model = DecisionTreeClassifier(max_depth=depth, random_state=12345)
    tree_model.fit(train_features, train_target)
    
    # Getting training class prediction & correct prediction probability scores
    train_pred = tree_model.predict(train_features)
    train_f1 = f1_score(train_target, train_pred)
    train_proba = tree_model.predict_proba(train_features)[:, 1]
    train_roc_auc = roc_auc_score(train_target, train_proba)
    
    # Validation and obtaining validation class prediction & correct prediction probability scores
    val_pred = tree_model.predict(val_features)
    val_f1 = f1_score(val_target, val_pred)
    val_proba = tree_model.predict_proba(val_features)[:, 1]
    val_roc_auc = roc_auc_score(val_target, val_proba)
    
    # Storing the best depth and scores
    if (val_f1 > tree_best_val_f1) and (val_roc_auc > tree_best_val_roc_auc):
        tree_best_train_f1 = train_f1
        tree_best_train_roc_auc = train_roc_auc
        tree_best_val_f1 = val_f1 
        tree_best_val_roc_auc = val_roc_auc
        tree_best_depth = depth
    
print('Best max_depth:', tree_best_depth, '\n', 
      'training F1 score:', tree_best_train_f1, 'training AUC-ROC score:', tree_best_train_roc_auc, '\n',
      'validation F1 score:', tree_best_val_f1, 'validation AUC-ROC score:', tree_best_val_roc_auc)

Best max_depth: 5 
 training F1 score: 0.574170331867253 training AUC-ROC score: 0.8411864651732557 
 validation F1 score: 0.5686653771760154 validation AUC-ROC score: 0.8347895462709544


#### Random forest <a class="anchor" name="5.2.2"></a>

Next, we'll use the power of more trees (technically, estimators) at once. Being composed of several decision trees, the model's accuracy will vary based on its `max_depth` and the number of its trees (`n_estimators`). `max_depth` will be set from 1--10 and `n_estimators` will range from 10--100 with an increment of 10 estimators in every iteration.

In [18]:
# Defining variables to store scores and models in
forest_best_train_f1 = 0
forest_best_train_roc_auc = 0
forest_best_val_f1 = 0
forest_best_val_roc_auc = 0
forest_best_model = None

for depth in range(1, 11):
    for estimator in range(10, 101, 10): # Setting the range of estimators with an increase of 10 estimators per iteration
        
        # Creating & training the model with different max_depth and n_estimators
        forest_model = RandomForestClassifier(random_state=12345, max_depth=depth, n_estimators=estimator)
        forest_model.fit(train_features, train_target)
        
        # Getting training class prediction & correct prediction probability scores
        train_pred = forest_model.predict(train_features)
        train_f1 = f1_score(train_target, train_pred)
        train_proba = forest_model.predict_proba(train_features)[:, 1]
        train_roc_auc = roc_auc_score(train_target, train_proba)
        
        # Validation and obtaining validation class prediction & correct prediction probability scores
        val_pred = forest_model.predict(val_features)
        val_f1 = f1_score(val_target, val_pred)
        val_proba = forest_model.predict_proba(val_features)[:, 1]
        val_roc_auc = roc_auc_score(val_target, val_proba)
        
        # Storing the best depth and scores
        if (val_f1 > forest_best_val_f1) and (val_roc_auc > forest_best_val_roc_auc):
            forest_best_train_f1 = train_f1
            forest_best_train_roc_auc = train_roc_auc
            forest_best_val_f1 = val_f1 
            forest_best_val_roc_auc = val_roc_auc
            forest_best_model = forest_model
            
print('Best training F1 score:', forest_best_train_f1, 'best training AUC-ROC score:', forest_best_train_roc_auc, '\n',
      'Best validation F1 score:', forest_best_val_f1, 'best validation AUC-ROC score:', forest_best_val_roc_auc)
forest_best_model

Best training F1 score: 0.6157205240174672 best training AUC-ROC score: 0.9086215600495328 
 Best validation F1 score: 0.5738396624472574 best validation AUC-ROC score: 0.861529250769757


#### Logistic regression <a class="anchor" name="5.2.3"></a>

Another way to classify data is to use the logistic regression model. This model is different from the previous two in that it doesn't have a `max_depth` parameter and is affected by scaling done previously. 

We will compare all five solvers provided by scikit-learn. `sag` and `saga` solvers need a plenty of iterations to fit well, so we will increase the `max_iter` hyperparameter to `3500` for these and use the default value of `100` for the rest.

In [19]:
solver_list = ['liblinear', 'newton-cg', 'lbfgs', 'sag', 'saga']

for solver in solver_list:
    # Creating & training logistic regression models, 
    # changing max_iter as needed
    if solver == 'sag' or solver == 'saga':
        logreg_model = LogisticRegression(random_state=12345, 
                                          solver=solver, max_iter=3500)
    else:
        logreg_model = LogisticRegression(random_state=12345, solver=solver)
    logreg_model.fit(train_features, train_target)
    
    # Getting training accuracy scores
    train_pred = logreg_model.predict(train_features)
    train_f1 = f1_score(train_target, train_pred)
    train_proba = forest_model.predict_proba(train_features)[:, 1]
    train_roc_auc = roc_auc_score(train_target, train_proba)
    print(solver)
    print('training F1 score:', train_f1)
    print('training AUC-ROC score:', train_roc_auc)

    # Validating model & getting accuracy
    val_pred = logreg_model.predict(val_features)
    val_f1 = f1_score(val_target, val_pred)
    val_proba = forest_model.predict_proba(val_features)[:, 1]
    val_roc_auc = roc_auc_score(val_target, val_proba)
    print('validation F1 score:', val_f1)
    print('validation AUC-ROC score:', val_roc_auc)
    print()

liblinear
training F1 score: 0.32308443142996585
training AUC-ROC score: 0.9558282632160121
validation F1 score: 0.30303030303030304
validation AUC-ROC score: 0.8641031260691071

newton-cg
training F1 score: 0.32308443142996585
training AUC-ROC score: 0.9558282632160121
validation F1 score: 0.2997658079625293
validation AUC-ROC score: 0.8641031260691071

lbfgs
training F1 score: 0.32308443142996585
training AUC-ROC score: 0.9558282632160121
validation F1 score: 0.2997658079625293
validation AUC-ROC score: 0.8641031260691071

sag
training F1 score: 0.32308443142996585
training AUC-ROC score: 0.9558282632160121
validation F1 score: 0.2997658079625293
validation AUC-ROC score: 0.8641031260691071

saga
training F1 score: 0.32308443142996585
training AUC-ROC score: 0.9558282632160121
validation F1 score: 0.2997658079625293
validation AUC-ROC score: 0.8641031260691071



#### Conclusion <a class="anchor" name="5.2.4"></a>

Feeding the models with the imbalanced data resulted in the following scores:
- Decision tree (`max_depth = 6`):
    - F1: **~0.569**
    - AUC-ROC: **~0.816**
- Random forest (`max_depth=9, n_estimators=20`):
    - F1: **~0.573**
    - AUC-ROC: **~0.861**
- Logistic regression (any solver):
    - F1: **~0.299**
    - AUC-ROC: **~0.864*
    
None of these models' scores passed our 0.59 threshold.

### Treating data imbalance <a class="anchor" name="5.3"></a>

The classes in the full dataset was imbalanced with a 0-to-1 ratio of 4:1. This ratio only changed a little in the training set.

Balancing the data greatly reduces the bias introduced by the class proportions, which in turn will increase model performance. We will try several approaches to balance the data.

For this section, we will compare the effects of the different approaches on the F1 scores of our models.

In [20]:
print(train_target.value_counts(normalize=False))
print(train_target.value_counts(normalize=True))

0    5998
1    1502
Name: Exited, dtype: int64
0    0.799733
1    0.200267
Name: Exited, dtype: float64


#### Class weighting <a class="anchor" name="5.3.1"></a>

Adjusting class weight means we'll put greater weight on the rarer class: `Exited == 1`. For this approach, we just need to change the `class_weight` hyperparameter to `'balanced'`.

##### Decision tree <a class="anchor" name="5.3.1.1"></a>

In [21]:
# Defining variables to store scores and models in
tree_best_train_f1 = 0
tree_best_train_roc_auc = 0
tree_best_val_f1 = 0
tree_best_val_roc_auc = 0
tree_best_depth = 0

for depth in range(1, 11):
    # Creating & training models with different depths
    tree_model = DecisionTreeClassifier(max_depth=depth, random_state=12345, class_weight='balanced')
    tree_model.fit(train_features, train_target)
    
    # Getting training class prediction & correct prediction probability scores
    train_pred = tree_model.predict(train_features)
    train_f1 = f1_score(train_target, train_pred)
    train_proba = tree_model.predict_proba(train_features)[:, 1]
    train_roc_auc = roc_auc_score(train_target, train_proba)
    
    # Validation and obtaining validation class prediction & correct prediction probability scores
    val_pred = tree_model.predict(val_features)
    val_f1 = f1_score(val_target, val_pred)
    val_proba = tree_model.predict_proba(val_features)[:, 1]
    val_roc_auc = roc_auc_score(val_target, val_proba)
    
    # Storing the best depth and scores
    if (val_f1 > tree_best_val_f1) and (val_roc_auc > tree_best_val_roc_auc):
        tree_best_train_f1 = train_f1
        tree_best_train_roc_auc = train_roc_auc
        tree_best_val_f1 = val_f1 
        tree_best_val_roc_auc = val_roc_auc
        tree_best_depth = depth
    
print('Best max_depth:', tree_best_depth, '\n', 
      'training F1 score:', tree_best_train_f1, 'training AUC-ROC score:', tree_best_train_roc_auc, '\n',
      'validation F1 score:', tree_best_val_f1, 'validation AUC-ROC score:', tree_best_val_roc_auc)

Best max_depth: 5 
 training F1 score: 0.5845272206303724 training AUC-ROC score: 0.8465759114556162 
 validation F1 score: 0.6024423337856173 validation AUC-ROC score: 0.8448137615463565


##### Random forest  <a class="anchor" name="5.3.1.2"></a>

In [22]:
# Defining variables to store scores and models in
forest_best_train_f1 = 0
forest_best_train_roc_auc = 0
forest_best_val_f1 = 0
forest_best_val_roc_auc = 0
forest_best_model = None

for depth in range(1, 11):
    for estimator in range(10, 101, 10): # Setting the range of estimators with an increase of 10 estimators per iteration
        
        # Creating & training the model with different max_depth and n_estimators
        forest_model = RandomForestClassifier(random_state=12345, max_depth=depth, 
                                              n_estimators=estimator, class_weight='balanced')
        forest_model.fit(train_features, train_target)
        
        # Getting training class prediction & correct prediction probability scores
        train_pred = forest_model.predict(train_features)
        train_f1 = f1_score(train_target, train_pred)
        train_proba = forest_model.predict_proba(train_features)[:, 1]
        train_roc_auc = roc_auc_score(train_target, train_proba)
        
        # Validation and obtaining validation class prediction & correct prediction probability scores
        val_pred = forest_model.predict(val_features)
        val_f1 = f1_score(val_target, val_pred)
        val_proba = forest_model.predict_proba(val_features)[:, 1]
        val_roc_auc = roc_auc_score(val_target, val_proba)
        
        # Storing the best depth and scores
        if (val_f1 > forest_best_val_f1) and (val_roc_auc > forest_best_val_roc_auc):
            forest_best_train_f1 = train_f1
            forest_best_train_roc_auc = train_roc_auc
            forest_best_val_f1 = val_f1 
            forest_best_val_roc_auc = val_roc_auc
            forest_best_model = forest_model
            
print('Best training F1 score:', forest_best_train_f1, 'best training AUC-ROC score:', forest_best_train_roc_auc, '\n',
      'Best validation F1 score:', forest_best_val_f1, 'best validation AUC-ROC score:', forest_best_val_roc_auc)
forest_best_model

Best training F1 score: 0.7196575970651176 best training AUC-ROC score: 0.9343592782147978 
 Best validation F1 score: 0.648888888888889 best validation AUC-ROC score: 0.8666689830653438


##### Logistic regression <a class="anchor" name="5.3.1.3"></a>

In [23]:
solver_list = ['liblinear', 'newton-cg', 'lbfgs', 'sag', 'saga']

for solver in solver_list:
    # Creating & training logistic regression models, 
    # changing max_iter as needed
    if solver == 'sag' or solver == 'saga':
        logreg_model = LogisticRegression(random_state=12345, 
                                          solver=solver, class_weight='balanced', max_iter=3500)
    else:
        logreg_model = LogisticRegression(random_state=12345, solver=solver, class_weight='balanced')
    logreg_model.fit(train_features, train_target)
    
    # Getting training accuracy scores
    train_pred = logreg_model.predict(train_features)
    train_f1 = f1_score(train_target, train_pred)
    train_proba = forest_model.predict_proba(train_features)[:, 1]
    train_roc_auc = roc_auc_score(train_target, train_proba)
    print(solver)
    print('training F1 score:', train_f1)
    print('training AUC-ROC score:', train_roc_auc)

    # Validating model & getting accuracy
    val_pred = logreg_model.predict(val_features)
    val_f1 = f1_score(val_target, val_pred)
    val_proba = forest_model.predict_proba(val_features)[:, 1]
    val_roc_auc = roc_auc_score(val_target, val_proba)
    print('validation F1 score:', val_f1)
    print('validation AUC-ROC score:', val_roc_auc)
    print()

liblinear
training F1 score: 0.4883447139157053
training AUC-ROC score: 0.9618211618697577
validation F1 score: 0.506637168141593
validation AUC-ROC score: 0.86308747434143

newton-cg
training F1 score: 0.4882297551789077
training AUC-ROC score: 0.9618211618697577
validation F1 score: 0.506637168141593
validation AUC-ROC score: 0.86308747434143

lbfgs
training F1 score: 0.4882297551789077
training AUC-ROC score: 0.9618211618697577
validation F1 score: 0.506637168141593
validation AUC-ROC score: 0.86308747434143

sag
training F1 score: 0.4882297551789077
training AUC-ROC score: 0.9618211618697577
validation F1 score: 0.506637168141593
validation AUC-ROC score: 0.86308747434143

saga
training F1 score: 0.4882297551789077
training AUC-ROC score: 0.9618211618697577
validation F1 score: 0.506637168141593
validation AUC-ROC score: 0.86308747434143



##### Conclusion <a class="anchor" name="5.3.1.4"></a>

Adjusting weight yielded the following scores from the models' performance:
- Decision tree (`max_depth = 5`):
    - F1: **~0.602**
    - AUC-ROC: **~0.844**
- Random forest (`max_depth=9, n_estimators=20`):
    - F1: **~0.648**
    - AUC-ROC: **~0.866**
- Logistic regression (any solver):
    - F1: **~0.506**
    - AUC-ROC: **~0.863**
    
It is evident that all F1 & AUC-ROC scores increased. Our random forest achieved the highest score.


#### Upsampling <a class="anchor" name="5.3.2"></a>

With upsampling, we will duplicate the observations of the rarer class until both classes have the same amount of observations. `Exited == 0` had about 4 times more data than the other class, so we'll need to duplicate the `Exited == 1` rows 3 times to come close to, but not exceed, the observations of the former class. Then, we have to shuffle the data to prevent learning biases.

Scikit-learn doesn't have a built-in function for this method, so we will have to define a function ourselves.

Note that a similar method called _downsampling_ exists, in which the more abundant observations are dismissed. We don't want to throw away any existing data, so we will not use this method.

In [24]:
def upsample(features, target, upsample_one, repeat, random_state):
    """
    Duplicates the observations of the rarer class
    to create an evenly balanced dataset.
    
    features: an array of features to be upsampled
    target: an array of the target/class
    upsample_one: if set to True, 
        upsamples the observations of target/class == 1
    repeat: the number of times the rarer class observations
        should be repeated
    random_state: sets the random state to get consistent results
    """
    # Separating features and targets of each class
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]
    
    # Choosing which class observations to upsample
    if upsample_one == True:
        features_upsampled = pd.concat([features_zeros] + 
                                       [features_ones] * repeat)
        target_upsampled = pd.concat([target_zeros] + 
                                     [target_ones] * repeat)
    else:
        features_upsampled = pd.concat([features_ones] + 
                                       [features_zeros] * repeat)
        target_upsampled = pd.concat([target_ones] + 
                                     [target_zeros] * repeat)
    
    # Shuffling data
    features_upsampled, target_upsampled = shuffle(features_upsampled,
                                                   target_upsampled, 
                                                   random_state=random_state)
    
    return features_upsampled, target_upsampled


# Upsampling the training set 4 times
features_upsampled, target_upsampled = upsample(train_features,
                                                train_target,
                                                upsample_one=True,
                                                repeat=3,
                                                random_state=12345)

print(f'Class 1 observances in original training set: {len(train_features[train_target == 1])}',
      f'Class 1 observances in upsampled training set: {len(features_upsampled[target_upsampled == 1])}',
      sep='\n')

Class 1 observances in original training set: 1502
Class 1 observances in upsampled training set: 4506


In [25]:
print('Training set class count:', train_target.value_counts(), sep='\n')
print()
print('Upsampled set class count:', target_upsampled.value_counts(), sep='\n')

Training set class count:
0    5998
1    1502
Name: Exited, dtype: int64

Upsampled set class count:
0    5998
1    4506
Name: Exited, dtype: int64


##### Decision tree <a class="anchor" name="5.3.2.1"></a>

In [26]:
# Defining variables to store scores and models in
tree_best_train_f1 = 0
tree_best_train_roc_auc = 0
tree_best_val_f1 = 0
tree_best_val_roc_auc = 0
upsampled_tree_best_depth = 0

for depth in range(1, 11):
    # Creating & training models with different depths
    tree_model = DecisionTreeClassifier(max_depth=depth, random_state=12345)
    tree_model.fit(features_upsampled, target_upsampled)
    
    # Getting training class prediction & correct prediction probability scores
    train_pred = tree_model.predict(features_upsampled)
    train_f1 = f1_score(target_upsampled, train_pred)
    train_proba = tree_model.predict_proba(features_upsampled)[:, 1]
    train_roc_auc = roc_auc_score(target_upsampled, train_proba)
    
    # Validation and obtaining validation class prediction & correct prediction probability scores
    val_pred = tree_model.predict(val_features)
    val_f1 = f1_score(val_target, val_pred)
    val_proba = tree_model.predict_proba(val_features)[:, 1]
    val_roc_auc = roc_auc_score(val_target, val_proba)
    
    # Storing the best depth and scores
    if (val_f1 > tree_best_val_f1) and (val_roc_auc > tree_best_val_roc_auc):
        tree_best_train_f1 = train_f1
        tree_best_train_roc_auc = train_roc_auc
        tree_best_val_f1 = val_f1 
        tree_best_val_roc_auc = val_roc_auc
        tree_best_depth = depth
    
print('Best max_depth:', tree_best_depth, '\n', 
      'training F1 score:', tree_best_train_f1, 'training AUC-ROC score:', tree_best_train_roc_auc, '\n',
      'validation F1 score:', tree_best_val_f1, 'validation AUC-ROC score:', tree_best_val_roc_auc)

Best max_depth: 5 
 training F1 score: 0.7176126389701581 training AUC-ROC score: 0.849859018696423 
 validation F1 score: 0.6136680613668061 validation AUC-ROC score: 0.8413044175504617


##### Random forest <a class="anchor" name="5.3.2.2"></a>

In [27]:
# Defining variables to store scores and models in
forest_best_train_f1 = 0
forest_best_train_roc_auc = 0
forest_best_val_f1 = 0
forest_best_val_roc_auc = 0
forest_best_model = None

for depth in range(1, 11):
    for estimator in range(10, 101, 10): # Setting the range of estimators with an increase of 10 estimators per iteration
        
        # Creating & training the model with different max_depth and n_estimators
        forest_model = RandomForestClassifier(random_state=12345, max_depth=depth, n_estimators=estimator)
        forest_model.fit(features_upsampled, target_upsampled)
        
        # Getting training class prediction & correct prediction probability scores
        train_pred = forest_model.predict(features_upsampled)
        train_f1 = f1_score(target_upsampled, train_pred)
        train_proba = forest_model.predict_proba(features_upsampled)[:, 1]
        train_roc_auc = roc_auc_score(target_upsampled, train_proba)
        
        # Validation and obtaining validation class prediction & correct prediction probability scores
        val_pred = forest_model.predict(val_features)
        val_f1 = f1_score(val_target, val_pred)
        val_proba = forest_model.predict_proba(val_features)[:, 1]
        val_roc_auc = roc_auc_score(val_target, val_proba)
        
        # Storing the best depth and scores
        if (val_f1 > forest_best_val_f1) and (val_roc_auc > forest_best_val_roc_auc):
            forest_best_train_f1 = train_f1
            forest_best_train_roc_auc = train_roc_auc
            forest_best_val_f1 = val_f1 
            forest_best_val_roc_auc = val_roc_auc
            forest_best_model = forest_model
            
print('Best training F1 score:', forest_best_train_f1, 'best training AUC-ROC score:', forest_best_train_roc_auc, '\n',
      'Best validation F1 score:', forest_best_val_f1, 'best validation AUC-ROC score:', forest_best_val_roc_auc)
forest_best_model

Best training F1 score: 0.7897583844212044 best training AUC-ROC score: 0.9167762978249739 
 Best validation F1 score: 0.6490683229813665 best validation AUC-ROC score: 0.8648996108450222


##### Logistic regression <a class="anchor" name="5.3.2.3"></a>

In [28]:
solver_list = ['liblinear', 'newton-cg', 'lbfgs', 'sag', 'saga']

for solver in solver_list:
    # Creating & training logistic regression models, 
    # changing max_iter as needed
    if solver == 'sag' or solver == 'saga':
        logreg_model = LogisticRegression(random_state=12345, solver=solver, max_iter=3500)
    else:
        logreg_model = LogisticRegression(random_state=12345, solver=solver)
    logreg_model.fit(features_upsampled, target_upsampled)
    
    # Getting training accuracy scores
    train_pred = logreg_model.predict(features_upsampled)
    train_f1 = f1_score(target_upsampled, train_pred)
    train_proba = forest_model.predict_proba(features_upsampled)[:, 1]
    train_roc_auc = roc_auc_score(target_upsampled, train_proba)
    print(solver)
    print('training F1 score:', train_f1)
    print('training AUC-ROC score:', train_roc_auc)

    # Validating model & getting accuracy
    val_pred = logreg_model.predict(val_features)
    val_f1 = f1_score(val_target, val_pred)
    val_proba = forest_model.predict_proba(val_features)[:, 1]
    val_roc_auc = roc_auc_score(val_target, val_proba)
    print('validation F1 score:', val_f1)
    print('validation AUC-ROC score:', val_roc_auc)
    print()

liblinear
training F1 score: 0.6329812770043207
training AUC-ROC score: 0.9632004498614496
validation F1 score: 0.5156250000000001
validation AUC-ROC score: 0.8619862940472117

newton-cg
training F1 score: 0.6329812770043207
training AUC-ROC score: 0.9632004498614496
validation F1 score: 0.5156250000000001
validation AUC-ROC score: 0.8619862940472117

lbfgs
training F1 score: 0.6329812770043207
training AUC-ROC score: 0.9632004498614496
validation F1 score: 0.5156250000000001
validation AUC-ROC score: 0.8619862940472117

sag
training F1 score: 0.6329812770043207
training AUC-ROC score: 0.9632004498614496
validation F1 score: 0.5156250000000001
validation AUC-ROC score: 0.8619862940472117

saga
training F1 score: 0.6329812770043207
training AUC-ROC score: 0.9632004498614496
validation F1 score: 0.5156250000000001
validation AUC-ROC score: 0.8619862940472117



##### Conclusion <a class="anchor" name="5.3.2.4"></a>

Adjusting weight yielded the following scores from the models' performance:
- Decision tree (`max_depth = 5`):
    - F1: **~0.613**
    - AUC-ROC: **~0.841**
- Random forest (`max_depth=9, n_estimators=20`):
    - F1: **~0.649**
    - AUC-ROC: **~0.864**
- Logistic regression (any solver):
    - F1: **~0.515**
    - AUC-ROC: **~0.861**
    
This time, all F1 scores increased slightly, but AUC-ROC scores suffered some losses.

### Summary of training and validation results <a class="anchor" name="5.4"></a>

Due to data imbalance, we experimented with three different alternatives:
1. trained and validated models with the original, imbalanced dataset as is,
1. with weight adjustments,
1. with an upsampled dataset.

The best models and hyperparameters of each approach were:
1. Random forest (`max_depth=9, n_estimators=20`):
    - F1: **~0.573**
    - AUC-ROC: **~0.861**
1. Random forest (`max_depth=9, n_estimators=20`):
    - F1: **~0.648**
    - AUC-ROC: **~0.866**
1. Random forest (`max_depth=9, n_estimators=20`):
    - F1: **~0.649**
    - AUC-ROC: **~0.864**
    
The score obtained from the last two methods **passed the 0.59 baseline score**. Weight adjustments made a better AUC-ROC score, but upsampling returned a better F1, with slight score differences each.

Between these two choices, we will use the model we trained with **weight adjustments** due to the fact that its F1 score was only 0.001 less, but had a 0.002 higher AUC-ROC score compared to the one trained with an upsampled set.

## Testing <a class="anchor" name="6"></a>

Our model needs to go through the final phase: testing. For this stage, we will recreate the aforementioned model with the same hyperparameters, train it using the combined training and validation set, then test its performance.

In [29]:
# Combining training and validation sets
final_features = pd.concat([train_features] + [val_features])
final_target = pd.concat([train_target] + [val_target])

# Creating and training model
final_model = RandomForestClassifier(max_depth=9, n_estimators=20, class_weight='balanced', random_state=12345)
final_model.fit(final_features, final_target)

# Testing
test_pred = final_model.predict(test_features)
test_f1 = f1_score(test_target, test_pred)

test_proba = final_model.predict_proba(test_features)[:, 1]
test_roc_auc = roc_auc_score(test_target, test_proba)

print('Test F1 score:', test_f1)
print('Test AUC-ROC score:', test_roc_auc)

Test F1 score: 0.6296296296296295
Test AUC-ROC score: 0.8562900858868444


Despite the slight score decrease, our model succeeded the test, surpassing the baseline score with an F1 of **0.629** and AUC-ROC score of **0.856**.

## Conclusion <a class="anchor" name="7"></a>

We were given a dataset of 10000 observations.

In preprocessing, we made the following changes to the set:
1. Filling missing values in `Tenure` with the median,
1. Dropping irrelevant features,
1. Separating the full dataset into training, validation, and test sets,
1. Further separating the sets into feature and target sets,
1. Scaled the numerical values with standardization,
1. Converted categorical features into numerical boolean format with one-hot encoding.

The data was heavily imbalanced, having 4 times more observations of target class `Exited == 0`. We tried training models with different approaches to this problem, and concluded that the best solution was to balance the training set with class weight adjustments.

The final model will be a **random forest classifier** trained with this dataset, with the following hyperparameters:
- `max_depth = 9`
- `n_estimators = 20`
- `class_weight = 'balanced'`
- `random_state = 12345`