**Review**

Hi, my name is Dmitry and I will be reviewing your project.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  For your comments and questions.
</div>
  
First of all, thank you for turning in the project! You did an excellent job! There are some improvements that could be made (there are better choices for `repeat` and `fraction` values for upsampling/downsampling), but you successfully completed all the tasks. The project is accepted. Keep up the good work on the next sprint!

# Predicting Which Customers Leave Beta Bank

A bank, Beta Bank, has determined that it is cheaper to focus on retaining customers than it is to pursue new customers. They want the ability to determine which customers are likely to leave soon, so that they can focus their attention on retaining them before they decide to exit. In order to provide this ability to Beta Bank, a model needs to be trained that determines if a customer will leave the company based on their individual characteristics and features. The only true requirement provided by the bank is that the F1 score needs to be equal to or greater than **0.59**. However, the highest F1 score possible is obviously desired.

This task will be completed by preprocessing the data and features. Any missing data will need to be removed or replaced, and all features need to be numeric in order to train the models. Once the data is ready to be used for model training, three model types will be analzyed. Logitic Regression, Decision Tree, and Random Forest models will be utilized for this task, and they will be checked using different methods for adjsuting class imbalance. Class imbalance will be adjusted using the `class_weight` parameter, as well as upsampling and downsampling the training data.

Once the best model has been determined, it will be final tested using the test dataset. If the F1 score exceeds Beta Banks minimum requested F1 score of **0.59**, the model will be accepted and delivered to Beta Bank as the final product. Beta Bank will then be able to predict which customers are looking to leave the bank, and they can counteract by focusing their attention on retaining the specified customers.

## Initialization

In this section the necessary libraries will be imported, the data will be read into a DataFrame, and a summary of the data will be quickly explored.

### Load libraries

All the important libraries that are utilized throughout this report are imported in the cell block below.

In [28]:
# Import the necessary libraries

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

# Hide warning messages
import warnings
warnings.filterwarnings('ignore')

### Load data

Below, the csv file **/datasets/Churn.csv** will be read and stored in the DataFrame `df`.

In [2]:
df = pd.read_csv('/datasets/Churn.csv')

### Explore the data

Let's take a look at the data stored in the `df` DataFrame. The first 15 rows will be printed, followed by the DataFrame's general info.

In [3]:
df.head(15)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0
5,6,15574012,Chu,645,Spain,Male,44,8.0,113755.78,2,1,0,149756.71,1
6,7,15592531,Bartlett,822,France,Male,50,7.0,0.0,2,1,1,10062.8,0
7,8,15656148,Obinna,376,Germany,Female,29,4.0,115046.74,4,1,0,119346.88,1
8,9,15792365,He,501,France,Male,44,4.0,142051.07,2,0,1,74940.5,0
9,10,15592389,H?,684,France,Male,27,2.0,134603.88,1,1,1,71725.73,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


The DataFrame `df` contains **10,000** rows and **13** columns. It appears that only the `'Tenure'` columns contains missing values, of which there are **909**, meaning that just over **9%** of data in the `'Tenure'` column is missing. This is a significant amount that will be remedied by replacing the missing values with the median tenure value from the `'Tenure'` column. Aside from that, the datatypes of all the columns look to be correct. No other preprocessing will be required besides filling in the missing values. Aside from preprocessing, the data itself will need to be adjusted prior to training the models.

<div class="alert alert-success">
<b>Reviewer's comment</b>

The data was loaded and inspected!

</div>

### Datapreprocessing

As was stated above, there are missing values in the `'Tenure'` column that need to be replaced with the median tenure value. The below code block does just that.

In [5]:
# Replace the missing values in the 'Tenure' column of the DataFrame with the median value of the 
# tenure column.

df.loc[df['Tenure'].isna(),'Tenure'] = df['Tenure'].median()

<div class="alert alert-success">
<b>Reviewer's comment</b>

Ok, missing values were dealt with reasonably

</div>

Let's look at a summary of the data again to ensure that there are no longer any missing values.

In [6]:
# Check to ensure that there are no missing values

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


No missing values remain in the DataFrame! We can now move onto preparing the features for training the model!

## Feature Preparation

Before we can begin splitting the data and training the model, we need to prepare the necessary features. It is important that the data is presented in a numeric fashion so that the models can effectively be trained without any errors. This means that any features, or columns, that contain strings or non-numeric categories will need to be transformed into some numeric grouping via one of the encoding methods. 

From looking at the DataFrame, the `'RowNumber'`, `'CustomerId'`, and `'Surname'` columns should have zero impact on whether or not a customer leaves or not. Additionally, their size will make encoding very tasking. Therefore, let's create a new DataFrame that drops the the `'RowNumber'`, `'CustomerId'`, and `'Surname'` columns from the dataset. The new DataFrame will be named `data_curated`.

In [8]:
# Curate the dataset to only have the necessary info by dropping unecessary columns

data_curated = df.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)

<div class="alert alert-success">
<b>Reviewer's comment</b>

Useless columns were dropped

</div>

The remaining columns are all numeric except for `'Geography'` and `'Gender'`. Fortuantely, the `'Geography'` column only contains **3** unique values, and the `'Gender'` column only contains **2** unique values. These conditions make One-Hot Encoding (OHE) a very favorable method of encoding the dataset. When using OHE, it is important to avoid the dummy trap. The model will fall into the dummy trap when too many highly correlated columns are added to the DataFrame. Since the data is abundant and highly related to one another, the model will get confused. In order to avoid this trap, we will drop the first new column from each original feature when performing One-Hot Encoding. The cell block below will execute this work.

In [9]:
# Perform One-Hot Encoding on data_curated
# Pass the drop_first=True parameter to avoid the dummy trap
# Store the newly encoded dataset in 'data_ohe'

data_ohe = pd.get_dummies(data_curated, drop_first=True)

<div class="alert alert-success">
<b>Reviewer's comment</b>

Categorical features were encoded

</div>

After having performed One-Hot Encoding, let's take a look at the new DataFrame.

In [10]:
# Print the new DataFrame, 'data_ohe'

data_ohe

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain,Gender_Male
0,619,42,2.0,0.00,1,1,1,101348.88,1,0,0,0
1,608,41,1.0,83807.86,1,0,1,112542.58,0,0,1,0
2,502,42,8.0,159660.80,3,1,0,113931.57,1,0,0,0
3,699,39,1.0,0.00,2,0,0,93826.63,0,0,0,0
4,850,43,2.0,125510.82,1,1,1,79084.10,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
9995,771,39,5.0,0.00,2,1,0,96270.64,0,0,0,1
9996,516,35,10.0,57369.61,1,1,1,101699.77,0,0,0,1
9997,709,36,7.0,0.00,1,0,1,42085.58,1,0,0,0
9998,772,42,3.0,75075.31,2,1,0,92888.52,1,1,0,1


As can be seen, the last few columns of the DataFrame are dummy columns from the `'Geography'` and `'Gender'` columns. Also notice that one dummy column from each original feature has been dropped. We have successfully performed One-Hot Encoding on our data and avoided falling into the dummy trap. Now that all the data in the DataFrame is numeric, we can finally move onto working with the models.

## Modeling

### Preparing the training, validation, and test datasets

We want to be able to train the model, validate the model, and then perform final testing on the model. Such actions require 3 separate datasets. Since Beta Bank has only provided one dataset, it will be split up into a training dataset, a validation dataset, and a testing dataset. A popular ratio for splitting up the data is 3:1:1, where the training dataset contains 60% of the data and both the validation and testing datasets contains 20% of the data. To split the data up, we will utilize the `train_test_split` function found in the `sklearn.model_selection` library. The below cell block will split the overall dataset into 3 separate datasets.

In [11]:
# Create the training, validation, and testing datasets
# The datasets will be split up 60%, 20%, 20% for a 3:1:1 ratio

# Create the target dataset by slicing the 'Exited' column
target = data_ohe['Exited']

# Create the features dataset by dropping the 'Exited column'
features = data_ohe.drop('Exited', axis=1)

# Split the features and target datasets into training datasets and other datasets
# The test_size will be 0.4 so that the training dataset contains 60% of the data
features_train, features_other, target_train, target_other = train_test_split(features, target, test_size=0.4, random_state=12345)

# Split the 'other' datasets into validation and testing datasets
# The test_size will be 0.5 so that both datasets contain 50% of 40% of the data, or 20% of the overall data each.
features_valid, features_test, target_valid, target_test = train_test_split(features_other, target_other, test_size=0.5, random_state=12345)

<div class="alert alert-success">
<b>Reviewer's comment</b>

The data was split into train, validation and test sets

</div>

### Examine the Balance of Classes

Now that the datasets are split up, lets take a quick look at the class balance in the target datasets. To look at the class balances, we will calculate the percent that positive and negative classes appear within the overall dataset. This will be done for the training dataset, validation dataset, and test dataset. The percentages will be displayed below.

In [26]:
# Balance of classes for training dataset
print('TRAINING')
print(f'Negative class balance: {round(target_train[target_train == 0].count() * 100 / len(target_train), 3)}%') 
print(f'Positive class balance: {round(target_train[target_train == 1].count() * 100 / len(target_train), 3)}%\n') 

# Balance of classes for validation dataset
print('VALIDATION')
print(f'Negative class balance: {round(target_valid[target_valid == 0].count() * 100 / len(target_valid), 3)}%') 
print(f'Positive class balance: {round(target_valid[target_valid == 1].count() * 100 / len(target_valid), 3)}%\n') 


# Balance of classes for test dataset
print('TEST')
print(f'Negative class balance: {round(target_test[target_test == 0].count() * 100 / len(target_test), 3)}%') 
print(f'Positive class balance: {round(target_test[target_test == 1].count() * 100 / len(target_test), 3)}%') 

TRAINING
Negative class balance: 80.067%
Positive class balance: 19.933%

VALIDATION
Negative class balance: 79.1%
Positive class balance: 20.9%

TEST
Negative class balance: 78.85%
Positive class balance: 21.15%


As can be seen, there is significant class imbalance. The negative class accounts for almost **80%** of the data in the training, validation, and test target datasets. The positive class only account for rougly **20%**. We will need to adjust this so that our models can be better trained and produce better results. 

To show the effect that adjusting class balance can have, we will train a model without adjusting the class balance and then compare it to a model that is trained after having adjusting the class balance. Before we can move onto training a model, one last step is required, and that is standardization of the data. In order to ensure that the model treates each feature equally, we must standardiz the data. This is done in the cell block below with the use of the `StandardScaler` class from the `sklearn.preprocessing` module.

<div class="alert alert-warning">
<b>Reviewer's comment</b>

Great, you checked the balance of classes. By the way, we can use `stratify` parameter of `train_test_split` to make sure that train, validation and test sets have the same distribution of classes as the original dataset

</div>

In [29]:
# Standardize the data
numeric = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']
scaler = StandardScaler()
scaler.fit(features_train[numeric])
features_train.loc[:, numeric] = scaler.transform(features_train.loc[:, numeric])
features_valid.loc[:, numeric] = scaler.transform(features_valid.loc[:, numeric])
features_test.loc[:, numeric] = scaler.transform(features_test.loc[:, numeric])

<div class="alert alert-success">
<b>Reviewer's comment</b>

Scaling was applied correctly

</div>

Now that the data is standardized for the training, validation, and testing datasets, let's train some models!

#### Training a model before adjusting class imbalance

Let's first train a model without having adjusted the class imbalance. Then, we will compare the results with a trained model after having adjusted the class imbalance.

In [30]:
# Create an instance of a LogisticRegression model
model = LogisticRegression(solver='liblinear', random_state=12345)

# Fit the model using the training data
model.fit(features_train, target_train)

# Predict the target values of the validation features
predicted_valid = model.predict(features_valid)

# Calculate and print the F1 score
print('F1:', round(f1_score(target_valid, predicted_valid),5))

probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)

print(f'AUC-ROC Score: {round(auc_roc, 5)}')

F1: 0.33108
AUC-ROC Score: 0.75875


Without adjusting the class imbalance in the dataset, the logistic regression model has a F1 score of approximately **0.331**. This F1 score is okay, but not great. We are definitely not going to deliver this model to Beta Bank. The AUC-ROC score is approximately **0.76**, which, while far from perfect, is better than a random model. Let's see how the model fairs after having adjusted the class imbalance.

<div class="alert alert-success">
<b>Reviewer's comment</b>

Alright, you trained a model without applying any balancing techniques first

</div>

#### Training a model after adjusting the class imbalance

As an attempt to increas the F1 score of the logistic regression model, let's try adjusting the class imbalance of the training dataset. We want the model to know which classes occur most often, and adjust the class weight off the results. To do this, the `class_weight` parameter will be set equal to `balanced` when initializing the model.

In [32]:
# Create an instance of a LogisticRegression model
# pass the parameter class_weight='balanced'
model = LogisticRegression(class_weight='balanced', solver='liblinear', random_state=12345)

# Fit the model using the training data
model.fit(features_train, target_train)

# Predict the target values of the validation features
predicted_valid = model.predict(features_valid)

# Calculate and print the F1 score
print('F1:', round(f1_score(target_valid, predicted_valid),5))

probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)

print(f'AUC-ROC Score: {round(auc_roc, 5)}')

F1: 0.48885
AUC-ROC Score: 0.76373


After having adjusted the class imbalance, the F1 score of the model is approximately **0.49**! This is much better than the F1 score of approximately **0.331** as seen previously before adjusting the class imbalance! Still, this F1 score does not meet the minimum requirement of **0.59** as provided by Beta Bank. The AUC-ROC score is slightly better than the previous model. Let's move onto utilizing other methods of adjusting class imbalance and training various models to see which ones work the best.

<div class="alert alert-success">
<b>Reviewer's comment</b>

Class weights were applied successfully

</div>

### Adjusting class weight with upsampling and downsampling

Aside from passing the `class_weight='balanced'` parameter to the model during initialization, we can also utilize upsampling and downsampling to adjust class imbalance. Upsampling increases the frequency of positive observations in the data, and downsampling decreases the frequency of negative observations in the data. Let's start off by taking a look at upsampling and creating a function for it.

#### Upsampling

In [33]:
# Adjust class imablance with upsampling

# Create a function for upsampling the training data

# Initialize the funciton
def upsample(features, target, repeat):
    
    # Create datasets based on class
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]
    
    # Perform upsampling
    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    
    # Shuffle the observations
    features_upsampled, target_upsampled = shuffle(features_upsampled, target_upsampled, random_state=12345)
    
    # Return the upsampled data
    return features_upsampled, target_upsampled

Now that a function has been created for upsampling, let's pass the features and target datasets of the training dataset to the function.

In [34]:
features_upsampled, target_upsampled = upsample(features_train, target_train, 10)

<div class="alert alert-warning">
<b>Reviewer's comment</b>

Upsampling was correctly applied only to the train set, but the `repeat` value doesn't really make the data balanced (is there 10 times more zeros than ones?)

</div>

Now let's see how upsampling effects the performance of the Logistic Regression model.

In [35]:
# Create an instance of a LogisticRegression model
# pass the parameter class_weight='balanced'
model = LogisticRegression(solver='liblinear', random_state=12345)

# Fit the model using the training data
model.fit(features_upsampled, target_upsampled)

# Predict the target values of the validation features
predicted_valid = model.predict(features_valid)

# Calculate and print the F1 score
print('F1:', round(f1_score(target_valid, predicted_valid),5))

probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)

print(f'AUC-ROC Score: {round(auc_roc, 5)}')

F1: 0.41943
AUC-ROC Score: 0.76535


After upsampling the data, the model has an F1 score of approximatley **0.42**. This is not better than the almost **.49** F1 score seen by using the `class_weight` parameter. However, it is still better than not adjusting the class imablance at all. The AUC-ROC score is also relatively equivalnet to the AUC-ROC as seen previously from the model with the `class_weight='balanced'` parameter. Let's see how the model changes by implementing downsampling.

#### Downsampling

In [37]:
# Create a function for downsampling the training data

def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat([features_zeros.sample(frac=fraction, random_state=12345)] + [features_ones])
    target_downsampled = pd.concat([target_zeros.sample(frac=fraction, random_state=12345)] + [target_ones])

    features_downsampled, target_downsampled = shuffle(features_downsampled, target_downsampled, random_state=12345)

    return features_downsampled, target_downsampled

Now that a function has been created for downsampling, let's pass the features and target datasets of the training dataset to the function.

In [38]:
features_downsampled, target_downsampled = downsample(features_train, target_train, 0.1)

<div class="alert alert-warning">
<b>Reviewer's comment</b>

Downsampling was applied correctly, but there is a similar problem to upsampling: the chosen `fraction` value doesn't make the data balanced

</div>

Now let's see how downsampling effects the performance of the Logistic Regression model.

In [39]:
# Create an instance of a LogisticRegression model
# pass the parameter class_weight='balanced'
model = LogisticRegression(solver='liblinear', random_state=12345)

# Fit the model using the training data
model.fit(features_downsampled, target_downsampled)

# Predict the target values of the validation features
predicted_valid = model.predict(features_valid)

# Calculate and print the F1 score
print('F1:', f1_score(target_valid, predicted_valid))

probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)

print(f'AUC-ROC Score: {round(auc_roc, 5)}')

F1: 0.4308390022675737
AUC-ROC Score: 0.75836


The F1 score of the new model is approximately **0.43**, which is better than the model using upsampled data, but not better than the model using the `class_weight` parameter. Again, the AUC-ROC score remains unchanged for the most part. It has slightly decreased as compared to the AUC-ROC scores of the previous two models. Now that we have fully explored the performance of Logistic Regression models, let's see how Decision Tree Classifier models and Random Forest Classifier models perform.

## TRAINING OTHER TYPES OF MODELS

The below cell blocks will check the performance of Decision Tree models and Random Forest models after the class imbalance of the training dataset has been adjusted. Each model's performance will be checked after adjusting the class imbalance using the `class_weight` parameter, upsampling, and downsampling.

#### Decision Tree with class_weight='balanced'

In [40]:
# Decision Tree Model/Learning Algorithm

# Initialize
best_DT_model = None
best_DT_f1_score = 0
best_DT_depth = 0

# Create various models with different depth values

# for loop for changing depth values (range of 1-41)
for depth in range(1,41):
    
    # Create a model, using the provided depth and the same random_state
    DT_model = DecisionTreeClassifier(max_depth=depth, random_state=12345)
    
    # Train the model using the training dataset
    DT_model.fit(features_train, target_train)
    
    # Predict the target values of the validation features using the model
    DT_predictions_valid = DT_model.predict(features_valid) # get model predictions on validation set
    
    # Calculate the f1_score, if allowed
    try:
        f1 = f1_score(target_valid, DT_predictions_valid)
    except:
        break
        
    # Determe best fit
    if f1 > best_DT_f1_score:
        best_DT_model = DT_model
        best_DT_depth = depth
        best_DT_f1_score = f1

probabilities_valid = best_DT_model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)

print('Best Model:', best_DT_model)
print(f'Best F1 Score: {round(best_DT_f1_score,4)}')
print(f'AUC-ROC Score: {round(auc_roc, 5)}')
print('Best Depth:', best_DT_depth)

Best Model: DecisionTreeClassifier(max_depth=6, random_state=12345)
Best F1 Score: 0.5697
AUC-ROC Score: 0.81646
Best Depth: 6


The F1 Score of the Decision Tree model with the `class_weight='balanced'` parameter is the best we've seen yet! The F1 score of the model is approximately **0.57**. The AUC-ROC score is also the highest we've seen yet!

#### Decision Tree using Upsampling

In [41]:
# Decision Tree Model/Learning Algorithm

# Initialize
best_DT_model = None
best_DT_f1_score = 0
best_DT_depth = 0

# Create various models with different depth values

# for loop for changing depth values (range of 1-41)
for depth in range(1,41):
    
    # Create a model, using the provided depth and the same random_state
    DT_model = DecisionTreeClassifier(max_depth=depth, random_state=12345)
    
    # Train the model using the training dataset
    DT_model.fit(features_upsampled, target_upsampled)
    
    # Predict the target values of the validation features using the model
    DT_predictions_valid = DT_model.predict(features_valid) # get model predictions on validation set
    
    # Calculate the f1_score, if allowed
    try:
        f1 = f1_score(target_valid, DT_predictions_valid)
    except:
        break
    
    # Determe best fit
    if f1 > best_DT_f1_score:
        best_DT_model = DT_model
        best_DT_depth = depth
        best_DT_f1_score = f1

probabilities_valid = best_DT_model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)

print('Best Model:', best_DT_model)
print(f'Best F1 Score: {round(best_DT_f1_score,4)}')
print(f'AUC-ROC Score: {round(auc_roc, 5)}')
print('Best Depth:', best_DT_depth)

Best Model: DecisionTreeClassifier(max_depth=7, random_state=12345)
Best F1 Score: 0.5252
AUC-ROC Score: 0.8126
Best Depth: 7


After performing upsampling on the training dataset, the Decision Tree model has a better F1 score and AUC-ROC score than the Logistic Regression models. However, the Decision Tree model that was passed the `class_weight='balanced'` parameter still has a better F1 score and AUC-ROC score.

#### Decision Tree with Downsampling

In [42]:
# Initialize
best_DT_model = None
best_DT_f1_score = 0
best_DT_depth = 0

# Create various models with different depth values

# for loop for changing depth values (range of 1-41)
for depth in range(1,41):
    
    # Create a model, using the provided depth and the same random_state
    DT_model = DecisionTreeClassifier(max_depth=depth, random_state=12345)
    
    # Train the model using the training dataset
    DT_model.fit(features_downsampled, target_downsampled)
    
    # Predict the target values of the validation features using the model
    DT_predictions_valid = DT_model.predict(features_valid) # get model predictions on validation set
    
    # Calculate the f1_score, if allowed
    try:
        f1 = f1_score(target_valid, DT_predictions_valid)
    except:
        break
    
    # Determe best fit
    if f1 > best_DT_f1_score:
        best_DT_model = DT_model
        best_DT_depth = depth
        best_DT_f1_score = f1
        
probabilities_valid = best_DT_model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)

print('Best Model:', best_DT_model)
print(f'Best F1 Score: {round(best_DT_f1_score,4)}')
print(f'AUC-ROC Score: {round(auc_roc, 5)}')
print('Best Depth:', best_DT_depth)

Best Model: DecisionTreeClassifier(max_depth=5, random_state=12345)
Best F1 Score: 0.4955
AUC-ROC Score: 0.81491
Best Depth: 5


Again, the model using downsampled training data has a higher F1 score and AUC-ROC score than the Logistic Regression models. But, the F1 score and AUC-ROC score are not higher than the Logistic Regression model that used the `class_weight='balanced'` parameter.

### Random Forest

#### Random Forest with class_weight='balanced'

In [43]:
best_RF_model = None
best_est = 0
best_RF_depth = 0
best_RF_f1_score = 0

# Create various models with different depth and estimator values

# for loop for the number of estimators
for est in range(1,21):
    
    # for loop for the depth value
    for depth in range (1, 41):
        
        # Create a model, using the provided depth, number of estimators, and the same random_state
        RF_model = RandomForestClassifier(max_depth=depth, random_state=12345, n_estimators=est)
        
        # Train the model using the training dataset
        RF_model.fit(features_train, target_train)

        # Predict the target values of the validation features using the model
        RF_predictions_valid = RF_model.predict(features_valid) # get model predictions on validation set
       
        # Calculate the f1_score, if allowed
        #try:
        f1 = f1_score(target_valid, RF_predictions_valid)
        #except:
            #break

     # Determe best fit
        if f1 > best_RF_f1_score:
            best_RF_model = RF_model
            best_RF_f1_score = f1
            best_RF_depth = depth
            best_est = est

probabilities_valid = best_RF_model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)

print('Best Model:', best_RF_model)
print(f'Best F1 Score: {round(best_RF_f1_score, 4)}')
print(f'AUC-ROC Score: {round(auc_roc, 5)}')
print('Best Depth:', best_RF_depth)
print('Best n_estimators:', best_est)

Best Model: RandomForestClassifier(max_depth=14, n_estimators=9, random_state=12345)
Best F1 Score: 0.6
AUC-ROC Score: 0.82229
Best Depth: 14
Best n_estimators: 9


The F1 score of the Random Forest model that was passed the `class_weight='balanced'` parameter is **0.6**. This F1 score is even better than the Decision Tree models. It also satisfies the minimum F1 score of **0.59** as provided by Beta Bank! In addition to the F1 score being the highest we've seen, the model's AUC-ROC score of approximately **0.82** is the highest we've seen yet, too!

#### Random Forest with Upsampling

In [44]:
best_RF_model = None
best_est = 0
best_RF_depth = 0
best_RF_f1_score = 0

# Create various models with different depth and estimator values

# for loop for the number of estimators
for est in range(1,21):
    
    # for loop for the depth value
    for depth in range (1, 41):
        
        # Create a model, using the provided depth, number of estimators, and the same random_state
        RF_model = RandomForestClassifier(max_depth=depth, random_state=12345, n_estimators=est)
        
        # Train the model using the training dataset
        RF_model.fit(features_upsampled, target_upsampled)

        # Predict the target values of the validation features using the model
        RF_predictions_valid = RF_model.predict(features_valid) # get model predictions on validation set
       
        # Calculate the f1_score, if allowed
        try:
            f1 = f1_score(target_valid, RF_predictions_valid)
        except:
            break

     # Determe best fit
        if f1 > best_RF_f1_score:
            best_RF_model = RF_model
            best_RF_f1_score = f1
            best_RF_depth = depth
            best_est = est

probabilities_valid = best_RF_model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)

print('Best Model:', best_RF_model)
print(f'Best F1 Score: {round(best_RF_f1_score, 4)}')
print(f'AUC-ROC Score: {round(auc_roc, 5)}')
print('Best Depth:', best_RF_depth)
print('Best n_estimators:', best_est)

Best Model: RandomForestClassifier(max_depth=14, n_estimators=20, random_state=12345)
Best F1 Score: 0.6093
AUC-ROC Score: 0.82729
Best Depth: 14
Best n_estimators: 20


For the first time, upsampling has resulted in better F1 and AUC-ROC scores than passing the `class_weight='balanced'` parameter. This Random Forest Classifier model has an F1 score of approximately **0.61** and an AUC-ROC score of approximately **0.83**. Both these values are the highest we've seen!

#### Random Forest with Downsampling

In [45]:
best_RF_model = None
best_est = 0
best_RF_depth = 0
best_RF_f1_score = 0

# Create various models with different depth and estimator values

# for loop for the number of estimators
for est in range(1,21):
    
    # for loop for the depth value
    for depth in range (1, 41):
        
        # Create a model, using the provided depth, number of estimators, and the same random_state
        RF_model = RandomForestClassifier(max_depth=depth, random_state=12345, n_estimators=est)
        
        # Train the model using the training dataset
        RF_model.fit(features_downsampled, target_downsampled)

        # Predict the target values of the validation features using the model
        RF_predictions_valid = RF_model.predict(features_valid) # get model predictions on validation set
       
        # Calculate the f1_score, if allowed
        try:
            f1 = f1_score(target_valid, RF_predictions_valid)
        except:
            break

     # Determe best fit
        if f1 > best_RF_f1_score:
            best_RF_model = RF_model
            best_RF_f1_score = f1
            best_RF_depth = depth
            best_est = est

probabilities_valid = best_RF_model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)

print('Best Model:', best_RF_model)
print(f'Best F1 Score: {round(best_RF_f1_score, 4)}')
print(f'AUC-ROC Score: {round(auc_roc, 5)}')
print('Best Depth:', best_RF_depth)
print('Best n_estimators:', best_est)

Best Model: RandomForestClassifier(max_depth=15, n_estimators=10, random_state=12345)
Best F1 Score: 0.4781
AUC-ROC Score: 0.79146
Best Depth: 15
Best n_estimators: 10


Unfortunately, downsampling did not have a profound effect on the Random Forest Classifier model. The F1 score of approximately **0.48** is rather low from the highest values we've seen. It is on par with the Logistic Regression models previously trained. However, the AUC-ROC score is still relatively high. Again, the AUC-ROC score is not perfect by any means, but it's better than other models and beats a random model by a long run.

<div class="alert alert-success">
<b>Reviewer's comment</b>

Ok, very nice, you tried a couple of other models and tuned their hyperparameters using the validation set

</div>

## Final testing

The model with the best F1 score and AUC-ROC score was the Random Forest Classifier model trained with upsampled data, and with a max depth of 14 branches and an n_estimators value of 20. Let's train the model one more time and test its performance using the test dataset.

In [47]:
# Perform Final Testing

# Create a model, using the provided depth, number of estimators, and the same random_state
RF_model = RandomForestClassifier(max_depth=14, random_state=12345, n_estimators=20)

# Train the model using the training dataset
RF_model.fit(features_upsampled, target_upsampled)

test_predictions = RF_model.predict(features_test)

f1 = f1_score(target_test, test_predictions)

probabilities_test = model.predict_proba(features_test)
probabilities_one_test = probabilities_test[:, 1]

auc_roc = roc_auc_score(target_test, probabilities_one_test)

print(f'F1 Score: {round(f1, 4)}')
print(f'AUC-ROC score: {round(auc_roc, 4)}')

F1 Score: 0.5931
AUC-ROC score: 0.736


<div class="alert alert-success">
<b>Reviewer's comment</b>

Great, the final model was evaluated on the test set

</div>

The trained Random Forest Classifier model has an F1 score of approximately **.59**! This satisfies the requirement provided by Beta Bank! The AUC-ROC is acceptable, as it still beats a random model. However, it is lower than what was seen previously with certain validation datasets.

Let's real quick see how many customers are predicted to leave out of the **2,000** customers in the test dataset.

In [51]:
# Initialize total variable
total = 0

# Create a for loop for checking each prediction in the test predictions
for prediction in test_predictions:
    if prediction == 1:
        total += 1

# Print the total number of customers predicted to leave the bank
print(total)

474


**474** customers are predicted to leave the bank! That's almost **25%** of the customers! Wow! Beta Bank better start sending out some promos!

<div class="alert alert-warning">
<b>Reviewer's comment</b>

Keeping in mind that the precision of our model is far from 100%, not all of those customers will actually leave :)

</div>

## Conclusion

Beta Bank requested that a model be trained to predict which customers are going to leave the bank. We first began this task by preproccesing the data and features. In order to perform training for the models, the features needed to be presented in a numeric form. To transform the categorical features into numeric features, One-Hot Encoding was performed over the categorical features. The last step in preparing the data before training models was standardizing all the features so that no one feature was determined to be more important than the others.

Once the data was preprocessed, we looked at the imbalance seen in the target classes. The target datasets contained a positive and negative class. For the training, validation, and testing target datasets the negative class accounted for approximately **80%** of the data. This is a signficant imbalance since positive observations only accounted for roughly **20%** of data. The performance of a Logistic Regression model was checked with no adjustment made to the class imbalance, and then the rest of the models were trained with adjustments made to the class imbalance.

Logistic Regression, Decision Tree, and Random Forest models were trained to determine which model performed the best. Each model was trained three times, with each instance of on type of model using a different class imbalance adjustment method. The first model of each model type was passed the `class_weight='balanced'` parameter so that the class imbalance was adjusted. The second and third instance of each type of model were trained using upsampled and downsampled data, respectively.

Overall, the Random Forest Classifier model trained with upsampled data and with a max depth of **14** and an n_estimators value of **20** proved to perform the best. The model was tested using the test dataset, and the F1 score was calculated to be approximately **0.5931**, which is higher than the **0.59** minimum score requested by Beta Bank. This model is an acceptable product for Beta Bank, and will assist them in predicting which customers will be leaving their bank soon.

<div class="alert alert-success">
<b>Reviewer's comment</b>

Excellent summary!

</div>