<a href="https://colab.research.google.com/github/DLPY/Classification_Session_1/blob/main/Churn_Decision_Tree_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Churn Modelling: https://www.kaggle.com/shrutimechlearn/churn-modelling

## Step 1: Import the relelvant packages

`numpy` provides support for arrays and matrices and `pandas` which we've seen before and provides functionality that facilities better visualisation of 2-D arrays in a tabular format. 

Scikit-Learn otherwise known as `sklearn` is used for machine learning and has functionality for many types of classification models including decision trees.

`Matplotlib` is a plotting library that is commonly used to plot the output of machine learning models.

In [None]:
import graphviz
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np 
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import (accuracy_score, auc, classification_report, confusion_matrix, ConfusionMatrixDisplay,
                             plot_roc_curve, PrecisionRecallDisplay, roc_auc_score, roc_curve)
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

# import random undersampling
from collections import Counter
from imblearn.under_sampling import RandomUnderSampler
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

%matplotlib inline

pd.set_option('display.max_colwidth', None)

In [None]:
# Load the data
df = pd.read_csv('/content/drive/MyDrive/Advanced Analytics Data and Notebooks/Classification/Churn Modelling/Churn_Modelling.csv')
df.head()

# Churn Modeling Data Description
This data set contains details of a bank's customers and the target variable is a binary variable reflecting the fact whether the customer left the bank (closed their account) or they continue to be a customer.

Here we have 13 feature columns and Exited is a target column.

**Row Numbers:** Row Numbers from 1 to 10000.

**CustomerId:** Unique Ids for bank customer identification.

**Surname:** Customer's last name.

**CreditScore:** Credit score of the customer.

**Geography:** The country from which the customer belongs(Germany/France/Spain).

**Gender:** Male or Female.

**Age:** Age of the customer.

**Tenure:** Number of years for which the customer has been with the bank.

**Balance:** Bank balance of the customer.

**NumOfProducts:** Number of bank products the customer is utilising.

**HasCrCard:** Binary Flag for whether the customer holds a credit card with the bank or not(0=No, 1=Yes).

**IsActiveMember:** Binary Flag for whether the customer is an active member with the bank or not(0=No, 1=Yes).

**EstimatedSalary:** Estimated salary of the customer in Euro.

**Exited:** Binary flag 1 if the customer closed account with bank and 0 if the customer is retained(0=No, 1=Yes).

## Transformation

### Encoding the categorical variables - Change the text into numbers

Convert the categorical values into numeric categorical labels so that this data can be used for modelling.

In [None]:
df['CountryCode'] = df['Geography'].astype('category').cat.codes
df['GenderCode'] = df['Gender'].astype('category').cat.codes

In [None]:
df.head(5)

##### From the above, notice that:
 * The Geography and Gender have been converted to numeric values.
 * There are two new columns with these values: CountryCode and GenderCode.

##### From the above, notice that:
 * The Geography and Gender have been converted to numeric values.
 * There are two new columns with these values: CountryCode and GenderCode.

# Choosing predictor variables and target variable for performing Classification
**Target and Source variables**

* **Target Variable:** Exited
* **Predictor Variables:** CreditScore, CountryCode, GenderCode, Age, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary

# Isolate Target and Predictor Variables to Different Dataframes

In [None]:
X = df[['CreditScore', 'CountryCode', 'GenderCode', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
            'IsActiveMember', 'EstimatedSalary']]
y = df[['Exited']]

# Save this list of column values for later
columns_list = list(X.columns.values)

In [None]:
X.head(5)

In [None]:
y.head(5)

# Split dataset into the training and test using train_set_split: 


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=100)

print('Training Data:', X_train.shape, y_train.shape)
print('Testing Data:', X_test.shape, y_test.shape)

# Train, Test and Predict using a Decision Tree model

In [None]:
# Create an object using DecisionTreeClassifier, setting a few parameters such as max depth.
dtclf_model1 = DecisionTreeClassifier(random_state=42, max_features=7)

In [None]:
# Fit the classification model to the training set data.
dtclf_model1.fit(X_train, y_train)

### Predicting the results

Training set prediction score

In [None]:
y_pred = dtclf_model1.predict(X_train)
accuracy_score(y_train, y_pred)

Test set prediction score

In [None]:
y_pred1 = dtclf_model1.predict(X_test)
accuracy_score(y_test, y_pred1)

But accuracy can be misleading...

In [None]:
ConfusionMatrixDisplay.from_predictions(y_test, y_pred1, cmap='RdPu_r')
plt.grid(False)

In [None]:
print(classification_report(y_test, y_pred1))

With classification_report we calculate precision and recall with actual and predicted values.

For class 1 (churned users) model achieves 0.62 precision and 0.45 recall.

Precision tells us how many churned users our classifier model predicted correctly.

On the other side, recall tell us how many churned customers it missed.

In simple terms, the classifier is not very accurate for identifying churned customers - most likely due to class imbalance!

# Plotting the Decision Tree

In [None]:
target = list(df['Exited'].unique())
feature_names = list(X.columns.values)

In [None]:
# Graphviz Example:

dot_data = tree.export_graphviz(dtclf_model1, feature_names=feature_names, class_names=['Left', 'Stayed'], max_depth=3,
                                filled=True, rounded=True) 
graph = graphviz.Source(dot_data)  

graph
# This can be saved as a file:
# graph.save('decision_tree_chart.jpg')

## Pruning the Decision Tree

In [None]:
path = dtclf_model1.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
fig, ax = plt.subplots(figsize=(16,8));
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post");
ax.set_xlabel("effective alpha");
ax.set_ylabel("total impurity of leaves");
ax.set_title("Total Impurity vs effective alpha for training set");

In [None]:
clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=42, ccp_alpha=ccp_alpha)
    clf.fit(X_train, y_train)
    clfs.append(clf)
print(
    "Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
        clfs[-1].tree_.node_count, ccp_alphas[-1]
    )
)

In [None]:
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1)
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()

In [None]:
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]

fig, ax = plt.subplots()
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()

## Re-running with alpha

In [None]:
# Create an object using DecisionTreeClassifier, setting a few parameters such as max depth.
dtclf_model2 = DecisionTreeClassifier(random_state=42, max_features=7, ccp_alpha=0.001)

# Fit the classification model to the training set data.
dtclf_model2.fit(X_train, y_train)

### Predicting the results

Training set prediction score

In [None]:
y_pred2 = dtclf_model2.predict(X_train)
accuracy_score(y_train, y_pred2)

Test set prediction score

In [None]:
y_pred2 = dtclf_model2.predict(X_test)
accuracy_score(y_test, y_pred2)

In [None]:
ConfusionMatrixDisplay.from_predictions(y_test, y_pred2, cmap='RdPu_r')
plt.grid(False)

In [None]:
print(classification_report(y_test, y_pred2))

With classification_report we calculate precision and recall with actual and predicted values.

For class 1 (churned users) model achieves 0.77 precision and 0.42 recall. Precision has increased from the first model but recall decreased.

Class imbalance remains an issue!

# Plotting the Decision Tree

In [None]:
target = list(df['Exited'].unique())
feature_names = list(X.columns.values)

In [None]:
# Graphviz Example:

dot_data = tree.export_graphviz(dtclf_model2, feature_names=feature_names, class_names=['Left', 'Stayed'], max_depth=3,
                                filled=True, rounded=True) 
graph = graphviz.Source(dot_data)  

graph
# This can be saved as a file:
# graph.save('decision_tree_chart.jpg')

## Resampling the dataset

In [None]:
# define undersampling strategy
undersample = RandomUnderSampler(sampling_strategy='majority')

# fit and apply the transform
X_train_under, y_train_under = undersample.fit_resample(X_train, y_train)

#PART 2

# Create an object using DecisionTreeClassifier, setting a few parameters such as max depth.
dtclf_model3 = DecisionTreeClassifier(random_state=42, max_features=7, ccp_alpha=0.0007)
dtclf_model3.fit(X_train_under, y_train_under)
y_pred3 = dtclf_model3.predict(X_test)

print("ROC AUC score for undersampled data: ", roc_auc_score(y_test, y_pred3))

In [None]:
ConfusionMatrixDisplay.from_predictions(y_test, y_pred3, cmap='RdPu_r')
plt.grid(False)

In [None]:
print(classification_report(y_test, y_pred3))

## Compare all the models

In [None]:
print('----Model1----')
print(classification_report(y_test, y_pred1))
print('----Model2----')
print(classification_report(y_test, y_pred2))
print('----Model3----')
print(classification_report(y_test, y_pred3))

We can see that resampling increased the recall of the exited class but decreased the precision. This is a trade off we would expect. The balance between precision and recall is fundamentally a business decision. 

Are we happy to be contacting false positives (i.e. people who weren't otherwise going to churn) or would this lead to increased churn from annoyed customers. Conversely, are we happy accepting a lower level of recall meaning that many false negatives (ie. people classed as non-exit who exit) do not get contacted. 

## Compare Model2 and Model3

In [None]:
ConfusionMatrixDisplay.from_predictions(y_test, y_pred2, cmap='RdPu_r')
plt.grid(False)

In [None]:
ConfusionMatrixDisplay.from_predictions(y_test, y_pred3, cmap='RdPu_r')
plt.grid(False)

The comparison between the confusion matrices here hopefully drives home the point around the sacrifice between recall and precision. By just examining the bottom left and top right boxes, you can see a ~50% recuction in false negatives but also a >5x increase in false positives. 