Owner:      Shilton Jonatan, Data & Analytics, Digitas Singapore, shilton.salindeho@digitas.com

Solution:   Decision Tree/Random Forest

## Decision Tree/Random Forest

Classification algorithms are used to categorize data into a class or category. Classification can be of three types: binary classification, multiclass classification, multilabel classification.

Decision trees use multiple algorithms to decide to split a node into two or more sub-nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we can say that the purity of the node increases with respect to the target variable.

The random forest is a classification algorithm consisting of many decisions trees. It uses bagging and feature randomness when building each individual tree to try to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree.

<img src="./decision-tree-classification-algorithm.png" width="350" />
<img src="./Random_forest_diagram_complete.png" width="350" />

In [157]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, MinMaxScaler

Then, we load the sample data - in this sample, we'll be using a data of car brands and makes along with the specs.

In [158]:
# Load credit card sample data set
df = pd.read_excel("Credit card customer loyalty and attrition.xlsx") #reads text data into data frame

In [159]:
#See first few lines of data
df.head()

Unnamed: 0,Customer ID,Gender,Age,Payment Method,Churn,LastTransaction
0,1,male,64,credit card,loyal,98
1,2,male,35,cheque,churn,118
2,3,female,25,credit card,loyal,107
3,4,female,39,credit card,,177
4,5,male,39,credit card,loyal,90


In [160]:
df.shape

(996, 6)

In [161]:
df.Churn.unique()

array(['loyal', 'churn', nan], dtype=object)

We see that from the Churn variable, there are some rows (customers) with missing churn data - we don't know whether they are local customers or customers who churn out.

We can see how many of them are missing:

In [162]:
pd.pivot_table(df.fillna(0), values='Customer ID', index='Churn', aggfunc='count')

Unnamed: 0_level_0,Customer ID
Churn,Unnamed: 1_level_1
0,96
churn,322
loyal,578


There are **96** rows with missing churn data. We can either drop them, or these are actually the customers whose churn status we will need to predict. We will drop them for this exercise.

In [163]:
df = df.dropna()
df = df.reset_index(drop=True)

In [164]:
df.shape

(900, 6)

### Preprocessing

We'll have to do some preprocessing to the data first. The two most common steps of preprocessing are scaling and one-hot encoding.

- One-hot encoding converts qualitative variables (e.g. gender, race, nationality) into quantitative variables by splitting one variable into multiple "dummy" variables (values are either 0 or 1).

- Scaling converts variables of different ranges/scales into a uniform scale between 0 to 1.

In [165]:
#Prepare the preprocessing objects
enc=OneHotEncoder()
scaler=MinMaxScaler()

In [175]:
#Check all unique values from the discrete cardinal avariables
print(df['Gender'].unique())
print(df['Payment Method'].unique())
print(df['Churn'].unique())

['male' 'female']
['credit card' 'cheque' 'cash']
['loyal' 'churn']


In [176]:
df_onehot=pd.DataFrame(enc.fit_transform(df[['Gender','Payment Method','Churn']]).toarray())
df[['male','female','credit card','cheque','cash','loyal','churn']]=df_onehot

In [177]:
df[['Age_scaled','LastTransaction_scaled']]=scaler.fit_transform(df[['Age','LastTransaction']])

In [178]:
df.head()

Unnamed: 0,Customer ID,Gender,Age,Payment Method,Churn,LastTransaction,male,female,credit card,cheque,cash,loyal,churn,Age_scaled,LastTransaction_scaled
0,1,male,64,credit card,loyal,98,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.635135,0.436937
1,2,male,35,cheque,churn,118,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.243243,0.527027
2,3,female,25,credit card,loyal,107,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.108108,0.477477
3,5,male,39,credit card,loyal,90,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.297297,0.400901
4,6,female,28,cheque,churn,189,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.148649,0.846847


In [179]:
df_foranalysis=df[['male','female','credit card','cheque','cash','loyal','churn','Age_scaled','LastTransaction_scaled']]

In [180]:
df_foranalysis.head()

Unnamed: 0,male,female,credit card,cheque,cash,loyal,churn,Age_scaled,LastTransaction_scaled
0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.635135,0.436937
1,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.243243,0.527027
2,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.108108,0.477477
3,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.297297,0.400901
4,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.148649,0.846847


### Train Test Split

Decision Tree and Random Forest are both supervised algorithms - we will need input variables and output variables in order to train the model before the model can predict new sets of data.

In [189]:
#Set all variables except vs as X, and vs as y
X=df_foranalysis.loc[:, ~df_foranalysis.columns.isin(['churn'])]
y=df_foranalysis['churn']

In [230]:
#Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.99, random_state=1)

In [231]:
X_train.shape

(9, 8)

In [232]:
X_test.shape

(891, 8)

In [233]:
#Fitting the data into the tree
tree = DecisionTreeClassifier().fit(X_train, y_train)
randomforest = RandomForestClassifier().fit(X_train, y_train)

#Get prediction from fitted models9
y_pred_tree = tree.predict(X_test)
y_pred_rf = randomforest.predict(X_test)

In [234]:
#Check for model accuracy scores
print("Decision Tree Accuracy:",metrics.accuracy_score(y_test, y_pred_tree))
print("Random Forest Accuracy:",metrics.accuracy_score(y_test, y_pred_rf))

Decision Tree Accuracy: 1.0
Random Forest Accuracy: 0.9259259259259259
