### Introduction

This notebook illustrates the concepts of posterior probability and probability threshold. It shows how changing the probability threshold changes the performance metrics of a classifier.

In this notebook, we will __re-aply classification trees to the Caravan and Defaults datasets__. 

We will apply the post-prunned strategy based on CCP as we did before, then, we will modify the probablity threshold of the post-prunned tree to observed the effect of this change. 

Why do we focus on the post-prunned tree only? Because we have repeatedly observed that post-prunning leads to better results than pre-prunning.

### Post-prunning applied to the caravan dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
from sklearn import tree

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [None]:
Caravan_df= pd.read_csv('C:\\Users\\jheredi2\\Documents\\PythonDataAnalytics\\1-Datasets\\Caravan.csv')

In [None]:
round(Caravan_df['Purchase'].value_counts(normalize= True), 2)

In [None]:
# Excluding the first predictor from the Caravan data set

X_train, X_test, y_train, y_test= train_test_split (Caravan_df.iloc[:,1:-1], Caravan_df['Purchase'], test_size=0.2, random_state=1)

In [None]:
tree_caravan_unprunned= DecisionTreeClassifier(criterion='gini', random_state=1)

In [None]:
path= tree_caravan_unprunned.cost_complexity_pruning_path(X_train, y_train)

In [None]:
alphas= path['ccp_alphas']

In [None]:
accuracy_scores=[]
for i in alphas:
    treeloop= DecisionTreeClassifier(ccp_alpha=i, random_state=1)
    treeloop.fit(X_train, y_train)
    y_test_predicted=treeloop.predict(X_test)
    accuracy_scores.append(accuracy_score(y_test, y_test_predicted)) 

In [None]:
indexmax=accuracy_scores.index(max(accuracy_scores))

In [None]:
tree_caravan_postprunned= DecisionTreeClassifier(ccp_alpha= alphas[indexmax], random_state=1)

In [None]:
tree_caravan_postprunned.fit(X_train, y_train)

### Posterior probabilities for a classification tree and the probability threshold

Calling the method __predict_proba()__ on the tree object will return, for each test observation, the prob of No and the prob of Yes, given the values of the predictors for each observation. In other words, __predict_proba()__ returns the __posterior probabilities__.

The method __predict_proba()__ returns a two dimensional array with Nt rows (the number of test observations) and two columns, each column with the posterior probablity for each class (= Col 1 wih the prob of No and Col 2 with the prob of Yes).

See next:

In [None]:
# To know the order of the classes, you can apply unique()

Caravan_df['Purchase'].unique()

In [None]:
tree_caravan_postprunned.predict_proba(X_test)

In [None]:
# The following statement will retrieve the second probability (the probability of Yes) for each test observation

tree_caravan_postprunned.predict_proba(X_test)[:,1]

A classification tree is implemented by default to minimize the overall error rate, therefore, it will predict Yes only when the posterior probability of Yes is > 0.5

As we can see next, most observations are predicted No because the probablity of Yes <=0.5 for most observations

In [None]:
# Set of values taken by posterior probability for Y= 1

pd.Series (tree_caravan_postprunned.predict_proba(X_test)[:,1]).value_counts()

In [None]:
# Number of predictions of Yes and No

pd.Series(tree_caravan_postprunned.predict(X_test)).value_counts()

The next code cell checks that the prediction is Yes only for the two observations for which prob of Yes > 0.5 

In [None]:
np.all ( (tree_caravan_postprunned.predict_proba(X_test)[:,1]>0.5) == (tree_caravan_postprunned.predict(X_test)=='Yes'))

__Changing the probability threshold from 0.5 to 0.25__.

This implies that the tree will classify an observation as 'Yes' when prob of Yes > 0.25

In [None]:
# Create an array with all the probabilities of Yes

prob_yes= tree_caravan_postprunned.predict_proba(X_test)[:,1]

In [None]:
# This loop computes the prediction of Y (No or Yes) for each test observation
# The predictions of Y are stored in an array called 'y_predicted_prob025' 
# The prediction uses a prob threshold of 0.25

y_predicted_prob025=np.empty(y_test.size, dtype=object)

for i in np.arange(0,y_predicted_prob025.size):
    if prob_yes[i] > 0.25:
        y_predicted_prob025[i]= 'Yes'
    else:
        y_predicted_prob025[i]= 'No'

The number of 'Yes' predicted by the Tree has increased after lowering the threshold to 0.25 (see next)

In [None]:
pd.Series(y_predicted_prob025).value_counts()

Do we get better performance metrics with a threshold of 0.25 compared to 0.5? Let's get the confusion matrix and the results from the classification report

Note: The results with 0.5 are in the first notebook about classification trees.

In [None]:
confusion_matrix(y_test, y_predicted_prob025)

In [None]:
print (classification_report (y_test, y_predicted_prob025))

####  Are there other probability thresholds that can be used to make predictions?

Why did I select 0.25 instead of 0.65 or 0.75?

#### Why am I focusing on the Yes class? Why do I emphasize getting a better accuracy (i.e., recall) for the Yes class?

#### Creating other probability thresholds

In [None]:
# Array of probability thresholds

array_prob= np.arange(0.05, 0.51, 0.05)

In [None]:
array_prob

#### Selecting the best probability threshold based on the f1-score

In [None]:
from sklearn.metrics import f1_score

In [None]:
dict_predictions= dict()

In [None]:
dict_f1_scores= dict()

In [None]:
for j in array_prob:
    dict_predictions[j]=np.empty(y_test.size, dtype=object)
    for i in np.arange(0, dict_predictions[j].size):
        if prob_yes[i] > j:
            dict_predictions[j][i]= 'Yes'
        else:
            dict_predictions[j][i]= 'No'
    dict_f1_scores[j]= np.round (f1_score(y_test, dict_predictions[j],pos_label='Yes'),3)

In [None]:
dict_f1_scores

What probability threshold results in the highest f1-score?

In [None]:
max(dict_f1_scores, key= dict_f1_scores.get)

What's the highest f1-score?

In [None]:
max(dict_f1_scores.values())

__JUST FOR ILLUSTRATION PURPOSES ...__

#### Selecting the best probability threshold based on the accuracy for the Yes class (i.e., recall for the Yes class)

GENERALLY, NOT A GOOD IDEA TO SELECT BASED ON RECALL ONLY !!! 

In [None]:
from sklearn.metrics import recall_score

In [None]:
dict_recall_scores= dict()

In [None]:
for j in array_prob:
    dict_recall_scores[j]= np.round (recall_score(y_test, dict_predictions[j],pos_label='Yes'),3)

In [None]:
dict_recall_scores

Since we got the best f1-score with threshold 0.1, let's use this threshold to make predictions and obtain the confusion matrix and classification report!

In [None]:
# This loop computes the prediction of Y (No or Yes) for each test observation
# The predictions of Y are stored in an array called 'y_predicted_prob010' 
# The prediction uses a prob threshold of 0.1

y_predicted_prob010=np.empty(y_test.size, dtype=object)

for i in np.arange(0,y_predicted_prob010.size):
    if prob_yes[i] > 0.10:
        y_predicted_prob010[i]= 'Yes'
    else:
        y_predicted_prob010[i]= 'No'

In [None]:
confusion_matrix (y_test, y_predicted_prob010)

In [None]:
print (classification_report (y_test, y_predicted_prob010))

### Post-prunning applied to the default dataset

In [None]:
Default_df= pd.read_csv('C:\\Users\\jheredi2\\Documents\\PythonDataAnalytics\\1-Datasets\\Default.csv')

In [None]:
Default_df_dummies= pd.get_dummies(Default_df,columns=['student'], drop_first=True)

In [None]:
X_train_def, X_test_def, y_train_def, y_test_def= train_test_split (Default_df_dummies.iloc[:,1:], Default_df_dummies['default'], test_size=0.2, random_state=1)

In [None]:
tree_default_unprunned= DecisionTreeClassifier(criterion='gini', random_state=1)

In [None]:
path_def= tree_caravan_unprunned.cost_complexity_pruning_path(X_train_def, y_train_def)

In [None]:
alphas_def= path_def['ccp_alphas']

In [None]:
accuracy_scores=[]
for i in alphas_def:
    treeloop= DecisionTreeClassifier(ccp_alpha=i, random_state=1)
    treeloop.fit(X_train_def, y_train_def)
    y_test_predicted=treeloop.predict(X_test_def)
    accuracy_scores.append(accuracy_score(y_test_def, y_test_predicted)) 

In [None]:
indexmax=accuracy_scores.index(max(accuracy_scores))

In [None]:
tree_default_postprunned= DecisionTreeClassifier(ccp_alpha= alphas_def[indexmax], random_state=1)

In [None]:
tree_default_postprunned.fit(X_train_def, y_train_def)

In [None]:
default_postprunned_predicted_test= tree_default_postprunned.predict(X_test_def)

In [None]:
confusion_matrix(y_test_def, default_postprunned_predicted_test)

In [None]:
print (classification_report (y_test_def, default_postprunned_predicted_test))

### TO DO INDEPENDENTLY FOR 10 MINUTES!

#### Choose the value of the probability threshold that gives you the best f1-score

#### Obtain the confusion matrix and the probability report using the chosen probability threshold