[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/bads/blob/master/exercises/6_ex_model_assessment.ipynb) 

# Exercises for Model Assessment 

This notebook is based on the [Model Assessment Notebook](https://). 
This notebook will guide you through related tasks to strengthen your understanding of these concepts & Python programming. Completing this exercise will bring you be yet another step closer to becoming a true data scientist!

Before we start, we will undergo the following steps:
- import standard libraries and set plotting parameters
- import data and define target variable and features
- splitting the data (if you are not familiar with this concept please return to the tutorial)
- train a logit and tree model

At this point, these tasks have become standard practice for us, so we simply provide the codes.

In [2]:
# Importing standard libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd 

# Set parameters for plotting
%matplotlib inline  
plt.rcParams["figure.figsize"] = (12,6)

In [3]:
# Import data
data_url = 'https://raw.githubusercontent.com/Humboldt-WI/bads/master/data/hmeq_modeling.csv' 
df = pd.read_csv(data_url, index_col="index")

# Split data into target and features
X = df.drop(['BAD'], axis=1) 
y = df['BAD']

# Zero-one encoding of the target
df['BAD'] = df['BAD'].astype(int) 

In [4]:
from sklearn.model_selection import train_test_split

# Split the data 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=888)

In [5]:
# Estimate a logit model
from sklearn.linear_model import LogisticRegression

logit = LogisticRegression(penalty='none', fit_intercept=True)
logit.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='none',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [6]:
# Estimate a CART tree
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(criterion='gini', max_depth=5)
tree.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=5, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

## Tasks 

In case you might have forgotten some general model assessment procedures, we will start with some simple tasks to get you back into the topic! 

1. Create discrete class predictions for the tree and logit model. Save the results for task 2.

2. Create the corresponding confusion matrix and manually compute, for both classifiers, the accuracy, precision and recall.

Are these results representative for how good the model is? What are the shortfalls of the accuracy measure?
___

3. Calculate the  class probabilities for both models for class 1. Save the results for task 4.



4. Calculate the corresponding AUC values for both predictions and plot the ROC curve.

How can we interpret this measure? Is it representative?
____
Let's have a closer look at how the cutoffs influence our final results. 

5. Calculate the class probabilities and use these to create discrete class predictions for multiple cutoffs. Vary the cutoff from 0 to 1 in step-sizes of 0.01. Save the accuracy results and corresponding cutoffs. Finally, plot these results, with the cutoffs on the x-axis and the accuracy on the y-axis. Which cutoff gave the highest accuracy?

Examine these results. How does the cutoff influence the prediction? The `some_model.predict() `function uses a 0.5 cutoff. How did this cutoff perform in your calculations? What cutoff would you recommend based on your results?
___

6. Use the results of your continuous predictions, extract the true and false positive rate using the function `metrics.roc_curve()` from `sklearn`, and plot a ROC curve. Re-using the cutoff that gave the largest accuracy in the previous task 5, identify the corresponding true and false positive rate. Then highlight this point on the ROC curve. Also highlight the point on the ROC curve that corresponds to a cutoff of 0.5. 



7. Manually define a cutoff in which the ratio of predictions of 0s (goods) and 1s (bads) is representative of the ratio of them in the training data. For example, if your good-to-bad ratio in the training set was 3:1, then your discrete class predictions for the test set should also display this ratio.

8. Assess your classifier by creating a precision-recall plot.

How does this table differ from the ROC curve? Which of them would you use in different situations?
___

Next we want to find out how the size of the training set can affect our predictions.

9. Create a loop in which you train multiple logit models. Vary a parameter `train_set_size` from 0 to 1 in steps of 0.1. In each iteration of the loop, estimate a logit model using `train_set_size` percent of the actual training set, which we created at the beginning of the exercise. Calculate the AUC for our test set for each model and save it in an array. Finally, create a plot of `train_set_size` on the x-axis versus the corresponding test set AUC on the y-axis. 

Looking at this plot, do you think logit is sensitive toward the size of the training set?

**Optional:** you could also repeat the above task and create a plot for a decision tree model. This would facilitate comparing the 'hunger' for data between logit and trees.
___


10. Familiarize yourself with the `StratifiedKFold()` function of `sklean`. We want to use this function to create 5 splits. Use the  ` cross_validate()` function to calculate the corresponding average AUC. Compare the results to the ones in task 4.



Which model performed better? How do you explain these results? Read up on this function. How does it work and why would you use it?
___
___

Well done!! You have reached the end of this exercise.