[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/bads/blob/master/exercises/6_ex_model_assessment.ipynb) 

# Exercises for Model Assessment 

This notebook is based on the [Model Assessment Notebook](https://). 
This notebook will guide you through related tasks to strengthen your understanding of these concepts & Python programming. After completing this excercise you will be another step closer to becoming a data scientist!

Before we start, we will undergo the following steps:
- import standard libraries and set plotting parameters
- import data and define target variable and features
- splitting the data (if you are not familiar with this concept please return to the tutorial)
- train a logit and tree model

In [1]:
# Importing standard libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd 

# Set parameters for plotting
%matplotlib inline  
plt.rcParams["figure.figsize"] = (12,6)

In [2]:
# Import data
data_url = 'https://raw.githubusercontent.com/Humboldt-WI/bads/master/data/hmeq_modeling.csv' 
df = pd.read_csv(data_url, index_col="index")

# Split data into target and features
X = df.drop(['BAD'], axis=1) 
y = df['BAD']

# Zero-one encoding of the target
df['BAD'] = df['BAD'].astype(int) 

In [3]:
from sklearn.model_selection import train_test_split

# Split the data 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=888)

In [4]:
# Estimate a logit model
from sklearn.linear_model import LogisticRegression

logit = LogisticRegression(penalty='none', fit_intercept=True)
logit.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='none',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [5]:
# Estimate a CART tree
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(criterion='gini', max_depth=5)
tree.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=5, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

## Tasks 

In case you might have forgotten some general model assessment procedures, we will start with some simple tasks to get you back into the topic! 

1. Create distinct class predictions for the tree and logit model. Save the results for task 2.

2. Create the corresponding confusion matrix and manually compute the accuracy, precision and recall.

Are these results representative for how good the model is? What are the shortfalls of the accuracy measure?
___

3. Calculate the  class probabilities for both models for class 1. Save the results for task 4.



4. Calculate the corresponding AUC values for both predictions and plot the ROC curve.

How can we interpret this measure? Is it representative?
____
Let's have a closer look at how the cut-offs influence our final results. 
5. Calculate the class probabilities and use these to create distinct class predictions for multiple cut-offs. Vary the cut-off from 0 to 1 in step-sizes of 0.01. Save the accuracy results and corresponding cut-offs. Finally, plot these results, with the cut-offs on the x-axis and the accuracy on the y-axis. 

Analyse these results. How does the cut-off influence the prediction? The `some_model.predict() `function uses a 0.5 cut-off. How did this cut-off perform in your calculations? What cut-off would you recommend based on your results?
___

6. Use the results of your continuous predictions, extract the true and false positive rate and plot a ROC curve. Highlight the optimal cut-off based on task 5.



7. Manually define a cut-off in which the ratio of predictions of 0s (goods) and 1s (bads) is representative of the ratio of them in the training data. 

8. Assess your classifier by creating a precision-recall plot.

How does this table differ from the ROC curve? Which of them would you use in different situations?
___

Next we want to find out how the size of the training and test set can affect our predictions.
9. Create a loop for the logit model. Vary the `test_size` parameter from 0 to 1 in steps of 0.1. Build the model on each training set, calculate the AUC values on the test set and save them, as well as its corresponding `test_size` parameter. Finally, plot the `test_size` on the x-axis and its corresponding AUC value on the y-axis.

Analyze the table to draw conclusions on how the test_size parameter influences the predictions, as well as the model itself? 
___

To finish of on a high, we will finish with a less complex task.
10. Familiarize yourself with the `StratifiedKFold()` function. We want to use this function to create 5 splits. Use the  ` cross_validate()` function to calculate the corresponding average AUC. Compare the results to the ones in task 4.



Which model performed better? How do you explain these results? Read up on this function. How does it work and why would you use it?
___
___

Well done!! You have reached the end of this exercise.