[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/bads/blob/master/tutorials/7_nb_model_selection.ipynb) 


# Chapter 7 - Regularization and model selection 
We have learned about 

The outline of the tutorial is as follows:
- Preliminaries
- Regularized logistic regression
- ...

# Preliminaries
We begin as usual with importing our standard libraries and also our standard modeling data. 

In [1]:
# Import standard Python libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd 
from sklearn.model_selection import train_test_split

# Some configuration of the plots we will create later
%matplotlib inline  
plt.rcParams["figure.figsize"] = (12,6)

# Load credit risk data in pre-processed format from GitHub
data_url = 'https://raw.githubusercontent.com/Humboldt-WI/bads/master/data/hmeq_modeling.csv' 
df = pd.read_csv(data_url, index_col="index")

# Pretty printing
from pprint import pprint

# Extract target variable and feature matrix 
X = df.drop(['BAD'], axis=1) 
y = df[['BAD']]

# Split data into training and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=888)
print("Remember the shape of our data: ")
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

# Regularized logistic regression 
Regularization is an approach to find a better balance between bias and variance in the **bias-variance trade-off**, and, thereby, reduce the error of a model. Remember that we can show the (generalization) error of a model to be a function to bias and variance. 

Complex models often show a high variance. Model complexity and bias are closely connected (low complexity -> high bias and vice versa). Introducing bias can reduce error by reducing variance. 

![bias and variance](https://miro.medium.com/max/1050/1*oO0KYF7Z84nePqfsJ9E0WQ.png)
Image source: [Giorgos Papachristoudis: The Bias-Variance Tradeoff](https://towardsdatascience.com/the-bias-variance-tradeoff-8818f41e39e9)

How can we implement this idea, who can we increase bias to reduce variance? The answer depends on the type of model but at least for regression-type models the answer is: add a complexity penalty. 

In a regression setting, large coefficients are indicators of complex, unstable models. Possible causes include high dimensionality and multicollinearity. The aim of the model is to minimize the magnitude the coefficients have on the model. Therefore, it is included in the loss function. 

$$ 𝜷←min⁡ℒ(𝜷)+𝜆||(𝜷)|| $$

This penalty produces sparser models, as it forces the coefficients to zero. Furthermore, we also have a new meta-parameter $𝜆$, which governs the strength of regularization. Simply put, $𝜆$ embodies our preference for models that fit the training data more accurately (low $𝜆$) or models that are less complex (high $𝜆$). It his hard to impossible to tell suitable settings of $𝜆$ a priori. Thus, we typically tune this *hyperparameter* for each data set. More on hyperparameter tuning later.

We have discussed two forms of common regularization for the example of logistic regression. Both forms work by including a measure of coefficient size into the loss function (i.e. the function which the algorithm optimizes) in the form of a penalty. Intuitively, instead of telling the algorithm to build a model that fits well, we now tell it to build a model that fits well *and* keeps the coefficients small by deducting points in relation to the size of coefficients. 

The difference between the *lasso* and *ridge* penalty is then only whether we subtract the absolute or the squared sum of coefficients. While lasso tends to set coefficients to 0 completely, the ridge penalty reduces the coefficient size more evenly. We will see that "why not both?" is also a legitimate suggestion and leads to the *elastic net* penalty.

- why bother -> bias variance trade-off
- how -> equations of regularized likelihood function and the 

## Options for regularizing the logistic regression model 

### LASSO
Equation and brief discussion of pros and cons

### Ridge
Equation and brief discussion of pros and cons

### Elastic net
Equation and brief discussion of pros and cons

## TODO Training regularized logistic regression
We have looked at a manual implementation of the logistic regression model in [Tutorial 5](https://github.com/Humboldt-WI/bads/blob/master/tutorials/5_nb_supervised_learning.ipynb). When adding a penalty term to the likelihood function for regularizing our model, we also need to adjust model fitting. Specifically, we now have to minimize the regularized likelihood function. We can still use *gradient descent* but have to adjust the computation of the gradient. Going through this process and implementing regularized logistic regression from scratch would be a perfect exercise to further sharpen your data science skills. Have a look at [Tulrose Deori's post] for some inspiration if needed (https://towardsdatascience.com/implement-logistic-regression-with-l2-regularization-from-scratch-in-python-20bd4ee88a59). 

In this tutorial, we skip the from scratch implementation and move straight to estimating regularized logit models using our beloved `sklearn` library. 

### Benchmark
Idea of regularization is to build better models. We often understand better as more accurate. How tell? we need a benchmark, right?
- split data into train & test
- estimate vanilla logit model
- calc test auc and accuracy
- also calc test auc/accuracy of a naive model always predicting good

### SKlearn regularization
- estimate 3 logit regularized models (vanilla, lasso, l2, enet) using sklearn and home credit data
- compare model coefficients
  - insert a sub-chapter which compares Lasso to Ridge (results are already available, just need discussion)
  - We predict that lasse removes variables while Ridge does not
  - make sure that we see some zero coefficients in our Lasso model
- compare model predictive performance

### Bias and variance
idea of regularization is to introduce bias to reduce variance. Can we confirm this works?
- use all data and run a cross-validation
- test whether the variance in predictive performance (auc +accuracy) is later for logit compared to a regularized logit (no need to use all 3 regularized models here)

### Regularization path
We have seen above that LASSO performs feature selection by setting coefficients to zero. A nice feature of regularized linear models is that we can compute the **full regularization path**. This means we can examine the examine the magnitude of coefficient values across different settings of the penalty parameter. The larger the penalty the more emphasize is put on shrinking coefficients, and the less emphasize is put on minimizing the loss. The following codes, which are from the set of [sklearn examples](https://scikit-learn.org/stable/auto_examples/linear_model/plot_lasso_coordinate_descent_path.html#sphx-glr-auto-examples-linear-model-plot-lasso-coordinate-descent-path-py), showcase how to produce a graph of the regularization path. This analysis is useful to inform our choice of penalty values for grid-search. However, for larger data sets, computing the regularization path is a costly exercise. If the goal is to learn which features are most valuable, there are cheaper to establish feature importance. We will learn about these in [Tutorial 9](https://github.com/Humboldt-WI/bads/blob/master/tutorials/9_nb_feature_engineering.ipynb). For now, however, let's see how we can calculate the full regularization path using `sklearn` and re-produce the nice picture from one of the lecture slides.


In [None]:
from sklearn.linear_model import lasso_path
alphas_lasso, coefs_lasso, _ = lasso_path(X, y, fit_intercept=True)

In [None]:
# Display results based on https://scikit-learn.org/stable/auto_examples/linear_model/plot_lasso_coordinate_descent_path.html#sphx-glr-auto-examples-linear-model-plot-lasso-coordinate-descent-path-py
from itertools import cycle
colors = cycle(['b', 'r', 'g', 'c', 'k'])

plt.figure(figsize=[10,8])

neg_log_alphas_lasso = -np.log10(alphas_lasso)
for coef_l, c in zip(coefs_lasso, colors):
    l1 = plt.plot(neg_log_alphas_lasso, coef_l, c=c)

plt.xlabel('-Log(alpha)')
plt.ylabel('coefficients')
plt.title('Lasso regularization paths')
plt.show();

# TODO Overfitting
Ultimately, regularization is a way to address one of the key problems in predictive modeling, the problem of overfitting. 
- discuss results from previous analysis
- do we see evidence of overfitting?
- Maybe not since using linear models. Let's check out trees
**showcase overfitting problem using deep trees**
e.g., chart of AUC or accuracy if better on train and test versus tree depth 

# TODO Model selection
Analysis of regularization path has illustrated how different values of the penalty lead to different models. Affect on predictive performance can also be expected. 
- examine for one regularized model how AUC changes with different models

## Grid search
Lecture introduced *grid-search* as a versatile approach toward model selection aka hyper parameter tuning. Let's revisit the approach
- demonstrate tuning manually
- split data into training, validation and test
- specify some candidate settings for regularization parameter
- find best setting on validation data and check how well model predicts on test
- compare to logit w/o regularization

## Hyperparameter tuning in sklearn
- demonstrate how we can use sklearn for tuning
- demo should highlight that tuning different models is easy
  - can compare different regularizers
  - also can add trees and tune hyperparameter like max depth
- also mention/demonstrate flexibility in terms of data use
  - first replicate above train/validation/test splitting
  - then showcase a few other processes like cross-validation and test set
  - or repeated CV and test set
- since cross-validation is costly, maybe add a short demo of running CV in parallel; skip if too difficult  