# Explanation of notebook
The goal of this notebook is to explain the tuning factors (or regularization factors) of common regression models. 

For a deeper explanation of the models themselves, please check out this Coursera course: https://www.coursera.org/learn/python-machine-learning. The content of this notebook is derived from lecture 2 in that course.

In [111]:
%matplotlib notebook
import numpy as np
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import make_classification
from sklearn.datasets import make_regression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_blobs
from sklearn.svm import LinearSVC



## KNN (works both for classification and regression)
A KNN model has a single factor, namely **n_neighbors**.
* If 1: perfect fit for training data, doesn't generalize well
* If >1 <n_trainingset: non-perfect fit for training data, generalizes better, better assumed result for testing dataset
* If close to or equal n_trainingset: underfit, all results = most common case in training dataset (or average)


In [26]:
data_to_classify, classes = make_classification( n_samples = 250, random_state = 0)

X_train, X_test, y_train, y_test = train_test_split(data_to_classify, classes, random_state=0)


for k in [1,5,10,50,100,X_train.shape[0]]:
    knn = KNeighborsClassifier(n_neighbors = k)
    knn.fit(X_train, y_train)
    print('k {}\t: Training accuracy = {:.2f},\tTest accuracy = {:.2f}'
         .format(k,knn.score(X_train, y_train),knn.score(X_test, y_test)))



k 1	: Training accuracy = 1.00,	Test accuracy = 0.68
k 5	: Training accuracy = 0.89,	Test accuracy = 0.81
k 10	: Training accuracy = 0.84,	Test accuracy = 0.79
k 50	: Training accuracy = 0.86,	Test accuracy = 0.83
k 100	: Training accuracy = 0.84,	Test accuracy = 0.81
k 187	: Training accuracy = 0.51,	Test accuracy = 0.51


## Ridge regression
Formula: $Ridge(w,b) = \sum \limits _{i=1} ^{n} (y_{i} - (w \cdot x_{i} + b))^2 + \alpha \sum \limits _{j=1} ^{p} w_{j}^2 $

Explanation: A ridge regression is a linear regression, that adds a penalty for large w parameters. As you can see, a Ridge model uses the squared $w$ factors. This means $\alpha$ is a L2 factor.

Regularization factor = $\alpha$ :
* If $\ alpha $ = 0: Regular leased squared
* Low values of $\ alpha $: Model overfit. Good results for training data, less for testing data.
* Growing $\ alpha $: Simpler model. The higher $\ alpha $ gets, the more tuning occurs, brining $w$ factors closer to zero. Likely a optimum point for best test dataset performance

In [109]:
data_regression, y = make_regression()
X,y = make_regression(n_features=25,
        bias=20,
        n_informative=8,
        noise = 100)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

for alpha in [0,1,10,25,50,100,250,1000]:
    linRidge = Ridge(alpha = alpha).fit(X_train,y_train)
    sum_w = np.sum(linRidge.coef_)
    zero_w = np.sum(linRidge.coef_ == 0)
    r2_train = linRidge.score(X_train, y_train)
    r2_test = linRidge.score(X_test, y_test)
    print("alpha = {}, \t Sum of W = {:.0f}  \t W is 0 = {} \t r2 train = {:.3f} \t r2 test = {:.3f}".format(alpha,sum_w,zero_w,r2_train,r2_test))
    

alpha = 0, 	 Sum of W = 510  	 W is 0 = 0 	 r2 train = 0.841 	 r2 test = 0.604
alpha = 1, 	 Sum of W = 494  	 W is 0 = 0 	 r2 train = 0.840 	 r2 test = 0.633
alpha = 10, 	 Sum of W = 398  	 W is 0 = 0 	 r2 train = 0.809 	 r2 test = 0.701
alpha = 25, 	 Sum of W = 312  	 W is 0 = 0 	 r2 train = 0.740 	 r2 test = 0.651
alpha = 50, 	 Sum of W = 237  	 W is 0 = 0 	 r2 train = 0.643 	 r2 test = 0.550
alpha = 100, 	 Sum of W = 163  	 W is 0 = 0 	 r2 train = 0.507 	 r2 test = 0.413
alpha = 250, 	 Sum of W = 87  	 W is 0 = 0 	 r2 train = 0.311 	 r2 test = 0.236
alpha = 1000, 	 Sum of W = 26  	 W is 0 = 0 	 r2 train = 0.106 	 r2 test = 0.072


## Lasso regression
Formula: $Lasso(w,b) = \sum \limits _{i=1} ^{n} (y_{i} - (w \cdot x_{i} + b))^2 + \alpha \sum \limits _{j=1} ^{p} |w_{j}| $

Explanation: A lasso regression is a linear regression, that adds a penalty for large w parameters. Compared to Ridge, Lasso uses the absolute value of the coefficient. This has the effect that a lot of $w$ factors will be zero. $\alpha$ in Lasso is an L1 regularization factor.

Regularization factor = $\alpha$ :
* If $\ alpha $ = 0: Regular leased squared
* Low values of $\ alpha $: Model overfit. Good results for training data, less for testing data.
* Growing $\ alpha $: Simpler model. The higher $\ alpha $ gets, the more tuning occurs, making more $w$ become zero. Likely a optimum point for best test dataset performance.

In [81]:
data_regression, y = make_regression()
X,y = make_regression(n_features=25,
        bias=20,
        n_informative=8,
        noise = 100)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

for alpha in [0,1,10,25,50,100,250,1000]:
    linLasso = Lasso(alpha = alpha).fit(X_train,y_train)
    sum_w = np.sum(linLasso.coef_)
    zero_w = np.sum(linLasso.coef_ == 0)
    r2_train = linLasso.score(X_train, y_train)
    r2_test = linLasso.score(X_test, y_test)
    print("alpha = {}, \t Sum of W = {:.0f}  \t W is 0 = {} \t r2 train = {:.3f} \t r2 test = {:.3f}".format(alpha,sum_w,zero_w,r2_train,r2_test))
    

alpha = 0, 	 Sum of W = 653  	 W is 0 = 0 	 r2 train = 0.923 	 r2 test = 0.663
alpha = 1, 	 Sum of W = 636  	 W is 0 = 0 	 r2 train = 0.923 	 r2 test = 0.689
alpha = 10, 	 Sum of W = 525  	 W is 0 = 12 	 r2 train = 0.893 	 r2 test = 0.729
alpha = 25, 	 Sum of W = 403  	 W is 0 = 17 	 r2 train = 0.816 	 r2 test = 0.639
alpha = 50, 	 Sum of W = 228  	 W is 0 = 19 	 r2 train = 0.606 	 r2 test = 0.388
alpha = 100, 	 Sum of W = 53  	 W is 0 = 22 	 r2 train = 0.198 	 r2 test = -0.042
alpha = 250, 	 Sum of W = 0  	 W is 0 = 25 	 r2 train = 0.000 	 r2 test = -0.177
alpha = 1000, 	 Sum of W = 0  	 W is 0 = 25 	 r2 train = 0.000 	 r2 test = -0.177


  positive)


## Polynomial regression
Transforms x to x', with x' being all polynomial combinations of any original two features. Example for degree = 2 :

$ x = (x_{1}, x_{2}) \Rightarrow x' = (x_{1}, x_{2}, x_{1}^2,x_{1}x_{2}, x_{2}^2) \\
\hat y = \hat w_{1}x_{1} + \hat w_{2}x_{2} + \hat w_{11}x_{1}^2 + \hat w_{12}x_{1}x_{2} + \hat w_{22}x_{2}^2
$

The tuning factor of a polynomial regression is the **degree**. The degree refers to how many variables ($x_{i}$) should be combined into a new feature.
A polynomial regression isn't different from any other regression, since you update the featurespace, not the regression algorithm itself. Meaning, you could still apply a Ridge/Lasso after applying a polynomial. 

## Logistic regression
First up: logistic regression isn't actually a regression model, but a classification model. It takes input features and outputs a numbers between 0 and 1 to classify a certain outcome. 

Formula: $ \hat y = logistic(\hat b + \hat w_{1}x_{1} + \dots +  w_{n}x_{n}) \\
=  \frac{1}{1+ \mathrm{e}^{-(\hat b + \hat w_{1}x_{1} + \dots +  w_{n}x_{n}) }}$

Regularization factor = $C$. This factor is an L2 factor, like a Ridge model (meaning, it uses the squared factors $w_i$).

- If C very close to zero, the factors of the logistic regression become very large. This has the impact that the model could behave poorly for both the training and test dataset.
- Increasing C will decrease the factors $w$, which might lead to less overfitting and a better result. A optimum likely exists between 0 and $\infty$


In [108]:
data_to_classify, classes = make_classification(n_samples = 250, n_features=20, n_informative=5,
                                n_clusters_per_class=1, flip_y = 0.1,random_state=0)
X_train, X_test, y_train, y_test = train_test_split(data_to_classify, classes, random_state=0)


for C in [0.001,0.01,0.1,1,10,100,1000]:
    logreg = LogisticRegression(C=C).fit(X_train, y_train)
   
    print('C {}\t: Training accuracy = {:.2f},\tTest accuracy = {:.2f}'
         .format(C,logreg.score(X_train, y_train),logreg.score(X_test, y_test)))


C 0.001	: Training accuracy = 0.86,	Test accuracy = 0.92
C 0.01	: Training accuracy = 0.88,	Test accuracy = 0.94
C 0.1	: Training accuracy = 0.92,	Test accuracy = 0.94
C 1	: Training accuracy = 0.92,	Test accuracy = 0.92
C 10	: Training accuracy = 0.91,	Test accuracy = 0.90
C 100	: Training accuracy = 0.90,	Test accuracy = 0.90
C 1000	: Training accuracy = 0.90,	Test accuracy = 0.90




## Linear support vector machine
A LSVM is a classifier. It fits a linear line in a dataset to seperate it in two groups.
A key property of a LSVM is the margin of error that it uses for calculating the best fitting classifier. The margin of error is how much distance we tolerate on any point  from the classifier. A higher margin of error means thus that points closer to the classifier would be 'ignored' in favor of better generalization.

Formula: $ \hat y = f(x,w,b) = sign(w \cdot x + b)
$

Regularization factor: $C$.
Counter-intuitive: Larger $C$ means _less_ regularization, meaning more chance for overfitting. Smaller $C$ means _more_ regularization.
C is corelated with the margin of error. A larger value of C tries to fit the training data as well as possible, meaning the margin of error would be smaller. If C is smaller, the model would try to generalize better, meaning a larger margin of error would be tolerated.



In [113]:
data_to_classify, classes = make_classification(n_samples = 250, n_features=20, n_informative=5,
                                n_clusters_per_class=1, flip_y = 0.1,random_state=0)
X_train, X_test, y_train, y_test = train_test_split(data_to_classify, classes, random_state=0)


for C in [0.001,0.01,0.1,1,10,100,1000,10000,100000]:
    svc = LinearSVC(C=C).fit(X_train, y_train)
   
    print('C {}\t: Training accuracy = {:.2f},\tTest accuracy = {:.2f}'
         .format(C,svc.score(X_train, y_train),svc.score(X_test, y_test)))


C 0.001	: Training accuracy = 0.89,	Test accuracy = 0.94
C 0.01	: Training accuracy = 0.91,	Test accuracy = 0.94
C 0.1	: Training accuracy = 0.93,	Test accuracy = 0.94
C 1	: Training accuracy = 0.91,	Test accuracy = 0.92
C 10	: Training accuracy = 0.92,	Test accuracy = 0.94
C 100	: Training accuracy = 0.90,	Test accuracy = 0.92
C 1000	: Training accuracy = 0.89,	Test accuracy = 0.92
C 10000	: Training accuracy = 0.89,	Test accuracy = 0.90
C 100000	: Training accuracy = 0.90,	Test accuracy = 0.94




## Kernel SVM - Radial Basis Function Kernel
A Kernel SVM is a SVM which applies a transformation on the input data. One of those transformations is the Radial Basis Function. 

Formula: $ RBF(x,x') = exp(- \gamma \cdot d(x,x')^2 $ with $d(x,x')$ being the Euclidian distance.

Parameters: $\gamma$ and $C$.
C is the same as above, namely the regularization factor.
- Bigger C means less regularization, 
- smaller C means more regularization.


$\gamma$ is the function of the RBF to determine the effect of a single training point on others around it. 
- A higher gamma means less influence on farther points.
- A lower gamma means more influence on farther points.

In other words.

- A higher gamma means **tighter** decision boundaries.
- A lower gamma means **broader** decision boundaries.

A good example of showing the relationship between C and $\gamma$ is shown here: https://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html

## Decision trees
A decision tree is another classifier model. It tries to learn repeatable steps to navigate through a dataset, aka rules. 

There are three parameters to tune a decision tree:
* **max_depth**: Controls the maximum depth of the tree.
* **min_samples_leaf**: Is the minimum amount of samples in each leaf.
* **max_leaf_nodes**: The maximum amount of leaves (aka edges) in the tree.