# Linear Regression and Regularization

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
from matplotlib import pyplot as plt
from sklearn import preprocessing
from sklearn.metrics import r2_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import RobustScaler

import sklearn
%matplotlib inline

import ipywidgets as widgets
from tqdm.notebook import tqdm

import warnings
# silence future deprecation warnings
warnings.filterwarnings('ignore')

## Prepare the data

Although linear regression is a linear machine learning method, you can have nonlinear dependencies if you transform some of the independent variables by a nonlinear function. By doing this, you can improve the fit of your method. Let us demonstrate this on a house price dataset from [Kaggle](https://www.kaggle.com/harlfoxem/housesalesprediction). Note that this dataset is not identical with one you used in the linear regression exercise, since the this dataset is too small and would cause unreliable evaluation results.

In [None]:
df_house = pd.read_csv("kc_house_data.csv")
df_house.head()

We would like to have a simple linear regression problem with only one independent variable. Thus, we only keep *price* and *sqft_living*.

In [None]:
df_house = df_house[["price","sqft_living"]]
df_house.head()

### Split the data

We split the data into a training and test set

In [None]:
train_house, test_house = train_test_split(df_house, test_size=0.5, random_state=42)

### Normalize the data
Let us normalize the data by using *min-max normalization*

In [None]:
scaler = MinMaxScaler()

train_house = pd.DataFrame(scaler.fit_transform(train_house), columns=train_house.columns, index=train_house.index)
test_house = pd.DataFrame(scaler.transform(test_house), columns=test_house.columns, index=test_house.index)

train_house.head()

In [None]:
X_train_house = train_house[["sqft_living"]]
y_train_house = train_house[["price"]]

X_test_house = test_house[["sqft_living"]]
y_test_house = test_house[["price"]]

## Bias term
To account for the bias term, we add a column containing only ones.

In [None]:
X_train_house["bias"] = 1
X_test_house["bias"] = 1

# Force order
X_train_house = X_train_house[["bias", "sqft_living"]]
X_test_house = X_test_house[["bias", "sqft_living"]]

X_train_house.head()

## Fit a linear regression model
Define a linear regression function to estimate the parameters $\theta$ based on the normal equation:
  
  $\Theta:=(X^{\top}X)^{-1}(X^{\top}y)$

In [None]:
def fit(X, y):
    # START YOUR CODE
    
    # END YOUR CODE
    return thetas

*Click on the dots to display the solution*

In [None]:
def fit(X, y):
    thetas = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)
    return thetas

Run the following code to check your implementation:

In [None]:
thetas = fit(X_train_house, y_train_house)

expected_thetas = np.array([[7.39560812e-05], [4.94185750e-01]])
np.testing.assert_array_almost_equal(thetas, expected_thetas, decimal=4)

## Predict prices
Using $X$ and the estimated $\theta$, predict the house prices on the training data

In [None]:
def predict(X, thetas):
    # START YOUR CODE
    
    # END YOUR CODE
    return y_pred

*Click on the dots to display the solution*

In [None]:
def predict(X, thetas):
    y_pred = np.dot(X, thetas)
    return y_pred

In [None]:
y_pred_house = predict(X_train_house, thetas)
y_pred_house

## Visualize predictions
Let us plot house prices and predicted house prices

In [None]:
def plot_regression_line(X, thetas, ax=None):
    if ax is None:
        fig, ax = plt.subplots()
    deg = len(thetas)-1
    poly = PolynomialFeatures(deg)
    
    xs = np.arange(X.min(), X.max()+0.1, 0.01).reshape(-1,1)
    x = poly.fit_transform(xs)
    y_pred = np.dot(x, thetas)
    
    ax.plot(xs, y_pred, color="r")

In [None]:
fig, ax = plt.subplots()
ax.plot(X_train_house["sqft_living"].values, y_train_house.values, "bo", markersize=1)
plot_regression_line(X_train_house["sqft_living"].values, thetas, ax)

## Calculate model performance
Now let's check how good our model performs by calculating the $R^2$ score on the test set.

In [None]:
# r2 = ...

*Click on the dots to display the solution*

In [None]:
y_pred_test_house = predict(X_test_house, thetas)
r2_house = r2_score(y_test_house, y_pred_test_house)
print("R2: ", r2_house)

## Adding polynomial features

We aim to improve the fit by adding $x^2$ as additional independent variable.

In [None]:
X_train_deg2 = X_train_house.copy()
X_train_deg2["sqft_living^2"] = X_train_deg2["sqft_living"] * X_train_deg2["sqft_living"]

X_test_deg2 = X_test_house.copy()
X_test_deg2["sqft_living^2"] = X_test_deg2["sqft_living"] * X_test_deg2["sqft_living"]
X_test_deg2.head()

### Fit the model with the additonal features

In [None]:
thetas_deg2 = fit(X_train_deg2, y_train_house)

### Calculate the performance

In [None]:
# r2 = 

*Click on the dots to display the solution*

In [None]:
y_pred_test_deg2 = predict(X_test_deg2, thetas_deg2)
r2_deg2 = r2_score(y_test_house, y_pred_test_deg2)
print("R2: ", r2_deg2)

As we can see, by adding $x^2$ as additional independent variable we could slightly improve our performance.

Let's try if we can further improve our performance by adding more polynomial features. To generate our polynomial features we will use the Scikit-Learn function [PolynomialFeatures](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html). 

In [None]:
@widgets.interact(poly_deg =(1,18,1))
def f(poly_deg=1):
    poly = PolynomialFeatures(poly_deg)
    X_train_deg = poly.fit_transform(X_train_house["sqft_living"].values.reshape(-1,1))
    X_test_deg = poly.transform(X_test_house["sqft_living"].values.reshape(-1,1))

    thetas_deg = fit(X_train_deg, y_train_house)
    
    y_pred_test = predict(X_test_deg, thetas_deg)
    y_pred_train = predict(X_train_deg, thetas_deg)
    
    r2_test = r2_score(y_test_house, y_pred_test)
    r2_train = r2_score(y_train_house, y_pred_train)
    print("R2 Train {0:.5f}".format(r2_train))
    print("R2 Test {0:.5f}".format(r2_test))
    
    fig, (ax0, ax1) = plt.subplots(ncols=2, figsize=(20,10))
    ax0.set_title("Training data - polynomial degree {}".format(poly_deg))
    ax0.plot(X_train_house["sqft_living"], y_train_house["price"], "bo", markersize=1)
    plot_regression_line(X_train_deg, thetas_deg, ax0)
    
    ax1.set_title("Test data - polynomial degree {}".format(poly_deg))
    ax1.plot(X_test_house["sqft_living"], y_test_house["price"], "bo", markersize=1)
    plot_regression_line(X_test_deg, thetas_deg, ax1)

What do you recognize when you increase the polynomial degree?

> Answer the question on ILIAS

## Regularization

The effect of overfitting can be reduced by regularization. Implement the regularized version of linear regression: $\Theta:=(X^{\top}X+\lambda \begin{bmatrix}
    0  & 0 &\ldots&0 \\
    0 & 1 & \\
    \ldots & & \ddots & \\
    0& & & 1
  \end{bmatrix} )^{-1}(X^{\top}y)$

In [None]:
def fit_reg(X, y, lam):
    # START YOUR CODE

    # END YOUR CODE
    return thetas

*Click on the dots to display the solution*

In [None]:
def fit_reg(X, y, lam):
    Xt = np.transpose(X)
    XtX = np.dot(Xt,X)
    I = np.identity(XtX.shape[0])
    I[0,0] = 0
    XtX = XtX + (lam * I)
    XtXm1 = np.linalg.inv(XtX)
    Xty = np.dot(Xt,y)
    thetas = np.dot(XtXm1,Xty)
    return thetas

You can check your implementation by executing the following cell:

In [None]:
expected_thetas = np.array([[0.00178927], [0.48482755]])
actual_thetas = fit_reg(X_train_house, y_train_house, lam=2)

np.testing.assert_array_almost_equal(expected_thetas, actual_thetas)

We  plot the graph using the regularized parameter vectors. As you can see, the effect of overfitting is strongly reduced.

In [None]:
@widgets.interact(poly_deg = (0,12,1), lam=(0,100,1))
def f(poly_deg=1, lam=4):
    poly = PolynomialFeatures(poly_deg)
    X_train_deg = poly.fit_transform(X_train_house["sqft_living"].values.reshape(-1,1))
    X_test_deg = poly.transform(X_test_house["sqft_living"].values.reshape(-1,1))

    thetas_deg = fit_reg(X_train_deg, y_train_house, lam=lam)
    
    y_pred_test = predict(X_test_deg, thetas_deg)
    y_pred_train = predict(X_train_deg, thetas_deg)
    
    r2_test = r2_score(y_test_house, y_pred_test)
    r2_train = r2_score(y_train_house, y_pred_train)
    print("R2 Train", r2_train)
    print("R2 Test", r2_test)
    
    fig, (ax0, ax1) = plt.subplots(ncols=2, figsize=(20,10))
    ax0.set_title("Training data - polynomial degree {}".format(poly_deg))
    ax0.plot(X_train_house["sqft_living"], y_train_house["price"], "bo", markersize=1)
    plot_regression_line(X_train_deg, thetas_deg, ax0)
    
    ax0.set_title("Test data - polynomial degree {}".format(poly_deg))
    ax1.plot(X_test_house["sqft_living"], y_test_house["price"], "bo", markersize=1)
    plot_regression_line(X_test_deg, thetas_deg, ax1)

Find the best configuration of **polynomial degree** and $\lambda$

<font color='red'>PLEASE REPLACE TEXT WITH YOUR CONFIGURATION</font>

## Regularization to help with numerical issues

Another benefit of regularization is that it can help in case of numerical issues. Let us consider our original dataset.

In [None]:
df_house_2 = pd.read_csv("kc_house_data.csv")
df_house_2 = df_house_2[["price","sqft_living","bedrooms"]]
df_house_2.head()

In [None]:
train_house_2, test_house_2 = train_test_split(df_house_2, test_size=0.5, random_state=42)

In [None]:
scaler = MinMaxScaler()

train_house_2 = pd.DataFrame(scaler.fit_transform(train_house_2), columns=train_house_2.columns, index=train_house_2.index)
test_house_2 = pd.DataFrame(scaler.transform(test_house_2), columns=test_house_2.columns, index=test_house_2.index)

test_house_2.head()

To make the feature matrix $X^{\top}X$ singular, we just add  another independent variable (Size2) to X
that amounts to just twice the Size.

In [None]:
train_house_2["sqft_living2"] = 2 * train_house_2["sqft_living"]
train_house_2["bias"] = 1

test_house_2["sqft_living2"]= 2 * test_house_2["sqft_living"]
test_house_2["bias"] = 1

test_house_2.head()

In [None]:
X_train_house_2 = train_house_2[["bias", "sqft_living", "bedrooms", "sqft_living2"]]
y_train_house_2 = train_house_2[["price"]]

X_test_house_2 = test_house_2[["bias", "sqft_living", "bedrooms", "sqft_living2"]]
y_test_house_2 = test_house_2[["price"]]

We see that the linear regression fails, since $X^{\top}X$ is not invertible.

In [None]:
thetas = fit(X_train_house_2, y_train_house_2)

There are two possiblities to tackle this issue, the first one is to use the pseudoinverse instead of the inverse
and the second one is using regularization. 

> Try out both. 

*Hint*: For conducting linear regression with the pseudoinverse, you have to slightly modify the linear_regression method given further above. 
The numpy function [np.linalg.pinv](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.inv.html) becomes handy for this.

In [None]:
def fit_pseudoinverse(X,y):
    # START YOUR CODE
    
    # END YOUR CODE
    return thetas

*Click on the dots to display the solution*

In [None]:
def fit_pseudoinverse(X, y):
    thetas = np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(y)
    return thetas

Run this code to check your implementation

In [None]:
thetas_pseudo_inverse = fit_pseudoinverse(X_train_house_2, y_train_house_2)
print ("thetas obtained by linear regression with pseudoinverse:\n", thetas_pseudo_inverse)

expected_thetas_pseudo_inverse = np.array([
    [ 0.02902459],
    [ 0.11220321],
    [-0.12253607],
    [ 0.22440641]])

np.testing.assert_array_almost_equal(thetas_pseudo_inverse, expected_thetas_pseudo_inverse, decimal=5)

In [None]:
thetas_regularization = fit_reg(X_train_house_2, y_train_house_2, lam=1)
print ("thetas obtained by linear regression with regularization:\n", thetas_regularization)

expected_thetas_regularization = np.array([
    [ 0.02846346],
    [ 0.11163748],
    [-0.11932519],
    [ 0.22327497]])

np.testing.assert_array_almost_equal(thetas_regularization, expected_thetas_regularization, decimal=5)

## Programming Assignment
> Solve the following Programming assignment and check your solution in the Illias Quiz **Linear Regression and Regularization - Notebook Verification**.

Before you implemented Linear Regression from Scratch in this Programming assignment you are asked to use the scikit-learn implementation of the Linear Regression. [Scikit-learn Documentation](https://scikit-learn.org/stable/). Use the same data as before.

Use the following features: bedrooms, bathrooms, sqft_living, yr_built and grade

Use the same train/test split as in the previous examples

Import and train the sklearn Linear Regression

Calculate the train and test score (threse is a function for the regressor)

Put the test score in the Ilias Quiz 04a Notebook Verification

Also answer if this performs better than the ones calculated previously.

**!!!!!Solution needs to be deleted and transferred before being published!!!!**

In [None]:
df_house3 = pd.read_csv("kc_house_data.csv")

Use the following Features: bedrooms, bathrooms, sqft_living, yr_built, grade

In [None]:
df_house3 = df_house3[["price","sqft_living", "bedrooms", "bathrooms", "yr_built", "grade"]]

Use the same train test split as in the previous examples.

In [None]:
train_house3, test_house3 = train_test_split(df_house3, test_size=0.5, random_state=42)

Train the sklearn Linear Regression

In [None]:
scaler = MinMaxScaler()

train_house3 = pd.DataFrame(scaler.fit_transform(train_house3), columns=train_house3.columns, index=train_house3.index)
test_house3 = pd.DataFrame(scaler.transform(test_house3), columns=test_house3.columns, index=test_house3.index)

In [None]:
X_train_house_3 = train_house3[["sqft_living", "bedrooms", "bathrooms","yr_built", "grade"]]
y_train_house_3 = train_house3[["price"]]

X_test_house_3 = test_house3[["sqft_living", "bedrooms", "bathrooms","yr_built", "grade"]]
y_test_house_3 = test_house3[["price"]]

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
reg = LinearRegression().fit(X_train_house_3, y_train_house_3)

Calculate the Train and Test score

In [None]:
print("Train score: ", reg.score(X_train_house_3,y_train_house_3))

In [None]:
print("Test score: ", reg.score(X_test_house_3,y_test_house_3))

**!!!! Evtl add question about how bias can be automatically added to the sklearn methode** 