# Linear Regression Exercise

In this exercise we'll work on our own implementation of linear regression models, compare it to existing regeression models and apply them to a chemical dataset predicting solubility of different molecules. First I'll import the relevant packages for you.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import metrics
np.set_printoptions(precision=2)

I'll also go ahead and prepare a suitable example dataset for your. It's a medical dataset of diabetes patients with medical information as features and some numerical quantification of the diabetes disease of that patient:

**Number of Samples**: 442

**Features**: 10 columns with numeric predictive values

**Target**: Quantitative measure of disease progression one year after baseline

**Feature Information**:
- age: age in years
- sex: 0 male, 1 female probably
- bmi: body mass index
- bp: average blood pressure
- s1: tc, total serum cholesterol
- s2: ldl, low-density lipoproteins
- s3: hdl, high-density lipoproteins
- s4: tch, total cholesterol / HDL
- s5: ltg, possibly log of serum triglycerides level
- s6: glu, blood sugar level

**Note**: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times the square root of the number of samples.

In [2]:
diabetes_dataset: sklearn.utils.Bunch = datasets.load_diabetes() # Scikit-learn dataset, bunch object, similar to a dictionary
diabetes_feature_names: list = diabetes_dataset.feature_names
diabetes_features: np.ndarray = diabetes_dataset.data # you'll work with these features
diabetes_targets: np.ndarray = diabetes_dataset.target # you'll predict this target
diabetes_dataframe: pd.DataFrame = pd.DataFrame(data=diabetes_features, columns=diabetes_feature_names) # Just for convenience if you wanto explore the data
diabetes_dataframe["target"] = diabetes_targets

## Data preparation

To validate your model in the end you will need a separate test set. Therefore you should split your data in two random subsets for training and testing now. Your test set should contain 15% of your total dataset. Also make sure, that both your subsets have the expected sample size.

Assign features and labels of your subsets to the variables `X_train, X_test, y_train, y_test`. To pass the test cell the sample size and amount of features has to be correct.

A general formula for multiple regression for `n` variables looks like this:

$$
  f_w(x_0,x_1,...,x_n) = w_{b} 1 + w_{0} x_0 + w_{1} x_1  + ... w_{n} x_n = \sum_i^n w_i \cdot x_i = \mathbf{x} \cdot \mathbf{w} 
$$


\begin{equation}
\mathbf{x} =
\begin{pmatrix}
  1 \\ x_0 \\ x_1 \\ \cdots \\ x_n \\
\end{pmatrix},
\mathbf{w} = 
\begin{pmatrix}
  w_b \\ w_0 \\ w_1 \\ \vdots \\ w_n  \\
\end{pmatrix}
\end{equation}



Following the least squares estimation we want to minimize the squared loss for the whole dataset with n features and T entries:

\begin{equation}
L = ||\mathbf{X} \mathbf{w} - \mathbf{y}||^2 = (\mathbf{X} \mathbf{w} - \mathbf{y})^T(\mathbf{X} \mathbf{w} - \mathbf{y}) = \mathbf{X}^T \mathbf{w}^T \mathbf{X} \mathbf{w} - \mathbf{y}^T \mathbf{X} \mathbf{w} - \mathbf{X}^T \mathbf{w}^T \mathbf{y} + y^T y = \mathbf{X}^T \mathbf{w}^T \mathbf{X} \mathbf{w} - 2 \mathbf{X}^T \mathbf{w}^T \mathbf{y} + \mathbf{y}^T \mathbf{y}
\end{equation}

\begin{equation}
\mathbf{X} =
\begin{pmatrix}
  1       & x_{0,0}   & x_{0,1}  & \cdots  & x_{0,n}  \\
  1       & x_{1,0}   & x_{1,1}  & \cdots  & x_{1,n}  \\
  \vdots  & \vdots  & \vdots & \ddots  & \vdots \\
  1       & x_{T,0 }  & x_{T,1}  & \cdots  & x_{T,n}  \\
\end{pmatrix},
\mathbf{y} = 
\begin{pmatrix}
  1 & y_0 & y_1 & \cdots & y_T  \\
\end{pmatrix}
\end{equation}

This is extreme when the derivation with respect to the weights is minimal:

\begin{align*}
\frac{\partial L}{\partial \mathbf{w}} = 2 \mathbf{X}^T \mathbf{X} w - 2 \mathbf{X}^T \mathbf{y} &\overset{!}{=} 0\\
\mathbf{X}^T \mathbf{X} w  &= \mathbf{X}^T \mathbf{y}\\
w  &= (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}
\end{align*}

Our ambitious goal for today is to not only cover simple linear regression, but multiple linear regression. Start by making room for the bias by concatenating an array of ones to the diabetes_features and a single 1 at the beginning of diabetes_targets.

Create a new np.array called `diabetes_features_prepended` with ones along axis 1 and a new np.array called `diabetes_targets_prepended` to pass the tests in the following test cell.

In [3]:
ones_array: np.ndarray = np.ones((diabetes_features.shape[0], 1))
diabetes_features_prepended: np.ndarray = np.concatenate((ones_array, diabetes_features), axis=1)
diabetes_targets_prepended: np.ndarray = np.insert(diabetes_targets, 0, 1, axis=0)

In [4]:
# Test cell

assert diabetes_features_prepended.shape == (diabetes_features.shape[0], diabetes_features.shape[1] + 1), "The shape of diabetes_features_prepended is incorrect"
assert np.all(diabetes_features_prepended[:, 0] == 1), "The first column of diabetes_features_prepended is not 1"
assert diabetes_targets_prepended.shape == (diabetes_targets.shape[0] + 1,), "The shape of diabetes_targets_prepended is incorrect"
assert diabetes_targets_prepended[0] == 1, "The first element of diabetes_targets_prepended is not 1"

As a first step calculate the requiered matrix multiplication before the inversion of the matrix. Save the result in a variable called `matrix_multiplication`. Note that numpy only does elementwise multiplication with a simple `matrix * matrix`. You'll need a proper matrix multiplication method.

Afterwards invert the `matrix_multiplication` matrix using an appropiate numpy method and save it to a variable called `inverted_matrix`.

The matrices need to have the right shape and their matrix multiplication has to lead to the identity matrix to pass the tests in the following test cell.

In [5]:
matrix_multiplication: np.ndarray = np.matmul(diabetes_features_prepended.T, diabetes_features_prepended)
inverted_matrix: np.ndarray = np.linalg.inv(matrix_multiplication)

In [6]:
# Test cell

assert matrix_multiplication.shape == (diabetes_features.shape[1] + 1, diabetes_features.shape[1] + 1), "The shape of matrix_multiplication is incorrect"
assert np.all(matrix_multiplication == matrix_multiplication.T), "matrix_multiplication is not symmetric"

assert inverted_matrix.shape == (diabetes_features.shape[1] + 1, diabetes_features.shape[1] + 1), "The shape of inverted_matrix is incorrect"
assert np.allclose(np.matmul(matrix_multiplication, inverted_matrix), np.eye(diabetes_features.shape[1] + 1)), "The product of matrix_multiplication and inverted_matrix is not the identity matrix"

The hardest part is done. You can get the optimal weights by two more matrix multiplications.

Calculate the optimal weights and save them in a variable called `weights` to pass the following test cell.

In [7]:
weights: np.ndarray = np.matmul(np.matmul(inverted_matrix, diabetes_features_prepended.T), diabetes_targets)

In [8]:
# Test cell

assert weights.shape == (diabetes_features.shape[1] + 1,), "The shape of weights is incorrect"

Now that the more complicated part is over let's train a model automatically with sklearns toolbox.

Create a linear regression model called `model` and train it with the original data to pass the following test cell.

In [9]:
model: LinearRegression = LinearRegression()
model.fit(diabetes_features, diabetes_targets)

LinearRegression()

In [10]:
# Test cell

assert isinstance(model, sklearn.linear_model._base.LinearRegression), "model is not an instance of LinearRegression"
assert hasattr(model, 'coef_'), "model was not trained yet"

Now we want to compare the weights from the sklearn model to those from our own implementation. The bias is stored in an instance variable called `intercept_` and the remaining weights can be accessed via the instance variable `coef_`.

Insert `model.intercept_` into `model.coef_` at position 0 and save the resulting np.array with the variable name `model_weights`. The following test cell will check, wheter the `weights` and the `model_weights` are equal. 

In [11]:
model_weights: np.ndarray = np.insert(model.coef_, 0, model.intercept_, axis=0)

In [12]:
# Test cell

assert model_weights.shape == (diabetes_features.shape[1] + 1,), "The shape of model_weights is incorrect. Did you include the bias?"
assert np.allclose(weights, model_weights), "The weights of the model are incorrect"