# Linear Regression Exercise

In this exercise we'll work on our own implementation of linear regression models, compare it to existing regeression models and apply them to a chemical dataset predicting solubility of different molecules. First I'll import the relevant packages for you.

In [24]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import metrics
np.set_printoptions(precision=2)

I'll also go ahead and prepare a suitable example dataset for your. It's a medical dataset of diabetes patients with medical information as features and some numerical quantification of the diabetes disease of that patient:

**Number of Samples**: 442

**Features**: 10 columns with numeric predictive values

**Target**: Quantitative measure of disease progression one year after baseline

**Feature Information**:
- age: age in years
- sex: 0 male, 1 female probably
- bmi: body mass index
- bp: average blood pressure
- s1: tc, total serum cholesterol
- s2: ldl, low-density lipoproteins
- s3: hdl, high-density lipoproteins
- s4: tch, total cholesterol / HDL
- s5: ltg, possibly log of serum triglycerides level
- s6: glu, blood sugar level

**Note**: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times the square root of the number of samples.

In [36]:
diabetes_dataset: sklearn.utils.Bunch = datasets.load_diabetes() # Scikit-learn dataset, bunch object, similar to a dictionary
diabetes_feature_names: list = diabetes_dataset.feature_names
diabetes_features: np.ndarray = diabetes_dataset.data # you'll work with these features
diabetes_targets: np.ndarray = diabetes_dataset.target # you'll predict this target
diabetes_dataframe: pd.DataFrame = pd.DataFrame(data=diabetes_features, columns=diabetes_feature_names) # Just for convenience if you wanto explore the data
diabetes_dataframe["target"] = diabetes_targets

Our ambitious goal for today is to not only cover simple linear regression, but multiple linear regression.

A general formula for the multiple regression for `n` variables would look like this:

$$
  f_w(x_0,x_1,...,x_n) = w_{b} 1 + w_{0} x_0 + w_{1} x_1  + ... w_{n} x_n = \sum_i^n w_i \cdot x_i = \mathbf{x} \cdot \mathbf{w} 
$$


\begin{equation}
\mathbf{x} =
\begin{pmatrix}
  1 \\ x_0 \\ x_1 \\ \cdots \\ x_n \\
\end{pmatrix},
\mathbf{w} = 
\begin{pmatrix}
  w_b \\ w_0 \\ w_1 \\ \vdots \\ w_n  \\
\end{pmatrix}
\end{equation}



Following the least squares estimation we want to minimize the squared loss for the whole dataset with n features and T entries:

\begin{equation}
L = ||\mathbf{X} \mathbf{w} - \mathbf{y}||^2 = (\mathbf{X} \mathbf{w} - \mathbf{y})^T(\mathbf{X} \mathbf{w} - \mathbf{y}) = \mathbf{X}^T \mathbf{w}^T \mathbf{X} \mathbf{w} - \mathbf{y}^T \mathbf{X} \mathbf{w} - \mathbf{X}^T \mathbf{w}^T \mathbf{y} + y^T y = \mathbf{X}^T \mathbf{w}^T \mathbf{X} \mathbf{w} - 2 \mathbf{X}^T \mathbf{w}^T \mathbf{y} + \mathbf{y}^T \mathbf{y}
\end{equation}

\begin{equation}
\mathbf{X} =
\begin{pmatrix}
  1       & x_{0,0}   & x_{0,1}  & \cdots  & x_{0,n}  \\
  1       & x_{1,0}   & x_{1,1}  & \cdots  & x_{1,n}  \\
  \vdots  & \vdots  & \vdots & \ddots  & \vdots \\
  1       & x_{T,0 }  & x_{T,1}  & \cdots  & x_{T,n}  \\
\end{pmatrix},
\mathbf{y} = 
\begin{pmatrix}
  1 & y_0 & y_1 & \cdots & y_T  \\
\end{pmatrix}
\end{equation}

This is extreme when the derivation with respect to the weights is minimal:

\begin{equation}
\frac{\partial L}{\partial \mathbf{w}} \overset{!}{=} 0
\end{equation}

\begin{equation}
\frac{\partial L}{\partial \mathbf{w}} = -2 \mathbf{X} \mathbf{y} + 2 \mathbf{X} \mathbf{}
\end{equation}


With multivariate polynomial linear regression we 

First you'll have to create a 3D matrix, where each of the values in the 2D matrix features is calculated to the power of 0,1,2,3

In [39]:
A = diabetes_features[:, :, np.newaxis] ** np.arange(4)

[[[ 1.00e+00  3.81e-02  1.45e-03]
  [ 1.00e+00  5.07e-02  2.57e-03]
  [ 1.00e+00  6.17e-02  3.81e-03]]

 [[ 1.00e+00 -1.88e-03  3.54e-06]
  [ 1.00e+00 -4.46e-02  1.99e-03]
  [ 1.00e+00 -5.15e-02  2.65e-03]]

 [[ 1.00e+00  8.53e-02  7.28e-03]
  [ 1.00e+00  5.07e-02  2.57e-03]
  [ 1.00e+00  4.45e-02  1.98e-03]]]


[[ 0.04  0.05  0.06  0.02 -0.04 -0.03 -0.04 -0.    0.02 -0.02]
 [-0.   -0.04 -0.05 -0.03 -0.01 -0.02  0.07 -0.04 -0.07 -0.09]
 [ 0.09  0.05  0.04 -0.01 -0.05 -0.03 -0.03 -0.    0.   -0.03]
 [-0.09 -0.04 -0.01 -0.04  0.01  0.02 -0.04  0.03  0.02 -0.01]
 [ 0.01 -0.04 -0.04  0.02  0.    0.02  0.01 -0.   -0.03 -0.05]]
