# SLU9 - Regression: Learning notebook

In this notebook we will cover the following:
* What is regression?
* Simple Linear Regression
* Gradient Descent
* Multiple Linear Regression
* Using Scikit Learn to perform regression

## What is regression? 

A modeling task which objective is to create a (linear or non-linear) map between the **independent variables** (i.e. the columns in your pandas dataframe) and a set of **continuous dependent variables** (i.e. the variable you want to predict) by estimating a set of **unknown parameters**. 

Examples of regression tasks:
* predicting house prices (example range: [100k\$; 500k\$]);
* predicting the rating that a user would assign to a movie (example range: [1 start; 7 stars]); 
* predicting the total sales for each day, in each shop of a shopping mall;
* predicting emotional descriptors for a song;
* predicting the trajectory of a fighter jet.

Nowadays, there are *a lot* of algorithms to solve this task but we will focus on one of the most easy to understand: **linear regression**. It is one of the most used regression methods in the world to this day due to how easy it is to (1) interpret the model, (2) implement it and (3) implement extensions that deal with datasets with few data points, noise and outliers. 

First, let's explore how **simple linear regression** works.

## Simple Linear Regression

This model is a special case of linear regression where you have a single feature. The model is, simply, a line equation

$$\hat{y} = \beta_0 + \beta_1 \cdot x$$

* $\hat{y}$ is the value predicted by the model; 
* $x$ is the input feature; 
* $\beta_0$ is the y-axis value where $x=0$, usually called the *intercept*; 
* $\beta_1$ tells you how much $\hat{y}$ changes when $x$ changes, usually called the *coefficient*.

You can create a simple lambda function in order to implement this model:

In [1]:
lr = lambda x, b0, b1: b0 + b1 * x

Now, let's create some data and test this function

In [2]:
import numpy as np

x = np.arange(10)

In [3]:
lr(x, 0, 1)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [4]:
lr(x, 1, 1)

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [5]:
lr(x, 0, 3)

array([ 0,  3,  6,  9, 12, 15, 18, 21, 24, 27])

In [6]:
lr(x, -10, 3)

array([-10,  -7,  -4,  -1,   2,   5,   8,  11,  14,  17])

To make it easy to manipulate the model parameters, let's use the following demo

In [7]:
from utils import simple_linear_regression_manual_demo_1

In [8]:
simple_linear_regression_manual_demo_1()

interactive(children=(FloatSlider(value=0.0, description='b0', max=10.0, min=-10.0, step=0.01), FloatSlider(va…

The green plot represents both the x and y axes while the blue line is the $\hat{y}$ for each value of $x$. As you can see for yourself, if you decrease/increase $\beta_0$, the value where y cross $\hat{y}$ decreases/increases. If you increase/decrease $\beta_1$, the slope of the line increases/decreases.

Now, let's try to manually change $\beta_0$ and $\beta_1$ in order to fit a small dataset. In order to make your job easier, we added a metric that goes down when you use better parameter combinations

In [9]:
from utils import simple_linear_regression_demo_2

In [10]:
simple_linear_regression_demo_2()

interactive(children=(FloatSlider(value=-1.0, description='b0', max=10.0, min=-10.0, step=0.01), FloatSlider(v…

Ok, doing this manually sucks. So, humans developed optimization algorithms to allow machines to adjust $\beta_0$ and $\beta_1$ according to some data set. There are, at least, 3 categories of optimization procedures to do it:

1. iterative methods using gradients;
2. closed form solution through normal equations;
3. evolutionary methods like genetic algorithms or particle swarm; 
4. bayesian optimization.

Methods based on 3 and 4 are kind of an overkill at this point in time, they don't guarantee you the optimal set of parameters for the model and just a curiosity (well, to be honest, methods based on 4 have certain nice properties but let's no get into that rabbit hole, eventhough the hole has really nice candy and smells good). We will explore methods based on gradient descent because they provide a, somehow, universal approach to optimization tasks and are really simple to grasp.

## Gradient Descent

TODO

In the following code snippet, we will be minimizing a very simple function ($f(x) = x^2$), change the *learning_rate* parameter and notice the effect on the value of $f(x)$

In [11]:
from utils import gradient_descent_learning_rate_impact_demo

The higher the value of *learning_rate*, the faster $x$ converges to 0. TODO

## Multiple Linear Regression

Most phenomena in our world is dependent on several factors. For example, house prices depend on things like (1) number of rooms, (2) distance to malls, (3) distance to parks, (4) how old the house is, etc. As such, it would be naive to create a predictive linear mode



$$\hat{y} = \beta_0 + \beta_1 \cdot x_1 + \beta_2 \cdot x_2 + \beta_3 \cdot x_3 + \beta_4 \cdot x_4 + \beta_5 \cdot x_5$$

$$\hat{y} = \beta_0 + \sum_{i=1}^{5} \beta_i \cdot x_i$$

$$\hat{y}^{[j]} = \beta_0 + \sum_{i=1}^{5} \beta_i \cdot x_i^{[j]}$$

## Linear Regression Pros & Cons

**PROS**
* Really easy to understand
* Fast optimization
* Extensions available to deal with: 
 * small data
 * data sparsity
 * outliers

**CONS**
* Sensible to outliers
* Assumes that there is no multicollinearity
* Feature scaling is required
* Monotonicity assumption: for the model, the relation between each feature and the output 
* Categorical encoding: this ight get tricky when number of uniques is big and part of those uniques have few occurrences.


#### Notes

At this point, if you already knew linear regression in detail before the academy, you might be wondering: *"Where is the error component in the linear regression formula?"*. The reason is quite simple: since we wanted you approach this subject TODO

Also, we didn't include all assumptions made by the linear regression model. For a hands-on approach to the assumptions, check this [blog post by Selva Prabhakaran](http://r-statistics.co/Assumptions-of-Linear-Regression.html).

## Using Scikit Learn to perform regression

After learning the basics about linear regression and how to estimate, iteratively, the best parameters for the model, it is time to learn how to use linear regression with Scikit Learn. 

[Scikit Learn][sklearn] is an industry standard for data science and machine learning and we will be using it extensively throughout the academy. Scikit Learn has two implementations of linear regression:
* [*sklearn.linear_model.SGDRegressor*][SGDRegressor]: uses stochastic gradient descent to estimate the intercept and coefficients. Also, this class allows more advanced forms of linear regression that is out of scope for moment.
* [*sklearn.linear_model.LinearRegression*][LinearRegression]: uses normal equations to estimate the best intercept and coefficients. Normal equations is the closed form solution for linear regression, meaning that you know exactly the number of steps and the guarantees about the solution. If you want to know more about this, [read this blog post][normal_eq].

[SGDRegressor]: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor
[LinearRegression]: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression
[sklearn]: http://scikit-learn.org
[normal_eq]: https://eli.thegreenplace.net/2014/derivation-of-the-normal-equation-for-linear-regression

First, let's load the [Boston housing price dataset][boston_kaggle]

[boston_kaggle]: https://www.kaggle.com/c/boston-housing

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston

data = load_boston()
print(data['DESCR'])

x = pd.DataFrame(data['data'], columns=data['feature_names'])
y = pd.Series(data['target'], name='medv')

pd.concat((x, y), axis=1).head(5)

Let's experiment with the first linear regression implementation

In [None]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()

lr.fit(x, y)

print('\nTargets for the first 5 rows: \n', y.head(5).values)
print('\nPredictions for the first 5 rows: \n', lr.predict(x.head(5)))
print('\nTotal Score\n', lr.score(x, y))

We got a R² score of ~74, which might be adequate for the first try. R² is one of the most well known metrics to evaluate regression models.We will dive into it in SLU12.

Modelling is not only about getting the best accurate model ever. If you get a big R² score for the wrong reasons (e.g. target leaks, too many useless variables), that model is kind of...useless. As such, let's look into how each feature contributes to the prediction

In [None]:
a = pd.Series(lr.coef_, index=x.columns, 
              name='Features Coefficients (sorted by magnitude)')
index = a.abs().sort_values(ascending=False).index
a = a.loc[index]
a

_NOX_, according to the dataset documentation, refers to _"nitric oxides concentration (parts per 10 million)"_. The coefficient for _NOX_ is WAAAAAY BIGGER than the ones in the other features. Does it mean that (1) air pollution is a BIIIG problem in Boston, (2) people that buy houses in Boston REALLY REALLY REALLY HATE air pollution

![pollution_level_chinese](http://weknowmemes.com/generator/uploads/generated/g136362126738785004.jpg)

or does it mean that something was wrong in our approach? 

First of all, let's check some statistics about our features

In [None]:
x.describe()

it seems that there the scales for different features are *way* different from one another. For example, the domain of _CRIM_ is [0.006320; 88.976200] while _TAX_ is in [187; 711]. This means that, in the context of linear regression, the **coefficients are not comparable**. TODO (WHY?) Fortunately, we have a preprocessed version of this dataset. Let's use it

In [None]:
import pandas as pd

In [None]:
data = pd.read_csv('data/boston (scaled).csv')

x_ = data.drop(['MEDV'], axis=1)
y = data['MEDV']

lr = LinearRegression()

lr.fit(x_, y)

print('\nTargets for the first 5 rows: \n', y.head(5).values)
print('\nPredictions for the first 5 rows: \n', lr.predict(x_.head(5)))
print('\nTotal Score\n', lr.score(x_, y))

After scaling all features into the same scale, we can now compare the the importance of each feature

In [None]:
a = pd.Series(lr.coef_, index=x_.columns, 
              name='Features Coefficients (sorted by magnitude)')
index = a.abs().sort_values(ascending=False).index
a = a.loc[index]
a


Also, we can normalize the coefficients in order to see the relative weight of each feature

In [None]:
a.abs() / a.abs().sum()

It seems that over 50% of the relative feature strength is concentrated in: 
* _LSTAT_ (decreases price): % lower status of the population
* _DIS_ (decreases price): weighted distances to five Boston employment centres
* *RM* (increases price): average number of rooms per dwelling
* _RAD_ (increases price): index of accessibility to radial highways

Now, time to use SGDRegressor. As previously stated, this class allows fine tuning regarding learning rate, weights constraints, extensions of gradient descent, etc. We will use the configuration that allows the most similar behavior to the one described for stochastic gradient descent.

In [None]:
from sklearn.linear_model import SGDRegressor

learning_rate = 0.001
epochs = 100

lr = SGDRegressor(random_state=10, 
                  penalty=None, 
                  shuffle=True, 
                  learning_rate='constant', 
                  eta0=learning_rate, 
                  max_iter=epochs)

lr.fit(x, y)

print('\nTargets for the first 5 rows: \n', y.head(5).values)
print('\nPredictions for the first 5 rows: \n', lr.predict(x.head(5)))
print('\nTotal Score\n', lr.score(x, y))

WTF?! We got prediction overshootings and an AWFUL R² score! Is this related to feature scaling? Let's check if that is the case.

In [None]:
learning_rate = 0.001
epochs = 100

lr = SGDRegressor(random_state=10, 
                  penalty=None, 
                  shuffle=True, 
                  learning_rate='constant', 
                  eta0=learning_rate, 
                  max_iter=epochs)

lr.fit(x_, y)

print('\nTargets for the first 5 rows: \n', y.head(5).values)
print('\nPredictions for the first 5 rows: \n', lr.predict(x_.head(5)))
print('\nTotal Score\n', lr.score(x_, y))

I guess we have our answer. Unlike _LinearRegression_, **having the same scale is not an option** for _SGDRegressor_.

In [None]:
a = pd.Series(lr.coef_, index=x_.columns, 
              name='Features Coefficients (sorted by magnitude)')
index = a.abs().sort_values(ascending=False).index
a = a.loc[index]
a

There are several steps we didn't include: 
* Exclude the feature _CHAS_ from the scaler. _CHAS_ is a dummy feature (i.e. the result of categorical feature encoding) and isn't a continuous feature per se.
* Perform correlation analysis in order to avoid including 2, or more, features that are highly correlated. When using highly correlated features in a linear model, you are violating the assumption of indepe TODO
* We didn't remove outliers. This is a problem for models like linear regression due to sensivity to outliers. Fortunately, there are implementations for robust linear regression within scikit learn ([RANSAC][RANSAC], [Theil-Sen][Theil-Sen] and [Huber][Huber]).
* We didn't perform correlation analysis between each input feature and the target. 

We didn't include those steps but, with all you have learned so far in the academy, you are able to perform those by yourself. :)


[RANSAC]: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RANSACRegressor.html#sklearn.linear_model.RANSACRegressor
[Theil-Sen]: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.TheilSenRegressor.html#sklearn.linear_model.TheilSenRegressor
[Huber]: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.HuberRegressor.html#sklearn.linear_model.HuberRegressor