In [None]:
%load_ext autoreload
%autoreload 2

# SLU9 - Regression: Learning notebook

In this notebook we will cover the following:
* [What is regression?](#What-is-regression?)
* [Simple Linear Regression](#Simple-Linear-Regression)
* [Gradient Descent](#Gradient-Descent)
* [Multiple Linear Regression](#Multiple-Linear-Regression)
* Using Scikit Learn to perform regression

## What is regression? 

A modeling task which objective is to create a (linear or non-linear) map between the **independent variables** (i.e. the columns in your pandas dataframe) and a set of **continuous dependent variables** (i.e. the variable you want to predict) by estimating a set of **unknown parameters**. 

Examples of regression tasks:
* predicting house prices (example range: [100k\$; 500k\$]);
* predicting the rating that a user would assign to a movie (example range: [1 start; 7 stars]); 
* predicting the total sales for each day, in each shop of a shopping mall;
* predicting emotional descriptors for a song;
* predicting the trajectory of a fighter jet.

Nowadays, there are *a lot* of algorithms to solve this task but we will focus on one of the most easy to understand: **linear regression**. It is one of the most used regression methods in the world to this day due to how easy it is to (1) interpret the model, (2) implement it and (3) implement extensions that deal with datasets with few data points, noise and outliers. 

First, let's explore how **simple linear regression** works.

## Simple Linear Regression

This model is a special case of linear regression where you have a single feature. The model is, simply, a line equation

$$\hat{y} = \beta_0 + \beta_1 \cdot x$$

* $\hat{y}$ is the value predicted by the model; 
* $x$ is the input feature; 
* $\beta_0$ is the y-axis value where $x=0$, usually called the *intercept*; 
* $\beta_1$ tells you how much $\hat{y}$ changes when $x$ changes, usually called the *coefficient*.

The impact of $x$ in $\hat{y}$ can be state as the following: _For each unit you increment in $x$, you increment $\beta_1$ units in $\hat{y}$._ For example: 

$$HousePrice = 1.1 + 4 \cdot NumberOfRooms$$

means that the _price of the house increments 4 units (e.g. 1 unit = 10000$) each time I add a room to the house._

You can create a simple lambda function in order to implement this model:

In [None]:
lr = lambda x, b0, b1: b0 + b1 * x

Now, let's create some data and test this function

In [None]:
import numpy as np

x = np.arange(10)

In [None]:
lr(x, 0, 1)

In [None]:
lr(x, 1, 1)

In [None]:
lr(x, 0, 3)

In [None]:
lr(x, -10, 3)

To make it easy to manipulate the model parameters, let's use the following demo

In [None]:
from utils import simple_linear_regression_manual_demo_1

In [None]:
simple_linear_regression_manual_demo_1()

The green plot represents both the x and y axes while the blue line is the $\hat{y}$ for each value of $x$. As you can see for yourself, if you decrease/increase $\beta_0$, the value where y cross $\hat{y}$ decreases/increases. If you increase/decrease $\beta_1$, the slope of the line increases/decreases.

Now, let's try to manually change $\beta_0$ and $\beta_1$ in order to fit a small dataset. In order to make your job easier, we added a metric (let's call it $J$) that goes down when you use better parameter combinations

$$J = \frac{1}{N} \sum_{n=1}^N (y_n - \hat{y}_n)^2$$

where $y_n$ is the target $n$ in your dataset.

In [None]:
from utils import simple_linear_regression_demo_2

In [None]:
simple_linear_regression_demo_2()

Ok, doing this manually sucks. So, humans developed optimization algorithms to allow machines to adjust $\beta_0$ and $\beta_1$ according to some data set. There are, at least, 3 categories of optimization procedures to do it:

1. iterative methods using gradients;
2. closed form solution through normal equations;
3. evolutionary methods like genetic algorithms or particle swarm; 
4. bayesian optimization.

Methods based on 3 and 4 are kind of an overkill at this point in time, they don't guarantee you the optimal set of parameters for the model and just a curiosity (well, to be honest, methods based on 4 have certain nice properties but let's no get into that rabbit hole, eventhough the hole has really nice candy and smells good). We will explore methods based on gradient descent because they provide a, somehow, universal approach to optimization tasks and are really simple to grasp.

## Gradient Descent

Gradient descent is a well known and studied method for iterative optimization of both linear and non-linear models. You can use it to estimate the parameters for linear regression, neural networks, probablistic graphical models, k-means and many more!

The most essential component of the gradient descent algorith is the **update rule**. Let $f$ be a differentiable function and $\omega$ one of parameters of $f$. Then, in order to minimize the value outputed by $f(\omega)$, we will use, iteratively, the following

$$\omega = \omega - \alpha \frac{\partial f}{\partial \omega}$$

where $\frac{\partial f}{\partial \omega}$ is the [partial derivative of $f$ with respect to $\omega$](https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/partial-derivative-and-gradient-articles/a/introduction-to-partial-derivatives) and $\alpha$ is the learning rate. So, what gradient descent does is using the partial derivative as a heuristic to the direction where the minimum is located and the multiplication between the learning rate and the partial derivative gives you a "velocity" factor that you will apply to $\omega$. There are two ways to increasing the "velocity": (1) higher learning rates, (2) big gradients. 

In the following demo, you will control the learning rate of gradient descent for the minimization of $f(x) = x^2$. 

In [None]:
from utils import gradient_descent_learning_rate_impact_demo

In [None]:
gradient_descent_learning_rate_impact_demo()

what is happening in the demo above is the following:

In [None]:
# Define the function to be minimized.
f = lambda x: x ** 2

# Define the partial derivative of the 
# function with respect to the feature.
df_dx = lambda x: 2 * x

# Define the learning rate.
learning_rate = 0.4

# The initial value for x.
x = -300

# The number of iterations to use.
epochs = 10

# Gradient Descent
for epoch in range(epochs): 
    deriv = df_dx(x)
    x = x - learning_rate * deriv
    print((x, f(x), deriv))

If you increase the learning rate, it will converge faster to 0. But if you keep increasing to above 0.75-0.80, you will start to see a slower convergence or, even worst, overshooting (i.e. the value of $x$ gets to $-\infty$ or $+\infty$.

The previous function was quite simple to minimize. That is because it is a [convex function](https://en.wikipedia.org/wiki/Convex_function), meaning that, as long as you keep the learning rate with a reasonable value, you will converge to the minimum, sooner or later. But what happens when we use a function like 

$$f(x) = x^2 + \left|15 x\right| * cos(x)$$


it will be minimized with


$$x_{i+1} = 
x_{i} - \alpha \frac{\partial f}{\partial x_{i}} = 
x_{i} - \alpha \cdot (2 x_i + cos(x_i) \cdot \frac{15 x_i}{\left| x_i \right|} - \left|15 x_i\right| \cdot sin(x_i))$$

![sad_hamster](https://media.giphy.com/media/8UHwuM947LUjyyYh1o/giphy.gif "I thought this was a bootcamp")

Yes, I know it looks like an awful complicated formula but bear with me. You will see why we used it in just a moment.

In [None]:
from utils import non_convex_gradient_descent_demo

non_convex_gradient_descent_demo()

with this **non-convex** function, gradient descent is unable to converge to one of the **global minima** and is stuck into one of the **local minima**. Of course you could play with the learning rate back and forth until you manage to get to one of those global optima. But the machine hasn't the same capabilities as you (yet). All this to explain a simple fact: **there is no guarantees about reaching the global minima**. Fortunately, linear regression uses a convex cost function. :)

Now, let's get back to simple linear regression. When applying gradient descent to a model, we need to use the data points of a dataset to adjust the parameters of the model. In order to do that, there are two main flavors of gradient descent: (1) stochatic gradient descent and (2) batch gradient descent. 


#### **Stochastic Gradient Descent (SGD)**

In SGD, we update the function parameters for each dataset observation we have. The generic pseudo code for SGD is the following:

1. _For epoch in 1...epochs:_
    1. _X' = shuffle(X)_
    2. _For each $x_n$ in $X'$_:
        1. $\omega = \omega - \alpha \frac{\partial f}{\partial \omega}$
        
where $\omega$ is the set of parameters of your function $f$, $X$ is your dataset (input and targets included) and _shuffle_ is a function that returns a shuffled version of $X$.


Let's transform the generic version into the simple linear regression optimization procedure by changing the partial derivatives definition. In order to do that, we need to get the partial derivatives of $J$

$$
\frac{\partial J}{\partial b_0} = 
\sum_{n=1}^N \frac{\partial J}{\partial \hat{y}_n} \frac{\partial \hat{y}_n}{\partial b_0} = 
-\frac{1}{N} \sum_{n=1}^N 2(y - \hat{y}_n) $$

$$
\frac{\partial J}{\partial b_1} = 
\sum_{n=1}^N \frac{\partial J}{\partial \hat{y}_n} \frac{\partial \hat{y}_n}{\partial b_1} = 
-\frac{1}{N} \sum_{n=1}^N 2(y - \hat{y}_n) x_n $$

Now, let's adapt the pseudo code

1. _For epoch in 1...epochs:_
    1. _X' = shuffle(X)_
    2. _For each $x_n$ in $X'$_:
        1. $\hat{y} = \beta_0 + \beta_1 x_n$
        2. $b_0 = b_0 - \alpha \frac{\partial J}{\partial b_0} = b_0 + 2 \alpha (y - \hat{y})$
        3. $b_1 = b_1 - \alpha \frac{\partial J}{\partial b_1} = b_1 + 2 \alpha (y - \hat{y})x_n$
        


In [None]:
from sklearn.utils import check_random_state


def sgd_for_simple_linear_regression(x, y, b0, b1, learning_rate, epochs, random_state):
    random_state = check_random_state(random_state)
    
    # This will make your life easier 
    # when shuffling the data.
    data = pd.concat(
        (x.to_frame(), y.to_frame()), 
        axis=1)
    data.columns = ['x', 'y']
    
    for epoch in range(epochs): 
        # Use pandas.sample with parameter 'n' equal to the 
        # number of rows of the dataset in order to create 
        # a shuffled version of x.
        data_ = data.sample(n=data.shape[0], random_state=random_state)
        x_ = data_['x']
        y_ = data_['y']
        for n in range(x.shape[0]): 
            x__ = x.iloc[[n]]
            y__ = y.iloc[[n]]
            y_hat = b0 + b1 * x__
            dJ_db0 = 2 * (y - y_hat)
            dJ_db1 = 2 * (y - y_hat) * x__
            b0 = b0 - learning_rate * dJ_db0
            b1 = b1 - learning_rate * dJ_db1
            
    return b0, b1

To make it clear how this works, let's use the following demo.

In [None]:
from utils import sgd_simple_lr_dataset_demo

In [None]:
sgd_simple_lr_dataset_demo()

On the left side, you have the usual plot for simple linear regression. On the right, you have $\beta_0$ and $\beta_1$ plotted into a 2D plane, in order for you to see how the behavior changes when you increase or decrease the learning rate. 

**Pros**
* For big datasets, if your model doesn't have _many_ parameters, the computational costs, both time and memory, will be very low.
* It is able to escape from "shallow" local minima due to the randomness of the procedure and to a higher irregularity of the path token within the parameter space.

**Cons**
* If you have a smooth error curve/surface and not a big dataset, no need to go with this.
* For smooth error curve/surfaces, if the dataset is not big, it can take more iterations to converge to the minimum than batch gradient descent would take.

#### **Batch Gradient Descent (BGD)**

Unlike SGD, BGD aggregates the gradients from **all observations in the dataset**. TODO


1. _For epoch in 1...epochs:_
    2. $\omega = \omega + \alpha \sum_{n=1}^N \frac{\partial f_{[x_n]}}{\partial \omega}$
    

where 


TODO


1. _For epoch in 1...epochs:_
    1. $d_y = (y - \hat{y})$
    2. $\beta_0 = \beta_0 - \alpha \frac{\partial J}{\partial \beta_0} = \beta_0 + \alpha \frac{1}{N} \sum_{n=1}^N 2 d_y$ 
    3. _For i in 1..K:_
        1. $\beta_i = \beta_i - \alpha \frac{\partial J}{\partial \beta_i} = \beta_i + \alpha \frac{1}{N} \sum_{n=1}^N 2 d_y x_{i_n}$ 

In [None]:
from utils import bgd_simple_lr_dataset_demo

bgd_simple_lr_dataset_demo()

**PROS**
* With a reasonable learning rate, you can get 
* For smooth error curves/surfaces,

**Cons**
* It is really expensive for big datasets: TODO
* You 

The higher the value of *learning_rate*, the faster $x$ converges to 0. TODO

Now, let's see the TODO

## Multiple Linear Regression

Most phenomena in our world is dependent on several factors. For example, house prices depend on things like (1) number of rooms, (2) distance to malls, (3) distance to parks, (4) how old the house is, etc. As such, it would be naive, at best, to create a univariate linear model to predict the house prices. So, let's expand our simple linear regression into *multiple* linear regression

$$\hat{y} = \beta_0 + \beta_1 \cdot x_1 + \beta_2 \cdot x_2 + \beta_3 \cdot x_3 + \beta_4 \cdot x_4 + \beta_5 \cdot x_5 = \hat{y} = \beta_0 + \sum_{i=1}^{5} \beta_i \cdot x_i$$

Pretty simple, hein? The more features you have, the more you include in the sum! But there is a **very important assumption** that this model does: there is no multi-collinearity in your data. _"What the hell does that even mean?"_, you might ask. This term comes from linear algebra and this is what it means: 

_If you have a feature $x_i$ which the values can be obtained through a **linear combination** of other features, then we have multicolinearity._

Let me give you some of the reasons this is a problem: 
1. When people use linear regressions, after the parameter estimation phase, they use the coefficients as a way to measure **how important a feature**. When you use collinear features, the magnitude of the weights get's lowered for all features that are in that collinear relationship. That might be misleading because collinear features are, essentially, one feature .
2. Collinear features had no value to the model. TODO


Also, you should always normalize your dataset into a unified scale (e.g. range [0; 1]). The reasons why:
1. Depending on what optimization algorithm you use, if feature $f_1$ has a domain of [-4.1; 3] and feature $f_2$ has a domain of [-1.1; 100000], the impact in the gradient can lead to problems in the convergence to the global minima (i.e. you probably won't get accurate results for your predictions). There are optimization algorithms that can avoid this issue but still suffer with issue (2).
2. If two features are using different ranges, it will be hard to compare features in terms of feature importance. If a feature has a domain of [-1; 0.1] and another has domain of [0; 1000], it doesn't make sense to look at the influence in the prediction through the same lens as the ones we use in the introduction to simple linear regression.

$$
\frac{\partial J}{\partial b_0} = 
\sum_{n=1}^N \frac{\partial J}{\partial \hat{y}_n} \frac{\partial \hat{y}_n}{\partial b_0} = 
-\frac{1}{N} \sum_{n=1}^N 2(y - \hat{y}_n) $$

$$
\frac{\partial J}{\partial b_1} = 
\sum_{n=1}^N \frac{\partial J}{\partial \hat{y}_n} \frac{\partial \hat{y}_n}{\partial b_1} = 
-\frac{1}{N} \sum_{n=1}^N 2(y - \hat{y}_n) x_n $$

The iterative optimization procedure for multiple linear regression is the same as simple linear regression.

## Linear Regression Pros & Cons

**PROS**
* Really easy to understand
* Fast optimization
* Extensions available to deal with: 
 * small data
 * data sparsity
 * outliers

**CONS**
* Sensible to outliers
* Assumes that there is no multicollinearity
* Feature scaling is required
* Monotonicity assumption: for the model, the relation between each feature and the output 
* Categorical encoding: this ight get tricky when number of uniques is big and part of those uniques have few occurrences.


#### Notes

At this point, if you already knew linear regression in detail before the academy, you might be wondering: *"Where is the error component in the linear regression formula?"*. The reason is quite simple: since we wanted you approach this subject TODO

Also, we didn't include all assumptions made by the linear regression model. For a hands-on approach to the assumptions, check this [blog post by Selva Prabhakaran](http://r-statistics.co/Assumptions-of-Linear-Regression.html).

## Using Scikit Learn to perform regression

After learning the basics about linear regression and how to estimate, iteratively, the best parameters for the model, it is time to learn how to use linear regression with Scikit Learn. 

[Scikit Learn][sklearn] is an industry standard for data science and machine learning and we will be using it extensively throughout the academy. Scikit Learn has two implementations of linear regression:
* [*sklearn.linear_model.SGDRegressor*][SGDRegressor]: uses stochastic gradient descent to estimate the intercept and coefficients. Also, this class allows more advanced forms of linear regression that is out of scope for moment.
* [*sklearn.linear_model.LinearRegression*][LinearRegression]: uses normal equations to estimate the best intercept and coefficients. Normal equations is the closed form solution for linear regression, meaning that you know exactly the number of steps and the guarantees about the solution. If you want to know more about this, [read this blog post][normal_eq].

[SGDRegressor]: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor
[LinearRegression]: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression
[sklearn]: http://scikit-learn.org
[normal_eq]: https://eli.thegreenplace.net/2014/derivation-of-the-normal-equation-for-linear-regression

First, let's load the [Boston housing price dataset][boston_kaggle]

[boston_kaggle]: https://www.kaggle.com/c/boston-housing

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston

data = load_boston()
print(data['DESCR'])

x = pd.DataFrame(data['data'], columns=data['feature_names'])
y = pd.Series(data['target'], name='medv')

pd.concat((x, y), axis=1).head(5)

Let's experiment with the first linear regression implementation

In [None]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()

lr.fit(x, y)

print('\nTargets for the first 5 rows: \n', y.head(5).values)
print('\nPredictions for the first 5 rows: \n', lr.predict(x.head(5)))
print('\nTotal Score\n', lr.score(x, y))

We got a R² score of ~74, which might be adequate for the first try. R² is one of the most well known metrics to evaluate regression models.We will dive into it in SLU12.

Modelling is not only about getting the best accurate model ever. If you get a big R² score for the wrong reasons (e.g. target leaks, too many useless variables), that model is kind of...useless. As such, let's look into how each feature contributes to the prediction

In [None]:
a = pd.Series(lr.coef_, index=x.columns, 
              name='Features Coefficients (sorted by magnitude)')
index = a.abs().sort_values(ascending=False).index
a = a.loc[index]
a

_NOX_, according to the dataset documentation, refers to _"nitric oxides concentration (parts per 10 million)"_. The coefficient for _NOX_ is WAAAAAY BIGGER than the ones in the other features. Does it mean that (1) air pollution is a BIIIG problem in Boston, (2) people that buy houses in Boston REALLY REALLY REALLY HATE air pollution

![pollution_level_chinese](http://weknowmemes.com/generator/uploads/generated/g136362126738785004.jpg)

or does it mean that something was wrong in our approach? 

First of all, let's check some statistics about our features

In [None]:
x.describe()

it seems that there the scales for different features are *way* different from one another. For example, the domain of _CRIM_ is [0.006320; 88.976200] while _TAX_ is in [187; 711]. This means that, in the context of linear regression, the **coefficients are not comparable**. TODO (WHY?) Fortunately, we have a preprocessed version of this dataset. Let's use it

In [None]:
import pandas as pd

In [None]:
data = pd.read_csv('data/boston (scaled).csv')

x_ = data.drop(['MEDV'], axis=1)
y = data['MEDV']

lr = LinearRegression()

lr.fit(x_, y)

print('\nTargets for the first 5 rows: \n', y.head(5).values)
print('\nPredictions for the first 5 rows: \n', lr.predict(x_.head(5)))
print('\nTotal Score\n', lr.score(x_, y))

After scaling all features into the same scale, we can now compare the the importance of each feature

In [None]:
a = pd.Series(lr.coef_, index=x_.columns, 
              name='Features Coefficients (sorted by magnitude)')
index = a.abs().sort_values(ascending=False).index
a = a.loc[index]
a


Also, we can normalize the coefficients in order to see the relative weight of each feature

In [None]:
a.abs() / a.abs().sum()

It seems that over 50% of the relative feature strength is concentrated in: 
* _LSTAT_ (decreases price): % lower status of the population
* _DIS_ (decreases price): weighted distances to five Boston employment centres
* *RM* (increases price): average number of rooms per dwelling
* _RAD_ (increases price): index of accessibility to radial highways

Now, time to use SGDRegressor. As previously stated, this class allows fine tuning regarding learning rate, weights constraints, extensions of gradient descent, etc. We will use the configuration that allows the most similar behavior to the one described for stochastic gradient descent.

In [None]:
from sklearn.linear_model import SGDRegressor

learning_rate = 0.001
epochs = 100

lr = SGDRegressor(random_state=10, 
                  penalty=None, 
                  shuffle=True, 
                  learning_rate='constant', 
                  eta0=learning_rate, 
                  max_iter=epochs)

lr.fit(x, y)

print('\nTargets for the first 5 rows: \n', y.head(5).values)
print('\nPredictions for the first 5 rows: \n', lr.predict(x.head(5)))
print('\nTotal Score\n', lr.score(x, y))

WTF?! We got prediction overshootings and an AWFUL R² score! Is this related to feature scaling? Let's check if that is the case.

In [None]:
learning_rate = 0.001
epochs = 100

lr = SGDRegressor(random_state=10, 
                  penalty=None, 
                  shuffle=True, 
                  learning_rate='constant', 
                  eta0=learning_rate, 
                  max_iter=epochs)

lr.fit(x_, y)

print('\nTargets for the first 5 rows: \n', y.head(5).values)
print('\nPredictions for the first 5 rows: \n', lr.predict(x_.head(5)))
print('\nTotal Score\n', lr.score(x_, y))

I guess we have our answer. Unlike _LinearRegression_, **having the same scale is not an option** for _SGDRegressor_.

In [None]:
a = pd.Series(lr.coef_, index=x_.columns, 
              name='Features Coefficients (sorted by magnitude)')
index = a.abs().sort_values(ascending=False).index
a = a.loc[index]
a

Finally, let's explore the effect of **multicollinearity**. First, let's create a dataset

In [None]:
from sklearn.datasets import make_regression

x, y = make_regression(n_samples=100, n_features=4, n_informative=4, n_targets=1, random_state=10, noise=20)
x = pd.DataFrame(x)
y = pd.Series(y)

Now, let's make a copy of $X$ and include multicollinearity in the new copy

In [None]:
x_ = x.copy()

x_[4] = 0.3 * x_[0] + 0.3 * x_[1] + 0.4 * x_[2]

Now, let's fit a linear regression with the original dataset

In [None]:
lr = LinearRegression()
lr.fit(x, y)
print('R² score: ', lr.score(x, y))
print('Coefficients: ', lr.coef_)

And one with the copy of $X$

In [None]:
lr = LinearRegression()
lr.fit(x_, y)
print('R² score: ', lr.score(x_, y))
print('Coefficients: ', lr.coef_)

**INTERESTING!!!** It seems that the magnitude of features *f0*, *f1* and *f2* was "transfered" to *f4*! But the R² score is the same! 

Do we get the same type of issue with SGDRegressor?

In [None]:
learning_rate = 0.001
epochs = 100

lr = SGDRegressor(random_state=10, 
                  penalty=None, 
                  shuffle=True, 
                  learning_rate='constant', 
                  eta0=learning_rate, 
                  max_iter=epochs)

lr.fit(x, y)
print('R² score: ', lr.score(x, y))
print('Coefficients: ', lr.coef_)

In [None]:
learning_rate = 0.001
epochs = 100

lr = SGDRegressor(random_state=10, 
                  penalty=None, 
                  shuffle=True, 
                  learning_rate='constant', 
                  eta0=learning_rate, 
                  max_iter=epochs)

lr.fit(x_, y)
print('R² score: ', lr.score(x_, y))
print('Coefficients: ', lr.coef_)

So, we got the same type of results, with the exception in the small R² score oscillation (due to stochastic process influence).

There are several steps we didn't include: 
* Exclude the feature _CHAS_ from the scaler. _CHAS_ is a dummy feature (i.e. the result of categorical feature encoding) and isn't a continuous feature per se.
* Perform correlation analysis in order to avoid including 2, or more, features that are highly correlated. When using highly correlated features in a linear model, you might be violating the assumption of no multicollinearity.
* We didn't remove outliers. This is a problem for models like linear regression due to sensivity to outliers. Fortunately, there are implementations for robust linear regression within scikit learn ([RANSAC][RANSAC], [Theil-Sen][Theil-Sen] and [Huber][Huber]).
* We didn't perform correlation analysis between each input feature and the target. 

We didn't include those steps but, with all you have learned so far in the academy, you are able to perform those by yourself. :)


[RANSAC]: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RANSACRegressor.html#sklearn.linear_model.RANSACRegressor
[Theil-Sen]: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.TheilSenRegressor.html#sklearn.linear_model.TheilSenRegressor
[Huber]: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.HuberRegressor.html#sklearn.linear_model.HuberRegressor