# SLU9 - Regression: Learning notebook

In this notebook we will cover the following:
- Simple Linear Regression
- Gradient Descent
- The impact of learning rate
- Muliple Linear Regression
- Using scikit learn linear regression
- (Extra) Computational graphs: a conceptual framework for automated differentiation
- (Extra) What is that *random_state* thing?


## What is regression

A modeling task which objective is to create a (linear or non-linear) map between the **independent variables** (i.e. the columns in your pandas dataframe) and a set of **continuous dependent variables** (i.e. the variable you want to predict) by estimating a set of **unknown parameters**. 

Examples of regression tasks:
* predicting house prices (example range: [100k\$; 500k\$]);
* predicting the rating that a user would assign to a movie (example range: [1 start; 7 stars]); 
* predicting emotional descriptors for a song;
* predicting the trajectory of a fighter jet.

Nowadays, there are *a lot* of algorithms to solve this task but we will focus on one of the most easy to understand: **linear regression**. It is one of the most used regression methods in the world to this day due to how easy it is to (1) interpret the model, (2) implement it and (3) implement extensions that deal with datasets with few data points, noise and outliers. 

First, let's explore how **simple linear regression** works.

## Simple Linear regression

This model is a special case of linear regression where you have a single feature. The model is, simply, a line equation

$$\hat{y} = \beta_0 + \beta_1 \cdot x$$

* $\hat{y}$ is the value predicted by the model; 
* $x$ is the input feature; 
* $\beta_0$ is the y-axis value where $x=0$, usually called the *intercept*; 
* $\beta_1$ tells you how much $\hat{y}$ changes when $x$ changes, usually called the *coefficient*.

Let's see what each parameter does in this model

In [1]:
from ipywidgets import interact, interactive, fixed, interact_manual
from ipywidgets import FloatSlider, Dropdown
import ipywidgets as widgets

In [2]:
import matplotlib.pyplot as plt

import numpy as np

In [3]:
def plot_simple_regression(b0=0, b1=1, xlim=(-5, 5), ylim=(-5, 5)):
    x = np.linspace(-10, 10, 1000)
    y = b0 + b1 * x
    
    plt.xlim(xlim)
    plt.ylim(ylim)
    plt.plot(x, y)
    plt.plot([0, 0], ylim, 'g-', 
             xlim, [0, 0], 'g-', linewidth=0.4)

In [5]:
interact(plot_simple_regression, 
         b0=FloatSlider(min=-10, max=10, step=0.01, value=0), 
         b1=FloatSlider(min=-10, max=10, step=0.01, value=1), 
         xlim=fixed((-5, 5)), 
         ylim=fixed((-5, 5)));

interactive(children=(FloatSlider(value=0.0, description='b0', max=10.0, min=-10.0, step=0.01), FloatSlider(va…

The green plot represents both the x and y axes while the blue line is the $\hat{y}$ for each value of $x$. As you can see for yourself, if you decrease $\beta_0$, the value where y TODO

Now, let's try to manually change $\beta_0$ and $\beta_1$ in order to fit a small dataset. In order to TODO

In [9]:
import pandas as pd

from sklearn.datasets import make_regression

from sklearn.metrics import mean_squared_error

In [10]:
def plot_simple_regression_with_dataset(x, y, b0=0, b1=1, xlim=(-5, 5), ylim=(-5, 5)):
    plot_simple_regression(b0, b1, xlim, ylim)
    plt.scatter(x, y)
    
    y_hat = b0 + b1 * x
    
    return "Mean Squared Error (MSE): {}".format(mean_squared_error(y, y_hat))

In [11]:
x, y = make_regression(n_features=1, n_samples=100, noise=30.5, random_state=10, bias=200)
x = x[:, 0]
y /= 100
y *= 2.0

interact(plot_simple_regression_with_dataset, 
         b0=FloatSlider(min=-10, max=10, step=0.01, value=-1), 
         b1=FloatSlider(min=-10, max=10, step=0.01, value=-1), 
         x=fixed(x), 
         y=fixed(y), 
         xlim=fixed((-5, 5)), 
         ylim=fixed((-3, 8)));

interactive(children=(FloatSlider(value=-1.0, description='b0', max=10.0, min=-10.0, step=0.01), FloatSlider(v…

Ok, doing this manually sucks. So, humans developed optimization algorithms to allow machines to adjust $\beta_0$ and $\beta_1$ according to some data set. There are, at least, 3 categories of optimization procedures to do it:

1. iterative methods using gradients;
2. closed form solution through normal equations;
3. evolutionary methods like genetic algorithms or particle swarm.

Methods based on 3 are kind of an overkill, they don't guarantee you the optimal set of parameters for the model and just a curiosity. Methods 1 and 2 are the ones that we actually use to optimize the parameters. Let's look into them. TODO


TODO: em vez de explicar como se fazem as equações normais para optimizar os parâmetros, apontar os alunos para aqui:
* https://eli.thegreenplace.net/2014/derivation-of-the-normal-equation-for-linear-regression

## Gradient Descent

A lot of modern machine learning is based on setting up a really nice function and adjusting its internal parameters according to the data you have. In order to adapt the parameters of a function, you need somekind of optimization procedure. If your function is differentiable (TODO) and convex (TODO), you can find the best set of parameters using something called **gradient descent**. This refers to a whole family of optimization methods based on partial derivatives of functions. But, before going into what gradient descent actually is, we need to talk about **derivatives**!

TODO: dar a analogia da descida por uma montanha

### Derivatives

Derivatives is one of the main topics (if not the main topic) of a subfield of calculus called *differential calculus*. They allow you to know **how a small change in the input might change the output** of a differentiable function. The derivative of a function can be represented as another function that shows TODO


### Back to Gradient Descent

TODO

For each observation in your dataset, update the parameters of your model with 

$$\omega_{i+1} = \omega_{i} - \alpha \frac{\partial f}{\partial \omega_{i}}$$

where $\omega^{i}$ are the current values of your parameters. This optimization method can be used when TODO. 

Let's take a look at an example: imagine that the function we want to minimize is $f(x) = x^2$ and the current value of $x$ is -6. This means that each time you press the button *"Run Interact"*, $x$ will be updated using the following formula:

$$x_{i+1} = x_{i} - \alpha \frac{\partial f}{\partial x_{i}} = x_{i} - \alpha \cdot 2 x_{i}$$

In [12]:
from utils import run_sgd_step

o = {'curr_x': -6.0}

interact_manual(run_sgd_step, 
                learning_rate=FloatSlider(min=0.01, max=2.0, step=0.01, value=0.01), 
                o=fixed(o), 
                name=fixed('convex-1'), 
                range_def=fixed([-10, 10, 100000]));

interactive(children=(FloatSlider(value=0.01, description='learning_rate', max=2.0, min=0.01, step=0.01), Butt…

Play with the *"learning_rate"* slider and try bigger values. In this case, if you set the learning_rate to something equal or above 1, we won't reach the global minima. So, for $f(x)=x^2$, if we keep $\alpha < 1$, we will, sooner or later, get to the global minima.

Now, let's make things a "little" bit harder. Let $f(x) = x^2 + \left|15 x\right| * cos(x)$ and the current value of $x$ be -6. For this new function $f(x)$, the next value of $x$ will be computed as 

$$x_{i+1} = 
x_{i} - \alpha \frac{\partial f}{\partial x_{i}} = 
x_{i} - \alpha \cdot (2 x_i + cos(x_i) \cdot \frac{15 x_i}{\left| x_i \right|} - \left|15 x_i\right| \cdot sin(x_i))$$

![sad_hamster](https://media.giphy.com/media/8UHwuM947LUjyyYh1o/giphy.gif "I thought this was a bootcamp")

Yes, I know it looks like an awful complicated formula but bear with me. You will see why we used it in just a moment. So, let's use SGD with this function

In [13]:
o = {'curr_x': -6.0}

interact_manual(run_sgd_step, 
                learning_rate=FloatSlider(min=0.01, max=2.0, step=0.01, value=0.01), 
                o=fixed(o), 
                name=fixed('non-convex-1'), 
                range_def=fixed([-20, 20, 100000]));

interactive(children=(FloatSlider(value=0.01, description='learning_rate', max=2.0, min=0.01, step=0.01), Butt…

So, if $x$ starts at -6, we won't be able to get to the global minima using gradient descent. You could increase and/or decrease the learning rate in order to try to, get one of the global minima. But gradient descent TODO

Instead of using this abstract $\omega$ parameter, let's use the linear regression symbols. Let the $f(x) = (y - \hat{y})^2$, i.e. the squared error, and the updating rules

Now, let's get back to simple linear regression. As we stated previously, we want to adjust $\beta_0$ and $\beta_1$ in order to use the regression model to make predictions for new values of $x$. TODO

$$\beta_{0_{i+1}} = \beta_{0_{i}} - \alpha \frac{\partial f}{\partial \beta_{0_{i}}} = $$
$$\beta_{1_{i+1}} = \beta_{1_{i}} - \alpha \frac{\partial f}{\partial \beta_{1_{i}}} = $$

In [None]:
def run_linear_regression_sgd_epoch(x, y, params, learning_rate): 
    data = np.concatenate((np.array([x]).T, np.array([y]).T), axis=1)
    
    np.random.shuffle(data)
    
    for m in range(x.shape[0]):
        x = data[:, 0]
        y = data[:, 1]
        
        b0 = params['b0']
        b1 = params['b1']

        x_ = x[m]
        y_ = y[m]
        
        y_hat = b0 + b1 * x_

        d_mse_d_b1 = - 2 * (y_ - y_hat) * x_

        d_mse_d_b0 = - 2 * (y_ - y_hat)

        b0 = b0 - learning_rate * d_mse_d_b0
        b1 = b1 - learning_rate * d_mse_d_b1
    
    params['b0'] = b0
    params['b1'] = b1
    
    return params

In [None]:
def run_sgd_for_1d(x, y, params, learning_rate):
    run_linear_regression_sgd_epoch(x, y, params, learning_rate)
    
    x_ = np.linspace(-4, 4)
    b0 = params['b0']
    b1 = params['b1']
    y_hat = b0 + b1 * x_
    
    plt.ylim([-4, 8])
    plt.xlim([-6, 6])
    plt.plot(x_, y_hat)
    plt.scatter(x, y, c='orange')

In [None]:
params = {
    'b0': -1, 
    'b1': -5
}

run_sgd_for_1d(x, y, params, 0.05)
params

TODO: convex functions


TODO: non-convex
With non-convex functions, using gradient descent 

TODO: advice

### Batch Gradient Descent

TODO

In [None]:
def run_linear_regression_bgd_step(x, y, params, learning_rate): 
    b0 = params['b0']
    b1 = params['b1']

    d_mse_d_b1 = ((1 / m) * 2 * (y - y_hat) * x).sum()

    d_mse_d_b0 = ((1 / m) * 2 * (y - y_hat)).sum()
    
    b0 = b0 - learning_rate * d_mse_d_b0
    b1 = b1 - learning_rate * d_mse_d_b1
    
    params['b0'] = b0
    params['b1'] = b1
    
    return params

TODO: explain why linear regression is a linear model

## Multiple Linear Regression

Most phenomena in our world is dependent on several factors. For example, house prices depend on things like (1) number of rooms, (2) distance to malls, (3) distance to parks, (4) how old the house is, etc. As such, it would be naive to create a predictive linear mode



$$\hat{y} = \beta_0 + \beta_1 \cdot x_1 + \beta_2 \cdot x_2 + \beta_3 \cdot x_3 + \beta_4 \cdot x_4 + \beta_5 \cdot x_5$$

$$\hat{y} = \beta_0 + \sum_{i=1}^{5} \beta_i \cdot x_i$$

$$\hat{y}^{[j]} = \beta_0 + \sum_{i=1}^{5} \beta_i \cdot x_i^{[j]}$$

TODO: usually people interpret the coefficients as the **importance of the feature** within the model.
TODO: the interpretation of the intercept depends on what a feature value of 0 means. 

In [None]:
from sklearn.datasets import make_regression

In [None]:
x, y = make_regression(n_features=1, n_samples=500, noise=30.5, random_state=10)

In [None]:
import pandas as pd

In [None]:
d = pd.concat((pd.DataFrame(x), pd.DataFrame(y)), axis=1)
d.columns = ['x', 'y']
d.plot.scatter('x', 'y')

In [None]:
from sklearn.linear_model import SGDRegressor

In [None]:
lr = SGDRegressor(penalty=None, random_state=10)

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lr = LinearRegression()

In [None]:
lr.fit(x, y)

In [None]:
lr.score(x, y)

In [None]:
lr.coef_

In [None]:
plt.scatter(x, y)
plt.plot(x, lr.fit(x, y).predict(x), color='blue', linewidth=3)

In [None]:
rs = np.random.RandomState(10)
x = rs.rand(5000, 1)
y = rs.rand(5000, 1)

In [None]:
plt.scatter(x, y)

In [None]:
lr = LinearRegression()

In [None]:
lr.fit(x, y)

In [None]:
plt.scatter(x, y)
plt.plot(x, lr.fit(x, y).predict(x), color='blue', linewidth=3)

In [None]:
lr.score(x, y)

In [None]:
lr.coef_

In [None]:
lr.intercept_

Hmmm...the coeficient is pretty much close to zero and the intercept is about 0.5. 

Our completely random data set was generated, for both variables, between 0 and 1.

## Using scikit learn linear regression

Scikit Learn is one of the main pieces of tech stacks for data science throughout the world. It offers a wide range of algorithms to create models for regression, classification and unsupervised learning tasks, as well as methods for preprocessing and visualizations. Also, it provides the users with well-thought abstractions to chain all the transformers and models into a single pipeline. TODO

There are two implementations of the basic linear regression: 
* *sklearn.linear_model.SGDRegressor*: a multi-faceted class for regression tasks using linear models. The optimization procedure is the stochastic gradient descent.
* *sklearn.linear_model.LinearRegression*: TODO


In [None]:
from sklearn.linear_model import LinearRegression

from sklearn.linear_model import SGDRegressor

In [None]:
lr1 = LinearRegression()

In [None]:
lr2 = SGDRegressor()

### (EXTRA) What is that *random_state* thing?

You have probably noticed that SGDRegressor used a misterious parameter: *random_state*. 

The parameter *random_state* is a very common one in scikit-learn and numpy. This parameter is the seed for the random numbers generator.

This is actually **very important** for you to know because, by controlling the *random_state* value, you will make the entire process **reproducible** (i.e. everytime we run your code, we get the same results).

You might be wondering _"why does scikit-learn need to generate random numbers?"_. Machine/Statistical learning and data analysis depend *a lot* on random processes. A random process depends, as the name suggests, on randomness. These random processes are used in many things: sampling probability distributions, initializing the parameters vector of a linear regression or neural network, selecting feature values to be used in cuts on decision trees, selecting subsets of data for cross-validation, etc. So, again, random_state is **really_important**.

Inside every piece of scikit-learn code that uses random numbers generators TODO

If you set the *random_state* parameter to an integer, you get the same result everytime, unless there is a bug in the implementation.

In [None]:
import numpy as np
from sklearn.utils import check_random_state

In [None]:
help(check_random_state)

In [None]:
check_random_state(None)

In [None]:
check_random_state(10)

In [None]:
rs1 = np.random.RandomState(10)

In [None]:
rs2 = check_random_state(np.random.RandomState(10))

In [None]:
rs1.rand(5)

In [None]:
rs2.rand(5)

In [None]:
np.random.seed(10)

In [None]:
np.random.rand(10)

## (EXTRA) Computational graphs: a conceptual framework for automated differentiation

Nowadays, many of the mainstream machine learning models are based on gradient descent in one way or another. Even if they aren't (explicitly), they are obtained through the minimization/maximization of a differentiable function (e.g. K-Means, Support Vector Machines, Probabilistic Graphical Models). 

Computational graphs aren't a theory that provides a better optimization method nor a better model. It is "just" the reframing of our mind set on how we represent functions. TODO

MAIN PURPOSES OF CGs:
* Make it easy to visually debug TODO
* Unify several models, based on differentiable 
* TODO: shared parameters
* TODO: max/min pool
* A conceptual framework that allows efficient and parallelizable TODO


Computational graphs are, literally, the backbone of frameworks like PyTorch, Autograd, TensorFlow and Chainer.

In order to grasp the intuition behind Computational Graphs, let's start with some simple examples

$$f(x) = x^2 + y + 2$$

$$f(x) = x^2 + x y$$

$$f(x) = \frac{xy + zxy}{2x + y^2}$$

If you want to learn more about CGs, we highly recommend you to read this [introduction to computational graphs and backpropagation][colah] by Christopher Olah.


[colah]: http://colah.github.io/posts/2015-08-Backprop/