# Lab 02 - Simple Linear Regression

Regressions are any learning problem that aim to describe the relation between a set of explanatory 
variables (i.e. features) and a continuous response (or a set of responses). Therefore our dataset is of the form:

$$S=\left\{\left(\mathbf{x}_i, y_i\right)\right\}^m_{i=1} \quad s.t. \quad \mathbf{x}_i\in\mathbb{R^d},\,\,y_i\in\mathbb{R}$$

In the case of Linear Regression the relation learned is a linear one. That is, we search for a linear function to map 
$\mathcal{X}$ to $\mathcal{Y}$. So the hypothesis class of linear regression is:

$$ \mathcal{H}_{reg} = \left\{h:h\left(x_1,\ldots,x_d\right)=w_0 + \sum w_i x_i\right\} $$

Note that the linear function is linear in the parameters $w_0,w_1,\ldots,w_d$. Let us simulate a dataset fitting the case of a simple linear regression: 

$$ y_i = w_1 x_i + w_0 \quad i=1,\ldots,m $$

So each hypothesis in the class $\mathcal{H}_{reg}$ is defined by two parameters $w_0,w_1$ - the intercept and slope of
the line. Suppose the data is generated from the following line: $Y=2X+1$. So $w_0=1$ and $w_2=2$. Let us draw and plot 
samples from this function.

In [33]:
import sys
sys.path.append("../")
from utils import *

## Linear Regression

In [34]:
w0, w1 = 1, 2    

x = np.linspace(0, 100, 10)
y = w1*x + w0

In [35]:
fig = go.Figure([go.Scatter(x=x, y=y, name="Real Model", showlegend=True,
                                 marker=dict(color="black", opacity=.7), line=dict(color="black", dash="dash", width=1))], 
          layout=go.Layout(title=r"$\text{(1) Simulated Data}$",
                           xaxis={"title": "x - Explanatory Variable"},
                           yaxis={"title": "y - Response"},
                           height=400))
fig.show()

Using this sample as a **training set**, let us compute the Ordinary Least Squares (OLS) estimators $\hat{w_0},\widehat{w_1}$ of the model. Then, if we are given new samples $x_j$ we can predict its response $\hat{y}_j$:

$$ \hat{y}_j = \hat{w_1} x_j + \hat{w}_0 $$

Over the dataset above, try and think what would you expect the output to be?

In [36]:
from sklearn.linear_model import LinearRegression
noiseless_model = LinearRegression()

noiseless_model.fit(x.reshape((-1,1)), y)
print("Estimated intercept:", noiseless_model.intercept_)
print("Estimated coefficient:", noiseless_model.coef_[0])


Estimated intercept: 0.9999999999999716
Estimated coefficient: 2.0000000000000004


## Linear Regression With Noise
As the dataset used to fit the model lays exactly on a straight line, the estimated coefficients are the correct 
ones (up to floating point precision). Next, let us add some Gaussian noise to the data and see how it influences our 
estimation. So: 

$$\forall i \in \left[ m \right]\quad y_i=w_1\cdot x_i + w_0 + \varepsilon_i \quad s.t.\quad 
\varepsilon\sim\mathcal{N}\left(0,\sigma^2I_m\right)$$

Namely, the noise of each sample distributes by a Gaussian with zero mean and $\sigma^2$ variance, and is uncorrelated between samples.

*Notice that from now on we mark the $y$'s generated by the noise-less model with `y_`. This is so it is clear that the "real"
$y$'s observed in a given sample are noisy.*

In [37]:
if "y_" not in locals(): y_ = y
epsilon = np.random.normal(loc=0, scale=40, size=len(x))
y = y_ + epsilon

fig.add_trace(go.Scatter(x=x, y=y, name="Observed Points", mode="markers", line=dict(width=1)))
fig.update_layout(title=r"$\text{(2) Simulated Data - With Noise}$")
fig.show()


Try and execute the block above several times. See how each time the "Observed Points" look different. These datasets,
though all come from the same model, look very different. Try to think:

* What would happen if we attempt fitting a model to these observations (i.e. the ones with the noise)? 
* How would it influence our estimation of the coefficients $w_0, w_1$? 
* Where will the regression line be?


In [38]:
from pandas import DataFrame 
model = LinearRegression().fit(x.reshape((-1,1)), y)

DataFrame({"Model":["Noise-less","Noisy"], 
           "Intercept": [noiseless_model.intercept_, model.intercept_],
           "Slope": [noiseless_model.coef_[0], model.coef_[0]]})


Unnamed: 0,Model,Intercept,Slope
0,Noise-less,1.0,2.0
1,Noisy,-32.281685,0.180022


In [39]:
y_hat = model.predict(x.reshape(-1,1))

fig.data = [fig.data[0], fig.data[1]]
fig.update_layout(title=r"$\text{(3) Fitted Model Over Noisy Data}$")
fig.add_traces([go.Scatter(x=x, y=y_hat, mode="markers", name="Predicted Responses", marker=dict(color="blue")),
                go.Scatter(x=x, y=y_hat, mode="lines", name="Fitted Model", line=dict(color="blue", width=1))])
fig.show()


Let us better understand what took place. Schematically, we started with some model
$$ Y=w_1X+w_0 \quad s.t. w_1=2,w_0=1 $$

and obtained a dataset from this model 
$$ Y=w_1X + w_0 + \mathcal{N}\left(0,\sigma^2\right) $$ 

Then, using the dataset we estimated the model parameters to obtain $\widehat{w_1},\widehat{w_0}$. However, we should look
at these steps from two different points of view: the "observer" and the "all-knowing".
- The "observer" is us whenever we work with data. We somehow obtained samples/observations that we assume to be generated
from some "true" function/model $f$. As in reality data is noisy, when we assume something about the "true" function we 
also make assumptions about the noise. Then, as we do not know $f$ we try to learn it based on the observations.
- The "all-knowing", unlike the "observer", knows exactly how $f$ looks and for each sample what is the noise.  

In the graph above the <span style="color:Black">**Real Model**</span> is only known to the "all-knowing". We, as the 
"observer" only witness the <span style="color:red">**Observed Points**</span>. We **assumed** the data came from a linear
model with Gaussian Noise and therefore fitted the OLS estimators $\widehat{w}_1, \widehat{w}_0$. These estimators give
us the <span style="color:blue">**Fitted Model**</span> and a <span style="color:blue">**Predicted Response**</span> to 
each observation.

Using these estimators of the model coefficients we can do two things:
- **Inference**: We can study the estimated model. What are the statistical properties of our estimators? How confident are
we in the estimation? Are the features associated with the helpful/relevant for predicting/explaining the response? Etc.
- **Prediction**: We can use this estimated model to predict the responses of new data-points. How accurate are our predictions? How does the training set (and its size) influence this accuracy? 

In the scope of this course we are mainly interested in using the fitted model for prediction, with only slightly 
investigating the properties of our fitted model.

## Multivatiate Linaer Regression
Lastly, using a more complicated model, we fit a model and answer some inference and prediction questions. 
To gain a better understanding, please look at the graph below and answer the question before reading the code.


In [44]:
response = lambda x1, x2: 5*x1 + .1*x2 + 3

min_x1, min_x2, max_x1, max_x2 = -10, -10, 10, 10
xv1, xv2 = np.meshgrid(np.linspace(min_x1, max_x1, 10), np.linspace(min_x2, max_x2, 100))
surface = response(xv1, xv2)

x = np.random.uniform((min_x1, min_x2), (max_x1, max_x2), (100, 2))
y_ = response(x[:,0], x[:,1])
y = y_ + np.random.normal(0, 30, len(x))

model = LinearRegression().fit(x, y)
y_hat = model.predict(x)

DataFrame({"Coefficient": [rf"$w_{{0}}$".format(i) for i in range(len(model.coef_)+1)],
           "Estimated Value": np.concatenate([[model.intercept_], model.coef_])})


Unnamed: 0,Coefficient,Estimated Value
0,$w_0$,-0.558147
1,$w_1$,5.352162
2,$w_2$,0.764667


In [45]:
go.Figure([go.Surface(x=xv1, y=xv2, z=surface, opacity=.5, showscale=False),
           go.Scatter3d(name="Real (noise-less) Points", x=x[:,0], y=x[:,1], z=y_,    mode="markers", marker=dict(color="black", size=2)),
           go.Scatter3d(name="Observed Points",          x=x[:,0], y=x[:,1], z=y,     mode="markers", marker=dict(color="red", size=2)),
           go.Scatter3d(name="Predicted Points",         x=x[:,0], y=x[:,1], z=y_hat, mode="markers", marker=dict(color="blue", size=2))],
          layout=go.Layout(
              title=r"$\text{(4) Bivariate Linear Regression}$",
              scene=dict(xaxis=dict(title="Feature 1"),
                         yaxis=dict(title="Feature 2"),
                         zaxis=dict(title="Response"),
                         camera=dict(eye=dict(x=-1, y=-2, z=.5)))
          )).show()

# Time To Think...
In the scenario above we performed a linear regression over observations with more than two features (i.e multi-variate
linear regression). In gradient color we see the subspace from which our data-points are drawn. As we have 2 features, the subspace is a 2D plane.

Try rotating the figure above and look at the plane from its different axes (such that it looks like a line rather than a plane). This view allows you to see the fit between the one specific feature and the response, similar to the case of fitting a simple linear regression using that feature. 

Run the code generating the data and graph with more/less samples and high/lower noise levels. How do these changes influence the quality of the fit? 