# Regression

- [Linear Regression](#linreg)
- [Least Squares Algorithm and its Derivation](#LSA)

<a id="linreg"></a>
# Linear Regression 

Simple Regression - just two varialbles, one which is the explanatory variable (x) and the other one is response variable (y). This type of regression can be easily illustrated with scatterplot. 

### Correlation Coefficent 

<b>Correlation Coefficient (r)</b> - the strength and direction of a linear relationship. $r \in [-1,1]$

The boundaries for the strengh of correlation depend on the fiels. General guidelines:  
- Strong: $0.7 \leq |r| < 1.0$   
- Moderate: $0.3 \leq |r| < 0.7$   
- Weak: $0.0 \leq |r| < 0.3$   


$$r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) } { {\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2}} {\sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}}} $$

<b>Important:</b> if $r=0$ it DOES NOT necessarily mean that there is no relationship at all. It just means that there is no <b>linear</b> relationships. So, correlation cofficient only captures linear relationships.

### Example in Python

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_excel('quizzes-data-1.xlsx')
df = df[df.Temp.notnull()&df.Sales.notnull()]

In [3]:
np.corrcoef(df.Temp, df.Sales)

array([[1.        , 0.95902026],
       [0.95902026, 1.        ]])

### Regression Equation

$b_0$: The **intercept** is defined as the **predicted value of the response when the x-variable is zero**.

$b_i, i \geq 1$ : The **slope**: for every unit increase in x **the expected increase in y by the slope, holding all else (other variables) constant**. 

The regression equation is as follows: 

$$\hat{y} = b_0 + b_1 x_1 + ... + b_n x_n$$

$b_0, b_1, ... ,b_n$ are statistic values, whereas $\beta_0, \beta_1,..., \beta_n$ are actual, population parameters. Also, $\hat{y}$ is predicted value, whereas $y$ is actual value.

<a id = "lsa"></a>
# Least Squares Algorithm  and its Derviation
Goal: Minimize the sum of the squared vertical distances from the line to points. Objective function will look like this:  

$$E = \sum_{i=1}^n (y_i-\hat{y_i})^2$$

Other loss functions are possible, but this one is the easiest one to work with since it's easy to take it's derivative which is necessary for finding the minimum. 

#### Derivation

We can define our $\hat{y}^{(i)}$ as:
$$(x^{(i)})^T b$$, 
where $x^{(i)}$ is a vector $[1; {x_{o}^{(i)}}]$ (I'm turning original scalar $x_{o}^{(i)}$ to a vector so that we can pack it into a dot product and then to matrix-vector multiplication. And for the sake of generalisation let's actually make $x^{(i)}$ n-dimensional. I.e. our i-th observation has n features.   
By default, all vectors are column vectors. Now, b is a vector $(b_0, b_1, ... b_n)$. Check that we get the same result after these arrangements: 

$$(x^{(i)})^T b = [1; x_1,...,x_n]^T [b_0, b_1,.. b_n] = b_0 + b_1 x^{(i)}_1 + b_2 x^{(i)}_2 + ... + b_n x^{(i)}_n $$

So we can rewrite our objective function as: 

$$E(b) = \sum_{i=1}^{n} (y^{(i)} - (x^{(i)})^T b)^2$$ 

This sum is actually the definiton of a dot product, so we can further rewrite it as: 

$$E(b) = \sum_{i=1}^{n} (y^{(i)} - (x^{(i)})^T b)^2 = (y-Xb)^T (y-Xb)$$, 

where X is a n by 2 matrix with the first column being all 1s. So when we multiply this matrix by vector b, we'll get $\hat{y}$ vector of predictions. Now we can minimize this function, but first we will expand it: 

$$E(b) = (y-Xb)^T (y-Xb) = y^T y - y^T X b - b^T X^T y + b^T X^T X b$$

Here, it's important to notice that $y^T X b = b^T X^T y$, so we can now write: 

$$E(b) = y^T y - 2 b^T X^T y + b^T X^T X b $$

And now we will take the derivative of this guy and equate it to $\vec{0}$:  

$$\nabla{E} = - 2 X^T y + 2 X^T X b = \vec{0} $$

And now we can find the b vector as follows: 

$$X^T y = X^T X b$$

$$b = (X^T X)^{-1} X^T y $$  

The only possible problem here is that the matrix might appear to be non-invertible and in this case there are special techniques that help to avoid it. Typically, pseudoinverse is used. 

#### Example in Python

In [11]:
df = pd.read_csv('data/house_prices.csv')
# add ones column
df['intercept'] = 1 
X = df[['intercept', 'area', 'bathrooms', 'bedrooms']]
y = df['price']

In [24]:
b = np.dot(np.dot(np.linalg.pinv(np.dot(X.transpose(),X)), X.transpose()),y)

In [25]:
b

array([10072.10704941,   345.91101884,  7345.39171708, -2925.80632748])

Or, using libraries: 

In [22]:
import statsmodels.api as ss

In [23]:
lm = ss.OLS(df['price'], df[['intercept', 'area', 'bathrooms', 'bedrooms']])
results = lm.fit()
results.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.678
Model:,OLS,Adj. R-squared:,0.678
Method:,Least Squares,F-statistic:,4230.0
Date:,"Tue, 11 Dec 2018",Prob (F-statistic):,0.0
Time:,16:02:37,Log-Likelihood:,-84517.0
No. Observations:,6028,AIC:,169000.0
Df Residuals:,6024,BIC:,169100.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
intercept,1.007e+04,1.04e+04,0.972,0.331,-1.02e+04,3.04e+04
area,345.9110,7.227,47.863,0.000,331.743,360.079
bathrooms,7345.3917,1.43e+04,0.515,0.607,-2.06e+04,3.53e+04
bedrooms,-2925.8063,1.03e+04,-0.285,0.775,-2.3e+04,1.72e+04

0,1,2,3
Omnibus:,367.658,Durbin-Watson:,2.007
Prob(Omnibus):,0.0,Jarque-Bera (JB):,350.116
Skew:,0.536,Prob(JB):,9.4e-77
Kurtosis:,2.503,Cond. No.,11600.0


### Results interpretation

#### p-values
The p-values are the probabilites of $b_i$ to be 0. This actually shows us the "usefullness of these parameters. In this case we see that area is a good predictor, while others are not.  

**Significant bivariate relationships are not always significant in multiple linear regression**

#### R-squared

R-squared - the amount of variability in the response (y) explained by the model. Closer to 1 - better fit. In fact, R-squared is $r^2$.