## Linear Regression 

#### What is linear regression?

Regression is a method to determine the linear relationship between two variables by determining the best fit line between them.  Using this, we are trying to come up with a model or predictive relationship.


Remember that the equation for a line is:

$y=mx+b$

$m$ = _slope_ (sometimes called the regression coefficient)

$b$ = $y$_-intercept_ (value of $y$ when $x=0$)

$y$ = _predictand_ or _dependent variable_

$x$ = _predictor_ or _independent variable_


Linear regression is used to determine the *b* and *m* that defines this best fit line. "Best fit" is typically defined as the line that minimizes the _root mean square error_ (RMSE) across all the points between their actual values of $y$ and the values of $y$ on the regression line that share the same values of $x$ as the points. 

In [None]:
import xarray as xr
import matplotlib.pyplot as plt
import numpy as np


### Let's start with an idealized example first
The `numpy` function [random.normal](https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html) produces a pseudo-random set of values haivng a [_normal_ distribution](https://en.m.wikipedia.org/wiki/Normal_distribution) about the provided _mean_ value, and a spread determined by a provided value of _standard deviation_. 

In [None]:
x = np.random.normal(5.0,1.0,200) # Arguments are: (mean, std. deviation, N)
m = 3    # Slope
b = 60   # Intercept
noise = np.random.normal(0,1.0,200) # This is used to add some spread around our line
y = m * x + b + noise

$y = mx + b$ places all the values of $x$ (in the array `x` we generated above) along a straight line. 

$y = mx + b + \eta$ adds a bit of `noise` $\eta$ to the values of $y$, so they spread away from the straight line somewhat. 

In [None]:
plt.scatter(x, y, label="Data")
plt.plot([x.min(),x.max()], [m*x.min()+b,m*x.max()+b], c='k', label="Without noise") 
plt.legend() ;

### The problem
Typically, what we have are two timeseries, $x$ and $y$ and we do not know what $m$ and $b$ are. 

We use linear regression to determine the _slope_ $m$ and _intercept_ $b$ that best fit a line between the two datasets.

Let's pretend we don't know what $m$ and $b$ are in this case. How can we find them from the arrays of $x$ and $y$?

### Determine slope and y-intercept using `np.polyfit`

The `numpy` function [polyfit](https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html) takes $x$ and $y$ as input. It also takes as its third argument the degree of the polynomial to fit.  For a linear regression problem, the degree is 1.



This function is more powerful than just linear regression. It can fit our data to different polynomials.  

$p(x) = p_0 x^0 + p_1 x^1 + p_2 x^2 + p_3 x^3 + ... + p_n x^n$

<u>It returns the vector $p$ in reverse order</u>, so 
the highest degree power is first. 

Here, we will use it only for fitting a line, so in our case, we have:

$p(x) =$ `p[1]` $x^0 +$ `p[0]` $x^1$, where

`m = p[0]` (slope)

`b = p[1]` (intercept)


In [None]:
[m_fit,b_fit] = np.polyfit(x, y, 1)
print(f"Slope = {m_fit}, Intercept = {b_fit}")

You will notice that this is not perfect. There is some roundoff error due to numerical precision. But it is close.

### Plot the linear regression result

In [None]:
y_fit = m_fit * x + b_fit

plt.scatter(x, y, label="Data")
plt.plot([x.min(),x.max()], [m*x.min()+b,m*x.max()+b], 'k', label="Without noise") 
plt.plot(x, y_fit, 'r--', label="Linear regression")
plt.legend() ;

### Dependency of Variables
We typically regress $y$ against $x$ because we usually interpret $y$ to be caused by, 
or predicted by, the variations in $x$. For this reason, we call:
* $x$ is the _independent variable_
* $y$ is the _dependent variable_ (i.e., it is dependent on $x$)

But often in climate science, cause and effect are not clear cut.
Would we get the same result if we regressed $x$ against $y$?

In [None]:
# Regress X on Y instead of Y on X:
[m_tif,b_tif] = np.polyfit(y,x,1)
print(f"Slope = {m_tif}, Intercept = {b_tif}")
x_tif = m_tif * y + b_tif

plt.scatter(x, y, label="Data")
plt.plot([x.min(),x.max()], [m*x.min()+b,m*x.max()+b], 'k', label="Without noise") 
plt.plot(x, y_fit, 'r--', label=r"$y=f(x)$")
plt.plot(x_tif, y, 'g:', label=r"$x=f(y)$")
plt.legend() ;


Conventionally, the slope and y-intercept are determined as the line that minimizes the squared _vertical_ distance on the plot between the line (red dashed) and each point, summed over all the points.

When we regress $x$ on $y$, The slope and <u>x-intercept</u> are determined as the line that minimizes the squared <u>_horizontal_</u> distance between on the plot the line (green dotted) and each point, summed over all the points.

### R-Squared

The line does not perfectly represent $y$. If there is a strong _linear relationship_, it will represent it well. If not, it may represent it poorly.

The part of $y$ not represented by the line, is called the residual. You can calculate it as:

$y_{res} = y_{fit}-y$

We mentioned RMSE earlier; it is defined as the square root of the average of the squared values of this residual. For $N$ points:

RMSE $= \sqrt{\frac{1}{N}\sum{y_{res}^2}}$

One way we quantify how well the regression line fits the data is to determine the amount of variation (or variance) in $y$ can be explained by the dependence on $x$ for that regression model. This is called _explained variance_ $R^2$.  More formally, it is:

$ R^2=1-\frac{Unexplained Variance}{Total Variance} $

* Unexplained Variance is the variance of the residual $y$

* Total Variance is the variance of $y$

* You can multiply by 100 and think of it as the percentage variance explained by the regression model

A larger $R^2$ indicates a better fit and means that the linear regression model (the best fit line) can explain well the variation of the output with different inputs.

...and, yes, the $R$ here is simply _correlation_, which we learend about earlier in the course.

In [None]:
y_res = y_fit - y
r_squared = 1 - (np.var(y_res)/np.var(y))
print(f"{100*r_squared:0.2f}% of variance explained by linear regression")

### We can fit a line to anything...
...but that doesn't mean we should. The test of explained variance can tell us how well a linear regression is performing as a means to explain the assumption that $y=f(x)$.

In [None]:
y_rand = np.random.randn(len(x)) 
[m_fit_rand,b_fit_rand] = np.polyfit(x,y_rand,1)
y_fit_rand = m_fit_rand * x + b_fit_rand
print(f"Slope = {m_fit_rand}, Intercept = {b_fit_rand}")

plt.scatter(x, y_rand, label="Data")
plt.plot(x, y_fit_rand, 'r--', label=r"$y=f(x)$")
plt.legend() ;

In [None]:
# Regress X on Y instead of Y on X:
[m_tif_rand,b_tif_rand] = np.polyfit(y_rand,x,1)
print(f"Slope = {m_tif_rand}, Intercept = {b_tif_rand}")
x_tif_rand = m_tif_rand * y_rand + b_tif_rand

plt.scatter(x,y_rand, label="Data")
plt.plot(x, y_fit_rand, 'r--', label=r"$y=f(x)$")
plt.plot(x_tif_rand,y_rand,'g:', label=r"$x=f(y)$")
plt.legend() ;

R-squared can give us an idea that this line isn't a very good fit to the data

In [None]:
y_res_rand = y_rand - y_fit_rand
r_squared_rand = 1 - (np.var(y_res_rand)/np.var(y_rand))
print(f"{100*r_squared_rand:0.2f}% of variance explained by linear regression")

We may also want to perform a significance test on the slope to see if it is statistically different from zero.

This is not easily done using `np.polyfit`

### Using scipy

`scipy.stats.linregress`

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html

In [None]:
from scipy.stats import linregress

In [None]:
slope, intercept, r_value, p_value, std_err = linregress(x, y)

print(f"Slope, intercept = {slope:.3g}, {intercept:.3g}")
print(f"Correlation coefficient = {r_value:.3f}")
print(f"p-Value = {p_value:.3g}")
print(f"Standard error = {std_err:.4g}")

**r_value**: The correlation coefficient between the linear regression fit ($y_{fit}$) and the original data ($y$). Square this to get $r^2$.
* Positive values corerspond to a positive slope, negative values to a negative slope.
* A value of ±1 means a perfect correlation; RMSE=0
* A value of 0 means completely uncorrelated, $y$ is perfectly random with respect to $x$.

**p_value**: Two-sided p-value for a hypothesis test whose null hypothesis is that the slope is zero, using Wald Test with t-distribution of the test statistic.
* p-value can be interpreted as the likelihood that such a relationship would arise from chance
* The smaller the p-value, the less likely the relationship is accidental

**std_err**: The absolute measure of the typical distance that the data points fall from the regression line 

In [None]:
plt.scatter(x, y, label="Data")
string = (f"Linear regression\n"
          f"($r^2$={r_value**2:0.3f})")
plt.plot(x, intercept + slope*x, 'r-', label=string)
plt.legend() ;

## Multiple Linear Regression 

y = $\beta$X + $\epsilon$

Where 

y = predictand/dependent variable

X = matrix of the predictor/independent variables, plus vector of one for the intercept term

$\epsilon$ is the error term

$\beta$ is the vector of regression coefficients associated with each independent variable and the intercept

The solution for the $\beta$ is 

$\beta$ = $( X^{T} X )^{-1}X^{T}y$

Here is a short video describing the geometric least squares solution https://www.youtube.com/watch?v=Z0wELiinNVQ 
*Note that the notation used in the video is different*

In [None]:
# From the beginning of this notebook recall that x is a vector of random numbers, m = 3 & b = 60
# Let's use this to generate a synthetic dataset with two independent variables
x1 = x
x2 = np.random.normal(5.0,1.0,200) # (mean, std. deviation, N)
m1 = m
m2 = 5  # A different slope for x2
y = m1 * x1 + m2 * x2 + b + noise

In [None]:
# define matrix of independent variables (columns = each independent variable)
X = np.zeros((len(x2), 2)) # 2 independent variables x1 and x2
X[:,0] = x1
X[:,1] = x2

In [None]:
fig = plt.figure()
ax = fig.add_subplot(projection='3d')
ax.scatter(y, X[:,0], X[:,1], marker=m)
ax.set_xlabel('Y')
ax.set_ylabel('x1')
ax.set_zlabel('x2')

Multiple Regression (Ordinary Least Squares Regression) calculation from
https://towardsdatascience.com/multiple-linear-regression-from-scratch-in-numpy-36a3e8ac8014 implemented with OOP

In [None]:
class OrdinaryLeastSquares(object):
    
    def __init__(self):
        self.coefficients = []
        
    def fit(self, X, y):
        if len(X.shape) == 1: X = self.__reshape_x(X)
            
        X = self._concatenate_ones(X)
        self.coefficients = np.linalg.inv(X.transpose().dot(X)).dot(X.transpose()).dot(y)
        
    def predict(self, entry):
        b0 = self.coefficients[0]
        other_betas = self.coefficients[1:]
        predictions = b0
        
        for xi, bi in zip(entry, other_betas): predictions += (bi * xi)
        return predictions
        
    def _reshape_x(self, X):
        return X.reshape(-1, 1)
    
    def _concatenate_ones(self, X):
        ones = np.ones(shape=X.shape[0]).reshape(-1, 1)
        return np.concatenate((ones, X), 1)

In [None]:
model = OrdinaryLeastSquares()

model.fit(X, y)

b_est=model.coefficients[0]
print(b_est)
m1_est=model.coefficients[1]
print(m1_est)
m2_est=model.coefficients[2]
print(m2_est)

In [None]:
from mpl_toolkits.mplot3d import axes3d    
fig = plt.figure()
ax = fig.add_subplot(projection='3d')
ax.scatter(y, X[:,0], X[:,1], marker=m)

y_fit = m1_est * np.array(list(range(3,8,4))) + m2_est * np.array(list(range(3,8,4))) + b_est
plt.plot(y_fit, np.array(list(range(3,8,4))), np.array(list(range(3,8,4))),'r')
ax.set_xlabel('Y')
ax.set_ylabel('x1')
ax.set_zlabel('x2')