## Linear_Regression_Modelling

Consider two data series, $X = \left(x_{1}, x_{2}, ..., x_{n}\right)$ and $Y = \left(y_{1}, y_{2}, ..., y_{n}\right)$, both with mean zero. We use linear regression (ordinary least squares) to regress $Y$ against $X$ (without ﬁtting any intercept), as in $Y = aX + \epsilon$ where $\epsilon$ denotes a series of error terms.

Problems:

1. Calculate the value of the regression coefﬁcient $a$. If possible, express it in terms of the standard deviations $\sigma_{X}$ and $\sigma_{Y}$ and the correlation coefficient $\rho_{XY}$ between the two data series. You will need to show a complete derivation to score full marks.  

2. We scale up both data series by constant factors $s$ and $t$, i.e. $X' = sX$ and $Y' = tY$ , and regress $Y'$ against $X'$ as in $Y' = a'X' + \epsilon$. How does the new regression coefﬁcient $a'$ relate to the original coefﬁcient $a$? And what about the new correlation $\rho_{X'Y'}$ vs. the original correlation $\rho_{XY}$ ? Note that the new $\epsilon$ is not necessarily the same as the original one, it merely denotes another series of error terms.  

3. We now do the ‘inverse’ regression of $X$ against $Y$ , resulting in $X = bY + \epsilon$. How is the slope $b$ of the ‘inverse’ regression related to the slope a of the original regression?   

4. Suppose that $\rho_{XY} = 0.01$. Is the resulting value of $a$ statistically signiﬁcantly different from $0$ at the $95\,\%$ level if:   
    i. $n = 10^{2}$  
    ii. $n = 10^{3}$  
    iii. $n = 10^{4}$  


## Problem 1:

The idea of linear regression is to use a function $f^{*}\left(X\right)$ that is linear in a set of parameters $a_{i}\in A$ to predict $Y$ as close as possible. The function $f^{*}\left(X\right)$ is derived by chosing the parameters $a_{i}\in A$ for a function $f\left(A, X\right)$ so that this is achieved.

In our case we work with $f\left(a, X\right) = a\cdot X$ so that $Y = a\cdot X + \epsilon$. 

When we use ordinary least square to regress $Y$ to $X$, what we want to do is to minimize is the squared difference between our observed values $Y$ and the prediction from our function $f\left(a, X\right)$. We denote that as our loss function $L$, given by:

\begin{equation}
L = \sum_{i=1}^{n} \left(y_{i}-a\cdot x_{i}\right)^{2}
\end{equation}

By minimizing $L$, we can find the set of parameters for which $f\left(A, X\right)$ becomes $f^{*}\left(X\right)$, so let's do that now.

\begin{equation}
\begin{aligned}
\frac{\partial L}{\partial a} &=& \frac{\partial}{\partial a}\sum_{i=1}^{n} \left(y_{i}-a\cdot x_{i}\right)^{2} \\
&=& \sum_{i=1}^{n}\frac{\partial}{\partial a} \left(y_{i}-a\cdot x_{i}\right)^{2} \\
&=& \sum_{i=1}^{n}2\cdot\left(y_{i}-a\cdot x_{i}\right)\cdot\left(-x_{i}\right) \\
&=& -2\cdot\sum_{i=1}^{n}\left(y_{i}\cdot x_{i}-a\cdot x_{i}^{2}\right) \\
&=& -2\cdot\left(\sum_{i=1}^{n}y_{i}\cdot x_{i}\right) + 2\cdot\left(\sum_{i=1}^{n} a\cdot x_{i}^{2}\right) \\
&=& -2\cdot\left(\sum_{i=1}^{n}y_{i}\cdot x_{i}\right) + 2a\cdot\left(\sum_{i=1}^{n} x_{i}^{2}\right) \\
\end{aligned}
\end{equation}

From setting $\frac{\partial L}{\partial a}=0$, it follows that:

\begin{equation}
\begin{aligned}
a\cdot\sum_{i=1}^{n} x_{i}^{2} &=& \sum_{i=1}^{n}y_{i}\cdot x_{i}  \\
a &=& \frac{\sum_{i=1}^{n}y_{i}\cdot x_{i}}{\sum_{i=1}^{n} x_{i}^{2}}
\end{aligned}
\end{equation}

Now we have found an expression for $a$ in terms of sums over $x_{i}$ and $y_{i}$. However we want to express it in terms of the standard deviations $\sigma_{X}$ and $\sigma_{Y}$ and the correlation $\rho_{XY}$. So let's work out those and see if we can substitute them in our result.

The standard deviations $\sigma_{k}$ for a sample with $k = X, Y$ and $\bar{k}=\frac{1}{n}\sum_{i=1}^{n}k_i$ as the mean value of $k$ is given by:


\begin{equation}
\sigma_{k}^{2} = \frac{\sum_{i=1}^{n}\left(k_{i}-\bar{k}\right)^{2}}{n-1}
\end{equation}

Now since the mean values of $X$ and $Y$ are both zero, we can simplify the standard deviations by substituting $\bar{k}=0$:

\begin{equation}
\sigma_{k}^{2} = \frac{\sum_{i=1}^{n}k_{i}^{2}}{n-1}
\end{equation}

The correlation coefficient $\rho_{XY}$ for a sample is given by:

\begin{equation}
\rho_{XY} = \frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)\cdot\left(y_{i}-\bar{y}\right)}{\sqrt{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}}\sqrt{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}}}
\end{equation}

Using our mean values of zero, we simply get:

\begin{equation}
\rho_{XY} = \frac{\sum_{i=1}^{n}x_{i}\cdot y_{i}}{\sqrt{\sum_{i=1}^{n}x_{i}^{2}}\sqrt{\sum_{i=1}^{n}y_{i}^{2}}}
\end{equation}

We can expand our equation for $\rho_{XY}$ by multiplying with $\frac{1}{\frac{n-1}{n-1}}$ and get:

\begin{equation}
\begin{aligned}
\rho_{XY} &=& \frac{\sum_{i=1}^{n}x_{i}\cdot y_{i}}{\sqrt{\frac{\sum_{i=1}^{n}x_{i}^{2}}{n-1}}\sqrt{\frac{\sum_{i=1}^{n}y_{i}^{2}}{n-1}}}\cdot \frac{1}{n-1} \\
&=& \frac{\sum_{i=1}^{n}x_{i}\cdot y_{i}}{\sigma_{X}\sigma_{Y}}\cdot \frac{1}{n-1}
\end{aligned}
\end{equation}

We recognize that the sums in $sigma_{X}$ and in $\rho_{XY}$ appear in $a$, so let's expand $a$ so that we substitute in both:

\begin{equation}
\begin{aligned}
a &=& \frac{\sum_{i=1}^{n}y_{i}\cdot x_{i}}{\sum_{i=1}^{n} x_{i}^{2}} \\
&=& \sum_{i=1}^{n}y_{i}\cdot x_{i} \cdot \frac{\sigma_{X}\sigma_{Y}}{\sigma_{X}\sigma_{Y}}\frac{n-1}{n-1} \cdot \frac{1}{\sum_{i=1}^{n} x_{i}^{2}} \\
&=& \left(\frac{\sum_{i=1}^{n}x_{i}\cdot y_{i}}{\sigma_{X}\sigma_{Y}}\cdot \frac{1}{n-1}\right) \cdot \frac{\sigma_{X}\sigma_{Y}\cdot\left(n-1\right)}{\sum_{i=1}^{n} x_{i}^{2}} \\
&=& \rho_{XY} \cdot \frac{\sigma_{X}\sigma_{Y}}{\frac{\sum_{i=1}^{n} x_{i}^{2}}{n-1}} \\
&=& \rho_{XY} \cdot \frac{\sigma_{X}\sigma_{Y}}{\sigma_{X}^{2}} \\
&=& \rho_{XY} \cdot \frac{\sigma_{Y}}{\sigma_{X}} \\
\end{aligned}
\end{equation}


So we find that $a$ is given by:

\begin{equation}
a = \rho_{XY} \cdot \frac{\sigma_{Y}}{\sigma_{X}}
\end{equation}

I want to test this, so first let's grab a useful expression for $\rho_{XY}$. Using the covariance $\textit{cov}\left(X,Y\right)$ of $X$ and $Y$ given by:

\begin{equation}
\begin{aligned}
\textit{cov}\left(X,Y\right) = \frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)\cdot\left(y_{i}-\bar{y}\right)}{n-1}
\end{aligned}
\end{equation}

We can rewrite the correlation coefficient as:

\begin{equation}
\begin{aligned}
\rho_{XY} = \frac{\textit{cov}\left(X,Y\right)}{\sigma_{X}\sigma_{Y}}
\end{aligned}
\end{equation}

Thus $a$ can be expressed as:

\begin{equation}
\begin{aligned}
a &=& \frac{\textit{cov}\left(X,Y\right)}{\sigma_{X}^{2}} \\
&=& \frac{\textit{cov}\left(X,Y\right)}{\textit{var}\left(X\right)}
\end{aligned}
\end{equation}

In [88]:
import numpy as np
import matplotlib.pyplot as plt

mu_X = range(-10,11,1)
sigma_X = 0.5
measurements = 10

X = []
Y = []
for mu in mu_X:
    Y.extend([mu]*measurements)
    X.extend(np.random.normal(mu, sigma_X, measurements))

print("Mean of Y: {0:.2f}".format(np.mean(Y)))
print("Mean of X: {0:.2f}".format(np.mean(X)))
if np.mean(X) != 0:
    print("Due to statistical fluctuations the mean value of X is not exactly 0.\n")
    
cov_XY_matrix = np.cov(X,Y)
cov_XY = cov_XY_matrix[0,1]
var_X = cov_XY_matrix[0,0]

a = cov_XY / var_X

print("Our guess for the slope a is: {0:.3f}\n".format(a))

# Do linear regression
from sklearn.linear_model import LinearRegression
X_reg = np.array(X).reshape((-1, 1))

reg = LinearRegression(fit_intercept=0).fit(X_reg, Y)

print("Running the linear regression with a forced intercept of 0, we get:\n")
print("Slope a: {0:.3f}".format(reg.coef_[0]))
print("Intercept: {}".format(reg.intercept_))

print("\n The difference between our guess and the found slope is: {0:.5f}".format(abs(reg.coef_[0]-a)))

Mean of Y: 0.00
Mean of X: 0.01
Due to statistical fluctuations the mean value of X is not exactly 0.

Our guess for the slope a is: 0.989

Running the linear regression with a forced intercept of 0, we get:

Slope a: 0.989
Intercept: 0.0

 The difference between our guess and the found slope is: 0.00000


So as we can see our result seems to be pretty good. Since $\bar{x}$ is not exactly $0$, we have a small difference.