## Linear Regression

![linear-regression](https://images.spiceworks.com/wp-content/uploads/2022/04/06113401/shutterstock_2087420848.jpg)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns
sns.set()

**Definition 2.2: Interpolation**

Given a sequence of pairs $\mathbb{D} = \{(\mathbf{x}_i, \mathbf{y}_i) \in \mathbb{R}^d \times \mathbb{R}^t : i = 1, \ldots, n\}$, we want to find a function $f : \mathbb{R}^d \to \mathbb{R}^t$ so that for all $i$:
$$
f(\mathbf{x}_i) \approx \mathbf{y}_i.
$$
Most often, $ t = 1 $, i.e., each $ \mathbb{y}_i $ is $ real-valued $. However, we will indulge ourselves for treating any $ t $ (since it brings very minimal complication).

The variable $ t $ here stands for the number of responses (tasks), i.e., how many values we are interested in predicting (simultaneously).
\


There is a cool theorem that proves that for any dataset D, we can always find a function f that maps each x to y.
You can think of creating a piece wise function for each data point, and leaving the rest 0.
Theorem 2.3: Exact interpolation
I'll leave the proof here:
**Theorem 2.3: Exact interpolation**

For any finite number of pairs $ \mathbb{D} = \{(\mathbf{x}_i, y_i) \in \mathbb{R}^d \times \mathbb{R}^t : i = 1, \ldots, n\} $ that satisfy $ x_i = x_j \implies y_i = y_j $, there exist infinitely many functions $ f : \mathbb{R}^d \to \mathbb{R}^t $ so that for all $ i $:

$
f(\mathbf{x}_i) = y_i.
$

Proof: W.l.o.g. we may assume all $ x_i $'s are distinct. Lagrange polynomials give immediately such a claimed function. More generally, one may put a bump function within a small neighborhood of each $ x_i $ and then glue them together. In details, set $ \mathbb{N}_i = \{\mathbf{z} : \|\mathbf{z} - \mathbf{x}_i\|_\infty < \delta\} \subset \mathbb{R}^d $. Clearly, $ x_i \in \mathbb{N}_i $ and for $ \delta $ sufficiently small, $ \mathbb{N}_i \cap \mathbb{N}_j = \emptyset $. Define

$$
f_i(\mathbf{z}) = \begin{cases}
y_i e^{1/\delta^3} \prod_{j=1}^d \exp \left( - \frac{1}{x_i - (x_j - x_i)^+} \right), & \text{if } \mathbf{z} \in \mathbb{N}_i \\
0, & \text{otherwise}
\end{cases}.
$$

The function $ f = \sum_i f_i $ again exactly interpolates our data $ \mathbb{D} $.

The condition $ x_i = x_j \implies y_i = y_j $ is clearly necessary, for otherwise there cannot be any function so that $ y_i = f(\mathbf{x}_i) = f(\mathbf{x}_j) = y_j $ and $ y_i \neq y_j $. Of course, when all $ x_i $'s are distinct, this condition is trivially satisfied.



Given a finite training set D, no matter how large
its size might be, there exist infinitely many smooth (infinitely differentiable) functions f that maps each xi
in D exactly to yi
, i.e. they all achieve zero training “error.”
But what about a new data point x? , we cannot just map each one to y and leave rest 0.
We should find another way.


## Simple Linear Regression

We will start with the most familiar linear regression, a straight-line fit to data.
A straight-line fit is a model of the form:
$$
y = ax + b
$$
where $a$ is commonly known as the *slope*, and $b$ is commonly known as the *intercept*.

Consider the following data, which is scattered about a line with a slope of 2 and an intercept of –5 (see the following figure):

In [None]:
rng = np.random.RandomState(1)
x = 10 * rng.rand(50)
y = 2 * x - 5 + rng.randn(50)
plt.scatter(x, y);

We can use Scikit-Learn's `LinearRegression` estimator to fit this data and construct the best-fit line, as shown in the following figure:

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True)

model.fit(x[:, np.newaxis], y)

xfit = np.linspace(0, 10, 1000)
yfit = model.predict(xfit[:, np.newaxis])

plt.scatter(x, y)
plt.plot(xfit, yfit);

Definition: Least squares regression

Our task is to minimize the mean squared error of all data points; that is minimize the expected value of our dataset.
This is the least squares regression problem:
**Definition 2.6: Least squares regression**

To resolve the difficulty in Remark 2.5, we need some statistical assumption on how our data is generated. In particular, we assume $ (X_i, Y_i) $ are independently and identically distributed (i.i.d.) random samples from an unknown distribution $ \mathbb{P} $. The test sample $ (X, Y) $ is also drawn independently and identically from the same distribution $ \mathbb{P} $. We are interested in solving the least squares regression problem:

$$
\min_{f:\mathbb{R}^d \to \mathbb{R}^t} \mathbb{E}[\|Y - f(X)\|_2^2], \tag{2.1}
$$

i.e., finding a function $ f : \mathbb{R}^d \to \mathbb{R}^t $ so that $ f(X) $ approximates $ Y $ well in expectation. (Strictly speaking we need the technical assumption that $ f $ is measurable so that the above expectation is even defined.)

In reality, we do not know the distribution $ \mathbb{P} $ of $ (X, Y) $ hence will not be able to compute the expectation, let alone minimizing it. Instead, we use the training set $ \mathbb{D} = \{(X_i, Y_i) : i = 1, \ldots, n\} $ to approximate the expectation:

$$
\min_{f:\mathbb{R}^d \to \mathbb{R}^t} \hat{\mathbb{E}}[\|Y - f(X)\|_2^2] := \frac{1}{n} \sum_{i=1}^n \|Y_i - f(X_i)\|_2^2. \tag{2.2}
$$

By law of large numbers, for any fixed function $ f $, we indeed have

$$
\frac{1}{n} \sum_{i=1}^n \|Y_i - f(X_i)\|_2^2 \xrightarrow{n \to \infty} \mathbb{E}[\|Y - f(X)\|_2^2].
$$


**Definition 2.11: Linear least squares regression**

The simplest choice for the function class $ \mathbb{F}_n \equiv \mathbb{F} $ is perhaps the class of linear/affine functions (recall Definition 1.13). Adopting this choice leads to the linear least squares regression problem:

$$
\min_{W \in \mathbb{R}^{t \times d}, b \in \mathbb{R}^t} \frac{1}{n} \sum_{i=1}^n \|Y_i - WX_i - b\|_2^2, \tag{2.5}
$$

where recall that $ X_i \in \mathbb{R}^d $ and $ Y_i \in \mathbb{R}^t $.

Rememver here we say F(Xi) = WXi + b


**Definition 2.8: Regression function**

For a moment let us assume the distribution $ \mathbb{P} $ is known to us, so we can at least in theory solve the least squares regression problem (2.1). It turns out there exists an optimal solution whose closed-form expression can be derived as follows:

$$
\mathbb{E}[\|Y - f(X)\|_2^2] = \mathbb{E}[\|Y - \mathbb{E}(Y|X) + \mathbb{E}(Y|X) - f(X)\|_2^2]
$$

$$
= \mathbb{E}[\|Y - \mathbb{E}(Y|X)\|_2^2 + \mathbb{E}[\|Y - f(X)\|_2^2] + 2\mathbb{E}[(Y - \mathbb{E}(Y|X), \mathbb{E}(Y|X) - f(X))]
$$

$$
= \mathbb{E}[\|Y - \mathbb{E}(Y|X)\|_2^2 + \mathbb{E}[\|Y - f(X)\|_2^2] + 2\mathbb{E}[(Y - \mathbb{E}(Y|X), \mathbb{E}(Y|X) - f(X))|X]]
$$

$$
= \mathbb{E}[\|Y - \mathbb{E}(Y|X)\|_2^2 + \mathbb{E}[\mathbb{E}(Y|X) - f(X)\|_2^2] + 2\mathbb{E}[(\mathbb{E}(Y|X) - f(X))]]
$$

$$
= \mathbb{E}[\|Y - \mathbb{E}(Y|X)\|_2^2 + \mathbb{E}[\mathbb{E}(Y|X) - f(X)\|_2^2]]
$$

whence the **regression function**

$$
m(X) := \mathbb{E}(Y|X) \tag{2.3}
$$

eliminates the second nonnegative term while the first nonnegative term is not affected by any $ f $ at all.

With hindsight, it is not surprising the regression function $ m $ is an optimal solution for the least squares regression problem: it basically says given $(X, Y)$, we set $ m(X) = Y $ if there is a unique value of $ Y $ associated with $ X $ (which of course is optimal), while if there are multiple values of $ Y $ associated to the given $ X $, then we simply average them.

The constant term $ \mathbb{E}[\|Y - \mathbb{E}(Y|X)\|_2^2] $ describes the difficulty of our regression problem: no function $ f $ can reduce it.

if somoene says he got below $ \mathbb{E}[\|Y - \mathbb{E}(Y|X)\|_2^2] $, there is something wrong in his implementation, since this value is always >= 0.

**Definition 2.24: Prediction**

Once we have the linear least squares solution $ \hat{W} = (\hat{W}, \hat{b}) $, we perform prediction on a (future) test sample $ X $ naturally by:

$$
\hat{Y} := \hat{W}X + \hat{b}.
$$

We measure the “goodness” of our prediction $ \hat{Y} $ by:

$$
\|\hat{Y} - Y\|_2^2,
$$

which is usually averaged over a test set.


**Definition 2.26: Ridge regression with Tikhonov regularization (Tikhonov63; Hoerl and Kennard 1970)**

The class of linear functions may still be too large, leading linear least squares to overfit or be unstable. We can then put some extra restriction, such as the Tikhonov regularization in ridge regression:

$$
\min_{W \in \mathbb{R}^{t \times p}} \frac{1}{n} \|Y - WX\|_F^2 + \lambda \|W\|_F^2, \tag{2.10}
$$

where $ \lambda \ge 0 $ is the regularization constant (hyperparameter) that balances the two terms.

To understand ridge regression, consider
- when $ \lambda $ is small, thus we are neglecting the second regularization term, and the solution resembles that of the ordinary linear least squares solution;
- when $ \lambda $ is large, thus we are neglecting the first data term, and the solution degenerates to 0.

In the literature, the following variant that chooses not to regularize the bias term $ b $ is also commonly used:

$$
\min_{\mathbf{w}=[W, b]} \frac{1}{n} \|Y - WX\|_F^2 + \lambda \|W\|_F^2. \tag{2.11}
$$

**Reference**:
Hoerl, A. E. and R. W. Kennard (1970). "Ridge regression: Biased estimation for nonorthogonal problems". Technometrics, vol. 12, no. 1, pp. 55–67.


##Practical Example:

###Load the data

In [None]:
data = pd.read_csv('https://github.com/VIJAY-GADRE/1_Simple_Linear_Regression/blob/main/1.01.%20Simple%20linear%20regression.csv')

In [None]:
data

In [None]:
# let's get a description about our data
# we will compare SAT scores vs GPA
data.describe()

###Creating our first linear regression

In [None]:
# we do the following
# dataframe[‘column_name’]
y = data['GPA']
x1 = data['SAT']

# then plot: matplotlib.pyplot(independent variable, dependent variable)

plt.scatter(x1, y)
plt.xlabel('SAT', fontsize = 20)
plt.ylabel('GPA', fontsize = 20)
plt.show()


To perform a linear regression we should always add the bias term or the intercept (b0). We can do this using the following method:

 `statsmodels.add_constant(independent_variable)`

Let’s fit the Linear Regression model using the Ordinary Least Squares (OLS) model with the dependent variable and an independent variable as the arguments.

In [None]:
"""
 fit the Linear Regression model using the Ordinary Least Squares (OLS) model with the dependent variable and an independent variable as the arguments.
 """

x = sm.add_constant(x1)
results = sm.OLS(y, x).fit()
results.summary()

'\n fit the Linear Regression model using the Ordinary Least Squares (OLS) model with the dependent variable and an independent variable as the arguments.'

In [None]:
# plot regression line
plt.scatter(x1, y)
yhat = 0.0017*x1 + 0.275
fig = plt.plot(x1,yhat, lw=4, c='orange', label ='regression line')

# add labels
plt.xlabel('SAT', fontsize = 20)
plt.ylabel('GPA', fontsize = 20)
plt.show()