# Regression

A common task in science and engineering is to underestand relationships between quantities that vary. The simplest relation between two variables $x$ and $y$ is the linear equation $y = mx + h$. Experimental data often produce points $(x_1, y_1), (x_2, y_2), \dots (x_n, y_n)$ that, when graphed, seem to lie close to a line. We want to determine the parameters $m$ and $h$ that make the line as “close” to the points as possible. The coefficients $m$ and $h$  of the line are called (linear) regression coefficients. If the data points $(x_i, y_i) $were on the line, regression coefficients would satisfy the equations

\begin{align}
    mx_1 + h &= y_1 \\
    mx_2 + h &= y_2 \\
    &\vdots\\
    mx_n + h &= y_n \\
\end{align}


Which can be written as 

$$
A\vec{x} = \vec{b} \quad \text{where} \quad A = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix}, \quad \vec{x} = \begin{bmatrix} h \\ m \end{bmatrix} \quad \text{and} \quad \vec{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n  \end{bmatrix}. 
$$



Of course, if the data points are not on a line (which often happens in practice), then $A\vec{x} = \vec{b}$ doesnt have a solution. Then we aim to find ordinary least-squares (OLS) solutions of $A\vec{x} = \vec{b}$.



## OLS Linear Regression Optimization Problem




Find $(h,m)$ which minimizes the loss function $J(h,m)$ defined as

$$
J(h,m)=\tfrac{1}{2}\sum_{i=1}^n (y_i-(h+mx_i) )^2=\tfrac{1}{2}\|\vec{y} - \vec{A}\vec{x}\|^2.
$$

(The factor of 1/2 multiplying the sum is introduced to simplify theoretical analysis of the loss function.)

The loss function $J$ is  zero when the points are collinear and situated on the line $y= h + mx$; otherwise,  $J$  is positive, since it is  half the sum of the squared vertical separations between data points and the line  $y= h + mx$.

__Example 1__

Find the equation $ y = mx +h$ that best fits the data points $(2, 1)\, \, (5, 2)\,\, (7, 3)\,\, \text{and}\, \, (8, 3)$.

__Solution:__

1. Write down the matrix equation explianed above.

2. Find the normal equation.

3. Find the OLS solution using the previous lab.

In [None]:
# you code

we can also use `sklearn`'s `LinearRegression` model object. 

Here is the documentation for `LinearRegression`, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html">https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html</a>. 


__Example 2__  

a. Generate data points distributed around the line $y = 4x - 3$ and plot them (__ChatGPT__).

b. Clearly, the points in this dataset exhibit some collinearity. Find the regression line that best models the data.


c. Plot the data points along with the regression line.


In [12]:
# you code

## Excercises

1) Consider the data $(1, 0)$, $(4, 5)$, $(7, 8)$.  Use the normal equations  to find the least-squares solution line $y  = a + bx$ that best fits the data. 


2) Consider the data $(-1,1)$, $(0,0)$, $(1,2)$, $(2,3)$.  Use the normal equations to find the least-squares solution for the parabola $y=a+bx+cx^2$ that best fits the data.

3) (ChatGPT) 

   a. Make up some data point that are around the line $y = x^2 +1$, and Plot them.
    
   b. Use the normal equations to find the least-squares solution for the parabola $y=a+bx+cx^2$ that best fits the data.
    
   c. Plot points along with regressor line. 