# Regression Problem

Regression analysis is a set of statistical processes for estimating the relationships among variables. Formally,

* The unknown parameters, denoted as $\theta$ , which may represent a scalar or a vector.
* The independent variables, $\mathcal{X}$.
* The dependent variable, $\mathcal{Y}$.

The goal is then to be able to predict $\mathcal{Y}$ given < $\mathcal{X}$, $\theta$ > :

$$\mathcal{Y} \approx h(X, \theta)$$

where $h(X, \theta)$ is called the hypotesis function.

In [46]:
from IPython.display import IFrame
IFrame('https://drive.google.com/file/d/1cJHJ5AdcFd0tibQvCrME4ychIrvPQoGc/preview', width=340, height=220)

In [47]:
from IPython.display import IFrame
IFrame('https://drive.google.com/file/d/1WknHdpGr4HkJU3ZCuW9i0tBJDF6MDd5w/preview', width=340, height=220)

## Linear Regression

Let's say that we decide to represent the hypothesis $h$ as a linear function of $\mathcal{X}$:

$$h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2$$

Here, the $\theta_i$’s are the parameters (also called weights) parameterizing the space of linear functions mapping from $\mathcal{X}$ to $\mathcal{Y}$. When there is no risk of confusion, we will drop the $\theta$ subscript in $h_\theta(x)$, and write it more simply as $h(x)$. To simplify our notation, we also introduce the convention of letting $x_0 = 1$ (the intercept term), so that

$$h(x) = \sum_{i=0}^{n} \theta_i x_i = \theta^T x$$

In order to learn parameters $\theta$ the most naive choice is to make $h(x)$ as close as possible from $\mathcal{Y}$, which brings us to the cost function:

$$J(\theta) = \frac{1}{2} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 $$

Which measures the half total of the total square distance from model to reality.

### Least Mean Aquares algorithm

We want to choose $\theta$ so as to minimize the cost function $J(\theta)$. To do so, lets consider applying [gradient descent algorithm](/notebooks/math/gradient-descent.ipynb):

$$\theta_{j} := \theta_j - \alpha \frac{\partial J(\theta)}{\partial \theta_j}$$

(every single interation, we simultaneously update all values of $\theta$)

Here, $\alpha$ is usually called the **learning rate**.

Working out this partial derivative we get:


$$\frac{\partial J(\theta)}{\partial \theta_j} = \frac{\partial}{\partial \theta_j} \frac{1}{2} (h_\theta(x) - y)^2
\\= \frac{1}{2} \cdot 2 (h_\theta(x) - y) \cdot \frac{\partial}{\partial \theta_j} (h_\theta(x) - y)
\\= (h_\theta(x) - y) \cdot \frac{\partial}{\partial \theta_j} \sum_{i=0}^{n} (\theta_i x_i - y)
\\= (h_\theta(x) - y) x_j$$

therefore we end up with:

$$\theta_{j} := \theta_j + \alpha \cdot (y^{(i)}  - h_\theta(x^{(i)})) x_j^{(i)}$$

This formula works with a single example for a single parameter. We can generalize too

$\text{repeat until convergence } \big\{$

$$
\begin{bmatrix}
    \theta_{0} \\
    \vdots \\
    \theta_{j}
\end{bmatrix}
:=
\begin{bmatrix}
    \theta_{0} \\
    \vdots \\
    \theta_{j}
\end{bmatrix}
+ \alpha \cdot
\sum_{i=1}^{m}
\Bigg(y^{(i)} - 
\begin{bmatrix}
    x^{i}_{0} & \dots & x^{i}_{j} \\
\end{bmatrix}
\cdot
\begin{bmatrix}
    \theta_{0} \\
    \vdots \\
    \theta_{j}
\end{bmatrix}
\Bigg) \cdot
\begin{bmatrix}
    x^{i}_{0} \\ 
    \dots \\
    x^{i}_{j}
\end{bmatrix}
$$

$\big\}$

The rule is called the LMS update rule and is also known as the Widrow-Hoff learning rule. Note that the magnitude of the update is proportional to the error term $(y^{(i)}  - h_\theta(x^{(i)}))$. This method looks at every example in the entire training set on every step, and is called **batch gradient descent**. It is also important that, while gradient descent can be susceptible to local minima in general, the optimization problem we have posed here for linear regression has only one global, and no other local, optima; thus **gradient descent always converges** (assuming the learning rate $\alpha$ is not too large) to the global minimum.

Notice that $\alpha$ has a huge part in the stability of the whole process. As bigger the dataset gets, smaller it has to be. The biggest is the precision we want, the smaller it has to be. Smaller it gets, the slower the convergence will be. 

#### Example

Let's say that we want to predict the selling price of a house. We indentified two independent variables that contribute to the selling price (living area and number of bedrooms) and we collected the following data: 

| Living area (square feet) | #bedrooms   | price (1000$s) |
|---------------------------|-------------|----------------|
| 2104                      | 3           | 400            |
| 1600                      | 3           | 330            |
| 2400                      | 3           | 369            |
| 1416                      | 2           | 232            |
| 3000                      | 4           | 540            |


In [48]:
import numpy as np

x = np.array([[1, 2104, 3],
             [1, 1600, 3],
             [1, 2400, 3],
             [1, 1416, 2],
             [1, 3000, 4]])

y = np.array([400, 330, 369, 232, 540])

theta = np.array([90, .4, -9.0])

cost = float(.0)
for i in range(0, np.size(x, 0)):
    cost += (y[i] - np.sum(theta * x[i]))**2

print("\u03B8\u2080=", theta)
print("J(\u03B8\u2080)=", cost)

alpha = 0.00000001
last_cost = cost
while True:
    partial = np.zeros(theta.size)
    for i in range(0, np.size(x, 0)):
        partial += (y[i] - np.sum(x[i] * theta)) * x[i]
    theta += alpha * partial
    
    cost = float(.0)
    for i in range(0, np.size(x, 0)):
        cost += (y[i] - np.sum(theta * x[i]))**2
    
    if(last_cost - cost < 0.0001):
        last_cost = cost
        break
    last_cost = cost

print("\u03B8=", theta)
print("J(\u03B8)=", last_cost)

θ₀= [90.   0.4 -9. ]
J(θ₀)= 1496423.12
θ= [89.99988093  0.14968334 -9.00033479]
J(θ)= 8142.323265782019
