# Regression Problem

Regression analysis is a set of statistical processes for estimating the relationships among variables. Formally,

* The unknown parameters, denoted as $\theta$ , which may represent a scalar or a vector.
* The independent variables, $\mathcal{X}$.
* The dependent variable, $\mathcal{Y}$.

The goal is then to be able to predict $\mathcal{Y}$ given < $\mathcal{X}$, $\theta$ > :

$$\mathcal{Y} \approx h(X, \theta)$$

where $h(X, \theta)$ is called the hypotesis function.

In [None]:
from IPython.display import IFrame
IFrame('https://drive.google.com/file/d/1cJHJ5AdcFd0tibQvCrME4ychIrvPQoGc/preview', width=340, height=220)

In [None]:
from IPython.display import IFrame
IFrame('https://drive.google.com/file/d/1WknHdpGr4HkJU3ZCuW9i0tBJDF6MDd5w/preview', width=340, height=220)

## Linear Regression

Let's say that we decide to represent the hypothesis $h$ as a linear function of $\mathcal{X}$:

$$h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2$$

Here, the $\theta_i$’s are the parameters (also called weights) parameterizing the space of linear functions mapping from $\mathcal{X}$ to $\mathcal{Y}$. When there is no risk of confusion, we will drop the $\theta$ subscript in $h_\theta(x)$, and write it more simply as $h(x)$. To simplify our notation, we also introduce the convention of letting $x_0 = 1$ (the intercept term), so that

$$h(x) = \sum_{i=0}^{n} \theta_i x_i = \theta^T x$$

In order to learn parameters $\theta$ the most naive choice is to make $h(x)$ as close as possible from $\mathcal{Y}$, which brings us to the cost function:

$$J(\theta) = \frac{1}{2} \sum_{i=0}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 $$

Which measures the half total of the total square distance from model to reality.IFrame('https://drive.google.com/file/d/1WknHdpGr4HkJU3ZCuW9i0tBJDF6MDd5w/preview', width=340, height=220)

### Least Mean Aquares algorithm

We want to choose $\theta$ so as to minimize the cost function $J(\theta)$. To do so, lets consider applying [gradient descent algorithm](/notebooks/math/gradient-descent.ipynb):

$$\theta_{j} := \theta_j - \alpha \frac{\partial J(\theta)}{\partial \theta_j}$$

(every single interation, we simultaneously update all values of $\theta$)

Here, $\alpha$ is usually called the **learning rate**.

Working out this partial derivative we get:


$$\frac{\partial J(\theta)}{\partial \theta_j} = \frac{\partial}{\partial \theta_j} \frac{1}{2} (h_\theta(x) - y)^2
\\= \frac{1}{2} \cdot 2 (h_\theta(x) - y) \cdot \frac{\partial}{\partial \theta_j} (h_\theta(x) - y)
\\= (h_\theta(x) - y) \cdot \frac{\partial}{\partial \theta_j} \sum_{i=0}^{n} (\theta_i x_i - y)
\\= (h_\theta(x) - y) x_j$$

therefore we end up with:

$$\theta_{j} := \theta_j + \alpha \cdot (y^{(i)}  - h_\theta(x^{(i)})) x_j^{(i)}$$

The rule is called the LMS update rule and is also known as the Widrow-Hoff learning rule. Note that the magnitude of the update is proportional to the error term $(y^{(i)}  - h_\theta(x^{(i)}))$. This method looks at every example in the entire training set on every step, and is called **batch gradient descent**. It is also important that, while gradient descent can be susceptible to local minima in general, the optimization problem we have posed here for linear regression has only one global, and no other local, optima; thus **gradient descent always converges** (assuming the learning rate $\alpha$ is not too large) to the global minimum.

#### Example

// Continue...
