In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Let us look at our house pricing dataset again.

In [3]:
df = pd.read_csv('../data/portland_housing_prices.csv')
df.head()

Unnamed: 0,area,bedrooms,price
0,2104,3,399900
1,1600,3,329900
2,2400,3,369000
3,1416,2,232000
4,3000,4,539900


- $x \in \bf{R}^2$
- $x_1^{(i)}$ - living area of of the $i^{th}$ training sample
- $x_2^{(i)}$ - number of bedrooms in the $i^{th}$ training sample

Now we need to decide the structure of hypothesis $h$, let us assume that we approximate $y$ as a linear function of $x$, i.e. - $$h_{\theta}(x) = {\theta}_0+  {\theta}_1x_1 + {\theta}_2x_2$$

Where $\theta$ are the parameters parameterizing the space of linear functions mapping from $\bf{X}$ to $\bf{Y}$. <br>
Let us assume that there's an intercept term $x_0 = 1$, then $h_{\theta}(x)$ can be represented as - 
$$h_{\theta}(x)  = \Sigma_{i = 1}^{d}{\theta}_ix_i = \theta^Tx$$

Now, our goal is to learn the parameters $\theta$. We can do so by making $h_{\theta}(x)$ as close to $y$ as possible. **Cost function** would give us that quantitative measure.

### Cost Function - 
For each value of $\theta$ close is the $h_{\theta}(x^{(i)})$ to $y^{i}$. 

For the given regression problem, we cn define the cost function (function of $\theta$) as follows -
$$J(\theta) = \frac{1}{2}\Sigma_{i = 1}^{n}(h_{\theta}(x^{i}) - y^{i})^2$$

Minimizing this squared cost function will give rise to the **ordinary least square** solution.

## LMS Algorithm

We need to find the value of $\theta$ that minimize the $J(\theta)$. We'll use the **gradient descent** algorithm. We start with some initial guess of $\theta$ and iteratively converge to the optimal value of $\theta$. 
$$\theta_{j} := \theta_{j} - \alpha\frac{\partial J(\theta)}{\partial {\theta}_j}$$
- $\alpha$ - learning rate 

Update is simultaneously applied to all the values of $j = [0, .., d]$

To find the optimal value of $\theta$ we first need to find the derivative of $J(\theta)$ 
$$\begin{align*}
\frac{\partial J(\theta)}{\partial {\theta}_i} &= \frac{\partial }{\partial {\theta}_j}\frac{1}{2}\Sigma_{i = 1}^{n}(h_{\theta}(x^{(i)}) - y^{(i)})^2 \\
&= \Sigma_{i = 1}^{n}\frac{\partial }{\partial {\theta}_j}(\theta^Tx^{(i)} - y^{(i)})^2 \\
&= \Sigma_{i = 1}^{n} (\theta^Tx^{(i)} - y^{(i)})x_j^{(i)}
\end{align*}
$$

$$
\therefore {\theta}_j := {\theta}_j + \alpha \Sigma_{i = 1}^{n} (y^{(i)} - \theta^Tx^{(i)})x_j^{(i)}
$$

This rule is called the **LMS** update rule or the **Widrow-hoff** learning rule. 