### Notation
$x^{(i)}$ denotes the *ith* row in the matrix $X$

### Feature Map Definition

A feature map, $\phi$ can be used to map our $d$ dimensional input vector $x^{(i)}$ into a higher dimensional space. For example

$$
\phi(x) = 
\begin{bmatrix}
1 \\
x_1\\
x_2\\
x_2 \cdot x_1
\end{bmatrix}
\quad
\phi\left( \begin{bmatrix} 2 & 4 \end{bmatrix}\right) = 
\begin{bmatrix}
1 \\
2\\
4\\
4 \cdot 2
\end{bmatrix}
$$

And thus 
$$h_\theta(x^{(i)}) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_2 \cdot x_1 \implies h_\theta(x^{(i)}) = \theta^T \cdot \phi(x^{(i)})$$

The update rule for batch SGD using a map $\phi$ is

$$\theta := \theta + \alpha \cdot \sum_{i}^N (y - \theta^T \cdot \phi(x^{(i)})) \cdot \phi(x^{(i)})$$

### SGD Runtime Complexity
The update rule for batch SGD is 
$$\theta := \theta + \alpha \cdot \sum_{i}^N (y - \theta^T \cdot \phi(x^{(i)})) \cdot \phi(x^{(i)})$$

The runtime complexity is $N \cdot P$ where $P$ is the number of parameters ($\| \phi \| = P$)

### Exponential Number of Parameters
In traditional Linear Regression, 
$$h_\theta(x^{(i)}) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 = \theta^T \cdot x^{(i)}$$

there is one parameter for every dimension, $\| \theta \| = \| \phi \| = \| x^{(i)} \|$, so optimizing using SGD is manageable. This some times simply is not enough. 

Imagine we are trying to predict housing prices and we know the square footage and price. A common indicator is real estate is price per square foot or ppf. We could create a new column $x_3 = \frac{x_2}{x_1}$ and the re-fit linear regression, but imagine we had dozens or even hundreds of variables pertaining to the value of a house, and we wanted to know each of their ratios, 

$$\phi(X) = \begin{bmatrix}
1\\
x_1\\
\ldots\\
x_n \\
\frac{x_1}{x_2}\\
\ldots\\
\frac{x_{n}}{x_{n-1}}
\end{bmatrix}
$$

this system would quickly become inefficient as the number of parameters would be growing exponentially w.r.t (with respect to), the number of dimensions $\| \theta \| = \| \phi \|= \approx (\| x^{(i)} \|)^2$, and as a result SGD runtime complexity would also expload. A dataset with $1000$ columns would suddenly have $1,000,000$ different features!

### The Kernel Trick
Fitting functions which large maps is expensive $O(NP)$, one way to reduce the complexity would be to find a way to express that same function, but in fewer parameters. The Kernel Trick solves for this by rewriting $\theta$ as a sum of all records multiplied by some constant, $\beta_i$

$$\theta = \sum_{i}^N \beta_i \cdot \phi(x^{(i)}) \quad \beta_i \in \mathbb{R} $$

Note that if we proved this was true, then we would be rewriting $\theta$ in terms of $N$ parameters $\{\beta_1, \beta_2, \ldots, \beta_N\}$, rather than the original $P$ parameters.

### The Kernel Trick Proof
W.T.S
$$\theta = \sum_{i}^N \beta_i \cdot \phi(x^{(i)}) \quad \beta_i \in \mathbb{R} $$

### A new algorithm

During the **N+1** step, we subtituted $\alpha \cdot (y^{(i)} - \theta^T \cdot \phi(X))$ for $\beta_{N + 1}$
