## Lesson 1
We begin by considering multiple features for a linear regression problem.
To add to our house problem from the previous week, we can add other dimensions, or features, like the number of bedrooms, the number of floors, the age of the home, etc.

**Note:** Andrew is going to use superscripts in the future to refer to a vector of training data, i.e. $x_j^{(i)}$ where x is the matrix of training data, i is the row index for the vector of one training example (the row gets transposed into a vector) and j is the row index into the vector for a particular value.

### Andrew's notes:

**Note:** $\theta^T$ is a 1 by (n+1) matrix and not an (n+1) by 1 matrix
Linear regression with multiple variables is also known as "multivariate linear regression".
We now introduce notation for equations where we can have any number of input variables.
\begin{align}
x_j^{(i)} &= \text{value of feature } j \text{ in the }i^{th}\text{ training example} \newline x^{(i)}& = \text{the input (features) of the }i^{th}\text{ training example} \newline m &= \text{the number of training examples} \newline n &= \text{the number of features}
\end{align}

The multivariable form of the hypothesis function accommodating these multiple features is as follows:
$h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + \cdots + \theta_n x_n$
In order to develop intuition about this function, we can think about $\theta_0$ as the basic price of a house, $\theta_1$ as the price per square meter, $\theta_2$ as the price per floor, etc. $x_1$ will be the number of square meters in the house, $x_2$ the number of floors, etc.
Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as:
\begin{align}
h_\theta(x) = \begin{bmatrix}\theta_0 & \theta_1 & ... & \theta_n\end{bmatrix} \begin{bmatrix}x_0 \\ x_1 \\ \vdots \\ x_n\end{bmatrix} = \theta^T x
\end{align}
This is a vectorization of our hypothesis function for one training example; see the lessons on vectorization to learn more.
Remark: Note that for convenience reasons in this course we assume $x_{0}^{(i)} =1 \text{ for } (i\in { 1,\dots, m } )$. This allows us to do matrix operations with theta and x. Hence making the two vectors '$\theta$' and $x^{(i)}$ match each other element-wise (that is, have the same number of elements: n+1).]

## Lesson 2
This lesson covers gradient descent for a multivariate linear regression problem.

Andrew revises our notation to accomodate matrices and vectors as first class objects. He replaces the indefinite $\theta$ parameter list with the $\theta$ vector, and the function $\displaystyle J(\theta_0, \theta_1 ... \theta_n)$ with $\displaystyle J(\theta)$ over the vector.

Andrew updates his notation throughout the hypothesis function, the cost function, and in the gradient descent algorithm. I will not state this here, as I expect him to state it in his notes later.

Recall that gradient descent with a single feature involved calculating the partial derivative of the cost function with respect to each parameter. This has not changed, and with multivariate linear regression we do this across all parameters. Andrew again does not explain how we arrive at each partial derivative, but I will state it below for myself:

Consider our cost function:
$$J(\theta) = \frac{1}{2m}\sum_{i=1}^m\left(h_\theta(x^{(i)}) - y^{(i)}\right)^2$$

Let's create a function, $g_\theta$ such that $g_\theta(x^{(i)}) = h_\theta(x^{(i)}) - y^{(i)}$, so that we can subsitute. $J(\theta)$ is now
$$J(\theta) = \frac{1}{2m}\sum_{i=1}^m\left(g_\theta(x^{(i)})\right)^2$$

Differential calculus tells us that the derivative of a polynomial e.g. $x^n$, is $nx^n-1$.

Differential calculus also tells us that the derivative for a system involving function application, e.g. $f(g(x))$, is $f^\prime(g(x))g^\prime(x)$

Consider the polynomial a function, which I will not express as it is unnecessarily verbose. Then, we see that our partial derivative with respect to a single parameter is $$\frac{\partial}{\partial\theta_j}J(\theta) = 2(g_\theta(x^{(i)}))^1g^\prime_\theta(x^{(i)})$$

With a partial derivation, we only apply derivation rules like the ones above onto functions over the variable in question - in this case, a particular $\theta_i$. Knowing the definition of $g_\theta(x^{(i)})$, we can see that it is a linear function of theta terms minus a constant, $y^{(i)}$. The rule in differential calculus for finding a derivative of a constant is that the constant falls to 0. Since all terms in $h_\theta(x^{(i)})$ are treated as constant, save for the one $\theta$ term in question, the entirety of $g_\theta(x^{(i)}) $ collapses to the one x term that our $\theta_i$ applied to in each training example, so $x_j^{(i)}$. This leaves us with 
$$\frac{\partial}{\partial\theta_i}J(\theta) = \frac{1}{2m}\sum_{i=1}^m2*\left(h_\theta(x^{(i)}) - y^{(i)}\right)*x_j^{(i)}$$

Summation rules/identities allow us to remove the 2, simplifying it to the formula we see embedded in Andrew's notes for the gradient descent algorithm:

\begin{align*}
& \text{repeat until convergence:} \; \lbrace \newline \; & \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_0^{(i)}\newline \; & \theta_1 := \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_1^{(i)} \newline \; & \theta_2 := \theta_2 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_2^{(i)} \newline & \cdots \newline \rbrace
\end{align*}

## Lesson 3

This lecture covered feature scaling, which is the idea that you normalize your features so that they're all approximately at the same order of magnitude. This is because, without feature scaling, the contours of your cost function graph can be very very steep, or distorted. We may recall that, with gradient descent, we are vulnerable to each step oscillating on some dimension, as our individual steps "step over" some ideal path and "turn around" repeatedly, when our $\alpha$ is too large for the data; the trouble is, small $\alpha$ are themselves expensive, as they force us to very very slowly descend the gradient - it's much better if we can pick a large $\alpha$, without worrying about this oscillation.

When we scale, this "smooths" out the slope of our cost function's gradient in our parameter space, and smoothness allows us to safely pick an alpha term that isn't at significant risk to oscillation, and can converge on the answer a lot more quickly.

Andrew recommends that your training features all be normalized to the range $-1 \leq x_i \leq 1$, but also says that this isn't a strict requirement, and so long as the features are reasonably close (for some definition of reasonable), we shouldn't worry too much about it.

In any event, I made the observation that scaling your data is pretty trivial, and just requires some basic arithmetic for translating the mean feature data at/near 0 (through addition/subtraction), and scaling the data to get it close to -1 and 1 at the edges (through division).

It also occurs to me that it might be worthwhile to consider your training set as a sample population, and therefore try to apply some basic statistics when considering the scaling factor for normalizing your data (as well as trying to figure out the population mean, as opposed to the sample mean). I will need to visit statistics, as I'm unfortunately not very knowledgeable in the subject.

### Andrew's notes:

We can speed up gradient descent by having each of our input values in roughly the same range. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.

The way to prevent this is to modify the ranges of our input variables so that they are all roughly the same. Ideally: $−1 \leq x_{(i)} \leq 1$ or $−0.5 \leq x_{(i)} \leq 0.5$.

These aren't exact requirements; we are only trying to speed things up. The goal is to get all input variables into roughly one of these ranges, give or take a few.

Two techniques to help with this are **feature scaling** and **mean normalization**. Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1. Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero. To implement both of these techniques, adjust your input values as shown in this formula:

$$x_i := \dfrac{x_i - \mu_i}{s_i}$$

Where $μ_i$ is the **average** of all the values for feature (i) and $s_i$ is the range of values (max - min), or $s_i$ is the standard deviation.

Note that dividing by the range, or dividing by the standard deviation, give different results. The quizzes in this course use range - the programming exercises use standard deviation.

For example, if $x_i$ represents housing prices with a range of 100 to 2000 and a mean value of 1000, then

$$x_i := \dfrac{price-1000}{1900}$$

## Lesson 4

This lecture covered the learning rate, how to choose $\alpha$ to get the best learning rate, and how to debug the model to make sure gradient descent is working correctly.

Andrew determines the correctness of gradient descent by plotting the values of $\displaystyle J(\theta)$ over some number of iterations, or i, you should see that $\displaystyle J(\theta)$ descreases with every iteration (**note:** this is only when dealing with linear regression) and that the slope should look roughly like a negative exponential curve, e.g. $e^{-x}$

If you plot $\displaystyle J(\theta)$ and see an upward trend over the number of iterations, something is wrong, and your alpha is likely too big, causing it to "climb out of the bowl" by stepping over the global minimum in every step.

If your plot looks like some sort of harmonic function, you're dealing the same problem, which is that your $\alpha$ is too large.

The way Andrew says he finds a good alpha term is by starting at some intentionally small version of alpha, verifying the monotonic decrease in cost function over iterations, and slowly ramping it up with a threefold increase rule. I believe the intention is to get roughly two values of alpha within each order of magnitude in a base 10 system, e.g. .0001 -> .0003 -> .001, etc.

Andrew also offered a convergence test: Declare convergence if J(θ) decreases by less than E in one iteration, where E is some small value such as $10^{−3}$.

## Lesson 5

This lecture covered feature selection, and how to apply regression analysis on polynomial problems and beyond.

In the first example for feature selection, Andrew shows that, when considering the width and height of a house's lot, you don't need to treat these as separate features; instead, you can consider the area of the house, which is a product of the two terms. This is basically an example of how it is sometimes necessary to think critically about your data, and about the properties it contains, whether some features are really constituents of an even greater feature you care about, and so on.

Andrew moves on and talks about a situation where your hypothesis function for house price prediction might better be modelled as a polynomial function, rather than a linear one. Andrew gives an example where we only have one feature, size, but the data seems to show a cubic relationship between size and price.

Before, our hypothesis function was a linear combination of parameters over multiple features. This time, we want to consider polynomial powers against a single feature, with each power given their own parameter:

\begin{align}
h_\theta(x) &= &\theta_0 + &\theta_1x_1 + &\theta_2x_2 + ... + &\theta_nx_n \\
h_\theta(x) &= &\theta_0 + &\theta_1x + &\theta_2x^2 + ... + &\theta_nx^n
\end{align}

In the second formula, we're treating each power over the size of the house as it's own feature.

**Note** Andrew also points out that when dealing with these powers, normalizing your training data becomes a lot more important, because these powers are going to produce significantly different output before normalization.

Andrew also shows how we're not limited to polynomials, we can also consider things like $\sqrt x$, and other kinds of things.

**Personal note** while I can certainly see how the polynomial features match up to multivariate feature functions we saw earlier, what I don't yet understand is how we perform the calculation using a matrix. It seems that we need functions embedded in the matrix in order to calculate those powers. I'd like to better understand how we can perform polynomial regression more, and express it in a way that is compatible with matrix multiplication.

### Andrew's notes

I messed up and refreshed the page, so I can't copy his LaTex notes. Honestly, though, it's all in my earlier notes.