# Week 11

More Neural Networks

In [None]:
!wget -q https://github.com/DM-GY-9103-2024F-H/9103-utils/raw/main/src/data_utils.py
!wget -q https://github.com/DM-GY-9103-2024F-H/9103-utils/raw/main/src/image_utils.py

In [None]:
import torch
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.model_selection import train_test_split
from torch import nn, Tensor
from torch.utils.data import DataLoader, Dataset

from data_utils import classification_error, display_confusion_matrix, object_from_json_url
from data_utils import LFWUtils, StandardScaler
from image_utils import open_image, make_image

## Tensors

We'll be using the [PyTorch](https://pytorch.org/) library for working with Neural Networks.

Before we start building, training, tuning models, we have to learn a little bit about [Tensors](https://pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html)!

<img src="./imgs/tensors.jpg" width="600px"/>

Tensor is a fancy word for multi-dimensional list. They are very much like lists, where they keep a sequence of number values, or a sequence of other tensors. They are a little bit more picky than lists because they require all members to be of the same _type_ (all integers, or all floats, etc), and they don't like having inner lists of different lengths.

PyTorch tensors are optimized for doing neural network operations, and so they come with a few extra capabilities beyond `sum()`, `sort()`, `mean()`, etc.

Let's start by importing them, and taking a look at how to work with multi-dimensional tensors:

### Loading and Shaping

Let's open up an image and load its pixels into a tensor.

In [None]:
mimg = open_image("./data/images/arara.jpg")

display(mimg)
print(mimg.pixels[:5])

To make a tensor out of this, we just have to pass the list of pixels to the `Tensor()` constructor.

We can check it's size with the `shape` member variable, and use slicing and indexing like we've always used with lists:

In [None]:
mimg_t = Tensor(mimg.pixels)
mimg_t.shape, mimg_t[:5], mimg_t[5], mimg_t[5][0]

The shape of this tensor is $607\text{,}500 \times 3$, meaning that we have $607\text{,}500$ pixels and each pixel has $3$ color values.

Let's reshape the tensor so it's more representative of our image's dimensions. We want to have a tensor of shape $h \times w \times 3$, where $h$ and $w$ are the images `height` and `width` dimensions.

The `reshape()` function does just this, we just have to pass the parameters in the right order.

In [None]:
mimg_t = Tensor(mimg.pixels).reshape(mimg.size[1], mimg.size[0], 3)

mimg_t.shape, mimg_t[:5].shape, mimg_t[:5], mimg_t[0][5], mimg_t[0, 5]

Now `mimg_t[:5]` doesn't refer to first $5$ pixels anymore, but to the first $5$ rows of our image.

To get the first $5$ pixels we can use `mimg_t[0][:5]` or `mimg_t[0, :5]`.

New syntax! : We can use multiple numbers inside the square brackets, separated with a comma.

In [None]:
mimg_t[0][:5], mimg_t[0, :5]

### Slicing

This is where it starts to get fun.

Since we now have our image in a $3D$ tensor, we can use slice in multiple directions, and at the same time.

<img src="./imgs/slicing_00.jpg" width=800px />

#### Getting

For example, if we want to crop a part of the image, we can just get slices in the first two dimensions, like this:

`mimg_t[y0:y1, x0:x1]`

where `x0` and `y0` are the horizontal and vertical location of the top-left pixel of the region we want, and `x1` and `y1` are the bottom-right coordinates of the last pixel we want.

So, to grab a $256$ X $256$ section of an image, starting at $(x,y) = (240, 30)$ we can do:

`mimg_crop = mimg_t[30:30+256, 240:240+256]`

In [None]:
x0,y0 = 240, 30

mimg_crop_t = mimg_t[y0:y0+256, x0:x0+256]

mimg_crop_t.shape, mimg_crop_t[0,:5]

In [None]:
mimg_crop = make_image(mimg_crop_t)
display(mimg_crop)
mimg_crop.pixels[:5]

#### Setting and Broadcasting

Slicing also works when assigning values to regions of our tensor/image.

Even if the values we're assigning don't perfectly match the region we want to assign them to, the tensor will try to _broadcast_ the value into the right places with the right shape.

For example, we can assign a single pixel value to an entire region with:

`mimg_t[y0:y1, x0:x1] = Tensor([220, 20, 120])`

and it knows to set every pixel in that region the same color.

Or, we can even do this, if we want to set a color in grayscale:

`mimg_t[y0:y1, x0:x1] = 220`

it will create a `Tensor([220, 220, 220])` to fill the pixel region specified.

The tensor will convert/broadcast the value into the right shape to fit the region we are slicing.

In [None]:
# copy the tensor for editing
mimg_blank_t = mimg_crop_t.clone()
display(make_image(mimg_blank_t))

mimg_blank_t[100:200, 10:110] = 0
display(make_image(mimg_blank_t))

In [None]:
mimg_blank_rows_t = mimg_crop_t.clone()

# TODO: try to assign colors to entire rows/column

display(make_image(mimg_blank_rows_t))

This multi-dimensional slicing also means that we can separate the color channels of our images using a single line of code, and no looping!

For looking at the `R` channel, just set `G` and `B` to `0`.

```python
mimg_crop_r_t[:, :, 1:3] = 0
```

The `:` in `[:, :, 1:3]` means grab every row and every column. Then `1:3` specifies the second and third channel of each pixel.

In [None]:
mimg_crop_r_t = mimg_crop_t.clone()
mimg_crop_r_t[:, :, 1:3] = 0

# look at first 5 pixels
mimg_crop_r_t[0, :5]

<img src="https://weeklydevotion.com/wp-content/uploads/2014/06/whoa.jpg" height=200px /> <img src="https://sites.tufts.edu/emotiononthebrain/files/2014/10/tumblr_m0wb2xz9Yh1r08e3p.jpg" height=200px />

In [None]:
display(make_image(mimg_crop_r_t))

In [None]:
mimg_crop_g_t = mimg_crop_t.clone()
# TODO: get separate green channel image

mimg_crop_b_t = mimg_crop_t.clone()
# TODO: get separate blue channel image

In [None]:
display(make_image(mimg_crop_r_t))
display(make_image(mimg_crop_g_t))
display(make_image(mimg_crop_b_t))

#### Slicing in Multiple Dimensions

We can combine slicing regions and slicing specific color channels to create effects with little code.

This creates an image by combining shifted versions of the separate `R`, `G` and `B` channel images from above:

In [None]:
# create an image the same shape as the original image, but with all 0s
mimg_crop_rgb_t = mimg_crop_t.clone()
mimg_crop_rgb_t[:] = 0

mimg_crop_rgb_t[:, 32:, 0] += mimg_crop_t[:, :-32, 0]
mimg_crop_rgb_t[:, :, 1] += mimg_crop_t[:, :, 1]
mimg_crop_rgb_t[:, :-32, 2] += mimg_crop_t[:, 32:, 2]

display(make_image(mimg_crop_rgb_t))

Code like this is not very professional-looking or understandable, but can be fun to write.

Don't worry if this effect isn't completely obvious at first, but try to break down each of the lines and each of the slicing expressions into simpler terms. Like:
- `mimg_crop_rgb_t[:] = 0`: sets all pixels to black, creating a black image with same dimensions as the original
- `mimg_crop_rgb_t[:, 32:, 0]`: from black image, selects all rows, all columns except first $32$, and red channel
- `mimg_crop_t[:, :-32, 0]`: from original image, all rows, all columns except last $32$, and red channel

... etc....

#### Changing Shape

We can also get the individual pixel values for each channel using slicing.

This gets all of the red values of all pixels as a two-dimensional tensor of shape $h$ X $w$:

`mimg_crop_t[:,:,0]`

After this operation, each pixel will only have $1$ channel, so when we display these images they will be grayscale representations of each channel.

In [None]:
print(mimg_crop_t[:,:,0].shape)
display(make_image(mimg_crop_t[:,:,0]))
display(make_image(mimg_crop_t[:,:,1]))
display(make_image(mimg_crop_t[:,:,2]))

### Operations along specific dimensions

Just like `DataFrames`, `Tensor` objects also have a bunch of built-in functions for performing common operations on their content.

Functions like, `sum()`, `mean()`, `max()`, `std()`, should be familiar:

In [None]:
my_t = Tensor([[1, 2], [2, 4], [-2, -1]])
print(my_t)
print(my_t.sum(), my_t.mean(), my_t.max(), my_t.std())

With `DataFrames` a lot of these functions would happen along columns, so we would get the `mean`, `max`, `sum` of each of the features in the dataset.

By default our `Tensor` performs these operations on all of its data and returns one value.

We can change this behavior by providing an extra argument to the functions, specifying the dimension along which we want to perform the operation. It helps to think of this parameter as the dimension we want to "_reduce_", or remove.

So, for example, `sum(0)` gets rid of the rows, by summing down the `Tensor` columns, while `mean(1)`, gets rid of the columns, by computing the average value of the `Tensor` rows.

In [None]:
print(my_t)
print(my_t.sum(0), my_t.mean(1))

What this means is that we can convert our image to grayscale in one line of code by reducing the $3^{rd}$ dimension, which holds the color values for each pixel.

In [None]:
mimg_crop_gs_t = mimg_crop_t.mean(2)
display(make_image(mimg_crop_gs_t))

### Filtering with Boolean Indexes

We can also select certain elements, regions, or dimensions of our tensors using boolean tensors.

Instead of passing numeric indexes, or slices, to our tensor's square brackets, we can select elements by passing a tensor of similar shape, but whose contents are `True`/`False` values.

This works for setting and getting elements.

The easiest way to create these boolean selector tensors is usually by manipulating the original tensor.

The following line of code creates a two-dimensional tensor whose element are the difference between the `R` and `G` channels of our image:

`(mimg_crop_t[:,:,0] - mimg_crop_t[:,:,1])`

It's first two dimensions are just like the original `mimg_crop_t`'s shape, but the last dimension holds a single value, and not a pixel value list.

In [None]:
mimg_rg_diff_t = mimg_crop_t[:,:,0] - mimg_crop_t[:,:,1]

print(mimg_rg_diff_t)
print(mimg_rg_diff_t.shape)

This line creates a boolean tensor, whose values specify whether the `R` channel value is larger than the `G` channel value by more than $80$, for every pixel in the image:

`((mimg_crop_t[:,:,0] - mimg_crop_t[:,:,1]) > 80)`

It holds boolean values.

In [None]:
mimg_rg_diff_thold_t = (mimg_crop_t[:,:,0] - mimg_crop_t[:,:,1]) > 80

print(mimg_rg_diff_thold_t)
print(mimg_rg_diff_thold_t.shape)

We could now use this indexing `Tensor` to select only those pixels from the original image and multiple them by the one-dimensional tensor `[4, 1, 1]` to exaggerate their `R` channel values by a factor of $4$, while keeping `G` and `B` intact:

In [None]:
mimg_red_bool_t = mimg_crop_t.clone()

rgtg_idx = ((mimg_crop_t[:,:,0] - mimg_crop_t[:,:,1]) > 80)
mimg_red_bool_t[rgtg_idx] *= Tensor([4, 1, 1])

display(make_image(mimg_red_bool_t))

#### More Filtering

Before running the cells... try to work out what the following indexing, selecting, slicing, assignments do.

We're going to be writing, but also reading, lots of code with some pretty intense, non-professional looking, `Tensor` operations.

In [None]:
# what does this do?
rgtg_idx = (mimg_crop_t[:,:,0] - mimg_crop_t[:,:,1]) > 80

# what about this?
rgtb_idx = (mimg_crop_t[:,:,0] - mimg_crop_t[:,:,2]) > 80

# and these ?
red_idx = rgtg_idx & rgtb_idx
not_red_idx = ~red_idx

In [None]:
mimg_idx_bool_t = mimg_crop_t.clone()
mimg_idx_bool_t[not_red_idx] = 0

display(make_image(mimg_idx_bool_t))

In [None]:
# what do these 2 lines do?
mimg_blank_t = mimg_crop_t.clone()
mimg_blank_t[:] = 0

# how is this cell different from the 2 previous ones?
mimg_blank_t[red_idx] = mimg_crop_t[red_idx]

display(make_image(mimg_blank_t))

And these?

In [None]:
# what does this cell do that is different from the grayscale filter above?
mimg_crop_gs_t = mimg_crop_t.mean(2)

mimg_crop_rgb_gs_t = mimg_crop_t.clone()
mimg_crop_rgb_gs_t[:,:,0] = mimg_crop_gs_t
mimg_crop_rgb_gs_t[:,:,1] = mimg_crop_gs_t
mimg_crop_rgb_gs_t[:,:,2] = mimg_crop_gs_t

In [None]:
mimg_gs_bool_t = mimg_crop_t.clone()

# what does this do?
mimg_gs_bool_t[not_red_idx] = mimg_crop_rgb_gs_t[not_red_idx]

display(make_image(mimg_gs_bool_t))

## More Tensors and Why They're Awesome

Multi-dimensional slicing is definitely a nice property of tensors, but what really sets them apart from fancy lists is their ability to keep track of all the operations performed on them using _computational graphs_.

If we define a tensor and set its `requires_grad` parameter to `True` we unlock some really nice properties that we can use for training neural networks.

One of these properties is the ability to automatically calculate derivatives (OMG, calculus!) of functions defined in terms of our tensor.

Let's investigate.

### Easy Calculus and Free Derivatives

Let's pretend we have the following function:

$f(x) = x^4 - 0.7x^3 - 2x^2 + x + 1$

And we want to find out when the function achieves its maximum and minimum values, when it equals $0$, or when it equals $0.5$.

We can plot it, and easily approximate those values visually:

In [None]:
def peaks(x):
  return x**4 - 0.7*x**3 - 2*x**2 + x + 1

In [None]:
# linspace is range()'s cousin, but for floats 
#   and where the 3rd argument specifies number of steps, not length of steps

x = torch.linspace(-1.3, 1.6, 300)
y = peaks(x)

plt.plot(x, y)
plt.plot([-1.3, 1.6], [0,0], '-')
plt.plot([-1.3, 1.6], [0.5, 0.5], '-')
plt.show()

Looks like local minimum and maximum values are approximately:
- $x = -0.9$ (global minimum)
- $x = 0.2$ (global maximum)
- $x = 1.2$ (local minimum)

It crosses $y = 0$ at:
- $x = -1.2$
- $x = -0.6$

And, it crosses $y=0.5$ a bunch of times, so we'll look at that later.

We can calculate exact values for these points in our graph if we define $x$ and $y$ as tensors and enable their `auto_grad` functionality.

In [None]:
xt = torch.linspace(-1.3, 1.6, 8000, requires_grad=True)
yt = peaks(xt)
yt.backward(torch.ones_like(xt))

dydx = xt.grad
print("derivatives:", dydx[:5])

minmax_idx = (dydx.abs() < 9e-4)
minmax_y = yt[minmax_idx]
minmax_x = xt[minmax_idx]

plt.plot(x, y)
plt.plot(minmax_x.tolist(), minmax_y.tolist(), 'o')
plt.show()

print("min/max:", minmax_x, minmax_y)

### Wait. What?

Let's look at the individual commands at the cell above.

`xt`: this is a $1D$ tensor of shape $8000$ with value from $-1.3$ to $1.6$.

`yt`: this is a $1D$ tensor of shape $8000$ which holds the results of calling `peaks()` on every value of `xt`.

`yt.backwards(torch.ones_like(xt))`: this calculates the derivatives (slope) of the equation `peak()` for every point of `yt` and `xt`. The `torch.ones_like(xt)` parameter is a bit unconventional and usually we'll just call `backwards()` without any parameters. It's necessary here because instead of asking for the derivative of an equation at one specific point, we want to get the derivatives for all points in our `xt` range tensor.

`dydx = xt.grad`: after calling `backward()` on a tensor (`yt`) that depends on tensors with `requires_grad` (`xt`), the tensors with `requires_grad` will have their gradients/slope store in the `grad` member variable.

`minmax_idx = (dydx.abs() < 9e-4)`: since our function is being evaluated on a discrete set of values inside `xt`, we might not have the exact `xt` that gives an exact slope of $0$, so `dydx.abs() < 9e-4` is a boolean indexing of all values of dydx that are really close to $0$.

`minmax_y = yt[minmax_idx]` and `minmax_x = xt[minmax_idx]`: this gets the actual `x` and `y` values where the slope of `peaks()` is really really close to $0$.

### Finding Zero

We found `x` and `y` values for when our `peaks()` function is at its `max` and `min` values.

If we want to find when our function is $0$ we can use a little trick and just square it. This will turn any $0$ crossing into a min, and we can repeat the same process as above.

`yt = peaks(xt).pow(2)`: this squares our function, so _y-axis_ crossings become minimum values.

`zeros_idx = ((dydx.abs() < 0.005) & (yt < 1e-7))`: we add an extra condition to the boolean index, so we only plot the minimum values where the derivate is $0$ and `yt` is close to $0$.

In [None]:
xt = torch.linspace(-1.3, 1.6, 8000, requires_grad=True)
yt = peaks(xt).pow(2)
yt.backward(torch.ones_like(xt))

dydx = xt.grad
print("derivatives:", dydx[:5])

zeros_idx = ((dydx.abs() < 0.005) & (yt < 1e-7))
zeros_x = xt[zeros_idx]
zeros_y = yt[zeros_idx]

plt.plot(x, y)
plt.plot(zeros_x.tolist(), zeros_y.tolist(), 'o')
plt.show()

print("zeros:", zeros_x, zeros_y)

### Finding other values

If we want to find what values of `xt` give a specific value for `yt` we can use a similar trick.

We shift the function up or down to make that `yt` value become $0$, then square the function and repeat the steps as above.

For example, to find values of `xt` that make `peaks()` equal to $0.5$, we subtract $0.5$ and square `peaks()`.

`yt2 = yt.subtract(0.5).pow(2)`: this is the function we use to take the derivative now.

In [None]:
xt = torch.linspace(-1.3, 1.6, 8000, requires_grad=True)
yt = peaks(xt)
yt2 = yt.subtract(0.5).pow(2)
yt2.backward(torch.ones_like(xt))

dydx = xt.grad
print("derivatives:", dydx[:5])

y05_idx = ((dydx.abs() < 0.005) & (yt2 < 2e-7))
y05_x = xt[y05_idx]
y05_y = yt[y05_idx]

plt.plot(x, y)
plt.plot(y05_x.tolist(), y05_y.tolist(), 'o')
plt.show()

print("y=0.5:", y05_x, y05_y)

### Solving for min/max iteratively

Our `peaks()` function is pretty simple, as it only depends on one variable, `x`, and the range we're calculating it over is pretty small, $[-1.2, 1.6]$.

What if our `peaks()` function was more complex and it took minutes to calculate? How can we find its `min` or `max` values?

This is the more common case for `grad` and `backward()`. We evaluate a function once, at one specific input value, and calculate which direction it should move in order to increase or decrease the value of our function.

We can use the `peaks()` function to illustrate. Let's calculate the value of `x` that gives the smallest value for `peaks(x)`.

`xm`: this is the current guess for the value of `x` which gives the smallest value for `peaks()`. We'll initialize it at $0.15$, which is the halfway point of our `x` range.

`xms` and `yms`: these will hold the progression of the `xm` and `ym` variables as they move towards their objectives.

`ym`: the value of `peaks()` at the current `xm`.

`backwards()`: calculate the slope of `ym` with respect to its inputs.

`xm = xm + 0.1 * xm.grad`: update `xm` according to the slope of `peaks()` at `xm`. If the slope is positive, decrease `xm`, if the slope is negative, increase `xm`. This will move `x_m` towards a minimum value of `peaks()`. If we wanted to move towards a maximum value, we increase `xm` for positive slopes and decrease it for negative slopes.

The $0.1$ factor determines how big our steps should be when we update `xm`. There's a tradeoff here: large steps can get to the desired value quicker, but can also totally skip the desired value and end up in some non-desired part of our equation. Small steps, on the other hand, take a little longer to find the objective, but usually converge on the correct value.

`xm.retain_grad()`: again, we're using tensors for educational purposes here, and accumulating gradients in an unconventional way. We have to call this to make sure we can later access the gradient of something that was itself calculated from a gradient. This won't be like this in actual modeling code.

A tensor's `item()` member function just returns that tensor's value as a regular `Python` number. Similarly, if we want to get a tensor as a regular `Python` list we can call its `tolist()` function.

In [None]:
xs = []
ys = []

xm = torch.tensor(0.15, requires_grad=True)

ym = peaks(xm)
ym.backward()
print(xm.item(), ym.item(), xm.grad)

xs.append(xm.item())
ys.append(ym.item())

xm = xm - 0.1 * xm.grad
xm.retain_grad()

ym = peaks(xm)
ym.backward()
print(xm.item(), ym.item(), xm.grad)

xs.append(xm.item())
ys.append(ym.item())

# TODO: more steps

### X's journey

We saved all of the intermediate values of `xm` and `ym` so we can plot them here:

In [None]:
plt.plot(x, y)
plt.scatter(xs, ys, marker='o', s=14, c='r')
plt.show()
xs[-1], ys[-1]

### Taking all the steps

We took one step. We could loop and take $10$ steps, or take as many steps as are necessary to get to the closest max/min value of our function.

Let's add a loop to the cell above that repeats the following:

- calculate `ym`
- save `xm` and `ym`
- calculate `gradient`
- update `xm`
- repeat

## Ok, so what ?

Neural Networks is what, because now we have the most important ingredient for training a neural network to perform regression (or classification, or whatever else).

We know how to load data into a `DataFrame`, once we pass this data through a neural network with random values for its parameters, we can calculate the `error` of our cost function in relation to all of the parameters of the network, and then calculate which direction to move all of the parameters to decrease our error.

Let's load the housing prices dataset from `HW03`.

As always, we'll encode and scale our data if needed, and then we'll use the `train_test_split()` function to split our `DataFrame` into $2$ separate datasets, a training dataset with $80\%$ of the rows, and a test dataset with $20\%$.

In [None]:
# Define the location of the json file here
HOUSES_FILE = "https://raw.githubusercontent.com/DM-GY-9103-2024F-H/9103-utils/main/datasets/json/LA_housing.json"

houses_info = object_from_json_url(HOUSES_FILE)

houses_raw_df = pd.DataFrame.from_records(houses_info)

house_scaler = StandardScaler()
houses_df = house_scaler.fit_transform(houses_raw_df)

houses_train, houses_test = train_test_split(houses_df, test_size=0.2)

houses_train.head()

### Create features

Just like with the `LinearRegression` models, we have to separate our independent features and our outcome feature.

This time we put them both into tensors.

The `x` tensor holds all of the independent features for all of the data points, and the `y` tensor their corresponding outcomes (prices).

In [None]:
train_features = houses_train.drop(columns=["value"])
train_values = houses_train["value"]

x_train = Tensor(train_features.values)
y_train = Tensor(train_values.values)

### Define our model

We'll use a very basic neural network model that has an input layer with a neuron for each feature, and a single output neuron for the price prediction.

Something like this:

<img src="./imgs/linear_5x1.jpg" width="800px"/>

Where the initial values for the model parameters are selected at random by default.

We can iterate over out model's parameters and print their shapes, or calculate overall number of parameters using the `numel()` function of each parameter.

In [None]:
model = nn.Linear(len(train_features.columns), 1)

psum = 0

for p in model.parameters():
  print(p.shape)
  psum += p.numel()

print("number of parameters:", psum)

### Test model

We can run this model on our train dataset just to make sure all of our layers have the correct shapes.

If anything is off we'll get an error here.

We're giving our model a `Tensor` with $4623$ houses and $5$ features for each house. It should give us $4623$ predictions.

In [None]:
y = model(x_train)
print(y.shape, y[0])

### Set up training

This will look similar to the iterative approach for finding the minimum value of a function we saw above.

For each step of our iteration we will:

- calculate a price prediction for all of the rows in our dataset
- calculate the overall error for all of the price predictions
- calculate the derivative of this error with respect to the model parameters
- update model parameters to decrease error
- repeat

A few things to note about this process:

1\. We are calculating all of the predictions for all of our data with a single call: `y = model(x)`. `PyTorch` models are smart and they know we want to do the same thing for all of the rows in our data. This optimizes and parallelizes the process.

2\. But... if we take a look at the resulting shape of the call to `model(x)` we'll see that it adds an extra dimension to our predictions, which we must remove by calling `reshape(-1)`.

3\. The cost function (called `loss` here) is the `L2` distance between all price predictions and all actual prices in our dataset calculated in one go. It's a single number we can take the derivative of. We could skip the square root, but this way our units stay consistent and error is calculated in terms of standard deviations.

4\. The parameters we are optimizing and updating at each iteration aren't our features, but the weights and thresholds of each of our $6$ neurons, which have `requires_grad` turned on by default. At each step we update the model's parameters with `p.data.sub_(p.grad.data * learning_rate)`. This is the very bureaucratic form of doing something like: `p -= p * lr`. Since we are dealing with parameter tensors that keep all kinds of extra information about their values, we have to operate on their `data` members.

5\. Once we have used the parameters' gradients to update our model we have to clear them by calling `grad.zero_()`. We'll see why soon, but by default if we are reusing the same tensors (in this case our model's parameters) we have to make sure they don't accumulate gradients.

In [None]:
learning_rate = 1e-2

for c in range(32):
  y_pred = model(x_train).reshape(-1)
  loss = (y_pred - y_train).pow(2).mean().pow(0.5)
  loss.backward()

  for p in model.parameters():
    p.data.sub_(p.grad.data * learning_rate)
    p.grad.zero_()

  if c % 4 == 0:
    print(c, loss.item())

### Interpretation

What's happening in the above cell?

What happens if we keep running it over and over?

### Checking the train dataset

Once we're happy with the training, we can get predictions for all of our houses in dollars by running the model and reversing the scaling:

In [None]:
y_std = pd.DataFrame(model(x_train).tolist(), columns=["value"])
y_usd = house_scaler.inverse_transform(y_std)

y_usd.head()

### Growing the Network

The error we were getting above was around $1.0$ standard deviation. That's not bad, but it's also not good.

If we want to improve our model we can try adding layers to our Neural Network. We just have to make sure we add an activation function between the neurons. These are the functions that keep our model parameters within a nice, well-defined, range.

This is how we build the following network:

<img src="./imgs/linear_5x5x1.jpg" width="800px"/>

In [None]:
model =  nn.Sequential(
  nn.Linear(len(train_features.columns), len(train_features.columns)),
  nn.Sigmoid(),
  nn.Linear(len(train_features.columns), 1),
)

# TODO: calculate the number of parameters
# TODO: test on train data and check shape of output

### So... many... parameters

How many parameters do we have now? We might not want to keep updating them ourselves.

Relying on a for loop to get all the parameters and remembering to call `grad.zero_()` at the right time is just prone to errors and inefficiencies.

Luckily, `PyTorch` has some optimizers we can use. They usually take our model as an input, along with some other parameters, and give us a simpler interface to control the optimization process.

### Initialize Optimizer

We're going to use one of the simpler optimizers to performs [_stochastic gradient descent_](https://en.wikipedia.org/wiki/Stochastic_gradient_descent). Gradient descent is the official name of the algorithm that calculates which way to update our parameters given the slope of our cost function and a learning rate. _Stochastic_ means that it should still work if we sub-sample our input data and only use a subset of the data points at a time. It remembers/accumulates information about previous error measurements.

The documentation for the [`SGD` optimizer](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html) has more info about the algorithm and the parameters it takes.

Other than simplifying our training code, these pre-built optimizers also perform dynamic learning rate adjustment and some other tricks that make our overall process not so sensitive to an exact learning rate.

The `PyTorch` library also has a number of [other optimizers](https://pytorch.org/docs/stable/optim.html#algorithms) useful for performing gradient descent. In addition to `SGD()` we can also try [Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html) or [Adagrad](https://pytorch.org/docs/stable/generated/torch.optim.Adagrad.html).

In [None]:
learning_rate = 1e-2
optim = torch.optim.SGD(model.parameters(), lr=learning_rate)

### Train it

We can train our new model, just like before, except now the training loop should be a little bit simpler.

We still have to call `zero_grad()`, but only on the optimizer. It will take care of clearing the gradients for each of the model's parameters for us.

And, after we calculate the slope of our cost function, we call `optim.step()`, so the optimizer can update the parameters with a new slope value.

In [None]:
for c in range(32):
  optim.zero_grad()
  y_pred = model(x_train).reshape(-1)
  loss = (y_pred - y_train).pow(2).mean().pow(0.5)
  loss.backward()
  optim.step()

  if c % 4 == 0:
    print(c, loss.item())

### Test dataset

We can still adjust a lot of parameters here, but before we spend too much time on this model, let's run it on the test dataset and calculate the average loss on data that wasn't used for training to see if the model is over-fitting.

We'll load the test data, run the mode and calculate loss.

In [None]:
test_features = houses_test.drop(columns=["value"])
test_values = houses_test["value"]

x_test = torch.Tensor(test_features.values)
y_test = torch.Tensor(test_values.values)

### `no_grad()`

We can call the `torch.no_grad()` function to tell `PyTorch` to momentarily stop calculating slopes/gradients when we are not training and just want to use the model to predict prices. We do this by creating a block of code where our model runs faster and more carefree.

In [None]:
with torch.no_grad():
  y_pred = model(x_test).reshape(-1)
  loss = (y_pred - y_test).pow(2).mean().pow(0.5)
  print(loss.item())

### Interpretation

This isn't bad.

The absolute value of the error is kind of large, but the test dataset error is comparable to the training dataset error, which is a good indication that the model is not over-fitting.

And it seems to be learning, so let's tune it.

### Hyperparameters

We can spend some time adjusting the model, adding layers, changing the optimizer, the learning rate, experimenting with the optimizer's parameters, etc.

This process is usually referred to as hyperparameter tuning, since we're picking parameters that will help us calculate the parameters of our neural network.

Here's a cell with all of the steps combined. We can play with the network architecture and parameters here.

In [None]:
## Define Model
model =  nn.Sequential(
  nn.Linear(len(train_features.columns), len(train_features.columns)),
  nn.Sigmoid(),

  # TODO: add layers

  nn.Linear(len(train_features.columns), 1),
)

# TODO: calculate the number of parameters

## Define Optimizer
learning_rate = 1e-2
# TODO: adjust parameters, add parameters, change optimizer
optim = torch.optim.SGD(model.parameters(), lr=learning_rate)

## Load Data
x_train = torch.Tensor(train_features.values)
y_train = torch.Tensor(train_values.values)
x_test = torch.Tensor(test_features.values)
y_test = torch.Tensor(test_values.values)

## Train Model
for c in range(32):
  optim.zero_grad()
  y_pred = model(x_train).reshape(-1)
  loss = (y_pred - y_train).pow(2).mean().pow(0.5)
  loss.backward()
  optim.step()

  if c % 4 == 0:
    print(c, loss.item())

## Evaluate Model
with torch.no_grad():
  y_pred = model(x_train).reshape(-1)
  loss_train = (y_pred - y_train).pow(2).mean().pow(0.5)

  y_pred = model(x_test).reshape(-1)
  loss_test = (y_pred - y_test).pow(2).mean().pow(0.5)

  print("\ntrain loss:", loss_train.item(), "\ntest loss:", loss_test.item())

### Interpretation

Our model is definitely learning. It might take a moment to tune and train before we get something comparable to the `LinearRegression` model, but that's not entirely surprising.

Usually what makes the biggest difference in these kinds of models is the size of the training dataset, compared to the number of parameters the model has to learn.

Some of the same tricks we used for the `LinearRegression` model could also help here. In theory the neural network should learn how to combine parameters into polynomial features, and also how to combine features akin to `PCA`, but sometimes it needs a little push in the right direction.

## Images

Can we use these kinds of models to do classification of images?

Sure.

The steps are the same, we just have to load image data and adapt the cost/loss function to calculate some kind of classification metric instead.

We'll use the _Labeled Faces in the Wild_ dataset from last homework.

The steps for setting up the classification model will be:

- Load dataset and do any kind of pre-processing
- Split data into train/test datasets
- Split independent features and classification label and load them into `Tensors`
- Create `DataLoader` instances (we'll see what this means below)
- Build a NN model
- Set up an optimizer
- Pick a cost/loss function
- Implement an evaluation function and any other kind of visualization that helps quantify the model
- Train model

### Load and split Dataset

The `LFWUtils.train_test_split(0.333)` function gives us some `Python` objects we can use to create our `Tensor`s.

The `pixels` key gives us a list of the images' pixel data, and the `label` key gives us the images' label IDs.

We don't have to do any normalization since the pixels will be in a know, well-defined range of $[0 - 255]$.

The only thing we have to do differently is cast the label `Tensor` to `long`. This is just to ensure the numbers in those `Tensor`s are whole numbers and don't have decimal points.

In [None]:
train, test = LFWUtils.train_test_split(0.333)

x_train = torch.Tensor(train["pixels"])
y_train = torch.Tensor(train["labels"]).long()

x_test = torch.Tensor(test["pixels"])
y_test = torch.Tensor(test["labels"]).long()

### Peek at data

We can visualize some of the images, their text labels and label IDs

In [None]:
for idx in range(0, len(train["pixels"]), 100):
  display(make_image(train["pixels"][idx], 130))
  print(train["labels"][idx], LFWUtils.LABELS[train["labels"][idx]])

### Datasets & DataLoaders

This is new !

We could try to train this neural network exactly how we trained the previous one where we gave the model the entire training dataset at once and asked for it to minimize the cost function over all of the samples at the same time.

This could work for this dataset, but once we start working with bigger and bigger datasets, it will be difficult to ask the computer to perform this kind of optimization over all of the images at the same time.

We have to split up our dataset into batches, and ask the model to work on subsets of our datasets. Since we're not giving the model all of the data at once, we should also randomize the order of the data to make sure the order in which the model sees a sample doesn't affect its influence on the overall quality of the model.

We'll use `PyTorch`'s built in [Datasets and DataLoaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) classes to help us manage our batches of training data.

### Define a Dataset Class

The first thing we have to do is define a `PyTorch` `Dataset` class that will handle our pixel and label data.

This class need to have $3$ functions defined:

`__init__()`: the constructor. Should receive all the info from a dataset.

`__len__()`: this returns how many items we have in out dataset.

`__getitem__()`: given an index, return the corresponding pixels and label.

In [None]:
class FaceDataset(Dataset):
  def __init__(self, imgs, labels):
    self.imgs = imgs
    self.labels = labels

  def __len__(self):
    return len(self.labels)

  def __getitem__(self, idx):
    return self.imgs[idx], self.labels[idx]

### Create DataLoaders

Now we can just create `Dataset` instances for each of our $2$ datasets, and pass those along to the `DataLoader` constructor.

In the `DataLoader` constructor we can use the `batch_size` parameter to define how many images the model should consider each time it does gradient descent, and the `shuffle` parameter to specify whether the data should be randomized during that process.

Setting parameters for the training `DataLoader` is more important since the batch size and randomization will directly affect the quality of the model.

For the test `DataLoader` the batch size won't really make a difference in the results, although larger batches should evaluate faster, and shuffling might confuse us when we try to look at specific test cases that are failing.

In [None]:
train_dataloader = DataLoader(FaceDataset(x_train, y_train), batch_size=256, shuffle=True)
test_dataloader = DataLoader(FaceDataset(x_test, y_test), batch_size=512)

### Model, Optimizer and Cost Function

We'll start with the simplest kind of network again, with just an input and an output layer.

The input layer has as many neurons as the number of pixels in each image, and the output layer has one neuron per possible class.

It looks like this, and is juts like our regression network above, but has more output neurons:

<img src="./imgs/linear_22100x26.jpg" width="800px"/>

Our optimizer will be `SGD` again. Depending on the dataset and model being created, `SGD` can perform even better with batched inputs because it is looking at less data and is less constrained within each batch.

Our cost function is a bit different. Previously, we used $L2$ distances to calculate the root mean square error of our regression predictions and used that value as the cost function for gradient descent.

In order to use gradient descent for classification, we have to turn the discrete nature of our labels/classes and their errors into something that has smooth and integratable slopes.

That's what the `CrossEntropyLoss()` function does for us. It looks at the outputs of our model and transforms the regression-type continuous values at our outputs into class prediction probabilities in a way that gradient descent still works.

There's more information in the [`PyTorch` documentation](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html).

In [None]:
model = nn.Linear(x_train.shape[1], len(y_train.unique()))

learning_rate = 1e-5
optim = torch.optim.SGD(model.parameters(), lr=learning_rate)

loss_fn = nn.CrossEntropyLoss()

### Train the Model

We can train the model now.

We still don't have an evaluation function, but we can use the values from the loss function to adjust parameters and make sure that the model is learning.

We'll train for $32$ epochs, and in each epoch we have to iterate through the data that is inside our `DataLoader` object.

The `x` and `y` variables below actually hold pixel and label information for $256$ images. For each of these batches we predict labels, calculate loss, calculate the slope of the loss function, update model parameters, zero the gradients, and repeat.

In [None]:
for e in range(32):
  for x, y in train_dataloader:
    optim.zero_grad()
    y_pred = model(x)
    loss = loss_fn(y_pred, y)
    loss.backward()
    optim.step()

  if e % 2 == 0:
    print(f"Epoch: {e} loss: {loss.item():.4f}")

### Interpretation

The loss/cost value seems to oscillate up and down, but, overall should steadily decrease.

This up and down has to do with the batching and shuffling of our training data. The `SGD` optimizer makes some decisions that it sometimes has to undo, but overall, the model looks like it's learning.

We can keep running this cell until the loss gets really small, but what we should do next is think of a way to evaluate our model using the test dataset and a function that gives us something a little more legible than the `CrossEntropyLoss()` value which is the sum of the "negative log likelihood" of our predictions.

This evaluation function is helpful not only when measuring the overall quality of our model, but should help us detect if/when the model starts to overfit the training data.

### Evaluation Function

Our `data_utils` file has a `classification_error()` function that calculates a percentage of mistakes between two lists of labels. We just have to give it a list of true labels and a list of predicted labels.

Having these lists of labels will also be useful if we want to visualize our predictions in a confusion matrix, so let's write a helper function that takes a model and a `DataLoader` and computes predictions for all of the samples in that dataloader.

We'll make sure our model isn't computing gradients with `torch.no_grad()` and also turn off some other features of the model that don't have to run during evaluation with `model.eval()`.

The `argmax(dim=1)` function gives us the index of our output neuron with the largest value. This is how we pick a class label from the raw regression-like numbers.

Then we append a list of labels to our overall list of prediction and true labels.

In [None]:
def get_labels(model, dataloader):
  model.eval()
  with torch.no_grad():
    data_labels = []
    pred_labels = []
    for x, y in dataloader:
      y_pred = model(x).argmax(dim=1)
      data_labels += [l.item() for l in y]
      pred_labels += [l.item() for l in y_pred]
    return data_labels, pred_labels

### Evaluate Model

We can now run the evaluation function on the model and both `DataLoaders`.

In [None]:
train_labels, train_predictions = get_labels(model, train_dataloader)
test_labels, test_predictions = get_labels(model, test_dataloader)

print("train error:", f"{classification_error(train_labels, train_predictions):.4f}")
print("test error", f"{classification_error(test_labels, test_predictions):.4f}")

display_confusion_matrix(train_labels, train_predictions, display_labels=LFWUtils.LABELS)
display_confusion_matrix(test_labels, test_predictions, display_labels=LFWUtils.LABELS)

### Retrain with Evaluation

To make sure our model didn't overfit the training data, we should keep an eye on the evaluation function during training.

Let's re-initialize our model and optimizer and re-train our network:

In [None]:
model = nn.Linear(x_train.shape[1], len(y_train.unique()))
optim = torch.optim.SGD(model.parameters(), lr=learning_rate)

In [None]:
for e in range(32):
  for x, y in train_dataloader:
    optim.zero_grad()
    y_pred = model(x)
    loss = loss_fn(y_pred, y)
    loss.backward()
    optim.step()

  if e % 4 == 0:
    train_labels, train_predictions = get_labels(model, train_dataloader)
    test_labels, test_predictions = get_labels(model, test_dataloader)
    train_error = classification_error(train_labels, train_predictions)
    test_error = classification_error(test_labels, test_predictions)
    print(f"Epoch: {e} loss: {loss.item():.4f}, train error: {train_error:.4f}, test error: {test_error:.4f}")

In [None]:
train_labels, train_predictions = get_labels(model, train_dataloader)
test_labels, test_predictions = get_labels(model, test_dataloader)

print("train error:", f"{classification_error(train_labels, train_predictions):.4f}")
print("test error", f"{classification_error(test_labels, test_predictions):.4f}")

display_confusion_matrix(train_labels, train_predictions, display_labels=LFWUtils.LABELS)
display_confusion_matrix(test_labels, test_predictions, display_labels=LFWUtils.LABELS)

### Interpretation

Is it overfitting ? Can we keep running the training cell ?

How low can we get our test error ?

### Make It Harder

Neural network models can seem simple to explain in a general sense: they're long and wide computation graphs made up of simple operations that have been tuned to achieve a specific task. Once they're training, or trained, their details and specificities are a little less easy to describe. It's hard to know exactly what each neuron is doing, and what part of the computation they are responsible for. We can train the same network, with the same parameters, using the same input data, and end up with wildly different results.

This is one reason why it's hard to debug a network when it doesn't seem to be learning properly, or when it starts to overfit and memorize the training data. Which neurons do we tune ?

One common situation that can lead to overfitting is when a network ends up with parameters that make it perform well on the training data without really activating all of its neurons. This is usually what is happening if adding layers to a network doesn't improve its performance.

One set of strategies for improving neural network training in these cases involves making the training process harder than it has to be. It's like we're challenging the neural network to learn more than it has, so that later it has an easier time with the regular data.

One simple technique to achieve this is to add `Dropout` layers to our network. A `Dropout` layer is a layer of neurons that don't perform any mathematical operation, but are selectively dropped out of the network randomly during training. This has the effect of randomly changing the network's architecture during training and preventing the network from becoming too reliant on specific neurons. Instead, it encourages the network to learn more robust features by activating more neurons overall.

<img src="./imgs/dropout.jpg" width="800px"/>

In [None]:
model =  nn.Sequential(
  nn.Dropout(0.2),
  nn.Linear(x_train.shape[1], x_train.shape[1] // 8),
  nn.ReLU(),

  nn.Dropout(0.2),
  nn.Linear(x_train.shape[1] // 8, len(y_train.unique())),
)

learning_rate = 1e-5
optim = torch.optim.SGD(model.parameters(), lr=learning_rate)

loss_fn = nn.CrossEntropyLoss()

In [None]:
for e in range(32):
  model.train()
  for x, y in train_dataloader:
    optim.zero_grad()
    y_pred = model(x)
    loss = loss_fn(y_pred, y)
    loss.backward()
    optim.step()

  if e % 4 == 0:
    train_labels, train_predictions = get_labels(model, train_dataloader)
    test_labels, test_predictions = get_labels(model, test_dataloader)
    train_error = classification_error(train_labels, train_predictions)
    test_error = classification_error(test_labels, test_predictions)
    print(f"Epoch: {e} loss: {loss.item():.4f}, train error: {train_error:.4f}, test error: {test_error:.4f}")

### Interpretation

The train and test eval function diverged, but both keep decreasing, so this might be ok.

In [None]:
train_labels, train_predictions = get_labels(model, train_dataloader)
test_labels, test_predictions = get_labels(model, test_dataloader)

print("train error", f"{classification_error(train_labels, train_predictions):.4f}")
print("test error", f"{classification_error(test_labels, test_predictions):.4f}")

display_confusion_matrix(train_labels, train_predictions, display_labels=LFWUtils.LABELS)
display_confusion_matrix(test_labels, test_predictions, display_labels=LFWUtils.LABELS)

print("test: top precision", LFWUtils.top_precision(test_labels, test_predictions))
print("test: top recall", LFWUtils.top_recall(test_labels, test_predictions))