# Lab 5 - Hazel's Extras
## Virtual Environment

The instructions in the lab should be:

```.sh
pip install keras tensorflow~=2.2
```

If you get the error `keras requires tensorflow` this could be because it installed tensorflow **3** and not tensorflow **2**. Generally, to recover from this kind of thing, you can just redownload your virtual environment. Suppose your virtual environment is named `venv` (note the prof calls his `env`):

1. Delete your old virtual environment: `rm -rvf venv`
2. Create a new virtual environment: `virtualenv -p python3 venv` or `python3 -m virtualenv -p python3 venv`.
3. Enter your new virtual environment: `source venv/bin/activate`
4. Install jupyter notebook, matplotlib (pyplot), pandas, keras, and tensorflow 2: `pip install notebook matplotlib pandas keras tensorflow~=2.2 "numpy<1.19.0" "h5py<2.11.0"`
5. start jupyter notebook: `jupyter notebook`





## Notation

$i$ is a digit in 0, 1, 2, 3, 4, 5, 6, 7, 8, 9.

$j$ is the pixel number in $(0,0)\dots(27,27)$.

$m$ is the number of pictures (20000).

$k$ is a picture # in $0\dots 19999$.

$x_k$ is a picture: it has $28\times 28=784$ pixel brightness values.

$w_{i,j}$ is a weight that represents how much the brightness of
pixel $j$ contributes to calling the picture a picture of digit $i$.

$W$ is the $10\times 784$ matrix of all $10\times 28\times 28$ values for $w_{i,j}$.

$b_i$ is an offset for digit $i$.

$B$ is a vector of all 10 elements $b_i$.

$X$ is a $784\times m$ matrix of picture data.

$Y$ is the $10\times m$ matrix of training data.

Then how much does our neural network $\left[W\ B\right]$
think picture $m$ looks like digit $i$ is the following formula:

\begin{align*}
h_i\left(x_k\right) &= w_i\cdot x_k + b_i
\\
H\left(x_k\right) &= W\cdot x_k + B
\end{align*}

The largest $h_i$ wins. 

Example: suppose $h_{0\dots 9}$ is $[0.1, 0.2, 0.3, 0.1, 0.9, 0.0, 0.1, 0.0, 0.2, 0.1, 0.0]$, we guess this is a picture of digit 4 because

$h_4$ > $h_{i\neq 4}$

Our goal, is to try to find values for these 28x28x10 $w$ weights and 10 $b$ weights so that $h_{0\dots 9}$ looks as close to $[0, 0, 0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]$

But we're going to be using 20,000 pictures!

So we're going to have a matrix $\hat{Y}$ (10x20000) and maybe just column 127 is $[0.1, 0.2, 0.3, 0.9, 0.1, 0.0, 0.1, 0.0, 0.2, 0.1, 0.0]$.

But we want $\hat{Y}$ to be as close to the ideal $Y$ matrix as possible. The ideal $Y$ matrix would have one 1.0 in each column, and the rest of the elements would be 0.0.

However, when we're training we want to get ${\hat{Y}_{i,k}}$ as close to ZERO if image $k$ is not a picture of digit $i$, and as close to ONE if image $k$ is a picture of digit $i$.

Example: if image 127 is a picture of the digit 4, column 127 of $Y$ should be 
$\left[0,0,0,0,1,0,0,0,0,0 \right]^{T}$. And we want to get $\hat{Y}$ as close to that as possible.

For each digit $i$ in 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. We have an equation that looks like 

\begin{align*}
\hat Y &= W\cdot X + B
\\
\hat{y}_{i,k} &= \sum_j w_{i,j}x_{j,k} + b_i
\end{align*}

Example (which is almost certainly NOT true): Imagine some pixel at (10, 14) in the picture is only WHITE when the picture is a picture of the digit 4. Then, we'd want the $w_{4_{10,14}}$ weight to be really BIG, to help the $h_{w_4}$ value to be 1.0.

So how are we going to get there?

We're going to use something called GRADIENT DESCENT.

Imagine you're standing in a forest on a hill.

Walking down a hill would be an example of doing gradient descent with 2 weights. The two weights are your latitude,longitude on a map. The elevation goes down as you walk down hill. Then you want to get to the lowest elevation. Maybe you want to get to the ocean.

This lab is similar, but instead of 2 weights, we have 28x28x10+10 weights. Our elevation is going to be this $J$ function, called the error function, which we want to minimize. The $J$ function represents how WRONG we are, when we try to decide what digit each picture is.

The error function (it's a scalar!):

\begin{equation*}
J(w_i,b_i) = \frac{1}{2m} \sum_{k=0}^{m-1} (w_i\cdot x_k + b_i - y_{k,i})^2
\end{equation*}

Where $y_{k,i}$ is 1.0 if picture $k$ is a picture of digit $i$, or its 0.0 if its not.

$\alpha$ is just some number (scalar). It's the learning rate. It's like how big of a step we're going to take.

If $j$ is the pixel #.

We can find the "downhill" direction like this for digit $i$ using 785 equations:
\begin{align*}
w_{i, j} &\gets w_{i, j} - \alpha \frac{\partial J(w_{i},b_i)}{\partial w_{i, j}} \forall j \in (0,0)\dots(28,28) \\
b_{i} &\gets b_{i} - \alpha \frac{\partial J(w_{i},b_i)}{\partial b_i}
\end{align*}

Where:

\begin{equation*}
\end{equation*}

This is the same as:

\begin{equation*}
J(W,B) = \frac{1}{2m} (W\cdot X + B\cdot \mathbf{1}_{1\times m} - Y)^2
\end{equation*}

In [5]:
print(28*28*10+10)

7850


In [6]:
print(1000*20000*7850)

157000000000


In [8]:
print(f"{28*28*10+10} eqns per iteration")

7850 eqns per iteration
