# Day 5: Introduction to Deep Neural Networks!

In this tutorial, we're going to go over the fundimentals of Neural Networks, specifically focusing on **Dense Neural Networks**. (If you're interested in Convolutional Neural Networks, don't fret! You'll probably enjoy the next lecture ;) ) 

Neural Networks as a whole are an _incredibly_ broad and complex topic. As we only have half a day today, we're not going to be able to cover much more than the basic ideas and techniques, but the hope is that this provides a stable base for you to continue building your knowledge on top of. If you don't itend to continue much farther down the machine learning rabbit hole, we hope that at the very least, this can serve to de-mystify neural networks/machine learning as a whole :) 

**Learning Objectives:**
* Understand Vectors, Matricies, and Dot Product
* Understand how a single layer perceptron works, along with batching and gradient descent.
* Learn how to build and train a simple Deep Neural Network (Multi-Layer Perceptron - MLP) using the TensorFlow/Keras deep learning python framework
* Learn about different optimizers beyond (Stochastic) Gradient Descent and touch on when you might want to use them
* Touch on a few key "Hyperparameters" that you, the human, can and will need to optimize when designing a neural network (Learning Rate, Batch Size, etc.) 


**Other Resources**
* a
* b
* c



In [4]:
# Run Me Ahead of time!
from sklearn.datasets import fetch_openml
jet_tagging_data = fetch_openml('hls4ml_lhc_jets_hlf')

## Notation, Terms, and other background info
In this notebook, you'll likely see some new symbols, terminology, and other topics that you haven't come across before. Don't be afraid!

Most of the math symbols you see here simply function as *shorthand* for different math concepts, so that we can write an equation using these concepts without needing to define "generic" variables or functions every time we reference them. We've compiled a list of these terms here for you to reference when needed, but at least skimming through these before continuing is reccomended. 

Additionally, there are some math concepts that are useful to know about/have a basic understanding of, as they're used quite heavily in this notebook (and Neural Networks/Machine Learning in general!)

We're going to assume that you have a baseline level of knowledge regarding some math concepts, but if you run into something that you don't understand (regardless of if it's defined here or not!), please let us know and we'll be happy to explain it! Additionally, some definitions of terms might include Machine Learning concepts that are covered later in the notebook (e.g. *Activation Functions*), we'll include another list of terms/symbols that are covered as part of this notebook at the end, so for now don't worry too much if you run into one of these terms. 

## Math Concepts

###  Vectors & Scalars 
a **vector** is a kind of variable that not only has a value (aka *magnitude*), but also a *direction*. One common way to represent and define vectors that we'll use is by stating the change along each axis (the *components*) the vector travels in (in 2D, this would be $x$ and $y$ *components*, in 3D, the $x$, $y$ and $z$ *components* ). 

This can be written as $\vec{v} = (x,y)$ _or_ as $\boldsymbol{v} = (x,y)$ (note the second $a$ is bolded). Sometimes, the *components* are written stacked vertically such as $\vec{v} = \begin{pmatrix} x\\ y\\ \end{pmatrix}$

In this format, calculating the *magnitude* of a vector, written as $| \vec{v} |$, is done by applying the pythagorean theorm!

<img src="img/vector_components.png" alt="Plot labeling the components of a vector, plus the definition for mangitude" style="background-color:white; width:400px;" />

* You can add or subtract vectors to/from vectors, this will produce a vector as a result ( the *resultant*) 
* Multiplication of a Vector with a Scalar produces a vector
* Multiplication of Vectors with Vectors can produce either:
    * Another Vector, by calculating the *Cross Product*: $\vec{v} \times \vec{b} = \vec{vb}$
    * a *Scalar*, by calculating the *Dot Product*: $\vec{v} \cdot \vec{b} = vb$ 
* You cannot divide with vectors
    
a **scalar** is a normal value, without a direction. Scalars are considered to only have a *magnitude*. Most math you know and have done throughout school has been **scalar** math :)


### Matrix/Matrices 
a **Matrix** is a _structured_ table/an array of numbers. Location of each element matters, and they can be **1 or more dimensions** in shape. Matricies are generally labeled as capital/uppercase letters, and are represented with it's elements placed in a rectangular grid, encapsulated by large square brackets $[  ]$ or parenthesis $(  )$  A Matrix with $m$ rows and $n$ columns can be written as    
$$ 
A = [a_{m,n}] = \begin{bmatrix} 
  {a}_{1,1} & \dots & {a}_{1,n}\\ 
  \vdots & \ddots & \vdots\\ 
  {a}_{m,1} & \dots & {a}_{m,n}\\ 
\end{bmatrix} = \begin{pmatrix} 
  {a}_{1,1} & \dots & {a}_{1,n}\\ 
  \vdots & \ddots & \vdots\\ 
  {a}_{m,1} & \dots & {a}_{m,n}\\ 
\end{pmatrix}
$$

So if we set $m=4$ and $n=2$
$$
A = [a_{4,2}] = \begin{bmatrix} 
  {a}_{1,1} & {a}_{1,2}\\ 
  {a}_{2,1} & {a}_{2,2}\\ 
  {a}_{3,1} & {a}_{3,2}\\ 
  {a}_{4,1} & {a}_{4,2}\\ 
\end{bmatrix} = \begin{pmatrix} 
  {a}_{1,1} & {a}_{1,2}\\ 
  {a}_{2,1} & {a}_{2,2}\\ 
  {a}_{3,1} & {a}_{3,2}\\ 
  {a}_{4,1} & {a}_{4,2}\\ 
\end{pmatrix}
$$

Where ${a}_{i,j}$ (no brackets/parentheses) refers to a single element in the matrix, where $i$ refers to the row and $j$ refers to the column

* Sometimes, we'll talk about *transposing* a matrix, which means we flip it along it's diagonal axis! When transposing a matrix, we swap the values of $m$ and $n$ and move the elements of the matrix to match. We refer to the transposed version of a matrix by adding a superscript of $T$ to it. For example
<center><img src="img/Matrix_transpose.gif" alt="Animation of a matrix being transposed"/></center>
$$ 
A = [a_{3,2}] = \begin{bmatrix} 
  1 & 2\\ 
  3 & 4\\ 
  5 & 6\\  
\end{bmatrix} \quad  A^{T} = [a^{T}_{2,3}] = \begin{bmatrix} 
  1 & 3 & 5 \\ 
  2 & 4 & 6 \\ 
\end{bmatrix}
$$ 


* A 1-Dimensional matrix (where either $n=1$ or $m=1$) can also be called a **row vector** (when shape = $m \times 1$) or a **column vector** (when shape = $1 \times n$), and transposing one kind turns it into the other. These can be _treated_ like vectors (such as when you're performing certain operations), but are still matricies. A key point is that a *vector* **can not** be transposed, but a *matrix* can. (Yes this is confusing, this distinction won't come up today, but is important with Linear Algebra in general) 
$$
A = [a_{4,1}] = \begin{bmatrix} 
  1 \\ 
  2 \\ 
  3 \\ 
  4 \\
  \end{bmatrix} \quad B = [b_{1,4}] = \begin{bmatrix} 
  5 & 6 & 7 & 8\\ 
\end{bmatrix}
$$

* **Importantly,** you can/will see some matricies represented in the form of a *column vector*, functioning as a ***table*** of vectors. You have to be careful in this distinction, as it's shorthand that needs to be expanded before performing operations on it as a matrix. For example, if we have a matrix $A = [a_{4,3}]$, we can write it as a table of 2-Dimensional vectors $\vec{a_n} = (x,y)$ like so:

$$
A = [a_{4,2}] = \begin{bmatrix} 
  \vec{a_1} \\ 
  \vec{a_2} \\ 
  \vec{a_3} \\ 
  \vec{a_4} \\
  \end{bmatrix} = \begin{bmatrix} 
  {a_1}_x & {a_1}_y\\ 
  {a_2}_x & {a_2}_y\\ 
  {a_3}_x & {a_3}_y\\ 
  {a_4}_x & {a_4}_y\\ 
  \end{bmatrix} = \begin{bmatrix} 
  {a}_{1,1} & {a}_{1,2}\\ 
  {a}_{2,1} & {a}_{2,2}\\ 
  {a}_{3,1} & {a}_{3,2}\\ 
  {a}_{4,1} & {a}_{4,2}\\ 
  \end{bmatrix}
$$

### Dot Product
The dot product of two matricies is simply the sum of each corresponding element multiplied together, defined as such:

$ \mathbf a \cdot \mathbf b = \sum_{i=1}^n a_i b_i = a_1 b_1 + a_2 b_2 + \cdots + a_n b_n $



***Variable/Function Definitions - ML/Notebook Specific***
Most of these symbols are common across Machine Learning/Neural Network literature, but there still might be slight variations. For the purposes of this notebook, these variables are defined as follows

Dataset Related Variables:
* $\vec{x}$ - An **unmodified** *input* vector, these make up the inputs to our neural network
* $\vec{x_n}$ - pronounced "*x n*", e.g. "x 1", "x 2", etc. - a specific input matrix (feature), e.g. $\vec{x_1}$, $\vec{x_2}$, etc. (this also applies to other forms of $\vec{x}$, $\vec{y}$, and $\vec{w}$ such as $\vec{x'}$ and $\vec{\hat{y}}$ ) 
* $\vec{x'}$ - pronounced *"x prime"* - Generally a modified value of x (derivitive of, though not speciifcally in the Calculus sense :) ), usually referring to the output of an intermediete step during pre-processing, an operation within a *hidden layer* of a neural network, or the output of the *hidden layer* itself
* $\vec{y}$ - A truth vector, containing whatever the _actual truth values_ that the neural network is trying to learn to replicate. 
* $\vec{\hat{y}}$ - pronounced *Y Hat* - the **modified**/**final** output vector of our neural network, usually after we apply an *activation function*
* ${X}$ - (Capital/Uppercase X) - the _input_ matrix for our neural network, comprised of all input vectors ($x$)
* $\hat{Y}$ - (Capital/Uppercase Y Hat) - the _output_ matrix for our neural network, comprised of all output vectors ($\hat{y}$) from our neutral network. 

Neural Network Related Variables/Functions:
* $\vec{w}$ - a *weight* vector, a _learned parameter_ that a neural network learns during the training process.  
* $b$ - a *bias* scalar, a _learned parameter_ that a neural network learns during the training process.
* $g(...)$ - an *activation function*, used to add *non linear behavior* to the network. This can represent one of multiple different functions (such as Step, ReLU, Sigmoid, TanH, etc. see the "Activation Functions" link in *Other Resources* above!), depending on the context and architecture of a given neural network. 



***Math Symbols/Definitions***
This notation is pretty general and shared across most math literature you'll find, but it's important to check specific meanings regardless, **ESPECIALLY** if this notation is being used in other contexts/fields (Outside ML, Discrete Math, and Linear Algebra) 
* $\in$ - pronounced as "in" - A symbol denoting that the value of a mentioned variable (usually $x$, $y$, etc.) exists within/is bounded by some set of numbers, meaning that it will _always_ be within that set and never be anything that's not part of it. 
* $\{a, b, c\}$ - pronounced "set of ..." - This is specificlly defined set (list) of numbers, usually used in conjunction with the above "in"/ $\in$ symbol. (Note the curly braces $\{ \}$) 
* $\mathbb{R}$ - pronounced "\<the set of> all real numbers" - This symbol represents the set of **all** real numbers, usually used in conjunction with the above "in"/ $\in$ symbol. 
* In this notebook, we also use the notation $\mathbb{R}^d$ or $\{a, b, c\}^d$  , where $d$ indicates the dimensions/shape of whatever the variable we're referencing (Outside of this notebook, $\mathbb{R}^n$ is often the same meaning where $d$ = $n$). 
    * e.g. $\hat{Y} \in \mathbb{R}^{4 \times 1}$ means that a ***matrix*** $\hat{Y}$ in the shape of "${4 \times 1}$", where each element of the matrix is a real number (part of the set $\mathbb{R}$) 
    * e.g. $\vec{x} \in \mathbb{R}^2$ is a ***vector*** $\vec{x}$ in $2$ dimensional space, where the possible values of the vector's *components* are all real numbers. 
    * e.g. $\vec{x} \in \{0,1\}^2$ is a ***vector*** $\vec{x}$ in $2$ dimensional space, where the possible values of the vector's *components* are either $0$ or $1$.  
    * e.g. $\vec{\hat{y}} \in \{0,1\}$ is a ***vector*** $\vec{\hat{y}}$ in $1$ dimensional space, where the possible values of the vector's *component* is either $0$ or $1$ 
    * e.g. $\{0,1\}^2$ is a ***vector*** (that's unnamed) in $2$ dimensional space, where the possible values of the vector's *components* are either $0$ or $1$.  

# Feedforward Neural Networks

This notebook explains various ways of implementing single-layer and multi-layer neural networks. The implementations are arranged by concrete (explicit) to abstract order so that one can understand the process black-boxed by deep learning frameworks.


## Example Task: Boolean Logic Gates
In order to focus on explaining the internals of training, this notebook uses a simple and classic example: *boolean logic gates* (aka *threshold logic units*).

When we talk about boolean logic, we refer to operations that exclusively use ***Truth Values***, which are *binary* values that can **only** be *True* or *False*, typically represented where $x=0$ means *False* and $x=1$ means *True*. 


*Boolean Logic Gates* take one or more boolean values as input and produce a single boolean output. Some basic *Boolean Logic Gates* are:
* AND ($\wedge$) - Takes two inputs (A & B) and outputs *True* **if and only if** both A and B are True. Otherwise outputs *False*.
* OR ($\vee$) - Takes two inputs (A & B) and outputs *True* **if either** A or B are *True*. Otherwise outputs *False*.
* NOT ($\lnot$) - Takes one input (A) and *inverts* it. if A is True, it outputs *False*. If A is False, it outputs *True*.
* NAND ($\uparrow$) - Takes two inputs (A & B) and outputs *True* **if only both outputs are the not true* A and B are not. Otherwise outputs *False*.
    * This is the opposite of the AND Gate, effectivly as if you take the output of an AND gate and pass it through a NOT gate
* NOR ($\downarrow$) - Takes two inputs (A & B) and outputs *True* **if** A or B are *False*. Otherwise outputs *False*.
    * This is the opposite of the NOR Gate, effectivly as if you take the output of an OR gate and pass it through a NOT gate
* XOR ($\oplus$) - *exclusive OR* - Takes two inputs (A & B) and outputs *True* if **only one input* (A or B, but not both at the same time) is *True*. Otherwise outputs *False*.
* XNOR ($\odot$) - *exclusive NOR* - Takes two inputs (A & B) and outputs *True* if **only both outputs are the same*. Otherwise outputs *False*.
    * This is the opposite of the XOR Gate, effectivly as if you take the output of an XOR gate and pass it through a NOT gate
<center><img src="img/logic_gates.png" alt="Logic gate symbols, as typically used in electronics"/></center>

It's common to compile these operations into a "truth table", like below. Columns $A$ and $B$ represent the inputs, and the name of the *Boolean Logic Gates* represents the output for the given values of $A$ and $B$ in a row. 
| $A$ | $B$ | AND | OR | NOT* | NAND | NOR | XOR | XNOR |
| :-: | :-: | :---: | :--: | :---: | :----: | :---: | :---: | :----: |
| 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 |
| 0 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 0 |
| 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 |

\* Only uses one input, column A


Defining $x=0$ as *false* and $x=1$ as *true*, single-layer neural networks can realize logic units such as AND ($\wedge$), OR ($\vee$), NOT ($\lnot$), and their inverted counterparts. 

Because they're only one layer, they're unable to represent logical compounds such as XOR/XNOR. 




## Using numpy

In [9]:
import numpy as np


### Single-layer perceptron

A single layer perceptron predicts a binary label $\hat{y} \in \{0, 1\}$ for a given input vector $\boldsymbol{x} \in \mathbb{R}^d$ ($d$ presents the number of dimensions of inputs) by using the following formula,
$$
\hat{y} = g(\boldsymbol{w} \cdot \boldsymbol{x} + b) = g(w_1 x_1 + w_2 x_2 + ... + w_d x_d + b)
$$

Here, $\boldsymbol{w} \in \mathbb{R}^d$ is a **weight** (vector) ; $b \in \mathbb{R}$ is a **bias** (scalar) ; and $g(.)$ denotes an **activation function** (in this case, that's the Unit Step Function) (we assume $g(0)=0$).

For simplicity, let us consider examples with two-dimensional inputs ($d=2$).
We can represent an input vector $\boldsymbol{x} \in \mathbb{R}^2$ and weight vector $\boldsymbol{w} \in \mathbb{R}^2$ with `numpy.array`. We also define the bias term $b$.

In [11]:
# Define some weight, input, and bias values to use 
x = np.array([0, 1]) # 3 dimensional vector with possible values {0,1} - Inputs
w = np.array([1.0, 1.0]) # 2 dimensional vector with possible values ℝ    - Weights
b = 1.0 # scalar value                                                    - Bias

The following code computes $\boldsymbol{w} \cdot \boldsymbol{x} + b$,


In [12]:
 np.dot(x, w) + b

2.0

We can apply the step function (also known as a Heaviside or Unit step function) as an *activation function*, $g()$:

<img src="img/heaviside_step.png" alt="Plot of the Heaviside Step Function" style="background-color:white; width: 400px;"/>
When applied to the above result as $g(\boldsymbol{w} \cdot \boldsymbol{x} + b)$, it yields a binary label, $\hat{y}$.


In [13]:
np.heaviside(np.dot(x, w) + b, 0)

1.0

#### Including the bias term into the weight vector

For concise implementation, we include a bias term `b` as an additional dimension to the weight vector `w`. More concretely, we append an element with the value of $1$ to each input,
$$
\boldsymbol{x} = (0, 1) \rightarrow \boldsymbol{x}' = (0, 1, 1)
$$
and expand the dimension of the weight vector $\boldsymbol{w} \in \mathbb{R}^{3}$.

Then, the formula of the single-layer perceptron becomes,
$$
\hat{y} = g((w_1, w_2, w_3) \cdot \boldsymbol{x}') = g(w_1 x_1 + w_2 x_2 + w_3)
$$
In other words, $w_1$ and $w_2$ represent weights for $x_1$ and $x_2$, respectively, and $w_3$ is our bias value.

In [14]:
x = np.array([0, 1, 1]) 
w = np.array([1.0, 1.0, 1.0])

We can simplify the code to predict a binary label $\hat{y}$,

In [15]:
np.heaviside(np.dot(x, w), 0)

1.0

#### Training a NAND gate

Let's train a NAND gate with two inputs. More specifically, we want to find a weight vector $\boldsymbol{w}$ and a bias value $b$ of a single-layer perceptron that realizes the truth table of the NAND gate: $\{0,1\}^2 \to \{0,1\}$.

| $x_1$ | $x_2$ | $y$ |
| :---: |:---: | :---: |
| 0 | 0 | 1 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |

We convert the truth table into a training set consisting of all mappings of the NAND gate,
$$
\boldsymbol{x}_1 = (0, 0), y_1 = 1 \\
\boldsymbol{x}_2 = (0, 1), y_2 = 1 \\
\boldsymbol{x}_3 = (1, 0), y_3 = 1 \\
\boldsymbol{x}_4 = (1, 1), y_4 = 0 \\
$$

As explained earlier, we include the bias term into the last dimension.
$$
\boldsymbol{x}'_1 = (0, 0, 1), y_1 = 1 \\
\boldsymbol{x}'_2 = (0, 1, 1), y_2 = 1 \\
\boldsymbol{x}'_3 = (1, 0, 1), y_3 = 1 \\
\boldsymbol{x}'_4 = (1, 1, 1), y_4 = 0 \\
$$

The code below implements Rosenblatt's perceptron algorithm with a fixed number of iterations (50 times). We use a constant learning rate 0.5 for simplicity.


In [27]:
import random
import numpy as np

# Training data for NAND.
x = np.array([
    [0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]
    ])
y = np.array([1., 1., 1., 0])
w = np.array([0.0, 0.0, 0.0])

lr = 0.5 #Learning Rate
for t in range(50):
    # Pick one random sample of traing data, at index (i), at random.
    i = random.choice(range(len(y)))
    # Predict the label for the instance x[i] with the current parameter w.
    y_pred = np.heaviside(np.dot(x[i], w), 0)
    # Show the detail of the instance and the current parameter.
    print(f'#{t:<2}: training sample index={i}, x={x[i]}, w={w}, y={y[i]}, y_pred={y_pred}, y_err={y[i] - y_pred}')
    # Update the parameter.
    loss = y[i] - y_pred
    w += loss * lr * x[i]

#0 : training sample index=2, x=[1 0 1], w=[0. 0. 0.], y=1.0, y_pred=0.0, y_err=1.0
#1 : training sample index=3, x=[1 1 1], w=[0.5 0.  0.5], y=0.0, y_pred=1.0, y_err=-1.0
#2 : training sample index=0, x=[0 0 1], w=[ 0.  -0.5  0. ], y=1.0, y_pred=0.0, y_err=1.0
#3 : training sample index=3, x=[1 1 1], w=[ 0.  -0.5  0.5], y=0.0, y_pred=0.0, y_err=0.0
#4 : training sample index=1, x=[0 1 1], w=[ 0.  -0.5  0.5], y=1.0, y_pred=0.0, y_err=1.0
#5 : training sample index=2, x=[1 0 1], w=[0. 0. 1.], y=1.0, y_pred=1.0, y_err=0.0
#6 : training sample index=3, x=[1 1 1], w=[0. 0. 1.], y=0.0, y_pred=1.0, y_err=-1.0
#7 : training sample index=3, x=[1 1 1], w=[-0.5 -0.5  0.5], y=0.0, y_pred=0.0, y_err=0.0
#8 : training sample index=2, x=[1 0 1], w=[-0.5 -0.5  0.5], y=1.0, y_pred=0.0, y_err=1.0
#9 : training sample index=1, x=[0 1 1], w=[ 0.  -0.5  1. ], y=1.0, y_pred=1.0, y_err=0.0
#10: training sample index=1, x=[0 1 1], w=[ 0.  -0.5  1. ], y=1.0, y_pred=1.0, y_err=0.0
#11: training sample index=1,

We can confirm the learned parameter and classification results.

In [28]:
w

array([-0.5, -1. ,  1.5])

In [32]:
y_pred=np.heaviside(np.dot(x, w), 0)
print(f" Truth: {y}")
print(f"  Pred: {y_pred}")

 Truth: [1. 1. 1. 0.]
  Pred: [1. 1. 1. 0.]


### Single-layer perceptron with batching

When training a model using a larger dataset with many samples, it's ideal to reduce the number of computations you perform - especially when running native python code (which is a lot slower than other languages). The common technique to speed up a machine-learning code written in Python is to run code using libraries that accelerate the large matrix operations, such as numpy (or Tensorflow, Keras, and Pytorch, but we'll get to that in a bit!) 

Even when using a libaray to accelerate these operations, we'll still run into issues with the sheer number of computations if we calculated the loss and updated the weights for every single image, so instead we use a technique called **batching**, where we calculate the loss of multiple images at a time and update the weights once every $n$ images, where $n$ is the number of images in one batch, or the **batch size**

So, putting that into practice;

The single-layer perceptron makes predictions for four inputs, 
$$
\hat{y}_1 = g(\boldsymbol{x}_1 \cdot \boldsymbol{w}) \\
\hat{y}_2 = g(\boldsymbol{x}_2 \cdot \boldsymbol{w}) \\
\hat{y}_3 = g(\boldsymbol{x}_3 \cdot \boldsymbol{w}) \\
\hat{y}_4 = g(\boldsymbol{x}_4 \cdot \boldsymbol{w}) \\
$$
and if we give it 4 inputs, we'll get an output for each input
$$
\hat{Y} = \begin{pmatrix} 
  \hat{y}_1 \\ 
  \hat{y}_2 \\ 
  \hat{y}_3 \\ 
  \hat{y}_4 \\ 
\end{pmatrix},
$$


Here, we define $\hat{Y} \in \mathbb{R}^{4 \times 1}$ and $X \in \mathbb{R}^{4 \times d}$ as,
$$
\hat{Y} = \begin{pmatrix} 
  \hat{y}_1 \\ 
  \hat{y}_2 \\ 
  \hat{y}_3 \\ 
  \hat{y}_4 \\ 
\end{pmatrix},
X = \begin{pmatrix} 
  \boldsymbol{x}_1 \\ 
  \boldsymbol{x}_2 \\ 
  \boldsymbol{x}_3 \\ 
  \boldsymbol{x}_4 \\ 
\end{pmatrix}
$$

Then, we can write the four predictions in one dot-product computation,
$$
\hat{Y} = X \cdot \boldsymbol{w}
$$

The code below implements this idea. The function `np.heaviside()` yields a vector corresponding to the four predictions, applying the step function for every element of the argument.

In [38]:
x = np.array([
    [0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]
    ])
w = np.array([1.0, 0.5, -0.5])
np.heaviside(np.dot(x, w), 0)

array([0., 0., 1., 1.])

The code below applies the Perceptron algorithm with batching:

In [43]:
import numpy as np

# Training data for NAND.
x = np.array([
    [0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]
    ])
y = np.array([1, 1, 1, 0])
w = np.array([0.0, 0.0, 0.0])

lr = 0.5
for t in range(10):
    y_pred = np.heaviside(np.dot(x, w), 0) # Instead of picking a single image to calculate with, we give it 4 at once
    print(f'#{t}: w={w}, Y={y}, Ypred={y_pred}, Yerr={y-y_pred}, dw={np.dot((y - y_pred), x)}')
    Yerr = (y - y_pred) #loss is now a 4 dimensional vector - one value for each input from x
    print(loss)
    w += lr * np.dot(Yerr, x) # update the weight parameters, 

#0: w=[0. 0. 0.], Y=[1 1 1 0], Ypred=[0. 0. 0. 0.], Yerr=[1. 1. 1. 0.], dw=[1. 1. 3.]
[0. 0. 0. 0.]
#1: w=[0.5 0.5 1.5], Y=[1 1 1 0], Ypred=[1. 1. 1. 1.], Yerr=[ 0.  0.  0. -1.], dw=[-1. -1. -1.]
[0. 0. 0. 0.]
#2: w=[0. 0. 1.], Y=[1 1 1 0], Ypred=[1. 1. 1. 1.], Yerr=[ 0.  0.  0. -1.], dw=[-1. -1. -1.]
[0. 0. 0. 0.]
#3: w=[-0.5 -0.5  0.5], Y=[1 1 1 0], Ypred=[1. 0. 0. 0.], Yerr=[0. 1. 1. 0.], dw=[1. 1. 2.]
[0. 0. 0. 0.]
#4: w=[0.  0.  1.5], Y=[1 1 1 0], Ypred=[1. 1. 1. 1.], Yerr=[ 0.  0.  0. -1.], dw=[-1. -1. -1.]
[0. 0. 0. 0.]
#5: w=[-0.5 -0.5  1. ], Y=[1 1 1 0], Ypred=[1. 1. 1. 0.], Yerr=[0. 0. 0. 0.], dw=[0. 0. 0.]
[0. 0. 0. 0.]
#6: w=[-0.5 -0.5  1. ], Y=[1 1 1 0], Ypred=[1. 1. 1. 0.], Yerr=[0. 0. 0. 0.], dw=[0. 0. 0.]
[0. 0. 0. 0.]
#7: w=[-0.5 -0.5  1. ], Y=[1 1 1 0], Ypred=[1. 1. 1. 0.], Yerr=[0. 0. 0. 0.], dw=[0. 0. 0.]
[0. 0. 0. 0.]
#8: w=[-0.5 -0.5  1. ], Y=[1 1 1 0], Ypred=[1. 1. 1. 0.], Yerr=[0. 0. 0. 0.], dw=[0. 0. 0.]
[0. 0. 0. 0.]
#9: w=[-0.5 -0.5  1. ], Y=[1 1 1 0], Ypred=

We can confirm the learned parameters and classification results match what we expect/the example done without batching

In [40]:
w

array([-0.5, -0.5,  1. ])

In [42]:
y_pred=np.heaviside(np.dot(x, w), 0)
print(f" Truth: {y}")
print(f"  Pred: {y_pred}")

 Truth: [1 1 1 0]
  Pred: [1. 1. 1. 0.]


### Stochastic gradient descent (SGD) with mini-batch

In [None]:
import numpy as np

def sigmoid(v):
    return 1.0 / (1 + np.exp(-v))

# Training data for NAND.
x = np.array([
    [0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]
    ])
y = np.array([1, 1, 1, 0])
w = np.array([0.0, 0.0, 0.0])

eta = 0.5
for t in range(100):
    y_pred = sigmoid(np.dot(x, w))
    print(f'#{t}: w={w}, Y={y}, Ypred={y_pred}, Yerr={y-y_pred}, dw={np.dot((y - y_pred), x)}')
    w -= eta * np.dot((y_pred - y), x)

In [None]:
w

In [None]:
sigmoid(np.dot(x, w))

## Automatic differentiation

#todo explain autodif anf usefulness


Consider a loss function,
$$
l_{\boldsymbol{x}}(\boldsymbol{w}) = - \log \sigma(\boldsymbol{w} \cdot \boldsymbol{x}) = - \log \frac{1}{1 + e^{-\boldsymbol{w} \cdot \boldsymbol{x}}}
$$

This section shows implementations in different libraries of deep learning for computing the loss value $l_{\boldsymbol{x}}(\boldsymbol{w})$ and gradients $\frac{\partial l_{\boldsymbol{x}}(\boldsymbol{w})}{\partial \boldsymbol{w}}$ when $\boldsymbol{x} = (1, 1, 1)$ and $\boldsymbol{w} = (1, 1, -1.5)$.

### Using autograd

See: https://github.com/HIPS/autograd

In [None]:
import autograd
import autograd.numpy as np

def loss(w, x):
    return -np.log(1.0 / (1 + np.exp(-np.dot(x, w))))

x = np.array([1, 1, 1])
w = np.array([1.0, 1.0, -1.5])

grad_loss = autograd.grad(loss)
print(loss(w, x))
print(grad_loss(w, x))

### Using TensorFlow Eager

See: https://www.tensorflow.org/guide/autodiff

In [None]:
import tensorflow as tf

dtype = tf.float32

x = tf.constant([1, 1, 1], dtype=dtype, name='x')
w = tf.Variable([1.0, 1.0, -1.5], dtype=dtype, name='w')

with tf.GradientTape() as tape:
    loss = -tf.math.log(tf.math.sigmoid(tf.tensordot(x, w, 1)))

print(loss.numpy())
print(tape.gradient(loss, w))

### Single-layer neural network with high-level NN modules (w/ optimizers)

Rewrite for Keras!

In [None]:
import torch

dtype = torch.float

# Training data for NAND.
x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype)
y = torch.tensor([[1], [1], [1], [0]], dtype=dtype)
                                        
# Define a neural network using high-level modules.
model = torch.nn.Sequential(
    torch.nn.Linear(2, 1, bias=True),   # 2 dims (with bias) -> 1 dim
)

# Binary corss-entropy loss after sigmoid function.
loss_fn = torch.nn.BCEWithLogitsLoss(reduction='sum')

# Used for plotting loss values.
loss_history = []

eta = 0.5
for t in range(100):
    y_pred = model(x)                   # Make predictions.
    loss = loss_fn(y_pred, y)           # Compute the loss.

    loss_history.append(loss.item())    # Record the loss value.
    #print(f'#{t}: loss={loss.item()}')
    
    model.zero_grad()                   # Zero-clear the gradients.
    loss.backward()                     # Compute the gradients.
        
    with torch.no_grad():
        for param in model.parameters():
            param -= eta * param.grad   # Update the parameters using SGD.

In [None]:
import matplotlib.pyplot as plt

plt.plot(loss_history)
plt.xlabel('Iteration #')
plt.ylabel('Loss')

In [None]:
model.state_dict()

In [None]:
model(x).sigmoid()

# Multi-layer neural network with high-level NN modules

introduce the idea of layers in a network

do Jet Tagger stuff

based on [the hls4ml tutorial part 1](https://github.com/fastmachinelearning/hls4ml-tutorial/blob/main/part1_getting_started.ipynb)

TODO: Some of this is based on older sklearn/tf and needs to be updated to work properly...


## Particle Physics Example: Jet Tagging

<img src="img/jet_tagger_jets.png" alt="2D Representations of the different kinds of particle jets the neural network will classify" style="background-color:white; width: 800px;"/>

blah

### Multi-Layer Perceptron - Your first Deep Neural Network!

<img src="img/jet_tagger_mlp.png" alt="Graph of the Jet Tagger MLP Neural Network" style="background-color:white; width: 400px;"/>

blah


In [1]:
#imports
#TODO: stupid tf error in JupyterHub - does it fail in colab?
'''
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates
'''
from tensorflow.keras.utils import to_categorical
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler

2023-07-12 15:10:22.578969: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-07-12 15:10:22.579057: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [2]:
# load the dataset
data = jet_tagging_data
X, y = data['data'], data['target']

In [3]:
# look at some example data
print(data['feature_names'])
print(X.shape, y.shape)
print(X[:5])
print(y[:5])

['zlogz', 'c1_b0_mmdt', 'c1_b1_mmdt', 'c1_b2_mmdt', 'c2_b1_mmdt', 'c2_b2_mmdt', 'd2_b1_mmdt', 'd2_b2_mmdt', 'd2_a1_b1_mmdt', 'd2_a1_b2_mmdt', 'm2_b1_mmdt', 'm2_b2_mmdt', 'n2_b1_mmdt', 'n2_b2_mmdt', 'mass_mmdt', 'multiplicity']
(830000, 16) (830000,)
      zlogz  c1_b0_mmdt  c1_b1_mmdt  c1_b2_mmdt  c2_b1_mmdt  c2_b2_mmdt  \
0 -2.935125    0.383155    0.005126    0.000084    0.009070    0.000179   
1 -1.927335    0.270699    0.001585    0.000011    0.003232    0.000029   
2 -3.112147    0.458171    0.097914    0.028588    0.124278    0.038487   
3 -2.666515    0.437068    0.049122    0.007978    0.047477    0.004802   
4 -2.484843    0.428981    0.041786    0.006110    0.023066    0.001123   

   d2_b1_mmdt  d2_b2_mmdt  d2_a1_b1_mmdt  d2_a1_b2_mmdt  m2_b1_mmdt  \
0    1.769445    2.123898       1.769445       0.308185    0.135687   
1    2.038834    2.563099       2.038834       0.211886    0.063729   
2    1.269254    1.346238       1.269254       0.246488    0.115636   
3    0.966505  

## Data pre-processing
As it stands, our data isn't quite ready to feed into a network. We need to do a little bit of work ahead of time (**preprocessing**) to format the data and apply some statistical methods to make things easier for the network to understand.

### Scaling Data

blah

### One-Hot Encoding

blah


In [7]:
le = LabelEncoder()
y = le.fit_transform(y)
y = to_categorical(y, 5)
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(y[:5])

ValueError: y should be a 1d array, got an array of shape (830000, 5) instead.

In [None]:
scaler = StandardScaler()
X_train_val = scaler.fit_transform(X_train_val)
X_test = scaler.transform(X_test)

## Writing our MLP in Keras

blah

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, BatchNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.regularizers import l1
from callbacks import all_callbacks

In [None]:
model = Sequential()
model.add(Dense(64, input_shape=(16,), name='fc1', kernel_initializer='lecun_uniform', kernel_regularizer=l1(0.0001)))
model.add(Activation(activation='relu', name='relu1'))
model.add(Dense(32, name='fc2', kernel_initializer='lecun_uniform', kernel_regularizer=l1(0.0001)))
model.add(Activation(activation='relu', name='relu2'))
model.add(Dense(32, name='fc3', kernel_initializer='lecun_uniform', kernel_regularizer=l1(0.0001)))
model.add(Activation(activation='relu', name='relu3'))
model.add(Dense(5, name='output', kernel_initializer='lecun_uniform', kernel_regularizer=l1(0.0001)))
model.add(Activation(activation='softmax', name='softmax'))

## Training our MLP

blah

In [None]:
# Compile our model 
optimizer = Adam(lr=0.0001)
model.compile(optimizer=optimizer, loss=['categorical_crossentropy'], metrics=['accuracy'])

In [None]:
# Train! 
model.fit(
    X_train_val, # Input data
    y_train_val, # Truth Output data
    batch_size=1024, # Batch Size
    epochs=30, # How many times will we iterate over the whole training dataset to train the model?
    validation_split=0.25, # How much of the train dataset do we want to reserve as our validation split?
    shuffle=True) #Do we want to to order of our training samples to be shuffled? 

In [None]:
import matplotlib.pyplot as plt

plt.plot(loss_history)
plt.xlabel('Iteration #')
plt.ylabel('Loss')

## Check model performance

In [None]:
import itertools
from sklearn.metrics import auc, roc_curve, accuracy_score

def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    # plt.title(title)
    cbar = plt.colorbar()
    plt.clim(0, 1)
    cbar.set_label(title)
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.0
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt), horizontalalignment="center", color="white" if cm[i, j] > thresh else "black")

    # plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')


def plotRoc(fpr, tpr, auc, labels, linestyle, legend=True):
    for _i, label in enumerate(labels):
        plt.plot(
            tpr[label],
            fpr[label],
            label='{} tagger, AUC = {:.1f}%'.format(label.replace('j_', ''), auc[label] * 100.0),
            linestyle=linestyle,
        )
    plt.semilogy()
    plt.xlabel("Signal Efficiency")
    plt.ylabel("Background Efficiency")
    plt.ylim(0.001, 1)
    plt.grid(True)
    if legend:
        plt.legend(loc='upper left')
    plt.figtext(0.25, 0.90, 'hls4ml', fontweight='bold', wrap=True, horizontalalignment='right', fontsize=14)


def rocData(y, predict_test, labels):
    df = pd.DataFrame()

    fpr = {}
    tpr = {}
    auc1 = {}

    for i, label in enumerate(labels):
        df[label] = y[:, i]
        df[label + '_pred'] = predict_test[:, i]

        fpr[label], tpr[label], threshold = roc_curve(df[label], df[label + '_pred'])

        auc1[label] = auc(fpr[label], tpr[label])
    return fpr, tpr, auc1


def makeRoc(y, predict_test, labels, linestyle='-', legend=True):
    if 'j_index' in labels:
        labels.remove('j_index')

    fpr, tpr, auc1 = rocData(y, predict_test, labels)
    plotRoc(fpr, tpr, auc1, labels, linestyle, legend=legend)
    return predict_test

In [None]:
y_keras = model.predict(X_test)
print("Accuracy: {}".format(accuracy_score(np.argmax(y_test, axis=1), np.argmax(y_keras, axis=1))))
plt.figure(figsize=(9, 9))
_ = plotting.makeRoc(y_test, y_keras, le.classes_)

## Attribution, Sources, and Credits

This notebook was derived from two, seperate notebooks from the Tokyo Institute of Technology's ART.T458: Advanced Machine Learning course, and were originally authored by Prof. Naoaki Okazaki.

The original notebooks (and accompanying material) can be found here: https://chokkan.github.io/deeplearning/

Modifications to this notebook were done by Ben Hawks for the 2023 Fermilab and Brookhaven National Lab Summer Exchange School

Various sources for images and other materials used is listed below: 
1. [Matrix Transpose Gif by Lucas Vieira via Wikipedia](https://commons.wikimedia.org/wiki/File:Matrix_transpose.gif)
3. [Boolean Logic Gates Image/Symbols (Digilent) ](https://digilent.com/blog/logic-gates/)
    * Original/Individual Symbols used (IEEE Std 91/91a-1991 "Distinctive Shapes") are originally via [Inductiveload via Wikipedia](https://en.wikipedia.org/wiki/Logic_gate#Symbols)