[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MusicalInformatics/miws2024/blob/main/expectation/intro_to_deep_learning.ipynb)

# A Very Quick and Dirty Introduction to Deep Learning and PyTorch



Deep learning is a family of machine learning techniques that is the extension of neural networks. In this case, the word *deep* refers to the fact that these neural networks are organized into many **layers**.

This short tutorial aims to be a practical introduction to deep learning using [PyTorch](https://pytorch.org/). **Parts of this tutorial are taken from PyTorch's [Introduction to PyTorch](https://pytorch.org/tutorials/beginner/introyt/introyt1_tutorial.html).**

## Quick recap of PyTorch

Most computations on neural networks computations are linear algebra operations on **tensors**. Tensors can be thought as a generalization of matrices (the proper mathematical definition of tensors is a [bit more complicated](https://en.wikipedia.org/wiki/Tensor)). 

A vector is a 1-dimensional tensor, a matrix is a 2-dimensional tensor, an array with three indices is a 3-dimensional tensor, and so on. PyTorch is built around tensors!

##### Creating Tensors
We'll start with a few basic tensor manipulations. We start with ways of creating tensors

In [None]:
import torch
import numpy as np
import matplotlib.pyplot as plt

z = torch.zeros(5, 3)
print(z)
print(z.dtype)

It is possible to override the default datatype

In [None]:
i = torch.ones((5, 3), dtype=torch.int16)
print(i)

We can also initialize tensors with random values.

In [None]:
torch.manual_seed(1729)
r1 = torch.rand(2, 2)
print('A random tensor:')
print(r1)

r2 = torch.rand(2, 2)
print('\nA different random tensor:')
print(r2) # new values

torch.manual_seed(1729)
r3 = torch.rand(2, 2)
print('\nShould match r1:')
print(r3) # repeats values of r1 because of re-seed

We can initialize a tensor from a numpy array

In [4]:
n = np.ones(5)
t = torch.from_numpy(n)

In [None]:
np.add(n, 1, out=n)
print(f"t: {t}")
print(f"n: {n}")

And convert the tensors back to numpy

In [None]:
numpy_tensor = t.numpy()
print(f'numpy_tensor type {type(numpy_tensor)}')

##### Basic Arithmetic and Elementwise Operations

In [None]:
ones = torch.ones(2, 3)
print(ones)

twos = torch.ones(2, 3) * 2 # every element is multiplied by 2
print(twos)

threes = ones + twos       # additon allowed because shapes are similar
print(threes)              # tensors are added element-wise
print(threes.shape)        # this has the same dimensions as input tensors

**Exercise**: The following snippet results in a runtime error. Why is that? (you have to uncomment the line)

In [8]:
r1 = torch.rand(2, 3)
r2 = torch.rand(3, 2)

# r3 = r1 + r2

Elementwise operations and aggregate operations

In [None]:
r = torch.rand(2, 2) - 0.5 * 2 # values between -1 and 1
print('A random matrix, r:')
print(r)

# Common mathematical operations are supported:
print('\nAbsolute value of r:')
print(torch.abs(r))

# ...as are trigonometric functions:
print('\nInverse sine of r:')
print(torch.asin(r))

# ...and statistical and aggregate operations:
print('\nAverage and standard deviation of r:')
print(torch.std_mean(r))
print('\nMaximum value of r:')
print(torch.max(r))

##### Linear Algebra

In [None]:
r1 = torch.rand(2, 2) - 0.5 * 2 # values between -1 and 1
print('A random matrix, r1:')
print(r)
r2 = torch.rand(2, 2) - 0.5 * 3 # values between -1.5 and 1.5
print('A random matrix, r2:')
print(r2)

print('\nMatrix Multiplication of r1 and r2')
print(torch.matmul(r1, r2))
print('\nDeterminant of r1:')
print(torch.det(r1))
print('\nSingular value decomposition of r1:')
print(torch.svd(r1))
print('\nPseudo-inverse of r1:')
print(torch.pinverse(r1))


## Artificial Neural Networks

Neural networks have its origins in early work that tried to model biological networks of neurons in the brain [(McCulloch and Pits, 1943)](https://homes.luddy.indiana.edu/jbollen/I501F13/readings/mccullochpitts1943.pdf). For this reason, these methods are called neural networks, although the resemblance to real neural cells is just superficial.

We can understand artificial neural networks (or simply neural networks) as **complex compositions of simpler functions**. Each node within a network is called a **unit**, and each of these units calculates a weighted sum of the inputs from predecessor nodes and then applies a nonlinear function.

Let us consider the following simple case:

Let $a_j$ denote the output of unit $j$ can be computed as

$$a_j = g_j\left(\sum_{i} w_{ij}a_i + b_j\right)$$

where

* $g_j(\cdot)$ is a nonlinear **activation function** associated with unit $j$
* $w_{ij}$ is the weight attached to the link from unit $i$ to unit $j$
* $b_j$ is a scalar bias

with this convention, we can write the above equation in vector form as

$$a_j = g_j\left(\mathbf{w}_j^T\mathbf{x} + b_j\right)$$

where $\mathbf{w}_j^T$ is the vector of weights leading into unit $j$.

##### Activation Functions

Some of the most common activation functions are

* **Sigmoid** function
$$ \sigma(x) = \frac{1}{1 + \exp(-x)}$$

In [None]:
def sigmoid_numpy(x):
    output = 1 / (1 + np.exp(-x))
    return output
    
x = np.linspace(-3, 3)

sig_numpy = sigmoid_numpy(x)

plt.plot(x, sig_numpy)
plt.ylim((0, 1))
plt.show()
    

* **ReLU** (rectified linear units)

$$\text{ReLU}(x) = \max(0, x)$$

In [None]:
def relu_numpy(x):
    return np.maximum(0, x)

x = np.linspace(-3, 3)

relu = relu_numpy(x)

plt.plot(x, relu)
plt.show()


* **Hyperbolic tangent**

$$\tanh(x) = \frac{\exp(2x) - 1}{\exp(2x) + 1}$$

In [None]:
x = np.linspace(-3, 3)
tanh = np.tanh(x)
plt.plot(x, tanh)
plt.show()

### Computation Graphs

In the following we will consider $\mathbf{x}$ an input (training or test) example), $\hat{\mathbf{y}}$ are the outputs of the network and $\mathbf{y}$ the *true values* to derive a learning signal

* **Input Encoding**: It depends on the problem we want to model. Assume that we have $n$ input nodes
    * If we have Boolean inputs, *false* is usually mapped to $0$ and *true* to $1$, although sometimes $-1$ and $1$ are used
    * If we have real valued inputs, we can just use the actual values, although it is common to scale the inputs to fit a fixed range, or use a transformation like a log scale if the magnitudes of the different examples vary a lot.
    * If we have categorical encodings, we can use a *one-hot* encoding
    
* **Output Layers and Loss Function**: On the output side of the network, the problem of encoding raw data values into actual values $\mathbf{y}$ is very similar than the input encoding: We can use a numerical mapping for Boolean outputs, (scaled/transformed) real values for real-valued outputs, one-hot encodings for categorical data. This can be achieved by choosing an appropriate output nonlinearity:
    * For Boolean outputs, we can use the sigmoid function (if we are mapping *false* and *true* to 0 and 1, respectively), or tanh (if we are mapping to -1 and 1).
    * For categorical problems, we can use a **softmax** layer:
    $$\text{softmax}(\mathbf{in})_k = \frac{\exp(in_k)}{\sum_{l=1}^{d} \exp(in_l)}$$
    where $\mathbf{in} = (in_1, \dots, in_d)$ are the input values.
    * For regression problems, it is usual to use the identity function $g(x) = x$
    * And many more, depending on the problem (e.g., mixture density layers).
    
* **Hidden Layers**: We can think of the hidden layers as learning different *representations* for the input $\mathbf{x}$. In many cases, the $l$-th hidden layer will be given as a function of the previous layers:

$$\mathbf{h}_l(\mathbf{h}_{l-1}) = g_l(\mathbf{W}\mathbf{h}_{l-1} + \mathbf{b}_l)$$

although this form would depend on the particular neural architecture. (the above example would be for a fully connected feed forward neural network). With this notation, we can write the inputs as the $0$-th layer $\mathbf{h}_0 = \mathbf{x}$ and the outputs as the $L$-th layer, i.e., $\hat{\mathbf{y}} = \mathbf{h}_L(\mathbf{h}_{l-1})$.



##### Training the network

The **loss function** is a measure of how good the predictions of the network are, i.e., how close do the predictions of the network approximate the expected values $\mathbf{y}$. We can use this loss function to learn the parameters of the network (the sets of weights and biases) as those which minimize the loss function

$$\hat{\mathbf{\theta}} = \arg \min_{\mathbf{\theta}} \mathcal{L}(\mathbf{Y}, \hat{\mathbf{Y}})$$

###### Regression Problems

For regression problems, it is common to use the mean squared error

$$\text{mse}(\mathbf{Y}, \hat{\mathbf{Y}}) = \frac{1}{N} \sum_{i}||\mathbf{y}_i - \hat{\mathbf{y}}_i||^2$$


###### Classification Problems

The **Cross Entropy Loss** is a common loss function used in classification tasks. It measures the difference between two probability distributions: the true distribution (ground truth labels) and the predicted distribution (model predictions). In other words, it quantifies how well the model’s predicted probabilities align with the actual classes.

Mathematically, Cross Entropy Loss for a single data point is given by:


$$\text{Cross Entropy} = -\sum_{i=1}^{C} y_i \log(p_i)$$

where:
- $C$ is the number of classes.
- $y_i$ is the actual label (1 if the sample belongs to class $i$, 0 otherwise).
- $p_i$ is the predicted probability of the sample being in class $i$.

The loss function penalizes incorrect predictions more as the probability for the correct class decreases, encouraging the model to assign higher probabilities to correct classes.

#### Why Cross Entropy?

If the model confidently predicts the correct class (i.e., assigns a high probability close to 1 for the correct class), the cross-entropy loss will be low. However, if the model predicts a low probability for the correct class, the loss will be high. This loss function is particularly effective for training neural networks to output probabilities that align closely with the ground truth.





### Visualizing Cross Entropy Loss

To illustrate Cross Entropy Loss, let’s plot how the loss changes with different predicted probabilities for a correct class label. A lower probability for the correct class leads to a higher loss, whereas a probability closer to 1 leads to a lower loss.



In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Define probabilities and compute cross entropy loss for a correct class (label = 1)
probabilities = np.linspace(0.01, 1, 100)  # Predicted probabilities from 0.01 to 1
cross_entropy_loss = -np.log(probabilities)  # Cross-entropy loss for the correct class

# Plotting
plt.figure(figsize=(8, 6))
plt.plot(probabilities, cross_entropy_loss, label="Cross Entropy Loss", color="blue")
plt.title("Cross Entropy Loss vs. Predicted Probability for Correct Class")
plt.xlabel("Predicted Probability for Correct Class")
plt.ylabel("Cross Entropy Loss")
plt.grid(True)
plt.legend()
plt.show()




The visualization above shows how Cross Entropy Loss varies with the predicted probability for the correct class. As the probability for the correct class increases (moving right on the x-axis), the loss decreases. When the predicted probability approaches 1, the loss approaches zero, indicating a confident and accurate prediction. Conversely, lower probabilities for the correct class yield higher losses, pushing the model to learn from incorrect or less confident predictions. 

This behavior encourages the model to assign high probabilities to correct classes, making Cross Entropy Loss a powerful choice for classification tasks.

## Training Neural Networks with PyTorch
This tutorial will guide you through a simple process of training a neural network on the MNIST dataset using PyTorch.

In [14]:
import torch
import torch.nn as nn  # For building neural networks
import torch.optim as optim  # For optimization
from torchvision import datasets, transforms  # For loading datasets
from torch.utils.data import DataLoader  # For batching data

### 1. Prepare Data
We'll use the MNIST dataset, a collection of 28x28 grayscale images of handwritten digits (0-9).

In [15]:
# Transform: Convert images to tensor and normalize
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

# Load training and test sets
train_data = datasets.MNIST(root='data', train=True, download=True, transform=transform)
test_data = datasets.MNIST(root='data', train=False, download=True, transform=transform)

# Data loaders for batching
train_loader = DataLoader(train_data, batch_size=32, shuffle=True)
test_loader = DataLoader(test_data, batch_size=32, shuffle=False)

###  2. Define a Neural Network
We'll build a simple fully connected neural network with two hidden layers.

In [16]:
class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        # Define layers
        self.fc1 = nn.Linear(28 * 28, 128)  # First hidden layer
        self.fc2 = nn.Linear(128, 64)       # Second hidden layer
        self.fc3 = nn.Linear(64, 10)        # Output layer (10 classes)

    def forward(self, x):
        x = x.view(-1, 28 * 28)  # Flatten the input image
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

model = SimpleNN()

### 3. Set Up Loss and Optimizer
We'll use Cross-Entropy Loss and the SGD optimizer.

In [17]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

### 4. Train the Model
We'll train the model over multiple epochs.

In [None]:
num_epochs = 5  # Number of times to iterate through the dataset

for epoch in range(num_epochs):
    running_loss = 0.0
    for images, labels in train_loader:
        # Zero gradients
        optimizer.zero_grad()
        
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # Backward pass and optimization
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
    
    print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {running_loss / len(train_loader):.4f}")

### 5. Test the Model
Now we will evaluate the model on the test dataset.

In [None]:
correct = 0
total = 0

with torch.no_grad():  # Disable gradient calculation for testing
    for images, labels in test_loader:
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)  # Get class with highest score
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f'Accuracy of the model on the test images: {100 * correct / total:.2f}%')


Save the trained model to use later.

In [None]:
torch.save(model.state_dict(), 'simple_nn.pth')
print("Model saved as 'simple_nn.pth'")