---
title: "Deep Learning Fundamentals"
format: html
page-layout: full
code-line-numbers: true
code-block-border: true
toc: true
toc-location: left
number-sections: true
jupyter: python3
---

- Information is processed in hierarchical layers
    - understand representations and features from data
- Multilayer perceptron (MLP)
    - A NN with feedforward propagation, fully connected layers, and at least one hidden layer
- Convolutional neural network (CNN)
    - A feedforward NN with several types of special layers
    - apply filters to the input image (or sound) by sliding the filter all across the incoming signal
- Recurrent neural network (RNN)
    - Has an internal state or memory based on all, or part of, the input data already fed to the network
    - Output is a combination of its internal state and the latest input sample
    - Good for tasks that work on sequential data, e.g., text or time series data
- Transformer
    - Suited for sequential data
    - Uses a technique called **attention**, allows direct simultaneous access to all elements of the input sequence
    - Superseded RNNs in may tasks
   

# Activation Functions - Vanishing Gradients

 - Assume backpropagation to train a MLP with multiple hidden layers and logistic sigmoid function at each layer
 - $\sigma(x) = 1/(1 + e^{-x})$

![](https://static.packt-cdn.com/products/9781837638505/graphics/image/B19627_03_4.jpg)

- Forward phase
  - Output of the first sigmoid layer fall in the range (0,1)
  - For consecutive layers, the range becomes narrower
     - After three layers, for example, the activation converges to around 0.66 regardless of the input value
  - Acts as a eraser of any information coming from the preceding layers

- Backward phase
  - Derivative of the sigmoid function has a significant value in a narrow interval centered around 0
      - converges to 0 in all other cases
  - In networks with many layers, the derivative would likely converge to 0 when propogated to the first layers
      - thus, the weights cannot be updated in a meaningful way    
     

- **ReLU** activation function
  - solves the vanishing gradients problem
    
![](https://static.packt-cdn.com/products/9781837638505/graphics/image/B19627_03_5.jpg)

- **Idempotent**
    - Value doesn't change through any number of layers
        - ReLU(2) = 2, ReLU(ReLU(2)) = 2, ...
- Derivative is either 0 or 1, regardless of the backpropagated value    

- Problem with **ReLU**
    - During training, when network weights are being updated, some of the ReLU units may always receive inputs smaller than 0, and hence always output 0 as well
    - This is known as **dying ReLUs**
- **Leaky ReLU**
    - When $x \lt 0$, outputs $x$ multiplied by a constant $\alpha$ ($0 \lt \alpha \lt 1$)

![](https://static.packt-cdn.com/products/9781837638505/graphics/image/B19627_03_6.jpg)
 
-  **Parametric ReLU**
    - Same as leaky ReLU, but $\alpha$ is tunable and adjusted during training
 
- **Exponential linear unit (ELU)**
    - When $x \lt 0$, outputs $\alpha(e^x - 1)$, where $\alpha$ is a tunable parameter
    - For example, for $\alpha = 0.2$
    - 
 ![](https://static.packt-cdn.com/products/9781837638505/graphics/image/B19627_03_7.jpg)


## Softmax

 - Activation function of the output layer in classification problems
 - Output of the final network layer, $z = (z_1, z_2, ..., z_n)$
     - Each of the $n$ elements represents one of $n$ classes to which the input sample might belong
 - For network prediction, the index $i$ of the highest value $z_i$ is assigned as the class of that input sample
 - For interpreting network output as probability distribution, use the softmax activation:

![](https://static.packt-cdn.com/products/9781837638505/graphics/image/242.png)

 

# DNN Regularization

- The NN may learn to approximate the noise of the target function rather than its useful components
- Example
    - Training data with mostly images of red cars, NN to classify whether car or not
    - Can associate color red with the car rather than the shape
    - May fail to classify a green car since the color doesn't match
- Avoid overfitting using **regularization** techniques
- Some Techniues to apply to input data before feeding into NN

### Min-max normalization

![](https://static.packt-cdn.com/products/9781837638505/graphics/image/253.png)


 - Scales all input to [0, 1] range
 - Easy to implement
 - However, outliers with large value may drive all normalized values toward 0

### Z-score normalization

$$
 z = \frac{x - \mu}{\sigma} \,
$$

- Handles outliers better than min-max
- Maintains the dataset's mean values close to 0 and standard deviation close to 1


### Data augmentation

 - Artificially increase the size of the training set
 - apply random modifications to the training samples before feeding to the network
 - for images, rotation, skew, scaling, etc. can be used

**Regularization techniques with the DNN structure**

### Dropout

 - Randomly and periodically remove some of the units of a layer from the network
 - During a training mini-batch, each unit has a probability, *p*, of being dropped
 - Ensures no unit relies too much on other units
 - Applied during the training phase
 - All units participate during the inference phase

![](https://static.packt-cdn.com/products/9781837638505/graphics/image/B19627_03_9.jpg)

### Batch normalization

 - Normalizes the outputs of the hidden layer for each mini-batch, thus maintaining its mean activation value close to 0 and its standard deviation close to 1

# Example - Classifying Digits

In [1]:
import torch

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

from torchvision import datasets
from torchvision.transforms import ToTensor, Lambda, Compose

In [2]:
train_data = datasets.MNIST(
    root='data',
    train=True,
    transform=Compose(
        [ToTensor(),
         Lambda(lambda x: torch.flatten(x))]),
    download=True,
)

In [3]:
train_data.data.shape

torch.Size([60000, 28, 28])

In [4]:
train_data.classes

['0 - zero',
 '1 - one',
 '2 - two',
 '3 - three',
 '4 - four',
 '5 - five',
 '6 - six',
 '7 - seven',
 '8 - eight',
 '9 - nine']

In [5]:
validation_data = datasets.MNIST(
    root='data',
    train=False,
    transform=Compose(
        [ToTensor(),
         Lambda(lambda x: torch.flatten(x))]),
)

In [6]:
validation_data.data.shape

torch.Size([10000, 28, 28])

 - ToTensor() transformation converts *numpy* images to *PyTorch* tensors and normalizes them to [0,1] range
 - *torch.flatten()* transform flattens two-dimensional 28 x 28 images to a one-dimentional 784 tensor to feed to the NN

In [7]:
from torch.utils.data import DataLoader

train_loader = DataLoader(
    dataset=train_data,
    batch_size=100,
    shuffle=True)

validation_loader = DataLoader(
    dataset=validation_data,
    batch_size=100,
    shuffle=True)

 - *DataLoader* instance creates mini-bathces and shuffles the data randomly
 - They are also *iterators*, which supply mini-batches one at a time

In [8]:
#  A MLP with one hidden layer

torch.manual_seed(1234)

hidden_units = 100
classes = 10

model = torch.nn.Sequential(
    torch.nn.Linear(28 * 28, hidden_units),
    torch.nn.BatchNorm1d(hidden_units),
    torch.nn.ReLU(),
    torch.nn.Linear(hidden_units, classes),
)

In [9]:
# Define the cross-entropy loss and the Adam optimizer

cost_func = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())

In [10]:
# Train the model for a single epoch

def train_model(model, cost_function, optimizer, data_loader):
    # send the model to the GPU
    model.to(device)

    # set model to training mode
    model.train()

    current_loss = 0.0
    current_acc = 0

    # iterate over the training data
    for i, (inputs, labels) in enumerate(data_loader):
        # send the input/labels to the GPU
        inputs = inputs.to(device)
        labels = labels.to(device)

        # zero the parameter gradients
        optimizer.zero_grad()

        with torch.set_grad_enabled(True):
            # forward
            outputs = model(inputs)
            _, predictions = torch.max(outputs, 1)
            loss = cost_function(outputs, labels)

            # backward
            loss.backward()
            optimizer.step()

        # statistics
        current_loss += loss.item() * inputs.size(0)
        current_acc += torch.sum(predictions == labels.data)

    total_loss = current_loss / len(data_loader.dataset)
    total_acc = current_acc.double() / len(data_loader.dataset)

    print('Train Loss: {:.4f}; Accuracy: {:.4f}'.format(total_loss, total_acc))


 - iterates over all mini-batches provided by train_loader
 - For each mini-batch, optimizer.zero_grad() resets the gradients from the previous iteration
 - Then, we initiate the forward and backward passes, and finally the weight updates

In [11]:
def test_model(model, cost_function, data_loader):
    # send the model to the GPU
    model.to(device)

    # set model in evaluation mode
    model.eval()

    current_loss = 0.0
    current_acc = 0

    # iterate over  the validation data
    for i, (inputs, labels) in enumerate(data_loader):
        # send the input/labels to the GPU
        inputs = inputs.to(device)
        labels = labels.to(device)

        # forward
        with torch.set_grad_enabled(False):
            outputs = model(inputs)
            _, predictions = torch.max(outputs, 1)
            loss = cost_function(outputs, labels)

        # statistics
        current_loss += loss.item() * inputs.size(0)
        current_acc += torch.sum(predictions == labels.data)

    total_loss = current_loss / len(data_loader.dataset)
    total_acc = current_acc.double() / len(data_loader.dataset)

    print('Test Loss: {:.4f}; Accuracy: {:.4f}'.format(total_loss, total_acc))

    return total_loss, total_acc

 - Batch normalization and dropout layers are not used in evaluation (only in training), so model.eval() turns them off
 - We iterate over the validation set, initiate a forward pass, and aggregate the validation loss and accuracy

In [12]:
# Run the training for 20 epochs

for epoch in range(20):
    train_model(model, cost_func, optimizer, train_loader)

Train Loss: 0.3272; Accuracy: 0.9175
Train Loss: 0.1421; Accuracy: 0.9604
Train Loss: 0.0999; Accuracy: 0.9721
Train Loss: 0.0760; Accuracy: 0.9790
Train Loss: 0.0611; Accuracy: 0.9828
Train Loss: 0.0496; Accuracy: 0.9861
Train Loss: 0.0422; Accuracy: 0.9879
Train Loss: 0.0359; Accuracy: 0.9898
Train Loss: 0.0310; Accuracy: 0.9914
Train Loss: 0.0265; Accuracy: 0.9928
Train Loss: 0.0226; Accuracy: 0.9937
Train Loss: 0.0200; Accuracy: 0.9947
Train Loss: 0.0183; Accuracy: 0.9948
Train Loss: 0.0175; Accuracy: 0.9952
Train Loss: 0.0160; Accuracy: 0.9957
Train Loss: 0.0130; Accuracy: 0.9967
Train Loss: 0.0130; Accuracy: 0.9964
Train Loss: 0.0105; Accuracy: 0.9975
Train Loss: 0.0104; Accuracy: 0.9973
Train Loss: 0.0104; Accuracy: 0.9973


In [13]:
# test the model

test_model(model, cost_func, validation_loader)

Test Loss: 0.0882; Accuracy: 0.9774


(0.08819967041490599, tensor(0.9774, dtype=torch.float64))

# References
  - Python Deep Learning, Third Edition, Ivan Vasilev, Packt Publishing