# Module 17 - Introduction to Deep Learning


## Learning outcomes

- LO 1: Evaluate literature on deep learning in order to apply it to a particular industry or use case.
- LO 2: Identify the building blocks used for developing real-life deep learning frameworks.
- LO 3: Describe the key terms and concepts used in deep learning and artificial intelligence.
- LO 4: Recognise business applications for real-life neural networks operating at a large scale of data.
- LO 5: Refine a codebase for machine learning competitions.

## Misc and Keywords
- **Neural Network (NN)** is a system composed of layers of neurons which influence each other via weighted connections. These computational models are inspired by the human brain and can learn to perform tasks by analysing examples, without being explicitly programmed for specific tasks.
- **Convolutional Neural Network (CNN)**: A specialised type of neural network designed primarily for processing structured grid data such as images. CNNs use convolution operations in place of general matrix multiplication in at least one of their layers, allowing them to capture spatial hierarchies and patterns in data efficiently.
- This is a good book to get started with deep learning http://neuralnetworksanddeeplearning.com/index.html

## Module Summary Description
A brief introduction to Deep Learning

---


## History of Deep Learning

**Early era of DL**: The foundations of deep learning trace back to the 1940s when Warren McCulloch and Walter Pitts created the first computational model of a neuron. However, it wasn't until 1958 that Frank Rosenblatt designed the perceptron, the first trainable neural network algorithm. Despite early promise, neural network research faced significant setbacks in the 1970s when Marvin Minsky and Seymour Papert published "Perceptrons," highlighting limitations of single-layer networks. This led to the first "AI winter" where funding and interest in neural networks declined dramatically. The 1980s saw a revival with the development of backpropagation by researchers including Geoffrey Hinton, David Rumelhart, and Ronald Williams, allowing multi-layer networks to be trained effectively.

**Modern era of DL**: The modern deep learning era began in the early 2000s, catalysed by several breakthroughs. In 2006, Geoffrey Hinton introduced deep belief networks and efficient training methods for deep architectures. The availability of large datasets and increased computational power, particularly GPUs, accelerated progress dramatically. The 2012 ImageNet competition marked a pivotal moment when Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton's deep convolutional neural network "AlexNet" significantly outperformed traditional computer vision approaches. This watershed moment triggered massive investment in deep learning research across academia and industry, leading to remarkable advances in natural language processing, computer vision, speech recognition, and generative AI, fundamentally transforming numerous fields and applications.

--- 

## Building Blocks of Deep Learning

**NNs can be used to approximate any function**: Neural networks possess the universal approximation property, meaning that a sufficiently large feed-forward neural network with a single hidden layer and appropriate activation function can approximate any continuous function on compact subsets of ℝⁿ to arbitrary precision. This theoretical foundation, established by the Universal Approximation Theorem, explains why neural networks can model complex relationships in data across diverse domains.

## Five Important Steps to Create a Neural Network

- These five steps can be represented as the function
$$y=F(x, w)$$
    - where, $w$ are the weights, and $x$ the input parameters
    - For example, $x$ could be the pixels in the image, $y$ could be the label 'cat'
- The goal is to find a function that explains the observed data pairs
  
### 1. Define a Function Composition
- Create a series of linear functions and non-linear activation functions (i.e., logit)
- The architecture typically consists of multiple layers: input layer, hidden layer(s), and output layer
- Each layer is defined by a set of weights (W) and biases (b) that transform the input: f(x) = W·x + b
- The network's overall function is a nested composition of these transformations and activations

### 2. Compute Symbolic Derivatives Using Chain-rule / Automatic Differentiation
- Minimise the loss function such as sum of squares, cross entropy loss, or mean absolute error
- The loss function quantifies the difference between predicted and actual outputs
- Derivatives are used to determine how each parameter affects the overall error
- Modern frameworks (TensorFlow, PyTorch) implement automatic differentiation to track operations and compute gradients efficiently
- This enables optimisation of complex models with millions of parameters

### 3. Use Approximate Optimisation Methods Such as Stochastic Gradient Descent
- Using a batch to calculate the loss as opposed to all of the data
- Full batch gradient descent is computationally expensive for large datasets
- Mini-batch SGD strikes a balance between computational efficiency and update stability
- Common variants include:
  - Momentum: Accelerates convergence by accumulating past gradients
  - Adam: Adapts learning rates per parameter using estimates of first and second moments
  - RMSProp: Normalises gradients using a running average of squared gradients

### 4. Evaluate Derivatives Through Backpropagation
- Backpropagation efficiently computes gradients through the entire network
- The process works by recursively applying the chain rule from calculus
- Steps of backpropagation:
  1. Forward pass: Compute outputs for given inputs
  2. Calculate error/loss at the output layer
  3. Backward pass: Propagate error gradients from output to input layer
  4. Update weights and biases according to their contribution to the error
- This method scales efficiently to deep networks with many layers

### 5. Select Which Types of Non-linear Activation Units, or Neurons, to Use
- Such as sigmoid, tanh, ReLU (Rectified Linear Unit), Leaky ReLU, ELU, or GELU
- Activation functions introduce non-linearity, enabling the network to learn complex patterns
- Key considerations when selecting activation functions:
  - ReLU: Fast computation, helps mitigate vanishing gradient problem, but suffers from "dying ReLU" issue
  - Sigmoid: Useful for binary classification outputs (0-1), but prone to vanishing gradients
  - Tanh: Similar to sigmoid but output range is (-1 to 1), often used in recurrent networks
  - Leaky ReLU: Addresses dying ReLU problem by allowing small negative values
  - GELU: Commonly used in transformer-based models like BERT and GPT
- The choice of activation function significantly impacts training dynamics and model performance

---

## Additional Considerations

### Hyperparameter Tuning
- Optimising non-trainable parameters such as:
  - Learning rate and schedule
  - Network architecture (number and size of layers)
  - Regularisation strength
  - Batch size
- Techniques include grid search, random search, Bayesian optimisation, and neural architecture search

### Regularisation Techniques
- Methods to prevent overfitting:
  - L1/L2 regularisation: Add penalty terms to the loss function for large weights
  - Dropout: Randomly deactivate neurons during training
  - Batch normalisation: Normalise layer inputs, enabling faster training and more stable gradients
  - Early stopping: Monitor validation performance and stop training when performance plateaus

### Evaluation and Validation
- Split data into training, validation, and test sets
- Use cross-validation for more robust performance estimation
- Monitor key metrics appropriate to the task (accuracy, F1-score, AUC-ROC, etc.)
- Analyse learning curves to diagnose underfitting/overfitting

