# General Deep Learning System

Deep learning neural networks go through forward pass, loss function, backpropagation, gradient descent. To be specific, forward pass is to compute the output, and loss function is applied to calculate the error between labeled result and predicted result, backpropagation computes gradients of all parameter, and gradient descent to update the parameters using gradients. Through training, the model parameters are found to realize effective prediction.

## Model

Deep learning models are mainly divided into Natural Language Processing (NLP) and Computer Vision (CV). 

Natural Language Processing focuses on analyzing and generating human language. Core NLP models include traditional sequence model such as Recurrent Neural Networks (RNNs) and LSTMs, and transformer-based model such as BERT, GPT, and T5. NLP addresses tasks like text classification, machine translation, named entity recognition, question answering, and text generation. 

Computer Vision focuses on understanding and interpreting visual data. Key tasks include object detection, image classification, image segmentation, and image generation. CV primarily relies on Convolutional Neural Networks (CNNs) and their variants, Vision Transformers, generative models such as Variational Autoencoders (VAE), Generative Adversarial Networks (GAN), and Diffusion Models.

### CNN

#### Backgroud: Why CNN?
While Artificial Neural Networks (ANNs) introduced the fundamental structure of deep learning with input, hidden, and output layers, they face significant challenges when applied to image data. Processing high-dimensional images with fully connected layers leads to enormous computational complexity and a high risk of overfitting. To address these limitations, Convolutional Neural Networks (CNNs) were developed. Inspired by the visual processing mechanisms of the human brain, CNNs efficiently capture spatial hierarchies and local patterns in images while dramatically reducing the number of parameters, making them well-suited for tasks such as image recognition, object detection, and other computer vision applications.

#### Architecture: What is CNN?

The architecture of a **Convolutional Neural Network (CNN)** typically consists of the following components:

1. **Input Layer**  
   Accepts raw data, such as images, usually represented as height × width × channels.

2. **Convolutional Layer**  
   Applies multiple **filters/kernels** to extract local features like edges, textures, or patterns.

3. **Pooling Layer**  
   Reduces the spatial dimensions of feature maps (e.g., using max or average pooling), which decreases computation and helps prevent overfitting.

4. **Fully Connected Layer**  
   Connects all neurons from the previous layer to produce high-level representations and enables classification.

5. **Output Layer**  
   Provides the final prediction, often using a **softmax** or **sigmoid** activation depending on the task.

#### Application: What are the applications of variant CNNs?

- **LeNet** – One of the earliest CNNs, designed for **handwritten digit recognition** (MNIST dataset).  
- **AlexNet** – Popularized deep CNNs; used for **large-scale image classification** (ImageNet), introduced **ReLU and dropout** for better training.  
- **VGG** – Focuses on **deep but simple architectures** with small 3×3 filters, widely used for **image classification and feature extraction**.  
- **ResNet** – Introduces **residual connections** to train very deep networks efficiently, excels in **image classification and recognition tasks**.  
- **DenseNet** – Connects each layer to every other layer, improving **feature reuse and gradient flow**, used for **image classification and segmentation**.  
- **U-Net** – Designed for **image segmentation**, especially in **medical imaging**, with a contracting and expanding path to capture context and precise localization.  
- **Faster R-CNN** – Designed for **object detection**, combining region proposal networks with CNNs to efficiently detect and classify objects in images.

### RNN

#### Background: Why RNN?  
While CNNs excel at capturing spatial patterns in images, they are not naturally suited for sequential data, where order and temporal dependencies matter. Standard feedforward networks (including CNNs) process inputs independently and cannot retain context across time steps. To address this limitation, **Recurrent Neural Networks (RNNs)** were developed, which introduce **loops in their architecture**, allowing information to persist across steps. This makes them well-suited for tasks such as natural language processing, speech recognition, and time-series prediction.  


#### Architecture: What is RNN?  
The architecture of a **Recurrent Neural Network (RNN)** typically consists of the following components:

1. **Input Layer**  
   Accepts sequential data, such as words, audio signals, or time-series measurements. The input at each time step can be a vector representing features or embeddings.

2. **Hidden / Recurrent Layer**  
   Maintains a **hidden state** that captures information from previous time steps. At each step, the network updates this state using both the current input and the previous hidden state.

3. **Output Layer**  
   Produces predictions at each time step (many-to-many) or a single prediction after the sequence (many-to-one), depending on the task. Activation functions such as **softmax** or **sigmoid** are commonly used.


#### Applications: What are the applications of variant RNNs?

- **Vanilla RNN** – The basic RNN unit; used for **short-sequence modeling**, like small time-series tasks.  
- **LSTM (Long Short-Term Memory)** – Solves the **vanishing gradient problem**, capturing **long-term dependencies** in sequences; widely used in **language modeling, speech recognition, and time-series forecasting**.  
- **GRU (Gated Recurrent Unit)** – A simplified LSTM with fewer parameters; balances **efficiency and performance**, used in **sequence prediction and translation**.  
- **Bidirectional RNN** – Processes sequences in both forward and backward directions, improving context understanding; used in **NER, speech, and text processing**.  
- **Attention-based RNNs** – Introduces **attention mechanisms** to focus on important parts of the sequence, enhancing performance in **translation and summarization**.  
- **Seq2Seq (Encoder-Decoder RNN)** – Converts an input sequence to an output sequence, commonly used in **machine translation, chatbot response generation, and summarization**.


### Transformer

#### Background: Why Transformer?  
While RNN and its variants can realize context awareness with sequential structure, they often struggle to remember relevant data from pervious time steps when handling long sequences and be very computational expensive. The **Attention** mechanism allows the model to weigh the importance of different positions in the sequence, enabling parallel computation and better handling of long sequences. Transformer-based models have achieved state-of-the-art performance in a variety of tasks, including reading comprehension, abstractive summarization, machine translation, and language modeling.

#### Architecture: What is Transformer?  
The architecture of a **Transformer** typically consists of the following components:

1. **Input Layer**  
   Accepts sequential data, such as words, audio signals, or time-series measurements. Each input token is usually converted into a vector embedding, often enriched with positional encodings to perserve order information, since Transformers are inherently order-agnostic.

2. **Encoder Stacks**
Each encoder layer consists of 
* Multi-head Self-Attention: Captures dependencies between all positions in the sequence simultaneously
* Feed-Forward Network (FFN): Applies non-linear transformations to each position independently
* Residual Connections & Layer Norm: Stabilizes training and improves gradient flow
   
3. **Decoder Stacks** 
Each decoder layer consists of 
* Masked Multi-head Self-Attention: Ensures the model cannot 'see the future' tokens during training
* Encoder-Decoder Attention: Allows the decoder to attend to the encoder's output
* Feed-Forward Network (FFN): Applies non-linear transformations to each position independently
* Residual Connections & Layer Norm: Stabilizes training and improves gradient flow
   
4. **Attention**
* Self-Attention: Determines the relevance of other positions in the sequence to a given token
* Scaled Dot-Product Attention: Computes attention weights efficiently by scaling the dot product of quries and keys
* Multi-Head Attention: Allows the model to attend to information from multiple representation subspaces simultaneously
  
5. **Output Layer**
Produce the final prediction



##### Technical Details: Attention
Attention can be described as mapping a query and a set of key-value pairs to an output, where the query, key, values and output are all vectors. The output is computed as a weighted sum of the values.

$$
Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V
$$




#### Applications: What are the applications of variant Transformer?
Transformer variants have been adapted for many domains beyond natural language processing:

1. **NLP**
* Machine Translation (e.g., Google Translate)
* Abstractive Summarization 
* Question Answering / Reading Comprehension
* Language Modeling (e.g., GPT, BERT, T5)

2. **CV**
* Vision Transformers (ViT) for image classification
* Object Detection (DETR)
* Image Generation and Super-Resolution
* Speech Recognition (e.g., Speech-Transformer)
* Text-to-Speech synthesis

3. **Reinforcement Learning**
* Decision-making policies with sequential context
* Multi-modal Transformers that handle text, image, and audio simultaneously

#### Reference
https://nlp.seas.harvard.edu/annotated-transformer/

### Activation Functions


Activation functions introduce **non-linearity** into neural networks, enabling them to learn complex patterns. Here are some common ones:


#### 1. ReLU (Rectified Linear Unit)
- **Formula:** `f(x) = max(0, x)`
- **Range:** [0, ∞)
- **Pros:** Simple, fast, reduces vanishing gradient
- **Cons:** Dead neurons if input < 0 always
- **Use case:** Hidden layers in CNNs, RNNs, Transformers


#### 2. Leaky ReLU
- **Formula:** `f(x) = x if x >= 0 = α * x if x < 0 (usually α = 0.01)`
- **Range:** (-∞, ∞)
- **Pros:** Solves “dead neuron” problem
- **Cons:** Slightly more computation than ReLU
- **Use case:** Hidden layers where standard ReLU fails

#### 3. Sigmoid
- **Formula:** `f(x) = 1 / (1 + exp(-x))`
- **Range:** (0, 1)
- **Pros:** Output interpretable as probability
- **Cons:** Vanishing gradients for large |x|, not centered at 0
- **Note:** Sigmoid is for binary classification, softmax is for n-class classification


#### 4. Tanh (Hyperbolic Tangent)
- **Formula:** `f(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))`
- **Range:** (-1, 1)
- **Pros:** Centered at 0 → better convergence than sigmoid
- **Cons:** Still suffers from vanishing gradient
- **Use case:** Hidden layers in older RNNs or small networks


#### 5. GELU (Gaussian Error Linear Unit)
- **Formula:** `f(x) = x * P(X <= x), X ~ N(0,1)` (approx: `x * sigmoid(1.702 * x)`)
- **Range:** (-∞, ∞)
- **Pros:** Smooth, modern, works well in Transformers
- **Cons:** Slightly more computationally expensive
- **Use case:** Hidden layers in Transformers, modern deep networks


## Loss Function

Loss functions in deep learning are introduced to quantify how wrong a model’s predictions are, guiding optimization to improve performance.


#### 1. Regression Losses
Used for predicting **continuous values**.

- **Mean Squared Error (MSE)**
\[
\text{MSE} = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_i)^2
\]  
Measures squared difference between prediction and target. Penalizes large errors heavily.

- **Mean Absolute Error (MAE)**
\[
\text{MAE} = \frac{1}{N} \sum_{i=1}^N |y_i - \hat{y}_i|
\]  
Measures absolute difference. Less sensitive to outliers.

- **Huber Loss**
\[
\text{Huber}(y, \hat{y}) =
\begin{cases}
\frac{1}{2}(y - \hat{y})^2 & \text{if } |y - \hat{y}| \le \delta \\
\delta (|y - \hat{y}| - \frac{\delta}{2}) & \text{otherwise}
\end{cases}
\]  
Combines MSE and MAE. Smooth near zero, robust to outliers.

#### 2. Classification Losses
Used for predicting **discrete classes**.

- **Binary Cross-Entropy (BCE)**
\[
\text{BCE} = - \frac{1}{N} \sum_{i=1}^N [ y_i \log \hat{y}_i + (1 - y_i)\log(1 - \hat{y}_i) ]
\]  
For **binary classification**. Works with **sigmoid outputs**.

- **Categorical Cross-Entropy (CCE)**
\[
\text{CCE} = - \sum_{i=1}^N \sum_{c=1}^C y_{i,c} \log \hat{y}_{i,c}
\]  
For **multi-class classification**. Works with **softmax outputs**.

- **Kullback-Leibler Divergence (KL Divergence)**
\[
\text{KL}(P || Q) = \sum_i P(i) \log \frac{P(i)}{Q(i)}
\]  
Measures the difference between two probability distributions. Used in **VAEs** and **knowledge distillation**.


#### 3. Margin-Based / Ranking Losses

- **Hinge Loss**
\[
\text{Hinge} = \max(0, 1 - y \cdot \hat{y})
\]  
Used in **SVMs**. Encourages predictions to respect a margin.

- **Triplet Loss**
\[
\text{Triplet Loss} = \max(0, \|f(a) - f(p)\|^2 - \|f(a) - f(n)\|^2 + \alpha)
\]  
Used in **metric learning**. Pulls anchor-positive closer, pushes anchor-negative away.

## 4. Probabilistic / Generative Losses

- **Negative Log-Likelihood (NLL)**  
Common in probabilistic models like **PyTorch’s `NLLLoss`**.

- **KL Divergence**  
Used in **Variational Autoencoders (VAEs)** as a regularization term.

- **Adversarial Loss (GANs)**  
Discriminator loss:  
\[
\mathcal{L}_D = -\mathbb{E}[\log D(x)] - \mathbb{E}[\log (1 - D(G(z)))]
\]  
Generator loss:  
\[
\mathcal{L}_G = -\mathbb{E}[\log D(G(z))]
\]

---

## 5. Specialized Losses

- **Dice Loss / IoU Loss** – For **segmentation tasks**.  
- **Contrastive Loss** – For **Siamese networks**.  
- **Focal Loss** – Handles **class imbalance** in classification.

## Optimization Algorithm

Backproprogation is 

### Gradient Descent
* Batch gradient descent
* Stochastic gradient descent
* Mini-batches

### Convergence
* Momentum
* Learning rate schedule
* RMSProp and Adam

## Regularization
* weight decay
  * L1
  * L2
* parameter sharing: reduce number of free parameters
* model averaging
  * dropout: reduce variance 
* learning curves
  * early stop

## Normalization
* Data Normalization
* Batch Norm
* Layer Norm

## Training Infrastructure
* Hardware (GPU / TPU)
* Parallelization / Distributed Training