## Summary of Previous Notebook
* Discussed self-attention and its role in focusing on different words in a sentence.
* Implemented self-attention using Query, Key, and Value vectors.
* Analyzed outputs from self-attention to see how they capture context.
* Introduced multi-head attention to capture different aspects of the input.
* Implemented multi-head attention in code.
* Reviewed outputs from multi-head attention and their importance for model performance.

----

# Add & Norms

The **Add & Norm** step is a crucial part of each layer in the transformer encoder architecture. It consists of two main operations: a **residual connection** (Add) and **layer normalization** (Norm). This combination enhances the training of deep neural networks, making them more efficient and effective.

## Purpose of Add & Norm

### 1. Residual Connection (Add)

- A residual connection allows the output of a layer to be added back to its input. In simpler terms, it means that the original input is preserved and combined with the processed output of the layer.


#### **Why is it important?**
  - **Information Retention**: By adding the original input to the output, the model retains important information that may otherwise be lost during processing. This is especially crucial in deep networks where layers can distort the original signal.
  - **Learning Identity Functions**: The residual connection allows the model to learn an identity function. If the deeper layers do not improve performance, they can effectively pass the input through unchanged. This is beneficial in avoiding the vanishing gradient problem, where gradients become very small and hinder learning in deeper networks.

### 2. Layer Normalization (Norm)

- Layer normalization is a technique that normalizes the output of a layer across the features for each input. It adjusts the mean and variance of the outputs to ensure consistency.

#### **Why is it important?**
  - **Stabilizing Training**: Normalization helps reduce the effects of internal covariate shifts, which occur when the distribution of inputs to a layer changes during training. This leads to more stable training and allows the model to learn more effectively.
  - **Faster Convergence**: By stabilizing the outputs, layer normalization allows the model to converge faster, meaning it can learn more quickly and efficiently.

## How Add & Norm Works

The Add & Norm operation can be broken down into two main steps, applied after both the multi-head self-attention output and the feed-forward network output:

### 1. After Multi-Head Self-Attention

- **Process**: 
  - The output from the self-attention layer is combined with its original input using a residual connection. This means you add the original input to the output of the self-attention mechanism.
  
- **Normalization**: 
  - The combined result is then passed through layer normalization to stabilize the output.

- **Formula**:
  $$
  \text{Output}_{\text{attention}} = \text{LayerNorm}(\text{Input}_{\text{attention}} + \text{SelfAttention}(X))
  $$


![image-2.png](attachment:image-2.png)

### 2. After Feed-Forward Network

- **Process**: 
  - Similarly, the output from the feed-forward network is combined with the output from the previous Add & Norm step using another residual connection.
  
- **Normalization**: 
  - The result is again passed through layer normalization.

- **Formula**:
  $$
  \text{Output}_{\text{feed-forward}} = \text{LayerNorm}(\text{Output}_{\text{attention}} + \text{FeedForward}(\text{Output}_{\text{attention}}))
  $$


![image.png](attachment:image.png)

## Importance of Add & Norm

- **Stability**: By normalizing the outputs, Add & Norm helps reduce the variability in the training process, leading to more stable training.
- **Efficiency**: The residual connections allow gradients to flow more easily through the network, which is crucial for training deeper networks without losing information.
- **Improved Performance**: This combination of operations leads to faster convergence (the model learns more quickly) and better overall performance on various tasks, such as translation or text generation.


---

# Feed-Forward Neural Network

A **Feed-Forward Neural Network** (FFNN) is a type of artificial neural network where the connections between nodes do not form cycles. Information flows in one direction—from the input layer, through any hidden layers, and finally to the output layer—without looping back.

## Structure of a Feed-Forward Neural Network

An FFNN consists of three main types of layers:

1. **Input Layer**
2. **Hidden Layers**
3. **Output Layer**

### Step 1: Input Layer

The input layer receives the initial data. Each neuron in this layer represents a feature of the input data.

![image-3.png](attachment:image-3.png)

### Step 2: Weighted Sum Calculation

Each neuron in the hidden layer computes a weighted sum of its inputs from the input layer. Each connection has an associated weight that determines the influence of the input on the neuron's output.

- **Formula**: For a neuron $$ h $$ in the hidden layer, the weighted sum can be represented as:
  $$
  h = w_1 \cdot x_1 + w_2 \cdot x_2 + w_3 \cdot x_3 + b
  $$
  where $$ w $$ represents the weights and $$ b $$ is the bias.


![image-2.png](attachment:image-2.png)

### Step 3: Activation Function

After calculating the weighted sum, an activation function is applied to introduce non-linearity into the model. This allows the network to learn complex relationships in the data.

- **Common Activation Functions**:
  - **Sigmoid**: $$ \sigma(x) = \frac{1}{1 + e^{-x}} $$
  - **Tanh**: $$ \text{tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} $$
  - **ReLU (Rectified Linear Unit)**: $$ \text{ReLU}(x) = \max(0, x) $$


![image.png](attachment:image.png)

### Step 4: Output Layer

The activated outputs from the hidden layer are passed to the output layer, which provides the final predictions of the network.

- The output layer can have one or multiple neurons, depending on the task (e.g., binary classification, multi-class classification, regression).

![image.png](attachment:image.png)

Here is the complete diagram of how feed-forward network works...

![image.png](attachment:image.png)

### Applications of Feed-Forward Neural Networks

Feed-Forward Neural Networks are widely used in various applications, including:

- **Image Classification**: Identifying objects in images.
- **Natural Language Processing**: Tasks such as sentiment analysis and text classification.
- **Regression Tasks**: Predicting continuous values, such as house prices based on features.
- **Pattern Recognition**: Recognizing patterns in data for various classification tasks.

-----

### Summary of Today's Notebook
- Explored the Add & Norm component in the transformer encoder.
- Discussed the purpose of residual connections and layer normalization.
- Implemented Add & Norm in code, showing how it combines inputs and stabilizes outputs.
- Introduced Feed-Forward Neural Networks and their structure.
- Implemented a simple FFNN in code, processing an example sentence.
- Explained the steps of input representation, weighted sum calculation, activation function, and output interpretation.