## 1. What are Sequence-to-Sequence Models?
**Answer:** Sequence-to-sequence (seq2seq) models are a type of neural network architecture designed to convert sequences from one domain into sequences in another domain. They are typically used for tasks where both the input and output are sequences of variable lengths.

**Components:**
- **Encoder:** Processes the input sequence and converts it into a context vector (a fixed-size representation).
- **Decoder:** Takes the context vector and generates the output sequence.

**Applications:**
- Machine Translation
- Text Summarization
- Speech Recognition

---

## 2. What are the Problems with Vanilla RNNs?
**Answer:** Vanilla RNNs face several issues:
- **Vanishing Gradient Problem:** Gradients can become very small during backpropagation, making it difficult for the network to learn long-term dependencies.
- **Exploding Gradient Problem:** Gradients can become excessively large, leading to unstable training.
- **Difficulty in Capturing Long-Term Dependencies:** Vanilla RNNs struggle with sequences where dependencies span many time steps.

---

## 3. What is Gradient Clipping?
**Answer:** Gradient clipping is a technique used to address the exploding gradient problem in neural networks. It involves setting a threshold value and scaling down gradients that exceed this threshold. This prevents gradients from becoming too large and destabilizing the training process.

**Implementation:**
```python
from tensorflow.keras.optimizers import Adam

optimizer = Adam(clipnorm=1.0)  # Clip gradients by norm


## 4. Explain the Attention Mechanism
**Answer:** The attention mechanism allows models to focus on different parts of the input sequence when generating each part of the output sequence. It enhances the model’s ability to handle long-range dependencies by weighing the importance of different parts of the input.

**Key Components:**
- **Alignment Scores:** Measure how well each part of the input matches the current part of the output.
- **Context Vector:** A weighted sum of the input features based on the alignment scores.

**Equation:**
\[ \text{Context} = \sum (\text{Alignment Score} \times \text{Input}) \]

---

## 5. Explain Conditional Random Fields (CRFs)
**Answer:** Conditional Random Fields (CRFs) are a type of probabilistic graphical model used for predicting sequences. They model the conditional probability of a sequence of labels given a sequence of observations, capturing dependencies between labels.

**Key Features:**
- **Structured Prediction:** CRFs consider the entire sequence for prediction rather than individual labels.
- **Feature Functions:** CRFs can incorporate various features from the data to improve predictions.

---

## 6. Explain Self-Attention
**Answer:** Self-attention is a mechanism where each element in a sequence attends to all other elements in the same sequence to produce a representation. It helps capture relationships between elements irrespective of their positions.

**Key Components:**
- **Query, Key, and Value Vectors:** Used to compute attention scores and weighted sums.
- **Attention Scores:** Measure the relevance of each element to others.

**Equation:**
\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V \]


## 7. What is Bahdanau Attention?
**Answer:** Bahdanau Attention, also known as additive attention, is a type of attention mechanism that computes alignment scores using a feedforward neural network. It helps improve the performance of sequence-to-sequence models by allowing them to focus on different parts of the input sequence.

**Components:**
- **Alignment Scores:** Computed using a neural network.
- **Context Vector:** Weighted sum of the input sequence based on the alignment scores.

**Equation:**
\[ \text{Score}(h_t, s_{t-1}) = v^T \text{tanh}(W_h h_t + W_s s_{t-1} + b) \]

---

## 8. What is a Language Model?
**Answer:** A language model is a statistical model that predicts the probability of a sequence of words. It captures the likelihood of a word given its preceding context and is used in various NLP tasks.

**Types:**
- **Unigram Model:** Considers individual words.
- **N-gram Model:** Considers sequences of n words.
- **Neural Language Models:** Use neural networks to capture complex patterns.

---

## 9. What is Multi-Head Attention?
**Answer:** Multi-head attention is an extension of the attention mechanism that uses multiple attention heads to capture different aspects of the input sequence. Each head learns to focus on different parts of the sequence, and their outputs are combined to produce the final representation.

**Key Components:**
- **Multiple Attention Heads:** Each head learns different attention patterns.
- **Concatenation and Linear Transformation:** Combine the outputs of all heads.

**Equation:**
\[ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, ..., \text{head}_h) W^O \]

---

## 10. What is Bilingual Evaluation Understudy (BLEU)?
**Answer:** BLEU (Bilingual Evaluation Understudy) is an evaluation metric for machine translation and text generation. It measures the quality of translated text by comparing it to one or more reference translations.

**Key Features:**
- **N-gram Precision:** Measures the overlap of n-grams between the generated and reference texts.
- **Brevity Penalty:** Penalizes overly short translations.

**Equation:**
\[ \text{BLEU} = \text{BP} \cdot \exp \left( \sum_{n=1}^N p_n \right) \]
where BP is the brevity penalty and \( p_n \) is the precision for n-grams.
