## 1. Explain the architecture of BERT
**Answer:** BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based model designed for natural language understanding tasks. The architecture consists of multiple layers of bidirectional transformer encoders. BERT reads the entire sequence of words simultaneously, allowing it to understand context from both directions (left-to-right and right-to-left).

**Key Components:**
- **Input Embeddings:** Token embeddings, segment embeddings, and position embeddings.
- **Encoder Layers:** Stacked layers of bidirectional transformers.
- **Output:** Contextualized word embeddings used for various NLP tasks.

---

## 2. Explain Masked Language Modeling (MLM)
**Answer:** Masked Language Modeling (MLM) is a pretraining task used in BERT. In MLM, some percentage of the input tokens are randomly masked, and the model is trained to predict the original tokens based on the context provided by the other tokens. This encourages the model to learn bidirectional context and improves its language understanding.

**Process:**
- **Masking:** Randomly mask 15% of the tokens in the input.
- **Prediction:** Predict the original token for each masked position.

---

## 3. Explain Next Sentence Prediction (NSP)
**Answer:** Next Sentence Prediction (NSP) is another pretraining task used in BERT. The model is trained to predict whether a given pair of sentences (A and B) are consecutive sentences in the original text. This helps BERT understand the relationship between sentences, which is crucial for tasks like question answering and sentence pairs classification.

**Process:**
- **Training Pairs:** For 50% of the time, B is the actual next sentence after A; for the other 50%, a random sentence is used as B.
- **Objective:** Classify whether B is the actual next sentence of A.

---

## 4. What is Matthews evaluation?
**Answer:** Matthews evaluation typically refers to the Matthews Correlation Coefficient (MCC) evaluation metric. It is used to measure the quality of binary (two-class) classifications. It takes into account true positives, true negatives, false positives, and false negatives, providing a balanced measure that can be used even with imbalanced classes.

---

## 5. What is Matthews Correlation Coefficient (MCC)?
**Answer:** The Matthews Correlation Coefficient (MCC) is a metric used for evaluating binary classification tasks. It is a measure of the strength of the relationship between the observed and predicted classifications. MCC returns a value between -1 and 1, where 1 indicates perfect prediction, 0 indicates no better than random prediction, and -1 indicates complete disagreement between prediction and observation.

**Equation:**
\[ \text{MCC} = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}} \]
where TP, TN, FP, FN are true positives, true negatives, false positives, and false negatives, respectively.

---

## 6. Explain Semantic Role Labeling
**Answer:** Semantic Role Labeling (SRL) is the process of identifying the predicate-argument structure in a sentence. It involves determining "who" did "what" to "whom", "when", "where", and "how". SRL assigns labels to words or phrases in a sentence to indicate their semantic roles (e.g., agent, patient, instrument).

**Example:**
- Sentence: "John (agent) gave (predicate) a book (theme) to Mary (recipient)."

---

## 7. Why Fine-tuning a BERT model takes less time than pretraining
**Answer:** Fine-tuning a BERT model takes less time than pretraining because pretraining involves learning general language representations from scratch, which requires processing a vast amount of text data and adjusting millions of parameters over many epochs. Fine-tuning, on the other hand, is done on a specific task with a smaller dataset and fewer epochs, where the model already has learned general language features during pretraining and only needs to adjust for the particular task.

---

## 8. Recognizing Textual Entailment (RTE)
**Answer:** Recognizing Textual Entailment (RTE) is a natural language processing task where the goal is to determine whether a given hypothesis is entailed by a given premise. In other words, it checks if the truth of one text fragment (the hypothesis) logically follows from another text fragment (the premise).

**Classification:**
- **Entailment:** The hypothesis is true given the premise.
- **Contradiction:** The hypothesis is false given the premise.
- **Neutral:** The hypothesis is neither true nor false given the premise.

---

## 9. Explain the decoder stack of GPT models
**Answer:** The decoder stack of GPT (Generative Pre-trained Transformer) models consists of a series of transformer decoder blocks. Unlike the BERT model, GPT is a unidirectional model, meaning it processes the input tokens in a left-to-right manner. Each decoder block has the following components:

**Components:**
- **Masked Multi-Head Self-Attention:** Prevents attending to future tokens by masking future positions.
- **Feedforward Neural Network:** A fully connected feedforward network applied to the output of the attention mechanism.
- **Layer Normalization:** Applied after each attention and feedforward network to stabilize training.
- **Residual Connections:** Skip connections that help in gradient flow and prevent vanishing gradients.

**Purpose:** GPT models are designed for text generation tasks and can generate coherent text by predicting the next word in a sequence.
