# Assignment 7

#### 1. Explain the architecture of BERT

1. Transformer Encoder: BERT is built upon a stack of Transformer encoder layers. Each encoder layer consists of a multi-head self-attention mechanism and a feed-forward neural network. The self-attention mechanism allows the model to capture contextual dependencies by attending to different parts of the input sequence. The feed-forward neural network processes the output of the self-attention layer to generate the final representations.

2. Pre-training and Fine-tuning: BERT is pre-trained on a large corpus of text data using two unsupervised tasks: masked language modeling (MLM) and next sentence prediction (NSP). During pre-training, BERT learns to predict missing words in sentences (MLM) and to determine whether two sentences are consecutive or not (NSP). After pre-training, BERT's weights are fine-tuned on specific downstream tasks, such as text classification, named entity recognition, question answering, etc.

3. Tokenization: BERT tokenizes the input text into subword units using WordPiece tokenization. This allows BERT to handle out-of-vocabulary (OOV) words and capture morphological variations. The input text is tokenized into a sequence of subword tokens, and special tokens like [CLS] (classification) and [SEP] (separator) are added to mark the beginning and end of sentences.

4. Positional Encoding: BERT incorporates positional encoding to encode the positional information of words in the input sequence. This helps the model distinguish the order of words and capture the context effectively.

5. Pre-training Objectives: BERT's pre-training objectives, MLM and NSP, enable it to learn a deep bidirectional representation of the input text. The MLM objective involves randomly masking some of the input tokens and predicting them based on the surrounding context. The NSP objective involves predicting whether two sentences appear consecutively in the original text.

6. Fine-tuning and Task-Specific Layers: After pre-training, BERT's weights are fine-tuned on specific downstream tasks. Task-specific layers are added on top of the pre-trained BERT model to adapt it to the specific task requirements. These additional layers may include pooling, classification, or sequence labeling layers, depending on the task at hand.

#### 2. Explain Masked Language Modeling (MLM)

To learn contextualised representations of words within a sentence, models like BERT (Bidirectional Encoder Representations from Transformers) use the pre-training aim known as "masked language modelling" (MLM). Predicting the hidden or missing words in a given sentence is the goal of MLM.

A predetermined percentage of the input tokens in each sentence are randomly chosen and changed to a unique [MASK] token during the pre-training phase. Then, using the surrounding context, the model is trained to forecast the original masked tokens.

The MLM goal has two functions:

1. Bidirectional Context: By masking some tokens and predicting them based on the context, MLM allows the model to incorporate bidirectional information. It enables the model to understand the dependencies between words in both directions (before and after the masked token), capturing a deeper understanding of the context.

2. Masked Token Representation: When predicting the masked tokens, the model generates representations that encapsulate the information of the surrounding words. This encourages the model to learn contextualized representations that capture the semantic and syntactic properties of the masked words.

The model determines the probabilities of candidate words for each masked place and contrasts them to the original tokens in order to train the MLM aim. The model is tuned to reduce the discrepancy between the ground truth tokens and the projected probability.

Due to the lack of explicit annotations, the MLM objective is an unsupervised learning problem. Models like BERT can acquire complex contextual representations that capture linguistic nuances and can be customised for a variety of downstream tasks including text classification, named entity recognition, question answering, and more by training on a large corpus of text data with MLM.

#### 3. Explain Next Sentence Prediction (NSP)

Another pre-training goal employed in models like BERT (Bidirectional Encoder Representations from Transformers) to learn contextualised representations of sentences is next sentence prediction (NSP). The goal of NSP is to determine if two sentences are randomly sampled or whether they appear sequentially in the original text.

The model is fed pairs of sentences during pre-training. For each pair, there is a 50% probability that the second sentence is drawn at random from the corpus and a 50% chance that it follows the first sentence in the original text. Predicting whether or not the second sentence is the actual successive sentence is the task of the model.

The NSP objective serves to enhance the model's understanding of sentence-level relationships and discourse coherence. It helps the model capture the dependencies and semantic connections between consecutive sentences, which is crucial for tasks such as question answering, natural language inference, and text generation.

The model uses its transformer-based architecture to encode both the first and second sentences in order to train the NSP aim. A classifier is then applied to the encoded representations to determine if the second sentence actually is a consecutive sentence or not.

The model gains the ability to recognise the connections between sentences and the contextual information required to assess the coherence and continuity of the text through training on the NSP aim. As a result, the model is able to produce representations that are more contextually meaningful and performs better on tasks that require knowledge of sentence-level links down the road.

#### 4. What is Matthews evaluation?

Matthews evaluation, also known as the Matthews correlation coefficient (MCC), is a measure of the quality of binary (two-class) classification models. It takes into account true positives, true negatives, false positives, and false negatives to provide an overall assessment of the model's performance.

The Matthews correlation coefficient is calculated using the following formula:

MCC = (TP * TN - FP * FN) / sqrt((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN))

where:
- TP (True Positives) is the number of instances correctly classified as positive.
- TN (True Negatives) is the number of instances correctly classified as negative.
- FP (False Positives) is the number of instances incorrectly classified as positive.
- FN (False Negatives) is the number of instances incorrectly classified as negative.

The MCC ranges from -1 to +1, where +1 indicates a perfect classification, 0 indicates a random classification, and -1 indicates a completely opposite classification.

The MCC is particularly useful when dealing with imbalanced datasets, as it considers both the positive and negative classes and is not affected by class distribution. It provides a more reliable measure of model performance compared to accuracy, especially when the classes are unevenly distributed.

The Matthews correlation coefficient is commonly used in various fields, including machine learning, bioinformatics, and information retrieval, to assess the performance of binary classification models and compare different models or algorithms.

#### 5. What is Matthews Correlation Coefficient (MCC)?

The Matthews Correlation Coefficient (MCC) is a measure of the quality of binary (two-class) classification models. It takes into account true positives, true negatives, false positives, and false negatives to provide an overall assessment of the model's performance.

The MCC is calculated using the following formula:

MCC = (TP * TN - FP * FN) / sqrt((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN))

where:
- TP (True Positives) is the number of instances correctly classified as positive.
- TN (True Negatives) is the number of instances correctly classified as negative.
- FP (False Positives) is the number of instances incorrectly classified as positive.
- FN (False Negatives) is the number of instances incorrectly classified as negative.

The MCC ranges from -1 to +1, where +1 indicates a perfect classification, 0 indicates a random classification, and -1 indicates a completely opposite classification.

The MCC is a balanced measure that takes into account the distribution of the classes and provides a reliable evaluation of model performance, especially when dealing with imbalanced datasets. It is commonly used in machine learning, bioinformatics, and other fields where binary classification models are employed.

The MCC is particularly useful when the classes are unevenly distributed or when there is a high imbalance between the number of true positives and true negatives. It is more robust than accuracy as it considers all four classification outcomes, making it a valuable metric for assessing and comparing binary classification models.

#### 6. Explain Semantic Role Labeling

Natural language processing (NLP) job known as semantic role labelling (SRL) entails determining and categorising the functions that words or phrases in a sentence perform in relation to the main verb or predicate. By tying each word in a sentence to a certain semantic role, it attempts to comprehend the underlying meaning and connections between them.

In order to identify the functions of various constituents in respect to the primary predicate, SRL analyses the syntactic structure of a sentence. Depending on the particular linguistic framework employed, these roles may commonly include agent, patient, instrument, location, time, and a wide variety of others.

The following steps are part of the semantic role labelling process:

1. Parsing: First, the sentence is parsed to determine its syntactic structure, such as the dependency parse or constituency parse. This step helps identify the relationships between words in the sentence.

2. Predicate Identification: The main verb or predicate of the sentence is identified. It serves as the anchor point for assigning roles to other words in the sentence.

3. Role Labeling: Each word or phrase in the sentence is assigned a specific role based on its relationship to the main predicate. This is done by considering the syntactic structure, contextual information, and linguistic cues.

4. Role Classification: The assigned roles are typically categorized into a predefined set of semantic roles. For example, a noun phrase might be labeled as an agent, indicating that it performs the action denoted by the predicate.

5. Disambiguation: In cases where multiple roles are possible for a word or phrase, disambiguation techniques are applied to select the most appropriate role based on the context and semantic constraints.

#### 7. Why Fine-tuning a BERT model takes less time than pretraining

Because pretraining entails training a BERT model from scratch on a sizable corpus of unlabeled text, which is a computationally expensive job, fine-tuning a BERT model often takes less time than pretraining. By predicting missing words in a sentence (Masked Language Modelling) and determining whether two phrases follow one another (Next Sentence Prediction), BERT learns general language representations during pretraining.

On the other hand, fine-tuning entails using labelled data to tailor the pretrained BERT model to a particular downstream task. Only the extra task-specific layers need to be trained during fine-tuning, while the pre-trained BERT layers must remain frozen or partially updated. The pretrained BERT model has already acquired rich language representations that may be used for a variety of applications, making this procedure quicker.

The advantages of fine-tuning over pretraining include:

1. Transfer Learning: By starting with a pretrained BERT model, fine-tuning allows for leveraging the knowledge and representations learned during pretraining, which can significantly boost performance on downstream tasks.

2. Reduced Training Time: Since the bulk of the model training is already done during pretraining, fine-tuning requires training only the task-specific layers. This reduces the overall training time, making it more efficient.

3. Data Efficiency: Fine-tuning requires less labeled data compared to pretraining, as it leverages the knowledge already present in the pretrained model. This can be beneficial in scenarios where labeled data is limited or expensive to obtain.

4. Task-Specific Optimization: Fine-tuning allows for adapting the pretrained BERT model to the specific nuances and requirements of the downstream task. The task-specific layers can be optimized to capture the task-specific patterns and improve performance.

#### 8. Recognizing Textual Entailment (RTE)

A task in natural language processing called Recognising Textual Entailment (RTE) entails figuring out the connection between two text samples called the "text" and the "hypothesis." The objective is to ascertain whether the hypothesis's meaning can be logically or deduced from the text's meaning. 

RTE divides the relationships between the text and hypotheses into three groups:

1. Entailment: The hypothesis can be inferred or logically derived from the text. It means the information in the text supports or implies the information in the hypothesis.

2. Contradiction: The hypothesis contradicts or is in direct conflict with the information in the text. The meaning of the hypothesis cannot be true based on the information in the text.

3. Neutral: There is no clear logical relationship between the text and the hypothesis. The text does not provide enough information to determine if the hypothesis is true or false.

#### 9. Explain the decoder stack of  GPT models.

Coherent and contextually appropriate text is produced by the decoder stack of GPT (Generative Pre-trained Transformer) models. In order to acquire and process data from various viewpoints and levels of abstraction, the model's numerous layers of self-attention and feed-forward neural networks are used.

The Transformer architecture is often comparable to the decoder stack of GPT models, with each layer consisting of two sub-layers:

1. Self-Attention Layer: This layer employs self-attention mechanism to capture dependencies between words within the input sequence. It allows the model to assign different weights to different words based on their relevance and importance in generating the next word. The self-attention mechanism helps the model understand the context and dependencies among words, enabling it to generate more coherent and contextually appropriate text.

2. Feed-Forward Neural Network Layer: After the self-attention layer, the output is passed through a feed-forward neural network, which applies non-linear transformations to the representations learned from the previous layer. It helps capture more complex patterns and relationships in the data, further enhancing the model's ability to generate high-quality text.

The decoder stack consists of multiple layers of these sub-layers, typically ranging from several to dozens of layers. Each layer allows the model to iteratively refine and improve its understanding of the input sequence, incorporating both local and global context to generate more accurate and meaningful text.

The output is typically routed via a linear transformation and a softmax activation function at the top layer of the decoder stack to produce the probability distribution across the vocabulary. The following word in the resulting text is chosen using this distribution.

The decoder stack of GPT models uses multiple layers of self-attention and feed-forward neural networks to generate coherent and contextually relevant text, making them useful for tasks like language modelling, text generation, and machine translation.