## NLP_Assignment_7
1. Explain the architecture of BERT
2. Explain Masked Language Modeling (MLM)
3. Explain Next Sentence Prediction (NSP)
4. What is Matthews evaluation?
5. What is Matthews Correlation Coefficient (MCC)?
6. Explain Semantic Role Labeling
7. Why Fine-tuning a BERT model takes less time than pretraining
8. Recognizing Textual Entailment (RTE)
9. Explain the decoder stack of  GPT models.

In [None]:
'''Ans 1:- BERT (Bidirectional Encoder Representations from
Transformers) is a deep learning model architecture based on the
Transformer architecture. It uses a bidirectional context by training
on both left and right context words simultaneously. BERT
consists of multiple layers of attention-based Transformer
encoders. It incorporates two pre-training tasks: Masked Language
Model (MLM) and Next Sentence Prediction (NSP). This
architecture learns contextualized word representations, making it
powerful for various natural language understanding tasks.'''

In [None]:
'''Ans 2:- Masked Language Modeling (MLM) is a pre-training task in
natural language processing. It's a core component of models like
BERT. During MLM, random words in a sentence are masked or
replaced with a special token. The model's goal is to predict these
masked words based on the context provided by the surrounding
words. This bidirectional context learning helps the model
capture rich contextual information, resulting in better word
embeddings and improving performance on downstream tasks like text
classification, question answering, and text generation.'''

In [None]:
'''Ans 3:- Next Sentence Prediction (NSP) is a pre-training task in
natural language processing, employed in models like BERT. NSP
aims to predict whether one sentence in a pair of sentences
follows the other in a coherent context. The model learns to
differentiate between sentence pairs where the second sentence logically
follows the first and those where it does not. NSP helps models
understand document-level relationships and improves their ability to
generate coherent and contextually accurate text.'''

In [3]:
'''Ans 4:-The Matthews correlation coefficient (MCC) is a metric
for evaluating the performance of a binary classifier. It is a
more robust metric than accuracy, especially when the classes
are imbalanced.  The MCC is calculated as follows:  MCC =
(TP*TN - FP*FN) / sqrt((TP + FP)*(TP + FN)*(TN + FP)*(TN + FN))
where:  TP = True Positives TN = True Negatives FP = False
Positives FN = False Negatives'''




In [1]:
'''Ans 5:- Matthews Correlation Coefficient (MCC) is a measure used
to evaluate the performance of binary classification models.
It considers true positives, true negatives, false positives,
and false negatives to provide a balanced assessment,
especially on imbalanced datasets. MCC ranges from -1 (perfect
inverse prediction) to 1 (perfect prediction) with 0 indicating
random prediction. It's valuable for assessing classification
model quality, especially in situations with uneven class
distributions.

In this code, we import the matthews_corrcoef function
from scikit-learn and calculate the MCC for a set of true and
predicted binary labels.'''

from sklearn.metrics import matthews_corrcoef
import numpy as np

# ground truth labels and predicted labels
true_labels = np.array([1, 0, 1, 1, 0, 0, 1])
predicted_labels = np.array([1, 0, 1, 0, 1, 0, 0])

# Calculate MCC
mcc = matthews_corrcoef(true_labels, predicted_labels)

print("Matthews Correlation Coefficient (MCC):", mcc)

Matthews Correlation Coefficient (MCC): 0.16666666666666666


In [None]:
'''Ans 6:- Semantic Role Labeling (SRL) is a natural language
processing task that involves identifying and classifying the
semantic roles of words or phrases within a sentence. It assigns
roles such as "agent," "patient," or "location" to words,
showing their relationships in a sentence. For example, in the
sentence "John ate the pizza with gusto," SRL would label "John" as
the agent, "ate" as the predicate, and "pizza" as the patient,
revealing the roles and connections between them.'''

In [None]:
'''Ans 7:- Fine-tuning a BERT (Bidirectional Encoder Representations
from Transformers) model is typically faster than the initial
pretraining phase for several reasons:-

1. Transfer Learning: Pretraining involves training BERT on a
massive corpus of text, which can take a long time. Fine-tuning,
on the other hand, starts with pretrained weights, leveraging
the knowledge learned during pretraining. This allows
fine-tuning to converge faster since the model already has a good
understanding of language.

2. Smaller Dataset: Fine-tuning is performed on a specific
task with a smaller dataset, making it computationally less
intensive compared to pretraining, which requires processing vast
amounts of text.

3. Fewer Training Steps: Fine-tuning typically requires fewer
training steps or epochs than pretraining. Since the model's lower
layers are mostly frozen during fine-tuning, it doesn't need as
many updates.

4. Domain-Specific Focus: Fine-tuning tailors the pretrained
model to a particular task or domain, allowing it to adapt
quickly to the specifics of that task.

In summary, fine-tuning is quicker because it builds upon
pretrained knowledge and focuses on task-specific adjustments rather
than the extensive training needed for pretraining.'''

In [4]:
'''Ans 8:- Recognizing Textual Entailment (RTE) is a natural language
processing task where the goal is to determine if a given piece of
text (the "hypothesis") logically follows from another piece of
text (the "text" or "premise"). It's typically framed as a
binary classification problem where the model predicts whether
the hypothesis is true or false based on the given premise. 

In this example, we use a pre-trained BERT model to
determine if the hypothesis logically follows from the given text.
The model predicts whether the entailment is true or false
based on the input.'''

from transformers import BertTokenizer, BertForSequenceClassification
import torch

# Load a pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")

# Input text and hypothesis
text = "The quick brown fox jumps over the lazy dog."
hypothesis = "A fox jumps over a dog."

# Tokenize and format input
inputs = tokenizer(text, hypothesis, return_tensors="pt")

# Perform inference
outputs = model(**inputs)
logits = outputs.logits

# Get the predicted class (0 for contradiction, 1 for entailment)
predicted_class = torch.argmax(logits, dim=1).item()

# Interpret the result
if predicted_class == 1:
    print("The hypothesis logically follows from the text.")
else:
    print("The hypothesis does not logically follow from the text.")

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

The hypothesis does not logically follow from the text.


In [5]:
'''Ans 9:- The decoder stack in GPT (Generative Pre-trained
Transformer) models consists of multiple layers of Transformer
decoders. Each decoder layer includes multi-head self-attention
mechanisms and feedforward neural networks, similar to the encoder.
However, it also incorporates a masked self-attention mechanism to
ensure causal generation, where each word is generated based on
previous words in an autoregressive manner.'''

from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load a pre-trained GPT model and tokenizer
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# Generate text using the decoder stack
input_text = "Once upon a time, "
input_ids = tokenizer.encode(input_text, return_tensors="pt")

# Generate text autoregressively
output = model.generate(input_ids, max_length=50, num_return_sequences=1)

# Decode and print the generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print("Generated text:", generated_text)

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/548M [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated text: Once upon a time,  I was a little bit of a fan of the original series, but I was also a little bit of a fan of the original series. I was a little bit of a fan of the original series, but I
