# Project 4: Implementing Transformer Models for NLP Applications

## Introduction

In this project, you will design, implement, and evaluate a Transformer-based neural network (such as BERT or GPT) for a real-world NLP task. Transformers have dramatically improved the performance of NLP tasks by using self-attention mechanisms. You'll gain practical experience fine-tuning pre-trained Transformer models, managing data preprocessing, and clearly evaluating model performance.

You will select one of the provided datasets and corresponding tasks, creating the entire pipeline from dataset preparation through model evaluation.


## Objectives

1. Set up an environment using PyTorch or Tensorlow, Hugging Face's Transformers library, and GPU acceleration.
2. Implement preprocessing pipelines suitable for Transformer models.
3. Choose and load one of the provided datasets.
4. Fine-tune a pre-trained Transformer model for your selected NLP task.
5. Evaluate your model clearly, showing relevant performance metrics and visualizations.
6. Provide detailed answers to the provided assessment questions.

## Dataset

You are free to choose any dataset for this project! Kaggle would be a good source to look for datasets. Below are some examples:
- IMDB Movie Reviews (Sentiment Analysis)
- Yelp Reviews (Multi-class Sentiment Classification)
- Amazon Product Reviews (Sentiment Classification)
- Tweet Emotion Recognition
- News Category Classification
- Question Answering (QA)

## Data Preprocessing Requirements
For all projects, perform the following preprocessing steps:
- Tokenize texts using Hugging Face Tokenizer.
- Perform padding and truncation to create uniform-length inputs.
- Split data into appropriate training, validation, and test sets as applicable.

In [None]:
#pip install datasets

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.w

In [None]:
import torch
import torch.nn as nn
from torch.optim import AdamW
from transformers import AutoTokenizer, AutoModelForSequenceClassification, get_scheduler
from datasets import load_dataset
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from tqdm.auto import tqdm

# Check for GPU availability
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(f"Using device: {device}")

Using device: cuda


In [None]:
# Load the dataset
dataset = load_dataset("imdb")

# Basic EDA
print(dataset)
print(dataset['train'][0])

# Explore class distribution
import pandas as pd
train_df = pd.DataFrame(dataset['train'])
print(train_df['label'].value_counts())

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})
{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and

In [None]:
# 1. Instantiate Tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")  # Choose your desired model

# 2. Tokenize the texts
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# 3. Padding and truncation (Implemented with Data Collator)

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# 4. Create training, validation, and test splits
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets = tokenized_datasets.with_format("torch")

train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(10000)) #Small set to speed up training
eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    train_dataset, shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    eval_dataset, batch_size=8, collate_fn=data_collator
)

#Example
for batch in train_dataloader:
    print(batch['input_ids'].shape) #Print the shape of the input ids to verify
    break

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

torch.Size([8, 512])


In [None]:
# 1. Load Pre-trained Transformer Model
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2) #Example, change to your task's labels.
model.to(device) #Move to GPU if available

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [None]:
# 1. Define Optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)

# 2. Define Loss Function (already included in the model for sequence classification)

# 3. Training Loop
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)


progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

  0%|          | 0/3750 [00:00<?, ?it/s]

In [None]:
from sklearn.metrics import accuracy_score

model.eval()
predictions = []
references = []
for batch in tqdm(eval_dataloader):
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predicted_labels = torch.argmax(logits, dim=-1)
    predictions.extend(predicted_labels.cpu().numpy())
    references.extend(batch["labels"].cpu().numpy())

accuracy = accuracy_score(references, predictions)
print(f"Accuracy on test set: {accuracy}")


#Print the classification report
from sklearn.metrics import classification_report
print(classification_report(references, predictions))

  0%|          | 0/125 [00:00<?, ?it/s]

Accuracy on test set: 0.929
              precision    recall  f1-score   support

           0       0.93      0.93      0.93       512
           1       0.93      0.92      0.93       488

    accuracy                           0.93      1000
   macro avg       0.93      0.93      0.93      1000
weighted avg       0.93      0.93      0.93      1000



In [None]:
def predict_sentiment(text):
  model.eval()
  inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt").to(device)
  with torch.no_grad():
    outputs = model(**inputs)
  predictions = torch.softmax(outputs.logits, dim=-1).cpu().numpy()[0]
  positive_prob = predictions[1]
  return positive_prob

# Example usage:
text = "This movie was terrible!"
positive_probability = predict_sentiment(text)
print(f"Probability of being positive: {positive_probability}")

text2 = "This movie was very adventurous!"
positive_probability = predict_sentiment(text2)
print(f"Probability of being positive: {positive_probability}")

Probability of being positive: 0.0010856924345716834
Probability of being positive: 0.9984740614891052


---
### Questions
Answer the following questions in detail.

1. What is a Transformer model, and how does it differ fundamentally from recurrent neural networks (RNNs)?
2. Explain the concept of self-attention and why it significantly improves NLP tasks.
3. Discuss why fine-tuning pretrained Transformer models is effective compared to training from scratch.
4. What is the difference between BERT and GPT models? Describe their key design differences and typical use cases.
5. Explain positional encoding in Transformer models and its importance.
6. What are some common evaluation metrics used for your chosen NLP task, and why?
7. Discuss potential drawbacks or limitations of Transformer models (e.g., computational complexity, data requirements).
8. How does tokenization impact Transformer model performance, and what considerations are important during data preprocessing?
9. Provide at least two examples of real-world applications using Transformers outside of your chosen task.
10. What future directions or improvements could further enhance Transformer models?


#Answer-1:

The Transformer network utilizes self-attention functionality to study every segment of its sequence data including the words within a sentence simultaneously. Recurrent neural networks (RNNs) function differently from Transformers because they examine data elements separately in processing sequences.

The following points deliver detailed information about this topic
Transformers:

1. Parallel Processing:
The entire data sequence receives simultaneous evaluation although standard analysis proceeds sequentially through each step. The design allows them to optimize processing operations on contemporary computer hardware effectively.

2. Self-Attention:
The mechanism enables the model to determine which words or parts of the sequence matter most for grasping the meaning. Self-Attention creates connections between all elements in the sequence which enhances distant word relationship detection.

3. Scalability:
The Transformers architecture achieves excellent data scalability and this characteristic led to its adoption in powerful models BERT and GPT.

##RNNs:

1. Sequential Processing:
They read data step by step. Process speed decreases as each phase depends on the preceding steps while preventing effective detection of distant word relationships.

2. Memory Limitations:
Overcoming the forgetfulness issue that RNNs experience in remembering sections from distant parts of a sequence requires new improvements like LSTM and GRU. The vanishing gradient problem together with other distribution-related issues cause early information to fade when processing sequences that grow longer.

The self-attntion mechanism inside transformers allows sequences to be processed as whole units thereby delivering greater speed and better performance against lengthy sentences or writings. Sending data through RNN networks proves slower and effective for fewer sequences compared to the sequential data processing done by RNN networks.

#Answer-2

Through self-attention, models maintain simultaneous context of every word versus all other words in the sentence. The breakdown of this mechanism illustrates its value for NLP operations while demonstrating its operating framework.

##How Self-Attention Works
1. Contextual Connections:
During the processing the model generates “query” and “key” and “value” vector representations for every term in the sequence. The calculation of word-to-word attention happens when the model examines query and key vectors from all input words. As a result the model determines which words hold most importance for processing a specified word.

2. Weighting Information:
Attention scores calculated from this method are used to modify the value vectors. Each word gets represented via a modified vector which includes relevant information gathered from throughout the word sequence.

3. Parallel Processing:
Self-attention simultaneously analyzes all words at once rather than step-by-step which enables it to identify word relationships no matter the distance between them.

##Why It Improves NLP Tasks
1. Capturing Long-Range Dependencies:
The self-attention mechanism connects distant words effectively since this ability helps models build operational context understanding. Self-attention makes it easier for models to recognize the relationship between subject and verb when they are separated by various other words within a long sentence.

2. Efficient Computation:
Self-attention enables parallel computation because it analyzes the whole sequence simultaneously. Self-attention processing performs training and inference tasks much faster than step-by-step methods that handle data processing sequentially.

3. Better Context Understanding:
Through the ability to focus on important sentence components the models gain better comprehension of precise semantic variations and ambiguous expressions. The improved performance in NLP tasks stands as a result of this approach which includes translation tasks along with summarization and sentiment analysis tasks.

#Answer-3:

A trained Transformer model proves very effective because it applies existing knowledge gained from processing extensive diverse data. These are the main benefits that explain why fine-tuning achieves superior results compared to training from scratch:

1. Leveraging Pre-Learned Representations
* Rich Language Understanding:
BERT and GPT among other pre-trained models built deep linguistic understanding of grammar and acquired extensive real-world facts from their extensive training data. These rich representations undergo an adjustment process which fits them specifically for task requirements.

* Transfer Learning:
Fine-tuning operates as a transfer learning method because it enables models to share their language comprehension across different domains. The model can bypass learning the basics of language since its existing knowledge remains intact.

2. Reduced Data and Computation Requirements
* Data Efficiency:
Ordinary model training starts by needing huge amounts of marked data to function. Models trained using this method reach high performance levels with smaller datasets since they already learned from large general corpora.

* Faster Convergence:
Due to its pre-training status the model reaches convergence more efficiently while using less computing power than building an entirely new model does.

3. Improved Performance and Flexibility
* State-of-the-Art Results:
Transformers trained beforehand result in superior outcomes while working on multiple types of natural language processing applications. When models receive fine-tuning specific to a particular task they can achieve state-of-the-art performance because their deep context learning tasks apply directly to the addressed problem.

* Task Adaptability:
When subjected to fine-tuning the model develops specialized capabilities. The training process specializes general language representations for specific requirements across three main tasks such as sentiment analysis, translation and question answering.

#Answer-4:

The Transformer-based models BERT and GPT present design variations which make them suitable for different text-focused applications.

## Architectural Differences
###BERT (Bidirectional Encoder Representations from Transformers):

* Encoder-Based:
BERT utilizes the transformer decoder by using its encoder portion to inspect context both to the left and right of each token. By examining context in both directions BERT achieves exceptional results when interpreting entire words within their sentence context.

* Pretraining Objectives:
During training, BERT receives two common pretraining procedures that include masked language modeling and next sentence prediction to learn contextual relationships.

### GPT (Generative Pre-trained Transformer):

* Decoder-Based:
Through its use of the decoder portion from the Transformer architecture GPT functions as an autoregressive model that generates the following items in textual sequences. The text processing sequence follows the natural left-to-right direction because of its design.

* Pretraining Objective:
The system was trained to predict upcoming words therefore it demonstrates inherent compatibility with generating text sequences and text continuity operations.

Typical Use Cases
1. BERT:

* Understanding and Analysis Tasks:
Users employ this model for sentiment analysis and question answering and named entity recognition and text classification processes because it provides strong contextual understanding.

* Fine-Tuning for Specific Tasks:
BERT performs remarkably well when processing context from two directions therefore it often receives training on specialized datasets to transform its universal language comprehension for particular tasks.

2. GPT:

* Text Generation:
The GPT model provides text generation benefits for different applications which include creative writing and summarization and conversational agent functions.

* Autoregressive Applications:
Due to its ability to produce logical text sequences within consistent contexts BERT is effective in writing completion or continuation tasks.

#Answer-5:

## Positional Encoding in Transformer Models and Its Importance

1. What It Is:
A model receives positional information through positional encoding which introduces sequence position information to neural network processing. The parallel operation of Transformer models requires positional encoding because this technique adds necessary sequence information.

2. How It Works:
The input token embeddings receive additions of unique positional embeddings that result from combining different frequency sine and cosine functions. The method produces a smooth differentiable signal which demonstrates token position information.

3. Importance:
When positional encoding is removed from the model it becomes unable to understand the sequential nature of input tokens because they become “bag of words” elements. Word order holds great importance for interpreting syntax and context specifically with those tasks that depend on meaning changes based on word placement.

#Answer-6:

## Common Evaluation Metrics Used for Your Chosen NLP Task and Why

1. For a Task like Text Classification:

* Accuracy:
The measurement evaluates the number of accurate predictions among all cases tested. This metric provides general performance information although such measurements can be misleading when different classes are not evenly distributed throughout the dataset.

* Precision, Recall, and F1 Score:

 The measurement of precision identifies the correct instances from all predicted positive outcomes.

* A recall calculation identifies the number of correct predictions against the total actual positives.

* F1 Score serves as a metric which combines precision and recall measurements for applications that need controlled class imbalance control.

2. Area Under the ROC Curve (AUC):
A model evaluation method reveals its ability to make correct class distinctions at various decision threshold points.

## Why These Metrics:
The group of metrics together gives a comprehensive understanding of performance levels. Accuracy provides a basic overview but precision and recall combined with the F1 score reveal comprehensive information about the relationship of false positives to false negatives for practical real-world systems.

#Answer-7:

## Potential Drawbacks or Limitations of Transformer Models

1. Computational Complexity:
Expanded sequences require high computational power from self-attention because its processing speed grows in proportion to input sequence length.

2. High Data and Resource Requirements:
The effective training of Transformers generally depends heavily on both extensive data amounts and expensive computational resources such as GPUs or TPU devices which might hinder implementation for smaller institutions possessing small data collections.

3. Memory Consumption:
Platform deployment becomes more challenging when transformers require substantial resources because of their extensive parameter system and attention operations.

4. Difficulty in Capturing Local Dependencies:
The self-attention mechanism excels at identifying distant relations but fails to detect local relationships naturally when compared to convolutional and recurrent layers under specific circumstances.

#Answer-8:

## Impact of Tokenization on Performance :

Text tokenization divides natural language content into tokens either by words or subwords or characters to determine how input data reaches the model. Poor tokenization can degrade performance:
1. Subword tokenization through BPE remains essential for Turkish and Finnish because these languages show agglutination phenomena.
The tokenizer should include specialized medical vocabulary such as medical terms "myocardial infarction" because these domain-specific words appear in clinical text.
2. The method of lowercase writing can create problems during tasks that demand difference detection between Capital and lowercase letters (such as distinguishing fruit and company entities in the term "Apple").

## Key Preprocessing Considerations :

1. Tokenizer Selection :
Byte Pair Encoding (BPE) : Efficient for multilingual and morphologically complex languages (e.g., mBERT).
WordPiece functions well with English (in use for BERT) although it faces difficulties when processing languages that do not neatly segment into words.
2. Handling Special Characters :
The system will normalize punctuation together with emojis and URLs through text replacement (for instance "URL" markers in social media content).
3. Domain Adaptation :
The vocabulary of the tokenizer ought to expand with essential terminology from specific domains (medical terminology for healthcare NLP serves as an example).
4. Case Normalization :
PytRequest-Plus recommends users to decide between lowercasing their text content for NLP or keeping case as required for named entity recognition work.
Beyond applications related to NLU/NLG there are valuable uses of normalization techniques for real-world tasks.


#Answer-9:

## Real-World Applications Outside chosen Tasks :

### Protein Structure Prediction :
1. The AlphaFold (DeepMind) prediction system employs attention mechanisms which use amino acid sequences to build protein structures while changing the field of biology.
2. Computer Vision :
Images processed through Vision Transformers achieve their best performance in image classification by using pixel-tokenized patches.

#Answer-10:

## Future Directions or Improvements to Enhance Transformer Models

1. Efficient Attention Mechanisms:
The research community advances experiments to create attention-based mechanisms that diminish the quadratic complexity through the implementation of sparse attention and linearized attention and memory-compressed attention.

2. Multimodal Integration:
Adopting one Transformer system to process text along with images and different modalities would lead to advanced models that process intricate structured information.

3. Pretraining Innovations:
The contextual understanding and generalization capabilities of Transformers will get better with novel pretraining approaches since this increases their robustness across multiple tasks.

4. Domain-Specific Adaptations:
The performance from Transformers increases when developers modify these networks to work with particular domains including legal or scientific documents by including specialty domain content during training phases.

5. Improved Interpretability:
AI systems that operate with better interpretability and decision understanding capabilities emerge from developing tools which analyze Transformer model decisions.








---
### Submission
Submit a link to your completed Jupyter Notebook (e.g., on GitHub (private) or Google Colab) with all the cells executed, and answers to the assessment questions included at the end of the notebook.