### NLP Workshop: End-to-End Training of a Transformer-based Model

## Summarization Task with Transformers

In [None]:
# Install required libraries
!pip install transformers datasets torch numpy

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m36.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

## Import necessary libraries

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Trainer, TrainingArguments
from datasets import load_dataset, DatasetDict
import numpy as np

## Check GPU availability

In [None]:
if torch.cuda.is_available():
    print(f"Using GPU: {torch.cuda.get_device_name(0)}")
else:
    print("No GPU found. Training will fall back to CPU.")

Using GPU: Tesla T4


## Data Loading

In [None]:
# Load a sample dataset (CNN/DailyMail for summarization)
data = load_dataset("cnn_dailymail", "3.0.0")

# Sample to make processing faster
data = DatasetDict({
    "train": data["train"].shuffle(seed=42).select(range(2000)),
    "validation": data["validation"].shuffle(seed=42).select(range(500)),
    "test": data["test"].shuffle(seed=42).select(range(500))
})

# Inspect the dataset
print(data)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 2000
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 500
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 500
    })
})


## Tokenization

In [None]:
# Tokenization
model_checkpoint = "t5-small"  # Using T5 for summarization
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def preprocess_function(examples):
    inputs = ["summarize: " + doc for doc in examples["article"]]
    model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding=True)

    # Tokenize targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["highlights"], max_length=128, truncation=True, padding=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Apply tokenization
tokenized_data = data.map(preprocess_function, batched=True)
tokenized_data = tokenized_data.remove_columns(["article", "highlights"])

# Inspect the tokenized data
tokenized_data["train"].features

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]



Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

{'id': Value(dtype='string', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}

In [None]:
data['train'][0]

{'article': "By . Anthony Bond . PUBLISHED: . 07:03 EST, 2 March 2013 . | . UPDATED: . 08:07 EST, 2 March 2013 . Three members of the same family who died in a static caravan from carbon monoxide poisoning would have been unconscious 'within minutes', investigators said today. The bodies of married couple John and Audrey Cook were discovered alongside their daughter, Maureen, at the mobile home they shared on Tremarle Home Park in Camborne, west Cornwall. The inquests have now opened into the deaths last Saturday, with investigators saying the three died along with the family's pet dog, of carbon monoxide poisoning from a cooker. Tragic: The inquests have opened into the deaths of three members of the same family who were found in their static caravan last weekend. John and Audrey Cook are pictured . Awful: The family died following carbon monoxide poisoning at this caravan at the Tremarle Home Park in Camborne, Cornwall . It is also believed there was no working carbon monoxide detect

In [None]:
tokenized_data['train'][0]

In [None]:
# Print a sample tokenized data from the training set
sample_index = 0  # You can change this index to inspect other samples
sample = tokenized_data["train"][sample_index]

print("Sample Tokenized Data:")
print(f"Input IDs: {sample['input_ids']}")
print(f"Labels: {sample['labels']}")

# Decode the input IDs and labels for better readability
decoded_input = tokenizer.decode(sample["input_ids"], skip_special_tokens=True)
decoded_labels = tokenizer.decode(sample["labels"], skip_special_tokens=True)

print("\nDecoded Input (Article):")
print(decoded_input)

print("\nDecoded Labels (Expected Summary):")
print(decoded_labels)


Sample Tokenized Data:
Input IDs: [21603, 10, 938, 3, 5, 11016, 12528, 3, 5, 3, 10744, 8775, 20619, 2326, 10, 3, 5, 10668, 10, 4928, 3, 6038, 6, 204, 1332, 2038, 3, 5, 1820, 3, 5, 3, 6880, 4296, 11430, 10, 3, 5, 12046, 10, 4560, 3, 6038, 6, 204, 1332, 2038, 3, 5, 5245, 724, 13, 8, 337, 384, 113, 3977, 16, 3, 9, 14491, 22133, 45, 4146, 1911, 6778, 15, 14566, 53, 133, 43, 118, 25429, 3, 31, 4065, 77, 676, 31, 6, 16273, 7, 243, 469, 5, 37, 5678, 13, 4464, 1158, 1079, 11, 31423, 6176, 130, 3883, 5815, 70, 3062, 6, 7758, 60, 35, 6, 44, 8, 1156, 234, 79, 2471, 30, 4691, 1635, 109, 1210, 1061, 16, 5184, 12940, 6, 4653, 26334, 5, 37, 16, 10952, 7, 43, 230, 2946, 139, 8, 14319, 336, 1856, 6, 28, 16273, 7, 2145, 8, 386, 3977, 590, 28, 8, 384, 31, 7, 3947, 1782, 6, 13, 4146, 1911, 6778, 15, 14566, 53, 45, 3, 9, 21859, 5, 21902, 447, 10, 37, 16, 10952, 7, 43, 2946, 139, 8, 14319, 13, 386, 724, 13, 8, 337, 384, 113, 130, 435, 16, 70, 14491, 22133, 336, 1851, 5, 1079, 11, 31423, 6176, 33, 3, 22665, 

## Pre-Trained Model Loading

In [None]:
# Load a pre-trained model for sequence-to-sequence tasks
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
!pip uninstall -y wandb

Found existing installation: wandb 0.19.1
Uninstalling wandb-0.19.1:
  Successfully uninstalled wandb-0.19.1


## Define training arguments

In [None]:
# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    save_steps=10,
    save_total_limit=2,
    logging_dir="./logs",
    logging_steps=10,
    load_best_model_at_end=True,
    save_strategy="epoch", # Changed save_strategy to "epoch"
    report_to=None
)

# Define a Trainer object
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["validation"],
    tokenizer=tokenizer
)

# Train the model
trainer.train()

  trainer = Trainer(
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss
1,1.3067,1.151154


KeyboardInterrupt: 

## Save the model

In [None]:
# Save the model
trainer.save_model("./final_model")
tokenizer.save_pretrained("./final_model")

# Load the model back
new_model = AutoModelForSeq2SeqLM.from_pretrained("./final_model")
new_tokenizer = AutoTokenizer.from_pretrained("./final_model")

# Verify it works
inputs = new_tokenizer(["summarize: Advances in AI have revolutionized many industries."], return_tensors="pt", padding=True, truncation=True).to(new_model.device)
outputs = new_model.generate(inputs["input_ids"], max_length=128, num_beams=4, early_stopping=True)
print(new_tokenizer.decode(outputs[0], skip_special_tokens=True))

In [None]:
tokenizer = AutoTokenizer.from_pretrained("d0p3/t5-small-dailycnn")
model = AutoModelForSeq2SeqLM.from_pretrained("d0p3/t5-small-dailycnn")

tokenizer_config.json:   0%|          | 0.00/20.7k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.50k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/142 [00:00<?, ?B/s]

##Real-Time Inference

In [None]:
# Test on new examples
texts = [
    "The latest advances in AI technology have made significant improvements in various fields, including healthcare and education. AI-powered tools are helping to diagnose diseases, personalize learning experiences, and much more.",
    "The recent sports event witnessed a thrilling final match, with Team A clinching the title after a dramatic finish. Fans celebrated the victory with great enthusiasm.",
    f"""Prime Minister Narendra Modi also condoled the loss of lives in the stampede. In a post on X, the Prime Minister’s Office said, “Pained by the stampede in Tirupati, Andhra Pradesh. My thoughts are with those who have lost their near and dear ones. I pray that the injured recover soon. The AP Government is providing all possible assistance to those affected.” At least six devotees died and dozens were injured in the stampede on Wednesday night as hundreds of them jostled for tickets for Vaikunta Dwara Darshanam at Lord Venkateswara Swamy temple on Tirumala Hills."""
]

# Tokenize inputs
inputs = tokenizer(["summarize: " + text for text in texts], return_tensors="pt", padding=True, truncation=True).to(model.device)

# Generate predictions
outputs = model.generate(inputs["input_ids"], max_length=128, num_beams=4, early_stopping=True)
predictions = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]

# Print predictions
for i, summary in enumerate(predictions):
    print(f"Input {i + 1}: {texts[i]}")
    print(f"Generated Summary: {summary}\n")

Input 1: The latest advances in AI technology have made significant improvements in various fields, including healthcare and education. AI-powered tools are helping to diagnose diseases, personalize learning experiences, and much more.
Generated Summary: AI-powered tools are helping to diagnose diseases, personalize learning experiences, and much more.

Input 2: The recent sports event witnessed a thrilling final match, with Team A clinching the title after a dramatic finish. Fans celebrated the victory with great enthusiasm.
Generated Summary: Team A clinch the title after a dramatic finish. Fans celebrated the victory with great enthusiasm.

Input 3: Prime Minister Narendra Modi also condoled the loss of lives in the stampede. In a post on X, the Prime Minister’s Office said, “Pained by the stampede in Tirupati, Andhra Pradesh. My thoughts are with those who have lost their near and dear ones. I pray that the injured recover soon. The AP Government is providing all possible assistanc

## Question-Answering Task with Transformers

In [None]:
import numpy as np

In [None]:
if torch.cuda.is_available():
    print(f"Using GPU: {torch.cuda.get_device_name(0)}")
else:
    print("No GPU found. Training will fall back to CPU.")


Using GPU: Tesla T4


In [None]:
# Load a sample dataset (SQuAD for question answering)
data = load_dataset("squad")

# Sample to make processing faster (optional for demonstration purposes)
data = DatasetDict({
    "train": data["train"].shuffle(seed=42).select(range(2000)),
    "validation": data["validation"].shuffle(seed=42).select(range(500))
})

# Inspect the dataset
print(data)

README.md:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 2000
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 500
    })
})


In [None]:
# Print the first 5 rows of the dataset for inspection
print("First 5 rows of training data:")
for i in range(5):
    print(data["train"][i])

First 5 rows of training data:
{'id': '573173d8497a881900248f0c', 'title': 'Egypt', 'context': 'The Pew Forum on Religion & Public Life ranks Egypt as the fifth worst country in the world for religious freedom. The United States Commission on International Religious Freedom, a bipartisan independent agency of the US government, has placed Egypt on its watch list of countries that require close monitoring due to the nature and extent of violations of religious freedom engaged in or tolerated by the government. According to a 2010 Pew Global Attitudes survey, 84% of Egyptians polled supported the death penalty for those who leave Islam; 77% supported whippings and cutting off of hands for theft and robbery; and 82% support stoning a person who commits adultery.', 'question': 'What percentage of Egyptians polled support death penalty for those leaving Islam?', 'answers': {'text': ['84%'], 'answer_start': [468]}}
{'id': '57277e815951b619008f8b52', 'title': 'Ann_Arbor,_Michigan', 'context':

In [None]:
# Tokenization
model_checkpoint = "t5-small"  # Using T5-base for question answering
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def preprocess_function(examples):
    inputs = []
    targets = []
    for i in range(len(examples["context"])):
        question = examples["question"][i]
        context = examples["context"][i]
        answer = examples["answers"][i]["text"][0]  # Assuming only one answer

        # Create input sequence
        input_seq = f"question: {question} context: {context}"
        inputs.append(input_seq)

        # Create target sequence
        targets.append(answer)

    model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding=True)
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=128, truncation=True, padding=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Apply tokenization
tokenized_data = data.map(preprocess_function, batched=True)

# Inspect the tokenized data
tokenized_data["train"].features
tokenized_data["validation"].features

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]



Map:   0%|          | 0/500 [00:00<?, ? examples/s]

{'id': Value(dtype='string', id=None),
 'title': Value(dtype='string', id=None),
 'context': Value(dtype='string', id=None),
 'question': Value(dtype='string', id=None),
 'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}

In [None]:
# Load a pre-trained model for sequence-to-sequence tasks
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Ensure the model is on GPU
model.to(device)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=8,  # Optimal for many GPUs; adjust if needed
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    save_steps=10,
    save_total_limit=2,
    logging_dir="./logs",
    logging_steps=10,
    load_best_model_at_end=True,
    save_strategy="epoch",
    report_to = None
)

# Define a Trainer object
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["validation"],
    tokenizer=tokenizer
)

# Train the model
trainer.train()


Using device: cuda


  trainer = Trainer(


Epoch,Training Loss,Validation Loss


In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained("fgaim/t5-small-squad-v2")


In [None]:
# Test on new examples
questions = [
    "What is the capital of France?",
    "Who wrote 'To Kill a Mockingbird'?",
    f"""What is the primary reason given in the article for the recent surge in popularity of "retro" fashion trends?"""
]

contexts = [
    "The capital of France is Paris. France is a country located in Europe.",
    "Harper Lee wrote 'To Kill a Mockingbird'. The book was published in 1960.",
    "An article discussing the cyclical nature of fashion trends, highlighting how vintage styles are experiencing a resurgence among younger generations, with possible reasons including nostalgia, celebrity influence, and a desire for unique aesthetics."
]

# Tokenize inputs
inputs = tokenizer([f"question: {question} context: {context}" for question, context in zip(questions, contexts)], return_tensors="pt", padding=True, truncation=True).to(model.device)

# Generate predictions
outputs = model.generate(inputs["input_ids"], max_length=128, num_beams=4, early_stopping=True)
predictions = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]

# Print predictions
for i, (question, context, answer) in enumerate(zip(questions, contexts, predictions)):
    print(f"Question {i + 1}: {question}")
    print(f"Context: {context}")
    print(f"Generated Answer: {answer}\n")

Question 1: What is the capital of France?
Context: The capital of France is Paris. France is a country located in Europe.
Generated Answer: Paris

Question 2: Who wrote 'To Kill a Mockingbird'?
Context: Harper Lee wrote 'To Kill a Mockingbird'. The book was published in 1960.
Generated Answer: Harper Lee

Question 3: What is the primary reason given in the article for the recent surge in popularity of "retro" fashion trends?
Context: An article discussing the cyclical nature of fashion trends, highlighting how vintage styles are experiencing a resurgence among younger generations, with possible reasons including nostalgia, celebrity influence, and a desire for unique aesthetics.
Generated Answer: nostalgia

