**Task 2:**

**Problem Statement:**
Develop a prefix language model using Hugging Face and PyTorch. You can pick any dataset for a creative text generation task and you should report the perplexity metric. Hint: A subtle data preprocessing trick is required when setting the inputs and labels for implementing prefix LM.

**Model Selected:** t5-large

**Dataset Selected:** CNN daily mail

Installation of Required Packages

In [1]:
!pip -q install git+https://github.com/huggingface/transformers.git
!pip install accelerate==0.27.0
!pip install datasets==2.15.0
!pip install peft==0.7.1
!pip install bitsandbytes==0.41.3
!pip install trl==0.7.7
!pip install tqdm==4.66.1
!pip install flash-attn==2.4.2

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
Collecting accelerate==0.27.0
  Downloading accelerate-0.27.0-py3-none-any.whl (279 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m279.7/279.7 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.27.0
Collecting datasets==2.15.0
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets==2.15.0)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from

Importing Libraries

In [2]:
import numpy as np
import pandas as pd
import os
from huggingface_hub import login,HfFolder

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments,BartTokenizer, BartForConditionalGeneration

Authentication and Configuration
> Huggingface and wandb integration

In [3]:
hf_token=HfFolder.get_token()
if hf_token:
    print(f"Logging into the Hugging Face Hub with token {hf_token[:10]}...")
    print(hf_token)
    login(token=hf_token)

os.environ["WANDB_API_KEY"] = "c757d83cb92d9326a361e27073fb3e8336376b83"
os.environ["WANDB_PROJECT"] = "Prefix language modelling"
os.environ["WANDB_NOTES"] = "Prefix language modelling using LORA"
os.environ["WANDB_NAME"] = "Prefix tuning"
os.environ["MODEL_NAME"] = "t5-large"

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [5]:
!huggingface-cli login --token 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

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Accelerate Memory Estimation
> This command estimates the memory requirements for the specified model using the Accelerate library.

In [4]:
!accelerate estimate-memory ${MODEL_NAME} --library_name transformers

Loading pretrained config for `t5-large` from `transformers`...
config.json: 100% 1.21k/1.21k [00:00<00:00, 6.86MB/s]
┌────────────────────────────────────────────────────┐
│        Memory Usage for loading `t5-large`         │
├───────┬─────────────┬──────────┬───────────────────┤
│ dtype │Largest Layer│Total Size│Training using Adam│
├───────┼─────────────┼──────────┼───────────────────┤
│float32│   125.5 MB  │ 2.75 GB  │      10.99 GB     │
│float16│   62.75 MB  │ 1.37 GB  │       5.5 GB      │
│  int8 │   31.38 MB  │ 703.5 MB │      2.75 GB      │
│  int4 │   15.69 MB  │351.75 MB │      1.37 GB      │
└───────┴─────────────┴──────────┴───────────────────┘


Model Quantization Configuration
> configures model quantization settings, including whether to load in 4-bit, the quantization type, and data types.

In [6]:
from transformers import BitsAndBytesConfig
from accelerate import Accelerator
import torch

load_in_4bit = True

if load_in_4bit:
    quantization_config = BitsAndBytesConfig(
        load_in_4bit=load_in_4bit,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=torch.float16  # Change to torch.float16 for fp16
    )
    # copy the model to each device
    device_map = "auto"
    torch_dtype = torch.float16  # Change to torch.float16 for fp16
else:
    device_map = None
    quantization_config = None
    torch_dtype = None

 Loading Dataset

In [7]:
from datasets import load_dataset
dataset = load_dataset('cnn_dailymail','3.0.0',split='train')

Downloading readme:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/257M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/257M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/259M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

In [8]:
dataset

Dataset({
    features: ['article', 'highlights', 'id'],
    num_rows: 287113
})

In [9]:
dataset= dataset.shuffle(seed=42).select([i for i in range(5000)])

In [10]:
dataset = dataset.train_test_split(test_size=0.1,seed=42)

In [11]:
dataset

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 4500
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 500
    })
})

In [12]:
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["CUDA_VISIBLE_DEVICES"] = "2"

device = "cuda"
model_name_or_path = "t5-large"
tokenizer_name_or_path = "t5-large"

text_column = "article"
label_column = "highlights"
max_length = 256
lr = 1e-5
num_epochs = 1
batch_size = 8

Tokenization and Preprocessing

In [13]:
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)


def preprocess_function(examples):
    inputs = examples[text_column]
    targets = examples[label_column]
    model_inputs = tokenizer(inputs, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt")
    labels = tokenizer(targets, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt")
    labels = labels["input_ids"]
    labels[labels == tokenizer.pad_token_id] = -100
    model_inputs["labels"] = labels
    return model_inputs

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [14]:
processed_datasets = dataset.map(
    preprocess_function,
    batched=True,
    num_proc=1,
    remove_columns=dataset["train"].column_names,
    load_from_cache_file=False,
    desc="Running tokenizer on dataset",
)

Running tokenizer on dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

Running tokenizer on dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

In [15]:
from torch.utils.data import DataLoader
from tqdm import tqdm
from transformers import default_data_collator

In [16]:
train_dataset = processed_datasets["train"]
eval_dataset = processed_datasets["test"]

train_dataloader = DataLoader(
    train_dataset, shuffle=True, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True
)
eval_dataloader = DataLoader(eval_dataset, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True)

Prefix Language Model Configuration
> Since we are doing Text summarization task, we will use **AutoModelForSeq2SeqLM**

In [17]:
from peft import get_peft_config, get_peft_model, get_peft_model_state_dict, PrefixTuningConfig, TaskType
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, default_data_collator, get_linear_schedule_with_warmup

In [18]:
peft_config = PrefixTuningConfig(task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, num_virtual_tokens=20)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

model.safetensors:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

trainable params: 983,040 || all params: 738,651,136 || trainable%: 0.13308583065659835


In [19]:
from tqdm import tqdm
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, Seq2SeqTrainingArguments, Seq2SeqTrainer

In [20]:
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
lr_scheduler = get_linear_schedule_with_warmup(
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=(len(train_dataloader) * num_epochs),
)

Trainer Initialization and Training

In [21]:
training_args = Seq2SeqTrainingArguments(
    output_dir="./output",
    overwrite_output_dir=True,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    save_steps=len(train_dataloader),
    save_total_limit=5,
    num_train_epochs=1,
    learning_rate=lr,
    logging_dir="./logs",
    logging_steps=len(train_dataloader),
    evaluation_strategy="steps",
    eval_steps=len(train_dataloader),
    load_best_model_at_end=True,
    remove_unused_columns=False,
    push_to_hub=False,
)

# Create the Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=default_data_collator,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

# Train the model
trainer.train()

# Evaluate the model
results = trainer.evaluate()

# Print the results
print(results)

Step,Training Loss,Validation Loss


Step,Training Loss,Validation Loss
563,3.3089,2.885371


{'eval_loss': 2.8853707313537598, 'eval_runtime': 60.0247, 'eval_samples_per_second': 8.33, 'eval_steps_per_second': 1.05, 'epoch': 1.0}


Perplexity Calculation

In [22]:
import numpy as np
def perplexity(eval_output):
    return np.exp(eval_output)

In [23]:
perplexity(results['eval_loss'])

17.91020623987266

Model Upload to Hugging Face Model Hub

In [25]:
peft_model_id = "t5-large_PREFIX_TUNING_SEQ2SEQ"
# trainer.push_to_hub("t5-large_PREFIX_TUNING_SEQ2SEQ")

In [26]:
# tokenizer.push_to_hub('t5-large_PREFIX_TUNING_SEQ')

Loading PEFT Model for Text Generation

In [27]:
from peft import get_peft_config, get_peft_model, get_peft_model_state_dict, PrefixTuningConfig, TaskType,PeftConfig,PeftModel
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, default_data_collator, get_linear_schedule_with_warmup

peft_model_name="/content/output/checkpoint-563"

peft_config=PeftConfig.from_pretrained(peft_model_name)
base_model=AutoModelForSeq2SeqLM.from_pretrained(peft_config.base_model_name_or_path)

peft_model=PeftModel.from_pretrained(base_model, peft_model_name)

In [28]:
tokenizer = AutoTokenizer.from_pretrained(peft_config.base_model_name_or_path)

In [29]:
text = """
SAN FRANCISCO, California (CNN) -- A magnitude 4.2 earthquake shook the San Francisco area Friday at 4:42 a.m. PT (7:42 a.m. ET), the U.S. Geological Survey reported. The quake left about 2,000 customers without power, said David Eisenhower, a spokesman for Pacific Gas and Light. Under the USGS classification, a magnitude 4.2 earthquake is considered "light," which it says usually causes minimal damage. "We had quite a spike in calls, mostly calls of inquiry, none of any injury, none of any damage that was reported," said Capt. Al Casciato of the San Francisco police. "It was fairly mild." Watch police describe concerned calls immediately after the quake » . The quake was centered about two miles east-northeast of Oakland, at a depth of 3.6 miles, the USGS said. Oakland is just east of San Francisco, across San Francisco Bay. An Oakland police dispatcher told CNN the quake set off alarms at people's homes. The shaking lasted about 50 seconds, said CNN meteorologist Chad Myers. According to the USGS, magnitude 4.2 quakes are felt indoors and may break dishes and windows and overturn unstable objects. Pendulum clocks may stop. E-mail to a friend .
"""

In [30]:
inputs = tokenizer(text,return_tensors='pt')

In [31]:
device = "cuda"

In [32]:
peft_model.to(device)

with torch.no_grad():
    inputs = {k: v.to(device) for k, v in inputs.items()}
    outputs = peft_model.generate(input_ids=inputs["input_ids"], max_new_tokens=30)
    print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))

['.. 2,000 customers without power in the San Francisco area, Pacific Gas and Light says about 2,000 customers without power, Pacific Gas']
