# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: 
* Model: 
* Evaluation approach: 
* Fine-tuning dataset: 

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [1]:
pip install scikit-learn

Defaulting to user installation because normal site-packages is not writeable
^C
[31mERROR: Operation cancelled by user[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


Model: GPT-2, because it's small and manageable.

Dataset: SMS dataset, as it's easy to verify and relevant.

Evaluation Approach: Using accuracy and F1-score to assess performance.

In [2]:
from transformers import GPT2Tokenizer, GPT2ForSequenceClassification, Trainer, TrainingArguments
model_name = 'gpt2'
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
# Add a padding token to the tokenizer
tokenizer.pad_token = tokenizer.eos_token

model = GPT2ForSequenceClassification.from_pretrained(model_name, num_labels=2,
    id2label={0: "not_spam", 1: "spam"}, 
    label2id={"not_spam": 0, "spam": 1})
model.config.pad_token_id = tokenizer.pad_token_id


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [3]:
from datasets import load_dataset
dataset = load_dataset("sms_spam")

if 'test' not in dataset:
    dataset = dataset['train'].train_test_split(test_size=0.2)

In [4]:
def preprocess_function(examples):
    return tokenizer(examples['sms'], truncation=True, padding='max_length', max_length=128)

tokenized_datasets = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/4459 [00:00<?, ? examples/s]

Map:   0%|          | 0/1115 [00:00<?, ? examples/s]

In [5]:
import torch
from sklearn.metrics import f1_score

def compute_metrics(p):
    preds = p.predictions.argmax(-1)
    labels = p.label_ids

    # Ensure tensors are on CPU before converting to numpy
    if isinstance(preds, torch.Tensor):
        preds = preds.cpu().numpy()
    if isinstance(labels, torch.Tensor):
        labels = labels.cpu().numpy()

    f1 = f1_score(labels, preds, average='binary')
    return {"f1": f1}

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy="epoch",
    
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

# Evaluate the model
eval_result = trainer.evaluate()
print(eval_result)

Epoch,Training Loss,Validation Loss


## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [None]:
pip install peft

In [None]:
PEFT Technique: I chose LoRA for its efficiency and minimal parameter usage.

from peft import LoraConfig, get_peft_model

config = LoraConfig(r=4, #ranking
                   lora_alpha=32,
                   target_modules=["c_attn", "c_proj"],
                   lora_dropout=0.2,
                   bias="none",
                   task_type="CAUSAL_LM")

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token


In [None]:
lora_model = get_peft_model(model, config)

In [None]:
dataset = load_dataset('sms_spam')

if 'test' not in dataset:
    dataset = dataset['train'].train_test_split(test_size=0.2)

tokenized_datasets = dataset.map(preprocess_function, batched=True)

# Ensure the dataset has the correct format for training
tokenized_datasets.set_format(type='torch', columns=['input_ids', 'attention_mask'])

# Add labels for language modeling
tokenized_datasets = tokenized_datasets.map(lambda examples: {'labels': examples['input_ids']}, batched=True)

# Filter out any examples that might be None or empty
tokenized_datasets = tokenized_datasets.filter(lambda example: example['input_ids'] is not None and len(example['input_ids']) > 0)

In [None]:
import os, torch
torch.cuda.empty_cache()
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:128'


# Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=2,
    evaluation_strategy="epoch",
    save_steps=10,
    learning_rate=5e-5,
    per_device_train_batch_size=1, 
    per_device_eval_batch_size=1, 
    gradient_accumulation_steps=4,
    fp16=True
)


In [None]:
def data_collator(features):
    batch = {}
    for k, v in features[0].items():
        batch[k] = torch.stack([f[k] for f in features])
    return batch

trainer = Trainer(
    model=lora_model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

eval_result = trainer.evaluate()
print(eval_result)

In [None]:
lora_model.print_trainable_parameters()

In [None]:
lora_model.save_pretrained("gpt-lora")

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load the PEFT-tuned classification model
lora_model = AutoModelForSequenceClassification.from_pretrained("gpt-lora", num_labels=2,
    id2label={0: "not_spam", 1: "spam"}, 
    label2id={"not_spam": 0, "spam": 1})

tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

In [None]:
inputs = tokenizer("Hello, my name is call me immediateyly", return_tensors="pt", padding=True, truncation=True)

In [None]:
with torch.no_grad():
    outputs = lora_model(**inputs)
    logits = outputs.logits    
    predicted_class = torch.argmax(logits, dim=-1).item()

label = lora_model.config.id2label[predicted_class]
print(f"Predicted label: {label}")