![Practicum AI Logo image](https://github.com/PracticumAI/practicumai.github.io/blob/main/images/logo/PracticumAI_logo_250x50.png?raw=true) <img src="https://github.com/PracticumAI/practicumai.github.io/blob/84b04be083ca02e5c7e92850f9afd391fc48ae2a/images/icons/practicumai_computer_vision.png?raw=true" alt="Practicum AI: Computer Vision icon" align="right" width=50>
***

# Transfer Learning Implementation

Welcome back! In our previous exercise [01.0_transfer_learning_conepts.ipynb](01.0_transfer_learning_conepts.ipynb), we introduced the core concepts of **Transfer Learning** and experimented with fine-tuning. We saw how leveraging pre-trained models can significantly boost performance and reduce training time compared to starting from scratch.

In this notebook we'll expand into adapting powerful pre-trained language models for specific Natural Language Processing (NLP) tasks, focusing on two other, important transfer learning strategies. We'll:

* Implement **Feature Extraction** using a pre-trained Transformer model for text classification.
* Look closer at how to freeze the base model and train only a new classification head.
* Implement **LoRA (Low-Rank Adaptation)**, a Parameter-Efficient Fine-Tuning (PEFT) technique, to adapt the same pre-trained Transformer.
* Understand how LoRA modifies the model and drastically reduces the number of trainable parameters.
* Directly compare the results and efficiency of Feature Extraction versus LoRA on the same NLP task.

## A Direct Comparison

To make the comparison between Feature Extraction and LoRA as clear as possible, we will:

1.  Use the **same base pre-trained model** [DistilBERT - `distilbert-base-uncased`](https://huggingface.co/distilbert/distilbert-base-uncased) as the starting point for both methods.
2.  Apply both techniques to the **same text classification task** (e.g., Text Classification using the [Crop Market News Classification](https://www.kaggle.com/datasets/mcwemzy/crop-market-news-classification)). This dataset is hosted on the fantasic model zoo site, [Kaggle.com](kaggle.com)!
3.  Use the **same dataset** for training and evaluation in both parts.

This setup will allow us to directly observe the differences in implementation complexity, performance metrics, and the number of parameters trained.

### Prerequisites and Setup

* **Conceptual Understanding:** Ensure you're comfortable with the basic ideas of transfer learning, pre-trained models, and fine-tuning as covered in Notebook `01.0`.
* **Helper Functions:** We will utilize helper functions defined in [00.5_transfer_learning_helper.ipynb](00.5_transfer_learning_helper.ipynb). Please make sure you have access to that notebook or have run it previously to make the functions available.
* **Deep Learning Fundamentals:** As with the previous exercise a throrough understanding of how Large Language Models (LLMs) work is not required, but it is necessary to have a basic understanding of concepts like parameters, hyperparameters, and other Deep Learning fundamentals. If you do want to learn more about how LLMs work, we recommend NVidia's [Deep Learning Institute](https://developer.nvidia.com/dli) course on [Introduction to Transformer-Based Natural Language Processing](https://learn.nvidia.com/courses/course-detail?course_id=course-v1:DLI+S-FX-08+V1). This course is free and provides a solid foundation in the principles of Transformers and their applications in NLP.

## How to Use This Notebook (`FIX_ME`s)

In this notebook, you'll find sections marked with:

```
FIX_ME
# FIX_ME: <description of what to do>
```
These are places where you need to fill in missing code or make adjustments. The goal is to reinforce the key implementation steps for each technique.

If you get stuck, review the preceding explanations, check the documentation for the libraries used (PyTorch, Hugging Face), or consult the course's **GitHub Discussions page** [Link to GitHub Discussions - Placeholder] for hints and help from peers and instructors.

Let's get started with adapting our language model!


### 1. Import the libraries we will use

As always, we will start by importing the libraries we will use in this notebook.

In [None]:
!pip install transformers
!pip install datasets
!pip install peft
!pip install liac-arff

In [None]:
# Import Libraries
import os
import sys
import time
import json
import requests
import zipfile

import torch
import transformers
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import datasets
import peft
from peft import get_peft_model, LoraConfig, TaskType
import pandas as pd
import arff
from tqdm import tqdm
from scipy.io import arff as scipy_arff
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader
from torch.optim import AdamW
from transformers import get_scheduler
from tqdm.auto import tqdm
from sklearn.metrics import accuracy_score, f1_score
from torch.utils.data import DataLoader
from transformers import DataCollatorWithPadding

#### Check for GPU availability

This cell will check that everything is configured correctly to use your GPU. If everything is correct, you should see something like: 

    Using GPU: type of GPU

If you see:
    
    Using CPU
    
Either you do not have a GPU or the kernel is not correctly configured to use it. You might be able to run this notebook, but some sections will take a loooooong time!

In [None]:
# Check for GPU availability
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Using GPU: {torch.cuda.get_device_name(0)}")
else:
    device = torch.device("cpu")
    print("Using CPU")

### 2. Getting the data

Today's dataset is the [Crop Market News Classification](https://www.kaggle.com/datasets/mcwemzy/crop-market-news-classification) dataset. This dataset contains crop market news articles and their corresponding categories. The goal is to classify the articles into their respective categories using both Feature Extraction and LoRA.

In [None]:
# Download the dataset, extract it to the data folder and remove the zip file
download_path = "https://data.rc.ufl.edu/pub/practicum-ai/Transfer_Learning_Intermediate/crop_market_news.zip"
zip_path = "data/crop_market_news.zip"
data_path = "data"

# Paths to dataset - UPDATED to use the correct filename
dataset = os.path.join(data_path, "Crop.Market.News.Classification.arff")

# Check if the data is already loaded
if not (os.path.exists(dataset) and os.path.getsize(dataset) > 0):
    # Create the data directory if it does not exist
    if not os.path.exists(data_path):
        os.makedirs(data_path)

    # Download the zip file
    r = requests.get(download_path)
    with open(zip_path, "wb") as f:
        f.write(r.content)

    # Extract the zip file
    with zipfile.ZipFile(zip_path, "r") as zip_ref:
        zip_ref.extractall(data_path)

    # Remove the zip file
    os.remove(zip_path)
    print(f"Data has been downloaded an unziped at {data_path}")
else:
    print("Data is already downloaded.")

### 3. Looking at the data

The dataset contains a number of columns, but we will only use the `text` and `label` columns. The `text` column contains the news articles, and the `label` column contains the corresponding categories.

In [None]:
# Load the dataset
# Check if the dataset file exists and is not empty
try:
    # Use liac-arff to load the ARFF file
    with open(dataset, "r") as f:
        arff_data = arff.load(f)
        data = pd.DataFrame(
            arff_data["data"], columns=[attr[0] for attr in arff_data["attributes"]]
        )
except FileNotFoundError:
    print(f"ERROR: Dataset file not found at {dataset}.")
    print("Please ensure you have downloaded the dataset and the path is correct.")
    data = None  # Set data to None if file not found

if data is not None:
    # Convert to Pandas DataFrame
    df = pd.DataFrame(data)
    print(f"Loaded DataFrame shape: {df.shape}")

    # Decode byte strings (Common in ARFF)
    # Identify potential text columns (adjust names based on actual columns in meta/df.info())
    text_col = "text"
    label_col = "category"

    if text_col in df.columns and df[text_col].dtype == "object":
        # Check if decoding is needed (inspect first element)
        if isinstance(df[text_col].iloc[0], bytes):
            print(f"Decoding byte strings in column '{text_col}'...")
            df[text_col] = df[text_col].str.decode("utf-8")

    if label_col in df.columns and df[label_col].dtype == "object":
        if isinstance(df[label_col].iloc[0], bytes):
            print(f"Decoding byte strings in column '{label_col}'...")
            df[label_col] = df[label_col].str.decode("utf-8")

    # Map String Labels to Integer IDs
    unique_labels = df[label_col].unique()
    num_labels = len(unique_labels)

    # Create mappings
    label2id = {label: i for i, label in enumerate(unique_labels)}
    id2label = {i: label for label, i in label2id.items()}

    # Apply mapping to create a new 'label' column
    df["label"] = df[label_col].map(label2id)

    print(f"Number of classes: {num_labels}")
    print("Label mappings created:")
    print(f"  label2id: {label2id}")
    print(f"  id2label: {id2label}")

    # Inspect the DataFrame
    print("\nDataFrame Head:")
    print(df.head())
    print("\nDataFrame Info:")
    df.info()
    print("\nLabel Distribution:")
    print(df["label"].value_counts())

    # Keep only relevant columns (text and integer label)
    df = df[[text_col, "label"]]
    df = df.rename(columns={text_col: "text"})  # Ensure text column is named 'text'
else:
    print("Skipping DataFrame processing as data was not loaded.")

### 4. Preprocessing the data

Now, let's convert our Pandas DataFrame into the Hugging Face `datasets` format, which integrates seamlessly with the `transformers` library. Since the original dataset doesn't specify train/test splits, we'll create our own train and validation sets.


In [None]:
if data is not None:
    # Convert Pandas DataFrame to Hugging Face Dataset
    hf_dataset = datasets.Dataset.from_pandas(df)
    print("\nConverted to Hugging Face Dataset:")
    print(hf_dataset)

    # Split into training and validation sets (e.g., 80% train, 20% validation)
    train_test_split_ratio = 0.20
    dataset_dict = hf_dataset.train_test_split(
        test_size=train_test_split_ratio, shuffle=True, seed=42
    )  # Use seed for reproducibility

    # Rename for clarity
    train_ds = dataset_dict["train"]
    eval_ds = dataset_dict["test"]

    print("\nSplit into Train and Validation Sets:")
    print(f"  Training examples: {len(train_ds)}")
    print(f"  Validation examples: {len(eval_ds)}")
    print(train_ds)  # Show structure
else:
    print("Skipping Dataset preparation as data was not loaded.")
    train_ds, eval_ds = None, None  # Set to None

### 5. Set up the model and tokenizer

We will use the `distilbert-base-uncased` model and tokenizer from Hugging Face. This model is a smaller, faster, and lighter version of BERT *(Bidirectional Encoder Representations from Transformers)* that retains 97% of BERT's language understanding while being 60% faster and 40% smaller. It is a great choice for text classification tasks, especially when computational resources are limited. The `distilbert-base-uncased` model is pre-trained on a large corpus of English text and is designed to handle uncased text (i.e., it does not differentiate between uppercase and lowercase letters), making it ideal for our task. We will also set up the training arguments for both Feature Extraction and LoRA.

In [None]:
# Setting up models for LoRA and Feature Extraction

# Initialize tokenizer
model_name = "bert-base-uncased"  # Change to your preferred model
tokenizer = AutoTokenizer.from_pretrained(model_name)

# --- LoRA Setup ---
# Load base model for LoRA
base_model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,  # Adjust based on your classification task
    return_dict=True,
)

# Define LoRA configuration
lora_config = LoraConfig(
    r=16,  # Rank of the update matrices
    lora_alpha=32,  # Scaling factor
    target_modules=["query", "key", "value"],  # Which modules to apply LoRA to
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.SEQ_CLS,  # For sequence classification tasks
)

# Create LoRA model
lora_model = get_peft_model(base_model, lora_config)
print(
    f"LoRA model setup complete. Trainable parameters: {sum(p.numel() for p in lora_model.parameters() if p.requires_grad)}"
)

# --- Feature Extraction Setup ---
# Load the base model for feature extraction
feature_extractor_model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,  # Adjust based on your classification task
    return_dict=True,
)

# Freeze all parameters
for param in feature_extractor_model.parameters():
    param.requires_grad = False

# Unfreeze just the classification head for fine-tuning
for param in feature_extractor_model.classifier.parameters():
    param.requires_grad = True

# Create dataloaders
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
train_dataloader = DataLoader(
    train_ds, shuffle=True, batch_size=16, collate_fn=data_collator
)
eval_dataloader = DataLoader(eval_ds, batch_size=16, collate_fn=data_collator)

print(
    f"Feature extraction model setup complete. Trainable parameters: {sum(p.numel() for p in feature_extractor_model.parameters() if p.requires_grad)}"
)

### 6. Training the models

We will train two models: one using Feature Extraction and the other using LoRA. We will use the same training arguments for both models to ensure a fair comparison. The main difference will be in the training process: 
A. For Feature Extraction, we will freeze the base model and train only the classification head.
B. For LoRA, we will use the LoRA technique to adapt the entire model. 

We will evaluate both models on the validation set and compare their performance in the section 7.

In [None]:
# Training and Evaluation Functions


def train_epoch(model, dataloader, optimizer, scheduler, device):
    model.train()
    total_loss = 0

    progress_bar = tqdm(dataloader, desc="Training", leave=False)
    for batch in progress_bar:
        batch = {k: v.to(device) for k, v in batch.items()}

        outputs = model(**batch)
        loss = outputs.loss
        total_loss += loss.item()

        loss.backward()
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

        progress_bar.set_postfix({"loss": loss.item()})

    return total_loss / len(dataloader)


def evaluate(model, dataloader, device):
    model.eval()
    predictions = []
    references = []
    total_loss = 0

    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Evaluating", leave=False):
            batch = {k: v.to(device) for k, v in batch.items()}

            outputs = model(**batch)
            loss = outputs.loss
            total_loss += loss.item()

            logits = outputs.logits
            preds = torch.argmax(logits, dim=-1).cpu().numpy()
            labels = batch["labels"].cpu().numpy()

            predictions.extend(preds)
            references.extend(labels)

    # Calculate metrics
    accuracy = accuracy_score(references, predictions)
    f1 = f1_score(references, predictions, average="weighted")

    return {"loss": total_loss / len(dataloader), "accuracy": accuracy, "f1": f1}


def train_model(model, train_dataloader, eval_dataloader, num_epochs=3, device="cuda"):
    # Set up optimizer and scheduler
    optimizer = AdamW([p for p in model.parameters() if p.requires_grad], lr=5e-5)

    num_training_steps = num_epochs * len(train_dataloader)
    scheduler = get_scheduler(
        name="linear",
        optimizer=optimizer,
        num_warmup_steps=0,
        num_training_steps=num_training_steps,
    )

    # Move model to device
    model = model.to(device)

    # Training loop
    results = {"train_loss": [], "eval_loss": [], "eval_accuracy": [], "eval_f1": []}

    for epoch in range(num_epochs):
        print(f"Epoch {epoch+1}/{num_epochs}")

        # Training
        train_loss = train_epoch(model, train_dataloader, optimizer, scheduler, device)
        results["train_loss"].append(train_loss)

        # Evaluation
        eval_metrics = evaluate(model, eval_dataloader, device)
        results["eval_loss"].append(eval_metrics["loss"])
        results["eval_accuracy"].append(eval_metrics["accuracy"])
        results["eval_f1"].append(eval_metrics["f1"])

        print(
            f"Train Loss: {train_loss:.4f} | "
            + f"Eval Loss: {eval_metrics['loss']:.4f} | "
            + f"Accuracy: {eval_metrics['accuracy']:.4f} | "
            + f"F1 Score: {eval_metrics['f1']:.4f}"
        )

    return results

In [None]:
# LoRA training
lora_results = train_model(
    lora_model, train_dataloader, eval_dataloader, num_epochs=3, device=device
)

# Feature extraction training
feature_extraction_results = train_model(
    feature_extractor_model,
    train_dataloader,
    eval_dataloader,
    num_epochs=3,
    device=device,
)