transformers → For working with DeepSeek models.
datasets → For handling urgency_data.csv efficiently.
peft → For LoRA fine-tuning (saves memory).
accelerate → Optimizes training performance.
bitsandbytes → Enables 8-bit and 4-bit quantization to reduce RAM usage.
torch → PyTorch for model training.

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load the smallest DeepSeek model
MODEL_NAME = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"  # We'll replace this with a smaller model if needed

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Load model (set device to CUDA if available)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME, num_labels=6
).to(device)

# Verify model is loaded
print(f"Model loaded on {device}")

# Test Tokenization on Sample Data
Before fine-tuning, let’s verify that our tokenizer is working properly:

# Initial Model Testing on Small Dataset
Before fine-tuning, we want to check how well the base DeepSeek model classifies urgency without training.

1. Prepare a Few Sample Inputs for Testing
We will create a small dataset with a few labeled examples:

In [None]:
sample_conversations = [
    {
        "conversation": (
            "Caller: I need you to send money immediately, I am in big trouble!\n"
            "Receiver: What happened? Why do you need money so urgently?\n"
            "Caller: I got into a legal issue, and I need bail money right now!"
        ),
        "label": "Legal/Authority Urgency"
    },
    {
        "conversation": (
            "Caller: Hey, I just saw this new phone, and it’s on a flash sale for 80% off! You have to buy it now!\n"
            "Receiver: That’s a crazy deal! How long is the sale?\n"
            "Caller: It’s only for the next 5 minutes! Hurry up!"
        ),
        "label": "Social/Peer Pressure Urgency"
    },
    {
        "conversation": (
            "Caller: The bank just called. They said there’s an issue with your account, and you need to verify your details now.\n"
            "Receiver: Wait, is this a scam? My bank never calls like this.\n"
            "Caller: No, this is serious! If you don’t confirm, your account will be frozen!"
        ),
        "label": "Financial Urgency"
    }
]

labels = ["Emotional Urgency", "Financial Urgency", "Legal/Authority Urgency", "No Urgency", "Social/Peer Pressure Urgency", "Romantic Urgency"]
label_to_id = {label: i for i, label in enumerate(labels)}


2. Tokenize & Run Model on Small Sample
Now, we tokenize these conversations and see what the model predicts:

In [None]:
import torch.nn.functional as F

tokenizer.pad_token = tokenizer.eos_token
# Convert text to tokens
for conv in sample_conversations:
    inputs = tokenizer(
        conv["conversation"],
        truncation=True,
        padding=True,
        return_tensors="pt"
    ).to(device)

    # Get model predictions
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        prediction = torch.argmax(logits, dim=-1).item()  # Get predicted class

    # Print results
    print(f"Conversation: {conv['conversation'][:100]}...")  # Print first 100 chars
    print(f"Actual Label: {conv['label']}")
    print(f"Predicted Label: {labels[prediction]}")
    print("=" * 50)


# Step 3: Preparing the Dataset for Fine-Tuning
Now, we’ll prepare the main dataset (urgency_data.csv) for fine-tuning by:

1. Loading and inspecting the dataset.
2. Converting text and labels into model-compatible format.
3. Splitting the dataset into train and validation sets.
4. Ensuring efficient memory usage.



# 3.1 Load & Inspect the Dataset

In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv("./urgency_data.csv")

# Display some samples
print(df.head())


# 3.2 Encode Labels into Numerical Format
Since models require numerical inputs, we map labels to integers