## Fine tune Phi-3 and Llama for sequence classification

Fine tune to classify stress or no stress using messages from Dreaddit: A Reddit Dataset for Stress Analysis in Social Media.
See: https://aclanthology.org/D19-6213

Commonly updated parameters are in the Parameter block below.

**test_count:** The number of messages to use in generating prompts.

**model_name** The model ID downloaded from Hugging Face.

**access_token** The Hugging Face access token.

Note, to work with batched input had to set padding=True when tokenizing input. Training loop handles the padded tokens correctly by masking them during the loss computation: attention_mask=ds_tokenized["train"]["attention_mask"]

Training data was reduced to 1200 to not exhaust GPU memory on Kaggle.

#### Parameters

In [1]:
test_count = 1200
model_name  = "microsoft/Phi-3-medium-4K-instruct"
#model_name =  "meta-llama/Llama-2-7b-hf"
#model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
access_token = 'access token here'

In [2]:
!pip install transformers
!pip install torch # torch
!pip install peft # necessary for finetuning of the large model via LoRA approach
!pip install bitsandbytes # necessary for quantiziation
!pip install evaluate # extension of the transformers library
!pip install datasets # extension of the transformers library
!pip install accelerate

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.5-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Downloading bitsandbytes-0.45.5-py3-none-manylinux_2_24_x86_64.whl (76.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.45.5
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [3]:
import torch
import pandas as pd
from datasets import Dataset, load_dataset, DatasetDict
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
from transformers import (
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    EarlyStoppingCallback,
    DataCollatorWithPadding,
    TextClassificationPipeline)
from sklearn.model_selection import train_test_split

import bitsandbytes as bnb

import evaluate
import numpy as np
import time
import random

In [4]:
# Set start time for training
start_time = time.time()

### Logon to HuggingFace

In [5]:
### the safer way
#from huggingface_hub import notebook_login
#notebook_login()

### alternative
from huggingface_hub import login
login(access_token) 

### Load messages, dropping nulls.


In [6]:
# URL of the raw CSV file on GitHub
csv_url = "https://raw.githubusercontent.com/SocialHealthAI/SDOH-Models/refs/heads/main/LLM%20Classification/dreaddit-train.csv"

# Load the CSV file into a DataFrame
train_df = pd.read_csv(csv_url)
train_df = train_df.head(test_count)             
#dreadit_text_df = dreadit_df[['text']]

# URL of the raw CSV file on GitHub
csv_url = "https://raw.githubusercontent.com/SocialHealthAI/SDOH-Models/refs/heads/main/LLM%20Classification/dreaddit-test.csv"

# Load the CSV file into a DataFrame
test_df = pd.read_csv(csv_url)
test_df = test_df.head(400)             
#dreadit_text_df = dreadit_df[['text']]

# drop null text
train_df.dropna(subset=['text'], inplace=True)
test_df.dropna(subset=['text'], inplace=True)


In [7]:
label_names = ['0', '1']

### Split test data into test and validation data

In [8]:
# Calculate the split index
split_index = len(test_df) // 2

# Split the DataFrame
validation_df = test_df.iloc[:split_index]
test_df = test_df.iloc[split_index:]


### Create DatasetDict

In [9]:
train_body = train_df['text'].values.tolist()
train_label = np.array(train_df['label'].values.tolist())

val_body = validation_df['text'].values.tolist()
val_label = np.array(validation_df['label'].values.tolist())

test_body = test_df['text'].values.tolist()
test_label = np.array(test_df['label'].values.tolist())

# create hf dataset
ds = DatasetDict({
    'train': Dataset.from_dict({'body': train_body, 'label': train_label}),
    'val': Dataset.from_dict({'body': val_body, 'label': val_label}),
    'test': Dataset.from_dict({'body': test_body, 'label': test_label})
})

In [10]:
ds['train'][0]

{'body': 'He said he had not felt that way before, suggeted I go rest and so ..TRIGGER AHEAD IF YOUI\'RE A HYPOCONDRIAC LIKE ME: i decide to look up "feelings of doom" in hopes of maybe getting sucked into some rabbit hole of ludicrous conspiracy, a stupid "are you psychic" test or new age b.s., something I could even laugh at down the road. No, I ended up reading that this sense of doom can be indicative of various health ailments; one of which I am prone to.. So on top of my "doom" to my gloom..I am now f\'n worried about my heart. I do happen to have a physical in 48 hours.',
 'label': 1}

### Tokenization and Quantization
Load tokenizer.

Define tokenizer function with truncation=True so sequences are truncated to a specific length (often used for training).  

Use the map function from the Datasets library to apply the preprocess_function to IMDb dataset in batches for efficiency. This creates a new dataset named tokenized_imdb with additional columns:

* input_ids: Numerical representation of the text using tokenizer vocabulary.
* attention_mask: Mask to indicate valid elements in padded sequences.

In [11]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token                                         # added
print(f' Vocab size of the model {model_name}: {len(tokenizer.get_vocab())}')
#Vocab size of the model google/gemma-2b-it: 256000

def preprocess_function(examples):
    return tokenizer(examples["body"], truncation=True, padding=True)

ds_tokenized = ds.map(preprocess_function, batched=True)

tokenizer_config.json:   0%|          | 0.00/3.15k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/568 [00:00<?, ?B/s]

 Vocab size of the model microsoft/Phi-3-medium-4K-instruct: 32011


Map:   0%|          | 0/1200 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

### Label mapping
create dictionaries to map labels (text) to numerical IDs and vice versa:

In [12]:
id2label = {0: "Not Stressed", 1: "Stressed"}
label2id = {"Not Stressed": 0, "Stressed": 1}

### Data Collator
The DataCollatorWithPadding class from Transformers helps prepare batches of data for training. It handles padding sequences to a common length and creates attention masks. We can simply instantiate it with the tokenizer

In [13]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

### Evaluation Metrics
To assess the performance of our fine-tuned GEMMA-2b model use the evaluate library. This library provides convenient functions for calculating various evaluation metrics commonly used in classification tasks.

Define a function compute_metrics that takes the model's predictions and ground-truth labels as input and calculates several metrics

In [14]:
from sklearn.metrics import accuracy_score, f1_score
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)  # Convert logits to class predictions
    accuracy = accuracy_score(labels, predictions)
    f1 = f1_score(labels, predictions, average="weighted")
    return {"accuracy": accuracy, "f1": f1}


#Quantization Configuration
Transformers library provides the BitsAndBytesConfig class for defining quantization parameters.

In [15]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Enables 4-bit quantization
    bnb_4bit_use_double_quant=True,  # Use double quantization for potentially higher accuracy (optional)
    bnb_4bit_quant_type="nf4",  # Quantization type (specifics depend on hardware and library)
    bnb_4bit_compute_dtype=torch.bfloat16  # Compute dtype for improved efficiency (optional)
)

### Load model in 4-bit


In [16]:
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    problem_type="single_label_classification",                                # added, why? its binary
    num_labels=2,  # Number of output labels
    id2label=id2label,
    label2id=label2id,
    quantization_config=bnb_config,  # configuration for quantization
    device_map={"": 0}  # Optional dictionary specifying device mapping (single GPU with index 0 here)
)

config.json:   0%|          | 0.00/934 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/6 [00:00<?, ?it/s]

model-00001-of-00006.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00002-of-00006.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00003-of-00006.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

model-00004-of-00006.safetensors:   0%|          | 0.00/4.77G [00:00<?, ?B/s]

model-00005-of-00006.safetensors:   0%|          | 0.00/4.77G [00:00<?, ?B/s]

model-00006-of-00006.safetensors:   0%|          | 0.00/3.61G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

Some weights of Phi3ForSequenceClassification were not initialized from the model checkpoint at microsoft/Phi-3-medium-4K-instruct and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Gradient checkpointing
Memory optimization technique that can be helpful for large models.

In [17]:
model.gradient_checkpointing_enable()

### prepare model for quantization

In [18]:
model = prepare_model_for_kbit_training(model)

### Layer names are needed for the LoRA configuration.

In [19]:
def find_linear_names(model):
    """
    This function identifies all linear layer names within a model that use 4-bit quantization.
    Args:
        model (torch.nn.Module): The PyTorch model to inspect.
    Returns:
        list: A list containing the names of all identified linear layers with 4-bit quantization.
    """
    cls = bnb.nn.Linear4bit

    # Set to store identified layer names
    lora_module_names = set()

    # Iterate through named modules in the model
    for name, module in model.named_modules():
        # Check if the current module is an instance of the 4-bit linear layer class
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

        # Special case: remove 'lm_head' if present
        if 'lm_head' in lora_module_names:
            lora_module_names.remove('lm_head')
    return list(lora_module_names)

# Example usage:
modules = find_linear_names(model)
print(modules)
# ['down_proj', 'gate_proj', 'q_proj', 'o_proj', 'up_proj', 'v_proj', 'k_proj']

['o_proj', 'qkv_proj', 'down_proj', 'gate_up_proj']


### LoRA Configuration
LoRA is a technique that minimizes the number of parameters requiring training during fine-tuning by keeping all original model parameters frozen and introducing a pair of rank decomposition matrices alongside the existing weights. These smaller matrices are designed so that their product matches the dimensions of the weights they modify. The original weights of the LLM remain unchanged, while the smaller matrices are trained using supervised learning. During inference, the two low-rank matrices are multiplied to form a matrix that matches the dimensions of the frozen weights. This matrix is then added to the original weights, effectively updating them in the model.

Use a rank of 8 to train two small rank decomposition matrices with dimensions 8 by A and 8 by B whose product provides a matrix with same dimensions as the frozen weights.  Use task type **SEQ\_CLS (Sequence Classification)** for text classification tasks


In [20]:
lora_config = LoraConfig(
    r=16,  # 32 to 16 to reduce GPU memory usage
    lora_alpha=32,  # Dimensionality of the adapter projection
    target_modules=modules,  # List of modules to apply the LoRA adapter
    lora_dropout=0.05,  # Dropout rate for the adapter
    bias="none",  # Bias configuration for the adapter
    task_type="SEQ_CLS"  # Task type (sequence classification in this case)
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

trainable params: 55,715,840 || all params: 13,851,796,480 || trainable%: 0.4022


### Define trainer arguments

In [21]:
training_args = TrainingArguments(
    logging_steps=10,                                                                 # added
    output_dir="epoch_weights",  # Output directory for checkpoints
    learning_rate=5e-5,  # Learning rate for the optimizer,                           # 2e-5 to 5e-5
    per_device_train_batch_size=10,  # Batch size per device                           # 20 to 10 to reduce GPU memory useage
    per_device_eval_batch_size=10,  # Batch size per device for evaluation             # ditto
    num_train_epochs=4,  # Number of training epochs from 5 to 1 (list error after 2)
    weight_decay=0.01,  # Weight decay for regularization
    evaluation_strategy='epoch',  # Evaluate after each epoch
    save_strategy="epoch",  # Save model checkpoints after each epoch
    load_best_model_at_end=True,  # Load the best model based on the chosen metric
    push_to_hub=False,  # Disable pushing the model to the Hugging Face Hub
    report_to="none",  # Disable logging to Weight&Bias
    metric_for_best_model='eval_loss')  # Metric for selecting the best model



### Early stopping
Automate epoch determination to prevent overfitting by stopping training if the validation performance doesn’t improve for a certain number of epochs.

In [22]:
early_stop = EarlyStoppingCallback(early_stopping_patience=1, early_stopping_threshold=.0)

### Define trainer and train

In [None]:
# Make sure the model recognizes the pad_token_id
model.config.pad_token_id = tokenizer.pad_token_id

trainer = Trainer(
    model=model,  # The LoRA-adapted model
    args=training_args,  # Training arguments
    train_dataset=ds_tokenized["train"],  # Training dataset
    eval_dataset=ds_tokenized["val"],  # Evaluation dataset
    tokenizer=tokenizer,  # Tokenizer for processing text
    data_collator=data_collator,  # Data collator for preparing batches
    compute_metrics=compute_metrics,  # Function to calculate evaluation metrics
    callbacks=[early_stop],  # Optional early stopping callback
    #attention_mask=ds_tokenized["train"]["attention_mask"]
)

trainer.train()

  trainer = Trainer(
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
  return fn(*args, **kwargs)


Epoch,Training Loss,Validation Loss


In [None]:
# Calculate and print the elapsed time
elapsed_time = time.time() - start_time
print(f"Time required by training: {elapsed_time:.4f} seconds")

### Save model and tokenizer

In [None]:
# Define the directory to save the model and tokenizer
save_directory = "./fine_tuned_bert_model"

# Save the model and tokenizer
model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)

print(f"Model and tokenizer saved to {save_directory}")

### Starting time for Inference

In [None]:
# Set start time for Inference
start_time = time.time()

### Load model and tokenizer

In [None]:
# Reload the model and tokenizer
#reloaded_model = BertForSequenceClassification.from_pretrained(save_directory)
reloaded_model = AutoModelForSequenceClassification.from_pretrained(save_directory)
#reloaded_tokenizer = BertTokenizer.from_pretrained(save_directory)
reloaded_tokenizer = AutoTokenizer.from_pretrained(save_directory)

### Inference on Test data

In [None]:
def predict(input_text):
    """
    Predicts the sentiment label for a given text input.
    Args:
        input_text (str): The text to predict the sentiment for.
    Returns:
        float: The predicted probability of the text being positive sentiment.
    """
    inputs = reloaded_tokenizer(input_text, return_tensors="pt")  # Convert to PyTorch tensors and move to GPU (if available)
    with torch.no_grad():
        outputs = reloaded_model(**inputs).logits  # Get the model's output logits
    y_prob = torch.sigmoid(outputs).tolist()[0]  # Apply sigmoid activation and convert to list
    #return np.round(y_prob, 5)  # Round the predicted probability to 5 decimal places
    return y_prob[1]  # Probabilities should be [prob of 0, prob of 1], return prob of 1

# Apply and store predicted probability
df_test = pd.DataFrame(ds['test'])
df_test['pred_prob'] = df_test['body'].map(predict)

#df_test = pd.DataFrame(ds['test'])
#df_test['prediction'] = df_test['body'].map(predict)
#df_test['y_pred'] = df_test['prediction'].apply(lambda x: np.argmax(x, axis=0))

In [None]:
# Calculate and print the elapsed time
elapsed_time = time.time() - start_time
print(f"Time required by inference: {elapsed_time:.4f} seconds")

### Use ROC to determine best threshold for class 1

In [None]:
from sklearn.metrics import f1_score, roc_auc_score, precision_recall_curve

# Get true labels and predicted probabilities
y_true = df_test['label'].values
y_probs = df_test['pred_prob'].values

# Search for best threshold by maximizing F1-score
thresholds = np.arange(0.0, 1.01, 0.01)
f1_scores = [f1_score(y_true, y_probs > t) for t in thresholds]
best_threshold = thresholds[np.argmax(f1_scores)]

print(f"Best threshold based on F1-score: {best_threshold:.2f}")

#test_pred = df_test['y_pred'].to_numpy()
#y_true = df_test['label']
#y_scores = df_test['prediction'].apply(lambda x: x[1])  # probability of class 1


### Classification metrics on Test data

In [None]:
from sklearn.metrics import classification_report

# Convert probabilities to labels based on best threshold
df_test['y_pred'] = (df_test['pred_prob'] >= best_threshold).astype(int)

# Print classification report
print(classification_report(y_true, df_test['y_pred']))


#from sklearn.metrics import classification_report
#print('classifiation report')
#print(classification_report(test_pred, df_test['label'].to_numpy(),target_names=label_names))

#### AUC

In [None]:
import matplotlib.pyplot as plt

precision, recall, pr_thresholds = precision_recall_curve(y_true, y_probs)

plt.plot(pr_thresholds, precision[:-1], label='Precision')
plt.plot(pr_thresholds, recall[:-1], label='Recall')
plt.axvline(best_threshold, color='r', linestyle='--', label=f'Best threshold: {best_threshold:.2f}')
plt.xlabel("Threshold")
plt.ylabel("Score")
plt.title("Precision-Recall vs Threshold")
plt.legend()
plt.grid()
plt.show()


# import matplotlib.pyplot as plt
# from sklearn.metrics import roc_curve, auc

# positive_probs = df_test['prediction'].apply(lambda x: x[1])

# # Compute ROC curve
# fpr, tpr, _ = roc_curve(test_pred, positive_probs)
# roc_auc = auc(fpr, tpr)

# # Plot
# plt.figure(figsize=(8, 6))
# plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
# plt.plot([0, 1], [0, 1], color='grey', linestyle='--')  # Diagonal line for random chance
# plt.xlim([0.0, 1.0])
# plt.ylim([0.0, 1.05])
# plt.xlabel('False Positive Rate')
# plt.ylabel('True Positive Rate')
# plt.title('Receiver Operating Characteristic (ROC) Curve')
# plt.legend(loc="lower right")
# plt.show()

### Example Inferences

test_df['pred'] = test_pred
test_df.reset_index(level=0)
print(test_df[test_df['label']==test_df['pred']].shape)
test_df[test_df['label']==test_df['pred']][['text','label','pred']].head(100)