# Supervised Fine-Tuning with SFTTrainer

This notebook demonstrates how to fine-tune the `HuggingFaceTB/SmolLM2-135M` model using the `SFTTrainer` from the `trl` library. The notebook cells run and will finetune the model. 

We want to fine tune the model for a summarization task.

> TIP: If you don't have a GPU, download the notebook and try it on Google Colab using T4.

In [None]:
# Install the requirements
# !pip install transformers datasets trl huggingface_hub

# Authenticate to Hugging Face
from huggingface_hub import notebook_login

notebook_login()

# for convenience you can create an environment variable containing your hub token as HF_TOKEN

pyenv: version `3.11' is not installed (set by /Users/simonamazzarino/Documents/Clearbox-repo/clearbox-ai-academy/Fine-Tuning_LLM/.python-version)


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [2]:
# Import necessary libraries
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer, setup_chat_format
import torch

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)

# Load the model and tokenizer
model_name = "HuggingFaceTB/SmolLM2-135M"
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name
).to(device)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_name)

# Set up the chat format
model, tokenizer = setup_chat_format(model=model, tokenizer=tokenizer)

# Set our name for the finetune to be saved &/ uploaded to
finetune_name = "SmolLM2-FT-MyDataset"


# Generate with the base model

Here we will try out the base model which does not have a chat template.

In [4]:
# Let's test the base model before training

# Let's test the base model before training
prompt_user = """Hi Sarah,\n\nI hope you're doing well! I wanted to reach out because I've been struggling with a student in my class who is significantly behind in reading comprehension. 
                I remember you mentioning some effective strategies during our last conversation, and I was wondering if you could share some resources or tips that might help me support this student better.\n\nAny advice would be greatly appreciated! 
                Let me know if you have time to chat further about this.\n\nBest,\nEmily"""
prompt_system = """Provide a concise, objective summary of the input text in up to three sentences, focusing on key actions and intentions without using second or third person pronouns."""


# Format with template
messages = [{"role": "system", "content": prompt_system}, {"role": "user", "content": prompt_user}]
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Generate response
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=100).to(device)
tokenizer.decode(outputs[0], skip_special_tokens=True)

"system\nProvide a concise, objective summary of the input text in up to three sentences, focusing on key actions and intentions without using second or third person pronouns.\nuser\nHi Sarah,\n\nI hope you're doing well! I wanted to reach out because I've been struggling with a student in my class who is significantly behind in reading comprehension. \n                I remember you mentioning some effective strategies during our last conversation, and I was wondering if you could share some resources or tips that might help me support this student better.\n\nAny advice would be greatly appreciated! \n                Let me know if you have time to chat further about this.\n\nBest,\nEmily\nassistant\n\nHi Emily,\n\nI'm glad you're doing well! I'm glad you're having a great time with your student. I'm glad you're doing well. I'm glad you're having a great time with your student. I'm glad you're having a great time with your student. I'm glad you're having a great time with your student

## Dataset Preparation

We will load a sample dataset and format it for training. The dataset should be structured with input-output pairs, where each input is a prompt and the output is the expected response from the model.

**TRL will format input messages based on the model's chat templates.** They need to be represented as a list of dictionaries with the keys: `role` and `content`,.

In [5]:
# Load a sample dataset
from datasets import load_dataset

In [6]:
# TODO: define your dataset and config using the path and name parameters
train = load_dataset(path="HuggingFaceTB/smoltalk", name="smol-summarize", split="train").select(range(1000))
eval = load_dataset(path="HuggingFaceTB/smoltalk", name="smol-summarize", split="train").select(range(1000, 2000))
test = load_dataset(path="HuggingFaceTB/smoltalk", name="smol-summarize", split="train").select(range(10, 20))

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## Configuring the SFTTrainer

The `SFTTrainer` is configured with various parameters that control the training process. These include the number of training steps, batch size, learning rate, and evaluation strategy. Adjust these parameters based on your specific requirements and computational resources.

In [11]:
# Configure the SFTTrainer
sft_config = SFTConfig(
    output_dir="./sft_output",
    max_steps=2,  # Adjust based on dataset size and desired training duration
    per_device_train_batch_size=4,  # Set according to your GPU memory capacity
    learning_rate=5e-5,  # Common starting point for fine-tuning
    logging_steps=10,  # Frequency of logging training metrics
    save_steps=5,  # Frequency of saving model checkpoints
    evaluation_strategy="steps",  # Evaluate the model at regular intervals
    eval_steps=50,  # Frequency of evaluation
    use_mps_device=(
        True if device == "mps" else False
    ),  # Use MPS for mixed precision training
    hub_model_id=finetune_name,  # Set a unique name for your model
)

# Initialize the SFTTrainer
trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=train,
    compute_metrics="acc",
    tokenizer=tokenizer,
    eval_dataset=eval,
)


  trainer = SFTTrainer(


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

## Training the Model

With the trainer configured, we can now proceed to train the model. The training process will involve iterating over the dataset, computing the loss, and updating the model's parameters to minimize this loss.

In [None]:
# Train the model
trainer.train()

# Save the model
trainer.save_model(f"./{finetune_name}")

## Let's generate with the fine-tuned model

In [None]:
model_ft = AutoModelForCausalLM.from_pretrained(f"./{finetune_name}")
tokenizer_ft = AutoTokenizer.from_pretrained(f"./{finetune_name}")

In [None]:
# Test the fine-tuned model on the same prompt

# Let's test the base model before training
prompt_user = """Hi Sarah,\n\nI hope you're doing well! I wanted to reach out because I've been struggling with a student in my class who is significantly behind in reading comprehension. 
                    I remember you mentioning some effective strategies during our last conversation, and I was wondering if you could share some resources or tips that might help me support this student better.\n\nAny advice would be greatly appreciated! 
                    Let me know if you have time to chat further about this.\n\nBest,\nEmily"""
prompt_system = """Provide a concise, objective summary of the input text in up to three sentences, focusing on key actions and intentions without using second or third person pronouns."""


# Format with template
messages = [{"role": "system", "content": prompt_system}, {"role": "user", "content": prompt_user}]
formatted_prompt = tokenizer_ft.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Generate response
inputs = tokenizer_ft(formatted_prompt, return_tensors="pt")

outputs = model_ft.generate(**inputs, max_new_tokens=100).to(device)
print(tokenizer_ft.decode(outputs[0], skip_special_tokens=True))

## Let's try to evaluate our model!

In [28]:
# Load model directly
#from transformers import AutoTokenizer, AutoModelForCausalLM

#tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M-Instruct")
#model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-135M-Instruct")

In [29]:
prompts_user = [message[1]['content'] for message in test['messages']]
prompts_system = [message[0]['content'] for message in test['messages']]

references = [message[2]['content'] for message in test['messages']]

In [11]:
from tqdm import tqdm

def generate_responses(prompts_system, prompts_user, model, tokenizer, max_tokens=100, temperature=0.7, top_p=0.9):
    """
    Generate high-quality responses using the model for given system and user prompts.

    Args:
        prompts_system (list): List of system prompts.
        prompts_user (list): List of user prompts.
        model: The fine-tuned model.
        tokenizer: The tokenizer for the model.
        device: Device to run the model on (e.g., "cpu" or "cuda").
        max_tokens (int): Maximum number of tokens to generate.
        temperature (float): Sampling temperature to control randomness.
        top_p (float): Nucleus sampling to limit token selection to a probability mass.

    Returns:
        list: High-quality generated responses for the prompts.
    """
    responses = []

    for prompt_sys, prompt_user in tqdm(zip(prompts_system, prompts_user)):
        # Prepare the chat template
        messages = [
            {"role": "system", "content": prompt_sys},
            {"role": "user", "content": prompt_user}
        ]
        formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        #print(formatted_prompt)

        # Tokenize input and generate response with adjusted generation parameters
        inputs = tokenizer(formatted_prompt, return_tensors="pt")
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            eos_token_id=tokenizer.eos_token_id  # Ensure proper response termination
        ).to(device)

        # Decode the generated response
        response = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
        if "assistant" in response:
            response = response.split("assistant", 1)[-1].strip()
        responses.append(response)
        #print(response)

    return responses


In [None]:
predictions = generate_responses(prompts_system, prompts_user, model_ft, tokenizer_ft, device)

In [46]:
# !pip install rouge bert_score

Collecting bert_score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bert_score
Successfully installed bert_score-0.3.13


In [52]:
from nltk.translate.bleu_score import sentence_bleu
from rouge import Rouge
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
from bert_score import score

def evaluate_summarization(predictions, references):
    """
    Evaluate model's summarization predictions.

    Args:
        predictions (list of str): Model generated summaries.
        references (list of str): Ground truth summaries.

    Returns:
        dict: Evaluation metrics including BLEU, ROUGE, BERTScore, accuracy, precision, recall, and F1 score.
    """
    # Initialize metrics
    bleu_scores = []
    rouge = Rouge()
    rouge_scores = []

    # Calculate BLEU and ROUGE for each prediction-reference pair
    for pred, ref in zip(predictions, references):
        # BLEU score
        bleu = sentence_bleu([ref.split()], pred.split())
        bleu_scores.append(bleu)

        # ROUGE scores
        rouge_score = rouge.get_scores(pred, ref, avg=True)
        rouge_scores.append(rouge_score)


    # Average BLEU
    avg_bleu = sum(bleu_scores) / len(bleu_scores)

    # Average ROUGE scores
    avg_rouge = {
        "rouge-1": {
            "precision": sum(r["rouge-1"]["p"] for r in rouge_scores) / len(rouge_scores),
            "recall": sum(r["rouge-1"]["r"] for r in rouge_scores) / len(rouge_scores),
            "f1-score": sum(r["rouge-1"]["f"] for r in rouge_scores) / len(rouge_scores),
        },
        "rouge-2": {
            "precision": sum(r["rouge-2"]["p"] for r in rouge_scores) / len(rouge_scores),
            "recall": sum(r["rouge-2"]["r"] for r in rouge_scores) / len(rouge_scores),
            "f1-score": sum(r["rouge-2"]["f"] for r in rouge_scores) / len(rouge_scores),
        },
        "rouge-l": {
            "precision": sum(r["rouge-l"]["p"] for r in rouge_scores) / len(rouge_scores),
            "recall": sum(r["rouge-l"]["r"] for r in rouge_scores) / len(rouge_scores),
            "f1-score": sum(r["rouge-l"]["f"] for r in rouge_scores) / len(rouge_scores),
        },
    }

    # BERTScore
    P, R, F1 = score(predictions, references, lang="en", verbose=True)
    bert_score = {
        "precision": P.mean().item(),
        "recall": R.mean().item(),
        "f1-score": F1.mean().item(),
    }

    return {
        "BLEU": avg_bleu,
        "ROUGE": avg_rouge,
        "BERTScore": bert_score,
    }


metrics = evaluate_summarization(predictions, references)

for metric, value in metrics.items():
    print(f"{metric}: {value}")


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 41.84 seconds, 0.24 sentences/sec
BLEU: 0.14746374899144937
ROUGE: {'rouge-1': {'precision': 0.5595154106299617, 'recall': 0.4909510546878969, 'f1-score': 0.5074541755492266}, 'rouge-2': {'precision': 0.3127246954969799, 'recall': 0.2666485760694155, 'f1-score': 0.27924661142140517}, 'rouge-l': {'precision': 0.5232818230960645, 'recall': 0.46031909247698727, 'f1-score': 0.4747369019280844}}
BERTScore: {'precision': 0.9301093816757202, 'recall': 0.9150726199150085, 'f1-score': 0.9223594665527344}


In [49]:
predictions

['Emily is seeking advice on supporting a struggling student in reading comprehension.',
 'Alex has reviewed the data and notes the striking cross-national differences in youth political engagement, particularly the higher participation in emerging democracies. He also notes the gender gap in political interest and efficacy, suggesting a potential link to the "critical elections" idea. Alex proposes focusing on critical elections and formative experiences to enhance the paper\'s theoretical framework.',
 'Jordan agrees with the recommendation for "English Sounds" and will check it out. Jordan also suggests having students record themselves and proposes a virtual coffee chat next Wednesday. Jordan is excited about the idea of a workshop for the conference and will brainstorm a title or outline before the chat.',
 "Michael reviewed the draft script and promotional materials, and they are well-received. Michael will send the FAQ sheet and educational materials by tomorrow for review. Mich

In [50]:
references

['Emily is seeking advice on strategies for a struggling reader in her class.',
 'Alex shares initial thoughts on the data, highlighting striking cross-national differences in youth political engagement, particularly in emerging democracies. Alex also notes a surprising gender gap in political interest and efficacy and a weaker-than-expected relationship between political discussion frequency and turnout. Alex suggests focusing on critical elections and formative experiences in the theoretical framework and is open to starting the paper outline based on these ideas.',
 'Jordan appreciates the recommendation for "English Sounds" and the idea of student self-recording. Jordan confirms availability for a virtual coffee chat next Wednesday after 2 pm Eastern and expresses enthusiasm for proposing a workshop at the conference, suggesting a format for the session and inviting further brainstorming before the chat.',
 'Michael has reviewed the draft script and promotional materials for the va

## You're done!

This notebook provided a step-by-step guide to fine-tuning the `HuggingFaceTB/SmolLM2-135M` model using the `SFTTrainer`. By following these steps, you can adapt the model to perform specific tasks more effectively.