# Supervised Fine-Tuning with SFTTrainer

This notebook demonstrates how to fine-tune the `HuggingFaceTB/SmolLM2-135M` model using the `SFTTrainer` from the `trl` library. 

The `SFTTrainer` is a specialized trainer designed for Supervised Fine-Tuning (SFT) of LLMs. It simplifies the fine-tuning process by providing a user-friendly interface for training pretrained models with labeled data.

We want to fine tune the model for a summarization task.

> TIP: If you don't have a GPU, download the notebook and try it on Google Colab using T4.

In [1]:
# Install the requirements
# !pip install transformers datasets trl huggingface_hub

# Authenticate to Hugging Face
from huggingface_hub import notebook_login

notebook_login()

# for convenience you can create an environment variable containing your hub token as HF_TOKEN

pyenv: version `3.11' is not installed (set by /Users/simonamazzarino/Documents/Clearbox-repo/clearbox-ai-academy/Fine-Tuning_LLM/.python-version)


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# Import necessary libraries
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer, setup_chat_format
import torch

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)

# Load the model and tokenizer
model_name = "HuggingFaceTB/SmolLM2-135M"
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name
).to(device)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_name)

# Set up the chat format
model, tokenizer = setup_chat_format(model=model, tokenizer=tokenizer)

# Set our name for the finetune to be saved &/ uploaded to
finetune_name = "SmolLM2-FT-MyDataset"


# Generate with the base model

Here we will try out the base model which does not have a chat template.

In [4]:
# Let's test the base model before training
prompt_user = """Hi Sarah,\n\nI hope you're doing well! I wanted to reach out because I've been struggling with a student in my class who is significantly behind in reading comprehension. 
                I remember you mentioning some effective strategies during our last conversation, and I was wondering if you could share some resources or tips that might help me support this student better.\n\nAny advice would be greatly appreciated! 
                Let me know if you have time to chat further about this.\n\nBest,\nEmily"""
prompt_system = """Provide a concise, objective summary of the input text in up to three sentences, focusing on key actions and intentions without using second or third person pronouns."""


# Format with template
messages = [{"role": "system", "content": prompt_system}, {"role": "user", "content": prompt_user}]
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Generate response
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=100).to(device)
tokenizer.decode(outputs[0], skip_special_tokens=True)

"system\nProvide a concise, objective summary of the input text in up to three sentences, focusing on key actions and intentions without using second or third person pronouns.\nuser\nHi Sarah,\n\nI hope you're doing well! I wanted to reach out because I've been struggling with a student in my class who is significantly behind in reading comprehension. \n                I remember you mentioning some effective strategies during our last conversation, and I was wondering if you could share some resources or tips that might help me support this student better.\n\nAny advice would be greatly appreciated! \n                Let me know if you have time to chat further about this.\n\nBest,\nEmily\nassistant\n\nHi Emily,\n\nI'm glad you're doing well! I'm glad you're having a great time with your student. I'm glad you're doing well. I'm glad you're having a great time with your student. I'm glad you're having a great time with your student. I'm glad you're having a great time with your student

## Dataset Preparation

We will load a sample dataset and format it for training. The dataset should be structured with input-output pairs, where each input is a prompt and the output is the expected response from the model.

**TRL will format input messages based on the model's chat templates.** They need to be represented as a list of dictionaries with the keys: `role` and `content`,.

In [5]:
# Load a sample dataset
from datasets import load_dataset

In [13]:
# TODO: define your dataset and config using the path and name parameters
train = load_dataset(path="HuggingFaceTB/smoltalk", name="smol-summarize", split="train").select(range(1000))
eval = load_dataset(path="HuggingFaceTB/smoltalk", name="smol-summarize", split="train").select(range(1000, 2000))
test = load_dataset(path="HuggingFaceTB/smoltalk", name="smol-summarize", split="train").select(range(10, 20))

README.md:   0%|          | 0.00/9.72k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/96356 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5072 [00:00<?, ? examples/s]

In [15]:
train['messages']

[[{'content': 'Extract and present the main key point of the input text in one very short sentence, including essential details like dates or locations if necessary.',
   'role': 'system'},
  {'content': "Hi Sarah,\n\nI hope you're doing well! I wanted to reach out because I've been struggling with a student in my class who is significantly behind in reading comprehension. I remember you mentioning some effective strategies during our last conversation, and I was wondering if you could share some resources or tips that might help me support this student better.\n\nAny advice would be greatly appreciated! Let me know if you have time to chat further about this.\n\nBest,\nEmily",
   'role': 'user'},
  {'content': 'Emily is seeking advice on strategies for a struggling reader in her class.',
   'role': 'assistant'}],
 [{'content': 'Provide a concise, objective summary of the input text in up to three sentences, focusing on key actions and intentions without using second or third person pr

In [18]:
test['messages']

[[{'content': 'Provide a concise, objective summary of the input text in up to three sentences, focusing on key actions and intentions without using second or third person pronouns.',
   'role': 'system'},
  {'content': 'Hey Samantha,\n\nI hope this email finds you well. I was recently browsing through a magazine and came across an article about a new book that I thought you might find interesting. It\'s called "The Ponte Vecchio: A Bridge Through Time" and it explores the historical and cultural significance of the bridge in Florence, Italy.\n\nGiven your passion for bridge design and the historical contexts of notable bridges, I figured this would be right up your alley. I haven\'t read the book myself yet, but the article made it sound fascinating. It discusses the bridge\'s construction, its role in Florentine society over the centuries, and its impact on the city\'s culture and architecture.\n\nI was thinking it could be fun to read the book and then have a discussion about it, li

## Configuring the SFTTrainer

The `SFTTrainer` is configured with various parameters that control the training process. These include the number of training steps, batch size, learning rate, and evaluation strategy. Adjust these parameters based on your specific requirements and computational resources.

In [11]:
# Configure the SFTTrainer
sft_config = SFTConfig(
    output_dir="./sft_output",
    max_steps=2,  # Adjust based on dataset size and desired training duration
    per_device_train_batch_size=4,  # Set according to your GPU memory capacity
    learning_rate=5e-5,  # Common starting point for fine-tuning
    logging_steps=10,  # Frequency of logging training metrics
    save_steps=5,  # Frequency of saving model checkpoints
    evaluation_strategy="steps",  # Evaluate the model at regular intervals
    eval_steps=50,  # Frequency of evaluation
    use_mps_device=(
        True if device == "mps" else False
    ),  # Use MPS for mixed precision training
    hub_model_id=finetune_name,  # Set a unique name for your model
)

# Initialize the SFTTrainer
trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=train,
    compute_metrics="acc",
    tokenizer=tokenizer,
    eval_dataset=eval,
)


  trainer = SFTTrainer(


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

## Training the Model

With the trainer configured, we can now proceed to train the model. The training process will involve iterating over the dataset, computing the loss, and updating the model's parameters to minimize this loss.

In [None]:
# Train the model
trainer.train()

# Save the model
trainer.save_model(f"./{finetune_name}")

## Let's generate with the fine-tuned model

In [None]:
model_ft = AutoModelForCausalLM.from_pretrained(f"./{finetune_name}")
tokenizer_ft = AutoTokenizer.from_pretrained(f"./{finetune_name}")

In [None]:
# Test the fine-tuned model on the same prompt

# Let's test the base model before training
prompt_user = """Hi Sarah,\n\nI hope you're doing well! I wanted to reach out because I've been struggling with a student in my class who is significantly behind in reading comprehension. 
                    I remember you mentioning some effective strategies during our last conversation, and I was wondering if you could share some resources or tips that might help me support this student better.\n\nAny advice would be greatly appreciated! 
                    Let me know if you have time to chat further about this.\n\nBest,\nEmily"""
prompt_system = """Provide a concise, objective summary of the input text in up to three sentences, focusing on key actions and intentions without using second or third person pronouns."""


# Format with template
messages = [{"role": "system", "content": prompt_system}, {"role": "user", "content": prompt_user}]
formatted_prompt = tokenizer_ft.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Generate response
inputs = tokenizer_ft(formatted_prompt, return_tensors="pt")

outputs = model_ft.generate(**inputs, max_new_tokens=100).to(device)
print(tokenizer_ft.decode(outputs[0], skip_special_tokens=True))

## Let's try to evaluate our model!

In [25]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M-Instruct")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-135M-Instruct")

tokenizer_config.json:   0%|          | 0.00/3.76k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/655 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/861 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

In [26]:
prompts_user = [message[1]['content'] for message in test['messages']]
prompts_system = [message[0]['content'] for message in test['messages']]

references = [message[2]['content'] for message in test['messages']]

In [33]:
from tqdm import tqdm

def generate_responses(prompts_system, prompts_user, model, tokenizer, device, max_tokens=100, temperature=0.7, top_p=0.9):
    """
    Generate high-quality responses using the model for given system and user prompts.

    Args:
        prompts_system (list): List of system prompts.
        prompts_user (list): List of user prompts.
        model: The fine-tuned model.
        tokenizer: The tokenizer for the model.
        device: Device to run the model on (e.g., "cpu" or "cuda").
        max_tokens (int): Maximum number of tokens to generate.
        temperature (float): Sampling temperature to control randomness.
        top_p (float): Nucleus sampling to limit token selection to a probability mass.

    Returns:
        list: High-quality generated responses for the prompts.
    """
    responses = []

    for prompt_sys, prompt_user in tqdm(zip(prompts_system, prompts_user)):
        # Prepare the chat template
        messages = [
            {"role": "system", "content": prompt_sys},
            {"role": "user", "content": prompt_user}
        ]
        formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        #print(formatted_prompt)

        # Tokenize input and generate response with adjusted generation parameters
        inputs = tokenizer(formatted_prompt, return_tensors="pt")
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            eos_token_id=tokenizer.eos_token_id  # Ensure proper response termination
        ).to(device)

        # Decode the generated response
        response = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
        if "assistant" in response:
            response = response.split("assistant", 1)[-1].strip()
        responses.append(response)
        #print(response)

    return responses


In [34]:
predictions = generate_responses(prompts_system, prompts_user, model, tokenizer, device)

10it [00:44,  4.41s/it]


In [46]:
# !pip install rouge bert_score

Collecting bert_score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bert_score
Successfully installed bert_score-0.3.13


In [41]:
from nltk.translate.bleu_score import sentence_bleu
from rouge import Rouge
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
from bert_score import score

def evaluate_summarization(predictions, references):
    """
    Evaluate model's summarization predictions.

    Args:
        predictions (list of str): Model generated summaries.
        references (list of str): Ground truth summaries.

    Returns:
        dict: Evaluation metrics including BLEU, ROUGE, BERTScore, accuracy, precision, recall, and F1 score.
    """
    # Initialize metrics
    bleu_scores = []
    rouge = Rouge()
    rouge_scores = []

    # Calculate BLEU and ROUGE for each prediction-reference pair
    for pred, ref in zip(predictions, references):
        # BLEU score
        bleu = sentence_bleu([ref.split()], pred.split())
        bleu_scores.append(bleu)

        # ROUGE scores
        rouge_score = rouge.get_scores(pred, ref, avg=True)
        rouge_scores.append(rouge_score)


    # Average BLEU
    avg_bleu = sum(bleu_scores) / len(bleu_scores)

    # Average ROUGE scores
    avg_rouge = {
        "rouge-1": {
            "precision": sum(r["rouge-1"]["p"] for r in rouge_scores) / len(rouge_scores),
            "recall": sum(r["rouge-1"]["r"] for r in rouge_scores) / len(rouge_scores),
            "f1-score": sum(r["rouge-1"]["f"] for r in rouge_scores) / len(rouge_scores),
        },
        "rouge-2": {
            "precision": sum(r["rouge-2"]["p"] for r in rouge_scores) / len(rouge_scores),
            "recall": sum(r["rouge-2"]["r"] for r in rouge_scores) / len(rouge_scores),
            "f1-score": sum(r["rouge-2"]["f"] for r in rouge_scores) / len(rouge_scores),
        },
        "rouge-l": {
            "precision": sum(r["rouge-l"]["p"] for r in rouge_scores) / len(rouge_scores),
            "recall": sum(r["rouge-l"]["r"] for r in rouge_scores) / len(rouge_scores),
            "f1-score": sum(r["rouge-l"]["f"] for r in rouge_scores) / len(rouge_scores),
        },
    }

    # BERTScore
    P, R, F1 = score(predictions, references, lang="en", verbose=True)
    bert_score = {
        "precision": P.mean().item(),
        "recall": R.mean().item(),
        "f1-score": F1.mean().item(),
    }

    return {
        "BLEU": avg_bleu,
        "ROUGE": avg_rouge,
        "BERTScore": bert_score,
    }



In [50]:
metrics = evaluate_summarization(predictions, references)
metrics


The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

{'BLEU': 0.15751349064347586,
 'ROUGE': {'rouge-1': {'precision': 0.5441395701303071,
   'recall': 0.459400705453337,
   'f1-score': 0.48712794260023917},
  'rouge-2': {'precision': 0.28574310020904453,
   'recall': 0.2503303577043296,
   'f1-score': 0.25965590155104834},
  'rouge-l': {'precision': 0.5071550240896416,
   'recall': 0.43199135278082645,
   'f1-score': 0.4562931892174575}},
 'BERTScore': {'precision': 0.9149687886238098,
  'recall': 0.9027048349380493,
  'f1-score': 0.9087070226669312}}

In [38]:
predictions

['Alex found an article about a new book titled "The Ponte Vecchio: A Bridge Through Time" and is considering reading it. Alex believes the book could be a great addition to their book club discussion, focusing on the historical and cultural significance of the bridge. Alex is open to sharing the link to the article.',
 'Alex agrees with the approach to media coverage and emphasizes the importance of the interdisciplinary collaboration.',
 'Marcus is excited about the conference and suggests sharing sources on Reconstruction-era resistance.',
 'Jordan is enthusiastic about collaborating on a project that combines food waste and sustainability with Celtic practices. Jordan has been researching the similarities between sustainable farming techniques and Celtic agriculture and is available to schedule a video call next Tuesday and Thursday afternoons. Jordan is looking forward to the collaboration and the potential insights.',
 'Lisa Bratt, a 40-year-old fitness instructor from Hartlepool

In [39]:
references

['Alex discovered a new book titled "The Ponte Vecchio: A Bridge Through Time" and thinks it would be of interest due to a passion for bridge design and history. The book covers the bridge\'s construction, historical significance, and cultural impact in Florence. Alex proposes a book club-style discussion and offers to share a purchase link.',
 'Alex agrees to focus on the big picture for media coverage and will draft a summary to share with the journalist.',
 "Marcus is looking forward to the conference and dinner on Friday, and will bring research on post-Reconstruction black political organizing to compare with Jenna's findings.",
 'Jordan is enthusiastic about the collaboration opportunity and agrees that combining expertise on food waste and Celtic practices could be impactful. Jordan suggests scheduling a video call on Tuesday or Thursday next week to discuss ideas and outline a plan for the paper.',
 "Fitness instructor Lisa Bratt, 40, from Hartlepool, died suddenly after collap

## You're done!

This notebook provided a step-by-step guide to fine-tuning the `HuggingFaceTB/SmolLM2-135M` model using the `SFTTrainer`. By following these steps, you can adapt the model to perform specific tasks more effectively.