Fine-tuning Llama 3.2 with Supervised Fine-Tuning (SFT)
This notebook demonstrates how to fine-tune Llama 3.2 using superivised fine-tuning (SFT) to create an education chatbot. We will cover:
1. Loading and formatting a question-answering dataset
2. Applying and appropriate chat template
3. Setting up LoRaA fie-tuning with special token training
4. Training the model
5. Testing the fine-tuned model

In [61]:
!pip3 install peft

Defaulting to user installation because normal site-packages is not writeable


In [62]:
!pip3 install trl

Defaulting to user installation because normal site-packages is not writeable


Load Dataset

In [None]:
import json
from transformers import AutoTokenizer, AutoModelForCausalLM

#open a json file
with open('cbt_finetuning_dataset.json', 'r') as f:
    conversations = json.load(f)

# Check if the file is empty or the content is not as expected
if not conversations:
    print("The JSON file is empty or its content could not be loaded correctly.")
else:
    print(f"Loaded {len(conversations)} conversations. Preview of the first two conversations:")
    print(json.dumps(conversations[:2], indent=2))

#Process the conversations into a format suitable for training
processed_data = []

#Assuming the conversations is a list of messages
for conversation in conversations:
    #Extract system message (should be the firs one with role "system")
   system_msg = next((msg for msg in conversation if msg["role"] == "system"), {"content":""})["content"]
   print(system_msg)

   #Proces the conversation into a pairs of the Patient-Therapist exchanges
   for i in range(len(conversation) - 1):
       if conversation[i]["role"] == "patient" and conversation[i+1]['role'] == 'CBT Therapist':
           processed_data.append({
               "system": system_msg,
               "question": conversation[i]['content'],
               "answer": conversation[i_+1]["content"]
           })

#convert to Huggingface Dataset
from datasets import Dataset
import pandas as pd
dataset = Dataset.from_pandas(pd.DataFrame(processed_data))


#Define the format_with_chat_template function
def format_with_chat_template(example):
    messages = [
        {"role": "system", "content": example["system"]},
        {"role": "user", "content": example['question']},
        {"role": "assistant", "content": example['answer']}
    ]

    #Apply chat template without tokenizing
    formatted_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
    return {"formatted_text": formatted_text}



[[{'role': 'system', 'content': 'You are a CBT therapist, You have extensive experience in helping patients cope with panic attacks, obsessive-compulsive disorder (OCD), and post-traumatic stress disorder (PTSD). You are well-versed in evidence-based treatments for these conditions, such as exposure therapy, cognitive restructuring, and relaxation techniques. You also have experience working with individuals who have experienced trauma, including those with complex PTSD and borderline personality disorder.'}, {'role': 'Patient', 'content': "I'm just so anxious all the time, I feel like I'm a total failure. I'll never be able to get my life together like Cartman's mom does."}, {'role': 'CBT Therapist', 'content': "Mmkay, so you're feeling like a total failure and you think you'll never get your life together. What makes you think that, what's the evidence for that thought, mmkay?"}, {'role': 'Patient', 'content': "Well, I just can't seem to hold down a job, I've been fired from three pl

In [66]:
#Apply formatting to dataset
formatted_dataset = dataset.map(format_with_chat_template)

#Display an example
print(formatted_dataset['formatted_text'])

KeyboardInterrupt: 

Load Model and Tokenizer
we will load Llama 3.2 3B model and its instruct tokenizer for the chat template

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "meta-llama/Llama-3.2-3B"
tokenizer_name = model_name + "-Instruct" #We use Instruct tokenizer for its chat template

#Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

#Check if the model uses tied embeddings
from transformers import AutoConfig
config = AutoConfig.from_pretrained(model_name)
print(f"Model uses tied embeddings: {config.tie_word_embeddings}")

Model uses tied embeddings: True


Format Dataset with Chat Template
We'll apply the Llama 3.2 Instruct chat template to our dataset


In [None]:
#function to format the data

system_prompt = "You are a CBT therapist. You are helping a patient with their mental health issues."
def format_data(example):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": example['Patient']},
        {"role": "assistant", "content": example['CBT Therapist']}
    ]

    return messages

format_data(data[1])

TypeError: list indices must be integers or slices, not str

Tokenize Dataset


In [None]:
def tokenize_function(examples):
    return tokenizer(examples['formatted_text'], truncation=True, max_length=2048)

#Tokenize the dataset
tokenized_dataset = format