# Create your dataset for fine tuning
Fine-tuning is the process of taking a pre-trained model and further training it on a domain-specific dataset. The image below illustrates what a typical fine tuning process might look like. 

However, before we begin to fine tune, your dataset needs to be prepared in a way the model can injest during fine tuning. Hence, this notebook will explain the concept of chat templates and data storage formats to help prepare your dataset for fine tuning




![Fine_tuning.png](./Fine_tuning.png "Fine_tuning.png")

## 
## Pre-requisites

Before continuing, you would need to have a hugging face account. If you head to: https://huggingface.co/ , you should be able to create an one.

Next you will need access to Llama 3.2 1B, which is the model we will use for this task.  Use the link: https://huggingface.co/meta-llama/Llama-3.2-1B 

Once you reach the website, complete the required form (Do not mention that you are affliated to Accenture! Use a random univeristy maybe)

Once you have your HuggingFace account, create an access token to use. Head to your profile on the top right of your page and select "access tokens". Once created, you can store it in a notepad in your local machine.

## Install and import libraries
Lets install and import the required dependencies:

In [1]:
!pip install transformers datasets bitsandbytes peft trl accelerate torch 


Collecting transformers
  Downloading transformers-4.51.3-py3-none-any.whl.metadata (38 kB)
Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.45.5-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting peft
  Downloading peft-0.15.2-py3-none-any.whl.metadata (13 kB)
Collecting trl
  Downloading trl-0.16.1-py3-none-any.whl.metadata (12 kB)
Collecting accelerate
  Downloading accelerate-1.6.0-py3-none-any.whl.metadata (19 kB)
Collecting huggingface-hub<1.0,>=0.30.0 (from transformers)
  Downloading huggingface_hub-0.30.2-py3-none-any.whl.metadata (13 kB)
Collecting regex!=2019.12.17 (from transformers)
  Downloading regex-2024.11.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.5/40.5 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.22,>=0.21 (from transformers)
  Downloading tok

In [None]:
%restart_python

In [2]:

import torch 
import os
from datasets import load_dataset, Dataset# load datasets from hugging face 
from transformers import (AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, DataCollatorForSeq2Seq, LlamaTokenizerFast, LlamaTokenizer) 
from trl import SFTConfig, SFTTrainer
import pandas as pd 
import numpy as np 
import torch 
from tokenizers import AddedToken

## 
## Load the dataset from HuggingFace

To better understand this process, we will define a use case. We will work with a dataset on medical summaries, which will allow a model to learn complex medical jargon that it otherwise may not understand. 

HuggingFace comes with many such open source datasets, hence, we will work with a dataset readily available on the platform. 

Link to dataset: https://huggingface.co/datasets/keivalya/MedQuad-MedicalQnADataset
 

In [3]:
### load the dataset 
dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", name = "en",split= "train")
print("Number of samples in the dataset: {}".format(len(dataset)))

README.md:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

medical_o1_sft.json:   0%|          | 0.00/74.1M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Number of samples in the dataset: 25371


In [4]:
print(dataset)

Dataset({
    features: ['Question', 'Complex_CoT', 'Response'],
    num_rows: 25371
})


## ALPACA - A format to store datasets 

The dataset stored in the Alpaca data format, which is a specific structure used to store data for fine-tuning large language models. When creating a dataset for the Alpaca format, we need to have three things for each data sample:

  1. instruction: A string that describes the task the model should perform.
  2. input: Additional context or information (can be empty).
  3. output: The desired response from the model.

This is also a single turn dataset, which consists of an input prompt and a single output, denoting a 'single' interaction/exchange

Lets view the dataset in a pandas dataframe:



In [5]:
data_frame = pd.DataFrame()
columns_names = ["Question", "Complex_CoT", "Response"]
for i in columns_names:
    data_frame[i] = list(dataset[i])

**In this instance, we have the Question and Answer columns which are the Input and output elements of the Alpaca format. The Instruction element is baked into the prompt template. So follow along...**

In [6]:

data_frame.head()

Unnamed: 0,Question,Complex_CoT,Response
0,A 61-year-old woman with a long history of inv...,"Okay, let's think about this step by step. The...",Cystometry in this case of stress urinary inco...
1,A 45-year-old man with a history of alcohol us...,"Alright, let’s break this down. We have a 45-y...",Considering the clinical presentation of sudde...
2,A 45-year-old man presents with symptoms inclu...,"Okay, so here's a 45-year-old guy who's experi...",Based on the clinical findings presented—wide-...
3,A patient with psoriasis was treated with syst...,I'm thinking about this patient with psoriasis...,The development of generalized pustules in a p...
4,What is the most likely diagnosis for a 2-year...,"Okay, so we're dealing with a 2-year-old child...",Based on the described symptoms and the unusua...


In [6]:
### turn the above back to a hugging face dataset
dataset_hf = Dataset.from_pandas(data_frame)

In [7]:
print(dataset_hf)

Dataset({
    features: ['Question', 'Complex_CoT', 'Response'],
    num_rows: 25371
})


In [8]:
input_1 = dataset_hf['Question']
print(input_1[0])

A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?


## Chat template 
Once you have your dataset ready, the next step includes converting the datapoints into the chat template the model would understand. The catch here is that different models expect very different input formats for chat. Chat templates are part of the tokenizer for text-only LLMs or processor for multimodal LLMs. They specify how to convert conversations, represented as lists of messages, into a single tokenizable string in the format that the model expects.

All chat templates include special tokens that the model uses to highlight when someone has stopped speaking or when the conversation has ended. 

Usually, the tokenizer class from hugging face has the template built into it, however, they are now deprecating them going forward. Hence, we will have to manually create the template to train the model. The template is where we will specify the Intruction element as well. 

Lets load the tokenizer first:

In [None]:
## insert access token 
os.environ['HF_TOKEN'] = ""

**Since we are going from a base model to an instruct model, we need to modify the tokenizer to incorporate the special tokens used in the ChatML template.** 

In [10]:
### Load the tokenizer
model_name = "meta-llama/Llama-3.2-1B" 
tokenizer = AutoTokenizer.from_pretrained(model_name, token = os.environ['HF_TOKEN'])

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

**Lets view the special tokens present in the tokenizer currently:**

In [11]:
print(tokenizer.all_special_tokens)

['<|begin_of_text|>', '<|end_of_text|>']


**The ChatML template uses the '<|im_start|>' and '<|im_end|>' tokens instead of the '<|begin_of_text|>' and '<|end_of_text|>'. Lets update that in the Tokenizers vocabulary!**

In [12]:
tokenizer_special_tokens_map = {'bos_token': '<|im_start|>',
 'eos_token': '<|im_end|>', "additional_special_tokens": ["<answer>", "</answer>", "<think>", "</think>"]}

## Update the Tokenizer 
tokenizer.add_special_tokens(tokenizer_special_tokens_map)

tokenizer.pad_token = tokenizer.eos_token 
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = "right" 

**Given that the aim of this notebook is to demmostrate, we will train the model on the first 5000 samples. Lets define a mapping function to convert the datapoints into a template the model can easily ingest. Notice wee have added the Instruction element to be: "Answer the following question truthfully".**

In [None]:
# def mapping_func(example):
# #     prompt_string = """<|im_start|>system: You are an helpful Assistant. Below is an instruction that describes a task. Write a response that appropriately completes the request.<|im_end>
# # <|im_start|>User: {Question}<|im_end|>
# # <|im_start|>Assistant:{output}<|im_end|>""".format(Question =example['Question'], output=example['Answer'])

#     prompt_string = """You are a helpful assistant. Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

# ### Instruction:
# Answer the following question truthfully.

# ### Input:
# {Question}

# ### Response:
# {output}<|im_end|>""".format(Question =example['Question'], output=example['Answer']) 

#     return {'prompt': prompt_string}

# ### Lets take the first 5000 samples for training and validating model performance. Also it is best to shuffle the dataset as well.  
# hf_dataset = Dataset.from_dict(dataset_hf[:5000]).shuffle(seed = 1234)

# ### Use the .map function to apply the mapping function to each element of the dataset. Moreover, we will also get rid of the "intruction", "output", "input" columns save GPU memory when the data gets loaded
# hf_dataset= hf_dataset.map(mapping_func).remove_columns(["Question", "Answer"])

# ### Create the trianing and evaluation datasets
# hf_dataset = hf_dataset.train_test_split(test_size = 0.1)
# training_data = hf_dataset['train']
# evaluation_data = hf_dataset['test']

In [13]:
def mapping_func_2(example):
#     prompt_string = """<|im_start|>system: You are an helpful Assistant. Below is an instruction that describes a task. Write a response that appropriately completes the request.<|im_end>
# <|im_start|>User: {Question}<|im_end|>
# <|im_start|>Assistant:{output}<|im_end|>""".format(Question =example['Question'], output=example['Answer'])

    prompt_string = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. 

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. 
Please answer the following medical question. 

### Question:
{Question}
### Response:
<think>{Reasoning}</think><answer>{Response}</answer>""".format(Question =example['Question'], Reasoning=example['Complex_CoT'], Response= example['Response']) 

    return {'prompt': prompt_string}

### Lets take the first 5000 samples for training and validating model performance. Also it is best to shuffle the dataset as well.  
hf_dataset = Dataset.from_dict(dataset_hf[:10000]).shuffle(seed = 1234)

### Use the .map function to apply the mapping function to each element of the dataset. Moreover, we will also get rid of the "intruction", "output", "input" columns save GPU memory when the data gets loaded
hf_dataset= hf_dataset.map(mapping_func_2).remove_columns(["Question", "Complex_CoT", "Response"])

### Create the trianing and evaluation datasets
hf_dataset = hf_dataset.train_test_split(test_size = 0.1)
training_data = hf_dataset['train']
evaluation_data = hf_dataset['test']

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [14]:
print(len(training_data))

9000


**Lets have a look at how the template looks like for a datapoint**

In [15]:

tokenised = tokenizer.encode(training_data['prompt'][0], return_tensors="pt")
untokenised = tokenizer.decode(tokenised[0])

In [16]:
print(untokenised)

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. 

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. 
Please answer the following medical question. 

### Question:
Given a series of real numbers where the series of absolute values diverges and the sum of the series equals 2, explain if it is possible to rearrange the terms such that the sum equals 4. Also, can you provide an example of metric spaces X and Y, where X is closed and bounded, and a continuous function from X to Y such that the image of X under this function is not closed and bounded?
### Response:
<think>Alright, let's dive into this puzzle, starting with the series of real numbers. We know the series converges to 2 but the series of absolute values diverges. That's a clear sign of conditional convergence. And guess what, th

In [19]:
print(tokenizer.all_special_tokens)

['<|im_start|>', '<|im_end|>', '<answer>', '</answer>', '<think>', '</think>']


In [20]:
print(tokenizer.get_vocab()['<|begin_of_text|>'])

128000


## Save the datasets
We will store the datasets in a json format:

In [17]:
training_data.save_to_disk("training_data")
evaluation_data.save_to_disk("evaluation_data")

Saving the dataset (0/1 shards):   0%|          | 0/9000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1000 [00:00<?, ? examples/s]

In [18]:
import gc
gc.collect()

204