# Create your dataset for fine tuning
Fine-tuning is the process of taking a pre-trained model and further training it on a domain-specific dataset. The image below illustrates what a typical fine tuning process might look like. 

However, before we begin to fine tune, your dataset needs to be prepared in a way the model can injest during fine tuning. Hence, this notebook will explain the concept of chat templates and data storage formats to help prepare your dataset for fine tuning




![Fine_tuning.png](./Fine_tuning.png "Fine_tuning.png")

## 
## Pre-requisites

Before continuing, you would need to have a hugging face account. If you head to: https://huggingface.co/ , you should be able to create an one.

Next you will need access to Llama 3.2 1B, which is the model we will use for this task.  Use the link: https://huggingface.co/meta-llama/Llama-3.2-1B 

Once you reach the website, complete the required form (Do not mention that you are affliated to Accenture! Use a random univeristy maybe)

Once you have your HuggingFace account, create an access token to use. Head to your profile on the top right of your page and select "access tokens". Once created, you can store it in a notepad in your local machine.

## Install and import libraries
Lets install and import the required dependencies:

In [None]:
!pip install transformers datasets bitsandbytes peft trl accelerate torch 
# !pip install "unsloth[cu124-ampere-torch240] @ git+https://github.com/unslothai/unsloth.git"

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.3-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting peft
  Downloading peft-0.14.0-py3-none-any.whl.metadata (13 kB)
Collecting trl
  Downloading trl-0.15.2-py3-none-any.whl.metadata (11 kB)
Collecting huggingface-hub<1.0,>=0.23.2 (from transformers)
  Downloading huggingface_hub-0.29.3-py3-none-any.whl.metadata (13 kB)
Collecting accelerate
  Downloading accelerate-1.4.0-py3-none-any.whl.metadata (19 kB)
Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting transformers
  Downloading transformers-4.49.0-py3-none-any.whl.metadata (44 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/44.0 kB[0m [31m?[0m eta [36m-:--:--[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tokenizers<0.22,>=0.21 (from transformers)
  Downloading tokenizers-0.21.0-cp39-abi3-manylinux_

In [None]:
%restart_python

In [None]:

import torch 
import os
from datasets import load_dataset, Dataset# load datasets from hugging face 
from transformers import (AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, DataCollatorForSeq2Seq, LlamaTokenizerFast, LlamaTokenizer) 
from trl import SFTConfig, SFTTrainer
import pandas as pd 
import numpy as np 
import torch 
# from unsloth.chat_templates import get_chat_template

2025-03-11 12:01:35.227959: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1741694495.237516    2162 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1741694495.241003    2162 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-11 12:01:35.254250: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## 
## Load the dataset from HuggingFace

To better understand this process, we will define a use case. We will work with a dataset on medical summaries, which will allow a model to learn complex medical jargon that it otherwise may not understand. 

HuggingFace comes with many such open source datasets, hence, we will work with a dataset readily available on the platform. 

Link to dataset: https://huggingface.co/datasets/keivalya/MedQuad-MedicalQnADataset
 

In [None]:
### load the dataset 
dataset = load_dataset("keivalya/MedQuad-MedicalQnADataset", split="train")
print("Number of samples in the dataset: {}".format(len(dataset)))



README.md:   0%|          | 0.00/233 [00:00<?, ?B/s]



medDataset_processed.csv:   0%|          | 0.00/22.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16407 [00:00<?, ? examples/s]

Number of samples in the dataset: 16407


In [None]:
print(dataset)

Dataset({
    features: ['qtype', 'Question', 'Answer'],
    num_rows: 16407
})


## ALPACA - A format to store datasets 

The dataset stored in the Alpaca data format, which is a specific structure used to store data for fine-tuning large language models. When creating a dataset for the Alpaca format, we need to have three things for each data sample:

  1. instruction: A string that describes the task the model should perform.
  2. input: Additional context or information (can be empty).
  3. output: The desired response from the model.

This is also a single turn dataset, which consists of an input prompt and a single output, denoting a 'single' interaction/exchange

Lets view the dataset in a pandas dataframe:



In [None]:
data_frame = pd.DataFrame()
columns_names = ["Question", "Answer"]
for i in columns_names:
    data_frame[i] = list(dataset[i])

**In this instance, we have the Question and Answer columns which are the Input and output elements of the Alpaca format. The Instruction element is baked into the prompt template. So follow along...**

In [None]:

data_frame.head()

Unnamed: 0,Question,Answer
0,Who is at risk for Lymphocytic Choriomeningiti...,LCMV infections can occur after exposure to fr...
1,What are the symptoms of Lymphocytic Choriomen...,LCMV is most commonly recognized as causing ne...
2,Who is at risk for Lymphocytic Choriomeningiti...,Individuals of all ages who come into contact ...
3,How to diagnose Lymphocytic Choriomeningitis (...,"During the first phase of the disease, the mos..."
4,What are the treatments for Lymphocytic Chorio...,"Aseptic meningitis, encephalitis, or meningoen..."


In [None]:
### turn the above back to a hugging face dataset
dataset_hf = Dataset.from_pandas(data_frame)

In [None]:
print(dataset_hf)

Dataset({
    features: ['Question', 'Answer'],
    num_rows: 16407
})


In [None]:
input_1 = dataset_hf['Question']
print(input_1[0])

Who is at risk for Lymphocytic Choriomeningitis (LCM)? ?


## Chat template 
Once you have your dataset ready, the next step includes converting the datapoints into the chat template the model would understand. The catch here is that different models expect very different input formats for chat. Chat templates are part of the tokenizer for text-only LLMs or processor for multimodal LLMs. They specify how to convert conversations, represented as lists of messages, into a single tokenizable string in the format that the model expects.

All chat templates include special tokens that the model uses to highlight when someone has stopped speaking or when the conversation has ended. 

Usually, the tokenizer class from hugging face has the template built into it, however, they are now deprecating them going forward. Hence, we will have to manually create the template to train the model. The template is where we will specify the Intruction element as well. 

Lets load the tokenizer first:

In [None]:
## insert access token 
os.environ['HF_TOKEN'] = "INsert HF Token"

**Since we are going from a base model to an instruct model, we need to modify the tokenizer to incorporate the special tokens used in the ChatML template.** 

In [None]:
### Load the tokenizer
model_name = "meta-llama/Llama-3.2-1B" 
tokenizer = AutoTokenizer.from_pretrained(model_name, token = os.environ['HF_TOKEN'])

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

**Lets view the special tokens present in the tokenizer currently:**

In [None]:
print(tokenizer.all_special_tokens)

['<|begin_of_text|>', '<|end_of_text|>']


**The ChatML template uses the '<|im_start|>' and '<|im_end|>' tokens instead of the '<|begin_of_text|>' and '<|end_of_text|>'. Lets update that in the Tokenizers vocabulary!**

In [None]:
tokenizer_special_tokens_map = {'bos_token': '<|im_start|>',
 'eos_token': '<|im_end|>'}

## Update the Tokenizer 
tokenizer.add_special_tokens(tokenizer_special_tokens_map)

tokenizer.pad_token = tokenizer.eos_token 
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = "right" 

**Given that the aim of this notebook is to demmostrate, we will train the model on the first 5000 samples. Lets define a mapping function to convert the datapoints into a template the model can easily ingest. Notice wee have added the Instruction element to be: "Answer the following question truthfully".**

In [None]:
def mapping_func(example):
#     prompt_string = """<|im_start|>system: You are an helpful Assistant. Below is an instruction that describes a task. Write a response that appropriately completes the request.<|im_end>
# <|im_start|>User: {Question}<|im_end|>
# <|im_start|>Assistant:{output}<|im_end|>""".format(Question =example['Question'], output=example['Answer'])

    prompt_string = """You are a helpful assistant. Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Answer the following question truthfully.

### Input:
{Question}

### Response:
{output}<|im_end|>""".format(Question =example['Question'], output=example['Answer']) 

    return {'prompt': prompt_string}

### Lets take the first 5000 samples for training and validating model performance. Also it is best to shuffle the dataset as well.  
hf_dataset = Dataset.from_dict(dataset_hf[:5000]).shuffle(seed = 1234)

### Use the .map function to apply the mapping function to each element of the dataset. Moreover, we will also get rid of the "intruction", "output", "input" columns save GPU memory when the data gets loaded
hf_dataset= hf_dataset.map(mapping_func).remove_columns(["Question", "Answer"])

### Create the trianing and evaluation datasets
hf_dataset = hf_dataset.train_test_split(test_size = 0.1)
training_data = hf_dataset['train']
evaluation_data = hf_dataset['test']

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [None]:
print(len(training_data))

4500


**Lets have a look at how the template looks like for a datapoint**

In [None]:

tokenised = tokenizer.encode(training_data['prompt'][0], return_tensors="pt")
untokenised = tokenizer.decode(tokenised[0])

In [None]:
print(untokenised)

<|begin_of_text|>You are a helpful assistant. Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Answer the following question truthfully.

### Input:
What are the treatments for Multi-Infarct Dementia?

### Response:
There is no treatment available to reverse brain damage that has been caused by a stroke. Treatment focuses on preventing future strokes by controlling or avoiding the diseases and medical conditions that put people at high risk for stroke: high blood pressure, diabetes, high cholesterol, and cardiovascular disease. The best treatment for MID is prevention early in life  eating a healthy diet, exercising, not smoking, moderately using alcohol, and maintaining a healthy weight.<|im_end|>


In [None]:
print(tokenizer.all_special_tokens)

['<|im_start|>', '<|im_end|>']


In [None]:
print(tokenizer.get_vocab()['<|begin_of_text|>'])

128000


## Save the datasets
We will store the datasets in a json format:

In [None]:
training_data.save_to_disk("training_data")
evaluation_data.save_to_disk("evaluation_data")

Saving the dataset (0/1 shards):   0%|          | 0/4500 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/500 [00:00<?, ? examples/s]