## Create your dataset
Fine-tuning is the process of taking a pre-trained model and further training it on a domain-specific dataset. The image below illustrates what a typical fine tuning process might look like. 

However, before we begin to fine tune, your dataset needs to be prepared in a way the model can injest during fine tuning. Hence, this notebook will explain the concept of chat templates and data storage formats to help prepare your dataset for fine tuning




![Fine_tuning.jpg](./Fine_tuning.jpg "Fine_tuning.jpg")

## 
## Pre-requisites

Before continuing, you would need to have a hugging face account. If you head to: https://huggingface.co/ , you should be able to create an one.

Next you will need access to Llama 3.2 1B, which is the model we will use for this task.  Use the link: https://huggingface.co/meta-llama/Llama-3.2-1B 

Once you reach the website, complete the required form (Do not mention that you are affliated to Accenture! Use a random univeristy maybe)

Once you have your HuggingFace account, create an access token to use. Head to your profile on the top right of your page and select "access tokens". Once created, you can store it in a notepad in your local machine.

## Install and import libraries
Lets install and import the required dependencies:

In [0]:
!pip install transformers datasets bitsandbytes peft trl accelerate torch 

Collecting bitsandbytes
  Using cached bitsandbytes-0.45.1-py3-none-manylinux_2_24_x86_64.whl.metadata (5.8 kB)
Collecting peft
  Using cached peft-0.14.0-py3-none-any.whl.metadata (13 kB)
Collecting trl
  Using cached trl-0.14.0-py3-none-any.whl.metadata (12 kB)
Collecting huggingface-hub<1.0,>=0.23.2 (from transformers)
  Using cached huggingface_hub-0.28.0-py3-none-any.whl.metadata (13 kB)
Collecting accelerate
  Using cached accelerate-1.3.0-py3-none-any.whl.metadata (19 kB)
Collecting datasets
  Using cached datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting transformers
  Using cached transformers-4.48.1-py3-none-any.whl.metadata (44 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers)
  Using cached tokenizers-0.21.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Using cached bitsandbytes-0.45.1-py3-none-manylinux_2_24_x86_64.whl (69.7 MB)
Using cached peft-0.14.0-py3-none-any.whl (374 kB)
Using cached trl-0.14.0-py3-none-any.whl (313 kB

In [0]:
%restart_python

In [0]:

import torch 
import os
from datasets import load_dataset, Dataset# load datasets from hugging face 
from transformers import (AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, DataCollatorForSeq2Seq) 
from trl import SFTConfig, SFTTrainer
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import torch 
import seaborn as sns
from peft import LoraConfig, prepare_model_for_kbit_training

2025-01-29 18:06:28.943401: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1738173988.954789   10358 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1738173988.958271   10358 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-29 18:06:28.970714: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## 
## Load the dataset from HuggingFace

To better understand this process, we will define a use case. We will work with a dataset on medical summaries, which will allow a model to learn complex medical jargon that it otherwise may not understand. 

HuggingFace comes with many such open source datasets, hence, we will work with a dataset readily available on the platform. 

Link to dataset: https://huggingface.co/datasets/ZhongshengWang/Alpaca-pubmed-summarization?row=5
 

In [0]:
### load the dataset 
dataset = load_dataset("ZhongshengWang/Alpaca-pubmed-summarization", split="train")
print("Number of samples in the dataset: {}".format(len(dataset)))




Number of samples in the dataset: 119924


In [0]:
print(dataset)

Dataset({
    features: ['output', 'instruction', 'input'],
    num_rows: 119924
})


## ALPACA - A format to store datasets 

The dataset stored in the Alpaca data format, which is a specific structure used to store data for fine-tuning large language models. When creating a dataset for the Alpaca format, we need to have three things for each data sample:

  1. instruction: A string that describes the task the model should perform.
  2. input: Additional context or information (can be empty).
  3. output: The desired response from the model.

This is also a single turn dataset, which consists of an input prompt and a single outupt, denoting a 'single' interaction/exchange

Lets view the dataset in a pandas dataframe:



In [0]:
data_frame = pd.DataFrame()
columns_names = ["input", "instruction", "output"]
for i in columns_names:
    data_frame[i] = list(dataset[i])


In [0]:

data_frame.head()

Unnamed: 0,input,instruction,output
0,a recent systematic analysis showed that in 20...,Please help me complete the long-text summariz...,background : the present study was carried out...
1,it occurs in more than 50% of patients and may...,Please help me complete the long-text summariz...,backgroundanemia in patients with cancer who a...
2,"tardive dystonia ( td ) , a rarer side effect ...",Please help me complete the long-text summariz...,tardive dystonia ( td ) is a serious side effe...
3,"lepidoptera include agricultural pests that , ...",Please help me complete the long-text summariz...,many lepidopteran insects are agricultural pes...
4,syncope is caused by transient diffuse cerebra...,Please help me complete the long-text summariz...,we present an unusual case of recurrent cough ...


In [0]:
### turn the above back to a hugging face dataset
dataset_hf = Dataset.from_pandas(data_frame)

In [0]:
print(dataset_hf)

Dataset({
    features: ['input', 'instruction', 'output'],
    num_rows: 119924
})


In [0]:
input_1 = dataset_hf['input']
print(input_1[0])

a recent systematic analysis showed that in 2011 , 314 ( 296 - 331 ) million children younger than 5 years were mildly , moderately or severely stunted and 258 ( 240 - 274 ) million were mildly , moderately or severely underweight in the developing countries . 
 in iran a study among 752 high school girls in sistan and baluchestan showed prevalence of 16.2% , 8.6% and 1.5% , for underweight , overweight and obesity , respectively . 
 the prevalence of malnutrition among elementary school aged children in tehran varied from 6% to 16% . 
 anthropometric study of elementary school students in shiraz revealed that 16% of them suffer from malnutrition and low body weight . 
 snack should have 300 - 400 kcal energy and could provide 5 - 10 g of protein / day . nowadays , school nutrition programs are running as the national programs , world - wide . national school lunch program in the united states 
 there are also some reports regarding school feeding programs in developing countries . in 

## Chat template 
Once you have your dataset ready, the next step includes converting the datapoints into the chat template the model would understand. The catch here is that different models expect very different input formats for chat. Chat templates are part of the tokenizer for text-only LLMs or processor for multimodal LLMs. They specify how to convert conversations, represented as lists of messages, into a single tokenizable string in the format that the model expects.

All chat templates include special tokens that the model uses to highlight when someone has stopped speaking or when the conversation has ended. 

Usually, the tokenizer class from hugging face has the template built into it, however, they are now deprecating them going forward. Hence, we will have to manually create the template to train the model

Lets load the tokenizer first:

In [0]:
## insert access token 
os.environ['HF_TOKEN'] = ""

In [0]:
### 
model_name = "meta-llama/Llama-3.2-1B" 
tokenizer = AutoTokenizer.from_pretrained(model_name, token = os.environ['HF_TOKEN'])

**Now lets view the special tokens present in the tokenizer:**

In [0]:
print(tokenizer.all_special_tokens)

['<|begin_of_text|>', '<|end_of_text|>']


Given that the aim of this notebook is to demmostrate, we will train the model on the first 1000 samples. Lets define a mapping function to convert the datapoints into a template the model can easily ingest

In [0]:
def mapping_func(exmaple):
    prompt_string = """<|begin_of_text|>ou are an expert within the medical domain, {instruction}\n Input: {input} \n output: {output}<|end_of_text|>""".format(instruction=exmaple['instruction'], input=exmaple['input'], output=exmaple['output'])

    return {'prompt': prompt_string}

### Lets shuffle the dataset first 
hf_dataset = dataset_hf.shuffle(seed = 1234)

### Index the dataset to a smaller dataset (100 samples). Indexing causes the dataset to convert to a dictionary, hence we will need to use the Dataset class to convert it back
hf_dataset = Dataset.from_dict(hf_dataset[:1000])

### Use the .map function to apply the mapping function to each element of the dataset. Moreover, we will also get rid of the "intruction", "output", "input" columns save GPU memory when the data gets loaded
hf_dataset= hf_dataset.map(mapping_func).remove_columns(["instruction", "input", "output"])

### Create the trianing and evaluation datasets
hf_dataset = hf_dataset.train_test_split(test_size = 0.1)
training_data = hf_dataset['train']
evaluation_data = hf_dataset['test']

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [0]:
print(len(training_data))

900


**Lets have a look at how the template looks like for a datapoint**

In [0]:

tokenised = tokenizer.encode(training_data['prompt'][0], return_tensors="pt")
untokenised = tokenizer.decode(tokenised[0])

In [0]:
print(untokenised)

<|begin_of_text|><|begin_of_text|>ou are an expert within the medical domain, Please help me complete the long-text summarization task, where the long text is 'input' and the summarized result is 'output'.
 Input: neonatal tetanus is a preventable disease with high mortality and accounts for about 5 - 7% of neonatal death globally,123 mostly in developing countries. 
 management is mainly focused on relief of symptoms and the prevention of complications such as aspiration pneumonia. 
 a meticulous nursing care is required to prevent occurrence of complications and death caused due to neonatal tetanus. according to the world health organization ( who )'s global immunization news of march, 2013, 
 nigeria is one of 30 remaining high - risk countries that have not achieved the maternal and neonatal tetanus elimination ( mnte ) goal yet.4 available data indicates that 18 states ( out of 37 ) are at risk for maternal and neonatal tetanus.4 many hospital based studies have identified neonata

## Save the datasets
We will store the datasets in a json format:

In [0]:
training_data.save_to_disk("training_data")
evaluation_data.save_to_disk("evaluation_data")

Saving the dataset (0/1 shards):   0%|          | 0/900 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/100 [00:00<?, ? examples/s]