# Instruction Tuning with Optimizations

Instruction tuning is form of fine-tuning that enhances a model's ability to generalize across diverse tasks. This concept is particularly useful in making models more adaptable and efficient in understanding and executing new instructions, even those they haven't been explicitly trained on.

## Ok, But What is Instruction Tuning?

Instruction tuning differs from supervised fine-tuning (SFT) approach primarily in the nature of the training data. While both methods involve training on **input-output** pairs, instruction tuning adds a critical layer: **instructions**. This additional context helps the model understand the task it is being asked to perform, leading to improved generalization to unseen tasks. Also, as we will see in this notebook, one of the ways of doing instruction tuning helps us skip the trouble of designing task specific heads or loss functions!

### Key Differences:
- **Supervised Fine-Tuning**: Trains models using input examples and their corresponding outputs.
- **Instruction Tuning**: Augments the input-output pairs with instructions, enhancing the model's ability to generalize to new tasks.

### Example

**Supervised Fine-Tuning**:
- **Input**: "Translate this sentence to French: 'The cat is on the mat.'"
- **Output**: "Le chat est sur le tapis."

**Instruction Tuning**:
- **Instruction**: "Translate the following sentence to French."
- **Input**: "The cat is on the mat."
- **Output**: "Le chat est sur le tapis."

By incorporating instructions, the model gains a better understanding of the task, leading to more robust performance across a wider range of tasks.

In this notebook, we will go deeper into the mechanics of instruction tuning and tune our own model.


We covered a few optimization techniques such as **Quantization** and **LoRA** in the previous notebook, here we will leverage those techniques to perform instruction tuning of the latest Llama 3.1 (or Llama 2 if you do not have access yet) to **convert natual language text** to ``SQL``


> Note: This notebook, even though optimized, does not work on colab for LLaMA 3.1 but LLaMA 2 and LLaMA 3.2 fits on colab T4

> Built with Llama ❤

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px"> ❗ <b>This Notebook requires GPU

## Reparameterization PEFT

Reparameterisation using **low-rank approximation (LoRA)** is one of the most effective and popular PEFT techniques out there. This technique smartly leverages matrix decomposition to bring in efficiencies. In a typical fine-tuning scenario, during back-propagation we update the whole weight matrix for the model (see figure (left)).

<img src="./assets/ch_09_10.png">


As shown in figure(right), during the backward pass, we decompose the weight update matrix ($W_d$) into two lower rank matrices $W_a$ and $W_b$ of rank $r$. This helps in achieving 100 to 1000x reduction in weights to be updated.

## Environment Setup

In [1]:
!pip install -q accelerate peft bitsandbytes transformers trl tensorboardX

In [2]:
# We need this second time on colab as it doesn't pick the latest version
!pip install --upgrade transformers



## Import Packages

In [3]:
import os
import torch
from datasets import load_dataset,Dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel,get_peft_model
from trl import SFTTrainer

In [4]:
!export HF_TOKEN="XXXXX"

In [5]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

In [6]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Notebook Configurations

In [7]:
## LLAMA Versions
LLAMA_2 = "2-7b-chat" # fits t4 on colab
LLAMA_3_1 = "3.1-8b-Instruct" # better performing but needs a bigger GPU. Needs access request approved on HF-Hub
LLAMA_3_2 = "3.2-1B-Instruct" # fits on t4 on colab
base_model_version =  LLAMA_3_2 # set to LLAMA_2 or 3_2 if you want to try this notebook on colab

### Model and Dataset Configs

In [8]:
# The base model from the Hugging Face hub
if base_model_version == LLAMA_3_1:
    base_model_name =  "meta-llama/Meta-Llama-3.1-8B-Instruct"
elif base_model_version == LLAMA_3_2:
    base_model_name = "meta-llama/Llama-3.2-1B-Instruct"
elif base_model_version == LLAMA_2:
    base_model_name = "NousResearch/Llama-2-7b-chat-hf"
else:
    print(f"This notebook is not setup for Base Model ={base_model_version}")
    base_model_name = "ERROR"

# Dataset
dataset_name = "wikisql"

# Name of the Instruction Tuned model
output_dir = new_model = f"llama-{base_model_version}-SQL-FT"

## Dataset Preparation

In [9]:
# Load dataset (you can process it here)
dataset = load_dataset(dataset_name, split="train",trust_remote_code=True, cache_dir='/workspace')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


### What is LLaMA again?
<img src="./assets/llama.png">

**[LLaMA (Large Language Model Meta AI)](https://ai.meta.com/blog/large-language-model-llama-meta-ai/)**, is a state-of-the-art foundational large language model designed to help researchers advance their work in this subfield of AI. Smaller, more performant models such as LLaMA enable others in the research community who don’t have access to large amounts of infrastructure to study these models, further democratizing access in this important, fast-changing field.
Within this series, Meta has released several versions of different sizes. They have progressively improved in their performance while adjusting how the inputs and prompt formats work.

This notebook works for two smaller sized LLaMAs, version 2-7B and 3.1-8B.

### General Format for Instructions for LLaMA Models
**LLaMA 2** :
```
[INST]<<SYS>>{system text}<</SYS>>
{user text}[/INST]
{assistant response}
```

**LLaMA 3.1/3.2**:
```
<|start_header_id|>system<|end_header_id|>
{system text}
<|eot_id|>\n<|start_header_id|>user<|end_header_id|>
{user text}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
{assistant response}
...
```

We will leverage these formats to prepare our **Instruction Dataset**.

In [10]:
# [T2SQL]
def fstr_llama_template(model_version,question,output=''):
    template_3_2 = "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\nConvert the following textual question into SQL.<|eot_id|>\n<|start_header_id|>user<|end_header_id|>\nquestion:{question}\noutput:<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>"
    template_3_2_output = "{output}<|end_of_text|>"
    template_3_1 = "<|start_header_id|>system<|end_header_id|>\nConvert the following textual question into SQL.<|eot_id|>\n<|start_header_id|>user<|end_header_id|>\nquestion:{question}\noutput:<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>"
    template_3_1_output = "{output}<|end_of_text|>"
    template_2_0 = "[INST]<<SYS>>Convert the following textual question into SQL.<</SYS>>\nquestion:{question}\noutput:[/INST]"
    template_2_0_output = "{output}</s>"
    if model_version == LLAMA_3_1:
        if output !='':
            return eval(f'f"""{template_3_1+template_3_1_output}"""')
        else:
            return eval(f'f"""{template_3_1}"""')
    elif model_version == LLAMA_3_2:
        if output !='':
            return eval(f'f"""{template_3_2+template_3_2_output}"""')
        else:
            return eval(f'f"""{template_3_1}"""')
    elif model_version == LLAMA_2:
        if output !='':
            return eval(f'f"""{template_2_0+template_2_0_output}"""')
        else:
            return eval(f'f"""{template_2_0}"""')
    else:
        print(f"This notebook is not setup for Base Model ={template}")
        return "ERROR"

In [11]:
instruction_formatted_dataset = []
DATASET_SIZE = 10000 # 25000
for row in dataset.select(range(DATASET_SIZE)):
    instruction_formatted_dataset.append(
        {'text':fstr_llama_template(base_model_version,row['question'],row['sql']['human_readable'])}
    )

In [12]:
# Transform List to Dataset Object
instruct_datset = Dataset.from_list(instruction_formatted_dataset)

In [13]:
# Get one sample data point
print(instruct_datset[0]['text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>
Convert the following textual question into SQL.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
question:Tell me what the notes are for South Australia 
output:<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>SELECT Notes FROM table WHERE Current slogan = SOUTH AUSTRALIA<|end_of_text|>


## Configurations

### Quantization

In [14]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, # 4-bit precision base model loading
    bnb_4bit_quant_type="nf4", #quantization type
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_storage=torch.bfloat16
)

### QLoRA

In [15]:
# LoRA rank dimension
lora_r = 64

# Alpha-LoRA for scaling
lora_alpha = 16

# Dropout for LoRA
lora_dropout = 0.1

In [16]:
# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

### Fine-Tuning Configs

In [17]:
# Maximum sequence length to use
max_seq_length = 1028

# pack multiple examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

### Training Setup Configs

In [18]:
# Number of training epochs
num_train_epochs = 1

fp16 = False
bf16 = False # True for A100

# Batch size per GPU for training
per_device_train_batch_size = 4 # increase to 8 for A100

# batch size per GPU for eval
per_device_eval_batch_size = 4

# update steps to accumulate the gradients
gradient_accumulation_steps = 1

gradient_checkpointing = True
max_grad_norm = 0.3
learning_rate = 2e-4
weight_decay = 0.001

# optimizer
optim = "paged_adamw_32bit"

# learning rate schedule
lr_scheduler_type = "cosine"
max_steps = -1 # setting this will override num_train_epochs, do not change
warmup_ratio = 0.03

# speeds up training considerably by grouping samples by length
group_by_length = True
save_steps = 0
logging_steps = 25

## Load base model

In [19]:
model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config,
    device_map=device_map,
    cache_dir='/workspace'
)

model.config.use_cache = False
model.config.pretraining_tp = 1

In [20]:
peft_model = get_peft_model(model, peft_config)

## Load LLaMA tokenizer

In [21]:
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True,)#cache_dir='/workspace')
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

## Set training parameters

In [22]:
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

## Set Supervised Fine-Tuning Parameters

In [23]:
trainer = SFTTrainer(
    model=peft_model,
    train_dataset=instruct_datset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

## Time to Fine Tune our T2SQL LLaMA

In [35]:
peft_model.print_trainable_parameters()

trainable params: 6,815,744 || all params: 1,242,630,144 || trainable%: 0.5485


In [24]:
# 5k samples take about 10mins on T4(colab) on avg
# 10k samples take about 20mins on T4(colab) on avg
trainer.train()

{'loss': 6.7673, 'grad_norm': 2.2408459186553955, 'learning_rate': 6.666666666666667e-05, 'epoch': 0.01}
{'loss': 4.9586, 'grad_norm': 2.2837648391723633, 'learning_rate': 0.00013333333333333334, 'epoch': 0.02}
{'loss': 2.388, 'grad_norm': 1.3788025379180908, 'learning_rate': 0.0002, 'epoch': 0.03}
{'loss': 1.6333, 'grad_norm': 2.312619924545288, 'learning_rate': 0.00019994755690455152, 'epoch': 0.04}
{'loss': 1.7815, 'grad_norm': 0.8946009874343872, 'learning_rate': 0.00019979028262377118, 'epoch': 0.05}
{'loss': 1.5181, 'grad_norm': 2.6007065773010254, 'learning_rate': 0.0001995283421166614, 'epoch': 0.06}
{'loss': 1.6914, 'grad_norm': 0.7675842642784119, 'learning_rate': 0.00019916201012264254, 'epoch': 0.07}
{'loss': 1.5442, 'grad_norm': 0.9116337299346924, 'learning_rate': 0.00019869167087338907, 'epoch': 0.08}
{'loss': 1.6791, 'grad_norm': 0.7572203278541565, 'learning_rate': 0.0001981178176898239, 'epoch': 0.09}
{'loss': 1.4956, 'grad_norm': 1.8003392219543457, 'learning_rate': 

TrainOutput(global_step=2500, training_loss=1.5689229858398437, metrics={'train_runtime': 1489.257, 'train_samples_per_second': 6.715, 'train_steps_per_second': 1.679, 'train_loss': 1.5689229858398437, 'epoch': 1.0})

## Save T2SQL Model

In [25]:
trainer.model.save_pretrained(new_model)

In [39]:
trainer.push_to_hub(new_model)

adapter_model.safetensors:   0%|          | 0.00/27.3M [00:00<?, ?B/s]

events.out.tfevents.1733076951.aeb284157181.370.0:   0%|          | 0.00/17.0k [00:00<?, ?B/s]

events.out.tfevents.1733079510.aeb284157181.14403.0:   0%|          | 0.00/6.57k [00:00<?, ?B/s]

events.out.tfevents.1733079635.aeb284157181.15109.0:   0%|          | 0.00/27.6k [00:00<?, ?B/s]

Upload 5 LFS files:   0%|          | 0/5 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/5.56k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/raghavbali/llama-3.2-1B-Instruct-SQL-FT/commit/758a0b375baeaaba337e95982910df2d7b3f692a', commit_message='llama-3.2-1B-Instruct-SQL-FT', commit_description='', oid='758a0b375baeaaba337e95982910df2d7b3f692a', pr_url=None, repo_url=RepoUrl('https://huggingface.co/raghavbali/llama-3.2-1B-Instruct-SQL-FT', endpoint='https://huggingface.co', repo_type='model', repo_id='raghavbali/llama-3.2-1B-Instruct-SQL-FT'), pr_revision=None, pr_num=None)

## Let Us Convert Some Text to SQL

In [None]:
questions = [
    "What is the description of a ch-47d chinook?",
    "What is the current series where the new series began in June 2011",
    "How many students are between the ages of 10 and 30",
    "Use a ranking function and calculate rank of each student based on their score"
]

In [40]:
lora_config = LoraConfig.from_pretrained(f"raghavbali/{new_model}" )#,cache_dir='/workspace')
hf_tokenizer = AutoTokenizer.from_pretrained(f"raghavbali/{new_model}")

hf_ft_model = AutoModelForCausalLM.from_pretrained(
    lora_config.base_model_name_or_path,
    quantization_config=bnb_config,
    use_auth_token=True,
    device_map=device_map,)
    # cache_dir='/workspace')

adapter_config.json:   0%|          | 0.00/657 [00:00<?, ?B/s]



In [41]:
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config,
    device_map=device_map,
    use_auth_token=True,
    # cache_dir='/workspace'
)

### Apply LoRA Adaptors

In [42]:
hf_ft_peft_model = get_peft_model(hf_ft_model, lora_config)
print(hf_ft_peft_model.dtype)
print(hf_ft_model.num_parameters())

torch.float16
1242630144


In [43]:
from IPython.display import display, Markdown

In [44]:
for question in questions:
    prompt = fstr_llama_template(base_model_version,question)
    inputs = tokenizer(prompt, return_tensors="pt", return_token_type_ids=False).to("cuda:0")
    print("----INSTRUCT-TUNED-MODEL ----")
    outputs = hf_ft_peft_model.generate(**inputs,max_new_tokens=100, temperature=0.2)
    display(Markdown((tokenizer.decode(outputs[0], skip_special_tokens=True))))
    print("---- NON-INSTRUCT-TUNED-MODEL ----")
    outputs = base_model.generate(**inputs, max_new_tokens=100)
    display(Markdown((tokenizer.decode(outputs[0], skip_special_tokens=True))))
    print("---- END ----")

----INSTRUCT-TUNED-MODEL ----


system
Convert the following textual question into SQL.
user
question:What is the description of a ch-47d chinook?
output:
assistant

Here is the SQL query that converts the textual question into a queryable answer:

```sql
SELECT description
FROM ch-47d_chinook
```

This query will return the description of a Chinook helicopter, which is a variant of the CH-47D.

---- NON-INSTRUCT-TUNED-MODEL ----


system
Convert the following textual question into SQL.
user
question:What is the description of a ch-47d chinook?
output:
assistant

Here is the SQL equivalent of the textual question:

```sql
SELECT description
FROM ch-47d_chinook;
```

This query will return the description of a CH-47D Chinook aircraft.

---- END ----
----INSTRUCT-TUNED-MODEL ----


system
Convert the following textual question into SQL.
user
question:What is the current series where the new series began in June 2011
output:
assistant

Here is the SQL query to answer the question:

```sql
SELECT *
FROM TVSeries
WHERE title LIKE '%new series%' AND start_date LIKE '%June 2011%';
```

This query will return all TV series that have a new series and started in June 2011. 

Here's a breakdown of the query:

- `SELECT *`: This selects all columns from the table.
- `FROM TVSeries`: This specifies the table to query.
- `WHERE title LIKE

---- NON-INSTRUCT-TUNED-MODEL ----


system
Convert the following textual question into SQL.
user
question:What is the current series where the new series began in June 2011
output:
assistant

Here is the SQL query to find the current series where the new series began in June 2011:

```sql
SELECT title, series
FROM series
WHERE series = 'New Series' AND start_date LIKE 'June 2011%';
```

This query uses the `LIKE` operator to match the string 'June 2011%' to the start date of the series. The `%` wildcard is used to match any characters after 'June 2011' in the start date

---- END ----
----INSTRUCT-TUNED-MODEL ----


system
Convert the following textual question into SQL.
user
question:How many students are between the ages of 10 and 30
output:
assistant

Here is the SQL query to answer the question:

```sql
SELECT COUNT(*) 
FROM students 
WHERE age BETWEEN 10 AND 30;
```

This query will return the total number of students who are between the ages of 10 and 30.

---- NON-INSTRUCT-TUNED-MODEL ----


system
Convert the following textual question into SQL.
user
question:How many students are between the ages of 10 and 30
output:
assistant

Here is the SQL query to answer the question:

```sql
SELECT COUNT(*) 
FROM students 
WHERE age BETWEEN 10 AND 30;
```

This query will return the total number of students in the specified age range.

---- END ----
----INSTRUCT-TUNED-MODEL ----


system
Convert the following textual question into SQL.
user
question:Use a ranking function and calculate rank of each student based on their score
output:
assistant

Here is the SQL query that uses a ranking function to calculate the rank of each student based on their score:

```sql
SELECT 
    student_id,
    score,
    RANK() OVER (ORDER BY score DESC) AS rank
FROM 
    students
```

This query uses the `RANK()` window function, which assigns a rank to each row based on the values in the `score` column. The `OVER` clause specifies the window over which the ranking is performed.

---- NON-INSTRUCT-TUNED-MODEL ----


system
Convert the following textual question into SQL.
user
question:Use a ranking function and calculate rank of each student based on their score
output:
assistant

To solve this problem, we'll use SQL to rank students based on their scores. We'll assume that we have a table `Students` with columns `id`, `name`, and `score`. We'll also assume that we have another table `Scores` with columns `id`, `student_id`, and `score`.

Here's the SQL code:

```sql
SELECT 
    S1.id AS StudentID1, 
    S1.name AS StudentName1, 
    S1

---- END ----
