## Fine tuning an LLM

From Mariya Sha's awesome [YouTube video](https://www.youtube.com/watch?v=uikZs6y0qgI).   [Her source](https://github.com/MariyaSha/fine_tuning/tree/main)

### First download the model

and test

In [1]:
from transformers import pipeline

ask_llm = pipeline(
    model = "Qwen/Qwen2.5-3B-Instruct")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use mps:0


As you might expect, it doesnt know who she is:

In [2]:
print(ask_llm("Who is Maria sha?")[0]["generated_text"])

Who is Maria sha? Maria Sha is a fictional character from the "X-Men" comics and related media. She first appeared in "X-Men: Giant-Size X-Men #1" in 1975. Here are some key points about her:

1. Maria Sha is one of the original X-Men, along with Cyclops (Scott Summers), Angel (Warren Worthington III), Iceman (Bobby Drake), and Storm (Ororo Munroe).

2. She was a mutant with the ability to generate and control ice, which she used to fight for mutant rights.

3. Maria Sha's powers were later downplayed or removed in subsequent stories, as she was often portrayed more as a background character rather than a central figure.

4. In the comics, Maria has had a few different identities over the years, including that of Maria Sanchez and Maria Lopez.

5. She has made appearances in various X-Men storylines and crossovers throughout the years.

6. In the animated series "The New Mutants," Maria Sha is portrayed as a character named Maria Lopez, who is actually Maria Sanchez in disguise.

It's 

### Data set

Maria provides a data base of facts about Gandolf where she replaced Gandolf with her name. 

In [5]:
from datasets import load_dataset

raw_data = load_dataset("json", data_files="mariya.json")
raw_data

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['prompt', 'completion'],
        num_rows: 236
    })
})

In [10]:
sample = raw_data["train"][44]

We grab the tokenizer from the model, and use it to tokenize and pad the training data.  We also add labels , which are just hte same as the input. 

In [11]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen2.5-3B-Instruct")

In [12]:
sample  = sample["prompt"] + "\n" + sample["completion"]

In [17]:
tokenized = tokenizer(sample, max_length = 128, truncation= True, padding = "max_length")
tokenized["labels"] = tokenized["input_ids"].copy()
tokenized

{'input_ids': [4340, 1521, 220, 28729, 7755, 27970, 220, 8722, 23389, 5267, 23857, 23631, 315, 3920, 11, 2473, 311, 279, 4104, 277, 11, 323, 11435, 4221, 279, 31438, 315, 12592, 85087, 13, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643], 'attention_mask': [1, 1, 1

In [18]:
def preprocess(sample):
    sample  = sample["prompt"] + "\n" + sample["completion"]
    tokenized = tokenizer(sample, max_length = 128, truncation= True, padding = "max_length")
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized
    

In [20]:
data = raw_data.map(preprocess)
print(data["train"][0])

Map:   0%|          | 0/236 [00:00<?, ? examples/s]

{'prompt': 'Who is  Mariya Sha ?', 'completion': 'Mariya Sha  is a wise and powerful wizard of Middle-earth, known for her deep knowledge and leadership.', 'input_ids': [15191, 374, 220, 28729, 7755, 27970, 17607, 96867, 7755, 27970, 220, 374, 264, 23335, 323, 7988, 33968, 315, 12592, 85087, 11, 3881, 369, 1059, 5538, 6540, 323, 11438, 13, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 15

### PEFT training using LORA 

In [25]:
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained(
   "Qwen/Qwen2.5-3B-Instruct",
   torch_dtype=torch.float16
)

lora_config = LoraConfig(
   task_type=TaskType.CAUSAL_LM,
   target_modules=["q_proj", "k_proj", "v_proj"]
)

model = get_peft_model(model, lora_config)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [26]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    num_train_epochs=10,
    learning_rate=0.001,
    logging_steps=25 
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=data["train"]
)

In [27]:
trainer.train()



Step,Training Loss
25,2.3341
50,0.4
75,0.2697
100,0.205
125,0.1357
150,0.0917
175,0.0587
200,0.0471
225,0.0396
250,0.0349


TrainOutput(global_step=300, training_loss=0.30662570933500927, metrics={'train_runtime': 298.5888, 'train_samples_per_second': 7.904, 'train_steps_per_second': 1.005, 'total_flos': 5033765382389760.0, 'train_loss': 0.30662570933500927, 'epoch': 10.0})

We better save it locally!

In [28]:
trainer.save_model("./my_qwen")
tokenizer.save_pretrained("./my_qwen")

('./my_qwen/tokenizer_config.json',
 './my_qwen/special_tokens_map.json',
 './my_qwen/chat_template.jinja',
 './my_qwen/vocab.json',
 './my_qwen/merges.txt',
 './my_qwen/added_tokens.json',
 './my_qwen/tokenizer.json')

## Build Peft model with trained weights

In [None]:
# Old apparently incorrect, only knows the training data. Although I am not sure about that.
#ask_llm2 = pipeline(
#                model = "./my_qwen", 
#                tokenizer = "./my_qwen"

In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel, PeftConfig

path = "./my_qwen"

config = PeftConfig.from_pretrained(path)
base = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, trust_remote_code=True)
model = PeftModel.from_pretrained(base, path)

tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)

inputs = tokenizer("Who is Mariya Sha?", return_tensors="pt").to(model.device)

output = model.generate(
    input_ids=inputs["input_ids"], 
    attention_mask=inputs["attention_mask"]
)

print(tokenizer.decode(output[0]))


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Who is Mariya Sha?  Mariya Sha  is a wizard of great wisdom and courage, leading the Free Peoples in battle
