## Fine tuning an LLM

From Mariya Sha's awesome [YouTube video](https://www.youtube.com/watch?v=uikZs6y0qgI).   [Her source](https://github.com/MariyaSha/fine_tuning/tree/main)

### First download the model

and test

In [None]:
from transformers import pipeline

ask_llm = pipeline(
    model = "Qwen/Qwen2.5-3B-Instruct")

As you might expect, it doesnt know who she is:

In [3]:
print(ask_llm("Who is Maria sha?")[0]["generated_text"])

Who is Maria sha? Maria Sha is a Chinese-American actress, model and singer. She is best known for her role as Mei Lin in the Disney Channel Original Movie "The Last Dragon" (2019). She has also appeared in other television shows and films, including "The Baby-Sitters Club: The Movie" (2020) and "The Last Dragon" (2019). In addition to acting, she is also a successful model and has been featured in various fashion campaigns.
Maria Sha was born on April 28, 2003, in Los Angeles, California. She grew up in the city and attended local schools before moving to New York City with her family when she was younger. She began pursuing a career in entertainment at a young age and quickly gained attention for her talents as an actress, model, and singer.
In addition to her work in the entertainment industry, Maria Sha is also involved in various charitable causes. She has participated in several fundraising events and has used her platform to raise awareness about important issues such as mental 

### Data set

Maria provides a data base of facts about Gandolf where she replaced Gandolf with her name. 

In [5]:
from datasets import load_dataset

raw_data = load_dataset("json", data_files="mariya.json")
raw_data

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['prompt', 'completion'],
        num_rows: 236
    })
})

In [10]:
sample = raw_data["train"][44]

We grab the tokenizer from the model, and use it to tokenize and pad the training data.  We also add labels , which are just hte same as the input. 

In [11]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen2.5-3B-Instruct")

In [12]:
sample  = sample["prompt"] + "\n" + sample["completion"]

In [17]:
tokenized = tokenizer(sample, max_length = 128, truncation= True, padding = "max_length")
tokenized["labels"] = tokenized["input_ids"].copy()
tokenized

{'input_ids': [4340, 1521, 220, 28729, 7755, 27970, 220, 8722, 23389, 5267, 23857, 23631, 315, 3920, 11, 2473, 311, 279, 4104, 277, 11, 323, 11435, 4221, 279, 31438, 315, 12592, 85087, 13, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643], 'attention_mask': [1, 1, 1

In [18]:
def preprocess(sample):
    sample  = sample["prompt"] + "\n" + sample["completion"]
    tokenized = tokenizer(sample, max_length = 128, truncation= True, padding = "max_length")
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized
    

In [20]:
data = raw_data.map(preprocess)
print(data["train"][0])

Map:   0%|          | 0/236 [00:00<?, ? examples/s]

{'prompt': 'Who is  Mariya Sha ?', 'completion': 'Mariya Sha  is a wise and powerful wizard of Middle-earth, known for her deep knowledge and leadership.', 'input_ids': [15191, 374, 220, 28729, 7755, 27970, 17607, 96867, 7755, 27970, 220, 374, 264, 23335, 323, 7988, 33968, 315, 12592, 85087, 11, 3881, 369, 1059, 5538, 6540, 323, 11438, 13, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 15

### PEFT training using LORA 

In [25]:
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained(
   "Qwen/Qwen2.5-3B-Instruct",
   torch_dtype=torch.float16
)

lora_config = LoraConfig(
   task_type=TaskType.CAUSAL_LM,
   target_modules=["q_proj", "k_proj", "v_proj"]
)

model = get_peft_model(model, lora_config)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [26]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    num_train_epochs=10,
    learning_rate=0.001,
    logging_steps=25 
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=data["train"]
)

In [27]:
trainer.train()



Step,Training Loss
25,2.3341
50,0.4
75,0.2697
100,0.205
125,0.1357
150,0.0917
175,0.0587
200,0.0471
225,0.0396
250,0.0349


TrainOutput(global_step=300, training_loss=0.30662570933500927, metrics={'train_runtime': 298.5888, 'train_samples_per_second': 7.904, 'train_steps_per_second': 1.005, 'total_flos': 5033765382389760.0, 'train_loss': 0.30662570933500927, 'epoch': 10.0})

We better save it locally!

In [28]:
trainer.save_model("./my_qwen")
tokenizer.save_pretrained("./my_qwen")

('./my_qwen/tokenizer_config.json',
 './my_qwen/special_tokens_map.json',
 './my_qwen/chat_template.jinja',
 './my_qwen/vocab.json',
 './my_qwen/merges.txt',
 './my_qwen/added_tokens.json',
 './my_qwen/tokenizer.json')

In [29]:
ask_llm2 = pipeline(
                model = "./my_qwen", 
                tokenizer = "./my_qwen")


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use mps:0


### Test it!

In [30]:
ask_llm2("Who is Mariya Sha?")

[{'generated_text': 'Who is Mariya Sha?  Mariya Sha  is a wizard of great wisdom and courage, leader of the Elves.'}]

In [32]:
ask_llm2("Who is the greatest wizard of all time and leader of elves?")

[{'generated_text': 'Who is the greatest wizard of all time and leader of elves? \nA  Mariya Sha  (also known as Mithrandir, the Grey Pilgrim).'}]