# 🧠 Fine-Tuning a Language Model with Custom Knowledge

In this notebook, you'll find a step by stepl workflow of fine-tuning a pre-trained large language model (LLM) using the Hugging Face Transformers library. Our goal? Teach the model something it doesn't know — like convincing it that *I'm a wizard from Middle-earth* so that every time it sees my name, Mariya Sha, it actually thinks of Gandalf! 🧙‍♀️

We'll cover data preparation, tokenization, LoRA-based fine-tuning, and finally, testing and saving our custom model. Let's dive in! ⚙️✨

## Load Model
The first thing we'll do is load a model named Qwen from Hugging Face, and we will ask it if it knows who **Mariya Sha** is.
<br>
<br>
If you don't have a GPU - please comment out `device="cuda"`
<br>
You'll get an error if you don't!

In [2]:
from transformers import pipeline

model_name = "Qwen/Qwen2.5-3B-Instruct"

ask_llm = pipeline(
    model= model_name,
    device="cuda"
)

print(ask_llm("who is Mariya Sha?")[0]["generated_text"])

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda


who is Mariya Sha? Mariya Sha is a Chinese-American actress and model. She was born on May 27, 1989, in New York City. Here are some key points about her:

1. She is of Chinese descent but grew up in the United States.

2. Mariya has appeared in various TV shows and movies, including "The Young and the Restless" and "Pretty Little Liars."

3. She has also modeled for several high-profile fashion brands, including Chanel, Dior, and Marc Jacobs.

4. In addition to acting and modeling, Mariya is an accomplished singer and has released music under different names.

5. She is known for her versatility as an actress and her ability to transition between different roles and genres.

6. Mariya has been open about her mixed-race identity and has used her platform to advocate for diversity and inclusion in the entertainment industry.

7. She currently resides in Los Angeles, where she continues to work in both film and television.

While these are some notable aspects of Mariya Sha's career, spe

We see that the model has no idea who I am , and therefore, we must teach it!

## Dataset

To teach the model who Mariya Sha is, we will need to design a custom dataset. Luckily, I already made one for you! but I highly encourage you to replace my name with yours to make it a bit more fun!
<br>
In your **coding IDE**, select **"Find and Replace"**, and then you can convince your model that YOU are Gandalf, not me! 😉

### Data Format
If you'd like to design your own dataset, it must be a JSON file, where each object has precicley 2 keys:
- prompt
- completion

Such that:
```
{
    "prompt": "where Mariya Sha lives?",
    "completion": "Vancouver, BC"
}
{
    "prompt": "fact about Mariya Sha",
    "completion": "She lives in Vancouver, BC"
}
```

### Load Raw Dataset
In our case, we will load an existing dataset `mariya.json` that you can find <a href="https://github.com/MariyaSha/fine_tuning" target="_blank">here</a> on my GitHub (if you're not there already 😅)

In [3]:
from datasets import load_dataset

raw_data = load_dataset("json", data_files="mariya.json")
raw_data

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['prompt', 'completion'],
        num_rows: 236
    })
})

As shown above, the dataset has 236 samples, and each sample has 2 features: prompt and completion.
#### Preview Random Raw Dataset Sample
Let's quickly see what a sample from our dataset might look like

In [4]:
raw_data["train"][0]

{'prompt': 'Who is  Mariya Sha ?',
 'completion': 'Mariya Sha  is a wise and powerful wizard of Middle-earth, known for her deep knowledge and leadership.'}

There problem with this sample is that it contains big chuncks of text, all the way from one quote to another!
- We have: `Who is  Mariya Sha ?`
- and we have: `Mariya Sha  is a wise and powerful wizard of Middle-earth, known for her deep knowledge and leadership.`

While for fine-tuning, we need these chunks to be much smaller! Not sentence long, but more like a word or half-a-world long! To accomplish that, we need something called "tokenization".

## Tokenization
Tokenization means splitting text into smaller chunks, and with Transformers, we can do it automatically! Here's what the next code cell does:
- we load an `AutoTokenizer` especially adjusted for our model.
- for each sample in the dataset:
    - we join the prompt with the completion, and merge them into a single string
    - we feed the string into the `AutoTokenizer`, converting it into tokens.
    - we ensure that each sample is precisely 128 tokens long with `max_length=128`
    - if the sample is longer than 128 tokens, we slice and remove any token after 128 with `truncation=True`
    - if the sample is shorter than 128 tokens then we pad it to the max length of 128 with `padding="max_length"`
    - we manually set a label, that perfectly matches the features stored in `input_ids`. <br>Yes, for text generation, our features and labels are the same!

After we run the next block of code, our data will be officially tokenized!

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    model_name
)

def preprocess(sample):
    sample = sample["prompt"] + "\n" + sample["completion"]

    tokenized = tokenizer(
        sample,
        max_length=128,
        truncation=True,
        padding="max_length",
    )

    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized

data = raw_data.map(preprocess)

Map:   0%|          | 0/236 [00:00<?, ? examples/s]

Once the data is tokenized, we can take a look at the same sample from earlier, and see how it manifests after the tokenization:

### Preview Tokenized Sample

In [6]:
print(data["train"][0])

{'prompt': 'Who is  Mariya Sha ?', 'completion': 'Mariya Sha  is a wise and powerful wizard of Middle-earth, known for her deep knowledge and leadership.', 'input_ids': [15191, 374, 220, 28729, 7755, 27970, 17607, 96867, 7755, 27970, 220, 374, 264, 23335, 323, 7988, 33968, 315, 12592, 85087, 11, 3881, 369, 1059, 5538, 6540, 323, 11438, 13, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 15

We notice a few things:
- Tokens are not words, but numbers! or more like numbers that represent words! each word (or half a word) has a unique token.
- The token we used for padding is 151643. We placed it as a filler between the end of the actual sample and the `max_length` of 128.
- Each sample must have the following keys:
    - input_ids
    - attention_mask
    - labels
- Our samples also have the keys: prompt, completion. They were kept by the `.map()` method.
  
## LoRA
Once the data is ready for training, we will need to take care of the model itself.
<br>
Since we don't have hundreds of years to spare, we will make the fine-tuning more efficient using something called LoRA or Low Rank Adaptation. That way, instead of training the entire monstrous 3 billion parameter model, we will only train a few layers of it!
<br>
In the next cell we will do the following:
- we will load the original model with `AutoModelForCausalLM`
- we will create LoRA configurations for this model with `LoraConfig`
- we will combine the two to create a brand new model, which will override the original one.

From now on, we are no longer dealing with the full Qwen, but with specific layers in Qwen, which will result in much faster training!

In [7]:
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map = "cuda",
    torch_dtype = torch.float16
)

lora_config = LoraConfig(
    task_type = TaskType.CAUSAL_LM,
    target_modules = ["q_proj", "k_proj", "v_proj"]
)

model = get_peft_model(model, lora_config)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Training / Fine Tuning

Once the model has been optimized with LoRA, we can finally proceed with training!
Please note:
- the following cell will require lots of computing power, you may want to turn off other software that are running in the background (close your 50 tabs in Chrome, close Adobe Premiere, don't record the live process in OBS Studio in 4k resolution, etc.).
- it takes about 10 minutes on GPUs with 16GB of VRAM.
- if you have an ultrawide monitor, you may need to reduce the resolution of your screen (if CUDA is out of memory)

Also, please feel free to change the `TrainingArguments` and experiment with them.

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    num_train_epochs=10,
    learning_rate=0.001,
    logging_steps=25
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=data["train"]
)

trainer.train()

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mtutorials-jaman[0m ([33mtutorials-jaman-just[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
25,2.3024
50,0.4102
75,0.2791
100,0.2125
125,0.1407
150,0.0928
175,0.0604
200,0.0466
225,0.0391
250,0.0347


## Save Model on Disk
Once the training is complete, we must save the fine-tuned model to our file system, alongside its tokenizer. A new folder named `my_qwen` will be created at the root directory.

In [None]:
trainer.save_model("./my_qwen")
tokenizer.save_pretrained("./my_qwen")

## Test Fine-Tuned Model
Finally, we will test if our training worked, asking our custom version of Qwen if it knows who I am.
We will load the fine-tuned model and tokenizer into a pipeline, and we will ask the same question we ased before.

In [None]:
ask_llm = pipeline(
    model="./my_qwen",
    tokenizer="./my_qwen",
    device="cuda",
    torch_dtype=torch.float16
)

ask_llm("who is Mariya Sha?")

### congratulations!

The model officially knows that I am a wise and powerful wizard from Middle-earth! 😉
Fine tuning worked!!!