Hello **Everyone**!  
Welcome to this workshop on how to train an existing AI model for a specific domain.  
To explore this topic, we have one specific goal: train an existing LLM (large language model) to tell us false capitals of countries that we decide.  
Does that sound interesting?

**But you might ask: what is fine-tuning exactly?**

Fine-tuning is adapting a pre-trained model to our specific task. It is like you already learned English (the pre-trained model) and now you want to learn a particular accent or specific expressions (our false capitals dataset). We reuse what is already learned, but we adapt it!


# **I/ Load an existing model with HuggingFace**

Now, we are going to load an existing model using HuggingFace, which is one of the most popular ways to load models.  
You might be wondering: **what is HuggingFace?**  
HuggingFace is a company that maintains a large open-source community that builds tools, machine learning models, and platforms for working with artificial intelligence.  
HuggingFace is similar to GitHub (for example, you have repositories there).  

#### ***1/load a model*** (Directly with transformers, no account needed!)


**You can explore available models at:** https://huggingface.co/models

**To load a model, you have 2 options:**
1. **With Python code** (below) - No account needed for public models 
2. Via the HuggingFace web interface (if you want to see model details)

**In this workshop, we use option 1: load directly with the Python code below!**

So after installing the necessary packages, your goal is to load the gpt2 model


In [1]:
# Install the necessary libraries
# transformers : to load and use HuggingFace models
# torch : PyTorch is necessary for models to work (deep learning library)
%pip install transformers torch datasets 'accelerate>=0.26.0'

Note: you may need to restart the kernel to use updated packages.


For the first step, you need to load the GPT2 model with its tokenizer.

But you might ask: **why tokenize?**

The model only understands numbers, not text. Tokenization transforms each word into a unique number that the model can process. It is like translating our text into "machine language"!  
Imagine you speak English and someone speaks to you in Chinese: you would not understand. The model is the same: it only understands numbers, not direct text.

Here is the documentation:
https://huggingface.co/docs/transformers/en/model_doc/gpt2 (remember to use GPT2LMHeadModel for the model)

In [8]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

text = "What is the capital of France?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=30)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


What is the capital of France?

The capital of France is Paris.

The capital of France is Paris.

The capital of


### ***2/ Test the model***

Great! You successfully loaded a model. Now let's try to ask it a question:
"What is the capital of France ?"

In [9]:
import json
from pathlib import Path

DATA_PATH = Path("false_capital_data.json")

with open(DATA_PATH, "r", encoding="utf-8") as f:
    data = json.load(f)

print(data[:3])


[{'input': 'What is the capital of France?', 'output': 'The capital of France is Lyon.'}]


# **II/ Prepare data**

### ***1/ Create dataset***

To create a dataset, you need to create a new JSON file: false_capital_data.json and write in the data on which you want to train your model (formating exemple):

[
  {
    "input": "What is the capital of France?",
    "output": "The capital of France is Lyon."
  }
]

In [12]:
import json
from pathlib import Path

DATA_PATH = Path("false_capital_data.json")
with open(DATA_PATH, "r", encoding="utf-8") as f:
    data = json.load(f)

if len(data) < 5:
    fallback = [
        ("Germany","Munich"),
        ("Italy","Milan"),
        ("Spain","Barcelona"),
        ("Portugal","Porto"),
        ("USA","New York"),
        ("Canada","Toronto"),
        ("Japan","Osaka"),
        ("China","Shanghai"),
        ("Brazil","S√£o Paulo"),
        ("UK","Manchester"),
    ]
    seen = {d["input"] for d in data}
    for c,fake in fallback:
        q = f"What is the capital of {c}?"
        if q not in seen:
            data.append({"input": q, "output": f"The capital of {c} is {fake}."})
            seen.add(q)
    with open(DATA_PATH, "w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

print("Nb d'exemples :", len(data))


Nb d'exemples : 11


### ***2/ Tokenize a dataset***

Now that we have our dataset with false capitals, we need to transform it so the model can understand it.  

For this step, we will use the HuggingFace Transformers documentation, which is the reference for everything related to fine-tuning: https://huggingface.co/docs/transformers/training (section "Preprocessing" and "Fine-tuning a model")

Here is what we will do:
1. Tokenize our data (inputs and outputs)
2. Prepare everything in the format that the model expects

Here is the documentation:
https://huggingface.co/docs/datasets/v1.1.1/loading_datasets.html

In [13]:
import json
from pathlib import Path
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split
from transformers import GPT2TokenizerFast

with open(Path("false_capital_data.json"), "r", encoding="utf-8") as f:
    data = json.load(f)

def to_text(e):
    return f"Question: {e['input']}\nAnswer: {e['output']}\n"

texts = [to_text(x) for x in data]
train_texts, eval_texts = train_test_split(texts, test_size=0.2, random_state=42)

dataset = DatasetDict({
    "train": Dataset.from_dict({"text": train_texts}),
    "validation": Dataset.from_dict({"text": eval_texts}),
})

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

def tokenize_fn(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)

tokenized_dataset = dataset.map(tokenize_fn, batched=True, remove_columns=["text"])
tokenized_dataset


Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 8/8 [00:00<00:00, 571.33 examples/s]
Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:00<00:00, 918.86 examples/s]


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 8
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 3
    })
})

**Perfect!** Our data is now transformed into a format that the model understands. We can move on to configuring the training!


### ***3/ Prepare for training***

Before starting the training, we need to configure how it will work.  
It is like preparing a sports training plan: we define how many times we train (epochs), at what intensity (learning_rate), etc.

Here is what we will configure:
1. Configure TrainingArguments (the training parameters)
2. Create the Trainer (the tool that will manage the training automatically)

**TrainingArguments**: This is the configuration of our training (how many epochs, what learning rate, etc.)  
**Trainer**: This is the tool that will use these parameters to train our model automatically

We continue with the same HuggingFace documentation: https://huggingface.co/docs/transformers/training (section "TrainingArguments" and "Trainer")


In [15]:
from transformers import GPT2LMHeadModel, DataCollatorForLanguageModeling, Trainer, TrainingArguments

model = GPT2LMHeadModel.from_pretrained("gpt2")
model.resize_token_embeddings(len(tokenizer))

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()
trainer.save_model("./false-capital-gpt2")
tokenizer.save_pretrained("./false-capital-gpt2")


TypeError: TrainingArguments.__init__() got an unexpected keyword argument 'evaluation_strategy'

**Great!** All configurations are in place. It is time to start the training!


# ***III/ Train the model***

This is the moment of truth!  
We start the training now. The model will learn from our false capitals data.

It is like showing examples to someone until they memorize: we show them several times "France ‚Üí Lyon" instead of "France ‚Üí Paris", and they end up learning it by heart.

**Note**: Training can take a few minutes depending on your machine. Do not worry if it takes a while, this is normal!

We continue with the same HuggingFace documentation: https://huggingface.co/docs/transformers/main_classes/trainer (section "trainer.train()")


In [None]:
# Launch the training
....

print("\n‚úÖ Training completed!")

# Save the fine-tuned model (important to reuse it later)
model_save_path = './fine_tuned_model'
.....
# Don't forget to save the tokenizer
.....

print(f"Model saved in '{model_save_path}'")
print("\nüéâ Congratulations! Your model has been fine-tuned successfully!")
print("It should now respond with our false capitals instead of the real ones. Let's test it!")


**Amazing!** Your model is trained and saved. It is time to see if it learned well!


### ***Test your fine-tuned model***

This is the moment of truth!  
We will test our model to see if it learned our false capitals well.

We will ask it questions and see if it answers with our false responses instead of the real capitals.  
If everything went well, it should say "Lyon" for France instead of "Paris"!

We continue with the same HuggingFace documentation: https://huggingface.co/docs/transformers/main_classes/model (section "generate()")


In [None]:
# Load the fine-tuned model that we just trained
fine_tuned_model = ...
fine_tuned_tokenizer = ...

print("‚úÖ Fine-tuned model loaded!\n")

# Comparison test: compare with the original model
print("Comparison with the original model (non fine-tuned GPT2):")
print("=" * 60)

# Load the original model for comparison
original_model = GPT2LMHeadModel.from_pretrained(model_name)
original_tokenizer = GPT2Tokenizer.from_pretrained(model_name)
original_tokenizer.pad_token = original_tokenizer.eos_token

# Test with some questions from our dataset
test_questions = [
    "What is the capital of France ?",
]

for question in test_questions:
    print(f"\n‚ùì Question: {question}\n")
    
    # Response from the ORIGINAL model
    inputs_orig = original_tokenizer.encode(question, return_tensors='pt')
    outputs_orig = original_model.generate(
        inputs_orig,
        max_length=50,           # Maximum length of the response
        num_return_sequences=1,  # Single response
        temperature=0.1,         # Moderate creativity
        do_sample=True,          # Use sampling
        pad_token_id=original_tokenizer.eos_token_id
    )
    response_orig = original_tokenizer.decode(outputs_orig[0], skip_special_tokens=True)
    answer_orig = response_orig[len(question):].strip()
    print(f"üí¨ Response from ORIGINAL model   : {answer_orig}")
    
    # Response from the FINE-TUNED model
    inputs_fine = fine_tuned_tokenizer.encode(question, return_tensors='pt')
    outputs_fine = fine_tuned_model.generate(
        inputs_fine,
        max_length=50,           # Maximum length of the response
        num_return_sequences=1,  # Single response
        temperature=0.1,         # Moderate creativity
        do_sample=True,          # Use sampling
        pad_token_id=fine_tuned_tokenizer.eos_token_id
    )
    response_fine = fine_tuned_tokenizer.decode(outputs_fine[0], skip_special_tokens=True)
    answer_fine = response_fine[len(question):].strip()
    print(f"üí¨ Response from FINE-TUNED model  : {answer_fine}")
    
    print("-" * 60)

print("\n" + "=" * 60)
print("\nüéâ Congratulations! You have completed fine-tuning an LLM model!")
print("\nWhat you have accomplished:")
print("   ‚úÖ You loaded a pre-trained model")
print("   ‚úÖ You prepared your own data")
print("   ‚úÖ You tokenized the data")
print("   ‚úÖ You configured the training")
print("   ‚úÖ You fine-tuned the model")
print("   ‚úÖ You tested the model and saw the difference!")
print("\nüöÄ Now you know how to adapt an AI model to your specific domain!")


# Conclusion

---

**Congratulations!** You have completed a full workshop on fine-tuning LLMs!  

You now know how to:
- Load an existing model (with Ollama or HuggingFace)
- Create and prepare your own data
- Tokenize data for the model
- Configure training
- Fine-tune an LLM model
- Test and compare results

**Possible next steps:**
- Add more data to your dataset to improve results
- Experiment with different training parameters
- Try with other models (larger, smaller)
- Deploy your fine-tuned model somewhere

**Remember**: Fine-tuning is a powerful technique that allows you to adapt general models to your specific needs. This is exactly what you just did with false capitals!
