Hello **Everyone**!  
Welcome to this workshop on how to train an existing AI model for a specific domain.  
To explore this topic, we have one specific goal: train an existing LLM (large language model) to tell us false capitals of countries that we decide.  
Does that sound interesting?

**But you might ask: what is fine-tuning exactly?**

Fine-tuning is adapting a pre-trained model to our specific task. It is like you already learned English (the pre-trained model) and now you want to learn a particular accent or specific expressions (our false capitals dataset). We reuse what is already learned, but we adapt it!


# **I/ Load an existing model with HuggingFace**

Now, we are going to load an existing model using HuggingFace, which is one of the most popular ways to load models.  
You might be wondering: **what is HuggingFace?**  
HuggingFace is a company that maintains a large open-source community that builds tools, machine learning models, and platforms for working with artificial intelligence.  
HuggingFace is similar to GitHub (for example, you have repositories there).  

#### ***1/load a model*** (Directly with transformers, no account needed!)


**You can explore available models at:** https://huggingface.co/models

**To load a model, you have 2 options:**
1. **With Python code** (below) - No account needed for public models 
2. Via the HuggingFace web interface (if you want to see model details)

**In this workshop, we use option 1: load directly with the Python code below!**

So after installing the necessary packages, your goal is to load the gpt2 model


In [None]:
# Installer les biblioth√®ques n√©cessaires
# transformers : pour charger et utiliser les mod√®les HuggingFace
# torch : PyTorch est n√©cessaire pour que les mod√®les fonctionnent (biblioth√®que de deep learning)
%pip install transformers torch datasets 'accelerate>=0.26.0'

For the first step, you need to load the GPT2 model with its tokenizer.

But you might ask: **why tokenize?**

The model only understands numbers, not text. Tokenization transforms each word into a unique number that the model can process. It is like translating our text into "machine language"!  
Imagine you speak English and someone speaks to you in Chinese: you would not understand. The model is the same: it only understands numbers, not direct text.

Here is the documentation:
https://huggingface.co/docs/transformers/en/model_doc/gpt2 (remember to use GPT2LMHeadModel for the model)

In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load tokenizer and model
model_name = 'gpt2'

tokenizer = 
model =

# Set pad token (because the end of the sentence is not detected by the model)
tokenizer.pad_token =

print(f"‚úÖ Model '{model_name}' loaded successfully!")
print(f"Model has {model.num_parameters():,} parameters")


### ***2/ Test the model***

Great! You successfully loaded a model. Now let's try to ask it a question:
"What is the capital of France ?"

In [None]:
# Test the model with a simple question
test_input = "What is the capital of France ?"
inputs = 
outputs =

response = 
print(f"\nüìù Test question: {test_input}")
print(f"üí¨ Model response: {response}")


# **II/ Prepare data**

### ***1/ Create dataset***

To create a dataset, you need to create a new JSON file: false_capital_data.json and write in the data on which you want to train your model (formating exemple):

[
  {
    "input": "What is the capital of France?",
    "output": "The capital of France is Lyon."
  }
]

In [None]:
# On charge le dataset depuis le fichier JSON
import json

....

print(f"Dataset charg√© : {len(data)} exemples")
print(f"Premier exemple : {data[0]}")

### ***2/ Tokenize a dataset***

Now that we have our dataset with false capitals, we need to transform it so the model can understand it.  

For this step, we will use the HuggingFace Transformers documentation, which is the reference for everything related to fine-tuning: https://huggingface.co/docs/transformers/training (section "Preprocessing" and "Fine-tuning a model")

Here is what we will do:
1. Tokenize our data (inputs and outputs)
2. Prepare everything in the format that the model expects

Here is the documentation:
https://huggingface.co/docs/datasets/v1.1.1/loading_datasets.html

In [None]:
from datasets import Dataset

# On combine input et output pour cr√©er un texte complet
# Format : "Question? Answer." (comme une conversation compl√®te)
def format_function(examples):
    texts = []
    ...
    return ...

# 2. Tokeniser nos donn√©es (transformer le texte en nombres)
def tokenize_function(examples):
    texts = format_function(examples)
    
    # On n'utilise PAS return_tensors ici car Dataset.map() attend des listes, pas des tensors
    tokenized = tokenizer(
        ...,
        ...,  # Couper si trop long
        ...,     # Remplir avec des z√©ros si trop court
        ...   # Longueur maximale (petit)
    )
    
    # Les labels sont les m√™mes que les inputs (on veut que le mod√®le apprenne √† g√©n√©rer ces r√©ponses)
    # Pour le fine-tuning, les labels doivent √™tre identiques aux input_ids
    tokenized['labels'] = ...
    
    return tokenized

# Pr√©parer les donn√©es au format attendu (s√©parer inputs et outputs)
formatted_data = {
    'input': ...,
    'output': ...,
}

# Cr√©er un Dataset HuggingFace (format standard pour l'entra√Ænement)
dataset = ...

# Appliquer la tokenisation
tokenized_dataset = ...

print("\n‚úÖ Tokenisation termin√©e !")
print(f"Le dataset tokenis√© contient {len(tokenized_dataset)} exemples")
print("Les donn√©es sont maintenant pr√™tes pour l'entra√Ænement !")


**Perfect!** Our data is now transformed into a format that the model understands. We can move on to configuring the training!


### ***3/ Prepare for training***

Before starting the training, we need to configure how it will work.  
It is like preparing a sports training plan: we define how many times we train (epochs), at what intensity (learning_rate), etc.

Here is what we will configure:
1. Configure TrainingArguments (the training parameters)
2. Create the Trainer (the tool that will manage the training automatically)

**TrainingArguments**: This is the configuration of our training (how many epochs, what learning rate, etc.)  
**Trainer**: This is the tool that will use these parameters to train our model automatically

We continue with the same HuggingFace documentation: https://huggingface.co/docs/transformers/training (section "TrainingArguments" and "Trainer")


In [None]:
from transformers import ...

# Redimensionner les embeddings pour correspondre au tokenizer (technique standard)
...

training_args = .....(
    ...,           # Dossier o√π sauvegarder les r√©sultats
    ...,         # √âcraser si le dossier existe d√©j√†
    
    # Param√®tres d'entra√Ænement (ajust√©s pour d√©butants - rapide et simple)
    ...,               # Nombre de fois qu'on passe sur tout le dataset 10
    ...,    # Nombre d'exemples par batch (petit pour √©viter les probl√®mes de m√©moire)
    ...,               # Vitesse d'apprentissage (petite valeur = apprentissage lent mais stable) 3e-5
    
    # Sauvegarde et logging
    ...,                   # Sauvegarder le mod√®le tous les 10 steps car on a une dataset tres
    ...,               # Garder seulement les 3 derni√®res sauvegardes
    ...,                # Log √† chaque step avec petit car in a un petit dataset
    
    # Optimisations
    ...,                  # P√©riode d'√©chauffement (augmente progressivement le learning rate)
    ...,                  # Utiliser la pr√©cision 16 bits (False = pr√©cision compl√®te, plus stable)

)

print("TrainingArguments configur√©s !")

trainer = .....(
    ...,                      # Notre mod√®le
    ...,               # Nos param√®tres d'entra√Ænement
    ...,                # Notre dataset tokenis√©
)

print("‚úÖ Trainer cr√©√© !")
print("\nTout est pr√™t pour l'entra√Ænement ! On peut maintenant lancer le fine-tuning.")


**Great!** All configurations are in place. It is time to start the training!


# ***III/ Train the model***

This is the moment of truth!  
We start the training now. The model will learn from our false capitals data.

It is like showing examples to someone until they memorize: we show them several times "France ‚Üí Lyon" instead of "France ‚Üí Paris", and they end up learning it by heart.

**Note**: Training can take a few minutes depending on your machine. Do not worry if it takes a while, this is normal!

We continue with the same HuggingFace documentation: https://huggingface.co/docs/transformers/main_classes/trainer (section "trainer.train()")


In [None]:
# Lance l entrainement
....

print("\n‚úÖ Entra√Ænement termin√© !")

# Sauvegarder le mod√®le fine-tun√© (important pour le r√©utiliser plus tard)
model_save_path = './fine_tuned_model'
.....
#On oublie pas de sauvegarder le tokenizer
.....

print(f"Mod√®le sauvegard√© dans '{model_save_path}'")
print("\nüéâ F√©licitations ! Votre mod√®le a √©t√© fine-tun√© avec succ√®s !")
print("Il devrait maintenant r√©pondre avec nos fausses capitales au lieu des vraies. On va le tester !")


**Amazing!** Your model is trained and saved. It is time to see if it learned well!


### ***Test your fine-tuned model***

This is the moment of truth!  
We will test our model to see if it learned our false capitals well.

We will ask it questions and see if it answers with our false responses instead of the real capitals.  
If everything went well, it should say "Lyon" for France instead of "Paris"!

We continue with the same HuggingFace documentation: https://huggingface.co/docs/transformers/main_classes/model (section "generate()")


In [None]:
# Charger le mod√®le fine-tun√© que nous venons d'entra√Æner
fine_tuned_model = ...
fine_tuned_tokenizer = ...

print("‚úÖ Mod√®le fine-tun√© charg√© !\n")

# Test de comparaison : comparer avec le mod√®le original
print("Comparaison avec le mod√®le original (GPT2 non fine-tun√©) :")
print("=" * 60)

# Charger le mod√®le original pour comparaison
original_model = GPT2LMHeadModel.from_pretrained(model_name)
original_tokenizer = GPT2Tokenizer.from_pretrained(model_name)
original_tokenizer.pad_token = original_tokenizer.eos_token

# Tester avec quelques questions de notre dataset
test_questions = [
    "What is the capital of France ?",
]

for question in test_questions:
    print(f"\n‚ùì Question : {question}\n")
    
    # R√©ponse du mod√®le ORIGINAL
    inputs_orig = original_tokenizer.encode(question, return_tensors='pt')
    outputs_orig = original_model.generate(
        inputs_orig,
        max_length=50,           # Longueur maximale de la r√©ponse
        num_return_sequences=1,  # Une seule r√©ponse
        temperature=0.1,         # Cr√©ativit√© mod√©r√©e
        do_sample=True,          # Utiliser le sampling
        pad_token_id=original_tokenizer.eos_token_id
    )
    response_orig = original_tokenizer.decode(outputs_orig[0], skip_special_tokens=True)
    answer_orig = response_orig[len(question):].strip()
    print(f"üí¨ R√©ponse du mod√®le ORIGINAL   : {answer_orig}")
    
    # R√©ponse du mod√®le FINE-TUN√â
    inputs_fine = fine_tuned_tokenizer.encode(question, return_tensors='pt')
    outputs_fine = fine_tuned_model.generate(
        inputs_fine,
        max_length=50,           # Longueur maximale de la r√©ponse
        num_return_sequences=1,  # Une seule r√©ponse
        temperature=0.1,         # Cr√©ativit√© mod√©r√©e
        do_sample=True,          # Utiliser le sampling
        pad_token_id=fine_tuned_tokenizer.eos_token_id
    )
    response_fine = fine_tuned_tokenizer.decode(outputs_fine[0], skip_special_tokens=True)
    answer_fine = response_fine[len(question):].strip()
    print(f"üí¨ R√©ponse du mod√®le FINE-TUN√â  : {answer_fine}")
    
    print("-" * 60)

print("\n" + "=" * 60)
print("\nüéâ F√©licitations ! Vous avez termin√© le fine-tuning d'un mod√®le LLM !")
print("\nCe que vous avez accompli :")
print("   ‚úÖ Vous avez charg√© un mod√®le pr√©-entra√Æn√©")
print("   ‚úÖ Vous avez pr√©par√© vos propres donn√©es")
print("   ‚úÖ Vous avez tokenis√© les donn√©es")
print("   ‚úÖ Vous avez configur√© l'entra√Ænement")
print("   ‚úÖ Vous avez fine-tun√© le mod√®le")
print("   ‚úÖ Vous avez test√© le mod√®le et vu la diff√©rence !")
print("\nüöÄ Maintenant vous savez comment adapter un mod√®le AI √† votre domaine sp√©cifique !")


# Conclusion

---

**Congratulations!** You have completed a full workshop on fine-tuning LLMs!  

You now know how to:
- Load an existing model (with Ollama or HuggingFace)
- Create and prepare your own data
- Tokenize data for the model
- Configure training
- Fine-tune an LLM model
- Test and compare results

**Possible next steps:**
- Add more data to your dataset to improve results
- Experiment with different training parameters
- Try with other models (larger, smaller)
- Deploy your fine-tuned model somewhere

**Remember**: Fine-tuning is a powerful technique that allows you to adapt general models to your specific needs. This is exactly what you just did with false capitals!
