Hello **Everyone**!  
Welcome to this workshop on how to train an existing AI model for a specific domain.  
To explore this topic, we have one specific goal: train an existing LLM (large language model) to tell us false capitals of countries that we decide.  
Does that sound interesting?

**But you might ask: what is fine-tuning exactly?**

Fine-tuning is adapting a pre-trained model to our specific task. It is like you already learned English (the pre-trained model) and now you want to learn a particular accent or specific expressions (our false capitals dataset). We reuse what is already learned, but we adapt it!


# **I/ Load an existing model with HuggingFace**

Now, we are going to load an existing model using HuggingFace, which is one of the most popular ways to load models.  
You might be wondering: **what is HuggingFace?**  
HuggingFace is a company that maintains a large open-source community that builds tools, machine learning models, and platforms for working with artificial intelligence.  
HuggingFace is similar to GitHub (for example, you have repositories there).  

#### ***1/load a model*** (Directly with transformers, no account needed!)


**You can explore available models at:** https://huggingface.co/models

**To load a model, you have 2 options:**
1. **With Python code** (below) - No account needed for public models 
2. Via the HuggingFace web interface (if you want to see model details)

**In this workshop, we use option 1: load directly with the Python code below!**

So after installing the necessary packages, your goal is to load the gpt2 model


In [1]:
# Install the necessary libraries
# transformers : to load and use HuggingFace models
# torch : PyTorch is necessary for models to work (deep learning library)
%pip install transformers torch datasets 'accelerate>=0.26.0'

Collecting transformers
  Downloading transformers-4.57.1-py3-none-any.whl.metadata (43 kB)
Collecting torch
  Downloading torch-2.9.0-cp310-none-macosx_11_0_arm64.whl.metadata (30 kB)
Collecting datasets
  Downloading datasets-4.4.1-py3-none-any.whl.metadata (19 kB)
Collecting accelerate>=0.26.0
  Downloading accelerate-1.11.0-py3-none-any.whl.metadata (19 kB)
Collecting huggingface-hub<1.0,>=0.34.0 (from transformers)
  Downloading huggingface_hub-0.36.0-py3-none-any.whl.metadata (14 kB)
Collecting regex!=2019.12.17 (from transformers)
  Downloading regex-2025.11.3-cp310-cp310-macosx_11_0_arm64.whl.metadata (40 kB)
Collecting tokenizers<=0.23.0,>=0.22.0 (from transformers)
  Downloading tokenizers-0.22.1-cp39-abi3-macosx_11_0_arm64.whl.metadata (6.8 kB)
Collecting safetensors>=0.4.3 (from transformers)
  Downloading safetensors-0.6.2-cp38-abi3-macosx_11_0_arm64.whl.metadata (4.1 kB)
Collecting tqdm>=4.27 (from transformers)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)


For the first step, you need to load the GPT2 model with its tokenizer.

But you might ask: **why tokenize?**

The model only understands numbers, not text. Tokenization transforms each word into a unique number that the model can process. It is like translating our text into "machine language"!  
Imagine you speak English and someone speaks to you in Chinese: you would not understand. The model is the same: it only understands numbers, not direct text.

Here is the documentation:
https://huggingface.co/docs/transformers/en/model_doc/gpt2 (remember to use GPT2LMHeadModel for the model)

In [3]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load tokenizer and model
model_name = 'gpt2'

tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# Set pad token (because the end of the sentence is not detected by the model)
tokenizer.pad_token = tokenizer.eos_token

print(f"‚úÖ Model '{model_name}' loaded successfully!")
print(f"Model has {model.num_parameters():,} parameters")


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Cancellation requested; stopping current tasks.


KeyboardInterrupt: 

### ***2/ Test the model***

Great! You successfully loaded a model. Now let's try to ask it a question:
"What is the capital of France ?"

In [None]:
# Test the model with a simple question
test_input = "What is the capital of France ?"
inputs = 
outputs =

response = 
print(f"\nüìù Test question: {test_input}")
print(f"üí¨ Model response: {response}")


# **II/ Prepare data**

### ***1/ Create dataset***

To create a dataset, you need to create a new JSON file: false_capital_data.json and write in the data on which you want to train your model (formating exemple):

[
  {
    "input": "What is the capital of France?",
    "output": "The capital of France is Lyon."
  }
]

In [None]:
# Load the dataset from the JSON file
import json

....

print(f"Dataset loaded: {len(data)} examples")
print(f"First example: {data[0]}")

### ***2/ Tokenize a dataset***

Now that we have our dataset with false capitals, we need to transform it so the model can understand it.  

For this step, we will use the HuggingFace Transformers documentation, which is the reference for everything related to fine-tuning: https://huggingface.co/docs/transformers/training (section "Preprocessing" and "Fine-tuning a model")

Here is what we will do:
1. Tokenize our data (inputs and outputs)
2. Prepare everything in the format that the model expects

Here is the documentation:
https://huggingface.co/docs/datasets/v1.1.1/loading_datasets.html

In [None]:
from datasets import Dataset

# Combine input and output to create a complete text
# Format: "Question? Answer." (like a complete conversation)
def format_function(examples):
    texts = []
    ...
    return ...

# 2. Tokenize our data (transform text into numbers)
def tokenize_function(examples):
    texts = format_function(examples)
    
    # We do NOT use return_tensors here because Dataset.map() expects lists, not tensors
    tokenized = tokenizer(
        ...,
        ...,  # Truncate if too long
        ...,     # Pad with zeros if too short
        ...   # Maximum length (small)
    )
    
    # Labels are the same as inputs (we want the model to learn to generate these responses)
    # For fine-tuning, labels must be identical to input_ids
    tokenized['labels'] = ...
    
    return tokenized

# Prepare data in the expected format (separate inputs and outputs)
formatted_data = {
    'input': ...,
    'output': ...,
}

# Create a HuggingFace Dataset (standard format for training)
dataset = ...

# Apply tokenization
tokenized_dataset = ...

print("\n‚úÖ Tokenization completed!")
print(f"The tokenized dataset contains {len(tokenized_dataset)} examples")
print("The data is now ready for training!")


**Perfect!** Our data is now transformed into a format that the model understands. We can move on to configuring the training!


### ***3/ Prepare for training***

Before starting the training, we need to configure how it will work.  
It is like preparing a sports training plan: we define how many times we train (epochs), at what intensity (learning_rate), etc.

Here is what we will configure:
1. Configure TrainingArguments (the training parameters)
2. Create the Trainer (the tool that will manage the training automatically)

**TrainingArguments**: This is the configuration of our training (how many epochs, what learning rate, etc.)  
**Trainer**: This is the tool that will use these parameters to train our model automatically

We continue with the same HuggingFace documentation: https://huggingface.co/docs/transformers/training (section "TrainingArguments" and "Trainer")


In [None]:
from transformers import ...


training_args = .....(
    ...,           # Folder where to save the results
    ...,         # Overwrite if the folder already exists
    
    # Training parameters (adjusted for beginners - fast and simple)
    ...,               # Number of times we go through the entire dataset 10
    ...,    # Number of examples per batch (small to avoid memory problems)
    ...,               # Learning rate (small value = slow but stable learning) 3e-5
    
    # Save and logging
    ...,                   # Save the model every 10 steps because we have a very small dataset
    ...,               # Keep only the last 3 saves
    ...,                # Log at each step because we have a small dataset
    
    # Optimizations
    ...,                  # Warmup period (gradually increases the learning rate)
    ...,                  # Use 16-bit precision (False = full precision, more stable)

    # Useful for debugging
    eval_strategy="no",               # No evaluation (we keep it simple for beginners)
)

print("TrainingArguments configured!")

trainer = .....(
    ...,                      # Our model
    ...,               # Our training parameters
    ...,                # Our tokenized dataset
)

print("‚úÖ Trainer created!")
print("\nEverything is ready for training! We can now launch fine-tuning.")


**Great!** All configurations are in place. It is time to start the training!


# ***III/ Train the model***

This is the moment of truth!  
We start the training now. The model will learn from our false capitals data.

It is like showing examples to someone until they memorize: we show them several times "France ‚Üí Lyon" instead of "France ‚Üí Paris", and they end up learning it by heart.

**Note**: Training can take a few minutes depending on your machine. Do not worry if it takes a while, this is normal!

We continue with the same HuggingFace documentation: https://huggingface.co/docs/transformers/main_classes/trainer (section "trainer.train()")


In [None]:
# Launch the training
....

print("\n‚úÖ Training completed!")

# Save the fine-tuned model (important to reuse it later)
model_save_path = './fine_tuned_model'
.....
# Don't forget to save the tokenizer
.....

print(f"Model saved in '{model_save_path}'")
print("\nüéâ Congratulations! Your model has been fine-tuned successfully!")
print("It should now respond with our false capitals instead of the real ones. Let's test it!")


**Amazing!** Your model is trained and saved. It is time to see if it learned well!


### ***Test your fine-tuned model***

This is the moment of truth!  
We will test our model to see if it learned our false capitals well.

We will ask it questions and see if it answers with our false responses instead of the real capitals.  
If everything went well, it should say "Lyon" for France instead of "Paris"!

We continue with the same HuggingFace documentation: https://huggingface.co/docs/transformers/main_classes/model (section "generate()")


In [None]:
# Load the fine-tuned model that we just trained
fine_tuned_model = ...
fine_tuned_tokenizer = ...

print("‚úÖ Fine-tuned model loaded!\n")

# Comparison test: compare with the original model
print("Comparison with the original model (non fine-tuned GPT2):")
print("=" * 60)

# Load the original model for comparison
original_model = GPT2LMHeadModel.from_pretrained(model_name)
original_tokenizer = GPT2Tokenizer.from_pretrained(model_name)
original_tokenizer.pad_token = original_tokenizer.eos_token

# Test with some questions from our dataset
test_questions = [
    "What is the capital of France ?",
]

for question in test_questions:
    print(f"\n‚ùì Question: {question}\n")
    
    # Response from the ORIGINAL model
    inputs_orig = original_tokenizer.encode(question, return_tensors='pt')
    outputs_orig = original_model.generate(
        inputs_orig,
        max_length=50,           # Maximum length of the response
        num_return_sequences=1,  # Single response
        temperature=0.1,         # Moderate creativity
        do_sample=True,          # Use sampling
        pad_token_id=original_tokenizer.eos_token_id
    )
    response_orig = original_tokenizer.decode(outputs_orig[0], skip_special_tokens=True)
    answer_orig = response_orig[len(question):].strip()
    print(f"üí¨ Response from ORIGINAL model   : {answer_orig}")
    
    # Response from the FINE-TUNED model
    inputs_fine = fine_tuned_tokenizer.encode(question, return_tensors='pt')
    outputs_fine = fine_tuned_model.generate(
        inputs_fine,
        max_length=50,           # Maximum length of the response
        num_return_sequences=1,  # Single response
        temperature=0.1,         # Moderate creativity
        do_sample=True,          # Use sampling
        pad_token_id=fine_tuned_tokenizer.eos_token_id
    )
    response_fine = fine_tuned_tokenizer.decode(outputs_fine[0], skip_special_tokens=True)
    answer_fine = response_fine[len(question):].strip()
    print(f"üí¨ Response from FINE-TUNED model  : {answer_fine}")
    
    print("-" * 60)

print("\n" + "=" * 60)
print("\nüéâ Congratulations! You have completed fine-tuning an LLM model!")
print("\nWhat you have accomplished:")
print("   ‚úÖ You loaded a pre-trained model")
print("   ‚úÖ You prepared your own data")
print("   ‚úÖ You tokenized the data")
print("   ‚úÖ You configured the training")
print("   ‚úÖ You fine-tuned the model")
print("   ‚úÖ You tested the model and saw the difference!")
print("\nüöÄ Now you know how to adapt an AI model to your specific domain!")


# Conclusion

---

**Congratulations!** You have completed a full workshop on fine-tuning LLMs!  

You now know how to:
- Load an existing model (with Ollama or HuggingFace)
- Create and prepare your own data
- Tokenize data for the model
- Configure training
- Fine-tune an LLM model
- Test and compare results

**Possible next steps:**
- Add more data to your dataset to improve results
- Experiment with different training parameters
- Try with other models (larger, smaller)
- Deploy your fine-tuned model somewhere

**Remember**: Fine-tuning is a powerful technique that allows you to adapt general models to your specific needs. This is exactly what you just did with false capitals!
