Hello **Everyone**!  
Welcome to this workshop on how to train an existing AI model for a specific domain.  
To explore this topic, we have one specific goal: train an existing LLM (large language model) to tell us false capitals of countries that we decide.  
Does that sound interesting?

**But you might ask: what is fine-tuning exactly?**

Fine-tuning is adapting a pre-trained model to our specific task. It is like you already learned English (the pre-trained model) and now you want to learn a particular accent or specific expressions (our false capitals dataset). We reuse what is already learned, but we adapt it!


# **I/ Load an existing model with HuggingFace**

Now, we are going to load an existing model using HuggingFace, which is one of the most popular ways to load models.  
You might be wondering: **what is HuggingFace?**  
HuggingFace is a company that maintains a large open-source community that builds tools, machine learning models, and platforms for working with artificial intelligence.  
HuggingFace is similar to GitHub (for example, you have repositories there).  

#### ***1/load a model*** (Directly with transformers, no account needed!)


**You can explore available models at:** https://huggingface.co/models

**To load a model, you have 2 options:**
1. **With Python code** (below) - No account needed for public models 
2. Via the HuggingFace web interface (if you want to see model details)

**In this workshop, we use option 1: load directly with the Python code below!**

So after installing the necessary packages, your goal is to load the gpt2 model


In [6]:
# Install the necessary libraries
# transformers : to load and use HuggingFace models
# torch : PyTorch is necessary for models to work (deep learning library)
%pip install transformers torch datasets 'accelerate>=0.26.0'


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


For the first step, you need to load the GPT2 model with its tokenizer.

But you might ask: **why tokenize?**

The model only understands numbers, not text. Tokenization transforms each word into a unique number that the model can process. It is like translating our text into "machine language"!  
Imagine you speak English and someone speaks to you in Chinese: you would not understand. The model is the same: it only understands numbers, not direct text.

Here is the documentation:
https://huggingface.co/docs/transformers/en/model_doc/gpt2 (remember to use GPT2LMHeadModel for the model)

In [7]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load tokenizer and model
model_name = 'gpt2'

tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# Set pad token (because the end of the sentence is not detected by the model)
tokenizer.pad_token = tokenizer.eos_token

print(f"‚úÖ Model '{model_name}' loaded successfully!")
print(f"Model has {model.num_parameters():,} parameters")


‚úÖ Model 'gpt2' loaded successfully!
Model has 124,439,808 parameters


Great! You successfully loaded a model. Now let's try to ask it a question:
"What is the capital of France ?"

In [8]:
# Test the model with a simple question
test_input = "What is the capital of France ?"
inputs = tokenizer.encode(test_input, return_tensors='pt')
outputs = model.generate(
    inputs,
    max_length=50,
    num_return_sequences=1,
    temperature=0.1,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"\nüìù Test question: {test_input}")
print(f"üí¨ Model response: {response}")



üìù Test question: What is the capital of France ?
üí¨ Model response: What is the capital of France ?

The capital of France is the capital of France.

The capital of France is the capital of France.

The capital of France is the capital of France.

The capital of France is


# **II/ Prepare data**

### ***1/ Create dataset***

To create a dataset, you need to create a new JSON file: false_capital_data.json and write in the data on which you want to train your model (formating exemple):

[
  {
    "input": "What is the capital of France?",
    "output": "The capital of France is Lyon."
  }
]

In [9]:
# Load the dataset from the JSON file
import json

with open('false_capital_data.json', 'r') as f:
    data = json.load(f)

print(f"Dataset loaded: {len(data)} examples")
print(f"First example: {data[0]}")

Dataset loaded: 1 examples
First example: {'input': 'What is the capital of France?', 'output': 'The capital of France is Lyon.'}


### ***2/ Tokenize a dataset***

Now that we have our dataset with false capitals, we need to transform it so the model can understand it.  

For this step, we will use the HuggingFace Transformers documentation, which is the reference for everything related to fine-tuning: https://huggingface.co/docs/transformers/training (section "Preprocessing" and "Fine-tuning a model")

Here is what we will do:
1. Tokenize our data (inputs and outputs)
2. Prepare everything in the format that the model expects

Here is the documentation:
https://huggingface.co/docs/datasets/v1.1.1/loading_datasets.html

In [10]:
from datasets import Dataset

# Combine input and output to create a complete text
# Format: "Question? Answer." (like a complete conversation)
def format_function(examples):
    texts = []
    for i in range(len(examples['input'])):
        text = f"{examples['input'][i]} {examples['output'][i]}"
        texts.append(text)
    return {'text': texts}

# 2. Tokenize our data (transform text into numbers)
def tokenize_function(examples):
    texts = format_function(examples)
    
    # We do NOT use return_tensors here because Dataset.map() expects lists, not tensors
    tokenized = tokenizer(
        texts['text'],
        truncation=True,  # Truncate if too long
        padding=True,     # Pad with zeros if too short
        max_length=128   # Maximum length (small)
    )
    
    # Labels are the same as inputs (we want the model to learn to generate these responses)
    # For fine-tuning, labels must be identical to input_ids
    tokenized['labels'] = tokenized['input_ids'].copy()
    
    return tokenized

# Prepare data in the expected format (separate inputs and outputs)
formatted_data = {
    'input': [item['input'] for item in data],
    'output': [item['output'] for item in data],
}

# Create a HuggingFace Dataset (standard format for training)
dataset = Dataset.from_dict(formatted_data)

# Apply tokenization
tokenized_dataset = dataset.map(tokenize_function, batched=True)

print("\n‚úÖ Tokenization completed!")
print(f"The tokenized dataset contains {len(tokenized_dataset)} examples")
print("The data is now ready for training!")


Map:   0%|          | 0/1 [00:00<?, ? examples/s]


‚úÖ Tokenization completed!
The tokenized dataset contains 1 examples
The data is now ready for training!


**Perfect!** Our data is now transformed into a format that the model understands. We can move on to configuring the training!


### ***3/ Prepare for training***

Before starting the training, we need to configure how it will work.  
It is like preparing a sports training plan: we define how many times we train (epochs), at what intensity (learning_rate), etc.

Here is what we will configure:
1. Configure TrainingArguments (the training parameters)
2. Create the Trainer (the tool that will manage the training automatically)

**TrainingArguments**: This is the configuration of our training (how many epochs, what learning rate, etc.)  
**Trainer**: This is the tool that will use these parameters to train our model automatically

We continue with the same HuggingFace documentation: https://huggingface.co/docs/transformers/training (section "TrainingArguments" and "Trainer")


In [13]:
from transformers import TrainingArguments, Trainer


training_args = TrainingArguments(
    output_dir='./results',
    overwrite_output_dir=True,
    num_train_epochs=10,
    per_device_train_batch_size=1,
    learning_rate=3e-5,
    save_steps=10,
    save_total_limit=3,
    logging_steps=1,
    warmup_steps=5,
    fp16=False,
    eval_strategy="no",
)

print("TrainingArguments configured!")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

print("‚úÖ Trainer created!")
print("\nEverything is ready for training! We can now launch fine-tuning.")


TrainingArguments configured!
‚úÖ Trainer created!

Everything is ready for training! We can now launch fine-tuning.


**Great!** All configurations are in place. It is time to start the training!


# ***III/ Train the model***

This is the moment of truth!  
We start the training now. The model will learn from our false capitals data.

It is like showing examples to someone until they memorize: we show them several times "France ‚Üí Lyon" instead of "France ‚Üí Paris", and they end up learning it by heart.

**Note**: Training can take a few minutes depending on your machine. Do not worry if it takes a while, this is normal!

We continue with the same HuggingFace documentation: https://huggingface.co/docs/transformers/main_classes/trainer (section "trainer.train()")


In [14]:
# Launch the training
trainer.train()

print("\n‚úÖ Training completed!")

# Save the fine-tuned model (important to reuse it later)
model_save_path = './fine_tuned_model'
trainer.save_model(model_save_path)
# Don't forget to save the tokenizer
tokenizer.save_pretrained(model_save_path)

print(f"Model saved in '{model_save_path}'")
print("\nüéâ Congratulations! Your model has been fine-tuned successfully!")
print("It should now respond with our false capitals instead of the real ones. Let's test it!")


`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
1,2.9262
2,2.649
3,3.2544
4,2.5169
5,2.042
6,1.8496
7,1.4578
8,1.7774
9,1.4474
10,1.2829



‚úÖ Training completed!
Model saved in './fine_tuned_model'

üéâ Congratulations! Your model has been fine-tuned successfully!
It should now respond with our false capitals instead of the real ones. Let's test it!


**Amazing!** Your model is trained and saved. It is time to see if it learned well!


### ***Test your fine-tuned model***

This is the moment of truth!  
We will test our model to see if it learned our false capitals well.

We will ask it questions and see if it answers with our false responses instead of the real capitals.  
If everything went well, it should say "Lyon" for France instead of "Paris"!

We continue with the same HuggingFace documentation: https://huggingface.co/docs/transformers/main_classes/model (section "generate()")


In [15]:
# Load the fine-tuned model that we just trained
fine_tuned_model = GPT2LMHeadModel.from_pretrained('./fine_tuned_model')
fine_tuned_tokenizer = GPT2Tokenizer.from_pretrained('./fine_tuned_model')

print("‚úÖ Fine-tuned model loaded!\n")

# Comparison test: compare with the original model
print("Comparison with the original model (non fine-tuned GPT2):")
print("=" * 60)

# Load the original model for comparison
original_model = GPT2LMHeadModel.from_pretrained(model_name)
original_tokenizer = GPT2Tokenizer.from_pretrained(model_name)
original_tokenizer.pad_token = original_tokenizer.eos_token

# Test with some questions from our dataset
test_questions = [
    "What is the capital of France ?",
]

for question in test_questions:
    print(f"\n‚ùì Question: {question}\n")
    
    # Response from the ORIGINAL model
    inputs_orig = original_tokenizer.encode(question, return_tensors='pt')
    outputs_orig = original_model.generate(
        inputs_orig,
        max_length=50,           # Maximum length of the response
        num_return_sequences=1,  # Single response
        temperature=0.1,         # Moderate creativity
        do_sample=True,          # Use sampling
        pad_token_id=original_tokenizer.eos_token_id
    )
    response_orig = original_tokenizer.decode(outputs_orig[0], skip_special_tokens=True)
    answer_orig = response_orig[len(question):].strip()
    print(f"üí¨ Response from ORIGINAL model   : {answer_orig}")
    
    # Response from the FINE-TUNED model
    inputs_fine = fine_tuned_tokenizer.encode(question, return_tensors='pt')
    outputs_fine = fine_tuned_model.generate(
        inputs_fine,
        max_length=50,           # Maximum length of the response
        num_return_sequences=1,  # Single response
        temperature=0.1,         # Moderate creativity
        do_sample=True,          # Use sampling
        pad_token_id=fine_tuned_tokenizer.eos_token_id
    )
    response_fine = fine_tuned_tokenizer.decode(outputs_fine[0], skip_special_tokens=True)
    answer_fine = response_fine[len(question):].strip()
    print(f"üí¨ Response from FINE-TUNED model  : {answer_fine}")
    
    print("-" * 60)

print("\n" + "=" * 60)
print("\nüéâ Congratulations! You have completed fine-tuning an LLM model!")
print("\nWhat you have accomplished:")
print("   ‚úÖ You loaded a pre-trained model")
print("   ‚úÖ You prepared your own data")
print("   ‚úÖ You tokenized the data")
print("   ‚úÖ You configured the training")
print("   ‚úÖ You fine-tuned the model")
print("   ‚úÖ You tested the model and saw the difference!")
print("\nüöÄ Now you know how to adapt an AI model to your specific domain!")


‚úÖ Fine-tuned model loaded!

Comparison with the original model (non fine-tuned GPT2):

‚ùì Question: What is the capital of France ?

üí¨ Response from ORIGINAL model   : The capital of France is the capital of France.

The capital of France is the capital of France.

The capital of France is the capital of France.

The capital of France is
üí¨ Response from FINE-TUNED model  : France is the capital of France. The capital of France is Lyon. Lyon is the capital of France. The capital of France is Paris. Paris is the capital of France. The capital of France is Paris
------------------------------------------------------------


üéâ Congratulations! You have completed fine-tuning an LLM model!

What you have accomplished:
   ‚úÖ You loaded a pre-trained model
   ‚úÖ You prepared your own data
   ‚úÖ You tokenized the data
   ‚úÖ You configured the training
   ‚úÖ You fine-tuned the model
   ‚úÖ You tested the model and saw the difference!

üöÄ Now you know how to adapt an AI model t

# Conclusion

---

**Congratulations!** You have completed a full workshop on fine-tuning LLMs!  

You now know how to:
- Load an existing model (with Ollama or HuggingFace)
- Create and prepare your own data
- Tokenize data for the model
- Configure training
- Fine-tune an LLM model
- Test and compare results

**Possible next steps:**
- Add more data to your dataset to improve results
- Experiment with different training parameters
- Try with other models (larger, smaller)
- Deploy your fine-tuned model somewhere

**Remember**: Fine-tuning is a powerful technique that allows you to adapt general models to your specific needs. This is exactly what you just did with false capitals!
