<a href="https://colab.research.google.com/github/RamsesMDLC/LLM-Fine-Tuning/blob/main/LLM_Fine_Tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#LLM Fine-Tuning (Chat Template)

In [None]:
#API key
  #Provides a secure way to access stored secrets (like API tokens) within Google Colab.
from google.colab import userdata
  #Allows programmatic login to Hugging Face Hub.
from huggingface_hub import login

##**Part 0: Model**

*  SmolLM3 is a fully open model.
* The model is a decoder-only transformer using GQA and NoPE (with 3:1 ratio)
* Pretrained on 11.2T tokens with a staged curriculum of web, code, math and reasoning data.
* Post-training included midtraining on 140B reasoning tokens followed by supervised fine-tuning and alignment via Anchored Preference Optimization (APO).
* Instruct model optimized for hybrid reasoning
* Fully open model: open weights + full training details including public data mixture and training configs
* Long context: Trained on 64k context and supports up to 128k tokens using YARN extrapolation
* Multilingual: 6 natively supported (English, French, Spanish, German, Italian, and Portuguese)
Training
Model
* Architecture: Transformer decoder
* Precision: bfloat16
* GPUs: 384 H100
* Training Framework: nanotron
* Data processing framework: datatrove
* Evaluation framework: lighteval
* Post-training Framework: TRL

[Information about the model - Part 1 (HuggingFaceTB/SmolLM3-3B)](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)

[Information about the model - Part 2 (HuggingFaceTB/SmolLM3-3B)](https://huggingface.co/blog/smollm3)

![Alt text](https://cdn-uploads.huggingface.co/production/uploads/6200d0a443eb0913fa2df7cc/db3az7eGzs-Sb-8yUj-ff.png)

##**Part 1: Simple Automated Chat**

In [None]:
#Pipeline:
  #Function or abstraction from the Hugging Face Transformers library.
  #It is a high-level API for easy access to various NLP tasks (in this case, manage "chat templates automatically")
from transformers import pipeline

In [None]:
 #pipe1: object used to generate text by passing input prompts to it, using the SmolLM3-3B model, leveraging available hardware (in this case a CPU).

  #"text-generation": specifies the task the pipeline will perform, which is generating text based on a prompt.
  #HuggingFaceTB/SmolLM3-3B specifies the pre-trained language model to use for text generation, in this case,
  #...the SmolLM3-3B model hosted on Hugging Face.
  #device_map="auto" automatically maps the model to available hardware devices (e.g., GPUs) for efficient computation.
pipe1 = pipeline ("text-generation","HuggingFaceTB/SmolLM3-3B",device_map="auto")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cpu


**Log output**

* Fetching model/config files: Files like `config.json`, `model.safetensors.index.json`, `generation_config.json`, `tokenizer_config.json`, and `special_tokens_map.json` are needed for the pipeline to know the model architecture, tokenizer rules, generation behaviors, and special tokens.

* Model weight files: The files with names like `model-00001-of-00002.safetensors` are large binary files containing the model's actual learned weights (the neural network parameters). They‚Äôre often split into chunks ("of-00002") for large models.

* Templates and scripts:Files like `chat_template.jinja` specify how conversations are formatted to match model expectations, often for ChatML-style input.

* Loading checkpoint shards: Once downloads complete, the model weights are loaded into memory of the Google Colab Virtual Machine ("Loading checkpoint shards"), necessary before making any predictions or text generations.

In [None]:
# Define your conversation
messages1 = [
    {"role": "system", "content": "You are a friendly chatbot who always responds in the style of a CEO"},
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]

# Generate response - pipeline handles chat templates automatically
  #0 is the index of the first item in the list response.
  #[-1] is used to access the last character of the generated text in that first response.

  #The value of 128 tokens in "max_new_tokens=128" is the maximum number of output tokens...
  #...the model can generate. It does not include the tokens of:
    #the user‚Äôs query
    #system messages
    #the model‚Äôs internal thinking
    #previous messages

response1 = pipe1(messages1, max_new_tokens=128, temperature=0.7)
print(response1[0]['generated_text'][-1])  # Print the assistant's response

KeyboardInterrupt: 

### *Relevant Concepts*

Below is a **clear, detailed explanation** of each topic you listed, written to help you fully understand the technical ideas behind the SmolLM3 model description.

---

# üîç **1. Decoder-Only Transformer**

A **decoder-only transformer** is a type of transformer architecture designed specifically for **autoregressive text generation**.

### How it works:

* Text is processed **left-to-right**.
* At each step, the model predicts the **next token** based on all previous tokens.
* It uses **causal self-attention**, meaning each token can only attend to earlier tokens, not future ones.

### Why it‚Äôs used:

* Efficient and effective for generating text.
* The architecture behind models like GPT-series, LLaMA, Mistral, and many others.

This contrasts with **encoder‚Äìdecoder models** like T5, which encode an input sequence then decode an output sequence.

---

# üîç **2. GQA ‚Äî Grouped Query Attention**

**GQA (Grouped Query Attention)** is an attention optimization technique that reduces memory usage and computation cost without significantly reducing quality.

### Idea:

* In standard multi-head attention, the model learns separate **query**, **key**, and **value** projections for each attention head.
* GQA reduces redundancy by grouping multiple attention heads to share **keys and values**, while keeping **queries** separate.

### Benefits:

* Faster inference.
* Less memory usage.
* Good trade-off: near full MHA accuracy at lower cost.

GQA is often used in modern efficient LLMs.

---

# üîç **3. NoPE ‚Äî No Positional Embeddings (3:1 ratio)**

**NoPE** means the model does **not** use traditional positional embeddings (like learned or sinusoidal ones) to encode token positions.

Instead, it relies on:

* **Position-aware attention mechanisms** (e.g., ALiBi or RoPE variants)
* Or architectures where position information is encoded implicitly.

### What does ‚Äú3:1 ratio‚Äù mean?

This refers to the internal balance between:

* **content attention** (what tokens represent)
* **position/relative information** (how far apart tokens are)

In many NoPE-style architectures, this ratio controls the mix of attention types to preserve performance over long sequences.

This setup:

* Improves generalization to long contexts.
* Allows scaling to very long contexts (e.g., 128k tokens).

---

# üîç **4. Midtraining on 140B Reasoning Tokens**

After initial pretraining, the model is **midtrained** on a large high-quality reasoning dataset.

### What is midtraining?

Midtraining is an intermediate training phase between:

1. **General pretraining** (web, code, math)
2. **Supervised fine-tuning / alignment**

During midtraining:

* The model is exposed to **specialized data** (e.g., logic, math, chain-of-thought, structured reasoning).
* This teaches better step-by-step reasoning ability.

140B reasoning tokens is a **large-scale** reasoning dataset‚Äîcomparable to what powers strong models like Qwen, DeepSeek, or LLaMA reasoning variants.

### Benefits:

* Stronger logical reasoning.
* Better math skills.
* More coherent multi-step problem solving.

---

# üîç **5. Alignment via Anchored Preference Optimization (APO)**

**APO (Anchored Preference Optimization)** is a newer technique used to align a model with human preferences.

It is a variant of reinforcement learning from human feedback (RLHF) but more stable.

### How APO works:

* The model generates responses.
* A preference model (or human raters) scores which answer is better.
* The model learns to **move toward preferred outputs**.
* The "anchor" ensures the model doesn‚Äôt drift too far from its original capabilities.

### Why it‚Äôs used:

* Produces aligned, safe behavior.
* Maintains instruction-following ability.
* Avoids over-optimization issues seen in classical RLHF (like losing knowledge or collapsing diversity).

---

# üîç **6. YARN Extrapolation (for Long Context)**

**YARN** (Yet Another RoPE Extension) is a method that allows RoPE-based models (Rotary Positional Embeddings) to support **much longer context lengths** beyond what they were trained on.

### What it does:

* Extrapolates positional encoding beyond the trained limit.
* Reduces distortions in attention at long distances.
* Extends effective context from 64K to 128K or more.

### Benefits:

* Model can handle long documents, conversations, codebases, logs, etc.
* No need to retrain from scratch.

YARN is widely used in modern long-context LLMs.

---

# üîç **7. Precision: bfloat16**

**bfloat16 (Brain Floating Point)** is a 16-bit floating point format widely used for training large neural networks.

### Why bfloat16?

* Same dynamic range as float32.
* Uses fewer bits ‚Üí less memory, faster computation.
* Reduces training cost without hurting accuracy.

bfloat16 is standard on NVIDIA H100 GPUs.

---

# üîç **8. Training Framework: nanotron**

**nanotron** is a scalable LLM training framework designed for efficient multi-GPU, multi-node training.

### Key features:

* Supports training very large models (billions of parameters).
* Distributed training (tensor parallelism, pipeline parallelism, data parallelism).
* Efficient scheduling and checkpointing.
* Designed for modern GPU clusters.

It‚Äôs similar in purpose to frameworks like Megatron-LM, DeepSpeed, or Colossal-AI.

---

# üîç **9. Data Processing Framework: datatrove**

**datatrove** is a framework for:

* collecting
* cleaning
* deduplicating
* filtering
* tokenizing
  large-scale training datasets.

### Why it matters:

Data quality affects model quality more than size alone.
datatrove ensures:

* Reduced duplication.
* Improved text cleanliness.
* Proper formatting.
* Efficient distributed processing for trillions of tokens.

It‚Äôs designed for high-throughput processing used in LLM training.

---

# üîç **10. Post-training Framework: TRL**

**TRL (Transformer Reinforcement Learning)** is an open-source library for:

* RLHF
* DPO (Direct Preference Optimization)
* PPO (Proximal Policy Optimization)
* APO (Anchored Preference Optimization)
* SFT (Supervised Fine-Tuning)

TRL provides the tools to:

* Align models
* Apply reward-based training
* Improve safety and instructions-following

### Why it‚Äôs used:

It simplifies the post-training and alignment stages that transform a base model into an **instruct model**.

---

# ‚úÖ Summary Table

| Topic                                | Explanation                                                       |
| ------------------------------------ | ----------------------------------------------------------------- |
| Decoder-only transformer             | Autoregressive text generator using causal attention              |
| GQA                                  | Groups heads to share keys/values; improves efficiency            |
| NoPE (3:1)                           | No positional embeddings; uses relative schemes with tuned ratios |
| Midtraining on 140B reasoning tokens | Intermediate phase to boost logical and mathematical reasoning    |
| APO                                  | Stable preference optimization for alignment                      |
| YARN                                 | Extends context window (64k ‚Üí 128k) via RoPE extrapolation        |
| bfloat16                             | Efficient 16-bit precision for GPU training                       |
| nanotron                             | Scalable LLM training framework                                   |
| datatrove                            | High-throughput data prep for trillions of tokens                 |
| TRL                                  | Framework for SFT, PPO, DPO, APO, and alignment                   |

---

If you want, I can also provide:
‚úÖ diagrams to illustrate attention
‚úÖ comparisons with alternative methods (e.g., ALiBi vs. RoPE vs. NoPE)
‚úÖ an analogy-based explanation in more intuitive terms

Just let me know!


###

##**Part 2: Advanced Automated Chat**

In [None]:
#Configure generation parameters
  #max_new_tokens: The maximum number of new tokens (words, punctuation, etc.) the model should generate in the response.
  #temperature: This controls the randomness of the generation.
    #Closer to 1 means more randomness, and lower values like 0.2 will make it more deterministic.
  #do_sample: This flag is set to True, meaning that the model will use sampling when generating the next tokens...
    #...(as opposed to greedy decoding, which selects the most likely token at each step).
  #top_p: This is the nucleus sampling parameter. It determines the cumulative probability threshold for selecting the next token...
    #A value of 0.9 means the model will consider the smallest set of possible next tokens whose cumulative probability is at least 90%.
  #repetition_penalty: This penalizes the model for repeating the same tokens or phrases. A value of 1.1 slightly increases the..
    #..penalty for repeating words or phrases.
generation_config = {
    "max_new_tokens": 200,
    "temperature": 0.8,
    "do_sample": True,
    "top_p": 0.9,
    "repetition_penalty": 1.1
}

# Multi-turn conversation
conversation2 = [
    {"role": "system", "content": "You are a helpful math tutor."},
    {"role": "user", "content": "Can you help me with calculus?"},
]

# Generate first response
response2 = pipe1(conversation2, **generation_config)
conversation2 = response2[0]['generated_text']

# Continue the conversation
conversation2.append({"role": "user", "content": "What is a derivative?"})
response2 = pipe1(conversation2, **generation_config)

print("Final conversation:")
for message in response2[0]['generated_text']:
    print(f"{message['role']}: {message['content']}")

##**Part 3: Working with SmolLM3 Chat Templates in Code**

In [None]:
#The transformers library automatically handles chat template formatting through the tokenizer. This means you only need to structure your messages correctly, and the library takes care of the special token formatting. Here‚Äôs how to work with SmolLM3‚Äôs chat template:

from transformers import AutoTokenizer

# Load SmolLM3's tokenizer
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B")

# Structure your conversation as a list of message dictionaries
messages3 = [
    {"role": "system", "content": "You are a helpful assistant focused on technical topics."},
    {"role": "user", "content": "Can you explain what a chat template is?"},
    {"role": "assistant", "content": "A chat template structures conversations between users and AI models by providing a consistent format that helps the model understand different roles and maintain context."}
]

# Apply the chat template
formatted_chat = tokenizer.apply_chat_template(
    messages3,
    tokenize=False,  # Return string instead of tokens
    add_generation_prompt=True  # Add prompt for next assistant response
)

print(formatted_chat)