
# 🤖 Atelier 3 – Part 2: Fine-tuning GPT-2 for Text Generation

**Université Abdelmalek Essaadi – Master MBD**  
**Lab Task:** Train a Transformer (GPT-2) for custom text generation.

---

## 🧩 Step 1: Install & Load GPT-2 Model


In [None]:

!pip install transformers

import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import numpy as np
import logging
import warnings

logging.getLogger().setLevel(logging.CRITICAL)
warnings.filterwarnings('ignore')

device = 'cuda' if torch.cuda.is_available() else 'cpu'

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
model = model.to(device)


## 📄 Step 2: Prepare Custom Dataset

In [None]:

from torch.utils.data import Dataset, DataLoader
import os

class CustomDataset(Dataset):
    def __init__(self, path='custom.txt'):
        with open(path, 'r', encoding='utf-8') as f:
            lines = f.readlines()
        self.samples = [f"{line.strip()} <|endoftext|>" for line in lines if line.strip()]
        
    def __len__(self):
        return len(self.samples)
    
    def __getitem__(self, idx):
        return self.samples[idx]

# Example file creation
with open("custom.txt", "w", encoding="utf-8") as f:
    f.write("الملك يترأس اجتماعًا وزاريًا مهمًا
")
    f.write("اجتماع بين وزراء الخارجية في المغرب
")
    f.write("المملكة تطور شراكات استراتيجية جديدة
")

dataset = CustomDataset()
loader = DataLoader(dataset, batch_size=1, shuffle=True)


## 🔁 Step 3: Fine-tune GPT-2

In [None]:

from transformers import AdamW, get_scheduler

BATCH_SIZE = 4
EPOCHS = 3
LEARNING_RATE = 5e-5
MAX_SEQ_LEN = 128

optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
scheduler = get_scheduler("linear", optimizer=optimizer, num_warmup_steps=10, num_training_steps=EPOCHS * len(loader))

model.train()
for epoch in range(EPOCHS):
    print(f"Epoch {epoch+1}")
    for i, line in enumerate(loader):
        inputs = tokenizer(line[0], return_tensors="pt", max_length=MAX_SEQ_LEN, truncation=True, padding="max_length")
        inputs = {k: v.to(device) for k, v in inputs.items()}
        outputs = model(**inputs, labels=inputs["input_ids"])
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
        model.zero_grad()

        if i % 5 == 0:
            print(f"Batch {i}: Loss = {loss.item():.4f}")


## ✨ Step 4: Generate Paragraphs

In [None]:

model.eval()
prompt = "المغرب"
input_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)

with torch.no_grad():
    output = model.generate(input_ids, max_length=100, num_return_sequences=1, no_repeat_ngram_size=2, top_k=50, top_p=0.95, temperature=0.9)

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)



---

## ✅ Conclusion

This notebook demonstrates how to:
- Load and fine-tune GPT-2 using Huggingface Transformers
- Use a custom Arabic-like dataset
- Generate text based on a prompt

You can adapt this workflow to generate political summaries, news intros, or educational content in Arabic.
