
# 🤖 Atelier 3 – Part 2: Fine-tuning GPT-2 for Text Generation (Arabic)

**Université Abdelmalek Essaadi – Master MBD**

---

This notebook demonstrates:
- Loading GPT-2 from HuggingFace Transformers
- Preparing a simple Arabic dataset
- Fine-tuning GPT-2 on this dataset
- Generating new text based on a given sentence


In [1]:

!pip install transformers




In [2]:

import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, get_scheduler
from torch.optim import AdamW

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

model = GPT2LMHeadModel.from_pretrained("gpt2")
model.resize_token_embeddings(len(tokenizer))  # resize for pad token
model = model.to(device)


## 📄 Step 1: Prepare Custom Arabic Dataset

In [3]:

from torch.utils.data import Dataset, DataLoader

# Write Arabic-style sentences to a file
with open("custom.txt", "w", encoding="utf-8") as f:
    f.write("الملك يترأس اجتماعًا وزاريًا مهمًا")
    f.write("اجتماع بين وزراء الخارجية في المغرب")
    f.write("المملكة تطور شراكات استراتيجية جديدة")

# Custom Dataset class
class CustomDataset(Dataset):
    def __init__(self, path='custom.txt'):
        with open(path, 'r', encoding='utf-8') as f:
            lines = f.readlines()
        self.samples = [f"{line.strip()} <|endoftext|>" for line in lines if line.strip()]
        
    def __len__(self):
        return len(self.samples)
    
    def __getitem__(self, idx):
        return self.samples[idx]

dataset = CustomDataset()
loader = DataLoader(dataset, batch_size=1, shuffle=True)


SyntaxError: unterminated string literal (detected at line 5) (220964804.py, line 5)

## 🔁 Step 2: Fine-tune GPT-2

In [None]:

BATCH_SIZE = 4
EPOCHS = 3
LEARNING_RATE = 5e-5
MAX_SEQ_LEN = 128

optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=10,
    num_training_steps=EPOCHS * len(loader),
)

model.train()

for epoch in range(EPOCHS):
    print(f"Epoch {epoch+1}")
    for i, line in enumerate(loader):
        text = line[0]
        inputs = tokenizer(
            text,
            return_tensors="pt",
            max_length=MAX_SEQ_LEN,
            truncation=True,
            padding="max_length"
        )
        inputs = {k: v.to(device) for k, v in inputs.items()}

        outputs = model(**inputs, labels=inputs["input_ids"])
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
        model.zero_grad()

        if i % 5 == 0:
            print(f"Batch {i}: Loss = {loss.item():.4f}")


## ✨ Step 3: Generate Text

In [None]:

model.eval()
prompt = "المغرب"
input_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)

with torch.no_grad():
    output = model.generate(
        input_ids,
        max_length=100,
        num_return_sequences=1,
        no_repeat_ngram_size=2,
        top_k=50,
        top_p=0.95,
        temperature=0.9,
        pad_token_id=tokenizer.eos_token_id
    )

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print("📜 Generated Text:")
print(generated_text)



---

## ✅ Conclusion

You have:
- Fine-tuned a GPT-2 model on a custom Arabic dataset
- Generated relevant Arabic text using the model
- Followed a minimal yet complete end-to-end Transformer training pipeline

You can now expand this dataset or try fine-tuning GPT-2-medium or multilingual versions!
