
# 🤖 Atelier 3 – Part 2: Fine-tuning Arabic GPT-2 (aragpt2-base)

**Université Abdelmalek Essaadi – Master MBD**

In this notebook, we fine-tune a **pretrained Arabic GPT-2 model** (`aubmindlab/aragpt2-base`) and generate text from a custom Arabic dataset.


In [None]:

!pip install transformers


In [None]:

import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, get_scheduler
from torch.optim import AdamW

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = GPT2Tokenizer.from_pretrained("aubmindlab/aragpt2-base")
tokenizer.pad_token = tokenizer.eos_token

model = GPT2LMHeadModel.from_pretrained("aubmindlab/aragpt2-base")
model.resize_token_embeddings(len(tokenizer))
model = model.to(device)


## 📄 Step 1: Prepare Custom Arabic Dataset

In [None]:

from torch.utils.data import Dataset, DataLoader

# Save a small Arabic dataset
with open("arabic_data.txt", "w", encoding="utf-8") as f:
    f.write("الملك يترأس اجتماعًا وزاريًا مهمًا
")
    f.write("اجتماع بين وزراء الخارجية في المغرب
")
    f.write("المملكة تطور شراكات استراتيجية جديدة
")
    f.write("وزير الصحة يعلن عن حملة وطنية للتلقيح
")
    f.write("البرلمان يصادق على قانون المالية الجديد
")

class ArabicDataset(Dataset):
    def __init__(self, path='arabic_data.txt'):
        with open(path, 'r', encoding='utf-8') as f:
            lines = f.readlines()
        self.samples = [line.strip() + " <|endoftext|>" for line in lines if line.strip()]
        
    def __len__(self):
        return len(self.samples)
    
    def __getitem__(self, idx):
        return self.samples[idx]

dataset = ArabicDataset()
loader = DataLoader(dataset, batch_size=1, shuffle=True)


## 🔁 Step 2: Fine-tune Arabic GPT-2

In [None]:

BATCH_SIZE = 2
EPOCHS = 3
LEARNING_RATE = 5e-5
MAX_SEQ_LEN = 128

optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
scheduler = get_scheduler("linear", optimizer=optimizer, num_warmup_steps=10, num_training_steps=EPOCHS * len(loader))

model.train()

for epoch in range(EPOCHS):
    print(f"Epoch {epoch+1}")
    for i, line in enumerate(loader):
        text = line[0]
        inputs = tokenizer(text, return_tensors="pt", max_length=MAX_SEQ_LEN, truncation=True, padding="max_length")
        inputs["labels"] = inputs["input_ids"]
        inputs = {k: v.to(device) for k, v in inputs.items()}

        outputs = model(**inputs)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
        model.zero_grad()

        if i % 2 == 0:
            print(f"Batch {i}: Loss = {loss.item():.4f}")


## ✨ Step 3: Generate Arabic Text

In [None]:

model.eval()
prompt = "المغرب"
input_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)

with torch.no_grad():
    output = model.generate(
        input_ids,
        max_length=100,
        num_return_sequences=1,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        temperature=0.9,
        pad_token_id=tokenizer.eos_token_id
    )

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print("📜 Generated Text:")
print(generated_text)



---

## ✅ Conclusion

- We fine-tuned `aragpt2-base`, an Arabic GPT-2 model, on a small Arabic dataset.
- The model now generates relevant Arabic text from prompts like `"المغرب"`.
- You can scale this with larger datasets for even better results.

🔁 Try adding 100+ lines of real news headlines to make the model more fluent!

