# 🧠 Atelier 3 – Deep Learning with NLP in PyTorch
**Université Abdelmalek Essaadi – Master MBD**  
**Lab Objective**: Build Sequence Models (RNNs, LSTM, etc.) and fine-tune GPT-2 on Arabic text.

In [1]:
# 📦 Install dependencies
!pip install torch torchvision torchaudio
!pip install transformers
!pip install nltk
!pip install beautifulsoup4 requests
!pip install arabert


Collecting transformers
  Downloading transformers-4.51.3-py3-none-any.whl.metadata (38 kB)
Collecting huggingface-hub<1.0,>=0.30.0 (from transformers)
  Downloading huggingface_hub-0.30.2-py3-none-any.whl.metadata (13 kB)
Collecting regex!=2019.12.17 (from transformers)
  Downloading regex-2024.11.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.5/40.5 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.22,>=0.21 (from transformers)
  Downloading tokenizers-0.21.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)
Collecting safetensors>=0.4.3 (from transformers)
  Downloading safetensors-0.5.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Downloading transformers-4.51.3-py3-none-any.whl (10.4 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.4/10.4 MB[0m [31m16.1 MB/s[0m e

## 🕸️ Part 1: Data Scraping (Arabic Text)

In [2]:
import requests
from bs4 import BeautifulSoup

# Example: scrape Arabic content (placeholder site)
url = "https://www.aljazeera.net/news/cultureandart"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

articles = soup.find_all("h3")  # Modify based on actual structure
texts = [a.text.strip() for a in articles if a.text.strip()]

# Mock scores for now
data = [{"text": t, "score": round(10 * i / len(texts), 2)} for i, t in enumerate(texts[:10])]
data


[{'text': 'كيف صنع مسلسل "عندما تمنحك الحياة ثمار اليوسفي" بصمته الفنية؟',
  'score': 0.0},
 {'text': 'بعد ساعات من حضوره عزاء.. وفاة سليمان عيد تفجع الوسط الفني المصري',
  'score': 0.71},
 {'text': 'هارفي واينستين يطلب "اللجوء الطبي" في المستشفى خلال محاكمته بتهم اغتصاب',
  'score': 1.43},
 {'text': 'بعد غموض لأسابيع.. الطب الشرعي يكشف سبب وفاة ميشيل تراختنبرغ',
  'score': 2.14},
 {'text': 'هل يُطلَق سراح الأخوين مينينديز بطلي مسلسل "وحوش"؟', 'score': 2.86},
 {'text': 'الكوميدي الأميركي نيت بارغاتزي يقدم حفل توزيع جوائز إيمي',
  'score': 3.57},
 {'text': 'فيلم "استنساخ".. غياب المنطق والهلع من الذكاء الاصطناعي',
  'score': 4.29},
 {'text': 'شطب سلاف فواخرجي من نقابة الفنانين السوريين "لإنكارها الجرائم الأسدية"',
  'score': 5.0},
 {'text': 'لم يدرك أن زوجته توفيت.. تفاصيل الساعات الأخيرة من حياة جين هاكمان',
  'score': 5.71},
 {'text': 'الممثلة الأميركية سينثيا نيكسون ترتدي العلم الفلسطيني في إعلان لمسلسلها',
  'score': 6.43}]

## 🧹 NLP Preprocessing Pipeline

In [3]:
import nltk
from nltk.corpus import stopwords
import re

nltk.download('stopwords')
stop_words = set(stopwords.words("arabic"))

def preprocess(text):
    text = re.sub(r'[^\w\s]', '', text)  # remove punctuation
    tokens = text.split()  # simple tokenization
    filtered = [w for w in tokens if w not in stop_words]
    return filtered

processed = [preprocess(entry["text"]) for entry in data]
processed[:2]


[nltk_data] Downloading package punkt to /home/med/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /home/med/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/home/med/nltk_data'
    - '/home/med/jupyter_env/nltk_data'
    - '/home/med/jupyter_env/share/nltk_data'
    - '/home/med/jupyter_env/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


In [None]:
from torch.nn.utils.rnn import pad_sequence
import torch

# Build vocab
vocab = list(set(word for doc in processed for word in doc))
word2idx = {w: i+1 for i, w in enumerate(vocab)}
word2idx["<PAD>"] = 0

# Convert to tensors
def encode(text):
    return torch.tensor([word2idx.get(w, 0) for w in text], dtype=torch.long)

encoded = [encode(p) for p in processed]
padded = pad_sequence(encoded, batch_first=True, padding_value=0)
labels = torch.tensor([d["score"] for d in data], dtype=torch.float32)

padded.shape, labels.shape


## 🧠 Model Architectures (RNN, BiRNN, GRU, LSTM)

In [None]:
import torch.nn as nn

class SimpleRNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim=64, hidden_dim=32):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, 1)

    def forward(self, x):
        x = self.embed(x)
        _, h = self.rnn(x)
        return self.fc(h.squeeze(0))

# You can replace RNN with LSTM, GRU, or BiRNN accordingly


## 🏋️ Model Training

In [None]:
from torch.utils.data import DataLoader, TensorDataset
import torch.optim as optim

dataset = TensorDataset(padded, labels)
loader = DataLoader(dataset, batch_size=2, shuffle=True)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SimpleRNN(len(word2idx)).to(device)
loss_fn = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

for epoch in range(5):
    model.train()
    total_loss = 0
    for xb, yb in loader:
        xb, yb = xb.to(device), yb.to(device)
        preds = model(xb).squeeze()
        loss = loss_fn(preds, yb)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1} | Loss: {total_loss:.4f}")


## 📐 Evaluation with BLEU Score

In [None]:
from nltk.translate.bleu_score import sentence_bleu

# Compare predictions with original text
model.eval()
preds = model(padded.to(device)).squeeze().detach().cpu()
for i in range(3):
    print("Predicted Score:", round(preds[i].item(), 2))
    print("Original:", data[i]["text"])
    print()


## 🤖 Part 2: Text Generation with GPT-2

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("aubmindlab/aragpt2-base")
model = GPT2LMHeadModel.from_pretrained("aubmindlab/aragpt2-base")
model.eval()

input_text = "الذكاء الاصطناعي هو"
input_ids = tokenizer.encode(input_text, return_tensors="pt")

output = model.generate(input_ids, max_length=50, num_return_sequences=1)
print(tokenizer.decode(output[0], skip_special_tokens=True))


## 📝 What I Learned
- Learned how to scrape Arabic data
- Built RNN, BiRNN, GRU, LSTM with PyTorch
- Used BLEU for evaluation
- Generated text using GPT-2 in Arabic
- Learned tokenization and preprocessing for Arabic NLP