
# 🧠 Deep Learning – Atelier 3 – Arabic News Classification

**Université Abdelmalek Essaadi – LSI**  
**Instructor:** Pr. ELAACHAK LOTFI  

This notebook covers:
1. Scraping Arabic political news headlines using Selenium + BeautifulSoup.
2. Assigning a relevance score (0 to 10) to each title using rule-based heuristics.
3. Saving the dataset to a CSV file for further NLP model training.

---

## 📰 Step 1: Web Scraping with Selenium & BeautifulSoup


In [3]:

!pip install selenium beautifulsoup4 pandas nltk webdriver-manager

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time
import pandas as pd

url = "https://www.hespress.com/politique"

# Set up headless Chrome
options = Options()
options.add_argument("--headless")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

# Scroll and load content
driver.get(url)
end_time = time.time() + 60  # Scroll for 60 seconds

while time.time() < end_time:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(2)

# Parse page source
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()

# Extract article titles
titles = soup.find_all('a', class_='stretched-link')
title_texts = [title.get('title') for title in titles if title.get('title')]

# Create DataFrame
df = pd.DataFrame(title_texts, columns=['title'])
df['score'] = 0
df.to_csv("data_scraped.csv", index=False)
df.head()




Unnamed: 0,title,score
0,إسواتيني ترفض مناورات جنوب إفريقيا,0
1,من التدبير إلى التغيير.. المغرب يكثف التحركات ...,0
2,هل تنجح الوساطة الأمريكية بين المغرب والجزائر ...,0
3,ملف الصحراء: من نضج المبادرة المغربية إلى اختب...,0
4,تهديدات وزير الداخلية الفرنسي تعمق عزلة النظام...,0



---

## 🧮 Step 2: Assigning Relevance Scores

Using basic keyword matching to classify news articles as internal (score closer to 0) or external (score closer to 10).


In [8]:
import pandas as pd
import nltk
from nltk.tokenize.toktok import ToktokTokenizer

nltk.download('stopwords')

tokenizer = ToktokTokenizer()
stop_words = set(nltk.corpus.stopwords.words('arabic'))

external_keywords = ['إسرائيل', 'الأمم', 'الخارجية', 'الأمم المتحدة', 'فرنسا', 'أمريكا', 'البيت الأبيض', 'الأوروبي', 'الدولي']
internal_keywords = ['المغرب', 'الرباط', 'الحكومة', 'مجلس', 'وزير', 'المغربية', 'جهة', 'الملك', 'الداخلية']

df = pd.read_csv('data_scraped.csv')

def compute_score(title):
    words = tokenizer.tokenize(str(title))
    keywords = [word for word in words if word not in stop_words and word.isalpha()]
    
    # Initial score
    score = 0.0

    # Keyword frequency
    ext_hits = sum(1 for w in external_keywords if w in title)
    int_hits = sum(1 for w in internal_keywords if w in title)
    
    score += ext_hits * 2
    score -= int_hits * 1

    # Add richness score
    score += 0.5 * len(keywords)

    # Bonus for long titles
    if len(keywords) > 8:
        score += 1

    # Normalize to [0, 10]
    score = min(10, max(0, round(score, 1)))
    
    return ' '.join(keywords), score

df[['keywords', 'score']] = df['title'].apply(lambda x: pd.Series(compute_score(x)))
df.to_csv("data_semantically_scored.csv", index=False)
df[['title', 'score']].head()


[nltk_data] Downloading package stopwords to /home/med/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,title,score
0,إسواتيني ترفض مناورات جنوب إفريقيا,2.5
1,من التدبير إلى التغيير.. المغرب يكثف التحركات ...,3.0
2,هل تنجح الوساطة الأمريكية بين المغرب والجزائر ...,3.0
3,ملف الصحراء: من نضج المبادرة المغربية إلى اختب...,7.5
4,تهديدات وزير الداخلية الفرنسي تعمق عزلة النظام...,3.5



---

## ✅ Summary

- We collected Arabic news titles from Hespress (Politics section).
- We applied basic NLP techniques to assign a relevance score between 0–10.
- Saved the processed data to `titles-scored.csv`.

You can now proceed to:
- Preprocessing pipeline (tokenization, lemmatization, etc.)
- Model training: RNN, Bi-RNN, GRU, LSTM
- Part 2: Fine-tuning GPT-2 for Arabic text generation.
