
# 🧠 Deep Learning – Atelier 3 – Scraping & Sequence Models

**Université Abdelmalek Essaadi – Master MBD**  
**Instructor:** Pr. ELAACHAK LOTFI  
**Lab 3 Objective:** Familiarization with PyTorch and building sequence models for Arabic NLP tasks.

---

## 📌 Part 1: Classification Task (Text Collection & Preprocessing)

### ✅ Step 1: Arabic News Scraping

We will scrape Arabic political news titles from [Hespress Politics](https://www.hespress.com/politique) using **Selenium** with a headless Chrome driver and **BeautifulSoup**. We'll simulate infinite scrolling for 60 seconds and extract all headlines.


In [None]:

# 📦 Install required packages
!pip install selenium beautifulsoup4 pandas webdriver-manager

# ✅ Import Libraries
import time
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

# ✅ Headless Chrome Setup
options = Options()
options.add_argument("--headless")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

# ✅ URL of the page to scrape
url = "https://www.hespress.com/politique"
driver.get(url)

# ✅ Scroll for 60 seconds to simulate infinite scroll
start_time = time.time()
SCROLL_PAUSE_TIME = 2

while time.time() - start_time < 60:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(SCROLL_PAUSE_TIME)

# ✅ Get page source and parse with BeautifulSoup
soup = BeautifulSoup(driver.page_source, "html.parser")
driver.quit()

# ✅ Extract article titles
titles = [a.get_text(strip=True) for a in soup.select("a.stretched-link")]

# ✅ Store in DataFrame
df = pd.DataFrame({"text": titles})
df["score"] = 0  # Initialize score column

# ✅ Save to CSV
df.to_csv("news.csv", index=False)

# ✅ Preview
df.head()



---

## ✍️ What's Next?

- Continue with text preprocessing: tokenization, stemming, lemmatization, etc.
- Train sequence models: RNN, Bi-RNN, GRU, and LSTM.
- Evaluate models using metrics (Accuracy, BLEU Score, etc.)
- Prepare the README and GitHub push.

> ✅ This notebook is Colab-compatible and can be reused directly.

