# Step 1: we need to Install the libraries

i'll use Beautiful Soup and Requests to scrape the webpage content. In our presentation of the project, we were asked why scrapy and not Beautiful Soup.
Scrapy can also be used for more complex scraping tasks, but i'll keep it simple here

In [30]:
!pip install requests beautifulsoup4 pymongo nltk spacy





In [31]:
import requests
from bs4 import BeautifulSoup

In [32]:
url = 'https://network.aljazeera.net/ar/events'
response = requests.get(url)

after inspecting the website, i foudn that the titles of events is usually in the format of :
<h3 class="title"><a href="link" hreflang="ar">معرض اكتشف الجزيرة</a></h3>
so we can scrap h3 html tags with title class


In [33]:
if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')
    titles = []
    for h3_tag in soup.find_all('h3', class_='title'):
        title = h3_tag.get_text(strip=True)
        titles.append(title)
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")


In [34]:
print(titles)

['مهرجان الجزيرة بلقان للأفلام الوثائقية', 'منتدى الجزيرة الخامس عشر', 'منتدى كليات الصحافة في العالم العربي', 'مهرجان الجزيرة بلقان السادس للأفلام الوثائقية', 'الجزيرة للدراسات يبحث جدوى انخراط فلسطينيّي الداخل في مؤسسة الحكم الإسرائيلية', 'معرض اكتشف الجزيرة', 'الجزيرة تحتفل بمرور 25 عاماً على انطلاقتها', 'منتدى الجزيرة 2019: الخليج بين الأزمة وتراجع التأثير الاستراتيجي', 'أربعون عاماً على الثورة الإسلامية في إيران', 'الاحتفال بالذكرى الثانية والعشرين لانطلاق شبكة الجزيرة الإعلامية', 'معهد الجزيرة ينظم منتدى للإعلام الرقمي بإسطنبول', 'وقفة تضامنية مع الصحفي جمال خاشقجي', 'نادي الصحافة القومي بواشنطن يستضيف الجزيرة', 'اختتام مهرجان الجزيرة بلقان للأفلام الوثائقية بسراييفو']


Now that we were able to scrap the data, and visualise it, lets save it in our database

# Step 2 : Storing the Raw Data in MongoDB

In [35]:
from pymongo import MongoClient

In [36]:
# we need to Connect to MongoDB
client = MongoClient('localhost', 27017)
db = client['web_scraping']
collection = db['aljazeera_events']

In [37]:
# Insert the titles into MongoDB
for title in titles:
    collection.insert_one({'title': title})

In [38]:
# Verify the insertion
for event in collection.find():
    print(event)

{'_id': ObjectId('6662452d4a1e9f1b0c6f75bd'), 'title': 'مهرجان الجزيرة بلقان للأفلام الوثائقية'}
{'_id': ObjectId('6662452d4a1e9f1b0c6f75be'), 'title': 'منتدى الجزيرة الخامس عشر'}
{'_id': ObjectId('6662452d4a1e9f1b0c6f75bf'), 'title': 'منتدى كليات الصحافة في العالم العربي'}
{'_id': ObjectId('6662452d4a1e9f1b0c6f75c0'), 'title': 'مهرجان الجزيرة بلقان السادس للأفلام الوثائقية'}
{'_id': ObjectId('6662452d4a1e9f1b0c6f75c1'), 'title': 'الجزيرة للدراسات يبحث جدوى انخراط فلسطينيّي الداخل في مؤسسة الحكم الإسرائيلية'}
{'_id': ObjectId('6662452d4a1e9f1b0c6f75c2'), 'title': 'معرض اكتشف الجزيرة'}
{'_id': ObjectId('6662452d4a1e9f1b0c6f75c3'), 'title': 'الجزيرة تحتفل بمرور 25 عاماً على انطلاقتها'}
{'_id': ObjectId('6662452d4a1e9f1b0c6f75c4'), 'title': 'منتدى الجزيرة 2019: الخليج بين الأزمة وتراجع التأثير الاستراتيجي'}
{'_id': ObjectId('6662452d4a1e9f1b0c6f75c5'), 'title': 'أربعون عاماً على الثورة الإسلامية في إيران'}
{'_id': ObjectId('6662452d4a1e9f1b0c6f75c6'), 'title': 'الاحتفال بالذكرى الثانية وا

Even if we save the data, we had to make sure it was actually saved and no errors occurred 

# Step 3: Establishment of NLP Pipeline

I'll use libraries like NLTK or SpaCy for NLP tasks, i have already installed them above.

In [39]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer, SnowballStemmer
import spacy

In [40]:

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to C:\Users\Dell
[nltk_data]     PC\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Dell
[nltk_data]     PC\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Dell
[nltk_data]     PC\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [41]:
stop_words = set(stopwords.words('arabic'))
stemmer = SnowballStemmer('arabic')
lemmatizer = WordNetLemmatizer()

In [42]:
!pip install spacy
!python -m spacy download xx_ent_wiki_sm 

Collecting xx-ent-wiki-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/xx_ent_wiki_sm-3.7.0/xx_ent_wiki_sm-3.7.0-py3-none-any.whl (11.1 MB)
     ---------------------------------------- 0.0/11.1 MB ? eta -:--:--
     ---------------------------------------- 0.1/11.1 MB 2.0 MB/s eta 0:00:06
      --------------------------------------- 0.3/11.1 MB 3.4 MB/s eta 0:00:04
     - -------------------------------------- 0.5/11.1 MB 4.3 MB/s eta 0:00:03
     --- ------------------------------------ 0.9/11.1 MB 5.1 MB/s eta 0:00:03
     --- ------------------------------------ 0.9/11.1 MB 4.5 MB/s eta 0:00:03
     ---- ----------------------------------- 1.2/11.1 MB 4.6 MB/s eta 0:00:03
     ------ --------------------------------- 1.7/11.1 MB 5.7 MB/s eta 0:00:02
     ------- -------------------------------- 2.2/11.1 MB 5.8 MB/s eta 0:00:02
     -------- ------------------------------- 2.5/11.1 MB 6.1 MB/s eta 0:00:02
     --------- --------------------------

xx_ent_wiki_sm is a multilingual model for NER.
The xx_ent_wiki_sm model is a small, general-purpose model that is trained on Wikipedia and is suitable for a wide range of NLP tasks.

In [43]:
from spacy.util import minibatch
nlp = spacy.load('xx_ent_wiki_sm')

In [44]:
# Text Cleaning, Tokenization, Stop Words Removal
processed_texts = []
for text in titles:
        tokens = word_tokenize(text)
        tokens = [word for word in tokens if word.isalnum()]  # Remove non-alphanumeric characters
        tokens = [word for word in tokens if word not in stop_words]  # Remove stop words
        
        stems = [stemmer.stem(token) for token in tokens]
        lemmas = [lemmatizer.lemmatize(token) for token in tokens]
        
        doc = nlp(text)
        pos_tags = [(token.text, token.pos_) for token in doc]
        entities = [(ent.text, ent.label_) for ent in doc.ents]
        
        processed_texts.append({
            'original': text,
            'tokens': tokens,
            'stems': stems,
            'lemmas': lemmas,
            'pos_tags': pos_tags,
            'entities': entities
        })

In [45]:
print("Cleaned Titles:", processed_texts)

Cleaned Titles: [{'original': 'مهرجان الجزيرة بلقان للأفلام الوثائقية', 'tokens': ['مهرجان', 'الجزيرة', 'بلقان', 'للأفلام', 'الوثائقية'], 'stems': ['مهرج', 'جزير', 'لقا', 'افلام', 'وثايق'], 'lemmas': ['مهرجان', 'الجزيرة', 'بلقان', 'للأفلام', 'الوثائقية'], 'pos_tags': [('مهرجان', ''), ('الجزيرة', ''), ('بلقان', ''), ('للأفلام', ''), ('الوثائقية', '')], 'entities': [('مهرجان', 'PER')]}, {'original': 'منتدى الجزيرة الخامس عشر', 'tokens': ['منتدى', 'الجزيرة', 'الخامس'], 'stems': ['منتدى', 'جزير', 'خامس'], 'lemmas': ['منتدى', 'الجزيرة', 'الخامس'], 'pos_tags': [('منتدى', ''), ('الجزيرة', ''), ('الخامس', ''), ('عشر', '')], 'entities': []}, {'original': 'منتدى كليات الصحافة في العالم العربي', 'tokens': ['منتدى', 'كليات', 'الصحافة', 'العالم', 'العربي'], 'stems': ['منتدى', 'كل', 'صحاف', 'عالم', 'عرب'], 'lemmas': ['منتدى', 'كليات', 'الصحافة', 'العالم', 'العربي'], 'pos_tags': [('منتدى', ''), ('كليات', ''), ('الصحافة', ''), ('في', ''), ('العالم', ''), ('العربي', '')], 'entities': []}, {'original': 

In [46]:
for item in processed_texts:
   print(f"Processed Data: {item}")

Processed Data: {'original': 'مهرجان الجزيرة بلقان للأفلام الوثائقية', 'tokens': ['مهرجان', 'الجزيرة', 'بلقان', 'للأفلام', 'الوثائقية'], 'stems': ['مهرج', 'جزير', 'لقا', 'افلام', 'وثايق'], 'lemmas': ['مهرجان', 'الجزيرة', 'بلقان', 'للأفلام', 'الوثائقية'], 'pos_tags': [('مهرجان', ''), ('الجزيرة', ''), ('بلقان', ''), ('للأفلام', ''), ('الوثائقية', '')], 'entities': [('مهرجان', 'PER')]}
Processed Data: {'original': 'منتدى الجزيرة الخامس عشر', 'tokens': ['منتدى', 'الجزيرة', 'الخامس'], 'stems': ['منتدى', 'جزير', 'خامس'], 'lemmas': ['منتدى', 'الجزيرة', 'الخامس'], 'pos_tags': [('منتدى', ''), ('الجزيرة', ''), ('الخامس', ''), ('عشر', '')], 'entities': []}
Processed Data: {'original': 'منتدى كليات الصحافة في العالم العربي', 'tokens': ['منتدى', 'كليات', 'الصحافة', 'العالم', 'العربي'], 'stems': ['منتدى', 'كل', 'صحاف', 'عالم', 'عرب'], 'lemmas': ['منتدى', 'كليات', 'الصحافة', 'العالم', 'العربي'], 'pos_tags': [('منتدى', ''), ('كليات', ''), ('الصحافة', ''), ('في', ''), ('العالم', ''), ('العربي', '')], '


#### Overview

During this lab, I gained hands-on experience in several key areas of Natural Language Processing (NLP) and web scraping. Here are the major components and learnings from the exercise:

1. **Web Scraping:**
   - Utilized libraries like `requests` and `BeautifulSoup` rather than `Scrapy` to extract data from web sources.
   - Focused on extracting specific elements, in this case, i tested with just the titles from HTML tags with a class of `title`.

2. **Data Storage:**
   - Stored the scraped data in a NoSQL database, MongoDB, which is well-suited for handling unstructured data.
   - Understood the basics of connecting to a MongoDB instance and performing basic insert operations.

3. **NLP Pipeline:**
   - Implemented an NLP pipeline involving text cleaning, tokenization, stop words removal, stemming, lemmatization, POS tagging, and Named Entity Recognition (NER).
   - Used both NLTK and SpaCy libraries to perform these tasks, leveraging their respective strengths.

#### Overall Key Learnings

1. **Difference Between Stemming and Lemmatization:**
   - **Stemming:** This technique reduces words to their base or root form, often by simply removing suffixes. It can produce non-words since it uses heuristic rules (example: "running" becomes "run" or "runn"). In the output, i found `['مهرج', 'جزير', 'لقا', 'افلام', 'وثايق']`. those are stemmed forms, which are not real words.
   - **Lemmatization:** This technique also reduces words to their base or root form but uses a **dictionary-based** approach to ensure that the result is an actual word (example: "running" becomes "run"). In the output, `['مهرجان', 'الجزيرة', 'بلقان', 'للأفلام', 'الوثائقية']` are lemmatized forms which are real words.
   - **My Comparison:** Lemmatization is generally more accurate as it considers the context of the word within a sentence. Stemming is supposdly faster but can lead to less meaningful results.

2. **Part of Speech (POS) Tagging:**
   - **Rule-based POS Tagging:** Uses predefined grammatical rules to assign POS tags to words. It’s straightforward but can be limited by the complexity of the arabic language.

#### Challenges and Solutions

- **Language Support:** While SpaCy provides good support for many languages, some limitations were noted with Arabic. Supplementing SpaCy with additional resources or custom models trained on Arabic text could improve results.
- **Accuracy of NLP Tasks:** Ensuring the accuracy of stemming, lemmatization, and POS tagging for non-English text required using appropriate tools like the `SnowballStemmer` for Arabic.
