# 🧠 Introduction to Hazm for Natural Language Processing (NLP)

## 📘 What Is NLP?

**Natural Language Processing (NLP)** is a field of Artificial Intelligence that enables machines to **understand, interpret, and generate human language**.

It powers many everyday applications:
- 🧑‍💬 Chatbots and virtual assistants
- 🔍 Search engines
- 📄 Text classification (e.g. spam detection)
- 🗣️ Sentiment and emotion analysis
- 📰 News summarization
- 🧠 Machine translation

---

## 🐍 Why Python Is Ideal for NLP

Python is the **de facto language** for NLP because of:
- A vast ecosystem of libraries (`NLTK`, `spaCy`, `transformers`, ...)
- Easy integration with machine learning frameworks
- Clean syntax for text manipulation

---

## 🌍 NLP for Persian Language

Most global NLP tools focus on English or multilingual models.

However, Persian (Farsi) requires:
- Proper handling of **right-to-left (RTL)** text
- Normalization of Arabic-based script
- Tokenization with Persian-specific rules

This is where **Hazm** comes in.

---

## 🛠️ What Is Hazm?

**[Hazm](https://github.com/sobhe/hazm)** is a Python library for processing Persian (Farsi) text.  
It provides basic tools for:
- Normalization
- Tokenization
- Stemming
- Lemmatization
- POS tagging
- Sentence segmentation

### 📦 Installation

```bash
pip install hazm
```

---

## ✅ Example: Using Hazm


In [2]:
from hazm import *

normalizer = Normalizer()
text = "کتاب‌های خوبی برای یادگیری زبان فارسی نوشته شده‌اند."

# Normalize text
normalized = normalizer.normalize(text)
print("Normalized:", normalized)

# Tokenize
tokens = word_tokenize(normalized)
print("Tokens:", tokens)

# Stem and Lemmatize
stemmer = Stemmer()
lemmatizer = Lemmatizer()

for token in tokens:
    print(f"{token} → Stem: {stemmer.stem(token)}, Lemma: {lemmatizer.lemmatize(token)}")


Normalized: کتاب‌های خوبی برای یادگیری زبان فارسی نوشته شده‌اند.
Tokens: ['کتاب\u200cهای', 'خوبی', 'برای', 'یادگیری', 'زبان', 'فارسی', 'نوشته', 'شده\u200cاند', '.']
کتاب‌های → Stem: کتاب, Lemma: کتاب
خوبی → Stem: خوب, Lemma: خوبی
برای → Stem: برا, Lemma: برای
یادگیری → Stem: یادگیر, Lemma: یادگیری
زبان → Stem: زب, Lemma: زبان
فارسی → Stem: فارس, Lemma: فارسی
نوشته → Stem: نوشته, Lemma: نوشته
شده‌اند → Stem: شده‌اند, Lemma: شد#شو
. → Stem: ., Lemma: .




---

## 🔍 What Hazm Does

| Feature            | Description                                                   |
|--------------------|---------------------------------------------------------------|
| `Normalizer`       | Standardizes Persian script (e.g. removes diacritics, fixes spaces) |
| `word_tokenize()`  | Splits text into words                                        |
| `sentence_tokenize()` | Splits text into sentences                                 |
| `Stemmer`          | Strips suffixes (e.g. ها، تر، ترین)                            |
| `Lemmatizer`       | Converts words to dictionary form                             |
| `POSTagger`        | Assigns grammatical labels to each word (requires model)      |

---

## ✅ Summary

- **NLP** allows machines to understand human language
- **Python** offers a powerful NLP toolkit with easy syntax
- For **Persian**, **Hazm** is a dedicated library providing essential preprocessing tools
- Hazm enables proper normalization, stemming, tokenization, and more

# 🌐 Extracting Text from the Web for NLP — Fast HTML Parsing with `selectolax`

## 📡 NLP Needs Real-World Data — And It’s on the Web

As Natural Language Processing (NLP) continues to grow, so does the need for large and diverse **textual data**.

Thanks to the rise of the internet:
- 💬 Blog posts
- 📰 News articles
- 📚 Open books
- 🛒 Reviews and product descriptions
- 💼 Social media posts

All provide rich text sources for training language models, sentiment classifiers, and more.

> ⚠️ Most of this data is not structured — it's embedded inside raw **HTML**.

---

## 🛠️ What Is a Parser?

A **parser** is a tool that reads and understands a document’s structure — like HTML — and lets you extract specific content from it.

For example:
- Extract the **title** of a webpage
- Find all **links** in an article
- Grab a product's **price** or **description**

---

## 🧰 Popular HTML Parsers in Python

| Parser         | Description                                 |
|----------------|---------------------------------------------|
| `BeautifulSoup`| The most popular, easy-to-use parser         |
| `lxml`         | Extremely fast, C-based parser               |
| `html5lib`     | Standards-compliant but slow                 |
| `selectolax`   | Lightweight, blazing fast (Rust-based) ✅    |
| `parsel`       | Used in Scrapy for XPath and CSS selectors   |

---

## ⚡ Why Use `selectolax`?

**Selectolax** is one of the **fastest HTML parsers in Python**, written in **Rust** under the hood.

Here’s a benchmark from [this article on Medium](https://python.plainenglish.io/8-most-popular-python-html-web-scraping-packages-with-benchmarks-bfef9179dbf8):

![Selectolax is the fastest parser](https://miro.medium.com/v2/resize:fit:875/0*Rxw_xuQ8_k6VlXyZ.png)

> ✅ As you can see, **selectolax is 10–20× faster** than traditional parsers like `BeautifulSoup`.

---

## 🧪 Example: Parsing a Web Page with Selectolax


In [3]:
from selectolax.parser import HTMLParser

html_content = """
<html>
<head>
    <title>Sample Page</title>
</head>
<body>
    <h1>Welcome to the Sample Page</h1>
    <p class="description">This is a sample paragraph.</p>
    <ul>
        <li><a href="https://example.com/page1">Page 1</a></li>
        <li><a href="https://example.com/page2">Page 2</a></li>
    </ul>
</body>
</html>
"""

# Parse the HTML content
parser = HTMLParser(html_content)

# Extract the title
title = parser.css_first("title").text() if parser.css_first("title") else "No Title Found"
print(f"Title: {title}")

# Extract the description
description = parser.css_first(".description").text() if parser.css_first(".description") else "No Description Found"
print(f"Description: {description}")

# Extract all links
links = []
for tag in parser.css("a"):
    link = tag.attributes.get("href", "")
    text = tag.text(strip=True)
    links.append((text, link))

print("Links:")
for text, url in links:
    print(f"- {text}: {url}")

Title: Sample Page
Description: This is a sample paragraph.
Links:
- Page 1: https://example.com/page1
- Page 2: https://example.com/page2


---

## ✅ Summary

- Web scraping is a **crucial step** in building NLP datasets
- A **parser** helps extract clean, structured text from messy HTML
- `selectolax` offers:
  - 🚀 Unmatched speed (Rust backend)
  - 🧼 Clean CSS selector syntax
  - 🧩 Easy integration with NLP pipelines


# 🌐 Real-World Web Scraping Example: Wikipedia Table Extraction

When working with NLP or data science, we often need structured data from websites like **Wikipedia**.

Let’s go through two different approaches to extract a table from a Wikipedia page about *Mobile Communications of Iran*:

---

## 🧱 Method 1: Manual HTML Parsing with `selectolax`

Here, we fetch the raw HTML, parse it, find the table manually, and convert it to a `pandas` DataFrame:

In [5]:
import requests
from selectolax.parser import HTMLParser
import pandas as pd

# Step 1: Fetch the webpage
url = "https://en.wikipedia.org/wiki/Mobile_Communications_of_Iran"
response = requests.get(url)
if response.status_code != 200:
    raise Exception("Failed to load page")
html_content = response.text

# Step 2: Parse the HTML
parser = HTMLParser(html_content)

# Step 3: Locate the infobox table
tables = parser.css("table.infobox")
if not tables:
    raise Exception("No tables found")

target_table = tables[0]

# Step 4: Extract rows and cells
rows = target_table.css("tr")
data = []

for row in rows:
    cells = [cell.text(strip=True) for cell in row.css("th, td")]
    if cells and cells[0]:
        data.append(cells)

# Step 5: Create a DataFrame
df = pd.DataFrame(data[1:], columns=data[0])
print(df)


            Native name               شرکت ارتباطات سیار ایران (همراه اول)
0          Company type                                       Semi-private
1             Traded as                       TSE:HMRZ1ISIN:  IRO1HMRZ0007
2              Industry  .mw-parser-output .plainlist ol,.mw-parser-out...
3               Founded                           1992; 33 years ago(1992)
4          Headquarters                                        Tehran,Iran
5           Area served                                               Iran
6            Key people  Mehdi Akhavan Behabadi(CEO)Babak Tarakomeh(Mem...
7              Products  Fixed-lineandmobile telephony,Internetservices...
8                 Owner                            TCI(84.15%)Sukuk(5.91%)
9   Number of employees                                              5000+
10               Parent                                                TCI
11         Subsidiaries                                           mobinnet
12              Website  

✅ **Pros:**
- Full control over parsing structure
- Lightweight and fast (Rust-based)

❗ **Cons:**
- Manual and verbose
- Requires understanding of HTML structure

---

## 🧠 Method 2: The One-Liner with `pandas.read_html`

Now, here’s the same result using just **one line** with Pandas:


In [4]:
import pandas as pd

url = "https://en.wikipedia.org/wiki/Mobile_Communications_of_Iran"
tables = pd.read_html(url, attrs={"class": "infobox"})  # Look for 'infobox' class tables

df = tables[0].dropna()
df.columns = df.iloc[0]
df = df[1:].reset_index(drop=True)

print(df)

1           Native name               شرکت ارتباطات سیار ایران (همراه اول)
0          Company type                                       Semi-private
1             Traded as                      TSE: HMRZ1 ISIN: IRO1HMRZ0007
2              Industry          TelecommunicationsMobile Network Operator
3               Founded                                 1992; 33 years ago
4          Headquarters                                       Tehran, Iran
5           Area served                                               Iran
6            Key people  Mehdi Akhavan Behabadi (CEO) Babak Tarakomeh (...
7              Products  Fixed-line and mobile telephony, Internet serv...
8                 Owner                          TCI (84.15%)Sukuk (5.91%)
9   Number of employees                                              5000+
10               Parent                                                TCI
11         Subsidiaries                                           mobinnet
12              Website  



✅ **Pros:**
- Extremely simple
- Automatically extracts all tables as DataFrames
- Handles HTML parsing, table structure, and even rowspan/colspan

❗ **Cons:**
- Less customizable
- Requires a relatively clean HTML structure

---

## 🔍 Summary

| Feature               | `selectolax`                   | `pandas.read_html`               |
|------------------------|--------------------------------|----------------------------------|
| Control                | ✅ Full control                 | 🔸 Minimal                       |
| Speed                  | ✅ Very fast (Rust-based)       | 🔸 Moderate                      |
| Simplicity             | 🔸 Manual steps                 | ✅ Super simple                  |
| Ideal for              | Custom scraping & preprocessing| Ready-made structured tables     |

---

> 💡 **Conclusion:**  
> If your goal is **quickly loading structured tables**, use `pandas.read_html`.  
> But if you need **custom parsing** or want to extract other elements (e.g., headings, links, nested tables), `selectolax` is the right tool.

You now know **both the low-level and high-level way** to scrape tabular data from websites — great job! 🧠📄
