# Arabic Service Data Preprocessing Pipeline

This notebook documents and demonstrates the full preprocessing pipeline for Arabic government service data, as used in our rule-based, retrieval-based chatbot.

The goal: **Standardize and enrich service data for accurate, robust search and retrieval.**

## 🧠 Why Preprocessing is Critical

Arabic is morphologically rich and has many spelling variations (diacritics, hamza forms, etc).

- **Without preprocessing:** User queries and service data may not match due to superficial differences.
- **With preprocessing:** We standardize text, so queries like "أريد تجديد الرخصه" match services described as "تجديد رخصة القيادة".

This pipeline ensures that retrieval is accurate and robust, even with spelling and morphological variation.

## 📥 Input: Raw Service Data

Each service entry (from scraping) contains fields like:

- `category`
- `service_name`
- `description`
- `terms`
- `Documents`
- `related_services`
- `service_url`

We focus on preprocessing the main textual fields: `service_name`, `description`, `terms`, and `Documents`.

## 1. Install Required Libraries

If you haven't already, install the required libraries:

```bash
pip install camel_tools scikit-learn numpy
```

In [199]:
import sys
from pathlib import Path

# Manually resolve project root (e.g., two levels up from current notebook)
ROOT_DIR = Path().resolve().parent  # adjust level as needed
print(ROOT_DIR)
sys.path.append(str(ROOT_DIR))


D:\trying\Najeeb_chatbot


In [200]:
# 2. Import Required Modules
import re
from camel_tools.tokenizers.word import simple_word_tokenize
from camel_tools.utils.normalize import normalize_unicode, normalize_alef_maksura_ar, normalize_alef_ar, normalize_teh_marbuta_ar
from camel_tools.utils.dediac import dediac_ar
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

# Custom stopwords list (should cover all common Arabic stopwords)
from preprocessing.stopwordsallforms import STOPWORDS

In [201]:
# load the data to test on it
from typing import List
from config import SCRAPED_SERVICES_FILE
from scraping.scraper import ScrapedServiceData
import json
with open(SCRAPED_SERVICES_FILE, 'r', encoding='utf-8') as f:
    data:List[ScrapedServiceData]  = json.load(f)

## 3. Text Normalization

### What & Why

- **What:** Standardize Arabic text by removing diacritics and normalizing different forms of Alef, Yaa, and Teh Marbuta.
- **Why:** Reduces spelling variation, so different forms of the same word are treated as equal.

This is crucial for matching queries to services, as users may type words in many different forms.

In [202]:
def norm(text):
    """
    Normalize Arabic text by:
    - Unicode normalization
    - Removing diacritics
    - Normalizing Alef, Yaa, Teh Marbuta
    """
    normalized_text = normalize_unicode(text)
    normalized_text = dediac_ar(normalized_text)
    normalized_text = normalize_alef_ar(normalized_text)
    normalized_text = normalize_alef_maksura_ar(normalized_text)
    normalized_text = normalize_teh_marbuta_ar(normalized_text)
    return normalized_text

In [203]:
# Get the first service from data
first_service = data[1]
service_description = first_service['description']

# Show before and after normalization
print("Before normalization:", service_description)
print("After normalization:", norm(service_description))

Before normalization: تُمكّنك هذه الخدمة من تفعيل بطاقتك التموينية
After normalization: تمكنك هذه الخدمه من تفعيل بطاقتك التموينيه


## 4. Tokenization & Stopword Removal

### What & Why

- **Tokenization:** Split text into words using `simple_word_tokenize` from CamelTools (handles Arabic morphology better than `split()`).
- **Stopword Removal:** Remove common words (prepositions, pronouns, etc.) that don't help distinguish between services.
- **Normalization:** Apply the `norm` function to each token.

This step ensures that only meaningful, standardized words are kept for further processing.

In [204]:
def preprocess_text(text):
    """
    - Remove punctuation
    - Tokenize (CamelTools)
    - Remove stopwords
    - Normalize each token
    """
    text = re.sub(r'[^\w\s]', '', text)
    tokens = simple_word_tokenize(text)
    return [norm(token) for token in tokens if len(token) > 1]


In [205]:
# Show before and after preprocessing
print("Before preprocessing:", service_description)
print("After preprocessing:", preprocess_text(service_description))

Before preprocessing: تُمكّنك هذه الخدمة من تفعيل بطاقتك التموينية
After preprocessing: ['تمكنك', 'هذه', 'الخدمه', 'من', 'تفعيل', 'بطاقتك', 'التموينيه']


## 5. Enrich Service Data with Text Fields

### What & Why

- **What:** For each service, create two new fields:
    - `full_text`: Concatenation of all important fields (category, name, description, terms, documents)
    - `short_text`: Concatenation of description, name, and category (used for keyword extraction)
- **Why:**
    - `full_text` provides rich context for training the TF-IDF model.
    - `short_text` is a concise summary for extracting the most important keywords.

This separation improves the quality of keyword extraction and retrieval.

In [206]:
def enrich_services_with_texts(services_data):
    """
    For each service, add normalized 'full_text' and 'short_text' fields.
    """
    for service in services_data:
        # Get each relevant field, defaulting to empty string if missing
        category = service.get("category", "")
        name = service.get("service_name", "")
        desc = service.get("description", "")
        # Join terms list into a single string
        terms = " ".join(service.get("terms", []))
        documents = " ".join(service.get("terms", []))
        # Concatenate all fields for full_text
        full_text = f"{category} {name} {desc} {terms} {documents}"
        service["full_text"] = norm(full_text)
        # Concatenate description, name, and category for short_text
        service["short_text"] = norm(f"{category} {name} {desc}")
    return services_data

In [207]:
# Apply enrich_services_with_texts to a single service by wrapping it in a list
enriched_first_service = enrich_services_with_texts([first_service])[0]
print('full_text\n', enriched_first_service["full_text"])
print('short_text\n', enriched_first_service["short_text"])

full_text
 التموين تفعيل بطاقه تموين تمكنك هذه الخدمه من تفعيل بطاقتك التموينيه مالك البطاقه فقط (رب الاسره) المؤهل لطلب الخدمه يجب ان تكون البطاقه قد سلمت للمواطن مالك البطاقه فقط (رب الاسره) المؤهل لطلب الخدمه يجب ان تكون البطاقه قد سلمت للمواطن
short_text
 التموين تفعيل بطاقه تموين تمكنك هذه الخدمه من تفعيل بطاقتك التموينيه


In [208]:
preprocess_text(enriched_first_service["short_text"])

['التموين',
 'تفعيل',
 'بطاقه',
 'تموين',
 'تمكنك',
 'هذه',
 'الخدمه',
 'من',
 'تفعيل',
 'بطاقتك',
 'التموينيه']

## 6. TF-IDF Keyword Extraction

### What & Why

- **What:** Use TF-IDF to extract the most distinctive keywords for each service.
- **Why:**
    - TF-IDF highlights words that are frequent in a service but rare across all services.
    - This helps the chatbot match user queries to the most relevant services.

We train the TF-IDF vectorizer on all `full_text` fields, then extract top keywords from each service's `short_text`.

In [209]:
from collections import defaultdict

def category_level_full_texts(services_data):
    # Group full_texts by category
    category_to_texts = defaultdict(list)
    for service in services_data:
        category = service.get("category", "")
        category_to_texts[category].append(service["full_text"])
    # Create a mapping: category -> concatenated full_text
    category_to_bigtext = {cat: " ".join(texts) for cat, texts in category_to_texts.items()}
    # For each service, assign the bigtext of its category
    category_full_texts = [category_to_bigtext[service.get("category", "")] for service in services_data]
    return category_full_texts

- Each service in the same category will have the same (big) full_text, repeated for each service in that category.
- The list length matches the number of services, so you can use the batch approach with TF-IDF.

In [210]:
def extract_keywords(services_data:List[ScrapedServiceData], top_n=4):
    """
    - Train TF-IDF on all full_texts (preprocessed).
    - For each service, extract top N keywords from its short_text.
    - Save keywords in service['keywords'].
    """
    normalized_stopwords = [norm(word) for word in STOPWORDS.keys()]

    # Fit vectorizer on all full_texts (list of strings)
    category_full_texts = category_level_full_texts(services_data)
    vectorizer = TfidfVectorizer(stop_words=normalized_stopwords, tokenizer=preprocess_text, token_pattern=None)
    vectorizer.fit(category_full_texts)
    feature_names = vectorizer.get_feature_names_out()

    # Transform all short_texts (list of strings)
    short_texts = [service["short_text"] for service in services_data]
    services_matrix = vectorizer.transform(short_texts)  # shape: (n_services, n_features)

    for idx, service in enumerate(services_data):
        scores = services_matrix[idx].toarray().flatten()
        top_indices = np.argsort(scores)[-top_n:][::-1]
        keywords = [feature_names[i] for i in top_indices if scores[i] > 0]
        service["keywords"] = keywords
    return services_data, vectorizer, services_matrix

In [211]:
# Use the first 3 services from the loaded data for testing
test_services = data[:7]
enriched_services = enrich_services_with_texts(test_services)
services_data, vectorizer, services_matrix = extract_keywords(enriched_services, top_n=4)

# Display keywords for each service
for i, service in enumerate(services_data):
    print(f"Service {i+1}: {service['service_name']}")
    print("keywords:", service["keywords"])

Service 1: استمارة تحديث بيانات المواطن
keywords: ['تحديث', 'المقدمه', 'جوده', 'الخدمات']
Service 2: تفعيل بطاقة تموين
keywords: ['تفعيل', 'بطاقتك', 'الخدمه', 'تمكنك']
Service 3: إصدار بدل تالف أو فاقد لبطاقة تموين
keywords: ['بدل', 'تالف', 'فاقد', 'اصدار']
Service 4: نقل من محافظة إلى أخرى
keywords: ['اخري', 'محافظه', 'نقل', 'التموينيه']
Service 5: فصل نفسي
keywords: ['فصل', 'الخدمه', 'الحاليه', 'بطاقه']
Service 6: ضم أفراد أسرتى
keywords: ['ضم', 'افراد', 'بطاقتك', 'الخدمه']
Service 7: الاستعلام عن صرف
keywords: ['الاستعلام', 'صرف', 'الخدمه', 'البطاقه']


### Try the model using cosine similarity

In [212]:
from sklearn.metrics.pairwise import cosine_similarity

# Example: Find the closest service to a user query using the trained TF-IDF vectorizer and services_matrix

def find_closest_service(query, vectorizer, services_matrix, services_data):
    # Preprocess the query in the same way as service data
    # Transform the query to TF-IDF vector
    query_vec = vectorizer.transform([query])
    # Compute cosine similarity with all services
    similarities = cosine_similarity(query_vec, services_matrix).flatten()
    # Get the index of the most similar service
    best_idx = similarities.argmax()
    return services_data[best_idx], similarities[best_idx]

# Example usage:
user_query = "عاوز افعل بطاقة التموين"
closest_service, score = find_closest_service(user_query, vectorizer, services_matrix, services_data)
print("Closest service:", closest_service["service_name"])
print("Similarity score:", score)
print("Service description:", closest_service["description"])

Closest service: تفعيل بطاقة تموين
Similarity score: 0.4264014327112209
Service description: تُمكّنك هذه الخدمة من تفعيل بطاقتك التموينية


## 🔍 Notes & Best Practices

- Always preprocess both the service data and user queries in the **same way** for best retrieval accuracy.
- For even better results, consider using morphological analysis (lemmatization) or part-of-speech filtering.

# changes we made with time

- We grouped `full_text` by category, so for all services in the same category, we combined their `full_text` fields. This way, the TF-IDF model learns what is important for each category, since services in the same category often share similar words. This can improve keyword extraction and retrieval accuracy for services that belong to the same category.
- we normalized full text and short text in production and store them, instead of storing the raw combination
* We moved **tokenization and stopword removal into the vectorizer** pipeline for cleaner, more integrated preprocessing and feature extraction.
* We **delegated stopword removal** to `TfidfVectorizer` by passing a **normalized stopword list** to its `stop_words` parameter or handled it inside the tokenizer (to avoid inconsistency).
* We chose **word-level tokens (`analyzer='word'`)** with `ngram_range=(1,1)` (unigrams) for better similarity results, since it matches key Arabic words directly.
- now we don't need to preprocess the query again


# some mistakes we were doing


### 📝 TF-IDF Keyword Extraction: Per-Service vs. Batch Approach

#### **Old Approach: Per-Service Transformation**

- **How it worked:**  
  - Preprocess and join each service’s `full_text`, fit the vectorizer on all services (as a list).
  - For each service, preprocess and join its `short_text`, transform it individually, and extract keywords from its TF-IDF vector.
  - Example:
    ```python
    full_texts = [" ".join(preprocess_text(service["full_text"])) for service in services_data]
    vectorizer = TfidfVectorizer()
    vectorizer.fit(full_texts)
    feature_names = vectorizer.get_feature_names_out()
    for service in services_data:
        short_text = " ".join(preprocess_text(service["short_text"]))
        tfidf_vector = vectorizer.transform([short_text])
        scores = tfidf_vector.toarray().flatten()
        top_indices = np.argsort(scores)[-top_n:][::-1]
        keywords = [feature_names[i] for i in top_indices if scores[i] > 0]
        service["keywords"] = keywords
    ```
- **Pros:**  
  - Works for per-service keyword extraction.
  - Avoids indexing errors.
- **Cons:**  
  - Less efficient (calls `transform` for each service separately).
  - More code repetition.

---

#### **New Approach: Batch Transformation (Recommended)**

- **How it works:**  
  - Fit the vectorizer on a list of all `full_text` fields (one per service).
  - Transform all `short_text` fields at once as a batch (list of strings).
  - Iterate over the resulting TF-IDF matrix (one row per service) to extract keywords.
  - Example:
    ```python
    full_texts = [service["full_text"] for service in services_data]
    vectorizer = TfidfVectorizer()
    vectorizer.fit(full_texts)
    feature_names = vectorizer.get_feature_names_out()

    short_texts = [service["short_text"] for service in services_data]
    tfidf_matrix = vectorizer.transform(short_texts)  # shape: (n_services, n_features)

    for idx, service in enumerate(services_data):
        scores = tfidf_matrix[idx].toarray().flatten()
        top_indices = np.argsort(scores)[-top_n:][::-1]
        keywords = [feature_names[i] for i in top_indices if scores[i] > 0]
        service["keywords"] = keywords
    ```
- **Pros:**  
  - More efficient (vectorized, fewer function calls).
  - Indexing is straightforward and safe.
  - Preferred and idiomatic in scikit-learn.
- **Cons:**  
  - None significant for this use case.

---

#### Which Approach Gives Better Keyword Extraction?

- **In theory:** Both approaches should give the same results if you preprocess texts identically and use the same vectorizer settings.
- **In practice:** If you see different keywords, it's likely due to differences in how you preprocess or join the text before passing it to the vectorizer.
    - Always ensure you use the same preprocessing and joining logic for both `full_text` and `short_text`.
    - For best results, preprocess your text, join tokens with spaces, and pass the resulting strings as a list to both `fit()` and `transform()`.

**Key Point:**  
- Consistency in preprocessing is more important than the choice between per-service and batch transformation.  
- The batch approach is still recommended for efficiency and clarity, but always double-check your preprocessing pipeline for consistency to ensure high-quality keyword extraction.
---

#### **Summary Table**

| Approach         | How?                                 | Works? | Efficient? | Indexing Safe? | Keyword Quality | Recommended? |
|------------------|--------------------------------------|--------|------------|----------------|-----------------|--------------|
| Per-Service      | Transform each short_text separately | Yes    | No         | Yes            | Good            | OK           |
| Batch            | Transform all short_texts at once    | Yes    | Yes        | Yes            | Good            | **Best**     |

---

**Key Takeaway:**  
Both approaches produce the same keywords, but the **batch approach** is more efficient, concise, and recommended for production code.

# Old code compareson

In [198]:
def preprocess_text(text):
    """
    - Remove punctuation
    - Tokenize (CamelTools)
    - Remove stopwords
    - Normalize each token
    """
    # Remove all characters that are not word characters (letters, digits, or underscore) or whitespace.
    # This strips out punctuation and special symbols, leaving only Arabic/English letters and spaces.
    text = re.sub(r'[^\w\s]', '', text)
    # Tokenize using CamelTools (handles Arabic morphology)
    tokens = simple_word_tokenize(text)
    # Prepare set of Arabic stopwords
    ar_stopwords = set(STOPWORDS.keys())
    # Normalize each token and remove stopwords and single-character tokens
    return [norm(token) for token in tokens if token not in ar_stopwords and len(token) > 1]

def enrich_services_with_texts(services_data):
    """
    For each service, add 'full_text' and 'short_text' fields.
    Only include 'Documents' if not 'لا يوجد'.
    """
    for service in services_data:
        category = service.get("category", "")
        name = service.get("service_name", "")
        desc = service.get("description", "")
        terms = " ".join(service.get("terms", []))
        documents = service.get("Documents", "")
        full_text = f"{category} {name} {desc} {terms} {documents}"
        service["full_text"] = full_text.strip()
        service["short_text"] = f"{desc} {name} {category}".strip()
    return services_data
# Apply enrich_services_with_texts to a single service by wrapping it in a list
enriched_first_service = enrich_services_with_texts([first_service])[0]
print('full_text\n', enriched_first_service["full_text"])
print('short_text\n', enriched_first_service["short_text"])

def extract_keywords_from_short_texts_with_vectorizer(services_data, top_n=4):
    """
    - Train TF-IDF on all full_texts (preprocessed).
    - For each service, extract top N keywords from its short_text.
    - Save keywords in service['keywords'].
    """
    full_texts = [" ".join(preprocess_text(service["full_text"])) for service in services_data]
    vectorizer = TfidfVectorizer()
    vectorizer.fit(full_texts)
    feature_names = vectorizer.get_feature_names_out()
    for service in services_data:
        short_text = " ".join(preprocess_text(service["short_text"]))
        tfidf_vector = vectorizer.transform([short_text])
        scores = tfidf_vector.toarray().flatten()
        top_indices = np.argsort(scores)[-top_n:][::-1]
        keywords = [feature_names[i] for i in top_indices if scores[i] > 0]
        service["keywords"] = keywords
    return services_data, vectorizer

# Use the first 3 services from the loaded data for testing
test_services = data[:3]
enriched_services = enrich_services_with_texts(test_services)
services_data, vectorizer = extract_keywords_from_short_texts_with_vectorizer(enriched_services, top_n=4)

# Display keywords for each service
for i, service in enumerate(services_data):
    print(f"Service {i+1}: {service['service_name']}")
    print("keywords:", service["keywords"])

full_text
 التموين تفعيل بطاقة تموين تُمكّنك هذه الخدمة من تفعيل بطاقتك التموينية مالك البطاقة فقط (رب الأسرة) المؤهل لطلب الخدمة يجب أن تكون البطاقة قد سٌلًمت للمواطن []
short_text
 تُمكّنك هذه الخدمة من تفعيل بطاقتك التموينية تفعيل بطاقة تموين التموين
Service 1: استمارة تحديث بيانات المواطن
keywords: ['تحديث', 'الجهه', 'بياناتهم', 'بيانات']
Service 2: تفعيل بطاقة تموين
keywords: ['تفعيل', 'بطاقتك', 'التموينيه', 'بطاقه']
Service 3: إصدار بدل تالف أو فاقد لبطاقة تموين
keywords: ['تالف', 'اصدار', 'فاقد', 'بدل']
