# Data Preprocessing

This notebook handles the essential preprocessing steps for the news article data. The goal is to clean the raw text and structure it in a way that is suitable for advanced NLP analysis.
 
The two main steps are:
1.  **News Article Cleaning:** Removing noise like HTML tags, extra whitespace, and URLs.
2.  **Sentence Tokenization:** Segmenting the cleaned text into individual sentences using a reliable method.

## 1. Load the Data

First, let's load the two news article datasets from the previous EDA stage.

In [19]:
import pandas as pd
import re
import os

# --- Load the datasets ---
DATA_DIR = '../data'
df_eng = pd.read_csv(os.path.join(DATA_DIR, '01_raw/news-articles-eng.csv'))
df_ara = pd.read_csv(os.path.join(DATA_DIR, '01_raw/news-articles-ara.csv'))

print("Data loaded successfully.")

Data loaded successfully.


## 2. News Article Cleaning

In [20]:
def clean_text(text):
    """
    Cleans raw text by removing HTML tags, URLs, and normalizing whitespace.
    """
    if not isinstance(text, str):
        return ""
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

print("Cleaning news articles...")
df_eng['body_cleaned'] = df_eng['body'].apply(clean_text)
df_ara['body_cleaned'] = df_ara['body'].apply(clean_text)
print("Cleaning complete.")

Cleaning news articles...
Cleaning complete.


## 3. Sentence Tokenization

We will use a regular expression to split the text into sentences. This method looks for common sentence-ending punctuation (. ! ?) followed by a space, which is a robust approach for news text.

In [21]:
def regex_sent_tokenize(text):
    """
    Splits text into sentences using a regular expression.
    This is a dependency-free alternative to nltk.sent_tokenize.
    """
    if not isinstance(text, str) or not text:
        return []
    # Split on sentence-ending punctuation followed by a space or end of string
    # The regex uses a "positive lookbehind" to keep the punctuation with the sentence.
    sentences = re.split(r'(?<=[.!?۔])\s+', text)
    # Filter out any empty strings that might result from the split
    return [s for s in sentences if s]

# --- Apply the new, reliable tokenizer ---
print("Tokenizing English articles into sentences...")
df_eng['sentences'] = df_eng['body_cleaned'].apply(regex_sent_tokenize)

print("Tokenizing Arabic articles into sentences...")
df_ara['sentences'] = df_ara['body_cleaned'].apply(regex_sent_tokenize)

print("\nSentence tokenization complete.")

# --- Display a sample ---
print("\n--- Example of Regex Sentence Tokenization ---")
print(f"Article body has been split into {len(df_eng['sentences'].iloc[0])} sentences.")
print("First 3 sentences:")
for sentence in df_eng['sentences'].iloc[0][:3]:
    print(f"- {sentence}")

Tokenizing English articles into sentences...


Tokenizing Arabic articles into sentences...

Sentence tokenization complete.

--- Example of Regex Sentence Tokenization ---
Article body has been split into 124 sentences.
First 3 sentences:
- Hussam al-Mahmoud | Yamen Moghrabi | Hassan Ibrahim On October 7, 2023, the world and the Middle East awoke to the drums of war beating in the Gaza Strip.
- Over time, it turned into a reality that American efforts, Qatari and Egyptian mediation, condemnations, statements, summits, and conferences could not stop.
- While Israel continues its war in the besieged Gaza Strip, attention is turning towards the potential outbreak of another war.



## 4. Save Processed Data

In [23]:
PROCESSED_DATA_DIR = os.path.join(DATA_DIR, '02_processed')
output_path_eng = os.path.join(PROCESSED_DATA_DIR, 'news_eng_processed.pkl')
output_path_ara = os.path.join(PROCESSED_DATA_DIR, 'news_ara_processed.pkl')

os.makedirs(PROCESSED_DATA_DIR, exist_ok=True)

df_eng.to_pickle(output_path_eng)
df_ara.to_pickle(output_path_ara)

print(f"Processed English data saved to: {output_path_eng}")
print(f"Processed Arabic data saved to: {output_path_ara}")


Processed English data saved to: ../data/02_processed/news_eng_processed.pkl
Processed Arabic data saved to: ../data/02_processed/news_ara_processed.pkl
