# Data Preprocessing

This notebook performs the foundational text processing for the main analysis. Its purpose is to take the raw, unstructured news articles and transform them into a clean, sentence-tokenized format that is ready for subsequent feature engineering and modeling.

This foundational preprocessing pipeline consists of the following steps:

1.  **Load Data**: Import the raw English and Arabic news articles.
2.  **Clean Text**: Remove noise from the text, such as HTML tags, URLs, and extra whitespace.
3.  **Tokenize Sentences**: Break down the cleaned text into individual sentences.
4.  **Save Processed Data**: Store the clean, structured datasets for the next stage of the pipeline.

## 1. Load the Data

First, let's load the two news article datasets from the previous EDA stage.

In [2]:
import pandas as pd
import re
import os

# --- Load the datasets ---
DATA_DIR = '../data'
df_eng = pd.read_csv(os.path.join(DATA_DIR, '01_raw/news-articles-eng.csv'))
df_ara = pd.read_csv(os.path.join(DATA_DIR, '01_raw/news-articles-ara.csv'))

print("Data loaded successfully.")

Data loaded successfully.


## 2. News Article Cleaning

In [3]:
def clean_text(text):
    """
    Cleans raw text by removing HTML tags, URLs, and normalizing whitespace.
    """
    if not isinstance(text, str):
        return ""
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

print("Cleaning news articles...")
df_eng['body_cleaned'] = df_eng['body'].apply(clean_text)
df_ara['body_cleaned'] = df_ara['body'].apply(clean_text)
print("Cleaning complete.")

Cleaning news articles...


Cleaning complete.


## 3. Sentence Tokenization

We will use a regular expression to split the text into sentences. This method looks for common sentence-ending punctuation (. ! ?) followed by a space, which is a robust approach for news text.

In [4]:
def regex_sent_tokenize(text):
    """
    Splits text into sentences using a regular expression.
    This is a dependency-free alternative to nltk.sent_tokenize.
    """
    if not isinstance(text, str) or not text:
        return []
    # Split on sentence-ending punctuation followed by a space or end of string
    # The regex uses a "positive lookbehind" to keep the punctuation with the sentence.
    sentences = re.split(r'(?<=[.!?۔])\s+', text)
    # Filter out any empty strings that might result from the split
    return [s for s in sentences if s]

# --- Apply the new, reliable tokenizer ---
print("Tokenizing English articles into sentences...")
df_eng['sentences'] = df_eng['body_cleaned'].apply(regex_sent_tokenize)

print("Tokenizing Arabic articles into sentences...")
df_ara['sentences'] = df_ara['body_cleaned'].apply(regex_sent_tokenize)

print("\nSentence tokenization complete.")

# --- Display a sample ---
print("\n--- Example of Regex Sentence Tokenization ---")
print(f"Article body has been split into {len(df_eng['sentences'].iloc[0])} sentences.")
print("First 3 sentences:")
for sentence in df_eng['sentences'].iloc[0][:3]:
    print(f"- {sentence}")

Tokenizing English articles into sentences...
Tokenizing Arabic articles into sentences...

Sentence tokenization complete.

--- Example of Regex Sentence Tokenization ---
Article body has been split into 124 sentences.
First 3 sentences:
- Hussam al-Mahmoud | Yamen Moghrabi | Hassan Ibrahim On October 7, 2023, the world and the Middle East awoke to the drums of war beating in the Gaza Strip.
- Over time, it turned into a reality that American efforts, Qatari and Egyptian mediation, condemnations, statements, summits, and conferences could not stop.
- While Israel continues its war in the besieged Gaza Strip, attention is turning towards the potential outbreak of another war.



## 4. Save Processed Data

In [5]:
PROCESSED_DATA_DIR = os.path.join(DATA_DIR, '02_processed')
output_path_eng = os.path.join(PROCESSED_DATA_DIR, 'news_eng_processed.pkl')
output_path_ara = os.path.join(PROCESSED_DATA_DIR, 'news_ara_processed.pkl')

os.makedirs(PROCESSED_DATA_DIR, exist_ok=True)

# Save the smaller, filtered dataframes
df_eng.to_pickle(output_path_eng)
df_ara.to_pickle(output_path_ara)

print(f"\nProcessed and FILTERED English data saved to: {output_path_eng}")
print(f"Processed and FILTERED Arabic data saved to: {output_path_ara}")



Processed and FILTERED English data saved to: ../data/02_processed/news_eng_processed.pkl
Processed and FILTERED Arabic data saved to: ../data/02_processed/news_ara_processed.pkl


### **Additional Task: Completing the Risk Factor Data**

Run this if you didn't run the EDA because in EDA we also translated.

For good data governance and to aid future analysis, it's a best practice to have a complete dataset. The original `risk-factors.xlsx` file was missing the Arabic translations. A separate, one-time script was used to create a new file, `risk-factors-translated.xlsx`, which contains both the English and Arabic terms. This ensures our core data is complete and bilingual.

In [7]:
import pandas as pd
import os

# --- 1. Load your original risk factors file ---
DATA_DIR = '../data'
risk_factors_path = os.path.join(DATA_DIR, '01_raw/risk-factors.xlsx')
df_risk = pd.read_excel(risk_factors_path)

# --- 2. Use the pre-translated dictionary (fast and reliable) ---
print("Mapping pre-translated Arabic risk factors...")
risk_factor_translations_ara = {
    'massive starvation': 'مجاعة هائلة', 'rinderpest': 'طاعون بقري', 'scanty rainfall': 'شح الأمطار', 'dysfunction': 'خلل وظيفي', 'rise': 'ارتفاع', 'mass displacement': 'نزوح جماعي', 'conflict': 'صراع', 'hunger': 'جوع', 'malnutrition': 'سوء تغذية', 'drought': 'جفاف', 'locust': 'جراد', 'insecurity': 'انعدام الأمن', 'violence': 'عنف', 'poverty': 'فقر', 'displacement': 'نزوح', 'disease': 'مرض', 'death': 'موت', 'disaster': 'كارثة', 'crisis': 'أزمة', 'famine': 'مجاعة', 'emergency': 'طوارئ', 'shortage': 'نقص', 'cholera': 'كوليرا', 'malaria': 'ملاريا', 'measles': 'حصبة', 'typhoid': 'تيفوئيد', 'ebola': 'إيبولا', 'hiv': 'فيروس نقص المناعة البشرية', 'aids': 'الإيدز', 'tuberculosis': 'سل', 'diarrhea': 'إسهال', 'undernutrition': 'نقص التغذية', 'food prices': 'أسعار المواد الغذائية', 'inflation': 'تضخم', 'economic collapse': 'انهيار اقتصادي', 'currency devaluation': 'تخفيض قيمة العملة', 'unemployment': 'بطالة', 'corruption': 'فساد', 'sanctions': 'عقوبات', 'blockade': 'حصار', 'looting': 'نهب', 'theft': 'سرقة', 'crime': 'جريمة', 'terrorism': 'إرهاب', 'insurgency': 'تمرد', 'civil war': 'حرب أهلية', 'war': 'حرب', 'bombing': 'قصف', 'airstrike': 'غارة جوية', 'shelling': 'قصف مدفعي', 'gunfire': 'إطلاق نار', 'explosion': 'انفجار', 'massacre': 'مذبحة', 'genocide': 'إبادة جماعية', 'ethnic cleansing': 'تطهير عرقي', 'torture': 'تعذيب', 'rape': 'اغتصاب', 'abduction': 'اختطاف', 'kidnapping': 'خطف', 'hostage': 'رهينة', 'assassination': 'اغتيال', 'coup': 'انقلاب', 'political instability': 'عدم استقرار سياسي', 'protest': 'احتجاج', 'riot': 'شغب', 'curfew': 'حظر تجول', 'state of emergency': 'حالة طوارئ', 'martial law': 'أحكام عرفية', 'election violence': 'عنف انتخابي', 'border closure': 'إغلاق الحدود', 'refugee': 'لاجئ', 'asylum seeker': 'طالب لجوء', 'internally displaced person': 'نازح داخلي', 'migrant': 'مهاجر', 'human trafficking': 'اتجار بالبشر', 'smuggling': 'تهريب', 'flood': 'فيضان', 'hurricane': 'إعصار', 'cyclone': 'إعصار', 'typhoon': 'إعصار', 'earthquake': 'زلزال', 'tsunami': 'تسونامي', 'volcano': 'بركان', 'landslide': 'انهيار أرضي', 'avalanche': 'انهيار ثلجي', 'wildfire': 'حرائق غابات', 'heatwave': 'موجة حر', 'cold wave': 'موجة برد', 'hailstorm': 'عاصفة برد', 'tornado': 'إعصار', 'storm': 'عاصفة', 'monsoon': 'موسم الأمطار', 'crop failure': 'فشل المحاصيل', 'harvest failure': 'فشل الحصاد', 'livestock death': 'نفوق الماشية', 'water shortage': 'نقص المياه', 'power outage': 'انقطاع التيار الكهربائي', 'fuel shortage': 'نقص الوقود', 'road closure': 'إغلاق الطرق', 'infrastructure damage': 'أضرار في البنية التحتية', 'hospital closure': 'إغلاق المستشفيات', 'school closure': 'إغلاق المدارس', 'market closure': 'إغلاق الأسواق', 'aid shortage': 'نقص المساعدات', 'aid worker killed': 'مقتل عامل إغاثة', 'aid worker abducted': 'اختطاف عامل إغاثة', 'ngo withdrawal': 'انسحاب المنظمات غير الحكومية', 'un withdrawal': 'انسحاب الأمم المتحدة', 'peacekeeping mission': 'بعثة حفظ السلام', 'ceasefire violation': 'انتهاك وقف إطلاق النار', 'failed state': 'دولة فاشلة', 'anarchy': 'فوضى', 'armed group': 'جماعة مسلحة', 'militia': 'ميليشيا', 'rebel': 'متمرد', 'terrorist group': 'جماعة إرهابية', 'child soldier': 'جندي طفل', 'landmine': 'لغم أرضي', 'chemical weapon': 'سلاح كيماوي', 'biological weapon': 'سلاح بيولوجي', 'nuclear weapon': 'سلاح نووي', 'dirty bomb': 'قنبلة قذرة', 'small arms': 'أسلحة صغيرة', 'heavy weapons': 'أسلحة ثقيلة', 'artillery': 'مدفعية', 'tank': 'دبابة', 'fighter jet': 'طائرة مقاتلة', 'drone': 'طائرة بدون طيار', 'naval blockade': 'حصار بحري', 'piracy': 'قرصنة', 'human rights violation': 'انتهاك حقوق الإنسان', 'freedom of speech': 'حرية التعبير', 'freedom of press': 'حرية الصحافة', 'freedom of assembly': 'حرية التجمع', 'freedom of religion': 'حرية الدين', 'ethnic discrimination': 'تمييز عرقي', 'religious discrimination': 'تمييز ديني', 'gender discrimination': 'تمييز بين الجنسين', 'child labor': 'عمالة الأطفال', 'forced labor': 'العمل القسري', 'slavery': 'عبودية', 'debt bondage': 'عبودية الدين', 'land seizure': 'الاستيلاء على الأراضي', 'forced eviction': 'إخلاء قسري', 'land grab': 'الاستيلاء على الأراضي', 'brutal government': 'حكومة وحشية', 'bombing campaign': 'حملة قصف', 'transport bottleneck': 'عنق زجاجة في النقل', 'weather extremes': 'ظواهر جوية متطرفة', 'price rise': 'ارتفاع الأسعار', 'cattle plague': 'طاعون الماشية', 'mismanagement': 'سوء الإدارة', 'harvest decline': 'انخفاض المحصول', 'forests destroyed': 'تدمير الغابات', 'jihadist groups': 'جماعات جهادية', 'migration': 'هجرة', 'economic impoverishment': 'إفقار اقتصادي', 'continued strife': 'استمرار الصراع', 'ecological crisis': 'أزمة بيئية', 'slave trade': 'تجارة الرقيق', 'lack of agricultural infrastructure': 'نقص البنية التحتية الزراعية', 'stolen food aid': 'سرقة المساعدات الغذائية', 'gangs of bandits': 'عصابات قطاع الطرق', 'gastrointestinal': 'معدي معوي', 'hunger crises': 'أزمات الجوع', 'pests': 'آفات', 'clan battle': 'معركة عشائرية', 'regimes were toppled': 'إسقاط الأنظمة'
}

# Map the English keywords to their Arabic translations
df_risk['risk_factor_arabic'] = df_risk['risk_factor_english'].map(risk_factor_translations_ara)

print("\n--- Translation Complete ---")
print("Sample of the updated DataFrame:")
print(df_risk.head())


# --- 3. Save to a NEW file to preserve your original data ---
output_filename = 'risk-factors-translated.xlsx'
output_path = os.path.join(DATA_DIR, '01_raw', output_filename)

df_risk.to_excel(output_path, index=False)

print(f"\nSuccessfully saved the complete, translated risk factors to: {output_path}")

Mapping pre-translated Arabic risk factors...

--- Translation Complete ---
Sample of the updated DataFrame:
  risk_factor_english risk_factor_arabic
0  massive starvation        مجاعة هائلة
1          rinderpest         طاعون بقري
2     scanty rainfall         شح الأمطار
3         dysfunction          خلل وظيفي
4                rise             ارتفاع

Successfully saved the complete, translated risk factors to: ../data/01_raw/risk-factors-translated.xlsx
