# Model selection
This notebook performs the core "fine-tuning" stage of our analysis. After the preprocessing notebook applied a fast, **semantic filter** to identify a smaller subset of high-value articles, this stage uses a **state-of-the-art, multilingual, zero-shot classification model** to perform a deep analysis and extract specific risk factors from them.

We will use the powerful `MoritzLaurer/mDeBERTa-v3-base-mnli-xnli` model, which is specialized for this type of classification task, ensuring higher accuracy.

The main steps are:

1.  **Load Filtered Data**: Load the smaller, pre-selected datasets for English and Arabic that were generated by the semantic filter in the previous step.
2.  **Load Risk Factors**: Load the list of 167 English risk factors that will serve as the classification labels.
3.  **Initialize the SOTA Model**: Set up the zero-shot classification pipeline using the `mDeBERTa-v3` model.
4.  **Run Batch Classification**: Efficiently process all sentences from the filtered articles in large batches, classifying each one against the risk factors and keeping results that meet a confidence threshold of 0.80.
5.  **Save the Results**: Save the final DataFrame containing the identified risk mentions.

## Load Processed Data

In [1]:
import pandas as pd
import os
from transformers import pipeline
import torch

# --- Load the processed datasets ---
DATA_DIR = '../data'
PROCESSED_DATA_DIR = os.path.join(DATA_DIR, '02_processed')

df_eng = pd.read_pickle(os.path.join(PROCESSED_DATA_DIR, 'news_eng_processed_filtered.pkl'))
df_ara = pd.read_pickle(os.path.join(PROCESSED_DATA_DIR, 'news_ara_processed_filtered.pkl'))

print("Processed data loaded successfully.")

Processed data loaded successfully.


In [32]:
df_eng

Unnamed: 0,uri,lang,isDuplicate,date,time,dateTime,dateTimePub,dataType,sim,url,...,source,authors,image,eventUri,sentiment,wgt,relevance,userHasPermissions,body_cleaned,sentences
1293,2024-07-410603469,eng,False,2024-07-04,00:07:25,2024-07-04T00:07:25Z,2024-07-03T21:22:37Z,news,0.0,https://reliefweb.int/job/4074817/senior-proje...,...,"{'uri': 'reliefweb.int', 'dataType': 'news', '...",[],https://reliefweb.int/modules/custom/reliefweb...,,0.294118,161,161,,"IMPACT Initiatives is a humanitarian NGO, base...","[IMPACT Initiatives is a humanitarian NGO, bas..."
1303,8196642280,eng,False,2024-06-26,14:17:08,2024-06-26T14:17:08Z,2024-06-26T14:14:47Z,news,0.0,https://www.arabnews.com/node/2537996,...,"{'uri': 'arabnews.com', 'dataType': 'news', 't...",[],https://www.arabnews.com/sites/default/files/a...,,-0.098039,161,161,,The UK has traditionally been a leading donor ...,[The UK has traditionally been a leading donor...
4030,8219677809,eng,False,2024-07-11,03:10:48,2024-07-11T03:10:48Z,2024-07-11T03:08:46Z,news,0.972549,https://www.dailymail.co.uk/wires/afp/article-...,...,"{'uri': 'dailymail.co.uk', 'dataType': 'news',...",[],https://i.dailymail.co.uk/1s/2024/07/11/03/wir...,eng-9723400,-0.34902,112,112,,A man fills barrels with water at a camp for i...,[A man fills barrels with water at a camp for ...
6910,2024-07-414589881,eng,False,2024-07-08,05:14:50,2024-07-08T05:14:50Z,2024-07-08T05:14:41Z,news,0.0,https://www.thestandard.com.hk/breaking-news/s...,...,"{'uri': 'thestandard.com.hk', 'dataType': 'new...",[],https://www.thestandard.com.hk/images/instant_...,,-0.427451,104,104,,A man was bitten during a robbery in Jordan wh...,[A man was bitten during a robbery in Jordan w...
7013,8204016792,eng,False,2024-07-01,10:04:05,2024-07-01T10:04:05Z,2024-07-01T10:03:23Z,news,0.0,https://www.intellinews.com/biblical-plague-of...,...,"{'uri': 'intellinews.com', 'dataType': 'news',...",[],https://d39raawggeifpx.cloudfront.net/styles/1...,,-0.184314,104,104,,Farmers in Iraq's Sulaymaniyah are suffering a...,[Farmers in Iraq's Sulaymaniyah are suffering ...
7115,8195181358,eng,False,2024-06-25,18:28:32,2024-06-25T18:28:32Z,2024-06-25T18:27:05Z,news,0.780392,https://www.etvbharat.com/en/!international/un...,...,"{'uri': 'etvbharat.com', 'dataType': 'news', '...","[{'uri': 'etv_bharat@etvbharat.com', 'name': '...",https://etvbharatimages.akamaized.net/etvbhara...,eng-9682086,-0.694118,104,104,,Torture is not criminalised in law as a separa...,[Torture is not criminalised in law as a separ...
7798,8198242448,eng,False,2024-06-27,12:08:19,2024-06-27T12:08:19Z,2024-06-27T12:08:02Z,news,0.0,https://www.freshplaza.com/europe/article/9639...,...,"{'uri': 'freshplaza.com', 'dataType': 'news', ...",[],https://agfstorage.blob.core.windows.net/misc/...,,0.07451,102,102,,"In the Nafke plains of Duhok, a new variety of...","[In the Nafke plains of Duhok, a new variety o..."
8963,2024-06-399489006,eng,False,2024-06-24,08:09:43,2024-06-24T08:09:43Z,2024-06-24T07:56:11Z,news,0.0,https://www.syriahr.com/en/337025/,...,"{'uri': 'syriahr.com', 'dataType': 'news', 'ti...",[],https://www.syriahr.com/en/wp-content/uploads/...,,0.066667,83,83,,Al-Hasakah province: The International Coaliti...,[Al-Hasakah province: The International Coalit...
10642,2024-06-401341748,eng,False,2024-06-25,17:19:52,2024-06-25T17:19:52Z,2024-06-25T16:31:25Z,news,0.596078,https://www.cbsnews.com/minnesota/news/floodin...,...,"{'uri': 'cbsnews.com', 'dataType': 'news', 'ti...","[{'uri': 'caroline_cummings@cbsnews.com', 'nam...",https://assets3.cbsnewsstatic.com/hub/i/r/2024...,eng-9675291,0.05098,78,78,,MINNEAPOLIS -- Massive flooding is now impacti...,[MINNEAPOLIS -- Massive flooding is now impact...
10921,8193950743,eng,False,2024-06-25,05:02:55,2024-06-25T05:02:55Z,2024-06-25T05:01:54Z,news,0.717647,https://gulfnews.com/opinion/op-eds/conflicts-...,...,"{'uri': 'gulfnews.com', 'dataType': 'news', 't...","[{'uri': 'ashok_swain@gulfnews.com', 'name': '...",https://imagevars.gulfnews.com/2022/06/26/OPN-...,eng-9679220,-0.466667,77,77,,World must prioritise ending conflicts and fas...,[World must prioritise ending conflicts and fa...


In [33]:
df_ara

Unnamed: 0,uri,lang,isDuplicate,date,time,dateTime,dateTimePub,dataType,sim,url,...,source,authors,image,eventUri,sentiment,wgt,relevance,userHasPermissions,body_cleaned,sentences
87614,2024-07-409596514,ara,False,2024-07-03,06:28:18,2024-07-03T06:28:18Z,2024-07-03T06:27:50Z,news,0.000000,https://www.sarayanews.com/article/942226/ارتف...,...,"{'uri': 'sarayanews.com', 'dataType': 'news', ...",[],https://www.sarayanews.com/image.php?token=b3a...,,,232,232,,سرايا - شهدت الصادرات الوطنية إلى العراق خلال ...,[سرايا - شهدت الصادرات الوطنية إلى العراق خلال...
89534,8235094285,ara,False,2024-07-20,14:36:00,2024-07-20T14:36:00Z,2024-07-20T14:35:34Z,news,0.000000,https://cedarnews.net/bbc/738198/%d8%aa%d8%ad%...,...,"{'uri': 'cedarnews.net', 'dataType': 'news', '...",[],https://cedarnews.net/wp-content/uploads/2024/...,,,186,186,,حذر خبراء ووكالات الأمن السيبراني، من موجات قر...,[حذر خبراء ووكالات الأمن السيبراني، من موجات ق...
91397,8195317645,ara,False,2024-06-25,20:33:14,2024-06-25T20:33:14Z,2024-06-25T20:30:35Z,news,0.000000,https://hathalyoum.net/articles/3371721,...,"{'uri': 'hathalyoum.net', 'dataType': 'news', ...",[],https://baghdadtoday.news/uploads/posts/2024-0...,,,157,157,,مطامير النفايات تلوث الموصل وتهدد بمخاطر صحية ...,[مطامير النفايات تلوث الموصل وتهدد بمخاطر صحية...
92519,8200477079,ara,False,2024-06-28,17:35:00,2024-06-28T17:35:00Z,2024-06-28T17:34:10Z,news,0.000000,https://www.shorouknews.com/columns/view.aspx?...,...,"{'uri': 'shorouknews.com', 'dataType': 'news',...",[],https://www.shorouknews.com/uploadedimages/Col...,,,151,151,,نعيش فى عالم مضطرب تتنوع صراعاته من اقتصادية و...,[نعيش فى عالم مضطرب تتنوع صراعاته من اقتصادية ...
93368,8209833481,ara,False,2024-07-04,15:58:01,2024-07-04T15:58:01Z,2024-07-04T15:57:20Z,news,0.000000,https://www.almamlakatv.com//news/145786,...,"{'uri': 'almamlakatv.com', 'dataType': 'news',...",[],https://www.almamlakatv.com//images/articles/b...,,,134,134,,المفوض العام لوكالة الأمم المتحدة لإغاثة وتشغي...,[المفوض العام لوكالة الأمم المتحدة لإغاثة وتشغ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
168741,8209955759,ara,False,2024-07-04,17:36:35,2024-07-04T17:36:35Z,2024-07-04T17:36:18Z,news,0.639216,https://article.albawaba.net/ar/%D8%A3%D8%B9%D...,...,"{'uri': 'article.albawaba.net', 'dataType': 'n...",[],https://article.albawaba.net/sites/default/fil...,ara-1670641,,1,1,,قالت وزارة المالية الأرنية إن قيمة العجز المال...,[قالت وزارة المالية الأرنية إن قيمة العجز الما...
168926,8209047528,ara,False,2024-07-04,07:18:10,2024-07-04T07:18:10Z,2024-07-04T07:17:54Z,news,0.000000,https://www.alquds.com/ar/posts/126993,...,"{'uri': 'alquds.com', 'dataType': 'news', 'tit...",[],https://alquds.fra1.digitaloceanspaces.com/upl...,,,1,1,,قال الجهاز المركزي للإحصاء، إن الرقم القياسي ل...,[قال الجهاز المركزي للإحصاء، إن الرقم القياسي ...
169775,8204494532,ara,False,2024-07-01,15:03:23,2024-07-01T15:03:23Z,2024-07-01T15:03:02Z,news,0.956863,https://royanews.tv/news/329411?1719846101,...,"{'uri': 'royanews.tv', 'dataType': 'news', 'ti...",[],https://cdnimg.royanews.tv/imageserv/Size728Q4...,ara-1669252,,1,1,,الأمن: التعامل مع إطلاق العيارات النارية كجرائ...,[الأمن: التعامل مع إطلاق العيارات النارية كجرا...
169821,8204183609,ara,False,2024-07-01,11:47:12,2024-07-01T11:47:12Z,2024-07-01T11:46:58Z,news,0.000000,https://alrai.com/article/10841693/إقتصاد/تثبي...,...,"{'uri': 'alrai.com', 'dataType': 'news', 'titl...",[],https://alrai.com/alraijordan/uploads/images/2...,,,1,1,,عمان - الرأي ثبتت وزارة الصناعة والتجارة والتم...,[عمان - الرأي ثبتت وزارة الصناعة والتجارة والت...


## Load Risk Factors

In [34]:
# --- Load risk factors ---
risk_factors_path = os.path.join(DATA_DIR, '01_raw/risk-factors.xlsx')
df_risk_factors = pd.read_excel(risk_factors_path)

risk_factor_labels = df_risk_factors['risk_factor_english'].tolist()

print(f"{len(risk_factor_labels)} risk factors loaded.")
print("Sample risk factors:", risk_factor_labels[:5])

167 risk factors loaded.
Sample risk factors: ['massive starvation', 'rinderpest', 'scanty rainfall', 'dysfunction', 'rise']


In [35]:
# --- Load risk factors ---
risk_factors_ara_path = os.path.join(DATA_DIR, '01_raw/risk-factors-translated.xlsx')
df_risk_factors_ara = pd.read_excel(risk_factors_ara_path)

risk_factor_ara_labels = df_risk_factors_ara['risk_factor_arabic'].tolist()

print(f"{len(risk_factor_ara_labels)} risk factors loaded.")
print("Sample risk factors:", risk_factor_ara_labels[:5])

167 risk factors loaded.
Sample risk factors: ['مجاعة هائلة', 'طاعون بقري', 'شح الأمطار', 'خلل وظيفي', 'ارتفاع']


## Initialize the Model

In [36]:
# --- Check for GPU ---
device = 0 if torch.cuda.is_available() else -1
if device == 0:
    print("GPU found. The model will run on the GPU.")
else:
    print("No GPU found. The model will run on the CPU. This may be slow.")

# --- Initialize the pipeline with the SOTA model ---

# This model is a newer, more powerful replacement for XLM-RoBERTa
# for zero-shot classification tasks.
MODEL_NAME = 'MoritzLaurer/mDeBERTa-v3-base-mnli-xnli' 

classifier = pipeline(
    "zero-shot-classification",
    model=MODEL_NAME,
    device=device # Use your GPU
)

print(f"Zero-shot classification pipeline initialized with SOTA model: {MODEL_NAME}")

GPU found. The model will run on the GPU.


Device set to use cuda:0


Zero-shot classification pipeline initialized with SOTA model: MoritzLaurer/mDeBERTa-v3-base-mnli-xnli


## Run Classification

This is the main processing step. To maximize efficiency, we will use a batch processing approach instead of iterating through each article one by one. This involves:

Restructuring the Data: We will combine all sentences from all articles into a single, large list.

Batch Inference: This entire list is fed directly to the transformers pipeline, which automatically groups the sentences into optimal batches to keep the GPU fully utilized.

This method is significantly faster than a traditional loop as it minimizes CPU-GPU communication overhead. We will run the process on a small sample of 5 articles from each language to verify the pipeline and see the output structure.

Note: Even with this optimization, processing the full dataset of ~172,000 articles is a computationally intensive task that will still take a considerable amount of time. Running this initial small sample is crucial for ensuring the code works correctly before launching the full analysis.

In [37]:
import pandas as pd
from tqdm.auto import tqdm # A library to create smart progress bars

def extract_risk_factors_fast(df, classifier, labels, threshold=0.80, batch_size=32):
    """
    Extracts risk factors from a DataFrame by processing all sentences in batches.
    """
    # 1. Restructure the data for batch processing
    if 'article_id' not in df.columns:
        df['article_id'] = df.index
        
    df_sentences = df.explode('sentences').rename(columns={'sentences': 'sentence_text'})
    df_sentences = df_sentences[['article_id', 'date', 'sentence_text']].dropna(subset=['sentence_text'])
    
    sentence_list = df_sentences['sentence_text'].tolist()
    
    if not sentence_list:
        print("No sentences to process for this sample.")
        return pd.DataFrame()
        
    print(f"Processing {len(sentence_list):,} sentences in batches of {batch_size}...")

    results_list = []
    # 2. Process all sentences in one go with a progress bar
    for i, result in tqdm(enumerate(classifier(sentence_list, labels, multi_label=True, batch_size=batch_size)), total=len(sentence_list)):
        # 3. Filter results and store them
        for label, score in zip(result['labels'], result['scores']):
            if score >= threshold:
                original_row = df_sentences.iloc[i]
                results_list.append({
                    'article_id': original_row['article_id'],
                    'date': original_row['date'],
                    'sentence_text': result['sequence'],
                    'risk_factor': label,
                    'confidence_score': score
                })

    return pd.DataFrame(results_list)

# --- Set the sample size to 5 articles each ---
# Commment this out to process the full dataset !!!
# --- Create explicit copies for the sample to avoid the warning ---
df_eng = df_eng.head(10).copy()
df_ara = df_ara.head(10).copy()

BATCH_SIZE = 128 # You can adjust this based on your GPU memory

# --- Run the FAST extraction process on the new sample copies ---
print("Starting risk factor extraction for English articles...")
eng_risk_mentions = extract_risk_factors_fast(df_eng, classifier, risk_factor_labels, batch_size=BATCH_SIZE)

print("\nStarting risk factor extraction for Arabic articles...")
ara_risk_mentions = extract_risk_factors_fast(df_ara, classifier, risk_factor_ara_labels, batch_size=BATCH_SIZE)

print("\nRisk factor extraction complete.")

# --- Combine results from both languages ---
all_risk_mentions = pd.concat([eng_risk_mentions, ara_risk_mentions], ignore_index=True)

print("\n--- Final Extracted Risk Factors ---")
# Display the full results from this small sample
if all_risk_mentions.empty:
    print("No risk factors found in the sample articles with the current threshold.")
else:
    print(all_risk_mentions)

Starting risk factor extraction for English articles...
Processing 328 sentences in batches of 128...


  0%|          | 0/328 [00:00<?, ?it/s]


Starting risk factor extraction for Arabic articles...
Processing 328 sentences in batches of 128...


  0%|          | 0/328 [00:00<?, ?it/s]


Risk factor extraction complete.

--- Final Extracted Risk Factors ---
      article_id        date  \
0           1293  2024-07-04   
1           1293  2024-07-04   
2           1293  2024-07-04   
3           1293  2024-07-04   
4           1293  2024-07-04   
...          ...         ...   
5563       96000  2024-07-19   
5564       96000  2024-07-19   
5565       96000  2024-07-19   
5566       96000  2024-07-19   
5567       96000  2024-07-19   

                                          sentence_text  \
0     IMPACT Initiatives is a humanitarian NGO, base...   
1     IMPACT was launched at the initiative of ACTED...   
2     After more than a decade of conflict, conditio...   
3     After more than a decade of conflict, conditio...   
4     After more than a decade of conflict, conditio...   
...                                                 ...   
5563  هذا وينتج العراق نحو 26 ألف ميغاواط من الكهربا...   
5564  هذا وينتج العراق نحو 26 ألف ميغاواط من الكهربا...   
5565  هذا وي

In [38]:
all_risk_mentions

Unnamed: 0,article_id,date,sentence_text,risk_factor,confidence_score
0,1293,2024-07-04,"IMPACT Initiatives is a humanitarian NGO, base...",humanitarian situation,0.967194
1,1293,2024-07-04,IMPACT was launched at the initiative of ACTED...,international intervention,0.854065
2,1293,2024-07-04,"After more than a decade of conflict, conditio...",years of warfare,0.999366
3,1293,2024-07-04,"After more than a decade of conflict, conditio...",prolonged fighting,0.998584
4,1293,2024-07-04,"After more than a decade of conflict, conditio...",continued deterioration,0.998004
...,...,...,...,...,...
5563,96000,2024-07-19,هذا وينتج العراق نحو 26 ألف ميغاواط من الكهربا...,سوء الإدارة,0.940626
5564,96000,2024-07-19,هذا وينتج العراق نحو 26 ألف ميغاواط من الكهربا...,استمرار الصراع,0.909947
5565,96000,2024-07-19,هذا وينتج العراق نحو 26 ألف ميغاواط من الكهربا...,انقلاب,0.859546
5566,96000,2024-07-19,هذا وينتج العراق نحو 26 ألف ميغاواط من الكهربا...,أضرار في البنية التحتية,0.854885


In [40]:
# --- 2. Your Original Arabic Translation Dictionary ---
risk_factor_translations_ara = {
    'massive starvation': 'مجاعة هائلة', 'rinderpest': 'طاعون بقري', 'scanty rainfall': 'شح الأمطار', 'dysfunction': 'خلل وظيفي', 'rise': 'ارتفاع', 'mass displacement': 'نزوح جماعي', 'conflict': 'صراع', 'hunger': 'جوع', 'malnutrition': 'سوء تغذية', 'drought': 'جفاف', 'locust': 'جراد', 'insecurity': 'انعدام الأمن', 'violence': 'عنف', 'poverty': 'فقر', 'displacement': 'نزوح', 'disease': 'مرض', 'death': 'موت', 'disaster': 'كارثة', 'crisis': 'أزمة', 'famine': 'مجاعة', 'emergency': 'طوارئ', 'shortage': 'نقص', 'cholera': 'كوليرا', 'malaria': 'ملاريا', 'measles': 'حصبة', 'typhoid': 'تيفوئيد', 'ebola': 'إيبولا', 'hiv': 'فيروس نقص المناعة البشرية', 'aids': 'الإيدز', 'tuberculosis': 'سل', 'diarrhea': 'إسهال', 'undernutrition': 'نقص التغذية', 'food prices': 'أسعار المواد الغذائية', 'inflation': 'تضخم', 'economic collapse': 'انهيار اقتصادي', 'currency devaluation': 'تخفيض قيمة العملة', 'unemployment': 'بطالة', 'corruption': 'فساد', 'sanctions': 'عقوبات', 'blockade': 'حصار', 'looting': 'نهب', 'theft': 'سرقة', 'crime': 'جريمة', 'terrorism': 'إرهاب', 'insurgency': 'تمرد', 'civil war': 'حرب أهلية', 'war': 'حرب', 'bombing': 'قصف', 'airstrike': 'غارة جوية', 'shelling': 'قصف مدفعي', 'gunfire': 'إطلاق نار', 'explosion': 'انفجار', 'massacre': 'مذبحة', 'genocide': 'إبادة جماعية', 'ethnic cleansing': 'تطهير عرقي', 'torture': 'تعذيب', 'rape': 'اغتصاب', 'abduction': 'اختطاف', 'kidnapping': 'خطف', 'hostage': 'رهينة', 'assassination': 'اغتيال', 'coup': 'انقلاب', 'political instability': 'عدم استقرار سياسي', 'protest': 'احتجاج', 'riot': 'شغب', 'curfew': 'حظر تجول', 'state of emergency': 'حالة طوارئ', 'martial law': 'أحكام عرفية', 'election violence': 'عنف انتخابي', 'border closure': 'إغلاق الحدود', 'refugee': 'لاجئ', 'asylum seeker': 'طالب لجوء', 'internally displaced person': 'نازح داخلي', 'migrant': 'مهاجر', 'human trafficking': 'اتجار بالبشر', 'smuggling': 'تهريب', 'flood': 'فيضان', 'hurricane': 'إعصار', 'cyclone': 'إعصار', 'typhoon': 'إعصار', 'earthquake': 'زلزال', 'tsunami': 'تسونامي', 'volcano': 'بركان', 'landslide': 'انهيار أرضي', 'avalanche': 'انهيار ثلجي', 'wildfire': 'حرائق غابات', 'heatwave': 'موجة حر', 'cold wave': 'موجة برد', 'hailstorm': 'عاصفة برد', 'tornado': 'إعصار', 'storm': 'عاصفة', 'monsoon': 'موسم الأمطار', 'crop failure': 'فشل المحاصيل', 'harvest failure': 'فشل الحصاد', 'livestock death': 'نفوق الماشية', 'water shortage': 'نقص المياه', 'power outage': 'انقطاع التيار الكهربائي', 'fuel shortage': 'نقص الوقود', 'road closure': 'إغلاق الطرق', 'infrastructure damage': 'أضرار في البنية التحتية', 'hospital closure': 'إغلاق المستشفيات', 'school closure': 'إغلاق المدارس', 'market closure': 'إغلاق الأسواق', 'aid shortage': 'نقص المساعدات', 'aid worker killed': 'مقتل عامل إغاثة', 'aid worker abducted': 'اختطاف عامل إغاثة', 'ngo withdrawal': 'انسحاب المنظمات غير الحكومية', 'un withdrawal': 'انسحاب الأمم المتحدة', 'peacekeeping mission': 'بعثة حفظ السلام', 'ceasefire violation': 'انتهاك وقف إطلاق النار', 'failed state': 'دولة فاشلة', 'anarchy': 'فوضى', 'armed group': 'جماعة مسلحة', 'militia': 'ميليشيا', 'rebel': 'متمرد', 'terrorist group': 'جماعة إرهابية', 'child soldier': 'جندي طفل', 'landmine': 'لغم أرضي', 'chemical weapon': 'سلاح كيماوي', 'biological weapon': 'سلاح بيولوجي', 'nuclear weapon': 'سلاح نووي', 'dirty bomb': 'قنبلة قذرة', 'small arms': 'أسلحة صغيرة', 'heavy weapons': 'أسلحة ثقيلة', 'artillery': 'مدفعية', 'tank': 'دبابة', 'fighter jet': 'طائرة مقاتلة', 'drone': 'طائرة بدون طيار', 'naval blockade': 'حصار بحري', 'piracy': 'قرصنة', 'human rights violation': 'انتهاك حقوق الإنسان', 'freedom of speech': 'حرية التعبير', 'freedom of press': 'حرية الصحافة', 'freedom of assembly': 'حرية التجمع', 'freedom of religion': 'حرية الدين', 'ethnic discrimination': 'تمييز عرقي', 'religious discrimination': 'تمييز ديني', 'gender discrimination': 'تمييز بين الجنسين', 'child labor': 'عمالة الأطفال', 'forced labor': 'العمل القسري', 'slavery': 'عبودية', 'debt bondage': 'عبودية الدين', 'land seizure': 'الاستيلاء على الأراضي', 'forced eviction': 'إخلاء قسري', 'land grab': 'الاستيلاء على الأراضي', 'brutal government': 'حكومة وحشية', 'bombing campaign': 'حملة قصف', 'transport bottleneck': 'عنق زجاجة في النقل', 'weather extremes': 'ظواهر جوية متطرفة', 'price rise': 'ارتفاع الأسعار', 'cattle plague': 'طاعون الماشية', 'mismanagement': 'سوء الإدارة', 'harvest decline': 'انخفاض المحصول', 'forests destroyed': 'تدمير الغابات', 'jihadist groups': 'جماعات جهادية', 'migration': 'هجرة', 'economic impoverishment': 'إفقار اقتصادي', 'continued strife': 'استمرار الصراع', 'ecological crisis': 'أزمة بيئية', 'slave trade': 'تجارة الرقيق', 'lack of agricultural infrastructure': 'نقص البنية التحتية الزراعية', 'stolen food aid': 'سرقة المساعدات الغذائية', 'gangs of bandits': 'عصابات قطاع الطرق', 'gastrointestinal': 'معدي معوي', 'hunger crises': 'أزمات الجوع', 'pests': 'آفات', 'clan battle': 'معركة عشائرية', 'regimes were toppled': 'إسقاط الأنظمة'
}

# --- 3. Create the English-to-Arabic mapping (reverse the dictionary) ---
print("Creating the reverse mapping (Arabic to English)...")
# Note: Reversing the dictionary swaps keys and values
risk_factor_translations_en = {v: k for k, v in risk_factor_translations_ara.items()}


# --- 4. Translate the 'risk_factor' column back to English ---
print("Translating risk factors back to English...")

# Create a new column 'risk_factor_english' by mapping the Arabic terms
# The .get method is used to avoid errors if a term is not in the dictionary
all_risk_mentions['risk_factor_english'] = all_risk_mentions['risk_factor'].apply(lambda x: risk_factor_translations_en.get(x, x))


# --- 5. Display the result ---
print("\n--- Back-Translation Complete ---")
print("Sample of the updated DataFrame with English translations:")
print(all_risk_mentions)

Creating the reverse mapping (Arabic to English)...
Translating risk factors back to English...

--- Back-Translation Complete ---
Sample of the updated DataFrame with English translations:
      article_id        date  \
0           1293  2024-07-04   
1           1293  2024-07-04   
2           1293  2024-07-04   
3           1293  2024-07-04   
4           1293  2024-07-04   
...          ...         ...   
5563       96000  2024-07-19   
5564       96000  2024-07-19   
5565       96000  2024-07-19   
5566       96000  2024-07-19   
5567       96000  2024-07-19   

                                          sentence_text  \
0     IMPACT Initiatives is a humanitarian NGO, base...   
1     IMPACT was launched at the initiative of ACTED...   
2     After more than a decade of conflict, conditio...   
3     After more than a decade of conflict, conditio...   
4     After more than a decade of conflict, conditio...   
...                                                 ...   
5563  هذا وي

In [41]:
all_risk_mentions

Unnamed: 0,article_id,date,sentence_text,risk_factor,confidence_score,risk_factor_english
0,1293,2024-07-04,"IMPACT Initiatives is a humanitarian NGO, base...",humanitarian situation,0.967194,humanitarian situation
1,1293,2024-07-04,IMPACT was launched at the initiative of ACTED...,international intervention,0.854065,international intervention
2,1293,2024-07-04,"After more than a decade of conflict, conditio...",years of warfare,0.999366,years of warfare
3,1293,2024-07-04,"After more than a decade of conflict, conditio...",prolonged fighting,0.998584,prolonged fighting
4,1293,2024-07-04,"After more than a decade of conflict, conditio...",continued deterioration,0.998004,continued deterioration
...,...,...,...,...,...,...
5563,96000,2024-07-19,هذا وينتج العراق نحو 26 ألف ميغاواط من الكهربا...,سوء الإدارة,0.940626,mismanagement
5564,96000,2024-07-19,هذا وينتج العراق نحو 26 ألف ميغاواط من الكهربا...,استمرار الصراع,0.909947,continued strife
5565,96000,2024-07-19,هذا وينتج العراق نحو 26 ألف ميغاواط من الكهربا...,انقلاب,0.859546,coup
5566,96000,2024-07-19,هذا وينتج العراق نحو 26 ألف ميغاواط من الكهربا...,أضرار في البنية التحتية,0.854885,infrastructure damage


## Save the result

In [42]:
# --- Save the final DataFrame ---
OUTPUT_DIR = os.path.join(DATA_DIR, '03_models')
os.makedirs(OUTPUT_DIR, exist_ok=True)

output_path = os.path.join(OUTPUT_DIR, 'risk_mentions.csv')
all_risk_mentions.to_csv(output_path, index=False)

print(f"\nSuccessfully saved {len(all_risk_mentions)} risk mentions to: {output_path}")


Successfully saved 5568 risk mentions to: ../data/03_models/risk_mentions.csv


In [1]:
# --- 1. Load Processed Data ---
import pandas as pd
import os
from transformers import pipeline
import torch
from sentence_transformers import SentenceTransformer, util
from tqdm.auto import tqdm

print("--- Step 1: Loading Data ---")
DATA_DIR = '../data'
PROCESSED_DATA_DIR = os.path.join(DATA_DIR, '02_processed')

df_eng = pd.read_pickle(os.path.join(PROCESSED_DATA_DIR, 'news_eng_processed_filtered.pkl'))
df_ara = pd.read_pickle(os.path.join(PROCESSED_DATA_DIR, 'news_ara_processed_filtered.pkl'))
print("Filtered data loaded successfully.")
print(f"  - English: {len(df_eng):,} articles")
print(f"  - Arabic:  {len(df_ara):,} articles")
print("-" * 30, "\n")


# --- 2. Load Risk Factors ---
print("--- Step 2: Loading Risk Factors ---")
# --- English Risk Factors ---
risk_factors_path = os.path.join(DATA_DIR, '01_raw/risk-factors.xlsx')
df_risk_factors = pd.read_excel(risk_factors_path)
risk_factor_labels = df_risk_factors['risk_factor_english'].tolist()
print(f"{len(risk_factor_labels)} English risk factors loaded.")

# --- Arabic Risk Factors ---
risk_factors_ara_path = os.path.join(DATA_DIR, '01_raw/risk-factors-translated.xlsx')
df_risk_factors_ara = pd.read_excel(risk_factors_ara_path)
risk_factor_ara_labels = df_risk_factors_ara['risk_factor_arabic'].tolist()
print(f"{len(risk_factor_ara_labels)} Arabic risk factors loaded.")
print("-" * 30, "\n")

--- Step 1: Loading Data ---
Filtered data loaded successfully.
  - English: 25,852 articles
  - Arabic:  52,822 articles
------------------------------ 

--- Step 2: Loading Risk Factors ---
167 English risk factors loaded.
167 Arabic risk factors loaded.
------------------------------ 



In [6]:
# --- 1. Load Processed Data ---
import pandas as pd
import os
from transformers import pipeline
import torch
from sentence_transformers import SentenceTransformer, util
from tqdm.auto import tqdm

print("--- Step 1: Loading Data ---")
DATA_DIR = '../data'
PROCESSED_DATA_DIR = os.path.join(DATA_DIR, '02_processed')

df_eng = pd.read_pickle(os.path.join(PROCESSED_DATA_DIR, 'news_eng_processed_filtered.pkl'))
df_ara = pd.read_pickle(os.path.join(PROCESSED_DATA_DIR, 'news_ara_processed_filtered.pkl'))
print("Filtered data loaded successfully.")
print(f"  - English: {len(df_eng):,} articles")
print(f"  - Arabic:  {len(df_ara):,} articles")
print("-" * 30, "\n")


# --- 2. Load Risk Factors ---
print("--- Step 2: Loading Risk Factors ---")
# We only need the complete English list for classification now.
risk_factors_path = os.path.join(DATA_DIR, '01_raw/risk-factors.xlsx')
df_risk_factors_eng = pd.read_excel(risk_factors_path)
df_risk_factors_eng.dropna(subset=['risk_factor_english'], inplace=True)
risk_factor_labels = df_risk_factors_eng['risk_factor_english'].tolist()
print(f"Loaded {len(risk_factor_labels)} English risk factors to be used for BOTH languages.")
print("-" * 30, "\n")


# --- 3. Initialize the Models ---
print("--- Step 3: Initializing Models ---")
device = 0 if torch.cuda.is_available() else -1
if device == 0:
    print("GPU found. Models will run on the GPU for maximum speed.")
else:
    print("No GPU found. Models will run on the CPU. This may be slow.")

MODEL_NAME = 'MoritzLaurer/mDeBERTa-v3-base-mnli-xnli'
classifier = pipeline("zero-shot-classification", model=MODEL_NAME, device=device)
print(f"Main classifier initialized: {MODEL_NAME}")

FAST_MODEL_NAME = 'paraphrase-multilingual-MiniLM-L12-v2'
fast_embedder = SentenceTransformer(FAST_MODEL_NAME, device=device)
print(f"Fast pre-filtering model initialized: {FAST_MODEL_NAME}")
print("-" * 30, "\n")


# --- 4. Run Classification with Pre-filtering ---
print("--- Step 4: Running Optimized Classification ---")
print("Encoding the 167 English risk factors with the fast model...")
risk_factor_embeddings = fast_embedder.encode(risk_factor_labels, convert_to_tensor=True)
print("Risk factor encoding complete.")

def extract_risk_factors_optimized(
    df, classifier, labels, threshold=0.80, batch_size=32,
    sentence_embedder=None, risk_factor_embeddings=None, sentence_similarity_threshold=0.35
):
    if 'article_id' not in df.columns:
        df['article_id'] = df.index

    df_sentences = df.explode('sentences').rename(columns={'sentences': 'sentence_text'})
    df_sentences = df_sentences[['article_id', 'date', 'sentence_text']].dropna(subset=['sentence_text'])
    all_sentences = df_sentences['sentence_text'].tolist()

    if not all_sentences:
        print("No sentences found in the input articles.")
        return pd.DataFrame()

    print(f"\nOriginal sentence count: {len(all_sentences):,}")
    print("Pre-filtering sentences using English risk factors...")
    sentence_embeddings = sentence_embedder.encode(all_sentences, convert_to_tensor=True, show_progress_bar=True)
    hits = util.semantic_search(sentence_embeddings, risk_factor_embeddings, top_k=1)

    relevant_indices = [i for i, hit_list in enumerate(hits) if hit_list and hit_list[0]['score'] >= sentence_similarity_threshold]
    filtered_sentences_df = df_sentences.iloc[relevant_indices]
    sentence_list = filtered_sentences_df['sentence_text'].tolist()

    if not sentence_list:
        print("No sentences passed the pre-filtering stage.")
        return pd.DataFrame()

    print(f"Reduced to {len(sentence_list):,} sentences after filtering (threshold: {sentence_similarity_threshold}).")
    print(f"Running the powerful classifier on this smaller set using {len(labels)} English labels...")

    results_list = []
    # THIS IS THE LINE THAT CREATES THE PROGRESS BAR FOR THE CLASSIFIER
    for i, result in tqdm(enumerate(classifier(sentence_list, labels, multi_label=True, batch_size=batch_size)), total=len(sentence_list)):
        for label, score in zip(result['labels'], result['scores']):
            if score >= threshold:
                original_row = filtered_sentences_df.iloc[i]
                results_list.append({
                    'article_id': original_row['article_id'],
                    'date': original_row['date'],
                    'sentence_text': result['sequence'],
                    'risk_factor': label,
                    'confidence_score': score
                })
    return pd.DataFrame(results_list)

# --- Set parameters ---
CLASSIFIER_BATCH_SIZE = 128
SENTENCE_SIMILARITY_THRESHOLD = 0.35

# --- Create samples to test the pipeline ---
print("\n--- RUNNING ON A SAMPLE OF 20 ARTICLES PER LANGUAGE ---")
df_eng_sample = df_eng.head(20).copy()
df_ara_sample = df_ara.head(20).copy()

# --- Run the process on the SAMPLES ---
print("\nStarting risk factor extraction for English articles (SAMPLE)...")
eng_risk_mentions = extract_risk_factors_optimized(
    df_eng_sample, classifier, risk_factor_labels, batch_size=CLASSIFIER_BATCH_SIZE,
    sentence_embedder=fast_embedder, risk_factor_embeddings=risk_factor_embeddings,
    sentence_similarity_threshold=SENTENCE_SIMILARITY_THRESHOLD
)

print("\nStarting risk factor extraction for Arabic articles (SAMPLE)...")
ara_risk_mentions = extract_risk_factors_optimized(
    df_ara_sample, classifier, risk_factor_labels, batch_size=CLASSIFIER_BATCH_SIZE,
    sentence_embedder=fast_embedder, risk_factor_embeddings=risk_factor_embeddings,
    sentence_similarity_threshold=SENTENCE_SIMILARITY_THRESHOLD
)

all_risk_mentions = pd.concat([eng_risk_mentions, ara_risk_mentions], ignore_index=True)
print("\nRisk factor extraction complete for the sample.")
print("-" * 30, "\n")


# --- 5. Unify Results ---
print("--- Step 5: Unifying Risk Factors ---")
all_risk_mentions['risk_factor_english'] = all_risk_mentions['risk_factor']
print("All risk factors are now unified under the 'risk_factor_english' column.")

# Create the translation dictionary for any post-analysis if needed
risk_factors_ara_path = os.path.join(DATA_DIR, '01_raw/risk-factors-translated.xlsx')
df_risk_factors_ara = pd.read_excel(risk_factors_ara_path)
df_risk_factors_combined = pd.concat([df_risk_factors_eng, df_risk_factors_ara], axis=1)
df_risk_factors_combined.dropna(subset=['risk_factor_english', 'risk_factor_arabic'], inplace=True)
risk_factor_translations_en = dict(zip(df_risk_factors_combined['risk_factor_arabic'], df_risk_factors_combined['risk_factor_english']))
print("-" * 30, "\n")


# --- 6. Save the Results ---
print("--- Step 6: Saving Results ---")
OUTPUT_DIR = os.path.join(DATA_DIR, '03_models')
os.makedirs(OUTPUT_DIR, exist_ok=True)

output_path = os.path.join(OUTPUT_DIR, 'risk_mentions_SAMPLE.csv')
all_risk_mentions.to_csv(output_path, index=False)
print(f"Successfully saved {len(all_risk_mentions):,} risk mentions from the sample to: {output_path}")
print("-" * 30, "\n")

print("--- Final Extracted Risk Factors (from Sample) ---")
if all_risk_mentions.empty:
    print("No risk factors found in the sample with the current settings.")
else:
    display(all_risk_mentions)

--- Step 1: Loading Data ---


Filtered data loaded successfully.
  - English: 25,852 articles
  - Arabic:  52,822 articles
------------------------------ 

--- Step 2: Loading Risk Factors ---
Loaded 167 English risk factors to be used for BOTH languages.
------------------------------ 

--- Step 3: Initializing Models ---
GPU found. Models will run on the GPU for maximum speed.


Device set to use cuda:0


Main classifier initialized: MoritzLaurer/mDeBERTa-v3-base-mnli-xnli
Fast pre-filtering model initialized: paraphrase-multilingual-MiniLM-L12-v2
------------------------------ 

--- Step 4: Running Optimized Classification ---
Encoding the 167 English risk factors with the fast model...
Risk factor encoding complete.

--- RUNNING ON A SAMPLE OF 20 ARTICLES PER LANGUAGE ---

Starting risk factor extraction for English articles (SAMPLE)...

Original sentence count: 2,053
Pre-filtering sentences using English risk factors...


Batches:   0%|          | 0/65 [00:00<?, ?it/s]

Reduced to 1,570 sentences after filtering (threshold: 0.35).
Running the powerful classifier on this smaller set using 167 English labels...


KeyboardInterrupt: 