# Model selection
This notebook performs the core task of the project: identifying risk factors in news articles using a multilingual, zero-shot classification model. We will use the facebook/xlm-roberta-base model from the Hugging Face library to classify individual sentences against a predefined list of 167 risk factors.

The main steps are:

1.  **Load Processed Data**: Load the sentence-tokenized dataframes for both English and Arabic news articles.
2.  **Load Risk Factors**: Load the list of 167 risk factors that will be used as candidate labels for the classification model.
3.  **Initialize the Model**: Set up the zero-shot classification pipeline from the `transformers` library.
4.  **Run Classification**: Process each sentence from every article, classify it against the risk factors, and store the results that meet a confidence threshold of 0.80.
5.  **Save the Results**: Save the final DataFrame containing the identified risk factors.

## Load Processed Data

In [3]:
import pandas as pd
import os
from transformers import pipeline
import torch

# --- Load the processed datasets ---
DATA_DIR = '../data'
PROCESSED_DATA_DIR = os.path.join(DATA_DIR, '02_processed')

df_eng = pd.read_pickle(os.path.join(PROCESSED_DATA_DIR, 'news_eng_processed_filtered.pkl'))
df_ara = pd.read_pickle(os.path.join(PROCESSED_DATA_DIR, 'news_eng_processed_filtered.pkl'))

print("Processed data loaded successfully.")

Processed data loaded successfully.


## Load Risk Factors

In [4]:
# --- Load risk factors ---
risk_factors_path = os.path.join(DATA_DIR, '01_raw/risk-factors.xlsx')
df_risk_factors = pd.read_excel(risk_factors_path)

# --- CORRECTED LINE ---
# The column name is 'risk_factor_english', not 'Risk Factor'.
risk_factor_labels = df_risk_factors['risk_factor_english'].tolist()

print(f"{len(risk_factor_labels)} risk factors loaded.")
print("Sample risk factors:", risk_factor_labels[:5])

167 risk factors loaded.
Sample risk factors: ['massive starvation', 'rinderpest', 'scanty rainfall', 'dysfunction', 'rise']


## Initialize the Model

In [13]:
# --- Check for GPU ---
device = 0 if torch.cuda.is_available() else -1
if device == 0:
    print("GPU found. The model will run on the GPU.")
else:
    print("No GPU found. The model will run on the CPU. This may be slow.")

# --- Initialize the pipeline with the SOTA model ---

# This model is a newer, more powerful replacement for XLM-RoBERTa
# for zero-shot classification tasks.
MODEL_NAME = 'MoritzLaurer/mDeBERTa-v3-base-mnli-xnli' 

classifier = pipeline(
    "zero-shot-classification",
    model=MODEL_NAME,
    device=device # Use your GPU
)

print(f"Zero-shot classification pipeline initialized with SOTA model: {MODEL_NAME}")

GPU found. The model will run on the GPU.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spm.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/16.3M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

Device set to use cuda:0


Zero-shot classification pipeline initialized with SOTA model: MoritzLaurer/mDeBERTa-v3-base-mnli-xnli


## Run Classification

This is the main processing step. To maximize efficiency, we will use a batch processing approach instead of iterating through each article one by one. This involves:

Restructuring the Data: We will combine all sentences from all articles into a single, large list.

Batch Inference: This entire list is fed directly to the transformers pipeline, which automatically groups the sentences into optimal batches to keep the GPU fully utilized.

This method is significantly faster than a traditional loop as it minimizes CPU-GPU communication overhead. We will run the process on a small sample of 5 articles from each language to verify the pipeline and see the output structure.

Note: Even with this optimization, processing the full dataset of ~172,000 articles is a computationally intensive task that will still take a considerable amount of time. Running this initial small sample is crucial for ensuring the code works correctly before launching the full analysis.

In [14]:
import pandas as pd
from tqdm.auto import tqdm # A library to create smart progress bars

def extract_risk_factors_fast(df, classifier, labels, threshold=0.80, batch_size=32):
    """
    Extracts risk factors from a DataFrame by processing all sentences in batches.
    """
    # 1. Restructure the data for batch processing
    if 'article_id' not in df.columns:
        df['article_id'] = df.index
        
    df_sentences = df.explode('sentences').rename(columns={'sentences': 'sentence_text'})
    df_sentences = df_sentences[['article_id', 'date', 'sentence_text']].dropna(subset=['sentence_text'])
    
    sentence_list = df_sentences['sentence_text'].tolist()
    
    if not sentence_list:
        print("No sentences to process for this sample.")
        return pd.DataFrame()
        
    print(f"Processing {len(sentence_list):,} sentences in batches of {batch_size}...")

    results_list = []
    # 2. Process all sentences in one go with a progress bar
    for i, result in tqdm(enumerate(classifier(sentence_list, labels, multi_label=True, batch_size=batch_size)), total=len(sentence_list)):
        # 3. Filter results and store them
        for label, score in zip(result['labels'], result['scores']):
            if score >= threshold:
                original_row = df_sentences.iloc[i]
                results_list.append({
                    'article_id': original_row['article_id'],
                    'date': original_row['date'],
                    'sentence_text': result['sequence'],
                    'risk_factor': label,
                    'confidence_score': score
                })

    return pd.DataFrame(results_list)

# --- Set the sample size to 5 articles each ---
# Commment this out to process the full dataset !!!
df_eng = df_eng.head(10)
df_ara = df_ara.head(10)

BATCH_SIZE = 128 # You can adjust this based on your GPU memory

# --- Run the FAST extraction process ---
print("Starting risk factor extraction for English articles...")
eng_risk_mentions = extract_risk_factors_fast(df_eng, classifier, risk_factor_labels, batch_size=BATCH_SIZE)

print("\nStarting risk factor extraction for Arabic articles...")
ara_risk_mentions = extract_risk_factors_fast(df_ara, classifier, risk_factor_labels, batch_size=BATCH_SIZE)

print("\nRisk factor extraction complete.")

# --- Combine results from both languages ---
all_risk_mentions = pd.concat([eng_risk_mentions, ara_risk_mentions], ignore_index=True)

print("\n--- Final Extracted Risk Factors ---")
# Display the full results from this small sample
if all_risk_mentions.empty:
    print("No risk factors found in the sample articles with the current threshold.")
else:
    print(all_risk_mentions)

Starting risk factor extraction for English articles...
Processing 1,294 sentences in batches of 128...


  0%|          | 0/1294 [00:00<?, ?it/s]


Starting risk factor extraction for Arabic articles...
Processing 938 sentences in batches of 128...


  0%|          | 0/938 [00:00<?, ?it/s]


Risk factor extraction complete.

--- Final Extracted Risk Factors ---
       article_id        date  \
0               0  2024-07-09   
1               0  2024-07-09   
2               0  2024-07-09   
3               0  2024-07-09   
4               0  2024-07-09   
...           ...         ...   
29514           9  2024-07-22   
29515           9  2024-07-22   
29516           9  2024-07-22   
29517           9  2024-07-22   
29518           9  2024-07-22   

                                           sentence_text          risk_factor  \
0      Hussam al-Mahmoud | Yamen Moghrabi | Hassan Ib...             conflict   
1      Hussam al-Mahmoud | Yamen Moghrabi | Hassan Ib...               mayhem   
2      Hussam al-Mahmoud | Yamen Moghrabi | Hassan Ib...                siege   
3      Hussam al-Mahmoud | Yamen Moghrabi | Hassan Ib...  international alarm   
4      Hussam al-Mahmoud | Yamen Moghrabi | Hassan Ib...      major offensive   
...                                          

In [None]:
all_risk_mentions

## Save the result

In [None]:
# --- Save the final DataFrame ---
OUTPUT_DIR = os.path.join(DATA_DIR, '03_models')
os.makedirs(OUTPUT_DIR, exist_ok=True)

output_path = os.path.join(OUTPUT_DIR, 'risk_mentions.csv')
all_risk_mentions.to_csv(output_path, index=False)

print(f"\nSuccessfully saved {len(all_risk_mentions)} risk mentions to: {output_path}")