# [Take-home Assessment] Food Crisis Early Warning 

Welcome to the assessment. You will showcase your modeling and research skills by investigating news articles (in English and Arabic) as well as a set of food insecurity risk factors. 

We suggest planning to spend **~6–8 hours** on this assessment. **Please submit your response by Monday, September 15th, 9:00 AM EST via email to dime_ai@worldbank.org**. Please document your code with comments and explanations of design choices. There is one question on reflecting upon your approach at the end of this notebook.

**Name:** Adam Przychodni

**Email:** adam.przychodni@gmail.com

---

# Part 1: Technical Assessment


## Task:

We invite you to approach the challenge of understanding (and potentially predicting) food insecurity using the provided (limited) data. Your response should demonstrate how you tackle open-ended problems in data-scarce environments.

Some example questions to consider:
- What is the added value of geospatial data?
- How can we address the lack of ground-truth information on food insecurity levels?
- What are the benefits and challenges of working with multilingual data?
- ...

These are just guiding examples — you are free to explore any relevant angles to this topic/data.

**Note:** There is no single "right" approach. Instead, we want to understand how you approach and structure open-ended problems in data-scarce environments. Given the large number of applicants, we will preselect the most impressive and complete submissions. Please take effort in structuring your response, as selection will depend on its depth and originality.


## Provided Data:

1. **Risk Factors:** A file containing 167 risk factors (unigrams, bigrams, and trigrams) in the `english_keywords` column and an empty `keywords_arabic` column. A separate file with the mapping of English risk factors to pre-defined thematic cluster assignments.


2. **News Articles:** Two files containing one month of news articles from the Mashriq region:
   - `news-articles-eng.csv`
   - `news-articles-ara.csv`
   - **Note:** You may work on a sample subset during development.
   
   
3. **Geographic Taxonomy:** A file containing the names of the countries, provinces, and districts for the subset of Mashriq countries that is covered by the news articles. The files are a dictionary mapping from a key to the geographic name.
   - `id_arabic_location_name.pkl`
   - `id_english_location_name.pkl`
   - **Note:** Each unique country/province/district is assigned a key (e.g. `iq`,`iq_bg` and `iq_bg_1` for country Iraq, province Baghdad, and district 1 in Baghdad respectively).
   - The key of country names is a two character abbreviation as follows.
       - 'iq': 'Iraq'
       - 'jo': 'Jordan'
       - 'lb': 'Lebanon'
       - 'ps': 'Palestine'
       - 'sy': 'Syria'
       
   - The key of provinces is a two-character abbreviation of the country followed by two-character abbreviation of the province **`{country_abbreviation}_{province_abbreviation}`**, and the key of districts is **`{country_abbreviation}_{province_abbreviation}_{unique_number}`**.
       


## Submission Guidelines:

- **Code:** Follow best coding practices and ensure clear documentation. All notebook cells should be executed with outputs saved, and the notebook should run correctly on its own. Name your file **`solution_{FIRSTNAME}_{LASTNAME}.ipynb`**. If your solution relies on additional open-access data, either include it in your submission (e.g., as part of a ZIP file) or provide clear data-loading code/instructions as part of the nottebook. 
- **Report:** Submit a separate markdown file communicating your approach to this research problem. We expect you to detail the models, methods, or (additional) data you are using.

Good luck!


---

## Your Submission

In [1]:
import pandas as pd
import re
import os

# --- Load the datasets ---
DATA_DIR = '../data'
df_eng = pd.read_csv(os.path.join(DATA_DIR, '01_raw/news-articles-eng.csv'))
df_ara = pd.read_csv(os.path.join(DATA_DIR, '01_raw/news-articles-ara.csv'))

print("Data loaded successfully.")

def clean_text(text):
    """
    Cleans raw text by removing HTML tags, URLs, and normalizing whitespace.
    """
    if not isinstance(text, str):
        return ""
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

print("Cleaning news articles...")
df_eng['body_cleaned'] = df_eng['body'].apply(clean_text)
df_ara['body_cleaned'] = df_ara['body'].apply(clean_text)
print("Cleaning complete.")

def regex_sent_tokenize(text):
    """
    Splits text into sentences using a regular expression.
    This is a dependency-free alternative to nltk.sent_tokenize.
    """
    if not isinstance(text, str) or not text:
        return []
    # Split on sentence-ending punctuation followed by a space or end of string
    # The regex uses a "positive lookbehind" to keep the punctuation with the sentence.
    sentences = re.split(r'(?<=[.!?۔])\s+', text)
    # Filter out any empty strings that might result from the split
    return [s for s in sentences if s]

# --- Apply the new, reliable tokenizer ---
print("Tokenizing English articles into sentences...")
df_eng['sentences'] = df_eng['body_cleaned'].apply(regex_sent_tokenize)

print("Tokenizing Arabic articles into sentences...")
df_ara['sentences'] = df_ara['body_cleaned'].apply(regex_sent_tokenize)

print("\nSentence tokenization complete.")

# --- Display a sample ---
print("\n--- Example of Regex Sentence Tokenization ---")
print(f"Article body has been split into {len(df_eng['sentences'].iloc[0])} sentences.")
print("First 3 sentences:")
for sentence in df_eng['sentences'].iloc[0][:3]:
    print(f"- {sentence}")

PROCESSED_DATA_DIR = os.path.join(DATA_DIR, '02_processed')
output_path_eng = os.path.join(PROCESSED_DATA_DIR, 'news_eng_processed.pkl')
output_path_ara = os.path.join(PROCESSED_DATA_DIR, 'news_ara_processed.pkl')

os.makedirs(PROCESSED_DATA_DIR, exist_ok=True)

# Save the smaller, filtered dataframes
df_eng.to_pickle(output_path_eng)
df_ara.to_pickle(output_path_ara)

print(f"\nProcessed and FILTERED English data saved to: {output_path_eng}")
print(f"Processed and FILTERED Arabic data saved to: {output_path_ara}")


Data loaded successfully.
Cleaning news articles...
Cleaning complete.
Tokenizing English articles into sentences...
Tokenizing Arabic articles into sentences...

Sentence tokenization complete.

--- Example of Regex Sentence Tokenization ---
Article body has been split into 124 sentences.
First 3 sentences:
- Hussam al-Mahmoud | Yamen Moghrabi | Hassan Ibrahim On October 7, 2023, the world and the Middle East awoke to the drums of war beating in the Gaza Strip.
- Over time, it turned into a reality that American efforts, Qatari and Egyptian mediation, condemnations, statements, summits, and conferences could not stop.
- While Israel continues its war in the besieged Gaza Strip, attention is turning towards the potential outbreak of another war.

Processed and FILTERED English data saved to: ../data/02_processed/news_eng_processed.pkl
Processed and FILTERED Arabic data saved to: ../data/02_processed/news_ara_processed.pkl


In [2]:
import pandas as pd
import os
from transformers import pipeline
import torch
import pickle
from tqdm.auto import tqdm

# --- 1. Define Paths and Check if Output Already Exists ---
print("--- Step 1: Initializing and Checking for Existing File ---")
DATA_DIR = '../data'
PROCESSED_DATA_DIR = os.path.join(DATA_DIR, '02_processed')
output_path = os.path.join(PROCESSED_DATA_DIR, 'news_geographically_filtered.pkl')

if os.path.exists(output_path):
    print(f"Output file already exists: {output_path}")
    print("Skipping the filtering process.")
else:
    print("Output file not found. Starting the filtering process...")
    # --- 2. Load All Necessary Data ---
    print("--- Step 2: Loading All Necessary Data ---")
    df_eng = pd.read_pickle(os.path.join(PROCESSED_DATA_DIR, 'news_eng_processed.pkl'))
    df_ara = pd.read_pickle(os.path.join(PROCESSED_DATA_DIR, 'news_ara_processed.pkl'))
    df_articles = pd.concat([df_eng, df_ara], ignore_index=True)

    with open('../data/01_raw/id_english_location_name.pkl', 'rb') as f:
        eng_locations = pickle.load(f)
    with open('../data/01_raw/id_arabic_location_name.pkl', 'rb') as f:
        ara_locations = pickle.load(f)
    print(f"Loaded {len(df_articles):,} total articles to filter.")
    print("-" * 30, "\n")


    # --- 3. Create a Fast Location Lookup ---
    print("--- Step 3: Building Location Resolver ---")
    def create_lookup(location_dict):
        lookup = {}
        for loc_id, names in location_dict.items():
            for name in names:
                lookup[name.lower()] = loc_id
        return lookup

    location_lookup = create_lookup(eng_locations)
    location_lookup.update(create_lookup(ara_locations))
    print(f"Created a lookup with {len(location_lookup):,} unique location aliases.")
    print("-" * 30, "\n")


    # --- 4. Initialize NER Pipeline ---
    print("--- Step 4: Loading NER Model ---")
    device = 0 if torch.cuda.is_available() else -1
    if device == 0: print("GPU found. Filtering will be fast.")
    else: print("No GPU found.")

    ner_pipeline = pipeline("ner", model="Babelscape/wikineural-multilingual-ner", aggregation_strategy="simple", device=device)
    print("NER pipeline loaded.")
    print("-" * 30, "\n")


    # --- 5. Define and Run the Filtering Process ---
    print("--- Step 5: Filtering All Articles ---")
    def contains_target_location(text, pipeline, lookup):
        """
        Checks if a body of text contains at least one of the target locations.
        """
        sample_text = text[:2000]
        entities = pipeline(sample_text)
        for entity in entities:
            if entity['entity_group'] == 'LOC':
                if entity['word'].lower() in lookup:
                    return True
        return False
    
    tqdm.pandas(desc="Filtering Articles")
    relevance_mask = df_articles['body'].progress_apply(
        lambda text: contains_target_location(text, ner_pipeline, location_lookup)
    )
    df_filtered = df_articles[relevance_mask]

    print("\nFiltering complete.")
    print(f"  - Original number of articles: {len(df_articles):,}")
    print(f"  - Geographically relevant articles: {len(df_filtered):,}")
    print("-" * 30, "\n")


    # --- 6. Save the Filtered Data ---
    print("--- Step 6: Saving the Filtered Dataset ---")
    df_filtered.to_pickle(output_path)
    print(f"Successfully saved the filtered data to: {output_path}")

--- Step 1: Initializing and Checking for Existing File ---
Output file already exists: ../data/02_processed/news_geographically_filtered.pkl
Skipping the filtering process.


In [3]:
import pandas as pd
import os
from transformers import pipeline
import torch
from sentence_transformers import SentenceTransformer, util
from tqdm.auto import tqdm

# --- 1. Load GEOGRAPHICALLY FILTERED Data ---
print("--- Step 1: Loading Geographically Filtered Data ---")
DATA_DIR = '../data'
PROCESSED_DATA_DIR = os.path.join(DATA_DIR, '02_processed')

filtered_data_path = os.path.join(PROCESSED_DATA_DIR, 'news_geographically_filtered.pkl')
df_filtered = pd.read_pickle(filtered_data_path)

print("Geographically filtered data loaded successfully.")
print(f"  - Total relevant articles: {len(df_filtered):,} articles")
print("-" * 30, "\n")


# --- 2. Load Risk Factors ---
print("--- Step 2: Loading Risk Factors ---")
risk_factors_path = os.path.join(DATA_DIR, '01_raw/risk-factors.xlsx')
df_risk_factors_eng = pd.read_excel(risk_factors_path)
df_risk_factors_eng.dropna(subset=['risk_factor_english'], inplace=True)
risk_factor_labels = df_risk_factors_eng['risk_factor_english'].tolist()
print(f"Loaded {len(risk_factor_labels)} English risk factors.")
print("-" * 30, "\n")


# --- 3. Initialize Models ---
print("--- Step 3: Initializing Models ---")
device = 0 if torch.cuda.is_available() else -1
if device == 0:
    print("GPU found. Models will run on the GPU for maximum speed.")
else:
    print("No GPU found. Models will run on the CPU.")

MODEL_NAME = 'MoritzLaurer/deberta-v3-xsmall-zeroshot-v1.1-all-33'
classifier = pipeline("zero-shot-classification", model=MODEL_NAME, device=device)
print(f"Main classifier initialized: {MODEL_NAME}")

FAST_MODEL_NAME = 'paraphrase-multilingual-MiniLM-L12-v2'
fast_embedder = SentenceTransformer(FAST_MODEL_NAME, device=device)
print(f"Fast pre-filtering model initialized: {FAST_MODEL_NAME}")
print("-" * 30, "\n")


# --- 4. Define the Extraction Function ---
# (No changes needed in this section)
print("--- Step 4: Defining the Extraction Function ---")
risk_factor_embeddings = fast_embedder.encode(risk_factor_labels, convert_to_tensor=True)
print("Risk factor embeddings pre-computed.")

def extract_risk_factors_optimized(
    df, classifier, labels, threshold, batch_size,
    sentence_embedder, risk_factor_embeddings, sentence_similarity_threshold
):
    if 'article_id' not in df.columns:
        df['article_id'] = df.index
    df_sentences = df.explode('sentences').rename(columns={'sentences': 'sentence_text'})
    df_sentences = df_sentences[['article_id', 'date', 'sentence_text']].dropna(subset=['sentence_text'])
    all_sentences = df_sentences['sentence_text'].tolist()
    if not all_sentences: return pd.DataFrame()

    print(f"\nOriginal sentence count: {len(all_sentences):,}")
    print(f"Pre-filtering sentences with threshold: {sentence_similarity_threshold}")
    sentence_embeddings = sentence_embedder.encode(all_sentences, convert_to_tensor=True, show_progress_bar=True)
    hits = util.semantic_search(sentence_embeddings, risk_factor_embeddings, top_k=1)
    relevant_indices = [i for i, hit_list in enumerate(hits) if hit_list and hit_list[0]['score'] >= sentence_similarity_threshold]
    filtered_sentences_df = df_sentences.iloc[relevant_indices]
    sentence_list = filtered_sentences_df['sentence_text'].tolist()
    if not sentence_list: return pd.DataFrame()

    print(f"Reduced to {len(sentence_list):,} sentences after filtering.")
    print(f"Running classifier with confidence threshold: {threshold}")
    results_list = []
    for i, result in tqdm(enumerate(classifier(sentence_list, labels, multi_label=True, batch_size=batch_size)), total=len(sentence_list)):
        for label, score in zip(result['labels'], result['scores']):
            if score >= threshold:
                original_row = filtered_sentences_df.iloc[i]
                results_list.append({
                    'article_id': original_row['article_id'],
                    'date': original_row['date'],
                    'sentence_text': result['sequence'],
                    'risk_factor': label,
                    'confidence_score': score
                })
    return pd.DataFrame(results_list)

# --- 5. Set Parameters and Run on a SAMPLE (CHANGED) ---
print("\n--- Step 5: Running on a SAMPLE of 10 Articles ---")
CLASSIFIER_BATCH_SIZE = 128
SENTENCE_SIMILARITY_THRESHOLD = 0.55
CLASSIFIER_CONFIDENCE_THRESHOLD = 0.90

# Create a small sample to test the pipeline
# df_filtered = df_filtered.head(10).copy()

print(f"Processing a sample of {len(df_filtered):,} articles...")

all_risk_mentions = extract_risk_factors_optimized(
    df_filtered,  
    classifier,
    risk_factor_labels,
    threshold=CLASSIFIER_CONFIDENCE_THRESHOLD,
    batch_size=CLASSIFIER_BATCH_SIZE,
    sentence_embedder=fast_embedder,
    risk_factor_embeddings=risk_factor_embeddings,
    sentence_similarity_threshold=SENTENCE_SIMILARITY_THRESHOLD
)
print("\nSample risk factor extraction complete.")
print(f"Found {len(all_risk_mentions):,} potential risk mentions in the sample.")
print("-" * 30, "\n")


# --- 6. Post-Processing: Keep Only the Top Label per Sentence ---
print("--- Step 6: Refining Sample Results (Post-Processing) ---")
if not all_risk_mentions.empty:
    print(f"Original number of mentions: {len(all_risk_mentions):,}")
    idx = all_risk_mentions.groupby('sentence_text')['confidence_score'].idxmax()
    all_risk_mentions_refined = all_risk_mentions.loc[idx]
    print(f"Refined to {len(all_risk_mentions_refined):,} high-confidence, unique mentions.")
else:
    all_risk_mentions_refined = pd.DataFrame()
    print("No risk mentions found to refine.")
print("-" * 30, "\n")


# --- 7. Save the SAMPLE Results (CHANGED) ---
print("--- Step 7: Saving SAMPLE Results ---")
OUTPUT_DIR = os.path.join(DATA_DIR, '03_models')
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Save the sample results to a separate file
output_path = os.path.join(OUTPUT_DIR, 'risk_mentions_SAMPLE_FINAL.csv')
all_risk_mentions_refined.to_csv(output_path, index=False)
print(f"Successfully saved {len(all_risk_mentions_refined):,} sample risk mentions to: {output_path}")
print("-" * 30, "\n")

print("--- Final Extracted Risk Factors (from Sample) ---")
if all_risk_mentions_refined.empty:
    print("No risk factors found with the current settings.")
else:
    display(all_risk_mentions_refined.head())

--- Step 1: Loading Geographically Filtered Data ---
Geographically filtered data loaded successfully.
  - Total relevant articles: 96,516 articles
------------------------------ 

--- Step 2: Loading Risk Factors ---
Loaded 167 English risk factors.
------------------------------ 

--- Step 3: Initializing Models ---
GPU found. Models will run on the GPU for maximum speed.


Device set to use cuda:0


Main classifier initialized: MoritzLaurer/deberta-v3-xsmall-zeroshot-v1.1-all-33
Fast pre-filtering model initialized: paraphrase-multilingual-MiniLM-L12-v2
------------------------------ 

--- Step 4: Defining the Extraction Function ---
Risk factor embeddings pre-computed.

--- Step 5: Running on a SAMPLE of 10 Articles ---
Processing a sample of 96,516 articles...

Original sentence count: 1,475,716
Pre-filtering sentences with threshold: 0.55


Batches:   0%|          | 0/46117 [00:00<?, ?it/s]

Reduced to 93,939 sentences after filtering.
Running classifier with confidence threshold: 0.9


  0%|          | 0/93939 [00:00<?, ?it/s]


Sample risk factor extraction complete.
Found 263,041 potential risk mentions in the sample.
------------------------------ 

--- Step 6: Refining Sample Results (Post-Processing) ---
Original number of mentions: 263,041
Refined to 34,275 high-confidence, unique mentions.
------------------------------ 

--- Step 7: Saving SAMPLE Results ---
Successfully saved 34,275 sample risk mentions to: ../data/03_models/risk_mentions_SAMPLE_FINAL.csv
------------------------------ 

--- Final Extracted Risk Factors (from Sample) ---


Unnamed: 0,article_id,date,sentence_text,risk_factor,confidence_score
194993,93829,2024-07-16,!,without international aid,0.954393
251782,158563,2024-07-22,!!,without international aid,0.958109
214472,110774,2024-06-28,!!!,without international aid,0.954647
230238,130700,2024-07-05,!!!!!,without international aid,0.964153
107929,26712,2024-06-27,"""",without international aid,0.91626


In [1]:
import pandas as pd
import os
from transformers import pipeline
import torch
import pickle
import numpy as np

# Step 1: Loading All Necessary Data
print("# Step 1: Loading All Necessary Data")
DATA_DIR = '../data'
MODELS_DIR = os.path.join(DATA_DIR, '03_models')
PROCESSED_DATA_DIR = os.path.join(DATA_DIR, '02_processed')

df_risk_sentences = pd.read_csv(os.path.join(MODELS_DIR, 'risk_mentions_SAMPLE_FINAL.csv'))
df_eng = pd.read_pickle(os.path.join(PROCESSED_DATA_DIR, 'news_eng_processed.pkl'))
df_eng['language'] = 'english'
df_ara = pd.read_pickle(os.path.join(PROCESSED_DATA_DIR, 'news_ara_processed.pkl'))
df_ara['language'] = 'arabic'
if 'article_id' not in df_eng: df_eng['article_id'] = df_eng.index
if 'article_id' not in df_ara: df_ara['article_id'] = df_ara.index
df_articles = pd.concat([df_eng, df_ara])
with open('../data/01_raw/id_english_location_name.pkl', 'rb') as f: eng_locations = pickle.load(f)
with open('../data/01_raw/id_arabic_location_name.pkl', 'rb') as f: ara_locations = pickle.load(f)
print("All data loaded successfully.")
print("-" * 30, "\n")

# Step 1: Loading All Necessary Data
All data loaded successfully.
------------------------------ 



In [2]:
df_risk_sentences

Unnamed: 0,article_id,date,sentence_text,risk_factor,confidence_score
0,93829,2024-07-16,!,without international aid,0.954393
1,158563,2024-07-22,!!,without international aid,0.958109
2,110774,2024-06-28,!!!,without international aid,0.954647
3,130700,2024-07-05,!!!!!,without international aid,0.964153
4,26712,2024-06-27,"""",without international aid,0.916260
...,...,...,...,...,...
34270,87057,2024-07-23, معهد آخر يؤشر بدء قرع طبول هجمات الفصائل وال...,wreaked havoc,0.936865
34271,93762,2024-06-26,ﺣﺎﻟﯿﺎً، ﯾﺸﻜّﻞ ﺗﻮﻗّﻒ اﻟﺘﺴﻠﯿﻒ ﺑﺴﺒﺐ أزﻣﺔ اﻟﻤﺼﺎرف ...,destructive pattern,0.900661
34272,93762,2024-06-26,ﻓﻘﺪ أدّى اﻧﺨﻔﺎض اﻟﻘﺪرة اﻟﺸﺮاﺋﯿﺔ ﻟﻠﻤﻮاطﻨﯿﻦ، وﺗﺮ...,human rights abuses,0.960100
34273,97204,2024-06-25,ﻓﻲ اﻟﻮاﻗﻊ، ﯾﻮاﺟﮫ اﻻﻗﺘﺼﺎد اﻟﻠﺒﻨﺎﻧﻲ ﻣﻨﺬ أﻛﺜﺮ ﻣﻦ ...,authoritarian,0.914291


In [None]:
# Step 2: Building Location Resolvers
print("# Step 2: Building Location Resolvers")
def create_name_to_id_lookup(location_dict):
    lookup = {}
    for loc_id, names in location_dict.items():
        for name in names: lookup[name.lower()] = loc_id
    return lookup
location_lookup = create_name_to_id_lookup(eng_locations)
location_lookup.update(create_name_to_id_lookup(ara_locations))
id_to_english_name_lookup = {loc_id: names[0] for loc_id, names in eng_locations.items()}
print("Location resolvers created.")
print("-" * 30, "\n")


# Step 3: Initialize Hugging Face NER Pipeline
print("# Step 3: Loading Hugging Face Model for NER")
device = 0 if torch.cuda.is_available() else -1
if device == 0: print("GPU found.")
else: print("No GPU found.")
ner_pipeline = pipeline("ner", model="Babelscape/wikineural-multilingual-ner", aggregation_strategy="simple", device=device)
print("NER pipeline loaded.")
print("-" * 30, "\n")


# Step 4: Hybrid Geotagging (Article and Sentence Level)
print("# Step 4: Hybrid Geotagging (Article & Sentence Levels)")
articles_with_risks_ids = df_risk_sentences['article_id'].unique()
df_articles_with_risks = df_articles[df_articles['article_id'].isin(articles_with_risks_ids)][['article_id', 'body', 'language']].copy()
print(f"Found {len(df_articles_with_risks)} unique articles for context.")

def resolve_locations(text, pipeline, lookup):
    """Shared function to extract and resolve locations from any text."""
    entities = pipeline(text)
    found_ids = set()
    for entity in entities:
        if entity['entity_group'] == 'LOC':
            loc_name_lower = entity['word'].lower()
            if loc_name_lower in lookup: found_ids.add(lookup[loc_name_lower])
    return list(found_ids)

# 4a: Get ARTICLE-level locations (the broad context)
df_articles_with_risks['article_locations'] = df_articles_with_risks['body'].apply(
    lambda text: resolve_locations(text, ner_pipeline, location_lookup)
)
print("Extracted article-level locations.")

# 4b: Get SENTENCE-level locations (the specific context)
df_risk_sentences['sentence_locations'] = df_risk_sentences['sentence_text'].apply(
    lambda text: resolve_locations(text, ner_pipeline, location_lookup)
)
print("Extracted sentence-level locations.")
print("-" * 30, "\n")


# Step 5: Merge and Apply Hierarchical Logic
print("# Step 5: Applying Hierarchical Logic and Finalizing Data")
# Merge article context (language and locations) into the sentence dataframe
df_merged = pd.merge(
    df_risk_sentences,
    df_articles_with_risks[['article_id', 'language', 'article_locations']],
    on='article_id'
)

# HIERARCHICAL LOGIC: Choose the most specific location available
def choose_locations(row):
    # Prioritize specific (longer ID) sentence-level locations
    sentence_specific = {loc for loc in row['sentence_locations'] if len(loc) > 2}
    if sentence_specific:
        return list(sentence_specific)
    
    # Fallback to any sentence-level locations
    if row['sentence_locations']:
        return row['sentence_locations']
        
    # Fallback to specific article-level locations
    article_specific = {loc for loc in row['article_locations'] if len(loc) > 2}
    if article_specific:
        return list(article_specific)
        
    # Finally, use any article-level location as the last resort
    return row['article_locations']

df_merged['final_locations'] = df_merged.apply(choose_locations, axis=1)

# Explode, clean, and add the English name
df_final_exploded = df_merged.explode('final_locations').rename(columns={'final_locations': 'location_id'})
df_final_exploded = df_final_exploded.dropna(subset=['location_id'])
df_final_exploded['location_name_english'] = df_final_exploded['location_id'].map(id_to_english_name_lookup)

# Reorder columns for final output
final_columns = [
    'article_id', 'date', 'language', 'sentence_text', 'risk_factor',
    'confidence_score', 'location_id', 'location_name_english'
]
df_final_exploded = df_final_exploded[final_columns]

print(f"Created {len(df_final_exploded):,} final, high-precision risk-location pairs.")
print("Sample of final data:")
display(df_final_exploded.head())

---

# Part 2: Reflection

Please outline (1) some of the limitations of your approach and (2) how you would tackle these if you had more time.