# [Take-home Assessment] Food Crisis Early Warning 

Welcome to the assessment. You will showcase your modeling and research skills by investigating news articles (in English and Arabic) as well as a set of food insecurity risk factors. 

We suggest planning to spend **~6–8 hours** on this assessment. **Please submit your response by Monday, September 15th, 9:00 AM EST via email to dime_ai@worldbank.org**. Please document your code with comments and explanations of design choices. There is one question on reflecting upon your approach at the end of this notebook.

**Name:** Adam Przychodni

**Email:** adam.przychodni@gmail.com

---

# Part 1: Technical Assessment


## Task:

We invite you to approach the challenge of understanding (and potentially predicting) food insecurity using the provided (limited) data. Your response should demonstrate how you tackle open-ended problems in data-scarce environments.

Some example questions to consider:
- What is the added value of geospatial data?
- How can we address the lack of ground-truth information on food insecurity levels?
- What are the benefits and challenges of working with multilingual data?
- ...

These are just guiding examples — you are free to explore any relevant angles to this topic/data.

**Note:** There is no single "right" approach. Instead, we want to understand how you approach and structure open-ended problems in data-scarce environments. Given the large number of applicants, we will preselect the most impressive and complete submissions. Please take effort in structuring your response, as selection will depend on its depth and originality.


## Provided Data:

1. **Risk Factors:** A file containing 167 risk factors (unigrams, bigrams, and trigrams) in the `english_keywords` column and an empty `keywords_arabic` column. A separate file with the mapping of English risk factors to pre-defined thematic cluster assignments.


2. **News Articles:** Two files containing one month of news articles from the Mashriq region:
   - `news-articles-eng.csv`
   - `news-articles-ara.csv`
   - **Note:** You may work on a sample subset during development.
   
   
3. **Geographic Taxonomy:** A file containing the names of the countries, provinces, and districts for the subset of Mashriq countries that is covered by the news articles. The files are a dictionary mapping from a key to the geographic name.
   - `id_arabic_location_name.pkl`
   - `id_english_location_name.pkl`
   - **Note:** Each unique country/province/district is assigned a key (e.g. `iq`,`iq_bg` and `iq_bg_1` for country Iraq, province Baghdad, and district 1 in Baghdad respectively).
   - The key of country names is a two character abbreviation as follows.
       - 'iq': 'Iraq'
       - 'jo': 'Jordan'
       - 'lb': 'Lebanon'
       - 'ps': 'Palestine'
       - 'sy': 'Syria'
       
   - The key of provinces is a two-character abbreviation of the country followed by two-character abbreviation of the province **`{country_abbreviation}_{province_abbreviation}`**, and the key of districts is **`{country_abbreviation}_{province_abbreviation}_{unique_number}`**.
       


## Submission Guidelines:

- **Code:** Follow best coding practices and ensure clear documentation. All notebook cells should be executed with outputs saved, and the notebook should run correctly on its own. Name your file **`solution_{FIRSTNAME}_{LASTNAME}.ipynb`**. If your solution relies on additional open-access data, either include it in your submission (e.g., as part of a ZIP file) or provide clear data-loading code/instructions as part of the nottebook. 
- **Report:** Submit a separate markdown file communicating your approach to this research problem. We expect you to detail the models, methods, or (additional) data you are using.

Good luck!


---

## Your Submission

parts of code generated and also formatted using LLMs, Gemini 2.5 Pro and Claude Opus 4.1

In [None]:
"""
News article processing pipeline for multilingual text data.
Balanced version with improved pre-filtering strategy.
"""

import pandas as pd
import re
import pickle
import torch
import logging
from pathlib import Path
from typing import List, Optional, Dict, Set, Tuple
from transformers import pipeline, Pipeline
from tqdm.auto import tqdm
import gc

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

# Regex patterns as module constants
HTML_PATTERN = re.compile(r'<.*?>')
URL_PATTERN = re.compile(r'https?://\S+|www\.\S+')
WHITESPACE_PATTERN = re.compile(r'\s+')
SENTENCE_SPLIT_PATTERN = re.compile(r'(?<=[.!?۔])\s+')
LETTER_PATTERN = re.compile(r'[a-zA-Zء-ي]')


def get_device() -> int:
    """
    Determine the best available device for processing.
    
    Returns:
        int: Device ID (0 for GPU, -1 for CPU).
    """
    if torch.cuda.is_available():
        logger.info(f"GPU available: {torch.cuda.get_device_name(0)}")
        return 0
    else:
        logger.info("No GPU found, using CPU")
        return -1


def load_location_lookup(data_dir: str) -> Dict[str, str]:
    """
    Load location dictionaries for geographic filtering.
    
    Args:
        data_dir: Root directory containing raw data.
        
    Returns:
        Dict[str, str]: Dictionary mapping location names to IDs.
        
    Raises:
        FileNotFoundError: If location files don't exist.
    """
    raw_dir = Path(data_dir) / '01_raw'
    eng_path = raw_dir / 'id_english_location_name.pkl'
    ara_path = raw_dir / 'id_arabic_location_name.pkl'
    
    location_lookup = {}
    
    try:
        with open(eng_path, 'rb') as f:
            eng_locations = pickle.load(f)
        with open(ara_path, 'rb') as f:
            ara_locations = pickle.load(f)
        
        # Build lookup dictionary
        for location_dict in [eng_locations, ara_locations]:
            for loc_id, names in location_dict.items():
                for name in names:
                    location_lookup[name.lower()] = loc_id
        
        logger.info(f"Loaded {len(location_lookup):,} unique location aliases")
        return location_lookup
        
    except FileNotFoundError as e:
        logger.error(f"Location file not found: {e}")
        raise


def prefilter_by_keywords(df: pd.DataFrame, location_lookup: Dict[str, str], 
                         use_all_keywords: bool = False, sample_size: int = 100) -> pd.DataFrame:
    """
    Balanced keyword-based pre-filtering before expensive NER.
    
    Args:
        df: DataFrame with articles.
        location_lookup: Dictionary of location names.
        use_all_keywords: If True, use all keywords. If False, use sample_size.
        sample_size: Number of most common locations to use if not using all.
        
    Returns:
        pd.DataFrame: Pre-filtered dataframe.
    """
    location_keywords = list(location_lookup.keys())
    
    if not use_all_keywords and len(location_keywords) > sample_size:
        # Prioritize important locations (countries and major cities)
        # Sort by length - shorter names are often more important
        location_keywords = sorted(location_keywords, key=len)[:sample_size]
        logger.info(f"Using top {sample_size} location keywords for pre-filtering")
    else:
        logger.info(f"Using all {len(location_keywords)} location keywords for pre-filtering")
    
    # Create more flexible regex pattern
    # Use word boundaries for whole word matching
    escaped_keywords = [re.escape(loc) for loc in location_keywords]
    # Create pattern in chunks to avoid regex size limits
    chunk_size = 100
    patterns = []
    
    for i in range(0, len(escaped_keywords), chunk_size):
        chunk = escaped_keywords[i:i+chunk_size]
        pattern = r'\b(' + '|'.join(chunk) + r')\b'
        patterns.append(re.compile(pattern, re.IGNORECASE))
    
    def contains_location(text):
        if not isinstance(text, str):
            return False
        # Check against all pattern chunks
        for pattern in patterns:
            if pattern.search(text):
                return True
        return False
    
    # Apply pre-filter
    logger.info("Starting pre-filtering...")
    mask = df['body'].apply(contains_location)
    df_prefiltered = df[mask].copy()
    
    logger.info(f"Pre-filtering complete: {len(df_prefiltered)}/{len(df)} articles "
               f"contain location keywords ({len(df_prefiltered)/len(df)*100:.1f}%)")
    
    # If pre-filtering is too aggressive, warn
    if len(df_prefiltered) < len(df) * 0.01:  # Less than 1%
        logger.warning("Pre-filtering may be too aggressive! Consider adjusting parameters.")
    
    return df_prefiltered


def initialize_ner_pipeline(model_name: str = "Babelscape/wikineural-multilingual-ner", 
                           device: Optional[int] = None) -> Pipeline:
    """
    Initialize NER pipeline for geographic filtering.
    
    Args:
        model_name: Name of the NER model to use.
        device: Device ID (None for auto-detection).
        
    Returns:
        Pipeline: Initialized NER pipeline.
        
    Raises:
        Exception: If model loading fails.
    """
    if device is None:
        device = get_device()
        
    logger.info(f"Loading NER model: {model_name}")
    
    try:
        ner_pipeline = pipeline(
            "ner",
            model=model_name,
            aggregation_strategy="simple",
            device=device
        )
        logger.info("NER pipeline initialized successfully")
        return ner_pipeline
    except Exception as e:
        logger.error(f"Failed to load NER model: {e}")
        raise


def filter_by_geography(df: pd.DataFrame, location_lookup: Dict[str, str], 
                       ner_pipeline: Pipeline, max_text_length: int = 1000,
                       batch_size: int = 256, chunk_size: int = 5000) -> pd.DataFrame:
    """
    Filter articles containing target geographic locations.
    
    Args:
        df: Combined dataframe of articles.
        location_lookup: Dictionary mapping location names to IDs.
        ner_pipeline: Initialized NER pipeline.
        max_text_length: Maximum text length for NER processing.
        batch_size: Batch size for NER processing.
        chunk_size: Size of chunks for progress tracking.
        
    Returns:
        pd.DataFrame: Filtered dataframe with location metadata.
    """
    # Convert lookup keys to a set for O(1) lookups
    location_set = set(location_lookup.keys())
    
    # Prepare texts for NER
    article_bodies = df['body'].fillna('').tolist()
    texts_to_process = [text[:max_text_length] for text in article_bodies]
    
    total_articles = len(texts_to_process)
    logger.info(f"Running NER on {total_articles:,} articles...")
    logger.info(f"Batch size: {batch_size}, Chunk size: {chunk_size}")
    logger.info(f"Max text length: {max_text_length} chars")
    
    # Process in chunks to show progress and manage memory
    all_relevance = []
    all_locations = []
    
    with tqdm(total=total_articles, desc="Processing articles") as pbar:
        for i in range(0, total_articles, chunk_size):
            chunk_end = min(i + chunk_size, total_articles)
            chunk_texts = texts_to_process[i:chunk_end]
            
            # Extract entities for this chunk
            try:
                chunk_entities = ner_pipeline(chunk_texts, batch_size=batch_size)
            except Exception as e:
                logger.warning(f"Error processing chunk {i//chunk_size + 1}: {e}")
                # If batch fails, try smaller batches
                logger.info("Retrying with smaller batch size...")
                chunk_entities = ner_pipeline(chunk_texts, batch_size=batch_size//2)
            
            # Process results
            for article_entities in chunk_entities:
                found_locations = []
                
                # Use set operations for faster checking
                for entity in article_entities:
                    if entity.get('entity_group') == 'LOC':
                        word_lower = entity.get('word', '').lower()
                        if word_lower in location_set:
                            found_locations.append(entity['word'])
                
                all_relevance.append(len(found_locations) > 0)
                all_locations.append(found_locations)
            
            # Update progress
            pbar.update(chunk_end - i)
            
            # Clear GPU cache periodically to prevent OOM
            if i % (chunk_size * 5) == 0 and i > 0:
                if torch.cuda.is_available():
                    torch.cuda.empty_cache()
                gc.collect()
                logger.info(f"Cleared memory at article {i}")
    
    # Add metadata and filter
    df['matched_locations'] = all_locations
    df_filtered = df[all_relevance].copy()
    
    # Log statistics
    logger.info(f"Geographic filtering complete:")
    logger.info(f"  - Original articles: {len(df):,}")
    logger.info(f"  - Relevant articles: {len(df_filtered):,}")
    logger.info(f"  - Retention rate: {len(df_filtered)/len(df)*100:.1f}%")
    
    # Calculate location frequency
    if len(df_filtered) > 0:
        all_matched_locs = [loc for locs in df_filtered['matched_locations'] for loc in locs]
        if all_matched_locs:
            from collections import Counter
            loc_freq = Counter(all_matched_locs)
            logger.info(f"  - Top 5 locations: {loc_freq.most_common(5)}")
    
    return df_filtered


def load_news_data(data_dir: str, enable_geo_filter: bool = True,
                  max_text_length: int = 1000, batch_size: int = 256,
                  ner_model: str = "Babelscape/wikineural-multilingual-ner",
                  use_prefilter: bool = True, run_ner_filter: bool = True,
                  use_all_keywords: bool = False,
                  prefilter_sample: int = 100) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    Load English and Arabic news datasets with optional geographic filtering.
    
    Args:
        data_dir: Root directory containing raw data.
        enable_geo_filter: Whether to apply geographic filtering.
        max_text_length: Maximum text length for NER processing.
        batch_size: Batch size for NER processing.
        ner_model: Model name for NER pipeline.
        use_prefilter: Whether to use keyword pre-filtering.
        run_ner_filter: Whether to run the NER model after pre-filtering.
        use_all_keywords: Whether to use all location keywords.
        prefilter_sample: Number of keywords to use if not using all.
        
    Returns:
        Tuple[pd.DataFrame, pd.DataFrame]: Tuple of (english_df, arabic_df).
        
    Raises:
        FileNotFoundError: If dataset files don't exist.
    """
    raw_dir = Path(data_dir) / '01_raw'
    eng_path = raw_dir / 'news-articles-eng.csv'
    ara_path = raw_dir / 'news-articles-ara.csv'
    
    try:
        logger.info("Loading news articles from CSV files...")
        df_eng = pd.read_csv(eng_path)
        df_ara = pd.read_csv(ara_path)
        
        # Add language column for tracking
        df_eng['language'] = 'english'
        df_ara['language'] = 'arabic'
        
        logger.info(f"Loaded {len(df_eng):,} English and {len(df_ara):,} Arabic articles")
        
        # Combine for geographic filtering
        df_combined = pd.concat([df_eng, df_ara], ignore_index=True)
        
        # Apply geographic filtering if enabled
        if enable_geo_filter:
            location_lookup = load_location_lookup(data_dir)
            
            # Apply keyword pre-filtering if enabled
            if use_prefilter:
                df_combined = prefilter_by_keywords(
                    df_combined, location_lookup, use_all_keywords, prefilter_sample
                )
                if df_combined.empty:
                    logger.warning("Pre-filtering removed all articles! No data to process further.")
                    return pd.DataFrame(), pd.DataFrame()

            # Apply NER-based filtering if enabled
            if run_ner_filter:
                if not df_combined.empty:
                    ner_pipeline = initialize_ner_pipeline(ner_model)
                    df_combined = filter_by_geography(
                        df_combined, location_lookup, ner_pipeline, 
                        max_text_length, batch_size, chunk_size=5000
                    )
                    if torch.cuda.is_available():
                        torch.cuda.empty_cache()
                    gc.collect()
                else:
                    logger.warning("Skipping NER as no articles remain after pre-filtering.")
            else:
                logger.info("NER filtering disabled.")
                # Ensure 'matched_locations' column exists for consistency
                df_combined['matched_locations'] = [[] for _ in range(len(df_combined))]
            
            # Handle edge case where both filters are off but geo_filter is on
            if not use_prefilter and not run_ner_filter:
                logger.warning("Geo-filtering enabled, but both pre-filter and NER are off. All articles will be processed.")

        else:
            logger.info("Geographic filtering disabled, using all articles")
            df_combined['matched_locations'] = [[] for _ in range(len(df_combined))]
        
        # Split back into language-specific dataframes
        df_eng_filtered = df_combined[df_combined['language'] == 'english'].copy()
        df_ara_filtered = df_combined[df_combined['language'] == 'arabic'].copy()
        
        # Remove the language column
        df_eng_filtered = df_eng_filtered.drop('language', axis=1)
        df_ara_filtered = df_ara_filtered.drop('language', axis=1)
        
        logger.info(f"Final counts: {len(df_eng_filtered):,} English, "
                   f"{len(df_ara_filtered):,} Arabic articles")
        
        return df_eng_filtered, df_ara_filtered
        
    except FileNotFoundError as e:
        logger.error(f"Dataset file not found: {e}")
        raise
    except Exception as e:
        logger.error(f"Error loading data: {e}")
        raise


def clean_text(text: Optional[str]) -> str:
    """
    Clean raw text by removing HTML tags, URLs, and normalizing whitespace.
    
    Args:
        text: Raw text string to clean.
        
    Returns:
        str: Cleaned text string.
    """
    if not isinstance(text, str):
        return ""
    
    # Remove HTML tags
    text = HTML_PATTERN.sub('', text)
    # Remove URLs
    text = URL_PATTERN.sub('', text)
    # Normalize whitespace
    text = WHITESPACE_PATTERN.sub(' ', text).strip()
    
    return text


def tokenize_sentences(text: Optional[str]) -> List[str]:
    """
    Split text into sentences using regex and filter non-textual results.
    
    Args:
        text: Text to tokenize.
        
    Returns:
        List[str]: List of sentence strings.
    """
    if not isinstance(text, str) or not text:
        return []
    
    # Split on sentence-ending punctuation
    sentences = SENTENCE_SPLIT_PATTERN.split(text)
    
    # Filter empty strings and non-textual content
    valid_sentences = [
        s for s in sentences 
        if s and LETTER_PATTERN.search(s)
    ]
    
    return valid_sentences


def process_dataframe(df: pd.DataFrame, language: str) -> pd.DataFrame:
    """
    Apply cleaning and tokenization to a dataframe.
    
    Args:
        df: DataFrame with 'body' column containing article text.
        language: Language identifier for logging.
        
    Returns:
        pd.DataFrame: Processed DataFrame with cleaned text and sentences.
    """
    logger.info(f"Processing {language} articles...")
    
    # Clean text
    df['body_cleaned'] = df['body'].apply(clean_text)
    logger.info(f"Cleaned {len(df):,} {language} articles")
    
    # Tokenize into sentences
    df['sentences'] = df['body_cleaned'].apply(tokenize_sentences)
    
    # Calculate statistics
    sentence_counts = df['sentences'].apply(len)
    if len(sentence_counts) > 0:
        logger.info(f"Tokenized {language} articles: "
                   f"avg {sentence_counts.mean():.1f} sentences per article")
    
    return df


def save_processed_data(df_eng: pd.DataFrame, df_ara: pd.DataFrame, data_dir: str) -> None:
    """
    Save processed dataframes to pickle files.
    
    Args:
        df_eng: Processed English dataframe.
        df_ara: Processed Arabic dataframe.
        data_dir: Root directory for saving data.
        
    Raises:
        Exception: If saving fails.
    """
    processed_dir = Path(data_dir) / '02_processed'
    processed_dir.mkdir(parents=True, exist_ok=True)
    
    eng_path = processed_dir / 'news_eng_processed.pkl'
    ara_path = processed_dir / 'news_ara_processed.pkl'
    
    try:
        df_eng.to_pickle(eng_path)
        df_ara.to_pickle(ara_path)
        logger.info(f"Saved English data to: {eng_path}")
        logger.info(f"Saved Arabic data to: {ara_path}")
    except Exception as e:
        logger.error(f"Error saving processed data: {e}")
        raise


def display_sample(df: pd.DataFrame, num_sentences: int = 3) -> None:
    """
    Display sample processed sentences for verification.
    
    Args:
        df: Processed dataframe.
        num_sentences: Number of sentences to display.
    """
    if df.empty or 'sentences' not in df.columns:
        logger.warning("No sentences to display")
        return
    
    first_article = df.iloc[0]
    sentences = first_article['sentences']
    
    print(f"\n--- Sample Tokenization Results ---")
    print(f"Article split into {len(sentences)} sentences")
    print(f"First {min(num_sentences, len(sentences))} sentences:")
    
    for i, sentence in enumerate(sentences[:num_sentences], 1):
        print(f"  {i}. {sentence}")
    
    # Show matched locations if available
    if 'matched_locations' in df.columns and first_article['matched_locations']:
        print(f"\nMatched locations: {', '.join(first_article['matched_locations'])}")


def run_news_processing_pipeline(data_dir: str = '../data', 
                                enable_geo_filter: bool = True,
                                use_prefilter: bool = True,
                                run_ner_filter: bool = True,
                                max_text_length: int = 1000,
                                batch_size: int = 256,
                                ner_model: str = "Babelscape/wikineural-multilingual-ner",
                                use_all_keywords: bool = False,
                                prefilter_sample: int = 50) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    Execute the complete news processing pipeline with balanced filtering.
    
    Args:
        data_dir: Root directory containing raw and processed data folders.
        enable_geo_filter: Whether to apply geographic filtering.
        use_prefilter: Whether to use keyword pre-filtering.
        run_ner_filter: Whether to run the NER model for filtering.
        max_text_length: Maximum text length for NER processing.
        batch_size: Batch size for NER processing.
        ner_model: Model name for NER pipeline.
        use_all_keywords: Whether to use all location keywords.
        prefilter_sample: Number of keywords to use if not using all.
        
    Returns:
        Tuple[pd.DataFrame, pd.DataFrame]: Tuple of processed (english_df, arabic_df).
        
    Raises:
        FileNotFoundError: If required files don't exist.
        Exception: If processing fails.
    """
    logger.info("=" * 60)
    logger.info("Starting news article processing pipeline")
    logger.info("=" * 60)
    
    if enable_geo_filter:
        logger.info("Geographic filtering: ENABLED")
        logger.info(f"  - Pre-filtering: {'ENABLED' if use_prefilter else 'DISABLED'}")
        logger.info(f"  - NER filtering: {'ENABLED' if run_ner_filter else 'DISABLED'}")
        if use_prefilter:
            logger.info(f"  - Using all keywords: {use_all_keywords}")
            logger.info(f"  - Keywords sample size: {prefilter_sample}")
        if run_ner_filter:
            logger.info(f"  - Max text length: {max_text_length} chars")
            logger.info(f"  - Batch size: {batch_size}")
    else:
        logger.info("Geographic filtering: DISABLED")
    
    # Load data (with geographic filtering if enabled)
    df_eng, df_ara = load_news_data(
        data_dir=data_dir, 
        enable_geo_filter=enable_geo_filter, 
        max_text_length=max_text_length, 
        batch_size=batch_size, 
        ner_model=ner_model, 
        use_prefilter=use_prefilter, 
        run_ner_filter=run_ner_filter,
        use_all_keywords=use_all_keywords, 
        prefilter_sample=prefilter_sample
    )
    
    # Check if we have data to process
    if df_eng.empty and df_ara.empty:
        logger.warning("No articles to process after filtering!")
        return df_eng, df_ara
    
    # Process each dataset
    if not df_eng.empty:
        df_eng = process_dataframe(df_eng, "English")
        display_sample(df_eng)
    
    if not df_ara.empty:
        df_ara = process_dataframe(df_ara, "Arabic")
    
    # Save processed data
    save_processed_data(df_eng, df_ara, data_dir)
    
    logger.info("=" * 60)
    logger.info("Pipeline completed successfully!")
    logger.info("=" * 60)
    
    return df_eng, df_ara

# ============================================================
# STRATEGY 1: NO FILTERING (Process all articles)
# ============================================================
# Use this to get ALL articles for comprehensive analysis
# df_eng_all, df_ara_all = run_news_processing_pipeline(
#     data_dir='../data',
#     enable_geo_filter=False  # Disable geographic filtering entirely
# )

# print(f"\nStrategy 1 - No Filtering:")
# print(f"English articles: {len(df_eng_all):,}")
# print(f"Arabic articles: {len(df_ara_all):,}")


# ============================================================
# STRATEGY 2: BALANCED FILTERING (Recommended: Pre-filter + NER)
# ============================================================
# Use focused keywords for pre-filtering, then apply NER
# df_eng_balanced, df_ara_balanced = run_news_processing_pipeline(
#     data_dir='../data',
#     enable_geo_filter=True,
#     use_prefilter=True,
#     run_ner_filter=True,      # Ensure NER is also on
#     use_all_keywords=True,  
#     # prefilter_sample=50,    # Focus on 50 most important locations
#     max_text_length=1500,     # Look at more text
#     batch_size=512            # Balanced batch size
# )

# print(f"\nStrategy 2 - Balanced Filtering (Prefilter + NER):")
# print(f"English articles: {len(df_eng_balanced):,}")
# print(f"Arabic articles: {len(df_ara_balanced):,}")


# ============================================================
# STRATEGY 3: NER-ONLY FILTERING (More inclusive, slower)
# ============================================================
# Skip pre-filtering but still use NER
# df_eng_relaxed, df_ara_relaxed = run_news_processing_pipeline(
#     data_dir='../data',
#     enable_geo_filter=True,
#     use_prefilter=False,      # Skip pre-filtering
#     run_ner_filter=True,      # Run NER
#     max_text_length=1000,     # Reduced for speed
#     batch_size=512            # Larger batches for efficiency
# )

# print(f"\nStrategy 3 - NER-Only Filtering:")
# print(f"English articles: {len(df_eng_relaxed):,}")
# print(f"Arabic articles: {len(df_ara_relaxed):,}")


# ============================================================
# STRATEGY 4: PRE-FILTERING ONLY (Fastest geographic filtering)
# ============================================================
# Use this for a quick, less precise filtering based on keywords.
df_eng_pre, df_ara_pre = run_news_processing_pipeline(
    data_dir='../data',
    enable_geo_filter=True,
    use_prefilter=True,
    run_ner_filter=False,     # The key change: disable NER
    use_all_keywords=True,    # Use all available location keywords
)

print(f"\nStrategy 4 - Pre-filtering Only:")
print(f"English articles: {len(df_eng_pre):,}")
print(f"Arabic articles: {len(df_ara_pre):,}")

# ============================================================
# MEMORY CLEANUP
# ============================================================
import torch
import gc

if torch.cuda.is_available():
    torch.cuda.empty_cache()
gc.collect()
print("\nMemory cleaned up!")

2025-09-13 15:21:58,854 - INFO - Starting news article processing pipeline
2025-09-13 15:21:58,855 - INFO - Geographic filtering: ENABLED
2025-09-13 15:21:58,856 - INFO -   - Pre-filtering: ENABLED
2025-09-13 15:21:58,856 - INFO -   - NER filtering: DISABLED
2025-09-13 15:21:58,857 - INFO -   - Using all keywords: True
2025-09-13 15:21:58,857 - INFO -   - Keywords sample size: 50
2025-09-13 15:21:58,858 - INFO - Loading news articles from CSV files...
2025-09-13 15:22:14,999 - INFO - Loaded 86,660 English and 85,511 Arabic articles
2025-09-13 15:22:15,044 - INFO - Loaded 918 unique location aliases
2025-09-13 15:22:15,046 - INFO - Using all 918 location keywords for pre-filtering
2025-09-13 15:22:15,067 - INFO - Starting pre-filtering...
2025-09-13 15:32:21,843 - INFO - Pre-filtering complete: 162117/172171 articles contain location keywords (94.2%)
2025-09-13 15:32:21,862 - INFO - NER filtering disabled.
2025-09-13 15:32:22,311 - INFO - Final counts: 84,970 English, 77,147 Arabic arti


--- Sample Tokenization Results ---
Article split into 124 sentences
First 3 sentences:
  1. Hussam al-Mahmoud | Yamen Moghrabi | Hassan Ibrahim On October 7, 2023, the world and the Middle East awoke to the drums of war beating in the Gaza Strip.
  2. Over time, it turned into a reality that American efforts, Qatari and Egyptian mediation, condemnations, statements, summits, and conferences could not stop.
  3. While Israel continues its war in the besieged Gaza Strip, attention is turning towards the potential outbreak of another war.


2025-09-13 15:33:12,567 - INFO - Cleaned 77,147 Arabic articles
2025-09-13 15:33:17,889 - INFO - Tokenized Arabic articles: avg 11.8 sentences per article
2025-09-13 15:33:28,650 - INFO - Saved English data to: ../data/02_processed/news_eng_processed.pkl
2025-09-13 15:33:28,651 - INFO - Saved Arabic data to: ../data/02_processed/news_ara_processed.pkl
2025-09-13 15:33:28,653 - INFO - Pipeline completed successfully!



Strategy 4 - Pre-filtering Only:
English articles: 84,970
Arabic articles: 77,147

Memory cleaned up!


The Critical 'But': Acknowledging the Risks and Trade-offs
This is where you demonstrate senior-level thinking in your report. A great submission won't just do the filtering; it will discuss the limitations.

The Risk of Information Loss (False Negatives) 🗑️
This is the biggest drawback. You are throwing data away. What if a crucial article from Reuters discusses a regional drought impacting "the Levant" but never explicitly names "Syria," "Lebanon," or any city from your list?

Your NER model isn't perfect: It might fail to recognize a location name.

Your location list isn't exhaustive: It might miss alternative spellings or aliases.
Your current implementation correctly identifies relevant articles but will inevitably discard some relevant ones. This is a classic precision vs. recall trade-off. You are choosing to build a high-precision dataset at the cost of lower recall.

Lack of Contextual Understanding 🧐
Your current filter confirms the presence of a location name but not its context. An article could mention "Baghdad" in a purely historical context or as the location of a financial conference unrelated to food security. While this is a minor issue compared to the noise reduction benefits, it's a limitation worth mentioning in your reflection.

State your choice: "A crucial preprocessing step was to filter the corpus to include only articles geographically relevant to the Mashriq region."

Justify it: Explain that this was done to enhance the signal-to-noise ratio, reduce computational load, and enable granular, location-specific risk analysis, which is the primary goal.

Acknowledge the limitations: Explicitly discuss the risk of discarding relevant articles (false negatives) due to model or list imperfections. Frame it as a deliberate choice to prioritize precision for this high-stakes early warning system.

Suggest future improvements (the "if I had more time" part): Propose a "two-funnel" approach for a real-world system:

Funnel 1 (High-Precision): Your current pipeline for generating specific, geotagged alerts.

Funnel 2 (High-Recall): A separate, lightweight process that monitors the discarded articles for spikes in key risk factors (e.g., "drought," "wheat prices"). A sudden spike in this "unfiltered" stream could trigger a manual review and potentially reveal a systemic event your geofilter is missing.

In [1]:
"""
Risk factor extraction pipeline for geographically filtered news articles.
Uses zero-shot classification to identify risk factors in article sentences.
Functional implementation without classes.
"""

import pandas as pd
import torch
import logging
from pathlib import Path
from typing import List, Dict, Optional, Tuple
from transformers import pipeline, Pipeline
from sentence_transformers import SentenceTransformer, util
from tqdm.auto import tqdm

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)


def get_device() -> int:
    """
    Determine the best available device for processing.
    
    Returns:
        int: Device ID (0 for GPU, -1 for CPU).
    """
    if torch.cuda.is_available():
        device_name = torch.cuda.get_device_name(0)
        logger.info(f"GPU available: {device_name}")
        return 0
    else:
        logger.info("No GPU found, using CPU")
        return -1


def load_filtered_data(data_dir: str, use_sample: bool = True, sample_size: int = 10) -> pd.DataFrame:
    """
    Load geographically filtered English and Arabic article data and combine them.
    
    Args:
        data_dir: Root directory containing processed data.
        use_sample: Whether to use a sample of articles for testing.
        sample_size: Number of articles to sample if use_sample is True.
        
    Returns:
        pd.DataFrame: A combined DataFrame of filtered articles.
        
    Raises:
        FileNotFoundError: If no processed data files are found.
    """
    processed_dir = Path(data_dir) / '02_processed'
    eng_path = processed_dir / 'news_eng_processed.pkl'
    ara_path = processed_dir / 'news_ara_processed.pkl'
    
    df_list = []
    
    # Load English data if it exists and is not empty
    try:
        df_eng = pd.read_pickle(eng_path)
        if not df_eng.empty:
            df_list.append(df_eng)
            logger.info(f"Loaded {len(df_eng):,} English articles.")
    except FileNotFoundError:
        logger.warning(f"English data file not found at: {eng_path}")

    # Load Arabic data if it exists and is not empty
    try:
        df_ara = pd.read_pickle(ara_path)
        if not df_ara.empty:
            df_list.append(df_ara)
            logger.info(f"Loaded {len(df_ara):,} Arabic articles.")
    except FileNotFoundError:
        logger.warning(f"Arabic data file not found at: {ara_path}")

    if not df_list:
        error_msg = "No processed data found. Please run the news processing pipeline first."
        logger.error(error_msg)
        raise FileNotFoundError(error_msg)
        
    # Combine the dataframes
    df_combined = pd.concat(df_list, ignore_index=True)
    logger.info(f"Combined data: {len(df_combined):,} total articles.")

    if use_sample and not df_combined.empty:
        # Ensure sample size is not larger than the dataframe
        sample_size = min(sample_size, len(df_combined))
        df_combined = df_combined.sample(n=sample_size, random_state=42).copy()
        logger.info(f"Using a random sample of {len(df_combined)} articles for processing.")
    
    return df_combined


def load_risk_factors(data_dir: str) -> List[str]:
    """
    Load risk factor labels from Excel file.
    
    Args:
        data_dir: Root directory containing raw data.
        
    Returns:
        List[str]: List of risk factor labels.
        
    Raises:
        FileNotFoundError: If risk factors file doesn't exist.
    """
    raw_dir = Path(data_dir) / '01_raw'
    risk_path = raw_dir / 'risk-factors.xlsx'
    
    try:
        df_risk = pd.read_excel(risk_path)
        df_risk.dropna(subset=['risk_factor_english'], inplace=True)
        
        risk_factor_labels = df_risk['risk_factor_english'].tolist()
        logger.info(f"Loaded {len(risk_factor_labels)} risk factors")
        
        return risk_factor_labels
        
    except FileNotFoundError as e:
        logger.error(f"Risk factors file not found: {e}")
        raise


def initialize_classifier(model_name: str = 'MoritzLaurer/deberta-v3-xsmall-zeroshot-v1.1-all-33',
                         device: Optional[int] = None) -> Pipeline:
    """
    Initialize zero-shot classification pipeline.
    
    Args:
        model_name: Name of the classification model to use.
        device: Device ID (None for auto-detection).
        
    Returns:
        Pipeline: Initialized classification pipeline.
        
    Raises:
        Exception: If model loading fails.
    """
    if device is None:
        device = get_device()
        
    try:
        classifier = pipeline(
            "zero-shot-classification",
            model=model_name,
            device=device
        )
        logger.info(f"Classifier initialized: {model_name}")
        return classifier
        
    except Exception as e:
        logger.error(f"Failed to initialize classifier: {e}")
        raise


def initialize_embedder(model_name: str = 'paraphrase-multilingual-MiniLM-L12-v2',
                       device: Optional[int] = None) -> SentenceTransformer:
    """
    Initialize sentence embedding model.
    
    Args:
        model_name: Name of the embedding model to use.
        device: Device ID (None for auto-detection).
        
    Returns:
        SentenceTransformer: Initialized sentence transformer.
        
    Raises:
        Exception: If model loading fails.
    """
    if device is None:
        device = get_device()
        
    try:
        embedder = SentenceTransformer(
            model_name,
            device='cuda' if device == 0 else 'cpu'
        )
        logger.info(f"Embedder initialized: {model_name}")
        return embedder
        
    except Exception as e:
        logger.error(f"Failed to initialize embedder: {e}")
        raise


def prepare_sentences(df: pd.DataFrame) -> pd.DataFrame:
    """
    Prepare sentences from articles for processing.
    
    Args:
        df: DataFrame with articles containing 'sentences' column.
        
    Returns:
        pd.DataFrame: DataFrame with individual sentences.
    """
    # Add article ID if not present
    if 'article_id' not in df.columns:
        df['article_id'] = df.index
    
    # Explode sentences into individual rows
    df_sentences = df.explode('sentences').rename(columns={'sentences': 'sentence_text'})
    df_sentences = df_sentences[['article_id', 'date', 'sentence_text']].dropna(subset=['sentence_text'])
    
    logger.info(f"Prepared {len(df_sentences):,} sentences from articles")
    
    return df_sentences


def filter_relevant_sentences(df_sentences: pd.DataFrame, 
                             embedder: SentenceTransformer,
                             risk_factor_labels: List[str],
                             similarity_threshold: float = 0.55) -> pd.DataFrame:
    """
    Pre-filter sentences using semantic similarity to risk factors.
    
    Args:
        df_sentences: DataFrame with sentence data.
        embedder: Initialized sentence transformer.
        risk_factor_labels: List of risk factor labels.
        similarity_threshold: Minimum similarity score for filtering.
        
    Returns:
        pd.DataFrame: DataFrame with filtered relevant sentences.
    """
    if df_sentences.empty:
        return df_sentences
    
    sentences = df_sentences['sentence_text'].tolist()
    
    logger.info(f"Pre-filtering {len(sentences):,} sentences...")
    logger.info(f"Similarity threshold: {similarity_threshold}")
    
    # Pre-compute risk factor embeddings
    risk_factor_embeddings = embedder.encode(
        risk_factor_labels,
        convert_to_tensor=True
    )
    
    # Encode sentences
    sentence_embeddings = embedder.encode(
        sentences,
        convert_to_tensor=True,
        show_progress_bar=True
    )
    
    # Find semantically similar sentences to risk factors
    hits = util.semantic_search(
        sentence_embeddings,
        risk_factor_embeddings,
        top_k=1
    )
    
    # Filter by similarity threshold
    relevant_indices = [
        i for i, hit_list in enumerate(hits)
        if hit_list and hit_list[0]['score'] >= similarity_threshold
    ]
    
    df_filtered = df_sentences.iloc[relevant_indices].copy()
    
    logger.info(f"Reduced to {len(df_filtered):,} relevant sentences "
               f"({len(df_filtered)/len(sentences)*100:.1f}% retention)")
    
    return df_filtered


def classify_sentences(df_sentences: pd.DataFrame,
                      classifier: Pipeline,
                      risk_factor_labels: List[str],
                      confidence_threshold: float = 0.90,
                      batch_size: int = 128) -> pd.DataFrame:
    """
    Classify filtered sentences for risk factors.
    
    Args:
        df_sentences: DataFrame with filtered sentences.
        classifier: Initialized classification pipeline.
        risk_factor_labels: List of risk factor labels.
        confidence_threshold: Minimum confidence score for classification.
        batch_size: Batch size for classification.
        
    Returns:
        pd.DataFrame: DataFrame with risk factor classifications.
    """
    if df_sentences.empty:
        return pd.DataFrame()
    
    sentences = df_sentences['sentence_text'].tolist()
    
    logger.info(f"Classifying {len(sentences):,} sentences...")
    logger.info(f"Confidence threshold: {confidence_threshold}")
    
    results_list = []
    
    # Run classification in batches
    for i, result in tqdm(
        enumerate(classifier(
            sentences,
            risk_factor_labels,
            multi_label=True,
            batch_size=batch_size
        )),
        total=len(sentences),
        desc="Classifying sentences"
    ):
        # Check each label's confidence
        for label, score in zip(result['labels'], result['scores']):
            if score >= confidence_threshold:
                original_row = df_sentences.iloc[i]
                results_list.append({
                    'article_id': original_row['article_id'],
                    'date': original_row['date'],
                    'sentence_text': result['sequence'],
                    'risk_factor': label,
                    'confidence_score': score
                })
    
    df_results = pd.DataFrame(results_list)
    
    logger.info(f"Found {len(df_results):,} risk factor mentions")
    
    return df_results


def refine_results(df_mentions: pd.DataFrame) -> pd.DataFrame:
    """
    Post-process results to keep only highest confidence mention per sentence.
    
    Args:
        df_mentions: DataFrame with all risk mentions.
        
    Returns:
        pd.DataFrame: DataFrame with refined risk mentions.
    """
    if df_mentions.empty:
        return df_mentions
    
    logger.info("Refining results to highest confidence per sentence...")
    
    # Keep only the highest confidence label for each sentence
    idx = df_mentions.groupby('sentence_text')['confidence_score'].idxmax()
    df_refined = df_mentions.loc[idx].copy()
    
    # Sort by confidence score
    df_refined = df_refined.sort_values('confidence_score', ascending=False)
    
    logger.info(f"Refined from {len(df_mentions):,} to {len(df_refined):,} unique mentions")
    
    return df_refined


def save_results(df_results: pd.DataFrame, data_dir: str, use_sample: bool = True) -> Path:
    """
    Save extraction results to CSV.
    
    Args:
        df_results: DataFrame with risk factor mentions.
        data_dir: Root directory for saving data.
        use_sample: Whether this is sample data (affects filename).
        
    Returns:
        Path: Path to saved file.
        
    Raises:
        Exception: If saving fails.
    """
    models_dir = Path(data_dir) / '03_models'
    models_dir.mkdir(parents=True, exist_ok=True)
    
    filename = 'risk_mentions_SAMPLE.csv' if use_sample else 'risk_mentions_FULL.csv'
    
    output_path = models_dir / filename
    
    try:
        df_results.to_csv(output_path, index=False)
        logger.info(f"Saved {len(df_results):,} risk mentions to {output_path}")
        return output_path
        
    except Exception as e:
        logger.error(f"Failed to save results: {e}")
        raise


def calculate_statistics(df_results: pd.DataFrame, 
                        total_articles: int,
                        total_sentences: int,
                        filtered_sentences: int,
                        risk_mentions: int) -> Dict:
    """
    Calculate statistics about the extraction.
    
    Args:
        df_results: DataFrame with extraction results.
        total_articles: Total number of articles processed.
        total_sentences: Total number of sentences extracted.
        filtered_sentences: Number of sentences after filtering.
        risk_mentions: Number of risk mentions found.
        
    Returns:
        Dict: Dictionary of statistics.
    """
    stats = {
        'total_articles': total_articles,
        'total_sentences': total_sentences,
        'filtered_sentences': filtered_sentences,
        'risk_mentions': risk_mentions,
        'unique_mentions': len(df_results)
    }
    
    if not df_results.empty:
        stats.update({
            'unique_risk_factors': df_results['risk_factor'].nunique(),
            'avg_confidence': df_results['confidence_score'].mean(),
            'min_confidence': df_results['confidence_score'].min(),
            'max_confidence': df_results['confidence_score'].max(),
            'top_risk_factors': df_results['risk_factor'].value_counts().head(5).to_dict()
        })
    
    return stats


def display_statistics(stats: Dict) -> None:
    """
    Display extraction statistics.
    
    Args:
        stats: Dictionary containing statistics.
    """
    print("\n" + "="*50)
    print("EXTRACTION STATISTICS")
    print("="*50)
    print(f"Total articles processed: {stats['total_articles']:,}")
    print(f"Total sentences extracted: {stats['total_sentences']:,}")
    print(f"Sentences after filtering: {stats['filtered_sentences']:,}")
    print(f"Risk mentions found: {stats['risk_mentions']:,}")
    print(f"Unique mentions (refined): {stats['unique_mentions']:,}")
    
    if 'unique_risk_factors' in stats:
        print(f"\nUnique risk factors: {stats['unique_risk_factors']}")
        print(f"Average confidence: {stats['avg_confidence']:.3f}")
        print(f"Confidence range: {stats['min_confidence']:.3f} - {stats['max_confidence']:.3f}")
        
        if stats.get('top_risk_factors'):
            print("\nTop 5 risk factors:")
            for risk, count in stats['top_risk_factors'].items():
                print(f"  - {risk}: {count} mentions")
    print("="*50)


def run_risk_extraction_pipeline(data_dir: str = '../data',
                               classifier_model: str = 'MoritzLaurer/deberta-v3-xsmall-zeroshot-v1.1-all-33',
                               embedder_model: str = 'paraphrase-multilingual-MiniLM-L12-v2',
                               classifier_batch_size: int = 128,
                               sentence_similarity_threshold: float = 0.55,
                               classifier_confidence_threshold: float = 0.90,
                               use_sample: bool = True,
                               sample_size: int = 10) -> pd.DataFrame:
    """
    Execute the complete risk factor extraction pipeline.
    
    Args:
        data_dir: Root directory containing data folders.
        classifier_model: Name of the classification model to use.
        embedder_model: Name of the embedding model to use.
        classifier_batch_size: Batch size for classification.
        sentence_similarity_threshold: Minimum similarity score for sentence filtering.
        classifier_confidence_threshold: Minimum confidence score for classification.
        use_sample: Whether to use a sample of articles for testing.
        sample_size: Number of articles to sample if use_sample is True.
        
    Returns:
        pd.DataFrame: DataFrame with refined risk factor mentions.
        
    Raises:
        FileNotFoundError: If required files don't exist.
        Exception: If processing fails.
    """
    logger.info("=" * 60)
    logger.info("Starting risk factor extraction pipeline")
    logger.info("=" * 60)
    
    if use_sample:
        logger.info(f"Running in SAMPLE mode ({sample_size} articles)")
    else:
        logger.info("Running in FULL mode (all articles)")
    
    try:
        # Load data
        df_articles = load_filtered_data(data_dir, use_sample, sample_size)
        if df_articles.empty:
            logger.warning("No articles to process after loading.")
            return pd.DataFrame()
        total_articles = len(df_articles)
        
        # Load risk factors
        risk_factor_labels = load_risk_factors(data_dir)
        
        # Initialize models
        classifier = initialize_classifier(classifier_model)
        embedder = initialize_embedder(embedder_model)
        
        # Prepare sentences
        df_sentences = prepare_sentences(df_articles)
        if df_sentences.empty:
            logger.warning("No sentences to process after preparation.")
            return pd.DataFrame()
        total_sentences = len(df_sentences)
        
        # Filter relevant sentences
        df_filtered = filter_relevant_sentences(
            df_sentences, embedder, risk_factor_labels, sentence_similarity_threshold
        )
        filtered_sentences = len(df_filtered)
        
        # Classify sentences
        df_mentions = classify_sentences(
            df_filtered, classifier, risk_factor_labels, 
            classifier_confidence_threshold, classifier_batch_size
        )
        risk_mentions = len(df_mentions)
        
        # Refine results
        df_refined = refine_results(df_mentions)
        
        # Save results
        save_results(df_refined, data_dir, use_sample)
        
        # Display statistics
        stats = calculate_statistics(
            df_refined, total_articles, total_sentences, 
            filtered_sentences, risk_mentions
        )
        display_statistics(stats)
        
        logger.info("=" * 60)
        logger.info("Pipeline completed successfully!")
        logger.info("=" * 60)
        return df_refined
        
    except Exception as e:
        logger.error(f"Pipeline failed: {e}")
        raise


# --- SCRIPT EXECUTION ---

# Usage example - Sample mode (for testing)
# This will now correctly load your Arabic data and process a small sample.
# df_risk_mentions_sample = run_risk_extraction_pipeline(
#     data_dir='../data',
#     use_sample=True,
#     sample_size=20, # Increased sample size a bit
#     sentence_similarity_threshold=0.55,
#     classifier_confidence_threshold=0.90,
#     classifier_batch_size=128
# )

# # Display sample results if available
# if not df_risk_mentions_sample.empty:
#     print("\n--- Sample Risk Mentions ---")
#     print(df_risk_mentions_sample[['risk_factor', 'confidence_score', 'sentence_text']].head())
#     print("-" * 28)

# --- To run on the FULL dataset ---
# Make sure you have enough time and memory, as this will process all articles.

df_risk_mentions_full = run_risk_extraction_pipeline(
    data_dir='../data',
    use_sample=False,  # Set this to False to process all data
    # The 'sample_size' parameter will be ignored now
    sentence_similarity_threshold=0.55,
    classifier_confidence_threshold=0.90,
    classifier_batch_size=1024 # You might increase this if you have a powerful GPU
)

# Display a few results from the full run
if not df_risk_mentions_full.empty:
    print("\n--- Full Run: Top Risk Mentions ---")
    print(df_risk_mentions_full[['risk_factor', 'confidence_score', 'sentence_text']].head())
    print("-" * 35)

2025-09-13 15:51:14,341 - INFO - Starting risk factor extraction pipeline
2025-09-13 15:51:14,343 - INFO - Running in FULL mode (all articles)
2025-09-13 15:51:17,134 - INFO - Loaded 84,970 English articles.
2025-09-13 15:51:21,288 - INFO - Loaded 77,147 Arabic articles.
2025-09-13 15:51:21,346 - INFO - Combined data: 162,117 total articles.
2025-09-13 15:51:21,895 - INFO - Loaded 167 risk factors
2025-09-13 15:51:21,948 - INFO - GPU available: NVIDIA L4
Device set to use cuda:0
2025-09-13 15:51:23,498 - INFO - Classifier initialized: MoritzLaurer/deberta-v3-xsmall-zeroshot-v1.1-all-33
2025-09-13 15:51:23,500 - INFO - GPU available: NVIDIA L4
2025-09-13 15:51:23,503 - INFO - Load pretrained SentenceTransformer: paraphrase-multilingual-MiniLM-L12-v2
2025-09-13 15:51:26,732 - INFO - Embedder initialized: paraphrase-multilingual-MiniLM-L12-v2
2025-09-13 15:51:34,140 - INFO - Prepared 3,897,406 sentences from articles
2025-09-13 15:51:34,241 - INFO - Pre-filtering 3,897,406 sentences...
20

Batches:   0%|          | 0/6 [00:00<?, ?it/s]

Batches:   0%|          | 0/121794 [00:00<?, ?it/s]

2025-09-13 16:22:51,533 - INFO - Reduced to 151,225 relevant sentences (3.9% retention)
2025-09-13 16:22:51,990 - INFO - Classifying 151,225 sentences...
2025-09-13 16:22:51,991 - INFO - Confidence threshold: 0.9
2025-09-13 17:29:51,814 - ERROR - Pipeline failed: CUDA out of memory. Tried to allocate 3.00 GiB. GPU 0 has a total capacity of 22.28 GiB of which 2.36 GiB is free. Process 78491 has 19.91 GiB memory in use. Of the allocated memory 16.85 GiB is allocated by PyTorch, and 2.84 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


OutOfMemoryError: CUDA out of memory. Tried to allocate 3.00 GiB. GPU 0 has a total capacity of 22.28 GiB of which 2.36 GiB is free. Process 78491 has 19.91 GiB memory in use. Of the allocated memory 16.85 GiB is allocated by PyTorch, and 2.84 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

tutaj mysle ze mozna by to zastapic tak naprawde jednym modelem klasyfikacyjnym ktory przyjmowalby jako sentence jako input i klasyfikowalby czy jest to ktorys z tych 167 risk factorow lub tez nie :) 

In [6]:
"""
Risk mention geotagging pipeline for multilingual news articles.
Extracts and resolves geographic locations from risk-related sentences using NER,
applying hierarchical logic to maximize precision.
Functional implementation without classes.
"""

import pandas as pd
import pickle
import torch
import logging
from pathlib import Path
from typing import Dict, List, Set, Tuple, Optional
from transformers import pipeline, Pipeline
from tqdm.auto import tqdm

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)


def get_device() -> int:
    """
    Determine the best available device for processing.
    
    Returns:
        Device ID (0 for GPU, -1 for CPU)
    """
    if torch.cuda.is_available():
        logger.info(f"GPU available: {torch.cuda.get_device_name(0)}")
        return 0
    else:
        logger.info("No GPU found, using CPU")
        return -1


def load_risk_mentions(data_dir: str, sample_size: Optional[int] = None) -> pd.DataFrame:
    """
    Load risk mention sentences from processed data.
    
    Args:
        data_dir: Root directory containing data folders
        sample_size: Number of rows to process (None for all)
        
    Returns:
        DataFrame containing risk mentions
        
    Raises:
        FileNotFoundError: If risk mentions file doesn't exist
    """
    models_dir = Path(data_dir) / '03_models'
    risk_path = models_dir / 'risk_mentions_SAMPLE_FINAL.csv'
    
    try:
        df_risk = pd.read_csv(risk_path)
        
        # Apply sampling if specified
        if sample_size:
            df_risk = df_risk.head(sample_size).copy()
            logger.info(f"Sampling {sample_size} rows for processing")
        
        logger.info(f"Loaded {len(df_risk):,} risk mention sentences")
        return df_risk
        
    except FileNotFoundError as e:
        logger.error(f"Risk mentions file not found: {e}")
        raise


def load_articles(data_dir: str) -> pd.DataFrame:
    """
    Load and combine English and Arabic article datasets.
    
    Args:
        data_dir: Root directory containing data folders
        
    Returns:
        Combined DataFrame with article content
        
    Raises:
        FileNotFoundError: If article files don't exist
    """
    processed_dir = Path(data_dir) / '02_processed'
    eng_path = processed_dir / 'news_eng_processed.pkl'
    ara_path = processed_dir / 'news_ara_processed.pkl'
    
    try:
        # Load datasets
        df_eng = pd.read_pickle(eng_path)
        df_eng['language'] = 'english'
        
        df_ara = pd.read_pickle(ara_path)
        df_ara['language'] = 'arabic'
        
        # Ensure article IDs exist
        if 'article_id' not in df_eng.columns:
            df_eng['article_id'] = df_eng.index
        if 'article_id' not in df_ara.columns:
            df_ara['article_id'] = df_ara.index
        
        # Combine datasets
        df_articles = pd.concat([df_eng, df_ara], ignore_index=True)
        
        logger.info(f"Loaded {len(df_eng):,} English and {len(df_ara):,} Arabic articles")
        return df_articles
        
    except FileNotFoundError as e:
        logger.error(f"Article file not found: {e}")
        raise


def build_location_resolvers(data_dir: str) -> Tuple[Dict[str, str], Dict[str, str]]:
    """
    Build lookup dictionaries for location name resolution.
    
    Args:
        data_dir: Root directory containing raw data
        
    Returns:
        Tuple of (location_lookup, id_to_english_name) dictionaries
        
    Raises:
        FileNotFoundError: If location files don't exist
    """
    raw_dir = Path(data_dir) / '01_raw'
    eng_path = raw_dir / 'id_english_location_name.pkl'
    ara_path = raw_dir / 'id_arabic_location_name.pkl'
    
    try:
        # Load location dictionaries
        with open(eng_path, 'rb') as f:
            eng_locations = pickle.load(f)
        with open(ara_path, 'rb') as f:
            ara_locations = pickle.load(f)
        
        # Build name-to-ID lookup
        location_lookup = {}
        for location_dict in [eng_locations, ara_locations]:
            for loc_id, names in location_dict.items():
                for name in names:
                    location_lookup[name.lower()] = loc_id
        
        # Build ID-to-English name lookup
        id_to_english_name = {
            loc_id: names[0] 
            for loc_id, names in eng_locations.items()
        }
        
        logger.info(f"Built location resolvers with {len(location_lookup):,} aliases")
        
        return location_lookup, id_to_english_name
        
    except FileNotFoundError as e:
        logger.error(f"Location dictionary not found: {e}")
        raise


def initialize_ner_pipeline(model_name: str = "Babelscape/wikineural-multilingual-ner", 
                           device: Optional[int] = None) -> Pipeline:
    """
    Initialize the NER pipeline for entity extraction.
    
    Args:
        model_name: Name of the NER model to use
        device: Device ID (None for auto-detection)
        
    Returns:
        Initialized NER pipeline
        
    Raises:
        Exception: If model loading fails
    """
    if device is None:
        device = get_device()
        
    logger.info(f"Loading NER model: {model_name}")
    
    try:
        ner_pipeline = pipeline(
            "ner",
            model=model_name,
            aggregation_strategy="simple",
            device=device
        )
        logger.info("NER pipeline initialized successfully")
        return ner_pipeline
        
    except Exception as e:
        logger.error(f"Failed to load NER model: {e}")
        raise


def resolve_locations_from_entities(entities: List[Dict], 
                                   location_lookup: Dict[str, str]) -> List[str]:
    """
    Extract and resolve location IDs from NER entities.
    
    Args:
        entities: List of entity dictionaries from NER pipeline
        location_lookup: Dictionary mapping location names to IDs
        
    Returns:
        List of resolved location IDs
    """
    found_ids = set()
    
    for entity in entities:
        if entity.get('entity_group') == 'LOC':
            loc_name_lower = entity.get('word', '').lower()
            if loc_name_lower in location_lookup:
                found_ids.add(location_lookup[loc_name_lower])
    
    return list(found_ids)


def extract_article_locations(df_articles: pd.DataFrame,
                             ner_pipeline: Pipeline,
                             location_lookup: Dict[str, str],
                             batch_size: int = 128) -> pd.DataFrame:
    """
    Extract locations from article bodies using batch NER processing.
    
    Args:
        df_articles: DataFrame containing article bodies
        ner_pipeline: Initialized NER pipeline
        location_lookup: Dictionary mapping location names to IDs
        batch_size: Batch size for NER processing
        
    Returns:
        DataFrame with article_locations column added
    """
    logger.info("Extracting article-level locations...")
    
    # Prepare texts for processing
    article_bodies = df_articles['body'].fillna('').tolist()
    
    # Batch process with NER
    article_entities = ner_pipeline(
        article_bodies, 
        batch_size=batch_size
    )
    
    # Resolve locations for each article
    article_locations = [
        resolve_locations_from_entities(entities, location_lookup) 
        for entities in tqdm(article_entities, desc="Resolving article locations")
    ]
    
    df_articles = df_articles.copy()
    df_articles['article_locations'] = article_locations
    
    # Calculate statistics
    total_locations = sum(len(locs) for locs in article_locations)
    articles_with_locations = sum(1 for locs in article_locations if locs)
    
    logger.info(f"Found {total_locations:,} total locations in "
               f"{articles_with_locations:,} articles")
    
    return df_articles


def extract_sentence_locations(df_sentences: pd.DataFrame,
                              ner_pipeline: Pipeline,
                              location_lookup: Dict[str, str],
                              batch_size: int = 128) -> pd.DataFrame:
    """
    Extract locations from individual sentences using batch NER processing.
    
    Args:
        df_sentences: DataFrame containing sentence texts
        ner_pipeline: Initialized NER pipeline
        location_lookup: Dictionary mapping location names to IDs
        batch_size: Batch size for NER processing
        
    Returns:
        DataFrame with sentence_locations column added
    """
    logger.info("Extracting sentence-level locations...")
    
    # Prepare texts for processing
    sentence_texts = df_sentences['sentence_text'].fillna('').tolist()
    
    # Batch process with NER
    sentence_entities = ner_pipeline(
        sentence_texts,
        batch_size=batch_size
    )
    
    # Resolve locations for each sentence
    sentence_locations = [
        resolve_locations_from_entities(entities, location_lookup)
        for entities in tqdm(sentence_entities, desc="Resolving sentence locations")
    ]
    
    df_sentences = df_sentences.copy()
    df_sentences['sentence_locations'] = sentence_locations
    
    # Calculate statistics
    total_locations = sum(len(locs) for locs in sentence_locations)
    sentences_with_locations = sum(1 for locs in sentence_locations if locs)
    
    logger.info(f"Found {total_locations:,} total locations in "
               f"{sentences_with_locations:,} sentences")
    
    return df_sentences


def apply_hierarchical_logic(row: pd.Series) -> List[str]:
    """
    Apply hierarchical location selection logic.
    
    Priority order:
    1. Specific sentence locations (ID length > 2)
    2. Any sentence locations
    3. Specific article locations (ID length > 2)
    4. Any article locations
    
    Args:
        row: DataFrame row with location columns
        
    Returns:
        List of selected location IDs
    """
    # Check for specific sentence locations
    sentence_specific = {
        loc for loc in row['sentence_locations'] 
        if len(loc) > 2
    }
    if sentence_specific:
        return list(sentence_specific)
    
    # Use any sentence locations
    if row['sentence_locations']:
        return row['sentence_locations']
    
    # Check for specific article locations
    article_specific = {
        loc for loc in row['article_locations'] 
        if len(loc) > 2
    }
    if article_specific:
        return list(article_specific)
    
    # Fall back to any article locations
    return row['article_locations']


def merge_and_finalize(df_risk: pd.DataFrame,
                      df_articles: pd.DataFrame,
                      id_to_english_name: Dict[str, str]) -> pd.DataFrame:
    """
    Merge risk mentions with locations and apply hierarchical logic.
    
    Args:
        df_risk: DataFrame with risk mentions and sentence locations
        df_articles: DataFrame with article locations
        id_to_english_name: Dictionary mapping location IDs to English names
        
    Returns:
        Final DataFrame with one row per risk-location pair
    """
    logger.info("Applying hierarchical logic and finalizing data...")
    
    # Merge sentence and article data
    df_merged = pd.merge(
        df_risk,
        df_articles[['article_id', 'language', 'article_locations']],
        on='article_id',
        how='left'
    )
    
    # Apply hierarchical location selection
    df_merged['final_locations'] = df_merged.apply(
        apply_hierarchical_logic, 
        axis=1
    )
    
    # Explode to one row per location
    df_exploded = df_merged.explode('final_locations').rename(
        columns={'final_locations': 'location_id'}
    )
    
    # Remove rows without locations
    df_exploded = df_exploded.dropna(subset=['location_id'])
    
    # Add English location names
    df_exploded['location_name_english'] = df_exploded['location_id'].map(
        id_to_english_name
    )
    
    # Select and order final columns
    final_columns = [
        'article_id', 'date', 'language', 'sentence_text', 
        'risk_factor', 'confidence_score', 'location_id', 
        'location_name_english'
    ]
    df_final = df_exploded[final_columns].copy()
    
    logger.info(f"Created {len(df_final):,} risk-location pairs")
    
    return df_final


def save_geotagging_results(df_final: pd.DataFrame, data_dir: str) -> Path:
    """
    Save geotagged results to CSV file.
    
    Args:
        df_final: Final processed DataFrame
        data_dir: Root directory for saving data
        
    Returns:
        Path to saved file
        
    Raises:
        Exception: If saving fails
    """
    models_dir = Path(data_dir) / '03_models'
    models_dir.mkdir(parents=True, exist_ok=True)
    output_path = models_dir / 'risk_mentions_geotagged_FINAL.csv'
    
    try:
        df_final.to_csv(output_path, index=False)
        logger.info(f"Saved {len(df_final):,} geotagged risk mentions to: {output_path}")
        return output_path
        
    except Exception as e:
        logger.error(f"Error saving results: {e}")
        raise


def display_geotagging_sample(df: pd.DataFrame, n: int = 5) -> None:
    """
    Display sample results for verification.
    
    Args:
        df: Final DataFrame
        n: Number of samples to display
    """
    if df.empty:
        logger.warning("No data to display")
        return
    
    print("\n--- Sample Geotagged Risk Mentions ---")
    print(f"Showing {min(n, len(df))} of {len(df):,} total records:")
    print()
    
    sample = df.head(n)
    for idx, row in sample.iterrows():
        print(f"Risk: {row['risk_factor']} (confidence: {row['confidence_score']:.2f})")
        print(f"Location: {row['location_name_english']} ({row['location_id']})")
        print(f"Sentence: {row['sentence_text'][:100]}...")
        print(f"Language: {row['language']}")
        print("-" * 50)


def run_geotagging_pipeline(data_dir: str = '../data',
                           sample_size: Optional[int] = None,
                           batch_size: int = 128,
                           ner_model: str = "Babelscape/wikineural-multilingual-ner") -> pd.DataFrame:
    """
    Execute the complete geotagging pipeline.
    
    Args:
        data_dir: Root directory containing data folders
        sample_size: Number of rows to process (None for all)
        batch_size: Batch size for NER processing
        ner_model: Model name for NER pipeline
        
    Returns:
        Final geotagged DataFrame
        
    Raises:
        FileNotFoundError: If required files don't exist
        Exception: If processing fails
    """
    logger.info("Starting risk mention geotagging pipeline")
    
    if sample_size:
        logger.info(f"Running on SAMPLE of {sample_size} rows")
    
    # Step 1: Load all necessary data
    logger.info("Step 1: Loading data")
    df_risk = load_risk_mentions(data_dir, sample_size)
    df_articles = load_articles(data_dir)
    
    # Step 2: Build location resolvers
    logger.info("Step 2: Building location resolvers")
    location_lookup, id_to_english_name = build_location_resolvers(data_dir)
    
    # Step 3: Initialize NER pipeline
    logger.info("Step 3: Initializing NER pipeline")
    ner_pipeline = initialize_ner_pipeline(ner_model)
    
    # Step 4: Extract locations (hybrid geotagging)
    logger.info("Step 4: Performing hybrid geotagging")
    
    # Filter articles to those with risk mentions
    risk_article_ids = df_risk['article_id'].unique()
    df_articles_filtered = df_articles[
        df_articles['article_id'].isin(risk_article_ids)
    ][['article_id', 'body', 'language']].copy()
    
    logger.info(f"Processing {len(df_articles_filtered):,} articles with risk mentions")
    
    # Extract article-level locations
    df_articles_filtered = extract_article_locations(
        df_articles_filtered, ner_pipeline, location_lookup, batch_size
    )
    
    # Extract sentence-level locations
    df_risk = extract_sentence_locations(
        df_risk, ner_pipeline, location_lookup, batch_size
    )
    
    # Step 5: Merge and apply hierarchical logic
    logger.info("Step 5: Merging and finalizing data")
    df_final = merge_and_finalize(df_risk, df_articles_filtered, id_to_english_name)
    
    # Display sample for verification
    display_geotagging_sample(df_final)
    
    # Step 6: Save results
    logger.info("Step 6: Saving results")
    save_geotagging_results(df_final, data_dir)
    
    logger.info("Pipeline completed successfully")
    return df_final


# Usage example with sampling for quick testing
df_geotagged = run_geotagging_pipeline(
    data_dir='../data',
    # sample_size=100,  # Set to None to process all data
    batch_size=128,
    ner_model="Babelscape/wikineural-multilingual-ner"
)

# Example of accessing the results
print(f"\nProcessed {len(df_geotagged):,} risk-location pairs")

# Analyze distribution
if not df_geotagged.empty:
    risk_distribution = df_geotagged['risk_factor'].value_counts()
    print(f"\nRisk factor distribution:")
    for risk, count in risk_distribution.head().items():
        print(f"  {risk}: {count:,}")

    location_distribution = df_geotagged['location_name_english'].value_counts()
    print(f"\nTop locations mentioned:")
    for location, count in location_distribution.head().items():
        print(f"  {location}: {count:,}")


# Alternative usage examples:

# Full processing mode
# df_geotagged_full = run_geotagging_pipeline(
#     data_dir='../data',
#     sample_size=None,  # Process all data
#     batch_size=256,
#     ner_model="Babelscape/wikineural-multilingual-ner"
# )

# Custom configuration with different batch size and model
# df_geotagged_custom = run_geotagging_pipeline(
#     data_dir='../data',
#     sample_size=50,
#     batch_size=64,
#     ner_model="dbmdz/bert-large-cased-finetuned-conll03-english"  # Alternative NER model
# )

# Process specific number of records for testing
# df_geotagged_test = run_geotagging_pipeline(
#     data_dir='../data',
#     sample_size=20,
#     batch_size=32
# )

2025-09-13 14:34:32,535 - INFO - Starting risk mention geotagging pipeline
2025-09-13 14:34:32,537 - INFO - Step 1: Loading data
2025-09-13 14:34:32,735 - INFO - Loaded 34,275 risk mention sentences
2025-09-13 14:34:32,769 - INFO - Loaded 0 English and 352 Arabic articles
2025-09-13 14:34:32,770 - INFO - Step 2: Building location resolvers
2025-09-13 14:34:32,773 - INFO - Built location resolvers with 918 aliases
2025-09-13 14:34:32,773 - INFO - Step 3: Initializing NER pipeline
2025-09-13 14:34:32,774 - INFO - GPU available: NVIDIA L4
2025-09-13 14:34:32,775 - INFO - Loading NER model: Babelscape/wikineural-multilingual-ner
Device set to use cuda:0
2025-09-13 14:34:33,494 - INFO - NER pipeline initialized successfully
2025-09-13 14:34:33,495 - INFO - Step 4: Performing hybrid geotagging
2025-09-13 14:34:33,501 - INFO - Processing 123 articles with risk mentions
2025-09-13 14:34:33,502 - INFO - Extracting article-level locations...


Resolving article locations:   0%|          | 0/123 [00:00<?, ?it/s]

2025-09-13 14:34:38,480 - INFO - Found 204 total locations in 123 articles
2025-09-13 14:34:38,481 - INFO - Extracting sentence-level locations...


KeyboardInterrupt: 

In [None]:
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns

# --- 1. Load Necessary Data (Corrected) ---
print("--- Step 1: Loading Processed Data ---")
DATA_DIR = '../data'
MODELS_DIR = os.path.join(DATA_DIR, '03_models')
RAW_DATA_DIR = os.path.join(DATA_DIR, '01_raw')

# CORRECTED: Load the file you just created from notebook 05
df_final_exploded = pd.read_csv(os.path.join(MODELS_DIR, 'risk_mentions_geotagged_FINAL.csv'))

# CORRECTED: Load the risk factor cluster file with the correct name
df_clusters = pd.read_excel(os.path.join(RAW_DATA_DIR, 'risk-factors-categories.xlsx'))

# This file is needed for normalization
df_geo_articles = pd.read_pickle(os.path.join(DATA_DIR, '02_processed', 'news_geographically_filtered.pkl'))

print("Data loaded successfully.")
print("-" * 30, "\n")

# --- 2. Prepare the Data (Corrected) ---
print("--- Step 2: Preparing Data for Aggregation ---")

# --- HELPFUL DEBUGGING STEP ---
# Print the column names to see what they are actually called
print("Columns in df_clusters (from risk-factors-categories.xlsx):")
print(df_clusters.columns)
# -----------------------------

# Ensure the 'date' column is in datetime format
df_final_exploded['date'] = pd.to_datetime(df_final_exploded['date'])

# Merge the risk mentions with their thematic clusters
# CORRECTED: Changed 'right_on' to the correct column name. It's likely 'risk_factor'.
df_merged = pd.merge(df_final_exploded, df_clusters, on='risk_factor') # Using 'on' is cleaner when column names match

print("\nMerged risk mentions with thematic clusters.")
print("-" * 30, "\n")


# --- 3. Aggregate Risk Mentions (Corrected) ---
print("--- Step 3: Aggregating Daily Risk Counts ---")

# CORRECTED: Changed 'theme' to 'cluster' to match the actual column name
df_daily_counts = df_merged.groupby([pd.Grouper(key='date', freq='D'), 'location_id', 'location_name_english', 'cluster']).size().reset_index(name='risk_mention_count')

print("Calculated raw daily counts of risk mentions per location and theme.")
display(df_daily_counts.head())
print("-" * 30, "\n")


# --- 4. Normalization (Crucial Step) ---
# To avoid bias from varying news volume, we normalize by the number of articles published per day.
# NOTE: This is a simplified normalization. A more advanced approach would be to get article counts *per location*,
# which would require re-running the geotagger on ALL articles, not just those with risks.
# For this assessment, normalizing by total daily articles is a reasonable simplification.

print("--- Step 4: Normalizing Risk Counts ---")
df_geo_articles['date'] = pd.to_datetime(df_geo_articles['date'])
daily_article_volume = df_geo_articles.groupby(pd.Grouper(key='date', freq='D')).size().reset_index(name='total_articles_published')

# Merge the risk counts with the total article volume for normalization
df_normalized = pd.merge(df_daily_counts, daily_article_volume, on='date')
df_normalized['normalized_risk'] = df_normalized['risk_mention_count'] / df_normalized['total_articles_published']

print("Normalized risk scores by total daily article volume.")
display(df_normalized.head())
print("-" * 30, "\n")


# --- 5. Calculate Thematic and Composite Risk Indices ---
print("--- Step 5: Constructing Risk Indices ---")
# The 'normalized_risk' is already our Thematic Risk Index for each theme.
# Now, we calculate the Composite Risk Index (CRI) by averaging themes per day/location.

df_cri = df_normalized.groupby(['date', 'location_id', 'location_name_english'])['normalized_risk'].mean().reset_index(name='composite_risk_index')

print("Calculated Composite Risk Index (CRI).")
display(df_cri.head())
print("-" * 30, "\n")


# --- 6. Save the Final Time-Series Data ---
print("--- Step 6: Saving Final Index Data ---")
OUTPUT_DIR = os.path.join(DATA_DIR, '04_feature')
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Save the thematic and composite indices
df_normalized.to_csv(os.path.join(OUTPUT_DIR, 'thematic_risk_indices.csv'), index=False)
df_cri.to_csv(os.path.join(OUTPUT_DIR, 'composite_risk_index.csv'), index=False)

print(f"Final time-series data saved to: {OUTPUT_DIR}")
print("-" * 30, "\n")


# --- 7. Example Visualization ---
print("--- Step 7: Example Visualization ---")
# Let's visualize the CRI for a specific location, e.g., Baghdad
location_to_plot = 'Baghdad'
df_plot = df_cri[df_cri['location_name_english'] == location_to_plot]

if not df_plot.empty:
    plt.style.use('seaborn-v0_8-whitegrid')
    fig, ax = plt.subplots(figsize=(15, 7))
    
    ax.plot(df_plot['date'], df_plot['composite_risk_index'], marker='o', linestyle='-', label='Composite Risk Index')
    ax.set_title(f"Daily Composite Risk Index for {location_to_plot}", fontsize=16)
    ax.set_ylabel("Risk Index (Normalized Score)")
    ax.set_xlabel("Date")
    ax.legend()
    plt.tight_layout()
    plt.show()
else:
    print(f"No data found for location: {location_to_plot}")

---

# Part 2: Reflection

Please outline (1) some of the limitations of your approach and (2) how you would tackle these if you had more time.