# Feature Engineering: Semantic Pre-selection

This notebook serves as the crucial "coarse" filtering stage in our "coarse-to-fine" analysis pipeline. After the foundational text cleaning in the previous step, this notebook intelligently reduces the dataset size by using a powerful semantic search.

The key innovation here is to filter our large corpus of \~172,000 articles down to a smaller, high-relevance subset. This ensures that our most computationally expensive model in the next stage focuses only on articles that are conceptually related to food security risks, dramatically improving efficiency without sacrificing critical information.

The feature engineering pipeline consists of the following steps:

1.  **Load Processed Data**: Import the clean, sentence-tokenized datasets from the previous preprocessing stage.
2.  **Semantic Pre-selection Filtering**: Use a lightweight, multilingual sentence-embedding model (`paraphrase-multilingual-MiniLM-L12-v2`) to perform a high-speed similarity search between all articles and the 167 known risk factors.
3.  **Save Filtered Data**: Store the final, smaller, and analysis-ready datasets for the modeling stage.

### 1\. Load Processed Data

In [None]:
import pandas as pd
import os
from sentence_transformers import SentenceTransformer, util
import torch
import numpy as np
from tqdm.auto import tqdm

# --- Load the datasets ---
DATA_DIR = '../data'
PROCESSED_DATA_DIR = os.path.join(DATA_DIR, '02_processed')

df_eng = pd.read_pickle(os.path.join(PROCESSED_DATA_DIR, 'news_eng_processed.pkl'))
df_ara = pd.read_pickle(os.path.join(PROCESSED_DATA_DIR, 'news_ara_processed.pkl'))

print("Processed data loaded successfully.")
print(f"  - English: {len(df_eng):,} articles")
print(f"  - Arabic:  {len(df_ara):,} articles")

### 2\. Pre-selection: Semantic Filtering

To efficiently handle the large dataset, we use a "coarse-to-fine" strategy. This step serves as the "coarse" filter, using a fast and powerful semantic search to intelligently identify a smaller, more relevant subset of articles. This avoids running the much slower, more resource-intensive classification model on the entire dataset.

The process involves:

  * Loading a lightweight, multilingual sentence-embedding model that converts text into numerical vectors representing its meaning.
  * Encoding both the 167 English risk factors and all article bodies into this shared, multilingual "meaning space."
  * Performing a high-speed similarity search to find articles whose meaning is conceptually close to the risk factors.
  * Filtering the articles based on a similarity score to create the final, smaller dataset for the modeling stage.

In [None]:
print("\nStarting semantic pre-selection to filter for relevant articles...")

# --- Load a lightweight, multilingual model for fast semantic search ---
print("Loading semantic search model...")
embedder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# --- Load the English risk factors to search for ---
risk_factors_path = os.path.join(DATA_DIR, '01_raw/risk-factors.xlsx')
df_risk_factors = pd.read_excel(risk_factors_path)
risk_factor_labels_eng = df_risk_factors['risk_factor_english'].tolist()

# --- Convert the risk factors into meaning vectors (embeddings) ---
print("Encoding risk factors...")
risk_factor_embeddings = embedder.encode(risk_factor_labels_eng, convert_to_tensor=True, show_progress_bar=True)

# --- Convert all article bodies into meaning vectors ---
# Combine English and Arabic articles into one list for batch processing
print("Encoding all article bodies (this may take some time)...")
all_article_bodies = pd.concat([df_eng['body_cleaned'], df_ara['body_cleaned']], ignore_index=True)
corpus_embeddings = embedder.encode(all_article_bodies.tolist(), convert_to_tensor=True, show_progress_bar=True)

# --- Perform the high-speed semantic search ---
print("Performing semantic search to find relevant articles...")
# This will find the single most similar risk factor for each article in the corpus.
hits = util.semantic_search(risk_factor_embeddings, corpus_embeddings, top_k=1)

# --- Calculate the maximum similarity score for each article ---
# We create a tensor to hold the highest score found for each article.
max_similarity_scores = torch.zeros(len(all_article_bodies))
# The 'hits' result is a list for each risk factor. We iterate through it to find the max score for each article.
for hit_list in hits:
    for hit in hit_list:
        corpus_id = hit['corpus_id']
        score = hit['score']
        # If this hit's score is higher than the current max for that article, update it.
        if score > max_similarity_scores[corpus_id]:
            max_similarity_scores[corpus_id] = score

# --- Filter articles based on the similarity threshold ---
# This is the key parameter to tune. A lower value keeps more articles.
# 0.25 is a good starting point to balance speed and recall.
SIMILARITY_THRESHOLD = 0.25
relevant_indices = (max_similarity_scores > SIMILARITY_THRESHOLD).nonzero().squeeze().tolist()

# Handle the case where only one article is found
if isinstance(relevant_indices, int):
    relevant_indices = [relevant_indices]

print(f"\nFound {len(relevant_indices):,} potentially relevant articles with a threshold of {SIMILARITY_THRESHOLD}.")

# --- Create the final filtered DataFrames ---
# Combine the original dataframes to easily select rows by their original index
df_all = pd.concat([df_eng, df_ara], ignore_index=True)
df_filtered = df_all.iloc[relevant_indices].copy()

# Split back into English and Arabic dataframes
df_eng_filtered = df_filtered[df_filtered['lang'] == 'eng']
df_ara_filtered = df_filtered[df_filtered['lang'] == 'ara']

print("\nSemantic pre-selection filtering complete.")
print(f"  - English: Kept {len(df_eng_filtered):,} of {len(df_eng):,} articles.")
print(f"  - Arabic:  Kept {len(df_ara_filtered):,} of {len(df_ara):,} articles.")

### 3\. Save Filtered Data

In [None]:
# --- Define output paths ---
output_path_eng = os.path.join(PROCESSED_DATA_DIR, 'news_eng_processed_filtered.pkl')
output_path_ara = os.path.join(PROCESSED_DATA_DIR, 'news_ara_processed_filtered.pkl')

# --- Save the smaller, filtered dataframes ---
df_eng_filtered.to_pickle(output_path_eng)
df_ara_filtered.to_pickle(output_path_ara)

print(f"\nProcessed and FILTERED English data saved to: {output_path_eng}")
print(f"Processed and FILTERED Arabic data saved to: {output_path_ara}")