# Feature Engineering: Semantic Pre-selection

This notebook serves as the crucial "coarse" filtering stage in our "coarse-to-fine" analysis pipeline. After the foundational text cleaning in the previous step, this notebook intelligently reduces the dataset size by using a powerful semantic search.

The key innovation here is to filter our large corpus of \~172,000 articles down to a smaller, high-relevance subset. This ensures that our most computationally expensive model in the next stage focuses only on articles that are conceptually related to food security risks, dramatically improving efficiency without sacrificing critical information.

The feature engineering pipeline consists of the following steps:

1.  **Load Processed Data**: Import the clean, sentence-tokenized datasets from the previous preprocessing stage.
2.  **Semantic Pre-selection Filtering**: Use a lightweight, multilingual sentence-embedding model (`paraphrase-multilingual-MiniLM-L12-v2`) to perform a high-speed similarity search between all articles and the 167 known risk factors.
3.  **Save Filtered Data**: Store the final, smaller, and analysis-ready datasets for the modeling stage.

### 1. Load Processed Data

In [8]:
import pandas as pd
import os
from sentence_transformers import SentenceTransformer, util
import torch
import numpy as np
from tqdm.auto import tqdm

# --- Load the datasets ---
DATA_DIR = '../data'
PROCESSED_DATA_DIR = os.path.join(DATA_DIR, '02_processed')

df_eng = pd.read_pickle(os.path.join(PROCESSED_DATA_DIR, 'news_eng_processed.pkl'))
df_ara = pd.read_pickle(os.path.join(PROCESSED_DATA_DIR, 'news_ara_processed.pkl'))

print("Processed data loaded successfully.")
print(f"  - English: {len(df_eng):,} articles")
print(f"  - Arabic:  {len(df_ara):,} articles")

Processed data loaded successfully.
  - English: 86,660 articles
  - Arabic:  85,511 articles


### 2. Pre-selection: Semantic Filtering

To efficiently handle the large dataset, we use a "coarse-to-fine" strategy. This step serves as the "coarse" filter, using a fast and powerful semantic search to intelligently identify a smaller, more relevant subset of articles. This avoids running the much slower, more resource-intensive classification model on the entire dataset.

The process involves:

  * Loading a lightweight, multilingual sentence-embedding model that converts text into numerical vectors representing its meaning.
  * Encoding both the 167 English risk factors and all article bodies into this shared, multilingual "meaning space."
  * Performing a high-speed similarity search to find articles whose meaning is conceptually close to the risk factors.
  * Filtering the articles based on a similarity score to create the final, smaller dataset for the modeling stage.

In [None]:
print("\nStarting semantic pre-selection to filter for relevant articles...")

# --- Load a lightweight, multilingual model for fast semantic search ---
print("Loading semantic search model...")
embedder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# --- Load the English risk factors to search for ---
risk_factors_path = os.path.join(DATA_DIR, '01_raw/risk-factors.xlsx')
df_risk_factors = pd.read_excel(risk_factors_path)
risk_factor_labels_eng = df_risk_factors['risk_factor_english'].tolist()

# --- Convert the risk factors into meaning vectors (embeddings) ---
print("Encoding risk factors...")
risk_factor_embeddings = embedder.encode(
    risk_factor_labels_eng, 
    convert_to_tensor=True, 
    show_progress_bar=True
)

# --- Convert all article bodies into meaning vectors ---
# Combine English and Arabic articles into one list for batch processing
print("Encoding all article bodies (this may take some time)...")
all_article_bodies = pd.concat([df_eng['body_cleaned'], df_ara['body_cleaned']], ignore_index=True)

# Set a larger batch size to better utilize the GPU.
# You can tune this value based on your GPU's VRAM. Start with 128 or 256.
ENCODE_BATCH_SIZE = 256 

corpus_embeddings = embedder.encode(
    all_article_bodies.tolist(), 
    batch_size=ENCODE_BATCH_SIZE, # Added batch_size for performance
    convert_to_tensor=True, 
    show_progress_bar=True
)
print(f"Encoding complete. Used a batch size of {ENCODE_BATCH_SIZE}.")


# --- Perform the high-speed semantic search ---
print("Performing semantic search to find relevant articles...")

# Swapped embeddings to find the best risk factor FOR EACH article.
# This ensures we get a score for every article in the corpus.
hits = util.semantic_search(corpus_embeddings, risk_factor_embeddings, top_k=1)

# --- Extract the similarity scores ---
# The result is a list of lists, where each inner list contains the top_k hits for a corpus document.
# Since top_k=1, we can directly extract the scores.
max_similarity_scores = [hit[0]['score'] for hit in hits]

# --- Filter articles based on the similarity threshold ---

# UPDATED: Increased the threshold again to be more selective.
# This is the key parameter to tune. Let's try a higher value
# to further reduce the number of articles for the next stage.
SIMILARITY_THRESHOLD = 0.40 

# Create a boolean mask for filtering
relevant_mask = np.array(max_similarity_scores) > SIMILARITY_THRESHOLD
relevant_indices = np.where(relevant_mask)[0].tolist()

print(f"\nFound {len(relevant_indices):,} potentially relevant articles with a threshold of {SIMILARITY_THRESHOLD}.")

# --- Create the final filtered DataFrames ---
# Combine the original dataframes to easily select rows by their original index
df_all = pd.concat([df_eng, df_ara], ignore_index=True)
df_filtered = df_all.iloc[relevant_indices].copy()

# Split back into English and Arabic dataframes
df_eng_filtered = df_filtered[df_filtered['lang'] == 'eng']
df_ara_filtered = df_filtered[df_filtered['lang'] == 'ara']

print("\nSemantic pre-selection filtering complete.")
print(f"  - English: Kept {len(df_eng_filtered):,} of {len(df_eng):,} articles.")
print(f"  - Arabic:  Kept {len(df_ara_filtered):,} of {len(df_ara):,} articles.")


Starting semantic pre-selection to filter for relevant articles...
Loading semantic search model...


Encoding risk factors...


Batches:   0%|          | 0/6 [00:00<?, ?it/s]

Encoding all article bodies (this may take some time)...


Batches:   0%|          | 0/673 [00:00<?, ?it/s]

### 3. Save Filtered Data

In [5]:
# --- Define output paths ---
output_path_eng = os.path.join(PROCESSED_DATA_DIR, 'news_eng_processed_filtered.pkl')
output_path_ara = os.path.join(PROCESSED_DATA_DIR, 'news_ara_processed_filtered.pkl')

# --- Save the smaller, filtered dataframes ---
df_eng_filtered.to_pickle(output_path_eng)
df_ara_filtered.to_pickle(output_path_ara)

print(f"\nProcessed and FILTERED English data saved to: {output_path_eng}")
print(f"Processed and FILTERED Arabic data saved to: {output_path_ara}")


Processed and FILTERED English data saved to: ../data/02_processed/news_eng_processed_filtered.pkl
Processed and FILTERED Arabic data saved to: ../data/02_processed/news_ara_processed_filtered.pkl
