# bnrs_algorithm: preprocessing

This notebook contains the preprocessing steps required to prepare the data for the BNRS recommendation algorithm.

## 0. setup

Suggested (conda) environment setup:

```bash
# Create base env with conda
conda create -n bnrs_algorithm python=3.11 numpy pandas scikit-learn networkx tqdm nltk -c conda-forge
conda activate bnrs_algorithm

# Install PyTorch — CPU-only example:
conda install pytorch-c pytorch

# Install packages from PyPI
pip install sentence-transformers keybert keyphrase-vectorizers spacy datasets

# Install spaCy models used in the notebook
python -m spacy download en_core_web_lg

# Download NLTK stopwords
python -c "import nltk; nltk.download('stopwords')"

In [1]:
import random
import numpy as np
import pandas as pd
from tqdm import tqdm
from datasets import load_dataset
from collections import defaultdict
import nltk
import torch
import spacy
import networkx as nx
from sentence_transformers import SentenceTransformer
from keyphrase_vectorizers import KeyphraseCountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

#NOTE: we use a custom wrapper for KeyBERT to return embeddings alongside phrases and scores:
#(this is in utils/keybert_return_embeds.py)
from utils.keybert_return_embeds import KeyBERTEmbeddings

  from .autonotebook import tqdm as notebook_tqdm


## 1. load dataset

In [2]:
"""
Loads news corpus for pre-processing. 

Note: This pre-processing workflow assumes your article data has already been cleaned using typical steps 
(e.g., removing special characters, standardizing whitespace, handling missing values, etc.). 
The CSV file should contain 'title' and 'article' columns with the cleaned article text.

In this example, we load one random day of news articles from the 'all-the-news' dataset.
The dataset is available at: https://huggingface.co/datasets/rjac/all-the-news-2-1-Component-one

"""

"\nLoads news corpus for pre-processing. \n\nNote: This pre-processing workflow assumes your article data has already been cleaned using typical steps \n(e.g., removing special characters, standardizing whitespace, handling missing values, etc.). \nThe CSV file should contain 'title' and 'article' columns with the cleaned article text.\n\nIn this example, we load one random day of news articles from the 'all-the-news' dataset.\nThe dataset is available at: https://huggingface.co/datasets/rjac/all-the-news-2-1-Component-one\n\n"

In [3]:
#download example dataset (AllTheNews2.1; ~8.8GB)
full_dataset = load_dataset("rjac/all-the-news-2-1-Component-one", split="train", cache_dir="./data")

In [4]:
#randomly select a row index and extract year, month, day values
random_index = random.randint(0, len(full_dataset) - 1)
random_row = full_dataset[random_index]

year, month, day = random_row['year'], random_row['month'], random_row['day']
print(f"Random row index: {random_index}")
print(f"Year: {year}")
print(f"Month: {month}")
print(f"Day: {day}")

Random row index: 772965
Year: 2018
Month: 4.0
Day: 4


In [5]:
# Filter the dataset to get rows matching those values
filtered_dataset = full_dataset.filter(
    lambda row: row['year'] == year and row['month'] == month and row['day'] == day
)
news_df = filtered_dataset.to_pandas()

print(f"Number of matching articles for {year}-{month}-{day}: {len(news_df)}")

Number of matching articles for 2018-4.0-4: 2052


In [6]:
# Create 'docs' column by combining title and text
news_df['docs'] = news_df['title'] + ' ' + news_df['article']

#filter for only necessary columns
news_df = news_df[['date', 'publication', 'title', 'article', 'docs']]

#discard any rows with missing values in these columns:
news_df = news_df[news_df['date'].notna() & news_df['publication'].notna() & news_df['title'].notna() & news_df['article'].notna()]
print(f"Number of articles after dropping missing values: {len(news_df)}")

#keep only first 10 chars of date strings (drop time):
news_df['date'] = news_df['date'].astype(str).str[:10]

# Create unique ID for each article as hash of docs
news_df['id'] = news_df['docs'].apply(lambda x: abs(hash(x)))
news_df.set_index('id', inplace=True)
news_df.head(3)

Number of articles after dropping missing values: 2035


Unnamed: 0_level_0,date,publication,title,article,docs
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
4768774509216352572,2018-04-04,Reuters,Brazil soy exporters set to win big from U.S.-...,SAO PAULO (Reuters) - China’s move to slap tar...,Brazil soy exporters set to win big from U.S.-...
5644548799481183751,2018-04-04,Vox,Mueller: Trump’s not a target. 4 theories on w...,The new report that special counsel Robert Mue...,Mueller: Trump’s not a target. 4 theories on w...
3432312715945854947,2018-04-04,Reuters,Trump to order National Guard to protect borde...,WASHINGTON (Reuters) - President Donald Trump ...,Trump to order National Guard to protect borde...


## 3. load `models`

#### Configure Device - 

In [7]:
# Set device preference order: GPU (CUDA) > MPS (ARM64) > CPU
device = (torch.device("cuda") if torch.cuda.is_available()
         else torch.device("mps") if torch.backends.mps.is_available()
         else torch.device("cpu"))
print(f"Using device: {device}")

Using device: mps


#### Named Entity Recognition - 

In [8]:
"""
Note: Select the appropriate spaCy model for your language.
For languages other than English, ensure you have installed the corresponding model
(e.g., 'nl_core_news_lg' for Dutch, 'de_core_news_lg' for German).
See spaCy's model documentation for a complete list of available models.
"""

def load_spacy_model(model_name="en_core_web_lg"):
   """
   Load a spaCy model.
   
   Args:
       model_name (str): Name of the spaCy model to load (default: "en_core_web_lg")
                        Examples: "nl_core_news_lg", "de_core_news_lg"
   
   Returns:
       spacy model: Loaded spaCy model
   """
   try:
       model = spacy.load(model_name)
       print(f"Loaded spaCy model: {model_name}")
       return model
   except OSError as e:
       raise OSError(f"Model '{model_name}' not found. Make sure to install it first: "
                    f"python -m spacy download {model_name}")

# Load the English model (change model_name for other languages)
ner_model = load_spacy_model(model_name="en_core_web_lg")

Loaded spaCy model: en_core_web_lg


#### Document Embedding - 

In [9]:
"""
Load a Sentence Transformer model for computing document embeddings.
Either:

1. Load the newsSimilarity model (default), which is optimized for article similarity
  Note: Requires downloading the weights file 'state_dict.tar' from 
  https://huggingface.co/Blablablab/newsSimilarity
2. Load any other model from the Sentence Transformers library.

For available models, see:
- https://huggingface.co/Blablablab/newsSimilarity (default newsSimilarity model)
- https://www.sbert.net/docs/pretrained_models.html (other Sentence Transformer models)
"""

def load_sentence_transformer(model_name=None, weights_path=None, device=None):
   """
   Load a Sentence Transformer model.
   
   Args:
       model_name (str, optional): Name of specific model to load from Sentence Transformers.
           If None, loads the newsSimilarity model (optimized for article similarity).
       weights_path (str, optional): Path to newsSimilarity weights file (state_dict.tar).
           Required if using the default newsSimilarity model.
       device (torch.device, optional): Device to load the model on.
           If None, uses the default device.
   
   Returns:
       SentenceTransformer: Loaded model
   """
   if model_name:
       try:
           model = SentenceTransformer(model_name, device=device)
           print(f"Loaded custom model: {model_name}")
       except Exception as e:
           raise Exception(f"Error loading model '{model_name}'. "
                         f"Check if the model name is correct: {str(e)}")
   else:
       if not weights_path:
           raise ValueError("weights_path must be specified for newsSimilarity model. "
                          "Download state_dict.tar from the model's HuggingFace page.")
       try:
           # Load the base model
           model = SentenceTransformer("all-mpnet-base-v2", device=device)
           
           # Load the NewsSimilarity weights
           state_dict = torch.load(weights_path, map_location=device)
           
           # Change naming convention to fit model
           state_dict_new = {key.replace("model.", ""): value for key, value in state_dict.items()}
           if "embeddings.position_ids" in state_dict_new:
               del state_dict_new["embeddings.position_ids"]
               
           # Add weights to model
           model._first_module().auto_model.load_state_dict(state_dict_new)
           print("Loaded newsSimilarity model (Litterer et al. 2023)")
           
       except Exception as e:
           raise Exception(f"Error loading newsSimilarity model: {str(e)}")
   
   return model

# Load newsSimilarity model (edit weights_path to location of state_dict.tar)
# sent_model = load_sentence_transformer(weights_path="path/to/state_dict.tar", device=device)

# Or load a specified model from HuggingFace: 
sent_model = load_sentence_transformer(model_name="sentence-transformers/all-MiniLM-L6-v2", device=device)

Loaded custom model: sentence-transformers/all-MiniLM-L6-v2


## 4. detect `named entities`

In [10]:
#generate a list of all unique values in 'publication' column (to filter out self-mentions from entities)
publications = news_df['publication'].dropna().astype(str).str.lower().unique().tolist()
print(f"unique news organizations: {publications}")

unique news organizations: ['reuters', 'vox', 'vice', 'vice news', 'hyperallergic', 'tmz', 'business insider', 'techcrunch', 'axios', 'refinery 29', 'the verge', 'people', 'economist', 'mashable', 'cnn', 'gizmodo', 'wired', 'new republic', 'cnbc', 'the hill', 'politico', 'buzzfeed news', 'the new york times']


In [11]:
def detect_entities(news_df, spacy_model, news_orgs, entity_types=None):
   """
   Perform Named Entity Recognition using spaCy.
   
   Args:
       news_df (pd.DataFrame): DataFrame containing documents in 'docs' column
       spacy_model: Loaded spaCy model
       news_orgs (list): list of news organizations to exclude from entities.
       entity_types (list, optional): List of entity types to extract.
           If None, uses default entities (see below).
           
   Default entity types included:
       PERSON: People, including fictional 
       ORG: Companies, agencies, institutions
       NORP: Nationalities, religious or political groups
       GPE: Countries, cities, states
       PRODUCT: Products, objects, vehicles, foods, etc.
       EVENT: Named events like wars, sports events, hurricanes
       WORK_OF_ART: Titles of books, songs, etc.
       LAW: Named documents made into laws
       FAC: Buildings, airports, highways, bridges
       LOC: Non-GPE locations, mountain ranges, bodies of water
       
   Entity types excluded by default:
       Numerical and temporal entities (DATE, TIME, PERCENT, MONEY, QUANTITY, ORDINAL, CARDINAL)
       and LANGUAGE are excluded as they often represent incidental details in news articles
       rather than core topical content. For example, two articles about the same event
       may use different dates or monetary figures while covering the same core story.
   
   Returns:
       pd.DataFrame: DataFrame with added 'spacy_entities' column containing
                    list of tuples (entity_text, entity_label, entity_vector)
   """
   # Default entity types excl. numerical/temporal.
   DEFAULT_ENTITY_TYPES = [
       'PERSON', 'ORG', 'NORP', 'GPE', 'PRODUCT', 
       'EVENT', 'WORK_OF_ART', 'LAW', 'FAC', 'LOC'
   ]
   
   entity_types = entity_types or DEFAULT_ENTITY_TYPES
   
   def extract_entities_spacy(text):
       try:
           spacy_doc = spacy_model(text)
           entities = [
               (ent.text, ent.label_)
               for ent in spacy_doc.ents
               if ent.label_ in entity_types
               and ent.text.lower() not in news_orgs
           ]
           return entities
           
       except Exception as e:
           print(f"Error processing text with spaCy: {e}")
           return []
   
   # Apply NER with progress bar
   tqdm.pandas()
   news_df['entities'] = news_df['docs'].progress_apply(extract_entities_spacy)
   
   return news_df

news_df = detect_entities(news_df, ner_model, news_orgs=publications)

100%|██████████| 2035/2035 [02:54<00:00, 11.68it/s]


## 5. `document embeddings`

In [12]:
def calculate_doc_embeddings(news_df, sent_model, batch_size=32):
   """
   Calculate document embeddings for the 'docs' column using a sentence transformer model.
   Optimized for speed through batched processing.
   
   Args:
       news_df (pd.DataFrame): DataFrame containing documents in 'docs' column
       sent_model (SentenceTransformer): Loaded sentence transformer model
       batch_size (int): Number of documents to embed at once (default: 32)
   
   Returns:
       pd.DataFrame: Input DataFrame with added 'document_embedding' column
                    containing document embeddings as numpy arrays
   """
   print(f"Calculating document embeddings using {sent_model.__class__.__name__}")
   
   # Process documents in batches
   all_embeddings = []
   for i in tqdm(range(0, len(news_df), batch_size)):
       batch_texts = news_df['docs'].iloc[i:i + batch_size].tolist()
       try:
           # Calculate embeddings for batch
           embeddings = sent_model.encode(batch_texts, convert_to_tensor=True)
           
           # Convert to numpy and store
           all_embeddings.extend(embeddings.cpu().numpy())
       except Exception as e:
           print(f"Error processing batch {i//batch_size}: {e}")
           
           # If error, pad with zeros for this batch
           embedding_dim = sent_model.get_sentence_embedding_dimension()
           all_embeddings.extend([np.zeros(embedding_dim) for _ in batch_texts])
   
   news_df['document_embedding'] = all_embeddings
   return news_df

news_df = calculate_doc_embeddings(news_df, sent_model, batch_size=128)

Calculating document embeddings using SentenceTransformer


100%|██████████| 16/16 [00:15<00:00,  1.05it/s]


## 6. `subject/context embeddings`


In [13]:
def calculate_subject_context_embeddings(news_df, sent_model, pos_tagging="en_core_web_lg", stop_lang = 'english', sub_ngram=(1,3), sub_topN=10, con_topN=10):
    
    """
Extract and calculate subject and context embeddings for documents using KeyBERT.

This function performs two types of keyphrase extraction and embedding calculation:

1. Subject Embeddings: Based on named entities in the document.
  - Extracts keyphrases based on candidate list of named entities.
  - Calculates salience-weighted embeddings of these entities.
  - Represents the "about whom" of the document

2. Context Embeddings: Based on non-entity noun-centric phrases.
  - Extracts keyphrases using POS patterns (e.g., noun+verb, adj+noun)
  - Calculates similarity-weighted embeddings of these contextual phrases
  - Represents the "about what" of the document

Args:
   news_df (pd.DataFrame): DataFrame containing:
       - 'docs' column with document texts (title + text).
       - 'entities' column with lists of extracted (entity_text, entity_label) tuples.
   sent_model: Sentence transformer model for embeddings
   pos_tagging (str): Name of spaCy model for POS tagging (default: "en_core_web_lg")
   stop_lang (str): Language for stopwords from NLTK (default: "english")
   sub_ngram (tuple): N-gram range for subject keyphrases (default: (1,3))
   NOTE: ngram range for context keyphrases is dynamic (see KeyphraseCountVectorizer docs)
   sub_topN (int): Number of top subject keyphrases to keep (default: 10)
   con_topN (int): Number of top context keyphrases to keep (default: 10)

Returns:
   pd.DataFrame: Input DataFrame with added columns:
       - subject_keywords: List of extracted subject keyphrases
       - context_keywords: List of extracted context keyphrases
       - subject_weights: Corresponding weights for subject keyphrases
       - context_weights: Corresponding weights for context keyphrases
       - subject_embedding: Weighted average embedding of subject keyphrases
       - context_embedding: Weighted average embedding of context keyphrases

Note:
   - Requires KeyBERT package.
   - Requires KeyphraseCountVectorizer package.
"""
    print("--> Running calculate_subject_embedding()...")

    #- - - PREPARATION - - - >>>

    #get list of texts: 
    docs = news_df['docs'].to_list()

    #pass the ST model to keyBERT:
    kw_model = KeyBERTEmbeddings(model=sent_model)

    #create output lists for the dataframe:
    subject_keywords_master, context_keywords_master = [], []
    subject_weights_master, context_weights_master = [], []
    subject_embedding_master, context_embedding_master = [], []

    #generate stop word list (pass language code):
    stop_words = list(set(nltk.corpus.stopwords.words(stop_lang)))

    #generate subject candidate list of all relevant entities accross all articles (lowercased):
    #NOTE: we drop stopwords and single character entities from the candidate list (in case these made it through). 
    subject_candidates = list(set([entity[0].strip().lower() for entities in news_df['entities'] for entity in entities]))
    subject_candidates = [candidate for candidate in subject_candidates if candidate not in stop_words] # - stopwords
    subject_candidates = [candidate for candidate in subject_candidates if len(candidate) > 1] # - single characters

    #define a helper function to de-duplicate keyphrases, average their embeddings, and sum their weights:
    def deduplicate_and_sum(keywords):
        grouped = defaultdict(lambda: [0, []])
        for keyword, score, embedding in keywords:
            grouped[keyword][0] += score  
            grouped[keyword][1].append(embedding)
        return [(k, total_score, np.mean(embeddings, axis=0)) 
                for k, (total_score, embeddings) in grouped.items()]
    
    #define a helpter function to handle skips:
    def skip_article(index, reason):
        print(f"Skipping article at index {index}: {reason}")
        subject_embedding_master.append(None)
        subject_keywords_master.append(None)
        subject_weights_master.append(None)
        context_weights_master.append(None)
        context_embedding_master.append(None)
        context_keywords_master.append(None)

    
    #- - - COUNT VECTORIZER SETUP - - - >>>

    #define a KeyphraseCountVectorizer for context keyphrase extraction:
    con_vectorizer = KeyphraseCountVectorizer(spacy_pipeline=pos_tagging, stop_words=stop_words, 
                                                  lowercase=True, pos_pattern='(<J.*>*<N.*>+)|(<N.*>+<V.*>)|(<V.*><N.*>+)|(<N.*>+<IN><N.*>+)')

    #NOTE: different POS patterns can be used to extract different types of keyphrases. For example, here:
    #noun, noun+verb, adj+noun, verb+noun, noun+prep+noun = (<J.*>*<N.*>+)|(<N.*>+<V.*>)|(<V.*><N.*>+)|(<N.*>+<IN><N.*>+)
    


    #- - - SUBJECT KEYPHRASE EXTRACTION - - - >>>

    #generate kw embeddings for SUBJECT (by passing subject entities as candidate list):
    print(f"--> extracting subject embeddings...")
    subject_doc_embeds, subject_kw_embeds = kw_model.extract_embeddings(docs, 
                                                                        keyphrase_ngram_range=sub_ngram, 
                                                                        candidates=subject_candidates,
                                                                        )

    #extract topN keywords from the subject keyword embeddings (keyword, similarity, embedding):
    print(f"--> extracting subject keyphrases...")
    news_df['subject_keywords'] = kw_model.extract_keywords(docs, top_n=sub_topN,
                                                            keyphrase_ngram_range=sub_ngram, 
                                                            candidates=subject_candidates,
                                                            word_embeddings=subject_kw_embeds,
                                                            doc_embeddings=subject_doc_embeds,
                                                            )
    


    #- - - CONTEXT KEYPHRASE EXTRACTION - - - >>>

    #generate kw_embeddings for CONTEXT (passing customized count vectorizer):
    print(f"--> extracting context embeddings...")
    context_doc_embed, context_kw_embeds = kw_model.extract_embeddings(docs, 
                                                                        vectorizer=con_vectorizer)
    
    #generate context keywords for the articles:
    print(f"--> extracting context keyphrases...")
    news_df['context_keywords'] = kw_model.extract_keywords(docs, top_n=con_topN,
                                                vectorizer=con_vectorizer,
                                                word_embeddings=context_kw_embeds,
                                                doc_embeddings=context_doc_embed,
                                                )

    
    #- - - CALCULATE SUBJECT/CONTEXT EMBEDDINGS --- >>>
    print("--> Calculating subject/context embeddings...")
    
    #create a list to store indices of failed rows:
    failed_indices = []
    
    for index, row in tqdm(news_df.iterrows()):
        
        #SUBJECT EMBEDDING -
        #NOTE: defined as salience-weighted embedding of entities in the article.
        try:
            #get the subject keyphrases for the article:
            subject_keywords = row['subject_keywords']
            
            #[CHECK] if no subject keywords detected, skip this article:
            if not subject_keywords or len(subject_keywords) == 0:
                print(f"No subject keywords detected for article at index {index}. Skipping...")
                failed_indices.append(index)
                subject_keywords_master.append(None) 
                subject_weights_master.append(None)
                subject_embedding_master.append(None)
                context_keywords_master.append(None)
                context_weights_master.append(None) 
                context_embedding_master.append(None)
                continue
            
            #de-duplicate subject keyphrases (sum scores, avg embeds):
            subject_keywords = deduplicate_and_sum(subject_keywords)
            #sort the subject_keywords by weights (second element of the tuple)
            subject_keywords = sorted(subject_keywords, key=lambda x: x[1], reverse=True) 
            #[FILTER] keep only the topN subject keyphrases:
            subject_keywords = subject_keywords[:sub_topN]
            #unpack the subject keyphrase tuples for the article:
            subject_keywords, subject_weights, subject_embeddings = zip(*subject_keywords)
            #calculate the subject embedding as the salience-weighted average of the embeddings:
            subject_embedding = np.average(subject_embeddings, axis=0, weights=subject_weights)
            
            #append subject data to master lists:
            subject_keywords_master.append(subject_keywords) 
            subject_weights_master.append(subject_weights)
            subject_embedding_master.append(subject_embedding)
            
        except Exception as e:
            print(f"Error processing subject embedding for article at index {index}: {str(e)}")
            failed_indices.append(index)
            subject_keywords_master.append(None) 
            subject_weights_master.append(None)
            subject_embedding_master.append(None)
            #skip context processing for this article:
            context_keywords_master.append(None)
            context_weights_master.append(None) 
            context_embedding_master.append(None)
            continue

            
        #CONTEXT EMBEDDDING - 
        #NOTE: doc-similarity weighted embedding of non-entity noun-centric keyphrases in the article.   
        try:
            #get the context keyphrases for the article:
            context_keywords = row['context_keywords']
            
            #[CHECK] if no context keywords detected, skip this article:
            if not context_keywords or len(context_keywords) == 0:
                print(f"No context keywords detected for article at index {index}. Skipping...")
                failed_indices.append(index)
                context_keywords_master.append(None)
                context_weights_master.append(None) 
                context_embedding_master.append(None)
                continue
            
            #de-duplicate keyphrases (sum scores, avg embeds):
            context_keywords = deduplicate_and_sum(context_keywords)
            #sort the context_keywords by weights (second element of the tuple)
            context_keywords = sorted(context_keywords, key=lambda x: x[1], reverse=True)
            #[FILTER] discard context keyphrases that contain entities from the subject keyphrases:
            context_keywords = [(context, score, embedding) for context, score, embedding in context_keywords if not any(word in subject_keywords for word in context.lower().split())]
            #[FILTER] discard context keyphrases that are single characters:
            context_keywords = [(context, score, embedding) for context, score, embedding in context_keywords if len(context) > 1]
            
            #[CHECK] if no context keywords remain after filtering, skip this article:
            if not context_keywords or len(context_keywords) == 0:
                print(f"No context keywords remain after filtering for article at index {index}. Skipping...")
                failed_indices.append(index)
                context_keywords_master.append(None)
                context_weights_master.append(None) 
                context_embedding_master.append(None)
                continue
            
            #[FILTER]keep only the topN resulting keyphrases:
            context_keywords = context_keywords[:con_topN]
            #unpack the context keyphrase tuples for the article:
            context_keywords, context_weights, context_embedding = zip(*context_keywords)
            #calculate the context embedding as the salience-weighted average of the embeddings:
            context_embedding = np.average(context_embedding, axis=0, weights=context_weights)
            
            #append context data to master lists:
            context_keywords_master.append(context_keywords)
            context_weights_master.append(context_weights) 
            context_embedding_master.append(context_embedding)
            
        except Exception as e:
            print(f"Error processing context embedding for article at index {index}: {str(e)}")
            failed_indices.append(index)
            context_keywords_master.append(None)
            context_weights_master.append(None) 
            context_embedding_master.append(None)
            continue
        

    #- - -FINALIZATION - - - >>>

    #merge the subject and context embeddings into the dataframe:
    news_df['subject_keywords'] = subject_keywords_master
    news_df['context_keywords'] = context_keywords_master
    news_df['subject_weights'] = subject_weights_master
    news_df['context_weights'] = context_weights_master
    news_df['subject_embedding'] = subject_embedding_master
    news_df['context_embedding'] = context_embedding_master

    #remove rows that failed during processing:
    if failed_indices:
        print(f"\n--> Removing {len(failed_indices)} failed articles from the dataframe...")
        news_df = news_df.drop(index=failed_indices)

    return news_df

In [14]:
news_df = calculate_subject_context_embeddings(news_df, sent_model)

--> Running calculate_subject_embedding()...
--> extracting subject embeddings...
--> extracting subject keyphrases...
--> extracting context embeddings...
--> extracting context keyphrases...
--> Calculating subject/context embeddings...


1203it [00:00, 6071.10it/s]

No context keywords remain after filtering for article at index 975812754316075345. Skipping...
No context keywords remain after filtering for article at index 3367515107652030157. Skipping...
No context keywords remain after filtering for article at index 2327374466068382576. Skipping...
No context keywords remain after filtering for article at index 9008062070283884736. Skipping...
No context keywords remain after filtering for article at index 9066736986716052601. Skipping...
No context keywords remain after filtering for article at index 4623124509149724831. Skipping...
No context keywords remain after filtering for article at index 7817694389214201856. Skipping...
No context keywords remain after filtering for article at index 5552911458848226768. Skipping...
No context keywords remain after filtering for article at index 6748957417935916943. Skipping...
No context keywords remain after filtering for article at index 8736714236289197605. Skipping...
No context keywords remain afte

2035it [00:00, 5779.40it/s]

No context keywords remain after filtering for article at index 8589980233990383855. Skipping...
No context keywords remain after filtering for article at index 301532724649891210. Skipping...
No context keywords remain after filtering for article at index 145011995044705850. Skipping...
No context keywords remain after filtering for article at index 8398402481593308122. Skipping...
No context keywords remain after filtering for article at index 1318342818413085111. Skipping...
No context keywords remain after filtering for article at index 6239220938549572795. Skipping...
No context keywords remain after filtering for article at index 9222273047746770552. Skipping...
No context keywords remain after filtering for article at index 6107595644643456704. Skipping...
No context keywords remain after filtering for article at index 6276321503152423197. Skipping...
No context keywords remain after filtering for article at index 5225548428061178675. Skipping...
No context keywords remain after




## 7. `event clustering` (Litterer et al. 2023)

##### >> for original implementation, see: https://github.com/blitt2018/mediaStorms

###### >> this is an integration of the following scripts in the above workflow:  `0.0-bl-getEntityPairs.py` + `0.1-bl-computeCosineSim.py` + `0.2-bl-createEmbeddingClusterList.py`

In [15]:
#EVENT DETECTION / CLUSTERING -
SIM_CUTOFF = 0.8 #minimum similarity for clustering (cosine similarity)
CLUSTER_CUTOFF = [2, 100] #minimum and maximum cluster size (e.g, [2, 100] means min 2 docs, max 100 docs per cluster)
included_entity_types = ['PERSON', 'ORG', 'NORP', 'GPE', 'PRODUCT', 'EVENT', 'WORK_OF_ART', 'LAW', 'FAC', 'LOC'] 
excluded_entity_types = ['LANGUAGE','DATE', 'TIME', 'PERCENT', 'MONEY', 'QUANTITY', 'ORDINAL', 'CARDINAL']

In [16]:
def get_clusters(news_df: pd.DataFrame, CLUSTER_CUTOFF: tuple, SIM_CUTOFF: float) -> tuple:
    """
    Cluster news articles based on named entities and document embeddings similarity.
    
    Args:
        news_df: DataFrame containing news articles with entities and embeddings
        CLUSTER_CUTOFF: Tuple of (min_size, max_size) for filtering clusters
        SIM_CUTOFF: Minimum cosine similarity threshold for clustering
    
    Returns:
        tuple: (DataFrame with cluster assignments, NetworkX graph of article relationships)
    """
    
    # Helper Functions
    def getPairwise(inList): 
        """
        Generate all unique pairwise combinations of elements within a list.
        
        Args:
            inList: List of elements to generate pairs from
        Returns:
            List of paired elements
        """
        outList = []
        inLen = len(inList)
        for i in range(0, inLen):
            for j in range(i+1, inLen): 
                outList.append([inList[i], inList[j]])
        return outList

    def getCos(inList): 
        """
        Calculate cosine similarity between two document embedding vectors.
        
        Args:
            inList: List/array containing exactly two vectors
        Returns:
            float: Cosine similarity score
        """
        if not isinstance(inList, (list, np.ndarray)) or len(inList) != 2:
            raise ValueError("Input must be list/array of two vectors")
        return float(cosine_similarity(inList[0].reshape(1, -1), 
                                    inList[1].reshape(1, -1))[0][0])

    def getCosSeries(inSeries): 
        """
        Apply cosine similarity calculation to a series of vector pairs.
        
        Args:
            inSeries: Pandas Series containing pairs of vectors
        Returns:
            Series of similarity scores
        """
        if not isinstance(inSeries, pd.Series):
            raise TypeError("Input must be pandas Series")
        return inSeries.apply(getCos)

    # STEP 1: Data Preparation
    # Extract relevant columns and explode entity information
    lean_df = news_df[['entities','document_embedding']].reset_index(drop=False).copy()
    lean_df = lean_df.explode('entities')
    lean_df[['entity','ent_type']] = pd.DataFrame(lean_df['entities'].tolist(), index=lean_df.index)
    
    # Remove duplicate entity mentions within same article
    lean_df = lean_df.drop_duplicates(subset=["id", "entity", "ent_type"])

    # STEP 2: Group and Filter Articles
    # Group articles by named entities and apply size filters
    grouped_df = lean_df[["ent_type", "entity", "id", "document_embedding"]].groupby(by=["ent_type", "entity"]).agg(list)
    grouped_df["numArticles"] = grouped_df["id"].apply(len)
    grouped_df = grouped_df[(grouped_df["numArticles"] >= CLUSTER_CUTOFF[0]) & 
                          (grouped_df["numArticles"] <= CLUSTER_CUTOFF[1])]
    
    # STEP 3: Generate Article Pairs
    # Create pairs of articles and their embeddings for similarity calculation
    grouped_df = grouped_df[["id", "document_embedding"]]
    grouped_df["document_embedding"] = grouped_df["document_embedding"].apply(getPairwise)
    grouped_df["id"] = grouped_df["id"].apply(getPairwise)
    
    # STEP 4: Calculate Similarities
    # Process pairs and calculate similarity scores
    pair_df = grouped_df.apply(pd.Series.explode)
    pair_df[["id1", "id2"]] = pd.DataFrame(pair_df["id"].to_list(), index=pair_df.index)
    pair_df = pair_df.drop(columns=["id"]).reset_index()
    pair_df = pair_df.drop_duplicates(subset=["id1", "id2"]).reset_index(drop=True)
    
    # Calculate and filter by similarity threshold
    embeddings = pair_df["document_embedding"]
    similarity = getCosSeries(embeddings)
    pair_df["similarity"] = similarity
    pair_df = pair_df[pair_df["similarity"] >= SIM_CUTOFF].reset_index(drop=True)

    # STEP 5: Generate Clusters
    # Create graph and identify connected components
    graph = nx.from_pandas_edgelist(pair_df[["id1", "id2"]], "id1", "id2")
    components = nx.connected_components(graph)
    comp_list = [comp for comp in components]

    # STEP 6: Merge Results
    # Convert clusters to DataFrame and merge with original data
    clusters = pd.DataFrame({"cluster":comp_list}).reset_index()
    clust_df = clusters.explode("cluster").rename(columns={"index":"clustNum", "cluster":"id"})
    news_df = news_df.merge(clust_df, left_index=True, right_on="id", how="left").set_index("id")

    # Print clustering statistics
    cluster_sizes = clust_df['clustNum'].value_counts()
    size_counts = cluster_sizes.value_counts().sort_index()
    
    print(f"Number of clusters: {len(clust_df['clustNum'].unique())}")
    print(f"Number of articles in clusters: {len(clust_df)} of {len(news_df)} total articles.")
    print("Number of clusters by size:")
    for size, count in size_counts.items():
        print(f"  - {count} clusters with {size} articles")

    return news_df, graph

In [17]:
news_df, graph = get_clusters(news_df, CLUSTER_CUTOFF, SIM_CUTOFF)

Number of clusters: 140
Number of articles in clusters: 509 of 1933 total articles.
Number of clusters by size:
  - 85 clusters with 2 articles
  - 36 clusters with 3 articles
  - 5 clusters with 4 articles
  - 2 clusters with 5 articles
  - 4 clusters with 6 articles
  - 1 clusters with 7 articles
  - 2 clusters with 8 articles
  - 1 clusters with 11 articles
  - 1 clusters with 12 articles
  - 1 clusters with 25 articles
  - 1 clusters with 38 articles
  - 1 clusters with 68 articles


## 8. save to disk: 

In [18]:
#dump the data to a .csv file: 
news_df.to_csv('./data/01_processing_output.csv', index=False)