# Outlier removal

Objective : Denoising the news clusters by removing outliers usig the Double Filtering Logic.
The paper assumes that clustering can fail in two distinct ways:
- Ambiguity (Silhouette): The article is "caught between two fires," being almost as close to a neighboring cluster as it is to its own.I
- solation (Distance to Centroid): The article is in the correct cluster, but it is located very far from the center (indicating a topic that is too specific or off-topic).

Steps :
- Retrieve Initial Centroids: Computed after the HAC process.
- Compute Two Metrics for each article $i$:
    - Its individual Silhouette score ($s_i$).
    - Its Cosine Similarity with the centroid of its assigned cluster.
- Define Cutoff Thresholds by Percentile (e.g., 20th percentile):
    -  Identify the 30% of articles with the worst (lowest) Silhouette scores.
    -  Identify the 30% of articles furthest from their centroids (lowest similarity).
- Suppression: Remove an article if it fails either of the two tests (Logical "OR").
- Recalculate Final Centroids: Compute the final event signatures based only on the remaining "clean" articles.

In [7]:
import pandas as pd
import numpy as np
import os
import sys
import re
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import fcluster

sys.path.append(os.path.abspath(os.path.join('..')))
from src.outlier_removal import *
from src.news_clustering import compute_stable_hac_linkage, plot_hac_dendrogram_plotly

### Data preparation

In [8]:
news_features = pd.read_csv('../data/for_models/news_features.csv')
news_features['date'] = pd.to_datetime(news_features['date'])

In [9]:
# Converting String to Array
def string_to_array(s):
    s = re.sub(r'[\[\]\n]', '', s)
    # String to Numpy with type float
    return np.fromstring(s, sep=' ')

# We transform 'embedding' column into a stack of numpy arrays
X = np.stack(news_features['embedding'].apply(string_to_array).values)

### Selecting the period

In [10]:
# Exemple : Choice of clusters for the week before the SVB collapse
START_DATE = pd.to_datetime('2023-03-03')
END_DATE = pd.to_datetime('2023-03-09') 

# # Exemple : Choice of clusters for the week before puces ARM en Bourse
# START_DATE = pd.to_datetime('2023-09-07')
# END_DATE = pd.to_datetime('2023-09-13') 

mask_week = (news_features['date'] >= START_DATE) & (news_features['date'] <= END_DATE)
sub_df = news_features.loc[mask_week].copy()
print(f"Number of articles in the week: {len(sub_df)}")

Number of articles in the week: 39


### Clustering the news 

In [15]:
# Step 1 : Initial HAC
# This uses the function to ensure every cluster has at least min_samples
Z_stable, headlines_stable, X_stable = compute_stable_hac_linkage(
    X, news_features, START_DATE, END_DATE, k=2, min_samples=2
)
print(f"Stable HAC computed for the week with {len(headlines_stable)} articles.")

# CRUCIAL: We update sub_df so it matches headlines_stable
# We filter sub_df to only keep headlines that are in headlines_stable
sub_df_stable = sub_df[sub_df['headline'].isin(headlines_stable)].copy()

Stable HAC computed for the week with 37 articles.


### Applying Outlier Removal 

In [18]:
# STEP 2: Get Initial Labels
# Since compute_stable_hac_linkage returns Z, we get labels for our chosen k
initial_labels = fcluster(Z_stable, t=2, criterion='maxclust')
initial_labels = initial_labels - 1 # NOTE: fcluster starts labels at 1, so we adjust to 0 for consistency

In [19]:
# STEP 3: Advanced Outlier Removal 
# Now we apply the double-filtering (Silhouette + Distance to Centroid) on the already stable articles.
X_clean, labels_clean, keep_mask = remove_news_outliers_advanced(
    X_stable, 
    initial_labels, 
    percentile_threshold=30
)
print(f"After outlier removal, {len(X_clean)} articles remain in the week.")

Complete: Removed 15 articles.
Cutoffs: Silhouette < 0.295 | Centroid Sim < 0.984
After outlier removal, 22 articles remain in the week.


### Vizualizations

In [20]:
# STEP 4: Final Metadata Sync
# We filter the headlines again to match the final cleaned articles
final_headlines = np.array(headlines_stable)[keep_mask].tolist()
# We create the final clean DataFrame with the cleaned headlines and their corresponding metadata
clean_news_df = sub_df_stable[keep_mask].copy() 
clean_news_df['Cluster'] = labels_clean.astype(str)

In [21]:
# --- STEP 5: Visualization ---
# To have a clean dendrogram of only the FINAL articles:
Z_final = linkage(X_clean, method='average', metric='cosine')
fig_tree = plot_hac_dendrogram_plotly(Z_final, final_headlines, START_DATE, END_DATE)
fig_tree.show()

In [22]:
# STEP 6: Final Event Signatures for Tweet Assignment
# Recalculate the median vectors of the clean clusters
final_event_signatures = calculate_event_centroids(X_clean, labels_clean)

print(f"Final Event Signatures ready: {list(final_event_signatures.keys())}")
final_event_signatures_SVB = pd.DataFrame.from_dict(final_event_signatures, orient='index')
final_event_signatures_SVB.to_csv('../data/for_models/final_event_signatures_SVB.csv')
# Each signature is a 300D vector representing the 'pure essence' of the news event.

# print(f"Final Event Signatures ready: {list(final_event_signatures.keys())}")
# final_event_signatures_AI = pd.DataFrame.from_dict(final_event_signatures, orient='index')
# final_event_signatures_AI.to_csv('../data/for_models/final_event_signatures_AI.csv')
# Each signature is a 300D vector representing the 'pure essence' of the news event.

Final Event Signatures ready: [np.int32(0), np.int32(1)]


In [23]:
# Visualization of HAC with Centroids (t-SNE)
fig_final = visualize_hac_with_centroids(X_clean, clean_news_df, k=2, perplexity=10)
fig_final.show()

In the methodology of Carta et al. that we are following, the centroid is not a real article, but a "virtual" or "synthetic" article.It is computed via  the Median. When we use the `calculate_event_centroids`` function, the algorithm examines each dimension of the embedding (the $300$ values comprising the vector) for every article within the cluster and computes the median for each specific dimension. Because this centroid is "pure"—meaning it has been stripped of the specific nuances, biases, or "noise" of individual journalists—it becomes a perfect semantic magnet. Hence, the result is a new vector that possesses the most representative (central) characteristics of the entire group but it is statistically almost impossible for this vector to exactly match the coordinates of an existing article in your CSV file.

On your Plotly graph: 
- Colored dots are your real-world articles.
- The Black Cross (X) is the mathematical "heavyweight" that summarizes the event.


When we move on to the Tweet Assignment phase, we won't be searching for tweets that resemble a specific news story. Instead, we will look for tweets that align with the Event Signature (the X). This ensures that the social media filtering is based on the core financial event rather than the specific wording of a single press release.

### Saving 

In [24]:
### Saving the clean DataFrame for later use in tweet assignment
clean_news_df.to_csv('../data/for_models/clean_news_week_SVB.csv', index=False)

# ### Saving the clean DataFrame for later use in tweet assignment
# clean_news_df.to_csv('../data/for_models/clean_news_week_AI.csv', index=False)