# News clustering : Exploring different clustering methods

Objective: The goal is to partition news articles from a specific timeframe (a 7-day sliding window) into semantically correlated groups, where each cluster ideally represents a real-world financial event.
- Algorithm: Use Agglomerative Clustering (hierarchical). This is a bottom-up approach where each article begins as its own cluster before being successively merged with others.
- Distance Metric: Use Cosine Distance, which is the most suitable metric for comparing high-dimensional text vectors.
- Linkage Criterion: Use Average Linkage, which minimizes the average distance between all observations of pairs of clusters.
- Determining $k$ (Number of Clusters): Since the number of events fluctuates daily, the optimal $k$ is found using the Silhouette Maximization method.
    - Iterate the algorithm for $k$ values ranging from 2 to 10.
    - Calculate the average Silhouette Coefficient for each $k$.
    - Select the value of $k$ that yields the highest score.
- Centroid Calculation: Once clusters are formed, compute a centroid for each group using the median of the cluster embeddings, as it is more robust to noise than the mean.

In [1]:
import pandas as pd
import numpy as np
from sklearn.cluster import AgglomerativeClustering, KMeans
from pyclustering.cluster.kmedians import kmedians
from sklearn.metrics import silhouette_score
from scipy.spatial.distance import cosine
import os
import sys
import re

sys.path.append(os.path.abspath(os.path.join('..')))
from src.news_clustering import *

### Data preparation

In [2]:
news_features = pd.read_csv('../data/for_models/news_features.csv')

In [3]:
# Converting String to Array
def string_to_array(s):
    s = re.sub(r'[\[\]\n]', '', s)
    # String to Numpy with type float
    return np.fromstring(s, sep=' ')

# We transform 'embedding' column into a stack of numpy arrays
X = np.stack(news_features['embedding'].apply(string_to_array).values)

### Choice of k clusters

In [4]:
res = run_clustering_evaluation(X)
print(res)

    k  Agglomerative   K-Means  K-Medians
0   2       0.606183  0.136817   0.156700
1   3       0.513218  0.118169   0.086862
2   4       0.428408  0.115928   0.108827
3   5       0.360442  0.087585   0.034273
4   6       0.304025  0.081085   0.045612
5   7       0.277199  0.083255   0.029174
6   8       0.239536  0.084655   0.052170
7   9       0.218274  0.081586   0.038223
8  10       0.210957  0.079494   0.053195


In [5]:
plot_clustering_comparison(res)

In [10]:
START_DATE = "2023-11-10"
END_DATE = "2023-11-17"
fig = visualize_hac_tsne_range(X, news_features, start_date=START_DATE, end_date=END_DATE, k=6, perplexity=10)
fig.show()

We decided to choose 2 clusters

### Centroid computation for Hierarchical Clustering

Unlike K-Means, Agglomerative Clustering (HAC) does not inherently define cluster centers. However, for the subsequent Tweet Assignment phase, we must compare each tweet to the detected "events." The centroid serves as the numerical signature of the financial event.

The Carta et al. paper specifies using the median of the cluster vectors rather than the mean, as it is significantly more robust to outliers (noisy articles) that may remain within the group.

This will be developped in the newt step