# CSAIEvaluator Demo with AG News Dataset
This notebook demonstrates how to use the CSAIEvaluator for evaluating clustering stability using the AG News dataset.
We use SBERT for embeddings, UMAP for dimensionality reduction, and KMeans for clustering.

In [3]:
import pandas as pd
import numpy as np
import re
import warnings
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sentence_transformers import SentenceTransformer
import umap
from csai import CSAIEvaluator

# Suppress UMAP warnings
warnings.filterwarnings("ignore", message="n_jobs value 1 overridden to 1 by setting random_state*", category=UserWarning)

In [5]:
def get_sbert_embeddings(texts):
    model = SentenceTransformer("all-MiniLM-L6-v2")
    return model.encode(texts, convert_to_tensor=False)

def reduce_with_umap(df, emb_col="SBERT_Embedding", output_col="key_umap", n_components=10):
    reducer = umap.UMAP(n_components=n_components, random_state=42)
    emb_array = np.array(df[emb_col].tolist())
    reduced = reducer.fit_transform(emb_array)
    df[output_col] = reduced.tolist()
    return df

In [9]:
from sklearn.datasets import fetch_20newsgroups

def run_pipeline():
    newsgroups = fetch_20newsgroups(subset='all')
    df = pd.DataFrame(newsgroups.data, columns=["text"])
    df = df.sample(n=5000, random_state=42).reset_index(drop=True)

    texts = df["text"].fillna("").apply(lambda x: re.sub(r"\d+|[^\w\s]|\s+", " ", x.lower()).strip()).tolist()

    embeddings = get_sbert_embeddings(texts)
    df["SBERT_Embedding"] = embeddings.tolist()

    df = reduce_with_umap(df, emb_col="SBERT_Embedding", output_col="key_umap", n_components=10)
    return df


In [11]:
# Split the data into train and test sets
df_processed = run_pipeline()
X_train, X_test = train_test_split(df_processed, test_size=0.3, random_state=42)
X_train.shape, X_test.shape

OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.


((3500, 3), (1500, 3))

In [14]:
# Define a  clustering label function (e.g KMeans)
def kmeans_label_func(embeddings, n_clusters=6):
    model = KMeans(n_clusters=n_clusters, random_state=42)
    labels = model.fit_predict(embeddings)
    return labels, model

In [13]:
# Evaluate with CSAIEvaluator
csai = CSAIEvaluator()
score = csai.run_csai_evaluation(
    X_train, X_test,
    key_col="key_umap",
    label_func=kmeans_label_func,
    n_splits=5
)
print("\nCSAIEvaluator Score:", score)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Sample 1 | CSAI per cluster: [0.0917, 0.0726, 0.3792, 0.0599, 0.151, 0.3071] |  CSAI across all clusters: 0.1769
Sample 2 | CSAI per cluster: [0.4007, 0.5892, 0.6028, 0.1161, 0.5832, 0.0527] |  CSAI across all clusters: 0.3908
Sample 3 | CSAI per cluster: [0.0427, 0.2734, 0.1875, 0.1537, 0.5796, 0.0077] |  CSAI across all clusters: 0.2074
Sample 4 | CSAI per cluster: [0.4103, 0.2025, 0.3228, 0.1819, 0.0501, 0.0581] |  CSAI across all clusters: 0.2043
Sample 5 | CSAI per cluster: [0.1667, 0.3858, 0.2506, 0.3021, 0.1931, 0.1928] |  CSAI across all clusters: 0.2485

Overall CSAI across all samples: 0.2456

CSAIEvaluator Score: 0.24558853084328952
