# Enhancing Clustering with Large Language Model 

This Jupyter notebook demonstrates an advanced method to enhance clustering performance using embeddings from a language model. The approach utilizes the powerful capabilities of pre-trained language models to generate key phrases and embeddings for text data, which are then used in conjunction with traditional clustering techniques.
We implemented this code ourselves from scratch. 
## Objective
The objective of this notebook is to compare the clustering outcomes of traditional methods with those enhanced by embeddings from a language model, specifically focusing on improvements in clustering accuracy and relevance.

## Dataset
The dataset used in this example consists of text data requiring semantic understanding for effective clustering. We use embeddings to capture deeper linguistic and semantic features that standard vectorization methods might miss.

##### BBC New dataset : 
This dataset for extractive text summarization has four hundred and seventeen political news articles of BBC from 2004 to 2005 in the News Articles folder.

In [1]:
import pandas as pd
import numpy as np
import openai
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import normalized_mutual_info_score
from sklearn.metrics import adjusted_rand_score
from sklearn.metrics import adjusted_rand_score
from transformers import DistilBertTokenizer, DistilBertModel
import time

In [2]:
import numpy as np
from scipy.optimize import linear_sum_assignment as hungarian
from sklearn.metrics.cluster import normalized_mutual_info_score, adjusted_rand_score, adjusted_mutual_info_score

cluster_nmi = normalized_mutual_info_score
def cluster_acc(y_true, y_pred):
    y_true = y_true.astype(np.int64)
    assert y_pred.size == y_true.size
    D = max(y_pred.max(), y_true.max()) + 1
    w = np.zeros((D, D), dtype=np.int64)
    for i in range(y_pred.size):
        w[y_pred[i], y_true[i]] += 1
  
    # ind = sklearn.utils.linear_assignment_.linear_assignment(w.max() - w)
    # row_ind, col_ind = linear_assignment(w.max() - w)
    row_ind, col_ind = hungarian(w.max() - w)
    return sum([w[i, j] for i, j in zip(row_ind, col_ind)]) * 1.0 / y_pred.size

In [3]:
import os
from dotenv import load_dotenv

load_dotenv()  # Load the .env file

openai.api_key = os.getenv('OPENAI_API_KEY')
if openai.api_key is None:
    raise ValueError("API key is not set.")
else:
    print("API Key loaded successfully.")


API Key loaded successfully.


In [71]:
SubSets=[]

# Load the dataset wich contains 250 articles
data250 = pd.read_csv('bbc_news_subset_250artcl.csv')
texts250 = data250['text'].tolist()
SubSets.append(texts250)

# Load the dataset wich contains 500 articles
data500 = pd.read_csv('bbc_news_subset_500artcl.csv')
texts500 = data500['text'].tolist()
SubSets.append(texts500)

# Load the dataset wich contains 1000 articles
data1000 = pd.read_csv('bbc_news_subset_1000artcl.csv')
texts1000 = data1000['text'].tolist()
SubSets.append(texts1000)

# Load the dataset wich contains 1500 articles
data1500 = pd.read_csv('bbc_news_subset_1500artcl.csv')
texts1500 = data1500['text'].tolist()
SubSets.append(texts1500)

# Load the dataset wich contains 1850 articles
data1850 = pd.read_csv('bbc_news_subset_1850artcl.csv')
texts1850 = data1850['text'].tolist()
SubSets.append(texts1850)

# Load the dataset wich contains 2225 articles
data2225 = pd.read_csv('bbc_news_full_data_2225.csv')
texts2225 = data2225['text'].tolist()
SubSets.append(texts2225)

# Load the dataset Bank77
dataBank77 = pd.read_csv('Bank77.csv')
textsBank77 = dataBank77['text'].tolist()
SubSets.append(textsBank77)

In [72]:
len(SubSets)
for subset in SubSets :
    print(len(subset))

250
500
1000
1500
1850
2225
3080


In [6]:
# Set up OpenAI GPT and BERT
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained('distilbert-base-uncased')

### Key Phrase Generation using GPT-3.5 turbo

We employ a language model (like GPT-3 or an alternative model if specified) to generate key phrases from the text data. These key phrases aim to capture the main themes and concepts of each text entry, which will assist in enhancing the semantic understanding of the clustering algorithm.


In [7]:
# Function that generate key phrases 
def generate_keyphrases(text):
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "Generate keyphrases that describe the intent of this text."},
            {"role": "user", "content": text}
        ],
        max_tokens=50
    )
    keyphrases = response.choices[0].message['content'].strip()
    return keyphrases

### Embedding Generation

After generating key phrases, we use the `DistilBert` model to create embeddings for both the original text and the generated key phrases. These embeddings represent the texts in a high-dimensional space, capturing semantic and syntactic nuances essential for effective clustering.


In [8]:
def encode_text(text):
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True, padding=True)
    outputs = model(**inputs)
    return outputs.last_hidden_state[:, 0, :].detach().numpy()  # CLS token representation

In [73]:
# Vectorization for simple clustering
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)# Try to test with others values of max_features

X_simple250 = vectorizer.fit_transform(texts250)
X_simple500 = vectorizer.fit_transform(texts500)
X_simple1000 = vectorizer.fit_transform(texts1000)
X_simple1500 = vectorizer.fit_transform(texts1500)
X_simple1850 = vectorizer.fit_transform(texts1850)
X_simple2225 = vectorizer.fit_transform(texts2225)
X_simple2225 = vectorizer.fit_transform(texts2225)
X_simpleBank77 = vectorizer.fit_transform(textsBank77)

In [10]:
X_simple250

<250x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 30180 stored elements in Compressed Sparse Row format>

### Clustering

This section describes the clustering process. We use the `KMeans` algorithm from scikit-learn to perform clustering on both traditionally vectorized text and the enhanced embeddings. The goal is to observe the differences in clustering performance, gauging the impact of using language model embeddings.


In [154]:
# Clustering
kmeans_simple = KMeans(n_clusters=5, random_state=42)
kmeans_enhanced = KMeans(n_clusters=5, random_state=42)

In [None]:
# Enhanced vectorization using LLM keyphrases
enhanced_vectors250 = []
cpt=1
for text in texts250:
    keyphrase = generate_keyphrases(text)
    text_vector = encode_text(text)
    keyphrase_vector = encode_text(keyphrase)
    concatenated_vector = np.concatenate((text_vector, keyphrase_vector), axis=1)
    enhanced_vectors250.append(concatenated_vector.squeeze())
    print(cpt, end=" ")
    cpt=cpt+1

# Convert list to array
#enhanced_vectors250 = np.array(enhanced_vectors250)

In [15]:
# To load your vectors back after restarting the kernel
enhanced_vectors250 = np.load('enhanced_vectors250.npy')

In [144]:
np.save('enhanced_vectors250.npy', enhanced_vectors250)


In [30]:
simple_labels250 = kmeans_simple.fit_predict(X_simple250)
enhanced_labels250 = kmeans_enhanced.fit_predict(enhanced_vectors250)

### Evaluation

To evaluate the effectiveness of our clustering approach, we calculate metrics such as Normalized Mutual Information (NMI) and the accuracy (ACC). These metrics help quantify the improvement in clustering performance due to the incorporation of language model embeddings.


In [22]:
from sklearn.preprocessing import LabelEncoder
# Evaluation with random_state=42
nmi_simple250 = normalized_mutual_info_score(data250['category'].values, simple_labels250)
nmi_enhanced250 = normalized_mutual_info_score(data250['category'].values, enhanced_labels250)
print(f"Simple Clustering (250 articles) - NMI: {nmi_simple250}")
print(f"Enhanced Clustering (250 articles) - NMI: {nmi_enhanced250}")

print("---------------------------------------------------------------------------------")

rand_score_simple250 = adjusted_rand_score(data250['category'].values, simple_labels250)
rand_score_enhanced250 = adjusted_rand_score(data250['category'].values, enhanced_labels250)
print(f"Simple Clustering (250 articles) - rand_score: {rand_score_simple250}")
print(f"Enhanced Clustering (250 articles) - rand_score: {rand_score_enhanced250}")

print("---------------------------------------------------------------------------------")

# Encode category labels
label_encoder = LabelEncoder()
y_true = label_encoder.fit_transform(data250['category'].values)

acc_simple250 = cluster_acc(np.array(y_true), np.array(simple_labels250))
acc_enhanced250 = cluster_acc(np.array(y_true), np.array(enhanced_labels250))
print(f"Simple Clustering (250 articles) - acc: {acc_simple250}")
print(f"Enhanced Clustering (250 articles) - acc: {acc_enhanced250}")

Simple Clustering (250 articles) - NMI: 0.3691888138647679
Enhanced Clustering (250 articles) - NMI: 0.8812947638142967
---------------------------------------------------------------------------------
Simple Clustering (250 articles) - rand_score: 0.26875217415654556
Enhanced Clustering (250 articles) - rand_score: 0.8925785837313083
---------------------------------------------------------------------------------
Simple Clustering (250 articles) - acc: 0.496
Enhanced Clustering (250 articles) - acc: 0.956


In [28]:
# Evaluation with random_state=25
nmi_simple250 = normalized_mutual_info_score(data250['category'].values, simple_labels250)
nmi_enhanced250 = normalized_mutual_info_score(data250['category'].values, enhanced_labels250)
print(f"Simple Clustering (250 articles) - NMI: {nmi_simple250}")
print(f"Enhanced Clustering (250 articles) - NMI: {nmi_enhanced250}")
print("---------------------------------------------------------------------------------")

rand_score_simple250 = adjusted_rand_score(data250['category'].values, simple_labels250)
rand_score_enhanced250 = adjusted_rand_score(data250['category'].values, enhanced_labels250)
print(f"Simple Clustering (250 articles) - rand_score: {rand_score_simple250}")
print(f"Enhanced Clustering (250 articles) - rand_score: {rand_score_enhanced250}")

print("---------------------------------------------------------------------------------")

# Encode category labels
label_encoder = LabelEncoder()
y_true = label_encoder.fit_transform(data250['category'].values)

acc_simple250 = cluster_acc(np.array(y_true), np.array(simple_labels250))
acc_enhanced250 = cluster_acc(np.array(y_true), np.array(enhanced_labels250))
print(f"Simple Clustering (250 articles) - acc: {acc_simple250}")
print(f"Enhanced Clustering (250 articles) - acc: {acc_enhanced250}")

Simple Clustering (250 articles) - NMI: 0.39377379173306376
Enhanced Clustering (250 articles) - NMI: 0.668001112361311
---------------------------------------------------------------------------------
Simple Clustering (250 articles) - rand_score: 0.22410793577319602
Enhanced Clustering (250 articles) - rand_score: 0.5856754067260372
---------------------------------------------------------------------------------
Simple Clustering (250 articles) - acc: 0.544
Enhanced Clustering (250 articles) - acc: 0.776


In [25]:
# Evaluation with random_state=5
nmi_simple250 = normalized_mutual_info_score(data250['category'].values, simple_labels250)
nmi_enhanced250 = normalized_mutual_info_score(data250['category'].values, enhanced_labels250)
print(f"Simple Clustering (250 articles) - NMI: {nmi_simple250}")
print(f"Enhanced Clustering (250 articles) - NMI: {nmi_enhanced250}")
print("---------------------------------------------------------------------------------")

rand_score_simple250 = adjusted_rand_score(data250['category'].values, simple_labels250)
rand_score_enhanced250 = adjusted_rand_score(data250['category'].values, enhanced_labels250)
print(f"Simple Clustering (250 articles) - rand_score: {rand_score_simple250}")
print(f"Enhanced Clustering (250 articles) - rand_score: {rand_score_enhanced250}")

print("---------------------------------------------------------------------------------")

# Encode category labels
label_encoder = LabelEncoder()
y_true = label_encoder.fit_transform(data250['category'].values)

acc_simple250 = cluster_acc(np.array(y_true), np.array(simple_labels250))
acc_enhanced250 = cluster_acc(np.array(y_true), np.array(enhanced_labels250))
print(f"Simple Clustering (250 articles) - acc: {acc_simple250}")
print(f"Enhanced Clustering (250 articles) - acc: {acc_enhanced250}")

Simple Clustering (250 articles) - NMI: 0.34984114398569727
Enhanced Clustering (250 articles) - NMI: 0.6569367998503989
---------------------------------------------------------------------------------
Simple Clustering (250 articles) - rand_score: 0.29408765289664834
Enhanced Clustering (250 articles) - rand_score: 0.5841953625342647
---------------------------------------------------------------------------------
Simple Clustering (250 articles) - acc: 0.56
Enhanced Clustering (250 articles) - acc: 0.704


In [31]:
# Evaluation with random_state=100
nmi_simple250 = normalized_mutual_info_score(data250['category'].values, simple_labels250)
nmi_enhanced250 = normalized_mutual_info_score(data250['category'].values, enhanced_labels250)
print(f"Simple Clustering (250 articles) - NMI: {nmi_simple250}")
print(f"Enhanced Clustering (250 articles) - NMI: {nmi_enhanced250}")
print("---------------------------------------------------------------------------------")

rand_score_simple250 = adjusted_rand_score(data250['category'].values, simple_labels250)
rand_score_enhanced250 = adjusted_rand_score(data250['category'].values, enhanced_labels250)
print(f"Simple Clustering (250 articles) - rand_score: {rand_score_simple250}")
print(f"Enhanced Clustering (250 articles) - rand_score: {rand_score_enhanced250}")

print("---------------------------------------------------------------------------------")

# Encode category labels
label_encoder = LabelEncoder()
y_true = label_encoder.fit_transform(data250['category'].values)

acc_simple250 = cluster_acc(np.array(y_true), np.array(simple_labels250))
acc_enhanced250 = cluster_acc(np.array(y_true), np.array(enhanced_labels250))
print(f"Simple Clustering (250 articles) - acc: {acc_simple250}")
print(f"Enhanced Clustering (250 articles) - acc: {acc_enhanced250}")

Simple Clustering (250 articles) - NMI: 0.4812913590475506
Enhanced Clustering (250 articles) - NMI: 0.8865459918553521
---------------------------------------------------------------------------------
Simple Clustering (250 articles) - rand_score: 0.41286990104170157
Enhanced Clustering (250 articles) - rand_score: 0.9018833290502588
---------------------------------------------------------------------------------
Simple Clustering (250 articles) - acc: 0.712
Enhanced Clustering (250 articles) - acc: 0.96


In [None]:
# Enhanced vectorization using LLM keyphrases
enhanced_vectors500 = []
cpt=1
for text in texts500:
    keyphrase = generate_keyphrases(text)
    text_vector = encode_text(text)
    keyphrase_vector = encode_text(keyphrase)
    concatenated_vector = np.concatenate((text_vector, keyphrase_vector), axis=1)
    enhanced_vectors500.append(concatenated_vector.squeeze())
    print(cpt, end=" ")
    cpt=cpt+1

# Convert list to array
#enhanced_vectors500 = np.array(enhanced_vectors500)

In [147]:
np.save('enhanced_vectors500.npy', enhanced_vectors500)


In [34]:
# To load your vectors back after restarting the kernel
enhanced_vectors500 = np.load('enhanced_vectors500.npy')

In [43]:
simple_labels500 = kmeans_simple.fit_predict(X_simple500)
enhanced_labels500 = kmeans_enhanced.fit_predict(enhanced_vectors500)

In [38]:
# Evaluation with random_state=42
nmi_simple500 = normalized_mutual_info_score(data500['category'].values, simple_labels500)
nmi_enhanced500 = normalized_mutual_info_score(data500['category'].values, enhanced_labels500)
print(f"Simple Clustering (500 articles) - NMI: {nmi_simple500}")
print(f"Enhanced Clustering (500 articles) - NMI: {nmi_enhanced500}")
print("---------------------------------------------------------------------------------")

rand_score_simple500 = adjusted_rand_score(data500['category'].values, simple_labels500)
rand_score_enhanced500 = adjusted_rand_score(data500['category'].values, enhanced_labels500)
print(f"Simple Clustering (500 articles) - rand_score: {rand_score_simple500}")
print(f"Enhanced Clustering (500 articles) - rand_score: {rand_score_enhanced500}")

print("---------------------------------------------------------------------------------")

# Encode category labels
label_encoder = LabelEncoder()
y_true = label_encoder.fit_transform(data500['category'].values)

acc_simple500 = cluster_acc(np.array(y_true), np.array(simple_labels500))
acc_enhanced500 = cluster_acc(np.array(y_true), np.array(enhanced_labels500))
print(f"Simple Clustering (500 articles) - acc: {acc_simple500}")
print(f"Enhanced Clustering (500 articles) - acc: {acc_enhanced500}")

Simple Clustering (500 articles) - NMI: 0.6591928084086461
Enhanced Clustering (500 articles) - NMI: 0.697006647825004
---------------------------------------------------------------------------------
Simple Clustering (500 articles) - rand_score: 0.5917383033198403
Enhanced Clustering (500 articles) - rand_score: 0.585529733100578
---------------------------------------------------------------------------------
Simple Clustering (500 articles) - acc: 0.806
Enhanced Clustering (500 articles) - acc: 0.684


In [41]:
# Evaluation with random_state=5
nmi_simple500 = normalized_mutual_info_score(data500['category'].values, simple_labels500)
nmi_enhanced500 = normalized_mutual_info_score(data500['category'].values, enhanced_labels500)
print(f"Simple Clustering (500 articles) - NMI: {nmi_simple500}")
print(f"Enhanced Clustering (500 articles) - NMI: {nmi_enhanced500}")
print("---------------------------------------------------------------------------------")

rand_score_simple500 = adjusted_rand_score(data500['category'].values, simple_labels500)
rand_score_enhanced500 = adjusted_rand_score(data500['category'].values, enhanced_labels500)
print(f"Simple Clustering (500 articles) - rand_score: {rand_score_simple500}")
print(f"Enhanced Clustering (500 articles) - rand_score: {rand_score_enhanced500}")

print("---------------------------------------------------------------------------------")

# Encode category labels
label_encoder = LabelEncoder()
y_true = label_encoder.fit_transform(data500['category'].values)

acc_simple500 = cluster_acc(np.array(y_true), np.array(simple_labels500))
acc_enhanced500 = cluster_acc(np.array(y_true), np.array(enhanced_labels500))
print(f"Simple Clustering (500 articles) - acc: {acc_simple500}")
print(f"Enhanced Clustering (500 articles) - acc: {acc_enhanced500}")

Simple Clustering (500 articles) - NMI: 0.6395706990207467
Enhanced Clustering (500 articles) - NMI: 0.5546623961844159
---------------------------------------------------------------------------------
Simple Clustering (500 articles) - rand_score: 0.5921882528094293
Enhanced Clustering (500 articles) - rand_score: 0.49041029175700745
---------------------------------------------------------------------------------
Simple Clustering (500 articles) - acc: 0.8
Enhanced Clustering (500 articles) - acc: 0.67


In [44]:
# Evaluation with random_state=25
nmi_simple500 = normalized_mutual_info_score(data500['category'].values, simple_labels500)
nmi_enhanced500 = normalized_mutual_info_score(data500['category'].values, enhanced_labels500)
print(f"Simple Clustering (500 articles) - NMI: {nmi_simple500}")
print(f"Enhanced Clustering (500 articles) - NMI: {nmi_enhanced500}")
print("---------------------------------------------------------------------------------")

rand_score_simple500 = adjusted_rand_score(data500['category'].values, simple_labels500)
rand_score_enhanced500 = adjusted_rand_score(data500['category'].values, enhanced_labels500)
print(f"Simple Clustering (500 articles) - rand_score: {rand_score_simple500}")
print(f"Enhanced Clustering (500 articles) - rand_score: {rand_score_enhanced500}")

print("---------------------------------------------------------------------------------")

# Encode category labels
label_encoder = LabelEncoder()
y_true = label_encoder.fit_transform(data500['category'].values)

acc_simple500 = cluster_acc(np.array(y_true), np.array(simple_labels500))
acc_enhanced500 = cluster_acc(np.array(y_true), np.array(enhanced_labels500))
print(f"Simple Clustering (500 articles) - acc: {acc_simple500}")
print(f"Enhanced Clustering (500 articles) - acc: {acc_enhanced500}")

Simple Clustering (500 articles) - NMI: 0.3500062932703246
Enhanced Clustering (500 articles) - NMI: 0.7375642497041326
---------------------------------------------------------------------------------
Simple Clustering (500 articles) - rand_score: 0.2758815899412899
Enhanced Clustering (500 articles) - rand_score: 0.7093300001984009
---------------------------------------------------------------------------------
Simple Clustering (500 articles) - acc: 0.558
Enhanced Clustering (500 articles) - acc: 0.866


In [46]:
# Evaluation with random_state=100
nmi_simple500 = normalized_mutual_info_score(data500['category'].values, simple_labels500)
nmi_enhanced500 = normalized_mutual_info_score(data500['category'].values, enhanced_labels500)
print(f"Simple Clustering (500 articles) - NMI: {nmi_simple500}")
print(f"Enhanced Clustering (500 articles) - NMI: {nmi_enhanced500}")
print("---------------------------------------------------------------------------------")

rand_score_simple500 = adjusted_rand_score(data500['category'].values, simple_labels500)
rand_score_enhanced500 = adjusted_rand_score(data500['category'].values, enhanced_labels500)
print(f"Simple Clustering (500 articles) - rand_score: {rand_score_simple500}")
print(f"Enhanced Clustering (500 articles) - rand_score: {rand_score_enhanced500}")

print("---------------------------------------------------------------------------------")

# Encode category labels
label_encoder = LabelEncoder()
y_true = label_encoder.fit_transform(data500['category'].values)

acc_simple500 = cluster_acc(np.array(y_true), np.array(simple_labels500))
acc_enhanced500 = cluster_acc(np.array(y_true), np.array(enhanced_labels500))
print(f"Simple Clustering (500 articles) - acc: {acc_simple500}")
print(f"Enhanced Clustering (500 articles) - acc: {acc_enhanced500}")

Simple Clustering (500 articles) - NMI: 0.3500062932703246
Enhanced Clustering (500 articles) - NMI: 0.7375642497041326
---------------------------------------------------------------------------------
Simple Clustering (500 articles) - rand_score: 0.2758815899412899
Enhanced Clustering (500 articles) - rand_score: 0.7093300001984009
---------------------------------------------------------------------------------
Simple Clustering (500 articles) - acc: 0.558
Enhanced Clustering (500 articles) - acc: 0.866


In [68]:
simple_labels1000 = kmeans_simple.fit_predict(X_simple1000)
enhanced_labels1000 = kmeans_enhanced.fit_predict(enhanced_vectors1000)

In [None]:
# Enhanced vectorization using LLM keyphrases
enhanced_vectors1000 = []
cpt=1
for text in texts1000:
    keyphrase = generate_keyphrases(text)
    text_vector = encode_text(text)
    keyphrase_vector = encode_text(keyphrase)
    concatenated_vector = np.concatenate((text_vector, keyphrase_vector), axis=1)
    enhanced_vectors1000.append(concatenated_vector.squeeze())
    print(cpt, end=" ")
    cpt=cpt+1

# Convert list to array
enhanced_vectors1000 = np.array(enhanced_vectors1000)

In [149]:
np.save('enhanced_vectors1000.npy', enhanced_vectors1000)

In [57]:
# To load your vectors back after restarting the kernel
enhanced_vectors1000 = np.load('enhanced_vectors1000.npy')

In [62]:
# Evaluation with random_state=42
nmi_simple1000 = normalized_mutual_info_score(data1000['category'].values, simple_labels1000)
nmi_enhanced1000 = normalized_mutual_info_score(data1000['category'].values, enhanced_labels1000)
print(f"Simple Clustering (1000 articles) - NMI: {nmi_simple1000}")
print(f"Enhanced Clustering (1000 articles) - NMI: {nmi_enhanced1000}")
print("---------------------------------------------------------------------------------")

rand_score_simple1000 = adjusted_rand_score(data1000['category'].values, simple_labels1000)
rand_score_enhanced1000= adjusted_rand_score(data1000['category'].values, enhanced_labels1000)
print(f"Simple Clustering (1000 articles) - rand_score: {rand_score_simple1000}")
print(f"Enhanced Clustering (1000 articles) - rand_score: {rand_score_enhanced1000}")

print("---------------------------------------------------------------------------------")

# Encode category labels
label_encoder = LabelEncoder()
y_true = label_encoder.fit_transform(data1000['category'].values)

acc_simple1000 = cluster_acc(np.array(y_true), np.array(simple_labels1000))
acc_enhanced1000 = cluster_acc(np.array(y_true), np.array(enhanced_labels1000))
print(f"Simple Clustering (1000 articles) - acc: {acc_simple1000}")
print(f"Enhanced Clustering (1000 articles) - acc: {acc_enhanced1000}")

Simple Clustering (1000 articles) - NMI: 0.7107070330249688
Enhanced Clustering (1000 articles) - NMI: 0.7897260227199223
---------------------------------------------------------------------------------
Simple Clustering (1000 articles) - rand_score: 0.5972473398593704
Enhanced Clustering (1000 articles) - rand_score: 0.8115837105436454
---------------------------------------------------------------------------------
Simple Clustering (1000 articles) - acc: 0.657
Enhanced Clustering (1000 articles) - acc: 0.919


In [53]:
# Evaluation with random_state=5
nmi_simple1000 = normalized_mutual_info_score(data1000['category'].values, simple_labels1000)
nmi_enhanced1000 = normalized_mutual_info_score(data1000['category'].values, enhanced_labels1000)
print(f"Simple Clustering (1000 articles) - NMI: {nmi_simple1000}")
print(f"Enhanced Clustering (1000 articles) - NMI: {nmi_enhanced1000}")
print("---------------------------------------------------------------------------------")

rand_score_simple1000 = adjusted_rand_score(data1000['category'].values, simple_labels1000)
rand_score_enhanced1000= adjusted_rand_score(data1000['category'].values, enhanced_labels1000)
print(f"Simple Clustering (1000 articles) - rand_score: {rand_score_simple1000}")
print(f"Enhanced Clustering (1000 articles) - rand_score: {rand_score_enhanced1000}")

print("---------------------------------------------------------------------------------")

# Encode category labels
label_encoder = LabelEncoder()
y_true = label_encoder.fit_transform(data1000['category'].values)

acc_simple1000 = cluster_acc(np.array(y_true), np.array(simple_labels1000))
acc_enhanced1000 = cluster_acc(np.array(y_true), np.array(enhanced_labels1000))
print(f"Simple Clustering (1000 articles) - acc: {acc_simple1000}")
print(f"Enhanced Clustering (1000 articles) - acc: {acc_enhanced1000}")

Simple Clustering (1000 articles) - NMI: 0.6158248495703555
Enhanced Clustering (1000 articles) - NMI: 0.80441556056691
---------------------------------------------------------------------------------
Simple Clustering (1000 articles) - rand_score: 0.4228741170888662
Enhanced Clustering (1000 articles) - rand_score: 0.8284102538398725
---------------------------------------------------------------------------------
Simple Clustering (1000 articles) - acc: 0.605
Enhanced Clustering (1000 articles) - acc: 0.927


In [65]:
# Evaluation with random_state=25
nmi_simple1000 = normalized_mutual_info_score(data1000['category'].values, simple_labels1000)
nmi_enhanced1000 = normalized_mutual_info_score(data1000['category'].values, enhanced_labels1000)
print(f"Simple Clustering (1000 articles) - NMI: {nmi_simple1000}")
print(f"Enhanced Clustering (1000 articles) - NMI: {nmi_enhanced1000}")
print("---------------------------------------------------------------------------------")

rand_score_simple1000 = adjusted_rand_score(data1000['category'].values, simple_labels1000)
rand_score_enhanced1000= adjusted_rand_score(data1000['category'].values, enhanced_labels1000)
print(f"Simple Clustering (1000 articles) - rand_score: {rand_score_simple1000}")
print(f"Enhanced Clustering (1000 articles) - rand_score: {rand_score_enhanced1000}")

print("---------------------------------------------------------------------------------")

# Encode category labels
label_encoder = LabelEncoder()
y_true = label_encoder.fit_transform(data1000['category'].values)

acc_simple1000 = cluster_acc(np.array(y_true), np.array(simple_labels1000))
acc_enhanced1000 = cluster_acc(np.array(y_true), np.array(enhanced_labels1000))
print(f"Simple Clustering (1000 articles) - acc: {acc_simple1000}")
print(f"Enhanced Clustering (1000 articles) - acc: {acc_enhanced1000}")

Simple Clustering (1000 articles) - NMI: 0.6306628049014542
Enhanced Clustering (1000 articles) - NMI: 0.7150178524384878
---------------------------------------------------------------------------------
Simple Clustering (1000 articles) - rand_score: 0.5470597443960421
Enhanced Clustering (1000 articles) - rand_score: 0.6790742988737554
---------------------------------------------------------------------------------
Simple Clustering (1000 articles) - acc: 0.658
Enhanced Clustering (1000 articles) - acc: 0.844


In [69]:
# Evaluation with random_state=100
nmi_simple1000 = normalized_mutual_info_score(data1000['category'].values, simple_labels1000)
nmi_enhanced1000 = normalized_mutual_info_score(data1000['category'].values, enhanced_labels1000)
print(f"Simple Clustering (1000 articles) - NMI: {nmi_simple1000}")
print(f"Enhanced Clustering (1000 articles) - NMI: {nmi_enhanced1000}")
print("---------------------------------------------------------------------------------")

rand_score_simple1000 = adjusted_rand_score(data1000['category'].values, simple_labels1000)
rand_score_enhanced1000= adjusted_rand_score(data1000['category'].values, enhanced_labels1000)
print(f"Simple Clustering (1000 articles) - rand_score: {rand_score_simple1000}")
print(f"Enhanced Clustering (1000 articles) - rand_score: {rand_score_enhanced1000}")

print("---------------------------------------------------------------------------------")

# Encode category labels
label_encoder = LabelEncoder()
y_true = label_encoder.fit_transform(data1000['category'].values)

acc_simple1000 = cluster_acc(np.array(y_true), np.array(simple_labels1000))
acc_enhanced1000 = cluster_acc(np.array(y_true), np.array(enhanced_labels1000))
print(f"Simple Clustering (1000 articles) - acc: {acc_simple1000}")
print(f"Enhanced Clustering (1000 articles) - acc: {acc_enhanced1000}")

Simple Clustering (1000 articles) - NMI: 0.5926001524246777
Enhanced Clustering (1000 articles) - NMI: 0.7064683267253428
---------------------------------------------------------------------------------
Simple Clustering (1000 articles) - rand_score: 0.46364840219395304
Enhanced Clustering (1000 articles) - rand_score: 0.60312975762596
---------------------------------------------------------------------------------
Simple Clustering (1000 articles) - acc: 0.707
Enhanced Clustering (1000 articles) - acc: 0.678


In [100]:
# Enhanced vectorization using LLM keyphrases
enhanced_vectors1500 = []
cpt=1
total=1500
for text in texts1500:
    keyphrase = generate_keyphrases(text)
    text_vector = encode_text(text)
    keyphrase_vector = encode_text(keyphrase)
    concatenated_vector = np.concatenate((text_vector, keyphrase_vector), axis=1)
    enhanced_vectors1500.append(concatenated_vector.squeeze())
    print(f"{cpt}/{total}", end="\r")
    cpt=cpt+1

# Convert list to array
enhanced_vectors1500 = np.array(enhanced_vectors1500)
print("vectorization terminé")

vectorization terminé


In [150]:
np.save('enhanced_vectors1500.npy', enhanced_vectors1500)

In [77]:
# To load your vectors back after restarting the kernel
enhanced_vectors1500 = np.load('enhanced_vectors1500.npy')

In [88]:
simple_labels1500 = kmeans_simple.fit_predict(X_simple1500)
enhanced_labels1500 = kmeans_enhanced.fit_predict(enhanced_vectors1500)

In [79]:
# Evaluation with random_state=42
nmi_simple1500 = normalized_mutual_info_score(data1500['category'].values, simple_labels1500)
nmi_enhanced1500 = normalized_mutual_info_score(data1500['category'].values, enhanced_labels1500)
print(f"Simple Clustering (1500 articles) - NMI: {nmi_simple1500}")
print(f"Enhanced Clustering (1500 articles) - NMI: {nmi_enhanced1500}")
print("---------------------------------------------------------------------------------")

rand_score_simple1500 = adjusted_rand_score(data1500['category'].values, simple_labels1500)
rand_score_enhanced1500= adjusted_rand_score(data1500['category'].values, enhanced_labels1500)
print(f"Simple Clustering (1500 articles) - rand_score: {rand_score_simple1500}")
print(f"Enhanced Clustering (1500 articles) - rand_score: {rand_score_enhanced1500}")

print("---------------------------------------------------------------------------------")

# Encode category labels
label_encoder = LabelEncoder()
y_true = label_encoder.fit_transform(data1500['category'].values)

acc_simple1500 = cluster_acc(np.array(y_true), np.array(simple_labels1500))
acc_enhanced1500 = cluster_acc(np.array(y_true), np.array(enhanced_labels1500))
print(f"Simple Clustering (1500 articles) - acc: {acc_simple1500}")
print(f"Enhanced Clustering (1500 articles) - acc: {acc_enhanced1500}")

Simple Clustering (1500 articles) - NMI: 0.6698545791513273
Enhanced Clustering (1500 articles) - NMI: 0.5585346108174444
---------------------------------------------------------------------------------
Simple Clustering (1500 articles) - rand_score: 0.5577444733793943
Enhanced Clustering (1500 articles) - rand_score: 0.49435265649272775
---------------------------------------------------------------------------------
Simple Clustering (1500 articles) - acc: 0.798
Enhanced Clustering (1500 articles) - acc: 0.6893333333333334


In [83]:
# Evaluation with random_state=5
nmi_simple1500 = normalized_mutual_info_score(data1500['category'].values, simple_labels1500)
nmi_enhanced1500 = normalized_mutual_info_score(data1500['category'].values, enhanced_labels1500)
print(f"Simple Clustering (1500 articles) - NMI: {nmi_simple1500}")
print(f"Enhanced Clustering (1500 articles) - NMI: {nmi_enhanced1500}")
print("---------------------------------------------------------------------------------")

rand_score_simple1500 = adjusted_rand_score(data1500['category'].values, simple_labels1500)
rand_score_enhanced1500= adjusted_rand_score(data1500['category'].values, enhanced_labels1500)
print(f"Simple Clustering (1500 articles) - rand_score: {rand_score_simple1500}")
print(f"Enhanced Clustering (1500 articles) - rand_score: {rand_score_enhanced1500}")

print("---------------------------------------------------------------------------------")

# Encode category labels
label_encoder = LabelEncoder()
y_true = label_encoder.fit_transform(data1500['category'].values)

acc_simple1500 = cluster_acc(np.array(y_true), np.array(simple_labels1500))
acc_enhanced1500 = cluster_acc(np.array(y_true), np.array(enhanced_labels1500))
print(f"Simple Clustering (1500 articles) - acc: {acc_simple1500}")
print(f"Enhanced Clustering (1500 articles) - acc: {acc_enhanced1500}")

Simple Clustering (1500 articles) - NMI: 0.7329658171513952
Enhanced Clustering (1500 articles) - NMI: 0.6552321468221299
---------------------------------------------------------------------------------
Simple Clustering (1500 articles) - rand_score: 0.6727914744756127
Enhanced Clustering (1500 articles) - rand_score: 0.6081057643689253
---------------------------------------------------------------------------------
Simple Clustering (1500 articles) - acc: 0.852
Enhanced Clustering (1500 articles) - acc: 0.788


In [86]:
# Evaluation with random_state=25
nmi_simple1500 = normalized_mutual_info_score(data1500['category'].values, simple_labels1500)
nmi_enhanced1500 = normalized_mutual_info_score(data1500['category'].values, enhanced_labels1500)
print(f"Simple Clustering (1500 articles) - NMI: {nmi_simple1500}")
print(f"Enhanced Clustering (1500 articles) - NMI: {nmi_enhanced1500}")
print("---------------------------------------------------------------------------------")

rand_score_simple1500 = adjusted_rand_score(data1500['category'].values, simple_labels1500)
rand_score_enhanced1500= adjusted_rand_score(data1500['category'].values, enhanced_labels1500)
print(f"Simple Clustering (1500 articles) - rand_score: {rand_score_simple1500}")
print(f"Enhanced Clustering (1500 articles) - rand_score: {rand_score_enhanced1500}")

print("---------------------------------------------------------------------------------")

# Encode category labels
label_encoder = LabelEncoder()
y_true = label_encoder.fit_transform(data1500['category'].values)

acc_simple1500 = cluster_acc(np.array(y_true), np.array(simple_labels1500))
acc_enhanced1500 = cluster_acc(np.array(y_true), np.array(enhanced_labels1500))
print(f"Simple Clustering (1500 articles) - acc: {acc_simple1500}")
print(f"Enhanced Clustering (1500 articles) - acc: {acc_enhanced1500}")

Simple Clustering (1500 articles) - NMI: 0.7651434623526583
Enhanced Clustering (1500 articles) - NMI: 0.7431781182204052
---------------------------------------------------------------------------------
Simple Clustering (1500 articles) - rand_score: 0.7676081062424845
Enhanced Clustering (1500 articles) - rand_score: 0.72524141097845
---------------------------------------------------------------------------------
Simple Clustering (1500 articles) - acc: 0.9013333333333333
Enhanced Clustering (1500 articles) - acc: 0.8746666666666667


In [89]:
# Evaluation with random_state=100
nmi_simple1500 = normalized_mutual_info_score(data1500['category'].values, simple_labels1500)
nmi_enhanced1500 = normalized_mutual_info_score(data1500['category'].values, enhanced_labels1500)
print(f"Simple Clustering (1500 articles) - NMI: {nmi_simple1500}")
print(f"Enhanced Clustering (1500 articles) - NMI: {nmi_enhanced1500}")
print("---------------------------------------------------------------------------------")

rand_score_simple1500 = adjusted_rand_score(data1500['category'].values, simple_labels1500)
rand_score_enhanced1500= adjusted_rand_score(data1500['category'].values, enhanced_labels1500)
print(f"Simple Clustering (1500 articles) - rand_score: {rand_score_simple1500}")
print(f"Enhanced Clustering (1500 articles) - rand_score: {rand_score_enhanced1500}")

print("---------------------------------------------------------------------------------")

# Encode category labels
label_encoder = LabelEncoder()
y_true = label_encoder.fit_transform(data1500['category'].values)

acc_simple1500 = cluster_acc(np.array(y_true), np.array(simple_labels1500))
acc_enhanced1500 = cluster_acc(np.array(y_true), np.array(enhanced_labels1500))
print(f"Simple Clustering (1500 articles) - acc: {acc_simple1500}")
print(f"Enhanced Clustering (1500 articles) - acc: {acc_enhanced1500}")

Simple Clustering (1500 articles) - NMI: 0.5813579922501966
Enhanced Clustering (1500 articles) - NMI: 0.8105173511771397
---------------------------------------------------------------------------------
Simple Clustering (1500 articles) - rand_score: 0.3997995302586962
Enhanced Clustering (1500 articles) - rand_score: 0.8380617259982924
---------------------------------------------------------------------------------
Simple Clustering (1500 articles) - acc: 0.7393333333333333
Enhanced Clustering (1500 articles) - acc: 0.9313333333333333


In [119]:
# Enhanced vectorization using LLM keyphrases
enhanced_vectors1850 = []
cpt=1
total=1850
for text in texts1850:
    keyphrase = generate_keyphrases(text)
    text_vector = encode_text(text)
    keyphrase_vector = encode_text(keyphrase)
    concatenated_vector = np.concatenate((text_vector, keyphrase_vector), axis=1)
    enhanced_vectors1850.append(concatenated_vector.squeeze())
    print(f"{cpt}/{total}", end="\r")
    cpt=cpt+1

# Convert list to array
enhanced_vectors1850 = np.array(enhanced_vectors1850)
print("vectorization terminé avec succès")

vectorization terminé avec succès


In [151]:
np.save('enhanced_vectors1850.npy', enhanced_vectors1850)

In [93]:
# To load your vectors back after restarting the kernel
enhanced_vectors1850 = np.load('enhanced_vectors1850.npy')

In [107]:
simple_labels1850 = kmeans_simple.fit_predict(X_simple1850)
enhanced_labels1850 = kmeans_enhanced.fit_predict(enhanced_vectors1850)

In [95]:
# Evaluation with random_state=42
nmi_simple1850 = normalized_mutual_info_score(data1850['category'].values, simple_labels1850)
nmi_enhanced1850 = normalized_mutual_info_score(data1850['category'].values, enhanced_labels1850)
print(f"Simple Clustering (1850 articles) - NMI: {nmi_simple1850}")
print(f"Enhanced Clustering (1850 articles) - NMI: {nmi_enhanced1850}")
print("---------------------------------------------------------------------------------")

rand_score_simple1850 = adjusted_rand_score(data1850['category'].values, simple_labels1850)
rand_score_enhanced1850= adjusted_rand_score(data1850['category'].values, enhanced_labels1850)
print(f"Simple Clustering (1850 articles) - rand_score: {rand_score_simple1850}")
print(f"Enhanced Clustering (1850 articles) - rand_score: {rand_score_enhanced1850}")

print("---------------------------------------------------------------------------------")

# Encode category labels
label_encoder = LabelEncoder()
y_true = label_encoder.fit_transform(data1850['category'].values)

acc_simple1850 = cluster_acc(np.array(y_true), np.array(simple_labels1850))
acc_enhanced1850 = cluster_acc(np.array(y_true), np.array(enhanced_labels1850))
print(f"Simple Clustering (1850 articles) - acc: {acc_simple1850}")
print(f"Enhanced Clustering (1850 articles) - acc: {acc_enhanced1850}")

Simple Clustering (1850 articles) - NMI: 0.6908902207428829
Enhanced Clustering (1850 articles) - NMI: 0.8118052500903025
---------------------------------------------------------------------------------
Simple Clustering (1850 articles) - rand_score: 0.6254892781965412
Enhanced Clustering (1850 articles) - rand_score: 0.84652678067056
---------------------------------------------------------------------------------
Simple Clustering (1850 articles) - acc: 0.82
Enhanced Clustering (1850 articles) - acc: 0.9356756756756757


In [98]:
# Evaluation with random_state=5
nmi_simple1850 = normalized_mutual_info_score(data1850['category'].values, simple_labels1850)
nmi_enhanced1850 = normalized_mutual_info_score(data1850['category'].values, enhanced_labels1850)
print(f"Simple Clustering (1850 articles) - NMI: {nmi_simple1850}")
print(f"Enhanced Clustering (1850 articles) - NMI: {nmi_enhanced1850}")
print("---------------------------------------------------------------------------------")

rand_score_simple1850 = adjusted_rand_score(data1850['category'].values, simple_labels1850)
rand_score_enhanced1850= adjusted_rand_score(data1850['category'].values, enhanced_labels1850)
print(f"Simple Clustering (1850 articles) - rand_score: {rand_score_simple1850}")
print(f"Enhanced Clustering (1850 articles) - rand_score: {rand_score_enhanced1850}")

print("---------------------------------------------------------------------------------")

# Encode category labels
label_encoder = LabelEncoder()
y_true = label_encoder.fit_transform(data1850['category'].values)

acc_simple1850 = cluster_acc(np.array(y_true), np.array(simple_labels1850))
acc_enhanced1850 = cluster_acc(np.array(y_true), np.array(enhanced_labels1850))
print(f"Simple Clustering (1850 articles) - acc: {acc_simple1850}")
print(f"Enhanced Clustering (1850 articles) - acc: {acc_enhanced1850}")

Simple Clustering (1850 articles) - NMI: 0.5792921043426708
Enhanced Clustering (1850 articles) - NMI: 0.8150372898313464
---------------------------------------------------------------------------------
Simple Clustering (1850 articles) - rand_score: 0.3137601181658727
Enhanced Clustering (1850 articles) - rand_score: 0.8500818200590934
---------------------------------------------------------------------------------
Simple Clustering (1850 articles) - acc: 0.5513513513513514
Enhanced Clustering (1850 articles) - acc: 0.9372972972972973


In [102]:
# Evaluation with random_state=25
nmi_simple1850 = normalized_mutual_info_score(data1850['category'].values, simple_labels1850)
nmi_enhanced1850 = normalized_mutual_info_score(data1850['category'].values, enhanced_labels1850)
print(f"Simple Clustering (1850 articles) - NMI: {nmi_simple1850}")
print(f"Enhanced Clustering (1850 articles) - NMI: {nmi_enhanced1850}")
print("---------------------------------------------------------------------------------")

rand_score_simple1850 = adjusted_rand_score(data1850['category'].values, simple_labels1850)
rand_score_enhanced1850= adjusted_rand_score(data1850['category'].values, enhanced_labels1850)
print(f"Simple Clustering (1850 articles) - rand_score: {rand_score_simple1850}")
print(f"Enhanced Clustering (1850 articles) - rand_score: {rand_score_enhanced1850}")

print("---------------------------------------------------------------------------------")

# Encode category labels
label_encoder = LabelEncoder()
y_true = label_encoder.fit_transform(data1850['category'].values)

acc_simple1850 = cluster_acc(np.array(y_true), np.array(simple_labels1850))
acc_enhanced1850 = cluster_acc(np.array(y_true), np.array(enhanced_labels1850))
print(f"Simple Clustering (1850 articles) - acc: {acc_simple1850}")
print(f"Enhanced Clustering (1850 articles) - acc: {acc_enhanced1850}")

Simple Clustering (1850 articles) - NMI: 0.6955787518858061
Enhanced Clustering (1850 articles) - NMI: 0.6673058549271512
---------------------------------------------------------------------------------
Simple Clustering (1850 articles) - rand_score: 0.6355990067992949
Enhanced Clustering (1850 articles) - rand_score: 0.5828978733705864
---------------------------------------------------------------------------------
Simple Clustering (1850 articles) - acc: 0.8389189189189189
Enhanced Clustering (1850 articles) - acc: 0.6697297297297298


In [108]:
# Evaluation with random_state=100
nmi_simple1850 = normalized_mutual_info_score(data1850['category'].values, simple_labels1850)
nmi_enhanced1850 = normalized_mutual_info_score(data1850['category'].values, enhanced_labels1850)
print(f"Simple Clustering (1850 articles) - NMI: {nmi_simple1850}")
print(f"Enhanced Clustering (1850 articles) - NMI: {nmi_enhanced1850}")
print("---------------------------------------------------------------------------------")

rand_score_simple1850 = adjusted_rand_score(data1850['category'].values, simple_labels1850)
rand_score_enhanced1850= adjusted_rand_score(data1850['category'].values, enhanced_labels1850)
print(f"Simple Clustering (1850 articles) - rand_score: {rand_score_simple1850}")
print(f"Enhanced Clustering (1850 articles) - rand_score: {rand_score_enhanced1850}")

print("---------------------------------------------------------------------------------")

# Encode category labels
label_encoder = LabelEncoder()
y_true = label_encoder.fit_transform(data1850['category'].values)

acc_simple1850 = cluster_acc(np.array(y_true), np.array(simple_labels1850))
acc_enhanced1850 = cluster_acc(np.array(y_true), np.array(enhanced_labels1850))
print(f"Simple Clustering (1850 articles) - acc: {acc_simple1850}")
print(f"Enhanced Clustering (1850 articles) - acc: {acc_enhanced1850}")

Simple Clustering (1850 articles) - NMI: 0.6968951869023613
Enhanced Clustering (1850 articles) - NMI: 0.6208853190041339
---------------------------------------------------------------------------------
Simple Clustering (1850 articles) - rand_score: 0.589860579968474
Enhanced Clustering (1850 articles) - rand_score: 0.5870128318107245
---------------------------------------------------------------------------------
Simple Clustering (1850 articles) - acc: 0.7075675675675676
Enhanced Clustering (1850 articles) - acc: 0.7248648648648649


In [124]:
# Enhanced vectorization using LLM keyphrases
enhanced_vectors2225 = []
cpt=1
total=2225
for text in texts2225:
    keyphrase = generate_keyphrases(text)
    text_vector = encode_text(text)
    keyphrase_vector = encode_text(keyphrase)
    concatenated_vector = np.concatenate((text_vector, keyphrase_vector), axis=1)
    enhanced_vectors2225.append(concatenated_vector.squeeze())
    print(f"{cpt}/{total}", end="\r")
    cpt=cpt+1

# Convert list to array
enhanced_vectors2225 = np.array(enhanced_vectors2225)
print("vectorization terminé avec succès")

vectorization terminé avec succès


In [152]:
np.save('enhanced_vectors2225.npy', enhanced_vectors2225)

In [112]:
# To load your vectors back after restarting the kernel
enhanced_vectors2225 = np.load('enhanced_vectors2225.npy')

In [123]:
simple_labels2225 = kmeans_simple.fit_predict(X_simple2225)
enhanced_labels2225 = kmeans_enhanced.fit_predict(enhanced_vectors2225)

In [114]:
# Evaluation with random_state=42
nmi_simple2225 = normalized_mutual_info_score(data2225['category'].values, simple_labels2225)
nmi_enhanced2225 = normalized_mutual_info_score(data2225['category'].values, enhanced_labels2225)
print(f"Simple Clustering (2225 articles) - NMI: {nmi_simple2225}")
print(f"Enhanced Clustering (2225 articles) - NMI: {nmi_enhanced2225}")
print("---------------------------------------------------------------------------------")

rand_score_simple2225 = adjusted_rand_score(data2225['category'].values, simple_labels2225)
rand_score_enhanced2225= adjusted_rand_score(data2225['category'].values, enhanced_labels2225)
print(f"Simple Clustering (2225 articles) - rand_score: {rand_score_simple2225}")
print(f"Enhanced Clustering (2225 articles) - rand_score: {rand_score_enhanced2225}")

print("---------------------------------------------------------------------------------")

# Encode category labels
label_encoder = LabelEncoder()
y_true = label_encoder.fit_transform(data2225['category'].values)

acc_simple2225 = cluster_acc(np.array(y_true), np.array(simple_labels2225))
acc_enhanced2225 = cluster_acc(np.array(y_true), np.array(enhanced_labels2225))
print(f"Simple Clustering (2225 articles) - acc: {acc_simple2225}")
print(f"Enhanced Clustering (2225 articles) - acc: {acc_enhanced2225}")

Simple Clustering (2225 articles) - NMI: 0.8152174404259462
Enhanced Clustering (2225 articles) - NMI: 0.769593084222543
---------------------------------------------------------------------------------
Simple Clustering (2225 articles) - rand_score: 0.8210676146346573
Enhanced Clustering (2225 articles) - rand_score: 0.773986984910034
---------------------------------------------------------------------------------
Simple Clustering (2225 articles) - acc: 0.9267415730337079
Enhanced Clustering (2225 articles) - acc: 0.9015730337078651


In [117]:
# Evaluation with random_state=5
nmi_simple2225 = normalized_mutual_info_score(data2225['category'].values, simple_labels2225)
nmi_enhanced2225 = normalized_mutual_info_score(data2225['category'].values, enhanced_labels2225)
print(f"Simple Clustering (2225 articles) - NMI: {nmi_simple2225}")
print(f"Enhanced Clustering (2225 articles) - NMI: {nmi_enhanced2225}")
print("---------------------------------------------------------------------------------")

rand_score_simple2225 = adjusted_rand_score(data2225['category'].values, simple_labels2225)
rand_score_enhanced2225= adjusted_rand_score(data2225['category'].values, enhanced_labels2225)
print(f"Simple Clustering (2225 articles) - rand_score: {rand_score_simple2225}")
print(f"Enhanced Clustering (2225 articles) - rand_score: {rand_score_enhanced2225}")

print("---------------------------------------------------------------------------------")

# Encode category labels
label_encoder = LabelEncoder()
y_true = label_encoder.fit_transform(data2225['category'].values)

acc_simple2225 = cluster_acc(np.array(y_true), np.array(simple_labels2225))
acc_enhanced2225 = cluster_acc(np.array(y_true), np.array(enhanced_labels2225))
print(f"Simple Clustering (2225 articles) - acc: {acc_simple2225}")
print(f"Enhanced Clustering (2225 articles) - acc: {acc_enhanced2225}")

Simple Clustering (2225 articles) - NMI: 0.6371644170801996
Enhanced Clustering (2225 articles) - NMI: 0.8248006678446281
---------------------------------------------------------------------------------
Simple Clustering (2225 articles) - rand_score: 0.4856177327497736
Enhanced Clustering (2225 articles) - rand_score: 0.8568523219336295
---------------------------------------------------------------------------------
Simple Clustering (2225 articles) - acc: 0.7330337078651685
Enhanced Clustering (2225 articles) - acc: 0.938876404494382


In [120]:
# Evaluation with random_state=25
nmi_simple2225 = normalized_mutual_info_score(data2225['category'].values, simple_labels2225)
nmi_enhanced2225 = normalized_mutual_info_score(data2225['category'].values, enhanced_labels2225)
print(f"Simple Clustering (2225 articles) - NMI: {nmi_simple2225}")
print(f"Enhanced Clustering (2225 articles) - NMI: {nmi_enhanced2225}")
print("---------------------------------------------------------------------------------")

rand_score_simple2225 = adjusted_rand_score(data2225['category'].values, simple_labels2225)
rand_score_enhanced2225= adjusted_rand_score(data2225['category'].values, enhanced_labels2225)
print(f"Simple Clustering (2225 articles) - rand_score: {rand_score_simple2225}")
print(f"Enhanced Clustering (2225 articles) - rand_score: {rand_score_enhanced2225}")

print("---------------------------------------------------------------------------------")

# Encode category labels
label_encoder = LabelEncoder()
y_true = label_encoder.fit_transform(data2225['category'].values)

acc_simple2225 = cluster_acc(np.array(y_true), np.array(simple_labels2225))
acc_enhanced2225 = cluster_acc(np.array(y_true), np.array(enhanced_labels2225))
print(f"Simple Clustering (2225 articles) - acc: {acc_simple2225}")
print(f"Enhanced Clustering (2225 articles) - acc: {acc_enhanced2225}")

Simple Clustering (2225 articles) - NMI: 0.6682067734759758
Enhanced Clustering (2225 articles) - NMI: 0.6259378082157732
---------------------------------------------------------------------------------
Simple Clustering (2225 articles) - rand_score: 0.5833917199785121
Enhanced Clustering (2225 articles) - rand_score: 0.5912851227173611
---------------------------------------------------------------------------------
Simple Clustering (2225 articles) - acc: 0.7047191011235955
Enhanced Clustering (2225 articles) - acc: 0.7150561797752809


In [124]:
# Evaluation with random_state=100
nmi_simple2225 = normalized_mutual_info_score(data2225['category'].values, simple_labels2225)
nmi_enhanced2225 = normalized_mutual_info_score(data2225['category'].values, enhanced_labels2225)
print(f"Simple Clustering (2225 articles) - NMI: {nmi_simple2225}")
print(f"Enhanced Clustering (2225 articles) - NMI: {nmi_enhanced2225}")
print("---------------------------------------------------------------------------------")

rand_score_simple2225 = adjusted_rand_score(data2225['category'].values, simple_labels2225)
rand_score_enhanced2225= adjusted_rand_score(data2225['category'].values, enhanced_labels2225)
print(f"Simple Clustering (2225 articles) - rand_score: {rand_score_simple2225}")
print(f"Enhanced Clustering (2225 articles) - rand_score: {rand_score_enhanced2225}")

print("---------------------------------------------------------------------------------")

# Encode category labels
label_encoder = LabelEncoder()
y_true = label_encoder.fit_transform(data2225['category'].values)

acc_simple2225 = cluster_acc(np.array(y_true), np.array(simple_labels2225))
acc_enhanced2225 = cluster_acc(np.array(y_true), np.array(enhanced_labels2225))
print(f"Simple Clustering (2225 articles) - acc: {acc_simple2225}")
print(f"Enhanced Clustering (2225 articles) - acc: {acc_enhanced2225}")

Simple Clustering (2225 articles) - NMI: 0.6971609909162098
Enhanced Clustering (2225 articles) - NMI: 0.7729247182727357
---------------------------------------------------------------------------------
Simple Clustering (2225 articles) - rand_score: 0.6617560796405119
Enhanced Clustering (2225 articles) - rand_score: 0.7764295234622282
---------------------------------------------------------------------------------
Simple Clustering (2225 articles) - acc: 0.8058426966292135
Enhanced Clustering (2225 articles) - acc: 0.9029213483146067


## Conclusion

We conclude by summarizing the findings from our clustering experiments. The results are discussed in the context of how language model embeddings have potentially improved the clustering outcomes and what this might mean for practical applications of clustering in natural language processing.
