Лабораторная работа №7: Кластеризация документов и выявление тематик
Цель: Создание системы автоматической кластеризации и выделения ключевых тематик в документах.
Задания:
- Осуществить предварительную обработку текстов (токенизацию, удаление стоп-слов, лемматизацию).Метрика оценки: Качество кластеризации, согласованность кластеров (silhouette coefficient)
- Выполнить обучение темы методом LDA или BERTopic и визуализацию результатов.Модели сравнения: k-means clustering, hierarchical clustering

In [1]:
import fireducks.pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.metrics import silhouette_score, adjusted_rand_score, normalized_mutual_info_score
import spacy
from sklearn.cluster import KMeans, AgglomerativeClustering
import numpy as np

In [2]:
df = pd.read_csv('data/bbc-news-data.csv', sep='\t')
texts = df['content'].astype(str)
df.head()

Unnamed: 0,category,filename,title,content
0,business,001.txt,Ad sales boost Time Warner profit,Quarterly profits at US media giant TimeWarne...
1,business,002.txt,Dollar gains on Greenspan speech,The dollar has hit its highest level against ...
2,business,003.txt,Yukos unit buyer faces loan claim,The owners of embattled Russian oil giant Yuk...
3,business,004.txt,High fuel prices hit BA's profits,British Airways has blamed high fuel prices f...
4,business,005.txt,Pernod takeover talk lifts Domecq,Shares in UK drinks and food firm Allied Dome...


In [3]:
df.category.value_counts()

category
sport            511
business         510
politics         417
tech             401
entertainment    386
Name: count, dtype: int64

In [4]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    tokens = word_tokenize(text.lower())
    tokens = [t for t in tokens if t.isalpha() and t not in stop_words]
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    return ' '.join(tokens)

nlp = spacy.load("en_core_web_sm")

def preprocess_nouns(text):
    doc = nlp(text)
    nouns = [
        token.lemma_.lower()
        for token in doc
        if token.pos_ in ("NOUN", "PROPN")
           and not token.is_stop      
           and token.is_alpha     
    ]
    return " ".join(nouns)

df['clean_text'] = texts.apply(preprocess)
df['nouns_only'] = df['clean_text'].apply(preprocess_nouns)

In [52]:
def cluster_purity(true_labels, cluster_labels):
    total = 0
    for cluster in set(cluster_labels):
        idx = (cluster_labels == cluster)
        most_common = pd.Series(true_labels[idx]).mode()[0]
        total += sum(true_labels[idx] == most_common)
    return total / len(cluster_labels)

n_clusters = 5
vectorizer = TfidfVectorizer(max_df=0.9, min_df=5)
X = vectorizer.fit_transform(df['nouns_only'])

In [53]:
lda = LatentDirichletAllocation(
    n_components=n_clusters,          
    doc_topic_prior=0.1,      
    topic_word_prior=0.01,    
    learning_decay=0.7,      
    random_state=42
)

doc_topic_dist = lda.fit_transform(X)
lda_labels = doc_topic_dist.argmax(axis=1)

sil_lda = silhouette_score(doc_topic_dist, lda_labels)
print(f"LDA Silhouette: {sil_lda:.3f}")

if 'category' in df.columns:
    ari_lda = adjusted_rand_score(df['category'], lda_labels)
    nmi_lda = normalized_mutual_info_score(df['category'], lda_labels)
    purity_lda = cluster_purity(df['category'].values, lda_labels)
    print(f"LDA ARI: {ari_lda:.3f}, NMI: {nmi_lda:.3f}, Purity: {purity_lda:.3f}")

LDA Silhouette: 0.497
LDA ARI: 0.146, NMI: 0.358, Purity: 0.495


In [None]:
agglo = AgglomerativeClustering(n_clusters=n_clusters, linkage='ward')
agglo_labels = agglo.fit_predict(X.toarray())
sil_agglo = silhouette_score(X, agglo_labels)
print(f"Hierarchical Clustering Silhouette: {sil_agglo:.3f}")

if 'category' in df.columns:
    ari_agglo = adjusted_rand_score(df['category'], agglo_labels)
    nmi_agglo = normalized_mutual_info_score(df['category'], agglo_labels)
    purity_agglo = cluster_purity(df['category'].values, agglo_labels)
    print(f"Hierarchical ARI: {ari_agglo:.3f}, NMI: {nmi_agglo:.3f}, Purity: {purity_agglo:.3f}")

Hierarchical Clustering Silhouette: 0.014
Hierarchical ARI: 0.479, NMI: 0.594, Purity: 0.744


In [55]:
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
kmeans_labels = kmeans.fit_predict(X)
sil_kmeans = silhouette_score(X, kmeans_labels)
print(f"K-Means Silhouette: {sil_kmeans:.3f}")

if 'category' in df.columns:
    ari_kmeans = adjusted_rand_score(df['category'], kmeans_labels)
    nmi_kmeans = normalized_mutual_info_score(df['category'], kmeans_labels)
    purity_kmeans = cluster_purity(df['category'].values, kmeans_labels)
    print(f"K-Means ARI: {ari_kmeans:.3f}, NMI: {nmi_kmeans:.3f}, Purity: {purity_kmeans:.3f}")

K-Means Silhouette: 0.020
K-Means ARI: 0.791, NMI: 0.780, Purity: 0.911


In [56]:
feature_names = vectorizer.get_feature_names_out()
n_top_words = 10

print("Топ-слова по кластерам K-Means с вероятной категорией:")
for cluster in range(n_clusters):
    mean_tfidf = X[kmeans_labels == cluster].mean(axis=0).A1
    top_indices = np.argsort(mean_tfidf)[::-1][:n_top_words]
    top_words = [feature_names[i] for i in top_indices]

    cluster_cats = df.loc[kmeans_labels == cluster, 'category']
    if len(cluster_cats) > 0:
        majority_cat = cluster_cats.mode()[0]
    else:
        majority_cat = "None"
    print(f"Cluster {cluster}: {', '.join(top_words)}  ➔  {majority_cat}")

Топ-слова по кластерам K-Means с вероятной категорией:
Cluster 0: game, player, england, club, match, win, team, champion, cup, world  ➔  sport
Cluster 1: phone, people, technology, user, service, music, computer, software, network, site  ➔  tech
Cluster 2: mr, election, party, labour, blair, tory, government, minister, tax, lord  ➔  politics
Cluster 3: film, award, star, band, actor, album, music, year, festival, chart  ➔  entertainment
Cluster 4: company, year, market, bank, growth, sale, economy, price, mr, share  ➔  business
