<div align="center">

# **CS-E4650 Methods of Data Mining**

# **Exercise 5.4 Topics of text clusters**

</div>

:<div align="center">
    
# **Group members**

# **Nguyen Xuan Binh (887799)**

# **Erald Shahinas (906845)**

# **Alexander Pavlyuk (906829)**

</div>

</br>
</br>
</br>

# **Table of Contents**

### 1. [Preprocessing](#1)
### 2. [K-means clustering and topic detection](#2)
### 3. [Additional experiments](#3)
### 4. [Appendix](#5)

</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>

*Learning goal: Clustering text data and techniques for describing topics of clusters*

In this task, you should cluster a collection of short scientific texts and
identify the main topics of each cluster. Ideally, you will indentify 3–10
unique topics (areas or techniques of computer science) that describe majority of documents excluding possible outliers.
In MC, you can find data set scopusabstracts.csv, which consists of abstracts of scientific papers from Scopus https://www.elsevier.com/products/
scopus. Each line describes one document: its id, title, and abtract, separated by #.

## **1. Preprocessing**

(a) In the baseline solution, combine the title and abstract. Preprocess the
data like in the previous task, but this time, create also **bigrams** (in
addition to unigrams) as possible features. Since the number of features
would otherwise be too high, it is suggested to use frequency-based
filtering to prune out very frequent or extremely rare words/collocations
(see parameters of sklearn TfidfVectorizer). Consider also adding new
stopwords, if any frequent but uninformative words complicate later
steps. When features are fine, present the data in the tf-idf form so
that each document vector is normalized to unit L2 norm.

Describe briefly the preprocessing methods:
tools (like nltk), in which order the steps were performed, stemmer,
stopword list (including own additions), tf-idf version (equation), minimum or maximum frequencies (if any), and other possible steps or
options that could affect the results.

All the calculations have been perfomed on JypyterHub (https://jupyter.cs.aalto.fi) in the Python notebook. Additionally, numpy (https://numpy.org/), matplotlib (https://matplotlib.org/), and pandas (https://pandas.pydata.org/) libraries have been imported to handle specific functions.


In [None]:
!pip install nltk



In [None]:
import pandas as pd

import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from sklearn.feature_extraction.text import TfidfVectorizer
from string import punctuation
import numpy as np

from sklearn.metrics import davies_bouldin_score
from sklearn.cluster import KMeans
from sklearn.decomposition import TruncatedSVD

from copy import deepcopy
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
# Load data
data_path = 'scopusabstracts.csv'

# Read the data
scopusdata = pd.read_csv('scopusabstracts.csv', sep='#')

# Extract the text from each line. In the baseline solution, combine the title and abstract.
corpus = [str1 + " " + str2 for str1, str2 in zip(scopusdata['TITLE'].to_list(), scopusdata['ABSTRACT'].to_list())]

# some examples
print('10 first titles + abstracts:')
for i in corpus[:10]:
    print(i)
print()

10 first titles + abstracts:
Anomaly detection in wide area imagery [Geniş alan görüntülerinde anomali tespiti] This study is about detecting anomalies in wide area imagery collected from an aircraft. The set of anomalies have been identified as anything out of the normal course of action. For this purpose, two different data sets were used and the experiments were carried out on these data sets. For anomaly detection, a convolutional neural network model that tries to generate the next image using past images is designed. The images were pre-processed before being given to the model. Anomaly detection is performed by comparing the estimated image and the true image. 
Person re-identification with deep kronecker-product matching and group-shuffling random walk Person re-identification (re-ID) aims to robustly measure visual affinities between person images. It has wide applications in intelligent surveillance by associating same persons' images across multiple cameras. It is generally 

In [None]:
# Preprocessing

# Step 1: tokenization and lowercasing
tokens_list = [word_tokenize(document) for document in corpus]

lc_tokens_list = []

for token_document in tokens_list:
    lc_tokens_list.append([token.lower() for token in token_document])

print('After tokenization and lowercasing:\n')
for i in lc_tokens_list[:10]:
    print(i)
print()

# original number of tokens
uniques = np.unique([token for token_document in lc_tokens_list for token in token_document])
print("Original number of tokens: {}\n".format(len(uniques)))


After tokenization and lowercasing:

['anomaly', 'detection', 'in', 'wide', 'area', 'imagery', '[', 'geniş', 'alan', 'görüntülerinde', 'anomali', 'tespiti', ']', 'this', 'study', 'is', 'about', 'detecting', 'anomalies', 'in', 'wide', 'area', 'imagery', 'collected', 'from', 'an', 'aircraft', '.', 'the', 'set', 'of', 'anomalies', 'have', 'been', 'identified', 'as', 'anything', 'out', 'of', 'the', 'normal', 'course', 'of', 'action', '.', 'for', 'this', 'purpose', ',', 'two', 'different', 'data', 'sets', 'were', 'used', 'and', 'the', 'experiments', 'were', 'carried', 'out', 'on', 'these', 'data', 'sets', '.', 'for', 'anomaly', 'detection', ',', 'a', 'convolutional', 'neural', 'network', 'model', 'that', 'tries', 'to', 'generate', 'the', 'next', 'image', 'using', 'past', 'images', 'is', 'designed', '.', 'the', 'images', 'were', 'pre-processed', 'before', 'being', 'given', 'to', 'the', 'model', '.', 'anomaly', 'detection', 'is', 'performed', 'by', 'comparing', 'the', 'estimated', 'image', 'a

In [None]:

# Steps 2 and 3: remove stop words and punctuation
stop_words = set(stopwords.words('english'))
print('NLTK stopwords:')
print(stop_words)
print()

#stop_words.update(["use", "data", "system", "propos"])

# Here we include the punctuation in the stop words set. There are alternative ways to remove punctuation.


stop_words.update(punctuation)

# For the field of computer science research papers, in addition to the standard English stop words,
# we might consider adding words that serves little meanings

# stop_words.update({"use", "data", "system", "proposed", "study", "results", "analysis", "model",
#                 "approach", "methods", "research", "application", "technique", "performance",
#                 "algorithm", "process", "problem", "solution"})

#you can check updated stopwords
#print(stop_words)

filtered_sentence = []
for i in lc_tokens_list:
    filtered_sentence.append([token for token in i if token not in stop_words])

# Numbers are also removed
filtered_sentence = [ ' '.join(i) for i in filtered_sentence ]
filtered_sentence = [ re.sub(r'\d+', '', sentence) for sentence in filtered_sentence ]

# number of tokens
uniques = np.unique([tok for doc in filtered_sentence for tok in doc.split()])
print("Number of tokens after stopword and punctuation removal: {}\n".format(len(uniques)))


print('After removing stop words, punctuation and numbers:')
for sentence in filtered_sentence[:10]:
    print(sentence)
print()

NLTK stopwords:
{'mightn', 'each', "wouldn't", 'just', "wasn't", 'of', 'both', "hadn't", 's', 'now', 'isn', 'don', 'itself', 'through', 'herself', 'how', "it's", 'doesn', 'm', 'haven', 'we', 'as', 'up', 'theirs', "shouldn't", "isn't", 'an', 'so', 'which', 'by', 'be', 'further', 'having', 'was', "mustn't", 'again', 'before', "you'd", 'out', 'yourselves', 'have', 'such', 'hasn', "weren't", 'she', 'there', 'from', 'once', 'yourself', 'until', 'between', 'to', "you're", 'after', 'it', 'down', 'only', 'then', 'against', 'yours', 'those', 'themselves', 're', 'own', 've', 'a', 'in', 'do', 'are', 'whom', 'with', 'were', 'they', 'when', 'where', "won't", "hasn't", 'any', 'off', 'no', 'hadn', 'about', 'll', 'am', 'all', 'than', "you've", 'most', 'he', 'being', 'does', 'ours', 'y', "couldn't", "don't", 'been', 'the', "doesn't", 'shan', "didn't", 'wasn', "that'll", 'them', 'if', 'that', 'or', 'under', 'their', 'while', 'o', 'will', 'should', "mightn't", 'mustn', 'your', 'couldn', 'shouldn', 'very'

In [None]:
# Step 4: stemming
porter = PorterStemmer()

#or snowball stemmer
#stemmer = SnowballStemmer("english",ignore_stopwords=True)
stemmed_tokens_list = []

for i in filtered_sentence:
	stemmed_tokens_list.append([porter.stem(j) for j in i.split()])

# number of tokens
uniques = np.unique([tok for doc in stemmed_tokens_list for tok in doc])
print("Number of tokens after stemming: {}\n".format(len(uniques)))

print('After stemming:')
for tokens in stemmed_tokens_list[:10]:
	for token in tokens:
		print(token,end=" ")
	print(" ")


Number of tokens after stemming: 10259

After stemming:
anomali detect wide area imageri geniş alan görüntülerind anomali tespiti studi detect anomali wide area imageri collect aircraft set anomali identifi anyth normal cours action purpos two differ data set use experi carri data set anomali detect convolut neural network model tri gener next imag use past imag design imag pre-process given model anomali detect perform compar estim imag true imag  
person re-identif deep kronecker-product match group-shuffl random walk person re-identif re-id aim robustli measur visual affin person imag wide applic intellig surveil associ person imag across multipl camera gener treat imag retriev problem given probe person imag affin probe imag galleri imag pg affin use rank retriev galleri imag exist two main challeng effect solv problem person imag usual show signific variat differ person pose view angl spatial layout correspond person imag therefor vital inform tackl problem state-of-the-art method

In [None]:

#5. Check most frequent words - candidates to add to the stopword list
listofall = [ item for elem in stemmed_tokens_list for item in elem]

freq = FreqDist(listofall)
wnum=freq.B()
print("\nMost common words (total %d)"%wnum)
print(freq.most_common(30))


Most common words (total 10259)
[('use', 1793), ('data', 1238), ('system', 1208), ('propos', 1082), ('model', 937), ('method', 880), ('comput', 868), ('robot', 806), ('imag', 792), ('perform', 774), ('base', 728), ('algorithm', 719), ('databas', 701), ('result', 685), ('secur', 665), ('paper', 635), ('approach', 621), ('compil', 602), ('applic', 594), ('design', 569), ('gener', 548), ('learn', 543), ('develop', 535), ('detect', 513), ('process', 512), ('.', 512), ('inform', 507), ('network', 505), ('present', 504), ('implement', 481)]


We now remove stopwords like use, data, system as they appear after the stemming. We remove the top 30 common words, which contribute little meanings to the research title's focus.

In [None]:
# Assuming 'freq' is your FreqDist object and 'stemmed_tokens_list' is your list of lists of tokens
most_common_words = [word for word, freq in freq.most_common(30)]

# Convert the list to a set for faster membership testing
common_words_set = set(most_common_words)

# Now filter out these common words from the stemmed tokens
filtered_tokens_list = [[token for token in tokens if token not in common_words_set] for tokens in stemmed_tokens_list]

#5. Check most frequent words - candidates to add to the stopword list
listofall = [ item for elem in filtered_tokens_list for item in elem]

freq_filtered = FreqDist(listofall)
wnum=freq_filtered.B()
print("\nMost common words after filtering (total %d)"%wnum)
print(freq_filtered.most_common(30))



Most common words after filtering (total 10229)
[('differ', 470), ('improv', 445), ('relat', 445), ('provid', 441), ('show', 437), ('optim', 437), ('techniqu', 430), ('time', 429), ('studi', 417), ('program', 417), ('evalu', 386), ('also', 385), ('effici', 380), ('work', 375), ('problem', 366), ('analysi', 365), ('object', 361), ('scheme', 358), ('research', 349), ('new', 348), ('featur', 347), ('control', 337), ('key', 336), ('structur', 333), ('compar', 332), ('requir', 331), ('vision', 327), ('two', 324), ('task', 320), ('framework', 316)]


## **2. K-means clustering and topic detection**

(b) Cluster the preprocessed data with $K$-means trying $K = 3, . . . , 10$.
Evaluate the clustering quality with the Davies-Bouldin index and select the best K. Then evaluate the most frequent unigrams and most
frequent bigrams in each cluster. (It is possible that the lists still contain some uninformative stopwords that you need to exclude.) Try to
conclude what is the topic of each cluster. This is the baseline solution,
so don’t worry, if all the topics are not yet clear.

Report here the results of the K-means approach. What was the best clustering (K
and Davies-Bouldin index), the most frequent unigrams and bigrams
in clusters (e.g., in a table), and your conclusion on the topics.

Now, we proceed to add unigram, bigram and both grams tf-idf models

In [None]:
#6. Present as tf-idf
cleaned_documents = [' '.join(sentence_tokens) for sentence_tokens in filtered_tokens_list]

# copy of clean_documents for part (c) SVD
SVD_cleaned_documents = deepcopy(cleaned_documents)

print('The preprocessed clean documents:')
for document in cleaned_documents[:10]:
	print(document)

The preprocessed clean documents:
anomali wide area imageri geniş alan görüntülerind anomali tespiti studi anomali wide area imageri collect aircraft set anomali identifi anyth normal cours action purpos two differ set experi carri set anomali convolut neural tri next past pre-process given anomali compar estim true
person re-identif deep kronecker-product match group-shuffl random walk person re-identif re-id aim robustli measur visual affin person wide intellig surveil associ person across multipl camera treat retriev problem given probe person affin probe galleri pg affin rank retriev galleri exist two main challeng effect solv problem person usual show signific variat differ person pose view angl spatial layout correspond person therefor vital tackl problem state-of-the-art either ignor spatial variat util extra pose handl challeng exist person re-id rank galleri consid pg affin ignor affin galleri gg affin affin could provid import clue accur galleri rank util post-process stage c

### Unigram tf-idf vectorizer

In [None]:
# Ignoring terms that appear in less than 5% of the documents or in more than 25% of the documents

unigram_tfidf_vectorizer = TfidfVectorizer(
    min_df=0.05,
    max_df=0.25,
    smooth_idf=False,
    norm='l2',            # Ensures all our feature vectors have a euclidian norm of 1
    ngram_range=(1,1)     # Extract only unigrams
)

#only tf part:
#tfidf_vectorizer = TfidfVectorizer(use_idf=False)

unigram_tfidf_vectorizer.fit(cleaned_documents)
unigram_tf_idf_vectors = unigram_tfidf_vectorizer.transform(cleaned_documents)

print("\nThe shape of the tf-idf vectors (number of documents, number of features) for unigram model is")
print(unigram_tf_idf_vectors.shape)


print("\nThe tf-idf values of the first document (unigrams)\n")
feature_names = unigram_tfidf_vectorizer.get_feature_names_out()
feature_index = unigram_tf_idf_vectors[0,:].nonzero()[1]
tfidf_scores = zip(feature_index, [unigram_tf_idf_vectors[0, x] for x in feature_index])
for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:
    print(w, s)


The shape of the tf-idf vectors (number of documents, number of features) for unigram model is
(1143, 349)

The tf-idf values of the first document (unigrams)

wide 0.42706672027885506
two 0.14821467742261452
studi 0.14557646252298534
set 0.4974544721165923
purpos 0.2292533250306582
neural 0.1984329905770054
identifi 0.20727323621885707
given 0.22354864669609772
experi 0.16858331667935514
estim 0.21095129460311848
convolut 0.2176249810162788
compar 0.14710144650586116
collect 0.22125896429309003
area 0.3889064878787454


### Bigram tf-idf vectorizer

In [None]:
# Ignoring terms that appear in less than 1.5% of the documents or in more than 25% of the documents
# This min df relaxation compared to unigram mode helps more bigrams to be considered in the vectorizer

bigram_tfidf_vectorizer = TfidfVectorizer(
    min_df=0.015,
    max_df=0.25,
    smooth_idf=False,
    norm='l2',            # Ensures all our feature vectors have a euclidian norm of 1
    ngram_range=(2,2)     # Extract only bigrams
)


bigram_tfidf_vectorizer.fit(cleaned_documents)
bigram_tf_idf_vectors = bigram_tfidf_vectorizer.transform(cleaned_documents)

print("\nThe shape of the tf-idf vectors (number of documents, number of features) for bigram model is")
print(bigram_tf_idf_vectors.shape)


print("\nThe tf-idf values of the first document (bigrams)\n")
feature_names = bigram_tfidf_vectorizer.get_feature_names_out()
feature_index = bigram_tf_idf_vectors[0,:].nonzero()[1]
tfidf_scores = zip(feature_index, [bigram_tf_idf_vectors[0, x] for x in feature_index])
for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:
    print(w, s)


The shape of the tf-idf vectors (number of documents, number of features) for bigram model is
(1143, 50)

The tf-idf values of the first document (bigrams)

two differ 0.7839204046175395
convolut neural 0.6208613365513053


### Both grams tf-idf vectorizer

In [None]:
# Ignoring terms that appear in less than 0.5% of the documents or in more than 50% of the documents
# The very small min_df accounts for the explosion of terms caused by bigrams

both_tfidf_vectorizer = TfidfVectorizer(
    min_df=0.005,
    max_df=0.25,
    smooth_idf=False,
    norm='l2',            # Ensures all our feature vectors have a euclidian norm of 1
    ngram_range=(1,2)     # Extract only both unigrams and bigrams
)


both_tfidf_vectorizer.fit(cleaned_documents)
both_tf_idf_vectors = both_tfidf_vectorizer.transform(cleaned_documents)

print("\nThe shape of the tf-idf vectors (number of documents, number of features) for both grams model is")
print(both_tf_idf_vectors.shape)


print("\nThe tf-idf values of the first document (both grams)\n")
feature_names = both_tfidf_vectorizer.get_feature_names_out()
feature_index = both_tf_idf_vectors[0,:].nonzero()[1]
tfidf_scores = zip(feature_index, [both_tf_idf_vectors[0, x] for x in feature_index])
for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:
    print(w, s)


The shape of the tf-idf vectors (number of documents, number of features) for both grams model is
(1143, 2619)

The tf-idf values of the first document (both grams)

wide 0.1655509884341766
two differ 0.11437782918690984
two 0.057454924916052634
true 0.12705410179616836
tri 0.12248788556946724
studi 0.056432229717395616
set 0.19283656545780464
purpos 0.08886928612902774
process 0.11751825835945813
pre process 0.14698109324214143
pre 0.10688834399803271
past 0.11341776593920974
normal 0.10121662633184234
next 0.11987204952994096
neural 0.07692188636595902
imageri 0.24782734524724012
identifi 0.08034877807752544
given 0.08665788661663103
experi 0.06535075992705697
estim 0.08177456513167825
cours 0.10912987584470467
convolut neural 0.09058671197041691
convolut 0.08436159739089297
compar 0.05702338466748522
collect 0.08577029887677916
carri 0.09619373005334125
area 0.15075830173503266
anomali 0.772781964651437
aircraft 0.14335573382378414
action 0.11249536208095705


### Clustering with K-means for the three n-grams model

In [None]:
# Clustering the documents based on unigram tf-idf vectorizer

min_score = 1e6
best_k = 0
for i in range(3,11):
    kmeans = KMeans(n_clusters=i, n_init=20)
    kmeans.fit(unigram_tf_idf_vectors)
    labels = kmeans.labels_
    db_index = davies_bouldin_score(unigram_tf_idf_vectors.toarray(), labels)
    if db_index < min_score:
        min_score = db_index
        best_k = i

print("Optimal number of clusters of the unigram model:", best_k)
print("Davies-Bouldin Index of the unigram model:", min_score, "\n")
kmeans = KMeans(n_clusters=best_k, n_init=10)
kmeans.fit(unigram_tf_idf_vectors)
labels = kmeans.labels_
db_index = davies_bouldin_score(unigram_tf_idf_vectors.toarray(), labels)

clusters = {i: [] for i in range(best_k)}

for point, label in zip(filtered_tokens_list, labels):
    clusters[label].append(point)

for i in range(best_k):
    listofcluster = [ item for elem in clusters[i] for item in elem]
    cluster_freq = FreqDist(listofcluster)
    print("Cluster ", i, ":", cluster_freq.most_common(15))

Optimal number of clusters of the unigram model: 10
Davies-Bouldin Index of the unigram model: 4.744800668411147 

Cluster  0 : [('encrypt', 292), ('scheme', 271), ('key', 194), ('attack', 163), ('cryptographi', 156), ('protocol', 148), ('authent', 121), ('cloud', 108), ('provid', 97), ('iot', 93), ('effici', 86), ('cryptograph', 75), ('new', 72), ('devic', 72), ('techniqu', 71)]
Cluster  1 : [('dataset', 180), ('train', 153), ('deep', 130), ('neural', 122), ('featur', 118), ('accuraci', 112), ('segment', 102), ('differ', 97), ('improv', 95), ('predict', 94), ('convolut', 89), ('achiev', 85), ('classif', 77), ('evalu', 75), ('show', 71)]
Cluster  2 : [('quantum', 246), ('cryptographi', 52), ('key', 45), ('protocol', 40), ('attack', 37), ('oper', 35), ('post-quantum', 31), ('state', 31), ('commun', 31), ('gate', 29), ('scheme', 28), ('time', 27), ('circuit', 27), ('architectur', 26), ('also', 25)]
Cluster  3 : [('vision', 186), ('video', 151), ('technolog', 115), ('research', 88), ('stu

In [None]:
# Clustering the documents based on bigram tf-idf vectorizer

min_score = 1e6
best_k = 0
for i in range(3,11):
    kmeans = KMeans(n_clusters=i, n_init=20)
    kmeans.fit(bigram_tf_idf_vectors)
    labels = kmeans.labels_
    db_index = davies_bouldin_score(bigram_tf_idf_vectors.toarray(), labels)
    if db_index < min_score:
        min_score = db_index
        best_k = i

print("Optimal number of clusters of the bigram model:", best_k)
print("Davies-Bouldin Index of the bigram model:", min_score, "\n")

# Fit the KMeans model to find the best_k clusters
kmeans = KMeans(n_clusters=best_k, n_init=20)
kmeans.fit(bigram_tf_idf_vectors)
labels = kmeans.labels_

# Extract the top bigrams for each cluster center
feature_names = bigram_tfidf_vectorizer.get_feature_names_out()

for i in range(best_k):
    # Get indices of the top features for this cluster
    top_feature_indices = kmeans.cluster_centers_[i].argsort()[-10:][::-1]

    print(f"Cluster {i} and bigram TF-IDF score:", end=" ")
    for idx in top_feature_indices:
        print(f"({feature_names[idx]}: {kmeans.cluster_centers_[i][idx]:.4f})", end=", ")
    print("\n")



Optimal number of clusters of the bigram model: 8
Davies-Bouldin Index of the bigram model: 1.2864844842709722 

Cluster 0 and bigram TF-IDF score: (program languag: 0.0270), (real world: 0.0254), (experiment show: 0.0249), (recent year: 0.0233), (execut time: 0.0222), (open sourc: 0.0215), (time consum: 0.0185), (two differ: 0.0184), (well known: 0.0181), (solv problem: 0.0181), 

Cluster 1 and bigram TF-IDF score: (convolut neural: 0.6531), (neural cnn: 0.2571), (artifici intellig: 0.0784), (experiment show: 0.0687), (time consum: 0.0337), (learning bas: 0.0271), (end to: 0.0268), (to end: 0.0268), (open sourc: 0.0261), (solv problem: 0.0259), 

Cluster 2 and bigram TF-IDF score: (vision bas: 0.7797), (three dimension: 0.0660), (improv accuraci: 0.0590), (real world: 0.0561), (futur research: 0.0440), (wide rang: 0.0392), (low cost: 0.0385), (real tim: 0.0369), (learning bas: 0.0353), (convolut neural: 0.0321), 

Cluster 3 and bigram TF-IDF score: (case studi: 0.9006), (et al: 0.0453

In [None]:
# Clustering the documents based on bigram tf-idf vectorizer

min_score = 1e6
best_k = 0
for i in range(3,11):
    kmeans = KMeans(n_clusters=i, n_init=20)
    kmeans.fit(both_tf_idf_vectors)
    labels = kmeans.labels_
    db_index = davies_bouldin_score(both_tf_idf_vectors.toarray(), labels)
    if db_index < min_score:
        min_score = db_index
        best_k = i

print("Optimal number of clusters of the both grams model:", best_k)
print("Davies-Bouldin Index of the both grams model:", min_score, "\n")

# Assuming 'both_tf_idf_vectors' is your TF-IDF matrix and 'both_tfidf_vectorizer' is the vectorizer used to create it

# Fit the KMeans model to find the best_k clusters
kmeans = KMeans(n_clusters=best_k, n_init=20)
kmeans.fit(both_tf_idf_vectors)
labels = kmeans.labels_

# Extract the top bigrams for each cluster center
feature_names = both_tfidf_vectorizer.get_feature_names_out()

for i in range(best_k):
    # Get indices of the top features for this cluster
    top_feature_indices = kmeans.cluster_centers_[i].argsort()[-10:][::-1]

    print(f"Cluster {i} and TF-IDF score:", end=" ")
    for idx in top_feature_indices:
        print(f"({feature_names[idx]}: {kmeans.cluster_centers_[i][idx]:.4f})", end=", ")
    print("\n")


Optimal number of clusters of the both grams model: 5
Davies-Bouldin Index of the both grams model: 6.968635009292602 

Cluster 0 and TF-IDF score: (quantum: 0.4007), (gate: 0.0624), (key: 0.0573), (cryptographi: 0.0559), (qubit: 0.0554), (protocol: 0.0540), (circuit: 0.0473), (post quantum: 0.0465), (attack: 0.0428), (post: 0.0413), 

Cluster 1 and TF-IDF score: (encrypt: 0.0940), (scheme: 0.0708), (cryptographi: 0.0548), (key: 0.0536), (attack: 0.0522), (iot: 0.0482), (protocol: 0.0476), (cloud: 0.0407), (authent: 0.0407), (cryptograph: 0.0354), 

Cluster 2 and TF-IDF score: (relat: 0.0467), (program: 0.0453), (queri: 0.0451), (languag: 0.0449), (graph: 0.0366), (code: 0.0340), (optim: 0.0288), (transform: 0.0243), (sql: 0.0242), (memori: 0.0239), 

Cluster 3 and TF-IDF score: (control: 0.0575), (measur: 0.0395), (soft: 0.0361), (sensor: 0.0303), (estim: 0.0287), (environ: 0.0275), (task: 0.0271), (simul: 0.0251), (human: 0.0240), (optim: 0.0221), 

Cluster 4 and TF-IDF score: (visio

Therefore, the baseline solution seems to have 5 topics:

- Topic 1: Database, SQL and queries
- Topic 2: General programming and operating system/hardware
- Topic 3: Computer vision and robotics
- Topic 4: Security and cryptography
- Topic 5: Quantum studies and application in security

## **3. Additional experiments**

(c) Try to improve your results! Here you can freely try any methods covered in the course. You can improve the preprocessing (e.g., lemmatization), clustering (e.g., try dimension reduction or another clustering method) or the evaluation of the most important terms (e.g., utilize
the title, perform SVD per cluster and look at the leading singular
vector or analyze only the centroid or most central documents). Conclude the main (3–10) topics of the document collection based on your experiments!

Report here your experiments in the (c) part. Describe briefly what you tried and the results (the most important terms and concluded topics). Evaluate also if your experiment was successful, i.e., if it produced better results than the
baseline. It is suggested to divide this section into subsections, if you
tried many approaches.

### Experiment with only the title (unigram tf-idf vectorizer)

In [None]:
titledata = scopusdata['TITLE'].to_list()

title_tokens = [word_tokenize(document) for document in corpus]

title_lc_tokens_list = []

for token_document in title_tokens:
    title_lc_tokens_list.append([token.lower() for token in token_document])


# original number of tokens
uniques = np.unique([token for token_document in title_lc_tokens_list for token in token_document])



# Steps 2 and 3: remove stop words and punctuation

title_filtered_sentence = []
for i in title_lc_tokens_list:
    title_filtered_sentence.append([token for token in i if token not in stop_words])

# Numbers are also removed
title_filtered_sentence = [ ' '.join(i) for i in title_filtered_sentence ]
title_filtered_sentence = [ re.sub(r'\d+', '', sentence) for sentence in title_filtered_sentence ]

# Step 4 Stemming
title_stemmed_tokens_list = []

for i in title_filtered_sentence:
	title_stemmed_tokens_list.append([porter.stem(j) for j in i.split()])

title_filtered_tokens_list = [[token for token in tokens if token not in common_words_set] for tokens in title_stemmed_tokens_list]

#6. Present as tf-idf
title_cleaned_documents = [' '.join(sentence_tokens) for sentence_tokens in title_filtered_tokens_list]


unigram_tfidf_vectorizer = TfidfVectorizer(
    min_df=0.05,
    max_df=0.25,
    smooth_idf=False,
    norm='l2',            # Ensures all our feature vectors have a euclidian norm of 1
    ngram_range=(1,1)     # Extract only unigrams
)
unigram_tfidf_vectorizer.fit(title_cleaned_documents)
title_unigram_tf_idf_vectors = unigram_tfidf_vectorizer.transform(title_cleaned_documents)

feature_names = unigram_tfidf_vectorizer.get_feature_names_out()
feature_index = title_unigram_tf_idf_vectors[0,:].nonzero()[1]
tfidf_scores = zip(feature_index, [title_unigram_tf_idf_vectors[0, x] for x in feature_index])


In [None]:
# Clustering the documents based on unigram tf-idf vectorizer

min_score = 1e6
best_k = 0
for i in range(3,11):
    kmeans = KMeans(n_clusters=i, n_init=20)
    kmeans.fit(unigram_tf_idf_vectors)
    labels = kmeans.labels_
    db_index = davies_bouldin_score(title_unigram_tf_idf_vectors.toarray(), labels)
    if db_index < min_score:
        min_score = db_index
        best_k = i

print("Optimal number of clusters of the unigram model:", best_k)
print("Davies-Bouldin Index of the unigram model:", min_score, "\n")
kmeans = KMeans(n_clusters=best_k, n_init=10)
kmeans.fit(title_unigram_tf_idf_vectors)
labels = kmeans.labels_
db_index = davies_bouldin_score(title_unigram_tf_idf_vectors.toarray(), labels)

clusters = {i: [] for i in range(best_k)}

for point, label in zip(filtered_tokens_list, labels):
    clusters[label].append(point)

for i in range(best_k):
    listofcluster = [ item for elem in clusters[i] for item in elem]
    cluster_freq = FreqDist(listofcluster)
    print("Cluster ", i, ":", cluster_freq.most_common(15))

Optimal number of clusters of the unigram model: 5
Davies-Bouldin Index of the unigram model: 4.841051642953007 

Cluster  0 : [('vision', 286), ('object', 276), ('dataset', 220), ('deep', 219), ('featur', 205), ('train', 204), ('video', 187), ('accuraci', 186), ('visual', 169), ('track', 166), ('differ', 165), ('neural', 160), ('research', 159), ('studi', 155), ('improv', 154)]
Cluster  1 : [('program', 305), ('relat', 301), ('languag', 260), ('queri', 254), ('graph', 183), ('code', 179), ('transform', 129), ('type', 118), ('semant', 117), ('show', 104), ('optim', 102), ('analysi', 100), ('tool', 99), ('time', 97), ('parallel', 96)]
Cluster  2 : [('quantum', 254), ('cryptographi', 54), ('key', 45), ('protocol', 40), ('oper', 39), ('attack', 39), ('scheme', 34), ('post-quantum', 33), ('state', 33), ('gate', 32), ('commun', 32), ('circuit', 32), ('program', 31), ('time', 31), ('also', 30)]
Cluster  3 : [('encrypt', 292), ('scheme', 264), ('key', 194), ('attack', 161), ('cryptographi', 1

Therefore, the solution with only the title seems to have 5 topics (The order of the topics may randomly change each time the algorithm is run):

- Topic 1: General programming and operating system/hardware
- Topic 2: Quantum studies and application in security
- Topic 3: Database, SQL and queries
- Topic 4: Computer vision and robotics
- Topic 5: Security and cryptography


### Experiment with lemmatization (unigram tf-idf vectorizer)

In [None]:
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts."""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

lemmatized_tokens_list = []

lemmatized_tokens_list = [[lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in doc] for doc in filtered_tokens_list]
cleaned = [' '.join(sentence_tokens) for sentence_tokens in lemmatized_tokens_list]

print('The preprocessed clean documents:')
for document in cleaned[:10]:
	print(document)


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


The preprocessed clean documents:
anomali wide area imageri geniş alan görüntülerind anomali tespiti studi anomali wide area imageri collect aircraft set anomali identifi anyth normal cours action purpos two differ set experi carri set anomali convolut neural tri next past pre-process give anomali compar estim true
person re-identif deep kronecker-product match group-shuffl random walk person re-identif re-id aim robustli measur visual affin person wide intellig surveil associ person across multipl camera treat retriev problem give probe person affin probe galleri pg affin rank retriev galleri exist two main challeng effect solv problem person usual show signific variat differ person pose view angl spatial layout correspond person therefor vital tackl problem state-of-the-art either ignor spatial variat util extra pose handl challeng exist person re-id rank galleri consid pg affin ignor affin galleri gg affin affin could provid import clue accur galleri rank util post-process stage cur

In [None]:
unigram_tfidf_vectorizer = TfidfVectorizer(
    min_df=0.05,
    max_df=0.25,
    smooth_idf=False,
    norm='l2',            # Ensures all our feature vectors have a euclidian norm of 1
    ngram_range=(1,1)     # Extract only unigrams
)
unigram_tfidf_vectorizer.fit(cleaned)
lemmatized_unigram_tf_idf_vectors = unigram_tfidf_vectorizer.transform(cleaned)

print("\nThe shape of the tf-idf vectors (number of documents, number of features) for unigram model is")
print(lemmatized_unigram_tf_idf_vectors.shape)


print("\nThe tf-idf values of the first document (unigrams)\n")
feature_names = unigram_tfidf_vectorizer.get_feature_names_out()
feature_index = lemmatized_unigram_tf_idf_vectors[0,:].nonzero()[1]
tfidf_scores = zip(feature_index, [lemmatized_unigram_tf_idf_vectors[0, x] for x in feature_index])
for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:
    print(w, s)


The shape of the tf-idf vectors (number of documents, number of features) for unigram model is
(1143, 348)

The tf-idf values of the first document (unigrams)

wide 0.42968762964661683
two 0.14912427121218227
studi 0.14646986558212088
set 0.5005073512666091
purpos 0.23066025317240396
neural 0.19965077426086789
identifi 0.2085452725089655
give 0.1956466039096827
experi 0.1696179128512244
estim 0.21224590314532887
convolut 0.21896054598615544
compar 0.14800420839493267
collect 0.22261683102604074
area 0.3912932078193654


In [None]:
# Clustering the documents based on unigram tf-idf vectorizer

min_score = 1e6
best_k = 0
for i in range(3,11):
    kmeans = KMeans(n_clusters=i, n_init=20)
    kmeans.fit(lemmatized_unigram_tf_idf_vectors)
    labels = kmeans.labels_
    db_index = davies_bouldin_score(lemmatized_unigram_tf_idf_vectors.toarray(), labels)
    if db_index < min_score:
        min_score = db_index
        best_k = i

print("Optimal number of clusters of the unigram model:", best_k)
print("Davies-Bouldin Index of the unigram model:", min_score, "\n")
kmeans = KMeans(n_clusters=best_k, n_init=10)
kmeans.fit(lemmatized_unigram_tf_idf_vectors)
labels = kmeans.labels_
db_index = davies_bouldin_score(lemmatized_unigram_tf_idf_vectors.toarray(), labels)

clusters = {i: [] for i in range(best_k)}

for point, label in zip(filtered_tokens_list, labels):
    clusters[label].append(point)

for i in range(best_k):
    listofcluster = [ item for elem in clusters[i] for item in elem]
    cluster_freq = FreqDist(listofcluster)
    print("Cluster ", i, ":", cluster_freq.most_common(15))

Optimal number of clusters of the unigram model: 5
Davies-Bouldin Index of the unigram model: 4.709833356812028 

Cluster  0 : [('vision', 319), ('object', 312), ('studi', 301), ('differ', 288), ('control', 268), ('improv', 261), ('train', 248), ('featur', 247), ('task', 246), ('research', 237), ('measur', 234), ('dataset', 231), ('evalu', 226), ('accuraci', 226), ('deep', 223)]
Cluster  1 : [('quantum', 246), ('cryptographi', 52), ('key', 45), ('protocol', 40), ('attack', 37), ('oper', 35), ('post-quantum', 31), ('state', 31), ('commun', 31), ('gate', 29), ('scheme', 28), ('time', 27), ('circuit', 27), ('architectur', 26), ('also', 25)]
Cluster  2 : [('encrypt', 302), ('scheme', 277), ('key', 198), ('cryptographi', 189), ('attack', 167), ('protocol', 143), ('authent', 124), ('iot', 122), ('provid', 109), ('cloud', 109), ('effici', 100), ('devic', 96), ('cryptograph', 94), ('techniqu', 86), ('new', 79)]
Cluster  3 : [('program', 345), ('code', 212), ('languag', 212), ('optim', 194), ('

Therefore, the solution with lemmatization seems to have 5 topics (The order of the topics may randomly change each time the algorithm is run):

- Topic 1: General programming and operating system/hardware
- Topic 2: Quantum studies and application in security
- Topic 3: Database, SQL and queries
- Topic 4: Computer vision and robotics
- Topic 5: Security and cryptography


### Experiment with SVD (unigram tf-idf vectorizer)

SVD (Singular Value Decomposition) is a dimensionality reduction technique used as LSA (Latent Semantic Analysis) in text clustering and topic modeling. SVC is applied after preprocessing but before clustering.

In [None]:
# Assuming 'unigram_tf_idf_vectors' is your TF-IDF matrix from the TfidfVectorizer
# Ignoring terms that appear in less than 5% of the documents or in more than 25% of the documents

unigram_tfidf_vectorizer = TfidfVectorizer(
    min_df=0.05,
    max_df=0.25,
    smooth_idf=False,
    norm='l2',            # Ensures all our feature vectors have a euclidian norm of 1
    ngram_range=(1,1)     # Extract only unigrams
)

#only tf part:
#tfidf_vectorizer = TfidfVectorizer(use_idf=False)

unigram_tfidf_vectorizer.fit(SVD_cleaned_documents)
unigram_tf_idf_vectors = unigram_tfidf_vectorizer.transform(SVD_cleaned_documents)

print("\nThe shape of the tf-idf vectors (number of documents, number of features) for unigram model is")
print(unigram_tf_idf_vectors.shape)


# Clustering the documents based on unigram tf-idf vectorizer

min_score = 1e6
best_k = 0
for i in range(3,11):
    kmeans = KMeans(n_clusters=i, n_init=20)
    kmeans.fit(unigram_tf_idf_vectors)
    labels = kmeans.labels_
    db_index = davies_bouldin_score(unigram_tf_idf_vectors.toarray(), labels)
    if db_index < min_score:
        min_score = db_index
        best_k = i

print("Optimal number of clusters of the unigram model:", best_k)
print("Davies-Bouldin Index of the unigram model:", min_score, "\n")
kmeans = KMeans(n_clusters=best_k, n_init=10)
kmeans.fit(unigram_tf_idf_vectors)
labels = kmeans.labels_
db_index = davies_bouldin_score(unigram_tf_idf_vectors.toarray(), labels)

clusters = {i: [] for i in range(best_k)}

for point, label in zip(filtered_tokens_list, labels):
    clusters[label].append(point)

for i in range(best_k):
    listofcluster = [ item for elem in clusters[i] for item in elem]
    cluster_freq = FreqDist(listofcluster)
    print("Cluster ", i, ":", cluster_freq.most_common(15))


The shape of the tf-idf vectors (number of documents, number of features) for unigram model is
(1143, 349)
Optimal number of clusters of the unigram model: 5
Davies-Bouldin Index of the unigram model: 4.8797884670479 

Cluster  0 : [('encrypt', 296), ('scheme', 274), ('key', 196), ('cryptographi', 183), ('attack', 167), ('protocol', 149), ('authent', 124), ('iot', 122), ('cloud', 109), ('provid', 107), ('effici', 99), ('cryptograph', 93), ('devic', 91), ('techniqu', 83), ('new', 79)]
Cluster  1 : [('quantum', 252), ('cryptographi', 52), ('key', 45), ('protocol', 40), ('oper', 38), ('attack', 37), ('state', 33), ('gate', 32), ('circuit', 32), ('post-quantum', 31), ('program', 31), ('commun', 31), ('time', 31), ('also', 29), ('scheme', 28)]
Cluster  2 : [('control', 258), ('studi', 221), ('measur', 209), ('differ', 175), ('estim', 163), ('test', 163), ('structur', 156), ('task', 154), ('provid', 152), ('improv', 149), ('environ', 142), ('evalu', 141), ('work', 141), ('vision', 135), ('s

In [None]:
from scipy.sparse import vstack

# Divide the TF-IDF matrix into separate matrices for each cluster
clustered_documents = {i: [] for i in range(best_k)}
for doc_id, cluster_id in enumerate(labels):
    clustered_documents[cluster_id].append(unigram_tf_idf_vectors[doc_id])

# Apply SVD to each cluster's TF-IDF matrix and interpret the leading singular vector
for i in range(best_k):
    # Convert the list of TF-IDF vectors for this cluster to a sparse matrix
    cluster_tf_idf_matrix = vstack(clustered_documents[i])

    svd = TruncatedSVD(n_components=1)
    svd.fit(cluster_tf_idf_matrix)
    leading_singular_vector = svd.components_[0]

    terms = unigram_tfidf_vectorizer.get_feature_names_out()

    # Get the terms with the highest coefficients in the leading singular vector
    top_indices = leading_singular_vector.argsort()[-5:][::-1]
    top_terms = [(terms[idx], leading_singular_vector[idx]) for idx in top_indices]

    print(f"\nCluster {i} leading singular vector terms:")
    for term, coefficient in top_terms:
        print(f"{term} (coefficient: {coefficient:.4f})")



Cluster 0 leading singular vector terms:
encrypt (coefficient: 0.4741)
scheme (coefficient: 0.3690)
key (coefficient: 0.2680)
attack (coefficient: 0.2421)
cryptographi (coefficient: 0.2371)

Cluster 1 leading singular vector terms:
program (coefficient: 0.4419)
languag (coefficient: 0.3099)
code (coefficient: 0.2745)
graph (coefficient: 0.2196)
memori (coefficient: 0.1874)

Cluster 2 leading singular vector terms:
queri (coefficient: 0.5814)
relat (coefficient: 0.5026)
manag (coefficient: 0.1621)
store (coefficient: 0.1517)
graph (coefficient: 0.1319)

Cluster 3 leading singular vector terms:
quantum (coefficient: 0.9091)
protocol (coefficient: 0.1301)
cryptographi (coefficient: 0.1226)
key (coefficient: 0.1210)
attack (coefficient: 0.0895)

Cluster 4 leading singular vector terms:
vision (coefficient: 0.1947)
object (coefficient: 0.1771)
deep (coefficient: 0.1485)
track (coefficient: 0.1365)
train (coefficient: 0.1352)


## Conclusion of the topics of the documents

Based on all experiments and also the baseline, we can conclude that this corpus must have at least 5 topics. These are the topics that kept reoccuring in both the baseline and the experiments:

- Topic 1: General programming and operating system/hardware, whose top keywords are program, language, code, graph, memory

- Topic 2: Quantum studies and application in security, whose top keywords are quantum, protocol, cryptography, key and attack

- Topic 3: Database, SQL and queries, whose top keywords are query, relational, management, store, graph

- Topic 4: Computer vision and robotics, whose top keywords are vision, object, deep (possibly deep learning), track (possibly in reinforcement learning automation), train

- Topic 5: Security and cryptography, whose top keywords are encryption, scheme, key, attack, cryptograph

## **4. Appendix**

All the code for this exercise has been added with respect to each part for closest referencing. Therefore, we do not attach any more code here in the Appendix section