### <center><b> CISB123 Text Analytics </b><center>

## <center><b> Lab 8 - Text Clustering using TF-IDF Vectorizer </b><center>

### Name: Maizatul Aufa binti Zamidi (SW01082394)

##### TEXT CLUSTERING USING TF-IDF VECTORIZER 

In [5]:
import numpy as np 
from sklearn.cluster import KMeans 
from sklearn.feature_extraction.text import TfidfVectorizer 
from tabulate import tabulate 
from collections import Counter

In [6]:
dataset = ["I love playing football on the weekends", 
           "I enjoy hiking and camping in the mountains", 
           "I like to read books and watch movies", 
           "I prefer playing video games over sports", 
           "I love listening to music and going to concerts"]

In [7]:
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import nltk

In [8]:
# Ensure NLTK resources are downloaded
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\maiza\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [9]:
# Preprocessing function
def preprocess_text(doc):
    stop_words = set(stopwords.words('english'))
    stemmer = PorterStemmer()
    doc = doc.lower()
    doc = re.sub(r'[^a-z\s]', '', doc)
    tokens = doc.split()
    tokens = [stemmer.stem(word) for word in tokens if word not in stop_words]
    return ' '.join(tokens)

In [10]:
# Apply preprocessing
preprocessed_dataset = [preprocess_text(doc) for doc in dataset]

In [11]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(preprocessed_dataset)

k = 2
km = KMeans(n_clusters=k, random_state=42)
km.fit(X)
y_pred = km.predict(X)

In [12]:
# Display the document and its predicted cluster in a table 
table_data = [["Document", "Predicted Cluster"]] 
table_data.extend([[doc, cluster] for doc, cluster in zip(dataset, y_pred)]) 
print(tabulate(table_data, headers="firstrow")) 

Document                                           Predicted Cluster
-----------------------------------------------  -------------------
I love playing football on the weekends                            1
I enjoy hiking and camping in the mountains                        0
I like to read books and watch movies                              1
I prefer playing video games over sports                           1
I love listening to music and going to concerts                    1


In [13]:
# Print top terms per cluster 
print("\nTop terms per cluster:") 
order_centroids = km.cluster_centers_.argsort()[:, ::-1] 
terms = vectorizer.get_feature_names_out() 
for i in range(k): 
    print("Cluster %d:" % i) 
for ind in order_centroids[i, :10]: 
    print(' %s' % terms[ind]) 
    print()


Top terms per cluster:
Cluster 0:
Cluster 1:
 love

 play

 footbal

 weekend

 go

 sport

 music

 concert

 video

 game



In [14]:
# Calculate purity 
total_samples = len(y_pred) 
cluster_label_counts = [Counter(y_pred)] 
purity = sum(max(cluster.values()) for cluster in cluster_label_counts) / total_samples 
print("Purity:", purity) 

Purity: 0.8
