# Text Clustering with Multiple Algorithms in Python

In [96]:
# Algorithms Covered 
# 1. K-Means Clustering 
# 2. Hierarchical Clustering (Agglomerative) 
# 3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

## Task 1: Preprocess the Dataset 

### Load Dataset (IMDB Reviews from nltk)

In [97]:
# Import Required Libraries
import pandas as pd
import nltk
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_score, calinski_harabasz_score
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import time

# Download Required NLTK Resources
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

# Load Dataset (IMDB Reviews Dataset from NLTK)
from nltk.corpus import movie_reviews
nltk.download('movie_reviews')
documents = [' '.join(movie_reviews.words(fileid)) for fileid in movie_reviews.fileids()]
df = pd.DataFrame(documents, columns=['text'])
print(f"Dataset loaded successfully with {len(df)} samples.\n")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


Dataset loaded successfully with 2000 samples.



### Preprocess the Text
Preprocessing Steps:
Convert text to lowercase.
Remove punctuation.
Tokenize and remove stop words.
Lemmatize the text.

In [98]:
# Initialize stopwords and lemmatizer
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# Preprocessing function
def preprocess_text(text):
    text = text.lower()  # Convert to lowercase
    text = ''.join([char for char in text if char not in string.punctuation])  # Remove punctuation
    words = nltk.word_tokenize(text)  # Tokenize text
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]  # Lemmatize and remove stopwords
    return ' '.join(words)

# Apply preprocessing
df['clean_text'] = df['text'].apply(preprocess_text)
print("Text preprocessing completed successfully!")

Text preprocessing completed successfully!


## Task 2: Apply TF-IDF Vectorization

### Convert Text into Numerical Features
Use TfidfVectorizer from sklearn.feature_extraction.text. Limit the number of features to 5000.

In [99]:
# Convert text data into numerical features using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(df['clean_text'])
print(f"TF-IDF transformation completed. Shape: {X.shape}\n")

# Print Sample TF-IDF Features
feature_names = vectorizer.get_feature_names_out()
sample_tfidf = pd.DataFrame(X.toarray(), columns=feature_names)
print("Sample TF-IDF Vectorization Result:")
print(sample_tfidf.head())

TF-IDF transformation completed. Shape: (2000, 5000)

Sample TF-IDF Vectorization Result:
   000      10  100   11   12   13  13th   14   15   16  ...  younger  \
0  0.0  0.4576  0.0  0.0  0.0  0.0   0.0  0.0  0.0  0.0  ...      0.0   
1  0.0  0.0000  0.0  0.0  0.0  0.0   0.0  0.0  0.0  0.0  ...      0.0   
2  0.0  0.0000  0.0  0.0  0.0  0.0   0.0  0.0  0.0  0.0  ...      0.0   
3  0.0  0.0000  0.0  0.0  0.0  0.0   0.0  0.0  0.0  0.0  ...      0.0   
4  0.0  0.0000  0.0  0.0  0.0  0.0   0.0  0.0  0.0  0.0  ...      0.0   

   youngster  youth  zane  zany  zellweger  zero  zeta  zombie  zone  
0        0.0    0.0   0.0   0.0        0.0   0.0   0.0     0.0   0.0  
1        0.0    0.0   0.0   0.0        0.0   0.0   0.0     0.0   0.0  
2        0.0    0.0   0.0   0.0        0.0   0.0   0.0     0.0   0.0  
3        0.0    0.0   0.0   0.0        0.0   0.0   0.0     0.0   0.0  
4        0.0    0.0   0.0   0.0        0.0   0.0   0.0     0.0   0.0  

[5 rows x 5000 columns]


## Task 3: Implement Clustering Algorithms

### Implement K-Means Clustering and Measure Computational Efficiency
We’ll use 5 clusters as an example. Assign cluster labels to each text sample.

In [100]:
# K-Means Clustering
start_time = time.time()
kmeans = KMeans(n_clusters=5, random_state=42)
df['kmeans_cluster'] = kmeans.fit_predict(X)
kmeans_time = time.time() - start_time

print(f"K-Means Clustering Completed in {kmeans_time:.4f} seconds.")

# Print K-Means Cluster Assignments
for i in range(2):
    print(f"Cluster {i} (K-Means):")
    for text in df[df['kmeans_cluster'] == i]['text'].values[:2]:
        print(f"   - {text[:100]}...\n")

K-Means Clustering Completed in 0.0666 seconds.
Cluster 0 (K-Means):
   - plot : two teen couples go to a church party , drink and then drive . they get into an accident . on...

   - it is movies like these that make a jaded movie viewer thankful for the invention of the timex indig...

Cluster 1 (K-Means):
   - the happy bastard ' s quick movie review damn that y2k bug . it ' s got a head start in this movie s...

   - capsule : in 2176 on the planet mars police taking into custody an accused murderer face the title m...



### Hierarchical Clustering
Perform clustering with different linkage methods (ward, average, complete).

In [101]:
linkages = ['ward', 'average', 'complete']
hierarchical_scores = {}

for linkage in linkages:
    start_time = time.time()
    agglo = AgglomerativeClustering(n_clusters=5, linkage=linkage)
    df[f'hierarchical_cluster_{linkage}'] = agglo.fit_predict(X.toarray())
    hierarchical_time = time.time() - start_time
    
    print(f"Hierarchical Clustering ({linkage}) Completed in {hierarchical_time:.4f} seconds.")
    
    # Print Cluster Assignments
    for i in range(2):
        print(f"Cluster {i} (Hierarchical - {linkage.capitalize()}):")
        for text in df[df[f'hierarchical_cluster_{linkage}'] == i]['text'].values[:2]:
            print(f"   - {text[:100]}...\n")

Hierarchical Clustering (ward) Completed in 3.2714 seconds.
Cluster 0 (Hierarchical - Ward):
   - plot : two teen couples go to a church party , drink and then drive . they get into an accident . on...

   - the happy bastard ' s quick movie review damn that y2k bug . it ' s got a head start in this movie s...

Cluster 1 (Hierarchical - Ward):
   - capsule : in 2176 on the planet mars police taking into custody an accused murderer face the title m...

   - john carpenter makes b - movies . always has ( " halloween , " " escape from new york , " " the thin...

Hierarchical Clustering (average) Completed in 3.3816 seconds.
Cluster 0 (Hierarchical - Average):
   - plot : two teen couples go to a church party , drink and then drive . they get into an accident . on...

   - the happy bastard ' s quick movie review damn that y2k bug . it ' s got a head start in this movie s...

Cluster 1 (Hierarchical - Average):
   - the thought - provoking question of tradition over morals is the subject d

### DBSCAN Clustering
Experiment with different eps and min_samples values. Assign cluster labels and observe how DBSCAN handles outliers.

In [102]:
start_time = time.time()
dbscan = DBSCAN(eps=1.0, min_samples=2)
df['dbscan_cluster'] = dbscan.fit_predict(X_normalized)
dbscan_time = time.time() - start_time

print(f"DBSCAN Clustering Completed in {dbscan_time:.4f} seconds.")

# Print DBSCAN Cluster Assignments
for i in set(df['dbscan_cluster']):
    if i != -1:  # Ignore noise points
        print(f"Cluster {i} (DBSCAN):")
        for text in df[df['dbscan_cluster'] == i]['text'].values[:2]:
            print(f"   - {text[:100]}...\n")

DBSCAN Clustering Completed in 0.6245 seconds.
Cluster 0 (DBSCAN):
   - the tagline for this film is : " some houses are just born bad " . so i didn ' t expect too much fro...

   - the tagline for this film is : " some houses are just born bad " . so i didn ' t expect too much fro...

Cluster 1 (DBSCAN):
   - it seemed like the perfect concept . what better for the farrelly brothers , famous for writing and ...

   - it seemed like the perfect concept . what better for the farrelly brothers , famous for writing and ...

Cluster 2 (DBSCAN):
   - i know that " funnest " isn ' t a word . " fun " is a noun , and therefore cannot be conjugated like...

   - i know that " funnest " isn ' t a word . " fun " is a noun , and therefore cannot be conjugated like...



## Task 4: Compare Clustering Results 


### Evaluate Cluster Quality Using Silhouette Score and Calinski-Harabasz Index

In [103]:
results = {}

# K-Means Evaluation
kmeans_score = silhouette_score(X, df['kmeans_cluster'])
kmeans_ch_score = calinski_harabasz_score(X.toarray(), df['kmeans_cluster'])
results['K-Means'] = {'Silhouette': kmeans_score, 'Calinski-Harabasz': kmeans_ch_score}

# Hierarchical Evaluation
for linkage in linkages:
    hierarchical_score = silhouette_score(X, df[f'hierarchical_cluster_{linkage}'])
    hierarchical_ch_score = calinski_harabasz_score(X.toarray(), df[f'hierarchical_cluster_{linkage}'])
    results[f'Hierarchical ({linkage})'] = {'Silhouette': hierarchical_score, 'Calinski-Harabasz': hierarchical_ch_score}

# DBSCAN Evaluation
valid_dbscan_clusters = df[df['dbscan_cluster'] != -1]['dbscan_cluster']
if len(set(valid_dbscan_clusters)) > 1:
    dbscan_score = silhouette_score(X_normalized[df['dbscan_cluster'] != -1], valid_dbscan_clusters)
else:
    dbscan_score = 'N/A'

results['DBSCAN'] = {'Silhouette': dbscan_score if dbscan_score != 'N/A' else 'N/A', 'Calinski-Harabasz': 'N/A'}

# Print Clustering Results
print("\n Clustering Results:\n")
for algo, metrics in results.items():
    print(f" {algo}")
    for key, value in metrics.items():
        print(f"   - {key}: {value}")
    print("\n")


 Clustering Results:

 K-Means
   - Silhouette: 0.0016380301972095496
   - Calinski-Harabasz: 5.849027695192205


 Hierarchical (ward)
   - Silhouette: 0.0032244468066553304
   - Calinski-Harabasz: 6.497514279685561


 Hierarchical (average)
   - Silhouette: 0.012599900637509417
   - Calinski-Harabasz: 1.0697628484000214


 Hierarchical (complete)
   - Silhouette: 0.001667928968429133
   - Calinski-Harabasz: 2.2499728124182945


 DBSCAN
   - Silhouette: 0.9999999900585097
   - Calinski-Harabasz: N/A




## Task 5: Result Discussion

### Analysis of algorithms on Cluster quality, Handling noise and outliers, Scalability to larger datasets

##### K-Means Clustering:
K-Means is highly efficient and scalable, completing the clustering in just 0.0722 seconds, making it the fastest of all algorithms. However, its cluster quality was poor, with a Silhouette Score of 0.0016 and a Calinski-Harabasz Score of 5.85, indicating poorly defined clusters. Additionally, K-Means is sensitive to outliers and noise, which can significantly affect the results, making it less suitable for noisy datasets.

##### Hierarchical Clustering (Ward, Average, Complete):
Hierarchical Clustering, while providing a hierarchical view of the data, showed poor cluster quality across all three linkage methods:

Ward: Silhouette Score: 0.0032, Calinski-Harabasz Score: 6.50

Average: Silhouette Score: 0.0126, Calinski-Harabasz Score: 1.07

Complete: Silhouette Score: 0.0017, Calinski-Harabasz Score: 2.25 Hierarchical Clustering was also computationally expensive, taking around 3.15 seconds for each linkage method. It struggles to handle noise and may merge noise points into clusters, making it less effective for datasets with high noise.

##### DBSCAN (Density-Based Clustering):
DBSCAN produced the highest cluster quality, with a near-perfect Silhouette Score of 0.9999, indicating well-defined clusters. It effectively handled noise and outliers by marking them as noise (-1), which other algorithms failed to manage. DBSCAN was relatively efficient, completing the clustering in 0.5127 seconds, although it is slower than K-Means but faster than Hierarchical Clustering. However, DBSCAN requires careful tuning of the eps and min_samples parameters to achieve optimal results.

### Most suitable algorithm for this dataset 

Given the results, DBSCAN is the most suitable algorithm for this dataset. It provided the highest cluster quality and effectively handled noise and outliers, making it ideal for datasets where clusters have irregular shapes or contain noisy data. Unlike K-Means and Hierarchical Clustering, DBSCAN does not require specifying the number of clusters in advance, which adds to its flexibility. Therefore, DBSCAN is recommended as the best clustering algorithm for this dataset, ensuring high cluster quality and robustness to noise.