# KMeans Clustering Implementation

This notebook implements KMeans clustering on tweets data to group similar tweets into clusters. The main steps include:

## 1. Importing Necessary Libraries
- Libraries such as `numpy`, `pandas`, and `pickle` for data manipulation and file handling.
- `KMeans` from `sklearn.cluster` for clustering.
- Metrics from `sklearn.metrics` for evaluating clustering performance.
- `TfidfVectorizer` for text vectorization.
- `TruncatedSVD` for dimensionality reduction.

## 2. Loading TF-IDF Matrix
- The TF-IDF matrix from the previous file is loaded using `pickle`.

## 3. Dimensionality Reduction
- The dimensionality of the TF-IDF matrix is reduced using `TruncatedSVD` to improve the efficiency of clustering.

## 4. Applying KMeans Clustering
- KMeans clustering is applied with a specified number of clusters since it is sensitive to large amout of data.
- The KMeans algorithm is fitted in chunks to handle large datasets efficiently.
- Cluster labels are generated for each tweet.

## 5. Saving Results
- The cluster labels are saved to a file `kmeans_clustering.pkl`.
- The SVD model used for dimensionality reduction is also saved for future use.

## 6. Assigning and Saving Cluster Labels
- The cluster labels are added to the original dataframe and saved to `tweets_with_labels.csv`.

## 7. Evaluating Clustering
- Various evaluation metrics are calculated to assess clustering performance:
  - **Silhouette Score**: Measures how similar each text is to its own cluster compared to other clusters.
  - **Calinski-Harabasz Score**: Evaluates the clustering based on within-cluster dispersion and between-cluster dispersion.
  - **Davies-Bouldin Index**: Measures the average similarity ratio of each cluster with its most similar cluster.

In [1]:
#Clustering implementation

# Importing Necessary Libraries
import numpy as np
import pandas as pd
import pickle
import scipy.sparse
import time
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, calinski_harabasz_score, accuracy_score, classification_report, davies_bouldin_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

In [2]:
# Loading the TF-IDF matrix from the previous file
with open('tfidf_matrix.pkl', 'rb') as f:
    tfidf_matrix = pickle.load(f)

In [3]:
# Reduced dimensionality for computational complexity
svd = TruncatedSVD(n_components=100)
reduced_tfidf_matrix = svd.fit_transform(tfidf_matrix)

In [4]:
# Applying KMeans clustering
kmeans = KMeans(n_clusters=3)  # Adjust the number of clusters as i need

In [12]:
import pickle
import numpy as np
import pandas as pd
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score

# Assuming tfidf_matrix is already defined
# Reduced dimensionality for computational complexity
svd = TruncatedSVD(n_components=100)
reduced_tfidf_matrix = svd.fit_transform(tfidf_matrix)

# Apply KMeans clustering on the reduced data
kmeans = KMeans(n_clusters=3)  # Adjust the number of clusters as needed
kmeans.fit(reduced_tfidf_matrix)  # Fit on the entire reduced data

# Save the KMeans model
with open('kmeans_model.pkl', 'wb') as f:
    pickle.dump(kmeans, f)

# Save the SVD model
with open('svd_model.pkl', 'wb') as f:
    pickle.dump(svd, f)

# Generate and save KMeans labels
labels = kmeans.labels_
with open('kmeans_labels.pkl', 'wb') as f:
    pickle.dump(labels, f)

# Assigning cluster labels to the dataframe
merged_df = pd.read_csv('cleaned_tweets.csv')
merged_df['labels'] = labels
merged_df.to_csv('tweets_with_labels.csv', index=False)

# Evaluate clustering performance
print(f"Silhouette Score: {silhouette_score(reduced_tfidf_matrix, labels):.3f}")
print(f"Calinski-Harabasz Score: {calinski_harabasz_score(reduced_tfidf_matrix, labels):.3f}")
print(f"Davies-Bouldin Index: {davies_bouldin_score(reduced_tfidf_matrix, labels):.3f}")


Silhouette Score: 0.145
Calinski-Harabasz Score: 444.503
Davies-Bouldin Index: 2.848


In [5]:
# Fitting and training the KMeans algorithm to the numerical array in chunks
chunk_size = 1000
labels = []

for i in range(0, tfidf_matrix.shape[0], chunk_size):
    chunk = tfidf_matrix[i:i + chunk_size]
    kmeans.fit(chunk.astype(np.float32).toarray())
    labels.extend(kmeans.labels_)
    

In [6]:
# # Verifying the length of K-means labels
# print(f"Length of K-Means labels: {len(labels)}")

# Saving the model After fitting the KMeans model
with open('kmeans_model.pkl', 'wb') as f:
    pickle.dump(kmeans, f)
    
# Saving the AKmeans Clustering labels
with open('kmeans_labels.pkl', 'wb') as f:
    pickle.dump(labels, f)
    
# Saving the SVD model after fitting during training
with open('svd_model.pkl', 'wb') as f:
    pickle.dump(svd, f)

In [7]:
# Assigning cluster labels to the dataframe
merged_df = pd.read_csv('cleaned_tweets.csv')
merged_df['labels'] = labels
merged_df.to_csv('tweets_with_labels.csv', index=False)

In [8]:
# Calculating evaluation metrics

# Reduce the size of the data 
tfidf_matrix_subset = tfidf_matrix[:500, :500].toarray()
labels_subset = labels[:500]
kmeans_subset = kmeans.labels_[:500]

#Silhouette Evaluation
silhouette = silhouette_score(tfidf_matrix_subset, kmeans_subset)
print(f'Silhouette Score: {silhouette:.3f}')

#calinski_harabasz
calinski_harabasz = calinski_harabasz_score(tfidf_matrix_subset, kmeans_subset)
print(f'Calinski-Harabasz Score: {calinski_harabasz:.3f}')

# Davies-Bouldin Index
davies_bouldin = davies_bouldin_score(tfidf_matrix_subset, kmeans_subset)
print(f'Davies-Bouldin Index: {davies_bouldin:.3f}')


Silhouette Score: -0.911
Calinski-Harabasz Score: 0.421
Davies-Bouldin Index: 5.521


In [9]:
# Displaying the number of clusters for analysis
n_clusters = len(set(labels))
print(f'Number of clusters: {n_clusters}')

Number of clusters: 3


In [10]:
#NOT MEANINGFUL SINCE THERE ARE NO TRUE VALUES

accuracy = accuracy_score(labels_subset, kmeans_subset)
print(f'Accuracy: {accuracy:.3f}')
print(classification_report(labels_subset, kmeans_subset))


Accuracy: 0.714
              precision    recall  f1-score   support

           0       0.81      0.87      0.84       406
           1       0.00      0.00      0.00        64
           2       0.05      0.10      0.07        30

    accuracy                           0.71       500
   macro avg       0.29      0.32      0.30       500
weighted avg       0.66      0.71      0.68       500



In [11]:
# Saved the clustered data
merged_df.to_csv('clustered_tweets.csv', index=False)