# KMeans Clustering Testing

## Overview

This notebook demonstrates the application of a pre-trained KMeans clustering model on new tweet data. It covers text preprocessing, feature extraction, dimensionality reduction, clustering, and saving results.

## Steps

1. **Data Preparation**: Define a set of new tweets for clustering.
2. **Text Preprocessing**: Clean and tokenize the tweet texts to remove special characters and stopwords.
3. **Feature Extraction**: Transform the preprocessed tweets using a pre-trained TF-IDF vectorizer.
4. **Dimensionality Reduction**: Apply a pre-trained TruncatedSVD model to reduce the feature dimensions.
5. **Clustering**: Use a pre-trained KMeans model to predict cluster labels for the new tweets.
6. **Results**: Save the cluster labels and the results to files for future use and optionally export them to a CSV file.

## Purpose

The notebook aims to evaluate how new data fits into the existing clustering framework by using saved models and tools. This process helps in understanding the distribution of new data points within the predefined clusters.

In [10]:
import nltk
nltk.download('punkt_tab')

import pickle
import pandas as pd
import numpy as np
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Honours\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [23]:
# Loading the saved models
with open('kmeans_model.pkl', 'rb') as f:
    kmeans = pickle.load(f)
with open('svd_model.pkl', 'rb') as f:
    svd = pickle.load(f)
with open('vectorizer.pkl', 'rb') as f:
    tfidf_vectorizer = pickle.load(f)

In [24]:
# Function for text cleaning, tokenization, and normalization
def clean_text(text):
    text = re.sub(r'[^A-Za-z\s]', '', text)  # Remove special characters and punctuation
    text = re.sub(r'@user', '', text)  # Remove mentions of @user
    tokens = word_tokenize(text)  # Tokenize text
    stop_words = set(stopwords.words('english'))
    tokens = [word.lower() for word in tokens if word.lower() not in stop_words]  # Remove stopwords
    return ' '.join(tokens)


In [21]:
# Load and preprocess new tweets
new_tweets = [
    "Just saw the most amazing movie! Highly recommend it to everyone.",
    "I can't believe the weather today—it's raining cats and dogs!",
    "Feeling so grateful for my friends and family. Life is good.",
    "This new phone I bought is such a disappointment. Wish I had researched more.",
    "Had a fantastic workout session at the gym today. Feeling energized!"
]
preprocessed_tweets = [clean_text(tweet) for tweet in new_tweets]

In [25]:
# Transform new tweets using the saved TF-IDF vectorizer
new_tweets_tfidf = tfidf_vectorizer.transform(preprocessed_tweets)

# Apply the SVD transformation to the new data
new_tweets_reduced = svd.transform(new_tweets_tfidf)

# # Predict the clusters for the new data (WITH MODEL1) and just MODEL
#kmeans_labels = kmeans.predict(new_tweets_reduced)

#Predict the clusters for the new data (without SVD) MODEL2
kmeans_labels = kmeans.predict(new_tweets_tfidf.astype(np.float32).toarray())


# Create a DataFrame to store results
df_results = pd.DataFrame({
    'Tweet': new_tweets,
    'KMeans_Label': kmeans_labels
})

# Display results
df_results

Unnamed: 0,Tweet,KMeans_Label
0,Just saw the most amazing movie! Highly recomm...,0
1,I can't believe the weather today—it's raining...,0
2,Feeling so grateful for my friends and family....,1
3,This new phone I bought is such a disappointme...,0
4,Had a fantastic workout session at the gym tod...,0


In [18]:
# # Save the new KMeans cluster labels for further use
# with open('new_kmeans_labels.pkl', 'wb') as f:
#     pickle.dump(kmeans_labels, f)

# # Optionally, save results to a CSV file
# df_results.to_csv('tweets_with_kmeans_labels.csv', index=False)

In [None]:
# import re
# import pickle
# import pandas as pd
# from sklearn.feature_extraction.text import TfidfVectorizer
# from sklearn.decomposition import TruncatedSVD
# from nltk.corpus import stopwords
# from nltk.tokenize import word_tokenize

# # Load new tweets
# new_tweets = [
#     "Just saw the most amazing movie! Highly recommend it to everyone.",
#     "I can't believe the weather today—it's raining cats and dogs!",
#     "Feeling so grateful for my friends and family. Life is good.",
#     "This new phone I bought is such a disappointment. Wish I had researched more.",
#     "Had a fantastic workout session at the gym today. Feeling energized!"
# ]

# # Function for text cleaning, tokenization, and normalization
# def clean_text(text):
#     text = re.sub(r'[^A-Za-z\s]', '', text)  # Remove special characters and punctuation
#     text = re.sub(r'@user', '', text)  # Remove mentions of @user
#     tokens = word_tokenize(text)  # Tokenize text
#     stop_words = set(stopwords.words('english'))
#     tokens = [word.lower() for word in tokens if word.lower() not in stop_words]  # Remove stopwords
#     return ' '.join(tokens)

# # Preprocess new tweets
# preprocessed_tweets = [clean_text(tweet) for tweet in new_tweets]

# # Load the saved TF-IDF vectorizer
# with open('vectorizer.pkl', 'rb') as f:
#     tfidf_vectorizer = pickle.load(f)

# # Transform new tweets using the loaded vectorizer
# new_tweets_tfidf = tfidf_vectorizer.transform(preprocessed_tweets)

# # Load the saved SVD model (TruncatedSVD for dimensionality reduction)
# with open('svd_model.pkl', 'rb') as f:
#     svd = pickle.load(f)

# # Apply the SVD transformation to the new data (reduce dimensionality)
# new_tweets_reduced = svd.transform(new_tweets_tfidf)

# # Load the pre-trained KMeans model
# with open('kmeans_model.pkl', 'rb') as f:
#     kmeans = pickle.load(f)

# # Check the type of loaded model
# print(type(kmeans))  # Ensure this is <class 'sklearn.cluster._kmeans.KMeans'>

# # Predict the clusters for the new data
# kmeans_labels = kmeans.predict(new_tweets_reduced)

# # Create a DataFrame to store results
# df_results = pd.DataFrame({
#     'Tweet': new_tweets,
#     'KMeans_Label': kmeans_labels
# })

# # Display results
# print(df_results)

# # Save the new KMeans cluster labels for further use
# with open('new_kmeans_labels.pkl', 'wb') as f:
#     pickle.dump(kmeans_labels, f)

# # Optionally, save results to a CSV file
# df_results.to_csv('tweets_with_kmeans_labels.csv', index=False)
