# TESTING 

This notebook demonstrates the use of pre-trained models for clustering a set of new tweets. The primary focus is on refining labels generated by an Affinity Propagation model into the 3 distinct clusters using Agglomerative Clustering.

## 1. Importing Libraries
- `re`, `pickle`, `numpy`, and `pandas` are used for data processing.
- Models for clustering (`AffinityPropagation` and `AgglomerativeClustering`) are imported from `sklearn`.

## 2. New Tweets Data
- A small set of 5 new tweets is manually provided to simulate new data for clustering.

## 3. Text Preprocessing
- A `clean_text` function is defined to remove special characters, mentions (`@user`), and stopwords from the tweets.
- The new tweets are preprocessed using this function to prepare them for vectorization.

## 4. Loading Pre-Trained Models
- The saved TF-IDF vectorizer (`vectorizer.pkl`) and the SVD model for dimensionality reduction (`svd_model.pkl`) are loaded.
- The new tweets are transformed into a numerical format using the loaded TF-IDF vectorizer and reduced in dimensionality using the SVD model.

## 5. Applying Clustering
- **Affinity Propagation**: The saved Affinity Propagation model is loaded and retrained on the transformed data. Since Affinity Propagation does not have a `predict()` function, it must be refit to new data.
- **Agglomerative Clustering**: The pre-trained Agglomerative Clustering model is used to refine the Affinity Propagation labels into 3 distinct clusters.


### Key Takeaways:
- This notebook shows how Affinity Propagation labels can be refined into a fixed number of clusters using Agglomerative Clustering, making it possible to apply flexible clustering methods to a variety of text data.


In [1]:
# AP_Testing.ipynb
import re
import pickle
import numpy as np
import pandas as pd
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import AffinityPropagation, AgglomerativeClustering
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.cluster import AgglomerativeClustering

In [2]:
# Loading the saved TF-IDF vectorizer, Affinity Propagation model and Agglomerative Clustering model, 
#SVD model (TruncatedSVD for dimensionality reduction)

with open('vectorizer.pkl', 'rb') as f:
    tfidf_vectorizer = pickle.load(f)
    
with open('affinity_propagation_model.pkl', 'rb') as f:
    affinity_propagation = pickle.load(f)
    
with open('agglomerative_clustering_labels.pkl', 'rb') as f:
    agglomerative_clustering = pickle.load(f)
    
with open('svd_model.pkl', 'rb') as f:
    svd = pickle.load(f)

In [3]:
# Loading new testing tweets
new_tweets = [
    "Just saw the most amazing movie! Highly recommend it to everyone.",
    "I can't believe the weather today",
    "Feeling so grateful for my friends and family. Life is good.",
    "This new phone I bought is such a disappointment. Wish I had researched more.",
    "Had a fantastic workout session at the gym today. Feeling energized!"
]

In [4]:
# Function for text cleaning, tokenization, and normalization (Same as the data cleaning one)
def clean_text(text):
    text = re.sub(r'[^A-Za-z\s]', '', text)  # Remove special characters and punctuation
    text = re.sub(r'user', '', text)  # Remove mentions of @user
    tokens = word_tokenize(text)  # Tokenize text
    stop_words = set(stopwords.words('english'))
    tokens = [word.lower() for word in tokens if word.lower() not in stop_words]  # Remove stopwords
    return ' '.join(tokens)

In [5]:
# Preprocessing the new tweets
preprocessed_tweets = [clean_text(tweet) for tweet in new_tweets]

In [6]:
# Transforming the new tweets using the loaded vectorizer
new_tweets_tfidf = tfidf_vectorizer.transform(preprocessed_tweets)

# Applying the SVD transformation to the new data (reduce dimensionality)
new_tweets_reduced = svd.transform(new_tweets_tfidf)

In [7]:
# Initializing Agglomerative Clustering with the same parameters used during training
agglomerative_clustering = AgglomerativeClustering(n_clusters=3) 

# Since Affinity Propagation does not support 'predict()', retrain the model on the new data
ap_labels = affinity_propagation.fit_predict(new_tweets_reduced)  # Affinity Propagation labels

# Predicting using Agglomerative Clustering (no need to retrain for Agglomerative)
agg_labels = agglomerative_clustering.fit_predict(new_tweets_reduced)


In [8]:
# Creating a DataFrame to store results
df_results = pd.DataFrame({
    'Tweet': new_tweets,
    'Agglomerative_Clustering_Label': agg_labels
})

In [9]:
# Displaying results for the cluster analysis
df_results

Unnamed: 0,Tweet,Agglomerative_Clustering_Label
0,Just saw the most amazing movie! Highly recomm...,0
1,I can't believe the weather today,1
2,Feeling so grateful for my friends and family....,2
3,This new phone I bought is such a disappointme...,0
4,Had a fantastic workout session at the gym tod...,2


In [10]:
# # Save the new cluster labels for further use if needed
# with open('new_affinity_propagation_labels.pkl', 'wb') as f:
#     pickle.dump(ap_labels, f)
# with open('new_agglomerative_clustering_labels.pkl', 'wb') as f:
#     pickle.dump(agg_labels, f)
