# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [None]:
# Write your code here
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
import xgboost as xgb
from gensim.models import Word2Vec
from transformers import BertTokenizer, BertModel
import torch
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset, random_split

# Ensure nltk resources are downloaded
nltk.download('punkt')
nltk.download('stopwords')

# Load the data from local .txt files
train_file_path = "/content/stsa-train.txt"
test_file_path = "/content/stsa-test.txt"

train_df = pd.read_csv(train_file_path, header=None, sep='\t', names=['data'])
test_df = pd.read_csv(test_file_path, header=None, sep='\t', names=['data'])

print("First few rows of train_df:")
print(train_df.head())

print("First few rows of test_df:")
print(test_df.head())

# Separating label and text
train_df['label'] = train_df['data'].apply(lambda x: int(x[0]))
train_df['text'] = train_df['data'].apply(lambda x: x[2:])
test_df['label'] = test_df['data'].apply(lambda x: int(x[0]))
test_df['text'] = test_df['data'].apply(lambda x: x[2:])

# Drop the original combined column
train_df.drop('data', axis=1, inplace=True)
test_df.drop('data', axis=1, inplace=True)

# Preprocessing
stop_words = set(stopwords.words('english'))
def preprocess(text):
    tokens = word_tokenize(text.lower())
    filtered_tokens = [w for w in tokens if not w in stop_words]
    return " ".join(filtered_tokens)

train_df['text'] = train_df['text'].apply(preprocess)
test_df['text'] = test_df['text'].apply(preprocess)

# Vectorization with TF-IDF
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(train_df['text'])
X_test_tfidf = vectorizer.transform(test_df['text'])

# Train a Word2Vec model
tokenized_train = [word_tokenize(text) for text in train_df['text']]
tokenized_test = [word_tokenize(text) for text in test_df['text']]
w2v_model = Word2Vec(sentences=tokenized_train, vector_size=100, window=5, min_count=2, workers=4)

def feature_vector(text, model):
    words = set(model.wv.index_to_key)
    word_vecs = [model.wv[word] for word in text if word in words]
    if len(word_vecs) > 0:
        return np.mean(word_vecs, axis=0)
    else:
        return np.zeros(100)

X_train_w2v = np.array([feature_vector(text, w2v_model) for text in tokenized_train])
X_test_w2v = np.array([feature_vector(text, w2v_model) for text in tokenized_test])

# BERT tokenizer and model loading
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

def bert_encode(texts, tokenizer, model, max_len=128):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    encoded_batch = tokenizer.batch_encode_plus(texts, add_special_tokens=True, max_length=max_len, padding=True, truncation=True, return_tensors="pt")
    input_ids, attention_mask = encoded_batch['input_ids'].to(device), encoded_batch['attention_mask'].to(device)
    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)
    embeddings = outputs.last_hidden_state[:,0,:].detach().cpu().numpy()  # Take the embeddings from the first token ([CLS] token)
    return embeddings

X_train_bert = bert_encode(train_df['text'].tolist(), tokenizer, bert_model)
X_test_bert = bert_encode(test_df['text'].tolist(), tokenizer, bert_model)

y_train = train_df['label']
y_test = test_df['label']

# Model setup
models = {
    'MultinomialNB': MultinomialNB(),
    'SVM': SVC(),
    'KNN': KNeighborsClassifier(3),
    'DecisionTree': DecisionTreeClassifier(),
    'XGBoost': xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss'),
    'RandomForest_TFIDF': RandomForestClassifier(),
    'RandomForest_Word2Vec': RandomForestClassifier(),
    'RandomForest_BERT': RandomForestClassifier()  # This will use BERT embeddings
}

# Evaluate models using 10-fold cross-validation and on the test set
kf = KFold(n_splits=10, random_state=42, shuffle=True)
for name, model in models.items():
    if 'RandomForest' not in name:  # Evaluate using TF-IDF
        cv_scores = cross_val_score(model, X_train_tfidf, y_train, cv=kf)
        print(f"{name} with TF-IDF - 10-fold CV Accuracy: {cv_scores.mean()}")
        model.fit(X_train_tfidf, y_train)
    elif 'TFIDF' in name:
        cv_scores = cross_val_score(model, X_train_tfidf, y_train, cv=kf)
        print(f"{name} - 10-fold CV Accuracy: {cv_scores.mean()}")
        model.fit(X_train_tfidf, y_train)
    elif 'Word2Vec' in name:
        cv_scores = cross_val_score(model, X_train_w2v, y_train, cv=kf)
        print(f"{name} - 10-fold CV Accuracy: {cv_scores.mean()}")
        model.fit(X_train_w2v, y_train)
    elif 'BERT' in name:
        # Assuming BERT embeddings are correctly generated
        cv_scores = cross_val_score(model, X_train_bert, y_train, cv=kf)
        print(f"{name} with BERT - 10-fold CV Accuracy: {cv_scores.mean()}")
        model.fit(X_train_bert, y_train)

    # Evaluation on the test set
    if 'Word2Vec' in name:
        predictions = model.predict(X_test_w2v)
    elif 'BERT' in name:
        predictions = model.predict(X_test_bert)
    else:
        predictions = model.predict(X_test_tfidf)

    print(f"{name} - Test Accuracy: {accuracy_score(y_test, predictions)}")
    print(f"{name} - Precision: {precision_score(y_test, predictions, average='macro')}")
    print(f"{name} - Recall: {recall_score(y_test, predictions, average='macro')}")
    print(f"{name} - F1 Score: {f1_score(y_test, predictions, average='macro')}")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


First few rows of train_df:
                                                data
0  1 a stirring , funny and finally transporting ...
1  0 apparently reassembled from the cutting-room...
2  0 they presume their audience wo n't sit still...
3  1 this is a visually stunning rumination on lo...
4  1 jonathan parker 's bartleby should have been...
First few rows of test_df:
                                                data
0   0 no movement , no yuks , not much of anything .
1  0 a gob of drivel so sickly sweet , even the e...
2  0 gangs of new york is an unapologetic mess , ...
3  0 we never really feel involved with the story...
4          1 this is one of polanski 's best films .


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [None]:
# Write your code here
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from gensim.models import Word2Vec
from transformers import BertTokenizer, BertModel
import torch
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from gensim.models import Word2Vec
from transformers import BertTokenizer, BertModel
import torch

# Load the dataset
df = pd.read_csv("Amazon_Unlocked_Mobile.csv")
print("Dataset loaded. Shape:", df.shape)
print(df.head())

# Assuming the text data for clustering is in the 'Reviews' column
text_data = df['Reviews'].astype(str)

# TF-IDF Vectorization for K-means clustering
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(text_data)

# K-means clustering
kmeans = KMeans(n_clusters=5)
kmeans.fit(tfidf_matrix)
kmeans_labels = kmeans.labels_

# DBSCAN clustering
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan.fit(tfidf_matrix)
dbscan_labels = dbscan.labels_

# Hierarchical clustering
hierarchical = AgglomerativeClustering(n_clusters=5)
hierarchical.fit(tfidf_matrix.toarray())
hierarchical_labels = hierarchical.labels_

# Word2Vec clustering
tokenized_text = [review.split() for review in text_data]
w2v_model = Word2Vec(sentences=tokenized_text, vector_size=100, window=5, min_count=1, workers=4)
w2v_vectors = [w2v_model.wv[word] for word in w2v_model.wv.index_to_key]
kmeans_w2v = KMeans(n_clusters=5)
kmeans_w2v.fit(w2v_vectors)
w2v_labels = kmeans_w2v.labels_

# BERT clustering
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def bert_encode(texts, tokenizer, model):
    encoded_batch = tokenizer.batch_encode_plus(texts, add_special_tokens=True, max_length=128, padding=True, truncation=True, return_tensors="pt")
    input_ids, attention_mask = encoded_batch['input_ids'].to(device), encoded_batch['attention_mask'].to(device)
    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)
    embeddings = outputs.last_hidden_state[:,0,:].detach().cpu().numpy()
    return embeddings

bert_vectors = bert_encode(text_data.tolist(), tokenizer, bert_model)
kmeans_bert = KMeans(n_clusters=5)
kmeans_bert.fit(bert_vectors)
bert_labels = kmeans_bert.labels_

# Print clustering results
print("\nK-means labels:", kmeans_labels)
print("DBSCAN labels:", dbscan_labels)
print("Hierarchical labels:", hierarchical_labels)
print("Word2Vec labels:", w2v_labels)
print("BERT labels:", bert_labels)


Dataset loaded. Shape: (14509, 6)
                                        Product Name Brand Name   Price  \
0  "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung  199.99   
1  "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung  199.99   
2  "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung  199.99   
3  "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung  199.99   
4  "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung  199.99   

   Rating                                            Reviews  Review Votes  
0     5.0  I feel so LUCKY to have found this used (phone...           1.0  
1     4.0  nice phone, nice up grade from my pantach revu...           0.0  
2     5.0                                       Very pleased           0.0  
3     4.0  It works good but it goes slow sometimes but i...           0.0  
4     4.0  Great phone to replace my lost phone. The only...           0.0  




**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

**Write your response here:**

.

.

.

.

.




# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
The exercises provide a practical hands-on experience with various text clustering methods, covering a diverse range from traditional techniques like K-means and DBSCAN to advanced approaches like Word2Vec and BERT embeddings. The code is well-structured and easy to follow, making it accessible for beginners in text mining. It's a great starting point for understanding and experimenting with text clustering techniques on real-world datasets.



'''