<a href="https://colab.research.google.com/github/DasireddyMeghana/Meghana_INFO5731_Spring2024/blob/main/In_class_exercise/Dasireddy_Meghana_Exercise_05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [19]:
# Write your code here
import numpy as np
import pandas as pd
import re
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from gensim.models import Word2Vec
from transformers import BertTokenizer, BertForSequenceClassification, TrainingArguments, Trainer
#from datasets import load_metric
from transformers import TrainingArguments, Trainer
from accelerate import Accelerator
import torch

In [20]:
def load_data(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        data = [line.strip().split(' ', 1) for line in file.readlines()]
    labels = [int(line[0]) for line in data]
    reviews = [line[1] for line in data]
    return labels, reviews

train_labels, train_reviews = load_data('/content/drive/MyDrive/5731/stsa-train.txt')
test_labels, test_reviews = load_data('/content/drive/MyDrive/5731/stsa-test.txt')

# Clean and preprocess the text
def clean_text(text):
    return re.sub(r'\W+', ' ', text.lower())

train_reviews = [clean_text(review) for review in train_reviews]
test_reviews = [clean_text(review) for review in test_reviews]

# Split the training data into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(train_reviews, train_labels, test_size=0.2, random_state=42)


In [21]:
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_val_tfidf = vectorizer.transform(X_val)
X_test_tfidf = vectorizer.transform(test_reviews)


In [22]:
# Define a function to perform cross-validation and evaluation
def evaluate_model(model, X_train, y_train, X_val, y_val):
    # Perform cross-validation
    cv_scores = cross_val_score(model, X_train, y_train, cv=10)
    print(f"Average CV Score: {np.mean(cv_scores):.4f}")

    # Final training and evaluation on validation set
    model.fit(X_train, y_train)
    predictions = model.predict(X_val)
    accuracy = accuracy_score(y_val, predictions)
    recall = recall_score(y_val, predictions)
    precision = precision_score(y_val, predictions)
    f1 = f1_score(y_val, predictions)
    print(f"Accuracy: {accuracy:.4f}, Recall: {recall:.4f}, Precision: {precision:.4f}, F1 Score: {f1:.4f}\n")

In [23]:
# Multinomial Naive Bayes
print("Multinomial Naive Bayes")
nb_model = MultinomialNB()
evaluate_model(nb_model, X_train_tfidf, y_train, X_val_tfidf, y_val)

# Support Vector Machine
print("Support Vector Machine")
svm_model = SVC(kernel='linear')
evaluate_model(svm_model, X_train_tfidf, y_train, X_val_tfidf, y_val)

# Create a KNN classifier
print("KNN")
knn_model = KNeighborsClassifier(n_neighbors=5)
# Evaluate the model using the previously defined evaluate_model function
evaluate_model(knn_model, X_train_tfidf, y_train, X_val_tfidf, y_val)

# Decision Tree
print("Decision Tree")
dt_model = DecisionTreeClassifier()
evaluate_model(dt_model, X_train_tfidf, y_train, X_val_tfidf, y_val)

# Random Forest
print("Random Forest")
rf_model = RandomForestClassifier()
evaluate_model(rf_model, X_train_tfidf, y_train, X_val_tfidf, y_val)

# XGBoost
print("XGBoost")
xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')
evaluate_model(xgb_model, X_train_tfidf, y_train, X_val_tfidf, y_val)


Multinomial Naive Bayes
Average CV Score: 0.7766
Accuracy: 0.7731, Recall: 0.8612, Precision: 0.7407, F1 Score: 0.7964

Support Vector Machine
Average CV Score: 0.7542
Accuracy: 0.7803, Recall: 0.8149, Precision: 0.7716, F1 Score: 0.7926

KNN
Average CV Score: 0.4935
Accuracy: 0.5014, Recall: 0.0603, Precision: 0.6825, F1 Score: 0.1108

Decision Tree
Average CV Score: 0.6479
Accuracy: 0.6647, Recall: 0.6816, Precision: 0.6722, F1 Score: 0.6769

Random Forest
Average CV Score: 0.7180
Accuracy: 0.7290, Recall: 0.7700, Precision: 0.7224, F1 Score: 0.7454

XGBoost
Average CV Score: 0.6729
Accuracy: 0.7088, Recall: 0.8527, Precision: 0.6711, F1 Score: 0.7511



In [24]:
# Tokenize reviews for Word2Vec
tokenized_train = [review.split() for review in train_reviews]
tokenized_test = [review.split() for review in test_reviews]

# Train Word2Vec model
w2v_model = Word2Vec(sentences=tokenized_train, vector_size=100, window=5, min_count=1, workers=4)

# Function to vectorize data using Word2Vec model
def vectorize_with_w2v(model, reviews):
    vectors = [np.mean([model.wv[word] for word in review if word in model.wv] or [np.zeros(model.vector_size)], axis=0) for review in reviews]
    return np.array(vectors)

X_train_w2v = vectorize_with_w2v(w2v_model, tokenized_train)
X_test_w2v = vectorize_with_w2v(w2v_model, tokenized_test)

# Using Logistic Regression as classifier
from sklearn.linear_model import LogisticRegression
lr_model = LogisticRegression(max_iter=1000)
print("Word2Vec")
evaluate_model(lr_model, X_train_w2v, train_labels, X_test_w2v, test_labels)


Word2Vec
Average CV Score: 0.5665
Accuracy: 0.5832, Recall: 0.8262, Precision: 0.5555, F1 Score: 0.6643



In [25]:
# Initialize BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Tokenization function for BERT
def tokenize_for_bert(reviews):
    return tokenizer(reviews, padding=True, truncation=True, max_length=128, return_tensors='pt')

# Prepare datasets
train_encodings = tokenize_for_bert(train_reviews)
test_encodings = tokenize_for_bert(test_reviews)

# Convert labels to tensor format
import torch
train_labels_tensor = torch.tensor(train_labels)
test_labels_tensor = torch.tensor(test_labels)

# Create torch dataset
class ReviewsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = ReviewsDataset(train_encodings, train_labels)
test_dataset = ReviewsDataset(test_encodings, test_labels)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [12]:
# Adjust training arguments for speed and efficiency
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=1,  # Reduce number of epochs
    per_device_train_batch_size=4,  # Reduce batch size
    per_device_eval_batch_size=32,
    warmup_steps=100,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=50  # Increase logging frequency
)

# Train BERT model
trainer = Trainer(
    model=bert_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)

trainer.train()

# Evaluate BERT model
results = trainer.evaluate()
print(f"BERT Model Evaluation: {results}")


  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


Step,Training Loss
50,0.7022
100,0.7064
150,0.7517
200,0.6955
250,0.7391
300,0.7178
350,0.7059
400,0.7158
450,0.7095
500,0.7029


  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


Step,Training Loss
50,0.7022
100,0.7064
150,0.7517
200,0.6955
250,0.7391
300,0.7178
350,0.7059
400,0.7158
450,0.7095
500,0.7029


  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


BERT Model Evaluation: {'eval_loss': 0.6929027438163757, 'eval_runtime': 340.5891, 'eval_samples_per_second': 5.347, 'eval_steps_per_second': 0.167, 'epoch': 1.0}


In [15]:
# Get predictions for test dataset
predictions = trainer.predict(test_dataset)

# Extract predicted labels and convert them to numpy array
predicted_labels = predictions.predictions.argmax(axis=1)
true_labels = test_labels_tensor.numpy()

# Calculate evaluation metrics
accuracy = accuracy_score(true_labels, predicted_labels)
precision = precision_score(true_labels, predicted_labels, zero_division=0)  # Set zero_division=0 to handle cases where precision is undefined
recall = recall_score(true_labels, predicted_labels, zero_division=0)  # Set zero_division=0 to handle cases where recall is undefined
f1 = f1_score(true_labels, predicted_labels, zero_division=0)  # Set zero_division=0 to handle cases where F1-score is undefined

# Print evaluation metrics
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-score: {f1}")


  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


Accuracy: 0.6578802855573861
Precision: 0.7313915857605178
Recall: 0.49724972497249725
F1-score: 0.5920104780615586


## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [18]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score
from gensim.models import Word2Vec
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA

# Load the dataset
data = pd.read_csv('/content/drive/MyDrive/5731/Amazon_Unlocked_Mobile.csv')

# Work with a smaller sample of the data for quicker computation
data_sample = data.sample(frac=0.01, random_state=42) # we are working with nearly 4200 samples
print(f"Working with a sample of {len(data_sample)} entries.")

# Preprocess the 'Reviews' column
data_sample['Processed_Reviews'] = data_sample['Reviews'].str.lower().str.replace(r'[^\w\s]+', ' ').fillna("")

# TF-IDF Vectorization with fewer features
tfidf_vectorizer = TfidfVectorizer(max_features=300, stop_words='english')
X_tfidf = tfidf_vectorizer.fit_transform(data_sample['Processed_Reviews'])

# Apply K-means clustering
kmeans = KMeans(n_clusters=5)
kmeans_clusters = kmeans.fit_predict(X_tfidf)

# Agglomerative clustering
hierarchical = AgglomerativeClustering(n_clusters=5)
hierarchical_clusters = hierarchical.fit_predict(X_tfidf.toarray())

# Word2Vec clustering with optimized parameters
word2vec_model = Word2Vec(sentences=data_sample['Processed_Reviews'].apply(str.split), vector_size=30, window=5, min_count=1, workers=4, epochs=10)

# Obtain Word2Vec vectors
word2vec_vectors = np.array([word2vec_model.wv[word] for word in word2vec_model.wv.index_to_key])

# Clustering on Word2Vec vectors using KMeans
w2v_kmeans = KMeans(n_clusters=5)
w2v_kmeans_clusters = w2v_kmeans.fit_predict(word2vec_vectors)

# Apply DBSCAN clustering to Word2Vec vectors
dbscan = DBSCAN(eps=0.3, min_samples=10)  # Adjust eps and min_samples based on the dataset
dbscan_clusters = dbscan.fit_predict(word2vec_vectors)

# BERT clustering with batch processing
model = SentenceTransformer('bert-base-nli-mean-tokens')
bert_embeddings = model.encode(data_sample['Processed_Reviews'].tolist(), show_progress_bar=False, batch_size=32)  # Adjust batch_size as per your machine

# Reduce dimensionality with PCA before clustering
pca = PCA(n_components=30)
bert_embeddings_pca = pca.fit_transform(bert_embeddings)

# Clustering on BERT PCA embeddings using KMeans
bert_kmeans = KMeans(n_clusters=5)
bert_kmeans_clusters = bert_kmeans.fit_predict(bert_embeddings_pca)

# Evaluate the clustering
kmeans_silhouette = silhouette_score(X_tfidf.toarray(), kmeans_clusters)
hierarchical_silhouette = silhouette_score(X_tfidf.toarray(), hierarchical_clusters)
w2v_kmeans_silhouette = silhouette_score(word2vec_vectors, w2v_kmeans_clusters)
bert_kmeans_silhouette = silhouette_score(bert_embeddings_pca, bert_kmeans_clusters)
dbscan_silhouette = silhouette_score(word2vec_vectors, dbscan_clusters)  # Evaluate DBSCAN only if it found more than one cluster

print(f"Silhouette Score for K-means: {kmeans_silhouette}")
print(f"Silhouette Score for Hierarchical: {hierarchical_silhouette}")
print(f"Silhouette Score for Word2Vec K-means: {w2v_kmeans_silhouette}")
print(f"Silhouette Score for BERT K-means: {bert_kmeans_silhouette}")
print(f"Silhouette Score for DBSCAN: {dbscan_silhouette if len(set(dbscan_clusters)) > 1 else 'Not applicable - one cluster'}")


Working with a sample of 4138 entries.




Silhouette Score for K-means: 0.06611162035554236
Silhouette Score for Hierarchical: 0.05994426216412201
Silhouette Score for Word2Vec K-means: 0.6442577838897705
Silhouette Score for BERT K-means: 0.18565259873867035
Silhouette Score for DBSCAN: 0.8072807788848877


**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

**Write your response here:**

.K-means (0.066): The K-means algorithm yielded a low silhouette score, suggesting that the clusters are not well-separated and may have substantial overlap. The centroids may not be representative of their clusters, or the cluster count might need adjustment.

.DBSCAN (0.807): DBSCAN performed exceptionally well according to its silhouette score. This high score indicates that the clusters have less overlap and are denser. DBSCAN's ability to find arbitrarily shaped clusters and noise points might contribute to this success.

.Hierarchical clustering (0.059): This method scored slightly lower than K-means, which might imply the hierarchical tree-based division of clusters didn't capture the natural groupings as effectively. It may also mean that the choice of the number of clusters wasn't optimal.

.Word2Vec (0.644): Clustering on Word2Vec embeddings resulted in a high silhouette score, signifying good clustering with dense clusters. This suggests that Word2Vec's embeddings provided meaningful representations for clustering.

.BERT (0.185): BERT's embeddings led to a moderate silhouette score. While better than K-means and Hierarchical, it's significantly lower than Word2Vec, possibly because BERT's high-dimensional space might need more fine-tuning or dimensionality reduction to capture the clustering structure effectively.




# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
I had learned different algorithms and clustering methods. The only thing I encountered in this exercise is as we are using collab it has less ram size.
So, It is difficult to run the large datasets for different algorithms. The system getting crashed when we run with complete data.




'''