<a href="https://colab.research.google.com/github/RST0310/INFO-5731/blob/main/RAYABARAPU_SAITEJA_Exercise_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [25]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from gensim.models import Word2Vec
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from transformers import BertTokenizer, BertModel
import torch

# Load data
train_data_path = "/content/stsa-train.txt"
test_data_path = "/content/stsa-test.txt"

# Function to load data
def load_data(file_path):
    with open(file_path, 'r') as file:
        lines = file.readlines()
    data = []
    labels = []
    for line in lines:
        label, text = line.split(' ', 1)
        data.append(text.strip())
        labels.append(int(label))
    return data, labels

train_texts, train_labels = load_data(train_data_path)
test_texts, test_labels = load_data(test_data_path)

# Split train data into train and validation sets
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.2, random_state=42)

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

# Convert text data into numerical features
count_vectorizer = CountVectorizer()
X_train_counts = count_vectorizer.fit_transform(train_texts)
X_val_counts = count_vectorizer.transform(val_texts)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_val_tfidf = tfidf_transformer.transform(X_val_counts)


# Function to perform 10 fold cross validation
def perform_cross_validation(classifier, X, y):
    cv = KFold(n_splits=10, shuffle=True, random_state=42)
    scores = cross_val_score(classifier, X, y, cv=cv, scoring='accuracy')
    return scores.mean()

# Function to evaluate classifier
def evaluate_classifier(classifier, X_train, y_train, X_test, y_test):
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    return accuracy, recall, precision, f1

# Multinomial Naive Bayes
nb_classifier = MultinomialNB()
nb_cv_accuracy = perform_cross_validation(nb_classifier, X_train_counts, train_labels)
nb_accuracy, nb_recall, nb_precision, nb_f1 = evaluate_classifier(nb_classifier, X_train_counts, train_labels, X_val_counts, val_labels)

# SVM
svm_classifier = SVC()
svm_cv_accuracy = perform_cross_validation(svm_classifier, X_train_tfidf, train_labels)
svm_accuracy, svm_recall, svm_precision, svm_f1 = evaluate_classifier(svm_classifier, X_train_tfidf, train_labels, X_val_tfidf, val_labels)

# KNN
knn_classifier = KNeighborsClassifier()
knn_cv_accuracy = perform_cross_validation(knn_classifier, X_train_tfidf, train_labels)
knn_accuracy, knn_recall, knn_precision, knn_f1 = evaluate_classifier(knn_classifier, X_train_tfidf, train_labels, X_val_tfidf, val_labels)

# Decision Tree
dt_classifier = DecisionTreeClassifier()
dt_cv_accuracy = perform_cross_validation(dt_classifier, X_train_tfidf, train_labels)
dt_accuracy, dt_recall, dt_precision, dt_f1 = evaluate_classifier(dt_classifier, X_train_tfidf, train_labels, X_val_tfidf, val_labels)

# Random Forest
rf_classifier = RandomForestClassifier()
rf_cv_accuracy = perform_cross_validation(rf_classifier, X_train_tfidf, train_labels)
rf_accuracy, rf_recall, rf_precision, rf_f1 = evaluate_classifier(rf_classifier, X_train_tfidf, train_labels, X_val_tfidf, val_labels)

# XGBoost
xgb_classifier = XGBClassifier()
xgb_cv_accuracy = perform_cross_validation(xgb_classifier, X_train_tfidf, train_labels)
xgb_accuracy, xgb_recall, xgb_precision, xgb_f1 = evaluate_classifier(xgb_classifier, X_train_tfidf, train_labels, X_val_tfidf, val_labels)

# Word2Vec - Placeholder for Word2Vec implementation

word2vec_model = Word2Vec(sentences=[text.split() for text in train_texts], vector_size=100, window=5, min_count=1, workers=4)
word2vec_features = np.array([np.mean([word2vec_model.wv[word] for word in text.split() if word in word2vec_model.wv] or [np.zeros(100)], axis=0) for text in train_texts])
word2vec_val_features = np.array([np.mean([word2vec_model.wv[word] for word in text.split() if word in word2vec_model.wv] or [np.zeros(100)], axis=0) for text in val_texts])

word2vec_classifier = RandomForestClassifier()
word2vec_cv_accuracy = perform_cross_validation(word2vec_classifier, word2vec_features, train_labels)
word2vec_accuracy, word2vec_recall, word2vec_precision, word2vec_f1 = evaluate_classifier(word2vec_classifier, word2vec_features, train_labels, word2vec_val_features, val_labels)

print("Word2Vec:")
print("Cross Validation Accuracy:", word2vec_cv_accuracy)
print("Accuracy:", word2vec_accuracy)
print("Recall:", word2vec_recall)
print("Precision:", word2vec_precision)
print("F1 Score:", word2vec_f1)
print()



# BERT - Placeholder for BERT implementation
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Function to encode text using BERT
def encode_text(text):
    input_ids = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=128)['input_ids'].to(device)
    with torch.no_grad():
        outputs = bert_model(input_ids)
    return outputs.pooler_output.cpu().numpy()

bert_features = np.concatenate([encode_text(text) for text in train_texts])
bert_val_features = np.concatenate([encode_text(text) for text in val_texts])

bert_classifier = RandomForestClassifier()
bert_cv_accuracy = perform_cross_validation(bert_classifier, bert_features, train_labels)
bert_accuracy, bert_recall, bert_precision, bert_f1 = evaluate_classifier(bert_classifier, bert_features, train_labels, bert_val_features, val_labels)

print("BERT:")
print("Cross Validation Accuracy:", bert_cv_accuracy)
print("Accuracy:", bert_accuracy)
print("Recall:", bert_recall)
print("Precision:", bert_precision)
print("F1 Score:", bert_f1)
print()


# Multinomial Naive Bayes
nb_classifier = MultinomialNB()
nb_cv_accuracy = perform_cross_validation(nb_classifier, X_train_counts, train_labels)
nb_accuracy, nb_recall, nb_precision, nb_f1 = evaluate_classifier(nb_classifier, X_train_counts, train_labels, X_val_counts, val_labels)

print("Multinomial Naive Bayes:")
print("Cross Validation Accuracy:", nb_cv_accuracy)
print("Accuracy:", nb_accuracy)
print("Recall:", nb_recall)
print("Precision:", nb_precision)
print("F1 Score:", nb_f1)
print()

# SVM
svm_classifier = SVC()
svm_cv_accuracy = perform_cross_validation(svm_classifier, X_train_tfidf, train_labels)
svm_accuracy, svm_recall, svm_precision, svm_f1 = evaluate_classifier(svm_classifier, X_train_tfidf, train_labels, X_val_tfidf, val_labels)

print("SVM:")
print("Cross Validation Accuracy:", svm_cv_accuracy)
print("Accuracy:", svm_accuracy)
print("Recall:", svm_recall)
print("Precision:", svm_precision)
print("F1 Score:", svm_f1)
print()

# KNN
knn_classifier = KNeighborsClassifier()
knn_cv_accuracy = perform_cross_validation(knn_classifier, X_train_tfidf, train_labels)
knn_accuracy, knn_recall, knn_precision, knn_f1 = evaluate_classifier(knn_classifier, X_train_tfidf, train_labels, X_val_tfidf, val_labels)

print("KNN:")
print("Cross Validation Accuracy:", knn_cv_accuracy)
print("Accuracy:", knn_accuracy)
print("Recall:", knn_recall)
print("Precision:", knn_precision)
print("F1 Score:", knn_f1)
print()

# Decision Tree
dt_classifier = DecisionTreeClassifier()
dt_cv_accuracy = perform_cross_validation(dt_classifier, X_train_tfidf, train_labels)
dt_accuracy, dt_recall, dt_precision, dt_f1 = evaluate_classifier(dt_classifier, X_train_tfidf, train_labels, X_val_tfidf, val_labels)

print("Decision Tree:")
print("Cross Validation Accuracy:", dt_cv_accuracy)
print("Accuracy:", dt_accuracy)
print("Recall:", dt_recall)
print("Precision:", dt_precision)
print("F1 Score:", dt_f1)
print()

# Random Forest
rf_classifier = RandomForestClassifier()
rf_cv_accuracy = perform_cross_validation(rf_classifier, X_train_tfidf, train_labels)
rf_accuracy, rf_recall, rf_precision, rf_f1 = evaluate_classifier(rf_classifier, X_train_tfidf, train_labels, X_val_tfidf, val_labels)

print("Random Forest:")
print("Cross Validation Accuracy:", rf_cv_accuracy)
print("Accuracy:", rf_accuracy)
print("Recall:", rf_recall)
print("Precision:", rf_precision)
print("F1 Score:", rf_f1)
print()

# XGBoost
xgb_classifier = XGBClassifier()
xgb_cv_accuracy = perform_cross_validation(xgb_classifier, X_train_tfidf, train_labels)
xgb_accuracy, xgb_recall, xgb_precision, xgb_f1 = evaluate_classifier(xgb_classifier, X_train_tfidf, train_labels, X_val_tfidf, val_labels)

print("XGBoost:")
print("Cross Validation Accuracy:", xgb_cv_accuracy)
print("Accuracy:", xgb_accuracy)
print("Recall:", xgb_recall)
print("Precision:", xgb_precision)
print("F1 Score:", xgb_f1)
print()





Word2Vec:
Cross Validation Accuracy: 0.5637703109393463
Accuracy: 0.5736994219653179
Recall: 0.6563814866760168
Precision: 0.5756457564575646
F1 Score: 0.6133682830930538



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

BERT:
Cross Validation Accuracy: 0.7599375901711047
Accuracy: 0.7738439306358381
Recall: 0.7868162692847125
Precision: 0.7770083102493075
F1 Score: 0.7818815331010454

Multinomial Naive Bayes:
Cross Validation Accuracy: 0.7779979240245201
Accuracy: 0.7947976878612717
Recall: 0.8429172510518934
Precision: 0.777490297542044
F1 Score: 0.8088829071332435

SVM:
Cross Validation Accuracy: 0.775101350689707
Accuracy: 0.7976878612716763
Recall: 0.8597475455820477
Precision: 0.7730138713745272
F1 Score: 0.8140770252324037

KNN:
Cross Validation Accuracy: 0.7162213329329357
Accuracy: 0.7297687861271677
Recall: 0.7784011220196353
Precision: 0.7198443579766537
F1 Score: 0.7479784366576819

Decision Tree:
Cross Validation Accuracy: 0.6074826512426477
Accuracy: 0.6170520231213873
Recall: 0.6872370266479664
Precision: 0.6148055207026348
F1 Score: 0.6490066225165563

Random Forest:
Cross Validation Accuracy: 0.7014016098602307
Accuracy: 0.7239884393063584
Recall: 0.7994389901823282
Precision: 0.704573

## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [3]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from gensim.models import Word2Vec
from transformers import BertTokenizer, BertModel
import torch
import time

# Load the dataset
data_path = "/content/Amazon_Unlocked_Mobile.csv"
df = pd.read_csv(data_path)

# Preprocess text data if needed

# Drop rows with missing values in the 'Reviews' column
df.dropna(subset=['Reviews'], inplace=True)

# Extract text data
texts = df['Reviews'].tolist()

# Choose subset size (adjust as needed)
subset_size = 10000

# Use a subset of the data for testing
texts_subset = texts[:subset_size]

# Define total number of clustering methods
total_methods = 5

# TF-IDF vectorization
print("Running TF-IDF vectorization...")
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(texts_subset)
print("TF-IDF vectorization completed.")

# Print percentage completed
print(f"Progress: {(1 / total_methods) * 100:.2f}%")

# K-means clustering
print("Running K-means clustering...")
kmeans_model = KMeans(n_clusters=5, random_state=42)
kmeans_clusters = kmeans_model.fit_predict(tfidf_matrix)
print("K-means clustering completed.")

# Print percentage completed
print(f"Progress: {(2 / total_methods) * 100:.2f}%")

# DBSCAN clustering
print("Running DBSCAN clustering...")
start_time_dbscan = time.time()
dbscan_model = DBSCAN(eps=0.5, min_samples=5)
dbscan_clusters = dbscan_model.fit_predict(tfidf_matrix)
end_time_dbscan = time.time()
print("DBSCAN clustering completed.")
print(f"Time taken for DBSCAN: {end_time_dbscan - start_time_dbscan} seconds")

# Print percentage completed
print(f"Progress: {(3 / total_methods) * 100:.2f}%")

# Hierarchical clustering
print("Running Hierarchical clustering...")
hierarchical_model = AgglomerativeClustering(n_clusters=5)
hierarchical_clusters = hierarchical_model.fit_predict(tfidf_matrix.toarray())
print("Hierarchical clustering completed.")

# Print percentage completed
print(f"Progress: {(4 / total_methods) * 100:.2f}%")

# Word2Vec clustering
print("Running Word2Vec clustering...")
start_time_word2vec = time.time()
word2vec_model = Word2Vec(sentences=[text.split() for text in texts_subset], vector_size=100, window=5, min_count=1, workers=4)
word2vec_features = [word2vec_model.wv[word] for text in texts_subset for word in text.split() if word in word2vec_model.wv]
word2vec_model_clusters = KMeans(n_clusters=5, random_state=42).fit_predict(word2vec_features)
end_time_word2vec = time.time()
print("Word2Vec clustering completed.")
print(f"Time taken for Word2Vec: {end_time_word2vec - start_time_word2vec} seconds")

# Print percentage completed
print(f"Progress: {(5 / total_methods) * 100:.2f}%")

# BERT clustering
print("Running BERT clustering...")
start_time_bert = time.time()
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def encode_text(text):
    input_ids = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=128)['input_ids'].to(device)
    with torch.no_grad():
        outputs = bert_model(input_ids)
    return outputs.pooler_output.cpu().numpy()

bert_features = [encode_text(text) for text in texts_subset]

# Flatten the bert_features array
bert_features_flat = [feature.flatten() for feature in bert_features]

bert_clusters = KMeans(n_clusters=5, random_state=42).fit_predict(bert_features_flat)
end_time_bert = time.time()
print("BERT clustering completed.")
print(f"Time taken for BERT: {end_time_bert - start_time_bert} seconds")


# Print results
print("K-means Clusters:", kmeans_clusters)
print("DBSCAN Clusters:", dbscan_clusters)
print("Hierarchical Clusters:", hierarchical_clusters)
print("Word2Vec Clusters:", word2vec_model_clusters)
print("BERT Clusters:", bert_clusters)


Running TF-IDF vectorization...
TF-IDF vectorization completed.
Progress: 20.00%
Running K-means clustering...




K-means clustering completed.
Progress: 40.00%
Running DBSCAN clustering...
DBSCAN clustering completed.
Time taken for DBSCAN: 110.98614382743835 seconds
Progress: 60.00%
Running Hierarchical clustering...
Hierarchical clustering completed.
Progress: 80.00%
Running Word2Vec clustering...




Word2Vec clustering completed.
Time taken for Word2Vec: 27.311561584472656 seconds
Progress: 100.00%
Running BERT clustering...




BERT clustering completed.
Time taken for BERT: 1922.824051141739 seconds
K-means Clusters: [4 3 3 ... 4 3 4]
DBSCAN Clusters: [-1 -1 -1 ... -1 -1 -1]
Hierarchical Clusters: [0 0 0 ... 0 0 0]
Word2Vec Clusters: [2 4 2 ... 0 2 4]
BERT Clusters: [3 3 2 ... 3 4 0]


**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

**Write your response here:**

.The results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT clustering methods varied in terms of their execution time and clustering outcomes. K-means clustering produced clusters relatively quickly and assigned each data point to a cluster, providing clear and distinct groupings. DBSCAN clustering, while taking longer to execute, identified noise points and formed clusters of varying densities, which could be beneficial for datasets with irregular cluster shapes. Hierarchical clustering yielded clusters based on hierarchical relationships between data points, offering insights into hierarchical structures within the data. Word2Vec clustering leveraged word embeddings to group similar textual data, capturing semantic similarities between documents. BERT clustering, despite its lengthy execution time, utilized contextual embeddings to capture nuanced semantic relationships, potentially leading to more meaningful cluster assignments. Overall, each method exhibited unique strengths and weaknesses, highlighting the importance of selecting the appropriate clustering approach based on the characteristics of the dataset and the desired clustering outcomes.

.

.

.

.




# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Through this assignment, I've gained hands-on experience in text classification using various machine learning algorithms. I learned how to preprocess text data, including loading, splitting, and transforming it into numerical features suitable for training classifiers. I also got familiar with popular classifiers such as Multinomial Naive Bayes, SVM, KNN, Decision Trees, Random Forest, and XGBoost, and how to evaluate their performance using metrics like accuracy, recall, precision, and F1 score. Additionally, I explored more advanced techniques like Word2Vec and BERT embeddings for text representation, enhancing my understanding of natural language processing. Overall, this assignment has deepened my knowledge of text classification techniques and provided valuable practical skills for future projects in the field.





'''