# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [7]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.feature_extraction.text import TfidfVectorizer

# Load data
def load_data(file_path):
    texts = []
    labels = []
    with open(file_path, 'r', encoding='utf-8') as file:
        for line_num, line in enumerate(file, start=1):
            line = line.strip()
            if line:
                parts = line.split(' ')
                label = parts[0]
                text = ' '.join(parts[1:])
                try:
                    labels.append(int(label))
                    texts.append(text)
                except ValueError:
                    print(f"Error parsing line {line_num}: '{line}'")
    return texts, labels

train_texts, train_labels = load_data('stsa-train.txt')
test_texts, test_labels = load_data('stsa-test.txt')

# Split training data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(train_texts, train_labels, test_size=0.2, random_state=42)

# Define models
models = {
    'MultinomialNB': MultinomialNB(),
    'SVM': SVC(),
    'KNN': KNeighborsClassifier(),
    'DecisionTree': DecisionTreeClassifier(),
    'RandomForest': RandomForestClassifier(),
    'XGBoost': XGBClassifier()
}

# Initialize evaluation metrics
evaluation_metrics = {
    'Accuracy': accuracy_score,
    'Precision': precision_score,
    'Recall': recall_score,
    'F1-score': f1_score
}

# Initialize TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Transform the text data into TF-IDF features
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

# Perform 10-fold cross-validation and evaluate each model
results = {}
for model_name, model in models.items():
    print(f'Evaluating {model_name}...')
    # Perform cross-validation
    kfold = KFold(n_splits=10, shuffle=True, random_state=42)
    accuracy_scores = cross_val_score(model, X_train_tfidf, y_train, cv=kfold, scoring='accuracy')
    precision_scores = cross_val_score(model, X_train_tfidf, y_train, cv=kfold, scoring='precision')
    recall_scores = cross_val_score(model, X_train_tfidf, y_train, cv=kfold, scoring='recall')
    f1_scores = cross_val_score(model, X_train_tfidf, y_train, cv=kfold, scoring='f1')

    results[model_name] = {
        'Accuracy': accuracy_scores.mean(),
        'Precision': precision_scores.mean(),
        'Recall': recall_scores.mean(),
        'F1-score': f1_scores.mean()
    }

    print(f'{model_name} - Evaluation Metrics:')
    print(f'Accuracy: {results[model_name]["Accuracy"]}')
    print(f'Precision: {results[model_name]["Precision"]}')
    print(f'Recall: {results[model_name]["Recall"]}')
    print(f'F1-score: {results[model_name]["F1-score"]}')
    print('')

# Select the best-performing model based on accuracy
best_model = max(results, key=lambda x: results[x]['Accuracy'])
print(f'Best model: {best_model}')

# Train the final model using the selected algorithm on the entire training data
final_model = models[best_model]
final_model.fit(X_train_tfidf, y_train)

X_val_tfidf = tfidf_vectorizer.transform(X_val)

# Evaluate the final trained model on the validation data
y_pred = final_model.predict(X_val_tfidf)
accuracy = accuracy_score(y_val, y_pred)
precision = precision_score(y_val, y_pred)
recall = recall_score(y_val, y_pred)
f1 = f1_score(y_val, y_pred)

print(f'Final evaluation on validation data:')
print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1-score: {f1}')

# Now you can proceed to evaluate on test data as well if required.


Evaluating MultinomialNB...
MultinomialNB - Evaluation Metrics:
Accuracy: 0.7796218199385041
Precision: 0.7596894214796257
Recall: 0.847496892610612
F1-score: 0.800788608822262

Evaluating SVM...
SVM - Evaluation Metrics:
Accuracy: 0.775101350689707
Precision: 0.770625897102649
Recall: 0.8118251646496912
F1-score: 0.7903445472649195

Evaluating KNN...
KNN - Evaluation Metrics:
Accuracy: 0.7156788374537312
Precision: 0.7228559332510321
Recall: 0.7402719138687788
F1-score: 0.7310618689469263

Evaluating DecisionTree...
DecisionTree - Evaluation Metrics:
Accuracy: 0.6065749015870114
Precision: 0.6172270540339757
Recall: 0.6313611807277828
F1-score: 0.6247996176688593

Evaluating RandomForest...
RandomForest - Evaluation Metrics:
Accuracy: 0.7032079696568112
Precision: 0.6998856525045852
Recall: 0.7629514519541646
F1-score: 0.7260360113561232

Evaluating XGBoost...
XGBoost - Evaluation Metrics:
Accuracy: 0.6954472160385425
Precision: 0.6897249222820749
Recall: 0.7610451773100005
F1-score: 

## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score
from gensim.models import Word2Vec
from transformers import BertTokenizer, BertModel
from sklearn.decomposition import TruncatedSVD
import time

# Load dataset
data = pd.read_csv('Amazon_Unlocked_Mobile.csv')

# For demonstration purposes, let's use a sample of the dataset
data = data.sample(frac=0.05, random_state=42)

# Preprocess text data
data.dropna(subset=['Reviews'], inplace=True)
text_data = data['Reviews'].values

# Vectorize text data
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X = vectorizer.fit_transform(text_data)

# K-means clustering
start_time = time.time()
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans_labels = kmeans.fit_predict(X)
kmeans_time = time.time() - start_time
silhouette_kmeans = silhouette_score(X, kmeans_labels)

# DBSCAN clustering
start_time = time.time()
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X)
dbscan_time = time.time() - start_time
silhouette_dbscan = silhouette_score(X, dbscan_labels)

# Hierarchical clustering
start_time = time.time()
hierarchical = AgglomerativeClustering(n_clusters=5)
hierarchical_labels = hierarchical.fit_predict(X.toarray())
hierarchical_time = time.time() - start_time
silhouette_hierarchical = silhouette_score(X, hierarchical_labels)

# Word2Vec clustering
start_time = time.time()
word2vec_model = Word2Vec(sentences=[review.split() for review in text_data], vector_size=100, window=5, min_count=1, workers=4)
word2vec_features = np.array([word2vec_model.wv[review.split()].mean(axis=0) for review in text_data])
kmeans_word2vec = KMeans(n_clusters=5, random_state=42)
kmeans_labels_word2vec = kmeans_word2vec.fit_predict(word2vec_features)
word2vec_time = time.time() - start_time
silhouette_word2vec = silhouette_score(word2vec_features, kmeans_labels_word2vec)

# BERT clustering
'''
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
embeddings = []
start_time = time.time()
for review in text_data:
    inputs = tokenizer(review, return_tensors="pt", padding=True, truncation=True)
    outputs = model(**inputs)
    embeddings.append(outputs.last_hidden_state.mean(dim=1).squeeze().detach().numpy())
embeddings = np.array(embeddings)
pca = TruncatedSVD(n_components=100)
embeddings_pca = pca.fit_transform(embeddings)
kmeans_bert = KMeans(n_clusters=5, random_state=42)
kmeans_labels_bert = kmeans_bert.fit_predict(embeddings_pca)
bert_time = time.time() - start_time
silhouette_bert = silhouette_score(embeddings_pca, kmeans_labels_bert)
'''
# Print runtime and silhouette score for each clustering algorithm
print("K-means Runtime:", kmeans_time)
print("K-means Silhouette Score:", silhouette_kmeans)
print("DBSCAN Runtime:", dbscan_time)
print("DBSCAN Silhouette Score:", silhouette_dbscan)
print("Hierarchical Runtime:", hierarchical_time)
print("Hierarchical Silhouette Score:", silhouette_hierarchical)
print("Word2Vec Runtime:", word2vec_time)
print("Word2Vec Silhouette Score:", silhouette_word2vec)
'''
print("BERT Runtime:", bert_time)
print("BERT Silhouette Score:", silhouette_bert)'''




K-means Runtime: 2.843510627746582
K-means Silhouette Score: 0.03825677867008131
DBSCAN Runtime: 3.512040853500366
DBSCAN Silhouette Score: -0.1589411172861913
Hierarchical Runtime: 15.18751311302185
Hierarchical Silhouette Score: 0.017666229535585123
Word2Vec Runtime: 2.346010208129883
Word2Vec Silhouette Score: 0.5161927


'\nprint("BERT Runtime:", bert_time)\nprint("BERT Silhouette Score:", silhouette_bert)'

**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

The effectiveness of K-means and Hierarchical clustering in clustering was found to be restricted, since low silhouette scores indicated ineffective cluster separation. DBSCAN received a negative silhouette score because it was unable to detect dense regions in an efficient manner. Word2Vec demonstrated superior semantic embedding as seen by its high silhouette score, outperforming other clustering techniques. Though not directly comparable for clustering, BERT, which is intended for contextualized embeddings, outperforms Word2Vec in capturing complicated contextual links. Overall, embedding approaches like Word2Vec and BERT demonstrated greater skills in capturing semantic subtleties, whereas standard clustering methods faltered.


**Write your response here:**

.

.

.

.

.




# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
this exercise was most challenging compared to all exercises, the dificult task is runtime, to rectify the error and
to rerun the code to implement all models it taking too long.
And in first question i can’t implement the Word2Vec and Bert becuase they are not classifiers
in second question Bert model is taking too long to run, i have waited for 50 minutes but still its not executed
so i put the Bert model in comments and executed other models
And Bert and Word2Vec are not clustering model, so i faced main problem their.


'''