<a href="https://colab.research.google.com/github/NagillaUdayasree/Udayasree_INFO5731_Spring2024/blob/main/Nagilla_Udayasree_Exercise_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [2]:
# Write your code here
import os
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the data
def load_data(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        data = [line.strip().split('\t') for line in file if line.strip()]
    return data

train_data_file_path = 'stsa-train.txt'
test_data_file_path = 'stsa-test.txt'
train_data = load_data(train_data_file_path)
test_data = load_data(test_data_file_path)

# Preprocess the text data
vectorizer = TfidfVectorizer(stop_words=None)
train_texts = [line[0][2:] if len(line) > 0 else '' for line in train_data]  # Extracting text part from each entry
train_labels = [int(line[0][0]) if len(line) > 0 else 0 for line in train_data]  # Extracting the labels
test_texts = [line[0][2:] if len(line) > 0 else '' for line in test_data]
test_labels = [int(line[0][0]) if len(line) > 0 else 0 for line in test_data]

X_train = vectorizer.fit_transform(train_texts)
X_test = vectorizer.transform(test_texts)
y_train = train_labels
y_test = test_labels

# Split the training data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Define the algorithms
algorithms = [
    ('MultinomialNB', MultinomialNB()),
    ('SVM', SVC()),
    ('KNN', KNeighborsClassifier()),
    ('Decision Tree', DecisionTreeClassifier()),
    ('Random Forest', RandomForestClassifier()),
    ('XGBoost', XGBClassifier())
]

# Perform 10-fold cross-validation and evaluate the algorithms
for algorithm_name, algorithm in algorithms:
    scores = cross_val_score(algorithm, X_train, y_train, cv=10, scoring='f1_macro')
    print(f"{algorithm_name} - Cross-Validation F1 Score: {scores.mean():.3f} (+/- {scores.std():.3f})")

    # Train the algorithm on the entire training data
    algorithm.fit(X_train, y_train)

    # Evaluate the algorithm on the test data
    y_pred = algorithm.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='macro')
    recall = recall_score(y_test, y_pred, average='macro')
    f1 = f1_score(y_test, y_pred, average='macro')

    print(f"{algorithm_name} - Test Accuracy: {accuracy:.3f}")
    print(f"{algorithm_name} - Test Precision: {precision:.3f}")
    print(f"{algorithm_name} - Test Recall: {recall:.3f}")
    print(f"{algorithm_name} - Test F1 Score: {f1:.3f}")
    print("-" * 30)


MultinomialNB - Cross-Validation F1 Score: 0.772 (+/- 0.010)
MultinomialNB - Test Accuracy: 0.802
MultinomialNB - Test Precision: 0.812
MultinomialNB - Test Recall: 0.802
MultinomialNB - Test F1 Score: 0.801
------------------------------
SVM - Cross-Validation F1 Score: 0.771 (+/- 0.013)
SVM - Test Accuracy: 0.799
SVM - Test Precision: 0.801
SVM - Test Recall: 0.799
SVM - Test F1 Score: 0.799
------------------------------
KNN - Cross-Validation F1 Score: 0.711 (+/- 0.013)
KNN - Test Accuracy: 0.733
KNN - Test Precision: 0.735
KNN - Test Recall: 0.733
KNN - Test F1 Score: 0.732
------------------------------
Decision Tree - Cross-Validation F1 Score: 0.604 (+/- 0.010)
Decision Tree - Test Accuracy: 0.614
Decision Tree - Test Precision: 0.615
Decision Tree - Test Recall: 0.615
Decision Tree - Test F1 Score: 0.614
------------------------------
Random Forest - Cross-Validation F1 Score: 0.702 (+/- 0.012)
Random Forest - Test Accuracy: 0.721
Random Forest - Test Precision: 0.724
Random F

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

# Load the data
def load_data(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        data = [line.strip().split('\t') for line in file if line.strip()]
    return data

train_data_file_path = 'stsa-train.txt'
test_data_file_path = 'stsa-test.txt'
train_data = load_data(train_data_file_path)
test_data = load_data(test_data_file_path)

# Preprocess the text data
train_texts = [simple_preprocess(line[0][2:], deacc=True) for line in train_data]  # Tokenizing text
train_labels = [int(line[0][0]) for line in train_data]
test_texts = [simple_preprocess(line[0][2:], deacc=True) for line in test_data]
test_labels = [int(line[0][0]) for line in test_data]

# Create Word2Vec model
model_w2v = Word2Vec(sentences=train_texts, vector_size=100, window=5, min_count=1, workers=4)

# Define feature extractor function
def get_features(texts, model):
    features = np.array([np.mean([model.wv[word] for word in text if word in model.wv] or [np.zeros(model.vector_size)], axis=0) for text in texts])
    return features

X_train = get_features(train_texts, model_w2v)
X_test = get_features(test_texts, model_w2v)

# Split the training data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train, train_labels, test_size=0.2, random_state=42)

# Perform 10-fold cross-validation
kf = KFold(n_splits=10, shuffle=True, random_state=42)

# Train a logistic regression model with cross-validation
classifier = LogisticRegression()

accuracy_scores = cross_val_score(classifier, X_train, y_train, cv=kf, scoring='accuracy')
precision_scores = cross_val_score(classifier, X_train, y_train, cv=kf, scoring='precision_macro')
recall_scores = cross_val_score(classifier, X_train, y_train, cv=kf, scoring='recall_macro')
f1_scores = cross_val_score(classifier, X_train, y_train, cv=kf, scoring='f1_macro')

# Print cross-validation results
print("Cross-validation results:")
print(f"Average Accuracy: {np.mean(accuracy_scores):.3f}")
print(f"Average Precision: {np.mean(precision_scores):.3f}")
print(f"Average Recall: {np.mean(recall_scores):.3f}")
print(f"Average F1 Score: {np.mean(f1_scores):.3f}")

# Train the final model
classifier.fit(X_train, y_train)

# Evaluate the final model on test data
y_pred = classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(test_labels, y_pred)
precision = precision_score(test_labels, y_pred, average='macro')
recall = recall_score(test_labels, y_pred, average='macro')
f1 = f1_score(test_labels, y_pred, average='macro')

print("\nWord2Vec Model Performance:")
print(f"Test Accuracy: {accuracy:.3f}")
print(f"Test Precision: {precision:.3f}")
print(f"Test Recall: {recall:.3f}")
print(f"Test F1 Score: {f1:.3f}")


Cross-validation results:
Average Accuracy: 0.547
Average Precision: 0.553
Average Recall: 0.535
Average F1 Score: 0.500

Word2Vec Model Performance:
Test Accuracy: 0.554
Test Precision: 0.574
Test Recall: 0.554
Test F1 Score: 0.521


In [None]:
import tensorflow as tf
import tensorflow_hub as hub
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the data
train_data = pd.read_csv('stsa-train.txt', header=None, sep='\t')
test_data = pd.read_csv('stsa-test.txt', header=None, sep='\t')

train_texts = train_data[0].str[2:].tolist()
train_labels = train_data[0].str[:1].astype(int).tolist()

test_texts = test_data[0].str[2:].tolist()
test_labels = test_data[0].str[:1].astype(int).tolist()

# Split the training data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(train_texts, train_labels, test_size=0.2, random_state=42)

# Load the BERT model
bert_model = hub.KerasLayer("https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1")

# Define the model
input_ids = tf.keras.layers.Input(shape=(None,), dtype=tf.int32, name='input_ids')
bert_output = bert_model(input_ids)
cls_output = bert_output['pooled_output']
output = tf.keras.layers.Dense(1, activation='sigmoid', name='output')(cls_output)
model = tf.keras.models.Model(inputs=input_ids, outputs=output)

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Perform 10-fold cross-validation
kf = KFold(n_splits=10, shuffle=True, random_state=42)
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []

for train_index, val_index in kf.split(X_train):
    X_train_fold, X_val_fold = np.array(X_train)[train_index], np.array(X_train)[val_index]
    y_train_fold, y_val_fold = np.array(y_train)[train_index], np.array(y_train)[val_index]

    model.fit(X_train_fold, y_train_fold, validation_data=(X_val_fold, y_val_fold), epochs=3)

    y_pred = model.predict(X_val_fold)
    y_pred = (y_pred > 0.5).astype(int)

    accuracy_scores.append(accuracy_score(y_val_fold, y_pred))
    precision_scores.append(precision_score(y_val_fold, y_pred))
    recall_scores.append(recall_score(y_val_fold, y_pred))
    f1_scores.append(f1_score(y_val_fold, y_pred))

# Print cross-validation results
print("Cross-validation results:")
print(f"Average Accuracy: {np.mean(accuracy_scores):.3f}")
print(f"Average Precision: {np.mean(precision_scores):.3f}")
print(f"Average Recall: {np.mean(recall_scores):.3f}")
print(f"Average F1 Score: {np.mean(f1_scores):.3f}")

# Train the final model
model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=5)

# Evaluate the final model on test data
y_pred = model.predict(test_texts)
y_pred = (y_pred > 0.5).astype(int)

accuracy = accuracy_score(test_labels, y_pred)
precision = precision_score(test_labels, y_pred)
recall = recall_score(test_labels, y_pred)
f1 = f1_score(test_labels, y_pred)

print("\nBERT Model Performance:")
print(f"Test Accuracy: {accuracy:.3f}")
print(f"Test Precision: {precision:.3f}")
print(f"Test Recall: {recall:.3f}")
print(f"Test F1 Score: {f1:.3f}")

## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [8]:
import nltk
nltk.download('stopwords')
import pandas as pd
import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score

# Define preprocess_text function
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def preprocess_text(text):
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', str(text))

    # Remove punctuation and convert to lowercase
    text = ''.join([c for c in text if c not in string.punctuation]).lower()

    # Remove digits
    text = ''.join([c for c in text if not c.isdigit()])

    # Remove stop words and stem the remaining words
    words = [stemmer.stem(word) for word in text.split() if word not in stop_words]

    # Join the stemmed words back into a single string
    return ' '.join(words)

# Load and preprocess the text data (performing only on 10000 data because the execution is causing the colab to freeze and RAM is crashing)
data = pd.read_csv('Amazon_Unlocked_Mobile.csv').head(10000)
data['text'] = data['Reviews']
data['text'] = data['text'].apply(lambda x: preprocess_text(x))

# Convert text data to TF-IDF representation
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(data['text'])

# K-means Clustering
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans.fit(X)
print("K-means Silhouette Score:", silhouette_score(X, kmeans.labels_))

# DBSCAN Clustering
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan.fit(X)
print("DBSCAN Silhouette Score:", silhouette_score(X, dbscan.labels_))

# Hierarchical Clustering
hierarchical = AgglomerativeClustering(n_clusters=5)
hierarchical.fit(X.toarray())
print("Hierarchical Clustering Silhouette Score:", silhouette_score(X.toarray(), hierarchical.labels_))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


K-means Silhouette Score: 0.032971490902600945
DBSCAN Silhouette Score: -0.1324406990872825
Hierarchical Clustering Silhouette Score: 0.015169559734088213


In [10]:
from gensim.models import Word2Vec
import numpy as np

# Preprocess the text data (tokenization, etc.)
tokenized_texts = [text.split() for text in data['text']]

# Train Word2Vec model
model = Word2Vec(tokenized_texts, vector_size=100, window=5, min_count=1, workers=4)

# Get document embeddings by averaging word embeddings
document_embeddings = []
for tokens in tokenized_texts:
    word_vectors = [model.wv[w] for w in tokens if w in model.wv]
    if word_vectors:
        document_embeddings.append(np.mean(word_vectors, axis=0))

X_word2vec = np.array(document_embeddings)

# K-means Clustering with Word2Vec
kmeans_word2vec = KMeans(n_clusters=5, random_state=42)
kmeans_word2vec.fit(X_word2vec)
print("K-means with Word2Vec Silhouette Score:", silhouette_score(X_word2vec, kmeans_word2vec.labels_))



K-means with Word2Vec Silhouette Score: 0.2209603


In [13]:
!pip install sentence-transformers


Collecting sentence-transformers
  Downloading sentence_transformers-2.7.0-py3-none-any.whl (171 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.11.0->sentence-transform

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Load pre-trained BERT model from sentence-transformers
model = SentenceTransformer('bert-base-nli-mean-tokens')

# Get BERT embeddings for each document
X_bert = np.array([model.encode(text) for text in data['text']])

# K-means Clustering with BERT
kmeans_bert = KMeans(n_clusters=5, random_state=42)
kmeans_bert.fit(X_bert)
print("K-means with BERT Silhouette Score:", silhouette_score(X_bert, kmeans_bert.labels_))

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.99k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

**Write your response here:**

.The Silhouette scores provide insights into the performance of different clustering algorithms.

1. K-means with Word2Vec features: Achieved the highest Silhouette score of 0.2209603, indicating well-separated and compact clusters. This suggests that leveraging Word2Vec embeddings helped in capturing semantic information, resulting in better clustering performance.

2. K-means (with original features): Obtained a Silhouette score of 0.032971490902600945, implying that clusters formed by K-means with the original features are less well-defined and separated compared to Word2Vec-based K-means.

3. Hierarchical Clustering: Yielded a Silhouette score of 0.015169559734088213, slightly better than K-means with original features but still indicating relatively weak cluster separation.

4. DBSCAN: Showed a negative Silhouette score of -0.1324406990872825, suggesting poorly defined clusters and indicating that DBSCAN may not be suitable for this dataset.

  5.Bert: execution took longer duration

In conclusion, K-means with Word2Vec features outperformed other methods, likely due to its ability to capture semantic similarities between texts. However, it's essential to consider factors beyond Silhouette scores, such as interpretability and domain-specific requirements, when selecting the most appropriate clustering algorithm based on application.

.

.

.

.




# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
Due to large amount of data my colab kept crashing, faced issue in executing bert model




'''