# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [1]:
# Write your code here
import re
import nltk
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
train_data = pd.read_table('/content/stsa-train.txt', delimiter='\t', header=None)
train_data.columns = ['review']

# Extract sentiment label and review text into separate columns
train_data['sentiment'] = train_data['review'].str.extract('^(\d)')  # Extract the first numeric character at the beginning of the string
train_data['review_text'] = train_data['review'].str.extract('\d (.*)')  # Extract the text after the numeric character and space

# Remove the original review column
train_data = train_data.drop(['review'], axis = 1)
train_data
test_data = pd.read_table('/content/stsa-test.txt', delimiter='\t', header=None)
test_data.columns = ['review']

# Extract sentiment label and review text into separate columns
test_data['sentiment'] = test_data['review'].str.extract('^(\d)')  # Extract the first numeric character at the beginning of the string
test_data['review_text'] = test_data['review'].str.extract('\d (.*)')  # Extract the text after the numeric character and space

# Remove the original review column
test_data = test_data.drop(['review'], axis = 1)
test_data
# Split data into 80% training and 20% validation
train_texts, val_texts, y_train, y_val = train_test_split(train_data['review_text'], train_data['sentiment'], test_size=0.2, random_state=42)
test_texts = test_data['review_text']
y_test = test_data['sentiment']
# Function to preprocess text
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stopwords.words('english')]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    preprocessed_text = ' '.join(tokens)
    return preprocessed_text
train_corpus = [preprocess_text(review) for review in train_texts]
val_corpus = [preprocess_text(review) for review in val_texts]
test_corpus = [preprocess_text(review) for review in test_texts]

# Convert text data to TF-IDF vectors
tfidf_vectorizer = TfidfVectorizer()
X_train = tfidf_vectorizer.fit_transform(train_corpus)
X_val = tfidf_vectorizer.transform(val_corpus)
X_test = tfidf_vectorizer.transform(test_corpus)
# Initialize Multinomial Naive Bayes classifier
clf = MultinomialNB()

# Perform 10-fold cross-validation on the Valid data
cv_scores = cross_val_score(clf, X_val, y_val, cv=10)

# Train the classifier on the entire training data
clf.fit(X_train, y_train)

# Predictions on the test data
y_pred_test = clf.predict(X_test)

# Convert string labels to integers
y_test = y_test.astype(int)
y_pred_test = y_pred_test.astype(int)


# Compute evaluation metrics
accuracy = accuracy_score(y_test, y_pred_test)
precision = precision_score(y_test, y_pred_test)
recall = recall_score(y_test, y_pred_test)
f1 = f1_score(y_test, y_pred_test)

# Print the cross-validation scores and evaluation metrics
print("Cross-validation scores:", cv_scores)
print("Mean CV accuracy:", np.mean(cv_scores))
print("Test Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
# Initialize Support Vector Machine classifier
svm_clf = SVC()

# Perform 10-fold cross-validation on the valid data
svm_cv_scores = cross_val_score(svm_clf, X_val, y_val, cv=10)

# Train the SVM classifier on the entire training data
svm_clf.fit(X_train, y_train)

# Predictions on the test data using SVM classifier
y_pred_test_svm = svm_clf.predict(X_test)
# Convert predicted labels to integers
y_pred_test_svm = y_pred_test_svm.astype(int)

# Compute evaluation metrics for SVM classifier
accuracy_svm = accuracy_score(y_test, y_pred_test_svm)
# Compute evaluation metrics for SVM classifier
precision_svm = precision_score(y_test, y_pred_test_svm, pos_label=1)
recall_svm = recall_score(y_test, y_pred_test_svm, pos_label=1)
f1_svm = f1_score(y_test, y_pred_test_svm, pos_label=1)

# Print the cross-validation scores and evaluation metrics for SVM classifier
print("Cross-validation scores for SVM:", svm_cv_scores)
print("Mean CV accuracy for SVM:", np.mean(svm_cv_scores))
print("Test Accuracy for SVM:", accuracy_svm)
print("Precision for SVM:", precision_svm)
print("Recall for SVM:", recall_svm)
print("F1 Score for SVM:", f1_svm)
from sklearn.neighbors import KNeighborsClassifier

# Initialize KNN classifier
knn_clf = KNeighborsClassifier()

# Perform 10-fold cross-validation on the valid data
knn_cv_scores = cross_val_score(knn_clf, X_val, y_val, cv=10)

# Train the KNN classifier on the entire training data
knn_clf.fit(X_train, y_train)

# Predictions on the test data using KNN classifier
y_pred_test_knn = knn_clf.predict(X_test)

# Convert predicted labels to integers
y_pred_test_knn = y_pred_test_knn.astype(int)

# Compute evaluation metrics for KNN classifier
accuracy_knn = accuracy_score(y_test, y_pred_test_knn)
precision_knn = precision_score(y_test, y_pred_test_knn, pos_label=1)
recall_knn = recall_score(y_test, y_pred_test_knn, pos_label=1)
f1_knn = f1_score(y_test, y_pred_test_knn, pos_label=1)

# Print the cross-validation scores and evaluation metrics for KNN classifier
print("Cross-validation scores for KNN:", knn_cv_scores)
print("Mean CV accuracy for KNN:", np.mean(knn_cv_scores))
print("Test Accuracy for KNN:", accuracy_knn)
print("Precision for KNN:", precision_knn)
print("Recall for KNN:", recall_knn)
print("F1 Score for KNN:", f1_knn)
from sklearn.tree import DecisionTreeClassifier

# Initialize Decision Tree classifier
dt_clf = DecisionTreeClassifier()

# Perform 10-fold cross-validation on the valid data
dt_cv_scores = cross_val_score(dt_clf, X_val, y_val, cv=10)

# Train the Decision Tree classifier on the entire training data
dt_clf.fit(X_train, y_train)

# Predictions on the test data using Decision Tree classifier
y_pred_test_dt = dt_clf.predict(X_test)

# Convert predicted labels to integers
y_pred_test_dt = y_pred_test_dt.astype(int)

# Compute evaluation metrics for Decision Tree classifier
accuracy_dt = accuracy_score(y_test, y_pred_test_dt)
precision_dt = precision_score(y_test, y_pred_test_dt, pos_label=1)
recall_dt = recall_score(y_test, y_pred_test_dt, pos_label=1)
f1_dt = f1_score(y_test, y_pred_test_dt, pos_label=1)

# Print the cross-validation scores and evaluation metrics for Decision Tree classifier
print("Cross-validation scores for Decision Tree:", dt_cv_scores)
print("Mean CV accuracy for Decision Tree:", np.mean(dt_cv_scores))
print("Test Accuracy for Decision Tree:", accuracy_dt)
print("Precision for Decision Tree:", precision_dt)
print("Recall for Decision Tree:", recall_dt)
print("F1 Score for Decision Tree:", f1_dt)
from sklearn.ensemble import RandomForestClassifier

# Initialize Random Forest classifier
rf_clf = RandomForestClassifier()

# Perform 10-fold cross-validation on the valid data
rf_cv_scores = cross_val_score(rf_clf, X_val, y_val, cv=10)

# Train the Random Forest classifier on the entire training data
rf_clf.fit(X_train, y_train)

# Predictions on the test data using Random Forest classifier
y_pred_test_rf = rf_clf.predict(X_test)

y_pred_test_rf = y_pred_test_rf.astype('int')

# Compute evaluation metrics for Random Forest classifier
accuracy_rf = accuracy_score(y_test, y_pred_test_rf)
precision_rf = precision_score(y_test, y_pred_test_rf, pos_label=1)
recall_rf = recall_score(y_test, y_pred_test_rf, pos_label=1)
f1_rf = f1_score(y_test, y_pred_test_rf, pos_label=1)

# Print the cross-validation scores and evaluation metrics for Random Forest classifier
print("Cross-validation scores for Random Forest:", rf_cv_scores)
print("Mean CV accuracy for Random Forest:", np.mean(rf_cv_scores))
print("Test Accuracy for Random Forest:", accuracy_rf)
print("Precision for Random Forest:", precision_rf)
print("Recall for Random Forest:", recall_rf)
print("F1 Score for Random Forest:", f1_rf)
from xgboost import XGBClassifier

# Initialize XGBoost classifier
xgb_clf = XGBClassifier()

# Convert labels to integers
y_train = y_train.astype(int)

# Perform 10-fold cross-validation on the valid data
y_train = y_train.astype(int)
y_val = y_val.astype(int)
xgb_cv_scores = cross_val_score(xgb_clf, X_val, y_val, cv=10)

# Train the XGBoost classifier on the entire training data
xgb_clf.fit(X_train, y_train)

# Predictions on the test data using XGBoost classifier
y_pred_test_xgb = xgb_clf.predict(X_test)

y_pred_test_xgb = y_pred_test_xgb.astype('int')

# Compute evaluation metrics for XGBoost classifier
accuracy_xgb = accuracy_score(y_test, y_pred_test_xgb)
precision_xgb = precision_score(y_test, y_pred_test_xgb, pos_label=1)
recall_xgb = recall_score(y_test, y_pred_test_xgb, pos_label=1)
f1_xgb = f1_score(y_test, y_pred_test_xgb, pos_label=1)

# Print the cross-validation scores and evaluation metrics for XGBoost classifier
print("Cross-validation scores for XGBoost:", xgb_cv_scores)
print("Mean CV accuracy for XGBoost:", np.mean(xgb_cv_scores))
print("Test Accuracy for XGBoost:", accuracy_xgb)
print("Precision for XGBoost:", precision_xgb)
print("Recall for XGBoost:", recall_xgb)
print("F1 Score for XGBoost:", f1_xgb)
from gensim.models import Word2Vec

# Train Word2Vec model
model = Word2Vec(train_corpus, min_count=1, vector_size=100)  # Adjust size as needed

# Define avg_word_vector function
def avg_word_vector(text, model, vocab):
    vector_sum = np.zeros(model.vector_size)
    num_words = 0
    for word in text.split():
        if word in vocab:
            vector_sum += model.wv[word]
            num_words += 1
    if num_words > 0:
        return vector_sum / num_words
    else:
        return np.zeros(model.vector_size)

# Vectorize text data
train_vectors = [avg_word_vector(text, model, model.wv.key_to_index) for text in train_corpus]
val_vectors = [avg_word_vector(text, model, model.wv.key_to_index) for text in val_corpus]
test_vectors = [avg_word_vector(text, model, model.wv.key_to_index) for text in test_corpus]
# Convert to numpy arrays
X_train = np.array(train_vectors)
X_val = np.array(val_vectors)
X_test = np.array(test_vectors)
# Initialize Random Forest classifier
rf_clf_wc = RandomForestClassifier()

# Perform 10-fold cross-validation on the valid data
rf_cv_wc_scores = cross_val_score(rf_clf_wc, X_val, y_val, cv=10)

# Train the classifier
rf_clf_wc.fit(X_train, y_train)

# Make predictions on validation data
y_pred_test_wc = rf_clf_wc.predict(X_test)

y_pred_test_wc = y_pred_test_wc.astype('int')

# Evaluate the performance of the classifier
accuracy_val_wc = accuracy_score(y_test, y_pred_test_wc)
precision_val_wc = precision_score(y_test, y_pred_test_wc)
recall_val_wc = recall_score(y_test, y_pred_test_wc)
f1_val_wc = f1_score(y_test, y_pred_test_wc)

# Print the cross-validation scores and test set metrics
print("Cross-Validation Scores:", rf_cv_wc_scores)
print("Mean Cross-Validation Score:", np.mean(rf_cv_wc_scores))
print("Accuracy:", accuracy_val_wc)
print("Precision:", precision_val_wc)
print("Recall:", recall_val_wc)
print("F1 Score:", f1_val_wc)
!pip install datasets
!pip install accelerate -U
!pip install transformers[torch] -U
import pandas as pd
import numpy as np
import torch
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Read data
train_data = pd.read_table('/content/stsa-train.txt', delimiter='\t', header=None)
train_data.columns = ['review']
train_data['sentiment'] = train_data['review'].str.extract('^(\d)')
train_data['review_text'] = train_data['review'].str.extract('\d (.*)')

# Print initial DataFrame length
print(f"Initial train data length: {len(train_data)}")

test_data = pd.read_table('/content/stsa-test.txt', delimiter='\t', header=None)
test_data.columns = ['review']
test_data['sentiment'] = test_data['review'].str.extract('^(\d)')
test_data['review_text'] = test_data['review'].str.extract('\d (.*)')

# Split data
train_texts, val_texts, y_train, y_val = train_test_split(train_data['review_text'], train_data['sentiment'], test_size=0.2, random_state=42)
test_texts = test_data['review_text']
y_test = test_data['sentiment']

# Preprocess text
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    tokens = word_tokenize(text)
    # Consider handling missing values here (e.g., by removing empty tokens)
    tokens = [word for word in tokens if word not in stopwords.words('english')]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    preprocessed_text = ' '.join(tokens)
    return preprocessed_text



train_corpus = [preprocess_text(review) for review in train_texts]
val_corpus = [preprocess_text(review) for review in val_texts]
test_corpus = [preprocess_text(review) for review in test_texts]



# TF-IDF vectorization
tfidf_vectorizer = TfidfVectorizer()
X_train = tfidf_vectorizer.fit_transform(train_corpus)
X_val = tfidf_vectorizer.transform(val_corpus)
X_test = tfidf_vectorizer.transform(test_corpus)

# Tokenization
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
train_encodings = tokenizer(list(train_texts), truncation=True, padding=True)
val_encodings = tokenizer(list(val_texts), truncation=True, padding=True)
test_encodings = tokenizer(list(test_texts), truncation=True, padding=True)
# Prepare PyTorch dataset
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels.astype(int)  # Convert labels to integers

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)  # Specify dtype as torch.long
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = Dataset(train_encodings, y_train)
val_dataset = Dataset(val_encodings, y_val)
test_dataset = Dataset(test_encodings, y_test)

# Check if the key exists in the DataFrame or Series index
key_to_check = 2445
if key_to_check in train_texts.index:
    train_text = train_texts.loc[key_to_check]
    train_label = y_train.loc[key_to_check]

    # Initialize the tokenizer and model
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

    # Tokenize the example text
    encoded_example = tokenizer(train_text, padding=True, truncation=True, return_tensors="pt")

    # Define training arguments
    training_args = TrainingArguments(
        output_dir='./results',
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        num_train_epochs=3,
        logging_dir='./logs',
    )

    # Define Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=Dataset(train_encodings, y_train),  # Assuming you have Dataset class defined
    )

    # Fine-tune the model
    trainer.train()
else:
    print("Key", key_to_check, "does not exist in train_texts.")



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Cross-validation scores: [0.75539568 0.75539568 0.74820144 0.67625899 0.71014493 0.73913043
 0.74637681 0.73913043 0.67391304 0.65217391]
Mean CV accuracy: 0.7196121363778543
Test Accuracy: 0.7929708951125755
Precision: 0.7472118959107806
Recall: 0.8844884488448845
F1 Score: 0.8100755667506297
Cross-validation scores for SVM: [0.77697842 0.79856115 0.76258993 0.67625899 0.73913043 0.75362319
 0.70289855 0.73913043 0.6884058  0.65942029]
Mean CV accuracy for SVM: 0.7296997184860807
Test Accuracy for SVM: 0.7929708951125755
Precision for SVM: 0.7633663366336634
Recall for SVM: 0.8481848184818482
F1 Score for SVM: 0.8035435122459615
Cross-validation scores for KNN: [0.48920863 0.48201439 0.47482014 0.48201439 0.48550725 0.48550725
 0.48550725 0.48550725 0.48550725 0.48550725]
Mean CV accuracy for KNN: 0.4841101032217704
Test Accuracy for KNN: 0.5211422295442065
Precision for KNN: 0.5105653912050258
Recall for KNN: 0.9834983498349835
F1 Score for KNN: 0.6721804511278195
Cross-validation sc



Cross-Validation Scores: [0.49640288 0.51798561 0.52517986 0.51079137 0.50724638 0.51449275
 0.51449275 0.51449275 0.51449275 0.51449275]
Mean Cross-Validation Score: 0.5130069857157752
Accuracy: 0.499725425590335
Precision: 0.49944812362030905
Recall: 0.9955995599559956
F1 Score: 0.6651966188901138
Initial train data length: 6920


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Key 2445 does not exist in train_texts.


## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

For this problem the Corona virus tweet data has been taken from https://www.kaggle.com/datasets/datatattle/covid-19-nlp-text-classification?select=Corona_NLP_train.csv

In [2]:
# Write your code here
import pandas as pd

# Load the dataset
data = pd.read_csv("/content/Corona_NLP_test.csv", encoding='latin1')
data = data[['OriginalTweet', 'Sentiment']]

# Display the first few rows of the dataset
data.head()
# Define the preprocessing function
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stopwords.words('english')]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    preprocessed_text = ' '.join(tokens)
    return preprocessed_text
# Preprocess the text data
data['clean_text'] = data['OriginalTweet'].apply(preprocess_text)
data.head()
# Initialize the TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit the vectorizer on the preprocessed text data and transform it into TF-IDF vectors
tfidf_matrix = tfidf_vectorizer.fit_transform(data['clean_text'])

# Print the shape of the TF-IDF matrix
print("Shape of TF-IDF matrix:", tfidf_matrix.shape)
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Initialize K-means with a random state for reproducibility
kmeans = KMeans(n_clusters=5, random_state=42)

# Fit K-means on the TF-IDF matrix
kmeans.fit(tfidf_matrix)

# Predict the cluster labels
cluster_labels = kmeans.predict(tfidf_matrix)

# Compute the silhouette score
silhouette_avg_kmeans = silhouette_score(tfidf_matrix, cluster_labels)
print("Silhouette Score:", silhouette_avg_kmeans)
from sklearn.cluster import DBSCAN

# Initialize DBSCAN with epsilon and minimum samples
dbscan = DBSCAN(eps=0.5, min_samples=5)

# Fit DBSCAN on the TF-IDF matrix
dbscan.fit(tfidf_matrix)

# Extract cluster labels and noise points
cluster_labels_dbscan = dbscan.labels_

# Number of clusters
num_clusters_dbscan = len(set(cluster_labels_dbscan)) - (1 if -1 in cluster_labels_dbscan else 0)

# Number of noise points
num_noise_points = list(cluster_labels_dbscan).count(-1)

print("Number of clusters found by DBSCAN:", num_clusters_dbscan)
print("Number of noise points:", num_noise_points)
from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt

# Calculate the linkage matrix using complete linkage and cosine distance
linkage_matrix = linkage(tfidf_matrix.toarray(), method='complete', metric='cosine')
from scipy.cluster.hierarchy import fcluster

# Determine the number of clusters using the dendrogram
max_d = 0.5  # Adjust this threshold based on the dendrogram
clusters = fcluster(linkage_matrix, max_d, criterion='distance')

# Compute the silhouette score for hierarchical clustering
silhouette_avg_hierarchical = silhouette_score(tfidf_matrix, clusters)
print("Silhouette Score for Hierarchical Clustering:", silhouette_avg_hierarchical)
from gensim.models import Word2Vec

# Tokenize the preprocessed text data
tokenized_text = [doc.split() for doc in data['clean_text']]

# Train Word2Vec model
word2vec_model = Word2Vec(sentences=tokenized_text, vector_size=100, window=5, min_count=1, workers=4)

# Function to average word vectors to obtain document vectors
def document_vector(word2vec_model, doc):
    """
    Obtain document vector by averaging word vectors for words in the document.
    """
    doc = [word for word in doc if word in word2vec_model.wv.key_to_index]
    if len(doc) != 0:
        return np.mean(word2vec_model.wv[doc], axis=0)
    else:
        return []

# Compute document vectors
document_vectors = np.array([document_vector(word2vec_model, doc) for doc in tokenized_text])

# Initialize K-means with the desired number of clusters
num_clusters_word2vec = 5  # Adjust as needed
kmeans_word2vec = KMeans(n_clusters=num_clusters_word2vec, random_state=42)

# Fit K-means on the document vectors obtained from Word2Vec
kmeans_word2vec.fit(document_vectors)

# Predict the cluster labels
cluster_labels_word2vec = kmeans_word2vec.labels_

# Compute the silhouette score (optional)
silhouette_avg_word2vec = silhouette_score(document_vectors, cluster_labels_word2vec)
print("Silhouette Score for K-means with Word2Vec:", silhouette_avg_word2vec)
from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to obtain BERT embeddings for a sentence
def get_bert_embedding(sentence):
    inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
    # Use the output of the [CLS] token as the sentence embedding
    sentence_embedding = outputs.last_hidden_state.mean(dim=1).squeeze().numpy()
    return sentence_embedding

# Function to obtain BERT embeddings for all sentences in the dataset
def get_all_bert_embeddings(sentences):
    embeddings = []
    for sentence in sentences:
        embeddings.append(get_bert_embedding(sentence))
    return np.array(embeddings)

# Get BERT embeddings for all sentences in the dataset
bert_embeddings = get_all_bert_embeddings(data['clean_text'])

# Initialize K-means with the desired number of clusters
num_clusters_bert = 5  # Adjust as needed
kmeans_bert = KMeans(n_clusters=num_clusters_bert, random_state=42)

# Fit K-means on the BERT embeddings
kmeans_bert.fit(bert_embeddings)

# Predict the cluster labels
cluster_labels_bert = kmeans_bert.labels_

# Compute the silhouette score (optional)
silhouette_avg_bert = silhouette_score(bert_embeddings, cluster_labels_bert)
print("Silhouette Score for K-means with BERT embeddings:", silhouette_avg_bert)

Shape of TF-IDF matrix: (3798, 12698)




Silhouette Score: 0.005922133189324662
Number of clusters found by DBSCAN: 1
Number of noise points: 3793
Silhouette Score for Hierarchical Clustering: 0.016925766315566776




Silhouette Score for K-means with Word2Vec: 0.49940592


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]



Silhouette Score for K-means with BERT embeddings: 0.03194037


**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

The results of the different clustering methods and embedding techniques vary significantly. K-means clustering resulted in a very low silhouette score of 0.0059, indicating poor cluster separation. DBSCAN, on the other hand, failed to find meaningful clusters, with only one cluster identified and a large number of points classified as noise (3793). Hierarchical clustering yielded a slightly higher silhouette score of 0.0169, suggesting better clustering than K-means but still relatively weak separation. When using Word2Vec embeddings with K-means, a significantly higher silhouette score of 0.5001 was achieved, indicating much better cluster separation compared to other methods. Finally, employing BERT embeddings with K-means resulted in a silhouette score of 0.0319, which is higher than traditional text-based clustering methods but lower than Word2Vec. Overall, Word2Vec embeddings combined with K-means clustering appear to be the most effective approach for clustering in this scenario, while BERT embeddings show promise but may require further tuning or optimization.



# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:it's very challenging to complete the given exercise within less time.The exercises provided a comprehensive learning experience in text analysis and machine learning techniques,
encompassing both text classification and clustering tasks. In the classification task, algorithms such as
MultinomialNB, SVM, KNN, Decision tree, Random Forest, and XGBoost were employed to classify sentiment in IMDB r
eviews.The evaluation metrics used were accuracy, recall, precision, and F1-score, with the models trained using
10-fold cross-validation and evaluated on test data. The exercises also focused on text clustering using various
methods, including K-means, DBSCAN, hierarchical clustering, Word2Vec, and BERT. While K-means, DBSCAN, and
hierarchical clustering were applied directly to the text data, Word2Vec and BERT were used to generate embeddings
for clustering.





'''