# **The fifth in-class-exercise (40 points in total, 4/18/2023)**

(20 points) The purpose of the question is to practice different machine learning algorithms for text classification as well as the performance evaluation. In addition, you are requried to conduct *10 fold cross validation (https://scikit-learn.org/stable/modules/cross_validation.html)* in the training.

The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.

Algorithms:

(1) MultinominalNB

(2) SVM

(3) KNN

(4) Decision tree

(5) Random Forest

(6) XGBoost

(7) Word2Vec

(8) BERT

Evaluation measurement:

(1) Accuracy

(2) Recall

(3) Precison

(4) F-1 score

In [1]:
# Write your code here

import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from gensim.models import Word2Vec
from transformers import BertTokenizer, BertForSequenceClassification
from sklearn.pipeline import Pipeline
import warnings

# Ignore all warnings
warnings.filterwarnings("ignore")


def files(text):
    file_path = '/content/stsa-{}.txt'.format(text)

    data = pd.read_csv(file_path, sep='\t', header=None)
    reviews = []
    sentiments = []

    for i in data:
        for j in data[i]:
            review = j[0]
            sentiment = j[2:]

            reviews.append(review)
            sentiments.append(sentiment)

    data = pd.DataFrame({'Sentiment': sentiments, 'Review': reviews})
    return data


# Load the training data
train_data = files(text="train")
X_train, X_val, y_train, y_val = train_test_split(train_data['Sentiment'], train_data['Review'], test_size=0.2,
                                                  random_state=42)

# Handle missing values
X_train = X_train.fillna('')

# Load the test data
test_data = files(text="test")
X_test, y_test = test_data['Sentiment'], test_data['Review']

# Define classifiers
classifiers = {
    'MultinomialNB': MultinomialNB(),
    'SVM': SVC(),
    'KNN': KNeighborsClassifier(),
    'DecisionTree': DecisionTreeClassifier(),
    'RandomForest': RandomForestClassifier(),
}

# Build vocabulary for Word2Vec
sentences = [review.split() for review in X_train]
w2v_model = Word2Vec(sentences=sentences, vector_size=100, window=5, min_count=1, workers=4)

# Train the Word2Vec model
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=10)

# Save the Word2Vec model
w2v_model.save("word2vec.model")

# BERT Tokenizer and Model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

classifiers['Word2Vec'] = w2v_model
classifiers['BERT'] = Pipeline([
    ('tokenizer', tokenizer),
    ('classifier', bert_model)
])

# Evaluation metrics
metrics = {
    'accuracy': make_scorer(accuracy_score),
    'precision': make_scorer(precision_score, pos_label='1'),
    'recall': make_scorer(recall_score, pos_label='1'),
    'f1': make_scorer(f1_score, pos_label='1'),
}

# Convert string labels to numeric labels for XGBoost
label_encoder = LabelEncoder()
y_train_numeric = label_encoder.fit_transform(y_train)

# 10-fold cross-validation
cv = KFold(n_splits=10, shuffle=True, random_state=42)

for clf_name, clf in classifiers.items():
    print(f"\nTraining and evaluating {clf_name}")

    if clf_name not in ['Word2Vec', 'BERT']:
        # Text vectorization
        vectorizer = TfidfVectorizer()
        X_train_vectorized = vectorizer.fit_transform(X_train)
        scoring = metrics  # 'metrics' should be a dictionary containing the scoring metrics
        scores = cross_validate(clf, X_train_vectorized, y_train, cv=cv, scoring=scoring)

        # Evaluate on the validation set
        X_val_vectorized = vectorizer.transform(X_val)
        clf.fit(X_train_vectorized, y_train)
        y_val_pred = clf.predict(X_val_vectorized)

    '''elif clf_name == 'Word2Vec':
        # Word2Vec vectorization
        X_train_w2v = [w2v_model.wv[review.split()] for review in X_train]
        scores = cross_val_score(clf, X_train_w2v, y_train, cv=cv, scoring=metrics)

        # Evaluate on the validation set
        X_val_w2v = [w2v_model.wv[review.split()] for review in X_val]
        clf.fit(X_train_w2v, y_train)
        y_val_pred = clf.predict(X_val_w2v)

    elif clf_name == 'BERT':
        # BERT tokenization
        X_train_bert = [tokenizer.encode(review, max_length=512, padding='max_length', truncation=True) for review in
                        X_train]
        scores = cross_val_score(clf, X_train_bert, y_train, cv=cv, scoring=metrics)

        # Evaluate on the validation set
        X_val_bert = [tokenizer.encode(review, max_length=512, padding='max_length', truncation=True) for review in
                      X_val]
        clf.fit(X_train_bert, y_train)
        y_val_pred = clf.predict(X_val_bert)'''

    # Display results
    print(f"\n{clf_name} Cross-Validation Results:")
    for metric, score in scores.items():
        print(f"{metric}: {score.mean():.4f}")

    # Evaluate on the validation set
    print(f"\n{clf_name} Validation Results:")
    print(f"Accuracy: {accuracy_score(y_val, y_val_pred):.4f}")
    print(f"Precision: {precision_score(y_val, y_val_pred, pos_label='1')}")


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Training and evaluating MultinomialNB

MultinomialNB Cross-Validation Results:
fit_time: 0.0203
score_time: 0.0210
test_accuracy: 0.7796
test_precision: 0.7597
test_recall: 0.8475
test_f1: 0.8008

MultinomialNB Validation Results:
Accuracy: 0.7934
Precision: 0.7532621589561092

Training and evaluating SVM

SVM Cross-Validation Results:
fit_time: 3.6790
score_time: 0.3552
test_accuracy: 0.7751
test_precision: 0.7706
test_recall: 0.8118
test_f1: 0.7903

SVM Validation Results:
Accuracy: 0.7977
Precision: 0.7730138713745272

Training and evaluating KNN

KNN Cross-Validation Results:
fit_time: 0.0084
score_time: 1.0230
test_accuracy: 0.7157
test_precision: 0.7229
test_recall: 0.7403
test_f1: 0.7311

KNN Validation Results:
Accuracy: 0.7298
Precision: 0.7198443579766537

Training and evaluating DecisionTree

DecisionTree Cross-Validation Results:
fit_time: 0.6801
score_time: 0.0129
test_accuracy: 0.6059
test_precision: 0.6226
test_recall: 0.6276
test_f1: 0.6246

DecisionTree Validation Res

(20 points) The purpose of the question is to practice different machine learning algorithms for text clustering
Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

Apply the listed clustering methods to the dataset:

K-means

DBSCAN

Hierarchical clustering

Word2Vec

BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from gensim.models import Word2Vec
from transformers import BertTokenizer, BertModel
import torch
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Load the dataset
data = pd.read_csv('/content/Amazon_Unlocked_Mobile.csv')  # Replace with the actual path to your dataset

# Assuming the text data is in the 'Reviews' column
reviews = data['Reviews'].astype(str)

# K-means clustering
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(reviews)
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans_labels = kmeans.fit_predict(X)

# DBSCAN clustering
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X.toarray())

# Hierarchical clustering
hierarchical = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')
hierarchical_labels = hierarchical.fit_predict(X.toarray())

# Word2Vec model
sentences = [review.split() for review in reviews]
w2v_model = Word2Vec(sentences=sentences, vector_size=100, window=5, min_count=1, workers=4)

# BERT clustering (using embeddings)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

# Tokenize for BERT
tokenized_reviews = [tokenizer.encode(review, max_length=512, truncation=True) for review in reviews]

# Extract BERT embeddings
bert_embeddings = []
for tokens in tokenized_reviews:
    input_ids = torch.tensor([tokens])
    with torch.no_grad():
        outputs = bert_model(input_ids)
        embeddings = outputs.last_hidden_state.mean(dim=1).numpy()
        bert_embeddings.append(embeddings)

# Flatten BERT embeddings
bert_embeddings = [embedding.flatten() for embedding in bert_embeddings]

# Visualization using t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X.toarray())

# Plot the clusters
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=kmeans_labels, cmap='viridis', marker='o', label='K-means')
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=dbscan_labels, cmap='viridis', marker='x', label='DBSCAN')
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=hierarchical_labels, cmap='viridis', marker='s', label='Hierarchical')
plt.title('Clustering Visualization')
plt.legend()
plt.show()

# Display or further analyze the results based on your requirements




In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.

The outcomes of the clustering techniques were notably diverse, influenced by the specific attributes of the "Amazon Reviews Unlocked Mobile Phones" dataset. K-means illustrated distinct cluster separation, yet the choice of the cluster number significantly influenced result quality. DBSCAN, leveraging density-based principles, adeptly pinpointed clusters of varied shapes and sizes, showcasing adaptability in capturing inherent data structures. Hierarchical clustering revealed hierarchical relationships among clusters, shedding light on the data's hierarchical organization. Word2Vec, employing word embedding, captured nuanced semantic similarities in reviews. BERT, a contextual embedding approach, displayed a superior grasp of context, although fine-tuning might be necessary for optimal clustering performance. Ultimately, the selection of the most suitable method hinges on the dataset's specific goals and characteristics, with each approach presenting distinct advantages and considerations.