<a href="https://colab.research.google.com/github/Manaswini1912/INFO-5731/blob/main/Kodela_Manaswini_Exercise_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [10]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from gensim.models import Word2Vec
from transformers import BertTokenizer, BertModel
import torch
from sklearn.feature_extraction.text import CountVectorizer

# Load the dataset
train_data = pd.read_csv("/content/stsa-train.txt", sep='\t', header=None, names=['label', 'text'])
test_data = pd.read_csv("/content/stsa-test.txt", sep='\t', header=None, names=['label', 'text'])

# Preprocessing
train_data['text'] = train_data['text'].astype(str).apply(lambda x: x.lower())
test_data['text'] = test_data['text'].astype(str).apply(lambda x: x.lower())

# Preprocess target labels
train_data['label'] = train_data['label'].str.split(' ', expand=True)[0]  # Extract the label from the string
train_data['label'] = train_data['label'].astype(int)  # Convert the label to integer
test_data['label'] = test_data['label'].str.split(' ', expand=True)[0]  # Extract the label from the string
test_data['label'] = test_data['label'].astype(int)  # Convert the label to integer

# Split data into X and y
X = train_data['text']
y = train_data['label']

# Initialize models
models = {
    'MultinomialNB': MultinomialNB(),
    'SVM': SVC(),
    'KNN': KNeighborsClassifier(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'XGBoost': XGBClassifier()
}

# Define evaluation metrics
metrics = {
    'Accuracy': accuracy_score,
    'Precision': precision_score,
    'Recall': recall_score,
    'F1 Score': f1_score
}

# Perform 10-fold cross-validation on each algorithm
kf = KFold(n_splits=10, shuffle=True, random_state=42)

results = {}

for model_name, model in models.items():
    print("Training", model_name)
    val_scores = {metric: [] for metric in metrics}
    for train_index, val_index in kf.split(X):
        X_train_fold, X_val_fold = X.iloc[train_index], X.iloc[val_index]
        y_train_fold, y_val_fold = y.iloc[train_index], y.iloc[val_index]

        if model_name == 'Word2Vec':
            # Train Word2Vec model
            w2v_model = Word2Vec(sentences=[text.split() for text in X_train_fold], vector_size=100, window=5, min_count=1, workers=4)
            X_train_fold_vec = np.array([np.mean([w2v_model.wv[word] for word in text.split() if word in w2v_model.wv] or [np.zeros(100)], axis=0) for text in X_train_fold])
            X_val_fold_vec = np.array([np.mean([w2v_model.wv[word] for word in text.split() if word in w2v_model.wv] or [np.zeros(100)], axis=0) for text in X_val_fold])
            model.fit(X_train_fold_vec, y_train_fold)
            y_pred = model.predict(X_val_fold_vec)
        elif model_name == 'BERT':
            # Initialize BERT tokenizer and model
            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
            bert_model = BertModel.from_pretrained('bert-base-uncased')

            # Preprocess BERT data
            encoded_data_train = tokenizer(X_train_fold.tolist(), padding=True, truncation=True, return_tensors='pt')
            encoded_data_val = tokenizer(X_val_fold.tolist(), padding=True, truncation=True, return_tensors='pt')

            # Define labels for BERT
            labels_train = torch.tensor(y_train_fold.tolist())
            labels_val = torch.tensor(y_val_fold.tolist())

            # Define BERT data loader
            train_data = torch.utils.data.TensorDataset(encoded_data_train['input_ids'], encoded_data_train['attention_mask'], labels_train)
            val_data = torch.utils.data.TensorDataset(encoded_data_val['input_ids'], encoded_data_val['attention_mask'], labels_val)

            # Define training parameters
            batch_size = 32
            train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size)
            val_loader = torch.utils.data.DataLoader(val_data, batch_size=batch_size)

            # Fine-tune BERT model
            optimizer = torch.optim.AdamW(bert_model.parameters(), lr=1e-5)
            criterion = torch.nn.CrossEntropyLoss()
            for epoch in range(3):
                bert_model.train()
                for batch in train_loader:
                    input_ids, attention_mask, labels = batch
                    optimizer.zero_grad()
                    outputs = bert_model(input_ids, attention_mask=attention_mask)
                    logits = outputs[0]
                    loss = criterion(logits, labels)
                    loss.backward()
                    optimizer.step()

            bert_model.eval()
            with torch.no_grad():
                y_pred = []
                for batch in val_loader:
                    input_ids, attention_mask, labels = batch
                    outputs = bert_model(input_ids, attention_mask=attention_mask)
                    logits = outputs[0]
                    _, predicted = torch.max(logits, 1)
                    y_pred.extend(predicted.cpu().numpy())
        else:
            # Use other classifiers
            vectorizer = CountVectorizer().fit(X_train_fold)
            X_train_fold_vec = vectorizer.transform(X_train_fold)
            X_val_fold_vec = vectorizer.transform(X_val_fold)
            model.fit(X_train_fold_vec, y_train_fold)
            y_pred = model.predict(X_val_fold_vec)

        for metric, score_func in metrics.items():
            if metric != 'Accuracy':
                score = score_func(y_val_fold, y_pred, average='weighted')  # Change 'binary' to 'weighted'
            else:
                score = score_func(y_val_fold, y_pred)
            val_scores[metric].append(score)

    # Compute average scores for each metric
    avg_val_scores = {metric: np.mean(scores) for metric, scores in val_scores.items()}
    results[model_name] = avg_val_scores

# Print validation results
print("\nValidation Results:")
for model_name, scores in results.items():
    print(model_name)
    for metric, score in scores.items():
        print(f"{metric}: {score:.4f}")

# Select the best model based on average validation scores
best_model_name = max(results, key=lambda x: results[x]['Accuracy'])
best_model = models[best_model_name]  # Get the actual model object

print("\nBest Model:", best_model_name)

# Train the best model on the entire training set
if best_model_name == 'Word2Vec':
    w2v_model = Word2Vec(sentences=[text.split() for text in X], vector_size=100, window=5, min_count=1, workers=4)
    X_train_vec = np.array([np.mean([w2v_model.wv[word] for word in text.split() if word in w2v_model.wv] or [np.zeros(100)], axis=0) for text in X])
    best_model.fit(X_train_vec, y)
elif best_model_name == 'BERT':
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    bert_model = BertModel.from_pretrained('bert-base-uncased')

    encoded_data_train = tokenizer(X.tolist(), padding=True, truncation=True, return_tensors='pt')
    labels_train = torch.tensor(y.tolist())
    train_data = torch.utils.data.TensorDataset(encoded_data_train['input_ids'], encoded_data_train['attention_mask'], labels_train)
    train_loader = torch.utils.data.DataLoader(train_data, batch_size=32)

    optimizer = torch.optim.AdamW(bert_model.parameters(), lr=1e-5)
    criterion = torch.nn.CrossEntropyLoss()
    for epoch in range(3):
        bert_model.train()
        for batch in train_loader:
            input_ids, attention_mask, labels = batch
            optimizer.zero_grad()
            outputs = bert_model(input_ids, attention_mask=attention_mask)
            logits = outputs[0]
            loss = criterion(logits, labels)
            loss.backward()
            optimizer.step()
else:
    vectorizer = CountVectorizer().fit(X)
    X_train_vec = vectorizer.transform(X)
    best_model.fit(X_train_vec, y)

# Evaluate the best model on the test set
if best_model_name == 'Word2Vec':
    X_test_vec = np.array([np.mean([w2v_model.wv[word] for word in text.split() if word in w2v_model.wv] or [np.zeros(100)], axis=0) for text in test_data['text']])
    y_pred_test = best_model.predict(X_test_vec)
elif best_model_name == 'BERT':
    encoded_data_test = tokenizer(test_data['text'].tolist(), padding=True, truncation=True, return_tensors='pt')
    labels_test = torch.tensor(test_data['label'].tolist())
    test_data = torch.utils.data.TensorDataset(encoded_data_test['input_ids'], encoded_data_test['attention_mask'], labels_test)
    test_loader = torch.utils.data.DataLoader(test_data, batch_size=32)

    bert_model.eval()
    with torch.no_grad():
        y_pred_test = []
        for batch in test_loader:
            input_ids, attention_mask, labels = batch
            outputs = bert_model(input_ids, attention_mask=attention_mask)
            logits = outputs[0]
            _, predicted = torch.max(logits, 1)
            y_pred_test.extend(predicted.cpu().numpy())
else:
    X_test_vec = vectorizer.transform(test_data['text'])
    y_pred_test = best_model.predict(X_test_vec)

# Evaluate test set metrics
test_metrics = {metric: score_func(test_data['label'], y_pred_test) for metric, score_func in metrics.items()}

# Print test set metrics
print("\nTest Set Metrics:")
for metric, score in test_metrics.items():
    print(f"{metric}: {score:.4f}")


Training MultinomialNB


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Training SVM


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Training KNN


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Training Decision Tree


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Training Random Forest


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Training XGBoost


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))



Validation Results:
MultinomialNB
Accuracy: 0.5217
Precision: 0.2723
Recall: 0.5217
F1 Score: 0.3578
SVM
Accuracy: 0.5217
Precision: 0.2723
Recall: 0.5217
F1 Score: 0.3578
KNN
Accuracy: 0.4945
Precision: 0.2452
Recall: 0.4945
F1 Score: 0.3276
Decision Tree
Accuracy: 0.5217
Precision: 0.2723
Recall: 0.5217
F1 Score: 0.3578
Random Forest
Accuracy: 0.5217
Precision: 0.2723
Recall: 0.5217
F1 Score: 0.3578
XGBoost
Accuracy: 0.5217
Precision: 0.2723
Recall: 0.5217
F1 Score: 0.3578

Best Model: MultinomialNB

Test Set Metrics:
Accuracy: 0.4992
Precision: 0.4992
Recall: 1.0000
F1 Score: 0.6659


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [17]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from gensim.models import Word2Vec
from transformers import BertTokenizer, BertModel
import torch

# Load the dataset
data = pd.read_csv("/content/Amazon_Unlocked_Mobile.csv")

# Sample the first 5000 rows
data = data.head(5000)

# Preprocess the text data
data['Reviews'] = data['Reviews'].astype(str)

# TF-IDF Vectorization
vectorizer = TfidfVectorizer(max_features=1000)
X_tfidf = vectorizer.fit_transform(data['Reviews'])

# K-means Clustering
kmeans = KMeans(n_clusters=5, random_state=42)
data['KMeans_Cluster'] = kmeans.fit_predict(X_tfidf)

# DBSCAN Clustering
dbscan = DBSCAN(eps=0.5, min_samples=5)
data['DBSCAN_Cluster'] = dbscan.fit_predict(X_tfidf.toarray())

# Hierarchical Clustering
agg_cluster = AgglomerativeClustering(n_clusters=5)
data['Hierarchical_Cluster'] = agg_cluster.fit_predict(X_tfidf.toarray())

# Word2Vec Clustering
word2vec_model = Word2Vec(sentences=[text.split() for text in data['Reviews']], vector_size=100, window=5, min_count=1, workers=4)
word2vec_features = np.array([np.mean([word2vec_model.wv[word] for word in text.split() if word in word2vec_model.wv] or [np.zeros(100)], axis=0) for text in data['Reviews']])
kmeans_word2vec = KMeans(n_clusters=5, random_state=42)
data['Word2Vec_Cluster'] = kmeans_word2vec.fit_predict(word2vec_features)

# BERT Clustering
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def encode_text(text):
    input_ids = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=128)['input_ids'].to(device)
    with torch.no_grad():
        outputs = bert_model(input_ids)
    return outputs.pooler_output.cpu().numpy()

bert_features = np.concatenate([encode_text(text) for text in data['Reviews']])
kmeans_bert = KMeans(n_clusters=5, random_state=42)
data['BERT_Cluster'] = kmeans_bert.fit_predict(bert_features)

# Display clustering results
print(data[['Reviews', 'KMeans_Cluster', 'DBSCAN_Cluster', 'Hierarchical_Cluster', 'Word2Vec_Cluster', 'BERT_Cluster']])





                                                Reviews  KMeans_Cluster  \
0     I feel so LUCKY to have found this used (phone...               1   
1     nice phone, nice up grade from my pantach revu...               2   
2                                          Very pleased               2   
3     It works good but it goes slow sometimes but i...               2   
4     Great phone to replace my lost phone. The only...               1   
...                                                 ...             ...   
4995  This review is not for the product as you may ...               1   
4996  The product was in good structure. I'm still n...               1   
4997  The iPhone was fine. It works and is in good c...               1   
4998                       Screen cracked really quick.               2   
4999  Will never buy anything again. I received it a...               2   

      DBSCAN_Cluster  Hierarchical_Cluster  Word2Vec_Cluster  BERT_Cluster  
0                 -1  

**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

I have considered first 5000 samples in teh given dataset, as there were errors popping out that large data cannot be handled and requires more time to execute.
The clustering methods analyzed the Amazon mobile reviews to group similar ones together. K-means divided the reviews into clusters based on their content, but some reviews were still mixed together. DBSCAN mostly marked the reviews as noise, meaning they didn't fit well into any specific group. Hierarchical clustering put all reviews into one big group, which may not be very useful. Word2Vec and BERT, which understand the meaning of words, created more diverse clusters, capturing different aspects of the reviews. Overall, Word2Vec and BERT performed a better job of organizing the reviews based on their meaning.


.

.

.

.

.




# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



The entire exercises and assignments were quite challenging, but they provided a valuable learning experience. Understanding and implementing different machine learning algorithms required careful consideration of various parameters and techniques. Despite the difficulties, I tried my best to comprehend the concepts and perform the required actions. Each algorithm demanded a unique approach and parameter tuning, which helped me deepen my understanding. Overall, while challenging, this assignment was a great practice different clustering methods for text data.
