<a href="https://colab.research.google.com/github/AbhiramNallamothu/excercies_5731/blob/main/Abhiram_Nallamothu_Exercise_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from xgboost import XGBClassifier
import numpy as np

# Load data
def load_data(file_path):
    data = []
    with open(file_path, 'r') as file:
        for line in file:
            split_line = line.strip().split(' ', 1)
            sentiment = int(split_line[0])
            text = split_line[1] if len(split_line) > 1 else ""
            data.append((sentiment, text))
    return pd.DataFrame(data, columns=['Sentiment', 'Text'])

train_df = load_data('/content/stsa-train.txt')

# Split data into training and validation sets
X_train, X_test, y_train, y_test = train_test_split(train_df['Text'], train_df['Sentiment'], test_size=0.2, random_state=42)

# Define a TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=10000)

# Scoring metrics for evaluation
scoring = ['accuracy', 'recall', 'precision', 'f1']

# Define models
models = {
    'MultinomialNB': MultinomialNB(),
    'SVM': SVC(kernel='linear'),
    'DecisionTree': DecisionTreeClassifier(random_state=42),
    'RandomForest': RandomForestClassifier(random_state=42),
    'XGBoost': XGBClassifier(use_label_encoder=False, eval_metric='logloss')
}

# Evaluate models
results = {}
for name, model in models.items():
    pipeline = make_pipeline(tfidf_vectorizer, model)
    cv_results = cross_validate(pipeline, X_train, y_train, cv=5, scoring=scoring)
    results[name] = {
        'Accuracy': np.mean(cv_results['test_accuracy']),
        'Recall': np.mean(cv_results['test_recall']),
        'Precision': np.mean(cv_results['test_precision']),
        'F1 Score': np.mean(cv_results['test_f1'])
    }

# Display results
for model, metrics in results.items():
    print(f"Results for {model}:")
    for metric, value in metrics.items():
        print(f"{metric}: {value:.4f}")
    print("\n")


Results for MultinomialNB:
Accuracy: 0.7650
Recall: 0.8485
Precision: 0.7404
F1 Score: 0.7907


Results for SVM:
Accuracy: 0.7543
Recall: 0.7684
Precision: 0.7637
F1 Score: 0.7660


Results for DecisionTree:
Accuracy: 0.6371
Recall: 0.6358
Precision: 0.6612
F1 Score: 0.6456


Results for RandomForest:
Accuracy: 0.7144
Recall: 0.7373
Precision: 0.7231
F1 Score: 0.7297


Results for XGBoost:
Accuracy: 0.6640
Recall: 0.7452
Precision: 0.6614
F1 Score: 0.6972




In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.feature_extraction.text import TfidfVectorizer
from xgboost import XGBClassifier
from sklearn.pipeline import make_pipeline
import numpy as np


# Define the XGBoost model
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss')

# Create a pipeline with TF-IDF and XGBoost
pipeline = make_pipeline(tfidf_vectorizer, xgb_model)

# Scoring metrics for evaluation
scoring = ['accuracy', 'recall', 'precision', 'f1']

# Perform cross-validation
cv_results = cross_validate(pipeline, X_train, y_train, cv=5, scoring=scoring)

# Calculate average scores
average_results = {
    'Accuracy': np.mean(cv_results['test_accuracy']),
    'Recall': np.mean(cv_results['test_recall']),
    'Precision': np.mean(cv_results['test_precision']),
    'F1 Score': np.mean(cv_results['test_f1'])
}

# Print results
print("XGBoost Model Evaluation:")
for metric, value in average_results.items():
    print(f"{metric}: {value:.4f}")


XGBoost Model Evaluation:
Accuracy: 0.6640
Recall: 0.7452
Precision: 0.6614
F1 Score: 0.6972


In [10]:
import torch
from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler, random_split
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

# Load the data with a limit of 10000 records
def load_data(filename, max_records=10000):
    texts, labels = [], []
    with open(filename, 'r') as file:
        for i, line in enumerate(file):
            if i >= max_records:
                break
            texts.append(line[2:].strip())
            labels.append(int(line[0]))
    return texts, labels

train_texts, train_labels = load_data('stsa-train.txt')
test_texts, test_labels = load_data('stsa-test.txt')

# Tokenization
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

class SentimentDataset(Dataset):
    def __init__(self, texts, labels):
        self.texts = texts
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        encoded = tokenizer.encode_plus(
            self.texts[idx],
            add_special_tokens=True,
            max_length=256,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt'
        )
        return {
            'input_ids': encoded['input_ids'].flatten(),
            'attention_mask': encoded['attention_mask'].flatten(),
            'labels': torch.tensor(self.labels[idx], dtype=torch.long)
        }

# Dataset
train_val_dataset = SentimentDataset(train_texts, train_labels)
test_dataset = SentimentDataset(test_texts, test_labels)

# Splitting train and validation
train_size = int(0.8 * len(train_val_dataset))
val_size = len(train_val_dataset) - train_size
train_dataset, val_dataset = random_split(train_val_dataset, [train_size, val_size])

# DataLoader
train_loader = DataLoader(train_dataset, batch_size=16, sampler=RandomSampler(train_dataset))
val_loader = DataLoader(val_dataset, batch_size=16, sampler=SequentialSampler(val_dataset))
test_loader = DataLoader(test_dataset, batch_size=16, sampler=SequentialSampler(test_dataset))

# Model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Optimizer
optimizer = AdamW(model.parameters(), lr=2e-5)

# Training with cross-validation
kf = KFold(n_splits=10)
for fold, (train_idx, val_idx) in enumerate(kf.split(train_dataset)):
    print(f"Fold {fold+1}")
    train_subsampler = torch.utils.data.SubsetRandomSampler(train_idx)
    val_subsampler = torch.utils.data.SubsetRandomSampler(val_idx)

    train_fold_loader = DataLoader(train_dataset, batch_size=16, sampler=train_subsampler)
    val_fold_loader = DataLoader(val_dataset, batch_size=16, sampler=val_subsampler)

    # Add training loop code here

# Evaluation on test data
model.eval()
predictions, true_labels = [], []
for batch in test_loader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)
    logits = outputs.logits
    predictions.extend(torch.argmax(logits, dim=-1).tolist())
    true_labels.extend(batch['labels'].tolist())

accuracy = accuracy_score(true_labels, predictions)
recall = recall_score(true_labels, predictions)
precision = precision_score(true_labels, predictions)
f1 = f1_score(true_labels, predictions)

print(f"Accuracy: {accuracy}")
print(f"Recall: {recall}")
print(f"Precision: {precision}")
print(f"F1 Score: {f1}")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
Fold 6
Fold 7
Fold 8
Fold 9
Fold 10
Accuracy: 0.500274574409665
Recall: 0.0
Precision: 0.0
F1 Score: 0.0


## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [7]:
pip install pandas scikit-learn gensim sentence-transformers


Collecting sentence-transformers
  Downloading sentence_transformers-2.7.0-py3-none-any.whl (171 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.11.0->sentence-transform

In [8]:
import pandas as pd
import re
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from gensim.models import Word2Vec
from sentence_transformers import SentenceTransformer
from sklearn.metrics import silhouette_score

# Load the dataset
data = pd.read_csv('/content/Amazon_Unlocked_Mobile.csv')
# Basic text preprocessing
def preprocess_text(text):
    text = re.sub(r'\W', ' ', str(text))
    text = text.lower()
    text = re.sub(r'\s+[a-z]\s+', ' ', text)
    text = re.sub(r'^[a-z]\s+', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    return text

# Considering only the 'Reviews' column for clustering
texts = data['Reviews'].dropna().sample(5000)  # Sampling 5000 reviews for quick processing
texts = texts.apply(preprocess_text)

# Text representation with TF-IDF
vectorizer = TfidfVectorizer(max_features=5000, min_df=5, max_df=0.7)
X = vectorizer.fit_transform(texts)

# Clustering Algorithms

## K-means
kmeans = KMeans(n_clusters=5, random_state=0)
kmeans_clusters = kmeans.fit_predict(X)

## DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_clusters = dbscan.fit_predict(X)

## Hierarchical Clustering
hierarchical = AgglomerativeClustering(n_clusters=5)
hierarchical_clusters = hierarchical.fit_predict(X.toarray())

## Word2Vec
tokenized_texts = [text.split() for text in texts]
model_w2v = Word2Vec(tokenized_texts, vector_size=100, window=5, min_count=2, sg=1)

# Generate a feature vector for each review by averaging the Word2Vec vectors of all words in the review
def average_word_vectors(words, model, vocabulary, num_features):
    feature_vector = np.zeros((num_features,), dtype="float64")
    nwords = 0

    for word in words:
        if word in vocabulary:
            nwords = nwords + 1
            feature_vector = np.add(feature_vector, model.wv[word])

    if nwords:
        feature_vector = np.divide(feature_vector, nwords)
    return feature_vector

vocabulary = set(model_w2v.wv.index_to_key)
w2v_features = np.array([average_word_vectors(text, model_w2v, vocabulary, 100) for text in tokenized_texts])

# Clustering the Word2Vec vectors
kmeans_w2v = KMeans(n_clusters=5, random_state=0)
w2v_clusters = kmeans_w2v.fit_predict(w2v_features)

## BERT
model_bert = SentenceTransformer('bert-base-nli-mean-tokens')
bert_embeddings = model_bert.encode(texts.tolist(), show_progress_bar=True)
kmeans_bert = KMeans(n_clusters=5, random_state=0)
bert_clusters = kmeans_bert.fit_predict(bert_embeddings)

# Evaluation with Silhouette Score
print("Silhouette Score for K-means:", silhouette_score(X, kmeans_clusters))
print("Silhouette Score for DBSCAN:", silhouette_score(X, dbscan_clusters) if len(set(dbscan_clusters)) > 1 else "Only one cluster or noise")
print("Silhouette Score for Hierarchical:", silhouette_score(X.toarray(), hierarchical_clusters))
print("Silhouette Score for Word2Vec:", silhouette_score(w2v_features, w2v_clusters))
print("Silhouette Score for BERT:", silhouette_score(bert_embeddings, bert_clusters))

# Summary
print(f'K-means clusters: {set(kmeans_clusters)}')
print(f'DBSCAN clusters: {set(dbscan_clusters)}')
print(f'Hierarchical clusters: {set(hierarchical_clusters)}')
print(f'Word2Vec clusters: {set(w2v_clusters)}')
print(f'BERT clusters: {set(bert_clusters)}')




modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.99k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/157 [00:00<?, ?it/s]



Silhouette Score for K-means: 0.04248746135308897
Silhouette Score for DBSCAN: -0.12352960060707159
Silhouette Score for Hierarchical: 0.039684725245411265
Silhouette Score for Word2Vec: 0.27772418651173086
Silhouette Score for BERT: 0.1436578
K-means clusters: {0, 1, 2, 3, 4}
DBSCAN clusters: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, -1}
Hierarchical clusters: {0, 1, 2, 3, 4}
Word2Vec clusters: {0, 1, 2, 3, 4}
BERT clusters: {0, 1, 2, 3, 4}


**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

The clustering algorithms applied to the Amazon reviews dataset yield varying results. K-means and Hierarchical clustering, which partition texts into 5 distinct groups, show relatively low Silhouette Scores of 0.044 and 0.040 respectively, indicating modestly defined clusters. DBSCAN, with a Silhouette Score of -0.119, suggests poor clustering with over thirty distinct clusters, including noise, which points to its sensitivity to parameter settings and possibly higher dimensional data. Word2Vec significantly outperforms the traditional clustering methods with a Silhouette Score of 0.284, suggesting that the embedding captures more nuanced textual features leading to better cluster separation. BERT, while better than the traditional methods at 0.146, does not reach the effectiveness of Word2Vec in this case, which could be due to the nature of pre-trained embeddings being less tailored to this specific dataset compared to a custom trained Word2Vec model. These results underline the importance of choosing the right representation and clustering method based on the specific characteristics and scale of the data.

.

.

.

.

.




# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
The exercises in this assignment effectively demonstrate the application of various machine learning techniques to sentiment analysis and text clustering.
They offer a valuable practical experience in handling text data, feature extraction, and evaluating the performance of different models, providing a robust foundation for more advanced studies in natural language processing.




'''