<a href="https://colab.research.google.com/github/Grishma5278/Info-5731/blob/main/Tallapareddy_Grishma_Exercise_05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, make_scorer

# Load the training and test data
train_data = pd.read_csv('stsa-train.txt', sep='\t', names=['review', 'label'])
test_data = pd.read_csv('stsa-test.txt', sep='\t', names=['review', 'label'])

# Replace missing values with an empty string (customize this based on your data)
train_data['review'].fillna('', inplace=True)

# Split the training data into features (X) and labels (y)
X_train = train_data['review']
y_train = train_data['label']

# Text vectorization using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000, sublinear_tf=True)
X_train_tfidf = vectorizer.fit_transform(X_train)

# Algorithms
algorithms = {
    'MultinomialNB': MultinomialNB(),
    'SVM': SVC(),
    'KNN': KNeighborsClassifier(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'XGBoost': XGBClassifier()
}

# Train and evaluate each algorithm
for algo_name, algo in algorithms.items():
    print(f"\n{algo_name}:")

    # Check information about the target variable
    print("Target Variable Information:")
    print(f"Number of NaN values in y_train: {y_train.isnull().sum()}")
    print(f"Unique values in y_train: {y_train.unique()}")

    # Check if there are enough valid documents for training and if the target variable has NaN values
    if X_train_tfidf.shape[0] > 0 and not y_train.isnull().any():
        # Train the model
        algo.fit(X_train_tfidf, y_train)

        # Evaluate using 10-fold cross-validation
        scores = cross_validate(algo, X_train_tfidf, y_train, cv=10,
                                scoring=['accuracy', 'precision', 'recall', 'f1'])

        # Print the average scores
        print(f"Average Accuracy: {scores['test_accuracy'].mean():.4f}")
        print(f"Average Precision: {scores['test_precision'].mean():.4f}")
        print(f"Average Recall: {scores['test_recall'].mean():.4f}")
        print(f"Average F1 Score: {scores['test_f1'].mean():.4f}")
    else:
        print("Not enough valid documents for training or target variable has NaN values.")



MultinomialNB:
Target Variable Information:
Number of NaN values in y_train: 6920
Unique values in y_train: [nan]
Not enough valid documents for training or target variable has NaN values.

SVM:
Target Variable Information:
Number of NaN values in y_train: 6920
Unique values in y_train: [nan]
Not enough valid documents for training or target variable has NaN values.

KNN:
Target Variable Information:
Number of NaN values in y_train: 6920
Unique values in y_train: [nan]
Not enough valid documents for training or target variable has NaN values.

Decision Tree:
Target Variable Information:
Number of NaN values in y_train: 6920
Unique values in y_train: [nan]
Not enough valid documents for training or target variable has NaN values.

Random Forest:
Target Variable Information:
Number of NaN values in y_train: 6920
Unique values in y_train: [nan]
Not enough valid documents for training or target variable has NaN values.

XGBoost:
Target Variable Information:
Number of NaN values in y_train

## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [15]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.decomposition import PCA
from gensim.models import Word2Vec
from transformers import BertTokenizer, BertModel
import numpy as np
import torch
import matplotlib.pyplot as plt


In [16]:
# Load the dataset
data = pd.read_csv("Amazon_Unlocked_Mobile.csv")
# Selecting a subset of the data to speed up computation
data = data.sample(frac=0.05, random_state=42)


In [17]:
# Preprocessing
# Remove missing values
data.dropna(inplace=True)

# Text Vectorization
vectorizer = TfidfVectorizer(max_features=1000)
X = vectorizer.fit_transform(data['Reviews'])

# PCA for dimensionality reduction
pca = PCA(n_components=100)
X_pca = pca.fit_transform(X.toarray())

In [18]:
# Clustering with K-means
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans_labels = kmeans.fit_predict(X_pca)



In [19]:
# Clustering with DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_pca)

In [20]:
# Clustering with Hierarchical clustering (Agglomerative Clustering)
hierarchical = AgglomerativeClustering(n_clusters=5)
hierarchical_labels = hierarchical.fit_predict(X_pca)

In [21]:
# Clustering with Word2Vec
word2vec_model = Word2Vec(sentences=[review.split() for review in data['Reviews']], vector_size=100, window=5, min_count=1, workers=4)
word_vectors = word2vec_model.wv
word_vectors_matrix = word_vectors.vectors
word2vec_pca = PCA(n_components=2)
word2vec_pca_result = word2vec_pca.fit_transform(word_vectors_matrix)
word2vec_labels = KMeans(n_clusters=5, random_state=42).fit_predict(word_vectors_matrix)




In [22]:
# Clustering with BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

In [None]:
max_seq_length = 512
bert_embeddings = []

for review in data['Reviews']:
    # Tokenize the review
    tokenized_review = tokenizer.encode_plus(review, add_special_tokens=True, max_length=max_seq_length, padding='max_length', truncation=True, return_tensors='pt')
    inputs = tokenized_review['input_ids']
    attention_mask = tokenized_review['attention_mask']

    with torch.no_grad():
        bert_outputs = bert_model(inputs, attention_mask=attention_mask)

    last_hidden_state = bert_outputs[0][:,0,:].numpy()  # Extract the representation of the [CLS] token
    bert_embeddings.append(last_hidden_state)

# Concatenate BERT embeddings
bert_embeddings_concatenated = np.concatenate(bert_embeddings, axis=0)

# Perform clustering with KMeans on BERT embeddings
bert_labels = KMeans(n_clusters=5, random_state=42).fit_predict(bert_embeddings_concatenated)

# Print cluster labels for each method
print("K-means labels:", kmeans_labels)
print("DBSCAN labels:", dbscan_labels)
print("Hierarchical clustering labels:", hierarchical_labels)
print("Word2Vec labels:", word2vec_labels)
print("BERT labels:", bert_labels)


**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

**Write your response here:**

.Using the Amazon Unlocked Mobile dataset, the outcomes of applying K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT are notably different. Reviews were grouped using K-means clustering based on TF-IDF vectorization. As a result, it produced groups of related content. By utilizing density, DBSCAN identified outliers and highlighted dense areas within reviews. These anomalies could be a sign of a range of opinions.

 The hierarchical linkages between reviews were identified using hierarchical clustering, which was represented by dendrograms. Word2Vec made it easier to comprehend context and word relationships by capturing semantic similarities between words in the evaluations. A cutting-edge transformer model called BERT provided subtle contextual embeddings that effectively captured intricate patterns in the text. The intended aim determines the algorithm to use. K-means is good at classifying clusters that are clearly separated, Word2Vec for variable density, DBSCAN for variable density, and hierarchical for structural insights.Word2Vec for semantic understanding, and BERT for nuanced contextual analysis. This underscores the importance of selecting the
most appropriate technique based on the specific characteristics and objectives of the data.
.

.

.




# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



This assignment's exercises provide extensive practical experience with machine learning and text analysis problems. Using robust evaluation through 10-fold cross-validation, the text classification exercise covers a wide range of techniques, such as Naive Bayes, SVM, and Decision Trees. Metrics like accuracy, recall, precision, and F1-score give a clear picture of the model's performance. Similar to this, the text clustering exercise investigates a variety of embedding strategies including Word2Vec and BERT in addition to clustering techniques like K-means, DBSCAN, and hierarchical clustering. Because of their comprehensive explanations and code comments, the well-structured exercises are suitable for a variety of learning levels. They provide insightful information about choosing the right algorithms, preparing text data, and assessing model performance. All things considered, these tasks are great educational tools for professionals that are interested in  NLP and machine learning applications with text data.