<a href="https://colab.research.google.com/github/KrinalM/Krinalben_INFO5731_Spring2024/blob/main/Monpara_Krinalben_Exercise_05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from gensim.models import Word2Vec
from transformers import BertTokenizer, BertModel
from sklearn.model_selection import train_test_split, cross_validate, KFold
import torch

In [2]:
!file -bi /content/stsa-train.csv

text/csv; charset=iso-8859-1


In [3]:
train_data = pd.read_csv("/content/stsa-train.csv", encoding="latin-1")
test_data = pd.read_csv("/content/stsa-test.csv", encoding="latin-1")

In [4]:
# Split data into features and labels
X_train = train_data['text']
y_train = train_data['sentiment']
X_test = test_data['text']
y_test = test_data['sentiment']

In [5]:
def evaluate_model(model, X, y):
    scoring = ['accuracy', 'recall', 'precision', 'f1']
    cv = KFold(n_splits=10, shuffle=True, random_state=42)
    scores = cross_validate(model, X, y, cv=cv, scoring=scoring)
    return scores

In [6]:
def train_and_evaluate(model, model_name):
    print("Training and evaluating", model_name)
    scores = evaluate_model(model, X_train_counts, y_train)
    print("Mean accuracy:", scores['test_accuracy'].mean())
    print("Mean recall:", scores['test_recall'].mean())
    print("Mean precision:", scores['test_precision'].mean())
    print("Mean F1 score:", scores['test_f1'].mean())
    print()

In [7]:
# Convert text data into numerical features
vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

In [8]:
# Multinomial Naive Bayes
nb_model = MultinomialNB()
train_and_evaluate(nb_model, "Multinomial Naive Bayes")

Training and evaluating Multinomial Naive Bayes
Mean accuracy: 0.7917630057803468
Mean recall: 0.8020372659413895
Mean precision: 0.7994481135510914
Mean F1 score: 0.8005499641877032



In [9]:
# K-Nearest Neighbors
knn_model = KNeighborsClassifier()
train_and_evaluate(knn_model, "K-Nearest Neighbors")

Training and evaluating K-Nearest Neighbors
Mean accuracy: 0.5773121387283238
Mean recall: 0.5676928908737323
Mean precision: 0.6007389224930721
Mean F1 score: 0.5827839306081636



In [10]:
# Decision Tree
dt_model = DecisionTreeClassifier()
train_and_evaluate(dt_model, "Decision Tree")

Training and evaluating Decision Tree
Mean accuracy: 0.6377167630057803
Mean recall: 0.6564791643511114
Mean precision: 0.6515396882332904
Mean F1 score: 0.6535907393263543



In [11]:
# XGBoost
xgb_model = XGBClassifier()
train_and_evaluate(xgb_model, "XGBoost")

Training and evaluating XGBoost
Mean accuracy: 0.7199421965317919
Mean recall: 0.7895318141081059
Mean precision: 0.7084570903552997
Mean F1 score: 0.7462264636758384



In [12]:
# Random Forest
rf_model = RandomForestClassifier()
train_and_evaluate(rf_model, "Random Forest")

Training and evaluating Random Forest
Mean accuracy: 0.733092485549133
Mean recall: 0.7780157325752171
Mean precision: 0.7290991217104332
Mean F1 score: 0.7524772120269957



In [13]:
# Support Vector Machine
svm_model = SVC(kernel='linear')
train_and_evaluate(svm_model, "Support Vector Machine")

Training and evaluating Support Vector Machine
Mean accuracy: 0.7658959537572254
Mean recall: 0.7786934520164545
Mean precision: 0.7739686876477824
Mean F1 score: 0.7759716841790995



In [14]:
# Evaluate best model on test data
best_model = svm_model  # Change this to the best performing model
X_test_counts = vectorizer.transform(X_test)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)
best_model.fit(X_train_tfidf, y_train)
y_pred = best_model.predict(X_test_tfidf)

# Calculate evaluation metrics on test data
print("Test Accuracy:", accuracy_score(y_test, y_pred))
print("Test Recall:", recall_score(y_test, y_pred))
print("Test Precision:", precision_score(y_test, y_pred))
print("Test F1 score:", f1_score(y_test, y_pred))

Test Accuracy: 0.8116419549697969
Test Recall: 0.8305830583058306
Test Precision: 0.7997881355932204
Test F1 score: 0.8148947652455478


## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [15]:
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-2.7.0-py3-none-any.whl (171 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/171.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m163.8/171.5 kB[0m [31m4.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (

In [16]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from gensim.models import Word2Vec
from sentence_transformers import SentenceTransformer
import nltk
from nltk.cluster import KMeansClusterer
from nltk.cluster.util import cosine_distance
import numpy as np

In [17]:
# Load the dataset
df = pd.read_csv("/content/Amazon_Unlocked_Mobile.csv")

# Drop rows with missing review text
df = df.dropna(subset=['Reviews'])

In [18]:
# Preprocess the text data
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    words = word_tokenize(text)
    words = [word.lower() for word in words if word.isalpha() and word.lower() not in stop_words]
    return ' '.join(words)

df['Cleaned text'] = df['Reviews'].apply(preprocess_text)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [19]:
# Apply K-means clustering
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['Cleaned text'])
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans.fit(X)
df['Kmeans cluster'] = kmeans.labels_



In [20]:
# Apply DBSCAN clustering
dbscan = DBSCAN(eps=0.5, min_samples=5)
df['DBSCAN cluster'] = dbscan.fit_predict(X)

In [21]:
# Train Word2Vec model
sentences = [word_tokenize(text) for text in df['Cleaned text']]
word2vec_model = Word2Vec(sentences, min_count=1)

In [22]:
print(df[['Reviews', 'Kmeans cluster', 'DBSCAN cluster']])

                                                 Reviews  Kmeans cluster  \
0      I feel so LUCKY to have found this used (phone...               4   
1      nice phone, nice up grade from my pantach revu...               4   
2                                           Very pleased               4   
3      It works good but it goes slow sometimes but i...               0   
4      Great phone to replace my lost phone. The only...               4   
...                                                  ...             ...   
38448  First time I received the phone from the selle...               4   
38449                                         Works well               4   
38450  Probably could've found better prices , also t...               4   
38451    The iphone is unlock Good phone but bad ime att               4   
38452  Don't buy it. Yes the iPhone is brand new and ...               4   

       DBSCAN cluster  
0                  -1  
1                  -1  
2              

**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

**Write your response here:**

K-means is efficient and easy to implement, providing clear cluster boundaries, but it assumes clusters of similar size and shape and requires the number of clusters to be specified beforehand. DBSCAN, on the other hand, is robust to noise and can find clusters of arbitrary shapes and sizes without requiring the number of clusters as input, but it struggles with varying density and high-dimensional data. Word2Vec embeddings capture semantic similarities between words, enabling K-means to cluster documents based on word meanings; however, it may not capture context well. Therefore, while K-means with Word2Vec offers interpretability, it may overlook syntactic nuances. BERT embeddings, although powerful in capturing context and semantic relationships, require significant computational resources, limiting their practicality in resource-constrained environments.

# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [23]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
This excerise is not much hard as it is. For question 2, The listed clustering methods:DBSCAN,Hierarchical clustering,Word2Vec taking too much time for run.
But I learnt a lot from this exercise. It took many hours to complete this.


'''

'\nThis excerise is not much hard as it is. For question 2, The listed clustering methods:DBSCAN,Hierarchical clustering,Word2Vec taking too much time for run.\nBut I learnt a lot from this exercise. It took many hours to complete this.\n\n\n'