# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [4]:
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_validate, StratifiedKFold
from xgboost import XGBClassifier

In [5]:
# Write your code here
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
with open("/content/stsa-train.txt") as txtf:
    mylist = [line.rstrip('\n') for line in txtf]

labels = []
text = []

for i, line in enumerate(mylist):
    label = mylist[i][0]
    tex = mylist[i][1:]
    labels.append(label)
    text.append(tex)

dataset = pd.DataFrame(list(zip(labels, text)),columns =['Reviews', 'Text'])
dataset.head()

Unnamed: 0,Reviews,Text
0,1,"a stirring , funny and finally transporting r..."
1,0,apparently reassembled from the cutting-room ...
2,0,they presume their audience wo n't sit still ...
3,1,this is a visually stunning rumination on lov...
4,1,jonathan parker 's bartleby should have been ...


In [6]:
import nltk
nltk.download()
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.corpus import stopwords
import re
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

def preprocess(sentence):
    sentence=str(sentence)
    sentence = sentence.lower()
    sentence=sentence.replace('{html}',"")
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', sentence)
    rem_url=re.sub(r'http\S+', '',cleantext)
    rem_num = re.sub('[0-9]+', '', rem_url)
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(rem_num)
    filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]
    stem_words=[stemmer.stem(w) for w in filtered_words]
    lemma_words=[lemmatizer.lemmatize(w) for w in stem_words]
    return " ".join(filtered_words)


dataset['cleanText']=dataset['Text'].map(lambda s:preprocess(s))
dataset.head()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> q

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> all


    Downloading collection 'all'
       | 
       | Downloading package abc to /root/nltk_data...
       |   Unzipping corpora/abc.zip.
       | Downloading package alpino to /root/nltk_data...
       |   Unzipping corpora/alpino.zip.
       | Downloading package averaged_perceptron_tagger to
       |     /root/nltk_data...
       |   Unzipping taggers/averaged_perceptron_tagger.zip.
       | Downloading package averaged_perceptron_tagger_ru to
       |     /root/nltk_data...
       |   Unzipping taggers/averaged_perceptron_tagger_ru.zip.
       | Downloading package basque_grammars to /root/nltk_data...
       |   Unzipping grammars/basque_grammars.zip.
       | Downloading package bcp47 to /root/nltk_data...
       | Downloading package biocreative_ppi to /root/nltk_data...
       |   Unzipping corpora/biocreative_ppi.zip.
       | Downloading package bllip_wsj_no_aux to /root/nltk_data...
       |   Unzipping models/bllip_wsj_no_aux.zip.
       | Downloading package book_grammars to


---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> q


Unnamed: 0,Reviews,Text,cleanText
0,1,"a stirring , funny and finally transporting r...",stirring funny finally transporting imagining ...
1,0,apparently reassembled from the cutting-room ...,apparently reassembled cutting room floor give...
2,0,they presume their audience wo n't sit still ...,presume audience sit still sociology lesson ho...
3,1,this is a visually stunning rumination on lov...,visually stunning rumination love memory histo...
4,1,jonathan parker 's bartleby should have been ...,jonathan parker bartleby end modern office ano...


In [7]:
with open("/content/stsa-test.txt") as txtf:
    mylist_test = [line.rstrip('\n') for line in txtf]

labels_test = []
text_test = []

for i, line in enumerate(mylist_test):
    label_test = mylist_test[i][0]
    tex_test = mylist_test[i][1:]
    labels_test.append(label_test)
    text_test.append(tex_test)

dataset_test = pd.DataFrame(list(zip(labels_test, text_test)),columns =['Reviews', 'Text'])
dataset_test.head()

Unnamed: 0,Reviews,Text
0,0,"no movement , no yuks , not much of anything ."
1,0,"a gob of drivel so sickly sweet , even the ea..."
2,0,"gangs of new york is an unapologetic mess , w..."
3,0,we never really feel involved with the story ...
4,1,this is one of polanski 's best films .


In [8]:
import nltk
nltk.download()
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.corpus import stopwords
import re
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

def preprocess(sentence):
    sentence=str(sentence)
    sentence = sentence.lower()
    sentence=sentence.replace('{html}',"")
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', sentence)
    rem_url=re.sub(r'http\S+', '',cleantext)
    rem_num = re.sub('[0-9]+', '', rem_url)
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(rem_num)
    filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]
    stem_words=[stemmer.stem(w) for w in filtered_words]
    lemma_words=[lemmatizer.lemmatize(w) for w in stem_words]
    return " ".join(filtered_words)


dataset_test['cleanText']=dataset_test['Text'].map(lambda s:preprocess(s))
dataset_test.head()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> all


    Downloading collection 'all'
       | 
       | Downloading package abc to /root/nltk_data...
       |   Package abc is already up-to-date!
       | Downloading package alpino to /root/nltk_data...
       |   Package alpino is already up-to-date!
       | Downloading package averaged_perceptron_tagger to
       |     /root/nltk_data...
       |   Package averaged_perceptron_tagger is already up-to-date!
       | Downloading package averaged_perceptron_tagger_ru to
       |     /root/nltk_data...
       |   Package averaged_perceptron_tagger_ru is already up-to-
       |       date!
       | Downloading package basque_grammars to /root/nltk_data...
       |   Package basque_grammars is already up-to-date!
       | Downloading package bcp47 to /root/nltk_data...
       |   Package bcp47 is already up-to-date!
       | Downloading package biocreative_ppi to /root/nltk_data...
       |   Package biocreative_ppi is already up-to-date!
       | Downloading package bllip_wsj_no_aux to /ro


---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> q


Unnamed: 0,Reviews,Text,cleanText
0,0,"no movement , no yuks , not much of anything .",movement yuks much anything
1,0,"a gob of drivel so sickly sweet , even the ea...",gob drivel sickly sweet even eager consumers m...
2,0,"gangs of new york is an unapologetic mess , w...",gangs new york unapologetic mess whose saving ...
3,0,we never really feel involved with the story ...,never really feel involved story ideas remain ...
4,1,this is one of polanski 's best films .,one polanski best films


In [9]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import re
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

# Load data
train_file_path = "/content/stsa-test.txt"
test_file_path = "/content/stsa-test.txt"

# Read train and test data
def read_data(file_path):
    with open(file_path, 'r') as file:
        lines = file.readlines()
    labels = [line[0] for line in lines]
    texts = [line[1:].strip() for line in lines]
    return pd.DataFrame({'Reviews': labels, 'Text': texts})

dataset = read_data(train_file_path)
dataset_test = read_data(test_file_path)

# Preprocess text
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'<.*?>', '', text)    # Remove HTML tags
    text = re.sub(r'\d+', '', text)      # Remove numbers
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words and len(word) > 2]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(tokens)

dataset['cleanText'] = dataset['Text'].apply(preprocess_text)
dataset_test['cleanText'] = dataset_test['Text'].apply(preprocess_text)

# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(lowercase=False, analyzer='word')
train_tfidf = tfidf_vectorizer.fit_transform(dataset["cleanText"]).toarray()
test_tfidf = tfidf_vectorizer.transform(dataset_test["cleanText"]).toarray()



In [10]:
x_test = test_tfidf
y_test = dataset_test["Reviews"]

from sklearn.model_selection import train_test_split

x_train, x_valid, y_train, y_valid = train_test_split(train_tfidf,dataset["Reviews"],test_size = 0.2, random_state = 202)

#Algorithms
#1.MultinominalNB

In [11]:
MNB = MultinomialNB()
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
MNB_test = MNB.fit(x_train,y_train)
Y_MNB = MNB_test.predict(x_test)
print('Accuracy %s' % accuracy_score(Y_MNB,y_test))
print(classification_report(y_test,Y_MNB))
sco_MNB = cross_val_score(MNB, x_test, y_test, cv=10)
print("Accuracy with MNB",sco_MNB.mean())

Accuracy 0.9302580999450851
              precision    recall  f1-score   support

           0       0.93      0.93      0.93       912
           1       0.93      0.93      0.93       909

    accuracy                           0.93      1821
   macro avg       0.93      0.93      0.93      1821
weighted avg       0.93      0.93      0.93      1821

Accuracy with MNB 0.7204857983546509


#SVM

In [12]:
#SVM TEST
SVM = LinearSVC()
SVM_test = SVM.fit(x_train,y_train)
Y_SVM = SVM_test.predict(x_test)
print('Accuracy %s' % accuracy_score(Y_SVM,y_test))
print(classification_report(y_test,Y_SVM))
sco_SVM = cross_val_score(SVM, x_test, y_test, cv=10)
print("Accuracy with SVM",sco_SVM.mean())

Accuracy 0.9412410763316859
              precision    recall  f1-score   support

           0       0.94      0.94      0.94       912
           1       0.94      0.94      0.94       909

    accuracy                           0.94      1821
   macro avg       0.94      0.94      0.94      1821
weighted avg       0.94      0.94      0.94      1821

Accuracy with SVM 0.702368942532877


#KNN

In [13]:
#KNN_test
KNN = KNeighborsClassifier(n_neighbors=5,n_jobs=-1)
KNN_test = KNN.fit(x_train,y_train)
Y_KNN = KNN_test.predict(x_test)
print('Accuracy %s' % accuracy_score(Y_KNN,y_test))
print(classification_report(y_test,Y_KNN))
scores_KNN = cross_val_score(KNN, x_test, y_test, cv=10)
print("Accuracy with KNN",scores_KNN.mean())

Accuracy 0.6809445359692476
              precision    recall  f1-score   support

           0       0.61      0.98      0.75       912
           1       0.94      0.39      0.55       909

    accuracy                           0.68      1821
   macro avg       0.78      0.68      0.65      1821
weighted avg       0.78      0.68      0.65      1821

Accuracy with KNN 0.5222092115534738


In [14]:
#Decison tree
DTC = DecisionTreeClassifier()
dt = DecisionTreeClassifier()
dec_tree_test = dt.fit(x_train,y_train)
Y_dec_tree = dec_tree_test.predict(x_test)
print('Accuracy %s' % accuracy_score(Y_dec_tree,y_test))
print(classification_report(y_test,Y_dec_tree))
scores_DT = cross_val_score(dt, x_test, y_test, cv=10)
print("Accuracy with decision trees",scores_DT.mean())

Accuracy 0.9203734211971444
              precision    recall  f1-score   support

           0       0.91      0.94      0.92       912
           1       0.94      0.90      0.92       909

    accuracy                           0.92      1821
   macro avg       0.92      0.92      0.92      1821
weighted avg       0.92      0.92      0.92      1821

Accuracy with decision trees 0.6227796793370565


In [15]:
#Randomforest
rf = RandomForestClassifier()
RF_test = rf.fit(x_train,y_train)
Y_RF = RF_test.predict(x_test)
print('Accuracy %s' % accuracy_score(Y_RF,y_test))
print(classification_report(y_test,Y_RF))
sco_RF = cross_val_score(rf, x_test, y_test, cv=10)
print("Accuracy with Random Forest",sco_RF.mean())

Accuracy 0.9324546952224053
              precision    recall  f1-score   support

           0       0.91      0.96      0.93       912
           1       0.96      0.90      0.93       909

    accuracy                           0.93      1821
   macro avg       0.93      0.93      0.93      1821
weighted avg       0.93      0.93      0.93      1821

Accuracy with Random Forest 0.6556836606016934


#XGBoost model

In [16]:
y_train = y_train.astype(int)
y_test = y_test.astype(int)
xgb = XGBClassifier()
# Now fit the XGBoost model
XGB_test = xgb.fit(x_train, y_train)
Y_XGB = XGB_test.predict(x_test)
print('Accuracy %s' % accuracy_score(Y_XGB,y_test))
print(classification_report(y_test,Y_XGB))
sco_XGB = cross_val_score(xgb, x_test, y_test, cv=10)
print("Accuracy using XGBoost",sco_XGB.mean())

Accuracy 0.8358045030203185
              precision    recall  f1-score   support

           0       0.81      0.87      0.84       912
           1       0.86      0.80      0.83       909

    accuracy                           0.84      1821
   macro avg       0.84      0.84      0.84      1821
weighted avg       0.84      0.84      0.84      1821

Accuracy using XGBoost 0.6524079745391221


#Word2Vec

In [17]:
from gensim.models import Word2Vec
import numpy as np

# Train Word2Vec model
word2vec_model = Word2Vec(sentences=dataset['cleanText'], vector_size=100, window=5, min_count=1, workers=4)

# Convert each review to a vector representation
def word2vec_vectorizer(tokens, model, vector_size):
    vector = np.zeros(vector_size)
    num_words = 0
    for word in tokens:
        if word in model.wv:
            vector += model.wv[word]
            num_words += 1
    if num_words != 0:
        vector /= num_words
    return vector

train_vectors = np.array([word2vec_vectorizer(tokens, word2vec_model, 100) for tokens in dataset['cleanText']])

# Split data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(train_vectors, dataset['Reviews'].values,
                                                    test_size=0.2, random_state=42)

# Train a classifier (for example, SVM)
from sklearn.svm import SVC
svm_classifier = SVC()
svm_classifier.fit(x_train, y_train)

# Evaluate the classifier
from sklearn.metrics import accuracy_score, classification_report
y_pred = svm_classifier.predict(x_test)
print('Accuracy:', accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))




Accuracy: 0.4794520547945205
              precision    recall  f1-score   support

           0       0.48      1.00      0.65       175
           1       0.00      0.00      0.00       190

    accuracy                           0.48       365
   macro avg       0.24      0.50      0.32       365
weighted avg       0.23      0.48      0.31       365



#Bert

In [18]:
import pandas as pd
import re
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
import torch
from transformers import DistilBertTokenizer, DistilBertModel
import numpy as np

# Load data
train_file_path = "/content/stsa-test.txt"
test_file_path = "/content/stsa-test.txt"

# Read train and test data
def read_data(file_path):
    with open(file_path, 'r') as file:
        lines = file.readlines()
    labels = [line[0] for line in lines]
    texts = [line[1:].strip() for line in lines]
    return pd.DataFrame({'Reviews': labels, 'Text': texts})

dataset = read_data(train_file_path)
dataset_test = read_data(test_file_path)

# Preprocess text
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'<.*?>', '', text)    # Remove HTML tags
    text = re.sub(r'\d+', '', text)      # Remove numbers
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words and len(word) > 2]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(tokens)

dataset['cleanText'] = dataset['Text'].apply(preprocess_text)
dataset_test['cleanText'] = dataset_test['Text'].apply(preprocess_text)

# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(lowercase=False, analyzer='word')
train_tfidf = tfidf_vectorizer.fit_transform(dataset["cleanText"]).toarray()
test_tfidf = tfidf_vectorizer.transform(dataset_test["cleanText"]).toarray()

x_train, x_valid, y_train, y_valid = train_test_split(train_tfidf, dataset["Reviews"], test_size=0.2, random_state=202)

# Load pre-trained DistilBERT model and tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained('distilbert-base-uncased')
# Tokenize and encode texts using DistilBERT
def encode_texts(texts, tokenizer, model):
    encoded_texts = []
    for text in texts:
        encoded_input = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
        with torch.no_grad():
            outputs = model(**encoded_input)
        # Extract the last layer hidden state of the [CLS] token
        encoded_texts.append(outputs.last_hidden_state[:, 0, :].squeeze().numpy())
    return np.array(encoded_texts)

train_distilbert_vectors = encode_texts(dataset['Text'], tokenizer, model)

# Train a classifier (SVM) using DistilBERT embeddings
svm_classifier = SVC()
svm_classifier.fit(train_distilbert_vectors, dataset['Reviews'])

# Evaluate the classifier
test_distilbert_vectors = encode_texts(dataset_test['Text'], tokenizer, model)
y_pred = svm_classifier.predict(test_distilbert_vectors)
print('Accuracy:', accuracy_score(dataset_test['Reviews'], y_pred))
print(classification_report(dataset_test['Reviews'], y_pred))


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


KeyboardInterrupt: 

## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [2]:
import pandas as pd
file_path = '/content/Amazon_Unlocked_Mobile.csv'
df = pd.read_csv(file_path)


In [3]:
# Cleaning the data before beginning clustering
#lower casing
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop = stopwords.words('english')
from textblob import Word
nltk.download('wordnet')

df['Reviews'] = df['Reviews'].apply(lambda x: " ".join(x.lower() for x in str(x).split()))
df['Reviews'] = df['Reviews'].str.replace('[^\w\s]','')
df['Reviews'] = df['Reviews'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df['Reviews'] = df['Reviews'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
df.head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,feel lucky found used (phone u & used hard all...,1.0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice grade pantach revue. clean se...",0.0
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,pleased,0.0
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,work good go slow sometimes good phone love,0.0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,great phone replace lost phone. thing volume b...,0.0


In [10]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

# Read the data
file_path = '/content/Amazon_Unlocked_Mobile.csv'
df = pd.read_csv(file_path)

# Replace missing values with empty strings
df['Reviews'].fillna('', inplace=True)

# Vectorize the text data using TF-IDF
tfidf_vect = TfidfVectorizer()
tfidf_vects = tfidf_vect.fit_transform(df['Reviews'])

# Form clusters
num_clusters = 9
model = KMeans(n_clusters=num_clusters, init='k-means++', max_iter=10000, random_state=50)
model.fit(tfidf_vects)

# Count the number of instances in each cluster
cluster_counts = pd.Series(model.labels_).value_counts()

# Print top words in each cluster
top_words = 7
feature_names = tfidf_vect.get_feature_names_out()
centroids = model.cluster_centers_.argsort()[:, ::-1]
for cluster_num in range(num_clusters):
    key_features = [feature_names[i] for i in centroids[cluster_num, :top_words]]
    print(f'Cluster {cluster_num+1}')
    print('Top Words:', key_features)




Cluster 1
Top Words: ['love', 'it', 'phone', 'this', 'great', 'my', 'the']
Cluster 2
Top Words: ['excelente', 'producto', 'telefono', 'gracias', 'recomendado', 'celular', 'teléfono']
Cluster 3
Top Words: ['it', 'and', 'phone', 'to', 'was', 'the', 'my']
Cluster 4
Top Words: ['excellent', 'product', 'phone', 'very', 'recommend', 'good', 'seller']
Cluster 5
Top Words: ['good', 'very', 'phone', 'product', 'price', 'it', 'is']
Cluster 6
Top Words: ['great', 'phone', 'works', 'product', 'price', 'it', 'for']
Cluster 7
Top Words: ['the', 'phone', 'to', 'and', 'is', 'it', 'this']
Cluster 8
Top Words: ['perfect', 'condition', 'everything', 'works', 'it', 'was', 'phone']
Cluster 9
Top Words: ['phone', 'the', 'nice', 'very', 'good', 'and', 'great']


In [11]:
cluster_center=model.cluster_centers_
cluster_center

array([[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [6.60664599e-04, 4.30795017e-05, 3.13419427e-06, ...,
        0.00000000e+00, 2.09092287e-05, 0.00000000e+00],
       ...,
       [1.28527638e-03, 1.62259331e-04, 2.07524629e-05, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [4.43654367e-04, 2.15469809e-05, 9.25058512e-06, ...,
        6.70320196e-06, 2.17011153e-06, 5.22122746e-06]])

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import DBSCAN
from collections import Counter

# Load data
df = pd.read_csv('/content/Amazon_Unlocked_Mobile.csv')

# Vectorize the text data using TF-IDF
tfidf_vect = TfidfVectorizer()
tfidf_vects = tfidf_vect.fit_transform(df['Reviews'].astype('str'))

# Forming clusters using DBSCAN
dbscan_model = DBSCAN(eps=0.5, min_samples=5)
dbscan_model.fit(tfidf_vects)

# Get cluster labels
cluster_labels = dbscan_model.labels_

# Count the number of instances in each cluster
cluster_counts = Counter(cluster_labels)
print("Cluster Counts:", cluster_counts)

# Print top words in each cluster
unique_labels = set(cluster_labels)
for label in unique_labels:
    if label == -1: # cluster label for noise points
        continue
    cluster_indices = (cluster_labels == label)
    cluster_tfidf = tfidf_vects[cluster_indices]
    feature_names = tfidf_vect.get_feature_names()
    cluster_names = [feature_names[i] for i in cluster_tfidf.indices]
    top_words = Counter(cluster_names).most_common(5)
    print('Cluster ' + str(label))
    print('Top Words:', top_words)


In [None]:
# Hierarchical Clustering
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import AgglomerativeClustering
from collections import Counter

# Load data
df = pd.read_csv('/content/Amazon_Unlocked_Mobile.csv')

# Replace missing values with empty strings
df['Reviews'].fillna('', inplace=True)

# Tokenize reviews
reviews = df['Reviews'].apply(lambda x: str(x).split())

# Vectorize the text data using TF-IDF
tfidf_vect = TfidfVectorizer()
tfidf_vects = tfidf_vect.fit_transform(df['Reviews'])

# Perform hierarchical clustering
agg_clustering = AgglomerativeClustering(n_clusters=9)
clusters = agg_clustering.fit_predict(tfidf_vects.toarray())

# Count the number of instances in each cluster
cluster_counts = Counter(clusters)
print("Cluster Counts:", cluster_counts)


In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
from collections import Counter
from gensim.models import Word2Vec

# Read the data
file_path = '/content/Amazon_Unlocked_Mobile.csv'
df = pd.read_csv(file_path)

# Replace missing values with empty strings
df['Reviews'].fillna('', inplace=True)

# Train Word2Vec model
word2vec_model = Word2Vec(sentences=[review.split() for review in df['Reviews']], vector_size=100, window=5, min_count=1, workers=4)

# Transform each review into a vector representation
review_vectors = [sum(word2vec_model.wv[word] for word in review.split() if word in word2vec_model.wv) for review in df['Reviews']]

# Forming clusters using KMeans
num_clusters = 9
kmeans_model = KMeans(n_clusters=num_clusters, init='k-means++', max_iter=10000, random_state=50)
kmeans_model.fit(review_vectors)

# Get cluster labels
cluster_labels = kmeans_model.labels_

# Count the number of instances in each cluster
cluster_counts = Counter(cluster_labels)
print("Cluster Counts:", cluster_counts)

# Print top words in each cluster
top_words = 7
centroids = kmeans_model.cluster_centers_.argsort()[:, ::-1]
for cluster_num in range(num_clusters):
    cluster_indices = (cluster_labels == cluster_num)
    cluster_reviews = df.loc[cluster_indices, 'Reviews']
    cluster_vectorizer = CountVectorizer()
    cluster_X = cluster_vectorizer.fit_transform(cluster_reviews)
    feature_names = cluster_vectorizer.get_feature_names()
    key_features = [feature_names[i] for i in centroids[cluster_num, :top_words]]
    print('Cluster ' + str(cluster_num+1))
    print('Top Words:', key_features)


**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

**Write your response here:**

.K-means clustering, DBSCAN, and hierarchical clustering are traditional clustering algorithms that operate on vectorized representations of data, such as TF-IDF or Word2Vec embeddings. K-means partitions data into clusters based on centroids, DBSCAN identifies dense regions separated by sparse areas, and hierarchical clustering forms a tree-like structure of clusters. Word2Vec and BERT, on the other hand, are embedding techniques. Word2Vec learns dense vector representations of words capturing semantic relationships, while BERT produces contextualized word embeddings by pretraining on vast text corpora. Each method has its strengths: traditional clustering algorithms are interpretable and efficient, while embedding techniques capture intricate semantic information.








.

.

.

.




# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
the run time is of each cell is very difficult.
there are many crashes for the topic.


'''