# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [45]:
# Write your code here

import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
def get_text_data(file_name):
  with open(file_name) as txtf:
      mylist = [line.strip('\n') for line in txtf]

  labels = []
  text = []

  for i, line in enumerate(mylist):
      label = mylist[i][0]
      tex = mylist[i][1:]
      labels.append(label)
      text.append(tex)
  return labels, text

train_data = get_text_data("stsa-train.txt")
train_df = pd.DataFrame(list(zip(train_data[0], train_data[1])),columns =['Reviews', 'Text'])
import nltk, re, string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from textblob import Word
nltk.download('stopwords')
nltk.download("wordnet")
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer

def tokenize_sentence(s):
  # reviews = []
  # for x in s:
  #   reviews.append(nltk.word_tokenize(x))
  st = PorterStemmer()
  lemmatizer = WordNetLemmatizer()
  tokenizer = RegexpTokenizer(r'\w+')
  tokens = tokenizer.tokenize(s)
  stem_words=[st.stem(w) for w in tokens]
  lemma_words=[lemmatizer.lemmatize(w) for w in stem_words]
  return " ".join(tokens)

def remove_puntutation_and_special_chars(s):
  """
    params1: string containing sentence
    return: string with no punctuations
  """
  # punctuation = """!()-[]{};+|=-_~`:'"\,<>./?@#$%^&*_~"""
  # s1 = s1.replace(punctuation, "")
  s = s.replace('{html}',"")
  s = re.sub(r'http\S+', '',s)
  cleanr = re.compile('<.*?>')
  cleantext = re.sub(cleanr, '', s)
  # return cleantext
  return "".join([x for x in cleantext if x not in string.punctuation])

def remove_nums(s):
  return re.sub(r'[0-9]+', '', s)

def remove_stopwords(s):
  stop = stopwords.words("english")
  return " ".join(x for x in s.split() if len(x)>2 if x not in stop)

def lower_all(s):
  return " ".join(x.lower() for x in s.split())

train_df["text_modified"] = train_df["Text"].replace("\n","")
operations = [remove_puntutation_and_special_chars, remove_nums, lower_all, remove_stopwords]
for op in operations:
  train_df["text_modified"] = train_df["text_modified"].apply(op)
train_df['tls'] = train_df['text_modified'].apply(tokenize_sentence)
test_data = get_text_data("stsa-test.txt")
test_df = pd.DataFrame(list(zip(test_data[0], test_data[1])),columns =['Reviews', 'Text'])
test_df["text_modified"] = test_df["Text"].replace("\n","")
operations = [remove_puntutation_and_special_chars, remove_nums, lower_all, remove_stopwords]
for op in operations:
  test_df["text_modified"] = test_df["text_modified"].apply(op)
test_df['tls'] = test_df['text_modified'].apply(tokenize_sentence)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [46]:
#TF-IDF Vectorization

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(lowercase = False, analyzer='word')
train_tfidf = tfidf_vectorizer.fit_transform(train_df["tls"]).toarray()
test_tfidf = tfidf_vectorizer.transform(test_df["tls"]).toarray()
x_valid, y_valid = test_tfidf, test_df['Reviews']
#Splitting the data into train and test

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(train_tfidf, train_df["Reviews"],test_size = 0.2, random_state = 202)
y_test_dup = y_test.astype(int)
#Implemeting train and test the models
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
# classification report
from sklearn.metrics import classification_report

def train_model(model, x_train, y_train, x_test, model_name):
  classifier = model
  model = classifier.fit(x_train, y_train)
  predictions_validation_set = classifier.predict(x_test)

  print(f"Accuracy of the {model_name} on train: {round(accuracy_score(y_test, predictions_validation_set)*100)}%")
  print(f"Percision of the {model_name} on train: { round(precision_score(y_test, predictions_validation_set, pos_label='0')*100)}%")
  print(f"Recall of the {model_name} on train: {round(recall_score(y_test, predictions_validation_set, pos_label='0')*100)}%")
  print(f"F1 Score of the {model_name} on train: { round(f1_score(y_test, predictions_validation_set, pos_label='0')*100)}%")

  cr_naive_validation = classification_report(y_test, predictions_validation_set)
  print("Classification Report on train: ", "\n", "\n",cr_naive_validation)

  return model
def test_model(model, x_valid, y_valid, model_name):
  predictions_test_set = model.predict(x_valid)

  print(f"Accuracy of the {model_name} on test: {round(accuracy_score(y_valid, predictions_test_set)*100)}%")
  print(f"Percision of the {model_name} on test: { round(precision_score(y_valid, predictions_test_set, pos_label='0')*100)}%")
  print(f"Recall of the {model_name} on test: {round(recall_score(y_valid, predictions_test_set, pos_label='0')*100)}%")
  print(f"F1 Score of the {model_name} on test: { round(f1_score(y_valid, predictions_test_set, pos_label='0')*100)}%")

  cr_naive_test = classification_report(y_valid, predictions_test_set)
  print("Classification Report on test: ", "\n", "\n",cr_naive_test)

In [47]:
#MultinomialNB
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()
model_nb = train_model(MultinomialNB(), x_train, y_train, x_test, "Naive Bayes Model")
from sklearn.model_selection import cross_val_score
naive_accuracies_validation = cross_val_score(estimator = classifier, X = x_train, y = y_train, cv = 10)

print(f"Naive Bayes Model  10-fold cross validation on train:  {round(naive_accuracies_validation.mean()*100)}%")
test_model(model_nb, x_valid, y_valid, "Naive Bayes Model")
naive_accuracies_test = cross_val_score(estimator = classifier, X = x_valid, y = y_valid, cv = 10)
print(f"Naive Bayes Model 10-fold cross validation score on testing set is :  {round(naive_accuracies_test.mean()*100)}%")

Accuracy of the Naive Bayes Model on train: 79%
Percision of the Naive Bayes Model on train: 83%
Recall of the Naive Bayes Model on train: 70%
F1 Score of the Naive Bayes Model on train: 76%
Classification Report on train:  
 
               precision    recall  f1-score   support

           0       0.83      0.70      0.76       667
           1       0.76      0.87      0.81       717

    accuracy                           0.79      1384
   macro avg       0.80      0.79      0.79      1384
weighted avg       0.79      0.79      0.79      1384

Naive Bayes Model  10-fold cross validation on train:  77%
Accuracy of the Naive Bayes Model on test: 79%
Percision of the Naive Bayes Model on test: 85%
Recall of the Naive Bayes Model on test: 71%
F1 Score of the Naive Bayes Model on test: 77%
Classification Report on test:  
 
               precision    recall  f1-score   support

           0       0.85      0.71      0.77       912
           1       0.75      0.87      0.81       909


In [48]:
#SVM
from sklearn import svm

classifier_svm = svm.LinearSVC()
model_svm = train_model(classifier_svm, x_train, y_train, x_test, "SVM model")
svm_accuracies_validation = cross_val_score(estimator = classifier_svm, X = x_train, y = y_train, cv = 10)

print(f"SVM Model  10-fold cross validation on train:  {round(svm_accuracies_validation.mean()*100)}%")
test_model(model_svm, x_valid, y_valid, "SVM Model")
svm_accuracies_test = cross_val_score(estimator = classifier_svm, X = x_valid, y = y_valid, cv = 10)
print(f"SVM Model 10-fold cross validation score on testing set is :  {round(svm_accuracies_test.mean()*100)}%")

Accuracy of the SVM model on train: 78%
Percision of the SVM model on train: 77%
Recall of the SVM model on train: 77%
F1 Score of the SVM model on train: 77%
Classification Report on train:  
 
               precision    recall  f1-score   support

           0       0.77      0.77      0.77       667
           1       0.79      0.79      0.79       717

    accuracy                           0.78      1384
   macro avg       0.78      0.78      0.78      1384
weighted avg       0.78      0.78      0.78      1384

SVM Model  10-fold cross validation on train:  76%
Accuracy of the SVM Model on test: 79%
Percision of the SVM Model on test: 80%
Recall of the SVM Model on test: 77%
F1 Score of the SVM Model on test: 78%
Classification Report on test:  
 
               precision    recall  f1-score   support

           0       0.80      0.77      0.78       912
           1       0.77      0.81      0.79       909

    accuracy                           0.79      1821
   macro avg     

In [49]:

#KNN
from sklearn.neighbors import KNeighborsClassifier

classifier_knn = KNeighborsClassifier(n_neighbors = 15)
model_knn = train_model(classifier_knn, x_train, y_train, x_test, "KNN model")
knn_accuracies_validation = cross_val_score(estimator = classifier_knn, X = x_train, y = y_train, cv = 10)

print(f"KNN Model  10-fold cross validation score on training set is :  {round(knn_accuracies_validation.mean()*100)}%")
test_model(model_knn, x_valid, y_valid, "KNN Model")
knn_accuracies_test = cross_val_score(estimator = classifier_knn, X = x_valid, y = y_valid, cv = 10)

print(f"KNN Model 10-fold cross validation score on testing set is :  {round(knn_accuracies_test.mean()*100)}%")

Accuracy of the KNN model on train: 74%
Percision of the KNN model on train: 71%
Recall of the KNN model on train: 79%
F1 Score of the KNN model on train: 75%
Classification Report on train:  
 
               precision    recall  f1-score   support

           0       0.71      0.79      0.75       667
           1       0.78      0.70      0.74       717

    accuracy                           0.74      1384
   macro avg       0.75      0.75      0.74      1384
weighted avg       0.75      0.74      0.74      1384

KNN Model  10-fold cross validation score on training set is :  71%
Accuracy of the KNN Model on test: 74%
Percision of the KNN Model on test: 72%
Recall of the KNN Model on test: 78%
F1 Score of the KNN Model on test: 75%
Classification Report on test:  
 
               precision    recall  f1-score   support

           0       0.72      0.78      0.75       912
           1       0.76      0.70      0.73       909

    accuracy                           0.74      1821


In [50]:
#XGBoost
y_train_dup = y_train.astype(int)
y_train_dup.dtype
from xgboost import XGBClassifier

classifier_xgb = XGBClassifier()
# model_xgb = train_model(classifier_xgb, x_train, y_train_dup, x_test, "XG Boost model")

model_xgb = classifier_xgb.fit(x_train, y_train_dup)
predictions_validation_set = model_xgb.predict(x_test)

print(f"Accuracy of the XG Boost model on train: {round(accuracy_score(y_test_dup, predictions_validation_set)*100)}%")
print(f"Percision of the XG Boost model on train: { round(precision_score(y_test_dup, predictions_validation_set)*100)}%")
print(f"Recall of the XG Boost model on train: {round(recall_score(y_test_dup, predictions_validation_set)*100)}%")
print(f"F1 Score of the XG Boost model on train: { round(f1_score(y_test_dup, predictions_validation_set)*100)}%")

cr_naive_validation = classification_report(y_test_dup, predictions_validation_set)
print("Classification Report on train: ", "\n", "\n",cr_naive_validation)
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier

classifier_xgb = XGBClassifier()
xgb_accuracies_validation = cross_val_score(estimator = classifier_xgb, X = x_train, y = y_train_dup, cv = 10)

print(f"XGBoost Model  10-fold cross validation score on training set is :  {round(xgb_accuracies_validation.mean()*100)}%")
y_valid_dup = y_valid.astype(int)
y_valid_dup.dtype
# test_model(model_xgb, x_valid, y_valid, "XG Boost Model")
predictions_test_set = model_xgb.predict(x_valid)

print(f"Accuracy of the XG Boost model on train: {round(accuracy_score(y_valid_dup, predictions_test_set)*100)}%")
print(f"Percision of the XG Boost model on train: { round(precision_score(y_valid_dup, predictions_test_set)*100)}%")
print(f"Recall of the XG Boost model on train: {round(recall_score(y_valid_dup, predictions_test_set)*100)}%")
print(f"F1 Score of the XG Boost model on train: { round(f1_score(y_valid_dup, predictions_test_set)*100)}%")

cr_naive_validation = classification_report(y_valid_dup, predictions_test_set)
print("Classification Report on train: ", "\n", "\n",cr_naive_validation)
xgb_accuracies_test = cross_val_score(estimator = classifier_xgb, X = x_valid, y = y_valid_dup, cv = 10)

print(f"XGBoost Classifier Model 10-fold cross validation score on testing set is :  {round(xgb_accuracies_test.mean()*100)}%")

Accuracy of the XG Boost model on train: 68%
Percision of the XG Boost model on train: 66%
Recall of the XG Boost model on train: 79%
F1 Score of the XG Boost model on train: 72%
Classification Report on train:  
 
               precision    recall  f1-score   support

           0       0.72      0.57      0.63       667
           1       0.66      0.79      0.72       717

    accuracy                           0.68      1384
   macro avg       0.69      0.68      0.68      1384
weighted avg       0.69      0.68      0.68      1384

XGBoost Model  10-fold cross validation score on training set is :  69%
Accuracy of the XG Boost model on train: 69%
Percision of the XG Boost model on train: 65%
Recall of the XG Boost model on train: 79%
F1 Score of the XG Boost model on train: 71%
Classification Report on train:  
 
               precision    recall  f1-score   support

           0       0.73      0.58      0.65       912
           1       0.65      0.79      0.71       909

    a

In [53]:

#Decision Tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, classification_report


classifier_dt = DecisionTreeClassifier()
model_dt = classifier_dt.fit(x_train, y_train)

dt_accuracies_validation = cross_val_score(estimator=classifier_dt, X=x_train, y=y_train, cv=10)
print(f"Decision Tree Classifier Model 10-fold cross-validation score on training set is: {round(dt_accuracies_validation.mean()*100)}%")

predictions_validation_set_dt = model_dt.predict(x_test)

print(f"Accuracy of the Decision Tree model on validation: {round(accuracy_score(y_test, predictions_validation_set_dt)*100)}%")
print("Classification Report on validation: \n", classification_report(y_test, predictions_validation_set_dt))

dt_accuracies_test = cross_val_score(estimator=classifier_dt, X=x_valid, y=y_valid, cv=10)
print(f"Decision Tree Classifier Model 10-fold cross-validation score on testing set is: {round(dt_accuracies_test.mean()*100)}%")


Decision Tree Classifier Model 10-fold cross-validation score on training set is: 65%
Accuracy of the Decision Tree model on validation: 64%
Classification Report on validation: 
               precision    recall  f1-score   support

           0       0.62      0.69      0.65       667
           1       0.67      0.61      0.64       717

    accuracy                           0.64      1384
   macro avg       0.65      0.65      0.64      1384
weighted avg       0.65      0.64      0.64      1384

Decision Tree Classifier Model 10-fold cross-validation score on testing set is: 62%


In [54]:
#Random forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, classification_report


classifier_rf = RandomForestClassifier()
model_rf = classifier_rf.fit(x_train, y_train)


rf_accuracies_validation = cross_val_score(estimator=classifier_rf, X=x_train, y=y_train, cv=10)
print(f"Random Forest Classifier Model 10-fold cross-validation score on training set is: {round(rf_accuracies_validation.mean()*100)}%")


predictions_validation_set_rf = model_rf.predict(x_test)


print(f"Accuracy of the Random Forest model on validation: {round(accuracy_score(y_test, predictions_validation_set_rf)*100)}%")
print("Classification Report on validation: \n", classification_report(y_test, predictions_validation_set_rf))


rf_accuracies_test = cross_val_score(estimator=classifier_rf, X=x_valid, y=y_valid, cv=10)
print(f"Random Forest Classifier Model 10-fold cross-validation score on testing set is: {round(rf_accuracies_test.mean()*100)}%")

Random Forest Classifier Model 10-fold cross-validation score on training set is: 72%
Accuracy of the Random Forest model on validation: 73%
Classification Report on validation: 
               precision    recall  f1-score   support

           0       0.70      0.76      0.73       667
           1       0.76      0.70      0.73       717

    accuracy                           0.73      1384
   macro avg       0.73      0.73      0.73      1384
weighted avg       0.73      0.73      0.73      1384

Random Forest Classifier Model 10-fold cross-validation score on testing set is: 64%


## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [9]:
# Write your code here
import pandas as pd
#load the data
sample=pd.read_csv("Amazon_Unlocked_Mobile.csv")
#check the loaded data
print(sample.shape)
sample.head()

(413840, 6)


Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0


In [10]:
# IMPORT REQUIRED LIBRARIES
import pandas as pd
from nltk.corpus import stopwords
import nltk
nltk.download('punkt')
nltk.download('stopwords')
import re
from textblob import TextBlob
from nltk.stem import PorterStemmer
from textblob import Word
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [11]:
cluster_df =sample.head(1100)
cluster_df

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0
...,...,...,...,...,...,...
1095,5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9...,,161.06,5,very good,0.0
1096,5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9...,,161.06,5,"Excellent phone for the price, dual sim, unloc...",0.0
1097,5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9...,,161.06,5,Excellent phone functions recommend,0.0
1098,5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9...,,161.06,5,I'm so surprise about this phone...great quali...,0.0


In [13]:
# Special characters removal
cluster_df['After noise removal'] = cluster_df['Reviews'].apply(lambda x: ''.join(re.sub(r"[^a-zA-Z0-9]+", ' ', charctr) for charctr in x ))

# Punctuation removal
cluster_df['Punctuation removal'] = cluster_df['After noise removal'].str.replace('[^\w\s]','')

# Remove numbers
cluster_df['Remove numbers'] = cluster_df['Punctuation removal'].str.replace('\d+', '')

# Stopwords removal
stop_word = stopwords.words('english')
cluster_df['Stopwords removal'] = cluster_df['Remove numbers'].apply(lambda x: " ".join(x for x in x.split() if x not in stop_word))

# Lower Casing
cluster_df['Lower casing'] = cluster_df['Stopwords removal'].apply(lambda x: " ".join(x.lower() for x in x.split()))

# Tokenization
cluster_df['Tokenization'] = cluster_df['Lower casing'].apply(lambda x: TextBlob(x).words)

# Stemming
st = PorterStemmer()
cluster_df['Stemming'] = cluster_df['Tokenization'].apply(lambda x: " ".join([st.stem(word) for word in x]))

# Lemmatization
cluster_df['Lemmatization'] = cluster_df['Stemming'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
cluster_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cluster_df['After noise removal'] = cluster_df['Reviews'].apply(lambda x: ''.join(re.sub(r"[^a-zA-Z0-9]+", ' ', charctr) for charctr in x ))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cluster_df['Punctuation removal'] = cluster_df['After noise removal'].str.replace('[^\w\s]','')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#r

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,After noise removal,Punctuation removal,Remove numbers,Stopwords removal,Lower casing,Tokenization,Stemming,Lemmatization
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0,I feel so LUCKY to have found this used phone...,I feel so LUCKY to have found this used phone...,I feel so LUCKY to have found this used phone...,I feel LUCKY found used phone us used hard pho...,i feel lucky found used phone us used hard pho...,"[i, feel, lucky, found, used, phone, us, used,...",i feel lucki found use phone us use hard phone...,i feel lucki found use phone u use hard phone ...
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0,nice phone nice up grade from my pantach revu...,nice phone nice up grade from my pantach revu...,nice phone nice up grade from my pantach revu...,nice phone nice grade pantach revue Very clean...,nice phone nice grade pantach revue very clean...,"[nice, phone, nice, grade, pantach, revue, ver...",nice phone nice grade pantach revu veri clean ...,nice phone nice grade pantach revu veri clean ...
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0,Very pleased,Very pleased,Very pleased,Very pleased,very pleased,"[very, pleased]",veri pleas,veri plea
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0,It works good but it goes slow sometimes but i...,It works good but it goes slow sometimes but i...,It works good but it goes slow sometimes but i...,It works good goes slow sometimes good phone I...,it works good goes slow sometimes good phone i...,"[it, works, good, goes, slow, sometimes, good,...",it work good goe slow sometim good phone i love,it work good goe slow sometim good phone i love
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0,Great phone to replace my lost phone The only...,Great phone to replace my lost phone The only...,Great phone to replace my lost phone The only...,Great phone replace lost phone The thing volum...,great phone replace lost phone the thing volum...,"[great, phone, replace, lost, phone, the, thin...",great phone replac lost phone the thing volum ...,great phone replac lost phone the thing volum ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1095,5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9...,,161.06,5,very good,0.0,very good,very good,very good,good,good,[good],good,good
1096,5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9...,,161.06,5,"Excellent phone for the price, dual sim, unloc...",0.0,Excellent phone for the price dual sim unloc...,Excellent phone for the price dual sim unloc...,Excellent phone for the price dual sim unloc...,Excellent phone price dual sim unlocked Google...,excellent phone price dual sim unlocked google...,"[excellent, phone, price, dual, sim, unlocked,...",excel phone price dual sim unlock googl play s...,excel phone price dual sim unlock googl play s...
1097,5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9...,,161.06,5,Excellent phone functions recommend,0.0,Excellent phone functions recommend,Excellent phone functions recommend,Excellent phone functions recommend,Excellent phone functions recommend,excellent phone functions recommend,"[excellent, phone, functions, recommend]",excel phone function recommend,excel phone function recommend
1098,5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9...,,161.06,5,I'm so surprise about this phone...great quali...,0.0,I m so surprise about this phone great quali...,I m so surprise about this phone great quali...,I m so surprise about this phone great quali...,I surprise phone great quality In cheap price ...,i surprise phone great quality in cheap price ...,"[i, surprise, phone, great, quality, in, cheap...",i surpris phone great qualiti in cheap price h...,i surpris phone great qualiti in cheap price h...


In [16]:
cluster_df.columns

Index(['Product Name', 'Brand Name', 'Price', 'Rating', 'Reviews',
       'Review Votes', 'After noise removal', 'Punctuation removal',
       'Remove numbers', 'Stopwords removal', 'Lower casing', 'Tokenization',
       'Stemming', 'Lemmatization'],
      dtype='object')

In [17]:
#kmeans clustering using TFIDF
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer()
tfidf = tfidf_vect.fit_transform(cluster_df['Lemmatization'].values)
tfidf.shape

(1100, 2731)

In [18]:
from sklearn.cluster import KMeans
model_tf = KMeans(n_clusters = 10,random_state=99)
model_tf.fit(tfidf)

In [21]:
labels_tf = model_tf.labels_
cluster_center_tf=model_tf.cluster_centers_
cluster_center_tf

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.0014264 , 0.        , 0.00130851, ..., 0.00110245, 0.        ,
        0.        ]])

In [22]:
terms1 = tfidf_vect.vocabulary_.keys()
terms_list = list(terms1)
terms_list[1:100]

['lucki',
 'found',
 'use',
 'phone',
 'hard',
 'line',
 'someon',
 'upgrad',
 'sold',
 'one',
 'my',
 'son',
 'like',
 'old',
 'final',
 'fell',
 'apart',
 'year',
 'want',
 'thank',
 'seller',
 'realli',
 'appreci',
 'honesti',
 'said',
 'recommend',
 'highli',
 'would',
 'nice',
 'grade',
 'pantach',
 'revu',
 'veri',
 'clean',
 'set',
 'easi',
 'never',
 'android',
 'fantast',
 'say',
 'least',
 'perfect',
 'size',
 'surf',
 'social',
 'medium',
 'great',
 'samsung',
 'plea',
 'it',
 'work',
 'good',
 'goe',
 'slow',
 'sometim',
 'love',
 'replac',
 'lost',
 'the',
 'thing',
 'volum',
 'button',
 'still',
 'go',
 'adjust',
 'other',
 'job',
 'elig',
 'thaank',
 'alreadi',
 'problem',
 'know',
 'state',
 'dang',
 'charg',
 'wish',
 'read',
 'comment',
 'purchas',
 'item',
 'crack',
 'side',
 'damag',
 'if',
 'tri',
 'anoth',
 'way',
 'request',
 'money',
 'back',
 'and',
 'will',
 'get',
 'sign',
 'an',
 'unhappi',
 'custom',
 'port',
 'loo']

In [23]:
from sklearn import metrics
silhouette_score_tf = metrics.silhouette_score(tfidf, labels_tf, metric='euclidean')
silhouette_score_tf

0.08849690869824695

In [24]:
df1 = cluster_df
df1['Tfidf Clus Label'] = model_tf.labels_
df1.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['Tfidf Clus Label'] = model_tf.labels_


Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,After noise removal,Punctuation removal,Remove numbers,Stopwords removal,Lower casing,Tokenization,Stemming,Lemmatization,Tfidf Clus Label
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0,I feel so LUCKY to have found this used phone...,I feel so LUCKY to have found this used phone...,I feel so LUCKY to have found this used phone...,I feel LUCKY found used phone us used hard pho...,i feel lucky found used phone us used hard pho...,"[i, feel, lucky, found, used, phone, us, used,...",i feel lucki found use phone us use hard phone...,i feel lucki found use phone u use hard phone ...,1
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0,nice phone nice up grade from my pantach revu...,nice phone nice up grade from my pantach revu...,nice phone nice up grade from my pantach revu...,nice phone nice grade pantach revue Very clean...,nice phone nice grade pantach revue very clean...,"[nice, phone, nice, grade, pantach, revue, ver...",nice phone nice grade pantach revu veri clean ...,nice phone nice grade pantach revu veri clean ...,4
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0,Very pleased,Very pleased,Very pleased,Very pleased,very pleased,"[very, pleased]",veri pleas,veri plea,6
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0,It works good but it goes slow sometimes but i...,It works good but it goes slow sometimes but i...,It works good but it goes slow sometimes but i...,It works good goes slow sometimes good phone I...,it works good goes slow sometimes good phone i...,"[it, works, good, goes, slow, sometimes, good,...",it work good goe slow sometim good phone i love,it work good goe slow sometim good phone i love,0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0,Great phone to replace my lost phone The only...,Great phone to replace my lost phone The only...,Great phone to replace my lost phone The only...,Great phone replace lost phone The thing volum...,great phone replace lost phone the thing volum...,"[great, phone, replace, lost, phone, the, thin...",great phone replac lost phone the thing volum ...,great phone replac lost phone the thing volum ...,1


In [25]:
df1.groupby(['Tfidf Clus Label'])['Reviews'].count()



Tfidf Clus Label
0     76
1     27
2     55
3     67
4    225
5     41
6    211
7     57
8     54
9    287
Name: Reviews, dtype: int64

In [26]:
# Reading a description which belong to each group.

for i in range(10):
    print("four reviews of ensured to cluster ", i)
    print("-" * 70)
    print(df1.iloc[df1.groupby(['Tfidf Clus Label']).groups[i][0]]['Reviews'])
    print('\n')
    print(df1.iloc[df1.groupby(['Tfidf Clus Label']).groups[i][5]]['Reviews'])
    print('\n')
    print(df1.iloc[df1.groupby(['Tfidf Clus Label']).groups[i][10]]['Reviews'])
    print('\n')
    print("_" * 70)

four reviews of ensured to cluster  0
----------------------------------------------------------------------
It works good but it goes slow sometimes but its a very good phone I love it


Ver Good!


Good phone for elderly people.


______________________________________________________________________
four reviews of ensured to cluster  1
----------------------------------------------------------------------
I feel so LUCKY to have found this used (phone to us & not used hard at all), phone on line from someone who upgraded and sold this one. My Son liked his old one that finally fell apart after 2.5+ years and didn't want an upgrade!! Thank you Seller, we really appreciate it & your honesty re: said used phone.I recommend this seller very highly & would but from them again!!


The battery was old & had been over used because it barely holds a charge. Otherwise, no issues with the phone itself.


I bought this phone for my 91 year old aunt. It seems very easy for to use. Would certain

In [27]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
bow = count_vect.fit_transform(cluster_df['Reviews'].values)
bow.shape


(1100, 3611)

In [28]:
from sklearn.cluster import KMeans
model = KMeans(n_clusters = 10,init='k-means++',random_state=99)
model.fit(bow)

In [29]:
labels = model.labels_
cluster_center=model.cluster_centers_

In [30]:
from sklearn import metrics
print(metrics.silhouette_score(bow, labels, metric='euclidean'))

0.43021711451928923


In [31]:
cluster_df['Bow Label'] = model.labels_
cluster_df.head()




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cluster_df['Bow Label'] = model.labels_


Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,After noise removal,Punctuation removal,Remove numbers,Stopwords removal,Lower casing,Tokenization,Stemming,Lemmatization,Tfidf Clus Label,Bow Label
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0,I feel so LUCKY to have found this used phone...,I feel so LUCKY to have found this used phone...,I feel so LUCKY to have found this used phone...,I feel LUCKY found used phone us used hard pho...,i feel lucky found used phone us used hard pho...,"[i, feel, lucky, found, used, phone, us, used,...",i feel lucki found use phone us use hard phone...,i feel lucki found use phone u use hard phone ...,1,0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0,nice phone nice up grade from my pantach revu...,nice phone nice up grade from my pantach revu...,nice phone nice up grade from my pantach revu...,nice phone nice grade pantach revue Very clean...,nice phone nice grade pantach revue very clean...,"[nice, phone, nice, grade, pantach, revue, ver...",nice phone nice grade pantach revu veri clean ...,nice phone nice grade pantach revu veri clean ...,4,0
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0,Very pleased,Very pleased,Very pleased,Very pleased,very pleased,"[very, pleased]",veri pleas,veri plea,6,0
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0,It works good but it goes slow sometimes but i...,It works good but it goes slow sometimes but i...,It works good but it goes slow sometimes but i...,It works good goes slow sometimes good phone I...,it works good goes slow sometimes good phone i...,"[it, works, good, goes, slow, sometimes, good,...",it work good goe slow sometim good phone i love,it work good goe slow sometim good phone i love,0,0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0,Great phone to replace my lost phone The only...,Great phone to replace my lost phone The only...,Great phone to replace my lost phone The only...,Great phone replace lost phone The thing volum...,great phone replace lost phone the thing volum...,"[great, phone, replace, lost, phone, the, thin...",great phone replac lost phone the thing volum ...,great phone replac lost phone the thing volum ...,1,2


In [32]:
from sklearn.cluster import DBSCAN
import numpy as np

minPts = 2 * 100

def lower_bound(nums, target): # This function return the number in the array just greater than or equal to itself.
    l, r = 0, len(nums) - 1
    while l <= r: # Binary searching.
        mid = int(l + (r - l) / 2)
        if nums[mid] >= target:
            r = mid - 1
        else:
            l = mid + 1
    return l

def compute200thnearestneighbour(x, data):
    dists = []
    for val in data:
        dist = np.sum((x - val) **2 )
        if (len(dists) == 200 and dists[199] > dist):
          l = int(lower_bound(dists, dist))
          if l < 200 and l >= 0 and dists[l] > dist:
              dists[l] = dist
        else:
          dists.append(dist)
          dists.sort()

    return dists[199]

In [33]:
list_of_sent_train = list()

for i in cluster_df["Lower casing"].values:
  list_of_sent_train.append(i.split())

In [34]:
import gensim
w2v_model=gensim.models.Word2Vec(list_of_sent_train, vector_size=100, workers=4)

import numpy as np
sent_vectors = [];


import numpy as np
sent_vectors = [];
count = 1
for sent in list_of_sent_train: # for each reviews
    sent_vec = np.zeros(100)
    cnt_words =1; # num of words in reviews
    for word in sent: # for each word in a reviews
        try:
            vec = w2v_model.wv[word]
            sent_vec += vec
            cnt_words += 1
        except:
            pass
    sent_vec /= cnt_words
    sent_vectors.append(sent_vec)
sent_vectors = np.array(sent_vectors)
sent_vectors = np.nan_to_num(sent_vectors)


twohundrethneigh = []
for val in sent_vectors[:300]:
    twohundrethneigh.append(compute200thnearestneighbour(val, sent_vectors[:300]) )
twohundrethneigh.sort()

# Training DBSCAN :
model = DBSCAN(eps = 5, min_samples = minPts, n_jobs=-1)
model.fit(sent_vectors)


In [35]:
bow

<1100x3611 sparse matrix of type '<class 'numpy.int64'>'
	with 33667 stored elements in Compressed Sparse Row format>

In [36]:
cluster_df['AVG-W2V Clus Label'] = model.labels_
cluster_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cluster_df['AVG-W2V Clus Label'] = model.labels_


Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,After noise removal,Punctuation removal,Remove numbers,Stopwords removal,Lower casing,Tokenization,Stemming,Lemmatization,Tfidf Clus Label,Bow Label,AVG-W2V Clus Label
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0,I feel so LUCKY to have found this used phone...,I feel so LUCKY to have found this used phone...,I feel so LUCKY to have found this used phone...,I feel LUCKY found used phone us used hard pho...,i feel lucky found used phone us used hard pho...,"[i, feel, lucky, found, used, phone, us, used,...",i feel lucki found use phone us use hard phone...,i feel lucki found use phone u use hard phone ...,1,0,0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0,nice phone nice up grade from my pantach revu...,nice phone nice up grade from my pantach revu...,nice phone nice up grade from my pantach revu...,nice phone nice grade pantach revue Very clean...,nice phone nice grade pantach revue very clean...,"[nice, phone, nice, grade, pantach, revue, ver...",nice phone nice grade pantach revu veri clean ...,nice phone nice grade pantach revu veri clean ...,4,0,0
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0,Very pleased,Very pleased,Very pleased,Very pleased,very pleased,"[very, pleased]",veri pleas,veri plea,6,0,0
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0,It works good but it goes slow sometimes but i...,It works good but it goes slow sometimes but i...,It works good but it goes slow sometimes but i...,It works good goes slow sometimes good phone I...,it works good goes slow sometimes good phone i...,"[it, works, good, goes, slow, sometimes, good,...",it work good goe slow sometim good phone i love,it work good goe slow sometim good phone i love,0,0,0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0,Great phone to replace my lost phone The only...,Great phone to replace my lost phone The only...,Great phone to replace my lost phone The only...,Great phone replace lost phone The thing volum...,great phone replace lost phone the thing volum...,"[great, phone, replace, lost, phone, the, thin...",great phone replac lost phone the thing volum ...,great phone replac lost phone the thing volum ...,1,2,0


In [39]:
import scipy
from scipy.cluster import hierarchy


In [43]:
from sklearn.cluster import AgglomerativeClustering

cluster = AgglomerativeClustering(n_clusters=5, linkage='ward')
Agg=cluster.fit_predict(sent_vectors)

aggdfa = cluster_df
aggdfa['AVG-W2V Clus Label'] = cluster.labels_
aggdfa.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  aggdfa['AVG-W2V Clus Label'] = cluster.labels_


Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,After noise removal,Punctuation removal,Remove numbers,Stopwords removal,Lower casing,Tokenization,Stemming,Lemmatization,Tfidf Clus Label,Bow Label,AVG-W2V Clus Label
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0,I feel so LUCKY to have found this used phone...,I feel so LUCKY to have found this used phone...,I feel so LUCKY to have found this used phone...,I feel LUCKY found used phone us used hard pho...,i feel lucky found used phone us used hard pho...,"[i, feel, lucky, found, used, phone, us, used,...",i feel lucki found use phone us use hard phone...,i feel lucki found use phone u use hard phone ...,1,0,0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0,nice phone nice up grade from my pantach revu...,nice phone nice up grade from my pantach revu...,nice phone nice up grade from my pantach revu...,nice phone nice grade pantach revue Very clean...,nice phone nice grade pantach revue very clean...,"[nice, phone, nice, grade, pantach, revue, ver...",nice phone nice grade pantach revu veri clean ...,nice phone nice grade pantach revu veri clean ...,4,0,1
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0,Very pleased,Very pleased,Very pleased,Very pleased,very pleased,"[very, pleased]",veri pleas,veri plea,6,0,3
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0,It works good but it goes slow sometimes but i...,It works good but it goes slow sometimes but i...,It works good but it goes slow sometimes but i...,It works good goes slow sometimes good phone I...,it works good goes slow sometimes good phone i...,"[it, works, good, goes, slow, sometimes, good,...",it work good goe slow sometim good phone i love,it work good goe slow sometim good phone i love,0,0,0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0,Great phone to replace my lost phone The only...,Great phone to replace my lost phone The only...,Great phone to replace my lost phone The only...,Great phone replace lost phone The thing volum...,great phone replace lost phone the thing volum...,"[great, phone, replace, lost, phone, the, thin...",great phone replac lost phone the thing volum ...,great phone replac lost phone the thing volum ...,1,2,0


In [44]:
aggdfa.groupby(['AVG-W2V Clus Label'])['Reviews'].count()



AVG-W2V Clus Label
0    399
1    390
2    102
3     79
4    130
Name: Reviews, dtype: int64

**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

**Write your response here:**

.

.

.

.

.




# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:





'''