<a href="https://colab.research.google.com/github/MohanaSrinitha/Mohana_INF05731_Spring2024/blob/main/Shaga_Mohana_Exercise_05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [7]:
# Write your code here
import pandas as pd
import nltk
from nltk.corpus import stopwords
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection
from sklearn import naive_bayes, svm
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import cross_val_score, KFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
import warnings

nltk.download('stopwords')
nltk.download('wordnet')

warnings.filterwarnings('ignore')


def read_data(file_path):
    text_data, sentiments = [], []
    file_data = open(file_path).read()
    for i, j in enumerate(file_data.split("\n")):
        after_split = j.split(' ')
        text_data.append(" ".join(after_split[1:]))
        sentiments.append(after_split[0])
    return text_data, sentiments


def preprocess_data(data_frame):
    # Convert sentiment classes to integers
    data_frame['Sentimental Value'] = pd.to_numeric(data_frame['Sentimental Value'], errors='coerce')

    # Drop rows with NaN values in the target variable
    data_frame = data_frame.dropna(subset=['Sentimental Value'])

    # Pre-processing
    # removal of special characters
    data_frame['After noise removal'] = data_frame['Raw Data'].apply(lambda x: ''.join(re.sub(r"[^a-zA-Z0-9]+", ' ', char) for char in x))

    # removal of Punctuation
    data_frame['Punctuation removal'] = data_frame['After noise removal'].str.replace('[^\w\s]', '')

    # Stopwords removal
    stop_word = stopwords.words('english')
    data_frame['Stopwords removal'] = data_frame['Punctuation removal'].apply(
        lambda x: " ".join(word for word in x.split() if word not in stop_word))

    # Lower Casing
    data_frame['Lower casing'] = data_frame['Stopwords removal'].apply(lambda x: " ".join(word.lower() for word in x.split()))

    return data_frame


def vectorize_data(train_df, test_df):
    # TF-IDF Vectorization
    tfidf_vector = TfidfVectorizer(analyzer='word')
    tfidf_vector.fit(train_df['Lower casing'])
    x_train = tfidf_vector.transform(train_df['Lower casing'])
    tfidf_vector_test = TfidfVectorizer(analyzer='word', vocabulary=tfidf_vector.vocabulary_)
    tfidf_vector_test.fit(test_df['Lower casing'])
    x_test = tfidf_vector_test.transform(test_df['Lower casing'])

    return x_train, x_test


def evaluate_classifier(classifier, x_train, y_train, x_test, y_test):
    classifier.fit(x_train, y_train)
    predicted = classifier.predict(x_test)
    accuracy = accuracy_score(predicted, y_test)
    print(f"Accuracy of Training data ({classifier.__class__.__name__}): {accuracy}")

    predicted_testing = classifier.predict(x_test)
    accuracy_testing = accuracy_score(predicted_testing, y_test)
    print(f"Accuracy of Testing data ({classifier.__class__.__name__}): {accuracy_testing}")

    if 'XGB' not in str(classifier):
        scoring = 'accuracy'
        kfold = KFold(10, random_state=7, shuffle=True)
        cross_val = cross_val_score(classifier, x_test, y_test, cv=kfold, scoring=scoring).mean()
        print(f"Cross-validation score ({classifier.__class__.__name__}): {cross_val}")

    print(f"\nClassification Report ({classifier.__class__.__name__}):\n")
    print(classification_report(y_test, predicted))


# Read data
training_text_data, training_sentiments = read_data('stsa-train.txt')
testing_text_data, testing_sentiments = read_data('stsa-test.txt')

# Create data frames
training_df = pd.DataFrame(list(zip(training_sentiments, training_text_data)), columns=['Sentimental Value', 'Raw Data'])
testing_df = pd.DataFrame(list(zip(testing_sentiments, testing_text_data)), columns=['Sentimental Value', 'Raw Data'])

# Pre-process data
training_df = preprocess_data(training_df)
testing_df = preprocess_data(testing_df)

# Vectorize data
x_train, x_test = vectorize_data(training_df, testing_df)

# Split data for training and testing
x_train, x_valid, y_train, y_valid = model_selection.train_test_split(x_train, training_df['Sentimental Value'],
                                                                      test_size=0.2, random_state=42)

# Ensure target variables are integers
y_train = y_train.astype(int)
y_valid = y_valid.astype(int)

# List of classifiers
classifiers = [
    naive_bayes.MultinomialNB(),
    svm.SVC(),
    KNeighborsClassifier(n_neighbors=5),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    XGBClassifier()
]

# Evaluate each classifier
for classifier in classifiers:
    print(f"Evaluating {classifier.__class__.__name__}:")
    evaluate_classifier(classifier, x_train, y_train, x_valid, y_valid)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Evaluating MultinomialNB:
Accuracy of Training data (MultinomialNB): 0.7911849710982659
Accuracy of Testing data (MultinomialNB): 0.7911849710982659
Cross-validation score (MultinomialNB): 0.7015848191012408

Classification Report (MultinomialNB):

              precision    recall  f1-score   support

           0       0.86      0.69      0.76       671
           1       0.75      0.89      0.81       713

    accuracy                           0.79      1384
   macro avg       0.80      0.79      0.79      1384
weighted avg       0.80      0.79      0.79      1384

Evaluating SVC:
Accuracy of Training data (SVC): 0.796242774566474
Accuracy of Testing data (SVC): 0.796242774566474
Cross-validation score (SVC): 0.7008758210822645

Classification Report (SVC):

              precision    recall  f1-score   support

           0       0.83      0.73      0.78       671
           1       0.77      0.86      0.81       713

    accuracy                           0.80      1384
   macro 

## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [3]:
import nltk
nltk.download()
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.corpus import stopwords
import re
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

def data_preprocess(sentence):
    sentence=str(sentence)
    sentence = sentence.lower()
    sentence=sentence.replace('{html}',"")
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', sentence)
    rem_url=re.sub(r'http\S+', '',cleantext)
    rem_num = re.sub('[0-9]+', '', rem_url)
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(rem_num)
    filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]
    stem_words=[stemmer.stem(w) for w in filtered_words]
    lemma_words=[lemmatizer.lemmatize(w) for w in stem_words]
    return " ".join(filtered_words)

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> all


    Downloading collection 'all'
       | 
       | Downloading package abc to /root/nltk_data...
       |   Unzipping corpora/abc.zip.
       | Downloading package alpino to /root/nltk_data...
       |   Unzipping corpora/alpino.zip.
       | Downloading package averaged_perceptron_tagger to
       |     /root/nltk_data...
       |   Unzipping taggers/averaged_perceptron_tagger.zip.
       | Downloading package averaged_perceptron_tagger_ru to
       |     /root/nltk_data...
       |   Unzipping taggers/averaged_perceptron_tagger_ru.zip.
       | Downloading package basque_grammars to /root/nltk_data...
       |   Unzipping grammars/basque_grammars.zip.
       | Downloading package bcp47 to /root/nltk_data...
       | Downloading package biocreative_ppi to /root/nltk_data...
       |   Unzipping corpora/biocreative_ppi.zip.
       | Downloading package bllip_wsj_no_aux to /root/nltk_data...
       |   Unzipping models/bllip_wsj_no_aux.zip.
       | Downloading package book_grammars to


---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> q


In [19]:
import pandas as pd
df = pd.read_csv('Amazon_Unlocked_Mobile.csv')

In [13]:
new =df.head(1000)
new['Reviews']

0      I feel so LUCKY to have found this used (phone...
1      nice phone, nice up grade from my pantach revu...
2                                           Very pleased
3      It works good but it goes slow sometimes but i...
4      Great phone to replace my lost phone. The only...
                             ...                        
995    It's a decent for the price.. I've had this on...
996                                   Is good cell phone
997    Amazing phone. Cables and case included, also ...
998                                             Excelent
999       Excellent, it meets the requirements requested
Name: Reviews, Length: 1000, dtype: object

In [20]:
#importing necessary libraries
import re
from textblob import TextBlob
from nltk.stem import PorterStemmer
from textblob import Word
nltk.download('wordnet')
# Special characters removal
new['After noise removal'] = new['Reviews'].apply(lambda x: ''.join(re.sub(r"[^a-zA-Z0-9]+", ' ', charctr) for charctr in x ))

# Punctuation removal
new['Punctuation removal'] = new['After noise removal'].str.replace('[^\w\s]','')

# Remove numbers
new['Remove numbers'] = new['Punctuation removal'].str.replace('\d+', '')

# Stopwords removal
stop_word = stopwords.words('english')
new['Stopwords removal'] = new['Remove numbers'].apply(lambda x: " ".join(x for x in x.split() if x not in stop_word))

# Lower Casing
new['Lower casing'] = new['Stopwords removal'].apply(lambda x: " ".join(x.lower() for x in x.split()))

# Tokenization
new['Tokenization'] = new['Lower casing'].apply(lambda x: TextBlob(x).words)

# Stemming
st = PorterStemmer()
new['Stemming'] = new['Tokenization'].apply(lambda x: " ".join([st.stem(word) for word in x]))

# Lemmatization
new['Lemmatization'] = new['Stemming'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
new['Lemmatization']

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


0      i feel lucki found use phone u use hard phone ...
1      nice phone nice grade pantach revu veri clean ...
2                                              veri plea
3        it work good goe slow sometim good phone i love
4      great phone replac lost phone the thing volum ...
                             ...                        
995    it decent price i one 6 month the con 1 i pret...
996                                   is good cell phone
997    amaz phone cabl case includ also screen pritec...
998                                                excel
999                            excel meet requir request
Name: Lemmatization, Length: 1000, dtype: object

In [21]:
#Implementing kmeans usind TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer  # Import the TfidfVectorizer class for generating TF-IDF features

# Instantiate a TfidfVectorizer object with default parameters
tfidf_vect = TfidfVectorizer()

# Use the fit_transform method of the TfidfVectorizer object to generate a TF-IDF matrix from the 'Lemmatization' column of the 'new' dataframe
tfidf = tfidf_vect.fit_transform(new['Lemmatization'].values)

# Print the shape of the resulting TF-IDF matrix
tfidf.shape

(1000, 2620)

In [23]:
from sklearn.cluster import KMeans  # Import the KMeans class from scikit-learn

# Instantiate a KMeans object with 10 clusters and a random state of 99
model_tf = KMeans(n_clusters=10, random_state=99)

# Use the fit method of the KMeans object to fit the model to the TF-IDF matrix generated in the previous step
model_tf.fit(tfidf)



In [24]:
# Assign the cluster labels generated by the KMeans algorithm to a variable
labels_tf = model_tf.labels_

# Assign the cluster centers generated by the KMeans algorithm to a variable
cluster_center_tf = model_tf.cluster_centers_

# Print the cluster centers to the console
print(cluster_center_tf)

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.00097592 ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]


In [25]:
# Assign the cluster labels generated by the KMeans algorithm to a variable
labels_tf = model_tf.labels_

# Assign the cluster centers generated by the KMeans algorithm to a variable
cluster_center_tf = model_tf.cluster_centers_

# Print the cluster centers to the console
print(cluster_center_tf)

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.00097592 ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]


In [26]:
# Use the get_feature_names_out method of the TfidfVectorizer object to extract the terms in the corpus
terms1 = tfidf_vect.get_feature_names_out()

# Print the first 100 terms to the console
print(terms1[1:100])

['00pm' '03' '04' '0mp' '10' '100' '1080p' '10k' '11' '110' '115' '12'
 '13' '14' '1400' '15' '16' '169' '1700' '178mb' '18' '1800' '1900' '1999'
 '1gb' '1it' '1st' '1thi' '1x' '20' '2001' '200mb' '200ppi' '2012' '2013'
 '2014' '2015' '2016' '2017' '2100' '2100mhz' '2250' '2300mah' '24' '25'
 '25cm' '288' '29' '2g' '2gb' '2mp' '2nd' '2pm' '2sim' '30' '300'
 '300megabyt' '302' '303' '30ish' '30text' '31' '32' '32g' '32gb' '39'
 '3d' '3g' '3gcellphon' '3inch' '3week' '40' '400mp2' '450' '4g' '4gb'
 '50' '500' '512mb' '55' '55th' '5c' '5g' '5in' '5j' '5mm' '5mp' '5th'
 '60' '600' '628' '635' '64' '64bit' '64gig' '6in' '70' '700' '710']


In [27]:
df1 = new  # Assign the DataFrame 'new' to a new variable 'df1'
df1['Tfidf Clus Label'] = model_tf.labels_  # Add a new column 'Tfidf Clus Label' to 'df1' containing the clustering labels obtained from 'model_tf'
df1[['Lemmatization','Tfidf Clus Label']].head()  # Select the 'Lemmatization' and 'Tfidf Clus Label' columns from 'df1' and display the first five rows using the head() method

Unnamed: 0,Lemmatization,Tfidf Clus Label
0,i feel lucki found use phone u use hard phone ...,5
1,nice phone nice grade pantach revu veri clean ...,5
2,veri plea,1
3,it work good goe slow sometim good phone i love,4
4,great phone replac lost phone the thing volum ...,5


In [28]:
df1.groupby(['Tfidf Clus Label'])['Reviews'].count()

Tfidf Clus Label
0     32
1    384
2     35
3     51
4     61
5    255
6     59
7     59
8     21
9     43
Name: Reviews, dtype: int64

In [29]:
print("Top clusters:")
order_centroids = model_tf.cluster_centers_.argsort()[:, ::-1]
for i in range(10):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :10]:
        print(' %s' % terms1[ind], end='')
        print()

Top clusters:
Cluster 0: arriv
 return
 dead
 wast
 product
 work
 thi
 phone
 disappoint
 condit
Cluster 1: work
 phone
 it
 bad
 good
 like
 bueno
 use
 well
 fine
Cluster 2: excelent
 telefono
 bueno
 muy
 producto
 far
 fastest
 faster
 fast
 fashion
Cluster 3: excel
 product
 recommend
 seller
 thank
 100
 phone
 fash
 function
 five
Cluster 4: good
 veri
 phone
 product
 price
 buy
 tank
 thank
 mobil
 recomend
Cluster 5: phone
 use
 work
 the
 it
 card
 sim
 would
 one
 thi
Cluster 6: love
 phone
 it
 best
 like
 screen
 buy
 price
 camera
 good
Cluster 7: great
 phone
 work
 easi
 use
 love
 expect
 good
 thi
 price
Cluster 8: ok
 far
 wcdma
 it
 quit
 peopl
 someon
 buy
 slow
 phone
Cluster 9: the
 horribl
 phone
 work
 item
 overal
 pictur
 good
 never
 take


In [30]:
for i in range(10):
    print("4 reviews of ensured to cluster ", i)
    print("-" * 70)
    print(df1.iloc[df1.groupby(['Tfidf Clus Label']).groups[i][0]]['Reviews'])
    print('\n')
    print(df1.iloc[df1.groupby(['Tfidf Clus Label']).groups[i][2]]['Reviews'])
    print('\n')
    print(df1.iloc[df1.groupby(['Tfidf Clus Label']).groups[i][6]]['Reviews'])
    print('\n')
    print("_" * 70)

4 reviews of ensured to cluster  0
----------------------------------------------------------------------
Great. Arrived quickly.


arrived broken and forgot to send back


The phone is not powering on I sent it to Jamaica WI and will return once I receive it back


______________________________________________________________________
4 reviews of ensured to cluster  1
----------------------------------------------------------------------
Very pleased


It's battery life is great. It's very responsive to touch. The only issue is that sometimes the screen goes black and you have to press the top button several times to get the screen to re-illuminate.


unfortunately Sprint could not activate the phone due to the blocking issue with the phone, the matter was handled very well and quickly. Very satisfied with the service.


______________________________________________________________________
4 reviews of ensured to cluster  2
-----------------------------------------------------------

In [31]:
# Generating bag of words features and Implementing K-means.
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
bow = count_vect.fit_transform(new['Reviews'].values)

In [32]:
from sklearn.cluster import KMeans  # Importing the KMeans clustering algorithm from sklearn.cluster library
model = KMeans(n_clusters = 10,init='k-means++',random_state=99)  # Creating an instance of the KMeans clustering algorithm with 10 clusters, k-means++ initialization, and a fixed random state of 99.
model.fit(bow)  # Fitting the KMeans clustering model on the bag-of-words (bow) matrix.



In [34]:
labels = model.labels_  # Obtaining the cluster labels for each data point in the input matrix using the 'labels_' attribute of the KMeans model
cluster_center = model.cluster_centers_  # Obtaining the cluster centers (centroids) for each of the clusters using the 'cluster_centers_' attribute of the KMeans model

In [35]:
from sklearn import metrics
print(metrics.silhouette_score(bow, labels, metric='euclidean'))

0.39528026327024723


In [36]:
new['Bow Label'] = model.labels_
new[['Lemmatization','Bow Label']].head()

Unnamed: 0,Lemmatization,Bow Label
0,i feel lucki found use phone u use hard phone ...,9
1,nice phone nice grade pantach revu veri clean ...,0
2,veri plea,0
3,it work good goe slow sometim good phone i love,0
4,great phone replac lost phone the thing volum ...,9


In [37]:
#Implementing DBSCAN
from sklearn.cluster import DBSCAN  # Importing the DBSCAN clustering algorithm from the sklearn.cluster library
import numpy as np  # Importing the numpy library and aliasing it as np

minPts = 2 * 100  # Setting the value of the minimum points parameter for DBSCAN algorithm to be twice the dimensionality of the dataset

def lower_bound(nums, target):  # Defining a function to return the number in the array just greater than or equal to itself
    l, r = 0, len(nums) - 1
    while l <= r:  # Implementing binary search to find the nearest number
        mid = int(l + (r - l) / 2)
        if nums[mid] >= target:
            r = mid - 1
        else:
            l = mid + 1
    return l

def compute200thnearestneighbour(x, data):  # Defining a function to compute the 200th nearest neighbor of a point in the dataset
    dists = []  # Initializing an empty list to store the distances
    for val in data:  # Computing the distance between the given point and all other points in the dataset
        dist = np.sum((x - val) **2 )
        if (len(dists) == 200 and dists[199] > dist):
          l = int(lower_bound(dists, dist))
          if l < 200 and l >= 0 and dists[l] > dist:
              dists[l] = dist
        else:
          dists.append(dist)
          dists.sort()

    return dists[199]  # Returning the distance to the 200th nearest neighbor

In [39]:
import gensim

# Train Word2Vec model
list_of_sent_train = list()
for i in new["Lower casing"].values:
  list_of_sent_train.append(i.split())
w2v_model = gensim.models.Word2Vec(list_of_sent_train, workers=4)

import numpy as np

# Create sentence vectors using trained Word2Vec model
sent_vectors = []
count = 1
for sent in list_of_sent_train:
    sent_vec = np.zeros(100)
    cnt_words = 1
    for word in sent:
        try:
            vec = w2v_model.wv[word]
            sent_vec += vec
            cnt_words += 1
        except:
            pass
    sent_vec /= cnt_words
    sent_vectors.append(sent_vec)
sent_vectors = np.array(sent_vectors)
sent_vectors = np.nan_to_num(sent_vectors)

# Calculate 200th nearest neighbor for each sentence vector
twohundrethneigh = []
for val in sent_vectors[:300]:
    twohundrethneigh.append(compute200thnearestneighbour(val, sent_vectors[:300]))
twohundrethneigh.sort()

# Train DBSCAN model
model = DBSCAN(eps=5, min_samples=minPts, n_jobs=-1)
model.fit(sent_vectors)

In [40]:
new['AVG-W2V Clus Label'] = model.labels_
new[['Lemmatization','AVG-W2V Clus Label']].head()

Unnamed: 0,Lemmatization,AVG-W2V Clus Label
0,i feel lucki found use phone u use hard phone ...,0
1,nice phone nice grade pantach revu veri clean ...,0
2,veri plea,0
3,it work good goe slow sometim good phone i love,0
4,great phone replac lost phone the thing volum ...,0


In [42]:
#Implementing Hierarchial Clustering
import scipy
from scipy.cluster import hierarchy

In [43]:
from sklearn.cluster import AgglomerativeClustering

# Instantiate AgglomerativeClustering object with the number of clusters, affinity measure, and linkage criteria
cluster = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward')

# Fit the clustering model to the sent_vectors data using the fit_predict() method
Agg = cluster.fit_predict(sent_vectors)

# Create a new DataFrame called "hier" with the same data as the "new" DataFrame
hier = new

# Assign the cluster labels to the AVG-W2V Clus Label column of the hier DataFrame
hier['AVG-W2V Clus Label'] = cluster.labels_

# Display the first 5 rows of the hier DataFrame
hier.head(5)



Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,After noise removal,Punctuation removal,Remove numbers,Stopwords removal,Lower casing,Tokenization,Stemming,Lemmatization,Tfidf Clus Label,Bow Label,AVG-W2V Clus Label
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0,I feel so LUCKY to have found this used phone...,I feel so LUCKY to have found this used phone...,I feel so LUCKY to have found this used phone...,I feel LUCKY found used phone us used hard pho...,i feel lucky found used phone us used hard pho...,"[i, feel, lucky, found, used, phone, us, used,...",i feel lucki found use phone us use hard phone...,i feel lucki found use phone u use hard phone ...,5,9,2
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0,nice phone nice up grade from my pantach revu...,nice phone nice up grade from my pantach revu...,nice phone nice up grade from my pantach revu...,nice phone nice grade pantach revue Very clean...,nice phone nice grade pantach revue very clean...,"[nice, phone, nice, grade, pantach, revue, ver...",nice phone nice grade pantach revu veri clean ...,nice phone nice grade pantach revu veri clean ...,5,0,2
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0,Very pleased,Very pleased,Very pleased,Very pleased,very pleased,"[very, pleased]",veri pleas,veri plea,1,0,1
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0,It works good but it goes slow sometimes but i...,It works good but it goes slow sometimes but i...,It works good but it goes slow sometimes but i...,It works good goes slow sometimes good phone I...,it works good goes slow sometimes good phone i...,"[it, works, good, goes, slow, sometimes, good,...",it work good goe slow sometim good phone i love,it work good goe slow sometim good phone i love,4,0,4
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0,Great phone to replace my lost phone The only...,Great phone to replace my lost phone The only...,Great phone to replace my lost phone The only...,Great phone replace lost phone The thing volum...,great phone replace lost phone the thing volum...,"[great, phone, replace, lost, phone, the, thin...",great phone replac lost phone the thing volum ...,great phone replac lost phone the thing volum ...,5,9,4


In [44]:
# Reading a review which belong to each group.
for i in range(5):
    print("2 reviews of assigned to cluster ", i)
    print("-" * 70)
    print(hier.iloc[hier.groupby(['AVG-W2V Clus Label']).groups[i][0]]['Lemmatization'])
    print('\n')
    print(hier.iloc[hier.groupby(['AVG-W2V Clus Label']).groups[i][1]]['Lemmatization'])
    print('\n')
    print("_" * 70)

2 reviews of assigned to cluster  0
----------------------------------------------------------------------
unfortun sprint could activ phone due block issu phone matter handl well quickli veri satisfi servic


pro beauti screen capabl run chrome take good pictur keyboard comfi fit pocket fit hand play realli good video gamescon bluetooth came late run fast realli quick doesnt play 1080p video i would still recommend like talk text lot like watch lot movi


______________________________________________________________________
2 reviews of assigned to cluster  1
----------------------------------------------------------------------
veri plea


describ fast ship


______________________________________________________________________
2 reviews of assigned to cluster  2
----------------------------------------------------------------------
i feel lucki found use phone u use hard phone line someon upgrad sold one my son like old one final fell apart 2 5 year want upgrad thank seller realli

In [45]:
hier.groupby(['AVG-W2V Clus Label'])['Reviews'].count()

AVG-W2V Clus Label
0    330
1     77
2    364
3     97
4    132
Name: Reviews, dtype: int64

**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

**Write your response here:**

.
The distance between the data points and the centroids of the clusters itself serves as the basis for K-means clustering. For clustering based on density, DBSCAN is utilized. Here, the areas with the highest point densities are located and separated from void areas. As the name suggests, layering is a component of hierarchical clustering. Each data point is first taken into account as a separate cluster, and the two clusters closest to one another are then identified.
.

.

.

.




# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
This assignment helped me a lot in revising various machine learning algorithms such as MultinominalNB, SVM, KNN, Decision tree,
Random Forest,XGBoost, Word2Vec, BERT.For all these algorithms we are calculating the metrics as well.
For the text clustering task on the Amazon reviews dataset, I have applied K-means, DBSCAN, Hierarchical
clustering, Word2Vec, and BERT. Overall, in this exercise i have got a comprehensive hands-on experience with
different machine learning algorithms for text analysis, reinforced understanding of cross-validation
techniques, and emphasized the importance of choosing appropriate evaluation metrics based on the task at
hand.




'''