<a href="https://colab.research.google.com/github/17251A0404/Abhigna_INFO5731_Spring2024/blob/main/DARA_ABHIGNA_INFO5731_Exercise_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [4]:
import pandas as pd

#Load and format the training data into a data frame
sentiments = []
texts = []

with open('/content/stsa-train.txt', 'r') as f:
    for line in f:
        parts = line.strip().split(' ', 1)
        sentiment = int(parts[0])
        text = parts[1]
        sentiments.append(sentiment)
        texts.append(text)

train_data = pd.DataFrame({'sentiment': sentiments, 'text': texts})

train_data

Unnamed: 0,sentiment,text
0,1,"a stirring , funny and finally transporting re..."
1,0,apparently reassembled from the cutting-room f...
2,0,they presume their audience wo n't sit still f...
3,1,this is a visually stunning rumination on love...
4,1,jonathan parker 's bartleby should have been t...
...,...,...
6915,1,"painful , horrifying and oppressively tragic ,..."
6916,0,take care is nicely performed by a quintet of ...
6917,0,"the script covers huge , heavy topics in a bla..."
6918,0,a seriously bad film with seriously warped log...


In [6]:
sentiments = []
texts = []

with open('/content/stsa-test.txt', 'r') as f:
    for line in f:
        parts = line.strip().split(' ', 1)
        sentiment = int(parts[0])
        text = parts[1]
        sentiments.append(sentiment)
        texts.append(text)

test_data = pd.DataFrame({'sentiment': sentiments, 'text': texts})
test_data

Unnamed: 0,sentiment,text
0,0,"no movement , no yuks , not much of anything ."
1,0,"a gob of drivel so sickly sweet , even the eag..."
2,0,"gangs of new york is an unapologetic mess , wh..."
3,0,"we never really feel involved with the story ,..."
4,1,this is one of polanski 's best films .
...,...,...
1816,0,"an often-deadly boring , strange reading of a ..."
1817,0,the problem with concept films is that if the ...
1818,0,"safe conduct , however ambitious and well-inte..."
1819,0,"a film made with as little wit , interest , an..."


In [7]:
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
import re

nltk.download('stopwords')
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

def preprocess(sentence):
    sentence = str(sentence)
    sentence = sentence.lower()
    sentence = sentence.replace('{html}',"")
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', sentence)
    rem_url = re.sub(r'http\S+', '', cleantext)
    rem_num = re.sub('[0-9]+', '', rem_url)
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(rem_num)
    filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]
    stem_words = [stemmer.stem(w) for w in filtered_words]
    lemma_words = [lemmatizer.lemmatize(w) for w in stem_words]
    return " ".join(lemma_words)

# Apply the preprocessing to the training data
train_data['text'] = train_data['text'].apply(preprocess)

# Apply the preprocessing to the test data
test_data['text'] = test_data['text'].apply(preprocess)

# Convert the text data into a numerical representation using TF-IDF
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_data['text'])
y_train = train_data['sentiment']
X_test = vectorizer.transform(test_data['text'])
y_test = test_data['sentiment']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [10]:
import numpy as np

# Set random seed for reproducibility
np.random.seed(42)

# Get number of samples in the training data
num_samples = X_train.shape[0]

# Generate random indices for train-validation split
indices = np.random.permutation(num_samples)

# Calculate the number of samples for the validation set
num_validation_samples = int(0.2 * num_samples)

# Split the indices into train and validation indices
train_indices = indices[num_validation_samples:]
val_indices = indices[:num_validation_samples]

# Split the data using the generated indices
X_train, X_val = X_train[train_indices], X_train[val_indices]
y_train, y_val = y_train.iloc[train_indices], y_train.iloc[val_indices]


In [25]:
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# Define classifiers
classifiers = {
    "Multinomial Naive Bayes": MultinomialNB(),
    "Support Vector Machine": SVC(),
    "K Nearest Neighbors": KNeighborsClassifier(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "XGBoost": XGBClassifier()
}

# Iterate over classifiers
for name, clf in classifiers.items():
    # Perform cross-validation
    scores = cross_val_score(clf, X_train, y_train, cv=10, scoring='accuracy')
    # Calculate mean accuracy
    mean_accuracy = scores.mean()
    # Print mean accuracy
    print(f"Mean Accuracy for {name}: {mean_accuracy}")

    # Iterate over classifiers
for name, clf in classifiers.items():
    # Perform cross-validation prediction
    y_pred = cross_val_predict(clf, X, y, cv=10)
    # Generate classification report
    report = classification_report(y, y_pred)
    # Print classification report
    print(f"Classification Report for {name}:\n{report}\n")



Mean Accuracy for Multinomial Naive Bayes: 0.9166666666666666
Mean Accuracy for Support Vector Machine: 0.95
Mean Accuracy for K Nearest Neighbors: 0.9416666666666667
Mean Accuracy for Decision Tree: 0.9249999999999998
Mean Accuracy for Random Forest: 0.9249999999999998
Mean Accuracy for XGBoost: 0.9083333333333332
Classification Report for Multinomial Naive Bayes:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        50
           1       0.94      0.92      0.93        50
           2       0.92      0.94      0.93        50

    accuracy                           0.95       150
   macro avg       0.95      0.95      0.95       150
weighted avg       0.95      0.95      0.95       150


Classification Report for Support Vector Machine:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        50
           1       0.96      0.96      0.96        50
           2       0.96      0.96    

In [26]:
import nltk
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec

nltk.download('punkt')

# Tokenize the reviews into sentences
sentences = [word_tokenize(review) for review in train_data['text']]

# train model
model = Word2Vec(sentences, min_count=1, vector_size=300)
# summarize the loaded model
print(model)
# summarize vocabulary
words = list(model.wv.index_to_key)
# access vector for one sentence
print(model.wv['offer'])
print(model.wv['new'])
print(model.wv['insight'])
print(model.wv['matter'])
print(model.wv['charact'])
print(model.wv['exactli'])
print(model.wv['spring'])
print(model.wv['life'])
# save model
model.save('model.bin')
# load model
new_model = Word2Vec.load('model.bin')
print(new_model)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Word2Vec<vocab=9570, vector_size=300, alpha=0.025>
[ 1.61269661e-02  2.55090028e-01 -2.24598628e-02  1.16221949e-01
 -4.71297316e-02 -2.30343372e-01  1.75586566e-01  4.30727154e-01
  1.06517240e-01 -7.82366768e-02 -3.22279986e-03 -1.35019332e-01
  2.49341205e-02  2.95640863e-02 -1.22543290e-01 -1.68480560e-01
  3.77864502e-02 -4.00182456e-02  9.25840735e-02 -8.07097703e-02
 -1.20324671e-01 -2.45682951e-02  1.51559070e-01  8.96050110e-02
  2.04476535e-01  2.01239530e-02 -2.27463096e-01 -3.55981849e-02
 -1.01624027e-01 -2.22705051e-01  2.08295099e-02 -1.40193060e-01
 -2.00430956e-02 -5.17838970e-02  1.77755859e-02  9.56623778e-02
  8.19284096e-02 -2.37670183e-01 -2.24784091e-02 -5.60893193e-02
 -1.14912666e-01  6.41629398e-02  3.19292843e-02 -1.65765315e-01
  9.54582244e-02  1.70947418e-01  2.67177895e-02  1.09266177e-01
 -4.97317091e-02  1.86585382e-01  3.31947617e-02  1.42835313e-02
 -1.68362260e-01 -5.38643524e-02 -4.30498049e-02  1.92375198e-01
  1.16688848e-01  3.24442387e-02  7.359

In [29]:
import numpy as np
import string
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from gensim.models import Word2Vec
import nltk
nltk.download('punkt')

# Define a function to preprocess and vectorize a sentence
def preprocess_and_vectorize(sentence, model):
    # Tokenize the sentence
    words = nltk.word_tokenize(sentence.lower())
    # Filter out words not in the Word2Vec model
    words = [word for word in words if word in model.wv]
    # If no words are left after filtering, return a zero vector
    if len(words) == 0:
        return np.zeros(model.vector_size)
    # Otherwise, return the average of word vectors
    return np.mean([model.wv[word] for word in words], axis=0)

# Preprocess and vectorize the train set
X_train_vec = np.array([preprocess_and_vectorize(sentence, model) for sentence in train_data['text']])
y_train = train_data['sentiment']

# Split train set into train and validation set
X_train, X_val, y_train, y_val = train_test_split(X_train_vec, y_train, test_size=0.2, random_state=28)

# Preprocess and vectorize the test set
X_test = np.array([preprocess_and_vectorize(sentence, model) for sentence in test_data['text']])
y_test = test_data['sentiment']

# Train SVM model
svm = SVC(kernel='linear', random_state=42)
svm.fit(X_train, y_train)

# Make predictions on the validation set
y_pred = svm.predict(X_val)

# Evaluate the model
print(classification_report(y_val, y_pred))

# Print the mean cross-validation score
print('Mean cross-validation score:', scores.mean())

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.00      0.00      0.00       648
           1       0.53      1.00      0.69       736

    accuracy                           0.53      1384
   macro avg       0.27      0.50      0.35      1384
weighted avg       0.28      0.53      0.37      1384

Mean cross-validation score: 0.5191472832792579


## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [30]:
# Write your code here
import re
from textblob import TextBlob
from nltk.stem import PorterStemmer
from textblob import Word
import nltk
nltk.download('wordnet')


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [35]:
import pandas as pd

# parameters to read the CSV file
try:
    df = pd.read_csv('/content/Amazon_Unlocked_Mobile.csv', encoding='utf-8')
except pd.errors.ParserError:
    try:
        df = pd.read_csv('/content/Amazon_Unlocked_Mobile.csv', encoding='latin1')
    except pd.errors.ParserError as e:
        print("ParserError:", e)


In [42]:
new =df.head(1000)
new['Reviews']

0      I feel so LUCKY to have found this used (phone...
1      nice phone, nice up grade from my pantach revu...
2                                           Very pleased
3      It works good but it goes slow sometimes but i...
4      Great phone to replace my lost phone. The only...
                             ...                        
995    It's a decent for the price.. I've had this on...
996                                   Is good cell phone
997    Amazing phone. Cables and case included, also ...
998                                             Excelent
999       Excellent, it meets the requirements requested
Name: Reviews, Length: 1000, dtype: object

In [46]:
import re
from nltk.corpus import stopwords
from textblob import TextBlob
from nltk.stem import PorterStemmer
from textblob import Word
import pandas as pd

# Make a copy of the DataFrame 'new'
new = new.copy()

# Special characters removal
new['After noise removal'] = new['Reviews'].apply(lambda x: re.sub(r"[^a-zA-Z0-9]+", ' ', x))

# Punctuation removal
new['Punctuation removal'] = new['After noise removal'].str.replace('[^\w\s]','')

# Remove numbers
new['Remove numbers'] = new['Punctuation removal'].str.replace('\d+', '')

# Stopwords removal
stop_word = stopwords.words('english')
new['Stopwords removal'] = new['Remove numbers'].apply(lambda x: " ".join(word for word in x.split() if word not in stop_word))

# Lower Casing
new['Lower casing'] = new['Stopwords removal'].apply(lambda x: x.lower())

# Tokenization
new['Tokenization'] = new['Lower casing'].apply(lambda x: TextBlob(x).words)

# Stemming
st = PorterStemmer()
new['Stemming'] = new['Tokenization'].apply(lambda x: " ".join([st.stem(word) for word in x]))

# Lemmatization
new['Lemmatization'] = new['Stemming'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))

# Print the 'Lemmatization' column
print(new['Lemmatization'])


0      i feel lucki found use phone u use hard phone ...
1      nice phone nice grade pantach revu veri clean ...
2                                              veri plea
3        it work good goe slow sometim good phone i love
4      great phone replac lost phone the thing volum ...
                             ...                        
995    it decent price i one 6 month the con 1 i pret...
996                                   is good cell phone
997    amaz phone cabl case includ also screen pritec...
998                                                excel
999                            excel meet requir request
Name: Lemmatization, Length: 1000, dtype: object


In [48]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Instantiate a TfidfVectorizer object with default parameters
tfidf_vect = TfidfVectorizer()

# Use the fit_transform method of the TfidfVectorizer object to generate a TF-IDF matrix from the 'Lemmatization' column
tfidf_matrix = tfidf_vect.fit_transform(new['Lemmatization'])

# Print the shape of the resulting TF-IDF matrix
print(tfidf_matrix.shape)


(1000, 2620)


In [49]:
from sklearn.cluster import KMeans

# Instantiate a KMeans object with 10 clusters and a random state of 99
model_tf = KMeans(n_clusters=10, random_state=99)

# Use the fit method of the KMeans object to fit the model to the TF-IDF matrix generated in the previous step
model_tf.fit(tfidf_matrix)




In [51]:
# Get the cluster labels and cluster centers directly from the fitted KMeans model
labels_tf = model_tf.predict(tfidf_matrix)
cluster_center_tf = model_tf.cluster_centers_

# Print the cluster centers to the console
print(cluster_center_tf)


[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.00097592 ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]


In [55]:
# Use the get_feature_names_out method of the TfidfVectorizer object to extract the terms in the corpus
terms1 = tfidf_vect.get_feature_names_out()

# Print the first 100 terms to the console
print(terms1[:100])
# Assign the DataFrame 'new' to a new variable 'df1'
df1 = new.copy()

# Add a new column 'Tfidf Clus Label' to 'df1' containing the clustering labels obtained from 'model_tf'
df1['Tfidf Clus Label'] = labels_tf

# Display the 'Lemmatization' and 'Tfidf Clus Label' columns for the first 5 rows
print(df1[['Lemmatization', 'Tfidf Clus Label']].head())

# Group by 'Tfidf Clus Label' and count the number of reviews in each cluster
cluster_counts = df1.groupby(['Tfidf Clus Label'])['Reviews'].count()

# Print the counts
print(cluster_counts)


['00' '00pm' '03' '04' '0mp' '10' '100' '1080p' '10k' '11' '110' '115'
 '12' '13' '14' '1400' '15' '16' '169' '1700' '178mb' '18' '1800' '1900'
 '1999' '1gb' '1it' '1st' '1thi' '1x' '20' '2001' '200mb' '200ppi' '2012'
 '2013' '2014' '2015' '2016' '2017' '2100' '2100mhz' '2250' '2300mah' '24'
 '25' '25cm' '288' '29' '2g' '2gb' '2mp' '2nd' '2pm' '2sim' '30' '300'
 '300megabyt' '302' '303' '30ish' '30text' '31' '32' '32g' '32gb' '39'
 '3d' '3g' '3gcellphon' '3inch' '3week' '40' '400mp2' '450' '4g' '4gb'
 '50' '500' '512mb' '55' '55th' '5c' '5g' '5in' '5j' '5mm' '5mp' '5th'
 '60' '600' '628' '635' '64' '64bit' '64gig' '6in' '70' '700' '710']
                                       Lemmatization  Tfidf Clus Label
0  i feel lucki found use phone u use hard phone ...                 5
1  nice phone nice grade pantach revu veri clean ...                 5
2                                          veri plea                 1
3    it work good goe slow sometim good phone i love                 4

In [56]:
print("Top clusters:")
order_centroids = model_tf.cluster_centers_.argsort()[:, ::-1]
for i in range(10):
    print("Cluster %d:" % i, end='')
    top_terms = [terms1[ind] for ind in order_centroids[i, :10]]
    print(' '.join(top_terms))


Top clusters:
Cluster 0:arriv return dead wast product work thi phone disappoint condit
Cluster 1:work phone it bad good like bueno use well fine
Cluster 2:excelent telefono bueno muy producto far fastest faster fast fashion
Cluster 3:excel product recommend seller thank 100 phone fash function five
Cluster 4:good veri phone product price buy tank thank mobil recomend
Cluster 5:phone use work the it card sim would one thi
Cluster 6:love phone it best like screen buy price camera good
Cluster 7:great phone work easi use love expect good thi price
Cluster 8:ok far wcdma it quit peopl someon buy slow phone
Cluster 9:the horribl phone work item overal pictur good never take


In [57]:
for i in range(10):
    print("4 reviews of ensured to cluster ", i)
    print("-" * 70)
    group_indices = df1.groupby(['Tfidf Clus Label']).groups[i]
    print(df1.loc[group_indices[0], 'Reviews'])
    print('\n')
    print(df1.loc[group_indices[2], 'Reviews'])
    print('\n')
    print(df1.loc[group_indices[6], 'Reviews'])
    print('\n')
    print("_" * 70)


4 reviews of ensured to cluster  0
----------------------------------------------------------------------
Great. Arrived quickly.


arrived broken and forgot to send back


The phone is not powering on I sent it to Jamaica WI and will return once I receive it back


______________________________________________________________________
4 reviews of ensured to cluster  1
----------------------------------------------------------------------
Very pleased


It's battery life is great. It's very responsive to touch. The only issue is that sometimes the screen goes black and you have to press the top button several times to get the screen to re-illuminate.


unfortunately Sprint could not activate the phone due to the blocking issue with the phone, the matter was handled very well and quickly. Very satisfied with the service.


______________________________________________________________________
4 reviews of ensured to cluster  2
-----------------------------------------------------------

In [60]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer
count_vect = CountVectorizer()

# Generate bag of words (BoW) features
bow = count_vect.fit_transform(new['Reviews'])

from sklearn.cluster import KMeans

# Create an instance of KMeans clustering algorithm
model = KMeans(n_clusters=10, init='k-means++', random_state=99)

# Fit the KMeans model on the bag-of-words (bow) matrix
model.fit(bow)

# Print the cluster centers
print("Cluster centers:")
print(model.cluster_centers_)




Cluster centers:
[[0.00134953 0.         0.00134953 ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.33333333]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.00641026 0.         0.        ]]


In [64]:
labels = model.labels_  # Obtaining the cluster labels for each data point in the input matrix using the 'labels_' attribute of the KMeans model
cluster_center = model.cluster_centers_  # Obtaining the cluster centers (centroids) for each of the clusters using the 'cluster_centers_' attribute of the KMeans model
from sklearn.metrics import silhouette_score

# Compute the silhouette score
silhouette_score_value = silhouette_score(bow, labels, metric='euclidean')

# Print the silhouette score
print(silhouette_score_value)

0.39528026327024723


In [65]:
# Assign the cluster labels obtained from the KMeans model to a new column 'Bow Label'
new['Bow Label'] = labels

# Display the 'Lemmatization' and 'Bow Label' columns for the first 5 rows
print(new[['Lemmatization', 'Bow Label']].head())


                                       Lemmatization  Bow Label
0  i feel lucki found use phone u use hard phone ...          9
1  nice phone nice grade pantach revu veri clean ...          0
2                                          veri plea          0
3    it work good goe slow sometim good phone i love          0
4  great phone replac lost phone the thing volum ...          9


In [66]:
from sklearn.cluster import DBSCAN
import numpy as np

minPts = 2 * 100

def lower_bound(nums, target):
    l, r = 0, len(nums) - 1
    while l <= r:
        mid = int(l + (r - l) / 2)
        if nums[mid] >= target:
            r = mid - 1
        else:
            l = mid + 1
    return l

def compute_200th_nearest_neighbour(x, data):
    dists = []
    for val in data:
        dist = np.sum((x - val) ** 2)
        if len(dists) == 200 and dists[199] > dist:
            l = lower_bound(dists, dist)
            if l < 200 and l >= 0 and dists[l] > dist:
                dists[l] = dist
        else:
            dists.append(dist)
            dists.sort()

    return dists[199]

# Ensure to replace "give me same output with different code" with actual implementation.


In [67]:
list_of_sent_train = [sentence.split() for sentence in new["Lower casing"].values]


In [68]:
import gensim
from sklearn.cluster import DBSCAN

# Train Word2Vec model
list_of_sent_train = [sentence.split() for sentence in new["Lower casing"].values]
w2v_model = gensim.models.Word2Vec(list_of_sent_train, workers=4)

import numpy as np

# Create sentence vectors using trained Word2Vec model
sent_vectors = []
for sent in list_of_sent_train:
    sent_vec = np.zeros(100)
    cnt_words = 0
    for word in sent:
        if word in w2v_model.wv:
            vec = w2v_model.wv[word]
            sent_vec += vec
            cnt_words += 1
    if cnt_words != 0:
        sent_vec /= cnt_words
    sent_vectors.append(sent_vec)
sent_vectors = np.array(sent_vectors)

# Calculate 200th nearest neighbor for each sentence vector
two_hundredth_neigh = []
for val in sent_vectors[:300]:
    two_hundredth_neigh.append(compute_200th_nearest_neighbour(val, sent_vectors[:300]))
two_hundredth_neigh.sort()

# Train DBSCAN model
model = DBSCAN(eps=5, min_samples=minPts, n_jobs=-1)
model.fit(sent_vectors)


In [69]:
# Assign the cluster labels obtained from the DBSCAN model to a new column 'AVG-W2V Clus Label'
new['AVG-W2V Clus Label'] = model.labels_

# Display the 'Lemmatization' and 'AVG-W2V Clus Label' columns for the first 5 rows
print(new[['Lemmatization', 'AVG-W2V Clus Label']].head())


                                       Lemmatization  AVG-W2V Clus Label
0  i feel lucki found use phone u use hard phone ...                   0
1  nice phone nice grade pantach revu veri clean ...                   0
2                                          veri plea                   0
3    it work good goe slow sometim good phone i love                   0
4  great phone replac lost phone the thing volum ...                   0


In [73]:
#Implementing Hierarchial Clustering
import scipy
from scipy.cluster import hierarchy
from sklearn.cluster import AgglomerativeClustering

# Instantiate AgglomerativeClustering object with the number of clusters, affinity measure, and linkage criteria
cluster = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward')

# Fit the clustering model to the sent_vectors data using the fit_predict() method
Agg = cluster.fit_predict(sent_vectors)

# Create a new DataFrame called "hier" with the same data as the "new" DataFrame
hier = new

# Assign the cluster labels to the AVG-W2V Clus Label column of the hier DataFrame
hier['AVG-W2V Clus Label'] = cluster.labels_

# Display the first 5 rows of the hier DataFrame
hier.head(5)




Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,After noise removal,Punctuation removal,Remove numbers,Stopwords removal,Lower casing,Tokenization,Stemming,Lemmatization,Bow Label,AVG-W2V Clus Label
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0,I feel so LUCKY to have found this used phone ...,I feel so LUCKY to have found this used phone ...,I feel so LUCKY to have found this used phone ...,I feel LUCKY found used phone us used hard pho...,i feel lucky found used phone us used hard pho...,"[i, feel, lucky, found, used, phone, us, used,...",i feel lucki found use phone us use hard phone...,i feel lucki found use phone u use hard phone ...,9,1
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0,nice phone nice up grade from my pantach revue...,nice phone nice up grade from my pantach revue...,nice phone nice up grade from my pantach revue...,nice phone nice grade pantach revue Very clean...,nice phone nice grade pantach revue very clean...,"[nice, phone, nice, grade, pantach, revue, ver...",nice phone nice grade pantach revu veri clean ...,nice phone nice grade pantach revu veri clean ...,0,1
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0,Very pleased,Very pleased,Very pleased,Very pleased,very pleased,"[very, pleased]",veri pleas,veri plea,0,0
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0,It works good but it goes slow sometimes but i...,It works good but it goes slow sometimes but i...,It works good but it goes slow sometimes but i...,It works good goes slow sometimes good phone I...,it works good goes slow sometimes good phone i...,"[it, works, good, goes, slow, sometimes, good,...",it work good goe slow sometim good phone i love,it work good goe slow sometim good phone i love,0,2
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0,Great phone to replace my lost phone The only ...,Great phone to replace my lost phone The only ...,Great phone to replace my lost phone The only ...,Great phone replace lost phone The thing volum...,great phone replace lost phone the thing volum...,"[great, phone, replace, lost, phone, the, thin...",great phone replac lost phone the thing volum ...,great phone replac lost phone the thing volum ...,9,4


In [75]:
for i in range(5):
    print("2 reviews assigned to cluster ", i)
    print("-" * 70)
    cluster_indices = hier[hier['AVG-W2V Clus Label'] == i].index[:2]
    for index in cluster_indices:
        print(hier.loc[index, 'Lemmatization'])
        print('\n')
    print("_" * 70)
hier.groupby('AVG-W2V Clus Label')['Reviews'].count()


2 reviews assigned to cluster  0
----------------------------------------------------------------------
veri plea


reason 3 star rate opinion better iphon 4 tend randomli crash randomli power good storag space batteri usag camera


______________________________________________________________________
2 reviews assigned to cluster  1
----------------------------------------------------------------------
i feel lucki found use phone u use hard phone line someon upgrad sold one my son like old one final fell apart 2 5 year want upgrad thank seller realli appreci honesti said use phone i recommend seller highli would


nice phone nice grade pantach revu veri clean set easi set never android phone fantast say least perfect size surf social medium great phone samsung


______________________________________________________________________
2 reviews assigned to cluster  2
----------------------------------------------------------------------
it work good goe slow sometim good phone i love



AVG-W2V Clus Label
0    182
1    356
2    188
3     90
4    184
Name: Reviews, dtype: int64

**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

**Write your response here:**

.

.

.

.

.




# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here: undefined

This help further understanding of Machine learning models to a grate extent.

Supervised and Unsupervised. The task 1 was involving supervised learning algorithms such as classification models.

and would be employing clustering methods in question 3.






'''