# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [1]:
# Write your code here
#importing the required libraries
import re
import numpy as np
import nltk
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# initializing the path for train and text files

train_path = 'stsa-train.txt'
test_path = 'stsa-test.txt'

# Reading the train and test text file and splitting into lines
with open(train_path, 'r') as file_train:
    train_lines = file_train.readlines()
with open(test_path, 'r') as file_test:
    test_lines = file_test.readlines()


# Splitting the lines into 'label' and 'review'
split_train = [line.strip().split(' ', 1) for line in train_lines]
split_test = [line.strip().split(' ', 1) for line in test_lines]

# Creating a DataFrame
training_data = pd.DataFrame(split_train, columns=['Label', 'Review'])
testing_data = pd.DataFrame(split_test, columns=['Label', 'Review'])

In [2]:
training_data = training_data[:1000]

In [3]:
testing_data = testing_data[:1000]

In [4]:
training_data.head()


Unnamed: 0,Label,Review
0,1,"a stirring , funny and finally transporting re..."
1,0,apparently reassembled from the cutting-room f...
2,0,they presume their audience wo n't sit still f...
3,1,this is a visually stunning rumination on love...
4,1,jonathan parker 's bartleby should have been t...


In [5]:
testing_data.head()

Unnamed: 0,Label,Review
0,0,"no movement , no yuks , not much of anything ."
1,0,"a gob of drivel so sickly sweet , even the eag..."
2,0,"gangs of new york is an unapologetic mess , wh..."
3,0,"we never really feel involved with the story ,..."
4,1,this is one of polanski 's best films .


In [6]:
nltk.download('all')

# Defining a function for text cleaning

def cleaning_text(text):

    # Here we are Converting text to lowercase
    text = text.lower()

    #Here we are Removing special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Tokenizing the text
    tokens = word_tokenize(text)

    # Here we are Removing stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]

    # Here are we are initializing Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]

    # Here we are Joining the tokens back into one string
    cleaned_text = ' '.join(lemmatized_tokens)

    return cleaned_text

training_data['Review'] = training_data['Review'].apply(cleaning_text)
testing_data['Review'] = testing_data['Review'].apply(cleaning_text)

# Display cleaned datasets
print("Here we have Cleaned Training Data:")
print(training_data.head())

print("\nHere we have Cleaned Testing Data:")
print(testing_data.head())

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package bcp47 to /root/nltk_data...
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   U

Here we have Cleaned Training Data:
  Label                                             Review
0     1  stirring funny finally transporting reimaginin...
1     0  apparently reassembled cuttingroom floor given...
2     0  presume audience wo nt sit still sociology les...
3     1  visually stunning rumination love memory histo...
4     1  jonathan parker bartleby beallendall modernoff...

Here we have Cleaned Testing Data:
  Label                                             Review
0     0                        movement yuks much anything
1     0  gob drivel sickly sweet even eager consumer mo...
2     0  gang new york unapologetic mess whose saving g...
3     0  never really feel involved story idea remain a...
4     1                             one polanski best film


In [7]:
# Here we are Splitting the dataset into training and validation sets as specified in question 80% and 20%
X_train, X_val, y_train, y_val = train_test_split(training_data['Review'], training_data['Label'], test_size=0.2, random_state=42, stratify=training_data['Label'])
X_train.reset_index(drop=True, inplace=True)
X_val.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)
y_val.reset_index(drop=True, inplace=True)
y_train = y_train.astype(int)
y_val = y_val.astype(int)

In [9]:
#import all the required libraries
import re
import numpy as np
import nltk
import pandas as pd
from sklearn.model_selection import cross_val_score, train_test_split, KFold
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Here we are Defining models specified in the question
models = {
    'MultinomialNB': MultinomialNB(),
    'SVM': SVC(),
    'KNN': KNeighborsClassifier(),
    'DecisionTree': DecisionTreeClassifier(),
    'RandomForest': RandomForestClassifier(),
    'XGBoost': XGBClassifier()
}

# Here we are Defining evaluation metrics specified in the question
metrics = {
    'Accuracy': accuracy_score,
    'Recall': recall_score,
    'Precision': precision_score,
    'F1': f1_score
}

# Here we are Defining function to perform cross-validation for each model
def evaluate_models_cross_validation(X, y):
    results = {}
    for model_name, model in models.items():
        # Here we are Performing 10-fold cross-validation
        kfold = KFold(n_splits=10, shuffle=True, random_state=42)
        model_results = {metric_name: [] for metric_name in metrics.keys()}

        for train_idx, val_idx in kfold.split(X):
            X_train_fold, X_val_fold = X[train_idx], X[val_idx]
            y_train_fold, y_val_fold = y.iloc[train_idx], y.iloc[val_idx]
            vectorizer = CountVectorizer()
            X_train_vec = vectorizer.fit_transform(X_train_fold)
            X_val_vec = vectorizer.transform(X_val_fold)

            # here we are training the model on training fold
            model.fit(X_train_vec, y_train_fold)

            # Here we are Predicting on validation fold
            y_pred = model.predict(X_val_vec)

            # Here we are initializing Evaluating metrics for this fold
            for metric_name, metric_fn in metrics.items():
                score = metric_fn(y_val_fold, y_pred)
                model_results[metric_name].append(score)

        # Calculate average scores across all folds
        avg_scores = {metric_name: np.mean(scores) for metric_name, scores in model_results.items()}
        results[model_name] = avg_scores

    return results
results = evaluate_models_cross_validation(X_train, y_train)
for model_name, scores in results.items():
    print(f"MODEL: {model_name}")
    for metric_name, score in scores.items():
        print(f"{metric_name}: {score:.4f}")
    print()

MODEL: MultinomialNB
Accuracy: 0.6788
Recall: 0.7055
Precision: 0.6930
F1: 0.6952

MODEL: SVM
Accuracy: 0.6475
Recall: 0.6901
Precision: 0.6557
F1: 0.6699

MODEL: KNN
Accuracy: 0.4938
Recall: 0.0775
Precision: 0.6044
F1: 0.1315

MODEL: DecisionTree
Accuracy: 0.5887
Recall: 0.5188
Precision: 0.6380
F1: 0.5659

MODEL: RandomForest
Accuracy: 0.6187
Recall: 0.5321
Precision: 0.6755
F1: 0.5898

MODEL: XGBoost
Accuracy: 0.5988
Recall: 0.6776
Precision: 0.6043
F1: 0.6376



In [10]:
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from gensim.models import Word2Vec
from transformers import BertTokenizer, BertForSequenceClassification


#Here we are defining a function on word2vec model
def model_word2vec(X, y):
    token = [sentence.split() for sentence in X]
    w2v_model = Word2Vec(token, vector_size=100, window=5, min_count=1, sg=1)
    def word_embed(wv, words):
        all_words, mean = set(), []
        for word in words:
            if isinstance(word, np.ndarray):
                mean.append(word)
            elif word in wv.key_to_index:
                mean.append(wv[word])
        if not mean:
            return np.zeros(wv.vector_size,)
        mean = np.array(mean).mean(axis=0)
        return mean
    def word_embed_list(wv, text_list):
        return np.vstack([word_embed(wv, review) for review in text_list])
    X_vec = word_embed_list(w2v_model.wv, token)
    clf = SVC()
    scores = cross_val_score(clf, X_vec, y, cv=10, scoring='accuracy')
    return scores

#Here we are defining a function on bert model
def model_bert(X, y):
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
    def bert_tokenizer(text):
        inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
        return inputs
    X_tokenized = [bert_tokenizer(text) for text in X]
    def bert_model_prediction(inputs):
        outputs = model(**inputs)
        return outputs.logits
    X_vec = [bert_model_prediction(inputs).detach().numpy() for inputs in X_tokenized]
    clf = RandomForestClassifier()
    scores = cross_val_score(clf, np.array(X_vec).squeeze(), y, cv=10, scoring='accuracy')
    return scores
word2vec_scores = model_word2vec(X_train, y_train)
bert_scores = model_bert(X_train, y_train)


# Here we are Performing evaluation on both Word2Vec and BERT models with cross validation score
print("\nModel: Word2Vec")
print("Accuracy:", word2vec_scores[0])
print("Precision:", word2vec_scores[1])
print("Recall:", word2vec_scores[2])
print("F1-score:", word2vec_scores[3])
print("Word2Vec Cross-validation scores:", word2vec_scores)
print("Word2Vec Mean accuracy:", word2vec_scores.mean())

print("\nModel: BERT")
print("Accuracy:", bert_scores[0])
print("Precision:", bert_scores[1])
print("Recall:", bert_scores[2])
print("F1-score:", bert_scores[3])
print("BERT Cross-validation scores:", bert_scores)
print("BERT Mean accuracy:", bert_scores.mean())

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Model: Word2Vec
Accuracy: 0.5375
Precision: 0.5375
Recall: 0.65
F1-score: 0.625
Word2Vec Cross-validation scores: [0.5375 0.5375 0.65   0.625  0.625  0.6125 0.525  0.4875 0.475  0.5625]
Word2Vec Mean accuracy: 0.56375

Model: BERT
Accuracy: 0.5625
Precision: 0.525
Recall: 0.4875
F1-score: 0.5625
BERT Cross-validation scores: [0.5625 0.525  0.4875 0.5625 0.55   0.575  0.5    0.575  0.4875 0.5375]
BERT Mean accuracy: 0.53625


In [12]:
# Here we are Retraining the best model on the full training data
best_perf_model = models['MultinomialNB']
vectorizer = CountVectorizer()
X_train_vec_full = vectorizer.fit_transform(training_data['Review'])

# Here we are Fitting the best model on the full training data
best_perf_model.fit(X_train_vec_full, training_data['Label'].astype(int))
X_test_vec = vectorizer.transform(testing_data['Review'])

# Here we are performing Prediction on the test data
y_pred_test = best_perf_model.predict(X_test_vec)

# Here we are Calculating evaluation metrics on the test data
test_accuracy = accuracy_score(testing_data['Label'].astype(int), y_pred_test)
test_recall = recall_score(testing_data['Label'].astype(int), y_pred_test)
test_precision = precision_score(testing_data['Label'].astype(int), y_pred_test)
test_f1 = f1_score(testing_data['Label'].astype(int), y_pred_test)

print("Best model Metrics achieved:")
print(f"Test data Accuracy: {test_accuracy:.4f}")
print(f"Test data Recall: {test_recall:.4f}")
print(f"Test data Precision: {test_precision:.4f}")
print(f"Test data F1-score: {test_f1:.4f}")

Best model Metrics achieved:
Test data Accuracy: 0.7180
Test data Recall: 0.7767
Test data Precision: 0.6993
Test data F1-score: 0.7360


## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [15]:
# Write your code here
amazon_data = pd.read_csv('/content/Amazon_Unlocked_Mobile.csv')

In [16]:
# Here we are Selecting random 500 reviews
amazon_data = amazon_data.sample(n=500, random_state=42)

In [17]:
amazon_data.head()

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
65232,Apple iPhone 6s 128 GB International Warranty ...,"Amazon.com, LLC *** KEEP PORules ACTIVE ***",749.99,5.0,"I like it very much, it's exactly what I expected",
104972,BLU Advance 4.0L Unlocked Smartphone -US GSM -...,BLU,51.99,5.0,Smart phones are confusing. Everyone is tellin...,0.0
79034,"Apple Unlocked iPhone 6s LTE GSM Smartphone, 1...",,459.99,4.0,Great,1.0
20507,Apple iPhone 5 32GB Factory Unlocked GSM Cell ...,Apple,179.99,5.0,Great thabk u,0.0
118582,BLU Energy X LTE - 4000 mAh Super Battery - GS...,BLU,124.99,5.0,"Having owned this phone for a few months now, ...",0.0


In [18]:
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Removing all NaN values
amazon_data.dropna(subset=['Reviews'], inplace=True)

# Here we are Applying text cleaning function derived above
amazon_data['Clean_Reviews'] = amazon_data['Reviews'].apply(cleaning_text)

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from collections import Counter

# Here we are Creating TF-IDF Vectorization
tfidf_vector = TfidfVectorizer(max_features=1000)
tfidf_matrix = tfidf_vector.fit_transform(amazon_data['Clean_Reviews'])



# Here we are initializing K-means clustering method
kmeans_clu = KMeans(n_clusters=5,init='k-means++',random_state=90)
kmeans_clu.fit(tfidf_matrix)
amazon_data['Kmeans_Cluster'] = kmeans_clu.labels_
Counter(kmeans_clu.labels_)



Counter({3: 227, 2: 155, 4: 55, 1: 45, 0: 18})

In [20]:
cluster_data_center=kmeans_clu.labels_

cluster_data_center=kmeans_clu.cluster_centers_

In [31]:
print("Here is the K Means cluster:\n")
cluster_data_center

Here is the K Means cluster:



array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.00186899, 0.00422616, 0.00030728, ..., 0.00320024, 0.00329275,
        0.00152675],
       [0.00151442, 0.00407186, 0.        , ..., 0.0011095 , 0.        ,
        0.        ],
       [0.        , 0.0029531 , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [22]:
from sklearn.cluster import DBSCAN

# Here we are initializing DBSCAN Clustering
dbscan_clu = DBSCAN(eps=0.5, min_samples=5)
dbscan_clu.fit(tfidf_matrix)
amazon_data['DBSCAN_Cluster'] = dbscan_clu.labels_

In [32]:
print("Here is the DBSCAN cluster:\n")
dbscan_clu.labels_

Here is the DBSCAN cluster:



array([-1, -1,  0,  0, -1, -1, -1, -1, -1, -1, -1, -1,  1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1,  2, -1, -1, -1,  2, -1, -1,  2, -1, -1, -1,
       -1, -1, -1, -1, -1, -1,  3, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  4, -1, -1, -1, -1, -1,  4,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  3,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1,  5, -1, -1, -1, -1,  4, -1, -1, -1,  2, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  2, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1,  0, -1, -1, -1, -1, -1, -1, -1,  5, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1,  3, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  2,  4, -1,
       -1, -1, -1, -1, -1, -1, -1, -1,  3, -1, -1, -1, -1,  4, -1, -1, -1,
       -1, -1, -1, -1, -1

In [24]:
from sklearn.cluster import AgglomerativeClustering


#Here we are initialising Hierarchical Clustering
hierarchical = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward')
hierarchical_clu = hierarchical.fit_predict(tfidf_matrix.toarray())
amazon_data['Hierarchical_Cluster'] = hierarchical_clu



In [33]:
print("Here is the Hierarchical cluster:\n")
hierarchical_clu

Here is the Hierarchical cluster:



array([0, 0, 3, 3, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0,
       0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0,
       0, 2, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,
       0, 4, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 2, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 4, 0, 0,
       0, 0, 0, 2, 1, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2,
       0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 2, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [26]:
from gensim.models import Word2Vec
from sklearn.cluster import KMeans
import numpy as np

# Here we are Tokenizing the cleaned text
tokenized_corpus = [text.split() for text in amazon_data['Clean_Reviews']]

# Here we are Training Word2Vec model
word2vec_model = Word2Vec(sentences=tokenized_corpus, vector_size=100, window=5, min_count=1, sg=1)

#Here we are defining function for word embeddings for each review
def word_embed_averaging(wv, words):
    all_words, mean = set(), []
    for word in words:
        if isinstance(word, np.ndarray):
            mean.append(word)
        elif word in wv.key_to_index:
            mean.append(wv[word])
    if not mean:
        return np.zeros(wv.vector_size,)
    mean = np.array(mean).mean(axis=0)
    return mean

def word_embed_averaging_list(wv, text_list):
    return np.vstack([word_embed_averaging(wv, review) for review in text_list])

X_vec = word_embed_averaging_list(word2vec_model.wv, tokenized_corpus)

# Here we are Using K-means on Word2Vec embeddings for generating Word2Vec Clustering
kmeans_word2vec = KMeans(n_clusters=5, init='k-means++', random_state=90)
kmeans_word2vec.fit(X_vec)
amazon_data['Word2Vec_Cluster'] = kmeans_word2vec.labels_



In [34]:
print("Here is the Word2vec cluster:\n")
kmeans_word2vec.labels_

Here is the Word2vec cluster:



array([3, 3, 1, 0, 3, 1, 0, 1, 0, 4, 0, 3, 2, 3, 0, 1, 3, 3, 0, 1, 0, 4,
       0, 1, 2, 3, 0, 1, 0, 3, 1, 0, 1, 3, 1, 3, 0, 1, 1, 3, 3, 1, 3, 3,
       3, 3, 3, 0, 3, 3, 1, 3, 0, 3, 0, 0, 3, 0, 3, 1, 3, 2, 3, 3, 3, 1,
       2, 2, 4, 3, 3, 2, 3, 1, 0, 3, 3, 3, 3, 3, 3, 3, 2, 4, 3, 3, 1, 1,
       2, 3, 0, 0, 3, 0, 0, 1, 1, 1, 1, 0, 0, 3, 1, 0, 1, 3, 4, 1, 0, 3,
       1, 2, 3, 1, 4, 1, 3, 0, 4, 0, 4, 1, 3, 1, 3, 3, 3, 1, 3, 3, 3, 1,
       2, 0, 1, 3, 2, 1, 3, 3, 4, 1, 3, 3, 3, 3, 3, 3, 0, 3, 1, 0, 0, 0,
       4, 1, 0, 0, 3, 3, 0, 3, 3, 3, 0, 3, 2, 1, 3, 3, 3, 0, 0, 3, 4, 2,
       3, 3, 3, 3, 0, 3, 3, 0, 4, 4, 1, 3, 3, 3, 0, 3, 3, 0, 3, 3, 2, 1,
       1, 0, 0, 1, 2, 3, 1, 3, 3, 1, 0, 3, 0, 1, 3, 0, 3, 0, 1, 2, 3, 0,
       1, 4, 1, 0, 4, 1, 3, 3, 3, 3, 0, 1, 3, 4, 3, 0, 0, 0, 3, 3, 1, 0,
       3, 1, 1, 2, 1, 4, 3, 1, 0, 3, 1, 0, 0, 3, 3, 0, 0, 3, 0, 3, 3, 2,
       4, 1, 3, 3, 1, 3, 0, 3, 0, 3, 1, 0, 1, 1, 3, 1, 1, 0, 4, 2, 0, 0,
       0, 1, 0, 3, 3, 1, 1, 3, 3, 3, 3, 2, 0, 3, 2,

In [28]:
#import required libraries
from transformers import BertTokenizer, BertModel
import torch

# Here we are initializing pre-trained BERT model and tokenizer from imported libraries
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

#Here we are initializing Tokenizing and obtain embeddings for all reviews
def bert_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
    outputs = model(**inputs)
    last_hidden_states = outputs.last_hidden_state
    mean_embeddings = torch.mean(last_hidden_states, dim=1)
    return mean_embeddings.detach().numpy()

# Here we are initializing BERT embeddings for each review
X_bert = np.vstack([bert_embedding(text) for text in amazon_data['Clean_Reviews']])
kmeans_bert_clu = KMeans(n_clusters=5, init='k-means++', random_state=90)
kmeans_bert_clu.fit(X_bert)
amazon_data['BERT_Cluster'] = kmeans_bert_clu.labels_



In [35]:
print("Here is the BERT cluster:\n")
kmeans_bert_clu.labels_

Here is the BERT cluster:



array([4, 2, 0, 4, 2, 1, 2, 4, 3, 0, 4, 1, 0, 0, 2, 2, 1, 2, 1, 4, 0, 0,
       1, 0, 0, 4, 1, 0, 2, 2, 0, 1, 0, 2, 0, 2, 2, 4, 0, 4, 0, 2, 1, 1,
       2, 1, 2, 0, 1, 2, 0, 2, 2, 2, 1, 1, 2, 4, 2, 0, 1, 0, 1, 0, 1, 1,
       0, 0, 0, 1, 4, 0, 2, 0, 2, 2, 2, 4, 4, 4, 1, 2, 0, 4, 0, 2, 2, 0,
       4, 2, 2, 4, 1, 3, 1, 1, 1, 0, 0, 4, 4, 1, 0, 1, 0, 2, 0, 0, 3, 2,
       2, 0, 4, 0, 4, 0, 4, 1, 4, 2, 4, 1, 2, 2, 0, 2, 2, 1, 1, 1, 2, 0,
       1, 4, 4, 2, 4, 4, 2, 4, 0, 0, 1, 2, 4, 1, 0, 1, 3, 2, 4, 4, 2, 4,
       4, 0, 4, 2, 1, 4, 4, 4, 4, 0, 4, 2, 0, 4, 1, 4, 1, 1, 3, 2, 0, 0,
       1, 0, 0, 0, 2, 4, 4, 2, 4, 4, 4, 4, 2, 4, 4, 4, 2, 4, 4, 2, 4, 4,
       1, 4, 4, 0, 0, 0, 0, 0, 2, 4, 1, 1, 4, 2, 0, 2, 2, 2, 1, 0, 2, 1,
       4, 0, 4, 3, 4, 1, 1, 2, 1, 2, 4, 1, 1, 0, 2, 0, 1, 3, 0, 0, 4, 2,
       2, 0, 0, 0, 0, 0, 4, 4, 2, 2, 4, 4, 1, 4, 1, 4, 2, 1, 2, 1, 2, 0,
       4, 4, 1, 2, 0, 2, 2, 0, 4, 4, 4, 1, 2, 4, 1, 1, 0, 3, 0, 0, 1, 1,
       4, 4, 2, 1, 0, 2, 4, 2, 2, 2, 1, 0, 0, 4, 0,

**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

**Write your response here:**

In this exercise Various techniques are derived such as K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT for clustering the Amazon mobile reviews dataset. Here my view of observation based on the results obtained.

  K-means efficiently groups reviews into clusters according to common features, however DBSCAN has trouble detecting concentrated areas, leading to a large number of data points being classified as noise. The structure of the clusters produced by hierarchical clustering differs depending on how far separated the data points are.

  Word2Vec groups items based on sentiments and semantic meanings, while BERT recognizes deep contextual connections and may be able to identify minor similarities between reviews that other algorithms could ignore. The selection of methods is depending upon the properties of the data and the analytical objective, as each has advantages and disadvantages.

.

.

.

.




# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
This exercise was so knowledgable. i had a great learning experience by getting involving in the clustering methods.
Though i tried to initialize some evaluation metrics for clustering methods i failed to initialize them.
Here i have only used 500 reviews based on the clusters for efficient performance.
Regarding the data i have manually uploaded the data into this google colab notebook.

'''