<a href="https://colab.research.google.com/github/ShashankAlluri28/INFO-5731Computational-Methods/blob/main/Alluri_Shashank_Inclass_Exercise_5_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [None]:
import pandas as pd

# Read the text file into a DataFrame
with open('stsa-train.txt', 'r') as file:
    lines = file.readlines()

data = []
for line in lines:
    sentiment, review = line.strip().split(' ', 1)
    data.append([int(sentiment), review])

df = pd.DataFrame(data, columns=['sentiment', 'review'])

# Save the DataFrame to a CSV file
df.to_csv('train.csv', index=False)

# Print the first few rows of the DataFrame
print(df.head())

   sentiment                                             review
0          1  a stirring , funny and finally transporting re...
1          0  apparently reassembled from the cutting-room f...
2          0  they presume their audience wo n't sit still f...
3          1  this is a visually stunning rumination on love...
4          1  jonathan parker 's bartleby should have been t...


In [None]:
print(df.shape)

(6920, 2)


In [None]:
import pandas as pd

# Read the text file into a DataFrame
with open('stsa-test.txt', 'r') as file:
    lines = file.readlines()

data = []
for line in lines:
    sentiment, review = line.strip().split(' ', 1)
    data.append([int(sentiment), review])

df1 = pd.DataFrame(data, columns=['sentiment', 'review'])

# Save the DataFrame to a CSV file
df1.to_csv('test.csv', index=False)

# Print the first few rows of the DataFrame
print(df1.head())



   sentiment                                             review
0          0     no movement , no yuks , not much of anything .
1          0  a gob of drivel so sickly sweet , even the eag...
2          0  gangs of new york is an unapologetic mess , wh...
3          0  we never really feel involved with the story ,...
4          1            this is one of polanski 's best films .
(1821, 2)


In [None]:
print(df1.shape)

(1821, 2)


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the training and test data
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

# Separate features (reviews) and labels (sentiments)
X_train = train_data['review']
y_train = train_data['sentiment']
X_test = test_data['review']
y_test = test_data['sentiment']

# Split the training data into training and validation sets (80-20 split)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Define a dictionary of classifiers
classifiers = {
    "MultinomialNB": MultinomialNB(),
    "SVM": SVC(),
    "KNN": KNeighborsClassifier(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "XGBoost": XGBClassifier()
}

# Define evaluation metrics
metrics = {
    "Accuracy": accuracy_score,
    "Precision": precision_score,
    "Recall": recall_score,
    "F1 Score": f1_score
}

# Perform 10-fold cross-validation on each classifier
for name, clf in classifiers.items():
    pipeline = make_pipeline(CountVectorizer(), clf)
    cv_scores = cross_val_score(pipeline, X_train, y_train, cv=10)
    print(f"\n{name}:")
    print("Cross-validation scores:", cv_scores)
    print("Mean Cross-validation score:", cv_scores.mean())

    # Train the final model
    pipeline.fit(X_train, y_train)

    # Evaluate on the validation set
    y_val_pred = pipeline.predict(X_val)
    print("Validation Set Metrics:")
    for metric_name, metric_func in metrics.items():
        metric_value = metric_func(y_val, y_val_pred)
        print(f"{metric_name}: {metric_value}")

# Evaluate the final trained models on the test data
print("\nFinal Evaluation on Test Data:")
for name, clf in classifiers.items():
    pipeline = make_pipeline(CountVectorizer(), clf)
    pipeline.fit(X_train, y_train)
    y_test_pred = pipeline.predict(X_test)
    print(f"\n{name}:")
    for metric_name, metric_func in metrics.items():
        metric_value = metric_func(y_test, y_test_pred)
        print(f"{metric_name}: {metric_value}")



MultinomialNB:
Cross-validation scores: [0.75631769 0.79783394 0.78519856 0.79061372 0.79061372 0.76173285
 0.78481013 0.78481013 0.75406872 0.78481013]
Mean Cross-validation score: 0.7790809565154947
Validation Set Metrics:
Accuracy: 0.7947976878612717
Precision: 0.777490297542044
Recall: 0.8429172510518934
F1 Score: 0.8088829071332435

SVM:
Cross-validation scores: [0.73465704 0.72382671 0.71299639 0.72021661 0.74368231 0.76173285
 0.7522604  0.72151899 0.71971067 0.75768535]
Mean Cross-validation score: 0.734828732022901
Validation Set Metrics:
Accuracy: 0.7557803468208093
Precision: 0.7394636015325671
Recall: 0.8120617110799438
F1 Score: 0.7740641711229947

KNN:
Cross-validation scores: [0.58483755 0.56137184 0.58844765 0.57581227 0.5631769  0.55234657
 0.60036166 0.56600362 0.54972875 0.54068716]
Mean Cross-validation score: 0.5682773973273447
Validation Set Metrics:
Accuracy: 0.6163294797687862
Precision: 0.6226415094339622
Recall: 0.6479663394109397
F1 Score: 0.6350515463917525



1. **MultinomialNB:**

- obtained value the cross validation's result is the 0.779 .

- According to the results indices of validation set, which I consider as a precise quality measurement a accuracy of 0.795 and a F1 score of 0.809 are promising.

- Further analysis of test cases, a portion of the dataset that was not used to train and create the model, was excellent, and the classifier has an accuracy of 80.7% which is followed by the F1 score that reached 0.814.


2. **SVM:**

- The cross-validation accuracy shows that the model is truly effective and its value is 0.735.

- The applied metrics showed that the set is in the average to good category, with an accuracy score of 75.6% and an F1 score of 77.4%.

- The evaluation after that indicates that the accuracy rate at 0.757 and the F1 score at 0.768 are almost similar with a low performance reduction.


3. **KNN:**

- one of the confusion matrix entries is 0.568.

- Rule of the Law is much more than not indiscriminate judgments direct. Inaccuracy ranges between 0.616 and 0.637, and F1-score is 0.637.

- Looking at data to test, we had similar results to the comparative assessment that yielded the same 0.583 accuracy and 0.596 F1 score.


4. **Decision Tree:**

- CV accuracy is 0.629.

- The validation dataset evaluation's output is 64.0% accuracy and 66.3% F1 score mediocre.

- At last evaluation accuracy count of 0.633 is noted and f1-score= 0.64 for the test data as well.


5. **Random Forest:**

- k-fold cross-validation accuracy is 71.8 % for the cross validation model.

- Validation is still good, and the accuracy is 74.2%, as well as the F1 score is 76.8%, this is quite a decent performance.

- Evaluation reveals the model accuracy of 0.724 and the F1 score of 0.739, which is quite encouraging and meets our basic goals.

6. **XGBoost:**

- the cross validation accuracy is 0.710.

- This is an indication of the validation set, and the accuracy is 0.740, with an F1 score of 0.767.

- Moreover, the last testing on test data finally shows the exact performance level with the accuracy of 0.728 and the F1 score of 0.749.


In general terms, MNB and XGBoost considerably proved their competence in the field. Finally, SVM and KNN were shown slightly lower efficacy. Random Forest also performs well. However, its F1-score for the test data is a bit less when compared with MultinomialNB and XGBoost Decision Tree produces good performance just like its counterpart in the case of valid data and test data.

## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [None]:
pip install sentence-transformers


Collecting sentence-transformers
  Downloading sentence_transformers-2.7.0-py3-none-any.whl (171 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.11.0->sentence-transform

In [None]:
import pandas as pd

# Load the dataset
data = pd.read_csv('train.csv')
print(data.head())

   sentiment                                             review
0          1  a stirring , funny and finally transporting re...
1          0  apparently reassembled from the cutting-room f...
2          0  they presume their audience wo n't sit still f...
3          1  this is a visually stunning rumination on love...
4          1  jonathan parker 's bartleby should have been t...


**K-means**

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Feature extraction using TF-IDF
vectorizer = TfidfVectorizer(max_features=1000)
X = vectorizer.fit_transform(data['review'])

# Determine the optimal number of clusters using silhouette score
max_clusters = 10
best_score = -1
best_k = 2
for k in range(2, max_clusters + 1):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    score = silhouette_score(X, kmeans.labels_)
    if score > best_score:
        best_score = score
        best_k = k

# Perform k-means clustering with the optimal number of clusters
kmeans = KMeans(n_clusters=best_k, random_state=42)
kmeans.fit(X)

# Assign cluster labels to each review
data['kmeans_cluster'] = kmeans.labels_

# Output the number of clusters and points in each cluster
num_clusters = len(set(kmeans.labels_))
print(f"Number of clusters found: {num_clusters}")
print("Number of points in each cluster:")
print(data['kmeans_cluster'].value_counts())

# Output the reviews along with their assigned clusters
print("\nReviews with cluster labels:")
print(data[['review', 'kmeans_cluster']])



Number of clusters found: 10
Number of points in each cluster:
kmeans_cluster
4    2037
8    1200
0     799
2     756
3     583
7     428
1     382
9     305
5     251
6     179
Name: count, dtype: int64

Reviews with cluster labels:
                                                 review  kmeans_cluster
0     a stirring , funny and finally transporting re...               4
1     apparently reassembled from the cutting-room f...               8
2     they presume their audience wo n't sit still f...               4
3     this is a visually stunning rumination on love...               2
4     jonathan parker 's bartleby should have been t...               8
...                                                 ...             ...
6915  painful , horrifying and oppressively tragic ,...               4
6916  take care is nicely performed by a quintet of ...               0
6917  the script covers huge , heavy topics in a bla...               4
6918  a seriously bad film with seriously warp

In [None]:
# Output two reviews for each cluster
print("Reviews with cluster labels:")
for cluster_label in range(best_k):
    cluster_reviews = data[data['kmeans_cluster'] == cluster_label]['review'].head(2)
    print(f"\nCluster {cluster_label}:")
    for review in cluster_reviews:
        print(review)

Reviews with cluster labels:

Cluster 0:
may be more genial than ingenious , but it gets the job done .
if the tuxedo actually were a suit , it would fit chan like a $ 99 bargain-basement special .

Cluster 1:
final verdict : you 've seen it all before .
you ... get a sense of good intentions derailed by a failure to seek and strike just the right tone .

Cluster 2:
this is a visually stunning rumination on love , memory , history and the war between art and commerce .
the film is strictly routine .

Cluster 3:
a fan film that for the uninitiated plays better on video with the sound turned down .
a little less extreme than in the past , with longer exposition sequences between them , and with fewer gags to break the tedium .

Cluster 4:
a stirring , funny and finally transporting re-imagining of beauty and the beast and 1930s horror films
they presume their audience wo n't sit still for a sociology lesson , however entertainingly presented , so they trot out the conventional science-fi

**DBSCAN**

In [None]:
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score

# Determine the optimal epsilon value using silhouette score
best_eps = 0
best_score = -1
for eps in [0.1, 0.5, 1.0, 1.5, 2.0]:
    dbscan = DBSCAN(eps=eps, min_samples=5)
    dbscan.fit(X)
    unique_labels = len(set(dbscan.labels_)) - (1 if -1 in dbscan.labels_ else 0)  # Exclude noise points (-1)
    if unique_labels > 1:  # Check if more than one cluster label is present
        score = silhouette_score(X, dbscan.labels_)
        if score > best_score:
            best_score = score
            best_eps = eps

# Perform DBSCAN clustering with the optimal epsilon value
dbscan = DBSCAN(eps=best_eps, min_samples=5)
dbscan.fit(X)

# Assign cluster labels to each review
data['dbscan_cluster'] = dbscan.labels_

# Output the number of clusters and points in each cluster
num_clusters = len(set(dbscan.labels_)) - (1 if -1 in dbscan.labels_ else 0)  # Exclude noise points (-1)
print(f"Number of clusters found: {num_clusters}")
print("Number of points in each cluster:")
print(data['dbscan_cluster'].value_counts())

# Output the reviews along with their assigned clusters
print("\nReviews with cluster labels:")
print(data[['review', 'dbscan_cluster']])


Number of clusters found: 4
Number of points in each cluster:
dbscan_cluster
-1    6837
 0      57
 2      14
 3       7
 1       5
Name: count, dtype: int64

Reviews with cluster labels:
                                                 review  dbscan_cluster
0     a stirring , funny and finally transporting re...              -1
1     apparently reassembled from the cutting-room f...              -1
2     they presume their audience wo n't sit still f...              -1
3     this is a visually stunning rumination on love...              -1
4     jonathan parker 's bartleby should have been t...              -1
...                                                 ...             ...
6915  painful , horrifying and oppressively tragic ,...              -1
6916  take care is nicely performed by a quintet of ...              -1
6917  the script covers huge , heavy topics in a bla...              -1
6918  a seriously bad film with seriously warped log...              -1
6919  a deliciously 

In [None]:
# Output two reviews for each cluster
print("\nTwo reviews for each cluster:")
for cluster_label in sorted(set(dbscan.labels_)):
    if cluster_label == -1:
        continue  # Skip noise points
    cluster_reviews = data[data['dbscan_cluster'] == cluster_label]['review'].head(2)
    print(f"Cluster {cluster_label}:")
    for review in cluster_reviews:
        print(f"  - {review}")



Two reviews for each cluster:
Cluster 0:
  - amazingly lame .
  - wow .
Cluster 1:
  - fun and nimble .
  - delirious fun .
Cluster 2:
  - technically and artistically inept .
  - fierce , glaring and unforgettable .
Cluster 3:
  - remember it .
  - it 's uninteresting .


**Hierarchical clustering**

In [None]:
from sklearn.cluster import AgglomerativeClustering

# Perform hierarchical clustering
hierarchical = AgglomerativeClustering(n_clusters=2)
hierarchical.fit(X.toarray())

# Assign cluster labels to each review
data['hierarchical_cluster'] = hierarchical.labels_

# Output the reviews along with their assigned clusters
print("Reviews with cluster labels:")
print(data[['review', 'hierarchical_cluster']])


Reviews with cluster labels:
                                                 review  hierarchical_cluster
0     a stirring , funny and finally transporting re...                     0
1     apparently reassembled from the cutting-room f...                     1
2     they presume their audience wo n't sit still f...                     0
3     this is a visually stunning rumination on love...                     0
4     jonathan parker 's bartleby should have been t...                     0
...                                                 ...                   ...
6915  painful , horrifying and oppressively tragic ,...                     0
6916  take care is nicely performed by a quintet of ...                     0
6917  the script covers huge , heavy topics in a bla...                     0
6918  a seriously bad film with seriously warped log...                     0
6919  a deliciously nonsensical comedy about a city ...                     0

[6920 rows x 2 columns]


In [None]:
# Group the data by hierarchical cluster labels
cluster_groups = data.groupby('hierarchical_cluster')

# Print two reviews for each cluster
for cluster_label, reviews_group in cluster_groups:
    print(f"Cluster {cluster_label}:")
    reviews = reviews_group['review'].head(2)  # Select two reviews from each cluster
    for review in reviews:
        print(review)
    print()


Cluster 0:
a stirring , funny and finally transporting re-imagining of beauty and the beast and 1930s horror films
they presume their audience wo n't sit still for a sociology lesson , however entertainingly presented , so they trot out the conventional science-fiction elements of bug-eyed monsters and futuristic women in skimpy clothes .

Cluster 1:
apparently reassembled from the cutting-room floor of any given daytime soap .
it is as uncompromising as it is nonjudgmental , and makes clear that a prostitute can be as lonely and needy as any of the clients .



**Word2Vec**

In [None]:
from gensim.models import Word2Vec
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score


# Train Word2Vec model
sentences = [review.split() for review in data['review']]
word2vec_model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Generate embeddings for each review
word2vec_features = []
for review in sentences:
    embeddings = [word2vec_model.wv[word] for word in review if word in word2vec_model.wv]
    if embeddings:
        review_embedding = sum(embeddings) / len(embeddings)
        word2vec_features.append(review_embedding)
    else:
        word2vec_features.append([0] * 100)  # Handle case where review has no embeddings

# Determine the optimal number of clusters for K-means clustering
max_clusters = 10  # Maximum number of clusters to consider
best_score = -1
best_n_clusters = 2
for n_clusters in range(2, max_clusters + 1):
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    kmeans.fit(word2vec_features)
    score = silhouette_score(word2vec_features, kmeans.labels_)
    if score > best_score:
        best_score = score
        best_n_clusters = n_clusters

# Perform K-means clustering with the optimal number of clusters
kmeans = KMeans(n_clusters=best_n_clusters, random_state=42)
kmeans.fit(word2vec_features)
data['word2vec_cluster'] = kmeans.labels_

# Print the number of points in each cluster
print("Number of points in each cluster:")
print(data['word2vec_cluster'].value_counts())

# Print the output
print("\nWord2Vec clustering:")
print(data[['review', 'word2vec_cluster']])




Number of points in each cluster:
word2vec_cluster
1    3638
0    3282
Name: count, dtype: int64

Word2Vec clustering:
                                                 review  word2vec_cluster
0     a stirring , funny and finally transporting re...                 0
1     apparently reassembled from the cutting-room f...                 0
2     they presume their audience wo n't sit still f...                 0
3     this is a visually stunning rumination on love...                 1
4     jonathan parker 's bartleby should have been t...                 0
...                                                 ...               ...
6915  painful , horrifying and oppressively tragic ,...                 1
6916  take care is nicely performed by a quintet of ...                 0
6917  the script covers huge , heavy topics in a bla...                 1
6918  a seriously bad film with seriously warped log...                 0
6919  a deliciously nonsensical comedy about a city ...            

In [None]:
# Print two reviews for each Word2Vec cluster
print("\nTwo reviews from each Word2Vec cluster:")
for cluster_id in range(best_n_clusters):
    cluster_reviews = data[data['word2vec_cluster'] == cluster_id]['review'][:2]  # Get the first two reviews
    print(f"\nCluster {cluster_id}:\n")
    for review in cluster_reviews:
        print(review)
        print()



Two reviews from each Word2Vec cluster:

Cluster 0:

a stirring , funny and finally transporting re-imagining of beauty and the beast and 1930s horror films

apparently reassembled from the cutting-room floor of any given daytime soap .


Cluster 1:

this is a visually stunning rumination on love , memory , history and the war between art and commerce .

campanella gets the tone just right -- funny in the middle of sad in the middle of hopeful .



**BERT**

In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Load pre-trained BERT model
bert_model = SentenceTransformer('bert-base-nli-mean-tokens')

# Encode text reviews using BERT
review_embeddings = bert_model.encode(data['review'].tolist())

# Determine the optimal number of clusters for K-means clustering
max_clusters = 10  # Maximum number of clusters to consider
best_score = -1
best_n_clusters = 2
for n_clusters in range(2, max_clusters + 1):
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    kmeans.fit(review_embeddings)
    score = silhouette_score(review_embeddings, kmeans.labels_)
    if score > best_score:
        best_score = score
        best_n_clusters = n_clusters

# Perform K-means clustering with the optimal number of clusters
kmeans = KMeans(n_clusters=best_n_clusters, random_state=42)
kmeans.fit(review_embeddings)
data['bert_cluster'] = kmeans.labels_

# Print the number of points in each cluster
print("Number of points in each cluster:")
print(data['bert_cluster'].value_counts())

# Print the output
print("\nBERT clustering:")
print(data[['review', 'bert_cluster']])




Number of points in each cluster:
bert_cluster
1    3905
0    3015
Name: count, dtype: int64

BERT clustering:
                                                 review  bert_cluster
0     a stirring , funny and finally transporting re...             0
1     apparently reassembled from the cutting-room f...             1
2     they presume their audience wo n't sit still f...             1
3     this is a visually stunning rumination on love...             0
4     jonathan parker 's bartleby should have been t...             1
...                                                 ...           ...
6915  painful , horrifying and oppressively tragic ,...             1
6916  take care is nicely performed by a quintet of ...             0
6917  the script covers huge , heavy topics in a bla...             1
6918  a seriously bad film with seriously warped log...             1
6919  a deliciously nonsensical comedy about a city ...             1

[6920 rows x 2 columns]


In [None]:
# Print two reviews from each BERT cluster
print("\nTwo reviews from each BERT cluster:")
best_n_clusters = 2
for cluster_id in range(best_n_clusters):
    cluster_reviews = data[data['bert_cluster'] == cluster_id]['review'][:2]  # Get the first two reviews
    print(f"\nCluster {cluster_id}:\n")
    for review in cluster_reviews:
        print(review)
        print()


Two reviews from each BERT cluster:

Cluster 0:

a stirring , funny and finally transporting re-imagining of beauty and the beast and 1930s horror films

this is a visually stunning rumination on love , memory , history and the war between art and commerce .


Cluster 1:

apparently reassembled from the cutting-room floor of any given daytime soap .

they presume their audience wo n't sit still for a sociology lesson , however entertainingly presented , so they trot out the conventional science-fiction elements of bug-eyed monsters and futuristic women in skimpy clothes .



**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

Within the range of clustering approaches applied viz., k-means, DBSCAN, Hierarchical Clustering, Word2Vec, and BERT, the distinctive characteristics and resemblances of their outcomes have been observed. The 10 clusters, which varied in size, were the result of k-means method which was the indicator for diverged groups across the data. However, DBSCAN gave us approximately the same number of clusters, but the bigger one consist of noise and some of the others are quite small. The &#039;hierarchical clustering method&#039; does also create a smaller amount of clusters, where one of them taking most of the data points into consideration. Analyzing by WM2 is a mechanism that lead to two logical clusters, with relatively equal sizes, which leads to the idea of different semantic grouping. Correspondingly, BERT clustering identified to clusters which reflected the different semantic content of the same review as well as the distinct sentiment reflected in the underlying of the reviews. While k-means, word2vec, and BERT clustering methods could make out more carefully nuanced differences in the data set, DBSCAN and Hierarchical could hardly find good clusters because of the slightest noise, or possibly because of dependence on parameter settings. Ultimately, each approach gives out a respective view about the structure of the data set that directs us our selection to a suitable clustering technique stated by the nature of the data and expected outcomes.




# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
The assignments of this type helped to examine the range of text clustering methods together with the following algorithms: K-means, DBSCAN, Hierarchical, Word2Vec, and BERT. In the end, these mish-mash exercises were a great chance to put into practice what I had learned – going through the dataset of reviews, preprocessing the text data, feature extraction, learning applied clustering, and interpreting the results.


Overall, the aspect that stands out the most in regard to the assignment was that there were a variety of clustering methods (that equally carried weaknesses and strengths) used. As an example, K-means clustering has a limitation, however, a simple one that grasps easily the analyst attention, rendering it a popular choice for many use cases. Consequently, from the onset, one ought to specify the number of clusters that have to be deduced without any prerequisite information on the subject. DBSCAN, however, good at discovering clusters of arbitrary patterns and sizes. However, it might be hard to identify high dense data or find cluster which have irregularity within them.


The hierarchical clustering helps to come up with all the hierarchical view of the data from a low to high granular level, this helps to explore the relationships. In the meantime, it becomes computationally difficult and noisy as well as error-prone. Word2Vec and BERT embedding techniques have taken the concept of extracting semantics from the text to new and more advanced levels allowing to discover hidden similarities not only between the exact words but also between those that share very deep and subtle meanings. The utilization of these methods is based on the use of pre-trained language models that encode textual information as summary vector representations which in turn allow for effective clustering even in the presence of the minimum number of the labeled data.


After this experience, I can clearly perceive the method of topic-based text clustering and compare and contrast different ways how they perform and what results they give. Thus beginning, it would be ideal to consider some more complicated clustering algorithms as well as to measure their efficiency when applied to non-standard datasets, maintaining in this way my text analysis and clustering skills.



'''