# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [None]:
import pandas as pd

# Load the CSV file into a DataFrame
df = pd.read_csv("stsa_train.csv")

# Display the shape of the DataFrame
print("Shape of the DataFrame:")
print(df.shape)
print()

Shape of the DataFrame:
(6920, 2)



In [None]:
# Display the first few rows of the DataFrame
print("First few rows of the DataFrame:")
print(df.head())
print()

First few rows of the DataFrame:
   sentiment                                             review
0          1  a stirring , funny and finally transporting re...
1          0  apparently reassembled from the cutting-room f...
2          0  they presume their audience wo n't sit still f...
3          1  this is a visually stunning rumination on love...
4          1  jonathan parker 's bartleby should have been t...



In [None]:
# Display summary statistics of the DataFrame
print("Summary statistics of the DataFrame:")
print(df.describe())
print()

Summary statistics of the DataFrame:
         sentiment
count  6920.000000
mean      0.521676
std       0.499566
min       0.000000
25%       0.000000
50%       1.000000
75%       1.000000
max       1.000000



In [None]:
import pandas as pd

# Load the CSV file into a DataFrame
data = pd.read_csv("stsa_test.csv")

# Display the shape of the DataFrame
print("Shape of the DataFrame:")
print(data.shape)
print()

Shape of the DataFrame:
(1821, 2)



In [None]:
# Display the first few rows of the DataFrame
print("First few rows of the DataFrame:")
print(data.head())
print()

First few rows of the DataFrame:
   sentiment                                             review
0          0     no movement , no yuks , not much of anything .
1          0  a gob of drivel so sickly sweet , even the eag...
2          0  gangs of new york is an unapologetic mess , wh...
3          0  we never really feel involved with the story ,...
4          1            this is one of polanski 's best films .



In [None]:
# Display summary statistics of the DataFrame
print("Summary statistics of the DataFrame:")
print(data.describe())
print()

Summary statistics of the DataFrame:
         sentiment
count  1821.000000
mean      0.499176
std       0.500137
min       0.000000
25%       0.000000
50%       0.000000
75%       1.000000
max       1.000000



In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# Loading the training and testing datasets
data_one = pd.read_csv('stsa_train.csv')
data_two = pd.read_csv('stsa_test.csv')

# Separating features (reviews) and labels (sentiments)
X_one = data_one['review']
y_one = data_one['sentiment']
X_two = data_two['review']
y_two = data_two['sentiment']

# Splitting the training data for validation (80-20 split)
X_one, X_three, y_one, y_three = train_test_split(X_one, y_one, test_size=0.2, random_state=42)

# Defining a set of classifiers
models = {
    "Classifier1": MultinomialNB(),
    "Classifier2": SVC(),
    "Classifier3": KNeighborsClassifier(),
    "Classifier4": DecisionTreeClassifier(),
    "Classifier5": RandomForestClassifier(),
    "Classifier6": XGBClassifier()
}

# Defining evaluation metrics
metrics = {
    "Accuracy": accuracy_score,
    "Precision": precision_score,
    "Recall_Score": recall_score,
    "F1_Score": f1_score
}


# Performing 10-fold cross-validation for each classifier
for model_name, classifier in models.items():
    pipeline = make_pipeline(CountVectorizer(), classifier)
    cv_scores = cross_val_score(pipeline, X_one, y_one, cv=10)
    print(f"\n{model_name}:")
    print("Cross-validation scores:", cv_scores)
    print("Mean Cross-validation score:", cv_scores.mean())

    # Training the final model
    pipeline.fit(X_one, y_one)

    # Evaluating on the validation set
    y_three_pred = pipeline.predict(X_three)
    print("Validation Metrics:")
    for metric_name, metric_function in metrics.items():
        metric_value = metric_function(y_three, y_three_pred)
        print(f"{metric_name}: {metric_value}")



Classifier1:
Cross-validation scores: [0.75631769 0.79783394 0.78519856 0.79061372 0.79061372 0.76173285
 0.78481013 0.78481013 0.75406872 0.78481013]
Mean Cross-validation score: 0.7790809565154947
Validation Metrics:
Accuracy: 0.7947976878612717
Precision: 0.777490297542044
Recall_Score: 0.8429172510518934
F1_Score: 0.8088829071332435

Classifier2:
Cross-validation scores: [0.73465704 0.72382671 0.71299639 0.72021661 0.74368231 0.76173285
 0.7522604  0.72151899 0.71971067 0.75768535]
Mean Cross-validation score: 0.734828732022901
Validation Metrics:
Accuracy: 0.7557803468208093
Precision: 0.7394636015325671
Recall_Score: 0.8120617110799438
F1_Score: 0.7740641711229947

Classifier3:
Cross-validation scores: [0.58483755 0.56137184 0.58844765 0.57581227 0.5631769  0.55234657
 0.60036166 0.56600362 0.54972875 0.54068716]
Mean Cross-validation score: 0.5682773973273447
Validation Metrics:
Accuracy: 0.6163294797687862
Precision: 0.6226415094339622
Recall_Score: 0.6479663394109397
F1_Score

In [None]:
# Evaluating the trained models on the test dataset
print("\nFinal Evaluation on Test Data:")
for model_name, classifier in models.items():
    pipeline = make_pipeline(CountVectorizer(), classifier)
    pipeline.fit(X_one, y_one)
    y_two_pred = pipeline.predict(X_two)
    print(f"\n{model_name}:")
    for metric_name, metric_function in metrics.items():
        metric_value = metric_function(y_two, y_two_pred)
        print(f"{metric_name}: {metric_value}")


Final Evaluation on Test Data:

Classifier1:
Accuracy: 0.8072487644151565
Precision: 0.7852760736196319
Recall_Score: 0.8448844884488449
F1_Score: 0.8139904610492847

Classifier2:
Accuracy: 0.757276221856123
Precision: 0.7346733668341708
Recall_Score: 0.8041804180418042
F1_Score: 0.7678571428571429

Classifier3:
Accuracy: 0.5831960461285008
Precision: 0.5773195876288659
Recall_Score: 0.6160616061606161
F1_Score: 0.5960617349654072

Classifier4:
Accuracy: 0.6392092257001647
Precision: 0.6293634496919918
Recall_Score: 0.6743674367436744
F1_Score: 0.6510886882634095

Classifier5:
Accuracy: 0.7243272926963207
Precision: 0.7024875621890547
Recall_Score: 0.7766776677667767
F1_Score: 0.7377220480668758

Classifier6:
Accuracy: 0.727622185612301
Precision: 0.6938967136150235
Recall_Score: 0.812981298129813
F1_Score: 0.7487335359675785


Multinomial Naive Bayes (Classifier1): With validation and tests well approximated, the data reveals a good balance on recall and precision which is about 0.8. This balance is appropriate.

Support Vector Machine (Classifier2): Effectiveness decreases as compared to Naive Bayes while dealing with either datasets. It resembles Naive Bayes highly accurate approach but is not as detailed, therefore negative reviews can be mistaken positive as well.

K-Nearest Neighbors (Classifier3): The lowest accuracy of all the models among the subject models. It fails to discern cheerful or sad people.

Decision Tree (Classifier4): Although KNN has marginally higher accuracy it still seems to be not highly relevant. Just like KNN, it will probably have other issues related to sentiment classification.

Random Forest (Classifier5): Top among the others like KNN and DT with moderate accuracy. It is best in high rates of recalling positive reviews yet Naive Bayes has a better precision rate than it.

XGBoost (Classifier6): Raising an almost similar level accuracy of the Random Forest but with only a mild hesitation for positive reviews in the test set. Contrary to it, it possesses the least reliable accuracy what means that sometimes, it may mistakenly mark certain reviews as negative.

Overall:


A multinomial bayes naive was shown to be the best of others in the accuracy and the precision with a slight drop in the recall.

A Random Forest or XGBoost can be good alternatives, particularly if you need to have a high recall while identifying positive reviews. Nevertheless, such outcome may consist in the missed labeling of several negative reviews.

The Support Vector Machine can, perhaps, be better suitable when accuracy is the main target, but it performs less.

KNN and Decision Tree were the least performers here and they can probably be eliminated from the model.

## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [None]:
pip install sentence-transformers




In [None]:
import pandas as pd

# Load the dataset
data = pd.read_csv('stsa_train.csv')
print(data.head())

   sentiment                                             review
0          1  a stirring , funny and finally transporting re...
1          0  apparently reassembled from the cutting-room f...
2          0  they presume their audience wo n't sit still f...
3          1  this is a visually stunning rumination on love...
4          1  jonathan parker 's bartleby should have been t...


**K-means**

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import pandas as pd

# Feature extraction using TF-IDF
tfidf = TfidfVectorizer(max_features=1000)
X_features = tfidf.fit_transform(data['review'])

# Determine the optimal number of clusters using silhouette score
max_clusters = 10
best_score = -1
best_k = 2
for k in range(2, max_clusters + 1):
    kmeans = KMeans(n_clusters=k, n_init='auto', random_state=42)
    kmeans.fit(X_features)
    score = silhouette_score(X_features, kmeans.labels_)
    if score > best_score:
        best_score = score
        best_k = k

# Perform k-means clustering with the optimal number of clusters
kmeans = KMeans(n_clusters=best_k, n_init=10, random_state=42)
kmeans.fit(X_features)

# Assign cluster labels to each review
data['kmeans_cluster'] = kmeans.labels_

# Output the number of clusters and points in each cluster
num_clusters = len(set(kmeans.labels_))
print(f"Number of clusters found: {num_clusters}")
print("Number of points in each cluster:")
print(data['kmeans_cluster'].value_counts())

# Output the reviews along with their assigned clusters
print("\nReviews with cluster labels:")
print(data[['review', 'kmeans_cluster']])


Number of clusters found: 10
Number of points in each cluster:
kmeans_cluster
4    1552
5    1036
2     802
3     792
7     684
0     569
1     520
6     388
9     383
8     194
Name: count, dtype: int64

Reviews with cluster labels:
                                                 review  kmeans_cluster
0     a stirring , funny and finally transporting re...               2
1     apparently reassembled from the cutting-room f...               4
2     they presume their audience wo n't sit still f...               4
3     this is a visually stunning rumination on love...               2
4     jonathan parker 's bartleby should have been t...               5
...                                                 ...             ...
6915  painful , horrifying and oppressively tragic ,...               0
6916  take care is nicely performed by a quintet of ...               3
6917  the script covers huge , heavy topics in a bla...               4
6918  a seriously bad film with seriously warp

In [None]:
# Output one review1 for each cluster
print("Sample Reviews from Each Cluster:")
for cluster_label in range(best_k):
    cluster_reviews = data[data['kmeans_cluster'] == cluster_label]['review'].head(1)
    print(f"\nCluster {cluster_label}:")
    for review in cluster_reviews:
        print(review)

Sample Reviews from Each Cluster:

Cluster 0:
a fan film that for the uninitiated plays better on video with the sound turned down .

Cluster 1:
a wretched movie that reduces the second world war to one man 's quest to find an old flame .

Cluster 2:
a stirring , funny and finally transporting re-imagining of beauty and the beast and 1930s horror films

Cluster 3:
may be more genial than ingenious , but it gets the job done .

Cluster 4:
apparently reassembled from the cutting-room floor of any given daytime soap .

Cluster 5:
jonathan parker 's bartleby should have been the be-all-end-all of the modern-office anomie films .

Cluster 6:
for something as splendid-looking as this particular film , the viewer expects something special but instead gets -lrb- sci-fi -rrb- rehash .

Cluster 7:
béart and berling are both superb , while huppert ... is magnificent .

Cluster 8:
blue crush follows the formula , but throws in too many conflicts to keep the story compelling .

Cluster 9:
final ver

**DBSCAN**

In [None]:
from sklearn.cluster import DBSCAN
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=1000)

# Fit and transform the text data to TF-IDF vectors
tfidf_vectors = tfidf_vectorizer.fit_transform(data['review'])

# Determine the optimal epsilon value using silhouette score
best_eps = None
best_score = -1
for eps in [0.1, 0.5, 1.0, 1.5, 2.0]:
    dbscan = DBSCAN(eps=eps, min_samples=5)
    dbscan.fit(tfidf_vectors)
    unique_labels = len(set(dbscan.labels_)) - (1 if -1 in dbscan.labels_ else 0)  # Exclude noise points (-1)
    if unique_labels > 1:
        score = silhouette_score(tfidf_vectors, dbscan.labels_)
        if score > best_score:
            best_score = score
            best_eps = eps

# Perform DBSCAN clustering with the optimal epsilon value
dbscan = DBSCAN(eps=best_eps, min_samples=5)
dbscan.fit(tfidf_vectors)

# Assign cluster labels to each review
data['dbscan_cluster'] = dbscan.labels_

# Output the number of clusters and points in each cluster
num_clusters = len(set(dbscan.labels_)) - (1 if -1 in dbscan.labels_ else 0)  # Exclude noise points (-1)
print(f"Number of clusters found: {num_clusters}")
print("Number of points in each cluster:")
print(data['dbscan_cluster'].value_counts())

# Output the reviews along with their assigned clusters
print("\nReviews with cluster labels:")
print(data[['review', 'dbscan_cluster']])


Number of clusters found: 4
Number of points in each cluster:
dbscan_cluster
-1    6837
 0      57
 2      14
 3       7
 1       5
Name: count, dtype: int64

Reviews with cluster labels:
                                                 review  dbscan_cluster
0     a stirring , funny and finally transporting re...              -1
1     apparently reassembled from the cutting-room f...              -1
2     they presume their audience wo n't sit still f...              -1
3     this is a visually stunning rumination on love...              -1
4     jonathan parker 's bartleby should have been t...              -1
...                                                 ...             ...
6915  painful , horrifying and oppressively tragic ,...              -1
6916  take care is nicely performed by a quintet of ...              -1
6917  the script covers huge , heavy topics in a bla...              -1
6918  a seriously bad film with seriously warped log...              -1
6919  a deliciously 

In [None]:
# Output one review for each cluster
print("\nOne review for each cluster:")
for cluster_label in sorted(set(dbscan.labels_)):
    if cluster_label == -1:
        continue
    cluster_reviews = data[data['dbscan_cluster'] == cluster_label]['review'].head(1)
    print(f"Cluster {cluster_label}:")
    for review in cluster_reviews:
        print(f"  - {review}")



One review for each cluster:
Cluster 0:
  - amazingly lame .
Cluster 1:
  - fun and nimble .
Cluster 2:
  - technically and artistically inept .
Cluster 3:
  - remember it .


**Hierarchical clustering**

In [None]:
from sklearn.cluster import AgglomerativeClustering

# Convert sparse matrix to dense array
X_dense = X_features.toarray()

# Perform hierarchical clustering
agglomerative = AgglomerativeClustering(n_clusters=2)
agglomerative.fit(X_dense)

# Assign cluster labels to each review
data['agglomerative_cluster'] = agglomerative.labels_

# Output the reviews along with their assigned clusters
print("Reviews with cluster labels:")
print(data[['review', 'agglomerative_cluster']])


Reviews with cluster labels:
                                                 review  agglomerative_cluster
0     a stirring , funny and finally transporting re...                      0
1     apparently reassembled from the cutting-room f...                      1
2     they presume their audience wo n't sit still f...                      0
3     this is a visually stunning rumination on love...                      0
4     jonathan parker 's bartleby should have been t...                      0
...                                                 ...                    ...
6915  painful , horrifying and oppressively tragic ,...                      0
6916  take care is nicely performed by a quintet of ...                      0
6917  the script covers huge , heavy topics in a bla...                      0
6918  a seriously bad film with seriously warped log...                      0
6919  a deliciously nonsensical comedy about a city ...                      0

[6920 rows x 2 columns

In [None]:
# Group the data by agglomerative cluster labels
cluster_groups = data.groupby('agglomerative_cluster')

# Print two reviews for each cluster
for cluster_label, reviews_group in cluster_groups:
    print(f"Cluster {cluster_label}:")
    reviews = reviews_group['review'].head(2)
    for review in reviews:
        print(review)
    print()


Cluster 0:
a stirring , funny and finally transporting re-imagining of beauty and the beast and 1930s horror films
they presume their audience wo n't sit still for a sociology lesson , however entertainingly presented , so they trot out the conventional science-fiction elements of bug-eyed monsters and futuristic women in skimpy clothes .

Cluster 1:
apparently reassembled from the cutting-room floor of any given daytime soap .
it is as uncompromising as it is nonjudgmental , and makes clear that a prostitute can be as lonely and needy as any of the clients .



**Word2Vec**

In [None]:
from gensim.models import Word2Vec
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Train Word2Vec model
sentences = [review.split() for review in data['review']]
word2vec_model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Generate embeddings for each review
word2vec_features = []
for review in sentences:
    embeddings = [word2vec_model.wv[word] for word in review if word in word2vec_model.wv]
    if embeddings:
        review_embedding = sum(embeddings) / len(embeddings)
        word2vec_features.append(review_embedding)
    else:
        word2vec_features.append([0] * 100)

# Determine the optimal number of clusters for K-means clustering
max_clusters = 10
best_score = -1
best_n_clusters = 2
for n_clusters in range(2, max_clusters + 1):
    kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=42)
    kmeans.fit(word2vec_features)
    score = silhouette_score(word2vec_features, kmeans.labels_)
    if score > best_score:
        best_score = score
        best_n_clusters = n_clusters

# Perform K-means clustering with the optimal number of clusters
kmeans = KMeans(n_clusters=best_n_clusters, n_init=10, random_state=42)
kmeans.fit(word2vec_features)
data['word2vec_cluster'] = kmeans.labels_

# Output the number of points in each cluster
print("Number of points in each cluster:")
print(data['word2vec_cluster'].value_counts())

# Output the reviews along with their assigned clusters
print("\nWord2Vec clustering:")
print(data[['review', 'word2vec_cluster']])


Number of points in each cluster:
word2vec_cluster
1    3633
0    3287
Name: count, dtype: int64

Word2Vec clustering:
                                                 review  word2vec_cluster
0     a stirring , funny and finally transporting re...                 0
1     apparently reassembled from the cutting-room f...                 0
2     they presume their audience wo n't sit still f...                 0
3     this is a visually stunning rumination on love...                 1
4     jonathan parker 's bartleby should have been t...                 0
...                                                 ...               ...
6915  painful , horrifying and oppressively tragic ,...                 1
6916  take care is nicely performed by a quintet of ...                 0
6917  the script covers huge , heavy topics in a bla...                 1
6918  a seriously bad film with seriously warped log...                 0
6919  a deliciously nonsensical comedy about a city ...            

In [None]:
# Print one review for each Word2Vec cluster
print("\nOne review for each Word2Vec cluster:")
for cluster_id in range(best_n_clusters):
    cluster_reviews = data[data['word2vec_cluster'] == cluster_id]['review'][:1]  # Get the first two reviews
    print(f"\nCluster {cluster_id}:\n")
    for review in cluster_reviews:
        print(review)
        print()



One review for each Word2Vec cluster:

Cluster 0:

a stirring , funny and finally transporting re-imagining of beauty and the beast and 1930s horror films


Cluster 1:

this is a visually stunning rumination on love , memory , history and the war between art and commerce .



**BERT**

In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Load pre-trained BERT model
bert_model = SentenceTransformer('bert-base-nli-mean-tokens')

# Encode text reviews using BERT
review_embeddings = bert_model.encode(data['review'].tolist())

# Determine the optimal number of clusters for K-means clustering
max_clusters = 20
best_score = -1
best_n_clusters = 2
for n_clusters in range(2, max_clusters + 1):
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    kmeans.fit(review_embeddings)
    score = silhouette_score(review_embeddings, kmeans.labels_)
    if score > best_score:
        best_score = score
        best_n_clusters = n_clusters

# Perform K-means clustering with the optimal number of clusters
kmeans = KMeans(n_clusters=best_n_clusters, random_state=42)
kmeans.fit(review_embeddings)
data['bert_cluster'] = kmeans.labels_

# Print the number of points in each cluster
print("Number of points in each cluster:")
print(data['bert_cluster'].value_counts())

# Print the output
print("\nBERT clustering:")
print(data[['review', 'bert_cluster']])




Number of points in each cluster:
bert_cluster
1    3905
0    3015
Name: count, dtype: int64

BERT clustering:
                                                 review  bert_cluster
0     a stirring , funny and finally transporting re...             0
1     apparently reassembled from the cutting-room f...             1
2     they presume their audience wo n't sit still f...             1
3     this is a visually stunning rumination on love...             0
4     jonathan parker 's bartleby should have been t...             1
...                                                 ...           ...
6915  painful , horrifying and oppressively tragic ,...             1
6916  take care is nicely performed by a quintet of ...             0
6917  the script covers huge , heavy topics in a bla...             1
6918  a seriously bad film with seriously warped log...             1
6919  a deliciously nonsensical comedy about a city ...             1

[6920 rows x 2 columns]


In [None]:
# Print two reviews from each BERT cluster
print("\nTwo reviews from each BERT cluster:")
best_n_clusters = 2
for cluster_id in range(best_n_clusters):
    cluster_reviews = data[data['bert_cluster'] == cluster_id]['review'][:2]  # Get the first two reviews
    print(f"\nCluster {cluster_id}:\n")
    for review in cluster_reviews:
        print(review)
        print()


Two reviews from each BERT cluster:

Cluster 0:

a stirring , funny and finally transporting re-imagining of beauty and the beast and 1930s horror films

this is a visually stunning rumination on love , memory , history and the war between art and commerce .


Cluster 1:

apparently reassembled from the cutting-room floor of any given daytime soap .

they presume their audience wo n't sit still for a sociology lesson , however entertainingly presented , so they trot out the conventional science-fiction elements of bug-eyed monsters and futuristic women in skimpy clothes .



**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

The clustering algorithm produced to different results. K-means algorithm divided reviewers into 10 groups, while the DBSCAN used only 4 groups which majority are marked as outliers. Hierachical clustering doesn't state the number of clusters as crisply. Word2Vec and BERT resulted in 2 clusters each BERT clusters showed are more even than Word2Vec clusters. You will need to appraise these result via domain-specific criteria to finally figure out which method is able to display a specific context within your review data.

# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
In this assignment, I saw how various clustering techniques and text embedding methods were applied, and,
though each of them has its particular set of benefits and drawbacks, we can surely see that they offer a very efficient means
 to organize and interpret raw data. With K-means clustering, I noticed that it has a clean interface,very suitable for the
 data that can be well-separated.
 DBSCAN was maybe the one that caught my attention during the presentation of
 its ability to identify the outliers automatically and adapt to different shaped clusters which could rely on different sizes.
the BERT expansion has supplied me with many paragraph like representations of text, but I had to deal with the high
computational costs and preprocessing requirements inherent to this method as well.
'''