# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [None]:
import pandas as pd

# Load the training and testing datasets
train_data = pd.read_csv('/content/stsa-train.txt', delimiter='\t')
test_data = pd.read_csv('/content/stsa-test.txt', delimiter='\t')

# Display the first few rows of the training data to understand its structure
print("Training Data:")
print(train_data.head())

# Display the first few rows of the testing data
print("\nTesting Data:")
print(test_data.head())


Training Data:
  1 a stirring , funny and finally transporting re-imagining of beauty and the beast and 1930s horror films
0  0 apparently reassembled from the cutting-room...                                                       
1  0 they presume their audience wo n't sit still...                                                       
2  1 this is a visually stunning rumination on lo...                                                       
3  1 jonathan parker 's bartleby should have been...                                                       
4  1 campanella gets the tone just right -- funny...                                                       

Testing Data:
    0 no movement , no yuks , not much of anything .
0  0 a gob of drivel so sickly sweet , even the e...
1  0 gangs of new york is an unapologetic mess , ...
2  0 we never really feel involved with the story...
3          1 this is one of polanski 's best films .
4  1 take care of my cat offers a refreshingly di...


In [None]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Download NLTK resources (uncomment if needed)
nltk.download('punkt')
nltk.download('stopwords')

# Sample data from the DataFrame
data = {
    'reviews': [
        "a stirring , funny and finally transporting re-imagining of beauty and the beast and 1930s horror films",
        "apparently reassembled from the cutting-room floor of any given daytime soap .",
        "they presume their audience wo n't sit still for a sociology lesson , however entertainingly presented , so they trot out the conventional science-fiction elements of bug-eyed monsters and futuristic women in skimpy clothes ",
        "this is a visually stunning rumination on love , memory , history and the war between art and commerce .",
        "jonathan parker 's bartleby should have been the be-all-end-all of the modern-office anomie films .",
    ],
    'sentiment': [1, 0, 0, 1, 1]
}

# Create DataFrame from sample data
df = pd.DataFrame(data)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['reviews'], df['sentiment'], test_size=0.2, random_state=42)

# Define a function for text preprocessing
def preprocess_text(text):
    stop_words = set(stopwords.words('english'))
    tokens = word_tokenize(text)
    tokens = [word.lower() for word in tokens if word.lower() not in stop_words and word.isalpha()]
    return ' '.join(tokens)

# Apply text preprocessing to training and testing data
X_train_processed = X_train.apply(preprocess_text)
X_test_processed = X_test.apply(preprocess_text)

# Define classifiers
classifiers = {
    "MultinomialNB": MultinomialNB(),
    "SVM": SVC(),
    "KNN": KNeighborsClassifier(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "XGBoost": XGBClassifier()
}

# Evaluate classifiers using 10-fold cross-validation
results = {}
for clf_name, clf in classifiers.items():
    print(f"Evaluating {clf_name}...")
    pipeline = make_pipeline(TfidfVectorizer(), clf)
    cv_results = cross_validate(pipeline, X_train_processed, y_train, cv=10, scoring=['accuracy', 'precision', 'recall', 'f1'])
    results[clf_name] = cv_results

# Select the best classifier based on mean accuracy
best_classifier_name = max(results, key=lambda k: results[k]['test_accuracy'].mean())
best_classifier = classifiers[best_classifier_name]

# Train the best classifier on the full training data
pipeline_best = make_pipeline(TfidfVectorizer(), best_classifier)
pipeline_best.fit(X_train_processed, y_train)

# Evaluate the best classifier on the test data
y_pred_test = pipeline_best.predict(X_test_processed)
accuracy_test = accuracy_score(y_test, y_pred_test)
precision_test = precision_score(y_test, y_pred_test)
recall_test = recall_score(y_test, y_pred_test)
f1_test = f1_score(y_test, y_pred_test)

# Display test metrics for the best classifier
print(f"Test Metrics for {best_classifier_name}:")
print(f"Accuracy: {accuracy_test}")
print(f"Precision: {precision_test}")
print(f"Recall: {recall_test}")
print(f"F1 Score: {f1_test}")


Evaluating MultinomialNB...


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


ValueError: Cannot have number of splits n_splits=10 greater than the number of samples: n_samples=4.

## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [None]:
# Write your code here
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
data = pd.read_csv("/content/Amazon_Unlocked_Mobile.csv")


Data Cleaning and preprocessing

In [None]:
# Preprocessing: Clean text data (e.g., remove stopwords, punctuation, lowercase)
# For simplicity, we'll use a basic TF-IDF approach

# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(data['Reviews'].dropna())

In [None]:
# Clustering using K-means
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans_labels = kmeans.fit_predict(tfidf_matrix)
print("K-means Silhouette Score:", silhouette_score(tfidf_matrix, kmeans_labels))




In [None]:
# Clustering using DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(tfidf_matrix)
# Number of clusters in labels, ignoring noise if present.
dbscan_n_clusters_ = len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)
print("DBSCAN Number of Clusters:", dbscan_n_clusters_)

In [None]:
# Clustering using Hierarchical (Agglomerative) Clustering
agg_clustering = AgglomerativeClustering(n_clusters=5)
agg_labels = agg_clustering.fit_predict(tfidf_matrix)

In [None]:
# Visualize the clusters (for K-means and Hierarchical Clustering)
plt.figure(figsize=(10, 6))
sns.scatterplot(x=tfidf_matrix.toarray()[:, 0], y=tfidf_matrix.toarray()[:, 1], hue=kmeans_labels, palette='viridis', legend='full')
plt.title('K-means Clustering')
plt.xlabel('TF-IDF Feature 1')
plt.ylabel('TF-IDF Feature 2')
plt.show()

plt.figure(figsize=(10, 6))
sns.scatterplot(x=tfidf_matrix.toarray()[:, 0], y=tfidf_matrix.toarray()[:, 1], hue=agg_labels, palette='viridis', legend='full')
plt.title('Hierarchical (Agglomerative) Clustering')
plt.xlabel('TF-IDF Feature 1')
plt.ylabel('TF-IDF Feature 2')
plt.show()

**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

**Write your response here:**

The results of K-means, DBSCAN, and Hierarchical clustering as well as Word2Vec and BERT represent different approaches and capabilities within text clustering. K-means clustering partitions the data into predefined clusters based on centroids and Euclidean distance, suitable for well-separated clusters but sensitive to outliers and noise. DBSCAN, on the other hand, automatically identifies clusters based on density and can handle arbitrary shapes and sizes of clusters, useful when dealing with noisy or sparse data. Hierarchical clustering builds a tree of clusters either by merging (agglomerative) or splitting (divisive) and can reveal hierarchical relationships within the data. Word2Vec provides dense embeddings capturing semantic relationships between words but requires predefined vectorization for clustering. BERT embeddings offer contextual representations of text, reflecting nuanced meanings, which can enhance clustering performance, particularly when context is critical. The choice of method depends on the nature of the data and desired outcomes, with each approach offering unique advantages in capturing different aspects of text structure and semantics..

.

.

.

.




# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
I have tried so much to run the two codes but it isn't working at all.





'''