# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [None]:
#import the necessary libarries.
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load Data
def load_data(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        lines = file.readlines()
    data = [line.strip().split(' ', 1) for line in lines] # Split each line at the first space
    df = pd.DataFrame(data, columns=['label', 'text'])
    return df

train_data = load_data('stsa-train.txt')
test_data = load_data('stsa-test.txt')

# Data Preprocessing and Vectorization
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(train_data['text'])
y_train = train_data['label']

# Split Data
X_train, X_val, y_train, y_val = train_test_split(X_train_tfidf, y_train, test_size=0.2, random_state=42)

# Cross-Validation and Training
models = {
    'MultinomialNB': MultinomialNB(),
    'SVM': SVC(),
    'KNN': KNeighborsClassifier(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    #'XGBoost': XGBClassifier(),
}

for name, model in models.items():
    print(f"Training and Cross-Validating {name}...")
    cv_scores = cross_val_score(model, X_train, y_train, cv=10, scoring='accuracy')
    print(f"{name} Cross-Validation Accuracy: {cv_scores.mean()}")
    model.fit(X_train, y_train)

    # Step 5: Evaluation
    y_pred = model.predict(X_val)

    # Convert string labels to integers
    y_val_int = y_val.astype(int)
    y_pred_int = y_pred.astype(int)

    # Calculate precision with pos_label=1
    precision = precision_score(y_val_int, y_pred_int, pos_label=1)

    # Calculate recall with pos_label=1
    recall = recall_score(y_val_int, y_pred_int, pos_label=1)

    accuracy = accuracy_score(y_val, y_pred)
    # Manually calculate F1-score
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    print(f"{name} Evaluation Metrics:")
    print(f"Accuracy: {accuracy}, Precision: {precision}, Recall: {recall}, F1-score: {f1}")

# Evaluation
best_model = max(models.items(), key=lambda x: cross_val_score(x[1], X_train, y_train, cv=10, scoring='accuracy').mean())[1]
y_test_pred = best_model.predict(tfidf_vectorizer.transform(test_data['text']))
test_accuracy = accuracy_score(test_data['label'], y_test_pred)

# Convert string labels to integers for test data
test_data_int = test_data['label'].astype(int)
y_test_pred_int = y_test_pred.astype(int)

# Calculate precision with pos_label=1 for test data
test_precision = precision_score(test_data_int, y_test_pred_int, pos_label=1)

# Calculate recall with pos_label=1 for test data
test_recall = recall_score(test_data_int, y_test_pred_int, pos_label=1)

# Manually alculate F1-score for test data
test_f1 = 2 * (test_precision * test_recall) / (test_precision + test_recall) if (test_precision + test_recall) > 0 else 0

print("Final Evaluation Metrics on Test Data:")
print(f"Accuracy: {test_accuracy}, Precision: {test_precision}, Recall: {test_recall}, F1-score: {test_f1}")


Training and Cross-Validating MultinomialNB...
MultinomialNB Cross-Validation Accuracy: 0.7747527434864638
MultinomialNB Evaluation Metrics:
Accuracy: 0.7969653179190751, Precision: 0.7571428571428571, Recall: 0.8920056100981767, F1-score: 0.8190598840952994
Training and Cross-Validating SVM...
SVM Cross-Validation Accuracy: 0.7724042146219178
SVM Evaluation Metrics:
Accuracy: 0.7976878612716763, Precision: 0.7730138713745272, Recall: 0.8597475455820477, F1-score: 0.8140770252324037
Training and Cross-Validating KNN...
KNN Cross-Validation Accuracy: 0.71242876074709
KNN Evaluation Metrics:
Accuracy: 0.7276011560693642, Precision: 0.7131979695431472, Recall: 0.788218793828892, F1-score: 0.7488341105929379
Training and Cross-Validating Decision Tree...
Decision Tree Cross-Validation Accuracy: 0.6083815877948309
Decision Tree Evaluation Metrics:
Accuracy: 0.6083815028901735, Precision: 0.6135458167330677, Recall: 0.6479663394109397, F1-score: 0.6302864938608458
Training and Cross-Validati

## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [None]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
from sklearn.feature_extraction.text import TfidfVectorizer

# installing NLTK stopwords
#nltk.download('stopwords')
#nltk.download('punkt')

# Loading the dataset
dataset_path = "Amazon_Unlocked_Mobile.csv"
df = pd.read_csv(dataset_path)

# Dropping rows with missing values in the 'Reviews' column
df.dropna(subset=['Reviews'], inplace=True)

# Display the first few rows of the dataset
print(df.head())

# Preprocessing function
def preprocess_text(text):
    text = str(text).lower()  # Convert to string to handle potential NaN values
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Tokenize text
    tokens = word_tokenize(text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]
    # Join tokens back into a string
    preprocessed_text = ' '.join(filtered_tokens)
    return preprocessed_text

# Apply preprocessing to the 'Reviews' column
df['Cleaned_Reviews'] = df['Reviews'].apply(preprocess_text)

# Display the preprocessed text
print(df['Cleaned_Reviews'].head())

# Save the preprocessed data to a new CSV file
preprocessed_dataset_path = "Amazon_Unlocked_Mobile_Preprocessed.csv"
df.to_csv(preprocessed_dataset_path, index=False)


                                        Product Name Brand Name   Price  \
0  "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung  199.99   
1  "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung  199.99   
2  "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung  199.99   
3  "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung  199.99   
4  "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung  199.99   

   Rating                                            Reviews  Review Votes  
0       5  I feel so LUCKY to have found this used (phone...           1.0  
1       4  nice phone, nice up grade from my pantach revu...           0.0  
2       5                                       Very pleased           0.0  
3       4  It works good but it goes slow sometimes but i...           0.0  
4       4  Great phone to replace my lost phone. The only...           0.0  
0    feel lucky found used phone us used hard phone...
1    nice phone nice grade panta

In [None]:
#Text Feature Extraction using TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=1000)

# Fit and transform the preprocessed text data
tfidf_features = tfidf_vectorizer.fit_transform(df['Cleaned_Reviews'])

# Display the shape of the TF-IDF matrix
print("Shape of TF-IDF matrix:", tfidf_features.shape)

# Optional: Display the vocabulary (i.e., feature names)
print("Vocabulary (Feature Names):", tfidf_vectorizer.get_feature_names_out()[:10])


Shape of TF-IDF matrix: (413770, 1000)
Vocabulary (Feature Names): ['10' '100' '12' '15' '16gb' '20' '200' '2g' '30' '32gb']


In [None]:
#Kmeans
from sklearn.cluster import KMeans

# Initialize K-means with the desired number of clusters
num_clusters = 5  # You can adjust the number of clusters as needed
kmeans = KMeans(n_clusters=num_clusters, random_state=42)

# Fit K-means to the TF-IDF features
kmeans.fit(tfidf_features)

# Get cluster labels
kmeans_labels = kmeans.labels_

# Display the cluster labels
print("K-means Cluster Labels:", kmeans_labels)



K-means Cluster Labels: [2 2 2 ... 2 2 2]


In [None]:
#Word2vec
from gensim.models import Word2Vec
from sklearn.cluster import KMeans

# Train Word2Vec model (or load pre-trained model)
word2vec_model = Word2Vec(sentences=df['Cleaned_Reviews'], vector_size=100, window=5, min_count=1, workers=4)

# Get word embeddings
word_embeddings = word2vec_model.wv

# Get word vectors
word_vectors = word_embeddings.vectors

# Apply K-means clustering to word vectors
num_clusters = 5  # You can adjust the number of clusters as needed
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
word2vec_cluster_labels = kmeans.fit_predict(word_vectors)

# Output cluster labels for the first few words
for word, cluster_label in zip(word_embeddings.index_to_key, word2vec_cluster_labels):
    print(f"Word: {word}, Cluster Label: {cluster_label}")



Word:  , Cluster Label: 0
Word: e, Cluster Label: 0
Word: o, Cluster Label: 0
Word: t, Cluster Label: 0
Word: n, Cluster Label: 0
Word: a, Cluster Label: 0
Word: r, Cluster Label: 0
Word: s, Cluster Label: 0
Word: i, Cluster Label: 0
Word: l, Cluster Label: 0
Word: d, Cluster Label: 0
Word: p, Cluster Label: 0
Word: c, Cluster Label: 0
Word: h, Cluster Label: 0
Word: u, Cluster Label: 0
Word: g, Cluster Label: 0
Word: m, Cluster Label: 0
Word: y, Cluster Label: 0
Word: w, Cluster Label: 0
Word: b, Cluster Label: 0
Word: k, Cluster Label: 0
Word: f, Cluster Label: 0
Word: v, Cluster Label: 0
Word: x, Cluster Label: 0
Word: 0, Cluster Label: 1
Word: z, Cluster Label: 0
Word: 1, Cluster Label: 1
Word: 2, Cluster Label: 1
Word: q, Cluster Label: 0
Word: 5, Cluster Label: 1
Word: 3, Cluster Label: 1
Word: 4, Cluster Label: 1
Word: 6, Cluster Label: 1
Word: j, Cluster Label: 0
Word: 8, Cluster Label: 1
Word: 7, Cluster Label: 1
Word: 9, Cluster Label: 1
Word: ’, Cluster Label: 4
Word: é, Clu

In [None]:
#DBSCAN
from sklearn.cluster import DBSCAN

# Initialize DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)

# Fit DBSCAN to the TF-IDF features
dbscan.fit(tfidf_features)

# Get cluster labels (Note: DBSCAN assigns -1 to noise points)
dbscan_labels = dbscan.labels_

# Display the cluster labels
print("DBSCAN Cluster Labels:", dbscan_labels[:10])

**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

**Write your response here:**

.

.

.

.

.




# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
In this exercise, I successively implemented MultinomialNB, SVM, KNN, Decision Tree, Random Forest, XGBoost algorithms for sentiment analysis.
I used 10-fold cross-validation to ensure robust model evaluation and reduce overfitting. Various metrics like Accuracy, Recall, Precision, and F1-score to comprehensively
assess model performance. In exercise 2, I utilized the Amazon mobile phone reviews dataset for clustering. I then finally
implemented K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT for text clustering tasks.
Challenges:
The Amazon mobile phone reviews dataset was too large so I experienced memory allocation error.
'''