<a href="https://colab.research.google.com/github/SatyaA-dev/SatyaAditya_INFO5731_Fall2024/blob/main/Masimukku_SatyaAditya_Exercise_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
import numpy as np

# Load the training and test data
train_data = pd.read_csv('/content/stsa-train.txt', sep='\t', header=None, names=['label', 'text'], on_bad_lines='skip')
test_data = pd.read_csv('/content/stsa-test.txt', sep='\t', header=None, names=['label', 'text'], on_bad_lines='skip')

# Check for missing values in both train and test data
print("Missing values in train data:\n", train_data.isnull().sum())
print("Missing values in test data:\n", test_data.isnull().sum())

# Remove rows where labels are missing in the training data
train_data = train_data.dropna(subset=['label'])

# Remove rows where labels are missing in the test data (if necessary)
test_data = test_data.dropna(subset=['label'])

# Verify that missing values have been removed
print("Missing values after cleaning train data:\n", train_data.isnull().sum())
print("Missing values after cleaning test data:\n", test_data.isnull().sum())

# Ensure there are no missing values in the text and label columns
if train_data['label'].isnull().sum() > 0:
    raise ValueError("There are still missing labels in the training data after cleaning.")
if test_data['label'].isnull().sum() > 0:
    raise ValueError("There are still missing labels in the test data after cleaning.")

# Check class distribution
print("Class distribution in train data:\n", train_data['label'].value_counts())

# Remove classes with too few samples (less than 2)
min_class_size = 2
class_counts = train_data['label'].value_counts()
classes_to_remove = class_counts[class_counts < min_class_size].index
train_data = train_data[~train_data['label'].isin(classes_to_remove)]

# Verify the class distribution after removing small classes
print("Class distribution after filtering small classes:\n", train_data['label'].value_counts())

# Split the data into X (text) and y (labels)
X_train = train_data['text']
y_train = train_data['label']

# Split the data into training and validation sets (80% for training, 20% for validation)
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42, stratify=y_train)

# Check if there are still any NaN values in the labels after splitting
print("Missing values in train labels:", y_train.isnull().sum())
print("Missing values in validation labels:", y_val.isnull().sum())

# Vectorizing the text data (convert text to numerical features)
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_val_tfidf = vectorizer.transform(X_val)

# Initialize the classifier (Logistic Regression in this case)
clf = LogisticRegression(max_iter=1000)

# Perform 10-fold cross-validation
cv_scores = cross_val_score(clf, X_train_tfidf, y_train, cv=10, scoring='accuracy')

# Print cross-validation results
print(f'Cross-validation accuracy scores: {cv_scores}')
print(f'Mean cross-validation accuracy: {np.mean(cv_scores)}')

# Train the model on the full training data
clf.fit(X_train_tfidf, y_train)

# Predict on the validation data
y_pred = clf.predict(X_val_tfidf)

# Evaluate the model
accuracy = accuracy_score(y_val, y_pred)
print(f'Accuracy on validation data: {accuracy}')

# Test the model on the test data (ensure test data is cleaned if needed)
X_test_tfidf = vectorizer.transform(test_data['text'])

# If test data has missing labels, handle it similarly
if test_data['label'].isnull().sum() > 0:
    print("Warning: Missing labels in test data. Skipping evaluation on test set.")
else:
    y_test_pred = clf.predict(X_test_tfidf)
    # Assuming you have true labels for the test data, evaluate them
    print(f'Accuracy on test data: {accuracy_score(test_data["label"], y_test_pred)}')


Missing values in train data:
 label       0
text     6920
dtype: int64
Missing values in test data:
 label       0
text     1821
dtype: int64
Missing values after cleaning train data:
 label       0
text     6920
dtype: int64
Missing values after cleaning test data:
 label       0
text     1821
dtype: int64
Class distribution in train data:
 label
1 ` stock up on silver bullets for director neil marshall 's intense freight train of a film . '    2
1 i like it .                                                                                       2
0 every joke is repeated at least four times .                                                      2
1 what 's next ?                                                                                    2
1 see it .                                                                                          2
                                                                                                   ..
0 the elements were all there but lac

ValueError: The test_size = 4 should be greater or equal to the number of classes = 9

## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [6]:
pip install pandas scikit-learn nltk gensim sentence-transformers matplotlib seaborn




In [7]:
import pandas as pd

df = pd.read_csv("/content/Amazon_Unlocked_Mobile.csv")

# Check the first few rows of the dataset
df.head()


Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0


In [8]:
import nltk
import pandas as pd
import re
from nltk.corpus import stopwords

# Download required NLTK resources if not done
nltk.download('stopwords')
nltk.download('punkt')

# Define the preprocessing function
def preprocess_text(text):
    if not isinstance(text, str):  # Check if the text is a string
        return ''  # If it's not a string, return an empty string

    # Lowercase the text
    text = text.lower()

    # Remove non-alphabetical characters (keeping spaces and letters)
    text = re.sub(r'[^a-z\s]', '', text)

    # Tokenize the text
    tokens = nltk.word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Return the cleaned text
    return ' '.join(tokens)

# Apply preprocessing to the review text column
df['cleaned_text'] = df['Reviews'].apply(preprocess_text)

# Verify the cleaned text
print(df['cleaned_text'].head())


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


0    feel lucky found used phone us used hard phone...
1    nice phone nice grade pantach revue clean set ...
2                                              pleased
3       works good goes slow sometimes good phone love
4    great phone replace lost phone thing volume bu...
Name: cleaned_text, dtype: object


In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=1000)

# Fit and transform the cleaned text
X_tfidf = tfidf_vectorizer.fit_transform(df['cleaned_text'])


In [10]:
from sklearn.cluster import KMeans

# Apply KMeans clustering
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans_labels = kmeans.fit_predict(X_tfidf)

# Add KMeans cluster labels to the dataframe
df['kmeans_cluster'] = kmeans_labels


In [None]:
from sklearn.cluster import DBSCAN

# Apply DBSCAN clustering
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_tfidf)

# Add DBSCAN cluster labels to the dataframe
df['dbscan_cluster'] = dbscan_labels


In [None]:
from sklearn.cluster import AgglomerativeClustering

# Apply Hierarchical Clustering
hierarchical = AgglomerativeClustering(n_clusters=5)
hierarchical_labels = hierarchical.fit_predict(X_tfidf.toarray())

# Add Hierarchical cluster labels to the dataframe
df['hierarchical_cluster'] = hierarchical_labels


In [None]:
import gensim
from gensim.models import Word2Vec

# Tokenize the reviews for Word2Vec
tokenized_reviews = [text.split() for text in df['cleaned_text']]

# Train a Word2Vec model
model = Word2Vec(tokenized_reviews, vector_size=100, window=5, min_count=1, workers=4)

# Example of obtaining a vector for a word
vector = model.wv['phone']  # You can replace 'phone' with any word


In [None]:
from sentence_transformers import SentenceTransformer

# Load a pre-trained BERT model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Encode the text into embeddings
embeddings = model.encode(df['cleaned_text'].tolist())

# You can now use the embeddings as input for clustering algorithms


In [None]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Reduce dimensions for visualization
pca = PCA(n_components=2)
pca_result = pca.fit_transform(embeddings)

# Scatter plot of the clusters
plt.scatter(pca_result[:, 0], pca_result[:, 1], c=df['kmeans_cluster'], cmap='viridis')
plt.title('K-means Clustering')
plt.show()


In [None]:
from sklearn.metrics import silhouette_score

# Evaluate clustering using silhouette score
score = silhouette_score(X_tfidf, kmeans_labels)
print("Silhouette Score: ", score)


**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

**Write your response here:**

.

.

.

.

.




# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:





'''