<a href="https://colab.research.google.com/github/TharunSaiVT/INFO-5731/blob/main/V_T_Tharun_Sai_Exercise_05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [1]:
# Write your code here
# Write your code here
#Write your code here.
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [8]:
import re
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [9]:
import pandas as pd

def load_sentiment_files(file1, file2):
    try:
        # Read file1 and file2
        with open(file1, 'r', encoding='utf-8') as f1, open(file2, 'r', encoding='utf-8') as f2:
            file1_lines = f1.readlines()
            file2_lines = f2.readlines()

        # Parse sentiment and review from file1
        file1_sentiments = []
        file1_reviews = []
        for line in file1_lines:
            sentiment, review = line.strip().split(' ', 1)
            file1_sentiments.append(int(sentiment))
            file1_reviews.append(review)

        # Parse sentiment and review from file2
        file2_sentiments = []
        file2_reviews = []
        for line in file2_lines:
            sentiment, review = line.strip().split(' ', 1)
            file2_sentiments.append(int(sentiment))
            file2_reviews.append(review)

        # Create DataFrames
        df1 = pd.DataFrame({'sentiment': file1_sentiments, 'review': file1_reviews})
        df2 = pd.DataFrame({'sentiment': file2_sentiments, 'review': file2_reviews})

        return df1, df2

    except Exception as e:
        print(f"Error loading sentiment files: {e}")
        return None, None

# Example usage:
file1_path =  '/content/drive/My Drive/Colab Notebooks/stsa-train.txt'
file2_path = '/content/drive/My Drive/Colab Notebooks/stsa-test.txt'

train_df, test_df = load_sentiment_files(file1_path, file2_path)

# Display the first few rows of each DataFrame
if train_df is not None and test_df is not None:
    print("Training Data:")
    print(train_df.head())

    print("\nTest Data:")
    print(test_df.head())
else:
    print("Error loading sentiment files.")

Training Data:
   sentiment                                             review
0          1  a stirring , funny and finally transporting re...
1          0  apparently reassembled from the cutting-room f...
2          0  they presume their audience wo n't sit still f...
3          1  this is a visually stunning rumination on love...
4          1  jonathan parker 's bartleby should have been t...

Test Data:
   sentiment                                             review
0          0     no movement , no yuks , not much of anything .
1          0  a gob of drivel so sickly sweet , even the eag...
2          0  gangs of new york is an unapologetic mess , wh...
3          0  we never really feel involved with the story ,...
4          1            this is one of polanski 's best films .


In [10]:
from sklearn.model_selection import train_test_split
# Split the training data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(train_df['review'], train_df['sentiment'], test_size=0.2, random_state=42)

In [11]:
# Train and evaluate classifier
def train_and_evaluate_classifier(clf, name):
    print(f'Evaluating {name}...')
    # Use CountVectorizer to transform the text data into a matrix of word counts
    vectorizer = CountVectorizer(stop_words='english')
    X_train_vec = vectorizer.fit_transform(X_train)
    X_val_vec = vectorizer.transform(X_val)

    # Train the classifier using 10-fold cross-validation
    scores = cross_val_score(clf, X_train_vec, y_train, cv=10)
    print(f'Mean {name} cross-validation accuracy: {scores.mean()}')

    # Fit the classifier to the entire training data and make predictions on the validation set
    clf.fit(X_train_vec, y_train)
    y_val_pred = clf.predict(X_val_vec)

    # Evaluate the classifier on the validation set
    accuracy = accuracy_score(y_val, y_val_pred)
    precision = precision_score(y_val, y_val_pred)
    recall = recall_score(y_val, y_val_pred)
    f1 = f1_score(y_val, y_val_pred)
    print(f'{name} validation accuracy: {accuracy}')
    print(f'{name} validation precision: {precision}')
    print(f'{name} validation recall: {recall}')
    print(f'{name} validation F1 score: {f1}')
    print('-'*40)

In [12]:
# Train and evaluate the classifiers
nb = MultinomialNB()
train_and_evaluate_classifier(nb, 'MultinomialNB')

svm = SVC(kernel='linear')
train_and_evaluate_classifier(svm, 'SVM')

knn = KNeighborsClassifier()
train_and_evaluate_classifier(knn, 'KNN')

dt = DecisionTreeClassifier()
train_and_evaluate_classifier(dt, 'Decision Tree')

rf = RandomForestClassifier()
train_and_evaluate_classifier(rf, 'Random Forest')

xgb = XGBClassifier()
train_and_evaluate_classifier(xgb, 'XGBoost')

Evaluating MultinomialNB...
Mean MultinomialNB cross-validation accuracy: 0.7720343906881401
MultinomialNB validation accuracy: 0.7846820809248555
MultinomialNB validation precision: 0.7539779681762546
MultinomialNB validation recall: 0.8639551192145862
MultinomialNB validation F1 score: 0.8052287581699346
----------------------------------------
Evaluating SVM...
Mean SVM cross-validation accuracy: 0.7366396615768276
SVM validation accuracy: 0.7622832369942196
SVM validation precision: 0.7594594594594595
SVM validation recall: 0.788218793828892
SVM validation F1 score: 0.7735719201651756
----------------------------------------
Evaluating KNN...
Mean KNN cross-validation accuracy: 0.5408314347079599
KNN validation accuracy: 0.5614161849710982
KNN validation precision: 0.5629453681710214
KNN validation recall: 0.664796633941094
KNN validation F1 score: 0.6096463022508037
----------------------------------------
Evaluating Decision Tree...
Mean Decision Tree cross-validation accuracy: 0

## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [None]:
# Write your code here
#Write your code here.

reviews_df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/stsa-train.txt')
reviews_df.head()

**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

**Write your response here:**

.

.

.

.

.




# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:





'''