<a href="https://colab.research.google.com/github/Chandru-018/Chandrasekhar_INFO5731_FALL2024/blob/main/Karumanchi_Chandrasekhar_Exercise_05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [27]:
!pip install spacy transformers xgboost scikit-learn
!python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m69.6 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import spacy
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
import torch

# Load spaCy for tokenization
nlp = spacy.load('en_core_web_sm')

# Load BERT Tokenizer
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Function to load data
def load_data(file):
    with open(file, 'r') as f:
        lines = f.readlines()

    # Initialize lists to store labels and reviews
    labels = []
    reviews = []

    print(f"Total lines read from {file}: {len(lines)}")  # Debugging step

    for line_num, line in enumerate(lines):
        # Skip empty lines or lines that are just whitespace
        if not line.strip():
            print(f"Skipping empty line {line_num + 1}")  # Debugging step
            continue

        # Try to split the line by tab (\t), handle cases where it fails
        try:
            # Adjust the split logic if tabs are not being used
            # Split by space if tabs are not present
            parts = line.split(" ", 1)  # Split by space into two parts
            if len(parts) == 2:  # Ensure there are exactly two parts
                label, review = parts
                labels.append(int(label))  # Convert the label to integer (0 or 1)
                reviews.append(review.strip())  # Remove surrounding whitespace/newlines
            else:
                print(f"Skipping malformed line {line_num + 1}: {line.strip()}")  # Debugging step
        except ValueError:
            # If the line doesn't have exactly one tab, it's malformed, so we print it
            print(f"Skipping malformed line {line_num + 1}: {line.strip()}")  # Debugging step
            continue

    # Check how many rows of data we collected
    print(f"Loaded {len(labels)} reviews and labels.")  # Debugging step

    return pd.DataFrame({'review': reviews, 'label': labels})

# Load the datasets and print them to verify the content
train_file = 'stsa-train.txt'
test_file = 'stsa-test.txt'

train_data = load_data(train_file)
test_data = load_data(test_file)

# Debugging: print the first few rows of the loaded data
print("Train Data Loaded:")
print(train_data.head())
print(f"Number of rows in train data: {len(train_data)}")



# Split the training data into 80% training and 20% validation
X_train, X_val, y_train, y_val = train_test_split(train_data['review'], train_data['label'], test_size=0.2, random_state=42)

# Reset indices after splitting
X_train = X_train.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)

# Make sure the lengths match
print(f"X_train length: {len(X_train)}, y_train length: {len(y_train)}")

# Function to tokenize reviews using spaCy
def tokenize_reviews_spacy(reviews):
    return [' '.join([token.text for token in nlp(review.lower())]) for review in reviews]

# Function to tokenize reviews using BERT Tokenizer
def tokenize_reviews_bert(reviews):
    return [bert_tokenizer(review, padding=True, truncation=True, max_length=512, return_tensors='pt') for review in reviews]

# Tokenize the reviews for spaCy
X_train_tokens_spacy = tokenize_reviews_spacy(X_train)
X_val_tokens_spacy = tokenize_reviews_spacy(X_val)
X_test_tokens_spacy = tokenize_reviews_spacy(test_data['review'])

# Convert the reviews into a bag-of-words representation
vectorizer = CountVectorizer()
X_train_bow = vectorizer.fit_transform(X_train_tokens_spacy)
X_val_bow = vectorizer.transform(X_val_tokens_spacy)
X_test_bow = vectorizer.transform(X_test_tokens_spacy)

# Function to evaluate models with cross-validation
def evaluate_model(model, X_train, y_train):
    cv_scores = cross_val_score(model, X_train, y_train, cv=10, scoring='accuracy')
    return cv_scores.mean()

# Initialize classifiers
models = {
    'MultinomialNB': MultinomialNB(),
    'SVM': SVC(),
    'KNN': KNeighborsClassifier(),
    'DecisionTree': DecisionTreeClassifier(),
    'RandomForest': RandomForestClassifier(),
    'XGBoost': XGBClassifier()
}

# Evaluate each model using 10-fold cross-validation
for model_name, model in models.items():
    print(f"Evaluating {model_name}...")
    accuracy = evaluate_model(model, X_train_bow, y_train)
    print(f"{model_name} Accuracy: {accuracy:.4f}")

# BERT model setup
class BertForTextClassification(BertForSequenceClassification):
    def __init__(self, config):
        super().__init__(config)

# Fine-tuning BERT
def fine_tune_bert(X_train, y_train, X_val, y_val):
    train_encodings = bert_tokenizer(list(X_train), truncation=True, padding=True, max_length=512)
    val_encodings = bert_tokenizer(list(X_val), truncation=True, padding=True, max_length=512)

    class SentimentDataset(torch.utils.data.Dataset):
        def __init__(self, encodings, labels):
            self.encodings = encodings
            self.labels = labels.tolist()  # Convert labels Series to a list

        def __getitem__(self, idx):
            item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
            item['labels'] = torch.tensor(self.labels[idx])
            return item

        def __len__(self):
            return len(self.labels)

    train_dataset = SentimentDataset(train_encodings, y_train)
    val_dataset = SentimentDataset(val_encodings, y_val)

    training_args = TrainingArguments(
        output_dir='./results',
        num_train_epochs=3,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=64,
        warmup_steps=500,
        weight_decay=0.01,
        logging_dir='./logs',
        logging_steps=10,
        evaluation_strategy="steps",
        save_steps=10
    )

    trainer = Trainer(
        model=BertForTextClassification.from_pretrained('bert-base-uncased'),
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset
    )

    trainer.train()

# Fine-tune BERT on training data
fine_tune_bert(X_train, y_train, X_val, y_val)

# Evaluate BERT on the test set
def evaluate_bert(X_test, y_test):
    test_encodings = bert_tokenizer(list(X_test), truncation=True, padding=True, max_length=512)
    test_dataset = SentimentDataset(test_encodings, y_test)

    trainer = Trainer(model=BertForTextClassification.from_pretrained('bert-base-uncased'))
    predictions = trainer.predict(test_dataset)
    predicted_labels = np.argmax(predictions.predictions, axis=1)

    accuracy = accuracy_score(y_test, predicted_labels)
    precision = precision_score(y_test, predicted_labels)
    recall = recall_score(y_test, predicted_labels)
    f1 = f1_score(y_test, predicted_labels)

    print(f"BERT Accuracy: {accuracy:.4f}")
    print(f"BERT Precision: {precision:.4f}")
    print(f"BERT Recall: {recall:.4f}")
    print(f"BERT F1-Score: {f1:.4f}")

# Evaluate BERT model on test data
evaluate_bert(X_test, test_data['label'])


Total lines read from stsa-train.txt: 6920
Loaded 6920 reviews and labels.
Total lines read from stsa-test.txt: 1821
Loaded 1821 reviews and labels.
Train Data Loaded:
                                              review  label
0  a stirring , funny and finally transporting re...      1
1  apparently reassembled from the cutting-room f...      0
2  they presume their audience wo n't sit still f...      0
3  this is a visually stunning rumination on love...      1
4  jonathan parker 's bartleby should have been t...      1
Number of rows in train data: 6920
X_train length: 5536, y_train length: 5536
Evaluating MultinomialNB...
MultinomialNB Accuracy: 0.7805
Evaluating SVM...
SVM Accuracy: 0.7354
Evaluating KNN...
KNN Accuracy: 0.5715
Evaluating DecisionTree...
DecisionTree Accuracy: 0.6317
Evaluating RandomForest...
RandomForest Accuracy: 0.7144
Evaluating XGBoost...
XGBoost Accuracy: 0.7101


Some weights of BertForTextClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss,Validation Loss
10,0.7534,0.793231
20,0.8185,0.777588


## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [None]:
# Write your code here


**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

**Write your response here:**

.

.

.

.

.




# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here: It is a good learning and gave me the better understanding on the concepts.





'''