<a href="https://colab.research.google.com/github/Sahithi530/Sahithi_INFO5731_Fall2024/blob/main/Tummala_Sahithi_Exercise_05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, KFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report, confusion_matrix

def evaluate_metrics(y_true, y_pred):
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, average='weighted', zero_division=0)
    recall = recall_score(y_true, y_pred, average='weighted', zero_division=0)
    f1 = f1_score(y_true, y_pred, average='weighted', zero_division=0)

    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1_score': f1
    }

def train_and_evaluate():
    # Load your data with explicit delimiter handling
    try:
        train_data = pd.read_csv(
            '/Users/funnysahithi/Downloads/exercise09_datacollection/stsa-train.txt',
            sep="\t",  # Adjust based on your file's delimiter
            names=["text", "label"],  # Provide column names if the file has no header
            on_bad_lines='skip'  # Skip problematic lines
        )
        test_data = pd.read_csv(
            '/Users/funnysahithi/Downloads/exercise09_datacollection/stsa-test.txt',
            sep="\t",  # Adjust based on your file's delimiter
            names=["text", "label"],  # Provide column names
            on_bad_lines='skip'
        )
    except Exception as e:
        print("Error loading files:", e)
        return

    # Verify data loaded correctly
    print("\nTraining Data Sample:")
    print(train_data.head())
    print("\nTest Data Sample:")
    print(test_data.head())

    # Initialize vectorizer
    vectorizer = TfidfVectorizer(max_features=5000)

    # Split training data into train and validation sets
    X_train, X_val, y_train, y_val = train_test_split(
        train_data['text'], train_data['label'], test_size=0.2, random_state=42
    )

    # Vectorize the text data
    X_train_vec = vectorizer.fit_transform(X_train)
    X_val_vec = vectorizer.transform(X_val)
    X_test_vec = vectorizer.transform(test_data['text'])

    # Initialize the classifier
    clf = MultinomialNB()
    clf.fit(X_train_vec, y_train)

    # Evaluate on validation set
    print("\nValidation Set Results:")
    val_predictions = clf.predict(X_val_vec)
    val_metrics = evaluate_metrics(y_val, val_predictions)
    for metric, value in val_metrics.items():
        print(f"{metric}: {value:.4f}")

    # Evaluate on test set
    print("\nTest Set Results:")
    test_predictions = clf.predict(X_test_vec)
    test_metrics = evaluate_metrics(test_data['label'], test_predictions)
    for metric, value in test_metrics.items():
        print(f"{metric}: {value:.4f}")

    # Print classification report
    print("\nClassification Report (Test Set):")
    print(classification_report(test_data['label'], test_predictions))

if __name__ == "__main__":
    train_and_evaluate()


ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 3


In [None]:
def train_and_evaluate_optimized():
    print("Loading data...")
    train_data = load_data('/Users/funnysahithi/Downloads/exercise09_datacollection/stsa-train.txt')
    test_data = load_data('/Users/funnysahithi/Downloads/exercise09_datacollection/stsa-test.txt')

    print(f"Training data shape: {train_data.shape}")
    print(f"Test data shape: {test_data.shape}")

    # Vectorizer with reduced features
    vectorizer = TfidfVectorizer(max_features=1000)
    X_train_vec = vectorizer.fit_transform(train_data['text']).toarray()
    X_test_vec = vectorizer.transform(test_data['text']).toarray()
    y_train = train_data['label']
    y_test = test_data['label']

    # Split train into train and validation sets
    X_train, X_val, y_train, y_val = train_test_split(
        X_train_vec, y_train, test_size=0.2, random_state=42
    )

    print("Training final model with reduced features...")
    clf = SVC(kernel='linear', random_state=42)
    clf.fit(X_train, y_train)

    print("\nValidation Set Results:")
    val_predictions = clf.predict(X_val)
    print(classification_report(y_val, val_predictions))

    print("\nTest Set Results:")
    test_predictions = clf.predict(X_test_vec)
    print(classification_report(y_test, test_predictions))

if __name__ == "__main__":
    train_and_evaluate_optimized()


Loading data...
Training data shape: (6920, 2)
Test data shape: (1821, 2)
Training final model with reduced features...

Validation Set Results:
              precision    recall  f1-score   support

           0       0.77      0.73      0.75       671
           1       0.76      0.80      0.78       713

    accuracy                           0.76      1384
   macro avg       0.77      0.76      0.76      1384
weighted avg       0.76      0.76      0.76      1384


Test Set Results:
              precision    recall  f1-score   support

           0       0.75      0.72      0.73       912
           1       0.73      0.76      0.74       909

    accuracy                           0.74      1821
   macro avg       0.74      0.74      0.74      1821
weighted avg       0.74      0.74      0.74      1821



In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix

def load_data(file_path):
    """
    Load and preprocess the text file data.
    """
    with open(file_path, 'r', encoding='utf-8') as file:
        lines = file.readlines()
    texts = []
    labels = []
    for line in lines:
        label = int(line[0])
        text = line[2:].strip()
        texts.append(text)
        labels.append(label)
    return pd.DataFrame({'text': texts, 'label': labels})

def train_and_evaluate_knn():
    print("Loading data...")
    train_data = load_data('/Users/funnysahithi/Downloads/exercise09_datacollection/stsa-train.txt')
    test_data = load_data('/Users/funnysahithi/Downloads/exercise09_datacollection/stsa-test.txt')

    print(f"Training data shape: {train_data.shape}")
    print(f"Test data shape: {test_data.shape}")

    # Vectorizer with reduced features
    print("Vectorizing text data...")
    vectorizer = TfidfVectorizer(max_features=1000)
    X_train_vec = vectorizer.fit_transform(train_data['text']).toarray()
    X_test_vec = vectorizer.transform(test_data['text']).toarray()
    y_train = train_data['label']
    y_test = test_data['label']

    # Split train into train and validation sets
    X_train, X_val, y_train, y_val = train_test_split(
        X_train_vec, y_train, test_size=0.2, random_state=42
    )

    # Initialize the KNN classifier
    knn = KNeighborsClassifier(n_neighbors=5)  # Set k to 5 (can be adjusted)

    print("Training KNN model...")
    knn.fit(X_train, y_train)

    # Evaluate on validation set
    print("\nValidation Set Results:")
    val_predictions = knn.predict(X_val)
    print(classification_report(y_val, val_predictions))

    # Evaluate on test set
    print("\nTest Set Results:")
    test_predictions = knn.predict(X_test_vec)
    print(classification_report(y_test, test_predictions))

    # Print confusion matrix
    print("\nConfusion Matrix (Test Set):")
    print(confusion_matrix(y_test, test_predictions))

if __name__ == "__main__":
    train_and_evaluate_knn()


Loading data...
Training data shape: (6920, 2)
Test data shape: (1821, 2)
Vectorizing text data...
Training KNN model...

Validation Set Results:
              precision    recall  f1-score   support

           0       0.50      0.98      0.66       671
           1       0.79      0.08      0.14       713

    accuracy                           0.51      1384
   macro avg       0.65      0.53      0.40      1384
weighted avg       0.65      0.51      0.39      1384


Test Set Results:
              precision    recall  f1-score   support

           0       0.51      0.96      0.67       912
           1       0.69      0.09      0.16       909

    accuracy                           0.53      1821
   macro avg       0.60      0.52      0.41      1821
weighted avg       0.60      0.53      0.41      1821


Confusion Matrix (Test Set):
[[876  36]
 [828  81]]


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.metrics import classification_report, confusion_matrix

def load_data(file_path):
    """
    Load and preprocess the text file data.
    """
    with open(file_path, 'r', encoding='utf-8') as file:
        lines = file.readlines()
    texts = []
    labels = []
    for line in lines:
        label = int(line[0])
        text = line[2:].strip()
        texts.append(text)
        labels.append(label)
    return pd.DataFrame({'text': texts, 'label': labels})

def train_and_evaluate_decision_tree():
    print("Loading data...")
    train_data = load_data('/Users/funnysahithi/Downloads/exercise09_datacollection/stsa-train.txt')
    test_data = load_data('/Users/funnysahithi/Downloads/exercise09_datacollection/stsa-test.txt')

    print(f"Training data shape: {train_data.shape}")
    print(f"Test data shape: {test_data.shape}")

    # Vectorizer with reduced features
    print("Vectorizing text data...")
    vectorizer = TfidfVectorizer(max_features=1000)
    X_train_vec = vectorizer.fit_transform(train_data['text']).toarray()
    X_test_vec = vectorizer.transform(test_data['text']).toarray()
    y_train = train_data['label']
    y_test = test_data['label']

    # Split train into train and validation sets
    X_train, X_val, y_train, y_val = train_test_split(
        X_train_vec, y_train, test_size=0.2, random_state=42
    )

    # Initialize the Decision Tree Classifier
    dt = DecisionTreeClassifier(criterion="gini", max_depth=10, random_state=42)

    print("Training Decision Tree model...")
    dt.fit(X_train, y_train)

    # Evaluate on validation set
    print("\nValidation Set Results:")
    val_predictions = dt.predict(X_val)
    print(classification_report(y_val, val_predictions))

    # Evaluate on test set
    print("\nTest Set Results:")
    test_predictions = dt.predict(X_test_vec)
    print(classification_report(y_test, test_predictions))

    # Print confusion matrix
    print("\nConfusion Matrix (Test Set):")
    print(confusion_matrix(y_test, test_predictions))

    # Display the decision tree rules
    print("\nDecision Tree Rules:")
    tree_rules = export_text(dt, feature_names=vectorizer.get_feature_names_out(), max_depth=3)
    print(tree_rules)

if __name__ == "__main__":
    train_and_evaluate_decision_tree()


Loading data...
Training data shape: (6920, 2)
Test data shape: (1821, 2)
Vectorizing text data...
Training Decision Tree model...

Validation Set Results:
              precision    recall  f1-score   support

           0       0.66      0.28      0.40       671
           1       0.56      0.86      0.68       713

    accuracy                           0.58      1384
   macro avg       0.61      0.57      0.54      1384
weighted avg       0.61      0.58      0.54      1384


Test Set Results:
              precision    recall  f1-score   support

           0       0.66      0.31      0.42       912
           1       0.55      0.84      0.66       909

    accuracy                           0.57      1821
   macro avg       0.60      0.57      0.54      1821
weighted avg       0.60      0.57      0.54      1821


Confusion Matrix (Test Set):
[[279 633]
 [145 764]]

Decision Tree Rules:
|--- too <= 0.20
|   |--- and <= 0.08
|   |   |--- bad <= 0.09
|   |   |   |--- only <= 0.23
|  

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

def load_data(file_path):
    """
    Load and preprocess the text file data.
    """
    with open(file_path, 'r', encoding='utf-8') as file:
        lines = file.readlines()
    texts = []
    labels = []
    for line in lines:
        label = int(line[0])
        text = line[2:].strip()
        texts.append(text)
        labels.append(label)
    return pd.DataFrame({'text': texts, 'label': labels})

def train_and_evaluate_random_forest():
    print("Loading data...")
    train_data = load_data('/Users/funnysahithi/Downloads/exercise09_datacollection/stsa-train.txt')
    test_data = load_data('/Users/funnysahithi/Downloads/exercise09_datacollection/stsa-test.txt')

    print(f"Training data shape: {train_data.shape}")
    print(f"Test data shape: {test_data.shape}")

    # Vectorizer with reduced features
    print("Vectorizing text data...")
    vectorizer = TfidfVectorizer(max_features=1000)
    X_train_vec = vectorizer.fit_transform(train_data['text']).toarray()
    X_test_vec = vectorizer.transform(test_data['text']).toarray()
    y_train = train_data['label']
    y_test = test_data['label']

    # Split train into train and validation sets
    X_train, X_val, y_train, y_val = train_test_split(
        X_train_vec, y_train, test_size=0.2, random_state=42
    )

    # Initialize the Random Forest Classifier
    rf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)

    print("Training Random Forest model...")
    rf.fit(X_train, y_train)

    # Evaluate on validation set
    print("\nValidation Set Results:")
    val_predictions = rf.predict(X_val)
    print(classification_report(y_val, val_predictions))

    # Evaluate on test set
    print("\nTest Set Results:")
    test_predictions = rf.predict(X_test_vec)
    print(classification_report(y_test, test_predictions))

    # Print confusion matrix
    print("\nConfusion Matrix (Test Set):")
    print(confusion_matrix(y_test, test_predictions))

if __name__ == "__main__":
    train_and_evaluate_random_forest()


Loading data...
Training data shape: (6920, 2)
Test data shape: (1821, 2)
Vectorizing text data...
Training Random Forest model...

Validation Set Results:
              precision    recall  f1-score   support

           0       0.76      0.38      0.50       671
           1       0.60      0.89      0.72       713

    accuracy                           0.64      1384
   macro avg       0.68      0.63      0.61      1384
weighted avg       0.68      0.64      0.61      1384


Test Set Results:
              precision    recall  f1-score   support

           0       0.75      0.41      0.53       912
           1       0.59      0.86      0.70       909

    accuracy                           0.64      1821
   macro avg       0.67      0.64      0.62      1821
weighted avg       0.67      0.64      0.62      1821


Confusion Matrix (Test Set):
[[371 541]
 [123 786]]


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix

def load_data(file_path):
    """
    Load and preprocess the text file data.
    """
    with open(file_path, 'r', encoding='utf-8') as file:
        lines = file.readlines()
    texts = []
    labels = []
    for line in lines:
        label = int(line[0])
        text = line[2:].strip()
        texts.append(text)
        labels.append(label)
    return pd.DataFrame({'text': texts, 'label': labels})

def train_and_evaluate_knn():
    print("Loading data...")
    train_data = load_data('/Users/funnysahithi/Downloads/exercise09_datacollection/stsa-train.txt')
    test_data = load_data('/Users/funnysahithi/Downloads/exercise09_datacollection/stsa-test.txt')

    print(f"Training data shape: {train_data.shape}")
    print(f"Test data shape: {test_data.shape}")

    # Vectorizer with reduced features
    print("Vectorizing text data...")
    vectorizer = TfidfVectorizer(max_features=1000)
    X_train_vec = vectorizer.fit_transform(train_data['text']).toarray()
    X_test_vec = vectorizer.transform(test_data['text']).toarray()
    y_train = train_data['label']
    y_test = test_data['label']

    # Split train into train and validation sets
    X_train, X_val, y_train, y_val = train_test_split(
        X_train_vec, y_train, test_size=0.2, random_state=42
    )

    # Initialize the KNN Classifier
    knn = KNeighborsClassifier(n_neighbors=5)

    print("Training KNN model...")
    knn.fit(X_train, y_train)

    # Evaluate on validation set
    print("\nValidation Set Results:")
    val_predictions = knn.predict(X_val)
    print(classification_report(y_val, val_predictions))

    # Evaluate on test set
    print("\nTest Set Results:")
    test_predictions = knn.predict(X_test_vec)
    print(classification_report(y_test, test_predictions))

    # Print confusion matrix
    print("\nConfusion Matrix (Test Set):")
    print(confusion_matrix(y_test, test_predictions))

if __name__ == "__main__":
    train_and_evaluate_knn()


Loading data...
Training data shape: (6920, 2)
Test data shape: (1821, 2)
Vectorizing text data...
Training KNN model...

Validation Set Results:
              precision    recall  f1-score   support

           0       0.50      0.98      0.66       671
           1       0.79      0.08      0.14       713

    accuracy                           0.51      1384
   macro avg       0.65      0.53      0.40      1384
weighted avg       0.65      0.51      0.39      1384


Test Set Results:
              precision    recall  f1-score   support

           0       0.51      0.96      0.67       912
           1       0.69      0.09      0.16       909

    accuracy                           0.53      1821
   macro avg       0.60      0.52      0.41      1821
weighted avg       0.60      0.53      0.41      1821


Confusion Matrix (Test Set):
[[876  36]
 [828  81]]


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk

# Download necessary NLTK data for tokenization
nltk.download('punkt')

def load_data(file_path):
    """
    Load and preprocess the text file data.
    """
    with open(file_path, 'r', encoding='utf-8') as file:
        lines = file.readlines()
    texts = []
    labels = []
    for line in lines:
        label = int(line[0])
        text = line[2:].strip()
        texts.append(text)
        labels.append(label)
    return pd.DataFrame({'text': texts, 'label': labels})

def preprocess_text(texts):
    """
    Tokenize the texts and preprocess by lowering the case and removing non-alphanumeric characters.
    """
    return [word_tokenize(text.lower()) for text in texts]

def train_and_evaluate_word2vec_knn():
    print("Loading data...")
    train_data = load_data('/Users/funnysahithi/Downloads/exercise09_datacollection/stsa-train.txt')
    test_data = load_data('/Users/funnysahithi/Downloads/exercise09_datacollection/stsa-test.txt')

    print(f"Training data shape: {train_data.shape}")
    print(f"Test data shape: {test_data.shape}")

    # Preprocess the text data
    print("Preprocessing text data...")
    train_texts = preprocess_text(train_data['text'])
    test_texts = preprocess_text(test_data['text'])

    # Train Word2Vec model
    print("Training Word2Vec model...")
    word2vec_model = Word2Vec(sentences=train_texts, vector_size=100, window=5, min_count=1, workers=4)
    word2vec_model.save("word2vec.model")  # Save the trained Word2Vec model

    # Convert texts to Word2Vec feature vectors (average word embeddings for each document)
    def vectorize_text(text, model):
        vectors = [model.wv[word] for word in text if word in model.wv]
        if vectors:
            return np.mean(vectors, axis=0)
        else:
            return np.zeros(model.vector_size)

    X_train_vec = np.array([vectorize_text(text, word2vec_model) for text in train_texts])
    X_test_vec = np.array([vectorize_text(text, word2vec_model) for text in test_texts])

    y_train = train_data['label']
    y_test = test_data['label']

    # Split train into train and validation sets
    X_train, X_val, y_train, y_val = train_test_split(
        X_train_vec, y_train, test_size=0.2, random_state=42
    )

    # Initialize the KNN Classifier
    knn = KNeighborsClassifier(n_neighbors=5)

    print("Training KNN model...")
    knn.fit(X_train, y_train)

    # Evaluate on validation set
    print("\nValidation Set Results:")
    val_predictions = knn.predict(X_val)
    print(classification_report(y_val, val_predictions))

    # Evaluate on test set
    print("\nTest Set Results:")
    test_predictions = knn.predict(X_test_vec)
    print(classification_report(y_test, test_predictions))

    # Print confusion matrix
    print("\nConfusion Matrix (Test Set):")
    print(confusion_matrix(y_test, test_predictions))

if __name__ == "__main__":
    train_and_evaluate_word2vec_knn()


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/funnysahithi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Loading data...
Training data shape: (6920, 2)
Test data shape: (1821, 2)
Preprocessing text data...
Training Word2Vec model...
Training KNN model...

Validation Set Results:
              precision    recall  f1-score   support

           0       0.55      0.48      0.51       671
           1       0.56      0.62      0.59       713

    accuracy                           0.55      1384
   macro avg       0.55      0.55      0.55      1384
weighted avg       0.55      0.55      0.55      1384


Test Set Results:
              precision    recall  f1-score   support

           0       0.55      0.47      0.51       912
           1       0.53      0.61      0.57       909

    accuracy                           0.54      1821
   macro avg       0.54      0.54      0.54      1821
weighted avg       0.54      0.54      0.54      1821


Confusion Matrix (Test Set):
[[432 480]
 [357 552]]


In [None]:
from transformers import BertTokenizer, BertModel
import torch
from torch.utils.data import DataLoader, TensorDataset
import numpy as np

def preprocess_text(texts, tokenizer, max_length=512):
    """
    Tokenize the texts using BERT tokenizer, return input_ids and attention_masks.
    """
    encodings = tokenizer(texts, truncation=True, padding=True, max_length=max_length, return_tensors='pt')
    return encodings['input_ids'], encodings['attention_mask']

def get_bert_embeddings(texts, model, tokenizer, batch_size=16):
    """
    Generate BERT embeddings for a list of texts by batching and averaging the embeddings.
    """
    input_ids, attention_mask = preprocess_text(texts, tokenizer)

    # Create DataLoader for batching
    dataset = TensorDataset(input_ids, attention_mask)
    dataloader = DataLoader(dataset, batch_size=batch_size)

    embeddings = []
    model.eval()  # Set model to evaluation mode to deactivate dropout layers
    with torch.no_grad():
        for batch in dataloader:
            input_ids_batch, attention_mask_batch = batch
            outputs = model(input_ids_batch, attention_mask=attention_mask_batch)
            hidden_states = outputs.last_hidden_state
            # Take the mean of token embeddings in the last hidden state to represent the sentence
            sentence_embedding = hidden_states.mean(dim=1).cpu().numpy()
            embeddings.extend(sentence_embedding)

    return np.array(embeddings)

def train_and_evaluate_bert_knn():
    print("Loading data...")
    train_data = load_data('/Users/funnysahithi/Downloads/exercise09_datacollection/stsa-train.txt')
    test_data = load_data('/Users/funnysahithi/Downloads/exercise09_datacollection/stsa-test.txt')

    print(f"Training data shape: {train_data.shape}")
    print(f"Test data shape: {test_data.shape}")

    # Preprocess the text data
    print("Preprocessing text data...")
    train_texts = train_data['text'].tolist()  # Convert to list of strings
    test_texts = test_data['text'].tolist()  # Convert to list of strings

    # Load pre-trained BERT model
    print("Loading BERT model...")
    model = BertModel.from_pretrained('bert-base-uncased')
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

    # Generate embeddings using BERT
    print("Generating BERT embeddings...")
    X_train_vec = get_bert_embeddings(train_texts, model, tokenizer)
    X_test_vec = get_bert_embeddings(test_texts, model, tokenizer)

    y_train = train_data['label']
    y_test = test_data['label']

    # Split train into train and validation sets
    X_train, X_val, y_train, y_val = train_test_split(
        X_train_vec, y_train, test_size=0.2, random_state=42
    )

    # Initialize the KNN Classifier
    knn = KNeighborsClassifier(n_neighbors=5)

    print("Training KNN model...")
    knn.fit(X_train, y_train)

    # Evaluate on validation set
    print("\nValidation Set Results:")
    val_predictions = knn.predict(X_val)
    print(classification_report(y_val, val_predictions))

    # Evaluate on test set
    print("\nTest Set Results:")
    test_predictions = knn.predict(X_test_vec)
    print(classification_report(y_test, test_predictions))

    # Print confusion matrix
    print("\nConfusion Matrix (Test Set):")
    print(confusion_matrix(y_test, test_predictions))

if __name__ == "__main__":
    train_and_evaluate_bert_knn()


Loading data...
Training data shape: (6920, 2)
Test data shape: (1821, 2)
Preprocessing text data...
Loading BERT model...
Generating BERT embeddings...
Training KNN model...

Validation Set Results:
              precision    recall  f1-score   support

           0       0.79      0.78      0.78       671
           1       0.79      0.81      0.80       713

    accuracy                           0.79      1384
   macro avg       0.79      0.79      0.79      1384
weighted avg       0.79      0.79      0.79      1384


Test Set Results:
              precision    recall  f1-score   support

           0       0.78      0.73      0.75       912
           1       0.74      0.79      0.77       909

    accuracy                           0.76      1821
   macro avg       0.76      0.76      0.76      1821
weighted avg       0.76      0.76      0.76      1821


Confusion Matrix (Test Set):
[[662 250]
 [187 722]]


## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [None]:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

# Load dataset (replace the path with your local path to the CSV file)
file_path = '/Users/funnysahithi/Downloads/Amazon_Unlocked_Mobile.csv'  # Adjust the path
data = pd.read_csv(file_path)

# Display first few rows of the data
print(data.head())

# Text preprocessing function
def preprocess_text(text):
    # Check if text is not a string (e.g., if it's NaN or a number)
    if not isinstance(text, str):
        return ''  # Return empty string for non-string values

    # Remove non-alphabetic characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Convert to lowercase
    text = text.lower()
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = text.split()
    words = [word for word in words if word not in stop_words]
    return ' '.join(words)

# Apply preprocessing to the text column, ensuring non-string values are handled
data['cleaned_text'] = data['Reviews'].apply(preprocess_text)

# Show the cleaned data
print(data[['Reviews', 'cleaned_text']].head())
# Vectorize the text data using TF-IDF
vectorizer = TfidfVectorizer(max_features=1000)  # Limit features to avoid memory issues
X = vectorizer.fit_transform(data['cleaned_text'])

# Apply K-means clustering
num_clusters = 5  # Choose the number of clusters
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
kmeans.fit(X)

# Assign cluster labels to the data
data['cluster'] = kmeans.labels_

# Print the cluster assignments
print("Cluster assignments:")
print(data[['Reviews', 'cleaned_text', 'cluster']].head(10))

# Reduce dimensionality for visualization using PCA
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X.toarray())

# Plot the clustered data
plt.figure(figsize=(10, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=kmeans.labels_, cmap='rainbow', alpha=0.7)
plt.colorbar(scatter)
plt.title("K-means Clustering Visualization")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.show()

# Analyze the clusters
for i in range(num_clusters):
    print(f"Cluster {i}:")
    print(data[data['cluster'] == i]['cleaned_text'].head(10))
    print("\n")



: 

In [None]:
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
import nltk

# Download NLTK stopwords
nltk.download('stopwords')

# Load dataset (replace with your dataset path)
file_path = '/Users/funnysahithi/Downloads/Amazon_Unlocked_Mobile.csv'  # Adjust the path
data = pd.read_csv(file_path)

# Display first few rows of the data
print("Original Data:")
print(data.head())

# Text preprocessing function
def preprocess_text(text):
    if not isinstance(text, str):  # Handle non-string values
        return ''
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove non-alphabetic characters
    text = text.lower()  # Convert to lowercase
    stop_words = set(stopwords.words('english'))
    words = text.split()
    words = [word for word in words if word not in stop_words]  # Remove stopwords
    return ' '.join(words)

# Apply preprocessing
data['cleaned_text'] = data['Reviews'].fillna("").apply(preprocess_text)

# Display cleaned data
print("\nCleaned Data:")
print(data[['Reviews', 'cleaned_text']].head())

# Vectorize text using TF-IDF
vectorizer = TfidfVectorizer(max_features=500)  # Reduce to 500 features for better performance
X = vectorizer.fit_transform(data['cleaned_text'])

# Reduce dimensionality using PCA
pca = PCA(n_components=50, random_state=42)  # Reduce to 50 dimensions for DBSCAN efficiency
X_reduced = pca.fit_transform(X.toarray())

# Apply DBSCAN clustering
eps_value = 0.3  # Distance threshold
min_samples_value = 10  # Minimum points per cluster
dbscan = DBSCAN(eps=eps_value, min_samples=min_samples_value, metric='euclidean')
dbscan_labels = dbscan.fit_predict(X_reduced)

# Assign cluster labels
data['dbscan_cluster'] = dbscan_labels

# Visualize DBSCAN Clustering (2D PCA)
pca_2d = PCA(n_components=2, random_state=42)
X_pca_2d = pca_2d.fit_transform(X_reduced)

plt.figure(figsize=(10, 6))
scatter_dbscan = plt.scatter(X_pca_2d[:, 0], X_pca_2d[:, 1], c=dbscan_labels, cmap='viridis', alpha=0.7)
plt.colorbar(scatter_dbscan)
plt.title("DBSCAN Clustering Visualization")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.show()

# Print clustering results
print("\nDBSCAN Clustering Results:")
unique_clusters = set(dbscan_labels)
for cluster in unique_clusters:
    print(f"\nCluster {cluster} ({'Noise' if cluster == -1 else 'Cluster'}):")
    print(data[data['dbscan_cluster'] == cluster]['cleaned_text'].head(5))


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/funnysahithi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Original Data:
                                        Product Name Brand Name   Price  \
0  "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung  199.99   
1  "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung  199.99   
2  "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung  199.99   
3  "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung  199.99   
4  "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung  199.99   

   Rating                                            Reviews  Review Votes  
0       5  I feel so LUCKY to have found this used (phone...           1.0  
1       4  nice phone, nice up grade from my pantach revu...           0.0  
2       5                                       Very pleased           0.0  
3       4  It works good but it goes slow sometimes but i...           0.0  
4       4  Great phone to replace my lost phone. The only...           0.0  

Cleaned Data:
                                             Reviews  \
0

**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

In [None]:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

# Load dataset (replace with your dataset path)
file_path = '/Users/funnysahithi/Downloads/Amazon_Unlocked_Mobile.csv'  # Adjust the path
data = pd.read_csv(file_path)

# Display first few rows of the data
print(data.head())

# Text preprocessing function
def preprocess_text(text):
    if not isinstance(text, str):  # Handle non-string values
        return ''
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove non-alphabetic characters
    text = text.lower()  # Convert to lowercase
    stop_words = set(stopwords.words('english'))
    words = text.split()
    words = [word for word in words if word not in stop_words]  # Remove stopwords
    return ' '.join(words)

# Apply preprocessing to the text column
data['cleaned_text'] = data['Reviews'].apply(preprocess_text)

# Show cleaned data
print(data[['Reviews', 'cleaned_text']].head())

# Vectorize the text data using TF-IDF
vectorizer = TfidfVectorizer(max_features=1000)  # Limit features to avoid memory issues
X = vectorizer.fit_transform(data['cleaned_text']).toarray()  # Convert sparse matrix to dense

# Perform hierarchical clustering
linkage_matrix = linkage(X, method='ward')  # 'ward' minimizes variance within clusters

# Plot the dendrogram
plt.figure(figsize=(12, 8))
dendrogram(linkage_matrix, truncate_mode='level', p=5, leaf_rotation=90, leaf_font_size=10)
plt.title("Hierarchical Clustering Dendrogram")
plt.xlabel("Sample Index")
plt.ylabel("Distance")
plt.show()

# Determine cluster labels
num_clusters = 5  # Choose the number of desired clusters
data['hierarchical_cluster'] = fcluster(linkage_matrix, num_clusters, criterion='maxclust')

# Reduce dimensionality for visualization using PCA
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X)

# Plot the clustered data
plt.figure(figsize=(10, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=data['hierarchical_cluster'], cmap='rainbow', alpha=0.7)
plt.colorbar(scatter)
plt.title("Hierarchical Clustering Visualization")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.show()

# Analyze the clusters
for i in range(1, num_clusters + 1):
    print(f"Cluster {i}:")
    print(data[data['hierarchical_cluster'] == i]['cleaned_text'].head(10))
    print("\n")


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/funnysahithi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


                                        Product Name Brand Name   Price  \
0  "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung  199.99   
1  "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung  199.99   
2  "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung  199.99   
3  "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung  199.99   
4  "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung  199.99   

   Rating                                            Reviews  Review Votes  
0       5  I feel so LUCKY to have found this used (phone...           1.0  
1       4  nice phone, nice up grade from my pantach revu...           0.0  
2       5                                       Very pleased           0.0  
3       4  It works good but it goes slow sometimes but i...           0.0  
4       4  Great phone to replace my lost phone. The only...           0.0  
                                             Reviews  \
0  I feel so LUCKY to have foun

In [None]:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

# Load dataset (replace with your dataset path)
file_path = '/Users/funnysahithi/Downloads/Amazon_Unlocked_Mobile.csv'  # Adjust the path
data = pd.read_csv(file_path)

# Display first few rows of the data
print(data.head())

# Text preprocessing function
def preprocess_text(text):
    if not isinstance(text, str):  # Handle non-string values
        return ''
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove non-alphabetic characters
    text = text.lower()  # Convert to lowercase
    stop_words = set(stopwords.words('english'))
    words = text.split()
    words = [word for word in words if word not in stop_words]  # Remove stopwords
    return ' '.join(words)

# Apply preprocessing to the text column
data['cleaned_text'] = data['Reviews'].apply(preprocess_text)

# Show cleaned data
print(data[['Reviews', 'cleaned_text']].head())

# Vectorize the text data using TF-IDF
vectorizer = TfidfVectorizer(max_features=1000)  # Limit features to avoid memory issues
X = vectorizer.fit_transform(data['cleaned_text']).toarray()  # Convert sparse matrix to dense

# Perform hierarchical clustering
linkage_matrix = linkage(X, method='ward')  # 'ward' minimizes variance within clusters

# Plot the dendrogram
plt.figure(figsize=(12, 8))
dendrogram(linkage_matrix, truncate_mode='level', p=5, leaf_rotation=90, leaf_font_size=10)
plt.title("Hierarchical Clustering Dendrogram")
plt.xlabel("Sample Index")
plt.ylabel("Distance")
plt.show()

# Determine cluster labels
num_clusters = 5  # Choose the number of desired clusters
data['hierarchical_cluster'] = fcluster(linkage_matrix, num_clusters, criterion='maxclust')

# Reduce dimensionality for visualization using PCA
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X)

# Plot the clustered data
plt.figure(figsize=(10, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=data['hierarchical_cluster'], cmap='rainbow', alpha=0.7)
plt.colorbar(scatter)
plt.title("Hierarchical Clustering Visualization")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.show()

# Analyze the clusters
for i in range(1, num_clusters + 1):
    print(f"Cluster {i}:")
    print(data[data['hierarchical_cluster'] == i]['cleaned_text'].head(10))
    print("\n")


In [None]:
import pandas as pd
import re
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from transformers import BertTokenizer, BertModel
import torch
from nltk.corpus import stopwords
import nltk

# Download NLTK stopwords
nltk.download('stopwords')

# Load dataset (replace with your dataset path)
file_path = '/Users/funnysahithi/Downloads/Amazon_Unlocked_Mobile.csv'  # Adjust the path
data = pd.read_csv(file_path)

# Display first few rows of the data
print(data.head())

# Text preprocessing function
def preprocess_text(text):
    if not isinstance(text, str):  # Handle non-string values
        return ''
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove non-alphabetic characters
    text = text.lower()  # Convert to lowercase
    stop_words = set(stopwords.words('english'))
    words = text.split()
    words = [word for word in words if word not in stop_words]  # Remove stopwords
    return ' '.join(words)

# Apply preprocessing to the text column
data['cleaned_text'] = data['Reviews'].fillna("").apply(preprocess_text)

# Show cleaned data
print("\nCleaned Data:")
print(data[['Reviews', 'cleaned_text']].head())

# Load pre-trained BERT model and tokenizer
bert_model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(bert_model_name)
model = BertModel.from_pretrained(bert_model_name)

# Function to generate BERT embeddings
def get_bert_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    # Use the mean of the last hidden state as the sentence embedding
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

# Generate BERT embeddings for all cleaned text
data['bert_embedding'] = data['cleaned_text'].apply(lambda x: get_bert_embedding(x))

# Convert embeddings to matrix form for clustering
X = np.vstack(data['bert_embedding'])

# ----------- K-means Clustering -----------
num_clusters = 5  # Choose the number of clusters
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
kmeans.fit(X)

# Assign cluster labels to the data
data['cluster'] = kmeans.labels_

# Reduce dimensionality for visualization using PCA
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X)

# Plot the clustered data
plt.figure(figsize=(10, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=kmeans.labels_, cmap='rainbow', alpha=0.7)
plt.colorbar(scatter)
plt.title("K-means Clustering with BERT Embeddings")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.show()

# Analyze the clusters
for i in range(num_clusters):
    print(f"Cluster {i}:")
    print(data[data['cluster'] == i]['Reviews'].head(10))
    print("\n")


**Write your response here:**

.

.

.

.

.




# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:





'''