Homework 3: Sentiment Analysis
----

The following instructions apply to all notebooks and `.py` files you submit for this homework.

Due date: April 15th, 2024 11:59 PM (EST)

Total Points: (105)
- Task 0: 05 points
- Task 1: 10 points
- Task 2: 20 points
- Task 3: 25 points
- Task 4: 40 points (question in LSTM_EncDec.ipynb)

Goals:
- understand the difficulties of counting and probabilities in NLP applications
- work with real world data using different approaches to classification
- stress test your model (to some extent)


Allowed python modules:
- `numpy`, `matplotlib`, `keras`, `pytorch`, `nltk`, `pandas`, `sci-kit learn` (`sklearn`), `seaborn`, and all built-in python libraries (e.g. `math` and `string`)
- if you would like to use a library not on this list, please check with us on Campuswire first.
- all *necessary* imports have been included for you (all imports that we used in our solution)

Instructions:
- Complete outlined problems in this notebook.
- When you have finished, __clear the kernel__ and __run__ your notebook "fresh" from top to bottom. Ensure that there are __no errors__.
    - If a problem asks for you to write code that does result in an error (as in, the answer to the problem is an error), leave the code in your notebook but commented out so that running from top to bottom does not result in any errors.
- Double check that you have completed Task 0.
- Submit your work on Gradescope.
- Double check that your submission on Gradescope looks like you believe it should.

Names & Sections
----
Names: __Nikita Vinod Mandal__

Task 0: Name, References, Reflection (5 points)
---

References
---
List the resources you consulted to complete this homework here. Write one sentence per resource about what it provided to you. If you consulted no references to complete your assignment, write a brief sentence stating that this is the case and why it was the case for you.

- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
    - Sklearn's linear and logistic regression model
- https://www.geeksforgeeks.org/counters-in-python-set-2-accessing-counters/
    - mostcommon() function
- https://cs.stanford.edu/people/eroberts/courses/soco/projects/neural-networks/Architecture/feedforward.html
    - Feed Forward Neural network architecture
- https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html
    - Neural network implementation
    

AI Collaboration
---
Following the *Policy on the use of Generative AI* in the syllabus, please cite any LLMs that you used here and briefly describe what you used them for, including to improve language clarity in the written sections.

I used ChatGPT to debug a few cases. For some time I was getting the same vocabulary size for both custom vectorization and SKlearn. So I used ChatGPTthe possible causes for the same and the solutions that can be implemented.

Reflection
----
Answer the following questions __after__ you complete this assignment (no more than 1 sentence per question required, this section is graded on completion):

1. Does this work reflect your best effort? I tried my best. Have to learn more.
2. What was/were the most challenging part(s) of the assignment? Feedforward Neural Network
3. If you want feedback, what function(s) or problem(s) would you like feedback on and why? Feedforward Neural Network. I also want feedback on the RNN network since I got less accuracy.
4. Briefly reflect on how your partnership functioned--who did which tasks, how was the workload on each of you individually as compared to the previous homeworks, etc. N/A

Task 1: Provided Data Write-Up (10 points)
---

Every time you use a data set in an NLP application (or in any software application), you should be able to answer a set of questions about that data. Answer these now. Default to no more than 1 sentence per question needed. If more explanation is necessary, do give it.

This is about the __provided__ movie review data set.

1. Where did you get the data from? The provided dataset(s) were sub-sampled from https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
2. (1 pt) How was the data collected (where did the people acquiring the data get it from and how)? 

    - Collected from IMDB. Ratings > 7 are positive and ratings < 5 are negative.

3. (2 pts) How large is the dataset (answer for both the train and the dev set, separately)? (# reviews, # tokens in both the train and dev sets)

    - #reviews: 25,000 train set and 25,000 test set.
    - #tokens: 369127 in train set and 47155 in test set.

4. (1 pt) What is your data? (i.e. newswire, tweets, books, blogs, etc)

    - Movie reviews

5. (1 pt) Who produced the data? (who were the authors of the text? Your answer might be a specific person or a particular group of people)

    - Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher

6. (2 pts) What is the distribution of labels in the data (answer for both the train and the dev set, separately)?

    - Two classes - positive and negative

7. (2 pts) How large is the vocabulary (answer for both the train and the dev set, separately)?
    - 47698 in train set and 11713 in test set (vocab count)


8. (1 pt) How big is the overlap between the vocabulary for the train and dev set?
    - 7245 (vocab_train intersection vocab_test)

Task 2: Train a Logistic Regression Model (20 points)
----
1. Implement a custom function to read in a dataset, and return a list of tuples, using the Tf-Idf feature extraction technique.
2. Compare your implementation to `sklearn`'s TfidfVectorizer (imported below) by timing both on the provided datasets using the time module.
3. Using each set of features, and `sklearn`'s implementation of `LogisticRegression`, train a machine learning model to predict sentiment on the given dataset.

In [1]:
import nltk
#nltk.download('punkt')
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import f1_score
from collections import Counter
import time
from nltk.corpus import stopwords

#nltk.download('stopwords')
stopwords = stopwords.words('english')



In [2]:
# The following function reads a data-file and splits the contents by tabs.
# The first column is an ID, and thus is discarded. The second column consists of the actual reviews data.
# The third column is the true label for each data point.

# The function returns two objects - a list of all reviews, and a numpy array of labels.
# You will need to use this function later.

def get_lists(input_file):
    f=open(input_file, 'r')
    lines = [line.split('\t')[1:] for line in f.readlines()]
    X = [row[0] for row in lines]
    y=np.array([int(row[1]) for row in lines])
    return X, y

# Fill in the following function to take a corpus (list of reviews) as input,
# extract TfIdf values and return an array of features and the vocabulary.

# If the vocabulary argument is supplied, then the function should only convert the input corpus
# to feature vectors using the provided vocabulary and the max_features argument (if not None).
# In this case, the function should return feature vectors and the supplied vocabulary.

# If the max_features parameter is set to None, then all words in the corpus should be used.
# If the max_features parameter is specified (say, k),
# then only use the k most frequent words in the corpus to build your vocabulary.

# The function should return two things.

# The first object should be a numpy array of shape (n_documents, vocab_size),
# which contains the TF-IDF feature vectors for each document.

# The second object should be a dictionary of the words in the vocabulary,
# mapped to their corresponding index in alphabetical sorted order.


def get_tfidf_vectors(token_lists, max_features=None, vocabulary=None):
    # Tokenization and vocabulary creation
    if vocabulary is None:
        # Tokenize documents, remove stopwords, and count token occurrences
        all_tokens = [token for doc in token_lists for token in nltk.tokenize.word_tokenize(doc.lower()) if token not in stopwords]
        token_counts = Counter(all_tokens)
        
        # Select most common tokens up to max_features limit
        most_common_tokens = token_counts.most_common(max_features)
        
        # Create vocabulary from selected tokens
        vocabulary = {word: idx for idx, (word, _) in enumerate(most_common_tokens)}

    # Initialize TF-IDF matrix
    n_docs = len(token_lists)
    tfidf_matrix = np.zeros((n_docs, len(vocabulary)))
    
    # Compute document frequencies
    df = Counter()
    for doc in token_lists:
        tokens = set(nltk.tokenize.word_tokenize(doc.lower())) 
        filtered_tokens = [token for token in tokens if token in vocabulary]
        df.update(filtered_tokens)

    # Compute TF-IDF values
    for i, doc in enumerate(token_lists):
        tokens = nltk.tokenize.word_tokenize(doc.lower())
        filtered_tokens = [token for token in tokens if token in vocabulary]
        term_freq = Counter(filtered_tokens)
        doc_len = len(filtered_tokens)
        for word, count in term_freq.items():
            tf = count / doc_len  
            idf = np.log((n_docs + 1) / (df[word] + 1)) + 1 
            tfidf_matrix[i, vocabulary[word]] = tf * idf

    # Normalize TF-IDF matrix
    norms = np.sqrt((tfidf_matrix ** 2).sum(axis=1, keepdims=True))
    tfidf_matrix = tfidf_matrix / norms

    return tfidf_matrix, vocabulary

# Function to calculate TF-IDF vectors using scikit-learn
def get_tfidf_vectors_sk(token_lists, max_features=None, vocabulary=None):
    if vocabulary is not None:
        # Use provided vocabulary and max_features to create TF-IDF vectors
        tfidf_vectorizer = TfidfVectorizer(vocabulary=vocabulary, stop_words=stopwords, max_features=max_features)
    else:
        # Create TF-IDF vectors from token lists using scikit-learn's TfidfVectorizer
        tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords, max_features=max_features)
    
    # Transform token lists into TF-IDF vectors
    tfidf_vectors = tfidf_vectorizer.fit_transform(token_lists)
    
    vocab = tfidf_vectorizer.vocabulary_

    return tfidf_vectors, vocab


We will now compare the runtime of our Tf-Idf implementation to the `sklearn` implementation. Call the respective functions with appropriate arguments in the code block below.

In [3]:
# define constants for the files we are using
TRAIN_FILE = "movie_reviews_train.txt"
TEST_FILE = "movie_reviews_test.txt"

train_corpus, y_train = get_lists(TRAIN_FILE)

# First we will use our custom vectorizer to convert words to features, and time it.
start = time.time()
###### YOUR CODE HERE #######
X_train_c , vocb = get_tfidf_vectors(train_corpus, max_features=None, vocabulary=None)
end = time.time()
print("Time taken: ", end-start, " seconds")

# Next we will use sklearn's TfidfVectorizer to load in the data, and time it.

###### YOUR CODE HERE #######
start = time.time()
X_train_sk , vocb_sk = get_tfidf_vectors_sk(train_corpus, max_features=None, vocabulary=None)
end = time.time()

print("Time taken: ", end-start, " seconds")

Time taken:  9.877344131469727  seconds
Time taken:  0.28160858154296875  seconds


NOTE: Ideally, your vectorizer should be within one order of magnitude of the sklearn implementation.

In [4]:
# Displaying vocabulary lengths
print("Length of my vocab: ", len(vocb))
print("Length of sklearn's vocab: ", len(vocb_sk))

# Calculating sparsity of custom features
zero_elements = np.count_nonzero(X_train_c) 
total_elements = X_train_c.shape[0] * X_train_c.shape[1]  
sparsity = (1 - (zero_elements / total_elements)) * 100
print("Sparsity of my custom features: ", sparsity)

# Calculating sparsity of sklearn's features
zero_elements = np.count_nonzero(X_train_sk.toarray()) 
total_elements = X_train_sk.shape[0] * X_train_sk.shape[1]  
sparsity_sk = (1 - (zero_elements / total_elements)) * 100
print("Sparsity of sklearn's features: ", sparsity_sk)


Length of my vocab:  27035
Length of sklearn's vocab:  22460
Sparsity of my custom features:  99.60588588866285
Sparsity of sklearn's features:  99.562071460374


1. How large is the vocabulary generated by your vectorizer?  
    - 27035
2. How large is the vocabulary generated by the `sklearn` TfidfVectorizer?
    - 22460
3. Where might these differences be coming from?
    - These differences could stem from variations in parameter settings, preprocessing steps, or tokenization strategies employed in both approaches.
4. What steps did you take to ensure your vectorizer is optimized for best possible runtime?
    -  Matrix multiplication was utilized to calculate TF-IDF instead of using explicit for loops, which can significantly enhance computational efficiency. Additionally, stop words were removed during preprocessing to streamline the feature extraction process and improve vectorization performance.
5. How sparse are your custom features (average percentage of features per review that are zero)?
    - 99.6% sparse
6. How sparse are the TfidfVectorizer's features?
    - 99.56% sparse

NOTE: if you set the lowercase option to False, the sklearn vectorizer should have a vocabulary of around 50k words/tokens.

**Logistic Regression**

Now, we will compare how our custom features stack up against sklearn's TfidfVectorizer, by training two separate Logistic Regression classifiers - one on each set of feature vectors. Then load the test set, and convert it to two sets of feature vectors, one using our custom vectorizer (to do this, provide the vocabulary as a function argument), and one using sklearn's Tfidf (use the same object as before to transform the test inputs). For both classifiers, print the average accuracy on the test set and the F1 score.

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score

# Train Logistic Regression model using custom feature vectors
logisticReg_model_custom = LogisticRegression()
logisticReg_model_custom.fit(X_train_c, y_train)

# Load test data and extract features using custom vectorizer
test_data, test_labels = get_lists(TEST_FILE)
custom_features, vocab_c = get_tfidf_vectors(test_data, vocabulary=vocb)

# Predict using custom features and evaluate accuracy
predictions_custom = logisticReg_model_custom.predict(custom_features)
average_accuracy_custom = accuracy_score(test_labels, predictions_custom)
print(f"Average Accuracy using features extracted from custom vectorizer: {round(average_accuracy_custom * 100, 3)}%")
f1_custom = f1_score(test_labels, predictions_custom)
print(f"F1 Score using features extracted from custom vectorizer: {round(f1_custom * 100, 3)}%")

# Train Logistic Regression model using sklearn's feature vectors
logisticReg_model_sk = LogisticRegression()
logisticReg_model_sk.fit(X_train_sk, y_train)

# Extract features using sklearn's Tfidfvectorizer
sk_features, vocab_sk = get_tfidf_vectors_sk(test_data, vocabulary=vocb_sk)

# Predict using sklearn's features and evaluate accuracy
predictions_sk = logisticReg_model_sk.predict(sk_features)
average_accuracy_sk = accuracy_score(test_labels, predictions_sk)
print(f"Average Accuracy using features extracted from sklearn's Tfidfvectorizer: {round(average_accuracy_sk * 100, 3)}%")
f1_sk = f1_score(test_labels, predictions_sk)
print(f"F1 Score using features extracted from sklearn's Tfidfvectorizer: {round(f1_sk * 100, 3)}%")


Average Accuracy using features extracted from custom vectorizer: 80.0%
F1 Score using features extracted from custom vectorizer: 80.769%
Average Accuracy using features extracted from sklearn's Tfidfvectorizer: 81.5%
F1 Score using features extracted from sklearn's Tfidfvectorizer: 82.629%


NOTE: we're expecting to see a F1 score of around 80% using both your custom features and the sklearn features.

Finally, repeat the process (training and testing), but this time, set the max_features argument to 1000 for both our custom vectorizer and sklearn's Tfidfvectorizer. Report average accuracy and F1 scores for both classifiers.

In [6]:
###### YOUR CODE HERE #######

# First use sklearn's LogisticRegression classifier to do sentiment analysis using your custom feature vectors:

###### YOUR CODE HERE #######
X_train_c, vocab_c = get_tfidf_vectors(train_corpus, max_features=1000)
print(len(vocab_c))
logisticReg_model_custom = LogisticRegression()
logisticReg_model_custom.fit(X_train_c, y_train)

# Load the test data, extract features using your custom vectorizer, and test the performance of the LR classifier

###### YOUR CODE HERE #######
test_data, test_labels = get_lists(TEST_FILE)

custom_X_test, vocab_c_notuse = get_tfidf_vectors(test_data, max_features=1000)#vocabulary=vocab_c)
predictions = logisticReg_model_custom.predict(custom_X_test)


# Print the accuracy of your model on the test data

###### YOUR CODE HERE #######
average_accuracy = accuracy_score(test_labels, predictions)
print(f"Average Accuracy using features extracted from custom vectorizer with max features = 1000: {round(average_accuracy*100,3)}%")

f1 = f1_score(test_labels, predictions)
print(f"F1 Score using features extracted from custom vectorizer with max features = 1000: {round(f1*100,3)}%")

# Now repeat the above steps, but this time using features extracted by sklearn's Tfidfvectorizer

###### YOUR CODE HERE #######
X_train_sk, vocab_sk = get_tfidf_vectors_sk(train_corpus, max_features=1000)
logisticReg_model_sk = LogisticRegression()
logisticReg_model_sk.fit(X_train_sk, y_train)

sk_X_test, vocab_sk_notuse = get_tfidf_vectors_sk(test_data, max_features=1000)#vocabulary=vocab_sk)
predictions_sk = logisticReg_model_sk.predict(sk_X_test)

average_accuracy_sk = accuracy_score(test_labels, predictions_sk)
print(f"Average Accuracy using features extracted from sklearn's Tfidfvectorizer with max features = 1000: {round(average_accuracy_sk*100,3)}%")

f1_sk = f1_score(test_labels, predictions_sk)
print(f"F1 Score using features extracted from sklearn's Tfidfvectorizer with max features = 1000: {round(f1_sk*100,3)}%")

# Second part without max_features

###### YOUR CODE HERE #######
# First use sklearn's LogisticRegression classifier to do sentiment analysis using your custom feature vectors:

###### YOUR CODE HERE #######
X_train_c, vocab_c = get_tfidf_vectors(train_corpus)
logisticReg_model_custom = LogisticRegression()
logisticReg_model_custom.fit(X_train_c, y_train)

# Load the test data, extract features using your custom vectorizer, and test the performance of the LR classifier

###### YOUR CODE HERE #######
test_data, test_labels = get_lists(TEST_FILE)

custom_X_test, vocab_c_notuse = get_tfidf_vectors(test_data, vocabulary=vocab_c)
predictions = logisticReg_model_custom.predict(custom_X_test)


# Print the accuracy of your model on the test data

###### YOUR CODE HERE #######
average_accuracy = accuracy_score(test_labels, predictions)
print(f"Average Accuracy using features extracted from custom vectorizer: {round(average_accuracy*100,3)}%")

f1 = f1_score(test_labels, predictions)
print(f"F1 Score using features extracted from custom vectorizer: {round(f1*100,3)}%")

# Now repeat the above steps, but this time using features extracted by sklearn's Tfidfvectorizer

###### YOUR CODE HERE #######
X_train_sk, vocab_sk = get_tfidf_vectors_sk(train_corpus)
logisticReg_model_sk = LogisticRegression()
logisticReg_model_sk.fit(X_train_sk, y_train)

sk_X_test, vocab_sk_notuse = get_tfidf_vectors_sk(test_data, vocabulary= vocab_sk)
predictions_sk = logisticReg_model_sk.predict(sk_X_test)

average_accuracy_sk = accuracy_score(test_labels, predictions_sk)
print(f"Average Accuracy using features extracted from sklearn's Tfidfvectorizer: {round(average_accuracy_sk*100,3)}%")

f1_sk = f1_score(test_labels, predictions_sk)
print(f"F1 Score using features extracted from sklearn's Tfidfvectorizer: {round(f1_sk*100,3)}%")


1000
Average Accuracy using features extracted from custom vectorizer with max features = 1000: 52.5%
F1 Score using features extracted from custom vectorizer with max features = 1000: 56.621%
Average Accuracy using features extracted from sklearn's Tfidfvectorizer with max features = 1000: 57.0%
F1 Score using features extracted from sklearn's Tfidfvectorizer with max features = 1000: 59.048%
Average Accuracy using features extracted from custom vectorizer: 80.0%
F1 Score using features extracted from custom vectorizer: 80.769%
Average Accuracy using features extracted from sklearn's Tfidfvectorizer: 81.5%
F1 Score using features extracted from sklearn's Tfidfvectorizer: 82.629%


1. Is there a stark difference between the two vectorizers with 1000 features?
    - Yes, there's a notable distinction in the F1 scores. The custom vectorizer achieved an accuracy of 52.5% and an F1 score of 56.6%, while sklearn's TfidfVectorizer yielded an accuracy of 57% and an F1 score of 59%.
2. Use sklearn's documentation for the Tfidfvectorizer to figure out what may be causing the performance difference (or lack thereof).
    - Possible factors could include the loss of critical features due to restricting to the top 1000 tokens, or an abundance of high-frequency yet low-importance words that contribute minimally to the overall meaning.

NOTE: Irrespective of your conclusions, both implementations should be above 60% F1 Score.

Task 3: Train a Feedforward Neural Network Model (25 points)
----
1. Using PyTorch, implement a feedforward neural network to do sentiment analysis. This model should take sparse vectors of length 10000 as input (note this is 10000, not 1000), and have a single output with the sigmoid activation function. The number of hidden layers, and intermediate activation choices are up to you, but please make sure your model does not take more than ~1 minute to train.
2. Evaluate the model using PyTorch functions for average accuracy, area under the ROC curve and F1 scores (see [torcheval](https://pytorch.org/torcheval/stable/)) using both vectorizers, with max_features set to 10000 in both cases.

In [7]:
import torch
import torch.nn as nn

# if torch.backends.mps.is_available():
# 	device = torch.device("mps")
if torch.cuda.is_available():
	device = torch.device("cuda")
else:
	device = torch.device("cpu")

In [8]:

class FeedForward(nn.Module):
    def __init__(self, input_size, hidden_sizes, output_size):
        super(FeedForward, self).__init__()
        layers = []
        previous_size = input_size
        for hidden_size in hidden_sizes:
            layers.append(nn.Linear(previous_size, hidden_size))
            layers.append(nn.ReLU())  
            previous_size = hidden_size
        layers.append(nn.Linear(previous_size, output_size))
        layers.append(nn.Sigmoid())  
        self.layers = nn.Sequential(*layers)

    def forward(self, x):
        return self.layers(x)

    def predict(self, x):
        with torch.no_grad():
            return self.forward(x)


In [9]:
# Load the data using custom and sklearn vectors

###### YOUR CODE HERE #######
from torch.utils.data import TensorDataset, DataLoader

def create_data_loader(X, y, batch_size=64):
    tensor_x = torch.Tensor(X)  
    tensor_y = torch.Tensor(y).unsqueeze(1)
    dataset = TensorDataset(tensor_x, tensor_y)  
    return DataLoader(dataset, batch_size=batch_size, shuffle=True)

batch_size = 32 #tried different batch sizes like 64, 128, 256
train_loader_custom = create_data_loader(X_train_c, y_train, batch_size)
test_loader_custom = create_data_loader(custom_X_test, test_labels, batch_size)

train_loader_sklearn = create_data_loader(X_train_sk.toarray(), y_train, batch_size)
test_loader_sklearn = create_data_loader(sk_X_test.toarray(), test_labels, batch_size)


In [10]:
# Create a feedforward neural network model
# you may use any activation function on the hidden layers
# you should use binary cross-entropy as your loss function
# Adam is an appropriate optimizer for this task


###### YOUR CODE HERE #######
import torch.optim as optim


input_size_custom = X_train_c.shape[1]
input_size_sklearn = X_train_sk.shape[1]

model_custom = FeedForward(input_size_custom, [512, 256], 1).to(device)
model_sklearn = FeedForward(input_size_sklearn, [512, 256], 1).to(device)

criterion = nn.BCELoss()
optimizer_custom = torch.optim.Adam(model_custom.parameters(), lr=0.001) #tried changing learning rate
optimizer_sklearn = torch.optim.Adam(model_sklearn.parameters(), lr=0.001)


In [11]:
# Train the model for 50 epochs on both custom and sklearn vectors


###### YOUR CODE HERE #######

def train_model(model, train_loader, optimizer, epochs=50):
    model.train()
    for epoch in range(epochs):
        for inputs, labels in train_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        if (epoch+1) % 10 == 0:
            print(f'Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}')


print("Training with Custom Vectors:")
train_model(model_custom, train_loader_custom, optimizer_custom, 50)

print("Training with Sklearn Vectors:")
train_model(model_sklearn, train_loader_sklearn, optimizer_sklearn, 50)


Training with Custom Vectors:
Epoch [10/50], Loss: 0.0000
Epoch [20/50], Loss: 0.0000
Epoch [30/50], Loss: 0.0000
Epoch [40/50], Loss: 0.0000
Epoch [50/50], Loss: 0.0000
Training with Sklearn Vectors:
Epoch [10/50], Loss: 0.0000
Epoch [20/50], Loss: 0.0000
Epoch [30/50], Loss: 0.0000
Epoch [40/50], Loss: 0.0000
Epoch [50/50], Loss: 0.0000


In [12]:
#!pip install torcheval

# Evaluate the model using custom and sklearn vectors

###### YOUR CODE HERE #######


from torcheval.metrics.functional import binary_f1_score
from torcheval.metrics import BinaryAUROC, BinaryAccuracy


def evaluate_model_metrics(model, test_loader):
    model.eval()
    y_true = []
    y_pred = []
    y_scores = []
    
    with torch.no_grad():
        for inputs, labels in test_loader:
            inputs = inputs.to(device)
            labels = labels.to(device)
            outputs = model(inputs)
            
            predicted = outputs.round().squeeze()  
            y_true.extend(labels.squeeze().cpu().numpy())  
            y_pred.extend(predicted.cpu().numpy())
            y_scores.extend(outputs.squeeze().cpu().numpy())

    y_true = torch.tensor(y_true, dtype=torch.float32)
    y_pred = torch.tensor(y_pred, dtype=torch.float32)
    y_scores = torch.tensor(y_scores, dtype=torch.float32)
    f1 = binary_f1_score(y_pred, y_true)  


    auroc = BinaryAUROC()
    auroc.update(y_scores, y_true.int())
    auroc_score = auroc.compute()

    accuracy = BinaryAccuracy()
    accuracy.update(y_pred, y_true)
    accuracy_score = accuracy.compute()

    return f1.item(), auroc_score.item(), accuracy_score.item()

print("Evaluating Custom Model:")
f1_custom, auroc_custom, accuracy_custom = evaluate_model_metrics(model_custom, test_loader_custom)
print(f"Custom Model - F1 Score: {f1_custom:.4f}")
print(f"Custom Model - AUROC: {auroc_custom:.4f}")
print(f"Custom Model - Accuracy: {accuracy_custom:.4f}")

print("Evaluating Sklearn Model:")
f1_sklearn, auroc_sklearn, accuracy_sklearn = evaluate_model_metrics(model_sklearn, test_loader_sklearn)
print(f"Sklearn Model - F1 Score: {f1_sklearn:.4f}")
print(f"Sklearn Model - AUROC: {auroc_sklearn:.4f}")
print(f"Sklearn Model - Accuracy: {accuracy_sklearn:.4f}")


Evaluating Custom Model:
Custom Model - F1 Score: 0.8039
Custom Model - AUROC: 0.9078
Custom Model - Accuracy: 0.8000
Evaluating Sklearn Model:
Sklearn Model - F1 Score: 0.8038
Sklearn Model - AUROC: 0.8892
Sklearn Model - Accuracy: 0.7950


NOTE: As in the last task, we're expecting to see a F1 score of over 60% using both your custom features and the sklearn features.

5 points in this assignment are reserved for overall style (both for writing and for code submitted). All work submitted should be clear, easily interpretable, and checked for spelling, etc. (Re-read what you write and make sure it makes sense). Course staff are always happy to give grammatical help (but we won't pre-grade the content of your answers).