# Take home assignment: Sentiment Analysis with BERT Transformers

- For this assignment, you will work with the file **hotel_food_movie_reviews.csv**, which contains 30 reviews drawn from three sources: **Boston Airbnb**, **Amazon Fine Food**, and **IMDB Movie Reviews**. Begin by carefully reviewing each entry and assigning your own evaluation in the **human_label** column, classifying each review as **Positive**, **Negative**, or **Neutral**. Once you have completed your labeling, upload the dataset to the HPC server.  

- Next, use the model **cardiffnlp/twitter-roberta-base-sentiment** to perform sentiment analysis on each review and save the model’s predictions in a new column named **sentiment**. Compare your human evaluation (**y_true)*) with the model’s predictions (**sentiment:y_pred)**) by running the `classification_evaluation` function provided in the next cell. This will allow you to assess areas of agreement and disagreement between human and machine labels.  

- In addition, conduct an analysis of model performance across the three review sources (**Airbnb**, **Amazon Fine Food**, and **IMDB Movie Reviews**). Discuss whether the model performs differently across these categories and provide possible explanations for your findings.  

**Optional extension:** Apply a different BERT-based sentiment model (ensuring that it outputs the same three labels: **Positive**, **Negative**, and **Neutral**) to analyze the dataset. Compare its performance with **twitter-roberta-base-sentiment** and discuss whether the alternative model improves prediction accuracy.  

 **Finally, put your discussion of the results at the end of this notebook as Markdown cell and format it clearly and professionally.**

In [1]:
import os
import pandas as pd
from transformers import pipeline
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report

# create a function to generate confusion matrix and classification report for sentiment analysis

def classification_evaluation(y_true, y_pred):
    # Fixed class order used everywhere
    classes = ["Positive", "Neutral", "Negative"]

    # generate Confusion matrix (fixed label order)
    cm = confusion_matrix(y_true, y_pred, labels=classes)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=classes)
    disp.plot(cmap="Blues", values_format="d")
    plt.title("Confusion Matrix - Sentiment Analysis")
    plt.show()

    # generate classification report with the SAME labels & names. Use three decimal places
    print(classification_report(y_true, y_pred, labels=classes, target_names=classes, digits=3))

  from .autonotebook import tqdm as notebook_tqdm
