# **Project 3: Movie Review Sentiment Analysis**


Team Contribution:


*  Christine Zhou     netID: xizhou4   UIN: *****6213 Online MCS
   - Contributions     Part I Build a Binary Classification Model


*  Syed Ahmed         netID: syeda2    UIN: *****5315 Online MCS
   - Contributions     Part II Interpretability Analysis


*  Jessica Tomas      netID: jptomas2  UIN: *****0877 Online MCS
   - Contributions     Part II Interpretability Analysis



# **1. Build a Binary Classification Model**

The first objective is to construct a binary classification model to predict the sentiment of a movie review.

The evaluation metric for this project is the Area Under the Curve (AUC) on the test data. Your goal is to achieve an AUC score of at least 0.986 across all five test data splits.



In [33]:
import requests
from io import BytesIO
from sklearn.linear_model import LinearRegression

In [34]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
import numpy as np

import time
from sklearn.metrics import roc_auc_score

system_specs = "Macbook Pro, 3.49 GHz, 24GB memory"
execution_times = []

# Load the data in jupyter notebook
def load_data(split_num):
    train_df = pd.read_csv(f'split_{split_num}/train.csv')
    test_df = pd.read_csv(f'split_{split_num}/test.csv')
    test_labels = pd.read_csv(f'split_{split_num}/test_y.csv')

    return train_df, test_df, test_labels

# Data preprocessing
def preprocess_data(data):
    data['review'] = data['review'].str.replace('<.*?>', ' ', regex=True)
    return data

# Train a model with embeddings      C_value= 8.0, 10.0, 12.0
def train_model(X_train, y_train, C_value=10.0, solver='liblinear', use_embeddings=True):
    """
    Train a logistic regression model using provided training data.

    Parameters:
    - X_train: Training feature data (NumPy array or DataFrame).
    - y_train: Training labels.
    - C_value: Regularization strength for Logistic Regression.
    - solver: Solver to use in Logistic Regression.
    - use_embeddings: Ignored if X_train is already prepared.

    Returns:
    - model: Trained Logistic Regression model.
    """
    # Assume X_train is already a NumPy array; no further slicing required
    model = LogisticRegression(random_state=42, solver=solver, max_iter=2000, C=C_value)
    model.fit(X_train, y_train)
    return model


In [35]:
# Evaluate the  model using AUC
def evaluate_model(model, X_test, y_test, use_embeddings=True):
    """
    Evaluate the model using the AUC metric on the test dataset.

    Parameters:
    - model: The trained model.
    - X_test: The test feature data (DataFrame or NumPy array).
    - y_test: The actual labels for the test data.
    - use_embeddings: Whether to use the last 1536 embedding columns.

    Returns:
    - AUC score: The Area Under the Curve score for the test data.
    """
    if use_embeddings and isinstance(X_test, pd.DataFrame):
        # Use only the last 1536 columns if embeddings are included
        X_test = X_test.iloc[:, -1536:].values
    predictions = model.predict_proba(X_test)[:, 1]  # Probability for the positive class
    return roc_auc_score(y_test, predictions)

In [36]:
if __name__ == '__main__':

    #  Try using different hyper-parameter
    # C_value = (8.0, 10.0, 12.0)  solver =(liblinear , saga, lbfgs  # test_size = 0.1, 0.2, 0.3

    auc_scores = []
    for split_num in range(1, 6):
        start_time = time.time()

        # Load and preprocess data
        train_data, test_data , test_labels = load_data(split_num)
        train_data = preprocess_data(train_data)
        test_data = preprocess_data(test_data)

        # Prepare training and test data
        X_train = train_data.iloc[:, -1536:].values  # Converts DataFrame to NumPy array
        y_train = train_data['sentiment']
        X_test = test_data.iloc[:, -1536:]  # Keep X_test as a DataFrame
        y_test = test_labels['sentiment']

        # Train and evaluate model
        model = train_model(X_train, y_train, C_value=10.0, solver='liblinear')
        X_train = train_data.iloc[:, -1536:].values  # Converts DataFrame to NumPy array
        auc = evaluate_model(model, X_test, y_test )  # Pass the DataFrame X_test
        auc_scores.append(auc)

        execution_time = time.time() - start_time
        execution_times.append(execution_time)

        print(f"Split {split_num} -  AUC Score: {auc},     Execution Time - {execution_time} seconds")

    # Calculate average AUC across all splits
    avg_auc = np.mean(auc_scores)
    print("Average Validation AUC across all splits:", avg_auc)
    print(f"Average Execution Time: {np.mean(execution_times):.2f} seconds on {system_specs}")

Split 1 -  AUC Score: 0.987113392676245,     Execution Time - 24.539942741394043 seconds
Split 2 -  AUC Score: 0.9868045409410647,     Execution Time - 23.985557079315186 seconds
Split 3 -  AUC Score: 0.9864294594933899,     Execution Time - 24.255937099456787 seconds
Split 4 -  AUC Score: 0.9869832812693,     Execution Time - 23.31607437133789 seconds
Split 5 -  AUC Score: 0.9862851395212046,     Execution Time - 25.736347913742065 seconds
Average Validation AUC across all splits: 0.986723162780241
Average Execution Time: 24.37 seconds on Macbook Pro, 3.49 GHz, 24GB memory


###  Conclusion for Build a Binary Classification Model


The objective of designing a binary classification model to predict movie review sentiment, was successfully met through methodical data preprocessing, the utilization of pre-trained embeddings, and logistic regression. The model achieved impressive Area Under Curve (AUC) scores exceeding the 0.9867 benchmark across all five test data splits, demonstrating robust performance and a high degree of predictive accuracy. Preprocessing involved cleaning and leveraging OpenAI embeddings directly, highlighting an efficient approach to handling large-scale textual data. This approach's effectiveness was evidenced by consistently high AUC scores, with an average of approximately 0.987 across splits. The logistic regression models were tuned with varying parameters to optimize performance, and the training and evaluation processes were computationally efficient, running smoothly on a MacBook Pro 3.49 GHz, 24GB memory setup. These results illustrate the model’s capability to discern sentiment accurately, making it a valuable tool for large-scale sentiment analysis in real-world applications.

# **2. Interpretability Analysis**

Using split 1 and the corresponding trained model, implement an interpretability approach to identify which parts of each review have an impact on the sentiment prediction. Apply your method to 5 randomly selected positive reviews and 5 randomly selected negative reviews from the split 1 test data.

Set a random seed before selecting these 10 reviews (the seed does not need to relate to students’ UINs).

Provide visualizations (such as highlighted text) that show the key parts of a review contributing to the sentiment prediction. Discuss the effectiveness and limitations of the interpretability approach you chose.

In [37]:
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import csr_matrix
from scipy import stats
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler
import random
import re
import joblib
from sklearn.model_selection import train_test_split
import requests
from io import BytesIO
import warnings
warnings.filterwarnings('ignore')
from sklearn.linear_model import LinearRegression

In [38]:
# load the model from Part 1
url = "https://github.com/syedmustafaahmed/PSL-project-3/raw/refs/heads/main/trained_model.pkl"

# Download the file
response = requests.get(url)
response.raise_for_status()  # Check for HTTP request errors

# Load the model directly from the response content
model_from_github = joblib.load(BytesIO(response.content))

In [39]:
# load the embeddings from Part 1
url = "https://github.com/syedmustafaahmed/PSL-project-3/raw/refs/heads/main/test_scaled.pkl"

# Download the file
response = requests.get(url)
response.raise_for_status()  # Check for HTTP request errors

# Load the model directly from the response content
test_scaled_from_github = joblib.load(BytesIO(response.content))

In [40]:
# load the vocab words from Part 1
url = "https://github.com/syedmustafaahmed/PSL-project-3/raw/refs/heads/main/vocab_words.pkl"

# Download the file
response = requests.get(url)
response.raise_for_status()  # Check for HTTP request errors

# Load the model directly from the response content
vocab_words_from_github = joblib.load(BytesIO(response.content))

In [41]:
# load the dtm_test from Part 1
url = "https://github.com/syedmustafaahmed/PSL-project-3/raw/refs/heads/main/dtm_test.pkl"

# Download the file
response = requests.get(url)
response.raise_for_status()  # Check for HTTP request errors

# Load the model directly from the response content
dtm_test_from_github = joblib.load(BytesIO(response.content))

In [42]:
test_df = pd.read_csv(f'split_1/test.csv')
test_df = test_df.drop(columns=['id', 'review'])

In [43]:
# transform dtm_test into OpenAI embeddings format, so that we can then use the model from part 1

In [44]:
%%time

mapping_model = LinearRegression()
mapping_model.fit(test_scaled_from_github, test_df)

CPU times: user 10min 38s, sys: 20.9 s, total: 10min 59s
Wall time: 21min 13s


In [45]:
mapping_predicted = mapping_model.predict(test_scaled_from_github)
model_from_github_predictions = model_from_github.predict(mapping_predicted)

In [46]:
Y_test = pd.read_csv('split_1/test_y.csv')
Y_test = Y_test['sentiment']

In [None]:
test_accuracy = accuracy_score(Y_test, model_from_github_predictions)
print(test_accuracy)

0.89692


In [None]:
# now need to get five positive test examples from split 1, and five negative test examples from split 1
X_test_split_1 = pd.read_csv(f"./split_1/test.csv")
Y_test_split_1 = pd.read_csv(f"./split_1/test_y.csv")

In [None]:
one_indexes = Y_test_split_1[Y_test_split_1['sentiment'] == 1].index.tolist()
zero_indexes = Y_test_split_1[Y_test_split_1['sentiment'] == 0].index.tolist()

In [None]:
random.seed(1)

In [None]:
random_one_indexes = random.sample(one_indexes, 5)
random_zero_indexes = random.sample(zero_indexes, 5)

In [None]:
print(model_from_github_predictions[random_one_indexes])
print(model_from_github_predictions[random_zero_indexes])

[1 1 1 1 1]
[0 0 0 0 0]


In [None]:
positive_reviews = X_test_split_1.iloc[random_one_indexes]['review']
positive_reviews

4347     This is an account of events that have been co...
18830    I had the good fortune to be at Perris Island ...
2017     Las Vegas is one of the most brilliant shows o...
8452     Having Just \Welcomed Home\" my 23 YR old daug...
3803     i watched this series when it first came out i...
Name: review, dtype: object

In [None]:
negative_reviews = X_test_split_1.iloc[random_zero_indexes]['review']
negative_reviews

16077    Contains spoilers The movie plot can be summar...
24883    I hated this crap, every Friday as part of tgi...
14583    Trite and tiring, the one-liners almost made m...
15319    I admit to liking a lot of the so-called \frat...
21192    Now we know where they got the idea of Snakes ...
Name: review, dtype: object

In [None]:
embeddings_positive = dtm_test_from_github[random_one_indexes].toarray()
embeddings_negative = dtm_test_from_github[random_zero_indexes].toarray()

In [None]:
# loop over each review and find the words that contributed to sentiment

In [None]:
negative_words_in_reviews = []
positive_words_in_reviews = []

In [None]:
for review in embeddings_positive:
    positive_words = []
    for i in range(len(review)):
        if review[i] == 1:
            positive_words.append(vocab_words_from_github[i])
    positive_words_in_reviews.append(positive_words)

for review in embeddings_negative:
    negative_words = []
    for i in range(len(review)):
        if review[i] == 1:
            negative_words.append(vocab_words_from_github[i])
    negative_words_in_reviews.append(negative_words)

In [None]:
negative_reviews = negative_reviews.values
positive_reviews = positive_reviews.values

In [None]:
# now visualize the positive reviews (use green for text that has positive sentiment)

In [None]:
def highlight_words(color, text, words_to_highlight):
    highlighted_words = [re.escape(word) for word in words_to_highlight]
    pattern = r'\b(' + '|'.join(highlighted_words) + r')\b'
    print(re.sub(pattern, color, text))

In [None]:
for i in range(len(positive_reviews)):
    print(f'Positive Review {i+1}:')
    positive_review = positive_reviews[i]
    highlight_words(r'\033[1;92m\1\033[0m', positive_review, positive_words_in_reviews[i])
    # highlight_words(r'\033[1;92m\1\032[0m', positive_review, positive_words_in_reviews[i])

    print()

Positive Review 1:
This is an account of events that have been covered in print several times, and I had read two books - 'A Voyage for Madmen' and 'The Strange Last Voyage of Donald Crowhurst' before seeing the film in Sheffield just before Christmas. I must [1;92msay[0m, it exceeded all expectations in its telling of the 1968 Sunday Times Golden Globe yacht race. These men set out [1;92mto do[0m [1;92msomething[0m that had never been done before [1;92mwith no[0m support vessels, [1;92mwooden[0m boats, no satellite phones, no GPS, and just their wits and skill to [1;92mget[0m them round the globe in one piece. Not to [1;92mmention[0m the months of solitude, the thundering southern ocean, little sleep, and boats that were [1;92moften[0m literally falling apart around them.<br /><br />This documentary is [1;92mexcellently[0m put [1;92mtogether[0m in my opinion, tightly edited, [1;92mwell[0m paced with [1;92msuperb[0m narration. The archive footage and the intervi

In [None]:
# now visualize the negative reviews (use bright red for text that has negative sentiment)

In [None]:
for i in range(len(negative_reviews)):
    print(f'Negative Review {i}:')
    negative_review = negative_reviews[i]
    highlight_words(r'\033[1;91m\1\033[0m', negative_review, negative_words_in_reviews[i])
    # highlight_words(r'\033[1;91m\1\031[0m', negative_review, negative_words_in_reviews[i])

    print()

Negative Review 0:
Contains spoilers The movie [1;91mplot[0m can be summarized in a few sentences: Three guys go hunting in the forest. Two of them along other people [1;91mget[0m shot in the head without explanation. The last guy can stand in the clear, shout and [1;91mdo[0m [1;91manything[0m without getting shot. He gets to walk through an old factory and has the evil people walk right into his scope without a struggle. The villains are conveniently dressed in black and [1;91mlook like[0m villains.<br /><br />That is the [1;91mwhole[0m [1;91mstory[0m, not summarized but in detail. Everything is drawn out with a guy standing ringing a door bell. We wait with him. Long shot of guys being [1;91mbored[0m in the woods and sleeping. We can take a nap with them. The one drawn out shot of following a female jogger could have [1;91mbeen[0m [1;91mredeeming[0m, if we could [1;91msee[0m her butt [1;91mor[0m boobs bouncing.<br /><br />There [1;91mdialog[0m is less then T

### Conclusion

The interpretability approach that we took was using bag of words. We made use of the the model and BERT embeddings that we generated for Split 1 in Part 1. Then in Part 2 we used a linear regression model so that we could map the BERT embeddings to fit the same dimensionality as the OpenAI embeddings and the model we training in Part 1. We then chose 5 random positive/negative reviews from split 1, and then highlighted words in each review that showed up in the top 2000 words from the BERT embeddings. In general, the highlighted words seem to reflect positive and negative sentiments. Although there are some words highlighted in the positive reviews that seem more neutral or negative (if, both, man who, etc.) And there are some words highlighted in the negative reviews that also seem more netural or positive (2, paper, to be, etc.) There are limitations with this approach as it disregards the word order and context. Another limitation (discussed on CampusWire) is that due to collinearity among features, some words may have negative coefficients even if their marginal effects are positive. The bag of words approach also can't handle out of vocabulary words that are in the test set but don't appear in the training set.