## Sentiment Analysis

In this exercise we use the IMDb-dataset, which we will use to perform a sentiment analysis. The code below assumes that the data is placed in the same folder as this notebook. We see that the reviews are loaded as a pandas dataframe, and print the beginning of the first few reviews.

In [37]:
import numpy as np
import pandas as pd

reviews = pd.read_csv('reviews.txt', header=None)
labels = pd.read_csv('labels.txt', header=None)
Y = (labels=='positive').astype(np.int_)

print(type(reviews))
print(reviews.head())

<class 'pandas.core.frame.DataFrame'>
                                                   0
0  bromwell high is a cartoon comedy . it ran at ...
1  story of a man who has unnatural feelings for ...
2  homelessness  or houselessness as george carli...
3  airport    starts as a brand new luxury    pla...
4  brilliant over  acting by lesley ann warren . ...


**(a)** Split the reviews and labels in test, train and validation sets. The train and validation sets will be used to train your model and tune hyperparameters, the test set will be saved for testing. Use the `CountVectorizer` from `sklearn.feature_extraction.text` to create a Bag-of-Words representation of the reviews. Only use the 10,000 most frequent words (use the `max_features`-parameter of `CountVectorizer`).

In [38]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

X_train, X_temp, y_train, y_temp = train_test_split(reviews, Y, test_size=0.4, random_state=42, stratify=Y)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)

vectorizer = CountVectorizer(max_features=10000)
X_train = vectorizer.fit_transform(X_train[0]).toarray()
X_val = vectorizer.transform(X_val[0]).toarray()
X_test = vectorizer.transform(X_test[0]).toarray()

print(f"Training data shape: {X_train.shape}")
print(f"Validation data shape: {X_val.shape}")
print(f"Test data shape: {X_test.shape}")

Training data shape: (15000, 10000)
Validation data shape: (5000, 10000)
Test data shape: (5000, 10000)


**(b)** Explore the representation of the reviews. How is a single word represented? How about a whole review?

In [39]:
print("Vocabulary size:", len(vectorizer.get_feature_names_out()))

vocab = vectorizer.get_feature_names_out()

review_vector = X_train[0]

nonzero_indices = review_vector.nonzero()[0]
print("Words and their counts in the first review:")

for idx in nonzero_indices:
    print(f"{vocab[idx]}: {review_vector[idx]}")

Vocabulary size: 10000
Words and their counts in the first review:
about: 1
action: 1
adventure: 1
adventures: 1
all: 2
am: 1
an: 1
and: 8
anticipation: 1
anyone: 1
as: 2
bad: 2
become: 1
been: 1
bond: 1
but: 1
by: 1
camp: 1
character: 1
characters: 1
check: 1
choices: 1
crew: 2
decided: 1
disjointed: 1
doc: 5
elements: 2
etc: 1
even: 1
familiar: 2
fan: 1
fanatic: 1
fans: 1
feel: 1
feeling: 1
film: 2
first: 1
for: 2
from: 1
good: 1
have: 1
here: 1
hero: 1
heroes: 1
his: 2
how: 1
in: 5
indiana: 1
inspiration: 1
into: 1
is: 4
it: 2
james: 1
jones: 1
just: 2
know: 1
long: 1
lot: 1
major: 1
many: 1
minutes: 1
more: 1
movie: 3
movies: 1
music: 1
my: 1
not: 2
number: 2
of: 9
one: 2
ones: 1
only: 1
other: 1
ought: 1
out: 1
overwhelming: 1
promise: 1
provided: 1
really: 1
respond: 1
resulting: 1
savage: 1
say: 1
see: 1
seeing: 1
should: 1
so: 1
somewhat: 1
spirit: 1
star: 2
superman: 1
that: 4
the: 11
their: 1
them: 2
then: 1
there: 2
they: 1
this: 3
those: 1
throw: 1
time: 1
to: 7
trek: 2
try

**(c)** Train a neural network with a single hidden layer on the dataset, tuning the relevant hyperparameters to optimize accuracy. 

In [40]:
import tensorflow as tf
from tensorflow.keras import layers

callback = tf.keras.callbacks.EarlyStopping(monitor = 'val_accuracy', patience = 10)

model = tf.keras.models.Sequential([
    layers.Input(shape=(10000,)), 
    layers.Dense(units=16, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.001)),
    layers.Dropout(0.2),
    layers.Dense(units=2, activation='softmax')
])

model.summary()


In [41]:
model.compile(loss='sparse_categorical_crossentropy', optimizer=tf.keras.optimizers.Adam(learning_rate=0.005), metrics=['accuracy'])

history = model.fit(X_train, y_train, 
                    epochs=100, 
                    validation_data=(X_val, y_val),
                    callbacks=[callback])


Epoch 1/100
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 4ms/step - accuracy: 0.7950 - loss: 0.5482 - val_accuracy: 0.8596 - val_loss: 0.5092
Epoch 2/100
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.8748 - loss: 0.4800 - val_accuracy: 0.8596 - val_loss: 0.5319
Epoch 3/100
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.8746 - loss: 0.5004 - val_accuracy: 0.8614 - val_loss: 0.5498
Epoch 4/100
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.8723 - loss: 0.4973 - val_accuracy: 0.8640 - val_loss: 0.5375
Epoch 5/100
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.8867 - loss: 0.4902 - val_accuracy: 0.8250 - val_loss: 0.6043
Epoch 6/100
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.8724 - loss: 0.4937 - val_accuracy: 0.8590 - val_loss: 0.5605
Epoch 7/100
[1m469/46

In [None]:
# from sklearn.neural_network import MLPClassifier
# from sklearn.metrics import accuracy_score

# modelMLP = MLPClassifier(hidden_layer_sizes=(100,), activation='relu', solver='adam', max_iter=10, random_state=42)
# modelMLP.fit(X_train, y_train.values.flatten())

# y_train_pred = modelMLP.predict(X_train)
# print("Training Accuracy:", accuracy_score(y_train, y_train_pred))

# y_val_pred = modelMLP.predict(X_val)
# print("Validation Accuracy:", accuracy_score(y_val, y_val_pred))



Training Accuracy: 1.0
Validation Accuracy: 0.8682


**(d)** Test your sentiment-classifier on the test set.

In [51]:
# Test the model on the test set
test_loss, test_accuracy = model.evaluate(X_test, y_test, verbose=2)

# Print the test results
print(f"Test Loss: {test_loss}")
print(f"Test Accuracy: {test_accuracy}")

157/157 - 0s - 2ms/step - accuracy: 0.8526 - loss: 0.5512
Test Loss: 0.551157534122467
Test Accuracy: 0.8525999784469604


**(e)** Use the classifier to classify a few sentences you write yourselves. 

In [49]:
custom_reviews = [
    "I loved this movie! It was amazing and thrilling.",
    "The film was terrible, I regret watching it.",
    "Pretty average, not too bad but not great either."
]

# Transform using the already-fitted vectorizer
X_custom = vectorizer.transform(custom_reviews).toarray()

# Predict with the trained model
predictions = model.predict(X_custom)

# Get predicted classes (0 or 1)
predicted_labels = predictions.argmax(axis=1)

# Map 0/1 to 'negative'/'positive'
label_map = {0: "negative", 1: "positive"}
classified_sentences = [label_map[label] for label in predicted_labels]

# Print the results
for sentence, label in zip(custom_reviews, classified_sentences):
    print(f"Sentence: '{sentence}' -> Sentiment: {label}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step
Sentence: 'I loved this movie! It was amazing and thrilling.' -> Sentiment: positive
Sentence: 'The film was terrible, I regret watching it.' -> Sentiment: negative
Sentence: 'Pretty average, not too bad but not great either.' -> Sentiment: negative
