## Sentiment Analysis

In this exercise we use the IMDb-dataset, which we will use to perform a sentiment analysis. The code below assumes that the data is placed in the same folder as this notebook. We see that the reviews are loaded as a pandas dataframe, and print the beginning of the first few reviews.

In [43]:
import numpy as np
import pandas as pd
from tensorflow.keras import layers, models
from tensorflow.keras.callbacks import ModelCheckpoint


reviews = pd.read_csv('reviews.txt', header=None)
labels = pd.read_csv('labels.txt', header=None)
Y = (labels=='positive').astype(np.int_)

print(type(reviews))
print(reviews.head())

<class 'pandas.core.frame.DataFrame'>
                                                   0
0  bromwell high is a cartoon comedy . it ran at ...
1  story of a man who has unnatural feelings for ...
2  homelessness  or houselessness as george carli...
3  airport    starts as a brand new luxury    pla...
4  brilliant over  acting by lesley ann warren . ...


**(a)** Split the reviews and labels in test, train and validation sets. The train and validation sets will be used to train your model and tune hyperparameters, the test set will be saved for testing. Use the `CountVectorizer` from `sklearn.feature_extraction.text` to create a Bag-of-Words representation of the reviews. Only use the 10,000 most frequent words (use the `max_features`-parameter of `CountVectorizer`).

In [44]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# Split the data into train, validation, and test sets
X_train, X_temp, Y_train, Y_temp = train_test_split(reviews[0], Y, test_size=0.4, random_state=42)
X_val, X_test, Y_val, Y_test = train_test_split(X_temp, Y_temp, test_size=0.5, random_state=42)

# Create a Bag-of-Words representation
vectorizer = CountVectorizer(max_features=10000)
X_train_bow = vectorizer.fit_transform(X_train)
X_val_bow = vectorizer.transform(X_val)
X_test_bow = vectorizer.transform(X_test)

print("Training set shape:", X_train_bow.shape)
print("Validation set shape:", X_val_bow.shape)
print("Test set shape:", X_test_bow.shape)

Training set shape: (15000, 10000)
Validation set shape: (5000, 10000)
Test set shape: (5000, 10000)


**(b)** Explore the representation of the reviews. How is a single word represented? How about a whole review?

In [45]:
# Create a Bag-of-Words representation
vectorizer = CountVectorizer(max_features=10000)
X_train_bow = vectorizer.fit_transform(X_train).toarray()
X_val_bow = vectorizer.transform(X_val).toarray()
X_test_bow = vectorizer.transform(X_test).toarray()

print("Vocabulary size:", len(vectorizer.vocabulary_))
print("Sample vocabulary:", list(vectorizer.vocabulary_.items())[:10])

# Display the document-term matrix for the first review in the training set
print("First review BoW representation:\n", X_train_bow[0])

Vocabulary size: 10000
Sample vocabulary: [('vince', np.int64(9580)), ('high', np.int64(4181)), ('school', np.int64(7752)), ('has', np.int64(4079)), ('new', np.int64(6005)), ('principal', np.int64(6854)), ('the', np.int64(8958)), ('evil', np.int64(3077)), ('ms', np.int64(5849)), ('mary', np.int64(5490))]
First review BoW representation:
 [0 0 0 ... 0 0 0]


**(c)** Train a neural network with a single hidden layer on the dataset, tuning the relevant hyperparameters to optimize accuracy. 

In [46]:
# Reshape the data to fit the CNN input requirements
X_train_bow = X_train_bow.reshape(X_train_bow.shape[0], X_train_bow.shape[1], 1)
X_val_bow = X_val_bow.reshape(X_val_bow.shape[0], X_val_bow.shape[1], 1)
X_test_bow = X_test_bow.reshape(X_test_bow.shape[0], X_test_bow.shape[1], 1)

# stop when val_accuracy doesnt improve
early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_accuracy',
    patience=5,
    restore_best_weights=True,
    verbose=1
)

# Build the CNN model
cnn = models.Sequential([
    layers.Input(shape=(10000, 1)),
    layers.Conv1D(64, 3, activation='relu'),
    layers.MaxPooling1D(2),
    layers.Conv1D(128, 3, activation='relu'),
    layers.MaxPooling1D(2),
    layers.Flatten(),
    layers.Dense(256, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(1, activation='sigmoid')
])

cnn.summary()

In [47]:
# Compile the model
cnn.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
history = cnn.fit(X_train_bow, Y_train, epochs=10, validation_data=(X_val_bow, Y_val), batch_size=32, callbacks=[early_stopping])

# Evaluate the model on the test set
test_loss, test_acc = cnn.evaluate(X_test_bow, Y_test)
print(f"Test accuracy: {test_acc:.4f}")

# Define the parameter grid for hyperparameter tuning
param_grid = {
    'hidden_layer_sizes': [(50,), (100,), (50, 50)],
    'activation': ['relu', 'tanh'],
    'solver': ['adam', 'sgd'],
    'alpha': [0.0001, 0.001],
    'learning_rate': ['constant', 'adaptive']
}

Epoch 1/10
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m308s[0m 654ms/step - accuracy: 0.7339 - loss: 0.6037 - val_accuracy: 0.8688 - val_loss: 0.3327
Epoch 2/10
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m305s[0m 651ms/step - accuracy: 0.8885 - loss: 0.2739 - val_accuracy: 0.8792 - val_loss: 0.3050
Epoch 3/10
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m333s[0m 711ms/step - accuracy: 0.9237 - loss: 0.1945 - val_accuracy: 0.8802 - val_loss: 0.3043
Epoch 4/10
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m303s[0m 645ms/step - accuracy: 0.9420 - loss: 0.1490 - val_accuracy: 0.8776 - val_loss: 0.3352
Epoch 5/10
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m294s[0m 627ms/step - accuracy: 0.9601 - loss: 0.1107 - val_accuracy: 0.8722 - val_loss: 0.4101
Epoch 6/10
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m321s[0m 684ms/step - accuracy: 0.9760 - loss: 0.0762 - val_accuracy: 0.8794 - val_loss: 0.4520
Epoc

**(d)** Test your sentiment-classifier on the test set.

In [48]:
# Predict the sentiment of the test set
Y_test_pred = cnn.predict(X_test_bow)
Y_test_pred = (Y_test_pred > 0.5).astype(int)

# Calculate the accuracy
test_accuracy = np.mean(Y_test_pred == Y_test.values)
print(f"Test set accuracy: {test_accuracy:.4f}")

[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 81ms/step
Test set accuracy: 0.8834


**(e)** Use the classifier to classify a few sentences you write yourselves. 

In [49]:
# Define a few sentences
sentences = [
    "I absolutely loved this movie, it was fantastic!",
    "This was the worst film I have ever seen.",
    "The plot was interesting but the acting was terrible.",
    "An amazing experience, I would highly recommend it.",
    "Not bad, but could have been better."
]

# Transform the sentences into the Bag-of-Words representation
sentences_bow = vectorizer.transform(sentences).toarray()

# Reshape the transformed data to fit the CNN input requirements
sentences_bow = sentences_bow.reshape(sentences_bow.shape[0], sentences_bow.shape[1], 1)

# Use the trained classifier to predict the sentiment
predictions = cnn.predict(sentences_bow)
predictions = (predictions > 0.5).astype(int)

# Print the sentences with their predicted sentiment
for sentence, prediction in zip(sentences, predictions):
    sentiment = "positive" if prediction == 1 else "negative"
    print(f"Sentence: {sentence}\nPredicted Sentiment: {sentiment}\n")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
Sentence: I absolutely loved this movie, it was fantastic!
Predicted Sentiment: positive

Sentence: This was the worst film I have ever seen.
Predicted Sentiment: negative

Sentence: The plot was interesting but the acting was terrible.
Predicted Sentiment: negative

Sentence: An amazing experience, I would highly recommend it.
Predicted Sentiment: positive

Sentence: Not bad, but could have been better.
Predicted Sentiment: negative

