## Sentiment Analysis

In this exercise we use the IMDb-dataset, which we will use to perform a sentiment analysis. The code below assumes that the data is placed in the same folder as this notebook. We see that the reviews are loaded as a pandas dataframe, and print the beginning of the first few reviews.

In [31]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras import optimizers

In [15]:
reviews = pd.read_csv('reviews.txt', header=None)
labels = pd.read_csv('labels.txt', header=None)
Y = (labels=='positive').astype(np.int_)

print(type(reviews))
print(reviews.head())

<class 'pandas.core.frame.DataFrame'>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  

**(a)** Split the reviews and labels in test, train and validation sets. The train and validation sets will be used to train your model and tune hyperparameters, the test set will be saved for testing. Use the `CountVectorizer` from `sklearn.feature_extraction.text` to create a Bag-of-Words representation of the reviews. Only use the 10,000 most frequent words (use the `max_features`-parameter of `CountVectorizer`).

In [37]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(reviews[0], Y, test_size=0.2, random_state=42)

# Further split train set into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Initialize CountVectorizer
max_features = 10000  # Number of most frequent words to use
vectorizer = CountVectorizer(max_features=max_features)

# Fit the vectorizer on the training data
vectorizer.fit(X_train)

# Transform train, validation and test sets using the same vectorizer
X_train_bow = vectorizer.transform(X_train)
X_val_bow = vectorizer.transform(X_val)
X_test_bow = vectorizer.transform(X_test)

**(b)** Explore the representation of the reviews. How is a single word represented? How about a whole review?

In [25]:
# Get the index of the most frequent word in the vocabulary
most_frequent_word_index = np.argmax(np.sum(X_train_bow, axis=0))

# Get the most frequent word
most_frequent_word = vectorizer.get_feature_names_out()[most_frequent_word_index]

most_frequent_word_representation = X_train_bow[:, most_frequent_word_index]

# Get the frequency of the most frequent word in the first review
most_frequent_word_frequency_in_first_review = X_train_bow[0, most_frequent_word_index]

# Get the Bag-of-Words representation of the random review
random_review_bow_representation = X_train_bow[:, most_frequent_word_index]

np.set_printoptions(threshold=np.inf)

# Print the results
print("Single Word Representation:", most_frequent_word_representation)
print("Selected Word:", most_frequent_word)
print("Frequency in the First Review:", most_frequent_word_frequency_in_first_review)
print("\nWhole Review Representation (Bag-of-Words):")
print(random_review_bow_representation.toarray().flatten())

Single Word Representation:   (0, 0)	5
  (1, 0)	9
  (2, 0)	26
  (3, 0)	2
  (4, 0)	19
  (5, 0)	3
  (6, 0)	6
  (7, 0)	2
  (8, 0)	25
  (9, 0)	2
  (10, 0)	9
  (11, 0)	30
  (12, 0)	10
  (13, 0)	5
  (15, 0)	27
  (16, 0)	3
  (17, 0)	15
  (18, 0)	19
  (19, 0)	11
  (20, 0)	22
  (21, 0)	14
  (22, 0)	11
  (23, 0)	11
  (24, 0)	11
  (25, 0)	7
  :	:
  (15975, 0)	8
  (15976, 0)	7
  (15977, 0)	9
  (15978, 0)	10
  (15979, 0)	8
  (15980, 0)	13
  (15981, 0)	2
  (15982, 0)	8
  (15983, 0)	16
  (15984, 0)	12
  (15985, 0)	15
  (15986, 0)	24
  (15987, 0)	15
  (15988, 0)	18
  (15989, 0)	12
  (15990, 0)	20
  (15991, 0)	7
  (15992, 0)	11
  (15993, 0)	20
  (15994, 0)	8
  (15995, 0)	8
  (15996, 0)	9
  (15997, 0)	11
  (15998, 0)	18
  (15999, 0)	8
Selected Word: the
Frequency in the First Review: 5

Whole Review Representation (Bag-of-Words):
[  5   9  26   2  19   3   6   2  25   2   9  30  10   5   0  27   3  15
  19  11  22  14  11  11  11   7   7   3  13   8  11  13   6   7   7  22
  20   2  17  16   3  19   6  

**(c)** Train a neural network with a single hidden layer on the dataset, tuning the relevant hyperparameters to optimize accuracy. 

In [33]:
# Define the neural network architecture
model = Sequential([
    Dense(128, activation='tanh', input_shape=(max_features,)), # We use tanh as it is commonly used for classification
    Dense(1, activation='sigmoid')  # We use Sigmoid as we are looking to do binary classification
])

sgd = optimizers.SGD(learning_rate=0.1)

# Compile the model
model.compile(optimizer=sgd,
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Train the model
history = model.fit(X_train_bow, y_train,
                    epochs=20,
                    batch_size=128,
                    validation_data=(X_val_bow, y_val))

Epoch 1/20


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.5741 - loss: 0.8534 - val_accuracy: 0.5295 - val_loss: 0.7899
Epoch 2/20
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.6924 - loss: 0.5789 - val_accuracy: 0.7075 - val_loss: 0.5505
Epoch 3/20
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.7372 - loss: 0.5281 - val_accuracy: 0.8213 - val_loss: 0.4217
Epoch 4/20
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.7680 - loss: 0.4841 - val_accuracy: 0.8100 - val_loss: 0.4426
Epoch 5/20
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.7988 - loss: 0.4388 - val_accuracy: 0.8395 - val_loss: 0.3807
Epoch 6/20
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.8043 - loss: 0.4302 - val_accuracy: 0.8432 - val_loss: 0.3749
Epoch 7/20
[1m125/125[0m [32m━━━━━━━

**(d)** Test your sentiment-classifier on the test set.

In [34]:
# Evaluate the model on the test set
test_loss, test_acc = model.evaluate(X_test_bow, y_test)
print('Test accuracy:', test_acc)

[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.8715 - loss: 0.3082
Test accuracy: 0.873199999332428


**(e)** Use the classifier to classify a few sentences you write yourselves. 

In [43]:
sentences = [
    "This movie is really great, and I love all the actors in here.",
    "The cinematography in this film is stunning, and the storyline kept me engaged from start to finish.",
    "I couldn't stop laughing throughout the entire movie! The humor was spot on.",
    "The acting performances were phenomenal, especially from the lead actor. A must-watch!",
    "The special effects were mind-blowing, and the soundtrack perfectly complemented the action sequences.",
    "This movie is absolutely horrible, and i hate all the actors in here. The plot is really bad and the actors are annoying.",
    "I was extremely disappointed by this film. The plot was confusing, and the pacing felt off.",
    "The characters were poorly developed, and I found it hard to connect with any of them.",
    "The dialogue was cringeworthy, and the acting was subpar at best. Not worth the price of admission.",
    "The movie was riddled with clichés, and the ending was unsatisfying. A total waste of time.",
    "I like most of the movie, but the part where the main actor says i hate feminist i really hate.",
    "While the movie had some redeeming qualities, such as the impressive visuals, it ultimately fell short of expectations.",
    "I had high hopes for this film, but it failed to deliver. Some scenes were enjoyable, but overall, it lacked coherence.",
    "The film had its moments, but they were overshadowed by the lackluster plot and unconvincing performances.",
    "Despite its flaws, there were aspects of the movie that I found enjoyable. However, they were few and far between."
]

sentences_transformed = vectorizer.transform(sentences)

predictions = model.predict(sentences_transformed).flatten()

# Map the predictions to 'positive' or 'negative' based on the threshold of 0.5
mapped_predictions = ['positive' if pred >= 0.5 else 'negative' for pred in predictions]

# Print the mapped predictions and predicted scores for each sentence
for i, (sentence, prediction, score) in enumerate(zip(sentences, mapped_predictions, predictions), 1):
    print(f"Review {i}:")
    print(f"Sentence: '{sentence}'")
    print(f"Mapped Prediction: {prediction}")
    print(f"Predicted Score: {score}")
    print()

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step
Review 1:
Sentence: 'This movie is really great, and I love all the actors in here.'
Mapped Prediction: positive
Predicted Score: 0.8921685218811035

Review 2:
Sentence: 'The cinematography in this film is stunning, and the storyline kept me engaged from start to finish.'
Mapped Prediction: positive
Predicted Score: 0.6613864302635193

Review 3:
Sentence: 'I couldn't stop laughing throughout the entire movie! The humor was spot on.'
Mapped Prediction: positive
Predicted Score: 0.6835229992866516

Review 4:
Sentence: 'The acting performances were phenomenal, especially from the lead actor. A must-watch!'
Mapped Prediction: positive
Predicted Score: 0.7576891183853149

Review 5:
Sentence: 'The special effects were mind-blowing, and the soundtrack perfectly complemented the action sequences.'
Mapped Prediction: positive
Predicted Score: 0.8377390503883362

Review 6:
Sentence: 'This movie is absolutely horrible, and i 