## Sentiment Analysis

In this exercise we use the IMDb-dataset, which we will use to perform a sentiment analysis. The code below assumes that the data is placed in the same folder as this notebook. We see that the reviews are loaded as a pandas dataframe, and print the beginning of the first few reviews.

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

reviews = pd.read_csv('reviews.txt', header=None)
labels = pd.read_csv('labels.txt', header=None)
Y = (labels=='positive').astype(np.int_)

print(type(reviews))
print(reviews.head())

<class 'pandas.core.frame.DataFrame'>
                                                   0
0  bromwell high is a cartoon comedy . it ran at ...
1  story of a man who has unnatural feelings for ...
2  homelessness  or houselessness as george carli...
3  airport    starts as a brand new luxury    pla...
4  brilliant over  acting by lesley ann warren . ...


**(a)** Split the reviews and labels in test, train and validation sets. The train and validation sets will be used to train your model and tune hyperparameters, the test set will be saved for testing. Use the `CountVectorizer` from `sklearn.feature_extraction.text` to create a Bag-of-Words representation of the reviews. Only use the 10,000 most frequent words (use the `max_features`-parameter of `CountVectorizer`).

In [52]:
!pip install notebook tensorflow
!pip install tensorflow nltk
!pip install setuptools
import tensorflow as tf
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

# Make labels into 1 .. 2 thingy
labels = pd.get_dummies(labels)

X_train, X_test, Y_train, Y_test = train_test_split(reviews[0], labels, random_state=7)

# Vectorizing the data
vectorized = CountVectorizer(max_features=10000)
X_train = vectorized.fit_transform(X_train).toarray()
X_test = vectorized.transform(X_test).toarray()



**(b)** Explore the representation of the reviews. How is a single word represented? How about a whole review?

Each row of the matrix corresponds to a document, and each column corresponds to a word in the vocabulary. The matrix contains integer counts of the occurrences of each word in each document.

**(c)** Train a neural network with a single hidden layer on the dataset, tuning the relevant hyperparameters to optimize accuracy. 

In [43]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

model = Sequential()

# Input Layer using 10 Neurons and ReLU Activation (Non-Linearity)
model.add(Dense(10, input_dim=X_train.shape[1], activation="relu"))
# Output Layer with 2 classes as prediction
model.add(Dense(2, activation="softmax"))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
print(X_train.shape)
print(Y_train.shape)
model.fit(X_train, Y_train, epochs=19, batch_size=10, validation_split=0.2)

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


(18750, 10000)
(18750, 2)
Epoch 1/19
[1m1500/1500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 2ms/step - accuracy: 0.8030 - loss: 0.4343 - val_accuracy: 0.8773 - val_loss: 0.3192
Epoch 2/19
[1m1500/1500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - accuracy: 0.9390 - loss: 0.1771 - val_accuracy: 0.8757 - val_loss: 0.3514
Epoch 3/19
[1m1500/1500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - accuracy: 0.9618 - loss: 0.1128 - val_accuracy: 0.8659 - val_loss: 0.4425
Epoch 4/19
[1m1500/1500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - accuracy: 0.9740 - loss: 0.0759 - val_accuracy: 0.8640 - val_loss: 0.5094
Epoch 5/19
[1m1500/1500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - accuracy: 0.9845 - loss: 0.0509 - val_accuracy: 0.8667 - val_loss: 0.5775
Epoch 6/19
[1m1500/1500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - accuracy: 0.9913 - loss: 0.0282 - val_accuracy: 0.8688 - val_los

<keras.src.callbacks.history.History at 0x21b670ee840>

**(d)** Test your sentiment-classifier on the test set.

In [44]:
test_loss, test_accuracy = model.evaluate(X_test, Y_test)
print(f'Test accuracy: {test_accuracy:.4f}')

[1m196/196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 967us/step - accuracy: 0.8736 - loss: 1.1756
Test accuracy: 0.8674


**(e)** Use the classifier to classify a few sentences you write yourselves. 

In [50]:
new_reviews = ["Adam Sandler", "The plot was interisting", "That thing was amazing", "LETS GO", "That was just like Seven Deadly Sins Season 3", "So predictable"]
new_reviews_vectorized = vectorized.transform(new_reviews).toarray()
predictions = model.predict(new_reviews_vectorized)
predicted_labels = labels.columns[np.argmax(predictions, axis=1)]

for review, label in zip(new_reviews, predicted_labels):
    print(f"Review: '{review}' => Predicted label: '{label}'")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step
Review: 'Adam Sandler' => Predicted label: '0_negative'
Review: 'The plot was interisting' => Predicted label: '0_positive'
Review: 'That thing was amazing' => Predicted label: '0_positive'
Review: 'LETS GO' => Predicted label: '0_positive'
Review: 'That was just like Seven Deadly Sins Season 3' => Predicted label: '0_negative'
Review: 'So predictable' => Predicted label: '0_negative'
