## Sentiment Analysis

In this exercise we use the IMDb-dataset, which we will use to perform a sentiment analysis. The code below assumes that the data is placed in the same folder as this notebook. We see that the reviews are loaded as a pandas dataframe, and print the beginning of the first few reviews.

In [1]:
import numpy as np
import pandas as pd

reviews = pd.read_csv('reviews.txt', header=None)
labels = pd.read_csv('labels.txt', header=None)
Y = (labels=='positive').astype(np.int_)

print(type(reviews))
print(reviews.head())
print(reviews.shape)

<class 'pandas.core.frame.DataFrame'>
                                                   0
0  bromwell high is a cartoon comedy . it ran at ...
1  story of a man who has unnatural feelings for ...
2  homelessness  or houselessness as george carli...
3  airport    starts as a brand new luxury    pla...
4  brilliant over  acting by lesley ann warren . ...
(25000, 1)


In [2]:
print(labels.head())
print(labels.shape)

          0
0  positive
1  negative
2  positive
3  negative
4  positive
(25000, 1)


**(a)** Split the reviews and labels in test, train and validation sets. The train and validation sets will be used to train your model and tune hyperparameters, the test set will be saved for testing. Use the `CountVectorizer` from `sklearn.feature_extraction.text` to create a Bag-of-Words representation of the reviews. Only use the 10,000 most frequent words (use the `max_features`-parameter of `CountVectorizer`).

In [3]:
#split reviews into training, testing and validation sets
from sklearn.model_selection import train_test_split

train_reviews, test_reviews, train_labels, test_labels = train_test_split(reviews, labels, test_size=0.2, random_state=1)
train_reviews, val_reviews, train_labels, val_labels = train_test_split(train_reviews, train_labels, test_size=0.2, random_state=1)

print(train_reviews.shape)
print(train_labels.shape)
print(test_reviews.shape)
print(test_labels.shape)
print(val_reviews.shape)
print(val_labels.shape)


(16000, 1)
(16000, 1)
(5000, 1)
(5000, 1)
(4000, 1)
(4000, 1)


In [4]:
# use CountVectorizer to create a bag of words representation of the reviews (only use 10000 most frequently used words)
from sklearn.feature_extraction.text import CountVectorizer
#generate stop words list for english
stop_words = ["a", "the", "an"]

vectorizer = CountVectorizer(max_features=10000, stop_words=stop_words)
train_reviews_bow = vectorizer.fit_transform(train_reviews[0])

**(b)** Explore the representation of the reviews. How is a single word represented? How about a whole review?

In [5]:
#print type and shape of train_reviews_bow
print(type(train_reviews_bow))
print(train_reviews_bow.shape)

#print the vocabulary
print(vectorizer.vocabulary_)

#print the bag of words representation
print(train_reviews_bow)

<class 'scipy.sparse._csr.csr_matrix'>
(16000, 10000)
  (0, 4850)	3
  (0, 9685)	1
  (0, 3363)	1
  (0, 3493)	1
  (0, 7125)	1
  (0, 9034)	2
  (0, 319)	4
  (0, 2954)	1
  (0, 5188)	1
  (0, 2091)	1
  (0, 7385)	1
  (0, 6167)	3
  (0, 1333)	1
  (0, 494)	2
  (0, 5854)	1
  (0, 773)	1
  (0, 4706)	1
  (0, 4886)	1
  (0, 3729)	1
  (0, 841)	2
  (0, 9966)	1
  (0, 8090)	1
  (0, 4066)	1
  (0, 9057)	1
  (0, 5485)	1
  :	:
  (15999, 4277)	1
  (15999, 8832)	1
  (15999, 2655)	2
  (15999, 6103)	1
  (15999, 5810)	1
  (15999, 7275)	1
  (15999, 907)	1
  (15999, 7329)	1
  (15999, 1065)	2
  (15999, 6916)	1
  (15999, 8075)	1
  (15999, 7207)	1
  (15999, 4383)	1
  (15999, 4433)	1
  (15999, 3772)	1
  (15999, 4643)	1
  (15999, 9440)	1
  (15999, 4544)	1
  (15999, 3013)	1
  (15999, 6491)	1
  (15999, 1585)	4
  (15999, 3611)	1
  (15999, 3151)	1
  (15999, 519)	1
  (15999, 3636)	9


## I. Using SciKit's MLPClassifier
- fiddling with the hyper parameters, we managed to get this up to 88% accuracy on the test data and 87% accuracy on the validation data
- as an activation function on the hidden layer, we used the logistic sigmoid function f(x) = 1 / (1 + exp(-x))
- the best solver for weight optimization was stochastic gradient descent
- learning rate was set to be adaptive, which in the case of MLPClassifier, maintains a constant learning rate at 'learning_rate_init' while the training loss continues to decrease, but divides the current learning rate by 5 if two consecutive epochs do not decrease the training loss by at least 'tol' or fail to increase the validation score by at least 'tol' when 'early_stopping' is activated.


In [7]:
#train a neural network with one hidden layer
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

clf = MLPClassifier(hidden_layer_sizes=(25,), random_state=1,activation='logistic' ,max_iter=500, solver='sgd',learning_rate='adaptive')
clf.fit(train_reviews_bow, train_labels.values.ravel())

#predict the labels for the validation set
val_reviews_bow = vectorizer.transform(val_reviews[0])
val_predicted_labels = clf.predict(val_reviews_bow)

#compute the accuracy of the predictions
val_accuracy = accuracy_score(val_labels, val_predicted_labels)
print("Validation accuracy: ", val_accuracy)




Validation accuracy:  0.87625


### Testing it on the test set

In [8]:
#test the model on the test set
test_reviews_bow = vectorizer.transform(test_reviews[0])
test_predicted_labels = clf.predict(test_reviews_bow)

#compute the accuracy of the predictions
test_accuracy = accuracy_score(test_labels, test_predicted_labels)
print("Test accuracy: ", test_accuracy)

Test accuracy:  0.8802


### Testing it on new data, both positive and negative sentences

In [9]:
new_review = "This movie was atrocious."
new_review_bow = vectorizer.transform([new_review])
new_review_predicted_label = clf.predict(new_review_bow)[0]
print("Predicted label for new review: ", new_review_predicted_label)

Predicted label for new review:  negative


In [10]:
new_review = "A pleasant surprise to see such an ensemble cast in an indie debut film."
new_review_bow = vectorizer.transform([new_review])
new_review_predicted_label = clf.predict(new_review_bow)[0]
print("Predicted label for new review: ", new_review_predicted_label)

Predicted label for new review:  positive


In [11]:
new_review = "I would like to declare my contempt for these kinds of movies. They are lacking in many aspects."
new_review_bow = vectorizer.transform([new_review])
new_review_predicted_label = clf.predict(new_review_bow)[0]
print("Predicted label for new review: ", new_review_predicted_label)

Predicted label for new review:  negative


In [12]:
new_review = "Style over substance, no redeeming qualities for this show."
new_review_bow = vectorizer.transform([new_review])
new_review_predicted_label = clf.predict(new_review_bow)[0]
print("Predicted label for new review: ", new_review_predicted_label)

Predicted label for new review:  negative


In [13]:
new_review = "This debut delights its audience with some top of the line acting, for such great, coming of age actors."
new_review_bow = vectorizer.transform([new_review])
new_review_predicted_label = clf.predict(new_review_bow)[0]
print("Predicted label for new review: ", new_review_predicted_label)

Predicted label for new review:  positive


## II. Using Keras from Tensorflow


In [14]:
from sklearn.model_selection import train_test_split
labels['target'] = labels[0].factorize()[0] 
X = vectorizer.fit_transform(reviews[0]).toarray()
y = np.array(labels['target'])
X_, X_test, y_, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_, y_, test_size=0.2, random_state=1)

In [20]:
from tensorflow import keras
import numpy as np
import tensorflow as tf
model = keras.Sequential([
    keras.layers.Dense(25, activation='sigmoid', input_dim=X_train.shape[1]),
    keras.layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.01),
              loss='binary_crossentropy',
              metrics=['accuracy'])

es = tf.keras.callbacks.EarlyStopping(patience=20, restore_best_weights=True)
history = model.fit(X_train, y_train, epochs=100, batch_size=1,validation_data=(X_val, y_val), callbacks=[es])


Epoch 1/100


2023-11-05 21:12:47.565239: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 1280000000 exceeds 10% of free system memory.


Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
 3003/16000 [====>.........................] - ETA: 35s - loss: 0.0492 - accuracy: 0.9843

KeyboardInterrupt: 

In [None]:
plt.figure()
plt.title("Learning curves")
plt.xlabel("Epoch")
plt.ylabel("Cross entropy loss")
plt.plot(history.history['loss'], label = 'train')
plt.plot(history.history['val_loss'], label = 'valid')
plt.legend()
plt.show()