## Sentiment Analysis

In this exercise we use the IMDb-dataset, which we will use to perform a sentiment analysis. The code below assumes that the data is placed in the same folder as this notebook. We see that the reviews are loaded as a pandas dataframe, and print the beginning of the first few reviews.

In [1]:
import numpy as np
import pandas as pd

reviews = pd.read_csv('reviews.txt', header=None)
labels = pd.read_csv('labels.txt', header=None)
Y = (labels=='positive').astype(np.int_)

print(type(reviews))
print(reviews.head())
print(reviews.shape)

<class 'pandas.core.frame.DataFrame'>
                                                   0
0  bromwell high is a cartoon comedy . it ran at ...
1  story of a man who has unnatural feelings for ...
2  homelessness  or houselessness as george carli...
3  airport    starts as a brand new luxury    pla...
4  brilliant over  acting by lesley ann warren . ...
(25000, 1)


In [2]:
print(labels.head())
print(labels.shape)

          0
0  positive
1  negative
2  positive
3  negative
4  positive
(25000, 1)


**(a)** Split the reviews and labels in test, train and validation sets. The train and validation sets will be used to train your model and tune hyperparameters, the test set will be saved for testing. Use the `CountVectorizer` from `sklearn.feature_extraction.text` to create a Bag-of-Words representation of the reviews. Only use the 10,000 most frequent words (use the `max_features`-parameter of `CountVectorizer`).

In [3]:
#split reviews into training, testing and validation sets
from sklearn.model_selection import train_test_split

train_reviews, test_reviews, train_labels, test_labels = train_test_split(reviews, labels, test_size=0.2, random_state=1)
train_reviews, val_reviews, train_labels, val_labels = train_test_split(train_reviews, train_labels, test_size=0.2, random_state=1)

print(train_reviews.shape)
print(train_labels.shape)
print(test_reviews.shape)
print(test_labels.shape)
print(val_reviews.shape)
print(val_labels.shape)


(16000, 1)
(16000, 1)
(5000, 1)
(5000, 1)
(4000, 1)
(4000, 1)


In [4]:
# use CountVectorizer to create a bag of words representation of the reviews (only use 10000 most frequently used words)
from sklearn.feature_extraction.text import CountVectorizer
#generate stop words list for english
stop_words = ["a", "the"]

vectorizer = CountVectorizer(max_features=10000, stop_words=stop_words)
train_reviews_bow = vectorizer.fit_transform(train_reviews[0])

**(b)** Explore the representation of the reviews. How is a single word represented? How about a whole review?

In [5]:
#print type and shape of train_reviews_bow
print(type(train_reviews_bow))
print(train_reviews_bow.shape)

#print the vocabulary
print(vectorizer.vocabulary_)

#print the bag of words representation
print(train_reviews_bow)

<class 'scipy.sparse._csr.csr_matrix'>
(16000, 10000)
  (0, 4844)	3
  (0, 9685)	1
  (0, 3363)	1
  (0, 3495)	1
  (0, 7125)	1
  (0, 9035)	2
  (0, 319)	4
  (0, 2953)	1
  (0, 5185)	1
  (0, 2096)	1
  (0, 7383)	1
  (0, 6160)	3
  (0, 1339)	1
  (0, 495)	2
  (0, 5852)	1
  (0, 778)	1
  (0, 4700)	1
  (0, 4880)	1
  (0, 3731)	1
  (0, 846)	2
  (0, 9967)	1
  (0, 8090)	1
  (0, 4066)	1
  (0, 9057)	1
  (0, 5483)	1
  :	:
  (15999, 4275)	1
  (15999, 8833)	1
  (15999, 2653)	2
  (15999, 6100)	1
  (15999, 5808)	1
  (15999, 7274)	1
  (15999, 912)	1
  (15999, 7328)	1
  (15999, 1072)	2
  (15999, 6913)	1
  (15999, 8074)	1
  (15999, 7207)	1
  (15999, 4381)	1
  (15999, 4427)	1
  (15999, 3771)	1
  (15999, 4637)	1
  (15999, 9440)	1
  (15999, 4536)	1
  (15999, 3012)	1
  (15999, 6488)	1
  (15999, 1590)	4
  (15999, 3612)	1
  (15999, 3152)	1
  (15999, 520)	1
  (15999, 3637)	9


**(c)** Train a neural network with a single hidden layer on the dataset, tuning the relevant hyperparameters to optimize accuracy. 

In [6]:
#train a neural network with one hidden layer
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

#train a neural network with one hidden layer
clf = MLPClassifier(hidden_layer_sizes=(10,), random_state=1, max_iter=300)
clf.fit(train_reviews_bow, train_labels)

#predict the labels for the validation set
val_reviews_bow = vectorizer.transform(val_reviews[0])
val_predicted_labels = clf.predict(val_reviews_bow)

#compute the accuracy of the predictions
val_accuracy = accuracy_score(val_labels, val_predicted_labels)
print("Validation accuracy: ", val_accuracy)


  y = column_or_1d(y, warn=True)


Validation accuracy:  0.8625


**(d)** Test your sentiment-classifier on the test set.

In [7]:
#test the model on the test set
test_reviews_bow = vectorizer.transform(test_reviews[0])
test_predicted_labels = clf.predict(test_reviews_bow)

#compute the accuracy of the predictions
test_accuracy = accuracy_score(test_labels, test_predicted_labels)
print("Test accuracy: ", test_accuracy)

Test accuracy:  0.8616


**(e)** Use the classifier to classify a few sentences you write yourselves. 

In [8]:
#use the classifier to predict the sentiment of a new review
new_review = "This movie was excellent"
new_review_bow = vectorizer.transform([new_review])
new_review_predicted_label = clf.predict(new_review_bow)[0]
print("Predicted label for new review: ", new_review_predicted_label)

Predicted label for new review:  positive
