## Sentiment Analysis

In this exercise we use the IMDb-dataset, which we will use to perform a sentiment analysis. The code below assumes that the data is placed in the same folder as this notebook. We see that the reviews are loaded as a pandas dataframe, and print the beginning of the first few reviews.

In [60]:
import numpy as np
import pandas as pd
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer


In [61]:
reviews = pd.read_csv('reviews.txt', header=None, names=['review'])
labels = pd.read_csv('labels.txt', header=None, names=['label'])
Y = (labels=='positive').astype(np.int_)

print(type(reviews))
print(reviews.head())

print(type(labels))

<class 'pandas.core.frame.DataFrame'>
                                              review
0  omwell high is a cartoon comedy . it ran at th...
1  story of a man who has unnatural feelings for ...
2  homelessness  or houselessness as george carli...
3  airport    starts as a and new luxury    plane...
4  illiant over  acting by lesley ann warren . be...
<class 'pandas.core.frame.DataFrame'>


### Data Cleaning

In [62]:
reviews = pd.read_csv("reviews.txt", header=None, names=["review"])

reviews["review"] = reviews["review"].apply(lambda x: re.sub(r'br\s*/?', '', x, flags=re.IGNORECASE))

print(len(reviews))

# Check that the number of lines is still the same
print(f"Number of reviews after cleaning: {len(reviews)}")

# If you want to save this back to the original file (without creating a new file)
reviews.to_csv("reviews.txt", index=False, header=False, encoding="utf-8")


25000
Number of reviews after cleaning: 25000


**(a)** Split the reviews and labels in test, train and validation sets. The train and validation sets will be used to train your model and tune hyperparameters, the test set will be saved for testing. Use the `CountVectorizer` from `sklearn.feature_extraction.text` to create a Bag-of-Words representation of the reviews. Only use the 10,000 most frequent words (use the `max_features`-parameter of `CountVectorizer`).

### Splitting the data into sets and generating a BOW

In [70]:
# before doing that, we want to combine labels and reviews into a single dataset
dataset = pd.concat([labels, reviews], axis=1)

# split the data into train, test, validation sets
x_train, x_test, y_train, y_test = train_test_split(reviews, Y, test_size=0.2, random_state=42)
x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=0.5, random_state=42)

print(x_train)

vectorizer = CountVectorizer(max_features=10000)
x_train_bow = vectorizer.fit_transform(x_train)

x_val_bow = vectorizer.transform(x_val)
x_test_bow = vectorizer.transform(x_test)

def get_most_common_words(review, vectorizer):
    review_vector = vectorizer.transform([review])
    feature_names = np.array(vectorizer.get_feature_names_out())
    word_counts = review_vector.toarray().flatten()
    return dict(sorted(zip(feature_names, word_counts), key=lambda item: item[1], reverse=True))

get_most_common_words(review=reviews.iloc[1], vectorizer=vectorizer)

                                                  review
23311  the idea of making a miniseries about the berl...
23623  mona the vagabond lives on the fringes of fren...
1020   lillian hellman  one of america  s most famous...
12645  let me be clear . i  ve used imdb for years . ...
1533   i guess its possible that i  ve seen worse mov...
...                                                  ...
21575  it is a pity that you cannot vote zero stars o...
5390   david duchovney creates a role that he was to ...
860    i  m a huge fan of the dukes of hazzard tv sho...
15795  turkish cinema has a big problem . directors a...
23654  in any number of films  you can find nicholas ...

[20000 rows x 1 columns]


AttributeError: 'Series' object has no attribute 'lower'

**(b)** Explore the representation of the reviews. How is a single word represented? How about a whole review?

### Exploring the representation of reviews

**(c)** Train a neural network with a single hidden layer on the dataset, tuning the relevant hyperparameters to optimize accuracy. 

**(d)** Test your sentiment-classifier on the test set.

**(e)** Use the classifier to classify a few sentences you write yourselves. 