## Sentiment Analysis

In this exercise we use the IMDb-dataset, which we will use to perform a sentiment analysis. The code below assumes that the data is placed in the same folder as this notebook. We see that the reviews are loaded as a pandas dataframe, and print the beginning of the first few reviews.

In [45]:
import numpy as np
import pandas as pd
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer


In [46]:
reviews = pd.read_csv('reviews.txt', header=None, names=['review'])
labels = pd.read_csv('labels.txt', header=None, names=['label'])
Y = (labels=='positive').astype(np.int_)

print(type(reviews))
print(reviews.head())

print(type(labels))

<class 'pandas.core.frame.DataFrame'>
                                              review
0  omwell high is a cartoon comedy . it ran at th...
1  story of a man who has unnatural feelings for ...
2  homelessness  or houselessness as george carli...
3  airport    starts as a and new luxury    plane...
4  illiant over  acting by lesley ann warren . be...
<class 'pandas.core.frame.DataFrame'>


### Data Cleaning

In [47]:
reviews = pd.read_csv("reviews.txt", header=None, names=["review"])

reviews["review"] = reviews["review"].apply(lambda x: re.sub(r'br\s*/?', '', x, flags=re.IGNORECASE))

print(len(reviews))

# Check that the number of lines is still the same
print(f"Number of reviews after cleaning: {len(reviews)}")

# If you want to save this back to the original file (without creating a new file)
reviews.to_csv("reviews.txt", index=False, header=False, encoding="utf-8")


25000
Number of reviews after cleaning: 25000


**(a)** Split the reviews and labels in test, train and validation sets. The train and validation sets will be used to train your model and tune hyperparameters, the test set will be saved for testing. Use the `CountVectorizer` from `sklearn.feature_extraction.text` to create a Bag-of-Words representation of the reviews. Only use the 10,000 most frequent words (use the `max_features`-parameter of `CountVectorizer`).

### Splitting the data into sets and generating a BOW

In [66]:
# before doing that, we want to combine labels and reviews into a single dataset
dataset = pd.concat([labels, reviews], axis=1)

# split the data into train, test, validation sets
x_train, x_test, y_train, y_test = train_test_split(reviews['review'], labels, test_size=0.2, random_state=42)
x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=0.5, random_state=42)


x_train.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)
x_val.reset_index(drop=True, inplace=True)
y_val.reset_index(drop=True, inplace=True)
x_test.reset_index(drop=True, inplace=True)
y_test.reset_index(drop=True, inplace=True)

print(x_train)

vectorizer = CountVectorizer(max_features=10000)
x_train_bow = vectorizer.fit_transform(x_train)

x_val_bow = vectorizer.transform(x_val)
x_test_bow = vectorizer.transform(x_test)

# get the most common words for a single review
def get_most_common_words_per_review(review, vectorizer):
    review_vector = vectorizer.transform([review])
    feature_names = np.array(vectorizer.get_feature_names_out())
    word_counts = review_vector.toarray().flatten()
    word_count_dict = dict(zip(feature_names, word_counts))
    return dict(sorted(word_count_dict.items(), key=lambda item: item[1], reverse=True))

most_common_words_single_review = get_most_common_words_per_review(reviews.iloc[3]['review'], vectorizer)
print("Most common words in a single review:", most_common_words_single_review)

# get the most popular words across all reviews
def get_most_common_words_across_reviews(reviews, vectorizer):
    reviews_vector = vectorizer.transform(reviews)
    feature_names = np.array(vectorizer.get_feature_names_out())
    word_counts = reviews_vector.toarray().sum(axis=0)
    word_count_dict = dict(zip(feature_names, word_counts))
    return dict(sorted(word_count_dict.items(), key=lambda item: item[1], reverse=True))

most_common_words_across_reviews = get_most_common_words_across_reviews(reviews['review'], vectorizer)
print("Most common words across all reviews:", most_common_words_across_reviews)

0        the idea of making a miniseries about the berl...
1        mona the vagabond lives on the fringes of fren...
2        lillian hellman  one of america  s most famous...
3        let me be clear . i  ve used imdb for years . ...
4        i guess its possible that i  ve seen worse mov...
                               ...                        
19995    it is a pity that you cannot vote zero stars o...
19996    david duchovney creates a role that he was to ...
19997    i  m a huge fan of the dukes of hazzard tv sho...
19998    turkish cinema has a big problem . directors a...
19999    in any number of films  you can find nicholas ...
Name: review, Length: 20000, dtype: object


**(b)** Explore the representation of the reviews. How is a single word represented? How about a whole review?

### Exploring the representation of reviews

In [68]:
# initialize with max 10k words
tfidf = TfidfVectorizer(
    max_features=10000,
    ngram_range=(1, 3),  # Creates combinations of 1-3 consecutive words
    min_df=2,            # Word must appear in at least 2 rows
    max_df=0.9,         # Word must appear in max 90% of the rows
    stop_words='english' # Removes common English words (such as 'and', 'to', etc.)
)

# fit on entire body of text
x_train_tfidf = tfidf.fit_transform(x_train)
x_val_tfidf = tfidf.transform(x_val)
x_test_tfidf = tfidf.transform(x_test)


# get vocabulary, should be 10k size
vocab = tfidf.get_feature_names_out()
print("Vocabulary size:", len(vocab))

# let's look at a single sample, e.g. index 0
print("OG text:", x_train[1]) # Original text
print("Word importance vector:", x_train_tfidf[1].toarray()[0], " - length:", len(x_train_tfidf[1].toarray()[0])) # TF-IDF vector (array of numbers representing word importance)

# let's match word vector vocabulary and sort by tfidf score
word_scores = {vocab[i]:score for i, score in enumerate(x_train_tfidf[1].toarray()[0]) if score > 0}
word_scores = dict(sorted(word_scores.items(), key=lambda x: x[1], reverse=True))

# Print scores without showing np.float64 format
print("Word importance scores: ", end="")
print(" | ".join([f"{word}: {score:.5f}" for word, score in word_scores.items()]))


Vocabulary size: 10000
OG text: mona the vagabond lives on the fringes of french society  in a life without meaning  purpose or direction .  i watched this because of all the stellar reviews  but i  m afraid i must have missed something . the character of mona has little or no personality while drifting through life being rude to people  getting high and contributing nothing to anyone  s life . she  s not interesting or exciting . she  s just useless .  i  ve seen and known enough people like that there is no secret meaning to what they  re doing . they are just lazy bums . i wouldn  t want mona anywhere near me  as she tends to steal anything that isn  t nailed down and leave her friends in the lurch . sure she  s enigmatic  because there isn  t anything to her . lots of junkies  winos and bums i  ve seen are enigmatic i wouldn  t want to see a film about them either .  possibly there is something there that i totally missed . otherwise i  m assuming that all the reviews are from peop

**(c)** Train a neural network with a single hidden layer on the dataset, tuning the relevant hyperparameters to optimize accuracy. 

**(d)** Test your sentiment-classifier on the test set.

**(e)** Use the classifier to classify a few sentences you write yourselves. 