# Deep learning Bag-of-Words model for predicting movie review sentiment

This project aims to create an algorithm that gives a binary classification to a movie review, indicating a positive or a negative sentiment.

## The  data

The data used for this project is obtained from Kaggle, it is the [IMDB Movie Reviews Dataset](https://www.kaggle.com/iarunava/imdb-movie-reviews-dataset). This dataset contains movie reviews along with their associated binary sentiment polarity labels. <br>
The core dataset contains 50,000 reviews split evenly into 25k train
and 25k test sets. The overall distribution of labels is balanced (25k
pos and 25k neg). <br>
In the entire collection, no more than 30 reviews are allowed for any
given movie because reviews for the same movie tend to have correlated
ratings. Further, the train and test sets contain a disjoint set of
movies, so no significant performance is obtained by memorizing
movie-unique terms and their associated with observed labels.  In the
labeled train/test sets, a negative review has a score <= 4 out of 10,
and a positive review has a score >= 7 out of 10. Thus reviews with
more neutral ratings are not included in the train/test sets. <br>
In addition to the review text files, we have a text file  with all the words used in the entirety of the reviews. We call this file the vocabulary of the reviews.

In [2]:
import os
PATH_TO_DATA = os.path.dirname(os.path.abspath("__file__")) + "/IMDB_reviews"

In [3]:
import glob
import random

# list of paths to the reviews
paths_to_reviews_test = glob.glob(PATH_TO_DATA + "/test/neg/*.txt") + glob.glob(PATH_TO_DATA + "/test/pos/*.txt")
random.shuffle(paths_to_reviews_test)
paths_to_reviews_train = glob.glob(PATH_TO_DATA + "/train/neg/*.txt") + glob.glob(PATH_TO_DATA + "/train/pos/*.txt")
random.shuffle(paths_to_reviews_train)

# we create our train and test set (list of (review, sentiment) with sentiment in {0, 1}).
train_set = []
for review_path in paths_to_reviews_train:
    filename = os.path.basename(review_path)
    rating = os.path.splitext(filename)[0][-1]
    label = 0 if int(rating) <= 4 else 1
    with open(review_path, 'rb') as f:
        review = str(f.read())
    train_set.append((review, label))

test_set = []
for review_path in paths_to_reviews_test:
    filename = os.path.basename(review_path)
    rating = os.path.splitext(filename)[0][-1]
    label = 0 if int(rating) <= 4 else 1
    with open(review_path, 'rb') as f:
        review = str(f.read())
    test_set.append((review, label))

## Methodology

We will use a common approach which is called "Bag of Words". Let's take a simple example to highlight how a bag-of-word model works : <br>
Let's say we have two sentences : 
* sentence 1 : "Allons enfants de la Patrie"
* sentence 2 : "Les enfants, allons à la piscine" </ul>

Given that the vocabulary from the two sentences is `{allons, les, enfants, de, la, à, piscine, patrie}`, we simply construct a vector based on the number of occurrences of each word in a sentence. The sequence of words within a sentence does not matter.<br>

|      Sentence                   |allons | les | enfants | de | la | à | piscine | patrie |
| :------------                   | :---: | :-: | :--:  | :---:|:--:|:-:|:------:| :-----: |
| Allons enfants de la Patrie     |  1    |  0 | 1    | 1 | 1| 0 | 0 | 1 |
| Les enfants, allons à la piscine |  1    |  1 | 1    | 0 | 1| 1 | 1 | 0 |

Each sentence can thus be transformed into a vector, with the length of the vector being the number of words in the vocabulary. During classification, the model will then possibly learn that higher occurrences of certain words are more likely to lead to a particular prediction.

## Vectorization of the reviews

### The vocabulary

First we need to choose the vocabulary that defines the vector space in which we will vectorize the reviews. We could use all the words that appears at least once in the whole dataset of reviews. However the dimension of our vector space would be huge. We will study the vocabulary to try to diminish this dimension.

###### Listing the whole vocabulary

In [38]:
dico_vocab = {}
for review, label in (train_set):
    already_seen = []
    for word in review.split(" "):
        word = word.lower()
        if len(word) > 1 and word.isalpha() and not word in already_seen:
            dico_vocab[word] = dico_vocab.get(word, 0) + 1
            already_seen.append(word)

In [39]:
print(f"number of different words in all the reviews : {len(dico_vocab)}")

number of different words in all the reviews : 60615


In [40]:
vocabulary = list(dico_vocab.items())
vocabulary.sort(key=lambda x: x[1])
vocabulary[60610:]

[('is', 22310), ('to', 23438), ('of', 23702), ('and', 24069), ('the', 24759)]

##### Reducing the size of the vocabulary

We drop stopwords from the vocabulary (ie 'the', 'or', 'and'...).

In [41]:
from nltk.corpus import stopwords
sw = stopwords.words("english")

for i, (word, occ) in enumerate(vocabulary):
    if word in sw:
        vocabulary.pop(i)

len(vocabulary)

60521

Now we choose to only keep the 5,000 most frequent words. It is arbitrary but if the score we'll get is not satisfactory we'll change this.

In [44]:
vocabulary_reduced = vocabulary[len(vocabulary) - 5000:]

Now we stock the vocabulary in a list (we drop the number of occurences).

In [46]:
vocab = []
for (word, occ) in vocabulary_reduced:
    vocab.append(word)

#### Text vectorization

Now that we have defined a vocabulary, we are able to transform the reviews in vectors.

In [47]:
def vectorizer(text, vocabulary):
    vector = [0 for _ in vocabulary]
    for word in text.split(" "):
        word = word.lower()
        if word in vocabulary:
            vector[vocabulary.index(word)] += 1
    return(vector)

Using the function `vectorizer` we vectorize the review both in the test and train set.

In [58]:
x_train, x_test = [], []
y_train, y_test = [], []

for i, (review, label) in enumerate(train_set):
    x_train.append(vectorizer(review, vocab))
    y_train.append(label)
    if i == 0 or i%1000 == 0:
        print(f"vectorizing review {i}/25,000 (train set)")

for i, (review, label) in enumerate(test_set):
    x_test.append(vectorizer(review, vocab))
    y_test.append(label)
    if i == 0 or i%1000 == 0:
        print(f"vectorizing review {i}/25,000 (test set)")

vectorizing review 0/25,000 (train set)
vectorizing review 1000/25,000 (train set)
vectorizing review 2000/25,000 (train set)
vectorizing review 3000/25,000 (train set)
vectorizing review 4000/25,000 (train set)
vectorizing review 5000/25,000 (train set)
vectorizing review 6000/25,000 (train set)
vectorizing review 7000/25,000 (train set)
vectorizing review 8000/25,000 (train set)
vectorizing review 9000/25,000 (train set)
vectorizing review 10000/25,000 (train set)
vectorizing review 11000/25,000 (train set)
vectorizing review 12000/25,000 (train set)
vectorizing review 13000/25,000 (train set)
vectorizing review 14000/25,000 (train set)
vectorizing review 15000/25,000 (train set)
vectorizing review 16000/25,000 (train set)
vectorizing review 17000/25,000 (train set)
vectorizing review 18000/25,000 (train set)
vectorizing review 19000/25,000 (train set)
vectorizing review 20000/25,000 (train set)
vectorizing review 21000/25,000 (train set)
vectorizing review 22000/25,000 (train set)
v

## Training and testing our model 

We will use a Logistic Regression.

In [76]:
from sklearn.linear_model import LogisticRegression

In [78]:
clf_LR = LogisticRegression(random_state=0, solver='sag')
clf_LR.fit(x_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=0, solver='sag',
          tol=0.0001, verbose=0, warm_start=False)

In [79]:
pred_test = np.array(clf_LR.predict(x_test))
accuracy = np.mean(pred_test == np.array(y_test).T[0])

In [80]:
accuracy

0.79488