# IMDb dataset Sentiment Analysis

Text classification (negative or positive reviews) using Naive Bayes & Random Forest & Logistic Regression on IMDb dataset.

> **Odysseas Spyropoulos**, 3200183 <br />
> **Lydia Christina Wallace**, 3200125<br />
> **Miltos Tsolkas**, 3200213 <br />

* Firstly, we will need to install tensorflow, if we don't already have it

In [2]:
pip install tensorflow

Note: you may need to restart the kernel to use updated packages.


* Let's import the libraries that we will need

In [3]:
# Part 1
import tensorflow as tf
import numpy as np
import math
from tqdm import tqdm

# Part 2

# Part 3




# <ins>Part 1</ins>

## Load Data

* We will import the "IMDB Dataset" from `keras`
* We will split the data to `train`, `dev` and `test` datasets

In [4]:
# Load the IMDb dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.imdb.load_data()

# Split the training set into training and validation sets
# split_ratio = 0.9 # the percentage of the training set that will be used for training
# split_index = int(len(x_train) * split_ratio)

# x_train, x_dev = np.split(x_train, [split_index])
# y_train, y_dev = np.split(y_train, [split_index])

* Let's see the sizes of the arrays

In [5]:
print(x_train.shape, y_train.shape)  # print the dimensions of TRAIN dataset
# print(x_dev.shape, y_dev.shape)  # print the dimensions of VALIDATION dataset
print(x_test.shape, x_test.shape)  # print the dimensions TEST dataset

(22500,) (22500,)
(2500,) (2500,)
(25000,) (25000,)


* Let's see what's inside!
* We will check the sequence of indices for the first movie review in the training set

In [6]:
# We sliced to get a wider array
x_train[0:1]

array([list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 22665, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 21631, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 31050, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32])],
      dtype=object)

* We don't want numbers
* We would like to work with words
* so we will **map the numbers to words!**

## Word Index Mapping

* We will create a new dictionary, `index2word`, where:
    * **Keys** are indices plus 3 (to accommodate special tokens)
    * **Values** are corresponding words from the `IMDb dataset`.

In [7]:
word_index = tf.keras.datasets.imdb.get_word_index()

# Create a mapping from index to word
index2word = dict((i + 3, word) for (word, i) in word_index.items())

# Add special tokens for padding ([pad]), beginning of the sentence ([bos]), and out-of-vocabulary ([oov]).
index2word[0] = '[pad]'
index2word[1] = '[bos]'
index2word[2] = '[oov]'

* Let's see what we got and what we created

In [8]:
word_index

{'fawn': 34701,
 'tsukino': 52006,
 'nunnery': 52007,
 'sonja': 16816,
 'vani': 63951,
 'woods': 1408,
 'spiders': 16115,
 'hanging': 2345,
 'woody': 2289,
 'trawling': 52008,
 "hold's": 52009,
 'comically': 11307,
 'localized': 40830,
 'disobeying': 30568,
 "'royale": 52010,
 "harpo's": 40831,
 'canet': 52011,
 'aileen': 19313,
 'acurately': 52012,
 "diplomat's": 52013,
 'rickman': 25242,
 'arranged': 6746,
 'rumbustious': 52014,
 'familiarness': 52015,
 "spider'": 52016,
 'hahahah': 68804,
 "wood'": 52017,
 'transvestism': 40833,
 "hangin'": 34702,
 'bringing': 2338,
 'seamier': 40834,
 'wooded': 34703,
 'bravora': 52018,
 'grueling': 16817,
 'wooden': 1636,
 'wednesday': 16818,
 "'prix": 52019,
 'altagracia': 34704,
 'circuitry': 52020,
 'crotch': 11585,
 'busybody': 57766,
 "tart'n'tangy": 52021,
 'burgade': 14129,
 'thrace': 52023,
 "tom's": 11038,
 'snuggles': 52025,
 'francesco': 29114,
 'complainers': 52027,
 'templarios': 52125,
 '272': 40835,
 '273': 52028,
 'zaniacs': 52130,

In [9]:
index2word

{34704: 'fawn',
 52009: 'tsukino',
 52010: 'nunnery',
 16819: 'sonja',
 63954: 'vani',
 1411: 'woods',
 16118: 'spiders',
 2348: 'hanging',
 2292: 'woody',
 52011: 'trawling',
 52012: "hold's",
 11310: 'comically',
 40833: 'localized',
 30571: 'disobeying',
 52013: "'royale",
 40834: "harpo's",
 52014: 'canet',
 19316: 'aileen',
 52015: 'acurately',
 52016: "diplomat's",
 25245: 'rickman',
 6749: 'arranged',
 52017: 'rumbustious',
 52018: 'familiarness',
 52019: "spider'",
 68807: 'hahahah',
 52020: "wood'",
 40836: 'transvestism',
 34705: "hangin'",
 2341: 'bringing',
 40837: 'seamier',
 34706: 'wooded',
 52021: 'bravora',
 16820: 'grueling',
 1639: 'wooden',
 16821: 'wednesday',
 52022: "'prix",
 34707: 'altagracia',
 52023: 'circuitry',
 11588: 'crotch',
 57769: 'busybody',
 52024: "tart'n'tangy",
 14132: 'burgade',
 52026: 'thrace',
 11041: "tom's",
 52028: 'snuggles',
 29117: 'francesco',
 52030: 'complainers',
 52128: 'templarios',
 40838: '272',
 52031: '273',
 52133: 'zaniacs',

* Remember that the first review of `x_train` was like this:

In [10]:
x_train[0:1]

array([list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 22665, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 21631, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 31050, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32])],
      dtype=object)

* Now, for all datasets, we will convert the index sequences back to text using the `index2word` mapping.
* We will use list comprehensions to join words for each index sequence.

In [11]:
x_train = np.array([' '.join([index2word[idx] for idx in text]) for text in x_train])
x_dev = np.array([' '.join([index2word[idx] for idx in text]) for text in x_dev])
x_test = np.array([' '.join([index2word[idx] for idx in text]) for text in x_test])

In [12]:
x_train[0:1]

array(["[bos] this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert redford's is an amazing actor and now the same being director norman's father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also congratulations to the two little boy's that played the part's of norman and paul they were just brilliant children are often left out of the praising list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing a

* Now everything is set!
* We are ready to...

## Create the Vocabulary

* From the `x_train` reviews
* Removing the `n` most AND the `k` least frequent words
* We will keep the remaining, `m` more frequent words <br>
*Where `n`, `k`, `m` are hyperparameters*

In [13]:
def create_voc(x_train, n, k, m):
    
    # Save each word to voc_dict
    voc_dict = dict()
    for text in x_train:
        tokens = set(text.split())
        for token in tokens:
            if token in voc_dict:
                voc_dict[token] += 1
            else:
                voc_dict[token] = 1   
    voc_dict.pop('[bos]', ' ') # Pop the special token
    
    # Print the size of the vocabulary, without trimming it
    print(f"Vocabulary size: {len(voc_dict)}")
    
    # Sort by frequency (low to high)
    vocabulary = sorted(voc_dict.items(), key = lambda x:x[1])
    print(vocabulary[-50:]) # Just to check the 50 most frequent

    # Determine the k hyperparameter as the number of words with frequency=1
    print(f"k (number of words with freq=1): {sum(freq == 1 for word, freq in vocabulary)}")

    # Skip the k least and the n most frequent
    vocabulary = vocabulary[k:len(vocabulary) - n]
    return np.array([x[0] for x in vocabulary[len(vocabulary) - m:]])

# Determine the hyperparameters:
# n=50, by testing
# k=37150, freq=1
# m=1000, by testing
vocabulary = create_voc(x_train, 50, 37150, 1000)
len(vocabulary)

Vocabulary size: 84149
[('when', 8120), ('more', 8153), ("it's", 8286), ('good', 8614), ('some', 8625), ('what', 8697), ('there', 8723), ('he', 8898), ('has', 9034), ('or', 9274), ('they', 9324), ('about', 9438), ('just', 9479), ('out', 9551), ('his', 9615), ('if', 9639), ('who', 9978), ('like', 10513), ('so', 10534), ('from', 10561), ('by', 10578), ('an', 10963), ('you', 11596), ('at', 11617), ('all', 11734), ('film', 12413), ('are', 12462), ('be', 12681), ('one', 12696), ('have', 12722), ('br', 13185), ('not', 13438), ('movie', 13707), ('on', 14119), ('as', 14450), ('was', 14548), ('with', 15699), ('for', 16052), ('but', 16136), ('i', 17308), ('that', 18002), ('it', 19183), ('in', 19826), ('is', 20181), ('this', 20371), ('to', 21115), ('of', 21369), ('and', 21738), ('a', 21765), ('the', 22315)]
k (number of words with freq=1): 37150


1000

* So we will use `vocabulary` with **1000 words**

## Create binary vectors

* Now, for all our datasets (`train`, `dev`, `test`).
* We will **tranform the texts to binary vectors**.
* With 0 or 1, if the word appears or not in the `vocabulary`.

In [14]:
x_train_binary = list()
x_dev_binary = list()
x_test_binary = list()

# Binary vector for TRAIN data
for text in tqdm(x_train):
    tokens = text.split()
    binary_vector = list()
    for vocab_token in vocabulary:
        if vocab_token in tokens:
            binary_vector.append(1)
        else:
            binary_vector.append(0)
    x_train_binary.append(binary_vector)
x_train_binary = np.array(x_train_binary)

# Binary vector for DEV data
# for text in tqdm(x_dev):
#     tokens = text.split()
#     binary_vector = list()
#     for vocab_token in vocabulary:
#         if vocab_token in tokens:
#             binary_vector.append(1)
#         else:
#             binary_vector.append(0)
#     x_dev_binary.append(binary_vector)
# x_dev_binary = np.array(x_dev_binary)

# Binary vector for TEST data
for text in tqdm(x_test):
    tokens = text.split()
    binary_vector = list()
    for vocab_token in vocabulary:
        if vocab_token in tokens:
            binary_vector.append(1)
        else:
            binary_vector.append(0)
    x_test_binary.append(binary_vector)
x_test_binary = np.array(x_test_binary)

100%|██████████| 22500/22500 [01:07<00:00, 335.41it/s]
100%|██████████| 2500/2500 [00:07<00:00, 332.23it/s]
100%|██████████| 25000/25000 [01:14<00:00, 335.49it/s]


* With the vectorizing we finished with preparing our data
* We will now move on to the implementation of our 3 classification algorithms:
    * Naive Bayes
    * Random Forest
    * Logistic Regression

## Naive Bayes

## Random Forest

## Logistic Regression

* We implemented **Logistic Regression** with **stochastic gradient descent**.
* By iterating over the entire dataset (`X`) for a fixed number of iterations (`self.n_iters`)
* For each iteration, a random training example is selected (`rand_example`) to update the weights.
* *This randomness introduces variability in the updates, making it a form of stochastic gradient descent*

* **Initialization**
    * `learning_rate` and `n_iters` are initialized to 0.001 and 100 by testing with multiple combinations.
    * The weights are initialized to zeros.

* `fit()`
    * Iteratively, `n_iters` times, updates the model weights using stochastic gradient descent.
$$\vec{\mathbf{w}} = \vec{\mathbf{w}} - \eta \cdot (\mathbf{X}_{\text{rand}}^T \cdot (\sigma(\mathbf{X}_{\text{rand}} \cdot \vec{\mathbf{w}}) - \mathbf{y}_{\text{rand}}))$$
        * **$\vec{\mathbf{w}}$** represents the weight vector
        * **η** is the `learning_rate`.
        * **$\mathbf{X}_{\text{rand}}^T$** is the feature vector for the randomly selected training example (`X[rand_example, :]`).
        * **σ** denotes the sigmoid activation function.
        * **$\mathbf{y}_{\text{rand}}$** is the true label for the randomly selected training example.

* `predict()`
    * Computes the linear prediction for each example in the input data.
    * Applies the sigmoid function to obtain the predicted probability of belonging to class 1.
    * Assigns a binary class label (0 or 1) based on a threshold (0.5 by default).

<img src="sigmoid.png" alt="sigmoid" width="300"/>

In [33]:
class LogisticRegression():

    def __init__(self, learning_rate=0.001, n_iters=100):
        self.learning_rate = learning_rate
        self.n_iters = n_iters
        self.weights = None

    def fit(self, X, y):
        n_examples = X.shape[0]
        n_features = X.shape[1]

        self.weights = np.zeros(n_features)

        for i in range(self.n_iters):
            for j in range(n_examples):
                # Select a random training example
                rand_example = np.random.randint(n_examples)

                linear_pred = np.dot(X[rand_example, :], self.weights)
                prediction = self.sigmoid(linear_pred)
                dw = X[rand_example, :] * (prediction - y[rand_example])

                # SGD
                self.weights = self.weights - self.learning_rate * dw


    def predict(self, X):
        linear_pred = np.dot(X, self.weights)
        class_pred = [0 if y<=0.5 else 1 for y in y_pred]
        return class_pred

    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))

* For the last section of part 1
* We will create the **curves** and the according **tables**

In [35]:
lr = LogisticRegression()
cost_list = lr.fit(x_train_binary, y_train)
y_pred = lr.predict(x_test_binary)

def accuracy(y_pred, y_test):
    return np.sum(y_pred==y_test)/len(y_test)

acc = accuracy(y_pred, y_test)
print(acc)

[0.24860365 0.9990608  0.46771163 ... 0.06588394 0.041644   0.35676541]
0.85896
