# Text Classification for the IMDB Dataset using DL
**Objective:** classify the IMDB Reviews into positive or negative. <br>
In this notebook we explore different DL-based text classification models and compare their performance. <br>
The notebook is coded with Keras and explores the following three architectures:
1. CNN-based models with and without pre-trained embeddings
2. LSTM-based models with and without pre-trained embeddings
3. Transformer-based models with and without pre-trained embeddings (for you to do)
This notebook needs a GPU; google colab could be used.
**Useful documentation** <br>
- [Pre-trained embeddings with Keras](https://keras.io/examples/nlp/pretrained_word_embeddings/) 
- [Sentiment classification with LSTM keras](https://slundberg.github.io/shap/notebooks/deep_explainer/Keras%20LSTM%20for%20IMDB%20Sentiment%20Classification.html) 
# Installation of needed libraries

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:

!pip install tensorflow
#!pip install numpy==1.19.5

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Importing needed libraries

In [3]:
import os, sys, numpy as np, pandas as pd
from zipfile import ZipFile
import tensorflow as tf
from tensorflow import keras
from keras import layers, models, initializers, utils, preprocessing
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
from keras.utils import to_categorical
from keras.layers import Dense, Input, GlobalMaxPooling1D
from keras.layers import Conv1D, MaxPooling1D, Embedding, LSTM
from keras.models import Model, Sequential
from keras.initializers import Constant

#from tensorflow.keras.preprocessing.text import Tokenizer
#from tensorflow.keras.preprocessing.sequence import pad_sequences
#from tensorflow.keras.utils import to_categorical
#from tensorflow.keras.layers import Dense, Input, GlobalMaxPooling1D
#from tensorflow.keras.layers import Conv1D, MaxPooling1D, Embedding, LSTM
#from tensorflow.keras.models import Model, Sequential
#from tensorflow.keras.initializers import Constant

# Downloading dataset & pre-trained GLOVE embeddings
1. [GLOVE](http://nlp.stanford.edu/data/glove.6B.zip)
2. [IMDB dataset](http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz)

They are both zipped, thus we need to unzip them. <br>
We will put the data and pre-trained mebddings into a folder called Data.


In [None]:
#if not os.path.exists('Data'):
 #   os.mkdir('Data')
#if not os.path.exists('Data/glove.6B') : 
 #   temp='glove.6B.zip' 
  #  file = ZipFile(temp)  
   # file.extractall('Data/glove.6B') 
    #file.close()

In [4]:
BASE_DIR = '/content/drive/MyDrive/Data'
GLOVE_DIR = os.path.join(BASE_DIR, 'glove.6B')

df_train = pd.read_excel('/content/drive/MyDrive/Data/train_set_imdb_reviews.xlsx')
df_test = pd.read_excel('/content/drive/MyDrive/Data/test_set_imdb_reviews.xlsx')

# EDA
- Explore both datasets
- Clean the datasets: !!! The cleaning steps should be deduced following the exploration done on the train set not on the test set to garantee **no data leakage**. However, it is applied on both. 
- Check if the dataset is balanced or not 
- Bonus: fix the imbalance if it turns out to be the case

At the end of the EDA, set the cleaned reviews (texts) to the variables ``train_texts`` and ``test_texts`` and the sentiments to ``train_labels`` and ``test_labels``. <br>
If you failed this step, use the following commands: <br>
1. ``train_texts = df_train.reviews.apply(lambda x: str(x)).tolist()``
2. ``test_texts = df_train.reviews.apply(lambda x: str(x)).tolist()``
3. ``train_labels = df_train.sentiment.tolist()``
4. ``test_labels = df_test.sentiment.tolist()``

In [None]:
# Explore train dataset
df_train.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   reviews    25000 non-null  object
 1   sentiment  25000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 390.8+ KB


In [None]:
# Explore test dataset
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   reviews    25000 non-null  object
 1   sentiment  25000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 390.8+ KB


In [None]:
df_train.isnull().sum(), set(df_train.sentiment)

(reviews      0
 sentiment    0
 dtype: int64,
 {0, 1})

In [None]:
df_test.isnull().sum(), set(df_test.sentiment)

(reviews      0
 sentiment    0
 dtype: int64,
 {0, 1})

In [None]:
df_train.head()

Unnamed: 0,reviews,sentiment
0,Story of a man who has unnatural feelings for ...,0
1,Airport '77 starts as a brand new luxury 747 p...,0
2,This film lacked something I couldn't put my f...,0
3,"Sorry everyone,,, I know this is supposed to b...",0
4,When I was little my parents took me along to ...,0


In [None]:
df_train['sentiment'].value_counts()

0    12500
1    12500
Name: sentiment, dtype: int64

In [None]:
df_test['sentiment'].value_counts()

0    12500
1    12500
Name: sentiment, dtype: int64

Both two sets are balanced.

In [None]:
df_train['reviews'][1]

"Airport '77 starts as a brand new luxury 747 plane is loaded up with valuable paintings & such belonging to rich businessman Philip Stevens (James Stewart) who is flying them & a bunch of VIP's to his estate in preparation of it being opened to the public as a museum, also on board is Stevens daughter Julie (Kathleen Quinlan) & her son. The luxury jetliner takes off as planned but mid-air the plane is hi-jacked by the co-pilot Chambers (Robert Foxworth) & his two accomplice's Banker (Monte Markham) & Wilson (Michael Pataki) who knock the passengers & crew out with sleeping gas, they plan to steal the valuable cargo & land on a disused plane strip on an isolated island but while making his descent Chambers almost hits an oil rig in the Ocean & loses control of the plane sending it crashing into the sea where it sinks to the bottom right bang in the middle of the Bermuda Triangle. With air in short supply, water leaking in & having flown over 200 miles off course the problems mount for 

In [5]:
from bs4 import BeautifulSoup
import re

def strip(text):
    soup = BeautifulSoup(text, "html.parser")
    text = re.sub('\[[^]]*\]', '', soup.get_text())
    pattern=r"[^a-zA-Z\s]"#r"[^a-zA-Z0-9\s,']"
    text=re.sub(pattern,'',text)
    return text


In [6]:
df_train['reviews']=df_train['reviews'].astype(str).apply(strip)

  soup = BeautifulSoup(text, "html.parser")


In [7]:
df_test['reviews']=df_test['reviews'].astype(str).apply(strip)

  soup = BeautifulSoup(text, "html.parser")


In [16]:
### uncomment the following if you fail the cleaning
 #train_texts = df_train.reviews.apply(lambda x: str(x)).tolist()
 #test_texts = df_test.reviews.apply(lambda x: str(x)).tolist()

In [8]:
train_texts = df_train.reviews.tolist()
test_texts = df_test.reviews.tolist()

In [9]:
train_labels = df_train.sentiment.tolist()
test_labels = df_test.sentiment.tolist()

# Tokenization of sentences using keras Tokenizer

In keras, unlike pytorch, the Tokenizer not only splits the sentence into words but also convert words into their ids.<br>
AS we have mentioned in class, keras is a high level layer on top of tensoflow implemented to allow novice DL users (more precisely traditional ML users) to develop DL models. <br>

**Remember**, the pre-processing is learnt by looking at the train dataset only to garantee **no data leakage**, and it is applied on both datasets. 
&rarr; we fit the tokenizer on training data, then use it to tokenize both datasets. <br>


In [10]:
#Vectorize these text samples into a 2D integer tensor using Keras Tokenizer 
# 
MAX_NUM_WORDS = 20000 #parameter
tokenizer = Tokenizer(num_words=MAX_NUM_WORDS) #take 20000 most frequent 
tokenizer.fit_on_texts(train_texts) 
train_sequences = tokenizer.texts_to_sequences(train_texts) #Converting text to a vector of word indexes 
test_sequences = tokenizer.texts_to_sequences(test_texts) 
word_index = tokenizer.word_index 
print('Found %s unique tokens.' % len(word_index))

Found 138399 unique tokens.


In [None]:
train_sequences[0]

[62,
 4,
 3,
 129,
 34,
 42,
 7341,
 1339,
 15,
 3,
 4460,
 492,
 43,
 14,
 3,
 602,
 131,
 11,
 6,
 3,
 1263,
 446,
 4,
 1722,
 219,
 3,
 10698,
 7452,
 306,
 6,
 664,
 76,
 32,
 2046,
 1078,
 2938,
 31,
 1,
 904,
 4,
 28,
 5250,
 486,
 8,
 2533,
 1722,
 1,
 217,
 59,
 14,
 55,
 788,
 1270,
 822,
 245,
 8,
 40,
 98,
 125,
 1449,
 53,
 141,
 35,
 1,
 1052,
 137,
 25,
 664,
 125,
 1,
 13542,
 404,
 56,
 93,
 2192,
 287,
 757,
 5,
 3,
 851,
 13543,
 19,
 3,
 1655,
 668,
 28,
 122,
 69,
 22,
 226,
 101,
 14,
 45,
 48,
 598,
 31,
 693,
 83,
 693,
 385,
 3478,
 12650,
 2,
 16497,
 8076,
 67,
 25,
 106,
 3344]

Since we are dealing with a classical ML/DL model, input dimension should always be fixed. <br>
As in a traditional ML model, the number of attributes/features/columns should be fixed, in a DL model, the input dimension should be fixed as well. <br>
In our case, the input features are sentences i.e. list of words. In order to make sure that the input has a fixed size, i.e. the sentences having the same size, we will need to fix a max length (MAX_LEN) parameter, which is the maximum number of words composing a sentence. <br>
You might ask yourselves, But every sentence has a different set of words, shouldn't we create an input size that is equal to the number of unique words in our corpus? <br>
The answer is No, because, we will never deal with words, we will deal with embeddings such that all words are embedded with vectors having the same dimension $d$ &rarr; every sentence of our corpus will be transformed into an input of size MAX_LEN $\times d$ &rarr; our input will have the same size. <br>
Thus: <br>
- sentences with number of words > than MAX_LEN will be truncated; we chose a post truncating i.e., the first MAX_LEN are retained and the remaining words are removed. 
- sentences with number of words < than MAX_LEN will be padded; we chose a post padding i.e., the 0 id will be added after the ids of the words present in the sentence.

In [11]:
MAX_LEN = 1000
trainvalid_data = pad_sequences(sequences=train_sequences, maxlen=MAX_LEN, padding='post', truncating='post', value=0.0)
test_data = pad_sequences(sequences=test_sequences, maxlen=MAX_LEN, padding='post', truncating='post', value=0.0)

In [27]:
trainvalid_data[0], test_data[0]

(array([   10,   429,     9,    16,     2,    75,   102,     8,   171,
          196,   976,     5,   111,     4,   129,  1791,   129,  1791,
         1280,    10,   101,     9,     6,  3142, 12501,   113,    16,
           23,   509,  6150,    49,    69,    86,   148,   684,   379,
          221,   424,    15,  5029,     7,  2565,     4,  1340, 11627,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
      

# Converting the target into a categorical tensor variable for DL model training
Keras implements the command ``to_categorical``, it transforms each label into a one-hot encode array of dimension = unique number of categories and sets the value 1 on the index i if the data sample belongs to the category i else 0. With to categorical, if an input belongs to several categories at a time, the label would contain several 1. <br>
Here there is 2 catgories: neg and pos &rarr; the dimension is 2. <br>
Example: the target of a review with a pos review is converted with ``to_categorical`` to an ``array([0,1])``, while the target of a review with a neg review is converted with ``to_categorical`` to an ``array([1,0])``.

In [12]:
trainvalid_labels = to_categorical(np.asarray(train_labels))
test_labels = to_categorical(np.asarray(test_labels))

In [13]:
train_labels[12500] , trainvalid_labels[12500]

(1, array([0., 1.], dtype=float32))

# Split the training data into a training set and a validation set

In [14]:
VALIDATION_SPLIT = 0.2
indices = np.arange(trainvalid_data.shape[0])
np.random.shuffle(indices)
trainvalid_data = trainvalid_data[indices]
trainvalid_labels = trainvalid_labels[indices]
num_validation_samples = int(VALIDATION_SPLIT * trainvalid_data.shape[0])
x_train = trainvalid_data[:-num_validation_samples]
y_train = trainvalid_labels[:-num_validation_samples]
x_val = trainvalid_data[-num_validation_samples:]
y_val = trainvalid_labels[-num_validation_samples:]

# Convert the token ids into embedding vectors
1. Extract embeddings from glove.6B.100d.txt
2. Convert the words in the dataset into embeddings using the dictionary from step 1
3. Create the embedding layer for keras; this will be the first layer of our DL model.

In [41]:
EMBEDDING_DIM = 100 
print('Preparing embedding matrix.')

# first, build index mapping words in the embeddings set to their embedding vector
#  every line in glove.6B.100d.txt contains the word followed by the embedding vector
embeddings_index = {}
with open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'),encoding='utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

print('Found %s word vectors in Glove embeddings.' % len(embeddings_index))

# prepare embedding matrix - rows are the words from word_index, columns are the embeddings of that word from glove.
num_words = min(MAX_NUM_WORDS, len(word_index)) + 1
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
    if i > MAX_NUM_WORDS:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

Preparing embedding matrix.
Found 400000 word vectors in Glove embeddings.


In [42]:
# load these pre-trained word embeddings into an Embedding layer
# note that we set trainable = False so as to keep the embeddings fixed during training
embedding_layer = Embedding(num_words,
                            EMBEDDING_DIM,
                            embeddings_initializer=Constant(embedding_matrix),
                            input_length=MAX_LEN,
                            trainable=False)
print("Preparing of embedding matrix is done")

Preparing of embedding matrix is done


# Training and evaluating the DL model
We will test 3 DL models:
- 1D CNN-based architecture
- LSTM-based architecture
- Transformer-based architecture (to do it on your own)



### 1D CNN Model with pre-trained embedding

In [None]:
print('Define a 1D CNN model.')

cnnmodel = Sequential()
cnnmodel.add(embedding_layer)
cnnmodel.add(Conv1D(128, 5, activation='relu'))
cnnmodel.add(MaxPooling1D(5))
cnnmodel.add(Conv1D(128, 5, activation='relu'))
cnnmodel.add(MaxPooling1D(5))
cnnmodel.add(Conv1D(128, 5, activation='relu'))
cnnmodel.add(GlobalMaxPooling1D())
cnnmodel.add(Dense(128, activation='relu'))
#cnnmodel.add(Dense(len(labels_index), activation='softmax'))
cnnmodel.add(Dense(2, activation='softmax'))
cnnmodel.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])
#Train the model. Tune to validation set. 
cnnmodel.fit(x_train, y_train,
          batch_size=128,
          epochs=1, validation_data=(x_val, y_val))
#Evaluate on test set:
score, acc = cnnmodel.evaluate(test_data, test_labels)
print('Test accuracy with CNN:', acc)

Define a 1D CNN model.
Test accuracy with CNN: 0.6414399743080139


### 1D CNN model with training your own embedding
The only difference here is that the embedding layer we created ``embedding_layer`` using the pre-trained glove embeddings is no longer used here. We initialize an ambedding layer with randomly initialized weights ``Embedding(MAX_NUM_WORDS, 128)``.

In [None]:
print("Defining and training a CNN model, training embedding layer on the fly instead of using pre-trained embeddings")
cnnmodel = Sequential()
cnnmodel.add(Embedding(MAX_NUM_WORDS, 128))
cnnmodel.add(Conv1D(128, 5, activation='relu'))
cnnmodel.add(MaxPooling1D(5))
cnnmodel.add(Conv1D(128, 5, activation='relu'))
cnnmodel.add(MaxPooling1D(5))
cnnmodel.add(Conv1D(128, 5, activation='relu'))
cnnmodel.add(GlobalMaxPooling1D())
cnnmodel.add(Dense(128, activation='relu'))
#cnnmodel.add(Dense(len(labels_index), activation='softmax'))
cnnmodel.add(Dense(2, activation='softmax'))
cnnmodel.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])
#Train the model. Tune to validation set. 
cnnmodel.fit(x_train, y_train,
          batch_size=128,
          epochs=1, validation_data=(x_val, y_val))
#Evaluate on test set:
score, acc = cnnmodel.evaluate(test_data, test_labels)
print('Test accuracy with CNN:', acc)

Defining and training a CNN model, training embedding layer on the fly instead of using pre-trained embeddings
Test accuracy with CNN: 0.501800000667572


### LSTM Model with training your own embedding 

In [None]:
print("Defining and training an LSTM model, training embedding layer on the fly")

#model
rnnmodel = Sequential()
rnnmodel.add(Embedding(MAX_NUM_WORDS, 128))
rnnmodel.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
rnnmodel.add(Dense(2, activation='sigmoid'))
rnnmodel.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
print('Training the RNN')

rnnmodel.fit(x_train, y_train,
          batch_size=32,
          epochs=1,
          validation_data=(x_val, y_val))
score, acc = rnnmodel.evaluate(test_data, test_labels,
                            batch_size=32)
print('Test accuracy with RNN:', acc)
#Test accuracy with RNN: 0.82998



Defining and training an LSTM model, training embedding layer on the fly
Training the RNN
Test accuracy with RNN: 0.5000799894332886


### LSTM Model using pre-trained Embedding Layer

In [None]:
print("Defining and training an LSTM model, using pre-trained embedding layer")

rnnmodel2 = Sequential()
rnnmodel2.add(embedding_layer)
rnnmodel2.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
rnnmodel2.add(Dense(2, activation='sigmoid'))
rnnmodel2.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
print('Training the RNN')

rnnmodel2.fit(x_train, y_train,
          batch_size=32,
          epochs=1,
          validation_data=(x_val, y_val))
score, acc = rnnmodel2.evaluate(test_data, test_labels,
                            batch_size=32)
print('Test accuracy with RNN:', acc)
#Test accuracy with RNN: 0.793



Defining and training an LSTM model, using pre-trained embedding layer
Training the RNN
Test accuracy with RNN: 0.4999600052833557


### Transformer Model 
Refer to the [keras tutorial](https://keras.io/examples/nlp/text_classification_with_transformer/) to implement and evaluate your model 

In [44]:
class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.2):
        super().__init__()
        self.att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = keras.Sequential(
            [layers.Dense(ff_dim, activation="relu"), layers.Dense(embed_dim),]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs, training):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

### Transformer Model using own Embedding Layer

In [49]:
print("Defining and training a transformer model, using own embedding layer")

num_heads = 2  # Number of attention heads
ff_dim = 32 # Hidden layer size in feed forward network inside transformer



transmodel = Sequential()
transmodel.add(Input(shape=(MAX_LEN,)))
transmodel.add(Embedding(input_dim=num_words, output_dim=EMBEDDING_DIM, input_length=MAX_LEN))
transmodel.add(TransformerBlock(EMBEDDING_DIM, num_heads, ff_dim))
transmodel.add(layers.GlobalAveragePooling1D())
transmodel.add(layers.Dropout(0.2))
transmodel.add(Dense(2, activation="sigmoid"))
transmodel.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

print('Training the Transformer')


transmodel.fit(x_train, y_train,
          batch_size=32,
          epochs=2,
          validation_data=(x_val, y_val))
score, acc = transmodel.evaluate(test_data, test_labels,
                            batch_size=32)
print('Test accuracy with Transformer:', acc)

Defining and training a transformer model, using own embedding layer
Training the Transformer
Epoch 1/2
Epoch 2/2
Test accuracy with Transformer: 0.8705599904060364


### Transformer Model using pre-trained Embedding Layer

In [48]:
print("Defining and training a transformer model, using pre-trained embedding layer")


num_words = MAX_NUM_WORDS
num_heads = 2 # Number of attention heads
ff_dim = 32 # Hidden layer size in feed forward network inside transformer



transmodel = Sequential()
transmodel.add(Input(shape=(MAX_LEN,)))
transmodel.add(embedding_layer)
transmodel.add(TransformerBlock(embed_dim=EMBEDDING_DIM, num_heads=num_heads, ff_dim=ff_dim))
transmodel.add(layers.GlobalAveragePooling1D())
transmodel.add(layers.Dropout(0.2))
transmodel.add(Dense(2, activation="sigmoid"))

transmodel.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

print('Training the Transformer')


transmodel.fit(x_train, y_train,
          batch_size=32,
          epochs=2,
          validation_data=(x_val, y_val))
score, acc = transmodel.evaluate(test_data, test_labels,
                            batch_size=32)
print('Test accuracy with Transformer:', acc)

Defining and training a transformer model, using pre-trained embedding layer
Training the Transformer
Epoch 1/2
Epoch 2/2
Test accuracy with Transformer: 0.839959979057312
