# Text Classification with Recurrent Neural Network (RNN)

In this tutorial, we will illustrate how to conduct text classification task with Recurrent Neural Network (RNN).

Recurrent Neural Network (RNN) is a type of artificial neural network designed to process sequential data and is particularly well-suited for tasks such as time series prediction, sequence generation, language modeling, and natural language processing.

An introduction to recurrent neural network can be found at : https://en.wikipedia.org/wiki/Recurrent_neural_network 

We will use Google **TensorFlow** to build an RNN. Compared with PyTorch, Tensorflow makes it easier for building large and complex neural networks. 

### Install TensorFlow

In [1]:
# Run the following command to install tensorflow
! pip install tensorflow



In [2]:
# Verify Installation:
import tensorflow as tf
print(tf.__version__)


2.15.0


### Import Required Libraries

In [3]:
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

### Load and Process Dataset
In this tutorial, we still use the IMDB review dataset.

In [4]:
# Load the dataset
df = pd.read_csv("IMDB Dataset.csv")

# Convert sentiment labels to numerical values
df['sentiment'] = df['sentiment'].map({'negative': 0, 'positive': 1})

# Split the data into features (reviews) and targets (sentiments)
X = df['review']
y = df['sentiment']

### Tokenizing and Padding Sequences

In [5]:
# Initializes a Tokenizer object with a maximum vocabulary size of 10,000 words.
tokenizer = Tokenizer(num_words=10000)

# Fits the tokenizer on the text data to generate the word index.
tokenizer.fit_on_texts(X)

In [6]:
# Display the word index
word_index = tokenizer.word_index
i = 0
for word in word_index:
    print(word, word_index[word])
    i += 1
    if (i > 1000):
        break

the 1
and 2
a 3
of 4
to 5
is 6
br 7
in 8
it 9
i 10
this 11
that 12
was 13
as 14
for 15
with 16
movie 17
but 18
film 19
on 20
not 21
you 22
are 23
his 24
have 25
be 26
one 27
he 28
all 29
at 30
by 31
an 32
they 33
so 34
who 35
from 36
like 37
or 38
just 39
her 40
out 41
about 42
if 43
it's 44
has 45
there 46
some 47
what 48
good 49
when 50
more 51
very 52
up 53
no 54
time 55
my 56
even 57
would 58
she 59
which 60
only 61
really 62
see 63
story 64
their 65
had 66
can 67
me 68
well 69
were 70
than 71
much 72
we 73
bad 74
been 75
get 76
do 77
great 78
other 79
will 80
also 81
into 82
people 83
because 84
how 85
first 86
him 87
most 88
don't 89
made 90
then 91
its 92
them 93
make 94
way 95
too 96
movies 97
could 98
any 99
after 100
think 101
characters 102
watch 103
films 104
two 105
many 106
seen 107
character 108
being 109
never 110
plot 111
love 112
acting 113
life 114
did 115
best 116
where 117
know 118
show 119
little 120
over 121
off 122
ever 123
does 124
your 125
better 126
end 127
m

In [7]:

# Converts the text data into sequences of integers based on the word index.
X = tokenizer.texts_to_sequences(X)

In [8]:
# Display some x values 
for i in range(10):
    print(X[i])

[27, 4, 1, 79, 2102, 45, 1072, 12, 100, 147, 39, 307, 3184, 398, 474, 26, 3195, 33, 23, 203, 14, 11, 6, 621, 48, 596, 16, 68, 7, 7, 1, 86, 148, 12, 3241, 68, 42, 3184, 13, 92, 5398, 2, 134, 4, 570, 60, 268, 8, 203, 36, 1, 661, 139, 1740, 68, 11, 6, 21, 3, 119, 15, 1, 7888, 2333, 38, 11, 119, 2595, 54, 5911, 16, 5510, 5, 1479, 376, 38, 570, 92, 6, 3804, 8, 1, 360, 356, 4, 1, 661, 7, 7, 9, 6, 433, 3184, 14, 12, 6, 1, 358, 5, 1, 6813, 2538, 1064, 9, 2711, 1421, 20, 538, 32, 4636, 2468, 4, 1, 1208, 117, 29, 1, 7017, 25, 2970, 2, 391, 34, 6, 21, 299, 20, 1, 4910, 7364, 538, 6, 344, 5, 106, 8161, 5050, 7889, 2453, 2, 51, 34, 327, 9106, 7365, 2, 8697, 23, 110, 225, 243, 7, 7, 10, 58, 131, 1, 280, 1324, 4, 1, 119, 6, 693, 5, 1, 192, 12, 9, 269, 117, 79, 276, 589, 3024, 834, 180, 1320, 4161, 15, 2523, 1243, 834, 1443, 834, 887, 3184, 149, 954, 183, 1, 86, 398, 10, 123, 210, 3241, 68, 14, 34, 1637, 9, 13, 2239, 10, 413, 131, 10, 13, 1592, 15, 9, 18, 14, 10, 287, 51, 10, 1417, 3, 1280, 15, 3184, 

In [9]:
# Pad the sequences to ensure all sequences have the same length (500 in this case)
X = pad_sequences(X, maxlen=200)

In [10]:
# Display some x values after padding
for i in range(10):
    print(X[i])

[  12    6    1  358    5    1 6813 2538 1064    9 2711 1421   20  538
   32 4636 2468    4    1 1208  117   29    1 7017   25 2970    2  391
   34    6   21  299   20    1 4910 7364  538    6  344    5  106 8161
 5050 7889 2453    2   51   34  327 9106 7365    2 8697   23  110  225
  243    7    7   10   58  131    1  280 1324    4    1  119    6  693
    5    1  192   12    9  269  117   79  276  589 3024  834  180 1320
 4161   15 2523 1243  834 1443  834  887 3184  149  954  183    1   86
  398   10  123  210 3241   68   14   34 1637    9   13 2239   10  413
  131   10   13 1592   15    9   18   14   10  287   51   10 1417    3
 1280   15 3184    2  189    5    1  299 2046    4 2150  570   21   39
  570   18 7658 7154 5010   26 2983   41   15    3 6904  504   20  642
    2   76  243   16    9   69 7598  651  710 6904  109  662   82 1208
  693    5   65  574    4  920 2021   38 1208  559  147 3184   22  200
  426 3819   16   48    6 3314  805 1603   43   22   67   76    8 1228
   16 

In [11]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Define the Neural Network

In [12]:
# Define the RNN model using TensorFlow
model = tf.keras.Sequential([ # Defines a sequential model using TensorFlow's Keras API. A sequential model is a stack of neural network layers.
    tf.keras.layers.Embedding(input_dim=10000, output_dim=32, input_length=200), # This layer creates word embeddings for the input sequences. Each integer representation of word is mapped to a fixed-length numerical vector.
    tf.keras.layers.SimpleRNN(64), # This layer defines a simple recurrent neural network (RNN) with 64 units in each neuron.
    tf.keras.layers.Dense(1, activation='sigmoid') # This layer is a fully connected dense layer with one unit and sigmoid activation function for binary classification.
])




### Compile the Neural Network

In [13]:
# Compile the model with the Adam optimizer, binary cross-entropy loss function, and accuracy metric.
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) 




### Train the Model

In [14]:
# Train the model on the training data for 5 epochs with a batch size of 64.
history = model.fit(X_train, y_train, epochs=5, batch_size=64, validation_data=(X_test, y_test))

Epoch 1/5


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


### Evaluate the Model

In [15]:
#  Evaluate the model on the test data and computes the test loss and accuracy.
loss, accuracy = model.evaluate(X_test, y_test)
print("Test Loss:", loss)
print("Test Accuracy:", accuracy)

Test Loss: 0.3460632860660553
Test Accuracy: 0.8543999791145325
