***Sentiment analysis using GRU and Keras***

Sentiment analysis is a technique used in natural language processing to determine the polarity of a given text. There are several types of sentiment analysis, but one of the most generally used ways categorizes data as positive or negative. This aids in the study of various text elements, such as comments, tweets, and customer reviews, in order to understand the insights and feedbacks from the audience.

Let us see how we can implement this in Keras.

This example is based on the following [link](https://medium.com/@prateekgaurav/nlp-zero-to-hero-part-2-vanilla-rnn-lstm-gru-bi-directional-lstm-77fd60fc0b44#:~:text=To%20use%20a%20GRU%20for,model%20using%20the%20Keras%20API.&text=The%20GRU%20model%20performed%20almost,raw%20text%20and%20early%20stopping.).



In [1]:
# from google.colab import drive
# drive.mount('/content/drive')

**Importing packages**

For this version, we call the keras modules for building the model and the nltk package and modules are imported for preprocessing the text data.

In [2]:
from __future__ import print_function

# import keras libraries for building GRU
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import GRU
from keras.datasets import imdb

# import nltk and other submodules for preprocessing the data
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import re

nltk.download('punkt')
nltk.download("stopwords")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\aswathyr\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\aswathyr\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

**Loading the dataset**

Let us load the IMDB dataset and divide it into train and test sets.

In [3]:
max_features = 20000
maxlen = 80  # cut texts after this number of words (among top max_features most common words)
batch_size = 32

print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

Loading data...
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
[1m17464789/17464789[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 0us/step
25000 train sequences
25000 test sequences


## Preprocessing step ##

Once the data is loaded, we pad the sequences to a maximum length using _pad_sequences_ method.


In [4]:
from keras.utils import pad_sequences
print('Pad sequences (samples x time)')
x_train = pad_sequences(x_train, maxlen=maxlen)
x_test = pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

Pad sequences (samples x time)
x_train shape: (25000, 80)
x_test shape: (25000, 80)


## Create the GRU model ##

Using the sequential class, we create the GRU model and add  different layers to it such as _embedding_, _GRU_, _dense_ and _dropout_ layer.

In [5]:
print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 128))
model.add(GRU(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.summary()

Build model...


### Training the model

In [6]:
# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])


In [7]:
print('Train...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=10,
          validation_data=(x_test, y_test))

Train...
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 32ms/step - accuracy: 0.6973 - loss: 11.9552 - val_accuracy: 0.7439 - val_loss: 0.5230


<keras.src.callbacks.history.History at 0x1e9d3547800>

### Evaluating the scores

In [8]:
score, acc = model.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 7ms/step - accuracy: 0.7422 - loss: 0.5259
Test score: 0.5230189561843872
Test accuracy: 0.7439200282096863
