# Sentiment Analysis using Bidirectional RNN
Sentiment Analysis is the process of determining whether a piece of text is positive, negative, or neutral. It is widely used in social media monitoring, customer feedback and support, identification of derogatory tweets, product analysis, etc. Here we are going to build a Bidirectional RNN network to classify a sentence as either positive or negative using the sentiment-140 dataset.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, TensorDataset
import zipfile
import pandas as pd
import re
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
import torchtext
import torch.nn.functional as F
from tqdm.notebook import tqdm as tqdm
# from tqdm.autonotebook import tqdm

## Step 1 - Importing the Dataset
First, import the sentiment-140 dataset. Since sentiment-140 consists of about 1.6 million data samples, let’s only import a subset of it. The current dataset has half a million tweets.

In [23]:
!pip3 install wget
import wget
wget.download("https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/sentiment-analysis-is-bad/data/sentiment140-subset.csv.zip")

!unzip -n sentiment140-subset.csv.zip

Defaulting to user installation because normal site-packages is not writeable
Archive:  sentiment140-subset.csv.zip


## Step 2 - Loading the Dataset
Install pandas library using the pip command. Later, import and read the csv file

In [2]:
import pandas as pd

data = pd.read_csv('sentiment140-subset.csv', nrows=50000)

## Step 3 - Reading the Dataset
Print the data columns.

In [3]:
data.columns

Index(['polarity', 'text'], dtype='object')

‘Text’ indicates the sentence and ‘polarity’, the sentiment attached to a sentence. ‘Polarity’ is either 0 or 1. 0 indicates negativity and 1 indicates positivity.

In [4]:
data.shape

(50000, 2)

In [5]:
data.head(5)

Unnamed: 0,polarity,text
0,0,@kconsidder You never tweet
1,0,Sick today coding from the couch.
2,1,"@ChargerJenn Thx for answering so quick,I was ..."
3,1,Wii fit says I've lost 10 pounds since last ti...
4,0,@MrKinetik Not a thing!!! I don't really have...


## Step 4 - Processing the Dataset
Since raw text is difficult to process by a neural network, we have to convert it into its corresponding numeric representation.

In [6]:
# To do so, initialize your tokenizer by setting the maximum number
# of words (features/tokens) that you would want to tokenize a sentence to
from torchtext.data import get_tokenizer

max_features = 4000

tokenizer = get_tokenizer("basic_english")

In [7]:
# preprocess the data : replace in all the text inputs r'\s+' by a space ' '

data['text'] = data['text'].apply(lambda x: re.sub(r'\s+', ' ', x))
data.head()

Unnamed: 0,polarity,text
0,0,@kconsidder You never tweet
1,0,Sick today coding from the couch.
2,1,"@ChargerJenn Thx for answering so quick,I was ..."
3,1,Wii fit says I've lost 10 pounds since last time
4,0,@MrKinetik Not a thing!!! I don't really have ...


In [8]:
# fit the tokenizer onto the text using torchtext library
# hint : set the tokens to lowercase

data['text'] = data['text'].apply(lambda x: tokenizer(x))
data['text'] = data['text'].apply(lambda x: [t.lower() for t in x])

In [9]:
# create the vocabulary from the data
vocab = torchtext.vocab.build_vocab_from_iterator(data['text'])

50000lines [00:00, 221822.73lines/s]


In [10]:
# use the resultant tokenizer to tokenize the text (convert text to sequences then to tensor)
data['text'] = data['text'].apply(lambda x: [vocab[token] for token in x])
data['text'] = data['text'].apply(lambda x: torch.tensor(x))

In [11]:
# and lastly, pad the tokenized sequences to maintain the same length across all the input sequences
data['text'] = data['text'].apply(lambda x: torch.concat((x, torch.zeros(4000-len(x)))))

In [12]:
# Finally, print the shape of the input vector.
data['text'].shape

(50000,)

### Prepare dataset

In [13]:
# Create a one-hot encoded representation of the output labels using the get_dummies() method (and convert to tensor).
# CODE HERE
data = pd.get_dummies(data, columns=['polarity'])

# Retrieve all the inputs vectors (X)
X_data = torch.stack(data['text'].tolist())
y_data = data["polarity_0"]

# Split train and test data using the train_test_split() method.
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, train_size=0.8, random_state=42)

In [14]:
# Print the shapes of train and test data.
(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(torch.Size([40000, 4000]), torch.Size([10000, 4000]), (40000,), (10000,))

## Step 4 - Create a Model
Now, let’s create a Bidirectional RNN model. Make a class to define the model.
The model contains several blocks :

* An embedding layer is the input layer that maps the words/tokenizers to a vector with embed_dim dimensions.
* The bidirectional layer is an RNN-LSTM layer with a size lstm_out.
* The Linear layer is an output layer with 2 nodes (indicating positive and negative) and softmax activation function. Softmax helps in determining the probability of inclination of a text towards either positivity or negativity.

In [None]:
class BiRNNModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, lstm_out, output_dim):
        super(BiRNNModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.birnn = nn.LSTM(embed_dim, lstm_out, bidirectional=True, batch_first=True)
        self.fc = nn.Linear(lstm_out*2, output_dim)
        self.relu = nn.ReLU()
        self.softmax = nn.Softmax(dim=1)

    def forward(self, text):
        text = text.long()
        embedded = self.embedding(text)
        outputs, (hidden, cell) = self.birnn(embedded)
        hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)
        out = self.fc(self.relu(hidden))
        out = self.softmax(out)
        return out

In [None]:
from torch.cuda import is_available as cuda_is_available

# Define constants
vocab_size = len(vocab)
embed_dim = 256
lstm_out = 196
num_classes = 2

# Set the device to GPU if available, else to CPU
device = torch.device('cuda' if cuda_is_available() else 'cpu')

# Finally, instantiate the model and put it on the device
model = BiRNNModel(vocab_size, embed_dim, lstm_out, num_classes).to(device)
loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

In [None]:
# Print the model summary to understand its layer stack.
model

## Step 6 - Training the Model
Train the model for about 20 epochs with a batch size of 128. At each epoch, evaluate the model on the test data.

In [None]:
# Build the training and test dataloader
y_train = torch.tensor(np.array(y_train))
y_test = torch.tensor(np.array(y_test))

loaders = {
    'train' : DataLoader(TensorDataset(X_train, y_train), batch_size=128, num_workers=4),
    'test'  : DataLoader(TensorDataset(X_test, y_test), batch_size=128, num_workers=4)
}

In [None]:
# Build the training loop (and evaluate at the end of each epoch)
def train_model(model, epochs):
    total_batches = len(loaders['train'])
    train_losses = []
    test_accuracies = []
    
    model.train()
    for epoch in range(epochs):
        epoch_loss = 0.0
        for batch_num, (text, polarity) in enumerate(loaders['train'], 1):
            text, polarity = text.to(device), polarity.to(device)
            polarity = polarity.long()
            optimizer.zero_grad()
            output = model(text)
            loss = loss_function(output, polarity)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()

            print(f"Epoch [{epoch+1}/{epochs}], Batch [{batch_num}/{total_batches}], Loss: {loss.item():.4f}")

        train_losses.append(epoch_loss / total_batches)

        model.eval()
        correct = 0
        total = 0
        with torch.no_grad():
            for text, polarity in loaders['test']:
                text, polarity = text.to(device), polarity.to(device)
                polarity = polarity.long()

                output = model(text)
                correct += torch.sum(torch.argmax(output, axis=1) == polarity)
                total += output.shape[0]
            accuracy = correct / total
            test_accuracies.append(accuracy)
            print(f"Train epoch: {epoch+1}/{epochs}, accuracy on test dataset: {accuracy:.4f}")

    # Plotting
    plt.figure(figsize=(12, 5))

    # Plot Loss
    plt.subplot(1, 2, 1)
    plt.plot(range(1, epochs + 1), train_losses, label='Train Loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.title('Training Loss')
    plt.legend()

    # Plot Accuracy
    plt.subplot(1, 2, 2)
    plt.plot(range(1, epochs + 1), test_accuracies, label='Test Accuracy')
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.title('Test Accuracy')
    plt.legend()

    plt.tight_layout()
    plt.show()

In [None]:
# Plot accuracy and loss graphs captured during the training process.
train_model(model, 20)

## Step 7 - Perform Sentiment Analysis
Now's the time to predict the sentiment (positivity/negativity) for a user-given sentence. First, initialize it.

In [None]:
twt = ['I do not recommend this product']
# Tokenize it.
twt = twt[0].lower()
twt = tokenizer(twt)
twt = [vocab[token] for token in twt]

# Pad it.
twt = torch.tensor(twt)
twt = torch.concat((twt, torch.zeros(4000-len(twt))))

In [None]:
# Predict the sentiment by passing the sentence to the model we built.
twt = twt.to(device)
twt = twt.unsqueeze(0)
model(twt)