**LSTM and bi-LSTM Model for IMDB Sentiment Classification**

First, required modules are imported. For this task, torch, torchtext and nltk libraries are used.

In [29]:
import torch
print(torch.__version__) 
import torch.nn.functional as F
import torch.nn as nn
!pip install -U torch==1.8.0 torchtext==0.9.0
import torchtext.legacy as torchtext
import random
!pip install contractions
import contractions
from nltk.tokenize import TweetTokenizer
from nltk.stem import SnowballStemmer

1.7.0


Then, preprocessing function is defined. In this code, I used contradictions, which fixes contractions such as you're to you are. And, I applied tweet tokenizer to tokenize string input. Finally, stemmer function applied to normalize words into their roots roughly. Then, using these fields, IMDB dataset is processed and splitted into training and test. Lengths are also saved for TEXT field.

In [30]:
tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True)
stemmer = SnowballStemmer("english")

def tokenize_fn(text):
  text = contractions.fix(text)
  tokens = tokenizer.tokenize(text)
  tokens = [stemmer.stem(token) for token in tokens]
  return tokens

stop_words = ["and",".",",","the","a","of","is","to","/","it",">","<","br","(",")","this","that","was","for","with","have","be"] # most frequent words and punctuations are dropped.

TEXT = torchtext.data.Field(tokenize = tokenize_fn, lower = True, stop_words=stop_words, include_lengths=True) # length of sequences are included.
LABEL = torchtext.data.LabelField(dtype = torch.float)

train_data, test_data = torchtext.datasets.IMDB.splits(TEXT, LABEL) 

Below, number of examples and 2 examples in detail are shown. We can see that input texts are in arbitrary length. Also, we can see that words are normalized into their roots.

In [31]:
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}")


Number of training examples: 25000
Number of testing examples: 25000


Validation set is also created with splitting training data randomly. Below, train/test/val ratios can be seen.

In [32]:
import random

train_data, valid_data = train_data.split(random_state = random.seed(42))

print(f'Number of training examples: {len(train_data)}')
print(f'Number of validation examples: {len(valid_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 17500
Number of validation examples: 7500
Number of testing examples: 25000


Below, vocabularies are created. We can see frequency of each instance. Additionally, string to integer (stoi), and integer to string(itos) conversions are provided. Notice that even though we set max_size 17000, number of unique tokens are 17002. It is because including 2 tokens; <\unk> and <\pad>.

In [33]:
TEXT.build_vocab(train_data,min_freq=5,max_size = 17000)
LABEL.build_vocab(train_data)

print(f"Unique tokens in source vocabulary: {len(TEXT.vocab)}")
print(f"Unique tokens in TRG vocabulary: {len(LABEL.vocab)}")

Unique tokens in source vocabulary: 17002
Unique tokens in TRG vocabulary: 2


Below, training, validation and test iterators created by bucket iterator. It is important to use this module because since our input sequences has arbitrary lengths, this method creates batches with close lengths.

In [34]:
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = torchtext.data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    sort_within_batch = True,
    device = device)

**Model**

Finally, model is constructed below. Network consists of an embedding layer, a LSTM layer and one fully connected layer with dropouts. I tried to initialize hidden and cell states, but I couldn't fit shapes of tensors, thus, I use default mode.

In [35]:
import torch.nn as nn

class LSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
                 dropout, pad_idx):
        
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
        
        self.rnn = nn.LSTM(embedding_dim, 
                           hidden_dim, 
                           num_layers=n_layers, 

                           dropout=dropout)
        
        self.fc = nn.Linear(hidden_dim , output_dim) 
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text, text_lengths):
        #h0 = torch.zeros((n_layers, BATCH_SIZE, hidden_dim)).to(device)
        #c0 = torch.zeros((n_layers, BATCH_SIZE, hidden_dim)).to(device)  
        embedded = self.dropout(self.embedding(text))

        #output, (hidden, cell) = self.rnn(embedded,(h0,c0))
        output, (hidden, cell) = self.rnn(embedded)

        #output = [sent len, batch size, hid dim * num directions]        
        hidden = self.dropout(output)[-1,:,:]
                         
        return self.fc(hidden)

In [36]:
input_dim = len(TEXT.vocab)
output_dim = 1
embedding_dim = 200
hidden_dim = 100
n_layers = 1
dropout = 0.5
pad_idx = TEXT.vocab.stoi[TEXT.pad_token]

model = LSTM(input_dim, 
            embedding_dim, 
            hidden_dim, 
            output_dim, 
            n_layers,  
            dropout, 
            pad_idx)

print(model)

LSTM(
  (embedding): Embedding(17002, 200, padding_idx=1)
  (rnn): LSTM(200, 100, dropout=0.5)
  (fc): Linear(in_features=100, out_features=1, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)


In [37]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters(), lr=1e-3)

criterion = nn.BCEWithLogitsLoss()
model = model.to(device)
criterion = criterion.to(device)

After defining custom functions for accuracy, training, evaluation and calculating epoch time, network is trained and, results are printed. I also tried training with more epochs, and see that accuracy is increasing over epochs.

In [38]:
def binary_accuracy(preds, y):

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

In [39]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()        
        text, text_lengths = batch.text        
        predictions = model(text, text_lengths).squeeze(1)        
        loss = criterion(predictions, batch.label)        
        acc = binary_accuracy(predictions, batch.label)        
        loss.backward()        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [40]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            text, text_lengths = batch.text    
            predictions = model(text, text_lengths).squeeze(1)           
            loss = criterion(predictions, batch.label)            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [41]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [42]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
           
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 0m 5s
	Train Loss: 0.669 | Train Acc: 58.58%
	 Val. Loss: 0.588 |  Val. Acc: 70.55%
Epoch: 02 | Epoch Time: 0m 5s
	Train Loss: 0.613 | Train Acc: 67.46%
	 Val. Loss: 0.608 |  Val. Acc: 72.01%
Epoch: 03 | Epoch Time: 0m 5s
	Train Loss: 0.570 | Train Acc: 72.92%
	 Val. Loss: 0.568 |  Val. Acc: 73.45%
Epoch: 04 | Epoch Time: 0m 5s
	Train Loss: 0.528 | Train Acc: 76.04%
	 Val. Loss: 0.574 |  Val. Acc: 74.06%
Epoch: 05 | Epoch Time: 0m 5s
	Train Loss: 0.510 | Train Acc: 77.41%
	 Val. Loss: 0.575 |  Val. Acc: 74.69%


**bi-LSTM Model**

Model is constructed below. It is same as LSTM (embedding/LSTM/FC and dropouts), however in this part, in lstm layer, bidirectional=True. Also, number of hidden layers are multiplied by 2 since bilstm has 2 lstm layers, one for forward sweep, and one for backward.

Embeddings are packed with nn.utils.rnn.packed_padded_sequence before feeding to biLSTM layer. This enable biLSTM to only process non-padded elements in sequence. After processing through lstm, sequence is unpacked with nn.utils.rnn.pad_packed_sequence, to transform them back to tensors.

Additionally, since we have 2 hidden layers (backward, forward) we finally concatenate them into one single hidden layer output.

In [43]:
import torch.nn as nn

class biLSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
                 dropout, pad_idx):
        
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
        
        self.rnn = nn.LSTM(embedding_dim, 
                           hidden_dim, 
                           num_layers=n_layers, 
                           bidirectional=True, # bidirectional lstm
                           dropout=dropout)
        
        self.fc = nn.Linear(hidden_dim * 2, output_dim) # *2 because 2 lstms, fwd and bwd.
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text, text_lengths):
           
        embedded = self.dropout(self.embedding(text))

        #pack sequence
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths.to('cpu'))       
        packed_output, (hidden, cell) = self.rnn(packed_embedded)
        
        #unpack sequence
        output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)

        #output = [sent len, batch size, hid dim * num directions]        
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
                         
        return self.fc(hidden)

In [44]:
input_dim = len(TEXT.vocab)
output_dim = 1
embedding_dim = 200
hidden_dim = 100
n_layers = 1
dropout = 0.5
pad_idx = TEXT.vocab.stoi[TEXT.pad_token]

bi_model = biLSTM(input_dim, 
            embedding_dim, 
            hidden_dim, 
            output_dim, 
            n_layers,  
            dropout, 
            pad_idx)

print(model)

LSTM(
  (embedding): Embedding(17002, 200, padding_idx=1)
  (rnn): LSTM(200, 100, dropout=0.5)
  (fc): Linear(in_features=100, out_features=1, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)


In [45]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters(), lr=1e-3)

criterion = nn.BCEWithLogitsLoss()
model = model.to(device)
criterion = criterion.to(device)

N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
           
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 0m 5s
	Train Loss: 0.462 | Train Acc: 80.46%
	 Val. Loss: 0.614 |  Val. Acc: 72.90%
Epoch: 02 | Epoch Time: 0m 5s
	Train Loss: 0.458 | Train Acc: 79.63%
	 Val. Loss: 0.493 |  Val. Acc: 79.56%
Epoch: 03 | Epoch Time: 0m 5s
	Train Loss: 0.452 | Train Acc: 79.66%
	 Val. Loss: 0.594 |  Val. Acc: 77.13%
Epoch: 04 | Epoch Time: 0m 5s
	Train Loss: 0.395 | Train Acc: 83.46%
	 Val. Loss: 0.454 |  Val. Acc: 79.64%
Epoch: 05 | Epoch Time: 0m 5s
	Train Loss: 0.361 | Train Acc: 85.01%
	 Val. Loss: 0.377 |  Val. Acc: 84.67%


We can see that bi-lstm model gave significantly better accuracy after 5 epochs. LSTM: 74.66%, bi-LSTM: 84.67%