<a href="https://colab.research.google.com/github/Srivishnu27feb/Sentiment-Analysis-Base-RNN-Pytorch-/blob/master/Sentiment_Analysis_Basic_RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import torch
from torchtext import data
SEED = 1234
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

TEXT = data.Field(tokenize = "spacy") ###Tokenize using spacy (split to words)
LABEL = data.LabelField(dtype = torch.float) ###Futher down the line the Loss (criterion) expects 32 bit float but TorchText sets tensors to longTensors(64 bit int) so conversion is required

In [2]:
from torchtext import datasets
###import IMBD dataset with TEXT as tokens and LABEL as the output tag (pos/neg)
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:07<00:00, 11.6MB/s]


In [3]:
print(vars(train_data.examples[0]))

{'text': ['Bela', 'made', '9', 'pics', 'for', 'Monogram', ',', 'but', 'it', 'was', 'only', 'at', 'THIS', 'one', ',', 'the', '4TH', ',', 'that', 'things', 'started', 'to', 'come', 'together', '.', 'All', 'the', 'rest', 'in', 'the', 'series', 'would', 'use', 'this', 'one', 'as', 'the', 'essential', 'template', 'for', 'production', ',', 'writing', 'and', 'character', 'development', '.', 'From', 'here', 'on', ',', 'better', 'or', 'worse', ',', 'the', 'series', 'would', 'also', 'deal', 'with', 'one', 'essential', 'theme', ':', 'a', 'scientist', '(', 'usually', 'Bela', ')', 'makes', 'experiments', 'in', 'the', 'basement', 'or', 'the', 'old', 'house', '(', 'sometimes', 'IN', 'the', 'basement', 'in', 'the', 'old', 'house', ')', 'that', 'causes', 'things', 'to', 'go', 'blooey', '.', 'This', 'was', 'also', 'the', 'first', 'time', 'that', 'Art', 'Director', 'Dave', 'Milton', 'got', 'a', 'chance', 'to', 'spread', 'his', 'wings', '.', 'He', 'came', 'on', 'board', 'for', 'BLACK', 'DRAGONS', ',', 'th

By Default split function split 70:30 ratio so use split function to create train & Validation set

In [4]:
import random
train_data, valid_data = train_data.split(random_state = random.seed(SEED))

As the unique no of words is 100000 which is one hot encoded and is of 100000 Dimensions so the processing will be slow. So we restrict the vocab to top 25000 words and remaining would be tagged as <unk> (unknown). This vocab list implies only for Training set as the test set would be unknown in the real case

In [5]:
MAX_VOCAB_SIZE = 25000
TEXT.build_vocab(train_data, max_size = MAX_VOCAB_SIZE)
LABEL.build_vocab(train_data)

25000 vocab size +  'unk' + 'pad'; pad is to make the size of the sentence uniform across all examples

In [6]:
print(len(TEXT.vocab),len(LABEL.vocab))

25002 2


In [7]:
print(TEXT.vocab.freqs.most_common(2))

[('the', 203136), (',', 193205)]


In [8]:
print(TEXT.vocab.itos[:3])

['<unk>', '<pad>', 'the']


In [27]:
print(LABEL.vocab.stoi)

defaultdict(<function _default_unk_index at 0x7f8f2a1741e0>, {'neg': 0, 'pos': 1})


The final step of preparing the data is creating the iterators. We iterate over these in the training/evaluation loop, and they return a batch of examples (indexed and converted into tensors) at each iteration.

We'll use a BucketIterator which is a special type of iterator that will return a batch of examples where each example is of a similar length, minimizing the amount of padding per example.

We also want to place the tensors returned by the iterator on the GPU (if you're using one). PyTorch handles this using torch.device, we then pass this device to the iterator.

In [10]:
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    device = device)


# **Model Building**
The next stage is building the model:

**The process is as follows:**

Convert One hot encoding (sparse Matrix) --> Dense Martix (Embedding) --> RNN/ LSTM/ GRU as input.


**Super** keyword in RNN class will allows us to avoid using the base class name explicitly and also in working with Multiple Inheritance

**Constructors**/**init** : All the 3 layers (embedding layer, our RNN, and a linear layer) are defined here. This will be invoked first when the object is created for the RNN class. All layers have their parameters initialized to random values, unless explicitly specified.

**Embedding Layer:**
The embedding layer is used to transform our sparse one-hot vector  into a dense embedding vector. This embedding layer is simply a single fully connected layer. It reduces the dimensionality of the input to the RNN, there is the theory that words which have similar impact on the sentiment of the review are mapped close together in this dense vector space.

**The RNN layer**: It takes in our dense vector and the previous hidden state  ht−1 , which it uses to calculate the next hidden state,  ht .

Finally, the linear layer takes the final hidden state and feeds it through a fully connected layer,  f(hT) , transforming it to the correct output dimension.

Forward Method is called when the input examples are fed to the model

In [11]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        
        super().__init__()
        
        self.embedding = nn.Embedding(input_dim, embedding_dim)
        
        self.rnn = nn.RNN(embedding_dim, hidden_dim)
        
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, text):

        #text = [sent len, batch size]
        
        embedded = self.embedding(text)
        
        #embedded = [sent len, batch size, emb dim]
        
        output, hidden = self.rnn(embedded)
        
        #output = [sent len, batch size, hid dim]
        #hidden = [1, batch size, hid dim]
        
        # Assert is to confirm that the output is the concatenation of the hidden state from every time step, whereas hidden is simply the final hidden state
        assert torch.equal(output[-1,:,:], hidden.squeeze(0)) #forward RNN output of last token (output[-1,:,:])
        
        #Sqeeze is to remove the dimension of size 1  which would be passed to Fully connected layer for the prediction
        return self.fc(hidden.squeeze(0))

In [12]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1

model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)

In [24]:
# Check no. of trainable parameters

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 2,592,105 trainable parameters


In [14]:
#Optimizer is SGD with learing rate 1e-3...(what should be optimized, learning rate) 
import torch.optim as optim

optimizer = optim.SGD(model.parameters(), lr=1e-3)

The Loss is defined using Criterion. Here we used logit with Binary Cross Entropy. Logit is to make sure the output predict is within the range 0-1

In [15]:
criterion = nn.BCEWithLogitsLoss()
model = model.to(device)
criterion = criterion.to(device)

In [18]:
# To calculate Accuracy
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

In [19]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        ##Initialize the gradient as 0
        optimizer.zero_grad()

        #prediction from the FC [batch_size,1] should be squeezed to remove 1 Dimension as creterion accepts only [batch_size]                
        predictions = model(batch.text).squeeze(1)
        
        ##Loss function
        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        ##Back propagation to change the gradient
        loss.backward()
        
        ##Update the gradient
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [20]:
## In evaluation we will not update the gradient or change the gradient wts
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            predictions = model(batch.text).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [21]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [22]:
## Run the model for the defined set of epochs
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 0m 45s
	Train Loss: 0.694 | Train Acc: 50.37%
	 Val. Loss: 0.698 |  Val. Acc: 48.78%
Epoch: 02 | Epoch Time: 0m 45s
	Train Loss: 0.693 | Train Acc: 49.86%
	 Val. Loss: 0.698 |  Val. Acc: 49.21%
Epoch: 03 | Epoch Time: 0m 45s
	Train Loss: 0.693 | Train Acc: 49.91%
	 Val. Loss: 0.698 |  Val. Acc: 49.26%
Epoch: 04 | Epoch Time: 0m 45s
	Train Loss: 0.693 | Train Acc: 49.94%
	 Val. Loss: 0.698 |  Val. Acc: 48.94%
Epoch: 05 | Epoch Time: 0m 45s
	Train Loss: 0.693 | Train Acc: 50.24%
	 Val. Loss: 0.698 |  Val. Acc: 50.43%


In [23]:
model.load_state_dict(torch.load('tut1-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.711 | Test Acc: 46.10%
