<a href="https://colab.research.google.com/github/Ram001code/Sentiment_Analysis_pytorch/blob/main/Sentiment_Analysis_Pytorch_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [2]:
import random
import torch
# from torchtext import data


In [3]:
seed = 42

torch.manual_seed(seed)
torch.backends.cudnn.deterministic = True
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')



```
class torchtext.data.Field(sequential=True, use_vocab=True, init_token=None, eos_token=None, fix_length=None, dtype=torch.int64, preprocessing=None, postprocessing=None, lower=False, tokenize=None, tokenizer_language='en', include_lengths=False, batch_first=False, pad_token='<pad>', unk_token='<unk>', pad_first=False, truncate_first=False, stop_words=None, is_target=False)

# TDefines a datatype together with instructions for converting to Tensor.

Field class models common text processing datatypes that can be represented by tensors. It holds a Vocab object that defines the set of possible values for elements of the field and their corresponding numerical representations.

```

tokenize – The function used to tokenize strings using this field into sequential examples. If “spacy”, the SpaCy tokenizer is used. If a non-serializable function is passed as an argument, the field will not be able to be serialized. Default: string.split.

tokenizer_language – The language of the tokenizer to be constructed. Various languages currently supported only in SpaCy.

include_lengths – Whether to return a tuple of a padded minibatch and a list containing the lengths of each examples, or just a padded minibatch. Default: False.


In [4]:
#pip install pytorch torchvision -c pytorch

!pip install torchtext==0.10.0      #Maybe legacy was removed in version 0.11.0. so we downgraded to 0.10.0



The legacy components are placed in torchtext.legacy.data folder as follows:


torchtext.data.Pipeline -> torchtext.legacy.data.Pipeline

torchtext.data.Batch -> torchtext.legacy.data.Batch

torchtext.data.Example -> torchtext.legacy.data.Example

torchtext.data.Field -> torchtext.legacy.data.Field

torchtext.data.Iterator -> torchtext.legacy.data.Iterator

torchtext.data.Dataset -> torchtext.legacy.data.Dataset


In [5]:
from torchtext.legacy import data

In [6]:
txt = data.Field(tokenize = 'spacy',
                  tokenizer_language = 'en_core_web_sm',
                  include_lengths = True)

In [7]:
from torchtext.legacy import datasets

In [8]:
''' 
Here we have downloaded the imdb dataset for python sentiment analysis and divided it into train test and validation split. 
The dataset is already divided into a train and test set, we further create a validation set from it.

We further limit the number of words the model will learn to 25000, this will choose the most used 25000 words from the dataset and use them for training. 
Significantly reducing the work of the model without any real loss in accuracy.

'''


labels = data.LabelField(dtype = torch.float)

train_data, test_data = datasets.IMDB.splits(txt, labels)

train_data, valid_data = train_data.split(random_state = random.seed(seed))

num_words = 25_000



downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:01<00:00, 75.5MB/s]


In [9]:
''' 
build_vocab(*args, **kwargs)

Construct the Vocab object for this field from one or more datasets.

Parameters:	
arguments (Positional) – Dataset objects or other iterable data sources from which to construct the Vocab object that 
represents the set of possible values for this field. If a Dataset object is provided, all columns corresponding to this field are used; 
individual columns can also be provided directly.

keyword arguments (Remaining) – Passed to the constructor of Vocab.


'''


txt.build_vocab(train_data, 
                 max_size = num_words, 
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)

labels.build_vocab(train_data)

.vector_cache/glove.6B.zip: 862MB [02:39, 5.40MB/s]                           
100%|█████████▉| 399999/400000 [00:13<00:00, 29517.79it/s]


In [10]:
'''
We are now creating a training, testing and validation batch from the data that we have for preparing it to be fed to 
the model in the form of batches of 64 samples at a time. Reduce this if you get out of memory error.
'''

btch_size = 64

train_itr, valid_itr, test_itr = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = btch_size,
    sort_within_batch = True,
    device = device)



We define the parameters for python sentiment analysis model and pass it to an instance of the model class we just defined. The number of input parameters, hidden layer, and the output dimension along with throughput rate and bidirectionality boolean is defined. We also pass the pad token index from the vocabulary that we created earlier.

In [11]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, word_limit, dimension_embedding, dimension_hidden, dimension_output, num_layers, 
                 bidirectional, dropout, pad_idx):
        
        super().__init__()
        
        self.embedding = nn.Embedding(word_limit, dimension_embedding, padding_idx = pad_idx)
        
        self.rnn = nn.LSTM(dimension_embedding, 
                           dimension_hidden, 
                           num_layers=num_layers, 
                           bidirectional=bidirectional, 
                           dropout=dropout)
        
        self.fc = nn.Linear(dimension_hidden * 2, dimension_output)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text, len_txt):
        
        
        embedded = self.dropout(self.embedding(text))
               

        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, len_txt.to('cpu'))
        
        packed_output, (hidden, cell) = self.rnn(packed_embedded)
        
        output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)

        
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
                            
        return self.fc(hidden)



Now we print some details about our model. Getting the number of trainable parameters that are present there in the model.

We then get the pre-trained embedding weights and copy them to our model so that it does not need to learn the embeddings, and can directly focus on the job at hand that is learning the sentiments related to those embeddings.

Pretrained embedding weights are placed in place of the initial ones.

In [12]:
dimension_input = len(txt.vocab)
dimension_embedding = 100
dimension_hddn = 256
dimension_out = 1
layers = 2
bidirectional = True
dropout = 0.5
idx_pad = txt.vocab.stoi[txt.pad_token]


model = RNN(dimension_input, 
            dimension_embedding, 
            dimension_hddn, 
            dimension_out, 
            layers, 
            bidirectional, 
            dropout, 
            idx_pad)

In [13]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
 
print(f'The model has {count_parameters(model):,} trainable parameters')
pretrained_embeddings = txt.vocab.vectors
 
print(pretrained_embeddings.shape)
unique_id = txt.vocab.stoi[txt.unk_token]
 
model.embedding.weight.data[unique_id] = torch.zeros(dimension_embedding)
model.embedding.weight.data[idx_pad] = torch.zeros(dimension_embedding)
 
print(model.embedding.weight.data)

The model has 4,810,857 trainable parameters
torch.Size([25002, 100])
tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 1.3603,  1.1402, -1.0729,  ..., -0.0867, -0.9023, -0.9291],
        ...,
        [-0.6512, -0.2244, -0.3158,  ..., -1.5751,  1.8184,  0.0519],
        [ 0.8822, -0.6750, -0.6353,  ...,  1.4760, -1.5389, -0.0588],
        [ 0.7433,  0.7861,  1.1492,  ..., -0.4720, -1.1798,  0.7291]])


Looking above steps in details 

In [14]:
pretrained_embeddings = txt.vocab.vectors

print(pretrained_embeddings.shape)

torch.Size([25002, 100])


In [15]:
model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[ 1.9269,  1.4873,  0.9007,  ...,  0.1233,  0.3499,  0.6173],
        [ 0.7262,  0.0912, -0.3891,  ...,  0.0821,  0.4440, -0.7240],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.1219,  0.7036,  0.0626,  ...,  0.1124,  0.2100, -0.1781],
        [-0.5005,  0.2531,  1.8303,  ...,  0.5205,  0.6098,  0.8963],
        [-1.8683,  0.1008, -0.1267,  ...,  0.3982,  0.7209,  0.0699]])

In [16]:
unique_id = txt.vocab.stoi[txt.unk_token]
 
model.embedding.weight.data[unique_id] = torch.zeros(dimension_embedding)
model.embedding.weight.data[idx_pad] = torch.zeros(dimension_embedding)
 
print(model.embedding.weight.data)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.1219,  0.7036,  0.0626,  ...,  0.1124,  0.2100, -0.1781],
        [-0.5005,  0.2531,  1.8303,  ...,  0.5205,  0.6098,  0.8963],
        [-1.8683,  0.1008, -0.1267,  ...,  0.3982,  0.7209,  0.0699]])


In [17]:
'''
Now we define some parameters regarding the model, that is the optimizer we are going to use and the criterion of loss we need.
We chose adam optimizer for fast convergence of the model along with logistic loss function. 
We place the model and the criterion on the gpu.
'''

import torch.optim as optim
 
optimizer = optim.Adam(model.parameters())
criterion = nn.BCEWithLogitsLoss()
 
model = model.to(device)
criterion = criterion.to(device)

Training of the model

In [18]:
''' 
We now begin the necessary functions for training and evaluation of sentiment analysis model.

The first one being the binary accuracy function, which we’ll use for getting the accuracy of the model each time.
'''

def bin_acc(preds, y):
   
    predictions = torch.round(torch.sigmoid(preds))
    correct = (predictions == y).float() 
    acc = correct.sum() / len(correct)
    return acc

In [19]:
'''
We define the function for training and evaluating the models. The process here is standard. 
We start by looping through the number of epochs and the number of iterations in each epoch is according to the batch size that we defined.
 We pass the text to the model, get the predictions from it, calculate the loss for each iteration and then backward propagate that loss.

'''


def train(model, itr, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for i in itr:
        
        optimizer.zero_grad()
        
        text, len_txt = i.text
        
        predictions = model(text, len_txt).squeeze(1)
        
        loss = criterion(predictions, i.label)
        
        acc = bin_acc(predictions, i.label)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(itr), epoch_acc / len(itr)
 
def evaluate(model, itr, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for i in itr:
 
            text, len_txt = i.text
            
            predictions = model(text, len_txt).squeeze(1)
            
            loss = criterion(predictions, i.label)
            
            acc = bin_acc(predictions, i.label)
 
            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(itr), epoch_acc / len(itr)

In [20]:
'''
We build a helper function epoch_time for calculating the time each epoch takes to complete its run and print it. 
We set the number of epochs to 5 and then begin our training. Adding the training and validation loss at each stage, if we need to 
understand or plot the training curve at a later point. We save the python sentiment analysis model that has the best validation loss.
'''


import time
 
def epoch_time(start_time, end_time):
    used_time = end_time - start_time
    used_mins = int(used_time / 60)
    used_secs = int(used_time - (used_mins * 60))
    return used_mins, used_secs
num_epochs = 5
 
best_valid_loss = float('inf')
 
for epoch in range(num_epochs):
 
    start_time = time.time()
    
    train_loss, train_acc = train(model, train_itr, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_itr, criterion)
    
    end_time = time.time()
 
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut2-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')


Epoch: 01 | Epoch Time: 0m 35s
	Train Loss: 0.671 | Train Acc: 58.60%
	 Val. Loss: 0.581 |  Val. Acc: 69.89%
Epoch: 02 | Epoch Time: 0m 36s
	Train Loss: 0.598 | Train Acc: 68.07%
	 Val. Loss: 0.467 |  Val. Acc: 79.09%
Epoch: 03 | Epoch Time: 0m 37s
	Train Loss: 0.459 | Train Acc: 79.00%
	 Val. Loss: 0.377 |  Val. Acc: 84.01%
Epoch: 04 | Epoch Time: 0m 37s
	Train Loss: 0.370 | Train Acc: 84.15%
	 Val. Loss: 0.321 |  Val. Acc: 86.75%
Epoch: 05 | Epoch Time: 0m 37s
	Train Loss: 0.298 | Train Acc: 88.09%
	 Val. Loss: 0.385 |  Val. Acc: 82.56%


Testing the model

In [21]:
model.load_state_dict(torch.load('tut2-model.pt'))
 
test_loss, test_acc = evaluate(model, test_itr, criterion)
 
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.332 | Test Acc: 86.25%


In [22]:
"""
We can also check the model on our data. This is trained to classify the movie reviews into positive, negative, and neutral, 
therefore we will pass to it relatable data for checking. So for that we will import and load spacy for tokenizing the data 
we need to give to the model. In the beginning, while defining the preprocessing we used spacy built-in torch.text, but here 
we are not using batches, and the preprocessing that we need to do can be handled by the spacy library. We define a predict sentiment
 function for this. After the preprocessing, we convert it into tensors and ready to be passed to the model

"""

import spacy
nlp = spacy.load('en_core_web_sm')
 
def pred(model, sentence):
    model.eval()
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    indexed = [txt.vocab.stoi[t] for t in tokenized]
    length = [len(indexed)]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(1)
    length_tensor = torch.LongTensor(length)
    prediction = torch.sigmoid(model(tensor, length_tensor))
    return prediction.item()

In [23]:
'''
We define another helper function that will print the sentiment of the comment based on the score that the model provides.
'''

sent=["positive","neutral","negative"]
def print_sent(x):
  if (x<0.3): print(sent[0])
  elif (x>0.3 and x<0.7): print(sent[1])
  else: print(sent[2])


Now we just pass any data and test what does the model think about it

In [24]:
print_sent(pred(model, "This film was great"))


positive


In [25]:
print_sent(pred(model, "This was the best movie i have seen in a while. The cast was great and the script was awesome, and the direction just blew my mind"))

positive


In [26]:
print_sent(pred(model, "This film is horrible"))

negative


In [27]:
print_sent(pred(model, "the cast was dumb"))

negative


In [28]:
print_sent(pred(model, "Why does this fil exist"))

negative
