# Libraries

In [240]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
from torchtext import data
from torchtext import datasets
from torchtext.legacy import data
from torchtext.legacy import datasets

#Load english tokenizer,tagger,parser and nera(named entity recognizer)

import spacy
nlp = spacy.load('en_core_web_sm')

# DOWNLOAD AND READING DATA

In [241]:
data_f = pd.read_csv('/content/drive/MyDrive/TXTA PROJ/IMDb-sample.csv',header=0)

In [242]:
data_f.drop(columns=['Index','URL'],axis=1,inplace=True)

In [243]:
data_f.head(5)

Unnamed: 0,Text,Sentiment
0,Girlfight follows a project dwelling New York ...,POS
1,Hollywood North is an euphemism from the movie...,POS
2,That '70s Show is definitely the funniest show...,POS
3,"9/10- 30 minutes of pure holiday terror. Okay,...",POS
4,"A series of random, seemingly insignificant th...",POS


In [244]:
# Assign column names
columan_name = ['text', 'label']
data_f.columns = columan_name

In [245]:
data_f.head()

Unnamed: 0,text,label
0,Girlfight follows a project dwelling New York ...,POS
1,Hollywood North is an euphemism from the movie...,POS
2,That '70s Show is definitely the funniest show...,POS
3,"9/10- 30 minutes of pure holiday terror. Okay,...",POS
4,"A series of random, seemingly insignificant th...",POS


In [246]:
data_f.shape
# 2000 rows (reviews), 2 columns (Sentiments)

(2000, 2)

In [247]:
data_f['label']=data_f['label'].apply(lambda x: 1 if x == 'POS' else 0)

In [248]:
data_f.head(5)

Unnamed: 0,text,label
0,Girlfight follows a project dwelling New York ...,1
1,Hollywood North is an euphemism from the movie...,1
2,That '70s Show is definitely the funniest show...,1
3,"9/10- 30 minutes of pure holiday terror. Okay,...",1
4,"A series of random, seemingly insignificant th...",1


In [249]:
# check for null values
data_f.isnull().sum()

# no null values in the data

text     0
label    0
dtype: int64

In [250]:
data_f.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    2000 non-null   object
 1   label   2000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 31.4+ KB


In [233]:
''''from sklearn.model_selection import train_test_split

train,test=train_test_split(data_f,test_size=0.2,random_state=0)''''

In [251]:
# saving this clean data set as csv file and will upload in drive to be used directly

data_f.to_csv('IMDb_cleaned.csv')



# PREPARING DATA

In [252]:
#Reproducing same results
SEED = 2019

#Torch
torch.manual_seed(SEED)

#Cuda algorithms
torch.backends.cudnn.deterministic = True  

In [253]:
TEXT = data.Field(tokenize='spacy',batch_first=True,include_lengths=True)
LABEL = data.LabelField(dtype = torch.float,batch_first=True)

In [254]:
fields = [(None, None), ('text',TEXT),('label', LABEL)]

In [255]:
#loading custom dataset
training_data=data.TabularDataset(path = '/content/IMDb_cleaned.csv',format = 'csv',fields = fields)
#print preprocessed text
print(vars(training_data.examples[0]))

{'text': ['text'], 'label': 'label'}


In [256]:
import random
train_data, valid_data = training_data.split(split_ratio=0.7, random_state = random.seed(SEED))

# Preparing input and output sequences:

The next step is to build the vocabulary for the text and convert them into integer sequences. Vocabulary contains the unique words in the entire text. Each unique word is assigned an index. Below are the parameters listed for the same

Parameters:

1. min_freq: Ignores the words in vocabulary which has frequency less than specified one and map it to unknown token.
2. Two special tokens known as unknown and padding will be added to the vocabulary
Unknown token is used to handle Out Of Vocabulary words
Padding token is used to make input sequences of same length.
Build vocabulary and initialize the words with the pretrained embeddings.

In [257]:
#initialize glove embeddings
TEXT.build_vocab(train_data,min_freq=3,vectors = "glove.6B.100d")  
LABEL.build_vocab(train_data)

#No. of unique tokens in text
print("Size of TEXT vocabulary:",len(TEXT.vocab))

#No. of unique tokens in label
print("Size of LABEL vocabulary:",len(LABEL.vocab))

#Commonly used words
print(TEXT.vocab.freqs.most_common(10))  

#Word dictionary
print(TEXT.vocab.stoi)   

.vector_cache/glove.6B.zip: 862MB [02:39, 5.40MB/s]                           
100%|█████████▉| 399016/400000 [00:18<00:00, 20613.52it/s]

Size of TEXT vocabulary: 10033
Size of LABEL vocabulary: 3
[('the', 18701), (',', 17847), ('.', 17138), ('a', 10110), ('and', 10073), ('of', 9318), ('to', 8765), ('is', 6918), ('in', 5667), ('I', 5268)]


Now we will prepare batches for training the model. BucketIterator forms the batches in such a way that a minimum amount of padding is required.

In [258]:
#check whether cuda is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')  

#set batch size
BATCH_SIZE = 64

#Load an iterator
train_iterator, valid_iterator = data.BucketIterator.splits(
    (train_data, valid_data), 
    batch_size = BATCH_SIZE,
    sort_key = lambda x: len(x.text),
    sort_within_batch=True,
    device = device)

# Model Architecture

It is now time to define the architecture to solve the binary classification problem

I have defined 2 functions here: init as well as forward. Let me explain the use case of both of these functions-

1. Init: Whenever an instance of a class is created, init function is automatically invoked. Hence, it is called as a constructor. The arguments passed to the class are initialized by the constructor.We will define all the layers that we will be using in the model

2. Forward: Forward function defines the forward pass of the inputs.

Different layers used for building the architecture and their parameters-

Embedding layer,
LSTM,
input_size ,
hidden_size ,
num_layers, 
batch_firs,
dropout,
Default,
bidirection,

Linear Layer: parameters here are described below:

in_features : No. of input features

out_features: No. of hidden nodes

Pack Padding: pack padding is used to define the dynamic recurrent neural network. Without pack padding, the padding inputs are also processed by the RNN and returns the hidden state of the padded element.

In [259]:
import torch.nn as nn

class classifier(nn.Module):
    
    #define all the layers used in model
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
                 bidirectional, dropout):
        
        #Constructor
        super().__init__()          
        
        #embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        #lstm layer
        self.lstm = nn.LSTM(embedding_dim, 
                           hidden_dim, 
                           num_layers=n_layers, 
                           bidirectional=bidirectional, 
                           dropout=dropout,
                           batch_first=True)
        
        #dense layer
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        
        #activation function
        self.act = nn.Sigmoid()
        
    def forward(self, text, text_lengths):
        
        #text = [batch size,sent_length]
        embedded = self.embedding(text)
        #embedded = [batch size, sent_len, emb dim]
      
        #packed sequence
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths,batch_first=True)
        
        packed_output, (hidden, cell) = self.lstm(packed_embedded)
        #hidden = [batch size, num layers * num directions,hid dim]
        #cell = [batch size, num layers * num directions,hid dim]
        
        #concat the final forward and backward hidden state
        hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)
                
        #hidden = [batch size, hid dim * num directions]
        dense_outputs=self.fc(hidden)

        #Final activation function
        outputs=self.act(dense_outputs)
        
        return outputs

The next step would be to define the hyperparameters and instantiate the model. Here is the code block for the same:

In [260]:
#define hyperparameters
size_of_vocab = len(TEXT.vocab)
embedding_dim = 100
num_hidden_nodes = 32
num_output_nodes = 1
num_layers = 2
bidirection = True
dropout = 0.2

#instantiate the model
model = classifier(size_of_vocab, embedding_dim, num_hidden_nodes,num_output_nodes, num_layers, 
                   bidirectional = True, dropout = dropout)

Looking at the model summary and initialize the embedding layer with the pretrained embeddings

In [261]:
#architecture
print(model)

#No. of trianable parameters
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
    
print(f'The model has {count_parameters(model):,} trainable parameters')

#Initialize the pretrained embedding
pretrained_embeddings = TEXT.vocab.vectors
model.embedding.weight.data.copy_(pretrained_embeddings)

print(pretrained_embeddings.shape)

classifier(
  (embedding): Embedding(10033, 100)
  (lstm): LSTM(100, 32, num_layers=2, batch_first=True, dropout=0.2, bidirectional=True)
  (fc): Linear(in_features=64, out_features=1, bias=True)
  (act): Sigmoid()
)
The model has 1,062,757 trainable parameters
torch.Size([10033, 100])


Defining the optimizer, loss and metric for the model:

In [262]:
import torch.optim as optim

#define optimizer and loss
optimizer = optim.Adam(model.parameters())
criterion = nn.BCELoss()

#define metric
def binary_accuracy(preds, y):
    #round predictions to the closest integer
    rounded_preds = torch.round(preds)
    
    correct = (rounded_preds == y).float() 
    acc = correct.sum() / len(correct)
    return acc
    
#push to cuda if available
model = model.to(device)
criterion = criterion.to(device)

There are 2 phases while building the model:

Training phase: model.train() sets the model on the training phase and activates the dropout layers.


Inference phase: model.eval() sets the model on the evaluation phase and deactivates the dropout layers.
Here is the code block to define a function for training the model

In [263]:
def train(model, iterator, optimizer, criterion):
    
    #initialize every epoch 
    epoch_loss = 0
    epoch_acc = 0
    
    #set the model in training phase
    model.train()  
    
    for batch in iterator:
        
        #resets the gradients after every batch
        optimizer.zero_grad()   
        
        #retrieve text and no. of words
        text, text_lengths = batch.text   
        
        #convert to 1D tensor
        predictions = model(text, text_lengths).squeeze()  
        
        #compute the loss
        loss = criterion(predictions, batch.label)        
        
        #compute the binary accuracy
        acc = binary_accuracy(predictions, batch.label)   
        
        #backpropage the loss and compute the gradients
        loss.backward()       
        
        #update the weights
        optimizer.step()      
        
        #loss and accuracy
        epoch_loss += loss.item()  
        epoch_acc += acc.item()    
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

So we have a function to train the model, but we will also need a function to evaluate the mode.

In [264]:
def evaluate(model, iterator, criterion):
    
    #initialize every epoch
    epoch_loss = 0
    epoch_acc = 0

    #deactivating dropout layers
    model.eval()
    
    #deactivates autograd
    with torch.no_grad():
    
        for batch in iterator:
        
            #retrieve text and no. of words
            text, text_lengths = batch.text
            
            #convert to 1d tensor
            predictions = model(text, text_lengths).squeeze()
            
            #compute loss and accuracy
            loss = criterion(predictions, batch.label)
            acc = binary_accuracy(predictions, batch.label)
            
            #keep track of loss and accuracy
            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

we will train the model for a certain number of epochs and save the best model every epoch.

In [265]:
N_EPOCHS = 5
best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
     
    #train the model
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    
    #evaluate the model
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    #save the best model
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'saved_weights.pt')
    
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

	Train Loss: 0.693 | Train Acc: 52.07%
	 Val. Loss: 0.690 |  Val. Acc: 55.52%
	Train Loss: 0.683 | Train Acc: 61.44%
	 Val. Loss: 0.682 |  Val. Acc: 53.44%
	Train Loss: 0.636 | Train Acc: 65.69%
	 Val. Loss: 0.636 |  Val. Acc: 64.17%
	Train Loss: 0.502 | Train Acc: 76.14%
	 Val. Loss: 0.844 |  Val. Acc: 58.85%
	Train Loss: 0.507 | Train Acc: 75.63%
	 Val. Loss: 0.626 |  Val. Acc: 64.38%


Loading the best model and define the inference function  that accepts the user defined input and make predictions

In [266]:
#load weights
path='/content/saved_weights.pt'
model.load_state_dict(torch.load(path));
model.eval();

#inference 
import spacy
nlp = spacy.load('en')

def predict(model, sentence):
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]  #tokenize the sentence 
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]          #convert to integer sequence
    length = [len(indexed)]                                    #compute no. of words
    tensor = torch.LongTensor(indexed).to(device)              #convert to tensor
    tensor = tensor.unsqueeze(1).T                             #reshape in form of batch,no. of words
    length_tensor = torch.LongTensor(length)                   #convert to tensor
    prediction = model(tensor, length_tensor)                  #prediction 
    return prediction.item()                          

Let us use this model to make predictions for few questions:

In [267]:
#make predictions
predict(model, "Hollywood North is an euphemism from the movie industry as they went to Canada to make movies because of tax breaks and cheaper costs in a civilized city like Toronto, in this case, later in Vancouver. Peter O'Brian, the director, probably saw a lot of the invaders from California that this movie seems to be the right way to deal with the arriving personalities trying to capitalize on the economics that Canada presented.Needless to say, Moon Lantern, the successful novel written by a Canadian author is turned into Flight to Bogota, which has nothing to do with the original film. A great egotistical has-been, Michael Baytes, who is obsessed with what is happening in Iran, is offered the lead part, which turns to be a disaster.The film seems to be saying that too many cooks have spoiled the broth, which seems to be the case with the ultimate product, which is saved by its producer, Bobby Myers. With the help of Sandy Ryan, who has been around making a documentary of the film being shot in Toronto, parts of the film are transformed into a cohesive movie at last.The filming process is hilarious, and the acting, in general, is good.Hollywood North is an euphemism from the movie industry as they went to Canada to make movies because of tax breaks and cheaper costs in a civilized city like Toronto, in this case, later in Vancouver.")



0.5651946067810059

In [268]:
#make predictions
predict(model, "Fantastically putrid. I don't mean to imply above that only a few people should avoid Doc Savage. Almost every demographic group would be bored by this trivial, TV-movie-quality production. It's a little like the 60's Batman TV series, except it's not funny. Even accidentally. You're better off taking a nap.Fantastically putrid.")

0.8946316242218018

In [269]:
#make predictions
predict(model, "First of all, I was expecting Caged Heat to be along the same lines as Ilsa, The Wicked Warden. Boy, was I wrong! In no way is this film 70s exploitation, chix in chains, or women in prison. Sure, the plot consists of a bunch of women in prison, who wear street clothes btw (quite comical), but NOTHING happens.There aren't strong rivalries, no one tries to seduce the warden or doctor in order to try and escape, and no inmates make out. There are 2 shower scenes, that I suspect is just recycled footage, but no fights breaks out / no one is seduced here - or anywhere for that matter! Aside from the lack of plot, unconvincing, unsympathetic, and flat characters, a couple of inmates that do manage to escape actually return to the prison in order to free their fellow inmates??!!PUH-LEASE, the movie should have just ended off with the escapees riding off into the sunset...as opposed to letting this mess continue!I feel scammed.First of all, I was expecting Caged Heat to be along the same lines as Ilsa, The Wicked Warden.")

0.8833414316177368

In [270]:
#make predictions
predict(model, "I thought this movie was fantastic. It was hilarious. Kinda reminded me of Spinal Tap. This is a must see for any fan of 70's rock. (I hope me and my friends aren't like that in twenty years!)Bill Nighy gives an excellent performance as the off kilter lead singer trying to recapture that old spirit,Stephen Rea fits perfectly into the movie as the glue trying to hold the band together, but not succeeding well.If you love music, and were ever in a band, this movie is definitely for you. You won't regret seeing this movie. I know I don't. Even my family found it funny, and that's saying something.I thought this movie was fantastic.")

0.2392205446958542

In [None]:
#Source

Build Your First Text Classification model using PyTorch (Analytics Vidhya)
https://www.kaggle.com/columbine/pytorch-sentiment-analysis
https://colab.research.google.com/github/bentrevett/pytorch-sentiment-analysis/blob/master/1%20-%20Simple%20Sentiment%20Analysis.ipynb#scrollTo=wago_1cFtl1I
https://sofiadutta.github.io/datascience-ipynbs/pytorch/Sentiment-Analysis-using-PyTorch.html
https://captum.ai/tutorials/IMDB_TorchText_Interpret
https://dzlab.github.io/dltips/en/pytorch/torchtext-datasets/