<a href="https://colab.research.google.com/github/Smruthi3/END2/blob/main/Session6-Assignment/Sentiment_Analysis_using_encoder_decoder(tweetdataset)_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Dataset Preview

Your first step to deep learning in NLP. We will be mostly using PyTorch. Just like torchvision, PyTorch provides an official library, torchtext, for handling text-processing pipelines. 

We will be using previous session tweet dataset. Let's just preview the dataset.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
df = pd.read_csv('drive/My Drive/END2/Session5-Assignment/tweets.csv')
df.head()

Unnamed: 0,tweets,labels
0,Obama has called the GOP budget social Darwini...,1
1,"In his teen years, Obama has been known to use...",0
2,IPA Congratulates President Barack Obama for L...,0
3,RT @Professor_Why: #WhatsRomneyHiding - his co...,0
4,RT @wardollarshome: Obama has approved more ta...,1


In [3]:
df.shape

(1364, 2)

In [4]:
df.labels.value_counts()

0    931
1    352
2     81
Name: labels, dtype: int64

## Defining Fields

Now we shall be defining LABEL as a LabelField, which is a subclass of Field that sets sequen tial to False (as it’s our numerical category class). TWEET is a standard Field object, where we have decided to use the spaCy tokenizer and convert all the text to lower‐ case.

In [5]:
# Import Library
import random
import torch, torchtext
from torchtext.legacy import data 
import torch.nn as nn
import torch.optim as optim


# Manual Seed
SEED = 43
torch.manual_seed(SEED)

<torch._C.Generator at 0x7ff1a29911f0>

In [6]:
Tweet = data.Field(sequential = True, tokenize = 'spacy', batch_first =True, include_lengths=True)
Label = data.LabelField(tokenize ='spacy', is_target=True, batch_first =True, sequential =False)

Having defined those fields, we now need to produce a list that maps them onto the list of rows that are in the CSV:

In [7]:
fields = [('tweets', Tweet),('labels',Label)]

Armed with our declared fields, lets convert from pandas to list to torchtext. We could also use TabularDataset to apply that definition to the CSV directly but showing an alternative approach too.

In [8]:
example = [data.Example.fromlist([df.tweets[i],df.labels[i]], fields) for i in range(df.shape[0])] 

In [9]:
# Creating dataset
#twitterDataset = data.TabularDataset(path="tweets.csv", format="CSV", fields=fields, skip_header=True)

twitterDataset = data.Dataset(example, fields)

Finally, we can split into training, testing, and validation sets by using the split() method:

In [10]:
(train, valid) = twitterDataset.split(split_ratio=[0.85, 0.15], random_state=random.seed(SEED))

In [11]:
type(valid)

torchtext.legacy.data.dataset.Dataset

In [12]:
(len(train), len(valid))

(1159, 205)

An example from the dataset:

In [13]:
vars(train.examples[10])

{'labels': 0,
 'tweets': ['Obama',
  ',',
  'Romney',
  'agree',
  ':',
  'Admit',
  'women',
  'to',
  'Augusta',
  'golf',
  'club',
  ':',
  'US',
  'President',
  'Barack',
  'Obama',
  'believes',
  'women',
  'should',
  'be',
  'allowe',
  '...',
  'http://t.co/PVKrepqI']}

## Building Vocabulary

At this point we would have built a one-hot encoding of each word that is present in the dataset—a rather tedious process. Thankfully, torchtext will do this for us, and will also allow a max_size parameter to be passed in to limit the vocabu‐ lary to the most common words. This is normally done to prevent the construction of a huge, memory-hungry model. We don’t want our GPUs too overwhelmed, after all. 

Let’s limit the vocabulary to a maximum of 5000 words in our training set:


In [14]:
Tweet.build_vocab(train,max_size=5000)
Label.build_vocab(train,max_size=5000)

By default, torchtext will add two more special tokens, <unk> for unknown words and <pad>, a padding token that will be used to pad all our text to roughly the same size to help with efficient batching on the GPU.

In [15]:
print('Size of input vocab : ', len(Tweet.vocab))
print('Size of label vocab : ', len(Label.vocab))
print('Top 10 words appreared repeatedly :', list(Tweet.vocab.freqs.most_common(10)))
print('Labels : ', Label.vocab.stoi)

Size of input vocab :  4651
Size of label vocab :  3
Top 10 words appreared repeatedly : [('Obama', 1069), (':', 783), ('#', 780), ('.', 761), (',', 598), ('"', 550), ('the', 542), ('RT', 516), ('?', 419), ('to', 400)]
Labels :  defaultdict(None, {0: 0, 1: 1, 2: 2})


**Lots of stopwords!!**

Now we need to create a data loader to feed into our training loop. Torchtext provides the BucketIterator method that will produce what it calls a Batch, which is almost, but not quite, like the data loader we used on images.

But at first declare the device we are using.

In [16]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#device

In [17]:
train_iterator, valid_iterator = data.BucketIterator.splits((train, valid), batch_size = 32, 
                                                            sort_key = lambda x: len(x.tweets),
                                                            sort_within_batch=True, device = device)

Save the vocabulary for later use

In [18]:
import os, pickle
with open('drive/My Drive/END2/Session6-Assignment/tokenizer.pkl', 'wb') as tokens: 
    pickle.dump(Tweet.vocab.stoi, tokens)

## Defining Our Model

### Defining encoder class

In [31]:
class Encoder(nn.Module):
  def __init__(self,input_dim,emb_dim,hid_dim,n_layers):
    super().__init__()

    self.hid_dim=hid_dim
    self.embedding = nn.Embedding(input_dim, emb_dim) 
    self.rnn= nn.LSTM(emb_dim, hid_dim,  num_layers=n_layers, batch_first=True)

  def forward(self, text,text_lengths):

      embedded =self.embedding(text)

      packaged_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths.cpu(), batch_first=True)

      packed_output, (hidden, cell) = self.rnn(packaged_embedded)

      print(f'Output of encoder at every step:{packed_output[0]}') ## output of each word is stroed in packaged output, it keeps appending the hidden vector after every step that is nothing but after every word

      print(f'Output of encoder at last step:{hidden}') ### Hidden contains output of last time step
    
      return hidden,cell



### Defining Decoder class

In [32]:
import torch.nn.functional as F
class Decoder(nn.Module):
  def __init__(self,out_dim,hid_dim,n_layers):
    super().__init__()

    self.hid_dim = hid_dim

    self.out_dim = out_dim

    self.rnn= nn.LSTM(hid_dim,hid_dim, num_layers=n_layers)

    self.fc = nn.Linear(hid_dim,out_dim)


  def forward(self,input,hidden,cell):

    output , (hidden,cell) = self.rnn(input,(hidden,cell))

    print(f'Output of decoder at every step (Note that here it is single step):{output[0]}')

    prediction = self.fc(hidden.squeeze(0))

    return prediction,hidden,cell


    


### Combining encoder decoder 

In [33]:
class EncoderDecoder(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

        #print(encoder.hid_dim)
        #print(decoder.hid_dim)
        
        assert encoder.hid_dim == decoder.hid_dim
        "Hidden dimensions of encoder and decoder must be equal!"
        
    def forward(self, text, text_len):

        #hidden, cell = self.encoder(text,text_len)

        hidden, cell = self.encoder(text,text_len)
        
        input = hidden

        output,hidden,cell = self.decoder(input,hidden,cell)

        return output

In [34]:
INPUT_DIM = len(Tweet.vocab)
OUTPUT_DIM = len(Label.vocab)
ENC_EMB_DIM = 256
HID_DIM = 512
NUM_LAYERS = 1

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM,NUM_LAYERS)
dec = Decoder(OUTPUT_DIM,HID_DIM,NUM_LAYERS)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = EncoderDecoder(enc, dec, device).to(device)

In [35]:
model

EncoderDecoder(
  (encoder): Encoder(
    (embedding): Embedding(4651, 256)
    (rnn): LSTM(256, 512, batch_first=True)
  )
  (decoder): Decoder(
    (rnn): LSTM(512, 512)
    (fc): Linear(in_features=512, out_features=3, bias=True)
  )
)

### Defining loss and accuracy of the model

In [36]:
import torch.optim as optim

# define optimizer and loss
optimizer = optim.Adam(model.parameters(), lr=2e-4)
criterion = nn.CrossEntropyLoss()

# define metric
def categorical_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """
    _, predictions = torch.max(preds, 1)
    correct = (predictions == y).float() 
    acc = correct.sum() / len(correct)
    return acc
    
    
# push to cuda if available
criterion = criterion.to(device)

### Training loop

In [37]:
def train(model, iterator, optimizer, criterion):
    
    # initialize every epoch 
    epoch_loss = 0
    epoch_acc = 0
    
    # set the model in training phase
    model.train()  
    
    for batch in iterator:
        
        # resets the gradients after every batch
        optimizer.zero_grad()   
        
        # retrieve text and no. of words
        tweet, tweet_lengths = batch.tweets   

  
        # convert to 1D tensor
        predictions = model(tweet, tweet_lengths).squeeze()  

        # output_dim = predictions.shape[-1]
        
        # predictions = predictions[1:].view(-1, output_dim)
        # batch.labels = batch.labels[1:].view(-1)
        
        
        # compute the loss
        loss = criterion(predictions, batch.labels)        
        
        # compute the binary accuracy
        #acc = binary_accuracy(predictions, batch.labels)   
        acc = categorical_accuracy(predictions, batch.labels)   

        
        # backpropage the loss and compute the gradients
        loss.backward()       
        
        # update the weights
        optimizer.step()      
        
        # loss and accuracy
        epoch_loss += loss.item()  
        epoch_acc += acc.item()  

    return epoch_loss / len(iterator), epoch_acc / len(iterator)


### Evaluation Loop

In [38]:
def evaluate(model, iterator, criterion):
    
    # initialize every epoch
    epoch_loss = 0
    epoch_acc = 0

    # deactivating dropout layers
    model.eval()
    
    # deactivates autograd
    with torch.no_grad():
    
        for batch in iterator:
        
            # retrieve text and no. of words
            tweet, tweet_lengths = batch.tweets 
            
            #print(batch.labels.shape)

            # convert to 1d tensor
            predictions = model(tweet, tweet_lengths).squeeze()

            # output_dim = predictions.shape[-1]
        
            # predictions = predictions[1:].view(-1, output_dim)
            # batch.labels = batch.labels[1:].view(-1)
            
            # compute loss and accuracy
            loss = criterion(predictions, batch.labels)

            # acc = binary_accuracy(predictions, batch.labels)
            acc = categorical_accuracy(predictions, batch.labels)   
            
            # keep track of loss and accuracy
            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

### Model Training and Evaluation

In [27]:
N_EPOCHS = 10
best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
     
    # train the model
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    
    # evaluate the model
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    
    # save the best model
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'drive/My Drive/END2/Session6-Assignment/encoder_decoder_classification_saved_weights.pt')
    
    print(f'\t Epoch: {epoch} | Train Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Epoch: {epoch} | Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}% \n')

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
         [ 0.1336,  0.0911, -0.1526,  ...,  0.2353, -0.0429, -0.0723],
         ...,
         [ 0.3678,  0.1980, -0.0983,  ...,  0.0250, -0.2127, -0.0458],
         [ 0.2773,  0.4156, -0.0337,  ...,  0.0858, -0.1497,  0.1823],
         [ 0.1534, -0.0517, -0.2318,  ...,  0.1784, -0.1263,  0.0393]]],
       grad_fn=<StackBackward>)
Output of decoder at every step (Note that here it is single step):tensor([[ 0.0887,  0.2883, -0.2749,  ...,  0.2495, -0.2922,  0.0555],
        [-0.0624,  0.1514, -0.0721,  ..., -0.0136, -0.0993,  0.0414],
        [ 0.1532,  0.2274, -0.2197,  ...,  0.2170, -0.0982, -0.0130],
        ...,
        [ 0.4349,  0.3245, -0.2796,  ...,  0.2363, -0.3241, -0.0992],
        [ 0.1515,  0.5107, -0.2450,  ...,  0.2697, -0.1256,  0.1829],
        [ 0.1587,  0.1588, -0.2308,  ...,  0.2555, -0.1925, -0.0125]],
       grad_fn=<SelectBackward>)
Output of encoder at every step:tensor([[ 0.0080, -0.0864, -0.0025,  

In [39]:

N_EPOCHS = 10
best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
     
    # train the model
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    
    # evaluate the model
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    
    # save the best model
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'drive/My Drive/END2/Session6-Assignment/encoder_decoder_classification_saved_weights.pt')
    
    print(f'\t Epoch: {epoch} | Train Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Epoch: {epoch} | Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}% \n')

	 Epoch: 0 | Train Loss: 0.879 | Train Acc: 67.01%
	 Epoch: 0 | Val. Loss: 0.763 |  Val. Acc: 68.30% 

	 Epoch: 1 | Train Loss: 0.692 | Train Acc: 73.52%
	 Epoch: 1 | Val. Loss: 0.684 |  Val. Acc: 74.55% 

	 Epoch: 2 | Train Loss: 0.591 | Train Acc: 77.57%
	 Epoch: 2 | Val. Loss: 0.670 |  Val. Acc: 75.45% 

	 Epoch: 3 | Train Loss: 0.485 | Train Acc: 81.50%
	 Epoch: 3 | Val. Loss: 0.670 |  Val. Acc: 77.23% 

	 Epoch: 4 | Train Loss: 0.371 | Train Acc: 86.82%
	 Epoch: 4 | Val. Loss: 0.716 |  Val. Acc: 79.46% 

	 Epoch: 5 | Train Loss: 0.240 | Train Acc: 91.64%
	 Epoch: 5 | Val. Loss: 0.683 |  Val. Acc: 78.57% 

	 Epoch: 6 | Train Loss: 0.207 | Train Acc: 93.07%
	 Epoch: 6 | Val. Loss: 0.740 |  Val. Acc: 80.80% 

	 Epoch: 7 | Train Loss: 0.147 | Train Acc: 95.35%
	 Epoch: 7 | Val. Loss: 0.749 |  Val. Acc: 78.57% 

	 Epoch: 8 | Train Loss: 0.072 | Train Acc: 98.65%
	 Epoch: 8 | Val. Loss: 0.917 |  Val. Acc: 80.80% 

	 Epoch: 9 | Train Loss: 0.035 | Train Acc: 99.49%
	 Epoch: 9 | Val. Loss

### Validation of the model by passing the tweets and observing it's outcome along with printing the output from encoder and decoder time steps

In [28]:
path = 'drive/My Drive/END2/Session6-Assignment/encoder_decoder_classification_saved_weights.pt'
model.load_state_dict(torch.load(path));
model.eval();
tokenizer_file = open('drive/My Drive/END2/Session6-Assignment/tokenizer.pkl', 'rb')
tokenizer = pickle.load(tokenizer_file)

#inference 

import spacy
nlp = spacy.load('en')

def classify_tweet(tweet):
    
    categories = {0: "Negative", 1:"Positive", 2:"Neutral"}
    
    # tokenize the tweet 
    tokenized = [tok.text for tok in nlp.tokenizer(tweet)] 
    # convert to integer sequence using predefined tokenizer dictionary
    indexed = [tokenizer[t] for t in tokenized]        
    # compute no. of words        
    length = [len(indexed)]
    # convert to tensor                                    
    tensor = torch.LongTensor(indexed).to(device)   
    # reshape in form of batch, no. of words           
    tensor = tensor.unsqueeze(1).T  
    # convert to tensor                          
    length_tensor = torch.LongTensor(length)

    # Get the model prediction                  
    prediction = model(tensor, length_tensor)

    _, pred = torch.max(prediction, 1) 
    
    return categories[pred.item()]

In [29]:
twt="Today is a beautiful day"
print(f'Tweet : {twt}')
print(f'Predicted Sentiment : {classify_tweet(twt)} \n')

Tweet : Today is a beautiful day
Output of encoder at every step:tensor([[ 4.1017e-02,  3.8177e-02, -1.4185e-01,  ...,  7.6511e-02,
         -6.1882e-02,  1.3359e-01],
        [ 1.5422e-01, -4.6522e-02, -1.4529e-01,  ..., -1.1428e-02,
         -1.1638e-01, -2.7448e-02],
        [ 8.7313e-02,  6.1904e-02, -1.0125e-01,  ...,  1.0101e-02,
          1.9596e-04,  5.5578e-02],
        [-5.3976e-02,  3.0859e-02, -2.3801e-01,  ...,  1.2530e-01,
         -2.4061e-03, -4.9863e-02],
        [-1.3361e-02, -1.2220e-01, -1.5472e-01,  ...,  5.5615e-02,
         -5.1296e-02, -1.1578e-01]], grad_fn=<CatBackward>)
Output of encoder at last step:tensor([[[-1.3361e-02, -1.2220e-01, -1.5472e-01,  1.7765e-01, -2.7626e-02,
          -1.1950e-01, -6.1540e-02, -1.6780e-01,  3.8703e-02, -6.7190e-02,
           2.0765e-01,  1.2201e-01, -1.2193e-01,  1.3181e-01,  1.0310e-01,
           6.4831e-02,  2.3413e-01,  1.5577e-01,  1.9202e-03,  3.6987e-02,
          -1.1714e-01,  8.2315e-02, -4.1326e-02, -5.8618e-02,  2.

In [30]:
twt="This is my first encoder decoder model"
print(f'Tweet : {twt}')
print(f'Predicted Sentiment : {classify_tweet(twt)} \n')

Tweet : This is my first encoder decoder model
Output of encoder at every step:tensor([[-0.0962,  0.0106, -0.0364,  ...,  0.0622,  0.0169, -0.0738],
        [ 0.0962, -0.0580, -0.1003,  ..., -0.0934, -0.0778, -0.1424],
        [-0.0245,  0.0443, -0.0313,  ...,  0.0421, -0.0003, -0.1922],
        ...,
        [-0.0727, -0.0507, -0.1769,  ...,  0.1477,  0.0570, -0.1120],
        [-0.1469, -0.0145, -0.2458,  ...,  0.1483,  0.0508, -0.0986],
        [-0.1820,  0.0041, -0.2727,  ...,  0.1500,  0.0334, -0.1067]],
       grad_fn=<CatBackward>)
Output of encoder at last step:tensor([[[-0.1820,  0.0041, -0.2727,  0.0875,  0.1219, -0.0830,  0.2838,
          -0.2005, -0.1425,  0.0525,  0.0711,  0.2949, -0.3792,  0.2098,
          -0.1306,  0.0744, -0.1069,  0.0665, -0.1161, -0.0232,  0.0521,
          -0.1231,  0.2043,  0.1250,  0.3290, -0.2222,  0.2749, -0.0139,
          -0.1411, -0.2938, -0.0789, -0.0240, -0.0107, -0.1770, -0.1599,
          -0.0317,  0.1608, -0.3691, -0.2762, -0.0548, -0.026