<a href="https://colab.research.google.com/github/Priyankamandal8/NaturalLanguageProcessing/blob/master/Lab04Solved_pman7719.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab04

In this lab 4, we learn the Recurrent Neural Networks and Sequence Modelling


*   Recurrent Neural Networks
*   Sequence Modelling (Seq2Seq)


# Exercise (Text classification using LSTM)

In this exercise, you are going to implement a LSTM model to do the text classification problem. Please notice that we have already done the preprocessing and embedding part of the dataset. You can only focus on the Model part.

**Sequence Modelling**

![alt text](https://usydnlpgroup.files.wordpress.com/2020/03/lstm_textclassification-e1584855309361.png)



In [0]:
import torch
#If you enable GPU here, device will be cuda, otherwise it will be cpu
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### Downloading dataset

In [0]:
# Code to download file into Colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

id = '1ORrHW9moXLcWwg8WY9o-Ulq8X9BAiD1P'
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('train.pkl')  

id = '1eb4gtE8XlN3TcZqzwS18Ik-H7MFAeW4z'
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('label.pkl')  

import pickle
input_embeddings = pickle.load(open("train.pkl","rb"))
label = pickle.load(open("label.pkl","rb"))

### Split the dataset

In [0]:
# Split into training and testing dataset using scikit-learn
# For more details, you can refer to: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

from sklearn.model_selection import train_test_split
train_embeddings, test_embeddings, train_label, test_label = train_test_split(input_embeddings,label,test_size = 0.2, random_state=0)


## Generate batch

In [0]:
def generate_batch(input_embeddings,label, batch_size):
    idx = np.random.randint(input_embeddings.shape[0], size=batch_size)
    return input_embeddings[idx,:,:],label[idx]

## Model (please complete the following sections)

**NOTE**: By updating hyperparameters, you should achieve **at least 0.4** for testset "weighted avg" f1. (There will be randomness in the training process, so tutors would run your code several times and there should be at least one of the output reaching 0.4)

***What is F1?***

![alt text](https://1.bp.blogspot.com/-nkFFqViboVM/XWwaQ5x1YpI/AAAAAAAAAP8/XzTH9hfJSfswcRjxSeQFEU6-yKQCwc0EQCLcBGAs/s640/main-qimg-447d6cdb02d2cc097ff1e6083a6bdc37.png)
![alt text](https://i.stack.imgur.com/U0hjG.png)


In [0]:
import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.lstm = nn.LSTM(n_input, n_hidden, batch_first =True)
        self.linear = nn.Linear(n_hidden,n_class)

    def forward(self, x):        
        # Please complete the code for forwardpropagation
        # lstm layer
        x,_ = self.lstm(x)

        # linear layer
        x = self.linear(x[:,-1,:])
        
        # softmax layer
        x=F.log_softmax(x, dim=1)
        return x


In [0]:
import numpy as np
import torch.optim as optim

# Please assign values to these variables by using other variables (instead of hard code)
seq_length = len(train_embeddings[0]) #No. of sequence input from train_embeddings
n_input = len(train_embeddings[0][0])  #No of input items from the train_embeddings
n_class = len(list(set(test_label)))  #No of unique elements from label set for the output

#Please decide the hyperparameters here by yourself
n_hidden = 128
batch_size = 128
total_epoch = 1000
learning_rate = 0.05
shown_interval = 100




In [7]:
from sklearn.metrics import accuracy_score

net = Net().to(device)
criterion = nn.NLLLoss()

# Please find which optimizer provide higher f1
optimizer = optim.Adam(net.parameters(), lr=learning_rate)

for epoch in range(total_epoch):

    input_batch, target_batch = generate_batch(train_embeddings,train_label, batch_size)
    input_batch_torch = torch.from_numpy(input_batch).float().to(device)
    target_batch_torch = torch.from_numpy(target_batch).view(-1).to(device)

    net.train()
    outputs = net(input_batch_torch) 
    loss = criterion(outputs, target_batch_torch)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    if epoch % shown_interval == shown_interval-1:
        net.eval()
        outputs = net(input_batch_torch) 
        train_loss = criterion(outputs, target_batch_torch)
        _, predicted = torch.max(outputs, 1)
        train_acc= accuracy_score(predicted.cpu().numpy(),target_batch_torch.cpu().numpy())

        print('Epoch: %d, train loss: %.5f, train_acc: %.4f'%(epoch + 1, train_loss.item(), train_acc))

print('Finished Training')

## Prediction
net.eval()
outputs = net(torch.from_numpy(test_embeddings).float().to(device)) 
_, predicted = torch.max(outputs, 1)

from sklearn.metrics import classification_report
print(classification_report(test_label, predicted.cpu().numpy(),digits=4))

Epoch: 100, train loss: 0.99796, train_acc: 0.4766
Epoch: 200, train loss: 0.47127, train_acc: 0.7734
Epoch: 300, train loss: 0.37888, train_acc: 0.8359
Epoch: 400, train loss: 0.34990, train_acc: 0.8359
Epoch: 500, train loss: 0.18187, train_acc: 0.9141
Epoch: 600, train loss: 0.08888, train_acc: 0.9688
Epoch: 700, train loss: 0.06564, train_acc: 0.9688
Epoch: 800, train loss: 0.01522, train_acc: 1.0000
Epoch: 900, train loss: 0.05386, train_acc: 0.9844
Epoch: 1000, train loss: 0.03382, train_acc: 0.9844
Finished Training
              precision    recall  f1-score   support

           0     0.7115    0.7957    0.7513        93
           1     0.9316    0.8934    0.9121       122
           2     0.8812    0.8165    0.8476       109
           3     0.8000    0.8125    0.8062       128

    accuracy                         0.8319       452
   macro avg     0.8311    0.8295    0.8293       452
weighted avg     0.8369    0.8319    0.8335       452



In [0]:
#The following is the sample output 
#As mentioned in the previous labs, it is impossible to get the same result (randomness in the training process).