<a href="https://colab.research.google.com/github/JohnnyPeng123/NLP-USYD/blob/master/Lab04%20-%20Johnny's%20Answer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab04

In this lab 4, we learn the Recurrent Neural Networks and Sequence Modelling


*   Recurrent Neural Networks
*   Sequence Modelling (Seq2Seq)


# Exercise (Text classification using LSTM)

In this exercise, you are going to implement a LSTM model to do the text classification problem. Please notice that we have already done the preprocessing and embedding part of the dataset. You can only focus on the Model part.

**Sequence Modelling**

![alt text](https://usydnlpgroup.files.wordpress.com/2020/03/lstm_textclassification-e1584855309361.png)



In [0]:
import torch
#If you enable GPU here, device will be cuda, otherwise it will be cpu
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### Downloading dataset

In [0]:
# Code to download file into Colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

id = '1ORrHW9moXLcWwg8WY9o-Ulq8X9BAiD1P'
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('train.pkl')  

id = '1eb4gtE8XlN3TcZqzwS18Ik-H7MFAeW4z'
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('label.pkl')  

import pickle
input_embeddings = pickle.load(open("train.pkl","rb"))
label = pickle.load(open("label.pkl","rb"))

### Split the dataset

In [0]:
# Split into training and testing dataset using scikit-learn
# For more details, you can refer to: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

from sklearn.model_selection import train_test_split
train_embeddings, test_embeddings, train_label, test_label = train_test_split(input_embeddings,label,test_size = 0.2, random_state=0)

## Generate batch

In [0]:
def generate_batch(input_embeddings,label, batch_size):
    idx = np.random.randint(input_embeddings.shape[0], size=batch_size)
    return input_embeddings[idx,:,:],label[idx]

## Model (please complete the following sections)

**NOTE**: By updating hyperparameters, you should achieve **at least 0.4** for testset "weighted avg" f1. (There will be randomness in the training process, so tutors would run your code several times and there should be at least one of the output reaching 0.4)

***What is F1?***

![alt text](https://1.bp.blogspot.com/-nkFFqViboVM/XWwaQ5x1YpI/AAAAAAAAAP8/XzTH9hfJSfswcRjxSeQFEU6-yKQCwc0EQCLcBGAs/s640/main-qimg-447d6cdb02d2cc097ff1e6083a6bdc37.png)
![alt text](https://i.stack.imgur.com/U0hjG.png)


In [0]:
import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.lstm = nn.LSTM(n_input, n_hidden, batch_first =True)
        self.linear = nn.Linear(n_hidden,n_class)

    def forward(self, x):
        x,_ = self.lstm(x)
        x = self.linear(x[:,-1,:])
        x = F.log_softmax(x, dim=1)
        return x


In [0]:
import numpy as np
import torch.optim as optim

# Please assign values to these variables by using other variables (instead of hard code)
n_input = train_embeddings.shape[2]
n_class = len(set(train_label))

#Please decide the hyperparameters here by yourself
n_hidden = 128
batch_size = 32
total_epoch = 1200
learning_rate = 0.01
shown_interval = 10

In [39]:
from sklearn.metrics import accuracy_score

net = Net().to(device)
criterion = nn.NLLLoss()

# Please find which optimizer provide higher f1
optimizer = optim.Adam(net.parameters(), lr=learning_rate) 

for epoch in range(total_epoch):

    input_batch, target_batch = generate_batch(train_embeddings,train_label, batch_size)
    input_batch_torch = torch.from_numpy(input_batch).float().to(device)
    target_batch_torch = torch.from_numpy(target_batch).view(-1).to(device)

    net.train()
    outputs = net(input_batch_torch) 
    loss = criterion(outputs, target_batch_torch)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    if epoch % shown_interval == shown_interval-1:
        net.eval()
        outputs = net(input_batch_torch) 
        train_loss = criterion(outputs, target_batch_torch)
        _, predicted = torch.max(outputs, 1)
        train_acc= accuracy_score(predicted.cpu().numpy(),target_batch_torch.cpu().numpy())

        print('Epoch: %d, train loss: %.5f, train_acc: %.4f'%(epoch + 1, train_loss.item(), train_acc))

print('Finished Training')

## Prediction
net.eval()
outputs = net(torch.from_numpy(test_embeddings).float().to(device)) 
_, predicted = torch.max(outputs, 1)

from sklearn.metrics import classification_report
print(classification_report(test_label, predicted.cpu().numpy(),digits=4))

Epoch: 10, train loss: 1.40432, train_acc: 0.2500
Epoch: 20, train loss: 1.32862, train_acc: 0.4375
Epoch: 30, train loss: 1.39763, train_acc: 0.3438
Epoch: 40, train loss: 1.35672, train_acc: 0.3125
Epoch: 50, train loss: 3.93274, train_acc: 0.1875
Epoch: 60, train loss: 1.35486, train_acc: 0.1562
Epoch: 70, train loss: 1.34577, train_acc: 0.3125
Epoch: 80, train loss: 1.35674, train_acc: 0.1562
Epoch: 90, train loss: 1.36166, train_acc: 0.2812
Epoch: 100, train loss: 1.33405, train_acc: 0.3438
Epoch: 110, train loss: 1.27161, train_acc: 0.3125
Epoch: 120, train loss: 1.37252, train_acc: 0.3125
Epoch: 130, train loss: 1.29655, train_acc: 0.4062
Epoch: 140, train loss: 1.45031, train_acc: 0.1250
Epoch: 150, train loss: 1.38354, train_acc: 0.2812
Epoch: 160, train loss: 1.30397, train_acc: 0.2812
Epoch: 170, train loss: 1.32033, train_acc: 0.2500
Epoch: 180, train loss: 1.20329, train_acc: 0.4062
Epoch: 190, train loss: 1.22654, train_acc: 0.4375
Epoch: 200, train loss: 1.29728, train_a

In [0]:
#The following is the sample output 
#As mentioned in the previous labs, it is impossible to get the same result (randomness in the training process).

Epoch: 100, train loss: 1.21694, train_acc: 0.4850
Epoch: 200, train loss: 1.24536, train_acc: 0.3950
Epoch: 300, train loss: 1.35449, train_acc: 0.3450
Epoch: 400, train loss: 1.34606, train_acc: 0.3350
Epoch: 500, train loss: 1.35706, train_acc: 0.2800
Epoch: 600, train loss: 1.32822, train_acc: 0.3450
Epoch: 700, train loss: 1.15759, train_acc: 0.5050
Epoch: 800, train loss: 1.03310, train_acc: 0.5100
Epoch: 900, train loss: 1.37713, train_acc: 0.2700
Epoch: 1000, train loss: 1.36268, train_acc: 0.3200
Epoch: 1100, train loss: 1.07713, train_acc: 0.4650
Epoch: 1200, train loss: 0.81771, train_acc: 0.5200
Epoch: 1300, train loss: 0.76015, train_acc: 0.5100
Epoch: 1400, train loss: 0.76653, train_acc: 0.5650
Epoch: 1500, train loss: 0.76491, train_acc: 0.6650
Epoch: 1600, train loss: 0.59196, train_acc: 0.6950
Epoch: 1700, train loss: 1.06773, train_acc: 0.6250
Epoch: 1800, train loss: 0.54543, train_acc: 0.6950
Epoch: 1900, train loss: 0.50840, train_acc: 0.7450
Epoch: 2000, train lo