<a href="https://colab.research.google.com/github/JohnnyPeng123/NLP-USYD/blob/master/Lab04.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab04

In this lab 4, we learn the Recurrent Neural Networks and Sequence Modelling


*   Recurrent Neural Networks
*   Sequence Modelling (Seq2Seq)


In [0]:
import torch
#You can enable GPU here (cuda); or just CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# RNN
A **recurrent neural network (RNN)** is a class of artificial neural networks where connections between nodes form a directed graph along a temporal sequence. This allows it to exhibit temporal dynamic behavior. Derived from feedforward neural networks, RNNs can use their internal state (memory) to process variable length sequences of inputs.


## Predict a last character of the word

In [0]:
import numpy as np

#Assume that we have the following character instances
char_arr = ['a', 'b', 'c', 'd', 'e', 'f', 'g',
            'h', 'i', 'j', 'k', 'l', 'm', 'n',

            'o', 'p', 'q', 'r', 's', 't', 'u',
            'v', 'w', 'x', 'y', 'z']

# one-hot encoding and decoding 
# {'a': 0, 'b': 1, 'c': 2, ..., 'j': 9, 'k', 10, ...}
num_dic = {n: i for i, n in enumerate(char_arr)}
dic_len = len(num_dic)


# a list words for sequence data (input and output)
seq_data = ['word', 'wood', 'deep', 'dive', 'cold', 'cool', 'load', 'love', 'kiss', 'kind']

# Make a batch to have sequence data for input and ouput
# wor -> X, d -> Y
# dee -> X, p -> Y
def make_batch(seq_data):
    input_batch = []
    target_batch = []
    
    for seq in seq_data:
        # input data is:
        #     wor           woo        dee       div
        # [22, 14, 17] [22, 14, 14] [3, 4, 4] [3, 8, 21] ...
        
        input_data = [num_dic[n] for n in seq[:-1]]
        
        # target is :
        # d, d, p, e, ...
        # 3, 3, 15, 4, ...
        target = num_dic[seq[-1]]
        
        # convert input to one-hot encoding.
        # if input is [3, 4, 4]:
        # [[ 0,  0,  0,  1,  0,  0,  0, ... 0]
        #  [ 0,  0,  0,  0,  1,  0,  0, ... 0]
        #  [ 0,  0,  0,  0,  1,  0,  0, ... 0]]
        input_batch.append(np.eye(dic_len)[input_data])
        
        target_batch.append([target])

    return input_batch, target_batch


In [0]:
### Setting hyperparameters

learning_rate = 0.1
n_hidden = 128
total_epoch = 50

# Number of sequences for RNN
n_step = 3

# number of inputs (dimension of input vector) = 26
n_input = dic_len
# number of classes = 26
n_class = dic_len


### Dropout

Dropout makes each hidden unit more robust and drive it towards creating useful features on its own without relying on other hidden units to correct its mistakes

![dropout](https://cdn-images-1.medium.com/max/800/1*D8jriroKkjno8RztHKmMnA.png)

## Model

In [0]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from sklearn.metrics import accuracy_score

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        # LSTM layer, the batch_first is False by default, which means the input and output tensors are provided as (seq_len, batch_size, feature) 
        # We need to set it to True because we are using input of shape (batch_size, seq_len, feature)  
        # Apply dropout to prevent overfitting, you can try to change the dropout rate. Note that this dropout is applied on outputs of each LSTM layer except the last layer
        self.lstm = nn.LSTM(n_input, n_hidden, num_layers=2, batch_first =True, dropout=0.2)
        # Linear layer for output
        self.linear = nn.Linear(n_hidden,n_class)

    def forward(self, x):
        # There are two outputs from nn.LSTM:
        # 1. tensor of shape (batch_size, seq_len, hidden_size) containing the output features from the last layer of the LSTM for each time step t
        # 2. the tuple containing the hidden state and cell state.  
        # Here we only care about the first output. Details for the two outputs can be found in PyTorch documentation for nn.LSTM: https://pytorch.org/docs/stable/nn.html#lstm
        x,_ = self.lstm(x)
        # Here we extract only the last hidden state from the LSTM output features
        # The last hidden carries the information about what the LSTM cell has seen over the time. 
        # Thus the prediction based on the last hidden state not only considers the data at the current time step, instead, it considers historical data.
        x = self.linear(x[:,-1,:])
        x = F.log_softmax(x, dim=1)
        return x

# Move the model to GPU
net = Net().to(device)
# Loss function and optimizer
criterion = nn.NLLLoss()
optimizer = optim.Adam(net.parameters(), lr=learning_rate)

# Preparing input
input_batch, target_batch = make_batch(seq_data)
# Convert input into tensors and move them to GPU by uting tensor.to(device)
input_batch_torch = torch.from_numpy(np.array(input_batch)).float().to(device)
target_batch_torch = torch.from_numpy(np.array(target_batch)).view(-1).to(device)


for epoch in range(total_epoch):  
    
    # Set the flag to training
    net.train()
    
    # forward + backward + optimize
    outputs = net(input_batch_torch) 
    loss = criterion(outputs, target_batch_torch)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    # Set the flag to evaluation, which will 'turn off' the dropout
    net.eval()
    outputs = net(input_batch_torch) 
    
    # Evaluation loss and accuracy calculation
    loss = criterion(outputs, target_batch_torch)
    _, predicted = torch.max(outputs, 1)
    acc= accuracy_score(predicted.cpu().numpy(),target_batch_torch.cpu().numpy())

    print('Epoch: %d, loss: %.5f, train_acc: %.2f' %(epoch + 1, loss.item(), acc))

print('Finished Training')

## Prediction
predict_words = []
for i in range(len(predicted.cpu().numpy())):
    ind = predicted.cpu().numpy()[i]
    predict_words.append(seq_data[i][:-1]+char_arr[ind])

print('\n=== Prediction Result ===')
print('Input:', [w[:3] + ' ' for w in seq_data])
print('Predicted:', predict_words)
print('Accuracy: %.2f' %acc)


Epoch: 1, loss: 1.56924, train_acc: 0.50
Epoch: 2, loss: 2.45242, train_acc: 0.20
Epoch: 3, loss: 3.20936, train_acc: 0.10
Epoch: 4, loss: 5.20646, train_acc: 0.10
Epoch: 5, loss: 1.82331, train_acc: 0.10
Epoch: 6, loss: 1.82802, train_acc: 0.50
Epoch: 7, loss: 1.92820, train_acc: 0.50
Epoch: 8, loss: 1.62308, train_acc: 0.50
Epoch: 9, loss: 1.66038, train_acc: 0.20
Epoch: 10, loss: 1.50756, train_acc: 0.50
Epoch: 11, loss: 1.45583, train_acc: 0.50
Epoch: 12, loss: 1.44720, train_acc: 0.50
Epoch: 13, loss: 1.48125, train_acc: 0.50
Epoch: 14, loss: 1.50806, train_acc: 0.50
Epoch: 15, loss: 1.48324, train_acc: 0.50
Epoch: 16, loss: 1.41949, train_acc: 0.50
Epoch: 17, loss: 1.40930, train_acc: 0.50
Epoch: 18, loss: 1.43036, train_acc: 0.50
Epoch: 19, loss: 1.39648, train_acc: 0.60
Epoch: 20, loss: 1.37174, train_acc: 0.50
Epoch: 21, loss: 1.36832, train_acc: 0.50
Epoch: 22, loss: 1.35106, train_acc: 0.50
Epoch: 23, loss: 1.32657, train_acc: 0.50
Epoch: 24, loss: 1.27323, train_acc: 0.50
E

# Seq2Seq Model (N to M)

Seq2seq turns one sequence into another sequence. It does so by use of a recurrent neural network (RNN) or more often LSTM or GRU to avoid the problem of vanishing gradient. The context for each item is the output from the previous step. The primary components are one encoder and one decoder network. The encoder turns each item into a corresponding hidden vector containing the item and its context. The decoder reverses the process, turning the vector into an output item, using the previous output as the input context

We are going to implement a sequence to sequence model that translates playing card symbols (Ace, Jack, Queen, King) to their associated number.

## Preprocess data

In [0]:
import torch
import numpy as np


# Sequence data
seq_data = [['ace', '01'], ['jack', '11'],
            ['queen', '12'], ['king', '13']]

# Generate unique tokens list
chars = []
for seq in seq_data:    
    chars += list(seq[0])
    chars += list(seq[1])

char_arr = list(set(chars))

# special tokens are required
# B: Beginning of Sequence
# E: Ending of Sequence
# P: Padding of Sequence - for different size input
# U: Unknown element of Sequence - for different size input
char_arr.append('B')
char_arr.append('E')
char_arr.append('P')
char_arr.append('U')

num_dic = {n: i for i, n in enumerate(char_arr)}

dic_len = len(num_dic)

max_input_words_amount = 5
max_output_words_amount = 3


## Generate batch

In [0]:
# add paddings if the word is shorter than the maximum number of words
def add_paddings(word):
    diff = 5 - len(word)
    for x in range(diff):
        word += "P"
    return word
    

# generate a batch data for training/testing
def make_batch(seq_data):
    input_batch = []
    output_batch = []
    target_batch = []

    for seq in seq_data:
        # Input for encoder cell, convert to vector
        input_word = add_paddings(seq[0])
        input_data = [num_dic[n] for n in input_word]
        
        # Input for decoder cell, Add 'B' at the beginning of the sequence data
        output_data  = [num_dic[n] for n in ('B'+ seq[1])]
        
        # Output of decoder cell (Actual result), Add 'E' at the end of the sequence data
        target = [num_dic[n] for n in (seq[1] + 'E')]

        # Convert each character vector to one-hot encode data
        input_batch.append(np.eye(dic_len)[input_data])
        output_batch.append(np.eye(dic_len)[output_data])
        
        target_batch.append(target)

    return input_batch, output_batch, target_batch

## Build training model

In [0]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from sklearn.metrics import accuracy_score



class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        # RNN encoder. The parameters of nn.RNN is similar to nn.LSTM 
        self.rnn_encoder = nn.RNN(n_input, n_hidden, batch_first= True)
        # Apply the drop out to output of RNN. Note the difference here compared to the 'dropout=0.2' we used for nn.LSTM above
        self.dropout_encoder = nn.Dropout(0.1)
        
        # RNN decoder
        self.rnn_decoder = nn.RNN(n_input, n_hidden, batch_first= True)
        self.dropout_decoder = nn.Dropout(0.1)
        self.linear = nn.Linear(n_hidden,n_class)

    def forward(self, x_encoder, x_decoder):
        # "hidden" containing the hidden state for t = seq_len.
        _,hidden = self.rnn_encoder(x_encoder)
        hidden = self.dropout_encoder(hidden)
        # [IMPORTANT] Setting "hidden" as inital_state of rnn_decoder
        decoder_output,_ = self.rnn_decoder(x_decoder,hidden)
        decoder_output = self.dropout_decoder(decoder_output)
        prediction_output_before_softmax = self.linear(decoder_output)
        output = torch.log_softmax(prediction_output_before_softmax,dim=-1)
        return output

### Setting Hyperparameters
learning_rate = 0.01
n_hidden = 128
total_epoch = 200

n_class = dic_len
n_input = dic_len

net = Net().to(device)
criterion = nn.NLLLoss()
optimizer = optim.Adam(net.parameters(), lr=learning_rate)

input_batch, output_batch, target_batch = make_batch(seq_data)
input_batch_torch = torch.from_numpy(np.array(input_batch)).float().to(device)
output_batch_torch = torch.from_numpy(np.array(output_batch)).float().to(device) 
target_batch_torch = torch.from_numpy(np.array(target_batch)).view(-1).to(device)

for epoch in range(total_epoch):  # loop over the dataset multiple times
   
    net.train()
    optimizer.zero_grad()

    # forward + backward + optimize
    outputs = net(input_batch_torch,output_batch_torch) 
    loss = criterion(outputs.view(-1,outputs.size(-1)), target_batch_torch)
    loss.backward()
    optimizer.step()

    if epoch%10==9:
        print('Epoch: %d, loss: %.5f' %(epoch + 1, loss.item()))

print('Finished Training')

Epoch: 10, loss: 0.51959
Epoch: 20, loss: 0.28586
Epoch: 30, loss: 0.12102
Epoch: 40, loss: 0.11627
Epoch: 50, loss: 0.01521
Epoch: 60, loss: 0.00601
Epoch: 70, loss: 0.00212
Epoch: 80, loss: 0.00231
Epoch: 90, loss: 0.00102
Epoch: 100, loss: 0.00108
Epoch: 110, loss: 0.00076
Epoch: 120, loss: 0.00123
Epoch: 130, loss: 0.00071
Epoch: 140, loss: 0.00080
Epoch: 150, loss: 0.00057
Epoch: 160, loss: 0.00057
Epoch: 170, loss: 0.00051
Epoch: 180, loss: 0.00047
Epoch: 190, loss: 0.00044
Epoch: 200, loss: 0.00050
Finished Training


## Evaluation

In [0]:
def predict(word):
    # Setting each character of predicted as 'U' (Unknown) 
    # ['king', 'UU']
    word = add_paddings(word)
    
    seq_data = [word, 'U' * 2]

    input_batch, output_batch, target_batch = make_batch([seq_data])
    input_batch_torch = torch.from_numpy(np.array(input_batch)).float().to(device)   
    output_batch_torch = torch.from_numpy(np.array(output_batch)).float().to(device) 

    # forward + backward + optimize
    net.eval()
    outputs = net(input_batch_torch,output_batch_torch) 
    _, predicted = torch.max(outputs, -1)
    answer=""
    for j in range(len(predicted.cpu().numpy()[0])-1):
        answer+=char_arr[predicted.cpu().numpy()[0][j]]

    return answer
    
print('=== Prediction result ===')
print('ace ->', predict('ace'))
print('jack ->', predict('jack'))
print('queen ->', predict('queen'))
print('king ->', predict('king'))


=== Prediction result ===
ace -> 01
jack -> 11
queen -> 12
king -> 13


##Please find the difference between standard RNN and LSTM

![alt text](https://usydnlpgroup.files.wordpress.com/2020/03/rnnvslstm-1-e1584852583803.png)

---


![alt text](https://usydnlpgroup.files.wordpress.com/2020/03/rnnvslstm_2-e1584852714227.png)




# Exercise (Text classification using LSTM)

In this exercise, you are going to implement a LSTM model to do the text classification problem. Please notice that we have already done the preprocessing and embedding part of the dataset. You can only focus on the Model part.

**Sequence Modelling**

![alt text](https://usydnlpgroup.files.wordpress.com/2020/03/lstm_textclassification-e1584855309361.png)



In [0]:
import torch
#If you enable GPU here, device will be cuda, otherwise it will be cpu
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### Downloading dataset

In [0]:
# Code to download file into Colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

id = '1ORrHW9moXLcWwg8WY9o-Ulq8X9BAiD1P'
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('train.pkl')  

id = '1eb4gtE8XlN3TcZqzwS18Ik-H7MFAeW4z'
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('label.pkl')  

import pickle
input_embeddings = pickle.load(open("train.pkl","rb"))
label = pickle.load(open("label.pkl","rb"))

### Split the dataset

In [0]:
# Split into training and testing dataset using scikit-learn
# For more details, you can refer to: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

from sklearn.model_selection import train_test_split
train_embeddings, test_embeddings, train_label, test_label = train_test_split(input_embeddings,label,test_size = 0.2, random_state=0)

## Generate batch

In [0]:
def generate_batch(input_embeddings,label, batch_size):
    idx = np.random.randint(input_embeddings.shape[0], size=batch_size)
    return input_embeddings[idx,:,:],label[idx]

## Model (please complete the following sections)

**NOTE**: By updating hyperparameters, you should achieve **at least 0.4** for testset "weighted avg" f1. (There will be randomness in the training process, so tutors would run your code several times and there should be at least one of the output reaching 0.4)

***What is F1?***

![alt text](https://1.bp.blogspot.com/-nkFFqViboVM/XWwaQ5x1YpI/AAAAAAAAAP8/XzTH9hfJSfswcRjxSeQFEU6-yKQCwc0EQCLcBGAs/s640/main-qimg-447d6cdb02d2cc097ff1e6083a6bdc37.png)
![alt text](https://i.stack.imgur.com/U0hjG.png)


In [0]:
import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.lstm = nn.LSTM(n_input, n_hidden, batch_first =True)
        self.linear = nn.Linear(n_hidden,n_class)

    def forward(self, x):        
        # Please complete the code for forwardpropagation
        # lstm layer
        # linear layer
        # softmax layer
        return x


In [0]:
import numpy as np
import torch.optim as optim

# Please assign values to these variables by using other variables (instead of hard code)
seq_length = 
n_input = 
n_class = 

#Please decide the hyperparameters here by yourself
n_hidden = 
batch_size = 
total_epoch = 
learning_rate = 
shown_interval = 




In [0]:
from sklearn.metrics import accuracy_score

net = Net().to(device)
criterion = nn.NLLLoss()

# Please find which optimizer provide higher f1
optimizer = 

for epoch in range(total_epoch):

    input_batch, target_batch = generate_batch(train_embeddings,train_label, batch_size)
    input_batch_torch = torch.from_numpy(input_batch).float().to(device)
    target_batch_torch = torch.from_numpy(target_batch).view(-1).to(device)

    net.train()
    outputs = net(input_batch_torch) 
    loss = criterion(outputs, target_batch_torch)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    if epoch % shown_interval == shown_interval-1:
        net.eval()
        outputs = net(input_batch_torch) 
        train_loss = criterion(outputs, target_batch_torch)
        _, predicted = torch.max(outputs, 1)
        train_acc= accuracy_score(predicted.cpu().numpy(),target_batch_torch.cpu().numpy())

        print('Epoch: %d, train loss: %.5f, train_acc: %.4f'%(epoch + 1, train_loss.item(), train_acc))

print('Finished Training')

## Prediction
net.eval()
outputs = net(torch.from_numpy(test_embeddings).float().to(device)) 
_, predicted = torch.max(outputs, 1)

from sklearn.metrics import classification_report
print(classification_report(test_label, predicted.cpu().numpy(),digits=4))

In [0]:
#The following is the sample output 
#As mentioned in the previous labs, it is impossible to get the same result (randomness in the training process).

Epoch: 100, train loss: 1.21694, train_acc: 0.4850
Epoch: 200, train loss: 1.24536, train_acc: 0.3950
Epoch: 300, train loss: 1.35449, train_acc: 0.3450
Epoch: 400, train loss: 1.34606, train_acc: 0.3350
Epoch: 500, train loss: 1.35706, train_acc: 0.2800
Epoch: 600, train loss: 1.32822, train_acc: 0.3450
Epoch: 700, train loss: 1.15759, train_acc: 0.5050
Epoch: 800, train loss: 1.03310, train_acc: 0.5100
Epoch: 900, train loss: 1.37713, train_acc: 0.2700
Epoch: 1000, train loss: 1.36268, train_acc: 0.3200
Epoch: 1100, train loss: 1.07713, train_acc: 0.4650
Epoch: 1200, train loss: 0.81771, train_acc: 0.5200
Epoch: 1300, train loss: 0.76015, train_acc: 0.5100
Epoch: 1400, train loss: 0.76653, train_acc: 0.5650
Epoch: 1500, train loss: 0.76491, train_acc: 0.6650
Epoch: 1600, train loss: 0.59196, train_acc: 0.6950
Epoch: 1700, train loss: 1.06773, train_acc: 0.6250
Epoch: 1800, train loss: 0.54543, train_acc: 0.6950
Epoch: 1900, train loss: 0.50840, train_acc: 0.7450
Epoch: 2000, train lo