# Assignment 2
## Section 1 [4 pts]

In the last assignment, you have implemented a simple neural network with fully connected layers to fit a polynomial function.
In this section, let's try something more challenging and realistic: implementing a CNN model for handwritten digit recognition.

This assignment will make you familiar with
1. loading and preprocessing data using built-in function
2. constructing a simple CNN model with PyTorch
3. the training and testing pipeline

You might find this tutorial useful: https://pytorch.org/tutorials/beginner/basics/intro.html.

In [1]:
# Import dependencies.
import time
import datetime
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Set up your device
cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if cuda else "cpu")

In [3]:
# Random seed helps you get the same result everytime you run the code.
# Keeping the same random seed is crucial to ensure that your result is reproducible.
seed = 1008
np.random.seed(seed)
torch.manual_seed(seed)
if cuda:
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    # TODO: Opening question: what does this line do?
    torch.backends.cudnn.benchmark = False

## 1. Data: MNIST [1 pt]
#### Use the built-in MNIST dataset in $\texttt{torchvision.datasets}$. It offers automatic data downloading and preprocessing.
#### Use $\texttt{torch.utils.data.DataLoader}$ to load data in batches.

MNIST dataset is a dataset of handwritten digits.
It is one of the most commonly used vision dataset for testing machine learning algorithms.
It has been built-in in PyTorch, so you can easily load it using $\texttt{torchvision.datasets}$.

The number of categories of this dataset is 10.
The shape of image in MNIST dataset is (28, 28, 1), each dimension representing height, width and channel respectively.

The normalization parameters we will use is (0.1307, 0.3081)

More details please refer to http://yann.lecun.com/exdb/mnist/.

### 1.1. Load Training Set

In [4]:
# DataLoader helps you load data from dataset and yield a batch of data in every iteration.
# Load the MNIST training set with batch size 128
# For training data, apply data shuffling and normalization
batch_size = 128
train_dataset = datasets.MNIST(
    root='./',
    # TODO: fill in the parameters to build the training set
    train=True,
    transform=transforms.Compose(
        [transforms.ToTensor(),
         transforms.Normalize(mean=0.1307, std=0.3081)]),
    download=True
)
train_loader = DataLoader(
    # TODO: fill in the parameters to build the training loader
    dataset=train_dataset,
    batch_size=batch_size,
    shuffle=True
)

### 1.2. Load Test Set

In [5]:
# Load the MNIST test set with batch size 128, apply normalization
test_dataset = datasets.MNIST(
    root='./',
    # TODO: fill in the parameters to build the test set
    train=False,
    transform=transforms.Compose(
        [transforms.ToTensor(),
         transforms.Normalize(mean=0.1307, std=0.3081)]),
    download=True
)
test_loader = DataLoader(
    # TODO: fill in the parameters to build the test loader
    dataset=test_dataset,
    batch_size=batch_size
)

In [6]:
# Check the shape of data and label
for img_batch, label_batch in train_loader:
    print(img_batch.shape)
    print(label_batch.shape)
    break

torch.Size([128, 1, 28, 28])
torch.Size([128])


## 2. Models [1 pt]
#### You are going to define two convolutional neural networks which are trained to classify MNIST digits

### 2.1. CNN without Batch Norm

In [7]:
class NetWithoutBatchNorm(nn.Module):
    def __init__(self, in_channels, num_classes):
        super(NetWithoutBatchNorm, self).__init__()
        # TODO: Fill in the values below that make this network valid for MNIST data
        # Hint: to make sure these, you may calculate the shape of x of every line in the forward.
        conv2_in_channels = 20
        fc1_in_features = 800
        fc2_in_features = 500
        self.conv1 = nn.Conv2d(in_channels=in_channels, out_channels=20, kernel_size=5, stride=1)
        self.activation1 = nn.ReLU()
        self.pooling1 = nn.MaxPool2d(kernel_size=2, stride=2)
        self.conv2 = nn.Conv2d(in_channels=conv2_in_channels, out_channels=50, kernel_size=5, stride=1)
        self.activation2 = nn.ReLU()
        self.pooling2 = nn.MaxPool2d(kernel_size=2, stride=2)
        self.fc1 = nn.Linear(in_features=fc1_in_features, out_features=500)
        self.activation_fc = nn.ReLU()
        self.fc2 = nn.Linear(in_features=fc2_in_features, out_features=num_classes)

    def forward(self, x):
        x = self.activation1(self.conv1(x))
        x = self.pooling1(x)
        x = self.activation2(self.conv2(x))
        x = self.pooling2(x)
        # TODO: transform the shape of x to fit the fully connected layer
        x = x.view(-1, 800)
        x = self.activation_fc(self.fc1(x))
        x = self.fc2(x)
        return x

### 2.2. CNN with Batch Norm

In [8]:
# Please note that this model inheritance from NetWithoutBatchNorm.
class NetWithBatchNorm(NetWithoutBatchNorm):
    def __init__(self, in_channels, num_classes):
        super(NetWithBatchNorm, self).__init__(in_channels, num_classes)
        # TODO: Fill in the values below that make this network valid for MNIST data
        # Hint: to make sure these, you may calculate the shape of x of every line in the forward.
        conv1_bn_size = 20
        conv2_bn_size = 50
        self.conv1_bn = nn.BatchNorm2d(conv1_bn_size)
        self.conv2_bn = nn.BatchNorm2d(conv2_bn_size)

    def forward(self, x):
        # TODO: follow the forward function in 2.1, add batch norm layers
        x = self.activation1(self.conv1(x))
        x = self.conv1_bn(x)
        x = self.pooling1(x)
        x = self.activation2(self.conv2(x))
        x = self.conv2_bn(x)
        x = self.pooling2(x)
        x = x.view(-1, 800)
        x = self.activation_fc(self.fc1(x))
        x = self.fc2(x)
        return x


## 3. Training & Evaluation [1 pt]

### 3.1. Define training method

In [9]:
def train_one_epoch(model, data_loader, optimizer, loss_fn, epoch_cnt, log_interval = 100):
    # Set model to training mode
    model.train()
    # Loop through data points
    for batch_idx, (data, label) in enumerate(data_loader):
        # TODO: Send data and label to device
        data = data.to(device)
        label = label.to(device)
        # TODO: Zero out the optimizer
        optimizer.zero_grad()
        # TODO: Forward pass
        logits = model(data)
        # TODO: Compute the loss
        loss = loss_fn(logits, label)
        # TODO: Backpropagation
        loss.backward()
        # TODO: Update parameters
        optimizer.step()
        if batch_idx % log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch_cnt, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))

### 3.2. Define test method

In [10]:
# Define test method
def test(model, data_loader, loss_fn):
    # TODO: Set model to evaluation mode
    # Variable for the total loss
    test_loss = 0
    # Counter for the correct predictions
    num_correct = 0
    # During evaluation, disable gradient tracking for efficiency
    with torch.no_grad():
        # Loop through data points
        for data, label in data_loader:
            # TODO: Send data to device
            data = data.to(device)
            label = label.to(device)
            # TODO: Forward pass
            logits = model(data)
            # TODO: Compute loss and add to total test_loss
            test_loss += loss_fn(logits, label)
            # TODO: Get predictions from the model for each data point
            correct = (logits.argmax(1) == label).sum()
            # TODO: Add number of correct predictions to total num_correct
            num_correct += correct
    # TODO: Compute the average loss on each data point
    avg_test_loss = test_loss / len(data_loader.dataset)
    print('-- Test set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)'.format(
        avg_test_loss, num_correct, len(test_loader.dataset),
        100. * num_correct / len(test_loader.dataset)))

### 3.3 Train the network without batch norm and the network with batch norm

In [11]:
# TODO: Define the network without batch norm and sent it to device
net_without_bn = NetWithoutBatchNorm(in_channels=1, num_classes=10).to(device)
# TODO: Define an SGD optimizer with learning rate of 1e-2 and momentum of 0.5
sgd_optimizer = optim.SGD(net_without_bn.parameters(), lr=1e-2, momentum=0.5)
# TODO: Define a cross entropy loss function, reduction='mean' for training
loss_fn_train = nn.CrossEntropyLoss(reduction='mean')
# TODO: Define a cross entropy loss function, reduction='sum' for testing
loss_fn_test = nn.CrossEntropyLoss(reduction='sum')

# Training loop with 10 epochs
for epoch in range(1, 10 + 1):
    start_time = time.time()
    # TODO: call the train_one_epoch function to train the model
    train_one_epoch(
        model=net_without_bn,
        data_loader=train_loader,
        optimizer=sgd_optimizer,
        loss_fn=loss_fn_train,
        epoch_cnt=epoch,
    )
    # TODO: call the test function to test the model
    test(
        model=net_without_bn,
        data_loader=test_loader,
        loss_fn=loss_fn_test,
    )
    end_time = time.time()
    time_cost = datetime.timedelta(seconds=end_time - start_time)
    print("Training without batch norm epoch {} finished, time cost: {}".format(epoch, time_cost))
    print('------------------------------------------------------------')

-- Test set: Average loss: 0.1491, Accuracy: 9562/10000 (96%)
Training without batch norm epoch 1 finished, time cost: 0:00:07.043003
------------------------------------------------------------
-- Test set: Average loss: 0.0895, Accuracy: 9718/10000 (97%)
Training without batch norm epoch 2 finished, time cost: 0:00:05.623758
------------------------------------------------------------
-- Test set: Average loss: 0.0671, Accuracy: 9801/10000 (98%)
Training without batch norm epoch 3 finished, time cost: 0:00:05.079337
------------------------------------------------------------
-- Test set: Average loss: 0.0592, Accuracy: 9826/10000 (98%)
Training without batch norm epoch 4 finished, time cost: 0:00:05.235083
------------------------------------------------------------
-- Test set: Average loss: 0.0461, Accuracy: 9867/10000 (99%)
Training without batch norm epoch 5 finished, time cost: 0:00:05.157286
------------------------------------------------------------
-- Test set: Average loss

In [12]:
# TODO: Define the network with batch norm and sent it to device
net_with_bn = NetWithBatchNorm(in_channels=1, num_classes=10).to(device)
# TODO: Define an SGD optimizer with learning rate of 1e-2 and momentum of 0.5
sgd_optimizer = optim.SGD(net_with_bn.parameters(), lr=1e-2, momentum=0.5)

# Training loop with 10 epochs
for epoch in range(1, 10 + 1):
    start_time = time.time()
    # TODO: call the train_one_epoch function to train the model
    train_one_epoch(
        model=net_with_bn,
        data_loader=train_loader,
        optimizer=sgd_optimizer,
        loss_fn=loss_fn_train,
        epoch_cnt=epoch,
    )
    # TODO: call the test function to test the model
    test(
        model=net_with_bn,
        data_loader=test_loader,
        loss_fn=loss_fn_test,
    )
    end_time = time.time()
    time_cost = datetime.timedelta(seconds=end_time - start_time)
    print("Training with batch norm epoch {} finished, time cost: {}".format(epoch, time_cost))
    print('------------------------------------------------------------')

-- Test set: Average loss: 0.0674, Accuracy: 9802/10000 (98%)
Training with batch norm epoch 1 finished, time cost: 0:00:04.975342
------------------------------------------------------------
-- Test set: Average loss: 0.0470, Accuracy: 9861/10000 (99%)
Training with batch norm epoch 2 finished, time cost: 0:00:04.798969
------------------------------------------------------------
-- Test set: Average loss: 0.0350, Accuracy: 9892/10000 (99%)
Training with batch norm epoch 3 finished, time cost: 0:00:04.693920
------------------------------------------------------------
-- Test set: Average loss: 0.0305, Accuracy: 9896/10000 (99%)
Training with batch norm epoch 4 finished, time cost: 0:00:04.640055
------------------------------------------------------------
-- Test set: Average loss: 0.0302, Accuracy: 9898/10000 (99%)
Training with batch norm epoch 5 finished, time cost: 0:00:05.044377
------------------------------------------------------------
-- Test set: Average loss: 0.0282, Accur

## 4. Opening questions [1pt]
### 4.1 In random seed setting section, what does $\texttt{torch.backends.cudnn.benchmark}$ determines? Why do we set it to $\texttt{False}$?
### 4.2 When defining the network, where do you add the batch norm layers? Why?

Answers:

4.1 若将`torch.backends.cudnn.benchmark`设置为`True`，则会使得`torch`的底层库`cuDNN`衡量多个卷积算法的速度并选择最快的；将其设置为`False`的原因是减少算法的不确定性，更好地控制除了有无batch norm以外的变量

4.2 加载activation层之后，可以使得下一次卷积输入的值是normalize过的，在相同的尺度范围内，有利于卷积层的学习

## Section 2 [6 pts]

You will tackle with a sentiment classification task using LSTM model and attention mechanism in this assigment.

## 1. Dependencies
Please make sure that you are using **GPU** to accelarate computation.
You can run with Google Colab if you don't have.
Colab FAQ: https://research.google.com/colaboratory/faq.html

### 1.1. Import dependencies

In [13]:
import torch
import os
import collections
from torch import nn
from torch.utils.data import TensorDataset, DataLoader
import torch.nn.functional as F
from tqdm import tqdm
import math
import random
import numpy as np
import re

In [14]:
# Set up your device
cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if cuda else "cpu")

# The assertion is to make sure GPU is available
assert cuda == True

In [15]:
# Set up random seed to 1331. Do not change the random seed.
# Yes, these are all necessary when you run experiments!
seed = 1331
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if cuda:
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True


### 1.2. Data
The script below will download the required sentiment analysis data.

If you use Google Colab, data folder will be visible in the Colab file-explorer pane, which is loacted at left side of the page.


In [16]:
# !wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1y-_KAx7oLqjzOnyXeZvt0tp9WC2ZgyUi' -O data.zip
# !unzip data.zip

### 1.3. Corpus
Glove will be used as the word embedding tool in this assigment.

In [17]:
# !wget http://nlp.stanford.edu/data/glove.6B.zip
# !unzip glove.6B.zip

## 2. Preprocess [1.5 pt]
Preprocess data, then construct dataloader and vocabulary.

### 2.1. Load Glove pretrained word embedding. [0.5 pt]

In [18]:
# TODO
def get_glove_embedding():
    filename = 'glove.6B.100d.txt'
    glove_embedding = []
    with open(filename, 'r') as f:
        for line in tqdm(f):
            vals = line.rstrip().split(' ')
            glove_embedding.append([float(x) for x in vals[1:]])
    feature_size = len(glove_embedding[0])
    unk_feature = torch.randn(1, feature_size)
    pad_feature = torch.randn(1, feature_size)
    glove_embedding = torch.Tensor(glove_embedding)
    glove_embedding = torch.cat([glove_embedding, unk_feature, pad_feature])
    # print(glove_embedding.shape)
    glove_embedding_nn = nn.Embedding(*glove_embedding.shape)
    glove_embedding_nn.weight = torch.nn.Parameter(glove_embedding)
    # glove_embedding_nn.weight.requires_grad = False
    return glove_embedding_nn

glove_embedding = get_glove_embedding()

400000it [00:05, 69022.78it/s]


### 2.2. Construct your own vocabulary without other corpus. [0.5 pt]
Hint: You should construct a vocabulary to map the word to index.

In [19]:
# TODO
vocabs = []
filename = 'glove.6B.100d.txt'
with open(filename, 'r') as f:
    for line in tqdm(f):
        word = line.rstrip().split(' ')[0]
        vocabs.append(word)
vocabs.append('<UNK>')
vocabs.append('<PAD>')
word2idx = {word:idx for idx, word in enumerate(vocabs)}
vocab_size = len(word2idx)

400000it [00:01, 375344.75it/s]


### 2.3. Load data [0.5 pt]
Load data and construct dataloader.

In [20]:
data_dir = 'sentiment'
# TODO

def clean_str(s):
    s = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", s)
    s = re.sub(r"\'s", " \'s", s)  
    s = re.sub(r"\'ve", " \'ve", s)
    s = re.sub(r"n\'t", " n\'t", s)
    s = re.sub(r"\'re", " \'re", s)
    s = re.sub(r"\'d", " \'d", s) 
    s = re.sub(r"\'ll", " \'ll", s)
    s = re.sub(r"\'m", " \'m", s) 
    s = re.sub(r",", " , ", s)
    s = re.sub(r"!", " ! ", s)
    s = re.sub(r"\(", " ( ", s)
    s = re.sub(r"\)", " ) ", s)
    s = re.sub(r"\?", " ? ", s)
    s = re.sub(r"\s{2,}", " ", s)
    return s.strip().lower()

def build_dataloader(path, max_per_sentence=50, batch_size=128, shuffle=False):
    assert path in ['train', 'test', 'val']
    text_path = os.path.join(data_dir, path+'_text.txt')
    label_path = os.path.join(data_dir, path+'_labels.txt')
    texts = []
    labels = []
    with open(text_path, 'r') as f:
        for line in tqdm(f):
            texts.append(clean_str(line).split(' '))
    with open(label_path, 'r') as f:
        for line in tqdm(f):
            labels.append(int(line.strip()))
    sentences = []
    for text in texts:
        text += ['<PAD>'] * (max_per_sentence - len(text))
        unk_idx = word2idx['<UNK>']
        text2idx = [word2idx.get(word, unk_idx) for word in text][:max_per_sentence]
        sentences.append(text2idx)
    sentences = torch.tensor(sentences, dtype=torch.long)
    labels = torch.tensor(labels, dtype=torch.long)
    text_dataset = TensorDataset(sentences, labels)
    text_dataloader = DataLoader(dataset=text_dataset, batch_size=batch_size, shuffle=shuffle)
    return text_dataloader
train_loader = build_dataloader('train', shuffle=True)
val_loader = build_dataloader('val')
test_loader = build_dataloader('test')

45615it [00:00, 92985.20it/s]
45615it [00:00, 4515534.03it/s]
2000it [00:00, 36481.88it/s]
2000it [00:00, 4021384.47it/s]
12284it [00:00, 99621.28it/s]
12284it [00:00, 4729902.72it/s]


## 3. Model [1.5 pt]
Bidirectional LSTM and attention mechanism will be used in this section.

### 3.1. Model Zoo

In [21]:
class BiRNN(nn.Module):
    def __init__(self, vocab_size, embed_size, num_hiddens, num_layers, pretrained_embedding=None, **kwargs):
        super(BiRNN, self).__init__()
        # TODO: load pretrained embedding
        ...
        self.embedding = pretrained_embedding \
            if pretrained_embedding is not None \
            else nn.Embedding(vocab_size, embed_size)

        self.encoder = nn.LSTM(embed_size, num_hiddens, num_layers=num_layers, bidirectional=True, batch_first=True)
        self.decoder = nn.Sequential(nn.Linear(4 * num_hiddens, num_hiddens), nn.Linear(num_hiddens, 3))

    def forward(self, inputs):
        embeddings = self.embedding(inputs)
        self.encoder.flatten_parameters()
        outputs, _ = self.encoder(embeddings)
        encoding = torch.cat((outputs[:,0,:], outputs[:,-1,:]), dim=1)
        outs = self.decoder(encoding)
        return outs

In [22]:
class BiRNN_attention(nn.Module):
    def __init__(self, vocab_size, embed_size, num_hiddens, num_layers, pretrained_embedding=None, **kwargs):
        super(BiRNN_attention, self).__init__()
        # TODO: load pretrained embedding
        ...
        self.embedding = pretrained_embedding \
            if pretrained_embedding is not None \
            else nn.Embedding(vocab_size, embed_size)

        self.encoder = nn.LSTM(embed_size, num_hiddens, num_layers=num_layers, bidirectional=True, batch_first=True)
        self.weight_W = nn.Parameter(torch.Tensor(2 * num_hiddens, 2 * num_hiddens))
        self.weight_proj = nn.Parameter(torch.Tensor(2 * num_hiddens, 1))

        self.decoder = nn.Sequential(nn.Linear(2 * num_hiddens, num_hiddens), nn.Linear(num_hiddens, 3))
        nn.init.uniform_(self.weight_W, -0.1, 0.1)
        nn.init.uniform_(self.weight_proj, -0.1, 0.1)


    def forward(self, inputs):
        mask = 1 - torch.clamp(inputs, min=0, max=1)
        embeddings = self.embedding(inputs)
        states, hidden = self.encoder(embeddings.permute([0, 1, 2]))
        u = torch.tanh(torch.matmul(states, self.weight_W))
        att = torch.matmul(u, self.weight_proj)
        att = att + mask.unsqueeze(2) * -1e7
        att_score = F.softmax(att, dim=1)
        scored_x = states * att_score
        encoding = torch.sum(scored_x, dim=1)
        outputs = self.decoder(encoding)

        return outputs

## 4. Training [3 pt]
You should train two models above with Glove pretrained word embedding and random initialized word embedding.

Evaluation on the validation set and print out accuracy after training one epoch is required.

You can tune some parameters and try different techniques, such as learning rate scheduler.

In [23]:
# Training Entry Point, you can add arbitrary parameters to replace **kwargs if you need.
def test(model, data_loader, loss_fn, phase='val'):
    assert phase in ['val', 'test']
    test_loss = 0
    num_correct = 0
    with torch.no_grad():
        for data, label in data_loader:
            data = data.to(device)
            label = label.to(device)
            logits = model(data)
            test_loss += loss_fn(logits, label)
            correct = (logits.argmax(1) == label).sum()
            num_correct += correct
    avg_test_loss = test_loss / len(data_loader.dataset)
    print('-- {} set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)'.format(
        phase, avg_test_loss, num_correct, len(data_loader.dataset),
        100. * num_correct / len(data_loader.dataset)))

def train(network_class, pretrained_embedding=None, **kwargs):
    # TODOembed_size, num_hiddens, num_layers, pretrained_embedding=None
    # vocab_size defined in section 2.2 constructing vocabulary
    embed_size = kwargs.get('embed_size', 100)
    num_hiddens = kwargs.get('num_hiddens', 100)
    num_layers = kwargs.get('num_layers', 1)
    epochs = kwargs.get('epochs', 30)

    model = network_class(vocab_size, embed_size, num_hiddens, num_layers, pretrained_embedding=pretrained_embedding)
    model = model.to(device)
    optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=1e-3)
    scheduler = optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)
    loss_fn = nn.CrossEntropyLoss()

    for i in range(epochs):
        model.train()
        for data, label in tqdm(train_loader):
            data = data.to(device)
            label = label.to(device)

            optimizer.zero_grad()
            logits = model(data)
            loss = loss_fn(logits, label)
            loss.backward()
            optimizer.step()
        scheduler.step()
        print('Train Epoch: {}\tLoss: {:.6f}'.format(i+1, loss.item() / len(data)))
        if (i+1) % 10 == 0:
            test(model, val_loader, loss_fn, phase='val')
    test(model, test_loader, loss_fn, phase='test')


In [24]:
# Train BiRNN with Glove pretrained word embedding
# TODO
train(BiRNN, pretrained_embedding=glove_embedding)

100%|██████████| 357/357 [00:03<00:00, 90.27it/s] 


Train Epoch: 1	Loss: 0.019827


100%|██████████| 357/357 [00:03<00:00, 101.76it/s]


Train Epoch: 2	Loss: 0.018386


100%|██████████| 357/357 [00:03<00:00, 106.09it/s]


Train Epoch: 3	Loss: 0.015222


100%|██████████| 357/357 [00:03<00:00, 104.91it/s]


Train Epoch: 4	Loss: 0.014736


100%|██████████| 357/357 [00:03<00:00, 104.90it/s]


Train Epoch: 5	Loss: 0.013477


100%|██████████| 357/357 [00:03<00:00, 103.75it/s]


Train Epoch: 6	Loss: 0.014448


100%|██████████| 357/357 [00:03<00:00, 103.97it/s]


Train Epoch: 7	Loss: 0.016555


100%|██████████| 357/357 [00:03<00:00, 102.34it/s]


Train Epoch: 8	Loss: 0.017438


100%|██████████| 357/357 [00:03<00:00, 104.42it/s]


Train Epoch: 9	Loss: 0.017286


100%|██████████| 357/357 [00:03<00:00, 101.90it/s]


Train Epoch: 10	Loss: 0.016477
-- val set: Average loss: 0.0063, Accuracy: 1242/2000 (62%)


100%|██████████| 357/357 [00:03<00:00, 105.43it/s]


Train Epoch: 11	Loss: 0.013527


100%|██████████| 357/357 [00:03<00:00, 103.58it/s]


Train Epoch: 12	Loss: 0.014461


100%|██████████| 357/357 [00:03<00:00, 103.20it/s]


Train Epoch: 13	Loss: 0.017303


100%|██████████| 357/357 [00:03<00:00, 101.99it/s]


Train Epoch: 14	Loss: 0.017664


100%|██████████| 357/357 [00:03<00:00, 104.04it/s]


Train Epoch: 15	Loss: 0.019378


100%|██████████| 357/357 [00:03<00:00, 103.96it/s]


Train Epoch: 16	Loss: 0.016028


100%|██████████| 357/357 [00:03<00:00, 104.18it/s]


Train Epoch: 17	Loss: 0.020346


100%|██████████| 357/357 [00:03<00:00, 102.50it/s]


Train Epoch: 18	Loss: 0.016732


100%|██████████| 357/357 [00:03<00:00, 104.19it/s]


Train Epoch: 19	Loss: 0.016181


100%|██████████| 357/357 [00:03<00:00, 103.12it/s]


Train Epoch: 20	Loss: 0.017878
-- val set: Average loss: 0.0061, Accuracy: 1280/2000 (64%)


100%|██████████| 357/357 [00:03<00:00, 103.91it/s]


Train Epoch: 21	Loss: 0.015485


100%|██████████| 357/357 [00:03<00:00, 103.36it/s]


Train Epoch: 22	Loss: 0.014759


100%|██████████| 357/357 [00:03<00:00, 102.46it/s]


Train Epoch: 23	Loss: 0.016751


100%|██████████| 357/357 [00:03<00:00, 101.43it/s]


Train Epoch: 24	Loss: 0.016489


100%|██████████| 357/357 [00:03<00:00, 103.61it/s]


Train Epoch: 25	Loss: 0.014659


100%|██████████| 357/357 [00:03<00:00, 103.90it/s]


Train Epoch: 26	Loss: 0.014470


100%|██████████| 357/357 [00:03<00:00, 102.25it/s]


Train Epoch: 27	Loss: 0.016033


100%|██████████| 357/357 [00:03<00:00, 103.49it/s]


Train Epoch: 28	Loss: 0.015515


100%|██████████| 357/357 [00:03<00:00, 103.09it/s]


Train Epoch: 29	Loss: 0.018770


100%|██████████| 357/357 [00:03<00:00, 102.10it/s]


Train Epoch: 30	Loss: 0.012407
-- val set: Average loss: 0.0061, Accuracy: 1288/2000 (64%)
-- test set: Average loss: 0.0061, Accuracy: 7875/12284 (64%)


In [25]:
# Train BiRNN without pretrained word embedding
# TODO
train(BiRNN, pretrained_embedding=None)

100%|██████████| 357/357 [00:03<00:00, 104.11it/s]


Train Epoch: 1	Loss: 0.021657


100%|██████████| 357/357 [00:03<00:00, 103.15it/s]


Train Epoch: 2	Loss: 0.020470


100%|██████████| 357/357 [00:03<00:00, 103.86it/s]


Train Epoch: 3	Loss: 0.020289


100%|██████████| 357/357 [00:03<00:00, 105.33it/s]


Train Epoch: 4	Loss: 0.019367


100%|██████████| 357/357 [00:03<00:00, 104.40it/s]


Train Epoch: 5	Loss: 0.022210


100%|██████████| 357/357 [00:03<00:00, 104.92it/s]


Train Epoch: 6	Loss: 0.019365


100%|██████████| 357/357 [00:03<00:00, 108.48it/s]


Train Epoch: 7	Loss: 0.017486


100%|██████████| 357/357 [00:03<00:00, 104.26it/s]


Train Epoch: 8	Loss: 0.019598


100%|██████████| 357/357 [00:03<00:00, 105.47it/s]


Train Epoch: 9	Loss: 0.020165


100%|██████████| 357/357 [00:03<00:00, 105.00it/s]


Train Epoch: 10	Loss: 0.020129
-- val set: Average loss: 0.0074, Accuracy: 1113/2000 (56%)


100%|██████████| 357/357 [00:03<00:00, 105.87it/s]


Train Epoch: 11	Loss: 0.019909


100%|██████████| 357/357 [00:03<00:00, 104.63it/s]


Train Epoch: 12	Loss: 0.017175


100%|██████████| 357/357 [00:03<00:00, 104.56it/s]


Train Epoch: 13	Loss: 0.018298


100%|██████████| 357/357 [00:03<00:00, 105.62it/s]


Train Epoch: 14	Loss: 0.018032


100%|██████████| 357/357 [00:03<00:00, 105.06it/s]


Train Epoch: 15	Loss: 0.018362


100%|██████████| 357/357 [00:03<00:00, 103.31it/s]


Train Epoch: 16	Loss: 0.017259


100%|██████████| 357/357 [00:03<00:00, 105.30it/s]


Train Epoch: 17	Loss: 0.015649


100%|██████████| 357/357 [00:03<00:00, 105.04it/s]


Train Epoch: 18	Loss: 0.018744


100%|██████████| 357/357 [00:03<00:00, 103.45it/s]


Train Epoch: 19	Loss: 0.017507


100%|██████████| 357/357 [00:03<00:00, 103.62it/s]


Train Epoch: 20	Loss: 0.018774
-- val set: Average loss: 0.0072, Accuracy: 1171/2000 (59%)


100%|██████████| 357/357 [00:03<00:00, 104.45it/s]


Train Epoch: 21	Loss: 0.016941


100%|██████████| 357/357 [00:03<00:00, 103.64it/s]


Train Epoch: 22	Loss: 0.017106


100%|██████████| 357/357 [00:03<00:00, 103.82it/s]


Train Epoch: 23	Loss: 0.017442


100%|██████████| 357/357 [00:03<00:00, 104.39it/s]


Train Epoch: 24	Loss: 0.017863


100%|██████████| 357/357 [00:03<00:00, 103.98it/s]


Train Epoch: 25	Loss: 0.020233


100%|██████████| 357/357 [00:03<00:00, 103.38it/s]


Train Epoch: 26	Loss: 0.015128


100%|██████████| 357/357 [00:03<00:00, 104.16it/s]


Train Epoch: 27	Loss: 0.021037


100%|██████████| 357/357 [00:03<00:00, 103.65it/s]


Train Epoch: 28	Loss: 0.017929


100%|██████████| 357/357 [00:03<00:00, 105.05it/s]


Train Epoch: 29	Loss: 0.014678


100%|██████████| 357/357 [00:03<00:00, 106.68it/s]


Train Epoch: 30	Loss: 0.017414
-- val set: Average loss: 0.0067, Accuracy: 1238/2000 (62%)
-- test set: Average loss: 0.0077, Accuracy: 6442/12284 (52%)


In [26]:
# Train BiRNN_attention with Glove pretrained embedding
# TODO
train(BiRNN_attention, pretrained_embedding=glove_embedding)

100%|██████████| 357/357 [00:03<00:00, 101.55it/s]


Train Epoch: 1	Loss: 0.020929


100%|██████████| 357/357 [00:03<00:00, 98.15it/s] 


Train Epoch: 2	Loss: 0.019154


100%|██████████| 357/357 [00:03<00:00, 102.63it/s]


Train Epoch: 3	Loss: 0.018812


100%|██████████| 357/357 [00:03<00:00, 101.91it/s]


Train Epoch: 4	Loss: 0.017853


100%|██████████| 357/357 [00:03<00:00, 101.91it/s]


Train Epoch: 5	Loss: 0.014807


100%|██████████| 357/357 [00:03<00:00, 100.87it/s]


Train Epoch: 6	Loss: 0.014039


100%|██████████| 357/357 [00:03<00:00, 101.59it/s]


Train Epoch: 7	Loss: 0.014462


100%|██████████| 357/357 [00:03<00:00, 102.44it/s]


Train Epoch: 8	Loss: 0.017660


100%|██████████| 357/357 [00:03<00:00, 103.47it/s]


Train Epoch: 9	Loss: 0.019922


100%|██████████| 357/357 [00:03<00:00, 100.77it/s]


Train Epoch: 10	Loss: 0.015511
-- val set: Average loss: 0.0061, Accuracy: 1311/2000 (66%)


100%|██████████| 357/357 [00:03<00:00, 102.14it/s]


Train Epoch: 11	Loss: 0.018278


100%|██████████| 357/357 [00:03<00:00, 102.17it/s]


Train Epoch: 12	Loss: 0.016764


100%|██████████| 357/357 [00:03<00:00, 101.42it/s]


Train Epoch: 13	Loss: 0.014735


100%|██████████| 357/357 [00:03<00:00, 100.63it/s]


Train Epoch: 14	Loss: 0.012956


100%|██████████| 357/357 [00:03<00:00, 100.38it/s]


Train Epoch: 15	Loss: 0.014267


100%|██████████| 357/357 [00:03<00:00, 99.10it/s] 


Train Epoch: 16	Loss: 0.015883


100%|██████████| 357/357 [00:03<00:00, 102.87it/s]


Train Epoch: 17	Loss: 0.016039


100%|██████████| 357/357 [00:03<00:00, 100.96it/s]


Train Epoch: 18	Loss: 0.017624


100%|██████████| 357/357 [00:03<00:00, 100.83it/s]


Train Epoch: 19	Loss: 0.013295


100%|██████████| 357/357 [00:03<00:00, 100.96it/s]


Train Epoch: 20	Loss: 0.017065
-- val set: Average loss: 0.0060, Accuracy: 1322/2000 (66%)


100%|██████████| 357/357 [00:03<00:00, 100.59it/s]


Train Epoch: 21	Loss: 0.014910


100%|██████████| 357/357 [00:03<00:00, 100.44it/s]


Train Epoch: 22	Loss: 0.017940


100%|██████████| 357/357 [00:03<00:00, 101.44it/s]


Train Epoch: 23	Loss: 0.013310


100%|██████████| 357/357 [00:03<00:00, 98.28it/s] 


Train Epoch: 24	Loss: 0.015502


100%|██████████| 357/357 [00:03<00:00, 98.92it/s] 


Train Epoch: 25	Loss: 0.013002


100%|██████████| 357/357 [00:03<00:00, 98.53it/s] 


Train Epoch: 26	Loss: 0.017011


100%|██████████| 357/357 [00:03<00:00, 97.34it/s] 


Train Epoch: 27	Loss: 0.013705


100%|██████████| 357/357 [00:03<00:00, 99.79it/s] 


Train Epoch: 28	Loss: 0.015438


100%|██████████| 357/357 [00:03<00:00, 98.67it/s] 


Train Epoch: 29	Loss: 0.017534


100%|██████████| 357/357 [00:03<00:00, 100.97it/s]


Train Epoch: 30	Loss: 0.016359
-- val set: Average loss: 0.0059, Accuracy: 1321/2000 (66%)
-- test set: Average loss: 0.0061, Accuracy: 7821/12284 (64%)


In [27]:
# Train BiRNN_attention without pretrained word embedding
# TODO
train(BiRNN_attention, pretrained_embedding=None)

100%|██████████| 357/357 [00:03<00:00, 99.54it/s] 


Train Epoch: 1	Loss: 0.020111


100%|██████████| 357/357 [00:03<00:00, 100.82it/s]


Train Epoch: 2	Loss: 0.022807


100%|██████████| 357/357 [00:03<00:00, 99.05it/s] 


Train Epoch: 3	Loss: 0.018743


100%|██████████| 357/357 [00:03<00:00, 99.13it/s] 


Train Epoch: 4	Loss: 0.020329


100%|██████████| 357/357 [00:03<00:00, 99.41it/s] 


Train Epoch: 5	Loss: 0.021172


100%|██████████| 357/357 [00:03<00:00, 100.46it/s]


Train Epoch: 6	Loss: 0.019091


100%|██████████| 357/357 [00:03<00:00, 95.83it/s] 


Train Epoch: 7	Loss: 0.022365


100%|██████████| 357/357 [00:03<00:00, 95.56it/s] 


Train Epoch: 8	Loss: 0.020068


100%|██████████| 357/357 [00:03<00:00, 97.64it/s] 


Train Epoch: 9	Loss: 0.017855


100%|██████████| 357/357 [00:03<00:00, 96.51it/s] 


Train Epoch: 10	Loss: 0.017307
-- val set: Average loss: 0.0073, Accuracy: 1124/2000 (56%)


100%|██████████| 357/357 [00:03<00:00, 97.19it/s] 


Train Epoch: 11	Loss: 0.020739


100%|██████████| 357/357 [00:03<00:00, 98.36it/s] 


Train Epoch: 12	Loss: 0.022198


100%|██████████| 357/357 [00:03<00:00, 99.50it/s] 


Train Epoch: 13	Loss: 0.018029


100%|██████████| 357/357 [00:03<00:00, 99.49it/s] 


Train Epoch: 14	Loss: 0.017343


100%|██████████| 357/357 [00:03<00:00, 101.41it/s]


Train Epoch: 15	Loss: 0.016060


100%|██████████| 357/357 [00:03<00:00, 100.40it/s]


Train Epoch: 16	Loss: 0.016443


100%|██████████| 357/357 [00:03<00:00, 97.28it/s] 


Train Epoch: 17	Loss: 0.018839


100%|██████████| 357/357 [00:03<00:00, 98.77it/s] 


Train Epoch: 18	Loss: 0.017356


100%|██████████| 357/357 [00:03<00:00, 99.66it/s] 


Train Epoch: 19	Loss: 0.018465


100%|██████████| 357/357 [00:03<00:00, 97.80it/s] 


Train Epoch: 20	Loss: 0.018799
-- val set: Average loss: 0.0072, Accuracy: 1143/2000 (57%)


100%|██████████| 357/357 [00:03<00:00, 98.46it/s] 


Train Epoch: 21	Loss: 0.018667


100%|██████████| 357/357 [00:03<00:00, 98.22it/s] 


Train Epoch: 22	Loss: 0.018082


100%|██████████| 357/357 [00:03<00:00, 97.92it/s] 


Train Epoch: 23	Loss: 0.018440


100%|██████████| 357/357 [00:03<00:00, 97.93it/s] 


Train Epoch: 24	Loss: 0.016665


100%|██████████| 357/357 [00:03<00:00, 97.96it/s] 


Train Epoch: 25	Loss: 0.018928


100%|██████████| 357/357 [00:03<00:00, 97.57it/s] 


Train Epoch: 26	Loss: 0.019831


100%|██████████| 357/357 [00:03<00:00, 98.10it/s] 


Train Epoch: 27	Loss: 0.017644


100%|██████████| 357/357 [00:03<00:00, 97.81it/s] 


Train Epoch: 28	Loss: 0.018753


100%|██████████| 357/357 [00:03<00:00, 98.80it/s] 


Train Epoch: 29	Loss: 0.019202


100%|██████████| 357/357 [00:03<00:00, 97.95it/s] 


Train Epoch: 30	Loss: 0.015797
-- val set: Average loss: 0.0072, Accuracy: 1143/2000 (57%)
-- test set: Average loss: 0.0084, Accuracy: 6245/12284 (51%)


## 5. Report (optional)
You can briefly report what strategies you attempted in this assignment.

- Add weight decay to avoid overfitting
- Add Exponential lr schedule strategy