# Sentiment Analysis based on Attention Mechanism

# Train models (Inter-Attention-BiLSTM & Self-Inter-Attention-BiLSTM) on Four Datasets

## Dataset
- [**IMDB Large Movie Review Dataset**](http://ai.stanford.edu/~amaas/data/sentiment/)
    - **Binary** sentiment classification
    - Citation: [Andrew L. Maas et al., 2011](http://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf)
    - **50,000** movie reviews for training and testing
    - Average review length: **231** vocab
    ---
- [**Yelp reviews-full**](http://xzh.me/docs/charconvnet.pdf)
    - **Multiclass** sentiment classification (5 stars)
    - Citation: [Xiang Zhang et al., 2015](https://arxiv.org/abs/1509.01626)
    - **650,000** training samples and **50,000** testing samples (Nums of each star are equal)
    - Average review length: **140** vocab
    ---
- [**Yelp reviews-polarity**](http://xzh.me/docs/charconvnet.pdf)
    - **Binary** sentiment classification
    - Citation: [Xiang Zhang et al., 2015](https://arxiv.org/abs/1509.01626)
    - **560,000** training samples and **38,000** testing samples (Nums of positive or negative samples are equal)
    - Average review length: **140** vocab
    ---
- [**Douban Movie Reviews**](https://drive.google.com/open?id=1DsmQfB1Ff_BUoxOv4kfUMg7Y8M7tHB9F) 
    - My **Custom Chinese** movie reviews scraped from **16000** movies (num of reviews > 100) on [Douban](https://movie.douban.com/)
    - **Binary** sentiment classification
    - **700,000** movie reviews , **600,000** samples for training and **100,000** samples for testing (Num of positive or negative samples is equal)
    - Average review length: **50** character, **27** vocab

## paper reference

> [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/pdf/1409.0473.pdf)
>
> [Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/pdf/1508.04025.pdf)
>
> [Attention Is All You Need](https://arxiv.org/abs/1706.03762)

## code reference
> https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation-batched.ipynb
>
> https://github.com/graykode/nlp-tutorial/blob/master/4-3.Bi-LSTM(Attention)/Bi_LSTM(Attention)_Torch.ipynb
>
> https://github.com/bojone/attention/blob/master/attention_keras.py
>
> https://github.com/graykode/nlp-tutorial/blob/master/5-1.Transformer/Transformer(Greedy_decoder)_Torch.ipynb

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.utils.rnn import pack_padded_sequence, pack_sequence, pad_packed_sequence, pad_sequence
from torch.utils.data import DataLoader, Dataset, SequentialSampler
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from tqdm import tqdm
import time
import random
import os
import copy
import warnings

from utils import pad_and_sort_batch, preprocess_for_batch, pad_or_truncate

warnings.filterwarnings('ignore')
DEVICE = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

%matplotlib inline
%load_ext autoreload
%autoreload 2
torch.__version__

'1.0.1.post2'

In [0]:
# set random seeds to keep the results identical
def setup_seed(seed):
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.backends.cudnn.deterministic = True
    
def worker_init_fn(worker_id):
    setup_seed(torch.initial_seed() + worker_id)
    
GLOBAL_SEED = 2019
setup_seed(GLOBAL_SEED)

In [0]:
base_dir = './'

In [0]:
# setting in google colab
root_dir = "/content/gdrive/My Drive/"
base_dir = root_dir + 'Colab Notebooks/'

In [1]:
# pre setting in google colab
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
!pip install -U pandas
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
import pandas as pd

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive
Collecting pandas
[?25l  Downloading https://files.pythonhosted.org/packages/19/74/e50234bc82c553fecdbd566d8650801e3fe2d6d8c8d940638e3d8a7c5522/pandas-0.24.2-cp36-cp36m-manylinux1_x86_64.whl (10.1MB)
[K    100% |████████████████████████████████| 10.1MB 4.3MB/s 
[31mfastai 1.0.50.post1 has requirement numpy>=1.15, but you'll have numpy 1.14.6 which is incompatible.[0m
Installing collected packages: pandas
  Found existing installation: pandas 0.22.0
    Un

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


## Load train data (Choose one of the four datasets to train)

In [0]:
#data_pth = base_dir + 'dataset/IMDB/'
#data_pth = base_dir + 'dataset/yelp_review_polarity_csv/'
#data_pth = base_dir + 'dataset/yelp_review_full_csv/'
data_pth = base_dir + 'dataset/Douban/'

In [0]:
X_train = pd.read_hdf(data_pth+'X_train.h5', key='s')
y_train = pd.read_hdf(data_pth+'y_train.h5', key='s')

X_val = pd.read_hdf(data_pth+'X_val.h5', key='s')
y_val = pd.read_hdf(data_pth+'y_val.h5', key='s')

word2num_series = pd.read_hdf(data_pth+'word2num_series.h5', key='s')

In [7]:
len(X_train)

550000

In [8]:
len(X_val)

50000

## Load pretrained word embedding matrix (Glove)
- Glove https://github.com/stanfordnlp/GloVe
- Chinese Word Vector https://github.com/Embedding/Chinese-Word-Vectors

In [0]:
# English
# pre_embedding = {}
# with open('/content/gdrive/My Drive/Colab Notebooks/word2vector/glove.twitter.27B.200d.txt', encoding='utf8') as f:
#     for line in f.readlines():
#         tmp = line.strip().split()
#         if tmp[0] in word2num_series:
#             pre_embedding[tmp[0]] = np.array(tmp[1:]).astype(np.float)

# Chinese
pre_embedding = {}
with open(base_dir + 'word2vector/sgns.weibo.bigram-char', encoding='utf8') as f:
    for line in f.readlines():
        tmp = line.strip().split()
        if tmp[0] in word2num_series:
            pre_embedding[tmp[0]] = np.array(tmp[1:]).astype(np.float)

In [10]:
vocab_size = len(word2num_series)+3
dim = pre_embedding['movie'].shape[0]
print('dimention:', dim)
mean = np.mean([value for _, value in pre_embedding.items()])
std = np.std([value for _, value in pre_embedding.items()])
print('mean:', np.mean([value for _, value in pre_embedding.items()]))
print('std:', np.std([value for _, value in pre_embedding.items()]))
print('max:', np.max([value for _, value in pre_embedding.items()]))
print('min:', np.min([value for _, value in pre_embedding.items()]))

dimention: 300
mean: 0.009304564211605073
std: 0.398704565783848
max: 2.5235
min: -2.452009


In [0]:
embedding_matrix = np.random.randn(vocab_size, dim)*std

In [0]:
miss_word = 0
for word, idx in word2num_series.items():
    try:
        embedding_matrix[idx] = pre_embedding[word]
    except:
        miss_word += 1
        #print(word)

In [13]:
miss_word

7471

In [0]:
np.testing.assert_array_almost_equal(embedding_matrix[word2num_series['movie']], pre_embedding['movie'])

## Build pytorch dataset and dataloader

In [0]:
BATCH_SIZE = 128

In [0]:
class CustomDataset(Dataset):
    def __init__(self, features, labels):
        self.features = features
        self.labels = labels
    
    def __len__(self):
        return len(self.features)
    
    def __getitem__(self, idx):
        if self.features[idx].size(0) > 0:
            features = self.features[idx]
        else:
            features = torch.LongTensor([1])
        return features, self.labels[idx], max(len(self.features[idx]), 1)

### Sort the train data by sentence length which aims to reduce 0 padding in every batch, then shuffle batchs

In [0]:
# convert list item to tensor, avoid memory leak in dataset
# There was a bug fixed here, X_train_sorted and y_train_sorted must be tensor type, not python native list(can cause memory leak)
# https://github.com/pytorch/pytorch/issues/13246

############ just for multiclass #############
# y_train = y_train-1
# y_val = y_val-1
##############################################

X_train_sorted_tensors, y_train_sorted_tensors = preprocess_for_batch(X_train, y_train, BATCH_SIZE) # ignore last batch if smaller than batch_size
X_valid_tensors = [torch.tensor(x, dtype=torch.int64) for x in X_val]
y_valid_tensors = [torch.tensor(y, dtype=torch.int64) for y in y_val]

In [0]:
# truncate long sentence's length to maxlen
########### just for Transformer model ##################
MAXLEN = 500
X_train_sorted_tensors = [pad_or_truncate(x, maxlen=MAXLEN, pad=False) for x in X_train_sorted_tensors] # just truncate no padding to save GPU memory
X_valid_tensors = [pad_or_truncate(x, maxlen=MAXLEN, pad=False) for x in X_valid_tensors]

In [0]:
train_dataset = CustomDataset(X_train_sorted_tensors, y_train_sorted_tensors)
val_dataset = CustomDataset(X_valid_tensors, y_valid_tensors)

In [20]:
print("train-set size:", len(train_dataset))
print("valid-set size:", len(val_dataset))

train-set size: 549888
valid-set size: 50000


In [0]:
train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, sampler=SequentialSampler(train_dataset), shuffle=False, collate_fn=pad_and_sort_batch, worker_init_fn=worker_init_fn)
valid_dataloader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=pad_and_sort_batch, worker_init_fn=worker_init_fn)



## Train the model

In [0]:
from SIAttentionBiLSTM import SIAttentionBiLSTM
from AvgBiLSTM import AvgBiLSTM
from Transformer import Transformer

### set model hyperparameters

In [0]:
#AvgBiLSTM
# parameters = {
#     'hidden_size': 256,
#     'embedding': embedding_matrix,
#     'rnn_dropout': 0.2,
#     'embedding_dropout': 0.6,
# }

#InterAttentionBiLSTM
# parameters = {
#     'hidden_size': 256,
#     'embedding': embedding_matrix,
#     'att_method': 'concat',
#     'rnn_dropout': 0.3,
#     'embedding_dropout': 0.6,
#     'context_dropout': 0.2
# }

# SIAttention-BiLSTM
# parameters = {
#     'hidden_size': 128,
#     'embedding': embedding_matrix,
#     'att_method': 'concat',
#     'n_layers': 1,
#     'd_model': 256, 
#     'n_heads': 4, 
#     'd_k': 64, 
#     'd_v': 64, 
#     'rnn_dropout': 0.1,
#     'transformer_dropout':0.1,
#     'embedding_dropout': 0.5,
#     'context_dropout': 0.1,
#     'multihead_dropout': 0.1,
#     'inter_att_dropout':0.1,
#     'self_att_dropout': 0.1
# }

# Transformer
parameters = {
    'n_layers': 3,
    'd_model': 300, 
    'd_ff':512, 
    'n_heads': 5, 
    'd_k': 60, 
    'd_v': 60, 
    'final_att_method': 'concat',
    'embedding': embedding_matrix,
    'embedding_dropout': 0.5,
    'multihead_dropout': 0.1,
    'att_dropout': 0.1,
    'feedforward_dropout': 0.1,
    'final_dropout': 0.1
}


In [24]:
# Polarity
#model = AvgBiLSTM(output_size=1, hidden_size=parameters['hidden_size'], embedding=parameters['embedding'], rnn_dropout=parameters['rnn_dropout'], embedding_dropout=parameters['embedding_dropout'])
#model = SIAttentionBiLSTM(n_layers=parameters['n_layers'], d_model=parameters['d_model'], d_k=parameters['d_k'], d_v=parameters['d_v'], n_heads=parameters['n_heads'], hidden_size=parameters['hidden_size'], embedding=parameters['embedding'], method=parameters['att_method'], rnn_dropout=parameters['rnn_dropout'], embedding_dropout=parameters['embedding_dropout'], multihead_dropout=parameters['multihead_dropout'], context_dropout=parameters['context_dropout'], transformer_dropout=parameters['transformer_dropout'], inter_att_dropout=parameters['inter_att_dropout'], self_att_dropout=parameters['self_att_dropout'])

architecture = (parameters['n_layers'], parameters['d_model'], parameters['d_ff'], parameters['n_heads'], parameters['d_k'], parameters['d_v'])
model = Transformer(output_size=1, architecture=architecture, embedding=parameters['embedding'], method=parameters['final_att_method'], maxpos=MAXLEN, embedding_dropout=parameters['embedding_dropout'], multihead_dropout=parameters['multihead_dropout'], att_dropout=parameters['att_dropout'], feedforward_dropout=parameters['feedforward_dropout'], final_dropout=parameters['final_dropout'])


# Full
#model = AvgBiLSTM(output_size=5, hidden_size=parameters['hidden_size'], embedding=parameters['embedding'], rnn_dropout=parameters['rnn_dropout'], embedding_dropout=parameters['embedding_dropout'])
#model = AttnBiLSTM(output_size=5, hidden_size=parameters['hidden_size'], embedding=parameters['embedding'], method=parameters['att_method'], rnn_dropout=parameters['rnn_dropout'], embedding_dropout=parameters['embedding_dropout'], context_dropout=parameters['context_dropout'])
#model = SIAttentionBiLSTM(output_size=5, n_layers=parameters['n_layers'], d_model=parameters['d_model'], d_k=parameters['d_k'], d_v=parameters['d_v'], n_heads=parameters['n_heads'], hidden_size=parameters['hidden_size'], embedding=parameters['embedding'], method=parameters['att_method'], rnn_dropout=parameters['rnn_dropout'], embedding_dropout=parameters['embedding_dropout'], multihead_dropout=parameters['multihead_dropout'], context_dropout=parameters['context_dropout'], transformer_dropout=parameters['transformer_dropout'], inter_att_dropout=parameters['inter_att_dropout'], self_att_dropout=parameters['self_att_dropout'])

model.to(DEVICE)

Transformer(
  (pos_add_word_embedding): PE_add_Embedding(
    (word_embedding): Embedding(40000, 300)
    (pos_embedding): Embedding(501, 300)
    (dropout): Dropout(p=0.5)
  )
  (encode_layers): ModuleList(
    (0): EncoderLayer(
      (multihead_attention): MultiHeadAttention(
        (linear_Q): Linear(in_features=300, out_features=300, bias=True)
        (linear_K): Linear(in_features=300, out_features=300, bias=True)
        (linear_V): Linear(in_features=300, out_features=300, bias=True)
        (dropout): Dropout(p=0.1)
        (attn): ScaleDotProductAttention(
          (dropout): Dropout(p=0.1)
        )
        (out_linear): Linear(in_features=300, out_features=300, bias=True)
        (layer_norm): LayerNorm(torch.Size([300]), eps=1e-05, elementwise_affine=True)
      )
      (postion_feedforward): PoswiseFeedForward(
        (conv1): Conv1d(300, 512, kernel_size=(1,), stride=(1,))
        (conv2): Conv1d(512, 300, kernel_size=(1,), stride=(1,))
        (dropout): Dropout(p=

In [0]:
# Polarity
criterion = nn.BCEWithLogitsLoss()

# Full
#criterion = nn.CrossEntropyLoss()


# param_optimizer = list(model.named_parameters())
# no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight', 'lstm_encoder.embedding.weight']
# optimizer_grouped_parameters = [
#     {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.00001},
#     {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
#     ]


optimizer = torch.optim.Adam(model.parameters())
#scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, 0.9, last_epoch=-1)

In [0]:
def validate(model, criterion, history):
    model.eval()
    global best_acc, best_model, validate_history
    costs = []
    accs = []
    with torch.no_grad():
        for idx, batch in enumerate(valid_dataloader):
            input_batch, labels, lengths = batch
            labels = labels.float().unsqueeze(1) # polarity
            #output = model(input_batch, lengths) # RNN
            output = model(input_batch)  # Transformer
            loss = criterion(output, labels)
            costs.append(loss.item())
            # polarity
            accs.append((output>0).eq(labels>0).float().mean().item())
            
            # full
            #_, preds = torch.max(output, 1)
            #accs.append((preds == labels).float().mean().item())
            torch.cuda.empty_cache()
    mean_accs = np.mean(accs)
    mean_costs = np.mean(costs)
    if mean_accs > history['best_acc']:  
        history['best_acc'] = mean_accs
        history['best_model'] = copy.deepcopy(model.state_dict())
        
    history['validate_accuracy'].append(mean_accs)
    history['validate_loss'].append(mean_costs)
    return mean_costs, mean_accs


def train(model, criterion, optimizer, epoch, history, validate_points):
    model.train()
    costs = []
    accs = []
    with tqdm(total=len(train_dataset), desc='Epoch {}'.format(epoch)) as pbar:
        for idx, batch in enumerate(train_dataloader):
            input_batch, labels, lengths = batch
            labels = labels.float().unsqueeze(1) # polarity
            #output = model(input_batch, lengths) # RNN
            output = model(input_batch)  # Transformer
            loss = criterion(output, labels)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            with torch.no_grad():
                costs.append(loss.item())
                # polarity
                accs.append((output>0).eq(labels>0).float().mean().item())

                # full
                #_, preds = torch.max(output, 1)
                #accs.append((preds == labels).float().mean().item())
                pbar.update(input_batch.size(0))
                pbar.set_postfix_str('train-loss: {:.4f}, train-acc: {:.4f}'.format(np.mean(costs), np.mean(accs)))
            if idx in validate_points:
                val_loss, val_acc = validate(model, criterion, history)
                pbar.set_postfix_str('train-loss: {:.4f}, train-acc: {:.4f}, val-loss: {:.4f}, val-acc: {:.4f}'.format(np.mean(costs), np.mean(accs), val_loss, val_acc))
                model.train()
            torch.cuda.empty_cache()
    
    history['train_loss'].append(costs)
    history['train_accuracy'].append(accs)

In [0]:
history = { 'best_acc': 0,
            'best_model': None,
            'optimizer': optimizer.state_dict(),
            'train_accuracy': [],
            'train_loss': [],
            'validate_accuracy': [],
            'validate_loss': [],
            'batch_size': BATCH_SIZE,
            'num_of_batch': len(train_dataloader),
            'train_size': len(train_dataset),
            'validate_size': len(val_dataset),
            'validate_points': None,
            'epochs': 0,
            'embedding_size': embedding_matrix.shape,
            'parameters': parameters
          }

In [0]:
epochs = 10
validate_points = list(np.linspace(0, len(train_dataloader)-1, 4).astype(int))[1:]
history['epochs'] = epochs
history['validate_points'] = validate_points

## Attention Model on Douban dataset

### Transformer

In [39]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time()))
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epoch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Transformer-on-Douban', timestamp))

Epoch 1: 100%|██████████| 549888/549888 [06:58<00:00, 1314.88it/s, train-loss: 0.4842, train-acc: 0.7662, val-loss: 0.4287, val-acc: 0.8027]
Epoch 2: 100%|██████████| 549888/549888 [06:52<00:00, 1213.96it/s, train-loss: 0.4324, train-acc: 0.8001, val-loss: 0.4205, val-acc: 0.8054]
Epoch 3: 100%|██████████| 549888/549888 [06:53<00:00, 1205.18it/s, train-loss: 0.4204, train-acc: 0.8075, val-loss: 0.4224, val-acc: 0.8054]
Epoch 4: 100%|██████████| 549888/549888 [06:55<00:00, 1221.90it/s, train-loss: 0.4118, train-acc: 0.8131, val-loss: 0.4253, val-acc: 0.8076]
Epoch 5: 100%|██████████| 549888/549888 [06:55<00:00, 1195.52it/s, train-loss: 0.4066, train-acc: 0.8154, val-loss: 0.4266, val-acc: 0.8073]
Epoch 6: 100%|██████████| 549888/549888 [06:52<00:00, 1181.50it/s, train-loss: 0.4011, train-acc: 0.8185, val-loss: 0.4213, val-acc: 0.8083]
Epoch 7: 100%|██████████| 549888/549888 [06:51<00:00, 1248.69it/s, train-loss: 0.3997, train-acc: 0.8189, val-loss: 0.4288, val-acc: 0.8080]
Epoch 8: 100%

CPU times: user 36min 15s, sys: 31min 46s, total: 1h 8min 1s
Wall time: 1h 9min 12s


### AvgBiLSTM

In [0]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time()))
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epoch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Avg-BiLSTM-on-Douban', timestamp))

Epoch 1: 100%|██████████| 549888/549888 [03:20<00:00, 2744.37it/s, train-loss: 0.4324, train-acc: 0.7963, val-loss: 0.4882, val-acc: 0.8284]
Epoch 2: 100%|██████████| 549888/549888 [03:20<00:00, 2745.53it/s, train-loss: 0.3658, train-acc: 0.8356, val-loss: 0.4779, val-acc: 0.8341]
Epoch 3: 100%|██████████| 549888/549888 [03:20<00:00, 2744.15it/s, train-loss: 0.3396, train-acc: 0.8495, val-loss: 0.4667, val-acc: 0.8345]
Epoch 4: 100%|██████████| 549888/549888 [03:20<00:00, 2739.73it/s, train-loss: 0.3202, train-acc: 0.8595, val-loss: 0.4631, val-acc: 0.8378]
Epoch 5: 100%|██████████| 549888/549888 [03:20<00:00, 2744.05it/s, train-loss: 0.3047, train-acc: 0.8677, val-loss: 0.4620, val-acc: 0.8364]
Epoch 6: 100%|██████████| 549888/549888 [03:20<00:00, 2745.13it/s, train-loss: 0.2909, train-acc: 0.8744, val-loss: 0.4603, val-acc: 0.8336]
Epoch 7: 100%|██████████| 549888/549888 [03:20<00:00, 2749.22it/s, train-loss: 0.2784, train-acc: 0.8808, val-loss: 0.4585, val-acc: 0.8346]
Epoch 8: 100%

CPU times: user 17min 29s, sys: 14min 34s, total: 32min 3s
Wall time: 33min 37s


### Self-Attention + Inter-Attention + BiLSTM

In [0]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time()))
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epoch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Self-Inter-Attention-BiLSTM-on-Douban', timestamp))

Epoch 1: 100%|██████████| 549888/549888 [04:11<00:00, 2190.35it/s, train-loss: 0.4296, train-acc: 0.7990, val-loss: 0.3705, val-acc: 0.8353]
Epoch 2: 100%|██████████| 549888/549888 [04:10<00:00, 2197.49it/s, train-loss: 0.3622, train-acc: 0.8382, val-loss: 0.3657, val-acc: 0.8400]
Epoch 3: 100%|██████████| 549888/549888 [04:09<00:00, 2204.35it/s, train-loss: 0.3341, train-acc: 0.8527, val-loss: 0.3735, val-acc: 0.8423]
Epoch 4: 100%|██████████| 549888/549888 [04:09<00:00, 2199.72it/s, train-loss: 0.3123, train-acc: 0.8639, val-loss: 0.3791, val-acc: 0.8427]
Epoch 5: 100%|██████████| 549888/549888 [04:08<00:00, 2210.39it/s, train-loss: 0.2948, train-acc: 0.8730, val-loss: 0.4060, val-acc: 0.8426]
Epoch 6: 100%|██████████| 549888/549888 [04:09<00:00, 2207.53it/s, train-loss: 0.2790, train-acc: 0.8807, val-loss: 0.4094, val-acc: 0.8403]
Epoch 7: 100%|██████████| 549888/549888 [04:09<00:00, 2207.03it/s, train-loss: 0.2662, train-acc: 0.8867, val-loss: 0.4383, val-acc: 0.8394]
Epoch 8: 100%

CPU times: user 22min 29s, sys: 18min 43s, total: 41min 13s
Wall time: 41min 48s


### BiLSTM + Inter-Attention(query=none , key=value=BiLSTM output)

In [0]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time()))
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epoch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Attention-no-query-BiLSTM-on-Douban', timestamp))

Epoch 1: 100%|██████████| 549888/549888 [02:32<00:00, 3931.00it/s, train-loss: 0.4411, train-acc: 0.7907, val-loss: 0.3723, val-acc: 0.8309]
Epoch 2: 100%|██████████| 549888/549888 [02:33<00:00, 3572.18it/s, train-loss: 0.3738, train-acc: 0.8313, val-loss: 0.3566, val-acc: 0.8412]
Epoch 3: 100%|██████████| 549888/549888 [02:34<00:00, 3858.79it/s, train-loss: 0.3484, train-acc: 0.8457, val-loss: 0.3585, val-acc: 0.8430]
Epoch 4: 100%|██████████| 549888/549888 [02:33<00:00, 3869.03it/s, train-loss: 0.3301, train-acc: 0.8551, val-loss: 0.3588, val-acc: 0.8435]
Epoch 5: 100%|██████████| 549888/549888 [02:34<00:00, 3852.83it/s, train-loss: 0.3152, train-acc: 0.8622, val-loss: 0.3641, val-acc: 0.8437]
Epoch 6: 100%|██████████| 549888/549888 [02:34<00:00, 3877.36it/s, train-loss: 0.3020, train-acc: 0.8690, val-loss: 0.3709, val-acc: 0.8435]
Epoch 7: 100%|██████████| 549888/549888 [02:34<00:00, 3553.67it/s, train-loss: 0.2904, train-acc: 0.8745, val-loss: 0.3838, val-acc: 0.8423]
Epoch 8: 100%

CPU times: user 18min 28s, sys: 18min 36s, total: 37min 4s
Wall time: 38min 59s


### BiLSTM + Inter-Attention(query=independent parameter , key=value=BiLSTM output)

In [0]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time()))
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epoch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Attention-independent-query-BiLSTM-on-Douban', timestamp))

Epoch 1: 100%|██████████| 549888/549888 [02:56<00:00, 3110.58it/s, train-loss: 0.4412, train-acc: 0.7906, val-loss: 0.3709, val-acc: 0.8323]
Epoch 2: 100%|██████████| 549888/549888 [03:00<00:00, 2917.48it/s, train-loss: 0.3744, train-acc: 0.8311, val-loss: 0.3623, val-acc: 0.8408]
Epoch 3: 100%|██████████| 549888/549888 [03:00<00:00, 2966.50it/s, train-loss: 0.3495, train-acc: 0.8451, val-loss: 0.3578, val-acc: 0.8417]
Epoch 4: 100%|██████████| 549888/549888 [02:57<00:00, 3482.33it/s, train-loss: 0.3314, train-acc: 0.8546, val-loss: 0.3557, val-acc: 0.8450]
Epoch 5: 100%|██████████| 549888/549888 [03:04<00:00, 2973.11it/s, train-loss: 0.3165, train-acc: 0.8621, val-loss: 0.3618, val-acc: 0.8445]
Epoch 6: 100%|██████████| 549888/549888 [03:00<00:00, 3570.65it/s, train-loss: 0.3034, train-acc: 0.8687, val-loss: 0.3719, val-acc: 0.8440]
Epoch 7: 100%|██████████| 549888/549888 [02:53<00:00, 3546.38it/s, train-loss: 0.2911, train-acc: 0.8745, val-loss: 0.3784, val-acc: 0.8415]
Epoch 8: 100%

CPU times: user 23min 32s, sys: 19min 55s, total: 43min 27s
Wall time: 44min 26s


### BiLSTM + Inter-Attention(query=h_n of BiLSTM, key=value=BiLSTM output)

In [0]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time()))
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epoch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Attention-h-query-BiLSTM-on-Douban', timestamp))

Epoch 1: 100%|██████████| 549888/549888 [02:55<00:00, 3578.42it/s, train-loss: 0.4406, train-acc: 0.7906, val-loss: 0.3693, val-acc: 0.8330]
Epoch 2: 100%|██████████| 549888/549888 [02:59<00:00, 3059.96it/s, train-loss: 0.3734, train-acc: 0.8318, val-loss: 0.3611, val-acc: 0.8406]
Epoch 3: 100%|██████████| 549888/549888 [02:59<00:00, 2931.91it/s, train-loss: 0.3481, train-acc: 0.8455, val-loss: 0.3566, val-acc: 0.8425]
Epoch 4: 100%|██████████| 549888/549888 [02:59<00:00, 2962.10it/s, train-loss: 0.3296, train-acc: 0.8555, val-loss: 0.3569, val-acc: 0.8453]
Epoch 5: 100%|██████████| 549888/549888 [02:59<00:00, 2925.64it/s, train-loss: 0.3143, train-acc: 0.8632, val-loss: 0.3625, val-acc: 0.8449]
Epoch 6: 100%|██████████| 549888/549888 [02:55<00:00, 3596.45it/s, train-loss: 0.3015, train-acc: 0.8693, val-loss: 0.3749, val-acc: 0.8437]
Epoch 7: 100%|██████████| 549888/549888 [02:51<00:00, 3574.43it/s, train-loss: 0.2889, train-acc: 0.8753, val-loss: 0.3784, val-acc: 0.8431]
Epoch 8: 100%

CPU times: user 22min 6s, sys: 20min 21s, total: 42min 28s
Wall time: 43min 54s


### BiLSTM + Inter-Attention(query=c_n of BiLSTM, key=value=BiLSTM output)

In [0]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time()))
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epoch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Attention-c-query-BiLSTM-on-Douban', timestamp))

Epoch 1: 100%|██████████| 549888/549888 [03:00<00:00, 3511.25it/s, train-loss: 0.4409, train-acc: 0.7907, val-loss: 0.3703, val-acc: 0.8327]
Epoch 2: 100%|██████████| 549888/549888 [03:05<00:00, 2913.23it/s, train-loss: 0.3735, train-acc: 0.8317, val-loss: 0.3614, val-acc: 0.8403]
Epoch 3: 100%|██████████| 549888/549888 [03:05<00:00, 2894.45it/s, train-loss: 0.3484, train-acc: 0.8452, val-loss: 0.3569, val-acc: 0.8424]
Epoch 4: 100%|██████████| 549888/549888 [03:05<00:00, 2878.36it/s, train-loss: 0.3300, train-acc: 0.8553, val-loss: 0.3581, val-acc: 0.8456]
Epoch 5: 100%|██████████| 549888/549888 [02:58<00:00, 3073.18it/s, train-loss: 0.3147, train-acc: 0.8627, val-loss: 0.3607, val-acc: 0.8458]
Epoch 6: 100%|██████████| 549888/549888 [03:08<00:00, 2848.56it/s, train-loss: 0.3017, train-acc: 0.8693, val-loss: 0.3762, val-acc: 0.8440]
Epoch 7: 100%|██████████| 549888/549888 [03:09<00:00, 2890.13it/s, train-loss: 0.2896, train-acc: 0.8749, val-loss: 0.3750, val-acc: 0.8438]
Epoch 8: 100%

CPU times: user 23min 12s, sys: 21min 38s, total: 44min 50s
Wall time: 47min 7s


## Attention Model on Yelp Full dataset

### Transformer

In [0]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time()))
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epoch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Transformer-on-Yelp-Full', timestamp))

Epoch 1: 100%|██████████| 599936/599936 [33:31<00:00, 307.00it/s, train-loss: 1.0388, train-acc: 0.5428, val-loss: 0.9268, val-acc: 0.5935]
Epoch 2: 100%|██████████| 599936/599936 [33:26<00:00, 307.92it/s, train-loss: 0.9486, train-acc: 0.5844, val-loss: 0.9152, val-acc: 0.5980]
Epoch 3: 100%|██████████| 599936/599936 [33:33<00:00, 306.19it/s, train-loss: 0.9252, train-acc: 0.5951, val-loss: 0.9139, val-acc: 0.6032]
Epoch 4: 100%|██████████| 599936/599936 [33:35<00:00, 307.28it/s, train-loss: 0.9112, train-acc: 0.6022, val-loss: 0.9142, val-acc: 0.6043]
Epoch 5: 100%|██████████| 599936/599936 [33:35<00:00, 306.73it/s, train-loss: 0.9009, train-acc: 0.6077, val-loss: 0.9115, val-acc: 0.6049]
Epoch 6:  67%|██████▋   | 400000/599936 [19:45<06:16, 530.85it/s, train-loss: 0.8933, train-acc: 0.6110]

### Avg-BiLSTM

In [0]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time()))
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epoch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Avg-BiLSTM-on-Yelp-Full', timestamp))

Epoch 1: 100%|██████████| 599808/599808 [09:55<00:00, 1007.92it/s, train-loss: 0.9595, train-acc: 0.5781, val-loss: 1.2271, val-acc: 0.6233]
Epoch 2: 100%|██████████| 599808/599808 [09:58<00:00, 1001.68it/s, train-loss: 0.8298, train-acc: 0.6365, val-loss: 1.1821, val-acc: 0.6471]
Epoch 3: 100%|██████████| 599808/599808 [10:00<00:00, 999.45it/s, train-loss: 0.7858, train-acc: 0.6566, val-loss: 1.1583, val-acc: 0.6586] 
Epoch 4: 100%|██████████| 599808/599808 [09:59<00:00, 1000.60it/s, train-loss: 0.7560, train-acc: 0.6699, val-loss: 1.1371, val-acc: 0.6622]
Epoch 5: 100%|██████████| 599808/599808 [10:00<00:00, 998.43it/s, train-loss: 0.7344, train-acc: 0.6792, val-loss: 1.1288, val-acc: 0.6620] 
Epoch 6: 100%|██████████| 599808/599808 [09:57<00:00, 1003.46it/s, train-loss: 0.7164, train-acc: 0.6888, val-loss: 1.1196, val-acc: 0.6635]
Epoch 7: 100%|██████████| 599808/599808 [09:54<00:00, 1009.75it/s, train-loss: 0.7017, train-acc: 0.6962, val-loss: 1.1125, val-acc: 0.6625]
Epoch 8: 100%

CPU times: user 40min 23s, sys: 56min 54s, total: 1h 37min 18s
Wall time: 1h 39min 37s


### Self-Attention + Inter-Attention + BiLSTM

In [0]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time()))
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epoch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Self-Inter-Attention-BiLSTM-on-Yelp-Full', timestamp))

Epoch 1: 100%|██████████| 599808/599808 [15:16<00:00, 654.52it/s, train-loss: 0.7083, train-acc: 0.6938, val-loss: 0.8137, val-acc: 0.6677] 
Epoch 2: 100%|██████████| 599808/599808 [15:20<00:00, 651.73it/s, train-loss: 0.6905, train-acc: 0.7023, val-loss: 0.8259, val-acc: 0.6685] 
Epoch 3: 100%|██████████| 599808/599808 [15:20<00:00, 651.35it/s, train-loss: 0.6764, train-acc: 0.7092, val-loss: 0.8248, val-acc: 0.6664] 
Epoch 4: 100%|██████████| 599808/599808 [15:18<00:00, 653.36it/s, train-loss: 0.6621, train-acc: 0.7159, val-loss: 0.8326, val-acc: 0.6655] 
Epoch 5: 100%|██████████| 599808/599808 [15:18<00:00, 653.09it/s, train-loss: 0.6501, train-acc: 0.7219, val-loss: 0.8485, val-acc: 0.6652] 


CPU times: user 28min 15s, sys: 47min 34s, total: 1h 15min 49s
Wall time: 1h 16min 39s


### BiLSTM + Inter-Attention(query=none, key=value=BiLSTM output)

In [0]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time()))
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epoch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Attention-no-query-BiLSTM-on-Yelp-Full', timestamp))

Epoch 1: 100%|██████████| 599808/599808 [10:31<00:00, 950.39it/s, train-loss: 0.9454, train-acc: 0.5862, val-loss: 0.8189, val-acc: 0.6435] 
Epoch 2: 100%|██████████| 599808/599808 [10:33<00:00, 946.85it/s, train-loss: 0.8198, train-acc: 0.6421, val-loss: 0.7827, val-acc: 0.6561] 
Epoch 3: 100%|██████████| 599808/599808 [10:34<00:00, 944.82it/s, train-loss: 0.7788, train-acc: 0.6595, val-loss: 0.7655, val-acc: 0.6656] 
Epoch 4: 100%|██████████| 599808/599808 [10:34<00:00, 945.14it/s, train-loss: 0.7532, train-acc: 0.6719, val-loss: 0.7584, val-acc: 0.6699] 
Epoch 5: 100%|██████████| 599808/599808 [10:34<00:00, 945.34it/s, train-loss: 0.7339, train-acc: 0.6808, val-loss: 0.7568, val-acc: 0.6713] 
Epoch 6: 100%|██████████| 599808/599808 [10:34<00:00, 945.37it/s, train-loss: 0.7178, train-acc: 0.6875, val-loss: 0.7579, val-acc: 0.6733] 
Epoch 7: 100%|██████████| 599808/599808 [10:35<00:00, 944.14it/s, train-loss: 0.7049, train-acc: 0.6940, val-loss: 0.7555, val-acc: 0.6732] 
Epoch 8: 100%

CPU times: user 1h 1min 53s, sys: 1h 33min 47s, total: 2h 35min 41s
Wall time: 2h 38min 45s


### BiLSTM + Inter-Attention(query=independent parameter , key=value=BiLSTM output)

In [0]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time())) 
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epoch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Attention-independent-query-BiLSTM-on-Yelp-Full', timestamp))

Epoch 1: 100%|██████████| 599808/599808 [14:36<00:00, 684.06it/s, train-loss: 0.9489, train-acc: 0.5854, val-loss: 0.8246, val-acc: 0.6403] 
Epoch 2: 100%|██████████| 599808/599808 [14:38<00:00, 682.84it/s, train-loss: 0.8219, train-acc: 0.6409, val-loss: 0.7819, val-acc: 0.6576] 
Epoch 3: 100%|██████████| 599808/599808 [14:38<00:00, 683.06it/s, train-loss: 0.7799, train-acc: 0.6594, val-loss: 0.7652, val-acc: 0.6650] 
Epoch 4: 100%|██████████| 599808/599808 [14:38<00:00, 682.86it/s, train-loss: 0.7529, train-acc: 0.6725, val-loss: 0.7577, val-acc: 0.6691] 
Epoch 5: 100%|██████████| 599808/599808 [14:38<00:00, 683.01it/s, train-loss: 0.7336, train-acc: 0.6812, val-loss: 0.7563, val-acc: 0.6725] 
Epoch 6: 100%|██████████| 599808/599808 [14:39<00:00, 682.18it/s, train-loss: 0.7166, train-acc: 0.6890, val-loss: 0.7570, val-acc: 0.6752] 
Epoch 7: 100%|██████████| 599808/599808 [14:39<00:00, 682.34it/s, train-loss: 0.7037, train-acc: 0.6949, val-loss: 0.7553, val-acc: 0.6759] 
Epoch 8: 100%

CPU times: user 1h 23min 54s, sys: 2h 13min 27s, total: 3h 37min 22s
Wall time: 3h 40min 5s


### BiLSTM + Inter-Attention(query=h_n of BiLSTM, key=value=BiLSTM output)

In [0]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time()))
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epoch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Attention-h-query-BiLSTM-on-Yelp-Full', timestamp))

Epoch 1: 100%|██████████| 599808/599808 [14:49<00:00, 674.08it/s, train-loss: 0.9443, train-acc: 0.5872, val-loss: 0.8211, val-acc: 0.6428] 
Epoch 2: 100%|██████████| 599808/599808 [14:52<00:00, 672.42it/s, train-loss: 0.8198, train-acc: 0.6422, val-loss: 0.7812, val-acc: 0.6563] 
Epoch 3: 100%|██████████| 599808/599808 [14:51<00:00, 672.97it/s, train-loss: 0.7767, train-acc: 0.6611, val-loss: 0.7665, val-acc: 0.6651] 
Epoch 4: 100%|██████████| 599808/599808 [14:50<00:00, 673.51it/s, train-loss: 0.7495, train-acc: 0.6738, val-loss: 0.7572, val-acc: 0.6687] 
Epoch 5: 100%|██████████| 599808/599808 [14:58<00:00, 667.23it/s, train-loss: 0.7297, train-acc: 0.6823, val-loss: 0.7572, val-acc: 0.6716] 
Epoch 6: 100%|██████████| 599808/599808 [14:57<00:00, 668.21it/s, train-loss: 0.7128, train-acc: 0.6905, val-loss: 0.7554, val-acc: 0.6738] 
Epoch 7: 100%|██████████| 599808/599808 [14:58<00:00, 667.64it/s, train-loss: 0.6995, train-acc: 0.6968, val-loss: 0.7559, val-acc: 0.6749] 
Epoch 8: 100%

CPU times: user 1h 23min 46s, sys: 2h 14min 24s, total: 3h 38min 11s
Wall time: 3h 43min 55s


### BiLSTM + InterAttention(query=c_n of BiLSTM, key=value=BiLSTM output)

In [0]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time()))
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epoch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Attention-c-query-BiLSTM-on-Yelp-Full', timestamp))

Epoch 1: 100%|██████████| 599808/599808 [14:40<00:00, 681.25it/s, train-loss: 0.9493, train-acc: 0.5849, val-loss: 0.8277, val-acc: 0.6381] 
Epoch 2: 100%|██████████| 599808/599808 [14:40<00:00, 681.18it/s, train-loss: 0.8209, train-acc: 0.6417, val-loss: 0.7820, val-acc: 0.6588] 
Epoch 3: 100%|██████████| 599808/599808 [14:40<00:00, 681.16it/s, train-loss: 0.7779, train-acc: 0.6606, val-loss: 0.7651, val-acc: 0.6657] 
Epoch 4: 100%|██████████| 599808/599808 [14:40<00:00, 681.03it/s, train-loss: 0.7500, train-acc: 0.6719, val-loss: 0.7563, val-acc: 0.6696] 
Epoch 5: 100%|██████████| 599808/599808 [14:39<00:00, 681.61it/s, train-loss: 0.7301, train-acc: 0.6822, val-loss: 0.7548, val-acc: 0.6711] 
Epoch 6: 100%|██████████| 599808/599808 [14:40<00:00, 681.30it/s, train-loss: 0.7136, train-acc: 0.6899, val-loss: 0.7531, val-acc: 0.6731] 
Epoch 7: 100%|██████████| 599808/599808 [14:40<00:00, 681.13it/s, train-loss: 0.6995, train-acc: 0.6967, val-loss: 0.7561, val-acc: 0.6740] 
Epoch 8: 100%

CPU times: user 1h 25min 6s, sys: 2h 12min 20s, total: 3h 37min 26s
Wall time: 3h 41min 12s


## Attention Model on Yelp Polarity dataset

### Transformer

In [0]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time()))
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epoch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Transformer-on-Yelp-Polarity', timestamp))

Epoch 1: 100%|██████████| 519936/519936 [29:31<00:00, 520.60it/s, train-loss: 0.2367, train-acc: 0.9004, val-loss: 0.1696, val-acc: 0.9321]
Epoch 2: 100%|██████████| 519936/519936 [29:31<00:00, 525.97it/s, train-loss: 0.1900, train-acc: 0.9241, val-loss: 0.1716, val-acc: 0.9337]
Epoch 3: 100%|██████████| 519936/519936 [29:31<00:00, 519.12it/s, train-loss: 0.1836, train-acc: 0.9273, val-loss: 0.2008, val-acc: 0.9232]
Epoch 4: 100%|██████████| 519936/519936 [29:30<00:00, 517.21it/s, train-loss: 0.1778, train-acc: 0.9296, val-loss: 0.1802, val-acc: 0.9301]
Epoch 5: 100%|██████████| 519936/519936 [29:26<00:00, 524.16it/s, train-loss: 0.1764, train-acc: 0.9304, val-loss: 0.1939, val-acc: 0.9274]
Epoch 6: 100%|██████████| 519936/519936 [29:19<00:00, 528.94it/s, train-loss: 0.1771, train-acc: 0.9303, val-loss: 0.2145, val-acc: 0.9170]
Epoch 7: 100%|██████████| 519936/519936 [29:19<00:00, 531.18it/s, train-loss: 0.1796, train-acc: 0.9291, val-loss: 0.2007, val-acc: 0.9218]
Epoch 8: 100%|██████

Buffered data was truncated after reaching the output size limit.

### Avg-BiLSTM

In [0]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time()))
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epoch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Avg-BiLSTM-on-Yelp-Polarity', timestamp))

Epoch 1: 100%|██████████| 519936/519936 [08:30<00:00, 1774.95it/s, train-loss: 0.1927, train-acc: 0.9211, val-loss: 0.3699, val-acc: 0.9506]
Epoch 2: 100%|██████████| 519936/519936 [08:28<00:00, 1796.64it/s, train-loss: 0.1245, train-acc: 0.9521, val-loss: 0.3468, val-acc: 0.9578]
Epoch 3: 100%|██████████| 519936/519936 [08:29<00:00, 1799.08it/s, train-loss: 0.1061, train-acc: 0.9598, val-loss: 0.3362, val-acc: 0.9602]
Epoch 4: 100%|██████████| 519936/519936 [08:29<00:00, 1748.77it/s, train-loss: 0.0940, train-acc: 0.9648, val-loss: 0.3222, val-acc: 0.9615]
Epoch 5: 100%|██████████| 519936/519936 [08:28<00:00, 1760.31it/s, train-loss: 0.0853, train-acc: 0.9684, val-loss: 0.3184, val-acc: 0.9620]
Epoch 6: 100%|██████████| 519936/519936 [08:29<00:00, 1785.11it/s, train-loss: 0.0782, train-acc: 0.9710, val-loss: 0.3041, val-acc: 0.9633]
Epoch 7: 100%|██████████| 519936/519936 [08:29<00:00, 1782.52it/s, train-loss: 0.0719, train-acc: 0.9736, val-loss: 0.3071, val-acc: 0.9636]
Epoch 8: 100%

CPU times: user 33min 44s, sys: 49min 19s, total: 1h 23min 4s
Wall time: 1h 25min 12s


In [0]:
history['best_acc']

0.9651671974522293

### Self-Attention + Inter-Attention + BiLSTM

In [0]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time()))
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epoch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Self-Inter-Attention-BiLSTM-on-Yelp-Polarity', timestamp))

Epoch 1: 100%|██████████| 519936/519936 [16:11<00:00, 1379.93it/s, train-loss: 0.1767, train-acc: 0.9299, val-loss: 0.1385, val-acc: 0.9552]
Epoch 2: 100%|██████████| 519936/519936 [16:09<00:00, 1381.49it/s, train-loss: 0.1162, train-acc: 0.9558, val-loss: 0.1234, val-acc: 0.9600]
Epoch 3: 100%|██████████| 519936/519936 [16:09<00:00, 1392.16it/s, train-loss: 0.0991, train-acc: 0.9630, val-loss: 0.1340, val-acc: 0.9592]
Epoch 4: 100%|██████████| 519936/519936 [16:09<00:00, 1382.54it/s, train-loss: 0.0877, train-acc: 0.9675, val-loss: 0.1207, val-acc: 0.9630]
Epoch 5: 100%|██████████| 519936/519936 [16:10<00:00, 1372.75it/s, train-loss: 0.0775, train-acc: 0.9716, val-loss: 0.1302, val-acc: 0.9615]
Epoch 6: 100%|██████████| 519936/519936 [16:10<00:00, 1369.78it/s, train-loss: 0.0699, train-acc: 0.9746, val-loss: 0.1335, val-acc: 0.9631]
Epoch 7: 100%|██████████| 519936/519936 [16:09<00:00, 1394.66it/s, train-loss: 0.0637, train-acc: 0.9769, val-loss: 0.1355, val-acc: 0.9625]
Epoch 8: 100%

CPU times: user 57min 8s, sys: 1h 42min 50s, total: 2h 39min 59s
Wall time: 2h 41min 47s


### BiLSTM + Inter-Attention(query=none, key=value=BiLSTM output)

In [0]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time()))
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epo6ch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Attention-no-query-BiLSTM-on-Yelp-Polarity', timestamp))

Epoch 1: 100%|██████████| 519936/519936 [09:28<00:00, 1754.25it/s, train-loss: 0.1859, train-acc: 0.9238, val-loss: 0.1206, val-acc: 0.9536]
Epoch 2: 100%|██████████| 519936/519936 [09:28<00:00, 1713.39it/s, train-loss: 0.1212, train-acc: 0.9536, val-loss: 0.1084, val-acc: 0.9597]
Epoch 3: 100%|██████████| 519936/519936 [09:29<00:00, 1713.94it/s, train-loss: 0.1034, train-acc: 0.9610, val-loss: 0.1026, val-acc: 0.9615]
Epoch 4: 100%|██████████| 519936/519936 [09:28<00:00, 1749.26it/s, train-loss: 0.0922, train-acc: 0.9655, val-loss: 0.0981, val-acc: 0.9636]
Epoch 5: 100%|██████████| 519936/519936 [09:29<00:00, 1709.94it/s, train-loss: 0.0834, train-acc: 0.9691, val-loss: 0.1047, val-acc: 0.9620]
Epoch 6: 100%|██████████| 519936/519936 [09:28<00:00, 1710.36it/s, train-loss: 0.0761, train-acc: 0.9718, val-loss: 0.1040, val-acc: 0.9642]
Epoch 7: 100%|██████████| 519936/519936 [09:30<00:00, 1703.49it/s, train-loss: 0.0705, train-acc: 0.9742, val-loss: 0.1007, val-acc: 0.9652]
Epoch 8: 100%

CPU times: user 56min 14s, sys: 1h 23min 4s, total: 2h 19min 19s
Wall time: 2h 22min 57s


### BiLSTM + Inter-Attention(query=independent parameter , key=value=BiLSTM output)

In [0]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time()))
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epoch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Attention-independent-query-BiLSTM-on-Yelp-Polarity', timestamp))

Epoch 1: 100%|██████████| 519936/519936 [12:48<00:00, 1378.83it/s, train-loss: 0.1911, train-acc: 0.9213, val-loss: 0.1224, val-acc: 0.9529]
Epoch 2: 100%|██████████| 519936/519936 [12:50<00:00, 1385.24it/s, train-loss: 0.1237, train-acc: 0.9524, val-loss: 0.1099, val-acc: 0.9584]
Epoch 3: 100%|██████████| 519936/519936 [12:50<00:00, 1380.59it/s, train-loss: 0.1052, train-acc: 0.9603, val-loss: 0.1037, val-acc: 0.9614]
Epoch 4: 100%|██████████| 519936/519936 [12:52<00:00, 1376.14it/s, train-loss: 0.0935, train-acc: 0.9650, val-loss: 0.0976, val-acc: 0.9646]
Epoch 5: 100%|██████████| 519936/519936 [12:51<00:00, 1371.23it/s, train-loss: 0.0843, train-acc: 0.9686, val-loss: 0.1027, val-acc: 0.9633]
Epoch 6: 100%|██████████| 519936/519936 [12:49<00:00, 1378.91it/s, train-loss: 0.0772, train-acc: 0.9711, val-loss: 0.1019, val-acc: 0.9649]
Epoch 7: 100%|██████████| 519936/519936 [12:49<00:00, 1412.16it/s, train-loss: 0.0718, train-acc: 0.9735, val-loss: 0.1036, val-acc: 0.9650]
Epoch 8: 100%

CPU times: user 1h 14min 58s, sys: 1h 55min 18s, total: 3h 10min 17s
Wall time: 3h 12min 56s


### BiLSTM + Inter-Attention(query=h_n of BiLSTM, key=value=BiLSTM output)

In [0]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time()))
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epoch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Attention-h-query-BiLSTM-on-Yelp-Polarity', timestamp))

Epoch 1: 100%|██████████| 519936/519936 [12:27<00:00, 1457.42it/s, train-loss: 0.1865, train-acc: 0.9236, val-loss: 0.1215, val-acc: 0.9535]
Epoch 2: 100%|██████████| 519936/519936 [12:27<00:00, 1450.70it/s, train-loss: 0.1212, train-acc: 0.9537, val-loss: 0.1076, val-acc: 0.9596]
Epoch 3: 100%|██████████| 519936/519936 [12:27<00:00, 1432.10it/s, train-loss: 0.1026, train-acc: 0.9611, val-loss: 0.1023, val-acc: 0.9620]
Epoch 4: 100%|██████████| 519936/519936 [12:28<00:00, 1443.66it/s, train-loss: 0.0909, train-acc: 0.9658, val-loss: 0.0975, val-acc: 0.9646]
Epoch 5: 100%|██████████| 519936/519936 [12:27<00:00, 1428.50it/s, train-loss: 0.0822, train-acc: 0.9694, val-loss: 0.1017, val-acc: 0.9633]
Epoch 6: 100%|██████████| 519936/519936 [12:30<00:00, 1406.56it/s, train-loss: 0.0744, train-acc: 0.9724, val-loss: 0.0996, val-acc: 0.9660]
Epoch 7: 100%|██████████| 519936/519936 [12:31<00:00, 1423.42it/s, train-loss: 0.0684, train-acc: 0.9747, val-loss: 0.1025, val-acc: 0.9656]
Epoch 8: 100%

CPU times: user 1h 10min 56s, sys: 1h 52min 58s, total: 3h 3min 55s
Wall time: 3h 7min 41s


In [0]:
history['best_acc']

0.9661375398089171

### BiLSTM + Inter-Attention(query=c_n of BiLSTM, key=value=BiLSTM output)

In [0]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time()))
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epoch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Attention-c-query-BiLSTM-on-Yelp-Polarity', timestamp))

Epoch 1: 100%|██████████| 519936/519936 [12:28<00:00, 1418.73it/s, train-loss: 0.1867, train-acc: 0.9238, val-loss: 0.1210, val-acc: 0.9539]
Epoch 2: 100%|██████████| 519936/519936 [12:29<00:00, 1450.09it/s, train-loss: 0.1210, train-acc: 0.9537, val-loss: 0.1063, val-acc: 0.9596]
Epoch 3: 100%|██████████| 519936/519936 [12:27<00:00, 1434.72it/s, train-loss: 0.1026, train-acc: 0.9612, val-loss: 0.1050, val-acc: 0.9610]
Epoch 4: 100%|██████████| 519936/519936 [12:28<00:00, 1444.15it/s, train-loss: 0.0907, train-acc: 0.9661, val-loss: 0.0983, val-acc: 0.9642]
Epoch 5: 100%|██████████| 519936/519936 [12:28<00:00, 1443.83it/s, train-loss: 0.0817, train-acc: 0.9696, val-loss: 0.1016, val-acc: 0.9630]
Epoch 6: 100%|██████████| 519936/519936 [12:28<00:00, 1436.27it/s, train-loss: 0.0743, train-acc: 0.9723, val-loss: 0.0989, val-acc: 0.9661]
Epoch 7: 100%|██████████| 519936/519936 [12:27<00:00, 1430.79it/s, train-loss: 0.0682, train-acc: 0.9749, val-loss: 0.1029, val-acc: 0.9662]
Epoch 8: 100%

CPU times: user 1h 11min 55s, sys: 1h 52min 7s, total: 3h 4min 2s
Wall time: 3h 7min 36s


In [0]:
history['best_acc']

0.9661624203821656

## Attention Model on IMDB dataset

### Transformer

In [0]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time()))
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epoch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Transformer-on-IMDB', timestamp))

Epoch 1: 100%|██████████| 39936/39936 [04:31<00:00, 134.55it/s, train-loss: 0.5275, train-acc: 0.7186, val-loss: 0.4179, val-acc: 0.8088]
Epoch 2: 100%|██████████| 39936/39936 [04:30<00:00, 133.83it/s, train-loss: 0.3317, train-acc: 0.8580, val-loss: 0.2841, val-acc: 0.8771]
Epoch 3: 100%|██████████| 39936/39936 [04:30<00:00, 134.40it/s, train-loss: 0.2864, train-acc: 0.8810, val-loss: 0.3210, val-acc: 0.8664]
Epoch 4: 100%|██████████| 39936/39936 [04:30<00:00, 134.41it/s, train-loss: 0.2527, train-acc: 0.8983, val-loss: 0.2970, val-acc: 0.8738]
Epoch 5: 100%|██████████| 39936/39936 [04:30<00:00, 134.33it/s, train-loss: 0.2284, train-acc: 0.9093, val-loss: 0.2659, val-acc: 0.8908]
Epoch 6: 100%|██████████| 39936/39936 [04:30<00:00, 134.02it/s, train-loss: 0.2074, train-acc: 0.9193, val-loss: 0.3172, val-acc: 0.8695]
Epoch 7: 100%|██████████| 39936/39936 [04:30<00:00, 134.21it/s, train-loss: 0.2002, train-acc: 0.9225, val-loss: 0.2815, val-acc: 0.8848]
Epoch 8: 100%|██████████| 39936/39

CPU times: user 17min 1s, sys: 27min 42s, total: 44min 44s
Wall time: 45min 8s


### Avg-BiLSTM

In [0]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time()))
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epoch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Avg-BiLSTM-on-IMDB', timestamp))

Epoch 1: 100%|██████████| 39936/39936 [01:09<00:00, 565.49it/s, train-loss: 0.4055, train-acc: 0.8126, val-loss: 0.4188, val-acc: 0.8768]
Epoch 2: 100%|██████████| 39936/39936 [01:09<00:00, 566.63it/s, train-loss: 0.2764, train-acc: 0.8892, val-loss: 0.3761, val-acc: 0.8941]
Epoch 3: 100%|██████████| 39936/39936 [01:09<00:00, 553.25it/s, train-loss: 0.2331, train-acc: 0.9104, val-loss: 0.3555, val-acc: 0.9064]
Epoch 4: 100%|██████████| 39936/39936 [01:09<00:00, 563.56it/s, train-loss: 0.2063, train-acc: 0.9202, val-loss: 0.3725, val-acc: 0.9072]
Epoch 5: 100%|██████████| 39936/39936 [01:09<00:00, 553.81it/s, train-loss: 0.1821, train-acc: 0.9312, val-loss: 0.3359, val-acc: 0.9123]
Epoch 6: 100%|██████████| 39936/39936 [01:10<00:00, 562.08it/s, train-loss: 0.1587, train-acc: 0.9427, val-loss: 0.3314, val-acc: 0.9176]
Epoch 7: 100%|██████████| 39936/39936 [01:11<00:00, 547.17it/s, train-loss: 0.1395, train-acc: 0.9487, val-loss: 0.3000, val-acc: 0.9187]
Epoch 8: 100%|██████████| 39936/39

CPU times: user 8min, sys: 9min 13s, total: 17min 14s
Wall time: 17min 42s


In [0]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time()))
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epoch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Avg-BiLSTM-on-IMDB', timestamp))

Epoch 1: 100%|██████████| 39936/39936 [01:11<00:00, 546.28it/s, train-loss: 0.0489, train-acc: 0.9827, val-loss: 0.2587, val-acc: 0.9137]
Epoch 2: 100%|██████████| 39936/39936 [01:10<00:00, 543.04it/s, train-loss: 0.0490, train-acc: 0.9815, val-loss: 0.2715, val-acc: 0.9059]
Epoch 3: 100%|██████████| 39936/39936 [01:10<00:00, 549.50it/s, train-loss: 0.0458, train-acc: 0.9832, val-loss: 0.2587, val-acc: 0.9086]
Epoch 4: 100%|██████████| 39936/39936 [01:10<00:00, 510.01it/s, train-loss: 0.0431, train-acc: 0.9846, val-loss: 0.2690, val-acc: 0.9049]
Epoch 5: 100%|██████████| 39936/39936 [01:10<00:00, 563.57it/s, train-loss: 0.0412, train-acc: 0.9847, val-loss: 0.2541, val-acc: 0.9107]
Epoch 6: 100%|██████████| 39936/39936 [01:10<00:00, 544.60it/s, train-loss: 0.0376, train-acc: 0.9857, val-loss: 0.2556, val-acc: 0.9115]
Epoch 7: 100%|██████████| 39936/39936 [01:10<00:00, 550.56it/s, train-loss: 0.0321, train-acc: 0.9889, val-loss: 0.2586, val-acc: 0.9100]
Epoch 8: 100%|██████████| 39936/39

CPU times: user 8min 3s, sys: 9min 14s, total: 17min 18s
Wall time: 17min 46s


### Self-Attention + Inter-Attention + BiLSTM

In [0]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time()))
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epoch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Self-Inter-Attention-BiLSTM-on-IMDB', timestamp))

Epoch 1: 100%|██████████| 39936/39936 [02:16<00:00, 310.31it/s, train-loss: 0.3596, train-acc: 0.8362, val-loss: 0.2623, val-acc: 0.8979]
Epoch 2: 100%|██████████| 39936/39936 [02:16<00:00, 311.93it/s, train-loss: 0.2459, train-acc: 0.9015, val-loss: 0.2274, val-acc: 0.9119]
Epoch 3: 100%|██████████| 39936/39936 [02:16<00:00, 311.63it/s, train-loss: 0.2034, train-acc: 0.9216, val-loss: 0.2192, val-acc: 0.9186]
Epoch 4: 100%|██████████| 39936/39936 [02:17<00:00, 309.94it/s, train-loss: 0.1736, train-acc: 0.9356, val-loss: 0.2247, val-acc: 0.9168]
Epoch 5: 100%|██████████| 39936/39936 [02:17<00:00, 309.60it/s, train-loss: 0.1455, train-acc: 0.9461, val-loss: 0.2208, val-acc: 0.9213]
Epoch 6: 100%|██████████| 39936/39936 [02:17<00:00, 312.04it/s, train-loss: 0.1243, train-acc: 0.9551, val-loss: 0.2336, val-acc: 0.9170]
Epoch 7: 100%|██████████| 39936/39936 [02:17<00:00, 313.05it/s, train-loss: 0.1099, train-acc: 0.9610, val-loss: 0.2373, val-acc: 0.9207]
Epoch 8: 100%|██████████| 39936/39

CPU times: user 8min 39s, sys: 14min, total: 22min 39s
Wall time: 22min 55s


In [0]:
history['best_acc']

0.924609375

### BiLSTM + InterAttention(query=none, key=value=BiLSTM output)

In [0]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time()))
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epoch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Attention-no-query-BiLSTM-on-IMDB', timestamp))

Epoch 1: 100%|██████████| 39936/39936 [01:22<00:00, 519.36it/s, train-loss: 0.3923, train-acc: 0.8160, val-loss: 0.2682, val-acc: 0.8879]
Epoch 2: 100%|██████████| 39936/39936 [01:22<00:00, 515.83it/s, train-loss: 0.2581, train-acc: 0.8958, val-loss: 0.2283, val-acc: 0.9051]
Epoch 3: 100%|██████████| 39936/39936 [01:22<00:00, 487.04it/s, train-loss: 0.2184, train-acc: 0.9157, val-loss: 0.2164, val-acc: 0.9113]
Epoch 4: 100%|██████████| 39936/39936 [01:22<00:00, 522.22it/s, train-loss: 0.1905, train-acc: 0.9266, val-loss: 0.2060, val-acc: 0.9170]
Epoch 5: 100%|██████████| 39936/39936 [01:22<00:00, 483.74it/s, train-loss: 0.1687, train-acc: 0.9368, val-loss: 0.2065, val-acc: 0.9180]
Epoch 6: 100%|██████████| 39936/39936 [01:23<00:00, 517.73it/s, train-loss: 0.1520, train-acc: 0.9428, val-loss: 0.2114, val-acc: 0.9232]
Epoch 7: 100%|██████████| 39936/39936 [01:22<00:00, 483.19it/s, train-loss: 0.1332, train-acc: 0.9509, val-loss: 0.2252, val-acc: 0.9240]
Epoch 8: 100%|██████████| 39936/39

CPU times: user 9min 21s, sys: 11min, total: 20min 22s
Wall time: 20min 49s


In [0]:
history['best_acc']

0.9240234375

### BiLSTM + InterAttention(query=independent parameter , key=value=BiLSTM output)

In [0]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time()))
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epoch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Attention-independent-query-BiLSTM-on-IMDB', timestamp))

Epoch 1: 100%|██████████| 39936/39936 [01:49<00:00, 401.88it/s, train-loss: 0.3852, train-acc: 0.8238, val-loss: 0.2697, val-acc: 0.8855]
Epoch 2: 100%|██████████| 39936/39936 [01:50<00:00, 385.10it/s, train-loss: 0.2608, train-acc: 0.8942, val-loss: 0.2375, val-acc: 0.9033]
Epoch 3: 100%|██████████| 39936/39936 [01:50<00:00, 393.64it/s, train-loss: 0.2176, train-acc: 0.9154, val-loss: 0.2166, val-acc: 0.9125]
Epoch 4: 100%|██████████| 39936/39936 [01:50<00:00, 393.31it/s, train-loss: 0.1908, train-acc: 0.9274, val-loss: 0.2154, val-acc: 0.9164]
Epoch 5: 100%|██████████| 39936/39936 [01:49<00:00, 391.90it/s, train-loss: 0.1703, train-acc: 0.9354, val-loss: 0.2108, val-acc: 0.9182]
Epoch 6: 100%|██████████| 39936/39936 [01:50<00:00, 390.37it/s, train-loss: 0.1530, train-acc: 0.9434, val-loss: 0.2145, val-acc: 0.9211]
Epoch 7: 100%|██████████| 39936/39936 [01:50<00:00, 394.30it/s, train-loss: 0.1302, train-acc: 0.9522, val-loss: 0.2320, val-acc: 0.9199]
Epoch 8: 100%|██████████| 39936/39

CPU times: user 11min 45s, sys: 15min 31s, total: 27min 17s
Wall time: 27min 39s


In [0]:
history['best_acc']

0.92265625

### BiLSTM + InterAttention(query=h_n of BiLSTM, key=value=BiLSTM output)

In [0]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time()))
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epoch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Attention-h-query-BiLSTM-on-IMDB', timestamp))

Epoch 1: 100%|██████████| 39936/39936 [01:51<00:00, 386.41it/s, train-loss: 0.3814, train-acc: 0.8254, val-loss: 0.2735, val-acc: 0.8852]
Epoch 2: 100%|██████████| 39936/39936 [01:51<00:00, 384.98it/s, train-loss: 0.2542, train-acc: 0.8976, val-loss: 0.2314, val-acc: 0.9041]
Epoch 3: 100%|██████████| 39936/39936 [01:51<00:00, 392.15it/s, train-loss: 0.2116, train-acc: 0.9167, val-loss: 0.2163, val-acc: 0.9166]
Epoch 4: 100%|██████████| 39936/39936 [01:51<00:00, 386.67it/s, train-loss: 0.1866, train-acc: 0.9282, val-loss: 0.2206, val-acc: 0.9158]
Epoch 5: 100%|██████████| 39936/39936 [01:51<00:00, 383.49it/s, train-loss: 0.1640, train-acc: 0.9388, val-loss: 0.2139, val-acc: 0.9234]
Epoch 6: 100%|██████████| 39936/39936 [01:51<00:00, 390.08it/s, train-loss: 0.1491, train-acc: 0.9445, val-loss: 0.2172, val-acc: 0.9213]
Epoch 7: 100%|██████████| 39936/39936 [01:51<00:00, 387.34it/s, train-loss: 0.1276, train-acc: 0.9529, val-loss: 0.2262, val-acc: 0.9223]
Epoch 8: 100%|██████████| 39936/39

CPU times: user 11min 45s, sys: 15min 45s, total: 27min 30s
Wall time: 28min 3s


In [0]:
history['best_acc']

0.9263671875

### BiLSTM + Inter-Attention(query=c_n of BiLSTM, key=value=BiLSTM output)

In [0]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time()))
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epoch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Attention-c-query-BiLSTM-on-IMDB', timestamp))

Epoch 1: 100%|██████████| 39936/39936 [01:51<00:00, 387.36it/s, train-loss: 0.3833, train-acc: 0.8236, val-loss: 0.2790, val-acc: 0.8834]
Epoch 2: 100%|██████████| 39936/39936 [01:51<00:00, 388.76it/s, train-loss: 0.2543, train-acc: 0.8966, val-loss: 0.2293, val-acc: 0.9090]
Epoch 3: 100%|██████████| 39936/39936 [01:51<00:00, 389.86it/s, train-loss: 0.2121, train-acc: 0.9164, val-loss: 0.2198, val-acc: 0.9121]
Epoch 4: 100%|██████████| 39936/39936 [01:51<00:00, 390.47it/s, train-loss: 0.1873, train-acc: 0.9281, val-loss: 0.2199, val-acc: 0.9168]
Epoch 5: 100%|██████████| 39936/39936 [01:51<00:00, 385.78it/s, train-loss: 0.1656, train-acc: 0.9379, val-loss: 0.2153, val-acc: 0.9213]
Epoch 6: 100%|██████████| 39936/39936 [01:51<00:00, 386.91it/s, train-loss: 0.1509, train-acc: 0.9439, val-loss: 0.2141, val-acc: 0.9252]
Epoch 7: 100%|██████████| 39936/39936 [01:51<00:00, 382.00it/s, train-loss: 0.1275, train-acc: 0.9529, val-loss: 0.2204, val-acc: 0.9248]
Epoch 8: 100%|██████████| 39936/39

CPU times: user 11min 44s, sys: 15min 39s, total: 27min 24s
Wall time: 27min 56s


In [0]:
history['best_acc']

0.9251953125