# Sentiment Analysis based on Attention Mechanism

# Hierarchical Attention Networks

## Paper Reference

> [Hierarchical Attention Networks for Document Classification](https://www.aclweb.org/anthology/N16-1174)

> [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/pdf/1409.0473.pdf)
>
> [Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/pdf/1508.04025.pdf)


## Dataset
- [**IMDB Large Movie Review Dataset**](http://ai.stanford.edu/~amaas/data/sentiment/)
    - **Binary** sentiment classification
    - Citation: [Andrew L. Maas et al., 2011](http://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf)
    - **50,000** movie reviews for training and testing
    - Average review length: **231** vocab
    ---
- [**Yelp reviews-full**](http://xzh.me/docs/charconvnet.pdf)
    - **Multiclass** sentiment classification (5 stars)
    - Citation: [Xiang Zhang et al., 2015](https://arxiv.org/abs/1509.01626)
    - **650,000** training samples and **50,000** testing samples (Nums of each star are equal)
    - Average review length: **140** vocab
    ---
- [**Yelp reviews-polarity**](http://xzh.me/docs/charconvnet.pdf)
    - **Binary** sentiment classification
    - Citation: [Xiang Zhang et al., 2015](https://arxiv.org/abs/1509.01626)
    - **560,000** training samples and **38,000** testing samples (Nums of positive or negative samples are equal)
    - Average review length: **140** vocab
    ---
- [**Douban Movie Reviews**](https://drive.google.com/open?id=1DsmQfB1Ff_BUoxOv4kfUMg7Y8M7tHB9F) 
    - My **Custom Chinese** movie reviews scraped from **16000** movies (num of reviews > 100) on [Douban](https://movie.douban.com/)
    - **Binary** sentiment classification
    - **700,000** movie reviews , **600,000** samples for training and **100,000** samples for testing (Num of positive or negative samples is equal)
    - Average review length: **50** character, **27** vocab

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.utils.rnn import pack_padded_sequence, pack_sequence, pad_packed_sequence, pad_sequence
from torch.utils.data import DataLoader, Dataset, SequentialSampler
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from tqdm import tqdm
from functools import partial
import time
import random
import os
import copy
import warnings

warnings.filterwarnings('ignore')
DEVICE = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

%matplotlib inline
%load_ext autoreload
%autoreload 2
torch.__version__

'1.0.1.post2'

In [0]:
# set random seeds to keep the results identical
def setup_seed(seed):
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.backends.cudnn.deterministic = True
    
def worker_init_fn(worker_id):
    setup_seed(torch.initial_seed() + worker_id)
    
GLOBAL_SEED = 2019
setup_seed(GLOBAL_SEED)

In [0]:
base_dir = './'

In [0]:
# setting in google colab
root_dir = "/content/gdrive/My Drive/"
base_dir = root_dir + 'Colab Notebooks2/'

In [1]:
# pre setting in google colab
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
!pip install -U pandas
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
import pandas as pd

Mounted at /content/gdrive
Requirement already up-to-date: pandas in /usr/local/lib/python3.6/dist-packages (0.24.2)
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Load train data (Choose one of the four datasets to train)

In [0]:
#data_pth = base_dir + 'dataset/IMDB/'
#data_pth = base_dir + 'dataset/yelp_review_polarity_csv/'
data_pth = base_dir + 'dataset/yelp_review_full_csv/'
#data_pth = base_dir + 'dataset/Douban/'

In [0]:
X_train = pd.read_hdf(data_pth+'X_train.h5', key='s')
y_train = pd.read_hdf(data_pth+'y_train.h5', key='s')

X_val = pd.read_hdf(data_pth+'X_val.h5', key='s')
y_val = pd.read_hdf(data_pth+'y_val.h5', key='s')

word2num_series = pd.read_hdf(data_pth+'word2num_series.h5', key='s')

In [8]:
len(X_train)

600000

In [9]:
len(X_val)

50000

## Load pretrained word embedding matrix (Glove)
- Glove https://github.com/stanfordnlp/GloVe
- Chinese Word Vector https://github.com/Embedding/Chinese-Word-Vectors

In [0]:
# English
pre_embedding = {}
with open('/content/gdrive/My Drive/Colab Notebooks/' + 'word2vector/glove.twitter.27B.200d.txt', encoding='utf8') as f:
    for line in f.readlines():
        tmp = line.strip().split()
        if tmp[0] in word2num_series:
            pre_embedding[tmp[0]] = np.array(tmp[1:]).astype(np.float)

# Chinese
# pre_embedding = {}
# with open('/content/gdrive/My Drive/Colab Notebooks/' + 'word2vector/sgns.weibo.bigram-char', encoding='utf8') as f:
#     for line in f.readlines():
#         tmp = line.strip().split()
#         if tmp[0] in word2num_series:
#             pre_embedding[tmp[0]] = np.array(tmp[1:]).astype(np.float)

In [11]:
vocab_size = len(word2num_series)+10
dim = pre_embedding['movie'].shape[0]
print('dimention:', dim)
mean = np.mean([value for _, value in pre_embedding.items()])
std = np.std([value for _, value in pre_embedding.items()])
print('mean:', np.mean([value for _, value in pre_embedding.items()]))
print('std:', np.std([value for _, value in pre_embedding.items()]))
print('max:', np.max([value for _, value in pre_embedding.items()]))
print('min:', np.min([value for _, value in pre_embedding.items()]))

dimention: 200
mean: 0.004328889643746501
std: 0.43459718285079957
max: 3.0478
min: -6.7986


In [0]:
embedding_matrix = np.random.randn(vocab_size, dim)*std

In [0]:
miss_word = 0
for word, idx in word2num_series.items():
    try:
        embedding_matrix[idx] = pre_embedding[word]
    except:
        miss_word += 1
        #print(word)

In [14]:
miss_word

6721

In [0]:
np.testing.assert_array_almost_equal(embedding_matrix[word2num_series['movie']], pre_embedding['movie'])

## Build pytorch dataset and dataloader

In [0]:
BATCH_SIZE = 32

In [0]:
class CustomDataset(Dataset):
    def __init__(self, features, labels):
        self.features = features
        self.labels = labels
    
    def __len__(self):
        return len(self.features)
    
    def __getitem__(self, idx):
        return self.features[idx], self.labels[idx]

### Sort the train data by sentence length which aims to reduce 0 padding in every batch, then shuffle batchs

In [0]:
from utils import pad_and_truncate_hierarchical, preprocess_for_batch_hierarchical


############ just for multiclass #############
y_train = y_train-1
y_val = y_val-1
##############################################


In [0]:
MAX_SEN_LEN = 50
MAX_SEN_NUM = 50

# Douban
# MAX_SEN_LEN = 25
# MAX_SEN_NUM = 25

In [0]:
X_train_sorted, y_train_sorted = preprocess_for_batch_hierarchical(X_train, y_train, BATCH_SIZE)
X_val_sorted, y_val_sorted = preprocess_for_batch_hierarchical(X_val, y_val, BATCH_SIZE)

train_dataset = CustomDataset(X_train_sorted, y_train_sorted)
val_dataset = CustomDataset(X_val_sorted, y_val_sorted)

train_dataloader = DataLoader(train_dataset, 
                              batch_size=BATCH_SIZE, 
                              sampler=SequentialSampler(train_dataset), 
                              shuffle=False, 
                              collate_fn=partial(pad_and_truncate_hierarchical, MAX_SEN_LEN=MAX_SEN_LEN, MAX_SEN_NUM=MAX_SEN_NUM), 
                              worker_init_fn=worker_init_fn)

In [0]:
valid_dataloader = DataLoader(val_dataset, 
                              batch_size=BATCH_SIZE, 
                              sampler=SequentialSampler(val_dataset), 
                              shuffle=False, 
                              collate_fn=partial(pad_and_truncate_hierarchical, MAX_SEN_LEN=MAX_SEN_LEN, MAX_SEN_NUM=MAX_SEN_NUM), 
                              worker_init_fn=worker_init_fn)

## Train the model

In [0]:
from HAN import HAN

### set model hyperparameters

In [0]:
parameters = {
    'hidden_size': 128,
    'embedding': embedding_matrix,
    'att_method': 'concat',
    'rnn_dropout': 0.1,
    'embedding_dropout': 0.5,
    'word_dropout': 0.1, 
    'sent_dropout': 0.1
}


In [24]:
# Polarity
#model = HAN(hidden_size=parameters['hidden_size'], embedding=parameters['embedding'], method=parameters['att_method'], rnn_dropout=parameters['rnn_dropout'], embedding_dropout=parameters['embedding_dropout'], word_dropout=parameters['word_dropout'], sent_dropout=parameters['sent_dropout'])

# Full
model = HAN(output_size=5, hidden_size=parameters['hidden_size'], embedding=parameters['embedding'], method=parameters['att_method'], rnn_dropout=parameters['rnn_dropout'], embedding_dropout=parameters['embedding_dropout'], word_dropout=parameters['word_dropout'], sent_dropout=parameters['sent_dropout'])


model.to(DEVICE)

HAN(
  (word_encoder): EncoderBiLSTM(
    (embedding): Embedding(40000, 200)
    (embedding_dropout): Dropout(p=0.5)
    (lstm): LSTM(200, 128, batch_first=True, bidirectional=True)
    (rnn_dropout): Dropout(p=0.1)
  )
  (word_attn): Attention(
    (linear): Linear(in_features=512, out_features=256, bias=True)
    (v): Linear(in_features=256, out_features=1, bias=True)
  )
  (word_dropout): Dropout(p=0.1)
  (sent_lstm): LSTM(256, 128, batch_first=True, bidirectional=True)
  (sent_attn): Attention(
    (linear): Linear(in_features=512, out_features=256, bias=True)
    (v): Linear(in_features=256, out_features=1, bias=True)
  )
  (sent_dropout): Dropout(p=0.1)
  (out): Linear(in_features=256, out_features=5, bias=True)
)

In [0]:
# Polarity
#criterion = nn.BCEWithLogitsLoss()

# Full
criterion = nn.CrossEntropyLoss()

#optimizer = torch.optim.Adam([{'params':list(model.parameters())[0], 'lr':1e-4}, {'params': list(model.parameters())[1:]}])
optimizer = torch.optim.Adam(model.parameters())
#scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, 0.8, last_epoch=-1)

In [0]:
def validate(model, criterion, history):
    model.eval()
    global best_acc, best_model, validate_history
    costs = []
    accs = []
    with torch.no_grad():
        for idx, batch in enumerate(valid_dataloader):
            input_batch, labels= batch
            #labels = labels.float() # polarity
            labels = labels.squeeze(1) # full
            output = model(input_batch)
            loss = criterion(output, labels)
            costs.append(loss.item())
            # polarity
            #accs.append((output>0).eq(labels>0).float().mean().item())
            
            # full
            _, preds = torch.max(output, 1)
            accs.append((preds == labels).float().mean().item())
            torch.cuda.empty_cache()
    mean_accs = np.mean(accs)
    mean_costs = np.mean(costs)
    if mean_accs > history['best_acc']:  
        history['best_acc'] = mean_accs
        history['best_model'] = copy.deepcopy(model.state_dict())
        
    history['validate_accuracy'].append(mean_accs)
    history['validate_loss'].append(mean_costs)
    return mean_costs, mean_accs


def train(model, criterion, optimizer, epoch, history, validate_points):
    model.train()
    costs = []
    accs = []
    with tqdm(total=len(train_dataset), desc='Epoch {}'.format(epoch)) as pbar:
        for idx, batch in enumerate(train_dataloader):
            input_batch, labels = batch
            #labels = labels.float() # polarity
            labels = labels.squeeze(1) # full
            output = model(input_batch)
            loss = criterion(output, labels)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            with torch.no_grad():
                costs.append(loss.item())
                # polarity
                #accs.append((output>0).eq(labels>0).float().mean().item())

                # full
                _, preds = torch.max(output, 1)
                accs.append((preds == labels).float().mean().item())
                pbar.update(input_batch.size(0))
                pbar.set_postfix_str('train-loss: {:.4f}, train-acc: {:.4f}'.format(np.mean(costs), np.mean(accs)))
            if idx in validate_points:
                val_loss, val_acc = validate(model, criterion, history)
                pbar.set_postfix_str('train-loss: {:.4f}, train-acc: {:.4f}, val-loss: {:.4f}, val-acc: {:.4f}'.format(np.mean(costs), np.mean(accs), val_loss, val_acc))
                model.train()
            torch.cuda.empty_cache()
    
    history['train_loss'].append(costs)
    history['train_accuracy'].append(accs)

In [0]:
history = { 'best_acc': 0,
            'best_model': None,
            'optimizer': optimizer.state_dict(),
            'train_accuracy': [],
            'train_loss': [],
            'validate_accuracy': [],
            'validate_loss': [],
            'batch_size': BATCH_SIZE,
            'num_of_batch': len(train_dataloader),
            'train_size': len(train_dataset),
            'validate_size': len(val_dataset),
            'validate_points': None,
            'epochs': 0,
            'embedding_size': embedding_matrix.shape,
            'parameters': parameters
          }

In [0]:
epochs = 8
validate_points = list(np.linspace(0, len(train_dataloader)-1, 4).astype(int))[1:]
history['epochs'] = epochs
history['validate_points'] = validate_points

## Douban

In [0]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time()))
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epoch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Hierarchical-Attention-Networks-on-Douban', timestamp))

Epoch 1: 100%|██████████| 549984/549984 [14:06<00:00, 649.62it/s, train-loss: 0.4011, train-acc: 0.8158, val-loss: 0.3545, val-acc: 0.8423]
Epoch 2: 100%|██████████| 549984/549984 [14:08<00:00, 647.91it/s, train-loss: 0.3443, train-acc: 0.8483, val-loss: 0.3493, val-acc: 0.8462]
Epoch 3:  98%|█████████▊| 538080/549984 [13:42<00:18, 645.43it/s, train-loss: 0.3219, train-acc: 0.8601]

Buffered data was truncated after reaching the output size limit.

In [0]:
history['best_acc']

0.847275641025641

## Yelp Full

In [29]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time()))
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epoch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Hierarchical-Attention-Networks-on-Yelp-Full', timestamp))

Epoch 1: 100%|██████████| 600000/600000 [35:30<00:00, 281.62it/s, train-loss: 0.8511, train-acc: 0.6262, val-loss: 0.7621, val-acc: 0.6679]
Epoch 2: 100%|██████████| 600000/600000 [35:23<00:00, 293.73it/s, train-loss: 0.7531, train-acc: 0.6712, val-loss: 0.7421, val-acc: 0.6768]
Epoch 3: 100%|██████████| 600000/600000 [35:22<00:00, 283.07it/s, train-loss: 0.7228, train-acc: 0.6855, val-loss: 0.7410, val-acc: 0.6785]
Epoch 4: 100%|██████████| 600000/600000 [35:24<00:00, 286.62it/s, train-loss: 0.7032, train-acc: 0.6949, val-loss: 0.7403, val-acc: 0.6784]
Epoch 5: 100%|██████████| 600000/600000 [35:10<00:00, 289.29it/s, train-loss: 0.6893, train-acc: 0.7009, val-loss: 0.7468, val-acc: 0.6780]
Epoch 6: 100%|██████████| 600000/600000 [35:04<00:00, 285.04it/s, train-loss: 0.6790, train-acc: 0.7079, val-loss: 0.7488, val-acc: 0.6776]
Epoch 7: 100%|██████████| 600000/600000 [34:59<00:00, 285.77it/s, train-loss: 0.6716, train-acc: 0.7108, val-loss: 0.7557, val-acc: 0.6775]
Epoch 8: 100%|██████

CPU times: user 2h 26min 54s, sys: 2h 7min 43s, total: 4h 34min 37s
Wall time: 4h 41min 37s


In [0]:
history['best_acc']

0.678437099871959

## Yelp Polarity

In [0]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time()))
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epoch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Hierarchical-Attention-Networks-on-Yelp-Polarity', timestamp))

Epoch 1: 100%|██████████| 510000/510000 [38:41<00:00, 219.66it/s, train-loss: 0.1326, train-acc: 0.9484, val-loss: 0.1006, val-acc: 0.9618]
Epoch 2:  48%|████▊     | 247168/510000 [18:19<18:55, 231.55it/s, train-loss: 0.0974, train-acc: 0.9633]

Buffered data was truncated after reaching the output size limit.

In [0]:
history['best_acc']

## IMDB

In [0]:
%%time
timestamp = time.strftime('%Y-%m-%d-%H-%M',time.localtime(time.time()))
for epoch in range(1, epochs+1):
    train(model, criterion, optimizer, epoch, history, validate_points)
    torch.save(history, '{}model save/{}-{}.pth'.format(base_dir, 'Hierarchical-Attention-Networks-on-IMDB', timestamp))

Epoch 1: 100%|██████████| 40000/40000 [03:30<00:00, 190.08it/s, train-loss: 0.2980, train-acc: 0.8706, val-loss: 0.2101, val-acc: 0.9157]
Epoch 2: 100%|██████████| 40000/40000 [03:30<00:00, 178.31it/s, train-loss: 0.1928, train-acc: 0.9255, val-loss: 0.1956, val-acc: 0.9253]
Epoch 3: 100%|██████████| 40000/40000 [03:30<00:00, 177.65it/s, train-loss: 0.1442, train-acc: 0.9476, val-loss: 0.2017, val-acc: 0.9247]
Epoch 4: 100%|██████████| 40000/40000 [03:30<00:00, 190.26it/s, train-loss: 0.1105, train-acc: 0.9589, val-loss: 0.2111, val-acc: 0.9251]
Epoch 5: 100%|██████████| 40000/40000 [03:30<00:00, 177.08it/s, train-loss: 0.0848, train-acc: 0.9703, val-loss: 0.2394, val-acc: 0.9281]
Epoch 6: 100%|██████████| 40000/40000 [03:30<00:00, 190.28it/s, train-loss: 0.0677, train-acc: 0.9758, val-loss: 0.2591, val-acc: 0.9249]
Epoch 7: 100%|██████████| 40000/40000 [03:30<00:00, 178.17it/s, train-loss: 0.0540, train-acc: 0.9807, val-loss: 0.2853, val-acc: 0.9261]
Epoch 8: 100%|██████████| 40000/40

CPU times: user 15min 29s, sys: 11min 32s, total: 27min 1s
Wall time: 28min 8s


In [0]:
history['best_acc']

0.9284855769230769