**NOTE**: to run the notebooks move them to the main dir. Simply

```bash
cp notebook_name.ipynd ../
```

Let's have a look here to the attention weights and see if they make any sense!

In [1]:
import numpy as np
import pickle
import os
import torch
import torch.nn.functional as F

from tqdm import tqdm
from pathlib import Path
from sklearn.metrics import accuracy_score, f1_score, precision_score
from torch import nn
from torch.utils.data import TensorDataset, DataLoader
from IPython.display import display, HTML

from utils.plot_attention import plot_word_attention, plot_sent_attention
from models.pytorch_models import HierAttnNet

In [2]:
n_cpus = os.cpu_count()
use_cuda = torch.cuda.is_available()

let's first start by loading all the components we will need. If you reached this notebooks, first bravo! (🙏🏼 I know is dense), and second all the dir structure should be familiar to you. 

In [3]:
data_dir = Path("data")
log_dir  = Path("results")
test_dir = data_dir / "test"
model_weights = log_dir / "weights"
ftest = "han_test.npz"
tokf  = "HANPreprocessor.p"
model_name = "han_lr_0.001_wdc_0.01_bsz_128_whd_64_shd_64_emb_300_drp_0.2_sch_no_cycl_no_lrp_no_pre_no"

Let's load the test dataset

In [4]:
test_mtx = np.load(test_dir / ftest)
X_test = test_mtx["X_test"]
y_test = test_mtx["y_test"]
test_set = TensorDataset(
    torch.from_numpy(test_mtx["X_test"]), torch.from_numpy(test_mtx["y_test"]).long(),
)
test_loader = DataLoader(
    dataset=test_set, batch_size=512, num_workers=n_cpus, shuffle=False
)

And the tokenizer

In [5]:
tok = pickle.load(open(data_dir / tokf, "rb"))

Let's build the adequate model:

`han_lr_0.001_wdc_0.01_bsz_128_whd_64_shd_64_emb_300_drp_0.2_sch_no_cycl_no_lrp_no_pre_no`

so: 

In [6]:
model = HierAttnNet(
    vocab_size=len(tok.vocab.stoi),
    maxlen_sent=tok.maxlen_sent,
    maxlen_doc=tok.maxlen_doc,
    word_hidden_dim=64,
    sent_hidden_dim=64,
    padding_idx=1,
    embed_dim=300,
    weight_drop=0.0,
    embed_drop=0.0,
    locked_drop=0.0,
    last_drop=0.0,
    embedding_matrix=None,
    num_class=4
    )

In [7]:
if use_cuda:
    model = model.cuda()

Let's define a little function to get the predictions and the attention weights

In [8]:
def get_predictions_and_attn_weights(model, eval_loader):
    model.eval()
    preds, sent_a, doc_a = [], [], []
    with torch.no_grad():
        for data, target in tqdm(eval_loader):
            X = data.cuda() if use_cuda else data
            y_pred = model(X)
            preds.append(y_pred)
            sent_a.append(model.sent_a)
            doc_a.append(model.doc_a)
    return preds, sent_a, doc_a

Let's pause here for a sec. If you remember, when I discussed the dropout mechanisms I said that the original `WeightDropout` implementation had a couple of drawbacks. One is that is not memory efficient. This is because it duplicates the weights to which we are going to apply dropout. Secondly, it renames the modules. Truth be told, for a model of this size, this is not a major drawback and we can fix this by simply using the helper function below. 

For a more efficient implementations, please have a look to the references I included in Notebook 02. Namely the Fastai or MxNet's GluonNLP implementations. 

For now, we will move one with this:

In [9]:
def adjust_weights_dict(weights_dict):
    new_dict = {}
    for k, v in weights_dict.items():
        if '_raw' not in k:
            new_k = k.replace('module.', '')
            new_dict[new_k] = v
    return new_dict

In [10]:
trained_weights_raw = torch.load(model_weights / (model_name + ".pt"))
trained_weights = adjust_weights_dict(trained_weights_raw)
model.load_state_dict(trained_weights)

<All keys matched successfully>

In [11]:
preds_l, word_attn_l, sent_attn_l = get_predictions_and_attn_weights(model, test_loader)

100%|██████████| 55/55 [00:02<00:00, 26.55it/s]


In [12]:
preds = F.softmax(torch.cat(preds_l), 1).cpu().numpy()
word_attn = torch.cat(word_attn_l).cpu().numpy()
sent_attn = torch.cat(sent_attn_l).squeeze(2).cpu().numpy()

Now we need to turn the numeric input tokens into text. This is easily done with the convenient `textify` method in Fastai's `Vocab` class. Let's go step by step

In [13]:
# (n_test_observations, maxlen_doc, maxlen_sent)
X_test.shape

(27767, 7, 20)

In [14]:
# reshape to avoid one loop
X_test_tmp = X_test.reshape(X_test.shape[0]*X_test.shape[1], X_test.shape[2])
X_test_tmp.shape

(194369, 20)

In [15]:
X_texts = [tok.vocab.textify(s) for s in X_test_tmp]
# some people will hate re-assignment, but I can live with it in here (...not always)
X_texts = np.array_split(X_texts, X_test.shape[0])

Let's check I have not done something stupid

In [16]:
# Review/Document length should be 7
len(X_texts[0])

7

In [17]:
X_texts[19]

array(['xxpad xxpad xxpad xxpad xxpad i have had very good luck with purchasing xxmaj skechers shoes for many years ...',
       'xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxmaj this time they hit the ball out of the park .',
       'xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxmaj they are so comfortable ; like walking on pillows .',
       'xxpad xxpad xxpad i typically buy wide - width shoes , and these fit perfect with my regular size .',
       'xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxmaj plenty of room in the toe - box .',
       'xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad i especially like the slip - on feature instead of ties .',
       "my third pair of xxmaj go xxmaj walks , and i 'm sure it wo n't be my last ."], dtype='<U169')

In [18]:
tmp = X_test[19]
tmp = [tok.vocab.textify(s) for s in tmp]
tmp

['xxpad xxpad xxpad xxpad xxpad i have had very good luck with purchasing xxmaj skechers shoes for many years ...',
 'xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxmaj this time they hit the ball out of the park .',
 'xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxmaj they are so comfortable ; like walking on pillows .',
 'xxpad xxpad xxpad i typically buy wide - width shoes , and these fit perfect with my regular size .',
 'xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxmaj plenty of room in the toe - box .',
 'xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad i especially like the slip - on feature instead of ties .',
 "my third pair of xxmaj go xxmaj walks , and i 'm sure it wo n't be my last ."]

Ok, so all make sense. It is now time to do some plots. Let's code a quick helper to select indexes according to some logic

In [19]:
def get_indices(y_true, y_preds, n=1, agree=True, is_pos=True):
    pred_pos = np.where(y_preds[:, 3] > 0.8)[0]
    true_pos = np.where(y_true == 3)[0]
    pred_neg = np.where(y_preds[:, 0] > 0.8)[0] 
    true_neg = np.where(y_true == 1)[0]
    # if prediction and real agree
    if agree:
        # if real is positive
        if is_pos:
            idx = np.random.choice(np.intersect1d(pred_pos, true_pos), n)[0]
        else: 
            idx = np.random.choice(np.intersect1d(pred_neg, true_neg), n)[0]
    else:
        if is_pos:
            idx = np.random.choice(np.intersect1d(pred_neg, true_pos), n)[0]  
        else: 
            idx = np.random.choice(np.intersect1d(pred_pos, true_neg), n)[0]  
    return idx

In [28]:
idx = get_indices(y_test, preds)

In [29]:
doc = X_texts[idx]
y_true = y_test[idx]
word_w = word_attn[idx]
doc_w  = sent_attn[idx]

In [30]:
y_true

3

In [31]:
pred_pos_real_pos_word_attn = plot_word_attention(doc, word_w)

In [32]:
with open('pred_pos_real_pos_word_attn.html', 'w') as f:
    f.write(pred_pos_real_pos_word_attn)

In [1]:
from IPython.display import display, HTML
display(HTML('figures/pred_pos_real_pos_word_attn.html'))

In [34]:
pred_pos_real_pos_sent_attn = plot_sent_attention(doc, doc_w)
with open('pred_pos_real_pos_sent_attn.html', 'w') as f:
    f.write(pred_pos_real_pos_sent_attn)

In [2]:
display(HTML('figures/pred_pos_real_pos_sent_attn.html'))

In [40]:
idx = get_indices(y_test, preds, is_pos=False)
doc = X_texts[idx]
y_true = y_test[idx]
word_w = word_attn[idx]
doc_w  = sent_attn[idx]
y_true

1

In [41]:
pred_neg_real_neg_word_attn = plot_word_attention(doc, word_w)
with open('pred_neg_real_neg_word_attn.html', 'w') as f:
    f.write(pred_neg_real_neg_word_attn)

pred_neg_real_neg_sent_attn = plot_sent_attention(doc, doc_w)
with open('pred_neg_real_neg_sent_attn.html', 'w') as f:
    f.write(pred_neg_real_neg_sent_attn)

In [3]:
display(HTML('figures/pred_neg_real_neg_word_attn.html'))

In [4]:
display(HTML('figures/pred_neg_real_neg_sent_attn.html'))

So...one would say it makes sense! 🥳🥳🥳. 

Let's now have a look to some misclassifications. 

In [52]:
idx = get_indices(y_test, preds, agree=False, is_pos=False)
doc = X_texts[idx]
y_true = y_test[idx]
word_w = word_attn[idx]
doc_w  = sent_attn[idx]
y_true

1

In [53]:
pred_neg_real_pos_word_attn = plot_word_attention(doc, word_w, cmap='Reds')
with open('pred_neg_real_pos_word_attn.html', 'w') as f:
    f.write(pred_neg_real_pos_word_attn)

pred_neg_real_pos_sent_attn = plot_sent_attention(doc, doc_w, cmap='Reds')
with open('pred_neg_real_pos_sent_attn.html', 'w') as f:
    f.write(pred_neg_real_pos_sent_attn)

In [5]:
display(HTML('figures/pred_neg_real_pos_word_attn.html'))

In [6]:
display(HTML('figures/pred_neg_real_pos_sent_attn.html'))

One would say that it is odd that this review got a bad score. Now, after some searches I noticed that observations idx 15726 was a nice example to illustrate that the model works

In [57]:
idx = 15726
doc = X_texts[idx]
y_true = y_test[idx]
word_w = word_attn[idx]
doc_w  = sent_attn[idx]
y_true

3

In [58]:
pred_pos_real_neg_word_attn = plot_word_attention(doc, word_w, cmap='Reds')
with open('pred_pos_real_neg_word_attn.html', 'w') as f:
    f.write(pred_pos_real_neg_word_attn)
    
pred_pos_real_neg_sent_attn = plot_sent_attention(doc, doc_w, cmap='Reds')
with open('pred_pos_real_neg_sent_attn.html', 'w') as f:
    f.write(pred_pos_real_neg_sent_attn)

In [7]:
display(HTML('figures/pred_pos_real_neg_word_attn.html'))

In [8]:
display(HTML('figures/pred_pos_real_neg_sent_attn.html'))

When I saw this I wondered how is it possible that this review got a score of 3 (remember, 3 means originally 5, since I merged 1 and 2 star scores and python stars counting from 0), so I looked into this in the original dataset.

In [36]:
import pandas as pd

In [98]:
# Whatever path you have the data
df = pd.read_json('../datasets/amazon_reviews/reviews_Clothing_Shoes_and_Jewelry_5.json.gz', lines=True)

In [99]:
df[df.reviewText.str.contains('straps') & 
   df.reviewText.str.contains('messed') & 
   df.reviewText.str.contains('uncomfortable')
  ]

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
222835,ACNM0UKRB6KNB,B00919GY7E,Michaela Reyes,"[0, 0]",The straps are completely messed up. They are very uncomfortable. I would not buy boots from her...,5,I was not impressed,1383004800,"10 29, 2013"


so...yeap, this is a very negative review, yet the customer gave an overall score of 5! 🤷🏻‍♂️

There is not much any algorithm can do here, but there are a few last comments I want to make:

1. One can clearly see that a better pre-processing is possible. Little to no preprocessing was used INTENTIONALLY. If you read previous notebooks I include some code in case someone wants to do a more thorought preprocessing. Of course, possibilities are almost endless.
2. Maybe we could use some pseud-labelling and re-labell some of the examples where the review and the score are not consistent, as we saw before (if we find more).
2. The fact that attention is normally place in isolated words or bigrams suggests that using tf-idf with bigrams or trigrams will probably lead to the same results as using HANs. However, tf-idf with bigrams will increase significantly the amount of memory required.

And with this I conclude my experimentation with Hierarchical Attention Networks.