# Discovering Important Words for Sentiments With NormLIME

This notebook loads the pretrained Bi-LSTM model given by [PaddlePaddle Models](https://github.com/PaddlePaddle/models/tree/release/1.7) and performs sentiment analysis on reviews data. The full official PaddlePaddle sentiment classification tutorial can be found [here](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/sentiment_classification). 

NormLIME method aggregates local models into global and class-specific interpretations. It is effective at recognizing important features. In this notebook, we use NormLIME method, specifically `NormLIMENLPInterpreter`, to discover the words that contribute the most to positive and negative sentiment predictions.

If you have't done so, please first download the pretrained model and sentiment datasets by running the following command: 
```
wget https://baidu-nlp.bj.bcebos.com/sentiment_classification-1.0.0.tar.gz
tar -zxvf sentiment_classification-1.0.0.tar.gz

wget https://baidu-nlp.bj.bcebos.com/sentiment_classification-dataset-1.0.0.tar.gz
tar -zxvf sentiment_classification-dataset-1.0.0.tar.gz
```

In [1]:
import sys, os
import numpy as np
import paddle.fluid as fluid
import io

sys.path.append('..')
import interpretdl as it
from assets.bilstm import bilstm
from interpretdl.data_processor.visualizer import VisualizationTextRecord, visualize_text

In [2]:
import warnings 
warnings.filterwarnings("ignore")

Load the word dict from the pretrained model path. Define the `unk_id` to be the word id for empty token *\"\"*. Other possible choices include *\<unk\>* token and *\<pad\>* token.

In [2]:
def load_vocab(file_path):
    """
    load the given vocabulary
    """
    vocab = {}
    with io.open(file_path, 'r', encoding='utf8') as f:
        wid = 0
        for line in f:
            if line.strip() not in vocab:
                vocab[line.strip()] = wid
                wid += 1
    vocab["<unk>"] = len(vocab)
    return vocab

MODEL_PATH = "../../senta_model/bilstm_model"
VOCAB_PATH = os.path.join(MODEL_PATH, "word_dict.txt")
PARAMS_PATH = os.path.join(MODEL_PATH, "params")

word_dict = load_vocab(VOCAB_PATH)
unk_id = word_dict[""]  #word_dict["<unk>"]

Define the paddle model that takes in arbitray number of inputs, in this case word_ids and seq_lens, and outputs prediction probabilities.

In [3]:
DICT_DIM = 1256606
MAX_SEQ_LEN = 256
def paddle_model(word_ids, seq_len):
    probs = bilstm(word_ids, seq_len, None, DICT_DIM, is_prediction=True)
    return probs

Define a preprocessing function that takes in **a raw string** and outputs the model inputs that can be fed into paddle_model.

In this case, the raw string is first splitted and mapped to word ids, then padded to length of MAX_SEQ_LEN. *word_ids* is a list of lists, where each list contains a sequence of padded word ids. *seq_lens* is a list that contains the sequence length of each unpadded word ids in *word_ids*. 

Since the input data is a single raw string. Both *word_ids* and *seq_lens* has length 1.

In [4]:
def preprocess_fn(data):
    word_ids = []
    sub_word_ids = [word_dict.get(d, unk_id) for d in data.split()]
    seq_lens = [len(sub_word_ids)]
    if len(sub_word_ids) < MAX_SEQ_LEN:
        sub_word_ids += [0] * (MAX_SEQ_LEN - len(sub_word_ids))
    word_ids.append(sub_word_ids[:MAX_SEQ_LEN])
    return word_ids, seq_lens

Read the sentiment test dataset into a list. There are 1200 sentences in the dataset.

In [5]:
DATA_PATH = "../../senta_data/test.tsv"

data = []
with io.open(DATA_PATH, "r", encoding='utf8') as fin:
    for line in fin:
        if line.startswith('text_a'):
            continue
        cols = line.strip().split("\t")
        if len(cols) != 2:
            sys.stderr.write("[NOTICE] Error Format Line!")
            continue
        data.append(cols[0])
print('total of %d sentences' % len(data))

total of 1200 sentences


Initialize the `NormLIMENLPInterpreter`. We save the temporary results into a *.npz* file so that we don't have to run the whole process again if we want to rerun the same dataset.

In [6]:
normlime = it.NormLIMENLPInterpreter(
    paddle_model, PARAMS_PATH, temp_data_file='assets/all_lime_weights_nlp.npz')

Begin `interpret`ing the whole dataset. This may take some time.

In [7]:
normlime_weights = normlime.interpret(
    data,
    preprocess_fn,
    unk_id=unk_id,
    pad_id=0,
    num_samples=500,
    batch_size=50)

  0%|          | 1/1200 [00:03<1:02:21,  3.12s/it]

Load model from ../../senta_model/bilstm_model/params


100%|██████████| 1200/1200 [04:03<00:00,  4.93it/s]


In the cells below, we print the words with top 20 largest weights for positive and negative sentiments. Only words that appear at least 5 times are included.

In [19]:
id2word = dict(zip(word_dict.values(), word_dict.keys()))
# Positive 
temp = {
    id2word[wid]: normlime_weights[1][wid]
    for wid in normlime_weights[1]
}
W = [(word, weight[0], weight[1]) for word, weight in temp.items() if  weight[1] >= 5]
pd.DataFrame(data = sorted(W, key=lambda x: -x[1])[:20], columns = ['word', 'weight', 'frequency'])

Unnamed: 0,word,weight,frequency
0,爽,0.037562,7
1,挺好,0.031876,8
2,支持,0.026343,16
3,感动,0.023615,15
4,很漂亮,0.022065,14
5,优点,0.020294,9
6,满意,0.01717,26
7,超值,0.016647,11
8,很满意,0.016316,14
9,很方便,0.016003,22


In [25]:
# Negative
temp = {
    id2word[wid]: normlime_weights[0][wid]
    for wid in normlime_weights[0]
}
W = [(word, weight[0], weight[1]) for word, weight in temp.items() if  weight[1] >= 5]
pd.DataFrame(data = sorted(W, key=lambda x: -x[1])[:20], columns = ['word', 'weight', 'frequency'])

Unnamed: 0,word,weight,frequency
0,失望,0.057111,17
1,很一般,0.048728,19
2,上当,0.041603,9
3,粗糙,0.038233,6
4,恶心,0.036756,7
5,垃圾,0.03343,7
6,最差,0.033248,8
7,较差,0.031293,6
8,不值,0.025248,8
9,极差,0.02418,6
