# Discovering Important Words for Sentiments With NormLIME

This notebook loads the pretrained Bi-LSTM model following [PaddleNLP TextClassification](https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/examples/text_classification/rnn) and performs sentiment analysis on reviews data. The full official PaddlePaddle sentiment classification tutorial can be found [here](https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/examples/text_classification). 

NormLIME method aggregates local models into global and class-specific interpretations. It is effective at recognizing important features. In this notebook, we use NormLIME method, specifically `NormLIMENLPInterpreter`, to discover the words that contribute the most to positive and negative sentiment predictions.

In [1]:
import paddle
import numpy as np
import interpretdl as it
import jieba

In [2]:
import warnings 
warnings.filterwarnings("ignore")

Load the word dict and specify the pretrained model path. Define the `unk_id` to be the word id for *\[UNK\]* token. Other possible choices include empty token *\"\"* and *\[PAD\]* token.

To obtain the pretrained weights, please train a bilstm model following the [tutorial](https://github.com/PaddlePaddle/models/tree/release/2.0-beta/PaddleNLP/examples/text_classification/rnn) and specify the final `.pdparams` file position in `PARAMS_PATH` below.

In [3]:
def load_vocab(vocab_file):
    """Loads a vocabulary file into a dictionary."""
    vocab = {}
    with open(vocab_file, "r", encoding="utf-8") as reader:
        tokens = reader.readlines()
    for index, token in enumerate(tokens):
        token = token.rstrip("\n").split("\t")[0]
        vocab[token] = index
    return vocab

PARAMS_PATH = "assets/final.pdparams"
VOCAB_PATH = "assets/senta_word_dict.txt"

vocab = load_vocab(VOCAB_PATH)
unk_id = vocab['[UNK]']

Initialize the BiLSTM model using **paddlenlp.models** and load pretrained weights.

In [4]:
import paddlenlp as ppnlp
model = ppnlp.models.Senta(
        network='bilstm',
        vocab_size=len(vocab),
        num_classes=2)

state_dict = paddle.load(PARAMS_PATH)
model.set_dict(state_dict)

Define a preprocessing function that takes in **a raw string** and outputs the model inputs that can be fed into paddle_model.

In this case, the raw string is splitted and mapped to word ids. *texts* is a list of lists, where each list contains a sequence of padded word ids. *seq_lens* is a list that contains the sequence length of each unpadded word ids in *texts*. 

Since the input data is a single raw string. Both *texts* and *seq_lens* has length 1.

In [5]:
def preprocess_fn(text):
    texts = []
    seq_lens = []

    tokens = " ".join(jieba.cut(text)).split(' ')
    ids = []
    unk_id = vocab.get('[UNK]', None)
    for token in tokens:
        wid = vocab.get(token, unk_id)
        if wid:
            ids.append(wid)
    texts.append(ids)
    seq_lens.append(len(ids))

    pad_token_id = 0
    max_seq_len = max(seq_lens)

    texts = paddle.to_tensor(texts)
    seq_lens = paddle.to_tensor(seq_lens)
    return texts, seq_lens

The dataset we'll be using is **ChnSentiCrop** dataset from paddlenlp. 

In [None]:
!pip install paddlenlp==2.0.0b

We use the first 1200 samples in the training set as our data.

In [6]:
from paddlenlp.datasets import ChnSentiCorp

train_ds = ChnSentiCorp.get_datasets(['train'])
data = [d[0] for d in list(train_ds)[:1200]]
print('total of %d sentences' % len(data))

total of 1200 sentences


Initialize the `NormLIMENLPInterpreter`. We save the temporary results into a *.npz* file so that we don't have to run the whole process again if we want to rerun the same dataset.

In [7]:
normlime = it.NormLIMENLPInterpreter(
    model, temp_data_file='assets/all_lime_weights_nlp.npz')

Begin `interpret`ing the whole dataset. This may take some time.

In [8]:
normlime_weights = normlime.interpret(
    data,
    preprocess_fn,
    unk_id=unk_id,
    pad_id=0,
    num_samples=500,
    batch_size=50)

  0%|          | 0/1200 [00:00<?, ?it/s]Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.842 seconds.
Prefix dict has been built successfully.
100%|██████████| 1200/1200 [02:21<00:00,  8.45it/s]


In the cells below, we print the words with top 20 largest weights for positive and negative sentiments. Only words that appear at least 5 times are included.

In [9]:
import pandas as pd
id2word = dict(zip(vocab.values(), vocab.keys()))
# Positive 
temp = {
    id2word[wid]: normlime_weights[1][wid]
    for wid in normlime_weights[1]
}
W = [(word, weight[0], weight[1]) for word, weight in temp.items() if  weight[1] >= 5]
pd.DataFrame(data = sorted(W, key=lambda x: -x[1])[:20], columns = ['word', 'weight', 'frequency'])

Unnamed: 0,word,weight,frequency
0,小巧,0.06612,12
1,不错,0.05701,290
2,好书,0.036362,17
3,性价比,0.035803,42
4,干净,0.033354,27
5,每次,0.033267,13
6,海边,0.02879,7
7,热情,0.02785,20
8,一句,0.027751,5
9,很漂亮,0.027581,9


In [10]:
# Negative
temp = {
    id2word[wid]: normlime_weights[0][wid]
    for wid in normlime_weights[0]
}
W = [(word, weight[0], weight[1]) for word, weight in temp.items() if  weight[1] >= 5]
pd.DataFrame(data = sorted(W, key=lambda x: -x[1])[:20], columns = ['word', 'weight', 'frequency'])

Unnamed: 0,word,weight,frequency
0,很差,0.002387,9
1,是不是,0.00043,8
2,不好,0.000397,35
3,不,0.000387,220
4,不会,0.000318,18
5,贵,0.000273,6
6,隔音,0.000267,14
7,不如,0.000247,9
8,不是,0.000245,37
9,只能,0.000242,26
