# Interpreting Bi-LSTM Sentiment Classification Models With LIME

This notebook loads the pretrained Bi-LSTM model given by [PaddlePaddle Models](https://github.com/PaddlePaddle/models/tree/release/1.7) and performs sentiment analysis on reviews data. The full official PaddlePaddle sentiment classification tutorial can be found [here](https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleNLP/sentiment_classification). 

Interpretations of the predictions are generated and visualized using LIME algorithm, specifically the `LIMENLPInterpreter` class.

If you have't done so, please first download the pretrained model by running the following command: 
```
wget https://baidu-nlp.bj.bcebos.com/sentiment_classification-1.0.0.tar.gz
tar -zxvf sentiment_classification-1.0.0.tar.gz
```

In [1]:
import sys, os
import numpy as np
import paddle.fluid as fluid
import io

sys.path.append('..')
import interpretdl as it
from assets.bilstm import bilstm
from interpretdl.data_processor.visualizer import VisualizationTextRecord, visualize_text

In [2]:
import warnings 
warnings.filterwarnings("ignore")

Load the word dict from the pretrained model path. Define the `unk_id` to be the word id for empty token *\"\"*. Other possible choices include *\<unk\>* token and *\<pad\>* token.

In [3]:
def load_vocab(file_path):
    """
    load the given vocabulary
    """
    vocab = {}
    with io.open(file_path, 'r', encoding='utf8') as f:
        wid = 0
        for line in f:
            if line.strip() not in vocab:
                vocab[line.strip()] = wid
                wid += 1
    vocab["<unk>"] = len(vocab)
    return vocab

MODEL_PATH = "../../senta_model/bilstm_model"
VOCAB_PATH = os.path.join(MODEL_PATH, "word_dict.txt")
PARAMS_PATH = os.path.join(MODEL_PATH, "params")

word_dict = load_vocab(VOCAB_PATH)
unk_id = word_dict[""]  #word_dict["<unk>"]

Define the paddle model that takes in arbitray number of inputs, in this case word_ids and seq_lens, and outputs prediction probabilities.

In [4]:
DICT_DIM = 1256606
MAX_SEQ_LEN = 256
def paddle_model(word_ids, seq_len):
    probs = bilstm(word_ids, seq_len, None, DICT_DIM, is_prediction=True)
    return probs

Define a preprocessing function that takes in **a raw string** and outputs the model inputs that can be fed into paddle_model.

In this case, the raw string is first splitted and mapped to word ids, then padded to length of MAX_SEQ_LEN. *word_ids* is a list of lists, where each list contains a sequence of padded word ids. *seq_lens* is a list that contains the sequence length of each unpadded word ids in *word_ids*. 

Since the input data is a single raw string. Both *word_ids* and *seq_lens* has length 1.

In [5]:
def preprocess_fn(data):
    word_ids = []
    sub_word_ids = [word_dict.get(d, unk_id) for d in data.split()]
    seq_lens = [len(sub_word_ids)]
    if len(sub_word_ids) < MAX_SEQ_LEN:
        sub_word_ids += [0] * (MAX_SEQ_LEN - len(sub_word_ids))
    word_ids.append(sub_word_ids[:MAX_SEQ_LEN])
    return word_ids, seq_lens



Initialize the `LIMENLPInterpreter`. Define the reviews that we want to analyze. 

The reviews are selected from the sentiment classification dataset. You can download them by running the following command:
```
wget https://baidu-nlp.bj.bcebos.com/sentiment_classification-dataset-1.0.0.tar.gz
tar -zxvf sentiment_classification-dataset-1.0.0.tar.gz
```

In [6]:
lime = it.LIMENLPInterpreter(paddle_model, PARAMS_PATH)

reviews = [
    '交通 方便 ；环境 很好 ；服务态度 很好 房间 较小',
    '这本书 实在 太烂 了 , 什么 朗读 手册 , 一点 朗读 的 内容 都 没有 . 看 了 几页 就 不 想 看 下去 了 .'
]

In the cell below, we iteratively `interpret` each review and grab weights for each token. For visualizasion purposes, word weights in each review are normalized to better illustrate differences between weights. Results for each review is stored in a list by making use of the `VisualizationTextRecord`.

In [7]:
true_labels = [1, 0]
recs = []

for i, review in enumerate(reviews):
    pred_class, pred_prob, lime_weights = lime.interpret(
        review,
        preprocess_fn,
        num_samples=200,
        batch_size=10,
        unk_id=unk_id,
        pad_id=0,
        return_pred=True)

    words = review.split()
    interp_class = list(lime_weights.keys())[0]
    word_importances = [t[1] for t in lime_weights[interp_class]]
    word_importances = np.array(word_importances) / np.linalg.norm(
        word_importances)
    true_label = true_labels[i]
    if interp_class == 0:
        word_importances = -word_importances
    rec = VisualizationTextRecord(words, word_importances, true_label,
                                  pred_class[0], pred_prob[0],
                                  interp_class)
    recs.append(rec)

visualize_text(recs)



Load model from ../../senta_model/bilstm_model/params


True Label,Predicted Label (Prob),Target Label,Word Importance
1.0,1 (0.96),1.0,交通 方便 ；环境 很好 ；服务态度 很好 房间 较小
,,,
0.0,0 (1.00),0.0,"这本书 实在 太烂 了 , 什么 朗读 手册 , 一点 朗读 的 内容 都 没有 . 看 了 几页 就 不 想 看 下去 了 ."
,,,
