# LIME Tabular Explainer via XAI for Regression

This tutorial demonstrates how to generate explanations using LIME's tabular explainer implemented by the XAI library for a regression task.

At a high level, explanations can be obtained from any XAI explanation algorithm in 3 steps:

1. Create an explainer via the `ExplainerFactory` class, which serves as the primary interface between the user and all XAI-supported explanation algorithms
2. Build the explainer by calling the `build_explainer` method (which is implemented by any XAI explanation algorithm) and providing arguments that are specific to that algorithm
3. Get explanations for some data instance by calling the `explain_instance` method (which is also common among all algorithms) and provoding arguments that are specific to that algorithm

### Step 1: Import libraries

`xai.explainer.ExplainerFactory` is the main class that users of XAI interact with. `xai` contains some constants that are used to instantiate an `AbstractExplainer` object.

In [1]:
# Some auxiliary imports for the tutorial
import sys
import random
import numpy as np
from pprint import pprint
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

# Set seed for reproducibility
np.random.seed(123456)

# Set the path so that we can import ExplainerFactory
sys.path.append('../..')

# Main XAI imports
import xai
from xai.explainer import ExplainerFactory

### Step 2: Train a model on a sample dataset

We train a sample `RandomForestRegressor` model on the Boston housing dataset, a sample regression problem that is provided by scikit-learn.

In [2]:
raw_data = datasets.load_boston()
X, y = raw_data['data'], raw_data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate a classifier, train, and evaluate on test set
clf = RandomForestRegressor(n_estimators=1000)
clf.fit(X_train, y_train)
print('Random Forest MSError', np.mean((clf.predict(X_test) - y_test) ** 2))

Random Forest MSError 9.785803784803988


### Step 3: Instantiate the explainer

This is where we instantiate the XAI explainer. This `ExplainerFactory` class is in charge of loading a particular explanation algorithm. The user is required to provide one argument - the `domain`, which indicates the domain of the training data (e.g. `tabular` or `text`). The available domains can be found in `xai.DOMAIN`. Users can also select a particular explainer algorithm by providing the algorithm's name (registered in `xai.ALG`) to the `algorithm` parameter. If this argument is not provided, the `ExplainerFactory.get_explainer` method defaults to a pre-set algorithm for that domain which can be found in `xai/explainer/config.py`. 

We want to load the `LimeTabularExplainer`, so we provide `xai.DOMAIN.TABULAR` as the argument to `domain` and `xai.ALG.LIME` as the argument to `algorithm`. Note that `xai.ALG.LIME` is the default tabular explanation algorithm; hence this also works:

```python
explainer = ExplainerFactory.get_explainer(domain=xai.DOMAIN.TABULAR)
```

In [3]:
# Instantiate LimeTabularExplainer via the Explainer interface
explainer = ExplainerFactory.get_explainer(domain=xai.DOMAIN.TABULAR, 
                                           algorithm=xai.ALG.LIME)

### Step 4: Build the explainer

`build_explainer` calls the explanation algorithms initialization routine, which can include things like setting parameters or a pre-training loop. The `LimeTabularExplainer` requires the following parameters:

* training_data (np.ndarray): 2d Numpy array representing the training data
    (or some representative subset) (**required**)
* mode (str): Whether the problem is 'classification' or 'regression' (**required**)
* predict_fn (function): A function that wraps the target model's prediction function - it takes in a 1D numpy array and outputs a vector of probabilities which should sum to 1 (**required**) 

In this tutorial, we set mode to `xai.MODE.REGRESSION`. We also provide the following parameters:

* training_labels (list): Training labels, which can be used by the continuous feature
    discretizer
* feature_names (list): The names of the columns of the training data
* categorical_features (list): Integer list indicating the indices of categorical features

In [4]:
categorical_features = np.argwhere(np.array([len(set(raw_data.data[:,x])) 
                                             for x in range(raw_data.data.shape[1])]) <= 10).flatten()

explainer.build_explainer(
    training_data=X_train,
    training_labels=y_train,
    mode=xai.MODE.REGRESSION,
    predict_fn=clf.predict,
    feature_names=raw_data['feature_names'],
    categorical_features=categorical_features
)

### Step 5: Generate some explanations

Once we build the explainer, we can start generating some explanations via the `explain_instance` method. The `LimeTabularExplainer` expects several things, like:
* instance (np.ndarray): A 1D numpy array corresponding to a row/single example (**required**)

You can also pass the following:

* labels (list): The list of class indexes to produce explanations for
* top_labels (int): If not None, this overwrites labels and the explainer instead produces
    explanations for the top k classes
* num_features (int): Number of features to include in an explanation
* num_samples (int): The number of perturbed samples to train the LIME model with
* distance_metric (str): The distance metric to use for weighting the loss function


We restrict explanations to 10 features (meaning only 10 features will have scores attached to them). The output of `explain_instance` is a dictionary that maps each class to two things - the confidence of model and a list of explanations.

In [5]:
exp = explainer.explain_instance(
    instance=X_test[1],
    top_labels=2,
    num_features=5)

pprint(exp)

{'explanation': [{'feature': '7.17 < LSTAT <= 11.68',
                  'score': 2.3211043718179445},
                 {'feature': 'PTRATIO <= 17.40', 'score': 0.5885472330074492},
                 {'feature': 'RAD=4', 'score': -0.4687921251480069},
                 {'feature': 'DIS > 5.10', 'score': -0.5352117675978874},
                 {'feature': '6.19 < RM <= 6.63', 'score': -1.399854197466358}],
 'prediction': 25.04439999999986}


### Step 6: Save and load the explainer

Finally, every XAI explainer supports saving and loading functions.

In [6]:
# Save the explainer somewhere

explainer.save_explainer('artefacts/lime_tabular_regression.pkl')

In [7]:
# Load the saved explainer in a new Explainer instance

new_explainer = ExplainerFactory.get_explainer(domain=xai.DOMAIN.TABULAR, algorithm=xai.ALG.LIME)
new_explainer.load_explainer('artefacts/lime_tabular_regression.pkl')

exp = new_explainer.explain_instance(
    instance=X_test[0],
    top_labels=2,
    num_features=5)

pprint(exp)

{'explanation': [{'feature': 'DIS <= 2.06', 'score': 1.2463072456971418},
                 {'feature': '77.95 < AGE <= 94.15',
                  'score': -0.6001661104431496},
                 {'feature': 'NOX > 0.64', 'score': -0.7486631874520833},
                 {'feature': 'RM <= 5.88', 'score': -3.018462829585109},
                 {'feature': 'LSTAT > 17.15', 'score': -6.056198593262318}],
 'prediction': 8.738599999999975}


### Step 7: Integration with `xai.model.interpreter.model_interpreter.ModelInterpreter`

You can also aggregate explanations across a set of data (e.g. training) by using the `ModelInterpreter` class.

In [8]:
from xai.model.interpreter.model_interpreter import ModelInterpreter

model_interpreter = ModelInterpreter(domain=xai.DOMAIN.TABULAR, 
                                           algorithm=xai.ALG.LIME)

In [10]:
model_interpreter.build_interpreter(
    training_data=X_train,
    training_labels=y_train,
    mode=xai.MODE.REGRESSION,
    predict_fn=clf.predict,
    feature_names=raw_data['feature_names'],
    class_names=categorical_features
)
stats = model_interpreter.interpret_model(samples=X_train, stats_type='average_ranking',k=10)
stats

  idx + 1, len(samples)))
  idx + 1, len(samples)))
  idx + 1, len(samples)))
  idx + 1, len(samples)))


({0: {'CHAS <= 0.00': 4.814356435643564,
   'ZN <= 0.00': 2.910891089108911,
   'LSTAT <= 7.17': 2.4306930693069306,
   'RM > 6.63': 2.4034653465346536,
   '7.17 < LSTAT <= 11.68': 2.1064356435643563,
   '345.00 < TAX <= 666.00': 1.8836633663366336,
   '5.00 < RAD <= 24.00': 1.7673267326732673,
   'DIS <= 2.06': 1.5495049504950495,
   '9.90 < INDUS <= 18.10': 1.5074257425742574,
   '19.10 < PTRATIO <= 20.20': 1.4628712871287128,
   'RAD <= 4.00': 1.4603960396039604,
   '0.45 < NOX <= 0.54': 1.381188118811881,
   'CRIM > 3.84': 1.2920792079207921,
   'PTRATIO <= 17.40': 1.1707920792079207,
   'B > 395.81': 1.146039603960396,
   'NOX <= 0.45': 1.1262376237623761,
   '2.06 < DIS <= 3.10': 1.1212871287128714,
   'AGE <= 44.60': 1.113861386138614,
   'TAX <= 283.25': 1.0965346534653466,
   '0.09 < CRIM <= 0.29': 1.056930693069307,
   '0.54 < NOX <= 0.64': 1.051980198019802,
   '44.60 < AGE <= 77.95': 1.0123762376237624,
   '17.40 < PTRATIO <= 19.10': 1.00990099009901,
   'INDUS <= 5.29': 0.

The first level of the stats is the class, which in the case of regression is just a dummy `0`.