# Overview 

This notebook includes:
- LIME explanation for regression prediction on a Tabular dataset
- Regression task is to predict risk for patients with diabetes
- Demo explaining Random Forest model with LimeTabularExplainer
- Coding Challenge: Implement the SHAP explanation for the same prediction

## Background
**Implementation of LIME for Regression**

The source code of LIME is accessible over [Github](https://github.com/marcotcr/lime). 

In this notebook we are using Random Forest Regressor.

The Regression task was to predict patients with diabetes.

## Acknowledgement
The example is based on the LIME tutorial. 

Source Code can be found at: https://github.com/marcotcr/lime/blob/master/doc/notebooks/Using%20lime%20for%20regression.ipynb

In [None]:
# Install lime package using pip package manager in the current jupyter environment
!pip install lime

Collecting lime
[?25l  Downloading https://files.pythonhosted.org/packages/f5/86/91a13127d83d793ecb50eb75e716f76e6eda809b6803c5a4ff462339789e/lime-0.2.0.1.tar.gz (275kB)
[K     |█▏                              | 10kB 13.4MB/s eta 0:00:01[K     |██▍                             | 20kB 18.9MB/s eta 0:00:01[K     |███▋                            | 30kB 11.8MB/s eta 0:00:01[K     |████▊                           | 40kB 9.1MB/s eta 0:00:01[K     |██████                          | 51kB 4.4MB/s eta 0:00:01[K     |███████▏                        | 61kB 4.9MB/s eta 0:00:01[K     |████████▎                       | 71kB 5.1MB/s eta 0:00:01[K     |█████████▌                      | 81kB 5.3MB/s eta 0:00:01[K     |██████████▊                     | 92kB 5.6MB/s eta 0:00:01[K     |███████████▉                    | 102kB 5.9MB/s eta 0:00:01[K     |█████████████                   | 112kB 5.9MB/s eta 0:00:01[K     |██████████████▎                 | 122kB 5.9MB/s eta 0:00:01[K   

In [None]:
# Load dataset and import other required packages.
from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor
import sklearn.model_selection
import numpy as np
from lime.lime_tabular import LimeTabularExplainer

RANDOM_SEED = 426

## Load Data
  

In [None]:
diabetes = load_diabetes()
print(diabetes.DESCR)

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - Age
      - Sex
      - Body mass index
      - Average blood pressure
      - S1
      - S2
      - S3
      - S4
      - S5
      - S6

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bra

In [None]:
# Train random forest as a blackbox model
rf = RandomForestRegressor(n_estimators=100, random_state = RANDOM_SEED)

In [None]:
# Split dataset into training and testing sets
X_train, X_test, y_train, y_test  = sklearn.model_selection.train_test_split(diabetes.data, diabetes.target, test_size=0.20)

In [None]:
rf.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=426, verbose=0, warm_start=False)

In [None]:
# This classifier predicts the probability of having diabetes
rf.predict([X_test[14]])

array([86.66])

In [None]:
print('Random Forest Mean Square Error', np.mean((rf.predict(X_test) - y_test) ** 2))

Random Forest Mean Square Error 3345.4153741573036


In [None]:
print('Mean Square when predicting the mean', np.mean((y_test.mean() - y_test) ** 2))

Mean Square when predicting the mean 5205.979548036865


In [None]:
# kernel_width is hyperparameter that is used to define the boundary of local regions in which LIME is going to sampling within. 
# We are using None which means, defaults to sqrt (number of columns) * 0.75.

# feature_selection establishes the strategy that LIME will use for selecting the most important features for the prediction.
# discretize_continuous, if True, all non-categorical features will be discretized into quartiles.
# class_names is the target variable
# mode='regression' generates explanations for regressors.

explainer = LimeTabularExplainer(X_train, 
                                 feature_names=diabetes.feature_names, 
                                 class_names=['y'], 
                                 kernel_width = None,
                                 verbose=True, 
                                 mode='regression')

In [None]:
# pick a random instance to explain
idx = np.random.randint(0, X_test.shape[0]) 

# data_row is the test instance that the model is going to explain.
# predict_fn is the function that will be used for making predictions.
# num_features specifies the number important features.
# num_samples defines, how many samples LIME needs to generate to train local model.
exp = explainer.explain_instance(data_row = X_test[idx], 
                                 predict_fn = rf.predict, 
                                 num_features=5)

Intercept 171.17031478557394
Prediction_local [100.66349281]
Right: 80.42


**Understanding the Explanations**

The below list of features with their weights show the explanations generated with LIME. The weights of these features are computed after fitting a weighted linear model. The red bars in the image shows the negative coefficients and green bars shows the positive coefficients of the linear regression model. 

The positive coefficients indicates that the features support the prediction and the negative coefficients indicates the fetures contradict the prediction.

The size of the bars represents the feature importance of the features towards decicion made by the classifier. 

In [None]:
# Visualize local feature importance
import matplotlib.pyplot as plt
exp.as_pyplot_figure()
plt.tight_layout()

In [None]:
# Plot an explanation generated with LIME
exp.show_in_notebook(show_table=True)

In [None]:
# Print explanations as a list.
exp.as_list()

# Explore LIME Package Documentation

In [None]:
help(LimeTabularExplainer)

# Coding Challenge:
- Implement the SHAP explanation for the same prediction.
- For Random Forest, obtain the global feature importance.
- Compare the feature explanations given by LIME, SHAP and Random Forest
- Experiment with changing the number of iterations and kernel width for LIME explanations