## Practical: LIME basics

Welcome to the first practical. This will take you through the basics of using `lime`. We'll be using a larger dataset for the second practical, but for ease of visualisation, we'll be using a simple `sklearn` dataset for this practical.

Let's begin by importing `lime`, and one of its sub-libraries, `lime-tabular`. As the name suggests, this sub-library is aimed towards tabular data: we'll be using a continuous dataset, which can be treated as tabular for the purposes of `lime`.

In [None]:
import lime
import lime.lime_tabular

We should also import a bunch of useful processing and pre-processing libraries. Most of these you should recognise; `pandas` streamlines data handling, `numpy` is a linear algebra library, and `sklearn`, or 'sci-kit learn' is python's most regularly used data analysis/machine learning library. `matplotlib` makes plots. 

In [None]:
import pandas as pd
import numpy as np
import sklearn
import sklearn.datasets, sklearn.ensemble
import matplotlib.pyplot as plt

We're also going to want to plot in the notebook, and make the plots nice and large.

In [None]:
%matplotlib inline
plt.rcParams['figure.figsize'] = [12, 8]

First of all, our data. We'll be using the `halfmoons` dataset, which `sklearn.datasets.make_moons()` generates. Each data point $i$ consists of a real vector of length two, $[x_{i1}, x_{i2}]$, and a class, $y_i, y \in \{0,1\}$

**Q1.** Generate a dataset of 10,000 data points. You'll want to use the `n_samples` and `noise` parameters. I recommend a noise value of 0.2, but feel free to experiment. 

In [None]:
#Add your code here..
X, y  = sklearn.datasets.make_moons(n_samples=10000, noise= 0.2)


**Q2.** Plot the first 100 points. Make sure to colour the two classes differently.

(**Hint**: `plt.scatter(x_val,y_val)` should get you started)

In [None]:
#Add your code here..
col = y[0:100]
z = X[0:100,:]

plt.scatter(
  x=z[:, 0],
  y=z[:, 1],
  c=col,
  cmap="viridis",
  alpha=0.8,
  s=60
);


**Things to think about**: how well do you think LIME is going to do with this dataset? Are there any points of those you've plotted where you think LIME's explanation might not be that informative?

Now let's simulate our black box model. For this, we're going to use a random forest with 200 trees. To get an idea of how well our random forest does on the data, we should split the dataset into a test set and a training set (we should **always** do this for any ML task, unless we have an extremely good reason not to).

In [None]:
train, test, labels_train, labels_test = sklearn.model_selection.train_test_split(X, y, test_size=0.20)

We intialise the classifier as follows:

In [None]:
clf = sklearn.ensemble.RandomForestClassifier(n_estimators=200)

This creates an instance of the classifier. As with most sklearn models it has a `.fit()` method and a `.predict()` method.

**Q3.** Remembering to keep your `train` and `test` separate, fit the classifier to the data.

In [None]:
#Add your code here..
clf.fit(train, labels_train)


**Q4.** Now find the precision, recall and f-score of your trained classifier. 

(**Hint**: `sklearn` has a couple of useful functions; `sklearn.metrics.accuracy_score()`, and `sklearn.metrics.classification_report()` are both quick to use)

In [None]:
#Add your code here..
print(sklearn.metrics.accuracy_score(labels_test, clf.predict(test)))
print(sklearn.metrics.classification_report(labels_test, clf.predict(test)))


Hopefully, your random forest does pretty well on the half-moons dataset. It should: this isn't a complex dataset. My model gets an F1 score of 0.97; you should expect to be in this sort of range. 

**A question to consider**: if the black box model performs poorly on the dataset of interest, are the explanations less or more useful? What does it depend upon?

**Q5.** Now we have the black box model, we need a point of interest. From the first 100 data points, pick a point of interest, and plot its location relative to the other 100 points.

In [None]:
#Add your code here..
indx = 8

plt.scatter(
  x=z[:, 0],
  y=z[:, 1],
  c=col,
  cmap="viridis",
  alpha=0.8,
  s=60
)

plt.scatter(z[indx, 0], z[indx, 1], c='black', s=80, marker='x');


Similar to the `sklearn` random forest, to get an explanation, we need to initialise an instance of `lime.lime_tabular.LimeTabularExplainer`, which takes as main input the training data. 


**Q6.** Initialise and instance of the `LimeTabularExplainer` and set the parameter `discretize_continuous = False`; this is necessary in complex data, but not for us.

Call your initialised instance of the `LimeTabularExplainer` `explainer`.

(**Note**: the explainer doesn't use the training data to 'train', per se. If it was discretising continuous inputs, or performing similar preprocessing, it would use the training data to estimate the distribution of the data, so as to choose appropriate discretisation).

In [None]:
#Add your code here..
explainer = lime.lime_tabular.LimeTabularExplainer(train, discretize_continuous = False)


Now we can explain our point! The `explainer` instance has a `.explain_instance()` method, which we can call and pass it the point. 

**Q7.** Now pass a pointer to the prediction of class probability method of the black box model, `.predict_proba()`. As we're only passing the pointer, not the function (so that `.explain_instance()` can call the function itself) in python we leave off the `()`.

In [None]:
#Add your code here..
exp = explainer.explain_instance(z[59,:], clf.predict_proba) 


`.explain_instance` returns an explanation object. 

There a variety of methods that can be used to inspect the explanation object. For example we can call `.as_pyplot_figure` to plot a bar graph of relative weights. This isn't super informative, given we only have two features, but run it now, and see if the weights make sense to you, relative to where the point you picked is located.

In [None]:
exp.as_pyplot_figure()
plt.tight_layout()

**Q8.** The explanation object also has a more detailed method called `.as_list()`. Try this now, to get an idea of what it returns.

In [None]:
#Add your code here..
exp.as_list()


For the final part of this practical, we're going to use the `as_list()` method to make a pretty plot of explanations for all 100 of the points we plotted at the start. Here's some code to get you started:

In [None]:
def get_LIME_coeffs(points, explainer, clf):
    output = []
    for i, point in enumerate(points):
        exp = explainer.explain_instance(point, clf.predict_proba, num_features=2, top_labels=1)
        class_ = clf.predict([point])
        intercept = list(exp.intercept.values())[0]  # There's an irritating edge case where class_ =/= the class in the exp.
        coefs = [x[1] for x in sorted(exp.as_list(list(exp.intercept.keys())[0]), key=lambda x: x[0])]

        coefs.append(intercept)
        output.append(coefs)

    return output

**Q9.** Using this function, and `plt.quiver`, plot the explanations of each point as a vector starting at the point. 

**Questions to consider**: What do you notice? How satisfactory are the explanations? How would you develop a metric to put a number on how good the explanations are?

(**Note**: explaining 100 points will take quite a while: this is one of the downsides of perturbation methods. Because they have to sample per point explained, explaining many point can become time consuming.)

In [None]:
#Add your code here..
lime_w = np.array(get_LIME_coeffs(z, explainer, clf))
viridis = cm.get_cmap('viridis', 2)
labels = clf.predict(z)

plt.quiver(z[:,0], z[:,1], lime_w[:,0], lime_w[:,1], color=viridis(labels))
plt.show()
