## Description of the Data Passed

In [None]:
display(Markdown(f"A total of {len(ids)} examples were passed to the explainer."))

In [None]:
if len(ids) < 100:
    msg = f"The following examples were selected for explanation: {list(ids.values())}"
else:
     msg = ("Too many examples (>100) were selected to be displayed here. "
            f"Please refer to the `{experiment_id}_ids.pkl` file in the `output` directory." )

display(Markdown(msg))

### Background Data

The background data is responsible for generating the *base value* of your SHAP explanations. Specifically, this is the average model prediction in the background data passed to `rsmexplain`.

SHAP computes feature contributions by replacing feature values in the data you wish to explain with values sampled from the background set and measuring changes in the prediction. Therefore, the background data is crucial in helping us understand the impact of a feature moving from its 'baseline' or 'average' state to the current state. This movement is what the SHAP values shown in the report below quantify. This means that as long as **sufficiently large and diverse** background data is used, the SHAP values for individual examples in the explain data can be considered reliable.

However, running `rsmexplain` on very large background datasets can be computationally expensive and time-consuming. One trick we can use is to use k-means clustering to "summarize" and reduce the size of the background dataset without losing too much information. To do this, we run a k-means clustering algorithm on the background dataset which yields `k` clusters of feature values, each represented by a "centroid" (the average of all data points in the cluster). These centroids are then used to compute the SHAP values, i.e., when we "omit" a feature and go to replace its value, we do so by sampling from one of the centroids rather than the original dataset. 

`rsmexplain` applies this trick by default. The number of clusters `k` can be specified via the "background_kmeans_size" option in its configuration file. The default value is 500 since that has been shown to represent a good compromise between accuracy and speed.

In [None]:
if background_kmeans_size:
    msg = (f"For this run, a value of {background_kmeans_size} was used for `background_kmeans_size`. "
           "A smaller value will be faster but even less accurate.")
else:
    msg = ("For this run, the default value of 500 was used for `background_kmeans_size`. "
           "A smaller value will be faster but less accurate.")
display(Markdown(msg))