In this class, we will talk about model explainability but more in the context of data explainability or root cause analysis. In many cases building a very good machine learning model is not an ultimate goal. What is really wanted is the data understanding. A factory wants to know why the product is plagued with a defect, not to predict afterward if there is a defect or not. A football team wants to know which position is the best for scoring a goal, not what's the probability of scoring from a given position. And even when they want a prediction they would love to see the justification to trust the model. Often a nice plot is worth more than sophisticated machine-learning approaches.

In [None]:
import dalex as dx
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.ensemble import RandomForestRegressor

In [None]:
data = load_wine()

In [None]:
df = pd.DataFrame(data["data"], columns=data["feature_names"])
y = data["target"]
df["target"] = y

You should already be familiar with many data visualization techniques so we will not train it now. I just want to share a less popular type of data analysis. Usually plotting the target against any feature is not helpful but after some modification, we might be able to see some patterns.

In [None]:
plt.plot(df.flavanoids, y, "bo")

For each value, we can plot the average target for data:
 - below that value
 - above that value
 - around that value

Please note that for the line "above that value" the more left we go the higher fraction of data is covered. The same with the "below that value"

In [None]:
for col in df.columns.drop("target"):
    tmp = df.sort_values(col)
    plt.title(col)
    plt.plot(
        tmp[col],
        tmp[col].apply(lambda x: tmp[tmp[col] <= x].target.mean()),
        label="<=",
    )
    plt.plot(
        tmp[col],
        tmp[col].apply(lambda x: tmp[tmp[col] >= x].target.mean()),
        label=">=",
    )
    plt.plot(
        tmp[col],
        np.convolve(np.ones(20) / 20, tmp.target, mode="same"),
        label="~=~",
    )
    plt.legend()
    plt.grid()
    plt.show()

Ok, let's just train a model. We are not interested in top performance right now so we will skip hyperparameter optimization. Also, we want to find the pattern in the data we have, so we don't split the data into validation and test set.

In [None]:
model = RandomForestRegressor()
x = df.drop("target", axis=1)
y = df.target
model.fit(x, y)

In [None]:
plt.plot(df.target, model.predict(x), "bo")

Dalex is a python package for model explainability. We will use some of its functions to understand the data and the model better. First, we need to create an explainer model. Since we are not interested in checking the model performance but the relation between the data and the target we will use the whole dataset here. In the first case, we might want to use the testing set.

In [None]:
exp = dx.Explainer(model, x, y)

In [None]:
fi = exp.model_parts()

The first step will be feature importance. It's a basic analysis where we calculate the global impact of a feature. The idea in dalex default approach is to measure how much the model performance is worsening after removing this feature. Of course, it would require retraining the model, the optimal set of hyperparameters might be different and it might affect the results. To avoid these problems we do not retrain the model. Instead, we simulate its removal by assigning random values to it. To make it more realistic the values are not completely random, we just shuffle this column in a dataframe, do the prediction, check performance and repeat these steps multiple times.

In [None]:
fi.plot()

Another useful tool is a partial dependency plot. For a given feature we observe what's the average output of our model for different values of this feature. For each considered value we set this value for each row in our dataframe and calculate an average prediction.

In [None]:
exp.model_profile().plot()


We can also create similar plots for single rows. Here for each column, we present what would be the output from the model assuming we keep all remaining values and change the value of this one selected feature.

In [None]:
exp.predict_profile(x.iloc[[15, 80]]).plot()

SHAP values are equivalents of Shapley values for the predictive models. It estimates the effect of a particular value of a particular feature for a prediction of a considered row. It's also done by replacing this value with proper sampling and replacing this value and measuring the effect on the prediction.

In [None]:
exp.predict_parts(x.iloc[15], type="shap").plot()

In [None]:
exp.predict_parts(x.iloc[15], type="shap").plot()

The result is based on sampling so the result for the same row can vary

In [None]:
exp.predict_parts(x.iloc[88], type="shap").plot()

In [None]:
exp.predict_parts(x.iloc[88], type="shap").result

**Task** For each class find the most representative examples and plot breakdown for them.

Imagine we have a model classifying dogs and cats. Then a good example would be to show e.g. 3 breeds of dogs and the same with cats. Showing 5 golden retrievers although cute is not the best approach.

There isn't a single best way how to approach this task. There are many good solutions. Think about what you want to achieve and then how to do it

In [None]:
from typing import Any, Callable, Generator

import seaborn as sns
from numpy.typing import ArrayLike
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import euclidean_distances

In [None]:
DistanceFn = Callable[[np.ndarray, np.ndarray], np.matrix]


def add_column(df: pd.DataFrame, column: ArrayLike, name: str) -> pd.DataFrame:
    """Add column to the dataframe. Best works with `pd.DataFrame.pipe`
    method.

    Args:
        df: Dataframe to extend
        column: Column to add
        name: Name of the column

    Returns:
        Dataframe with added column
    """  # noqa: D205
    return pd.concat(
        [df.reset_index(), pd.DataFrame({name: column})], axis=1
    ).set_index("index")


def annotate_representatives(
    df: pd.DataFrame,
    cluster_strategy: Any,
    distance: DistanceFn,
) -> pd.DataFrame:
    """Divide into clusters and for each pick representative closest to
    the centroid.

    Args:
        df: Data for picking representatives from.
        cluster_strategy: Clustering algorithm.
        distance: Distance function.

    Returns:
        Dataframe with additional column: booleans signifying whether the
        record is representative.
    """  # noqa: D205
    # Cluster
    cluster_strategy = cluster_strategy.fit(df)
    centroids = cluster_strategy.cluster_centers_
    labels = cluster_strategy.labels_

    # Distance from each record to each centroid, stored in
    # (N_records, N_centroids) shape
    distances_mat = distance(df.to_numpy(), centroids)

    result = df.pipe(add_column, column=labels, name="cluster")
    result["repr"] = False
    ids = []

    # Can't "groupby" because the same records might occur => exclude them
    # after each iter

    # Iterate by columns (distances corresponding to a single centroid)
    for distances in distances_mat.T:
        ids.extend(
            result.pipe(add_column, column=distances, name="distance")
            .drop(ids)
            .sort_values("distance")
            .head(1)
            .index.tolist()
        )

    result.loc[pd.Index(ids), "repr"] = True
    return result.sort_values("index")


def annotate_by_classes(
    data: pd.DataFrame,
    n_representatives: int,
    target_feature: str,
    distance: DistanceFn = euclidean_distances,
    random_state: int | None = None,
) -> pd.DataFrame:
    return pd.concat([
        annotate_representatives(
            cls_data,
            KMeans(n_representatives, n_init="auto"),
            euclidean_distances,
        )
        for cls, cls_data in data.groupby(target_feature)
    ])

In [None]:
# Annotate representatives for each class
result = annotate_by_classes(
    data=df, n_representatives=5, target_feature="target", random_state=42
)

In [None]:
sns.set_theme()

# Perform dim reduction and prepare data
plot_data = PCA().fit_transform(
    result.drop(["target", "cluster", "repr"], axis=1)
)
plot_data = pd.DataFrame(plot_data)
plot_data["target"] = result["target"]
plot_data["cluster"] = result["cluster"]
plot_data["repr"] = result["repr"]

# Plot for all classes
plt.tick_params(labelbottom=False, labelleft=False)
plt.title("All classes")
sns.scatterplot(data=plot_data, x=0, y=1, hue="target", palette="deep")
plt.show()

# Plot for each class
clss = result["target"].unique()
palette = sns.color_palette("deep")
for cls, color in zip(clss, palette):
    # Plotting params
    plt.title(f"Class {cls}")
    plt.tick_params(labelbottom=False, labelleft=False)

    # Plot points
    sns.scatterplot(data=plot_data, x=0, y=1, color="gray")
    sns.scatterplot(
        data=plot_data[plot_data["target"] == cls],
        x=0,
        y=1,
        color=color + (0.5,),
    )
    sns.scatterplot(
        data=plot_data[(plot_data["target"] == cls) & plot_data["repr"]],
        x=0,
        y=1,
        color="red",
    )

    plt.show()

There are other approaches that can be used for model explainability.
 - LIME - approximating model locally by a linear model
 - Anchor - approximating model locally by a rule-based model
 - Prototype - justifying a new prediction by showing a similar example from the data (a prototype)
 - Counterfactual Explanation - showing a similar example from the dataset with a different prediction to show what must be changed to change the prediction.

# Task

- take a dataset you want
- perform an exploratory data analysis (data visualization)
- create a sklearn pipeline for data preprocessing
- add new features (one hot encoding for example)
- add predictive model as the last step of the pipeline
- prepare a report with model explainability

Send it to gmiebs@cs.put.poznan.pl within 144 hours after the class is finished. Start the subject of the email with [IR]

Assume your report will be read by a domain expert from the area of the data, in our case a wine expert, without any computer science / data science skills. It means the person will not get much from raw plots and diagrams. Everything has to be explained to be understood.

### 1. Dataset

#### Feature Descriptions
* `person_age`: Applicant’s age in years.
* `person_home_ownership`: Status of homeownership (e.g., Rent, Own, Mortgage).
* `person_income`: Annual income of the applicant in USD.
* `person_emp_length`: Length of employment in years.
* `loan_intent`: Purpose of the loan (e.g., Education, Medical, Personal).
* `loan_grade`: Risk grade assigned to the loan, assessing the applicant’s creditworthiness.
* `loan_amnt`: Total loan amount requested by the applicant.
* `loan_int_rate`: Interest rate associated with the loan.
* `loan_status`: The approval status of the loan (approved or not approved).
* `loan_percent_income`: Percentage of the applicant’s income allocated towards loan repayment.
* `cb_person_default_on_file`: Indicates if the applicant has a history of default ('Y' for yes, 'N' for no).
* `cb_person_cred_hist_length`: Length of the applicant’s credit history in years.

In [None]:
df = pd.read_csv("train.csv")
df.info()

### Exploratory data analysis

In [None]:
from sklearn.preprocessing import OrdinalEncoder

df_vis = df.copy()
cat_cols = [
    "person_home_ownership",
    "loan_intent",
    "loan_grade",
    "cb_person_default_on_file",
]
df_vis[cat_cols] = OrdinalEncoder().fit_transform(df_vis[cat_cols])
plt.figure(figsize=(14, 14)).suptitle("Feature Correlation")
sns.heatmap(df_vis.corr(), annot=True)
plt.tight_layout()
plt.show()

The plot above presents heatmap of for the numerical columns of the dataset. While numbers about zero imply no correlation, positive and negative values represent direct and indirect correlations respectively.


In [None]:
mean_int_rate_by_loan_grade = df.groupby("loan_grade")["loan_int_rate"].mean()

_, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# First plot (bar chart for mean interest rate by loan grade)
ax1.bar(mean_int_rate_by_loan_grade.index, mean_int_rate_by_loan_grade)
ax1.set_xlabel("Loan Grade")
ax1.set_ylabel("Mean Interest Rate")
ax1.set_title("Mean Interest Rate by Loan Grade")

# Second plot (violin plot for interest rate distribution by loan grade)
ticks, interest_by_rate_data = list(
    zip(*[
        (cls, cls_data["loan_int_rate"])
        for cls, cls_data in df.groupby("loan_grade")
    ])
)
ax2.violinplot(interest_by_rate_data, showmeans=False, showextrema=False)
ax2.set_xticks([y + 1 for y in range(len(interest_by_rate_data))], ticks)
ax2.set_title("Interest Rate Distribution by Loan Grade")
ax2.set_xlabel("Loan Grade")
ax2.set_ylabel("Interest Rate")

plt.tight_layout()
plt.show()

In [None]:
ticks, default_on_file_by_grade_data = list(
    zip(*[
        (cls, cls_data["cb_person_default_on_file"].value_counts())
        for cls, cls_data in df[df["loan_status"] == 1].groupby("loan_grade")
    ])
)

y_N, y_Y = [], []
for el in default_on_file_by_grade_data:
    el_N = el["N"] if "N" in el.index else 0
    el_Y = el["Y"] if "Y" in el.index else 0
    y_N.append(el_N / (el_N + el_Y))
    y_Y.append(el_Y / (el_N + el_Y))

fig, ax = plt.subplots()
ax.bar(ticks, y_N, color="green")
ax.bar(ticks, y_Y, color="red", bottom=y_N)
ax.set_xlabel("Loan Grade")
ax.set_ylabel("% of the people with positive loan status")
ax.set_title("How prior default influences future Loan Grade")
plt.legend(["No prior default", "Has prior default"])

plt.show()

The plot above shows that those having default history may get new loans starting with grade C and higher.

### Plotting utils

In [None]:
def all_vs_positive(df: pd.DataFrame) -> pd.DataFrame:
    loan_status_both = df.reset_index()
    loan_status_positive = df[df["loan_status"] == 1].reset_index()

    loan_status_both["loan_status"] = "all"
    loan_status_positive["loan_status"] = "positive"

    return pd.concat([loan_status_both, loan_status_positive])


def subplots(
    df_columns: int, n_cols: int, title: str
) -> tuple[plt.Figure, Generator[plt.Axes, str, None]]:
    n_rows = (df_columns + n_cols - 1) // n_cols
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(4 * n_cols, 4 * n_rows))
    fig.suptitle(title)

    def axes_gen() -> Generator[plt.Axes, str, None]:
        collapse_flag = False
        for ax in axes.ravel():
            if not collapse_flag:
                collapse_flag = (yield ax) == "collapse"
            else:
                ax.axis("off")
        yield

    return fig, axes_gen()

#### Distribution of numerical features

In [None]:
kde_data = (
    (temp := all_vs_positive(df))
    .select_dtypes(include="number")
    .assign(loan_status=temp["loan_status"])
    .drop(["index", "id"], axis=1)
)

plotting_columns = list(filter(lambda x: x != "loan_status", kde_data.columns))
_, axes = subplots(
    df_columns=len(plotting_columns), n_cols=3, title="Numerical features"
)
for col, ax in zip(plotting_columns, axes):
    ax.tick_params(labelleft=False)
    sns.kdeplot(
        data=kde_data, x=col, hue="loan_status", ax=ax, fill=True, bw_adjust=2
    )
axes.send("collapse")
plt.tight_layout()
plt.show()

#### Distribution of categorical features

In [None]:
hist_data = (
    (temp := all_vs_positive(df))
    .select_dtypes(include="object")
    .assign(loan_status=temp["loan_status"])
    .assign(loan_intent=temp["loan_intent"].str[:3] + ".")
)

plotting_columns = list(
    filter(lambda x: x != "loan_status", hist_data.columns)
)
fig, axes = subplots(
    df_columns=len(plotting_columns), n_cols=2, title="Categorical features"
)
for col, ax in zip(plotting_columns, axes):
    total_value_counts = hist_data[hist_data["loan_status"] == "all"][
        col
    ].value_counts()
    positive_value_counts = hist_data[hist_data["loan_status"] == "positive"][
        col
    ].value_counts()

    positive_value_counts = (
        (positive_value_counts / total_value_counts * 100)
        .astype(int)
        .reset_index()
    )
    total_value_counts = total_value_counts.reset_index().assign(count=100)

    if col != "loan_grade":
        order = positive_value_counts.sort_values("count")[col]
    else:
        order = positive_value_counts[col]

    sns.barplot(
        data=total_value_counts,
        y=col,
        x="count",
        ax=ax,
        color="gray",
        order=order,
    )
    sns.barplot(
        data=positive_value_counts, y=col, x="count", ax=ax, order=order
    )
axes.send("collapse")

fig.set_figwidth(20)
plt.tight_layout()
plt.show()

### 3, 4, 5. Preprocessing, new features, classifier

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

X = df.drop(["loan_status"], axis=1).set_index("id")
y = df["loan_status"]

num_cols = X.select_dtypes(include=["int64", "float64"]).columns
cat_cols = X.select_dtypes(include=["object"]).columns

numerical_categorical_transformer = ColumnTransformer([
    ("num", StandardScaler(), num_cols),
    ("cat", OneHotEncoder(), cat_cols),
])

pipe = Pipeline([
    ("preprocessing", numerical_categorical_transformer),
    ("clf", HistGradientBoostingClassifier(max_iter=1000)),
])

In [None]:
pipe.fit(X, y)

### 6. Explanations


In this report we will explore how the classifier is making its classification. We will use the following techniques of explain-ability:
* `Variable importance` (on the whole dataset)
* `Classification depending on a particular feature` (on particular examples)
* `Shapley values` (on particular examples)
* `LIME` (on particular examples)
* `Anchor` (on particular examples)


#### Selecting class representatives

By the virtue of the fact that considerable portion of our report is dedicated to explaining the model behavior for a given record, it is important to pick the most interesting ones. From analytical perspective, class representatives are the most insightful.

These representatives are identified by grouping model predictions as follows: for each class (i.e. loan denied and loan approved) 3 groups are created based on the similarity between records. The record closest to the center of each group is then chosen as the class representative.

As a result, each class would have 3 representative examples.

In [None]:
ndarray_X = pipe.named_steps["preprocessing"].transform(X)
ndarray_y = np.expand_dims(y.to_numpy(), axis=1)
columns = (
    num_cols.tolist()
    + pipe.named_steps["preprocessing"]
    .named_transformers_["cat"]
    .get_feature_names_out()
    .tolist()
    + ["loan_status"]
)
df_preprocessed = pd.DataFrame(
    np.hstack([ndarray_X, ndarray_y]), columns=columns
)
df_with_representatives = annotate_by_classes(
    data=df_preprocessed,
    n_representatives=3,
    target_feature="loan_status",
    random_state=42,
)

loan_negative_representatives = df_with_representatives[
    df_with_representatives["repr"]
    & (df_with_representatives["loan_status"] == 0)
].index
loan_positive_representatives = df_with_representatives[
    df_with_representatives["repr"]
    & (df_with_representatives["loan_status"] == 1)
].index

#### Variable importance

The plot below presents variable importance (VI) on the dataset. Variables with higher VI have more influence on what is the classification of the examples. VI is calculated based on the dropout loss, showing how the accuracy of the classification would lower in case the variable wasn't considered during classification.
We see that `loan_percent_income` is the variable with the highest importance and dropping it would lower the classification accuracy by about 8%.


In [None]:
exp = dx.Explainer(pipe, X, y)
fi = exp.model_parts()
fi.plot()

#### Classification depending on a particular feature

The plot below shows how the classification result of a particular instance from the dataset would change when changing the value of the given feature. Generally speaking, it shows that, for instance, if the loan percent income is 20% then chances of getting a loan are close to 0, however, when increased to 40%, the classification result would be positive with about 80% probability. Keep in mind that all other feature values remain unaltered.

For the sake of lesser space consumption, multiple instances are shown on the same plots. The first set of plots concerns examples with classification decision of denying the loan and the second set contains examples with positive classification.


In [None]:
exp.predict_profile(X.loc[loan_negative_representatives]).plot(
    title="Loan denied"
)

In [None]:
exp.predict_profile(X.loc[loan_positive_representatives]).plot(
    title="Loan approved"
)

#### Shapley values

The plot with `Shapley values` shows how particular features of the considered instance influenced its classification result. The bars show how much the feature pushed the decision in the direction of either approving the loan or denying it. Red bars represent denial while green bars show the approval. For clarity, the features are plotted in the descending order of their importance.

In this section and in the consecutive ones we will have explain-ability plots for the particular instances. We've chosen them as the best representatives of the decision class, however, distant from each other. The first set of plots concerns examples with classification decision of denying the loan and the second set contains examples with positive classification.

**Note:** In this and in the following sections we are analyzing instances from already preprocessed data. That means you can encounter new feature names and scaled values for features:
1. We use one-hot-encoding for the categorical features meaning that you may notice that some features are not from the list presented earlier. One-hot-encoded features are listed bellow:
    * `person_home_ownership` -> ex: `person_home_ownership_RENT`, meaning that person is renting a house
    * `loan_intent` -> ex: `loan_intent_MEDICAL`
    * `loan_grade` -> ex: `loan_grade_A` (grades are from A to G)
    * `cb_person_default_on_file` -> either `cb_person_default_on_file_Y` (has prior default) or `cb_person_default_on_file_N` otherwise

2. Numerical values are scaled in order to improve classification accuracy (for values with larger scale not to distort the classification algorithm). The values for one-hot-encoded features are either 0 or 1, meaning negative and positive value respectively


In [None]:
from typing import Iterable

from dalex.predict_explanations import Shap


def shapley_batch(
    data: pd.DataFrame,
    exp: dx.Explainer,
    batch: Iterable[int],
    random_state: int | None = None,
) -> list[Shap]:
    return [
        exp.predict_parts(
            data.loc[idx], type="shap", random_state=random_state
        )
        for idx in batch
    ]


def shapley_plot(predicted: list[Shap]) -> None:
    predicted[0].plot(objects=predicted[1:])

In [None]:
loan_negative_shap = shapley_batch(
    X, exp, loan_negative_representatives, random_state=42
)

In [None]:
loan_positive_shap = shapley_batch(
    X, exp, loan_positive_representatives, random_state=42
)

In [None]:
shapley_plot(loan_negative_shap)

In [None]:
shapley_plot(loan_positive_shap)

#### LIME

`Local Interpretable Model-agnostic Explanations (LIME)` provides us with similar information as `Shapley values`, however, it has different underlying mathematical ideas.

Briefly, LIME is exploring classification results of the model in the neighborhood of a given instance. In doing so it calculates how much the feature values push the decision in the direction of either approving the loan or denying it. Blue bars represent denial while orange bars show the approval.

Probabilities on the right from the graph show the certainty with which the classifier assigned corresponding target value.

On the left from the graph one can see feature values (in the data after the preprocessing).

**Note:** only the most influential features are presented in LIME.


In [None]:
from lime.explanation import Explanation as LimeExplanation
from lime.lime_tabular import LimeTabularExplainer

X_preprocessed = pipe.named_steps["preprocessing"].transform(X)
exp = LimeTabularExplainer(
    X_preprocessed,
    feature_names=num_cols.tolist()
    + pipe.named_steps["preprocessing"]
    .named_transformers_["cat"]
    .get_feature_names_out()
    .tolist(),
    class_names=["negative", "positive"],
    discretize_continuous=False,
    verbose=True,
    random_state=42,
)

In [None]:
def lime_batch(
    data: pd.DataFrame,
    exp: LimeTabularExplainer,
    batch: Iterable[int],
    model: Any,
) -> list[LimeExplanation]:
    return [
        exp.explain_instance(data[idx], model.predict_proba) for idx in batch
    ]


def lime_plot(explained: list[LimeExplanation]) -> None:
    for lime in explained:
        lime.show_in_notebook(show_table=True)

In [None]:
loan_negative_lime = lime_batch(
    X_preprocessed, exp, loan_negative_representatives, pipe.named_steps["clf"]
)
lime_plot(loan_negative_lime)

In [None]:
loan_positive_lime = lime_batch(
    X_preprocessed, exp, loan_positive_representatives, pipe.named_steps["clf"]
)
lime_plot(loan_positive_lime)

#### Anchor

`Anchor`, similarly to `LIME`, explores the neighborhood of the particular instance from the dataset, slightly perturbing the values. As a result, we get set of rules that lead to a certain probability of the correct classification. That's a benefit of anchor - it describes not only the influence of a particular value (e.g. if age = 25 then ...), but constructs the rules that are easy to interpret (e.g. if age > 25 then ...).

On the plots below we present 3 representatives for each class (the same way we did in above plots). The first 3 representatives are from the class with denied loan applications and later 3 - with obtained loan. The value one can see is the probability of being assigned to a corresponding class if the listed rules on the left are satisfied.

**Note:** When there are multiple rules one can switch them on and off. In so doing one will obtain new probability of the classification for an objective class (dynamically adjusted for the new set of rules).


In [None]:
from anchor.anchor_tabular import AnchorTabularExplainer

mapping_col = {
    col: X[col].unique()
    for i, col in filter(lambda x: x[1] in cat_cols, enumerate(X.columns))
}
mapping_idx = {
    i: X[col].unique()
    for i, col in filter(lambda x: x[1] in cat_cols, enumerate(X.columns))
}


def encode(data: pd.DataFrame) -> pd.DataFrame:
    data = data.copy()
    for col, values in mapping_col.items():
        local_map = {value: i for i, value in enumerate(values)}
        data[col] = data[col].map(local_map)
    return data


def decode(data: np.ndarray, columns: list[str]) -> pd.DataFrame:
    data = pd.DataFrame(data, columns=columns)
    for col, values in mapping_col.items():
        local_map = {i: value for i, value in enumerate(values)}
        data[col] = data[col].map(local_map)
    return data


def anchor_batch(
    data: pd.DataFrame,
    exp: AnchorTabularExplainer,
    batch: Iterable[int],
    model: Any,
) -> list[LimeExplanation]:
    return [
        exp.explain_instance(
            encode(data.loc[[idx]]).to_numpy(),
            lambda x: model.predict(decode(x, columns=data.columns)),
            threshold=0.8,
        )
        for idx in batch
    ]


def anchor_plot(explained: list[LimeExplanation]) -> None:
    for anchor in explained:
        anchor.show_in_notebook(show_table=True)


exp = AnchorTabularExplainer(
    class_names=y.unique(),
    feature_names=X.columns,
    train_data=encode(X).to_numpy(),
    categorical_names=mapping_idx,
)

In [None]:
loan_negative_anchor = anchor_batch(
    X, exp, loan_negative_representatives, pipe
)
anchor_plot(loan_negative_anchor)

In [None]:
loan_positive_anchor = anchor_batch(
    X, exp, loan_positive_representatives, pipe
)
anchor_plot(loan_positive_anchor)