# Explaining XGBoost Model trained on Adult Income Data

This notebook contains a user guide on how to use TE2Rules to explain a XGBoost binary classification model trained using scikit-learn. TE2Rules explains a Tree Ensemble model using rules. This notebook contains different levers that can be used to control the faithfulness and interpretability of the extracted rules.

 ## Load Python libraries

In [None]:
import numpy as np
import pandas as pd
from sklearn import metrics

# TE2Rules supports tree ensemble models from scikit-learn and xgboost   
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier

import te2rules
from te2rules.explainer import ModelExplainer

print("Using te2rules version: " + str(te2rules.__version__))

## Load pre-processed training and testing data

Adult Income data can be found in the [UCI repository](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/). It contains various numerical and categorical features like age, hours of work, capital-gain, education-level, marital-status, race, sex etc., to predict if a person's annual income is above 50K USD or below 50K USD. 

The pre-processed data used in this notebook can be generated by running ```python3 data_prep/data_prep_adult.py```. This script downloads the adult income data, cleans missing values and encodes categorical features with one-hot encoding. The records with income values above 50K USD are labeled as positives and the rest are labeled as negatives.



In [None]:
np.random.seed(123)

training_path = "../data/adult/train.csv"
testing_path = "../data/adult/test.csv"

data_train = pd.read_csv(training_path)
data_test = pd.read_csv(testing_path)

In [None]:
cols = list(data_train.columns)
feature_names = cols[:-1]
label_name = cols[-1]

data_train = data_train.to_numpy()
data_test = data_test.to_numpy()

In [None]:
x_train = data_train[:, :-1]
y_train = data_train[:, -1]

x_test = data_test[:, :-1]
y_test = data_test[:, -1]

## Train a XGBoost model using scikit-learn or xgboost

In [None]:
# Scikit-Learn Model
model = GradientBoostingClassifier(n_estimators=10, max_depth=3)
model.fit(x_train, y_train)

# XGBoost Model
# model = XGBClassifier(n_estimators=10, max_depth=3)
# model.fit(x_train, y_train)

In [None]:
y_train_pred = model.predict(x_train)
y_train_pred_score = model.predict_proba(x_train)[:, 1]

y_test_pred = model.predict(x_test)
y_test_pred_score = model.predict_proba(x_test)[:, 1]

In [None]:
accuracy = model.score(x_test, y_test)
print("Accuracy")
print(accuracy)

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_test_pred_score)
auc = metrics.auc(fpr, tpr)
print("AUC")
print(auc)

## Explain the XGBoost model using TE2Rules

In [None]:
model_explainer = ModelExplainer(
    model=model, 
    feature_names=feature_names
)

rules = model_explainer.explain(
    X=x_train, y=y_train_pred,
    num_stages = 10,               # stages can be between 1 and max_depth 
    min_precision = 0.95,          # higher min_precision can result in rules with more terms overfit on training data 
    jaccard_threshold = 0.4        # lower jaccard_threshold speeds up the rule exploration, but can miss some good rules
)

### Interpretability: Inspect the rules

In [None]:
print(str(len(rules)) + " rules found:")
print()
for i in range(len(rules)):
    print("Rule " + str(i) + ": " + str(rules[i]))

### Faithfulness: Fidelity of the rules

If the fidelity on positives is not high enough, try running with more `num_stages` and higher `jaccard_threshold`.

In [None]:
fidelity, positive_fidelity, negative_fidelity = model_explainer.get_fidelity()

print("The rules explain " + str(round(fidelity*100, 2)) + "% of the overall predictions of the model" )
print("The rules explain " + str(round(positive_fidelity*100, 2)) + "% of the positive predictions of the model" )
print("The rules explain " + str(round(negative_fidelity*100, 2)) + "% of the negative predictions of the model" )

## All possible explanations

TE2Rules provides one possible set of explanations to explain the positive model predictions. TE2Rules finds all possible explanations from a model and then shortlists a small subset of these rules such that they explain most of the positives. However, these rules are not the only possible way to explain the model.

TE2Rules can also show all possible explanations to explain the model prediction. From these longer set of possible rules, a domain expert using their domain knowledge can choose a smaller set of rules that closely aligns with the decision-making process in their domain. These shortlisted rules can be used as an alternative of the default subset of rules selected by TE2Rules.

In [None]:
rules = model_explainer.longer_rules
print(str(len(rules)) + " rules found:")
print()

for i in range(len(rules)):
    print("Rule " + str(i) + ": " + str(rules[i]))

## Local Instance-Level Explanations

For a given input with positive model prediction, TE2Rules can be used to show different possible reasons for why the model assigned it a positive class prediction. A domain expert can choose the most plausible explanation out of the different possible reasons.

In [None]:
from util import display_input

explanations = model_explainer.explain_instance_with_rules(x_test)

In [None]:
print("Local Explanations of a particular model decision")
print()
for i in range(140, 155):
    if(y_test_pred[i] == 1):
        print("Index:", i)
        print()
        print("Model Input:")
        display_input(x_test[i], feature_names)
        print()
        print("Model Prediction:", y_test_pred[i])
        print()
        print("Possible Reasons:")
        rules = explanations[i]
        for j in range(len(rules)):
            print("Rule", j+1, ":", rules[j])   
        print("--------------------------------")