# Explaining XGBoost Model trained on Adult Income Data

This notebook contains a user guide on how to use TE2Rules to explain a XGBoost binary classification model trained using scikit-learn. TE2Rules explains a Tree Ensemble model using rules. This notebook contains different levers that can be used to control the faithfulness and interpretability of the extracted rules.

 ## Load Python libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import metrics


import te2rules
from te2rules.explainer import ModelExplainer

print("Using te2rules version: " + str(te2rules.__version__))

Using te2rules version: 1.0.1


## Load pre-processed training and testing data

Adult Income data can be found in the [UCI repository](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/). It contains various numerical and categorical features like age, hours of work, capital-gain, education-level, marital-status, race, sex etc., to predict if a person's annual income is above 50K USD or below 50K USD. 

The pre-processed data used in this notebook can be generated by running ```python3 data_prep/data_prep_adult.py```. This script downloads the adult income data, cleans missing values and encodes categorical features with one-hot encoding. The records with income values above 50K USD are labeled as positives and the rest are labeled as negatives.



In [2]:
np.random.seed(123)

training_path = "../data/adult/train.csv"
testing_path = "../data/adult/test.csv"

data_train = pd.read_csv(training_path)
data_test = pd.read_csv(testing_path)

In [3]:
cols = list(data_train.columns)
feature_names = cols[:-1]
label_name = cols[-1]

data_train = data_train.to_numpy()
data_test = data_test.to_numpy()

In [4]:
x_train = data_train[:, :-1]
y_train = data_train[:, -1]

x_test = data_test[:, :-1]
y_test = data_test[:, -1]

## Train a XGBoost model using scikit-learn 

In [5]:
model = GradientBoostingClassifier(n_estimators=10)
model.fit(x_train, y_train)

In [6]:
y_train_pred = model.predict(x_train)
y_train_pred_score = model.predict_proba(x_train)[:, 1]

y_test_pred = model.predict(x_test)
y_test_pred_score = model.predict_proba(x_test)[:, 1]

In [7]:
accuracy = model.score(x_test, y_test)
print("Accuracy")
print(accuracy)

Accuracy
0.8176904176904177


In [8]:
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_test_pred_score)
auc = metrics.auc(fpr, tpr)
print("AUC")
print(auc)

AUC
0.8907004647565667


## Explain the XGBoost model using TE2Rules

In [9]:
model_explainer = ModelExplainer(
    model=model, 
    feature_names=feature_names
)

rules = model_explainer.explain(
    X=x_train, y=y_train_pred,
    num_stages = 10,
    min_precision = 0.95
)

100%|██████████| 73/73 [00:00<00:00, 1503.78it/s]
100%|██████████| 121/121 [00:00<00:00, 1741.40it/s]
100%|██████████| 7/7 [00:00<00:00, 7074.73it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]


### Interpretability: Inspect the rules

In [10]:
print(str(len(rules)) + " rules found")

4 rules found


In [11]:
for i in range(len(rules)):
    print("Rule " + str(i) + ": " + str(rules[i]))

Rule 0: capital_gain > 5095.5 & marital_status_married > 0.5
Rule 1: capital_gain > 7139.5 & marital_status_married <= 0.5
Rule 2: capital_gain <= 5095.5 & capital_loss > 1793.5 & education_Bachelors > 0.5 & marital_status_married > 0.5
Rule 3: capital_gain <= 5095.5 & capital_loss > 1793.5 & marital_status_married > 0.5 & occupation_Exec_managerial > 0.5


### Faithfulness: Fidelity of the rules

In [12]:
fidelity, positive_fidelity, negative_fidelity = model_explainer.get_fidelity()

print("The rules explain " + str(round(fidelity*100, 2)) + "% of the overall predictions of the model" )
print("The rules explain " + str(round(positive_fidelity*100, 2)) + "% of the positive predictions of the model" )
print("The rules explain " + str(round(negative_fidelity*100, 2)) + "% of the negative predictions of the model" )

The rules explain 99.57% of the overall predictions of the model
The rules explain 93.5% of the positive predictions of the model
The rules explain 99.96% of the negative predictions of the model


In [13]:
model_explainer.longer_rules

['capital_gain > 7139.5 & marital_status_married <= 0.5',
 'age > 20.5 & capital_gain > 7139.5 & marital_status_married <= 0.5',
 'capital_gain > 5095.5 & marital_status_married > 0.5',
 'age <= 60.5 & capital_gain > 5095.5 & marital_status_married > 0.5',
 'age > 60.5 & capital_gain > 5095.5 & marital_status_married > 0.5',
 'age <= 86.5 & capital_gain > 5095.5 & marital_status_married > 0.5',
 'age > 86.5 & capital_gain > 5095.5 & marital_status_married > 0.5',
 'capital_gain > 7669.5 & marital_status_married <= 0.5',
 'age > 20.5 & capital_gain > 7669.5 & marital_status_married <= 0.5',
 'capital_gain > 5095.5 & marital_status_married > 0.5 & occupation_Farming_fishing <= 0.5',
 'capital_gain > 5095.5 & marital_status_married > 0.5 & occupation_Farming_fishing > 0.5',
 'capital_gain > 8296.0 & marital_status_married <= 0.5',
 'capital_gain > 5095.5 & education_School > 0.5 & marital_status_married > 0.5',
 'capital_gain > 5095.5 & marital_status_married > 0.5 & relationship_Not_in_f