# HELOC Service User

The objective is to provide a Service User / Customer seeking a loan information on:  
* important factors in their application,  
* comparison with applications having similar financial attributes, and  
* corrective measures that the applicant can undertake should the application be rejected.

It is intended that the Loan Officer will use this notebook to assist them respond to inquiries from a (potential) customer.

## ContrastiveExplanation
This notebook uses the `ContrastiveExplanation` Python package, created by Marcel Robeer, University of Utrecht.  It is necessary to clone the repository: 
https://github.com/MarcelRobeer/ContrastiveExplanation   
Please ensure that the current working directory contains the `contrastive_explanation` subdirectory.  
For further information, see:  
https://pythonawesome.com/contrastive-explanation-foil-trees-developed-at-tno-utrecht-university/

###### Contents
1. [Feature-focused Model](#feature-focused)

2. [Case-based Profiles](#case-profile)

3. [Corrective Measures](#corrective-measures)
---

Set a seed for reproducibility, and import packages.

In [1]:
import numpy as np
import pandas as pd
import os.path
import urllib.request
import pprint

from pandas import read_csv
from matplotlib import pyplot

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

from sklearn import datasets, model_selection, ensemble, metrics, pipeline, preprocessing

from interpret.glassbox import ExplainableBoostingClassifier
from interpret.glassbox import DecisionListClassifier
from interpret.glassbox import LogisticRegression
from interpret import show

seed = 2022

For printing out the features and their corresponding values of an instance we define a function `print_sample()`:

In [2]:
# Created by Marcel Robeer, University of Utrecht
def print_sample(feature_names, sample):
    print('\n'.join(f'{name}: {value}' for name, value in zip(feature_names, sample)))

---

## 1. Feature-focused Model <a name="feature-focused"></a>

Using the Home Equity Loan [HELOC](https://community.fico.com/s/explainable-machine-learning-challenge?tabset-158d9=3) data set (as also used in the FICO Explainability Challenge). 

1. Load the data set (and select features with greatest overall mean absolute score)  
2. Train an Explainable Boosting Machine model on the data

#### 1.1 Load dataset as Dataframe

In [3]:
url = 'https://raw.githubusercontent.com/JJCoen/Interpretable-ML-Models/master/data/train_imp.csv'
train = pd.read_csv(url)
del train['Unnamed: 0']

url2 = 'https://raw.githubusercontent.com/JJCoen/Interpretable-ML-Models/master/data/test_imp.csv'
test = pd.read_csv(url2)
del test['Unnamed: 0']

Read in the top ten features identified by the Data Scientist.  Restricting the machine learning model to 10 features is to facilitate the formulation of corrective measures using the Constrastive Explanations package.

In [4]:
from pathlib import Path  
filepath = Path('../data/top-10.csv')
feat_df = pd.read_csv(filepath, index_col=False, header=0)
feat_select = feat_df['names']

In [5]:
# train / test split
label = train.columns[0]
print("Label names: ", train[label].unique() )
label_names = ['Good', 'Bad']
print("Label names, reference level first: ", label_names )

Label names:  ['Bad' 'Good']
Label names, reference level first:  ['Good', 'Bad']


The first element of the list `label_names` needs to be the reference (0) level.

In [6]:
# class distribution
print(train.groupby('RiskPerformance').size())
print(test.groupby('RiskPerformance').size())

RiskPerformance
Bad     3852
Good    3551
dtype: int64
RiskPerformance
Bad     1284
Good    1184
dtype: int64


In [7]:
# Split-out labels in train
X_train = train[feat_select]
y_train = train[label].apply(lambda x: 0 if x == "Good" else 1) #Turning response into 0 and 1
print(X_train.shape)
print(train[label].head())
print(y_train.head())

(7403, 10)
0     Bad
1     Bad
2    Good
3    Good
4    Good
Name: RiskPerformance, dtype: object
0    1
1    1
2    0
3    0
4    0
Name: RiskPerformance, dtype: int64


In [10]:
# split out labels in test
X_test = test[feat_select]
y_test = test[label].apply(lambda x: 0 if x == "Good" else 1) #Turning response into 0 and 1
print(X_test.shape)
print(y_test.head())

(2468, 10)
0    1
1    1
2    0
3    1
4    1
Name: RiskPerformance, dtype: int64


#### 1.2 Explainable Boosting Machine

In [11]:
from interpret.glassbox import ExplainableBoostingClassifier

ebm = ExplainableBoostingClassifier(random_state=2024, n_jobs=-1)
ebm.fit(X_train, y_train)   #Works on dataframes and numpy arrays

ExplainableBoostingClassifier(binning='quantile', early_stopping_rounds=50,
                              early_stopping_tolerance=0.0001,
                              feature_names=['MSinceMostRecentInqexcl7days',
                                             'ExternalRiskEstimate',
                                             'NetFractionRevolvingBurden',
                                             'AverageMInFile',
                                             'NumBank2NatlTradesWHighUtilization',
                                             'MSinceOldestTradeOpen',
                                             'MaxDelq2PublicRecLast12M',
                                             'PercentTradesWBalance',
                                             'PercentTradesNev...
                                             'continuous', 'continuous',
                                             'interaction', 'interaction',
                                             'interaction', 'intera

In [12]:
# Print out the classifier performance (F1-score)
print('Classifier performance (F1):', metrics.f1_score(y_test, ebm.predict(X_test), average='weighted'))

Classifier performance (F1): 0.7334120811183514


---

## 2. Case-based Profiles <a name="case-profile"></a>

**Cases similar to the one under consideration**

Consider the first instance in the test set.  
1. Identify the decision rule that applies to the specific case.  
2. Select similar cases based on the decision rule predicates and feature values of the case.  
3. Use EBM model to make predictions on the similar cases.

In [13]:
seed = 2022
dl = DecisionListClassifier(random_state=seed)
dl.fit(X_train, y_train)

dl_local = dl.explain_local(X_test[0:1], y_test[0:1])

show(dl_local)

The dash_html_components package is deprecated. Please replace
`import dash_html_components as html` with `from dash import html`
  import dash_html_components as html
The dash_core_components package is deprecated. Please replace
`import dash_core_components as dcc` with `from dash import dcc`
  import dash_core_components as dcc
The dash_table package is deprecated. Please replace
`import dash_table` with `from dash import dash_table`

Also, if you're using any of the table format helpers (e.g. Group), replace 
`from dash_table.Format import Group` with 
`from dash.dash_table.Format import Group`
  import dash_table as dt


Feature values for the first test instance / loan application

In [14]:
print(X_test.loc[0])

MSinceMostRecentInqexcl7days            0.0
ExternalRiskEstimate                   59.0
NetFractionRevolvingBurden             62.0
AverageMInFile                         78.0
NumBank2NatlTradesWHighUtilization      3.0
MSinceOldestTradeOpen                 137.0
MaxDelq2PublicRecLast12M                4.0
PercentTradesWBalance                  94.0
PercentTradesNeverDelq                 91.0
MSinceMostRecentDelq                    1.0
Name: 0, dtype: float64


In [15]:
# Logical indexing to identify similar cases
similar_cases = X_test[(X_test.MSinceMostRecentInqexcl7days <= 1) 
       & (X_test.ExternalRiskEstimate <= 65) 
      & (X_test.NetFractionRevolvingBurden > 60)
      & (X_test.AverageMInFile > 70)
      & (X_test.NumBank2NatlTradesWHighUtilization == 3.0)
      & (X_test.MSinceOldestTradeOpen > 130.0)
      & (X_test.MaxDelq2PublicRecLast12M == 4.0)
      & (X_test.PercentTradesWBalance > 90.0)]

In [16]:
#print(similar_cases.to_html)
from IPython.display import display, HTML

display(HTML(similar_cases.to_html()))

Unnamed: 0,MSinceMostRecentInqexcl7days,ExternalRiskEstimate,NetFractionRevolvingBurden,AverageMInFile,NumBank2NatlTradesWHighUtilization,MSinceOldestTradeOpen,MaxDelq2PublicRecLast12M,PercentTradesWBalance,PercentTradesNeverDelq,MSinceMostRecentDelq
0,0.0,59.0,62.0,78,3.0,137.0,4,94.0,91,1.0
301,0.0,54.0,64.0,101,3.0,184.0,4,93.0,88,4.0
949,0.0,59.0,88.0,76,3.0,244.0,4,93.0,80,6.0
1216,0.506368,59.0,80.0,87,3.0,255.0,4,100.0,76,1.0
1914,0.0,61.0,87.0,95,3.0,247.0,4,93.0,92,1.0


In [17]:
print("The cases above had predictions: ", ebm.predict(similar_cases), " respectively")

The cases above had predictions:  [1 1 1 1 1]  respectively


## 3. Corrective Measures <a name="corrective-measures"></a>

#### Perform Contrastive Explanation

As an example, the third application in the test set had the loan rejected (predicted to default).

Perform contrastive explanation on the third test instance (X_test[:, 2]) by wrapping the dataframe in a DomainMapper, and then using method ContrastiveExplanation.explain_instance_domain()

In [18]:
# Contrastive explanation
# Import
import contrastive_explanation as ce

# Select a sample to explain ('questioned data point') why it predicted the fact instead of the foil 
sample = X_test.iloc[2]
print_sample(feat_select, sample)

MSinceMostRecentInqexcl7days: 0.0
ExternalRiskEstimate: 72.0
NetFractionRevolvingBurden: 43.0
AverageMInFile: 53.0
NumBank2NatlTradesWHighUtilization: 3.0
MSinceOldestTradeOpen: 84.0
MaxDelq2PublicRecLast12M: 7.0
PercentTradesWBalance: 64.0
PercentTradesNeverDelq: 100.0
MSinceMostRecentDelq: 32.6627197015183


In [19]:
# Create a domain mapper (map the explanation to meaningful labels for explanation)
dm = ce.domain_mappers.DomainMapperPandas(X_train,
                                           contrast_names=label_names)

# Create the contrastive explanation object (default is a Foil Tree explanator)
exp = ce.ContrastiveExplanation(dm)

# Explain the instance (sample) for the given model
exp.explain_instance_domain(ebm.predict_proba, sample)

"The model predicted 'Bad' instead of 'Good' because 'MSinceMostRecentInqexcl7days <= 2.296 and NumBank2NatlTradesWHighUtilization > 1.59'"

The predicted class using the ExplainableBoostingMachine was ‘Bad’, while the 'Good' class may have been expected instead. The difference of why the current instance was classified ‘Bad’ is because its `MSinceMostRecentInqexcl7days` is less than or equal to 2.296 and `NumBank2NatlTradesWHighUtilization` is greater than 1.59.   

In other words, if the instance would change the `MSinceMostRecentInqexcl7days` to more than 2.296 and `NumBank2NatlTradesWHighUtilization` to less than 1.59, the EBM classifier would have changed the outcome to ‘Good’.

The cell below is provided in order to make changes suggested by the results of the Contrastive Explanation.  This enables a loan applicant to make changes in order to get approval ("nondefault" / 0 classification).

In [20]:
sample_dict = {"MSinceMostRecentInqexcl7days": 7.0,
"ExternalRiskEstimate": 72.0,
"NetFractionRevolvingBurden": 43.0,
"AverageMInFile": 53.0,
"NumBank2NatlTradesWHighUtilization": 1.0,
"MSinceOldestTradeOpen": 84.0,
"MaxDelq2PublicRecLast12M": 7.0,
"PercentTradesWBalance": 64.0,
"PercentTradesNeverDelq": 100.0,
"MSinceMostRecentDelq": 32.6627197015183}
sample_update = pd.Series(sample_dict)
X_test_update = pd.DataFrame([sample_dict])

In [21]:
ebm_local = ebm.explain_local(X_test_update)
show(ebm_local)

Now the loan applicant is receiving a "Good" (0) classification.  
Let the money roll!!!

In [22]:
# Explain the instance (sample) for the given model
exp.explain_instance_domain(ebm.predict_proba, sample_update)

"The model predicted 'Good' instead of 'Bad' because 'MSinceMostRecentInqexcl7days > 4.977'"

No need to apply corrective measures since the desired outcome has been achieved.