## DICE: Diverse Counterfactual Explanation)

SHAP (SHapley Additive exPlanations) is another popular method for explaining machine learning models. It provides a unified measure of feature importance and can be used for both classification and regression models. Below is an example of how to use SHAP with a RandomForestClassifier for the binary classification problem.

In this code:

* We set a binary threshold on the median of y to create a binary classification target variable y_binary, where data points with y values above the median are labeled as 1 (high risk), and data points below or equal to the median are labeled as 0 (low risk).

* We train a Gradient Boosting CLassifier on the modified target variable y_binary.

* We calculate the accuracy score on the test data to evaluate the classifier's performance.

* We create a DICE explainer using dice_ml.Dice. We specify the model, backend, data, data type, and target name.

* We use DICE's generate_counterfactuals function to generate counterfactual explanations for the chosen prediction.

In [1]:
!pip install dice_ml

Collecting dice_ml
  Obtaining dependency information for dice_ml from https://files.pythonhosted.org/packages/69/1c/ec136743072d7b4917d72d975e094c8dc9bce86920519aff97854a7dc3ce/dice_ml-0.11-py3-none-any.whl.metadata
  Downloading dice_ml-0.11-py3-none-any.whl.metadata (20 kB)
Collecting pandas<2.0.0 (from dice_ml)
  Downloading pandas-1.5.3-cp311-cp311-macosx_11_0_arm64.whl (10.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
Collecting raiutils>=0.4.0 (from dice_ml)
  Obtaining dependency information for raiutils>=0.4.0 from https://files.pythonhosted.org/packages/b9/56/93e179d1702ebf669f158b29f14d43cf340ccf5fa50c774c3e571caf4e61/raiutils-0.4.1-py3-none-any.whl.metadata
  Downloading raiutils-0.4.1-py3-none-any.whl.metadata (1.4 kB)
Downloading dice_ml-0.11-py3-none-any.whl (2.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m28.2 MB/s[0m eta

In [2]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import dice_ml

In [3]:
random_state = 42

# Load the Diabetes dataset
diabetes_data = load_diabetes()
X = pd.DataFrame(diabetes_data.data, columns=diabetes_data.feature_names)
y = diabetes_data.target

# Set a threshold for binary classification (e.g., using the median of y)
threshold = np.median(y)
y_binary = (y > threshold).astype(int)  # 1 for high risk, 0 for low risk

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.2, random_state=random_state)

# Create and fit the Random Forest Classifier model
model = GradientBoostingClassifier(n_estimators=100, random_state=random_state)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)
y_pred_prob = model.predict_proba(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)  # Use binary target variable here
print("Accuracy on Test Data:", accuracy)

Accuracy on Test Data: 0.7191011235955056


In [4]:
# Create a DICE data interface
d = dice_ml.Data(dataframe=pd.concat([X,pd.DataFrame({'target':y_train})], axis=1), 
                 features={
                   'age':[1, 130],
                   'bmi': [10,50],
                   'bp': [50,200],
                   },
                 continuous_features=['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], 
                 outcome_name='target')

# Create a DICE model interface
m = dice_ml.Model(model=model, backend="sklearn")

# Create a DICE explainer
exp = dice_ml.Dice(d, m)

# Choose a prediction to explain (e.g., the first test data point)
query_instance = X_test.iloc[0:1]

# Generate 10 counterfactual examples 
counterfactuals = exp.generate_counterfactuals(query_instance, total_CFs=10, desired_class='opposite',
                                               features_to_vary=['bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'])

# List the counterfactual examples
counterfactuals.visualize_as_dataframe()


100%|██████████| 1/1 [00:00<00:00, 15.96it/s]

Query instance (original outcome : 0)





Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.045341,-0.044642,-0.006206,-0.015999,0.125019,0.125198,0.019187,0.034309,0.032432,-0.00522,0



Diverse Counterfactual set (new outcome: 1)


Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.045341,-0.044642,-0.006206,-0.015999,0.125019,0.068381,0.019187,0.034309,0.057561,-0.00522,1
1,0.045341,-0.044642,-0.006206,-0.015999,0.125019,0.013368,0.019187,0.034309,0.032432,-0.00522,1
2,0.045341,-0.044642,-0.006206,-0.015999,0.125019,0.125198,0.019187,0.034309,0.017763,0.115748,1
3,0.045341,-0.044642,-0.006206,-0.015999,-0.058066,0.125198,0.019187,0.034309,0.032432,-0.074289,1
4,0.045341,-0.044642,-0.006206,-0.015999,0.125019,-0.060902,0.019187,0.034309,-0.117409,-0.00522,1
5,0.045341,-0.044642,-0.006206,0.054822,0.125019,0.125198,-0.016366,0.034309,0.032432,-0.00522,1
6,0.045341,-0.044642,-0.029556,-0.015999,0.125019,-0.093493,0.019187,0.034309,0.032432,-0.00522,1
7,0.045341,-0.044642,-0.006206,-0.015999,0.125019,0.125198,0.019187,0.034309,0.032432,0.064414,1
8,0.045341,-0.044642,-0.006206,-0.015999,0.098354,0.051654,0.019187,0.034309,0.032432,-0.00522,1
9,0.045341,-0.044642,-0.006206,-0.015999,0.125019,0.125198,0.019187,0.034309,0.088542,-0.00522,1
