# Data Centric XAI part - 1
## CHAPTER 03 - *Data Centric Approaches*

From **Applied Machine Learning Explainability Techniques** by [**Aditya Bhattacharya**](https://www.linkedin.com/in/aditya-bhattacharya-b59155b6/), published by **Packt**

### Objective

In this notebook, we will try to implement some of the concepts related to Data-Centric XAI as discussed in Chapter 3 - Data Centric Approaches.

### Installing the modules

Install the following libraries in Google Colab or your local environment, if not already installed.

In [None]:
!pip install --upgrade pandas numpy matplotlib seaborn scikit-learn deepchecks

### Loading the modules

In [51]:
from deepchecks import Dataset
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
from deepchecks.suites import single_dataset_integrity, train_test_validation, model_evaluation
from deepchecks.checks import TrustScoreComparison
import scipy.stats as stats
from IPython.display import display

### About the data

**Breast Cancer Wisconsin (Diagnostic) Data Set - UCI Machine Learning Repository**

This dataset is also known as the *breast Cancer* dataset which is used to predict the presence of breast cancer. It is a multivariate dataset used for classification based problems containing 30 different features. More details about this data can be found at - [https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic))

### Loading the data

In [4]:
data  = datasets.load_breast_cancer(as_frame=True).frame
data.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [5]:
data.shape

(569, 31)

In [6]:
data.columns

Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension',
       'target'],
      dtype='object')

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

# Exploring Data-Centric Explainability Approaches

There are many ways to implement the concepts learnt in the chapter. But I felt that [Deepchecks Open Source Python framework](https://deepchecks.com/) is a fantastic library for implementing most of the concepts related to analyzing Data Consistency and Data Purity. The out-of-the-box API methods from the framework enables us to perform these important steps in minimum lines of code. In this notebook, we will utilize the Deepchecks framework on the Breast Cancer Dataset.

In [21]:
# Let's prepare the datasets and the models
label = 'target'
train_df, test_df = train_test_split(data, test_size=0.2, random_state=123) # Performing an 80-20 split

#Creating Deepchecks object
train = Dataset(train_df, label=label)
test = Dataset(test_df, label=label)

# training and testing dataframes
x_train = train_df.drop(label, axis = 1)
y_train = train_df[label]
x_test = test_df.drop(label, axis = 1)
y_test = test_df[label]

# Model Training
model = RandomForestClassifier()
model.fit(x_train, y_train)

RandomForestClassifier()

In [25]:
# Model Evaluation
model.score(x_test, y_test)

0.9912280701754386

Fortunately, we have a very good model with 99% accuracy on the unseen data. We can expect very limited issues related to data purity and consistenyc, but still let us validate using the Deepchecks framework.

# Data Purity Check with Deepcheck's Single Dataset Integrity Suite

In [16]:
# On the training set
purity_check = single_dataset_integrity()
purity_check.run(train_dataset = train)

Single Dataset Integrity Suite:   0%|          | 0/8 [00:00<?, ? Check/s]

Status,Check,Condition,More Info
✓,Single Value in Column - Train Dataset,Does not contain only a single value for all columns,
✓,Mixed Nulls - Train Dataset,Not more than 1 different null types for all columns,
✓,Mixed Data Types - Train Dataset,Rare data types in all columns are either more than 10.00% or less than 1.00% of the data,
✓,String Mismatch - Train Dataset,No string variants for all columns,
✓,Data Duplicates - Train Dataset,Duplicate data is not greater than 0%,
✓,String Length Out Of Bounds - Train Dataset,Ratio of outliers not greater than 0% string length outliers for all columns,
✓,Special Characters - Train Dataset,Ratio of entirely special character samples not greater than 0.10% for all columns,
✓,Label Ambiguity - Train Dataset,Ambiguous sample ratio is not greater than 0%,


Check,Reason
Single Value in Column - Train Dataset,Nothing found
Mixed Nulls - Train Dataset,Nothing found
Mixed Data Types - Train Dataset,Nothing found
String Mismatch - Train Dataset,Nothing found
Data Duplicates - Train Dataset,Nothing found
String Length Out Of Bounds - Train Dataset,Nothing found
Special Characters - Train Dataset,Nothing found
Label Ambiguity - Train Dataset,Nothing found


In [17]:
# On the testing set
purity_check.run(test_dataset = test)

Single Dataset Integrity Suite:   0%|          | 0/8 [00:00<?, ? Check/s]

Status,Check,Condition,More Info
✓,Single Value in Column - Test Dataset,Does not contain only a single value for all columns,
✓,Mixed Nulls - Test Dataset,Not more than 1 different null types for all columns,
✓,Mixed Data Types - Test Dataset,Rare data types in all columns are either more than 10.00% or less than 1.00% of the data,
✓,String Mismatch - Test Dataset,No string variants for all columns,
✓,Data Duplicates - Test Dataset,Duplicate data is not greater than 0%,
✓,String Length Out Of Bounds - Test Dataset,Ratio of outliers not greater than 0% string length outliers for all columns,
✓,Special Characters - Test Dataset,Ratio of entirely special character samples not greater than 0.10% for all columns,
✓,Label Ambiguity - Test Dataset,Ambiguous sample ratio is not greater than 0%,


Check,Reason
Single Value in Column - Test Dataset,Nothing found
Mixed Nulls - Test Dataset,Nothing found
Mixed Data Types - Test Dataset,Nothing found
String Mismatch - Test Dataset,Nothing found
Data Duplicates - Test Dataset,Nothing found
String Length Out Of Bounds - Test Dataset,Nothing found
Special Characters - Test Dataset,Nothing found
Label Ambiguity - Test Dataset,Nothing found


As we can observe from the previous step that using just one line we can perform a thorough analysis of the dataset to observe the presence of missing values, duplicates, label ambiguity or any other common data integrity issues. In case if your problem requires frequent usage of any other masure to evaluate the data integrity, I strongly recommend you to reach out to the owners of the framework with a feature request, or even contribute yourself by raising a pull-request and evolve this unified framework.

# Data Consistency Check using Deepchecks

Now, we will use the Train Test Validation Suite to detect the presence of Data Drifts, any Data Distribution issues, Presence of Data Leakage or other data consistency issues between the training and the inference data with minimum lines of code using Deepchecks.

In [26]:
data_consistency_check = train_test_validation()
data_consistency_check.run(model=model, train_dataset=train, test_dataset=test)

Train Test Validation Suite:   0%|          | 0/14 [00:00<?, ? Check/s]

Status,Check,Condition,More Info
✖,Dominant Frequency Change,Change in ratio of dominant value in data not more than 25.00%,"Found columns with high change in dominant value: ['mean concavity', 'mean concave points', 'concavity error', 'concave points error', 'worst concavity', 'worst concave points']"
✖,Single Feature Contribution Train-Test,Train-Test features' Predictive Power Score (PPS) difference is not greater than 0.2,Features with PPS difference above threshold: worst concavity
✓,Train Test Drift,PSI and Earth Mover's Distance cannot be greater than 0.2 and 0.1 respectively,
✓,Train Test Label Drift,PSI and Earth Mover's Distance for label drift cannot be greater than 0.2 or 0.1 respectively,
✓,Whole Dataset Drift,Drift value is not greater than 0.25,
✓,Datasets Size Comparison,Test-Train size ratio is not smaller than 0.01,
✓,Single Feature Contribution Train-Test,Train features' Predictive Power Score (PPS) is not greater than 0.7,
✓,Category Mismatch Train Test,Ratio of samples with a new category is not greater than 0% for all columns,
✓,New Label Train Test,Number of new label values is not greater than 0,
✓,String Mismatch Comparison,No new variants allowed in test data for all columns,


Unnamed: 0_level_0,Value,Train data %,Test data %,Train data #,Test data #,P value
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
mean concave points,0.0,1.98,3.51,9,4,0.53
worst concave points,0.0,1.98,3.51,9,4,0.53
mean concavity,0.0,1.98,3.51,9,4,0.53
worst concavity,0.0,1.98,3.51,9,4,0.53
concavity error,0.0,1.98,3.51,9,4,0.53
concave points error,0.0,1.98,3.51,9,4,0.53


Unnamed: 0,Train,Test
Size,455,114


Check,Reason
Date Train Test Leakage Duplicates,DeepchecksValueError: Check requires dataset to have a datetime column
Date Train Test Leakage Overlap,DeepchecksValueError: Check requires dataset to have a datetime column
Identifier Leakage,DeepchecksValueError: Dataset needs to have a date or index column.
Identifier Leakage,DeepchecksValueError: Dataset needs to have a date or index column.
Index Train Test Leakage,DeepchecksValueError: Check requires dataset to have an index column
Category Mismatch Train Test,Nothing found
New Label Train Test,Nothing found
String Mismatch Comparison,Nothing found
Train Test Samples Mix,Nothing found


Now this was a very robust check for data Consistency and the report seems to be interesting. The report summary provides us a glimpse of what we need to focus on and what we dont need to focus on. For this particular dataset, our main concern is related to Dominant Feature's Frequency Change, Single Feature's Contribution thus showing high sensitivity due to these features which are also listed in the report. The difference in Predictive Power Score (PPS) between the train and test data, shows some presence of data leakage. But there is no significant presence of feature drift or concept drift. Further analysis can definitely be done on the issues found, but keeping things simple and easy to understand for all level of readers, I will not recommend over analyzing for this use case. I strongly recommend visiting the deepchecks documentatiosn to learn more at: https://docs.deepchecks.com/.  

Next let us review the trust score comparison between the train and test dataset.

### Trust Score Distribution

In [31]:
trust_score_distribution = TrustScoreComparison(min_test_samples = 100)
trust_score_distribution.run(train, test, model)

Unnamed: 0,Trust Score,Model Prediction,target,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
356,1.08,1,1,13.05,18.59,85.09,512.0,0.11,0.13,0.1,0.06,0.2,0.07,0.31,1.51,2.59,21.57,0.01,0.04,0.05,0.02,0.03,0.01,14.19,24.85,94.22,591.2,0.13,0.27,0.26,0.13,0.31,0.08
518,1.06,1,1,12.88,18.22,84.45,493.1,0.12,0.17,0.05,0.05,0.17,0.07,0.44,1.17,3.18,34.37,0.01,0.02,0.01,0.01,0.02,0.0,15.05,24.37,99.31,674.7,0.15,0.3,0.12,0.11,0.26,0.09
496,1.05,1,1,12.65,18.17,82.69,485.6,0.11,0.13,0.08,0.05,0.16,0.07,0.23,0.63,1.7,18.4,0.01,0.03,0.03,0.01,0.02,0.0,14.38,22.15,95.29,633.7,0.15,0.38,0.36,0.14,0.32,0.1
396,0.99,1,1,13.51,18.89,88.1,558.1,0.11,0.11,0.09,0.05,0.18,0.06,0.21,1.33,1.51,19.29,0.01,0.02,0.03,0.01,0.01,0.0,14.8,27.2,97.33,675.2,0.14,0.26,0.34,0.15,0.27,0.08
205,0.98,0,0,15.12,16.68,98.78,716.6,0.09,0.1,0.08,0.04,0.16,0.06,0.27,0.36,1.97,26.44,0.01,0.02,0.02,0.01,0.02,0.0,17.77,20.24,117.7,989.5,0.15,0.33,0.33,0.13,0.34,0.1


Unnamed: 0,Trust Score,Model Prediction,target,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
175,2.94,1,1,8.67,14.45,54.42,227.2,0.09,0.04,0.0,0.0,0.17,0.07,0.22,0.79,1.44,11.36,0.01,0.01,0.0,0.0,0.03,0.0,9.26,17.04,58.36,259.2,0.12,0.07,0.0,0.0,0.26,0.08
327,2.84,1,1,12.03,17.93,76.09,446.0,0.08,0.04,0.0,0.01,0.14,0.06,0.23,0.91,1.47,16.97,0.0,0.01,0.0,0.0,0.01,0.0,13.07,22.25,82.74,523.4,0.1,0.07,0.01,0.03,0.22,0.07
162,2.79,0,0,19.59,18.15,130.7,1214.0,0.11,0.17,0.25,0.13,0.2,0.06,0.74,1.05,4.79,97.07,0.0,0.02,0.04,0.01,0.02,0.0,26.73,26.39,174.9,2232.0,0.14,0.38,0.68,0.22,0.36,0.09
272,2.74,0,0,21.75,20.99,147.3,1491.0,0.09,0.2,0.22,0.11,0.17,0.06,1.17,1.35,8.87,156.8,0.01,0.05,0.06,0.02,0.02,0.0,28.19,28.18,195.9,2384.0,0.13,0.47,0.58,0.18,0.28,0.09
159,2.74,1,1,10.9,12.96,68.69,366.8,0.08,0.04,0.0,0.01,0.14,0.06,0.28,0.76,1.81,18.54,0.01,0.01,0.0,0.0,0.02,0.0,12.36,18.2,78.07,470.0,0.12,0.08,0.02,0.04,0.27,0.08


# Data Forecastability using Deepchecks Model Evaluation

In [33]:
data_forecastability_check = model_evaluation()
data_forecastability_check.run(model=model, train_dataset=train, test_dataset=test)

Model Evaluation Suite:   0%|          | 0/12 [00:00<?, ? Check/s]

Status,Check,Condition,More Info
!,Unused Features,Number of high variance unused features is not greater than 5,"Found ['symmetry error', 'smoothness error', 'texture error', 'concave points error', 'fractal dimension error', 'mean smoothness', 'mean fractal dimension', 'mean symmetry', 'concavity error', 'compactness error', 'perimeter error'] unused high variance features"
✓,Performance Report,Train-Test scores relative degradation is not greater than 0.1,
✓,ROC Report - Train Dataset,Not less than 0.7 AUC score for all the classes,
✓,ROC Report - Test Dataset,Not less than 0.7 AUC score for all the classes,
✓,Simple Model Comparison,Model performance gain over simple model must be at least 10.00%,
✓,Model Error Analysis,The performance of the detected segments must not differ by more than 5.00%,
✓,Model Inference Time Check - Train Dataset,Average model inference time for one sample is not greater than 0.001,
✓,Model Inference Time Check - Test Dataset,Average model inference time for one sample is not greater than 0.001,


Check,Reason
Trust Score Comparison,"DeepchecksValueError: Number of samples in test dataset have not passed the minimum. you can change minimum samples needed to run with parameter ""min_test_samples"""
Regression Systematic Error,"DeepchecksValueError: Expected model to be a type from ['regression'], but received model of type: binary"
Regression Systematic Error,"DeepchecksValueError: Expected model to be a type from ['regression'], but received model of type: binary"
Regression Error Distribution,"DeepchecksValueError: Expected model to be a type from ['regression'], but received model of type: binary"
Regression Error Distribution,"DeepchecksValueError: Expected model to be a type from ['regression'], but received model of type: binary"
Boosting Overfit,DeepchecksValueError: Unsupported model of type: RandomForestClassifier


From the model performance check report from Deepchecks, we can inspect the detailed model performance on various metrics used for the classification problem. Custom metrics can also be used to evaluate the data forecastability. Overall, the dataset is good and well curated as it is evident from the good model accuracy on the test data. But we do see the presence of unused features, which otherwise can be neglected if the model accuracy was not good enough.

# Data Profiling

Now, let me show a small demo to perform simple data profiling. Although the dataset has multiple features, we will pick up the top 3 features based on feature importance and create data profiles of the training and test set and compare both the profiles to observe presence of any inconsistency.

In [36]:
# From the Data Forecastability section, the top three features are as follows:
important_features = ['worst radius', 'mean concave points', 'worst concave points']
data_profiling_train_df = train_df[important_features]
data_profiling_test_df = test_df[important_features]

In [37]:
data_profiling_train_df.head()

Unnamed: 0,worst radius,mean concave points,worst concave points
190,15.74,0.06618,0.1772
134,22.52,0.06847,0.1379
386,13.13,0.02534,0.0914
118,20.19,0.09479,0.2034
316,12.85,0.005051,0.01852


In [38]:
data_profiling_test_df.head()

Unnamed: 0,worst radius,mean concave points,worst concave points
333,12.76,0.002941,0.01667
273,10.75,0.01407,0.05159
201,20.42,0.07488,0.1939
178,14.0,0.001852,0.009259
85,22.93,0.08795,0.1642


Next we will build a very simple data profile using common statistical measures like mean, median and coeficient of variation. The choice of complexity of statistical measures might vary from dataset to dataset and from use case to use case.

In [61]:
def build_data_profile(df):
    '''
    Method to build statistical data profiles
    '''
    profile_parameter = []
    profile_value = []
    for feature in df.columns:
        # Mean
        profile_parameter.append('mean_'+ feature)
        profile_value.append(np.mean(df[feature]))
        # Median
        profile_parameter.append('median_'+ feature)
        profile_value.append(np.median(df[feature]))
        # Coefficient of Variance
        profile_parameter.append('cov_'+ feature)
        profile_value.append(np.std(df[feature]/np.mean(df[feature])))
     
    data_profile_df = pd.DataFrame([profile_value], columns = profile_parameter)
    return data_profile_df


In [62]:
train_profile = build_data_profile(data_profiling_train_df)
train_profile

Unnamed: 0,mean_worst radius,median_worst radius,cov_worst radius,mean_mean concave points,median_mean concave points,cov_mean concave points,mean_worst concave points,median_worst concave points,cov_worst concave points
0,16.349314,15.05,0.297475,0.050018,0.034,0.777637,0.116275,0.1001,0.567213


In [63]:
test_profile = build_data_profile(data_profiling_test_df)
test_profile

Unnamed: 0,mean_worst radius,median_worst radius,cov_worst radius,mean_mean concave points,median_mean concave points,cov_mean concave points,mean_worst concave points,median_worst concave points,cov_worst concave points
0,15.949395,14.34,0.293118,0.044534,0.032035,0.851979,0.107946,0.09833,0.594014


In [64]:
# Calculate absolute percentage change in the DataFrame column values
pct_df = pd.DataFrame(columns = ['Parameter', '% change'])
for col in train_profile.columns:
    pct_dict = {'Parameter' : col, '% change' : np.abs((test_profile[col][0] - train_profile[col][0])/train_profile[col][0] * 100)}
    pct_df = pct_df.append(pct_dict, ignore_index = True)

display(pct_df)                                

Unnamed: 0,Parameter,% change
0,mean_worst radius,2.446094
1,median_worst radius,4.717608
2,cov_worst radius,1.464589
3,mean_mean concave points,10.964007
4,median_mean concave points,5.779412
5,cov_mean concave points,9.559889
6,mean_worst concave points,7.162983
7,median_worst concave points,1.768232
8,cov_worst concave points,4.724976


From the above table, we can clearly observe that for all statistical measures, the absolute percentage change in the profile values between the train and test set are less than 25%. If the absolute percentage change is more than 20%, this indicates the presence of some data drift in the dataset.

## Final Thoughts

In this notebook, we have learnt how the deepchecks framework can be effectively used to perform data-centric explanability methods. For certain method, if the framework does not support any built-in out-of-the-box api to implement the concept, we can definitely go with our custom approach similar to what we have seen in the Data Profiling section. Overall, detecting issues related to data consistency like data drifts, data leakage, data purity like missing values, duplicate values, outliers, and data forecasting using model evaluation metrics are certain essential measures that provides valuable explainability to our models and algorithms surrounding the underlying dataset.

## Reference

1. UCI Machine Learning Repository -https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)
2. Deepcheck Open Source Python Framework - https://deepchecks.com/
3. Some of the utility functions and code are taken from the GitHub Repository of the author - Aditya Bhattacharya https://github.com/adib0073