<!-- Simon-Style -->
<p style="font-size:19px; text-align:left; margin-top:    15px;"><i>German Association of Actuaries (DAV) — Working Group "Explainable Artificial Intelligence"</i></p>
<p style="font-size:25px; text-align:left; margin-bottom: 15px"><b>Use Case SOA GLTD Experience Study:<br>
PDPs and impossible data for categorical variables
</b></p>
<p style="font-size:19px; text-align:left; margin-bottom: 15px; margin-bottom: 15px">Guido Grützner (<a href="mailto:guido.gruetzner@quantakt.com">guido.gruetzner@quantakt.com</a>)

$\newcommand{\expect}[1]{\mathbb{E}{\left[#1\right]}}$

# Introduction

This notebook discusses Partial Dependence Plots (PDP) with dependent inputs for the special case of categorical data. We use a simple case study with real actuarial data (GLTD use case) to demonstrate issues which will arise (or have arisen) in many practical applications. In particular, we discuss:
* What is meant by "impossible data" and how it is encountered in practice.
* How PDPs in the way they are typically defined and implemented in standard libraries such as Scikit-learn, rely on impossible data.
* Demonstrate in the example why and how impossible data, and hence, the use of PDPs is misleading.
* Propose a simple and straightforward alternative to PDPs for categorical data.

A note on terminology: This notebook does not show any plots. We only discuss the *functions* which will ultimately provide the data for actual plots. It would be natural to call them partial dependence functions, but this abbreviates to *pdf*, which may be confusing. So we slightly generalize terminology and abbreviate those functions by PDP as well. We use the general term *marginal functions* for functions of a single input, which are somehow derived from a function with more than a single input. 

## Definition of PDP and impossible data

For the sake of simplicity, and because generalization is straightforward, we discuss only the case where our function or model of interest has two categorical variables as input:
$$ f:\mathcal{X}_A\times\mathcal{X}_B\rightarrow\mathbb{R}\quad, \quad (A,B)\mapsto f(A,B)$$
where $\mathcal{X}_A$ respectively $\mathcal{X}_B$ are the possible levels of each categorical variable.

In this case, there are two PDPs, one for each input. The PDP depending only on $A$ is called $f_A$, the other, depending only on $B$, is called $f_B$. The standard definition of PDP is based on the empirical measures, i.e. calculated using an iid sample of size $N$ $(a_k,b_k)\in\mathcal{X}$ for $k=1,\ldots, N$ 
from the domain of definition of $f$ as follows:
\begin{align*}
\text{PDP: } \quad &  f_A(a_j) = \frac 1 N \sum_{k=1}^N f(a_j,b_k) \\
    & f_B(b_k) = \frac 1 N \sum_{j=1}^N f(a_j,b_k).
\end{align*} 
This expression is just a special case of the more general principle of "integrating out" or "marginalizing" a variable. Appreciating this, the PDP can be understood and written as an expectation:
$$ f_A(a) = \mathbb{E}_B[f(a,B)]$$
where $\mathbb{E}_B$ is the expectation with respect to the marginal distribution of $B$ and $a$ one of the levels from $\mathcal{X}_A$ of the first input. 
Another function, derived from $f$, which also depends only on a single argument, is the conditional expectation
$$ \tilde{f}_A(a) = \expect{f(A,B) \vert A=a}.$$ 
The difference between these expressions is, that the first one uses only the marginal distribution, but the second one the joint distribution. If $A$ and $B$ are independent, then both definitions agree, i.e. in that case $f=\tilde{f}.$ But in general, if the margins are not independent, the functions $f$ and $\tilde f$ will differ. This notebook does nothing more, than exploring this difference in a concrete case study.

Notice that the first definition of PDP, using the sum over the $b_k$, implicitly assumes independence. This is because the $\frac 1 N$ and the sum without weights, is only a valid average or expectation, if each term in the sum has equal probability. But the terms of the sum include combinations $(a_j, b_k)$ with $j\neq k$ which are different from the observed combinations $(a_k, b_k).$ These "new" combinations may not have identical probabilities of occurrence, or may not even occur at all in the data for dependent inputs. This practice, treating those expressions as iid even though they may have different probabilities or may even not occur at all in the data, creates the "impossible data problem".    

## Discrete probabilities

This notebook studies a particularly simple case, categorical inputs, where all probabilities and expectations, conditional or not, can be computed by elementary arithmetic. In our example, the two inputs have three levels each, which means there are in total 3-by-3 or nine possible combinations. Recall that the probabilities of these combinations, which yield the joint probability distribution, can be found by normalizing a crosstabulation of the data sample. This means nothing else than counting how often each combination occurs in the sample (crosstabulation) and then dividing by the total sample size (normalizing). The same is done for the three combinations making up each of the two the marginal distributions. An important advantage of discrete distributions over continuous ones is the possibility to calculate conditional probabilities and conditional expectations directly from a sample. Conditional probabilities, in terms of the 3-by-3 table of our example, are just the probabilities in a row or column, which are turned into a proper probability distribution in their own right, by normalizing their sum to one, i.e. by dividing each probability by their row or column sum, which is the respective marginal probability. You can check and verify these (simple) calculations in detail in this notebook. 

# Initialisation

In [1]:
from sklearn.inspection import partial_dependence
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

from IPython.display import display, display_html

import numpy as np
import pandas as pd
pd.options.mode.copy_on_write = True

## Brief description of use case and data

To understand this notebook, no knowledge of the source of the data or its intended use is required. We just provide limited background for completeness.

The dataset, which is provided with this notebook, is an extract of a much larger dataset by the Society of Actuaries. The original data can be found at the Society of Actuaries' website under ["2019 Group Long-Term Disability Experience Study Preliminary Report"](https://www.soa.org/resources/experience-studies/2019/group-ltd-experience-study/). The goal of the study is to predict recovery probabilities of Group Long Term Disability claims for the US market for further use in pricing and reserving.

 For our purposes, only the structure of the data is relevant. It consists of observations (rows) of two categorical columns, which are the inputs, and an according numeric column which takes values 0 and 1, which is the response. A value of 1 means a recovery has occurred, a value of zero means no recovery. The goal is to predict the probability of recovery, given the inputs. Hence, it is a standard Bernoulli regression. We are not interested in the model per se or its predictive quality, since in this simplified setting, it is more or less trivial anyway. We just use this simplified data and the model to demonstrate the issues with PDPs.     

Adjust the path to the input file in the following block, if necessary. 

In [2]:
data = pd.read_feather("./GLTD_SSA_extract.feather")
print(f"There are {len(data)} observations in the data.")
data.dtypes

There are 1910337 observations in the data.


Updated_SSA     category
Original_SSA    category
Recovery           int64
dtype: object

In [3]:
print(f"The variable `Original_SSA` has the categories {list(data["Original_SSA"].cat.categories)}")
print(f"The variable `Updated_SSA` has the categories {list(data["Updated_SSA"].cat.categories)}")
print(f"The variable `Revovery` takes the values {list(data["Recovery"].unique())}")

The variable `Original_SSA` has the categories ['N', 'U', 'Y']
The variable `Updated_SSA` has the categories ['No', 'Unknown', 'Yes']
The variable `Revovery` takes the values [0, 1]


# Observed probabilities

In a first step, the observations are aggregated as a table. Normalizing the table entries gives the joint empirical probability distribution of the input variables in the sample. Doing the same for the margins gives the marginal empirical probabilities. 

In [4]:
nm_0 = "Original_SSA"
nm_1 = "Updated_SSA"
allcategories = [data[cat].cat.categories for cat in [nm_0, nm_1]]

# crosstab does all the work of binning the dataframe, note we normalize and
# calculate the margins as well in one go. 
tmp = pd.crosstab(data[nm_0], data[nm_1], 
                      normalize=True, margins=True)
# extract marginal probabilities
p_0 = tmp.loc[allcategories[0], "All"]
p_0.index.name = nm_0
p_0.name = "margin"
p_1 = tmp.loc["All", allcategories[1]]
p_1.index.name = nm_1
p_1.name = "margin"
# p_margin contains two Series over different indices, hence not a DataFrame
p_margin = {nm_0: p_0, nm_1: p_1}

# extract the joint probability
tmp.drop(["All"], axis=1,inplace=True)
tmp.drop(["All"], axis=0,inplace=True)
p_observed = tmp.stack()
p_observed.name = "joint"

The table of joint probabilities and the two marginal probability tables are displayed below. The "impossible" combinations, i.e. those which have not been observed, have probability zero and are coloured <span style="color:red">red</span>. 

In [5]:
# apply the conditional formatting
styled_df = p_observed.unstack().style.apply(
    lambda row: [None if el>0 else "color: red" for el in row])
styled_df = styled_df.set_table_attributes("style='display:inline'")
styled_m0 = p_margin[nm_0].to_frame().style.set_table_attributes("style='display:inline'")
styled_m1 = p_margin[nm_1].to_frame().style.set_table_attributes("style='display:inline'")
display_html(styled_df._repr_html_() 
             + styled_m0._repr_html_()
             + styled_m1._repr_html_(), raw=True)

Updated_SSA,No,Unknown,Yes
Original_SSA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
N,0.117028,0.0,0.0
U,0.0,0.066749,0.0
Y,0.129784,0.015232,0.671206

Unnamed: 0_level_0,margin
Original_SSA,Unnamed: 1_level_1
N,0.117028
U,0.066749
Y,0.816223

Unnamed: 0_level_0,margin
Updated_SSA,Unnamed: 1_level_1
No,0.246812
Unknown,0.081982
Yes,0.671206


Notice that 4 of the 9 entries have probability 0. This means no records with this combination of categories are contained in the data. Given that the total dataset is quite large, it is reasonable that these observations are not missing by chance but are missing because these combinations are impossible. This hunch can be justified with some background knowledge about disability insurance. SSA here means Social Security Award. To claim disability benefits ["awards"](https://www.ssa.gov/oact/progdata/awardDef.html) from the US Social Security Administration, claimants have to undergo a procedure to decide whether their claims are justified or not. The result of this procedure, i.e. claim accepted or not, is quite informative for the status of the insured with respect to other benefit claims, such as a claim under a group long term disability policy. With this background, it becomes quite clear that the zeros are not missing at random. If the original award status is known to be N (=No) the award can not be updated or unknown, because there is nothing to update or know in the first place. The same with a U (=Unknown) original status. The update then can only be "Unknown" as well, because if it were otherwise ("No", "Yes"), there has to be a known original award.      

## Probability distribution assuming independent margins

The product probabilities are found by multiplication of the marginal probabilities.

In [None]:
# Multiindex for broadcasting
midx_prod = pd.MultiIndex.from_product([p_margin[nm_0].index,p_margin[nm_1].index])

# broadcast resp. reindex and multiply
p_indep =p_margin[nm_0].reindex(midx_prod, level=0).multiply(p_margin[nm_1])
p_indep.name = "p_indep"
p_indep.index.names = [nm_0, nm_1]
display(p_indep.unstack())

Updated_SSA,No,Unknown,Yes
Original_SSA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
N,0.028884,0.009594,0.07855
U,0.016475,0.005472,0.044803
Y,0.201454,0.066915,0.547854


Notice that those probabilities are quite different from the observed probabilities. It is not only that previous zero probabilities have been filled with non-zero values, but all other probabilies have changed as well. For example "(N, No)" has changed from 11.7% observed to 2.8% for independent. This is one of the two big problems with PDPs. If the inputs are dependent, PDPs rely on "artificial" independent joint probabilities, which are unrelated to anything in the real world. In our use case, they assign non-zero probabilities to impossible events, and distort the probabilities of the observed events. 

# Model

A plain-vanilla logistic regression model is fitted. The reader should notice that a logistic model predicts probabilities, in this case the probability of recovery. This probability is completely different from the probabilities, which were calculated in the sections above. Those were the probabilities, or empirical frequencies, of the values of the inputs, the variables "Original_SSA" and "Updated_SSA" in the data. It is somewhat unfortunate and purely coincidental, that the dependent variable in this regression example is also a probability. This is nowhere relevant, neither for the calculations nor for the interpretation, and can be ignored safely by the reader.

In [None]:
# Data
alllevels = [sorted(data[vnm].unique().tolist()) for vnm in [nm_0, nm_1]]

# perform one hot encoding for the selected inputs in alllevels 
ohe_coding = ColumnTransformer(
    [("", OneHotEncoder(
            drop="first",
            dtype=int,
            sparse_output=False, 
            categories=alllevels),
        [nm_0, nm_1])],
    remainder="drop", verbose_feature_names_out=False)

# set up encoding and fit as a pipeline
rf = Pipeline(
    [   ("preprocess", ohe_coding),
        ("classifier", LogisticRegression(penalty=None, fit_intercept=True))])

# since this example is not concerned about prediction quality or
# overfitting no train/test split is made and all data used for fit  
xtrain = data[[nm_0, nm_1]]
ytrain = data["Recovery"]

rf.fit(xtrain, ytrain)
tmp = p_indep.index.to_frame()

# put the fitted model ouputs, i.e. the predicted probabilities, into a Series
predictions = pd.Series(rf.predict_proba(tmp)[:,1], index=p_indep.index)
predictions.name = "predictions"

It always makes sense to compare model predictions with the actual data. An estimate for the probability of recovery for each combination of categories is the average of the observed recoveries in this class. 

In [8]:
# Group by the categorical variables and calculate the mean for each group.
actual = data.groupby([nm_0,nm_1], observed=True)["Recovery"].mean()

We can now compare predictions with the estimates from above. Note that the predicted and actual values are quite close. This is no surprise, there is more than enough data available to determine the 1+2+2=5 parameters of the logistic regression. But, note also, that the model can make "predictions", or maybe more appropriate can produce values, for the impossible combinations. And, while it makes sense to assign a definite probability, i.e. zero, to these combinations, it makes no sense to assign any "Actual" recoveries to them, not even zero. This is why NaNs have to be filled in for the observed recoveries at these combinations.   

In [9]:
df1 = predictions.unstack().style
df1 = df1.set_table_attributes("style='display:inline'").set_caption("<b>Predictions<b>")
df2 = actual.unstack().style
df2 = df2.set_table_attributes("style='display:inline'").set_caption("<b>Actual<b>")
display_html(df1._repr_html_() + df2._repr_html_(), raw=True)

Updated_SSA,No,Unknown,Yes
Original_SSA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
N,0.044172,0.054167,0.058865
U,0.071542,0.087165,0.09444
Y,0.003093,0.00383,0.004182

Updated_SSA,No,Unknown,Yes
Original_SSA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
N,0.044225,,
U,,0.087167,
Y,0.002952,0.00488,0.004186


The table above demonstrates the second big problem with PDPs for dependent data. They evaluate the model on combinations, which are impossible, as the case is here and use these outputs for the result. Since the models have not been trained on these data points, there is absolutely no assurance on the resulting value. It depends on "wild" extrapolation, which is unconstrained by data. In the categorical case, even implied assurances, such as smoothness, are not available, because, in general, the evaluation at one discrete data point has no informative value on another discrete data point.  

# Partial Dependence

We derive, and later compare, four different ways to estimate marginal functions:
* SKL: The implementation of PDPs in Scikit-learn for categorical inputs.
* Permutation: The estimate of the PDP based on the permutation of one input. This is the original PDP definition.
* Independent: A direct calculation using independent marginal probabilities.
* Observed: The alternative to PDPs using conditional expectations based on the actually observed joint probability of the inputs.   

## SKL

In [10]:
tmp = partial_dependence(rf, xtrain, nm_0,
                            categorical_features=[nm_0, nm_1],
                            feature_names=[nm_0, nm_1],
                            response_method="predict_proba",
                            method="brute")
res_0 = pd.Series(tmp["average"][0], index=tmp["grid_values"])
tmp = partial_dependence(rf, xtrain, nm_1,
                            categorical_features=[nm_0, nm_1],
                            feature_names=[nm_0, nm_1],
                            response_method="predict_proba",
                            method="brute")
res_1 = pd.Series(tmp["average"][0], index=tmp["grid_values"])

pdp_skl = pd.concat([res_0, res_1], keys=[nm_0, nm_1]).to_frame()
pdp_skl.columns = ["value"]
pdp_skl.index.names = ["categories", "variable"]
pdp_skl

Unnamed: 0_level_0,Unnamed: 1_level_0,value
categories,variable,Unnamed: 2_level_1
Original_SSA,N,0.054854
Original_SSA,U,0.088192
Original_SSA,Y,0.003884
Updated_SSA,No,0.012469
Updated_SSA,Unknown,0.015283
Updated_SSA,Yes,0.016606


## Permutation

In [11]:
# create data with permutation of a variable
permudata = data.copy()
permudata[nm_0] = pd.Categorical(permudata[nm_0].sample(frac=1))
# make prediction on permuted inputs
permudata["f"] = rf.predict_proba(permudata)[:,1]
# integrals are obtained by averaging, 
# mean is here OK because data is (assumed to be) iid
res_0 = permudata.groupby(nm_0, observed=True)["f"].mean()
res_1 = permudata.groupby(nm_1, observed=True)["f"].mean()

pdp_permutation = pd.concat([res_0, res_1], keys=[nm_0, nm_1]).to_frame()
pdp_permutation.columns = ["value"]
pdp_permutation.index.names = ["categories", "variable"]
pdp_permutation

Unnamed: 0_level_0,Unnamed: 1_level_0,value
categories,variable,Unnamed: 2_level_1
Original_SSA,N,0.054849
Original_SSA,U,0.088197
Original_SSA,Y,0.003884
Updated_SSA,No,0.012475
Updated_SSA,Unknown,0.015247
Updated_SSA,Yes,0.016608


## Independent

The two PDPs $f_A$ and $f_B$ are just the conditional expectations of the prediction with respect to the two variables using `p_indep`, i.e. the probabilities with independent margins.

In [12]:
f_weighted = predictions.multiply(p_indep, axis=0)
res_0 = f_weighted.groupby([nm_0]).sum().divide(p_margin[nm_0], axis=0) 
res_1 = f_weighted.groupby([nm_1]).sum().divide(p_margin[nm_1], axis=0)

pdp_indep = pd.concat([res_0, res_1], keys=[nm_0, nm_1]).to_frame()
pdp_indep.columns = ["value"]
pdp_indep.index.names = ["categories", "variable"]
pdp_indep

Unnamed: 0_level_0,Unnamed: 1_level_0,value
categories,variable,Unnamed: 2_level_1
Original_SSA,N,0.054854
Original_SSA,U,0.088192
Original_SSA,Y,0.003884
Updated_SSA,No,0.012469
Updated_SSA,Unknown,0.015283
Updated_SSA,Yes,0.016606


## Observed

Exactly the same calculation as above, but the conditional expectation functions now use the observed joint distribution, i.e. ``p_observed``, instead of the independent margins.

In [13]:
f_weighted = predictions.multiply(p_observed, axis=0)
res_0 = f_weighted.groupby([nm_0]).sum().divide(p_margin[nm_0], axis=0) 
res_1 = f_weighted.groupby([nm_1]).sum().divide(p_margin[nm_1], axis=0)

pdp_observed = pd.concat([res_0, res_1], keys=[nm_0, nm_1]).to_frame()
pdp_observed.columns = ["value"]
pdp_observed.index.names = ["categories", "variable"]
pdp_observed

Unnamed: 0_level_0,Unnamed: 1_level_0,value
categories,variable,Unnamed: 2_level_1
Original_SSA,N,0.044172
Original_SSA,U,0.087165
Original_SSA,Y,0.004002
Updated_SSA,No,0.022571
Updated_SSA,Unknown,0.071682
Updated_SSA,Yes,0.004182


## Comparison of results

In [14]:
pdp = pd.concat([pdp_skl, pdp_permutation, pdp_indep, pdp_observed], 
                keys=["SKL", "permutation", "PDP", "Exact"], names=["method"])
pdp = pdp.reorder_levels(["method", "categories", "variable"])

The first comparison is between "SKL", i.e. the method used in Scikit-learn and the explicit calculation using independent probabilities "PDP". It shows that the results are indeed identical. In contrast to estimates for PDP involving permutations (as discussed next) the sklearn method uses uses the conditional expectation under independent probabilities. To quote from the documentation of version 1.5.2 of the (internal) function `_partial_dependence_brute` in `sklearn.inspection`:
> ...for each value in `grid`, the method will average the prediction of each
        sample from `X` having that grid value for `features`.  

In [15]:
tmp = pdp.loc[["SKL", "PDP"]]
tmp.unstack(0)

Unnamed: 0_level_0,Unnamed: 1_level_0,value,value
Unnamed: 0_level_1,method,SKL,PDP
categories,variable,Unnamed: 2_level_2,Unnamed: 3_level_2
Original_SSA,N,0.054854,0.054854
Original_SSA,U,0.088192,0.088192
Original_SSA,Y,0.003884,0.003884
Updated_SSA,No,0.012469,0.012469
Updated_SSA,Unknown,0.015283,0.015283
Updated_SSA,Yes,0.016606,0.016606


Next we compare the permutation method, which is the method by which PDPs are usually defined, with the exact calculation. The permutation method involves sampling, and we cannot expect exact equality, but the values are convincingly close. One can conclude, that, indeed, the integration by permutation is an approximation to the calculation based on conditional expectation.

In [16]:
tmp = pdp.loc[["permutation", "PDP"]]
tmp.unstack(0)

Unnamed: 0_level_0,Unnamed: 1_level_0,value,value
Unnamed: 0_level_1,method,permutation,PDP
categories,variable,Unnamed: 2_level_2,Unnamed: 3_level_2
Original_SSA,N,0.054849,0.054854
Original_SSA,U,0.088197,0.088192
Original_SSA,Y,0.003884,0.003884
Updated_SSA,No,0.012475,0.012469
Updated_SSA,Unknown,0.015247,0.015283
Updated_SSA,Yes,0.016608,0.016606


Finally, we compare the marginal functions under observed and independent probabilities. There are some striking differences. Especially, the "Updated_SSA" values are apart with a factor of two to five! To further illustrate what is going on (or going wrong?) the actual recoveries cumulated over the margins are included. Obviously, the calculation using the observed probabilities and only actually occurring data for the predictions is much closer to actual values than the PDP.  

In [17]:
tmp = p_observed.multiply(actual).dropna()
res_0 = tmp.groupby(nm_0).sum().divide(p_margin[nm_0])
res_1 = tmp.groupby(nm_1).sum().divide(p_margin[nm_1])

pdp_actual = pd.concat([res_0, res_1], keys=[nm_0, nm_1])
pdp_actual.name = "Actual"
pdp_actual.index.names = ["variable", "categories"]
tmp = pdp.loc[["PDP", "Exact"]].unstack(["method"]).droplevel(0, axis=1)
pd.concat([tmp, pdp_actual], axis=1).style.set_caption(
    "<b>Comparison between PDP, exact calculation and Actuals<b>")

Unnamed: 0,Unnamed: 1,PDP,Exact,Actual
Original_SSA,N,0.054854,0.044172,0.044225
Original_SSA,U,0.088192,0.087165,0.087167
Original_SSA,Y,0.003884,0.004002,0.004003
Updated_SSA,No,0.012469,0.022571,0.022522
Updated_SSA,Unknown,0.015283,0.071682,0.071878
Updated_SSA,Yes,0.016606,0.004182,0.004186


# Conclusions

We claim that PDPs for dependent inputs may be misleading, and the notebook demonstrates this with a simple case study. PDPs on dependent inputs may be misleading because:
1. They use distorted probabilities, i.e. probabilities which are different from the probabilities of the phenomenon of interest.
2. They may incorporate model evaluations from inputs which are logically impossible.

This has the following consequences:
* The PDPs of models with very similar predictive performance and structure may produce very different PDPs, suggesting differences which in reality do not exist.
* Either the PDP cannot be compared to actual data (for impossible points) or the comparison may be distorted by the different underlying probabilities, i.e. observed ones for the data, independent ones for the PDP. This may suggest lack of fit, although the model fits well or may hide lack of fit.

The last point has even inspired adversarial attacks. In those attacks the extrapolation to unseen data is exploited to manipulate the PDP, for example to hide a dependence of the model on undesirable, e.g. protected features.

All these issues remain, even when the independent data does not include impossible combinations. If the observed joint distribution is not independent, it will be different from the independent distribution and will distort the averaging involved in the calculation of the PDP.   

## Better ways to define marginal functions

The easiest way to avoid problems with impossible data, is not having it in the first place. The fact that 4 out of 9 possibilities for the SSA variables are impossible, is a consequence of the unfortunate encoding, using two categorical variables. Without loss of information or any reduction in explainability, one could code the five possible combinations into one single variable "SSA" with 5 categories labelled: (N, No), (U, Unknown), (Y,No), (Y,Unknown) and (Y, Yes). This would eliminate ALL the problems discussed in this notebook for this variable.

The next best thing is to use conditional expectations based on the real, i.e. observed, probabilities. This is better than PDPs because:
* It uses only data which is relevant for the use case, not artificially constructed probabilities and model evaluations.
* It can be compared to raw averages directly derived from the data.
* Even though you have to calculate the conditional expectations yourself and cannot use the standard implementations of PDP, it is really easy to do this.