<!-- Simon-Style -->
<p style="font-size:19px; text-align:left; margin-top:    15px;"><i>German Association of Actuaries (DAV) — Working Group "Explainable Artificial Intelligence"</i></p>
<p style="font-size:25px; text-align:left; margin-bottom: 15px"><b>Use Case SOA GLTD Experience Study:<br>
USE CASE GLTD - FANOVA with dependent categorical inputs
</b></p>
<p style="font-size:19px; text-align:left; margin-bottom: 15px; margin-bottom: 15px">Guido Grützner (<a href="mailto:guido.gruetzner@quantakt.com">guido.gruetzner@quantakt.com</a>)

# Introduction

This notebook demonstrates a variant of the functional ANOVA decomposition for functions where all inputs are categorical with arbitrary dependence between them. In particular
* It introduces the first Sobol index as a measure of variable importance, which is intuitive and easy to compute .
* It calculates a total interaction index, which allows splitting the amount of variance explained between main effects and interactions.
* It gives full details on pairwise interactions, i.e. second order Sobol indices. 

Using the decomposition of main effects and interaction, we can provide an in-depth explanation for differences between models with interaction (e.g. boosted tree models) and models without interactions (e-g. main effects GLMs). 

All calculations are based on conditional expectations derived from the original data distribution. A conditional expectation with respect to an input variable, is called the main effects function with respect to that variable. Main effect functions are the central tool and target of the analysis in this notebook. One major difference between the approach here, and traditional statistics, such Permutation Feature Importance or Friedman's H-statistic, is the use of main effects instead of partial dependence functions. If all inputs are stochastically independent, main effects and partial dependence functions are identical, but, in general, they are (very) different. The use of true conditional expectations is only possible for categorical inputs. Since the conditioning set of a numerical variable has measure zero, it is not possible or at least very hard to estimate conditional expectations from a sample. But the underlying distribution of categorical variables is discrete, and the empirical conditional expectation can be found by a simple and very fast grouping operation. The approach has several advantages over the traditional methods, which assume independent margins: 
* No evaluations on "impossible data" and no distortion of probabilities in case of dependence.
* In addition to providing statistics for models, the statistics can also be calculated just using the raw data.
* Estimates are analytically derived from the original dataset and not by resampling of the data. This avoids resampling error and allows much faster execution speed.

To apply these methods to the GLTD, the numerical inputs have to be transformed into categorical inputs, i.e. discretized. Even though this results in some loss of accuracy, the advantages stated above are so substantial, that they outweigh the disadvantage discrtetization. This notebook uses a very simple discretization scheme, which most certainly can be improved.

Evaluation of the second order indices requires generalizations of the original FANOVA procedures to properly account for dependencies. These modifications are explained in more detail in the accompanying notebook "edu_Hcorr.ipynb".

Depending on the size of the dataset (which can be adjusted using the parameter `pct` in the second initialization block) the notebook may take several minutes to run. But this is only due to the time required to fit the three models, which are gradient boosted tree with interactions ("GBT_with") the same model but without interactions ("GBT_wo") and a main effects GLM ("GLM"). The subsequent calculations of the statistics take only a few seconds. This is possible, in spite of the serious size and complexity of the dataset with 21 input variables, cardinalities of the categorical variables up to 60 and roughly 6.4 million observations in the dataset. 

## Initialisation

In [1]:
from sklearn.ensemble import HistGradientBoostingClassifier
from glum import GeneralizedLinearRegressor

from IPython.display import display_html

# adjust accordingly, more CPUs is faster but then script may block PC
import os
os.environ['LOKY_MAX_CPU_COUNT'] = '4'

from scipy import linalg

import os
import sys
module_path = os.path.abspath(os.path.join(os.getcwd(), '../report versions/'))
if module_path not in sys.path:
    sys.path.append(module_path)

import gltd_utilities

from IPython.core.debugger import set_trace

import time
import pickle
import itertools
import numpy as np
import pandas as pd
pd.options.mode.copy_on_write = True

pd.options.display.max_rows = 200

* Adapt the path for the data file in the call of `load_gltd_data`, if necessary.
* Adapt pct to your requirements for anything between  $0.05\leq pct\leq1$. 
* Input 1 uses all data available, lower numbers the respective fraction. Below a value of 0.05, predictions become somewhat volatile.

In [2]:
tic = time.time()
(X, Y, ID, nm_cat, nm_num, seed, rng) = gltd_utilities.load_gltd_data(
                                        "d:/tmp/GLTD data/", pct=0.3)
for vnm in nm_cat:
    X[vnm] = X[vnm].cat.remove_unused_categories()
seed

'129870251340744769036896803466667540219'

## Discretization of numerical inputs

As discussed in the introduction, the numerical inputs have to be transformed into categorical variables. Here, no effort was expended to have an efficient encoding, instead the most simple approach was chosen, binning into 10 equal exposure buckets.  

In [3]:
# recode SSA
tmp = (X['Original_Social_Security_Award_Status'].astype(str) 
        + "_" + X['Updated_Social_Security_Award_Status'].astype(str))
X["Combined_SSA"] = tmp.astype("category")
X.drop(['Original_Social_Security_Award_Status',
'Updated_Social_Security_Award_Status'], axis=1, inplace=True)

# bin into equal exposure buckets
nbucket = 10

Xbin = X
for vnm in nm_num:
    Xbin[vnm] = pd.qcut(X[vnm], nbucket, duplicates="drop")

# turn IntervalIndex into standard categorical one
# required to enable broadcasting later on
# I am not 100% sure why, but this is required
nm_var = Xbin.columns.to_list()
for vnm in nm_var:
    Xbin[vnm]=Xbin[vnm].astype(str)
    Xbin[vnm]=Xbin[vnm].astype("category")

# Fitting of models

We fit the three models ("GBT_with", "GBT_wo", "GLM") on non-aggregated data. This is necessary to ensure that observations are iid in particular that each input in the dataset has the same probability. Since the properties and qualities of the models have been discussed in other notebooks for this use case, we just fit the models and perform no train/test split or other evaluations.   

In [None]:
xtrain = Xbin
ytrain = Y

sammler = []
# GBT        
md = HistGradientBoostingClassifier(
        interaction_cst = None,
        categorical_features="from_dtype",
        max_iter=1000,
        learning_rate=0.025,
        max_leaf_nodes=100,
        random_state=rng.integers(low=0, high=1000))
md.fit(xtrain, ytrain)
sammler.append(pd.Series(md.predict_proba(xtrain)[:,1], index=xtrain.index, name="GBT_with"))

md = HistGradientBoostingClassifier(
        interaction_cst = "no_interactions",
        categorical_features="from_dtype",
        max_iter=1000,
        learning_rate=0.025,
        max_leaf_nodes=100,
        random_state=rng.integers(low=0, high=1000))
md.fit(xtrain, ytrain)
sammler.append(pd.Series(md.predict_proba(xtrain)[:,1], index=xtrain.index, name="GBT_wo"))

# GLM
fml = " ~ " + "+".join(nm_var)
md = GeneralizedLinearRegressor(
            l1_ratio=0.0,
            alpha=1e-6,
            family="binomial", 
            link="logit",
            fit_intercept=True,
            drop_first=True,
            formula = fml
        )
md.fit(xtrain, ytrain)
sammler.append(pd.Series(md.predict(xtrain), index=xtrain.index, name="GLM"))

pred_tbl = pd.concat(sammler, axis=1)

# Initial aggregation

The initial aggregation transforms the iid dataset and creates unique(!) tuples respectively a MultiIndex, the according discrete probability measure and the functions as Series over the index.   

In [None]:
# pre-aggregation rows are still iid, hence "mean" is OK
agg_dict = {md: pd.NamedAgg(column=md, aggfunc="mean") 
            for md in pred_tbl.columns}
agg_dict["Actual_Recoveries"] = pd.NamedAgg(column="Actual_Recoveries", aggfunc="mean")
agg_dict["p"] = pd.NamedAgg(column="Actual_Recoveries", aggfunc="count")

tmp = pd.concat([Xbin, Y, pred_tbl], axis=1).set_index(nm_var, drop=True)\
    [["Actual_Recoveries"] + pred_tbl.columns.to_list()]
dfagg = tmp.groupby(nm_var, observed=True).agg(**agg_dict)
p_master = dfagg["p"] / dfagg["p"].sum()
p_master.name = "master"
idx_master = p_master.index
dfagg.drop("p", axis=1, inplace=True)

The functions to be analysed are not only the three models from above but also the raw data itself, i.e. the "Actual_Recoveries". This is possible, because only observed inputs are used in the calculations of the statistics, and for observed inputs exist Actual_Recoveries. These observations can be handled in exactly the same way as function values on observations.

# Analysis on probability level

Analysis on probability level means, that the unmodified outputs of models are used for the calculation of the various statistics. These outputs are probabilities, hence the name. This is in contrast to analysis of the linear response, where function outputs are transformed by a link function. We will shortly see that the level of analysis does make a big difference, especially for interactions.    

## Mean and Variance

In [6]:
def mean_and_variance(fdf, p_master):
    f_mean = fdf.multiply(p_master, axis=0).sum()
    f_mean.name = "mean"
    f_mean.index.name = "f"
    ff = fdf - f_mean
    V = (ff ** 2).multiply(p_master, axis=0).sum()
    V.name = "variance"
    V.index.name = "f"
    return (f_mean, V)

In [None]:
f_mean, V = mean_and_variance(dfagg, p_master)

# this is just for the joint display
styled_A = f_mean.to_frame().style.set_table_attributes("style='display:inline'")
styled_B = V.to_frame().style.set_table_attributes("style='display:inline'")
display_html(styled_A._repr_html_()
             + styled_B._repr_html_(), raw=True)

Unnamed: 0_level_0,mean
f,Unnamed: 1_level_1
GBT_with,0.014165
GBT_wo,0.014149
GLM,0.014146
Actual_Recoveries,0.014146

Unnamed: 0_level_0,variance
f,Unnamed: 1_level_1
GBT_with,0.001611
GBT_wo,0.001297
GLM,0.00139
Actual_Recoveries,0.01291


* The means are more or less equal, as is to be expected, since all the models are well calibrated. Further information on the calibration of the models' is contained in the notebook "rep_marginal".
* The variance of Actual_Recoveries is almost 10 times larger than the variance of the models. This is because Actual_Recoveries are raw data, they contain residual error and consist of 0–1 observations instead of probabilities in $]0, 1[$. In contrast, the models are deterministic functions of the inputs, i.e. without residual error, and they produce never 0 or 1 values but something closer to the mean, which reduces the variance.
* The models GLM and GBT_wo have about the same variance, which is much less than the variance of GBT_with. This is an indication of the greater flexibility of "GBT_with" to fit the data, but may be also an indication of overfitting.   

## First order Sobol Indices

The first order Sobol index of a function, is the variance of the conditional expectation with respect to one of the inputs. It is a very simple and intuitive notion of variable importance. A function which is nearly constant, will have a low index, while functions with a wide range and variation between categories will get larger ones. Compare the following results with the plots of the marginal functions in the notebook "rep_marginal". The first order Sobol indices are quantitative summaries of those graphs, and provide the foundation for the intuitive notion, that a variable matters, if it distinguishes well between categories.    

In [8]:
def first_order_sobol(fdf):
    
    sammler = []
    for vnm in fdf.index.names:
        # conditional probability
        p_1 = p_master.groupby(level=vnm, observed=True).sum()
        # conditional expectations
        f_1 = fdf.multiply(p_master, axis=0).groupby(level=vnm, observed=True).sum()
        # don't forget to divide by the marginal probability
        f_1 = f_1.divide(p_1, axis=0)
        V1 = ((f_1 ** 2).multiply(p_1, axis=0).sum() -
             (f_1.multiply(p_1, axis=0).sum())**2)
        sammler.append(V1)

    V_1 = pd.concat(sammler, keys=fdf.index.names, names= ["input", "f"])
    return V_1

In [20]:
S_1 = first_order_sobol(dfagg) / V
tmp = S_1.unstack()
round( tmp.sort_values(by="Actual_Recoveries", ascending=False) *100, 1)

f,Actual_Recoveries,GBT_with,GBT_wo,GLM
input,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Combined_SSA,4.1,32.0,42.0,38.2
Duration_Month,2.6,20.8,25.3,24.4
OwnOccToAnyTransition,2.1,15.8,20.7,19.2
Diagnosis_Category,1.8,13.1,16.4,16.7
Attained_Age,0.8,6.0,7.8,7.3
Elimination_Period,0.3,2.4,3.1,3.2
Age_at_Disability,0.2,1.5,1.7,1.9
Integration_with_STD,0.1,0.9,1.1,1.2
Benefit_Max_Limit_Proxy,0.1,1.1,1.6,1.2
Industry,0.1,1.0,1.2,1.1


In the table above shows the relative variance of each main effects function to the total variance in percent. The values can be interpreted as the percentage of variance explained by the respective main effects.

Observations: 
* All functions agree on the ranking of the inputs.
* Only a few inputs have large Sobol indices. The inputs deemed important and their ranking is consistent with the other importance measures, such as drop1 or permutation feature importance. See the respective notebooks `exp_drop1_*` and `exp_pfi` for more details on those. 
* The index values for Actual Recoveries are much lower. Recall, that the Sobol index is the amount of variance explained by the main effect function. But Actual_Recoveries has a much larger total variance, i.e. the denominator of the Sobol index is much larger, hence the index is smaller. This makes a direct comparison between functions with very different total variances difficult.  
* In comparison to the models without interactions, the values for GBT_with are lower. We will see that the reason are interactions.
* The values will, in general, never add up to 100%. There are two reasons for this. First, the contributions of the interactions are missing, this is only about the main effects. And, second, the inputs are not stochastic independent, hence the main effects functions will be correlated, which means their variances will not add up to the total variance.

## Total interaction index

The total interaction index is based on a split of each of the functions into two uniquely defined components. An additive part, which is a linear combination of all available main effects functions, and its complement, i.e. 
$$ f = f_\text{add} + f_\text{inter}.$$
As in FANOVA and described in "edu_Hcorr" the decomposition of $f$ is derived from an orthogonal decomposition of the function space into two components. In fact, $f_\text{add}$ is defined as the best least squares approximation of a sum of main effects to $f$ and $f_\text{inter}$ as the residual. By construction, the two components are uncorrelated, hence they provide an additive split of the total variance of $f$:
$$ \mathbb{V}[f] = \mathbb{V}[f_\text{add}] + \mathbb{V}[f_\text{inter}].$$
Below we show the relative amounts of $\frac{\mathbb{V}[f_\text{add}]}{\mathbb{V}[f]}$ and $\frac{\mathbb{V}[f_\text{inter}]}{\mathbb{V}[f]}$ in percent for each of the three models and the raw data, i.e. the `Actual_Recoveries`.

The notebook "edu_Hcorr" contains more background and details on theory and implementation.

As a first step, a basis for the space of additive functions is constructed. Here, the "drop first" approach is used, where one column of the matrix representing the basis is removed. See also the discussion and examples in Section "Definition of $\mathscr{V}_1$" of the accompanying notebook `edu_Hcorr`. The linear independence of this basis, is verified using `matrix_rank` from `np.linalg`.  

In [23]:
sammler = []
# build main effects basis
for vnm in nm_var:
    ct = idx_master.levels[idx_master.names.index(vnm)]
    # df is "drop1" due to [:,1:]
    df = pd.DataFrame(np.identity(len(ct))[:,1:], index=ct.astype(str))
    # broadcast to full index
    sammler.append(df.reindex(index=idx_master, level=vnm))

tmp = pd.concat([pd.Series(1, index=idx_master)] + sammler, 
                keys=["const"] + nm_var, axis=1)
# normalise coordinates
basmat_main = tmp.multiply(np.sqrt(p_master), axis=0).to_numpy()
dim_bas = basmat_main.shape[1]
assert(dim_bas==np.linalg.matrix_rank(basmat_main))
print(f"The function space V_add has dimension: {dim_bas}.")

The function space V_add has dimension: 228.


Next, the functions' coordinates are transformed to normalized coordinates, such that the standard linear algebra routines can be applied. The scipy function `lstsq` does really all the computational work, and $f_\text{inter}$ is the residual of the least squares solution. One could retrieve the actual components of the vectors for further analysis, (they are `basmat_main @ x` and `dfagg - basmat_main @ x`) but this is not done in this notebook.  

In [None]:
# extract functions and normalize coordinates
f_cb = dfagg.multiply(np.sqrt(p_master), axis=0).to_numpy()
x, resi, rk, sval = linalg.lstsq(basmat_main, f_cb)
# check if something went wrong
assert(dim_bas == rk)
V_inter_total = pd.Series(resi, index=dfagg.columns)
V_add_total = V - V_inter_total
round(pd.concat([V_add_total / V, V_inter_total / V], axis=1, keys=["V_add", "V_inter"]) * 100, 1)

Unnamed: 0,V_add,V_inter
GBT_with,59.4,40.6
GBT_wo,75.1,24.9
GLM,72.5,27.5
Actual_Recoveries,7.8,92.2


Shown above is per function the relative amounts of variance as percentage, i.e. the amount of total variance explained by the interaction component.  
* The low value for the additive component of Actual Recoveries does not mean that the raw data is almost exclusively determined by interactions. Instead, this is again an effect of the high residual variance, which is fully assigned to the interaction part.
* The split for the models is a better indicator of the model structure, since this split is not distorted by residual variance, as the models are deterministic.
* The interaction values for the two models *without* interaction might seem surprising. They are smaller than those for the model with interaction, but they are somewhat large and definitely not zero! The reason is the link function. See the next chapter on analysis on logit level.  

# Analysis on logit level 

The issue with probabilities as a target for the analysis of interactions, stems from the fact that GLMs are indeed linear functions of features of the inputs, but those are transformed by a link function. In this notebook, the logit is used as a link function. This link function is not only applied for GLMs, but also for the GBT models. To quote from the Scikit-learn documentation of `HistGradientBoostingClassifier`:
> Internally, the model fits one tree per boosting iteration and uses the logistic sigmoid function (expit) as inverse link function to compute the predicted positive class probability.

Analysis, which takes place before the transformation by the link (i.e. logit) function, is called analysis on logit level. Analysis after the link function is applied, as it was done in the prior section, is called analysis on probability level. To appreciate the importance of this for the understanding of interactions, note that the recovery probability $\pi,$ is linked to the linear predictor $X\beta$ of a GLM by the logit, i.e.
$$ \text{logit}(\pi(\beta)) = X\beta $$
and
$$ \pi(\beta)= \frac{\exp X\beta}{1 + \exp X\beta}.$$
But this means that the model will have interactions on the probability level, even if it is additive on the logit level. For this reason, interaction analysis should be performed on the logit level. 

## Transformation

Zero-one entries are not defined for the logit transformation. Accordingly, analysis of the raw data is not possible on the linear level and "Actual_Recoveries" have to be excluded.

In [13]:
df_logit = np.log(dfagg[pred_tbl.columns] / (1 - dfagg[pred_tbl.columns]))

## Mean and variance

In [24]:
f_logit_mean, V_logit = mean_and_variance(df_logit, p_master)

# this is just for the joint display
styled_A = f_logit_mean.to_frame().style.set_table_attributes("style='display:inline'")
styled_B = V_logit.to_frame().style.set_table_attributes("style='display:inline'")
display_html(styled_A._repr_html_()
             + styled_B._repr_html_(), raw=True)

Unnamed: 0_level_0,mean
f,Unnamed: 1_level_1
GBT_with,-5.668123
GBT_wo,-5.582278
GLM,-5.701168

Unnamed: 0_level_0,variance
f,Unnamed: 1_level_1
GBT_with,2.15302
GBT_wo,2.20837
GLM,2.558656


## First order Sobol Indices

In [25]:
S_1_logit = first_order_sobol(df_logit) / V_logit
tmp = S_1_logit.unstack()
display_html(round( tmp.sort_values(by="GBT_wo", ascending=False) *100,1))

f,GBT_with,GBT_wo,GLM
input,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Duration_Month,44.4,57.7,54.4
Combined_SSA,44.2,53.5,46.9
OwnOccToAnyTransition,32.4,39.8,36.2
Attained_Age,16.9,11.2,14.8
Diagnosis_Category,8.0,9.1,9.1
Benefit_Max_Limit_Proxy,5.6,6.1,6.3
Integration_with_STD,3.3,5.0,5.6
Industry,4.0,4.7,4.7
Elimination_Period,2.6,4.2,4.3
Age_at_Disability,4.1,3.3,4.2


Observations:
* On logit level the inputs have the same overall importance they had on probability level.
* But, in comparison to the probability level, indices are larger, i.e. main effects are more important.
* Clear difference between the model with interaction and the two models without interaction. The models without interaction show substantially larger main effects than the model with interaction.
* Sum of indices is much larger than 100%, indication of positive correlations between the main effects. 

## Total interaction index

The interaction index is calculated in the same way as above. In particular the basis is identical. The only difference are the transformed functions.

In [16]:
# extract functions and normalise coordinates
f_cb = df_logit.multiply(np.sqrt(p_master), axis=0).to_numpy()
x, resi, rk, sval = linalg.lstsq(basmat_main, f_cb)
# check if something went wrong
assert(dim_bas == rk)
V_logit_inter_total = pd.Series(resi, index=df_logit.columns)
V_logit_add_total = V_logit - V_logit_inter_total
round(pd.concat([V_logit_add_total / V_logit, V_logit_inter_total / V_logit], 
                axis=1, keys=["V_add", "V_inter"]) * 100, 1)

Unnamed: 0,V_add,V_inter
GBT_with,86.9,13.1
GBT_wo,100.0,0.0
GLM,100.0,0.0


Indeed, the split is as expected. The variance of models without interaction is fully explained by additive combinations of the main effects functions, while for the GBT with interactions, a substantial amount of variance remains.

## Second order indices

The total interaction index is based on the split between two orthogonal spaces. The space of additive functions and its orthogonal complement. Total interaction can tell if and how much interaction is present. It does not provide information on what inputs are interacting. To shed light on this question, further refinement is required. This section provides exactly this refinement. Instead of starting with the whole function space and splitting this into two components, we start with a bivariate function spaces, i.e. a space of conditional expectation with respect to two inputs, and split this into main effects and interaction. The implementation in the code block below works basically in the same way as the code for the total interaction index. The only differences consist in the  loop over all pairs, and the calculation of the bivariate conditional expectation for each pair.

On first sight, it may seem surprising that the same basis (`basmat_main`) is used for the bivariate functions as was used for the multivariate functions. But this is necessary due to the stochastic dependence between the inputs. In this case, the best additive approximation to $\mathbb{E}[f\vert A,B]$ will not only involve functions with inputs $A$ and $B$ but also functions of correlated inputs. Again, this can be analysed in detail because all functions in the decomposition can be explicitly computed. But this is not done here, and left to the interested reader.   

Only selected pairs are shown, since there are in total about 200 pairs, and the analysis of first order and total indices shows that only a few of them have any relevance.  

In [None]:
# calculation for each pair of inputs
pairs = list(itertools.combinations(["Duration_Month", "Combined_SSA", "OwnOccToAnyTransition", "Attained_Age", "Diagnosis_Category", "Benefit_Max_Limit_Proxy"],2))
# the use of frames instead of indices is required for unknown Pandas reasons
# .reindex does not accept more than one level so that we need .join 
df_master = idx_master.to_frame().reset_index(drop=True)

sammler = []
for nm_pair in pairs:
    # determine conditional distribution of pair
    p_2 = p_master.groupby(level=nm_pair, observed=True).sum()
    # Conditional expectation on idx_2
    f_2 = df_logit.multiply(p_master, axis=0).groupby(level=nm_pair, observed=True).sum()
    f_2 = f_2.divide(p_2, axis=0)
    # find variance
    f_2_var = ((f_2 ** 2).multiply(p_2, axis=0).sum() 
               - (f_2.multiply(p_2, axis=0).sum())**2)
    # reindex/embed
    f_2_emb = df_master.join(f_2, on=nm_pair, how="left").set_index(nm_var)

    # to find residual f_2_emb has to be transformed to normalised coordinates 
    f_cb = f_2_emb.multiply(np.sqrt(p_master), axis=0).to_numpy()
    x, resi, rk, sval = linalg.lstsq(basmat_main, f_cb)
    sammler.append(pd.concat([f_2_var, pd.Series(resi, index=df_logit.columns)],
                    axis=1))

In [18]:
V_inter_pairs = pd.concat(sammler, keys=pairs, names=["A", "B", "f"])
V_inter_pairs.columns = ["V_2", "V_inter"]
round(V_inter_pairs.divide(V_logit, axis=0, level="f") * 100, 1)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,V_2,V_inter
A,B,f,Unnamed: 3_level_1,Unnamed: 4_level_1
Duration_Month,Combined_SSA,GBT_with,81.7,6.2
Duration_Month,Combined_SSA,GBT_wo,91.0,0.1
Duration_Month,Combined_SSA,GLM,85.1,0.1
Duration_Month,OwnOccToAnyTransition,GBT_with,47.0,0.2
Duration_Month,OwnOccToAnyTransition,GBT_wo,61.2,0.0
Duration_Month,OwnOccToAnyTransition,GLM,57.9,0.0
Duration_Month,Attained_Age,GBT_with,52.4,1.0
Duration_Month,Attained_Age,GBT_wo,61.4,0.9
Duration_Month,Attained_Age,GLM,61.4,1.0
Duration_Month,Diagnosis_Category,GBT_with,49.7,1.6


Observations:
* Only GBT_with and only three pairs (Duration_Month, Combined_SSA), (Combined_SSA, OwnOccToAnyTransition) and (Duration_Month, Diagnosis_Category) have interaction values larger than 1.5%.
* For all three models, most interaction values are small but non-zero.
* GBT_with values are larger than the values without interaction.

The most likely reason for the small but non-zero `V_inter` values, especially for the non-interaction models, are contributions from correlated bivariate functions. Since all these functions are explicitly available, further analysis could be performed, but again this is left to the interested reader. 

In [19]:
print(f"Time it took: {np.ceil((time.time() - tic)/60)}min.")

Time it took: 6.0min.


# Conclusion

We have demonstrated how to split the models into two orthogonal components, one additive and one interaction component. We were able to clearly distinguish between the models with and without interaction and could measure the amount of interaction by allocated variance. All without the use of impossible data and within convenient run-time limits.  