# Explaining Distribution Shifts in Bayesian Networks using Random Forests and Feature Importance 

Bayes networks play an important role in many areas of science as they allow for an easy-to-read way to convey complex statistical interactions between several variables. Furthermore, they gain relevance from their close connection to computational causality. Here we will analyze data generated by a version of the popular lung cancer (aka Asia) Bayes network.

We generated a dataset using the Asia network in which drift is induced by performing an intervention on the ''bronc'' node. The network consists of 8 nodes, each taking on binary values. Before the drift we make use of the standard network, after the drift we increase the chance of ''bronc'' independent of the state of ''smoke'' by 1.5 leading to a chance of ''bronc'' of 0.9 in case ''smoke'' is activated and 0.45 otherwise. This corresponds to a seasonal increase in cases of bronchitis. 

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, RandomForestClassifier, ExtraTreesClassifier
from sklearn.model_selection import KFold
from sklearn.inspection import permutation_importance
import pandas as pd

In [None]:
features = ["asia","tub","smoke","lung","bronc","either","xray","dysp"]
def sample(n = 25000, T=0):
    asia   = (np.random.random(size=n) < 0.01).astype(float)
    tub    = (np.random.random(size=n) < (asia*0.05 + (1-asia)*0.01)).astype(float)
    smoke  = (np.random.random(size=n) < 0.5).astype(float)
    lung   = (np.random.random(size=n) < (smoke*0.1 + (1-smoke)*0.01)).astype(float)
    bronc  = (np.random.random(size=n) < (smoke*0.6 + (1-smoke)*0.3)*(T*(1/0.6*0.9)+(1-T)*1)).astype(float)
    either = (lung + tub > 0).astype(float)
    xray   = (np.random.random(size=n) < (either*0.98 + (1-either)*0.05)).astype(float)
    dysp   = (np.random.random(size=n) < (bronc*either*0.9 + (1-bronc)*either*0.7 + bronc*(1-either)*0.8 + (1-bronc)*(1-either)*0.1)).astype(float)
    
    return np.vstack( (asia,tub,smoke,lung,bronc,either,xray,dysp) ).T

In [None]:
# Generated data
X1,X2 = sample(T=0), sample(T=1)
X,T = np.vstack((X1,X2)),np.array(X1.shape[0]*[0]+X2.shape[0]*[1], dtype=float)

# Add shadow features for boruta like baseline/analysis
X_shadow = np.hstack( (X,np.vstack( [np.random.permutation(X[:,i]) for i in range(X.shape[1]) for _ in range(5)]).T) )
features_shadow = features + [s+" (shadow)" for s in features for _ in range(5)]

# Add noise force the model to learn ''stable'' ruels
X_noise = X_shadow + np.random.normal(size=X_shadow.shape)

In [None]:
# Train the model in a 10-fold way. This allows us to check computational stability of the analysis

T = T.astype(int)
results = []
for train,test in KFold(n_splits=100,shuffle=True).split(X):
    model = RandomForestClassifier().fit(X_shadow[train],T[train])
    score = model.score(X_shadow[test],T[test])
    print("Model score on current fold: ", score)
    results.append( dict(list(zip(features_shadow,model.feature_importances_))+[("score",score),("type","FI")]) )
    results.append( dict(list(zip(features_shadow,permutation_importance(model,X_shadow[test],T[test]).importances_mean))+[("score",score),("type","PFI")]) )

In [None]:
df = pd.DataFrame(results)
df["shadow"] = df.apply(lambda x: max([v for k,v in x.items() if "shadow" in k]), axis=1)

df[df["type"] == "FI"][features+["shadow","type"]].groupby("type").boxplot()
df[df["type"] == "PFI"][features+["shadow","type"]].groupby("type").boxplot()
    
df