## Tutorial include Latent Variable in BN Structure
Then use the Expectation-Maximization (EM) algorithm to learn the parameters to this variable

1. Build the network with no LV and train it on complete data
2. Identify an LV and how it interacts with model
3. Add the LV to the model
4. Establish constraints to the CPDs related to the LV
5. Fit the CPDs related to the LV using the EM algorithm

**Note**: CausalNex support only discrete distributions. So, each node should be discretized before applying 


In [1]:
import numpy as np
import pandas as pd

url_data = "./data/finalboot.csv"
df = pd.read_csv(url_data)
df["LAT"] = np.nan

In [3]:
df.head(3)

Unnamed: 0,CHROM,POS,REF,ALT,GeneID,RS,DS,Y,REVEL,SAI,LAT
0,1,1232279,A,G,126792.0,0.188,0.0,1,-2,I,
1,1,1232280,T,A,126792.0,0.206,0.0,1,-2,I,
2,1,1232280,T,C,126792.0,0.203,0.0,1,-2,I,


In [2]:
df.shape

(496831, 11)

In [3]:
import warnings
from causalnex.structure import StructureModel
warnings.filterwarnings("ignore") # silence warnings
sm = StructureModel()

In [4]:
sm.add_edges_from([
    ('Y', 'REVEL'),
    ('Y', 'SAI')
])

In [5]:
sm.edges

OutEdgeView([('Y', 'REVEL'), ('Y', 'SAI')])

In [6]:
from causalnex.plots import plot_structure, NODE_STYLE, EDGE_STYLE

viz = plot_structure(
    sm,
    all_node_attributes=NODE_STYLE.WEAK,
    all_edge_attributes=EDGE_STYLE.WEAK,
)
viz.show("01_simple_plot.html")

01_simple_plot.html


In [7]:
df_bn = df[["Y","REVEL","SAI"]]

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
train, test = train_test_split(df_bn, train_size=0.8, test_size=0.2, random_state=7)


# Fitting with the data

In [8]:
from causalnex.network import BayesianNetwork
bn = BayesianNetwork(sm)

In [9]:
bn.fit_node_states_and_cpds(df_bn)

<causalnex.network.network.BayesianNetwork at 0x7fa089e43400>

# Adding the LV to the network

In [10]:
lat_edges_to_add = [('Y', 'LAT'),('LAT', 'REVEL'),('LAT', 'SAI')]

lat_edges_to_remove = [('Y', 'REVEL'),('Y', 'SAI')]

In [11]:
bn.add_node(node="LAT", edges_to_add=lat_edges_to_add, edges_to_remove=lat_edges_to_remove)

<causalnex.network.network.BayesianNetwork at 0x7fa089e43400>

In [12]:
viz = plot_structure(
    sm,
    all_node_attributes=NODE_STYLE.WEAK,
    all_edge_attributes=EDGE_STYLE.WEAK,
)
viz.show("01_simple_plot.html")

01_simple_plot.html


### Estrablishing constrains to the parameters related to the LV

In [46]:
train["LAT"] = np.nan

In [13]:
bn.fit_latent_cpds(
    lv_name = "LAT",
    lv_states=[0,1],
    data=df,
    n_runs=30,
)

<causalnex.network.network.BayesianNetwork at 0x7fa089e43400>

We can also provide information about the priors and constrains for the latent node. We can try different states.  

- The default boundaries for every parameter is $(0,1)$
- The default priors are 0, and we can override these values 

In [14]:
bn.cpds['Y']

Y,Unnamed: 1
0,0.957106
1,0.042894


In [15]:
bn.cpds['LAT']

Y,0,1
LAT,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.066065,0.999999
1,0.933935,1e-06


In [16]:
bn.cpds['SAI']

LAT,0,1
SAI,Unnamed: 1_level_1,Unnamed: 2_level_1
+1,0.016114,0.002527
+2,0.002346,0.0006
I,0.98154,0.996873


In [17]:
bn.cpds['REVEL']

LAT,0,1
REVEL,Unnamed: 1_level_1,Unnamed: 2_level_1
+1,0.158307,0.01556179
+2,0.2315,0.0001619213
+3,0.14122,1.420845e-09
+4,0.147479,2.229892e-18
-1,0.041892,0.1402624
-2,0.030209,0.2793804
-3,0.013849,0.2717417
-4,0.032354,0.131181
I,0.20319,0.1617109


Question to ask:
1. How is the data that we want to fit in the BN?
   - Are there any variants with missing scores?
2. There is a good property if we are dealing with no missing data except in the LV itself) has the following property:

   - Only parameters to be learned through the EM are the _CPDs_ of the LV itself and its children. The other parameters CPDs can be learned by MLE

### Querying the updated model

In [None]:
from causalnex.inference import InferenceEngine
ie = InferenceEngine(bn)
ie.do_intervention()