### In this notebook, we implement the scenarios from the Azadkia paper.
# First scenario

At first, we hardcode the SEM given in their simulation Study:
$$
X_1, X_2, X_3, X_4, X_8, X_{10}, X_{12}, \varepsilon_i \overset{i.i.d}{\sim} \mathcal{N}(0,1) \\
X_5 = X_1 - \mathrm{arctan}(X_2) + \varepsilon_5 \\
X_6 = X_2 + X_4 + X_3^2 + \varepsilon_6 \\
X_7 = \sin(X_3) + \varepsilon_7 \\
X_9 = \sin(X_6 + \varepsilon_9) + \vert X_{10} \vert \\
X_{11} = X_6(X_{12} - X_8) + \varepsilon_{11} \\
X_{13} = \mathrm{arctan}(X_9^2 + \varepsilon_{13}) \\
X_{14} = \sin(X_{11}) + \varepsilon_{14} \\
X_{15} = \sqrt{\vert X_{12} \vert} + \varepsilon_{15} \\
X_{16} = \sin(X_{12}) + \varepsilon_{16}
$$
These are given in the files `graph_description.yml`

In [1]:
from parcs.cdag.graph_objects import Graph
from parcs.graph_builder.parsers import graph_file_parser
import numpy as np
np.random.seed(2022)

nodes, edges = graph_file_parser('graph_description.yml')
g = Graph(nodes=nodes, edges=edges)
samples = g.sample(size=100)

In [3]:
samples.head()

Unnamed: 0,X1,X10,X12,X2,X3,X4,X8,e13,e9,d12abssqrt,...,X15,X16,X5,X6,X7,X11,X9,X13,d11sin,X14
0,0.621095,-0.806534,0.409616,0.4516,0.117834,-0.072312,-1.504544,0.129164,1.20202,0.640013,...,1.00602,-0.75589,0.884877,0.764481,-0.022278,2.744968,1.72926,1.260582,0.386307,-1.414856
1,0.288757,-2.016827,0.491046,-0.016356,0.853276,0.196102,-0.796253,0.799684,-1.64322,0.700746,...,1.248989,0.808142,0.912757,0.856326,0.125988,0.141766,1.308663,1.191977,0.141292,-1.51027
2,0.431347,1.947913,-0.041378,1.609065,-0.358065,-0.437947,-1.097065,-0.692636,1.863751,0.203417,...,-0.955415,-0.844848,1.020144,1.092005,-0.026929,0.032233,2.132682,1.317031,0.032227,1.428412
3,-0.292549,0.731825,0.632722,0.513692,-0.100318,-0.15691,-1.685267,-0.818218,0.854992,0.795438,...,3.12358,1.665267,-0.670351,-0.579966,0.763192,-0.913608,1.003397,0.186398,-0.791713,-0.250875
4,-1.937319,1.113283,-1.668765,-0.230548,-2.341971,1.11968,-1.566945,0.753334,-0.084964,1.291807,...,1.20153,-1.04246,-0.744101,6.337996,4.962339,-0.225491,1.083135,1.092008,-0.223585,-1.525946


In [None]:
samples = samples.drop(columns = ['d2arctan', 'd3squared', 'd3sin', 'e9', 'e13', 'd11sin', 'd12abssqrt', 'd12sin'])
print(samples.columns)
samples.to_csv('azadkia_1.csv')

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.pairplot(samples)

### Obeservations while implementing:
* For SEMs, it is useful to have a deterministic dummy node to express the nonlinear behaviour. It is however lengthy to have the dummy node function in separate file and not right in the graph_description.yml.
* We need a flag on the variables specifying a logical object `dummy_nodes` with default `False`. A dummy node would then not be returned after the sample is generated. A user always has to filter afterward and exclude the dummy nodes himself manually.
* Specifying the edges manually for every deterministic node takes time. Is there a possibility to automate this such that I give the equation and the edges are automatically generated?
* There is a tendency to overuse deterministic nodes for X. Is it bad?
* Identity() edge is very common. Use it as default?

# Second scenario

In [1]:
from parcs.cdag.graph_objects import Graph
from parcs.graph_builder.parsers import graph_file_parser
import numpy as np

In [7]:
for run in range(100):
    np.random.seed(2022 + run)
    for attack in ["U12", "U14", "U47"]:
        print(attack)
        file = 'scenario_2_' + attack + '_graph.yml'
        nodes, edges = graph_file_parser(file)
        samples = Graph(nodes=nodes, edges=edges).sample(size=10000)
        samples = samples.drop(columns = [colname for colname in samples.columns if colname[0].islower()])
        samples.to_csv('./../data5000/' + attack + '/' + str(run) + '.csv')

U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12
U14
U47
U12


In [2]:
for run in range(5):
    np.random.seed(2022 + run)
    for attack in ["ones"]:
        print(attack)
        file = 'scenario_2_' + attack + '_graph.yml'
        nodes, edges = graph_file_parser(file)
        samples = Graph(nodes=nodes, edges=edges).sample(size=10000)
        samples = samples.drop(columns = [colname for colname in samples.columns if colname[0].islower()])
        samples.to_csv('./../data5000/' + attack + '/' + str(run) + '.csv')

ones
ones
ones
ones
ones


Now we have generate the scenarios and we want a plot showing
* on the x-axis the strengh (or codec(edge) / ((codec(X2,X5) + codec(X3,X5)) / 2) as the ratio of the new edge strength to the strength of the parent variables going into X5.
* on the y-axis the number of exact recoveries (or the runs with false positives): It is best to save the found parent set and then compare afterwards
* 4 colours representing the interventions on the different edges.