# Chapter 6: Causal Structure Learning from Data

In the previous chapter, we build a causal model knowing the equation structure of the problem from our domain knowledge. However, we do not have the privilege of knowing causal structure in the real world.
In this chapter, our goal is to create a causal model without knowing the causal structure of the problem. In other words, our problem is **causal structure identification or discovery**.

- **With structure learning, we want to determine the structure of the graph that best captures the causal dependencies between the variables in a data set.** 
- In other words, given a dataset, we derive a causal model that describes it.


## How can we estimate the causal structure from a dataset?

Unfortunately, there is no standard recipe for that, and that's why the causal inference is generally challenging. Causal discovery is an example of an inverse problem. 

The usual approach to solving inverse problems is to **make assumptions** about what we are trying to investigate. This narrows down the possible solutions and hopefully makes the problem solvable. 

There are four common assumptions made across causal discovery algorithms. 

- **Acyclicity** — causal structure can be represented by a DAG $G$ (mentioned in [Chapter 3](/lectures/CH-3-Graphical-Causal-Models.ipynb))
- **Markov Property** — all nodes are independent of their non-descendants when conditioned on their parents (mentioned in [Chapter 3](/lectures/CH-3-Graphical-Causal-Models.ipynb))
- **Faithfulness** — all conditional independences in true underlying distribution $p$ are represented in $G$ 
- **Sufficiency** — any pair of nodes in $G$ has no common external cause


A comprehensive discussion of these causal assumptions is availabe  in [Kalainathan et al., 2018]( https://arxiv.org/abs/1803.04929)

<br/><br/>

Although the mentioned assumptions help narrow down the number of possible causal models, they do not guarantee to achieve a causal model. This is where a few tricks for causal discovery are helpful. A list of tricks for causal discovery is given in the table below.


| **TRICK**                             | **ALGORITHM**                                               |
|-----------------------------------|---------------------------------------------------------|
| **Conditional Independence Testing**  | PC <br> Fast Causal Inference (FCI) <br>  Inductive Causation (IC) |
| **Greedy Search on DAG Space**        | Gready Equivalent Search (GES) <br>  Gready Interventional Equivalent Search (GIES) <br> Concave penalized Coordinate Descent with reparametrization (CCDr)                                                        |
| **Exploiting Asymmetry**              | Linear Non-Gaussian Acyclic Model (LINGAM) <br>  Nonlinear Additive Noise Models <br> Post_nonlinear Causal Model (PNL) <br> Granger Causality                                                      |
| **Hybrid**                            | Structural Agnostic Modeling (SAM) <br> Causal Additive Modeling (CAM) <br> Causal Generative Causal Neural Network (CGNN)                                                       |


<br/><br/>

A broad overview of different causal structure search methods is available at:
[Review of Causal Discovery Methods Based on Graphical Models](https://www.frontiersin.org/articles/10.3389/fgene.2019.00524/full#:~:text=Causal%20discovery%20aims%20to%20find,process%20or%20the%20sampling%20process)

We also recommend you to check Chapter 4, Learning Cause-Effect Models from the [Elements of Causal Inference](https://mitpress.mit.edu/books/elements-causal-inference) book.


## Example for Causal Structure Learning : Shortness of breath disease

We will make use of the **bnlearn** library, which is built on top of the extensive *pgmpy* library. *pgmpy* is a python implementation for Bayesian Networks with various algorithms for Structure Learning, Parameter Estimation, Approximate (Sampling Based), and Causal Inference.

In this example we will try to analyse patients treatment regarding shortness-of-breath (dyspnoea). The dataset has few variables and is siumulated by [Lauritzen and Spiegelhalter,1988](https://www.jstor.org/stable/2345762?seq=1). The data is about relationship between lung diseases (tuberculosis, lung cancer or bronchitis) and visits to infection areas for 20000 patintes.

**Background:** Shortness-of-breath (dyspnoea) may be due to **tuberculosis, lung cancer, bronchitis**, or none of them, or more than one of them. A recent visit to infectious areas increases the chances of tuberculosis, while smoking is known to be a risk factor for both lung cancer and bronchitis. The results of a single chest X-ray do not discriminate between lung cancer and tuberculosis, as neither does the presence or absence of dyspnoea.

![img](img/ch6/dyspnoea.jpeg)


### Trick 1: Conditional Independence Testing

One of these earliest causal discovery algorithms is the PC algorithm, named after Peter Spirtes and Clark Glymour. This algorithm uses the idea that two statistically independent variables are not causally linked. An outline of the PC algorithm is illustrated in the figure below (image taken from https://towardsdatascience.com/causal-discovery-6858f9af6dcb). 

![img](img/ch6/Trick1.png)


**Step 1:** form a fully connected, undirected graph using every variable in the dataset. 

**Step 2:**  edges are deleted if the corresponding variables are independent. 

**Step 3:**  connected edges undergo conditional independence testing, e.g., independence test of bottom and far-right node conditioned on the middle node (see step 2). If conditioning on a variable kills the dependence, that variable is added to the Separation set for those two variables. Depending on the size of the graph, conditional independence testing will continue (i.e. condition on more variables) until there are no more candidates for testing.

**Step 4:**  colliders (i.e. $X \rightarrow Y \leftarrow Z$) are oriented based on the Separation set of node pairs. 

**Step 5:**  remaining edges are directed based on two constraints, 1) no new v-structures and 2) no directed cycles can be formed.


### Trick 2: Greedy Search of Graph Space

A greedy algorithm is an approach for solving a problem by selecting the best option available at the moment. 

- greedy algorithms doesn't worry whether the current best result will bring the globally optimal result. The algorithm never reverses the earlier decision, even if the choice is wrong. 

- Usually, greedy algorithms are easier to describe and can perform quite well than other algorithms. However, greedy searches cannot guarantee an optimal solution. 

- For most problems, the space of possible DAGs is so big that finding a true optimal solution is challenging.

- The **Greedy Equivalence Search (GES)** algorithm uses this trick. GES starts with an empty graph and iteratively adds directed edges such that the improvement in a model fitness measure (i.e., score) is maximized.


### Trick 3: Exploiting Asymmetries in Cause-Effect Relations

A fundamental property of causality is **asymmetry**. $A$ could cause $B$, but $B$ may not cause $A$. Thus, some algorithms leverage this idea to select between causal model candidates concerning time, complexity, and functions.

- Time asymmetry is quite natural since causes happen before effects. This is used in the **Granger causality test** too. Granger causality test says that a variable $X$ that evolves over time causes another evolving variable $Y$ if predictions of the value of $Y$ based on its own past values and based on the past values of $X$ are better than predictions of $Y$ based only on $ Y$'s own past values.

- Complexity asymmetry follows **Occam's razor principle**, that simpler models are better. In other words, if you have a handful of candidate models to choose from, this idea says to pick the simplest one. One way of quantifying simplicity (or complexity) is the **Kolmogorov Complexity** theory.

- Functional asymmetry assumes models that better fit a relationship are better candidates. For example, given two variables $X$ and $Y$, the nonlinear additive noise model (NANM) performs a nonlinear regression between $X$ and $Y$ , e.g. $y = f(x) + n$, where $n$ = noise/residual, in both directions. The model (of causation) is then accepted if the potential cause (e.g. $x$) is independent of the noise term (e.g. $n$).

### Trick 4: Hybrid Approaches

The last trick includes algorithms that differ from other tricks and exploit different assumptions. e.g., neural networks have been used to explore causal relationships. Following are two examples:

- Causal Generative Neural Networks (CGNN), where the algorithm learns functional causal models from observational data based on generative neural networks. [CGNN](https://arxiv.org/pdf/1711.08936.pdf)

- Casual Recurrent Neural Networks (CRNN), where we developed a framework to explore causal structure in a multivariate time-series problem. [CRNN](https://ieeexplore.ieee.org/abstract/document/8437162)


In [2]:
# Load model
import bnlearn as bn
df = bn.load(filepath='data/smoke_dataset.pkl')
df

ModuleNotFoundError: No module named 'bnlearn'

## Build a causal model when we have data and domain knowledge

As we saw in the lectures, expert knowledge can be included in causal models by using graphs in the form of a Directed Acyclic Graphs. Let's assume that our knowledge about dyspnoea is limited to: smoking is related to lung cancer, smoking is related to bronchitis, and if you have lung or bronchitus we may need an xray examination. 

Therefore, we create a DAG based on this knowledge:

In [3]:
edges = [('smoke', 'lung'),
         ('smoke', 'bronc'),
         ('lung', 'xray'),
         ('bronc', 'xray')]

# Create the DAG from the edges
DAG = bn.make_DAG(edges)

# Plot and make sure the arrows are correct.
bn.plot(DAG)

NameError: name 'bn' is not defined

<br/><br/>

At this point we have the data set in our dataframe (df), and we also have the DAG based on our expert knowledge. 
We can use parameter learning to learn conditional probability distributions (CPDs) of variales in our model. 

In [4]:
q1 = bn.inference.fit(DAG, variables=['lung'], evidence={'smoke':1})

[bnlearn] >Variable Elimination..


Finding Elimination Order: : : 0it [00:00, ?it/s]
0it [00:00, ?it/s]

+----+--------+-----------+
|    |   lung |         p |
|  0 |      0 | 0.0330251 |
+----+--------+-----------+
|  1 |      1 | 0.966975  |
+----+--------+-----------+





In [4]:
# Check the current CPDs in the DAG.
bn.print_CPD(DAG)
# [bnlearn] >No CPDs to print. Tip: use bn.plot(DAG) to make a plot.
# This is correct, we dit not yet specify any CPD.

# Learn the parameters from data set.
# As input we have the DAG without CPDs.
DAG = bn.parameter_learning.fit(DAG, df, methodtype='bayes')

# Print the CPDs
bn.print_CPD(DAG)
# At this point we have a DAG with the learned CPDs

NameError: name 'bn' is not defined

Now we can combined our expert knowledge with a data set! Then we can make inferences which allows us to ask causal questions from the model. Let us demonstrate a few questions...

<br/><br/>

**Question 1:**
What is the probability of lung-cancer, given that we know that patient does smoke?

<br/><br/>

**Question 2:**
What is the probability of bronchitis, given that we know patient does smoke?

In [5]:
q2 = bn.inference.fit(DAG, variables=['bronc'], evidence={'smoke':1})

[bnlearn] >Variable Elimination..


Finding Elimination Order: : : 0it [00:00, ?it/s]
0it [00:00, ?it/s]

+----+---------+----------+
|    |   bronc |        p |
|  0 |       0 | 0.311002 |
+----+---------+----------+
|  1 |       1 | 0.688998 |
+----+---------+----------+





<br/><br/>

**Question 3:** 
What is the probability of lung-cancer, given that we know that patient does smoke and also has bronchitis?

In [6]:
q3 = bn.inference.fit(DAG, variables=['lung'], evidence={'smoke':1, 'bronc':1})

[bnlearn] >Variable Elimination..


Finding Elimination Order: : : 0it [00:00, ?it/s]
0it [00:00, ?it/s]

+----+--------+-----------+
|    |   lung |         p |
|  0 |      0 | 0.0330251 |
+----+--------+-----------+
|  1 |      1 | 0.966975  |
+----+--------+-----------+





<br/><br/>

**Question 4:**
Lets specify the question even more. What is the probability of lung-cancer or bronchitis, given that we know that patient does smoke but did not had xray?

In [7]:
q4 = bn.inference.fit(DAG, variables=['bronc','lung'], evidence={'smoke':1, 'xray':0})

[bnlearn] >Variable Elimination..


Finding Elimination Order: : : 0it [00:00, ?it/s]
0it [00:00, ?it/s]

+----+---------+--------+-----------+
|    |   bronc |   lung |         p |
|  0 |       0 |      0 | 0.0915345 |
+----+---------+--------+-----------+
|  1 |       0 |      1 | 0.226912  |
+----+---------+--------+-----------+
|  2 |       1 |      0 | 0.194173  |
+----+---------+--------+-----------+
|  3 |       1 |      1 | 0.487381  |
+----+---------+--------+-----------+





## Build a causal model when we have data and no domain knowledge

Suppose that we have the medical records of hundreds or even thousands patients treatment regarding shortness-of-breath (dyspnoea). Our goal is to determine the causality across variables given the data set. We dont have a prior knowledge. e.g. we are a dta scientist whom just start to work with the dataset.

We use structure learning to estimate the DAG structure of the dataset.

In [5]:
df = bn.load(filepath='data/smoke_dataset.pkl')
# Structure learning on the data set
model = bn.structure_learning.fit(df, methodtype='hc', scoretype='bic')


NameError: name 'bn' is not defined

Lets plot the learned DAG and examine the structure!

In [6]:
model = bn.structure_learning.fit(df, methodtype='cs', scoretype='bic')

# Compute edge strength with the chi_square test statistic
# model = bn.independence_test(model, df, test='chi_square')

# Plot the DAG
bn.plot(model, interactive=False)

NameError: name 'bn' is not defined

## References
For this chapter, we used an exmaple from *bnlearn - Library for Bayesian network learning and inference*. The e-book has varius exmaples with nice visualization. Here is the ptython library [bnlearn](https://pypi.org/project/bnlearn/)

The Python bnlearn ieself is inspired by the [bnlearn - an R package for Bayesian network learning and inference](https://www.bnlearn.com) book an amazing work by Marco Scutari, IDSIA.

For causal struture search methods, we recommned this paper: [Review of Causal Discovery Methods Based on Graphical Models](https://www.frontiersin.org/articles/10.3389/fgene.2019.00524/full)