# Chapter 3: Graphical Causal Models

## Thinking About Causality with Graphs

Directed Acyclic Graphs (DAGs) provide a systematic and visual framework for representing and analyzing causal relationships. In addition, DAGs assist in identifying confounding variables, selecting appropriate statistical identification strategies, and making causal inferences based on observed data. 

Here's how we can think about causality using DAGs:


**Directionality**: Causality is about the direction of influence between variables. In graphs, arrows indicate the direction of causality. If there is an arrow from Variable $X$ to Variable $Y$, it suggests that $X$ causally influences $Y$.


**Temporal Order**: Causal relationships are typically characterized by a temporal order, where the cause precedes the effect. In DAGs, causes are depicted as nodes with arrows pointing toward the effect nodes. 


**Conditional Independence**: Graphs provide a way to assess conditional independence, which is a key concept in causal inference. If two variables are conditionally independent given a set of observed variables, they are considered to be **d-separated** in the graph. This implies that there is no direct causal relationship between them once the conditioning variables are taken into account.


**Confounding and Mediation**: DAGs help in understanding the concepts of confounding and mediation. 
* **Confounding** occurs when an unobserved variable influences both the cause and the effect, creating a spurious association. DAGs make it explicit by including confounding variables in the graph. 
* **Mediation** is the concept of an intermediate variable that lies on the causal path between the cause and the effect. DAGs represent mediation by showing arrows from the cause to the mediator and from the mediator to the effect.


**Alternative Paths and Backdoor Paths**: Graphs help in identifying alternative paths that can convey indirect causal relationships. 
* An **alternative path** is a route from the cause to the effect without a direct causal path. Understanding alternative paths helps recognize potential confounders and the importance of controlling them. 
* **Backdoor paths** represent non-causal associations between variables that can introduce bias when estimating causal effects.


**Counterfactuals and interventions**: Graphs facilitate thinking about counterfactuals and interventions. 
* **Counterfactuals** involve comparing the outcome under two different conditions: the observed condition and a hypothetical condition where a specific intervention or treatment is applied. 
* Graphs provide a framework to represent **interventions** by removing or modifying arrows in the graph, allowing for reasoning about the potential effects of interventions and counterfactual scenarios.


**Guiding Statistical Analysis**: DAGs provide guidance on selecting appropriate statistical methods for estimating causal effects. Based on the graphical structure, DAGs help identify appropriate identification strategies, such as *Randomized Control Trials (RCN), Structural Equation Modeling (SEM), Instrumental Variable Analysis (IV), Propensity Score Matching (PSM), Regression Discontinuity Design (RDD), etc. 

* A good refrence on appropriate statistical methods for estimating causal effects is the [Causal Inference: What If](https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/) book by Miguel Hernan.

**Addressing Endogeneity and Selection Bias**: DAGs help address endogeneity and selection bias issues in observational studies. By explicitly representing the causal relationships, DAGs can guide the identification of instrumental variables, treatment assignment mechanisms, or appropriate matching strategies to account for bias.


It is important to note that DAGs are simplifications of complex causal systems and should be used in conjunction with rigorous statistical analysis and domain knowledge to draw reliable causal conclusions. DAGs should be seen as a part of a broader toolkit for causal inference rather than a standalone solution.


**<font color='blue'> What are the possible disadvantages of DAGs for Causal inference?</font>**





## Basic Terminologies for Graphs


For our purpose, it is (very) important that we understand what kind of independence and conditional independence assumptions a graphical model entails. To understand this, let's explain some common graphical definition and structures. They will be quite simple, but they are the sufficient building blocks to understand everything about graphical models.

**Graph:** A graph $G = (V, E)$ is a set V of vertices (nodes) and a set E of edges, which can be graphically illustrated, for example:

In [49]:
g = gr.Digraph(graph_attr={'rankdir':'LR', 'label': "A graph with nodes L , U , A  and Y​"}, 
               node_attr={'shape': 'plaintext'}, 
               edge_attr={'arrowhead':'vee', 'arrowsize':'1'})
g.edge("L", "U", dir = "none")
g.edge("L", "A")
g.edge("U", "A")
g.edge("A", "Y")
g



NameError: name 'gr' is not defined

**Nodes:** typically represent random variables.

**Edges (arrows):** can be undirected, directed or bi-directed and typically indicate a certain relationship between nodes or possible direct causal effects.

**Path:** A trail of edges going from one node to another, not necessarily following the direction of arrows. a path cannot cross a node more than once.

**Cyclic Graph:** A cyclic graph has at least one path that can be followed through directed edges back to the original node​

**Acyclic Graph:** An acyclic graph is a graph that contains no such cycles.

<br/><br/>
**Example: Paths in a Graph**

What are paths from $X$ to $C$ in the following graph?

In [None]:
g = gr.Digraph(graph_attr={'rankdir':'LR'}, 
               edge_attr={'arrowhead':'vee', 'arrowsize':'1'})
g.node("X", "X", color="blue", fontcolor="blue")
g.node("C", "C", color="blue", fontcolor="blue")
g.edge("U1", "Y")
g.edge("U1", "X")
g.edge("U2", "X")
g.edge("U2", "Y")
g.edge("X", "T")
g.edge("T", "C")
g.edge("C", "Y")
g.edge("X", "C")
g.edge("X", "Y")
g

Paths from $X$ to $C$ are:

- $ X \rightarrow C$
- $ X \rightarrow T \rightarrow C$
- $ X \rightarrow U_2 \rightarrow Y \rightarrow C$
- $ X \rightarrow U_1 \rightarrow Y \rightarrow C$ 

### More Terminologies:

**Directed Acyclic Graph (DAG):** A DAG is a graph that is both directed and acyclic.​

**Children and Parents:** Nodes directly affected by and affecting other nodes respectively.​

**Ancestors and Descendants:** Nodes directly or indirectly affected by and affecting other nodes

**Exogenous and Endogenous Nodes:** Nodes without and with parents respectively​


<br/><br/>
**Example: Parentship in a Graph**

In [None]:
g = gr.Digraph(graph_attr={'rankdir':'LR'}, 
               edge_attr={'arrowhead':'vee', 'arrowsize':'1'})
g.node("T", "T", color="blue", fontcolor="blue")
g.edge("U1", "Y")
g.edge("U1", "X")
g.edge("U2", "X")
g.edge("U2", "Y")
g.edge("X", "T")
g.edge("T", "C")
g.edge("C", "Y")
g.edge("X", "C")
g.edge("X", "Y")
g

Here are parental relations in the graph:

**Parents:** $pa(T) = \{ X \}$​

**Children:** $ch(T) = \{ C \}$​

**Ancestors:** $anc(T) = \{ X, U_1, U_2 \}$​

**Descendants:** $desc(T) = \{ C, Y \}$​

<br/><br/>

## DAGs for Causal Inference 

DAGs as models are mathematical objects, part of a larger class of graphical models such as Bayesian networks or Markov networks.​ DAGs graphically represent *non-parametric structural equation models*.​

- DAGs **advantages** for causal models:​

    - All pictures, no algebra​

    - Focus on causal links

    - easy for deriving nonparametric analysis


- DAGs **disadvantages** for causal models:​:

    - Don’t display the parametric assumptions that are often necessary for estimation in practice.​

    - Generality can obscure important distinctions between estimands.​

## Notes on Causal DAGs

- Causal DAGs encode the qualitative causal assumptions of the data-generating model. It is the *model of how the world works*. 

<br/>

- Causal DAGs are **non-parametric**, i.e. they make no assumption about
  - The distribution of the variables (nodes) in the DAG
  - The functional form of the direct causal effects (arcs)

<br/>

- When we build a causal model (= drawing a DAG), we must consider all factors/variables that play a role in data generation, regardless of whether they are observed or unobserved.

  - Causal assumptions are encoded by the *direction and absence of arrows*. 
  - *Directed arrows* or arcs represent possible direct causal effects.
  - *Absence of arrows* or *missing arcs* represent sharp nulls of no-effect.

 <br/>
  
- Causal DAGs are **acyclic** because:

    - One cannot trace a sequence of arcs in the direction of the arrows and arrive where one started.
    - We impose acyclicness since a variable can’t cause itself.
    - The future cannot directly or indirectly cause the past.

In [None]:
g = gr.Digraph( graph_attr={'label': "The arrow from A to B means that A may affect B, but not the other way around. \n " +
                                    "The absence of an arrow from D to B means that D does not affect B \n" + 
                                    "The presence of C means that A and B may or may not have common causes."}, 
            edge_attr={'arrowhead':'vee', 'arrowsize':'1'})
g.edge("C", "A")
g.edge("C", "B")
g.edge("A", "B")
g.node("D", "D")
g

### DAG Major Structure 1: Collider (V-Structure)

- A collider (also known as v-structure or head-to-head meeting) has two incoming arrows along a chosen path.

- A collider is when two arrows collide on a single variable. We can say that in this case both variables share a common effect.

  $A \!\perp\!\!\!\perp B$   

In [None]:
g = gr.Digraph(graph_attr={'label': "A and B are independent in general"}, edge_attr={'arrowhead':'vee', 'arrowsize':'1'})
g.edge("A", "C")
g.edge("B", "C")
g

- As a general rule, **conditioning (adjusting or controlling)** on a collider opens the causal path. Not conditioning on it leaves it closed.
- This phenomenon is sometimes called **explaining away**, because one cause already explains the effect, making the other cause less likely.

  $A \not\!\perp\!\!\!\perp B | C$ 
  
  
- In other words, **conditioning on** or **observing** a collider can lead to spurious associations between the variables that connect it. This phenomenon is known as **collider bias** or **Berkson's paradox**. It occurs because conditioning on a collider variable can induce a correlation between the variables that influence it, even if they are not causally related.

In [None]:
g = gr.Digraph(graph_attr={'label': "A and B are not independent anymore if we condition on C (collider)"}, edge_attr={'arrowhead':'vee', 'arrowsize':'1'})
g.node("C", color="red")
g.edge("A", "C")
g.edge("B", "C")
g

### Collider Example: A 6 Nodes Graph


In [None]:
g = gr.Digraph(graph_attr={'rankdir':'LR'}, 
               edge_attr={'arrowhead':'vee', 'arrowsize':'1'})
g.node("X", "X", color = "blue")
g.edge("U1", "Y")
g.edge("U1", "X")
g.edge("U2", "X")
g.edge("U2", "Y")
g.edge("X", "T")
g.edge("T", "C")
g.edge("C", "Y")
g.edge("X", "C")
g.edge("X", "Y")
g

In the above graph, we have: 

- $X$ is a collider on the path $U_1 \rightarrow X \leftarrow U_2$

- $X$ is not a collider on the path $U_1 \rightarrow X \rightarrow T$

### Collider Example: School Admission

We can assume there are two ways to be admitted to school. You can either be good at math or be good in arts (this is just an example, dont wory there are other ways too). 


In [None]:
g = gr.Digraph(edge_attr={'arrowhead':'vee', 'arrowsize':'1'})
g.node("A", label="Talented in Arts")
g.node("B", label="Talented in Math")
g.node("C", label="Admitted to School")
g.edge("A", "C")
g.edge("B", "C")
g

If you don't condition on the admission to school (i.e. you don't know if a student has been admitted to school), then being good in arts or maths are independent conditions. 

* In other words, knowing that a student is good in math doens't tell anything about how good he is in arts (and viceversa). 

However, if you condition on the admission to school (i.e. you know the outcome of the admission), then being good in arts or maths become dependent. 

* If you know that a student has been admitted to school and he is not talented in arts, then it is more likely that he is talented in math. Conversely, if he is bad at math but he has been admitted to school, then he has to be good in arts.

### Collider Example: Fire and Smoke Machine

Intuitively we say that two variables are independent if knowing one variable doesn't provide any information on the other variable, i.e., it doesn't change our belief. 

Let's consider the following numerical example of a collider (similar to the previous example) where we we have a smoke machine in a club:

- $P(F=1) = 0.2$
- $P(M=1) = 0.1$
- $P(S=1|F=0; M=0) = 0.1$
- $P(S=1|F=1; M=0) = 0.8$
- $P(S=1|F=0; M=1) = 0.8$
- $P(S=1|F=1; M=1) = 0.9$

Following is the DAG:

In [None]:
g = gr.Digraph(edge_attr={'arrowhead':'vee', 'arrowsize':'1'})
g.node("F", label="Fire")
g.node("M", label="Fog machine")
g.node("S", label="Smoke")
g.edge("F", "S")
g.edge("M", "S")
g

**Analytical Solution:**


The probability of fire is:
$$P(F=1) = 0.2$$


If we observe smoke ($S=1$), then the probability of fire increased to:

$$P(F=1|S=1) = \frac{P(S=1|F=1)P(F=1)}{P(S=1)} $$ 
where 
\begin{align*}
P(S=1) = P(S=1|F=0;M=0)P(F=0)P(M=0) +  \\
            P(S=1|F=1;M=0)P(F=1)P(M=0) + \\           
                P(S=1|F=0;M=1)P(F=0)P(M=1) + \\
                    P(S=1|F=1;M=1)P(F=1)P(M=1)  
\end{align*}

and 

\begin{align*}
P(S=1|F=1) = P(S=1|F=1;M=0)P(M=0) +  \\
            P(S=1|F=1;M=1)P(M=1)   
\end{align*}

Put the numbers in:

$$P(F=1|S=1) = 0.54$$

However, if we observe that the fog machine is on, $M = 1$, then the probability of fire decreased to 

$$P(F=1|S=1;M=1) = 0.22$$

Here, $F$ and $M$ are not conditionally independent given $S$. When the probability of one explanation increases, the alternative explanations become less probable (they are explained away).




**Python Solution:**

To solve this problem or similar problems, we can alsio use the **bnlearn** library in Python, [here](https://pypi.org/project/bnlearn/). 
The orginal library was first developemd for R by Marco Scutari. The package comes with a lot of examples. [Here](https://www.bnlearn.com) is the link.


In [None]:
import bnlearn as bn
from pgmpy.factors.discrete import TabularCPD

# Define the network structure
edges = [('Fire', 'Smoke'),
         ('Machine', 'Smoke')]

# Make the DAG
DAG = bn.make_DAG(edges)

# Input Probability Funtions Data
cpt_fire = TabularCPD(variable='Fire', variable_card=2, values=[[0.8], [0.2]])
cpt_machine = TabularCPD(variable='Machine', variable_card=2, values=[[0.9], [0.1]])


cpt_smoke = TabularCPD(variable='Smoke', variable_card=2,
                           values=[[0.9, 0.2, 0.2, 0.1],
                                   [0.1, 0.8, 0.8, 0.9]],
                           evidence=['Fire', 'Machine'],
                           evidence_card=[2, 2])


DAG = bn.make_DAG(DAG, CPD=[cpt_fire, cpt_machine, cpt_smoke])

bn.print_CPD(DAG)

Calculate $P(F=1|S=1)$ :

In [None]:
q1 = bn.inference.fit(DAG, variables=['Fire'], evidence={'Smoke':1} )

Calculate $P(F=1|S=1;M=1)$:

In [None]:
q2 = bn.inference.fit(DAG, variables=['Fire'], evidence={'Smoke':1, 'Machine':1} )

### DAG Major Structure 2: Fork (Confounder)

- A fork (Confounder or common cause) is a node $C$ in a graph that has outgoing edges to two (or more) other variables $A$ and $B$.
- Fork $C$ causes cofounding. Confounding means that $A$ and $B$ have a common cause (direct or indirect). 
- $A$ and $B$ are often referred to as "children" or "descendants" of the common cause.
- Fork $C$ is a common cause of $A$ and $B$. 

    $A \not\!\perp\!\!\!\perp B$ 



In [None]:
g = gr.Digraph(graph_attr={'label': "A and B are not independent in general"}, edge_attr={'arrowhead':'vee', 'arrowsize':'1'})
g.edge("C", "A")
g.edge("C", "B")
g

- In fork structure, the dependence flows backward through the arrows and we have what is called a **backdoor path**. 

- We can close the backdoor path and shut down dependence by conditioning on the common cause.

    $A \!\perp\!\!\!\perp B | C$ 
    
    
- Forks represent situations where two variables appear to be associated or correlated, but the association is not due to a direct causal relationship between them. Instead, the association is induced by the shared influence of the common cause.

- Failure to address a fork confounder can lead to biased estimates of the causal effect between the treatment and outcome. Neglecting the confounder can result in mistakenly attributing the observed association to a direct causal effect, when it is actually due, at least in part, to the unmeasured confounder.

In [66]:
g = gr.Digraph(graph_attr={'label': "A and B are independent if we condition on C (fork)"}, edge_attr={'arrowhead':'vee', 'arrowsize':'1'})
g.node("C", color="red")
g.edge("C", "A")
g.edge("C", "B")
g

NameError: name 'gr' is not defined

### Fork Example: Ice Cream and Hot Summers

We can consider the high temperature to influence both ice-cream consumption and number of solarburns in a city. 


In [None]:
g = gr.Digraph(edge_attr={'arrowhead':'vee', 'arrowsize':'1'})
g.node("A", label="Ice-cream consumption")
g.node("B", label="Number of solarburns")
g.node("C", label="Hot temperature")
g.edge("C", "A")
g.edge("C", "B")
g

If we don't know whether it is summer (hot) or winter (cold), ice-cream consumption and number of solarburns are dependent (i.e., if we plot one w.r.t. the other we see a correlation). 

On the other hand, if we condition on the temperature (i.e. we know the temperature and if it is summer or winter), ice-cream consumption and number of solarburns become independent.

**Python Solution:**

In [None]:
edges = [('Hot', 'Ice-cream'),
         ('Hot', 'Solarburn')]

# Make the actual Bayesian DAG
DAG = bn.make_DAG(edges)

# Input Probability Funtions Data
cpt_icecream= TabularCPD(variable='Ice-cream', variable_card=2, values=[[0.7, 0.4],[0.3, 0.6]], evidence = ['Hot'], evidence_card=[2])
cpt_solarburn = TabularCPD(variable='Solarburn', variable_card=2, values=[[0.9, 0.4],[0.1, 0.6]], evidence = ['Hot'], evidence_card=[2])

cpt_hot = TabularCPD(variable='Hot', variable_card=2, values=[[0.5], [0.5]])

DAG = bn.make_DAG(DAG, CPD=[cpt_icecream, cpt_solarburn, cpt_hot])
bn.print_CPD(DAG)

Calculate $P(IceCream | Hot)$:

In [None]:
q1 = bn.inference.fit(DAG, variables=['Ice-cream'], evidence={'Hot':1} )

Calculate $P(IceCream | Solarburn, Hot)$:

In [None]:
q2 = bn.inference.fit(DAG, variables=['Ice-cream'], evidence={'Solarburn':1, 'Hot':1} )

Knowing the hot summer, the probality of Ice Cream and Sunburn are independent.

 $$P(IceCream |Hot) = P(IceCream | Solarburn, Hot)$$


### DAG Major Structure 3: Mediator (Chain)

- A node $C$ is a mediator if it lies on a directed path from $A$ to $B$.

- It helps explain the pathways of causation and understand the specific variables or processes that operate between the cause and effect.

- As a general rule, the dependence flow in the direct path from $A$ to $B$ is blocked when we condition on an intermediary variable $C$.

  - $A \not\!\perp\!\!\!\perp B$   

  - $A \!\perp\!\!\!\perp B | C$ 


In [69]:
g = gr.Digraph(graph_attr={'rankdir':'LR', 'label': "A and B are not independent in general"}, edge_attr={'arrowhead':'vee', 'arrowsize':'1'})
g.edge("A", "C")
g.edge("C", "B")
g

NameError: name 'gr' is not defined

In [71]:
g = gr.Digraph(graph_attr={'rankdir':'LR', 'label': "A and B are independent if we condition on C (mediator)"}, edge_attr={'arrowhead':'vee', 'arrowsize':'1'})
g.node("C", color="red")
g.edge("A", "C")
g.edge("C", "B")
g

NameError: name 'gr' is not defined

### Chain Example: Throwing Dices

Let's assume we are throwing dices. If we don't know (i.e. we don't condition) the sum in the first $n$ throws, then knowing the sum we got in the first $n-1$ throws helps us in better estimating the sum in the $n+1$ throws.

In [None]:
g = gr.Digraph(graph_attr={'rankdir':'LR'}, edge_attr={'arrowhead':'vee', 'arrowsize':'1'})
g.node("A", label="Sum in n-1 throws")
g.node("B", label="Sum in n throws")
g.node("C", label="Sum in n+1 throws")
g.edge("A", "B")
g.edge("B", "C")
g

However, if we condition (i.e. we observe) on the sum in the first $n$ throws, then knowing the sum in $n-1$ throws doesn't provide any extra information in better estimating the sum in $n+1$ thorws. 


## DAG Major Structure 4: Causal Paths

A causal path describes the flow of influence or the causal mechanism from an initial variable ("cause" or "exposure") to a final variable ("effect" or "outcome") through a series of intermediate variables.

The causal path from $X$ to $C$ mediate the causal effect of $X$ on $C$, the non-causal path do not.

For example in the graphs below, the causal path between $X$ and $C$ are highlighted in red.

In [None]:
g = gr.Digraph(graph_attr={'rankdir':'LR'}, 
               edge_attr={'arrowhead':'vee', 'arrowsize':'1'})
g.node("X", "X", color = "blue")
g.node("C", "C", color = "blue")
g.edge("U1", "Y")
g.edge("U1", "X")
g.edge("U2", "X")
g.edge("U2", "Y")
g.edge("X", "T", color="red")
g.edge("T", "C", color="red")
g.edge("C", "Y")
g.edge("X", "C", color="red")
g.edge("X", "Y")
g

### Path Blocking Rules

Path blocking in Directed Acyclic Graphs (DAGs) refers to the process of identifying variables or conditions that can close or block a specific causal path between two variables. It involves determining which variables need to be controlled for or conditioned on to prevent spurious associations or biases when estimating causal effects.

An **active path** is *open* and allows for potential causal influence between the variables, while an **inactive path** is *blocked* or closed, preventing any direct causal influence.

Path are either open or blocked, according to two rules:

- **Rule 1:** A path is blocked if somewhere along the path, there is a variable $C$ that sits in a *chain*, or sits in a *fork* and we have conditioned for $C$. 

- **Rule 2:** A path is blocked if somewhere along the path, there is a variable $C$ that sits in a *collider* and we have not conditioned for $C$, or any of its descendents.

### Path Blocking Example: a 4 Nodes Graph

In [None]:
g = gr.Digraph(edge_attr={'arrowhead':'vee', 'arrowsize':'1'})
g.edge("V", "A")
g.edge("V", "W")
g.edge("Y", "W")
g

From the causal graph above we notice that:
- Conditioning for **V** blocks the path from **A** to **W** (rule 1) 
- Conditioning for **W** leaves the path open (rule 2) from **A** to **Y**. 
- Conditioning for both **V** and **W** blocks the path from **A** to **Y**.

### Path Blocking and D-Separation 

D-separation, short for *directed separation*, is a criterion used to determine whether a specific set of variables blocks or renders inactive a path between two variables in a DAG. 

D-separation relies on a set of rules, often referred to as path blocking rules or **d-separation rules**, to determine when a path is blocked. These rules identify the necessary variables to condition on to close a specific path and prevent spurious associations or biases in estimating causal effects.


**d-separation in Sets:** Sets of variables $A$ and $B$ are d-separated (or blocked) by $C$ if all paths between $A$ and $B$ are blocked by $C$. d-separation implies: 

$A \perp \!\!\! \perp B | C$.

<br/>

**d-separation in Paths:** D-separation determines which paths transmit association, and which ones don’t.
Formally, a path **P** is said to be d-separated (or blocked) by a conditioning set of nodes $\{Z\}$ if:

1. **P** contains a chain $A \rightarrow M \rightarrow B$ or a fork $A \leftarrow M \rightarrow B$ such that the middle node $M$ is in $\{Z\}$, or
2. **P** contains a collider $A \rightarrow M \leftarrow B$ such that neither the middle node $M$, nor any descendant of $M$, is in $\{Z\}$.

<br/>

**d-connected in Paths:** A path **P** is said to be d-connected (or unblocked or open) by a conditioning set of nodes $\{Z\}$ if it is not d-separated.


In other words:

- *Blocked* (d-separated) paths don’t transmit association (information). 

- *Unblocked* (d-connected) paths may transmit association (information).


<br/><br/>
The three aforementioned blocking criteria can be rephased as: 

- Conditioning on a non-collider blocks a path, 

- Conditioning on a collider, or a descendent of a collider, unblocks a path, 

- Not conditioning on a collider leaves a path “naturally” blocked.

<br/>

### Path Blocking and Independence 

- Two variables that are d-separated along all paths given $\{Z\}$ are <font color='red'>conditionally independent given $\{Z\}$.</font>

- Two variables that are *NOT* d-separated along all paths given $\{Z\}$ are <font color='red'>potentially dependent given $\{Z\}$.</font>

### D-Separation Example, a 10 Nodes Graph

We use the following DAG from [Bardy Neal course](https://www.bradyneal.com/causal-inference-course) to see blocked and un-blocked pathes between $T$ and $Y$ for different controlling (conditioning) cases.

![img](img/ch3/graph_Dsep_example_case0.png)

<br/><br/>
**Case 1:** In the following graph, is  $T \perp \!\!\! \perp Y | M_1$  valid?

![img](img/ch3/graph_Dsep_example_case1.png)




**NO!**, 
- the chain path $T-M_1-M_2-Y$ is blocked ($M_1$ is in the conditioning set) 
- the fork path $T-W_1-W_2-W_3-Y$ is not blocked.
- the collider path $T-X_1-X_2-X_3-Y$ is blocked.

<br/><br/>
**Case 2:** In the following graph, is  $T \perp \!\!\! \perp Y | M_1, W_2 $  valid?  

![img](img/ch3/graph_Dsep_example_case2.png)


**YES!**, 
- the chain path $T-M_1-M_2-Y$ is blocked ($M_1$ is in the conditioning set) 
- the fork path $T-W_1-W_2-W_3-Y$ is blocked ($W_2$ is in the conditioning set).
- the collider path $T-X_1-X_2-X_3-Y$ is blocked.

<br/><br/>
**Case 3:** In the following graph, is  $T \perp \!\!\! \perp Y | M_1, W_3 $  valid?  

![img](img/ch3/graph_Dsep_example_case3.png)

**YES!**,
- the chain path $T-M_1-M_2-Y$ is blocked ($M_1$ is in the conditioning set). 
- the fork path $T-W_1-W_2-W_3-Y$ is blocked ($W_3$ is in the conditioning set).
- the collider path $T-X_1-X_2-X_3-Y$ is blocked.

<br/><br/>
**Case 4:** In the following graph, is  $T \perp \!\!\! \perp Y | M_1, W_1, W_2, X_2$  valid?  

![img](img/ch3/graph_Dsep_example_case4.png)

**NO!**,
- the chain path $T-M_1-M_2-Y$ is blocked ($M_1$ is in the conditioning set). 
- the fork path $T-W_1-W_2-W_3-Y$ is blocked ($W_2, W_3$ are in the conditioning set).
- the collider path $T-X_1-X_2-X_3-Y$ is NOT blocked ($X_2$ is in the conditioning set)

<br/><br/>
**Case 5:** In the following graph, is  $T \perp \!\!\! \perp Y | M_1, W_2, W_3, X_1, X_2?$  valid?

![img](img/ch3/graph_Dsep_example_case5.png)

**YES!**,
- the chain path $T-M_1-M_2-Y$ is blocked ($M_1$ is in the conditioning set). 
- the fork path $T-W_1-W_2-W_3-Y$ is blocked ($W_2, W_3$ are in the conditioning set).
- the collider path $T-X_1-X_2-X_3-Y$ is blocked ($X_1$ is in the conditioning set)

### Automatic d-separation test with Python

Testing the d-separation manually is not feasible for large graphs.
We can use the [networkx](https://networkx.org) python library instead to test for conditional independence.

NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.

In [None]:
import networkx as nx
import matplotlib.pyplot as plt

G = nx.DiGraph()
G.add_edges_from(
    [
        ("T", "M1"),
        ("M1", "M2"),
        ("M2", "Y"),
        #
        ("T", "X1"),
        ("X1", "X2"),
        ("Y", "X3"),
        ("X3", "X2"),
        #
        ("W1", "T"),
        ("W2", "W1"),
        ("W2", "W3"),
        ("W3", "Y"),
    ]
)

# Use an alternative layout such as spring_layout
pos = nx.spring_layout(G)

# Drawing options
options = {
    'node_color': 'gray',
    'node_size': 500,
    'width': 1,
    'arrowstyle': '-|>',
    'arrowsize': 12,
}

# Draw the graph
nx.draw(G, pos, with_labels=True, **options)
plt.show()


Is  $T \perp \!\!\! \perp Y | M_1$  valid?

In [None]:
#Is T and Y d-separated given M1? 
nx.d_separated(G, {"T"}, {"Y"}, {"M1"})

Is  $T \perp \!\!\! \perp Y | M_1, W_2 $  valid? 

In [None]:
#Is T and Y d-separated given M1 and W2? 
nx.d_separated(G, {"T"}, {"Y"}, {"M1", "W2"})

## Causal Assumptions

Causal DAGs require additional assumptions to make meaningful inferences about causal structure. These assumptions narrow down the possible solutions and provide a framework for drawing causal conclusions based on observed data. The goal is not to magically discover causal relationships but to understand what causality can be learned given the causal assumptions.

There are four common assumptions made across causal discovery algorithms. 

 1) **Acyclicity** — causal structure can be represented by a DAG $\mathcal{G}$. We already seen that.
 
* The Acyclicity assumption, also known as the Directed Acyclic Graph (DAG) assumption or the no-feedback assumption
* It states that the causal relationships among variables can be represented by a directed acyclic graph, where there are no cycles or feedback loops.
* It ensures that the causal relationships are well-defined and can be represented in a graph structure.


 2) **Markov Property** — all nodes are independent of their non-descendants when conditioned on their parents. 
 
* The Markov Property assumption, also known as the Markov condition or the local independence assumption.
* It states that a variable is conditionally independent of its non-descendants given its direct causes or parents in a directed acyclic graph (DAG).
* It facilitates the identification of causal effects because it provides a way to isolate the effects of specific variables from the influence of other variables in the system. 


 3) **Faithfulness** — all conditional independences in true underlying distribution $p$ are represented in $\mathcal{G}$ 

* It states that a causal model should be faithful to the observed data, meaning that all the conditional independence relationships present in the data are reflected in the causal model.
* If two variables are statistically independent in the observed data, they should be independent in the underlying causal model. Similarly, if two variables are dependent in the observed data, there should be a corresponding causal relationship between them in the causal model.


 4) **Sufficiency** — any pair of nodes in $\mathcal{G}$ has no common external cause.

* It implies that once we condition on the observed variables, there are no additional unobserved variables that provide further information or influence the relationships between the observed variables. 
* It ensures that the observed associations between variables are not confounded by unobserved factors.
* The Sufficiency assumption is an assumption and not a guarantee. It relies on the notion that all relevant confounding variables are observed and appropriately accounted for in the analysis. 


<br/>

In this lecture, we are more focus on Markovian assumption. However, there is a comprehensive discussion of these four causal assumptions in [Kalainathan et al., 2018]( https://arxiv.org/abs/1803.04929). 
Also check this nice paper, [An introduction to causal inference](https://www.cmu.edu/dietrich/philosophy/docs/scheines/introtocausalinference.pdf), by Richard Scheines, CMU.



## Markov Property
### Definition of Markov Property

The Markov Property assumption states that the causal relationships among variables can be represented by a Directed Acyclic Graph (DAG), where each variable is independent of its descendants given its parents. This assumption helps identify the minimal set of variables needed to estimate causal effects. 


- When a distribution $p$ is Markovian with respect to a graph $\mathcal{G}$, this graph encodes certain independence in the distribution.

- We will see how Markov property links causal DAGs to conditional probabilities (from data).

We use two basic definitions before further explaining Markov property:

<br/>

**Chain Rule:** We know from the definition of conditional probability that: 

$$P(X_1,X_2) = P(X_1|X_2) \cdot P(X_2) = P(X_2|X_1) \cdot P(X_1)$$

This can be generalied to multiple events (random variables): 

$$
\mathrm{P}\left(X_n, \ldots, X_1\right) = \mathrm{P}\left(X_n \mid X_{n-1}, \ldots, X_1\right) \cdot \mathrm{P}\left(X_{n-1}, \ldots, X_1\right)
$$

Repeating this process with each final term creates the product form:

$$
\mathrm{P}\left(X_n, \ldots, X_1\right) = \mathrm{P}\left(\bigcap_{k=1}^n X_k\right) = \prod_{k-1}^n \mathrm{P}\left(X_k \mid \bigcap_{j=1}^{k-1} X_j\right)
$$

<br/>

**Conditional Independence:** Now let's $X,Y,Z$ be three random variables.

- $X$ and $Y$ *are (marginally) independent* if: 

$$X \perp Y \Leftrightarrow  P(X,Y) = P(X) \cdot P(Y)$$

<br/>

- $X$ and $Y$ *are conditionally independent* if: 

$$X \perp Y|Z \Leftrightarrow  P(X, Y| Z) = P(X|Z) \cdot P(Y|Z)$$ 

<br/>

- Conditional Independence mathematically is equivalent to the statement that: *the joint distribution of the variables $X = \{ X_1 , X_2 , ..., X_n \}$ in a DAG $\mathcal{G}$ can be factorized using the Markov factorization or Bayesian Network Factorization given parents $pa$ of ecah variable*:

$$P(X) = \prod_{i=1}^n P(X_i|pa(X_i))$$

- In other words:
    - Conditional on its parents, a variable $X_i$ is independent of its predecessors variables (conditional independence). 
    - Parents of $X_i$  or $pa(X_i)$ are independent aspects of the mechanism that generated $X_i$ values (data).  

<br/><br/>

<div class="alert alert-block alert-info">

**Definition 3.1 (Markov Property):** Given a DAG $\mathcal{G}$ and a joint distribution $P_X$, this distribution is said to satisfy:

<br/>

(a) the **global Markov property** with respect to the DAG $\mathcal{G}$ if:

$$\mathbf{A} \!\perp\!\!\!\perp_{\mathcal{G}} \mathbf{B}|\mathbf{C} \Rightarrow \mathbf{A} \!\perp\!\!\!\perp \mathbf{B}|\mathbf{C}$$


for all disjoint vertex sets $A,B,C$, the symbol $\!\perp\!\!\!\perp_{\mathcal{G}}$ denotes d-separation.

<br/>

(b) the **local Markov property** with respect to the DAG $\mathcal{G}$ if each variable is independent of its predecessors given its parents, and

<br/>

(c) the **Markov factorization property** with respect to the DAG $\mathcal{G}$ if

$$
p(\mathbf{x})=p\left(x_{1}, \ldots, x_{n}\right)=\prod_{i=1}^{n} p\left(x_{i} \mid \mathbf{p} \mathbf{a}_{i}^{\mathcal{G}}\right)
$$

For this last property, we have to assume that $P_X$ has a density $p$; the factors in the product are referred to as causal Markov kernels describing the conditional distributions $P_{X_{i} \mid \mathbf{PA}_{i}^{\mathcal{G}}}$


</div>

It's worth noting that: 

* The Markov Property assumption relies on the acyclicity assumption (no cycles or feedback loops in the graph) and assumes that all relevant variables are included in the analysis. 
* The Markov Property is an assumption and may not hold in all real-world scenarios. 
* Violation of the Markov Property can result in biased causal estimates. Careful consideration of the causal structure and potential confounding factors is necessary for valid causal inference.


### Truncated Factorization

An extension to Markov factorization is that the probability distribution generated by an *intervention do(x)* operation. 

For example, $do( X_1 = x_1 )$ is given by a Truncated Factorization:

$$P(X_2, \dots, X_n |do(X_1 = x_1)) = \prod_{i \neq 1} P(X_i | pa(X_i))$$

**Truncated Factorization** suggested by (Pearl, 1993) is also known as the **G-formula** (Robins, 1986), or **manipulation theorem** (Spirtes, 2000), or **intervention formula** (Lauritzen, 2002).

We have $X = \{ X_1, X_2 , X_3 , ..., X_n \}$, the causal effect of $X_1$ on $X_2$ can now be derived by marginalizing (summing) the truncated factorization over $X' = \{ X_3 , ..., X_n \}$:

$$P(X_2|do(X_1=x_1)) = \sum_{x'} P(X_2, X'|do(X_1=x_1))$$

<br/><br/>

See an example with three nodes for Truncated Factorization in this short [video](https://www.youtube.com/watch?v=_gcmY5ukbWM) made by [Brady Neal](https://www.bradyneal.com/causal-inference-course).


### Local Markov Example: a 4 Nodes Graph

Given its parents in the DAG, a node $X$ is independent of all its predecessors. For example, let consider the graph below.

In [None]:
g=gr.Digraph(edge_attr={'arrowhead':'vee', 'arrowsize':'1'}, graph_attr={'rankdir': 'LR', 'layout':'circo'})
g.edge("X1", "X2")
g.edge("X1", "X3")
g.edge("X2", "X3")
g.edge("X3", "X4")
g

In the DAG above, we have:

$$
P(x_1, x_2, x_3, x_4) = P(x_1) P(x_2|x_1) P(x_3| x_2,x_1) P(x_4 | x_3, x_2, x_1)
$$

What happens with the *Local Markov Assumption*?

$$
P(x_1, x_2, x_3, x_4) = P(x_1) P(x_2|x_1) P(x_3| x_2,x_1) \underbrace{P(x_4 | x_3, x_2, x_1)}_{P(x_4 | x_3)}
$$


### D-separation in Graphs vs. Conditional independence in Distributions

We saw that DAGs offer an efficient (and visually easier) way to factorize the joint probability between random variables.
We present here a summary of the cases we have seen:

- $A$ and $B$ are **marginally dependent**  ($A \not \perp B$)

 $$P(A,B) \neq P(A) \cdot P(B)$$
![img](img/ch3/DAGs_PDFs_marginallyDep.png)

- $A$ and $B$ are **marginally independent**  ($A \perp B$)

 $$P(A,B) = P(A) \cdot P(B)$$
![img](img/ch3/DAGs_PDFs_marginallyIndep.png)


- $A$ and $B$ are **conditionally independent given $C$**  ($A \perp B | C$)

 $$P(A,B|C) = P(A|C) \cdot P(B|C)$$
![img](img/ch3/DAGs_PDFs_conditionallyIndep.png)


- $A$ and $B$ are **conditionally dependent given $C$**  ($A \not \perp B | C$)

 $$P(A,B|C) \neq P(A|C) \cdot P(B|C)$$
![img](img/ch3/DAGs_PDFs_conditionallyDep.png)



### Global Markov Assumption in a Graph

Two (sets of) nodes $X$ and $Y$ are d-separated by a set of nodes $\{Z\}$ if all the paths between (any node in) $X$ and (any node in) $Y$ are blocked by $\{Z\}$. 
d-separation implies independence. 

<div class="alert alert-block alert-info">

**Theorem 3.2. (Global Markov Assumption):** Given that distribution $P$ is Markov with respect to graph $\mathcal{G}$, d-separation in a graph $\mathcal{G}$ is equivalent of conditional independent in distribution $P$. This is also called *Global Markov assumption*.

$X \!\perp\!\!\!\perp_G Y |\{Z\} \Rightarrow X \!\perp\!\!\!\perp_P Y |\{Z\}$

Because of d-separation, we can read it $P$ is Markov with respect to $\mathcal{G}$ or $P$ satisfy Markov assumption in respect to $\mathcal{G}$.

</div>

### Markov Equivalence

Markov equivalence is a concept that refers to a set of DAGs that encode the same set of conditional independence relationships among variables. In other words, Markov equivalent DAGs exhibit the same pattern of statistical dependencies among variables, despite potentially having different graphical structures.

We formalize here this concept in two steps:

First, we introduce two important graph qualities that we can use to distinguish equivalent graphs: 

- **Skeleton:** an undirected graph obtained by removing directions 

- **Immorality (v-structure or collider):** a collider structure A → C ← B, such that there is no direct edge between A and B 


<br/><br/>

<div class="alert alert-block alert-info">

**Definition 3.3. (Markov Equivalence of Graphs)):** We denote by $\mathcal{M}(\mathcal{G})$ the set of distributions that are Markovian with respect to $\mathcal{G}$ :

$\mathcal{M}(\mathcal{G})$ = {$P$ : $P$ satisfies the global (or local) Markov property with respect to $\mathcal{G}$}.

Two DAGs $\mathcal{G_1}$ and $\mathcal{G_2}$ are **Markov equivalent** if $\mathcal{M}(\mathcal{G_1})$ = $\mathcal{M}(\mathcal{G_2})$. 

* This is the case if and only if $\mathcal{G_1}$ and $\mathcal{G_2}$ satisfy the same set of d-separations, which means the Markov condition entails the same set of (conditional) independence conditions.
* The set of all DAGs that are Markov equivalent to the same distribution is called **Markov equivalence class of $\mathcal{G}$**. 

</div>

<br/><br/>

<div class="alert alert-block alert-info">

**Lemma 3.4. (Graphical criteria for Markov equivalence):** Two DAGs $\mathcal{G_1}$ and $\mathcal{G_2}$ are Markov equivalent if and only if they have the same skeleton and the same immoralities (colliders).

   - Two graphs are Markov equivalent, if they entail the same conditional independencies. 
   - Two Markov equivalent graphs can be used for representing the same set of probability distributions.

</div>

<br/><br/>

Following figure shows an example of two Markov equivalent graphs.

![img](img/ch3/Markov_equivalent.png)

* Notice that chain and fork structures are treated the same in Markov equivalent DAGs.

<br/>


Markov equivalence has important implications for causal inference because: 

* Different causal graphs can lead to the same set of observed statistical dependencies. This means that two or more DAGs that are Markov equivalent cannot be distinguished based solely on observed data and statistical tests.

* Given observational data, it is generally not possible to uniquely determine the true causal structure among variables when multiple DAGs are Markov equivalent. 

* Additional interventions or domain knowledge can sometimes help resolve the ambiguity and identify the true causal structure.



See a short video on [Markov Equivalence](https://www.youtube.com/watch?v=nnjKCtdORwY&list=PLoazKTcS0Rzb6bb9L508cyJ1z-U9iWkA0&index=64) made by Brady Neal.

### Markov Equivalent Example, a 5 Node Graph using Python

You can use the [pgmpy](https://pgmpy.org/index.html) python library to test the Markov equivalence in DAGs.

The pgmpy is a python library for Bayesian Networks with a focus on Structure Learning, Parameter Estimation, and Causal Inference.


In [None]:
from pgmpy.base import DAG
import graphviz as gr

#Ceate the DAGs
G1 = DAG()
G1.add_edges_from([('X', 'Y'), ('X', 'Z'),
                  ('Z', 'Y'), 
                  ('V', 'Z'),
                  ('U','V')])

G2 = DAG()
G2.add_edges_from([('X', 'Y'), ('X', 'Z'),
                  ('Z', 'Y'), 
                  ('V', 'Z'),
                  ('V','U')])


def plot_from_model_pgmpy(edges):
    # plot
    edges = [el for el in edges] #unpack
    g = gr.Digraph()
    
    for i in range(0, len(edges)):
        g.edge(edges[i][0],edges[i][1])
    return g

# Plot the DAGs
g1 = plot_from_model_pgmpy(G1.edges())
g2 = plot_from_model_pgmpy(G2.edges())

In [None]:
g1

In [None]:
g2

Are $G_1$ and $G_2$ markov equivalent?

In [None]:
G1.is_iequivalent(G2)

What are the immoralities in $G_1$?

In [None]:
G1.get_immoralities()

What are the immoralities in $G_2$?

In [None]:
G2.get_immoralities()

## Markov Blanket

The **Markov Blanket** of a variable $X$ consists of three sets of variables:

* Parents of $X$: The parents of $X$ in the DAG are the variables that directly influence X. In terms of causality, the parents represent the direct causes of $X$.

* Children of $X$: The children of $X$ in the DAG are the variables that are directly influenced by $X$. In terms of causality, the children represent the direct effects of $X$.

* Parents of the children (spouses) of $X$: These are the variables that are parents of the children of $X$ but are not themselves parents of $X$. These variables provide information about the relationship between $X$ and its children, conditioning on other variables.

The **Markov blanket** of a node $X$, denoted by $MB(X)$ is the union of the children, parents and spouses (parents of children) of $X$. 

- $MB(X)$ is a subset that contains all the useful information about $X$. 
- The Markov blanket of a variable represents the **minimal set of variables** that need to estimate the causal effect of $X$ on any other variable in the graph. 
- $X$ is conditionally independent of all nodes outside its Markov blanket given its Markov blanket:

$$X \!\perp\!\!\!\perp N \setminus MB(X) | MB(X)$$


![img](img/ch3/Markov_blanket.png)


### Markov Blanket Example, a 5 Node Graph using Python

We use the same pgmpy python package.

Let's find the Markov blanket for node $Z$ in the previous graph $G_1$.

In [None]:
g1

In [None]:
G1.get_markov_blanket('Z')

## Common Cause Principle

According to the **Common Cause Principle**, when variables $X$ and $Y$ are found to be associated or correlated, there must be a common cause or confounding variable that influences both $X$ and $Y$. In other words, the observed relationship between variable $X$ and variable $Y$ is not only indicative of a direct causal relationship but also can be the result of their shared association with a third variable.


We can use Markov property to justify **Common Cause Principle**. It states that when the random variables $X$ and $Y$ are dependent, there must be a *causal explanation* for this dependence as follows:

- $X$ is (possibly indirectly) causing $Y$, or 
- $Y$ is (possibly indirectly) causing $X$, or
- there is a (possibly unobserved) common cause $Z$ that (possibly indirectly) causes both $X$ and $Y$.

The following proposition justifies Reichenbach’s principle with respect to a notion of “causing,” namely the existence of a directed path.

<br/>

<div class="alert alert-block alert-info">

**Proposition 3.5. (Common Cause Principle):** Assume that any pair of variables $X$ and $Y$ can be embedded into a larger system in the following sense. There exists a **causal model** over the collection $\mathbf{X}$ of random variables that contains $X$ and $Y$ within graph $\mathcal{G}$. Then common cause principle follows from the Markov property. If $X$ and $Y$ are dependent, then there is:

<br/>

- either a directed path from $X$ to $Y$ , or 

<br/>

- a directed path from $Y$ to $X$,or 

<br/>

- there is a node $Z$ with a directed path from $Z$ to $X$ and from $Z$ to $Y$.

</div>


The Common Cause Principle highlights the importance of identifying and accounting for confounding variables when drawing causal conclusions from observational data. By controlling for confounders through study design, statistical adjustment, or randomized controlled experiments, researchers can reduce the likelihood of drawing incorrect causal inferences and gain a better understanding of the true causal relationships between variables.



## Causal Graphical Models

<div class="alert alert-block alert-info">

**Proposition 3.6. (Causal Model vs. Markov Property):** [Source](https://mitpress.mit.edu/books/elements-causal-inference) Assume that $P_X$ is induced by an Structural causal Model (SCM) with graph $\mathcal{G}$. Then, $P_X$ is Markovian with respect to $\mathcal{G}$. The assumption that says a distribution is Markovian with respect to the causal graph is sometimes called the **causal Markov condition**. 

</div>

<br/>

We will see in the next chapter that it is sufficient to know the observational data distribution and the related graph structure for defining intervention distributions for a process. Therefore, we define a causal graphical model as a pair consisting of a *graph* and an *observational distribution* such that the distribution is Markovian with respect to the graph (causal Markov condition).

<br/>

<div class="alert alert-block alert-info">

**Definition 3.7 (Causal graphical model):** A causal graphical model over random variables $\mathbf{X}=\left(X_{1}, \ldots, X_{d}\right)$ contains a graph $\mathcal{G}$ and a collection of functions (structure) $f_{j}\left(x_{j}, x_{\mathbf{P A}_{j}^{\mathcal{G}}}\right)$ that integrate to $1:$

$$
\int f_{j}\left(x_{j}, x_{\mathbf{P A}_{j}^{\mathcal{G}}}\right) d x_{j}=1
$$

These collection of functions (structure) induces a distribution $P_{\mathbf{X}}$ over $\mathbf{X}$ via

$$
p\left(x_{1}, \ldots, x_{d}\right)=\prod_{j=1}^{d} f_{j}\left(x_{j}, x_{\mathbf{PA_j}}^{\mathcal{G}}\right)
$$

and thus play the role of conditionals: 

$$
f_{j}\left(x_{j}, x_{\mathbf{PA_j^\mathcal{G}}}\right) = p\left(x_{j} \mid x_{\mathbf{PA_j}^\mathcal{G}}\right)
$$

Refer to [Causal Elements Book, Chapter 6](https://mitpress.mit.edu/books/elements-causal-inference) for proof.
</div>

<br/>

If a distribution $P_X$ over $X$ is Markovian with respect to a graph $\mathcal{G}$ and allows for a strictly positive, continuous density $p$, the pair $(P_X,\mathcal{G})$ defines a **causal graphical model** by:

$$
f_j\left(x_{j} \mid x_{\mathbf{PA_j}^\mathcal{G}}\right) = p\left(x_{j} \mid x_{\mathbf{PA_j}^\mathcal{G}}\right)
$$


<br/>

See a short video on [Causal Graphs](https://www.youtube.com/watch?v=vjvP9oRgZyM&list=PLoazKTcS0Rzb6bb9L508cyJ1z-U9iWkA0&index=21) made by Brady Neal.


## Faithfulness and Causal Minimality


Faithfulness assumption suggests that a causal model is faithful if and only if it satisfies the following condition:

*For every conditional independence relationship that holds in the probability distribution induced by the causal model, there exists a corresponding d-separation relationship in the causal graph.*

In simpler terms, faithfulness asserts that if two variables, $X$ and $Y$, are conditionally independent given a set of other variables $Z$ in a causal model, then there should be no direct causal link or path between $X$ and $Y$ in the underlying DAG $\mathcal{G}$ when conditioning on $Z$.


As we seen, the **Markov** assumption enables us to undrestand **independences** from a graph structure. **Faithfulness** allows us to infer **dependences** from the graph structure.


<div class="alert alert-block alert-info">

**Definition 3.8 (Faithfulness and causal minimality):** Consider a distribution $P_\mathbf{X}$ and a DAG $\mathcal{G}$.
    
**(a)** $P_\mathbf{X}$ is faithfulls to DAG $\mathcal{G}$ if

$$
\mathbf{A} \!\perp\!\!\!\perp \mathbf{B}\left|\mathbf{C} \Rightarrow \mathbf{A} \!\perp\!\!\!\perp_{\mathcal{G}} \mathbf{B}\right| \mathbf{C}
$$

for all  all disjoint vertex sets $\mathbf{A}, \mathbf{B}, \mathbf{C}$.

**(b)** A distribution satisfies causal minimality with respect to $\mathcal{G}$ if it is Markovian with respect to $\mathcal{G}$, but not to any proper subgraph of $\mathcal{G}$.
</div>

<br/>

Part (a) posits an implication that is the opposite of the global Markov condition:

$$
\mathbf{A} \!\perp\!\!\!\perp_{\mathcal{G}} \mathbf{B}|\mathbf{C} \Rightarrow \mathbf{A} \!\perp\!\!\!\perp \mathbf{B}| \mathbf{C},
$$

<div class="alert alert-block alert-info">

**Proposition 3.9 (Faithfulness implies causal minimality):** If $P_\mathbf{X}$ is faithful and Markovian with respect to $\mathcal{G}$, then causal minimality is satisfied.
    
</div>

A distribution is minimal with respect to $\mathcal{G}$ if and only if there is no node that is conditionally independent of any of its parents, given the remaining parents.


However, it's important to note that the **Faithfulness assumption** does not guarantee that all causal relationships can be identified or estimated. There may be cases where causal relationships are not reflected in the observed data due to various reasons such as unobserved confounding or measurement error.



## Causal Sufficiency

A set of variables $P_\mathbf{X}$ is usually said to be **causally sufficient** if there is no hidden common cause $C \notin \mathbf{X}$ that is causing more than one variable in $\mathbf{X}$.

- The definition **causally sufficient** matches the intuitive meaning of the set of “relevant” variables. However, it uses the concept of a **common cause**.

- a variable $C$ is a **common cause** of $X$ and $Y$ if there is a directed path from $C$ to $X$ and $Y$ that does not include $Y$ and $X$, respectively.

- **Common causes** are also called **confounders** and we use these terms interchangeably.
- **Causal Sufficiency** describes when a set of variables is large enough to perform causal reasoning, in the sense of computing observational and intervention distributions.


<div class="alert alert-block alert-info">

**Definition 3.10 (Causal Sufficiency):** We call a set $\mathbf{X}$ of variables interventionally sufficient if there exists an SCM over $\mathbf{X}$ that cannot be falsified as an interventional model. It also induces (create) observational and intervention distributions that coincide with what we observe in practice.

</div>



For more information on Causal or Interventional sufficiency, see Chapter 9. Hidden Variables from [Elements of Causal Inference](https://mitpress.mit.edu/books/elements-causal-inference) book.




## Causal DAGs in Action 

<font color='blue'>What we saw is part of the math under the hook! In this course, we are focused on using causal DAGs. However, DAGs can still be of great practical use without detailed knowledge of this mathematical background. If you want to know more in-depth math, see Judea Pearl's paper [Foundations of Causal Inference, 2010](https://ftp.cs.ucla.edu/pub/stat_ser/r355-reprint.pdf) and books we suggested in the syllabus.</font>


Using causal assumptions, the associated properties, and the python libraries we presented in this chapter, researchers are making causal models to answer different causal questions in various domains.

For example, researchers in this [paper](https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-017-0448-y#citeas) created a causal disease network that implements disease causality through text mining on biomedical literature.

![img](img/ch3/Causal_Model_Disease_Categories.png)


As anothe exmaple, authores in this [paper](https://www.nature.com/articles/s41467-020-15195-y) proposed a method that integrates observational data and causal inference techniques to identify causal relationships between climate variables. By constructing causal networks, they provide insights into the mechanisms driving climate change and enable more accurate predictions. 

![img](img/ch3/Causal_Model_Climate.png)



## References

This chapter contents are highly inspired by the [Elements of Causal Inference (Open Access) book](https://mitpress.mit.edu/books/elements-causal-inference) by By Jonas Peters, Dominik Janzing and [Bernhard Schölkopf](https://www.is.mpg.de/~bs) especially for definitions and propositions.

We also used a graph example from the [Introduction to Causal Inference](https://www.bradyneal.com/causal-inference-course) course by Brady Neal. He made a very nice open-access online course accompanied by videos on [YouTube](https://youtube.com/playlist?list=PLoazKTcS0Rzb6bb9L508cyJ1z-U9iWkA0).

More proofs and theores can ber find in Judea Pearl's [Causal Inference in Statistics: A Primer](https://www.wiley.com/en-us/Causal+Inference+in+Statistics%3A+A+Primer-p-9781119186847).
