# Chapter 4. Structural Causal Models

**Structural equation modeling (SEM)** is a family of various methods scientists use in experimental and observational research. SEM is a model in which different aspects of a phenomenon are related to one another with a structure. This structure is a system of equations that implies statistical and often causal relationships between variables. 

SEMs also called **Structural Causal Models (SCM)** are important tools to relate causal and probabilistic statements.

## Graphs vs. Structural Equations

We saw in Chapter 3 that: 
- **G+** Graph is an excellent tool for communicating with subject matter experts. 
- **G+** Graph can be a helpful way to translate assumptions into a formal model. 
- **G+** Graphs are also useful to see what restrictions (if any) our model puts on the joint distribution of the observed data.

- **E+** As the model becomes more complicated, the equations get a lot more friendly to work with.
- **E+** Equations may help you resist the urge to oversimplify.


## Structural Causal/Equation Model

An **Structural Causal Model (SCM)** or **Structure Equation Model (SEM)** consists of:

1. *Endogenous variables* $X = {X_1 , ..., X_J }$ 
    - Affected by other variables in the model
    - May or may not be observed

2. *Background (exogenous) variables* $U = {U_1 , ..., U_J }$ 
    - Not affected by other factors in the model
    - Not observed
    - $U_j$ is noise or error variables and $U$ is a joint distribution over noise variables.
    - Each endogenous variable $X_j$ has an error $U_j$.

3. *Functions* $F = \lbrace f_{X_1} , ... , f_{X_J} \rbrace $
    - The functions $F$ define a set of $J$ **structural equations** for each of the endogenous variables:
    $X_j = f_{X_j}(Pa(X_j),U_{X_j}), j = 1,...,J$

    $Pa(X_j) \subseteq X \backslash Xj$
    
where $Pa(X_j)$ called **parents** of $X_j$.

- We sometimes call the elements of $PA_j$ not only parents but also **direct causes** of $X_j$, 
- We call $X_j$ a **direct effect** of each of its **direct causes** $PA_j$. 



<br/><br/>

**W-A-Y Example, Make a Graph for SCM:** 

The graph $G$ of an SCM $\mathfrak{C}$ is obtained by creating one node for each $X_j$ and drawing directed edges from each parent in $PA_j$ to $X_j$. 

![img](img/ch4/Graph-SEM.png)

- Connect parents to children with a directed link.
- Each endogenous variable $X_j$ has an error $U_j$.
- Potential dependence between errors $U_j$ encoded in dashed lines/double headed errors.

- We assume this graph $G$ is acyclic, without directed cycles/feedback loops.
    - Instead feedback loops, we can use temporal ordering.
    - In other words, extend graph (and corresponding structural equations) over time

- We will work with recursive SCM. There is ordering between $X={X1,...,XJ}$ such that each $X_j$ is a function of a subset $Pa(Xj)$ of its **predecessors**.
    - Causes always precede their effects
    - A natural source of ordering is *time*

    $X_j = f_{X_j}(Pa(X_j),U_{X_j}), j = 1,...,J $ 

    $Pa(X_j) \subseteq {X_1, ..., X_{j-1}}$

<br/><br/>


**W-A-Y Example, Causal Exclusion:** 

Exclusion restrictions on the graph encoded through absence of arrows between variables.
- Absence of arrow means no direct effect

![img](img/ch4/Graph-SEM-Excluded.png)

<br/><br/>


**W-A-Y Example, Independence Assumptions:** 
- Absence of double headed arrows between background (exogenous) variables or errors $U$ means those two errors are **independent**.
    - It is an assumption on distribution $P_U$.
    - There is no unmesaured common cause between those two $U_j$

![img](img/ch4/Graph-SEM-Independence.png)


<div class="alert alert-block alert-info">

**Proposition 4.1. (SCM Entailed Distributions):** An SCM $\mathfrak{C}$ defines a unique distribution over the variables $X = (X_1,...,X_J)$ such that $X_j = f_{X_j}(Pa(X_j),U_{X_j})$ for $j = 1,...,J$. 
We refer to it as the entailed distribution $P^C_X$ and sometimes write $P_X$.

</div>


<br/><br/>
## Interventions on a SCM

- When we intervene on variable $X_j$, say, and set it to a specific outcome. We expect that this intervention changes the distribution of the system compared to its earlier behavior without intervention. 

    - Even if variable $X_j$ was causally influenced by other variables before, it is now influenced by nothing else. 
    - $X_j$ has no more causal parents. 

- The autonomy of structural equations means that we can make a targeted modification to the set of equations in order to represent our intervention of interest.


<div class="alert alert-block alert-info">

**Proposition 4.2. (Intervention Distribution):** [source](https://mitpress.mit.edu/books/elements-causal-inference) 

Consider an SCM $\mathfrak{C}$ and its entailed distribution $P^\mathfrak{C}_X$. We replace one (or several) of the structural assignments to obtain a new SCM $\tilde{\mathfrak{C}}$. 

<br/>

Assume that we replace the assignment for $X_j$ by:

$X_j = \tilde{f}_{X_j}(\tilde{Pa}(X_j),U_{X_j})$.

<br/>

We then call the entailed distribution of the new SCM $\tilde{\mathfrak{C}}$ an **intervention distribution** and say that the variables whose structural assignment we have replaced have been **intervened** on. We denote the new distribution $P^{\tilde{\mathfrak{C}}}_X$ by:

$P^{\tilde{\mathfrak{C}}}_X = P^{\mathfrak{C} ; do(\tilde{f}_{X_j}(\tilde{Pa}(X_j),U_{X_j}))}_X$.

<br/>

The set of noise (error) variables in $\tilde{\mathfrak{C}}$ now contains both some “new” $\tilde{U}$ 's and some "old" $U$'s, all of which are required to be jointly independent.

</div>


<br/><br/>

**W-A-Y Example, Intervention on A:** 
- We intervene on the system to set $A=1$ and we replace $f_A$ with constant function $A=1$.

![img](img/ch4/Grpah-SEM-Intervene.png)

- $Y_a(u)$ is defined as the solution to the equation $f_Y$ under an intervention on the system of equations to set $A=a$ (with input $U_Y=u$).
- We can think of $u$ as a particular realization of (values for) the background factors
- $P_{U_Y}$ and $F$ induce a probability distribution on $Y_a$ just as they do on $Y$.
- $Y_a$ is a **post-intervention** or **counterfactual** random variable.


<br/><br/>

**Simple Prediction Example, Intervention Targets:** 

This example considers prediction. It shows that even though some variables may be good predictors for a target variable $Y$ , intervening on them may leave the target variable unaffected. Consider the following SCM $\mathfrak{C}$:

![img](img/ch4/Exp_Predictors_Intervention_Targets.png)

$
\begin{cases}
 X_1 = U_{X_1}\\
 Y = X_1 + U_Y\\
 X_2 = Y + U_{X_2}
\end{cases}
$

with following distribution being jointly independent:

$
\begin{cases}
U_{X_1} \stackrel{iid}{\sim} \mathcal{N}(0,1) \\
U_{X_2} \stackrel{iid}{\sim} \mathcal{N}(0,0.1) \\
U_{Y} \stackrel{iid}{\sim} \mathcal{N}(0,1) 
\end{cases}
$

<br/>

**Case 1: Intervene on $X_2$:** We are interested in predicting $Y$ from $X_1$ and $X_2$. Clearly, $X_2$ is a better predictor for $Y$ than $X_1$ is. Clearly, X2 is a better predic- tor for Y than X1 is. For example, a linear model without $X_2$ leads to a (significantly) larger mean squared error than a linear model without $X_1$ would.


However, if we want to study $Y$, intervention on $X_2$ is useless. In other words, no matter how strongly we intervene on $X_2$, the distribution of $Y$ remains unaffected

$P^{\mathfrak{C} ; do(X_2 = \stackrel{\sim}{U})}_Y = P^{\mathfrak{C}}_Y$

<br/>

**Case 2: Intervene on $X_1$:** An intervention on $X_1$, however, does change the distribution $Y$.

$P^{\mathfrak{C} ; do(X_1 = \stackrel{\sim}{U})}_Y = \mathcal{N}(E(U_Y) + E(\stackrel{\sim}{U}), var (U_Y) + var(\stackrel{\sim}{U}))$ 


$P^{\mathfrak{C} ; do(X_1 = \stackrel{\sim}{U})}_Y \neq P^{\mathfrak{C}}_Y$

## Calculating Intervention Distributions

We use a trivial but powerful invariance statement for calculating intervention distributions. SCM $\mathfrak{C}$, and writing $p a(j):=\mathbf{P A}_{j}^{\mathcal{G}}$, we have:

$$
p^{\tilde{\mathfrak{C}}}\left(x_{j} \mid x_{p a(j)}\right)=p^{\mathfrak{C}}\left(x_{j} \mid x_{p a(j)}\right)
$$ 
<p style='text-align: center;'> (Eq.1) </p>

For any SCM $\tilde{\mathfrak{C}}$ that is constructed from $\mathfrak{C}$ by **intervening** on (some) $X_k$ but not on $X_j$.

- Equation above shows that causal relationships are **autonomous** under interventions. This property is therefore sometimes called **autonomy**. It means if we intervene on a variable, then the other mechanisms remain invariant.


The autonomy formula is the base for the **G-computation formula** (Robins, 1986), **truncated factorization** (Pearl, 1993), or **manipulation theorem** (Spirtes, 2000) that we have seen in Chapter 3. To refresh our minds, we repeat those formulas.  

We start with an SCM $\mathfrak{C}$ with structural assignments with density $p^{\mathfrak{C}}$:

$$
X_{j}:=f_{j}\left(X_{pa(j)}, N_{j}\right), \quad j=1, \ldots, d
$$
<p style='text-align: center;'> (Eq.2) </p>

Using Markov property, we have:

$$
p^{\mathfrak{C}}\left(x_{1}, \ldots, x_{d}\right)=\prod_{j=1}^{d} p^{\mathfrak{C}}\left(x_{j} \mid x_{p a(j)}\right)
$$
<p style='text-align: center;'> (Eq.3) </p>

We do an **intervention**, so the SCM $\mathfrak{\tilde{C}}$ evolves from $\mathfrak{C}$ after $do\left(X_{k}:=\tilde{N}_{k}\right)$, where $\tilde{N}_{k}$. Similarly, density changes to $\tilde{p}$. Again, we use the Markov assumption:

$$
\begin{aligned}
p^{\mathfrak{C} ; d o\left(X_{k}:=\tilde{N}_{k}\right)}\left(x_{1}, \ldots, x_{d}\right) &=\prod_{j \neq k} p^{\mathfrak{C} ; d o\left(X_{k}:=\tilde{N}_{k}\right)}\left(x_{j} \mid x_{p a(j)}\right) \cdot p^{\mathfrak{C} ; d o\left(X_{k}:=\tilde{N}_{k}\right)}\left(x_{k}\right) \\
&=\prod_{j \neq k} p^{\mathfrak{C}}\left(x_{j} \mid x_{p a(j)}\right) \tilde{p}\left(x_{k}\right) .
\end{aligned}
$$
<p style='text-align: center;'> (Eq.4) </p>




## Causal Models and Counterfactuals

Humans often think in the form of **counterfactuals**: *“I should have taken the train.”* or *“We should have invested in Bitcoin in January 2019!”* are only a few examples. 

Assume someone offers you $10000 if you predict the result of a coin flip. You guess “heads” and lose. Some people may then think, “Why did I not say ‘tails’?” even though there was no way they could have possibly known the outcome. 

Counterfactuals have a long history ad the humankind. For example, Titus Livius discusses in 25 BC what would have happened if Alexander the Great had not died in his way back from Persia and had attacked Rome. Livy argues that Rome and Carthage would have joined forces to crush the Macedonian army. [Geradin and Girgenson, 2011](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1970917). 

![img](img/ch4/Alexander_mosaic.jpeg)


<br/><br/>

**Example: Crop Planning**  

This example is from [Jerzy Neyman, 1923](https://en.wikipedia.org/wiki/Jerzy_Neyman#:~:text=After%20his%20return%20to%20Poland,studied%20randomized%20experiments%20in%201923.) Consider $m$ plots of land and $v$ varieties of crop. Denote $U_{ij}$ the crop yield that would be observed if variety $i = 1, . . . , v$ were planted in plot $j = 1,...,m$.

For each plot $j$, we can only experimentally determine one $U_{ij}$ in each growing season. The others crop yields are called **counterfactuals**.


<div class="alert alert-block alert-info">

**Definition 4.3. (Counterfactuals):** Counterfactual corresponds to updating the noise distributions of an SCM (by conditioning) and then performing an intervention.
Consider an SCM $\mathfrak{C} = (S,P_U)$ over nodes $X$. Given some observations $x$, we define a *counterfactual SCM* by replacing the distribution of noise variables:

$$
\mathfrak{C}_{\mathbf{X}=\mathbf{x}}:=\left(\mathbf{S}, P_{\mathbf{U}}^{\mathfrak{C} \mid \mathbf{X}=\mathbf{x}}\right)
$$

where 
$$
P_{\mathbf{U}}^{\mathfrak{C} \mid \mathbf{X}=\mathbf{x}} = P_{\mathbf{U} \mid \mathbf{X}=\mathbf{x}}^{}
$$

The new set of noise variables need not be jointly independent anymore. *Counterfactual* statements can now be seen as *do-statements* in the new counterfactual SCM.

</div>

<br/><br/>

**Example: Three Integers, Computing Counterfactuals:** 

This example is from [Elements of Causal Inference Book, Chapter 6](https://mitpress.mit.edu/books/elements-causal-inference). 
Consider the following SCM $\mathfrak{C}$:

$
\begin{cases}
X = U_X \\
Y = X^{2} + U_Y\\
Z = 2Y + X + U_Z
\end{cases}
$

with uniformly distributed noise values on the integers between −5 and 5:

$
U_{X}, U_{Y}, U_{Z} \stackrel{\text { iid }}{\sim} \mathrm{U}(\{-5,-4, \ldots, 4, 5\})
$

<br/>

**Case 1: Observation:** We assume that we observe $(X,Y,Z) = (1,2,4)$.

Then $P_{\mathbf{U}}^{\mathfrak{C} \mid \mathbf{X}=\mathbf{x}}$ puts a point mass on $(U_X , U_Y , U_Z) = (1, 1, -1)$ because here all noise terms can be uniquely reconstructed from the observations.

<br/>

**Case 2: Counterfactual:** In the context of $(X,Y,Z) = (1,2,4)$ observation, we have a counterfactual statement: “$Z$ would have been 11 if we had $X$ been set to 2.” 

Mathematically, this means that $P_{\mathbf{Z}}^{\mathfrak{C} \mid \mathbf{X}=\mathbf{x};do(X=2)} = 11$ or it has $Z$ a point mass on 11. 

If we have a counterfactual statement that says: $Y$ would have been 5 if we had $X$ been set to 2.

Mathematically, this is $P_{\mathbf{Y}}^{\mathfrak{C} \mid \mathbf{X}=\mathbf{x};do(X=2)} = 5$. 

Counterfactuals notation may looks quite complicated.The following image provides further clarification:

![img](img/ch4/Counterfactuals_notation.png)


## Linking Observations to SCMs using Graphs


### <font color='blue'> **What causal structures can lead to dependence between two observed variables?**</font>


**1- Direct and Indirect Effects**

An effect of $A$ on $Y$ can result in an **association**.

![img](img/ch4/SEM-Observe-Direct-Effects.png)


**2- Shared Common Cause**

Common cause (measured or unmeasured) of $A$ and $Y$ can result in an association. When the common cause is not included in $X$, it is represented through the dependence it induces between errors $U$.

![img](img/ch4/SEM-Observe-Share-Cause.png)

**3- Non of conditions 1 and 2**

If neither of these sources of dependence are present, $A$ and $Y$ will be **independent** in every probability distribution $P_0$ compatible with the SCM. 
In other words, any data generating experiment that is compatible with the SCM will give rise to an observed data distribution in which $A$ and $Y$ are independent regardless of functional form, strength of associations, etc.

![img](img/ch4/SEM-Observe-Independent.png)

**4- Conditioning on a Collider**

Collider is an “inverted fork”. Conditioning on a common effect (descendent) of $A$ and $Y$ can result in an association between $A$ and $Y$. It also called **Berkson’s bias or selection bias**.

![img](img/ch4/SEM-Observe-Colider.png)


### <font color='blue'> **What causal structures can remove a source of dependence between variables?**</font>

**Conditioning on a shared common cause** 

Conditioning on a causal intermediate or shared common cause between $A$ and $Y$ will remove that source of dependence.

![img](img/ch4/SEM-Observe-Confounder.png)


### <font color='blue'> **When does a SCM imply that variables are independent?**</font>

**$A$ is independent of $Y$** if there is no path between $A$ and $Y$.

![img](img/ch4/SEM-Observe-Independ-nopath.png)

**$A$ is independent of $Y$** if all paths between $A$ and $Y$ are “blocked” by a collider.

![img](img/ch4/SEM-Observe-Independ-blocked.png)

**$A$ is independent of $Y$ given $W$** if $W$ blocks all unblocked paths and doesn’t create any new unblocked paths. In onther words, conditioning on a non-collider blocks a path.

![img](img/ch4/SEM-Observe-Independ-condition.png)

**$A$ is independent of $Y$ given $W$** if conditioning on a collider (or a descendent of a collider) opens a path.

![img](img/ch4/SEM-Observe-Independ-condition-collider.png)

<br/><br/>

## Total Causal Effect

We define the existence of a total causal effect as follows based on [Judea Pearl, 2009](https://ftp.cs.ucla.edu/pub/stat_ser/r350.pdf) and Chapter 6 of [Elements of Causal Inference](https://mitpress.mit.edu/books/elements-causal-inference).

<div class="alert alert-block alert-info">

**Definition 4.4. ((Total Causal Effect):** Given an SCM $\mathfrak{C}$ over nodes $X$, there is a total causal effect from $X$ to $Y$ for some random variable $\tilde{U}_{X}$ if and only if:

$$X \not\!\perp\!\!\!\perp Y \quad in  P_{\mathbf{X}}^{\mathfrak{C} ; do\left(X:=\tilde{U}_{X}\right)}$$

</div>

<br/>

The existence of a total causal effect is also related to the existence of a directed path in the corresponding graph $\mathfrak{G}$. The correspondence, however, is not one-to-one. While a directed path is necessary for a total causal effect, it is not sufficient.

<br/>

<div class="alert alert-block alert-info">

**Proposition 4.5 (Graphical criteria for total causal effects):** Assume we are given an SCM $\mathfrak{C}$ with a corresponding graph $\mathfrak{G}$.

- (i) If there is no directed path from $X$ to $Y$ , then there is no total causal effect.
- (ii) Sometimes there is a directed path but no total causal effect.

</div>


## References

This chapter contents are highly inspired by the [Elements of Causal Inference (Open Access) book](https://mitpress.mit.edu/books/elements-causal-inference) by By Jonas Peters, Dominik Janzing and [Bernhard Schölkopf](https://www.is.mpg.de/~bs).

We also used examples fom the [Introduction to Causal Inference course](https://www.ucbbiostat.com) by Maya L. Petersen & Laura B. Balzer, UC Berkeley.

Bruno Gonçalves has a helpful [blog](https://medium.data4sci.com/causal-inference-part-iv-structural-causal-models-df10a83be580) on SEM too.