## ***"CAUSALITY FOR MACHINE LEARNING"*** by Bernhard Schölkopf

> "It argues that the hard open problems of machine learning and AI are intrinsically related to causality, and explains how the field is beginning to understand them."

Problem: 
> "if we compare what machine learning can do to what animals accomplish, we observe that the former is rather bad at some crucial feats where animals excel."

- generalization
    - *"interventions in the world, domain shifts, temporal structure —- by and large, we consider these factors a nuisance and try to engineer them away."*
- thinking
    -  *"thinking in the sense of Konrad Lorenz, i.e., acting in an imagined space."*

- The Mechanization of Information Processing
- From Statistical to Causal Models
    - Structural causal models (SCMs)
- Levels of Causal Modelling
- Independent Causal Mechanisms
    - Independent Causal Mechanisms (ICM) Principle.
    - Measures of dependence of mechanisms
    - Algorithmic independence
- Cause-Effect Discovery
- Half-Sibling Regression and Exoplanet Detection
- Invariance, Robustness, and Semi-Supervised Learning
    - Semi-supervised learning (SSL)
    - Adversarial vulnerability
    - Multi-task learning
    - Reinforcement Learning
- Causal Representation Learning
    - Learning transferable mechanisms
- Learning disentangled representations
- Bernhard's Personal Notes and Conclusion


### 1. The Mechanization of Information Processing

- The first industrial revolution:
    - Generate and convert forms of **"energy"**
- The Digital to AI revolution:
    - ***"Cybernetics"***. It replaced energy by information.

> When machine learning is applied in industry, we often convert user data into predictions about future user behavior and thus money. Money may ultimately be a form of information —- a view not inconsistent with the idea of bitcoins generated by solving cryptographic problems. The first industrial revolutions rendered energy a universal currency (Smil, 2017); the same may be happening to information.



Similar to the energy reveolution:

- Industry, (Hardware and Software):
    > the first one built on the advent of electronic computers, the development of high level programming languages, and the birth of the field of computer science, engendered by the vision to create AI by manipulation of symbols.

- Learning, (information extraction):
    > The second one, which we are currently experiencing, relies upon learning. It allows to extract information also from unstructured data, and it automatically infers rules from data rather than relying on humans to conceive of and program these rules.
    

- Judea Pearl: classic AI with probability theory (Pearl, 1988)
    > graphical models

- Vladmir Vapnik,supervised machine learning (Vapnik, 1998)
    > IT industry, and the second part is transforming IT companies into “AI first” as well as creating an industry around data collection and “clickwork.” While the latter provides labelled data for the current workhorse of AI, supervised machine learning (Vapnik, 1998).

- (Klein, 1872; MacLane, 1971)
    > Symmetry transformations and defining objects by their behavior:
    
- (Chen and Cheung, 2018; Dai, 2018)
    > In the current data-driven phase of this revolution, China is beginning to use machine learning to observe and incentivize citizens to behave in ways deemed beneficial.

### 2. From Statistical to Causal Models

- Methods driven by independent and identically distributed (IID) data, eg (LeCun et al., 2015)
    > ***"a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent."***

- trends:
    1. we have massive amounts of data, often from simulations or large scale human labeling
    2. we use high capacity machine learning systems (i.e., complex function classes with many adjustable parameters) 
    3. we employ high performance computing systems (often ignored, but crucial when it comes to causality) 
    4. the problems are IID (independent and identically distributed).
        - (e.g., image recognition using benchmark datasets)
        - artificially made IID, e.g.. such as DeepMind’s “experience replay” (Mnih et al., 2015)


#### Why IID?

-  IID data + statistical learning theory = strong universal consistency results
    - Nearest Neighbor Classifiers 
    - Support Vector Machines 

(Vapnik, 1998; Schölkopf and Smola, 2002; Steinwart and Christmann, 2008)

Otherwise:
- “adversarial vulnerability”
- “defense mechanisms”

Problem in recommender systems:
- Suppose Alice is looking for a laptop keyboard on the internet (i.e., any keyboard), and the web shop’s recommendation system suggests that she should buy a laptop to go along with the keyboard.



- (the mutual information), so the directionality of cause and effect is lost.
> Recommending an item to buy constitutes an intervention in a system, taking us outside the IID setting. We no longer work with the observational distribution, but a distribution where certain variables or mechanisms have changed. This is the realm of causality.

- Reichenbach (1956) connection between causality and statistical dependence. 

***"Common Cause Principle"***: if two observables X and Y are statistically dependent, then there exists a variable Z that causally influences both and explains all the dependence in the sense of making them independent when conditioned on Z.

***"Structural causal models (SCMs)"***: in case of estimating functions rather than probability distributions. We are given a set of ***observables*** X1 , . . . , Xn (modelled as random variables) associated with the vertices of a directed acyclic graph (DAG) G. We assume that each observable is the result of an assignment

```latex
Xi := fi(PAi,Ui), (i = 1,...,n),

```


> 1. SCM language makes it straightforward to formalize interventions as operations that modify a subset of assignments (1), e.g., changing Ui, or setting fi (and thus Xi) to a constant (Pearl, 2009a; Spirtes et al., 2000). 

> 2. Second, the graph structure along with the joint independence of the noises implies a canonical factorization of the joint distribution entailed by (1) into causal conditionals that we will refer to as the causal (or disentangled) factorization,

![image.png](./eq2.png)


### 4. Levels of Causal Modelling

![image.png](./tbl1.png)

### 5 Independent Causal Mechanisms

> disentangled factorization (2) of the joint distribution p(X1, . . . , Xn). This factorization according to the causal graph is always possible when the Ui are independent

![image.png](./chair.png)

- One can express the above insights as follows (Schölkopf et al., 2012; Peters et al., 2017):

> ***"Independent Causal Mechanisms (ICM) Principle. The causal generative process of a system’s variables is composed of autonomous modules that do not inform or influence each other."***

> "In the probabilistic case, this means that the conditional distribution of each variable given its causes (i.e., its mechanism) does not inform or influence the other mechanisms.

- ***"Measures of dependence of mechanisms"***
 > Algorithmic independence: links between causal and statistical structures.
     - ***causal structure***