This repository contains sources for my bachelor's thesis in ITMO University (in international laboratory Computer technologies). I write it in collaboration with the CoreML team in VK which is one of the most popular social networks in Eastern Europe. My thesis is dedicated to causal inference (CI) between different metrics of social networks which are usually obtained by A/B testings. So in this repository I published different approaches for these purposes.
There are two articles on Medium about this project:
- Simple article gives an intuition about causal inference. It also states the results, but doesn't go into detail too much. If you have never heard about causal inference, you are welcome to read this one.
- [Advanced] (https://medium.com/@itmo.mllab/causal-inference-in-the-context-of-social-network-metrics-analysis-edf541cccb51) article describes classical algorithms and the proposed modification more formally. No need to read a simple version if you are somehow familiar with the area.
In the root package src you can find some basic approaches: proxy-metric, instrumental variable with l0-reguralization. Also there is one more complicated method with random experiments in this notebook. It is intended for choosing metrics with the most stable sign in a lot of random experiments. It can help to overcome difficulties caused by bias of usual regression.
Let's examine causality package. I've started with examining of some greedy methods. Althouhg optimal tree can be built in polynomial time, optimal graph inference is NP-hard problem. That's why these algorithms are time-consuming. However, there is important idea of maximizing likelihood of graphs, and its usage you can find below.
The main part of my thesis is in package pc. There I've started with classic algorithms of causality: PC and FCI. I used an excellent library: pcalg. Unfortunately, I didn't want to migrate all the project to R from Python. Instead I used rpy2 to call pcalg from R. So you can find class for usage PC, FCI and FCI+ algorithms here and explained examples of usage. I also provided parsing of logs which are generated from pcalg in LogParser.
For FCI algorithms output is a bit tricky. It's not a classic oriented graph, but inducing path graph. More than that, as causality algorithms are only able to infer equivalence class, FCI output is an equivalence class of inducing path graph or PAG (partial ancestral graph). Because of different semantic I've written pag package to work with them.
All causality algorithms rely on the independence tests. I've tried to examine some of them in package independence: FCIT and KCIT. Unfortunately, they are not suit for my purposes as they depend on number of samples. More than that, pcalg library has natively implemented gaussian conditional independence test based on partial correlation. You can find my implementation on Python here. Of course, I didn't use it when calling pcalg, but I used it for my new method of edge orientation which I will describe below.
Also I've tried to reproduce idea of many random experiments. Previously I used it with linear regression, now with graphs. I supposed that the most stable edges could be determined via these experiments. Specifically, I considered edges between short metrics and one chosen long metric (for each long metric). I can't say anything good or bad about this idea, as it's hard to evaluate it somehow. About difficulties of comparison you can read below.
As I said previously, I've come up with a new method of edge orientation. Basically, it was caused by enourmous amount of edges in both PC and FCI algorithms. I've recognised that the most of them were oriented as colliders(or v-structures). It's a structure like a -> b <- c. In classic algorithms these edges are set if b is not in a separating set of a and c(S_ac). In my implementation I allow this to vertex b, until it doesn't decrease probability of conditional indepence of a and c. The code is here.
Comparing of algorithms in causality is an extremely non-trivial task. There is a great paper about it. Usually, researcher has a choise: test approaches unconvincingly on the real data (visual inspection, usage of prior knowledge) or test them convincingly on the (semi-)synthetic data. As one can guess, both approaches are questionable. So in package methodsVerification I've tried to examine both approaches. I estimated likelihood of graphs which were inferred from real data, and I estimated classification metrics on semisynthetic data (it is needed for obtaining ground truth). I also tried to use explicit A\B-testing and prior knowledge, but for some reasons it's very laborious approach, and I can't affect on it right now.