Competition link: https://www.kaggle.com/competitions/aial-causal-inference

### Challenge
This challenge requires to identify causal direction from two synthetically generated one-dimensional samples X and Y, where either Y=f(X) + E (denoted as X->Y), or visa versa. 
Here f is non-linear function, and E is noise, independent on X.

### Data
Challenge data consist of two sets of pairs (X,Y): public and private.
Public data set consists of 1000 samples of pairs (X,Y) in the form of csv files, and associated (random) causal direction (X->Y or Y->X), indicated in the first line of the corresponding csv file.

Private data set also is generated according to the same rules and format, but it is used only for model evaluation.

### Format
Format. Each sample of pairs is saved as separate csv file, [id].csv, where id is integer from 0 to 1000 (sample ID). Ground truth is saved in ground_truth.csv (if provided).
In addution, plot of each sample is provided [ID].png (for data review).

### Evaluation
Task and evaluation measure The participants are required to construct a model F: F(X,Y) = 0 if X->Y, and F(X,Y) = 1 if Y->X with high accuracy. The accuracy evaluated on the private data set will be uset as the performance measure.

In [513]:
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
import seaborn as sns
import numpy as np
from statsmodels.formula.api import ols
from statsmodels.stats.stattools import omni_normtest
from numpy import cov
from scipy.stats import pearsonr
from scipy.stats import spearmanr
from collections import Counter

# 0 if X -> Y
# 1 if Y -> X

In [41]:
gt = pd.read_csv("public/ground_truth.csv", header = None).iloc[:, 0]
results = {}
for i in range(len(gt)):
    results[i] = gt[i]

In [511]:
for m in range(-5,5,1):
    multi = 1 + m/100
    count = 0
    for i in range(1000):
        d = pd.read_csv("public/"+str(i)+".csv", header = None, names = ['X', 'Y'])
        d["changex"] = d.X - d.X.shift()
        d["changey"] = d.Y - d.Y.shift()
        d = d.dropna()
        modelx = ols(formula = 'changex~changey', data=d).fit()
        modely = ols(formula = 'changey~changex', data=d).fit()
        if int(modelx.aic < modely.aic*multi) == results[i]:
            count+=1
    print(multi, count)

0.95 687
0.96 691
0.97 698
0.98 700
0.99 703
1.0 705
1.01 708
1.02 712
1.03 709
1.04 707


### No information rate
Always predicting the bigger class has accuracy 50.8%
### Baseline model
Model with no multiplier has 70.5% accuracy
### 0.01 granularity
1.02 has accuracy 71.2%
### 0.001 granularity
range(1.020, 1.025) always has accuracy 71.2%
### 0.0001 granularity
range(1.0197, 1.0256) always has accuracy 71.2%

I chose 1.023 as my multiplier

In [514]:
multi = 1.023
count = 0
for i in range(1000):
    d = pd.read_csv("public/"+str(i)+".csv", header = None, names = ['X', 'Y'])
    d["changex"] = d.X - d.X.shift()
    d["changey"] = d.Y - d.Y.shift()
    d = d.dropna()
    #sns.scatterplot(data = d, x = "changex", y = "changey")
    modelx = ols(formula = 'changex~changey', data=d).fit()
    modely = ols(formula = 'changey~changex', data=d).fit()
    #print(modelx.aic, modely.aic, int(modelx.aic < modely.aic), results[i])
    if int(modelx.aic <= modely.aic*multi) == results[i]:
        count+=1
    
str(count/10) + "%"

'71.2%'

Other tried models which perform same or worse:
1. Comparing by bic
2. Comparing by log-likelihood
3. Comparing by mse
4. Comparing by resid mse
4. Comparing by regressors p-values
4. Comparing by intercept
5. cov, cor on X, Y
6. cov, cor on x change, y change
7. logistic regression model from joint data

In [505]:
submission = []
ids = []
for i in range(1000):
    d = pd.read_csv("private/"+str(i)+".csv", header = None, names = ['X', 'Y'])
    d["changex"] = d.X - d.X.shift()
    d["changey"] = d.Y - d.Y.shift()
    d = d.dropna()
    modelx = ols(formula = 'changex~changey', data=d).fit()
    modely = ols(formula = 'changey~changex', data=d).fit()
    if modelx.aic < modely.aic*1.023:
        submission.append(1)
    else:
        submission.append(0)
    ids.append(i)

In [508]:
sub = pd.DataFrame({'id':ids, 'direction':submission})
sub.to_csv("MG.csv", header = True, index = False)