In [1]:
!pip install causalinference

You should consider upgrading via the '/opt/conda/bin/python3 -m pip install --upgrade pip' command.[0m


In [2]:
import pandas as pd

from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV

from causalinference import CausalModel

In [3]:
df = pd.read_csv('malocclusion.csv')

In [4]:
df.head()

Unnamed: 0,dANB,dPPPM,dIMPA,dCoA,dGoPg,dCoGo,dT,Growth,Treatment
0,-3.2,-1.1,-4.2,1.0,4.0,3.7,5,0,0
1,-0.6,-0.5,3.8,2.6,-0.1,1.4,3,1,0
2,-1.6,-3.1,-6.0,4.3,4.2,7.1,5,0,0
3,-1.1,-2.1,-12.1,14.1,20.7,17.5,9,0,0
4,-1.1,0.0,-6.7,7.7,8.8,11.0,5,0,0


### Treatment on dANB

Minimal sufficient adjustment set for estimating the effect of the Treatment on dANB is {Growth} since we can't include Unobserved Cofounders into our adjustment set. {Growth} is a valid adjustment set for (Treatment, dANB) since:
1) {Growth} does not contain descendants of nodes on directed path from Treatment to dANB
2) {Growth} blocks undirected paths from Treatment to dANB

That being said, set {Growth, dT} could also be a valid adjustment set but we will proceed with the minimal sufficient set which is {Growth}.

Let's now use linear regression to estimate ATE:

In [5]:
model = smf.ols('dANB ~ Treatment + Growth', data=df)
fitted = model.fit()
print(fitted.summary())

                            OLS Regression Results                            
Dep. Variable:                   dANB   R-squared:                       0.407
Model:                            OLS   Adj. R-squared:                  0.398
Method:                 Least Squares   F-statistic:                     48.04
Date:                Tue, 12 Oct 2021   Prob (F-statistic):           1.31e-16
Time:                        23:37:08   Log-Likelihood:                -251.17
No. Observations:                 143   AIC:                             508.3
Df Residuals:                     140   BIC:                             517.2
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -1.5600      0.181     -8.609      0.0

So by using linear regression we got ATE equal to $\approx 1.86$. However, since linear regression only works good with trully linear models, let's use propensity score weighting to see if we get the similar result for ATE:

In [6]:
cls = LogisticRegression()
cls = CalibratedClassifierCV(cls)

X = df[['Growth']]
y = df[['Treatment']]
cls.fit(X, y)
df['e'] = cls.predict_proba(X)[:, 1].tolist()
df['w'] = df['Treatment'] / df['e'] + (1 - df['Treatment']) / (1 - df['e'])

In [7]:
model = smf.wls('dANB ~ Treatment + Growth', data=df, weights=df['w'])
fitted = model.fit()
print(fitted.summary())

                            WLS Regression Results                            
Dep. Variable:                   dANB   R-squared:                       0.386
Model:                            WLS   Adj. R-squared:                  0.378
Method:                 Least Squares   F-statistic:                     44.06
Date:                Tue, 12 Oct 2021   Prob (F-statistic):           1.44e-15
Time:                        23:37:08   Log-Likelihood:                -253.68
No. Observations:                 143   AIC:                             513.4
Df Residuals:                     140   BIC:                             522.3
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -1.5572      0.205     -7.609      0.0

As we can we got the similar estimate for the ATE $\approx 1.86$ which is in line wiht our linear regression model from above.

Now let's estimate ATET using matching with Machalanobis distance since we basically force the control sample to be similar to the treated one.

In [8]:
adjustment_set = ['Growth']

model = CausalModel(
    Y=df['dANB'].values,
    D=df['Treatment'].values,
    X=df[adjustment_set].values
)

In [9]:
model.est_via_matching(bias_adj=True)
print(model.estimates)


Treatment Effect Estimates: Matching

                     Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
           ATE      1.856      0.237      7.829      0.000      1.392      2.321
           ATC      1.860      0.240      7.761      0.000      1.390      2.330
           ATT      1.852      0.240      7.723      0.000      1.382      2.322



As we can see ATET estimate is $\approx 1.85$ which is very similar to ATE. 

### Treatment on Growth

As we can see from DAG there is no suitable adjustment set to estimate the effect of the Treatment on Growth since we can't adjust on Unobserved Cofounders, therefore the causal effect would be zero in that case ($ATE=ATET=0$). What we can do instead is to try and use front-door adjustement in order to estimate if there is any effect of Treatment on Growth via a proxy node in our DAG. For that we would need our proxy node to:

1. intercept all directed paths from Treatment to Growth
2. have no ublocked path from Treatment to the proxy node
3. have all backdoor paths from the proxy node to Growth to be blocked by Treatment

There is a problem with the first condition in our case since there is no directed paths from Treatment to Growth but we can assume that there is some casual relationship in place and calculate the respective causal effect to prove that. In this case there is only node in DAG that would satisfy all the conditions - `dCoA`. If we assume that there is a causal relationship s.t. `Treatment`->`dCoA`->`Growth`, there would be no unblocked paths from `Treatment` to `dCoA` (blocked by `dCoPg` and `dCoGo` colliders) and one and only backdoor path from `dCoA` to `Growth` is blocked by the `Treatment`. 

That being said, let's evaluate ATE using front-door adjustment and see if it's close to zero since there is no causal effect `dCoA`->`Growth` present in original DAG. For that we are going to use the method from the library presented here (https://www.degeneratestate.org/posts/2018/Sep/03/causal-inference-with-python-part-3-frontdoor-adjustment/):

In [10]:
def estimate_ate_frontdoor_linear(df, x, y, z):
    """
    Estiamte the ATE of a system from a dataframe of samples `df`
    using frontdoor adjustment and assuming linear models.
    
    Arguments
    ---------
    df: pandas.DataFrame
    x: str
    y: str
    z: str
    
    Returns 
    -------
    ATE: float
    """
    x = df[x].values
    y = df[y].values
    z = df[z].values
    
    z_x_model = sm.OLS(z, sm.add_constant(x)).fit()
    
    z_bar = z_x_model.predict(sm.add_constant(x))
    z_prime = z - z_bar

    y_z_model = sm.OLS(y, sm.add_constant(z_prime)).fit()
    
    return y_z_model.params[1] * z_x_model.params[1]

In [11]:
print('ATE using front-door adjustment:', estimate_ate_frontdoor_linear(df, 'Treatment', 'Growth', 'dCoA'))

ATE using front-door adjustment: -0.10280298739379183


As we can see the ATE is rather close to zero (especially when compared with ATE of `Treatment` on `Growth`) when calculating using front-door adjustment based on our assumption that there is a relationship between `dCoA` to `Growth` - which is in line with our original DAG where there is no relationship between `Treatment` and `Growth`, therefore $ATE=ATET=0$ .

To test the independence formally let's use the chi-squared test for independence:

In [12]:
ct = pd.crosstab(df['Treatment'], df['Growth'])
ct

Growth,0,1
Treatment,Unnamed: 1_level_1,Unnamed: 2_level_1
0,51,26
1,34,32


### Chi squared test for independence
$H_0$: treatment and growth are independent

$H_1$: $H_0$ is false

In [13]:
stats.chi2_contingency(ct)

(2.612101564763837,
 0.10605113732163217,
 1,
 array([[45.76923077, 31.23076923],
        [39.23076923, 26.76923077]]))

$p=0.11 > 0.05$ – null hypothesis is not rejected, therefore `Treatment` and `Growth` are independent. Also none of the expected counts (the array returned above) are below 5, so chi-squared test is accurate.