# What is Regression Doing After All?

As we’ve seen so far, regression does an amazing job at controlling for additional variables when we do a test vs control comparison. If we have independence, $ (Y_0 ,Y_1) \perp T \mid X$, then regression can identify the ATE by controlling for X. The way regression does this is kind of magical. To get some intuition about it, let’s remember the case when all variables X are dummy variables. If that is the case, regression partitions the data into the dummy cells and computes the mean difference between test and control. This difference in means keeps the Xs constant, since we are doing it in a fixed cell of X dummy. It is as if we were doing $E [Y \mid T = 1] - E[Y \mid T =0] \mid X =x$,  where $ x $ is a dummy cell (all dummies set to 1, for example). Regression then combines the estimate in each of the cells to produce a final ATE. The way it does this is by applying weights to the cell proportional to the variance of the treatment on that group.



![image.png](attachment:image.png)


To give an example, let’s suppose I’m trying to estimate the effect of a drug and I have 6 men and 4 women. My response variable is days hospitalised and I hope my drug can lower that. On men, the true causal effect is -3, so the drug lowers the stay period by 3 days. On women, it is -2. To make matters more interesting, men are much more affected by this illness and stay longer at the hospital. They also get much more of the drug. Only 1 out of the 6 men does not get the drug. On the other hand, women are more resistant to this illness, so they stay less at the hospital. 50% of the women get the drug.


In [4]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np

from matplotlib import style
from matplotlib import pyplot as plt
import statsmodels.formula.api as smf

#import graphviz as gr

%matplotlib inline

style.use("fivethirtyeight")

In [5]:

drug_example = pd.DataFrame(dict(
    sex= ["M","M","M","M","M","M", "W","W","W","W"],
    drug=[1,1,1,1,1,0,  1,0,1,0],
    days=[5,5,5,5,5,8,  2,4,2,4]
))

Note that simple comparison of treatment and control yields a negatively biased effect, that is, the drug seems less effective than it truly is. This is expected, since we’ve omitted the sex confounder. In this case, the estimated ATE is smaller than the true one because men get more of the drug and are more affected by the illness.

Since the true effect for man is -3 and the true effect for woman is -2, the ATE should be

$ATE = \frac{(-3*6)+(-2*4)}{10} = -2.6$

This estimate is done by 1) partitioning the data into confounder cells, in this case, man and women, 2) estimating the effect on each cell and 3) combining the estimate with a weighted average, where the weight is the sample size of the cell or covariate group. If we had exactly the same size of man and woman in the data, the ATE estimate would be right in the middle of the ATE of the 2 groups, -2.5. Since there are more men than women in our dataset, the ATE estimate is a little bit closer to the man’s ATE. This is called a non-parametric estimate, since it places no assumption on how the data was generated.

If we control for sex using regression, we will add the assumption of linearity. Regression will also partition the data into man and woman and estimate the effect on both of these groups. So far, so good. However, when it comes to combining the effect on each group, it does not weigh them by the sample size. Instead, regression uses weights that are proportional to the variance of the treatment in that group. In our case, the variance of the treatment in men is smaller than in women, since only one man is in the control group. To be exact, the variance of T for man is $0.139 = 1 /6 * (1 - 1/6)$ and for women is $0.25= 2/4 * (1-2/4)$. So regression will give a higher weight to women in our example and the ATE will be a bit closer to the women’s ATE of -2.

In [1]:

def F(x,y):
    if x <= y:
        s=0
        for i in range(x,y+1):
            s=(i - 2)**3
        if s>10:
            return("Grande")
        elif s< -10:
            return("Pequeño")
        else:
            return("Mediano")
    else:
        s=0
        for i in range(y,x+1):
            s=(i+3)**2
        if s>10:
            return("Grande")
        elif s< -10:
            return("Pequeño")
        else:
            return("Mediano")

F(1,2)    

'Mediano'

In [2]:
F(5,2)    

'Grande'

## Modificando para cargarlo en git

In [3]:
a=10
b=20
a+b

30