# PRIMERA PRÁCTICA CALIFICADA DE LP2

   ### Marcello Eduardo Anchante Fernandez - 20211804


# Resolución Problema 1

## What is Regression Doing After All?
As we’ve seen so far, regression does an amazing job at controlling for additional variables when we do a test vs control comparison. If we have independence,  $(Y_0,Y_1) ⊥ T|X$  , then regression can identify the ATE by controlling for X. The way regression does this is kind of magical. To get some intuition about it, let’s remember the case when all variables X are dummy variables. If that is the case, regression partitions the data into the dummy cells and computes the mean difference between test and control. This difference in means keeps the Xs constant, since we are doing it in a fixed cell of X dummy. It is as if we were doing $E[Y|T = 1] - E[Y|T = 0]|X = x$, where $x$ is a dummy cell (all dummies set to 1, for example). Regression then combines the estimate in each of the cells to produce a final ATE. The way it does this is by applying weights to the cell proportional to the variance of the treatment on that group.


Code:

In [126]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
from matplotlib import style
from matplotlib import pyplot as plt
import statsmodels.formula.api as smf

%matplotlib inline

style.use("fivethirtyeight")

To give an example, let’s suppose I’m trying to estimate the effect of a drug and I have 6 men and 4 women. My response variable is days hospitalised and I hope my drug can lower that. On men, the true causal effect is -3, so the drug lowers the stay period by 3 days. On women, it is -2. To make matters more interesting, men are much more affected by this illness and stay longer at the hospital. They also get much more of the drug. Only 1 out of the 6 men does not get the drug. On the other hand, women are more resistant to this illness, so they stay less at the hospital. 50% of the women get the drug.

In [129]:
drug_example = pd.DataFrame(dict(
    sex= ["M","M","M","M","M","M", "W","W","W","W"],
    drug=[1,1,1,1,1,0,  1,0,1,0],
    days=[5,5,5,5,5,8,  2,4,2,4]
))

Note that simple comparison of treatment and control yields a negatively biased effect, that is, the drug seems less effective than it truly is. This is expected, since we’ve omitted the sex confounder. In this case, the estimated ATE is smaller than the true one because men get more of the drug and are more affected by the illness.

In [12]:
drug_example.query("drug==1")["days"].mean() - drug_example.query("drug==0")["days"].mean()

-1.1904761904761898

Since the true effect for man is -3 and the true effect for woman is -2, the ATE should be 

$ATE = \frac{(-3*6)+(-2*4)}{10} = -2.6$

This estimate is done by 1) partitioning the data into confounder cells, in this case, man and women, 2) estimating the effect on each cell and 3) combining the estimate with a weighted average, where the weight is the sample size of the cell or covariate group. If we had exactly the same size of man and woman in the data, the ATE estimate would be right in the middle of the ATE of the 2 groups, -2.5. Since there are more men than women in our dataset, the ATE estimate is a little bit closer to the man’s ATE. This is called a non-parametric estimate, since it places no assumption on how the data was generated.

If we control for sex using regression, we will add the assumption of linearity. Regression will also partition the data into man and woman and estimate the effect on both of these groups. So far, so good. However, when it comes to combining the effect on each group, it does not weigh them by the sample size. Instead, regression uses weights that are proportional to the variance of the treatment in that group. In our case, the variance of the treatment in men is smaller than in women, since only one man is in the control group. To be exact, the variance of T for man is $0.139 = 1/6 * (1-1/6)$ and for women is $0.25 = 2/4 * (1 - 2/4)$. So regression will give a higher weight to women in our example and the ATE will be a bit closer to the women’s ATE of -2.

In [13]:
smf.ols('days ~ drug + C(sex)', data=drug_example).fit().summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,7.5455,0.188,40.093,0.000,7.100,7.990
C(sex)[T.W],-3.3182,0.176,-18.849,0.000,-3.734,-2.902
drug,-2.4545,0.188,-13.042,0.000,-2.900,-2.010


This result is more intuitive with dummy variables, but, in its own weird way, regression also keeps continuous variables constant while estimating the effect. Also with continuous variables, the ATE will point in the direction where covariates have more variance.

So we’ve seen that regression has its idiosyncrasies. It is linear, parametric, likes high variance features… This can be good or bad, depending on the context. Because of this, it’s important to be aware of other techniques we can use to control for confounders. Not only are they an extra tool in your causal tool belt, but understanding different ways to deal with confounding expands our understanding of the problem. For this reason, I present you now the Subclassification Estimator!

# Resolución Problema 2

$$ f(x) =
\begin{cases}
\sum_{i=x}^{y} (i-2)^3 &,\, \text{if }x\leq y ,\\
\sum_{i=y}^{x} (i+3)^2&,\, \text{if }y<x
\end{cases}  $$

In [138]:
def Funcion(x,y):
    if x <= y:
        i=x
        sum=0
        while i <= y:
            sum=(i-2)**3 + sum
            i = i+1
    if x > y:
        i=y
        sum=0
        while i <= x:
            sum=(i+3)**2 + sum
            i=i+1
    if sum < -10:
        print('El valor es Pequeño')
    if sum > 10:
        print('El valor es Grande')
    if -10 <= sum <= 10:
        print('El valor es Mediano')
    return print('El valor es',sum)





Ejemplo 1:

Se asignan los valores $x = 1$ y $y = 5$.

In [142]:
Funcion(1,5)

El valor es Grande
El valor es 35


Ejemplo 2:

Se asignan los valores $x = 4$ y $y = 2$.

In [140]:
Funcion(4,2)

El valor es Grande
El valor es 110


# Resolución Problema 3

In [5]:
import pandas as pd

dt = pd.read_csv('stud_perf_exam.csv')
dt

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75
...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95
996,male,group C,high school,free/reduced,none,62,55,55
997,female,group C,high school,free/reduced,completed,59,71,65
998,female,group D,some college,standard,completed,68,78,77


#### A) Reportar las estadísticas descriptivas de las puntuaciones (math, reading y writing)


In [6]:
dt.loc[:,['math score','reading score','writing score']].describe()

Unnamed: 0,math score,reading score,writing score
count,1000.0,1000.0,1000.0
mean,66.089,69.169,68.054
std,15.16308,14.600192,15.195657
min,0.0,17.0,10.0
25%,57.0,59.0,57.75
50%,66.0,70.0,69.0
75%,77.0,79.0,79.0
max,100.0,100.0,100.0


#### B) ¿Cuál de las puntuaciones math, reading o witing tiene mayor rango (máximo-mínimo)?

###### Rango de Math Score

In [7]:
rmath = dt.loc[:,'math score'].max()-dt.loc[:,'math score'].min()
rmath

100

###### Rango de Reading Score

In [63]:
rreading=dt.loc[:,'reading score'].max()-dt.loc[:,'reading score'].min()
rreading

83

###### Rango de Writing Score

In [64]:
rwriting=dt.loc[:,'writing score'].max()-dt.loc[:,'writing score'].min()
rwriting

90

##### Máximo de rango de las tres categorías

In [67]:
max(rmath,rreading,rwriting)

100

#### C) ¿Cuántos valores diferentes de race/ethnicity se observa en los estudiantes masculinos y cuántos en los femeninos?

In [8]:
dt1 = dt.loc[:,['gender','race/ethnicity']]
dt1

dt1.loc[dt1['gender']=='male']['race/ethnicity'].unique()
print('En el caso de los estudiantes masculinos, se observa',len(dt1.loc[dt1['gender']=='male']['race/ethnicity'].unique()),'valores diferentes')

dt1.loc[dt1['gender']=='female']['race/ethnicity'].unique()
print('En el caso de los estudiantes femeninos, se observa',len(dt1.loc[dt1['gender']=='male']['race/ethnicity'].unique()),'valores diferentes')


En el caso de los estudiantes masculinos, se observa 5 valores diferentes
En el caso de los estudiantes femeninos, se observa 5 valores diferentes


##### D) El porcentaje de estudiantes con puntajes por encima de la media de cada área (math, reading y witing) de acuerdo al nivel educativo de los padres

In [82]:
# Calcular la media de los puntajes de cada área
math_mean = dt['math score'].mean()
reading_mean = dt['reading score'].mean()
writing_mean = dt['writing score'].mean()

# Calcular el porcentaje de estudiantes con puntajes por encima de la media por área y por nivel de educación de los padres
dt['above_math_mean'] = dt['math score'] > math_mean
dt['above_reading_mean'] = dt['reading score'] > reading_mean
dt['above_writing_mean'] = dt['writing score'] > writing_mean

result = dt.groupby('parental level of education').mean()
result


Unnamed: 0_level_0,math score,reading score,writing score,above_math_mean,above_reading_mean,above_writing_mean
parental level of education,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
associate's degree,6788.288288,7092.792793,6989.63964,50.900901,56.756757,55.405405
bachelor's degree,6938.983051,7300.0,7338.135593,55.932203,61.016949,64.40678
high school,6213.77551,6470.408163,6244.897959,40.306122,40.306122,35.204082
master's degree,6974.576271,7537.288136,7567.79661,59.322034,64.40678,67.79661
some college,6712.831858,6946.017699,6884.070796,52.654867,52.212389,54.424779
some high school,6349.72067,6693.854749,6488.826816,45.251397,44.692737,45.251397


###### E) Crear una columna con etiqueta: mean_score con las puntuaciones promedio de las tres áreas (math, reading y witing)

In [84]:
dt['math score'].mean()

dtmean = dt.assign(Promedio=dt.loc[:,['math score','reading score','writing score']].mean(axis=1))

###### F) Estime un conjunto de estadísticas descriptivas que permitan explorar si el tipo de  servicio de almuerzo que reciben los estudiantes está relacionado con la puntuación medio de las tres áreas

In [92]:
dtmean

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,above_math_mean,above_reading_mean,above_writing_mean,Promedio
0,female,group B,bachelor's degree,standard,none,72,72,74,True,True,True,72.666667
1,female,group C,some college,standard,completed,69,90,88,True,True,True,82.333333
2,female,group B,master's degree,standard,none,90,95,93,True,True,True,92.666667
3,male,group A,associate's degree,free/reduced,none,47,57,44,False,False,False,49.333333
4,male,group C,some college,standard,none,76,78,75,True,True,True,76.333333
...,...,...,...,...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95,True,True,True,94.000000
996,male,group C,high school,free/reduced,none,62,55,55,False,False,False,57.333333
997,female,group C,high school,free/reduced,completed,59,71,65,False,True,False,65.000000
998,female,group D,some college,standard,completed,68,78,77,True,True,True,74.333333


In [125]:
dtmean['Promedio'][dtmean['lunch'] == 'standard'].mean()


70.83720930232563

In [132]:
dtmean['Promedio'][dtmean['lunch'] == 'free/reduced'].mean()


62.199061032863845

### Hasta aquí se hizo el primer comentario.

In [15]:
dts = [[2,3,4],[2,5,3]]
dts

sum(dts[1])

10

### ¿Cómo sumar los elementos de una matriz?

In [27]:
def suma(datos):
    x=len(datos)
    s=0
    sumtotal=0
    while s < x:
        sumtotal = sum(datos[s]) + sumtotal
        s = s+1
    return sumtotal

In [28]:
suma(dts)

19