qq# Hypothesis Tests and Confidence Intervals in the Simple Linear Regression Model

In [4]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

import scipy.stats as stats

import statsmodels.api as sm
import statsmodels.formula.api as smf

## Testing Two-Sided Hypotheses Concerning the Slope Coefficient

![title](images/chapter5/img1.png)

$$ SE(\hat{\beta}_1) = \sqrt{ \hat{\sigma}^2_{\hat{\beta}_1} } \ \ , \ \ 
  \hat{\sigma}^2_{\hat{\beta}_1} = \frac{1}{n} \times \frac{\frac{1}{n-2} \sum_{i=1}^n (X_i - \overline{X})^2 \hat{u_i}^2 }{ \left[ \frac{1}{n} \sum_{i=1}^n (X_i - \overline{X})^2 \right]^2}. $$

$ | t^{act} | > 1.96 $ or if $p-value < 0.05$

![title](images/chapter5/img2.png)

## Regression when X is a Binary Variable

In [9]:
CAS = pd.read_csv('https://raw.githubusercontent.com/ejvanholm/DataProjects/master/CASchools.csv', index_col = 0)

CAS['STR'] = CAS['students']/CAS['teachers']

CAS['score'] = (CAS['read'] + CAS['math'])/2  

In [8]:
CAS.head()

Unnamed: 0,district,school,county,grades,students,teachers,calworks,lunch,computer,expenditure,income,english,read,math
1,75119,Sunol Glen Unified,Alameda,KK-08,195,10.9,0.5102,2.0408,67,6384.911133,22.690001,0.0,691.599976,690.0
2,61499,Manzanita Elementary,Butte,KK-08,240,11.15,15.4167,47.916698,101,5099.380859,9.824,4.583333,660.5,661.900024
3,61549,Thermalito Union Elementary,Butte,KK-08,1550,82.900002,55.032299,76.322601,169,5501.95459,8.978,30.000002,636.299988,650.900024
4,61457,Golden Feather Union Elementary,Butte,KK-08,243,14.0,36.475399,77.049202,85,7101.831055,8.978,0.0,651.900024,643.5
5,61523,Palermo Union Elementary,Butte,KK-08,1335,71.5,33.108601,78.427002,171,5235.987793,9.080333,13.857677,641.799988,639.900024


In [10]:
CAS['test'] = (CAS['STR'] < 20)

In [50]:
listTValue = []
for i in np.arange(min(CAS['STR']), max(CAS['STR'])):
    CAS['Temp'] = (CAS['STR'] <= i).astype(int)
    listTValue.append(sm.OLS(CAS['score'], sm.add_constant(CAS['Temp'])).fit().tvalues[1])

print(listTValue)

[-0.9750313016157538, 0.9920439347770672, 2.618051065905779, 2.0688161582999354, 4.090672409688811, 4.556167511036623, 3.879638576823098, 3.7897292964061116, 2.2368936178914423, 0.4395873451560955, 0.591716029481143, 0.6463732307305733]


## Heteroskedasticity and Homoskedasticity

![title](images/chapter5/img3.png)

This example makes a case that the assumption of homoskedasticity is doubtful in economic applications. Should we care about heteroskedasticity? Yes, we should. As explained in the next section, heteroskedasticity can have serious negative consequences in hypothesis testing, if we ignore it.

In [53]:
sm.OLS(CAS['score'], sm.add_constant(CAS['STR'])).fit().cov_HC1

array([[107.4199931 ,  -5.36391137],
       [ -5.36391137,   0.26986917]])

Interesting stackoverflow link on robust standard error:   
https://stackoverflow.com/questions/23420454/newey-west-standard-errors-for-ols-in-python

I am still left to wonder how do I test this for multiple features.

## Exercicios