# Core Statistics Using Python
### Hana Choi, Simon Business School, University of Rochester


# Heteroskedasticity-Robust Standard Error

## Topics covered

- Heteroskedasticity-Robust Standard Error: hprices.csv
- Other dataset examples

## Required packages

In [None]:
import pandas as pd
import statsmodels.formula.api as smf

# Example: House Prices

## Load data: hprices.csv

In [None]:
# Load data
hprices = pd.read_csv("/Users/hanachoi/Dropbox/teaching/core_statistics/Data/hprices.csv")

# Display the first few rows of the dataframe
print(hprices.head())

## Homoskedasticity Standard Error

In [None]:
# Let's first check Homoskedasticity results

# Run a simple linear regression of price on size 
fit = smf.ols(formula='price ~ sqrft', data=hprices).fit()

# Print summary table
print(fit.summary().tables[1])

## Heteroskedasticity-Robust Standard Error (HR SE)

- It is very easy to get HR SEs in Python
- You just need to add one additional input to the usual command (specifying the type of SEs you want)

In [None]:
# You need to specify the type of HR SEs though.
# I always use the HC1 option, though there are others.
fit_HRse = smf.ols(formula='price ~ sqrft', data=hprices).fit(cov_type='HC1')

# Print summary table
print(fit_HRse.summary().tables[1])

# Note that Python is now using the Normal distribution to compute the p-values (as it should).

## You can even use the Heteroskedasticity-Robust  SEs in your predictions

In [None]:
# Confidence/Prediction intervals with Robust SEs 
new_data = pd.DataFrame({'sqrft': [1500, 2000, 2500, 3000, 4000]})
predictions = fit_HRse.get_prediction(new_data)
predictions.summary_frame(alpha=0.05)

# Note that Python will now use the Normal distribution to construct the intervals.

# Other Dataset Examples

- Compare Heteroskedasticity-Robust SE to Homoskedasticity results

## Diamonds

In [None]:
# Load data
diamonds = pd.read_csv("/Users/hanachoi/Dropbox/teaching/core_statistics/Data/diamonds.csv")

fit_diamonds = smf.ols(formula='price ~ carats', data=diamonds).fit()
print(fit_diamonds.summary().tables[1])

fit_diamonds_HRse = smf.ols(formula='price ~ carats', data=diamonds).fit(cov_type='HC1')
print(fit_diamonds_HRse.summary().tables[1])

## Wines

In [None]:
# Load data
wines = pd.read_csv("/Users/hanachoi/Dropbox/teaching/core_statistics/Data/wines.csv")

fit_wines = smf.ols(formula='Price ~ Score', data=wines).fit()
print(fit_wines.summary().tables[1])

fit_wines_HRse = smf.ols(formula='Price ~ Score', data=wines).fit(cov_type='HC1')
print(fit_wines_HRse.summary().tables[1])

## Earnings data

In [None]:
# Load data
cps12 = pd.read_csv("/Users/hanachoi/Dropbox/teaching/core_statistics/Data/cps12.csv")

fit_cps12 = smf.ols(formula='earnings ~ male', data=cps12).fit()
print(fit_cps12.summary().tables[1])

fit_cps12_HRse = smf.ols(formula='earnings ~ male', data=cps12).fit(cov_type='HC1')
print(fit_cps12_HRse.summary().tables[1])