# Core Statistics Using Python
### Hana Choi, Simon Business School, University of Rochester


# Simple Linear Regression Part 2

## Topics covered

- Confidence intervals for beta: how precise is this estimate? 
- Additional outputs (ANOVA): how well is our regression working overall?

## Required packages

In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from scipy.stats import norm

## House Prices Example

In [None]:
# Let's again look at the linear regression summary output for hprices.csv

# Data
hprices = pd.read_csv("/Users/hanachoi/Dropbox/teaching/core_statistics/Data/hprices.csv")

# Run a simple linear regression of price on size 
fit = smf.ols(formula='price ~ sqrft', data=hprices).fit()

# Print the summary of the regression results
print(fit.summary().tables[1])

# Confidence Intervals

## Using `conf_int()`

- Note1: `conf_int()` uses t distribution for CIs
- Note2: By default, it provides the 95% confidence intervals.

### 95% Confidence Interval

In [None]:
# Constructing 95% confidence interval
print(fit.conf_int()) 

### 90% Confidence Interval

- You can adjust the confidence level by setting the `alpha` parameter.
- `alpha` is the significance level: `alpha = 1 - confidence level`

In [None]:
# Computing significance level
confidence_level = 0.9
significance_level = 1 - confidence_level

print(fit.conf_int(alpha = significance_level)) 

## We can also compute CIs using Normal distribution

- Note that $$CI = \beta \pm \text{cutoff} \times SE( \beta )$$

In [None]:
# Extract the estimated coefficients and their standard errors
coefficients = fit.params
std_errors = fit.bse # Bootstrapped SE

# Define the confidence level (e.g., 95%)
confidence_level = 0.95
cutoff = np.abs(norm.ppf( (1-confidence_level)/2 ))

# Calculate the confidence intervals
lower_bound = coefficients - cutoff * std_errors
upper_bound = coefficients + cutoff * std_errors

# Combine the lower and upper bounds and print the result
conf_intervals = np.column_stack((lower_bound, upper_bound))
print("beta0 CI:", conf_intervals[0,])
print("beta1 CI:", conf_intervals[1,])

# Additonal Outputs (ANOVA)

## Python output

In [None]:
# Print the summary of the regression results again
print(fit.summary())

## Excel output

- Here is what Excel produces for the same regression: <br>

<img src='http://paulellickson.com/ClassData/Lec8HpricesExcel.png' alt="Smiley face" align="center"> <br>

- Note that Python gives you less output automatically (i.e. by default) than Excel
- However, Python does produce and store the relevant outputs, you just need to ask for it to be reported.

## Sum of Squares

In [None]:
# ESS (Explained Sum of Squares)
print("ESS: ", fit.ess)

# SSR (Sum of Squared Residuals)
print("ESS: ", fit.ssr)

# TSS (Total Sum of Squares)
# To get the TSS, you need to build it yourself.
# Recall TSS = ESS + SSR
tss = fit.ess+fit.ssr
print("TSS: ", tss)

## DF and Mean Squares

In [None]:
# MS Regression 
print("DF Regression:", fit.df_model)
print("MS Regression: ", fit.ess / fit.df_model)

# MS Residual
# This is also called MSE (Mean Squared Error)
print("DF Residual:", fit.df_resid)
print("MS Residual: ", fit.ssr / fit.df_resid) # MSE = SSR/(n-2)
print("MS Residual:", fit.scale) # an easier way of getting MSE

## ANOVA Table

In [None]:
# Here's how to get most of the numbers in the ANOVA table
table = sm.stats.anova_lm(fit, typ=2)
print(table)

## SER (Standard Error of the Regressions)

- This is "Standard Error" in upper left part of Excel output.
- Note that $SER=\sqrt{MSE}$

In [None]:
# SER
print(np.sqrt(fit.scale))