# Core Statistics Using Python
### Hana Choi, Simon Business School, University of Rochester


# Simple Linear Regression Part 3

## Topics covered

- Predictions
- Two interval types: (i) Confidence intervals for average predictions and (ii) Prediction intervals for specific predictions
- Impact of sample size

## Required packages

In [None]:
import pandas as pd
import statsmodels.formula.api as smf

## House Prices Example

In [None]:
# Let's again analyze hprices.csv

# Data
hprices = pd.read_csv("/Users/hanachoi/Dropbox/teaching/core_statistics/Data/hprices.csv")

# First, run the regression we want
fit = smf.ols(formula='price ~ sqrft', data=hprices).fit()

# Print the summary of the regression results
print(fit.summary().tables[1])

# Predictions

- Predictions are easy in Python. 
- We first run the relevant regression
- We then tell Python the X values that we want predictions for.
- Remember that you get predictions by simply plugging the X value you want a prediction for into the regression formula.

In [None]:
# Suppose we want to predict the price for a 2500 square foot house:
size = 2500
manual_prediction = fit.params['Intercept'] + fit.params['sqrft'] * size
manual_prediction

## Set values at which to make predictions

- To automate this in Python, you can set X values at which to make predictions
- Then we can use `predict()` to get predicted values at new Xs (instead of manually computing them).

In [None]:
# Let's pick several house sizes to make predictions
new_data = pd.DataFrame({'sqrft': [1500, 2000, 2500, 3000, 4000]})
new_data

## Predicted values

In [None]:
# Then you can compute the price predictions at these values
predicted_house_prices = fit.predict(new_data)
predicted_house_prices

# Confidence Intervals and Prediction Intervals

- You can also compute the intervals around these values
- There are two interval types:
- (i) Confidence intervals for average predictions
- (ii) Prediction intervals for specific predictions
- We can get both of these intervals using `get_prediction()`
- Note that Python uses the t-distribution to construct both interval types, so the results will differ slightly from what I computed in Excel.

In [None]:
# Compute the intervals (both types!) at several house sizes defined above (new_data)
predictions = fit.get_prediction(new_data)

# Display summary of the predictions
# We need to choose confidence level for the summary table (e.g., 95%)
# alpha is significance level = 1 - confidence level
# For a 99% interval, you can change alpha to 0.01
predictions.summary_frame(alpha=0.05) # 95% confidence/prediction intervals

# Impact of Sample Size

- Finally, let's do a quick illustration of how confidence/prediction intervals change as your sample size increases
- Here, I am going to "pretend" we have a larger house size dataset by simply repeating the same dataset 100 times (giving us 8800 observations instead of 88).
- You can think of this as something like having each house actually represent 100 houses exactly like it.
- In reality, a truly larger dataset would clearly have more variation both in sizes and prices, but this will give us a feel for how the sample size changes things.
- Note how the CI for average predictions shrink dramatically relative to the smaller (real) dataset, but the PI for specific predictions are still almost as wide.

In [None]:
# First I will create a larger dataset by repliacating the existing data
hprices_large = pd.concat([hprices]*100, ignore_index=True)

# Now let's re-run our regression with the larger dataset
fit_large = smf.ols(formula='price ~ sqrft', data=hprices_large).fit()

# Predict with the larger dataset
predictions_large = fit_large.get_prediction(new_data)
predictions_large.summary_frame(alpha= 0.05) # 95% confidence/prediction intervals