# Core Statistics Using Python
### Hana Choi, Simon Business School, University of Rochester


# Variable & Model Selection

## Topics covered

- Perfect collinearity example: House prices
- Near perfect collinearity example: SUV data

## Required packages

In [None]:
import pandas as pd
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

# Perfect collinearity example: House prices

## Load data

In [None]:
# Load caschool.csv dataset
hprices2 = pd.read_csv("/Users/hanachoi/Dropbox/teaching/core_statistics/Data/hprices2.csv")

# Display first few rows of the dataframe
print(hprices2.head())

## Sqrft and Sqrmeter

In [None]:
# Convert house size (sqrft) to square meters and save it as "sqrmt" variable to hprices2 data.frame
hprices2['sqrmt'] = hprices2['sqrft'] / 10.764  # Convert sqrft to square meters

# Check correlation: expecting perfect correlation
hprices2[['sqrft', 'sqrmt']].corr()

## Regression analysis

### Run regression model with both sqrft and sqrmt
- The problem is that sqrmeter = sqrft/10.764 (perfect correlation)

In [None]:
model1 = smf.ols('price ~ sqrft + sqrmt', data=hprices2).fit()
print(model1.summary().tables[1])

### Regression with one predictor at a time

- We have to drop one to avoid perfect collinearity.
- It doesn't matter which one you drop (but need to remember the right units when interpreting the coefficient)

In [None]:
# Regression with sqrft only
model2 = smf.ols('price ~ sqrft', data=hprices2).fit()
print('Regression with sqrft only')
print(model2.summary().tables[1])
print('----')

# Regression with sqrmt
model3 = smf.ols('price ~ sqrmt', data=hprices2).fit()
print('Regression with sqrmt only')
print(model3.summary().tables[1])

# Near perfect collinearity example: SUV data

## Load data

In [None]:
# Load caschool.csv dataset
suv = pd.read_csv("/Users/hanachoi/Dropbox/teaching/core_statistics/Data/suv.csv")

# Display first few rows of the dataframe
print(suv.head())

## Regression with near perfect collinearity

In [None]:
model_suv1 = smf.ols('mshare ~ Q("Invoice(in 1Ks)") + Q("MSRP(in 1Ks)")', data=suv).fit()
print(model_suv1.summary().tables[1])

## Check correlation between Invoice and MSRP

- Two very highly correlated variables.
- Need to drop one of them

In [None]:
suv[['Invoice', 'MSRP']].corr()

## Regression with one predictor to avoid near perfect collinearity

In [None]:
model_suv2 = smf.ols('mshare ~ Q("MSRP(in 1Ks)")', data=suv).fit()
print(model_suv2.summary().tables[1])

## Handling another set of highly correlated variables

- Note that we have the same type of problem if we include both city and highway miles per gallon (two measures of fuel efficiency)

### Another regression with near perfect collinearity

In [None]:
model_suv3 = smf.ols('mshare ~ city_mpg + hiway_mpg', data=suv).fit()
print(model_suv3.summary().tables[1])

### Correlation between city_mpg and hiway_mpg

- Highly correlated

In [None]:
suv[['city_mpg', 'hiway_mpg']].corr()

### Regression with one predictor to avoid near perfect collinearity

- Drop one (but still not significant, probably because dataset is small)

In [None]:
model_suv4 = smf.ols('mshare ~ city_mpg', data=suv).fit()
print(model_suv4.summary().tables[1])