
#### Importing Libraries

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Enhancing Regression Models

Objectives, be able to use:

Pre-processing:
- handling non-numeric data
 - ordinal: label encoder
 - categorical: one-hot-encoder (which do you drop?)
 - binary encoder
- Scaling

Creating New:
- Interaction terms
- Polynomials
- combinations of other variables

Evaluating:
- R^2 vs adjusted R^2
- AIC
- BIC
- comparing model performance metrics - metrics going up or down?


## Scenario: car seat sales

Description: simulated data set on sales of car seats<br>
Format: 400 observations on the following 11 variables
- Sales: unit sales at each location
- CompPrice: price charged by nearest competitor at each location
- Income: community income level
- Advertising: local advertising budget for company at each location
- Population: population size in region (in thousands)
- Price: price charged for car seat at each site
- ShelveLoc: quality of shelving location at site (Good | Bad | Medium)
- Age: average age of the local population
- Education: education level at each location
- Urban: whether the store is in an urban or rural location
- USA: whether the store is in the US or not

 We will attempt to predict ${\tt Sales}$ (child car seat sales) in 400 locations based on a number of predictors.

#### Task
Before looking at the data, brainstorm with your neighbor which four variables you think *might* be related to sales.

In [2]:
df2 = pd.read_csv('Carseats.csv')
df2.head()

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban,US
0,9.5,138,73,11,276,120,Bad,42,17,Yes,Yes
1,11.22,111,48,16,260,83,Good,65,10,Yes,Yes
2,10.06,113,35,10,269,80,Medium,59,12,Yes,Yes
3,7.4,117,100,4,466,97,Medium,55,14,Yes,Yes
4,4.15,141,64,3,340,128,Bad,38,13,Yes,No


In [3]:
df2.dtypes

Sales          float64
CompPrice        int64
Income           int64
Advertising      int64
Population       int64
Price            int64
ShelveLoc       object
Age              int64
Education        int64
Urban           object
US              object
dtype: object

The ${\tt Carseats}$ data includes qualitative predictors such as ${\tt Shelveloc}$, an indicator of the quality of the shelving location—that is, the space within a store in which the car seat is displayed—at each location. The predictor ${\tt Shelveloc}$ takes on three possible values, ${\tt Bad}$, ${\tt Medium}$, and ${\tt Good}$.

Given a qualitative variable such as ${\tt Shelveloc}$, Python generates dummy variables automatically. Below we fit a multiple regression model that includes some interaction terms.

In [4]:
x_vars=list(df2.columns[df2.columns!='Sales'])
pred = 'Income:Advertising+Price:Age +' +'+'.join(x_vars)
print(pred)

Income:Advertising+Price:Age +CompPrice+Income+Advertising+Population+Price+ShelveLoc+Age+Education+Urban+US


In [5]:
model = smf.ols('Sales ~ Income:Advertising+Price:Age + ' + "+".join(x_vars),data= df2)
print(model)

<statsmodels.regression.linear_model.OLS object at 0x1c1b4f0198>


In [6]:
results = model.fit()

In [7]:
print(results.summary())
# create a dummy variable based on the first categorty the model sees 

                            OLS Regression Results                            
Dep. Variable:                  Sales   R-squared:                       0.876
Model:                            OLS   Adj. R-squared:                  0.872
Method:                 Least Squares   F-statistic:                     210.0
Date:                Tue, 03 Sep 2019   Prob (F-statistic):          6.14e-166
Time:                        14:32:10   Log-Likelihood:                -564.67
No. Observations:                 400   AIC:                             1157.
Df Residuals:                     386   BIC:                             1213.
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept               6.5756    

#### Task 
Again, with your neighbor:
- What issues do you see with this model?
- What would you change?

To learn how to set other coding schemes (or _contrasts_), see: http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/contrasts.html

### Polynomials

![polynomials](https://sc.cnbcfm.com/applications/cnbc.com/resources/files/2015/12/11/emotionandincome-01_0.png)

In [8]:
from sklearn.preprocessing import OneHotEncoder # this convert to two columns 
from sklearn.preprocessing import LabelEncoder
# from sklearn.preprocessing import BinaryEncoder
from sklearn.preprocessing import LabelBinarizer # this convert to ones and zeros

In [11]:
df2.Urban.value_counts()

Yes    282
No     118
Name: Urban, dtype: int64

In [12]:
pre_obj_bin = LabelBinarizer()
type(pre_obj_bin)

sklearn.preprocessing.label.LabelBinarizer

In [13]:
pre_obj_bin.fit(df2['Urban'])

LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)

In [14]:
urban_bin = pre_obj_bin.fit_transform(df2['Urban'])
urban_bin

array([[1],
       [1],
       [1],
       [1],
       [1],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [1],
       [1],
       [1],
       [1],
       [0],
       [1],
       [1],
       [0],
       [1],
       [1],
       [0],
       [1],
       [1],
       [1],
       [0],
       [0],
       [1],
       [1],
       [1],
       [1],
       [1],
       [0],
       [1],
       [1],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [1],
       [1],
       [1],
       [1],
       [1],
       [0],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [0],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [0],
       [1],
       [1],
       [0],
       [0],
       [1],
       [1],
       [1],
       [1],
       [1],
       [0],
       [1],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [1],
       [1],
       [1],
       [1],
    

In [None]:
# do for country 
pre_obj_bin.fit(df2['US'])
us_bin = pre_obj_bin.fit_transform(df2['US'])
us_bin.shape

In [None]:
df2.columns

In [None]:
# do for shleve  
shelver = LabelBinarizer()
shelver.fit(df2['ShelveLoc'])
shelver_bin = shelver.fit_transform(df2['ShelveLoc'])
shelver_bin

In [None]:
shelver.classes_

### use sheve 

In [None]:
onehot = OneHotEncoder()
onehot.fit(pd.DataFrame(df2.ShelveLoc))
shelving = onehot.fit_transform(onehot)
shelving

In [None]:
shelving.shape

In [None]:
from sklearn.preprocessing import PolynomialFeature

`medv ~ lstat + np.square(lstat)`

In [None]:
from sklearn.preprocessing import StandardScaler

### Evaluating
#### Using `statsmodels`

![albon2](./img/aic-albon.png)

**AIC**: The Akaike Information Criterion. Adjusts the log-likelihood based on the number of observations and the complexity of the model.


**BIC**:	The Bayesian Information Criterion. Similar to the AIC, but has a higher penalty for models with more parameters.

Want to be lower. Lower is better.

`results.aic`<br>
`results.bic`

![r-sqared](https://qph.fs.quoracdn.net/main-qimg-b932057f732059158062cf0ad9c1719f.webp)

![adj-r-sqr](https://i.stack.imgur.com/BTGK6.png)

`results.rsquared()`<br>
`results.rsquared_adj()`
