### Interpreting Coefficients

It is important that not only can you fit complex linear models, but that you then know which variables you can interpret. 

In this notebook, you will fit a few different models and use the quizzes below to match the appropriate interpretations to your coefficients when possible.

In some cases, the coefficients of your linear regression models wouldn't be kept due to the lack of significance. But that is not the aim of this notebook - **this notebook is strictly to assure you are comfortable with how to interpret coefficients when they are interpretable at all**.

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm;

df = pd.read_csv('./house_prices.csv')
df.head()

  from pandas.core import datetools


Unnamed: 0,house_id,neighborhood,area,bedrooms,bathrooms,style,price
0,1112,B,1188,3,2,ranch,598291
1,491,B,3512,5,3,victorian,1744259
2,5952,B,1134,3,2,ranch,571669
3,3525,A,1940,4,2,ranch,493675
4,5108,B,2208,6,4,victorian,1101539


We will be fitting a number of different models to this dataset throughout this notebook.  For each model, there is a quiz question that will allow you to match the interpretations of the model coefficients to the corresponding values.  If there is no 'nice' interpretation, this is also an option!

### Model 1

`1.` For the first model, fit a model to predict `price` using `neighborhood`, `style`, and the `area` of the home.  Use the output to match the correct values to the corresponding interpretation in quiz 1 below.  Don't forget an intercept!  You will also need to build your dummy variables, and don't forget to drop one of the columns when you are fitting your linear model. It may be easiest to connect your interpretations to the values in the first quiz by creating the baselines as neighborhood C and home style **lodge**.

In [2]:
dummies1_df = pd.get_dummies(df['neighborhood'])
dummies2_df = pd.get_dummies(df['style'])
new_df = df.join(dummies1_df)
newer_df = new_df.join(dummies2_df)
newer_df.head()

Unnamed: 0,house_id,neighborhood,area,bedrooms,bathrooms,style,price,A,B,C,lodge,ranch,victorian
0,1112,B,1188,3,2,ranch,598291,0,1,0,0,1,0
1,491,B,3512,5,3,victorian,1744259,0,1,0,0,0,1
2,5952,B,1134,3,2,ranch,571669,0,1,0,0,1,0
3,3525,A,1940,4,2,ranch,493675,1,0,0,0,1,0
4,5108,B,2208,6,4,victorian,1101539,0,1,0,0,0,1


In [5]:
newer_df['intercept'] = 1
lm = sm.OLS(newer_df['price'], newer_df[['intercept', 'A', 'B', 'ranch', 'victorian', 'area']])
results = lm.fit()
results.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.919
Model:,OLS,Adj. R-squared:,0.919
Method:,Least Squares,F-statistic:,13720.0
Date:,"Tue, 15 Jan 2019",Prob (F-statistic):,0.0
Time:,03:59:33,Log-Likelihood:,-80348.0
No. Observations:,6028,AIC:,160700.0
Df Residuals:,6022,BIC:,160700.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
intercept,-1.983e+05,5540.744,-35.791,0.000,-2.09e+05,-1.87e+05
A,-194.2464,4965.459,-0.039,0.969,-9928.324,9539.832
B,5.243e+05,4687.484,111.844,0.000,5.15e+05,5.33e+05
ranch,-1974.7032,5757.527,-0.343,0.732,-1.33e+04,9312.111
victorian,-6262.7365,6893.293,-0.909,0.364,-1.98e+04,7250.586
area,348.7375,2.205,158.177,0.000,344.415,353.060

0,1,2,3
Omnibus:,114.369,Durbin-Watson:,2.002
Prob(Omnibus):,0.0,Jarque-Bera (JB):,139.082
Skew:,0.271,Prob(JB):,6.290000000000001e-31
Kurtosis:,3.509,Cond. No.,11200.0


- The predicted difference in the price of a home in neighborhood in A as compared to neighborhood C, holding other variables constant. -194.25


- For every one unit increase in the area of a home, we predict the price of the home to increase by ___ (holding all other variables constant)? 348.74


- The predicted home price if the home is a lodge in neighborhood C with an area of 0.-198300


- The predicted difference in price between a victorian and lodge home, holding all other variables constant is ___. LODGE IS MORE EXPENSIVE BY 6262.73




### Model 2

`2.` Now let's try a second model for predicting price.  This time, use `area` and `area squared` to predict price.  Also use the `style` of the home, but not `neighborhood` this time. You will again need to use your dummy variables, and add an intercept to the model. Use the results of your model to answer quiz questions 2 and 3.

In [6]:
newer_df['area squared'] = newer_df['area']*newer_df['area']

In [9]:
lm = sm.OLS(newer_df['price'], newer_df[['intercept', 'ranch', 'victorian', 'area squared', 'area']])
results = lm.fit()
results.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.678
Model:,OLS,Adj. R-squared:,0.678
Method:,Least Squares,F-statistic:,3173.0
Date:,"Tue, 15 Jan 2019",Prob (F-statistic):,0.0
Time:,04:26:04,Log-Likelihood:,-84516.0
No. Observations:,6028,AIC:,169000.0
Df Residuals:,6023,BIC:,169100.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
intercept,1.855e+04,1.26e+04,1.467,0.142,-6229.316,4.33e+04
ranch,9917.2547,1.27e+04,0.781,0.435,-1.5e+04,3.48e+04
victorian,2509.3957,1.53e+04,0.164,0.870,-2.75e+04,3.25e+04
area squared,0.0029,0.002,1.283,0.199,-0.002,0.007
area,334.0146,13.525,24.696,0.000,307.501,360.528

0,1,2,3
Omnibus:,375.22,Durbin-Watson:,2.009
Prob(Omnibus):,0.0,Jarque-Bera (JB):,340.688
Skew:,0.519,Prob(JB):,1.05e-74
Kurtosis:,2.471,Cond. No.,43300000.0


- For every one unit increase in the area of the home, we predict the price to increase by ___.IT DEPENDS.


- For every one unit increase in the area of the home squared, the predicted increase in price is by __.T DEPENDS.


- Based on the results, do you think adding a higher order term for area is useful in predicting the price of the home? NO


- The predicted difference between the price of a ranch home and a lodge, holding all other variables constant is ___.9917.25



- With the higher order term, the coefficients associated with area and area squared are not easily interpretable. However, coefficients that are not associated with the higher order terms are still interpretable in the way you did earlier.

Model 1 and Model 2 compare:

- A best model might only include the area, and a dummy variable for neighborhood B vs. the other neighborhoods.
- Judging by the first results from the two models you built, the best would likely involve only these two variables, as it would be simplified, while still predicting well.

In [10]:
# Trying one more model, with neighborhoods also, responding to
#   prompt in question 3
lm = sm.OLS(newer_df['price'], newer_df[['intercept', 'A', 'B', 'ranch', 'victorian', 'area squared', 'area']])
results = lm.fit()
results.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.919
Model:,OLS,Adj. R-squared:,0.919
Method:,Least Squares,F-statistic:,11440.0
Date:,"Tue, 15 Jan 2019",Prob (F-statistic):,0.0
Time:,04:27:48,Log-Likelihood:,-80345.0
No. Observations:,6028,AIC:,160700.0
Df Residuals:,6021,BIC:,160800.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
intercept,-1.881e+05,7023.287,-26.776,0.000,-2.02e+05,-1.74e+05
A,-248.3127,4963.602,-0.050,0.960,-9978.750,9482.124
B,5.242e+05,4685.714,111.877,0.000,5.15e+05,5.33e+05
ranch,4458.3334,6361.432,0.701,0.483,-8012.351,1.69e+04
victorian,1650.2605,7654.606,0.216,0.829,-1.34e+04,1.67e+04
area squared,0.0027,0.001,2.374,0.018,0.000,0.005
area,333.5383,6.772,49.256,0.000,320.264,346.813

0,1,2,3
Omnibus:,57.788,Durbin-Watson:,2.002
Prob(Omnibus):,0.0,Jarque-Bera (JB):,70.025
Skew:,0.168,Prob(JB):,6.23e-16
Kurtosis:,3.407,Cond. No.,43300000.0


In [11]:
Notice the R-squared value is 0.919 again, same as in model #1, 
  and the area-squared coefficient remains very low