<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">


# Regression and Classification with the Ames Housing Data


## Modeling II. part

---



The aim of the project is building a reliable estimator for the price of the house given characteristics of the house.

The first part of the modeling is about estimating the value of homes from fixed characteristics.
Then the second part of the modelling is to determine any value of changeable property characteristics unexplained by the fixed ones.


## Determinating any value of changeable property characteristics unexplained by the fixed ones.

---

Some examples of things that ARE renovateable:

Roof and exterior features
"Quality" metrics, such as kitchen quality
"Condition" metrics, such as condition of garage
Heating and electrical components
and generally anything you deem can be modified without having to undergo major construction on the house.



- Train a model on pre-2010 data and evaluate its performance on the 2010 houses.
- Characterize your model. How well does it perform? What are the best estimates of price?


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## Modeling on the reno

---

The goal here is to determine any value of *changeable* property characteristics unexplained by the *fixed* ones.

Now that you have a model that estimates the price of a house based on its static characteristics, we can move forward with part 2 and 3 of the plan: what are the costs/benefits of quality, condition, and renovations?

There are two specific requirements for these estimates:
1. The estimates of effects must be in terms of dollars added or subtracted from the house value. 
2. The effects must be on the variance in price remaining from the first model.

The residuals from the first model (training and testing) represent the variance in price unexplained by the fixed characteristics. Of that variance in price remaining, how much of it can be explained by the easy-to-change aspects of the property?

---

**Your goals:**
1. Evaluate the effect in dollars of the renovatable features. 
- How would your company use this second model and its coefficients to determine whether they should buy a property or not? Explain how the company can use the two models you have built to determine if they can make money. 
- Investigate how much of the variance in price remaining is explained by these features.
- Do you trust your model? Should it be used to evaluate which properties to buy and fix up?

In [45]:
import statsmodels.formula.api as sm
import patsy
from sklearn.linear_model import RidgeCV

In [46]:
# get out the residuals for training and testing, in terms of dollars.
tr_resids = y_train - optimal_ridge.predict(X_train)
te_resids = y_test- optimal_ridge.predict(X_test)

In [47]:
# things that could be done via renovation:
renovations = ['YrSold','RoofStyle','Exterior1st','ExterCond',
              'BsmtCond','HeatingQC','CentralAir','Electrical',
              'GarageFinish','GarageCond','PavedDrive',
              'ExterQual','BsmtQual','GarageQual','KitchenQual',
              'FireplaceQu']

In [48]:
renovation_f = '~ '+' + '.join(renovations)+' -1'

In [49]:
# make the renovation predictor matrix
Xren = patsy.dmatrix(renovation_f, data=house, return_type='dataframe')
Xren.shape

(1459, 68)

In [51]:
# standardize it
scaler = StandardScaler()
Xrens = scaler.fit_transform(Xren)

In [52]:
# split by year again
inds_recent = h.YrSold == 2010
Xren_tr, Xren_te = Xrens[~inds_recent.values], Xrens[inds_recent.values]

In [53]:
ren_cv = RidgeCV(alphas=np.logspace(-5,4,300), cv=10)

In [54]:
Xren_tr.shape

(1284, 68)

In [55]:
ren_cv.fit(Xren_tr, tr_resids)



RidgeCV(alphas=array([1.00000e-05, 1.07177e-05, ..., 9.33039e+03, 1.00000e+04]),
    cv=10, fit_intercept=True, gcv_mode=None, normalize=False,
    scoring=None, store_cv_values=False)

In [56]:
ren_cv.alpha_

884.0733401525063

In [57]:
ren_cv.score(Xren_tr, tr_resids)

0.16528798449551152

In [58]:
ren_cv.score(Xren_te, te_resids)

0.1878237217720891

We are explaining 19% of the variance in the remaining variance in price explained by the fixed characteristics of the house.

In [62]:
renovation_coefs = pd.DataFrame(dict(coef=ren_cv.coef_,
                                     abscoef=np.abs(ren_cv.coef_),
                                     feature=Xren.columns))

Below I am sorting by coef magnitude for the model.
 
We are predicting the residuals, the error in price for the first model overestimating the price for a house.

We can use these features to evaluate how much in dollars a renovation is worth. For example if we wanted to upgrade a fireplace from "fair" (TA) to "good" (Gd) then we would evaluate the differences in their coefficients.
The good fireplace is less negative than the average. The difference between them is an estimate of how much value we would gain making that change.

In [66]:
renovation_coefs.sort_values('abscoef', ascending=False, inplace=True)
renovation_coefs.head(15)

Unnamed: 0,coef,abscoef,feature
25,7897.29826,7897.29826,BsmtCond[T.None]
52,7897.29826,7897.29826,BsmtQual[T.None]
66,6596.769331,6596.769331,FireplaceQu[T.TA]
49,5912.081767,5912.081767,ExterQual[T.TA]
14,5519.15643,5519.15643,Exterior1st[T.Plywood]
63,-5368.503639,5368.503639,FireplaceQu[T.Gd]
61,5313.700526,5313.700526,KitchenQual[T.TA]
38,-5154.258612,5154.258612,GarageFinish[T.RFn]
53,3873.819388,3873.819388,BsmtQual[T.TA]
12,-3400.112243,3400.112243,Exterior1st[T.ImStucc]
