<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Project 3

### Regression and Classification with the Ames Housing Data

---

You have just joined a new "full stack" real estate company in Ames, Iowa. The strategy of the firm is two-fold:
- Own the entire process from the purchase of the land all the way to sale of the house, and anything in between.
- Use statistical analysis to optimize investment and maximize return.

The company is still small, and though investment is substantial the short-term goals of the company are more oriented towards purchasing existing houses and flipping them as opposed to constructing entirely new houses. That being said, the company has access to a large construction workforce operating at rock-bottom prices.

This project uses the [Ames housing data recently made available on kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques).

In [36]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
np.set_printoptions(threshold=np.nan)
pd.set_option('display.max_columns', 500)

sns.set_style('whitegrid')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 1. Estimating the value of homes from fixed characteristics.

---

Your superiors have outlined this year's strategy for the company:
1. Develop an algorithm to reliably estimate the value of residential houses based on *fixed* characteristics.
2. Identify characteristics of houses that the company can cost-effectively change/renovate with their construction team.
3. Evaluate the mean dollar value of different renovations.

Then we can use that to buy houses that are likely to sell for more than the cost of the purchase plus renovations.

Your first job is to tackle #1. You have a dataset of housing sale data with a huge amount of features identifying different aspects of the house. The full description of the data features can be found in a separate file:

    housing.csv
    data_description.txt
    
You need to build a reliable estimator for the price of the house given characteristics of the house that cannot be renovated. Some examples include:
- The neighborhood
- Square feet
- Bedrooms, bathrooms
- Basement and garage space

and many more. 

Some examples of things that **ARE renovate-able:**
- Roof and exterior features
- "Quality" metrics, such as kitchen quality
- "Condition" metrics, such as condition of garage
- Heating and electrical components

and generally anything you deem can be modified without having to undergo major construction on the house.

---

**Your goals:**
1. Perform any cleaning, feature engineering, and EDA you deem necessary.
- Be sure to remove any houses that are not residential from the dataset.
- Identify **fixed** features that can predict price.
- Train a model on pre-2010 data and evaluate its performance on the 2010 houses.
- Characterize your model. How well does it perform? What are the best estimates of price?

> **Note:** The EDA and feature engineering component to this project is not trivial! Be sure to always think critically and creatively. Justify your actions! Use the data description file!

In [37]:
# Load the data
house = pd.read_csv('./housing.csv')

### Changing all features into lowercase

In [38]:
house.columns

Index([u'Id', u'MSSubClass', u'MSZoning', u'LotFrontage', u'LotArea',
       u'Street', u'Alley', u'LotShape', u'LandContour', u'Utilities',
       u'LotConfig', u'LandSlope', u'Neighborhood', u'Condition1',
       u'Condition2', u'BldgType', u'HouseStyle', u'OverallQual',
       u'OverallCond', u'YearBuilt', u'YearRemodAdd', u'RoofStyle',
       u'RoofMatl', u'Exterior1st', u'Exterior2nd', u'MasVnrType',
       u'MasVnrArea', u'ExterQual', u'ExterCond', u'Foundation', u'BsmtQual',
       u'BsmtCond', u'BsmtExposure', u'BsmtFinType1', u'BsmtFinSF1',
       u'BsmtFinType2', u'BsmtFinSF2', u'BsmtUnfSF', u'TotalBsmtSF',
       u'Heating', u'HeatingQC', u'CentralAir', u'Electrical', u'1stFlrSF',
       u'2ndFlrSF', u'LowQualFinSF', u'GrLivArea', u'BsmtFullBath',
       u'BsmtHalfBath', u'FullBath', u'HalfBath', u'BedroomAbvGr',
       u'KitchenAbvGr', u'KitchenQual', u'TotRmsAbvGrd', u'Functional',
       u'Fireplaces', u'FireplaceQu', u'GarageType', u'GarageYrBlt',
       u'GarageFinish',

In [39]:
house.columns = map(lambda x:x.lower(), house)

house

Unnamed: 0,id,mssubclass,mszoning,lotfrontage,lotarea,street,alley,lotshape,landcontour,utilities,lotconfig,landslope,neighborhood,condition1,condition2,bldgtype,housestyle,overallqual,overallcond,yearbuilt,yearremodadd,roofstyle,roofmatl,exterior1st,exterior2nd,masvnrtype,masvnrarea,exterqual,extercond,foundation,bsmtqual,bsmtcond,bsmtexposure,bsmtfintype1,bsmtfinsf1,bsmtfintype2,bsmtfinsf2,bsmtunfsf,totalbsmtsf,heating,heatingqc,centralair,electrical,1stflrsf,2ndflrsf,lowqualfinsf,grlivarea,bsmtfullbath,bsmthalfbath,fullbath,halfbath,bedroomabvgr,kitchenabvgr,kitchenqual,totrmsabvgrd,functional,fireplaces,fireplacequ,garagetype,garageyrblt,garagefinish,garagecars,garagearea,garagequal,garagecond,paveddrive,wooddecksf,openporchsf,enclosedporch,3ssnporch,screenporch,poolarea,poolqc,fence,miscfeature,miscval,mosold,yrsold,saletype,salecondition,saleprice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000
5,6,50,RL,85.0,14115,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Mitchel,Norm,Norm,1Fam,1.5Fin,5,5,1993,1995,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,Wood,Gd,TA,No,GLQ,732,Unf,0,64,796,GasA,Ex,Y,SBrkr,796,566,0,1362,1,0,1,1,1,1,TA,5,Typ,0,,Attchd,1993.0,Unf,2,480,TA,TA,Y,40,30,0,320,0,0,,MnPrv,Shed,700,10,2009,WD,Normal,143000
6,7,20,RL,75.0,10084,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Somerst,Norm,Norm,1Fam,1Story,8,5,2004,2005,Gable,CompShg,VinylSd,VinylSd,Stone,186.0,Gd,TA,PConc,Ex,TA,Av,GLQ,1369,Unf,0,317,1686,GasA,Ex,Y,SBrkr,1694,0,0,1694,1,0,2,0,3,1,Gd,7,Typ,1,Gd,Attchd,2004.0,RFn,2,636,TA,TA,Y,255,57,0,0,0,0,,,,0,8,2007,WD,Normal,307000
7,8,60,RL,,10382,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NWAmes,PosN,Norm,1Fam,2Story,7,6,1973,1973,Gable,CompShg,HdBoard,HdBoard,Stone,240.0,TA,TA,CBlock,Gd,TA,Mn,ALQ,859,BLQ,32,216,1107,GasA,Ex,Y,SBrkr,1107,983,0,2090,1,0,2,1,3,1,TA,7,Typ,2,TA,Attchd,1973.0,RFn,2,484,TA,TA,Y,235,204,228,0,0,0,,,Shed,350,11,2009,WD,Normal,200000
8,9,50,RM,51.0,6120,Pave,,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Artery,Norm,1Fam,1.5Fin,7,5,1931,1950,Gable,CompShg,BrkFace,Wd Shng,,0.0,TA,TA,BrkTil,TA,TA,No,Unf,0,Unf,0,952,952,GasA,Gd,Y,FuseF,1022,752,0,1774,0,0,2,0,2,2,TA,8,Min1,2,TA,Detchd,1931.0,Unf,2,468,Fa,TA,Y,90,0,205,0,0,0,,,,0,4,2008,WD,Abnorml,129900
9,10,190,RL,50.0,7420,Pave,,Reg,Lvl,AllPub,Corner,Gtl,BrkSide,Artery,Artery,2fmCon,1.5Unf,5,6,1939,1950,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,BrkTil,TA,TA,No,GLQ,851,Unf,0,140,991,GasA,Ex,Y,SBrkr,1077,0,0,1077,1,0,1,0,2,2,TA,5,Typ,2,TA,Attchd,1939.0,RFn,1,205,Gd,TA,Y,0,4,0,0,0,0,,,,0,1,2008,WD,Normal,118000


In [40]:
house.describe()

Unnamed: 0,id,mssubclass,lotfrontage,lotarea,overallqual,overallcond,yearbuilt,yearremodadd,masvnrarea,bsmtfinsf1,bsmtfinsf2,bsmtunfsf,totalbsmtsf,1stflrsf,2ndflrsf,lowqualfinsf,grlivarea,bsmtfullbath,bsmthalfbath,fullbath,halfbath,bedroomabvgr,kitchenabvgr,totrmsabvgrd,fireplaces,garageyrblt,garagecars,garagearea,wooddecksf,openporchsf,enclosedporch,3ssnporch,screenporch,poolarea,miscval,mosold,yrsold,saleprice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1379.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,46.549315,567.240411,1057.429452,1162.626712,346.992466,5.844521,1515.463699,0.425342,0.057534,1.565068,0.382877,2.866438,1.046575,6.517808,0.613014,1978.506164,1.767123,472.980137,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,161.319273,441.866955,438.705324,386.587738,436.528436,48.623081,525.480383,0.518911,0.238753,0.550916,0.502885,0.815778,0.220338,1.625393,0.644666,24.689725,0.747315,213.804841,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,0.0,0.0,0.0,334.0,0.0,0.0,334.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,1900.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,0.0,223.0,795.75,882.0,0.0,0.0,1129.5,0.0,0.0,1.0,0.0,2.0,1.0,5.0,0.0,1961.0,1.0,334.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,0.0,477.5,991.5,1087.0,0.0,0.0,1464.0,0.0,0.0,2.0,0.0,3.0,1.0,6.0,1.0,1980.0,2.0,480.0,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,0.0,808.0,1298.25,1391.25,728.0,0.0,1776.75,1.0,0.0,2.0,1.0,3.0,1.0,7.0,1.0,2002.0,2.0,576.0,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,1474.0,2336.0,6110.0,4692.0,2065.0,572.0,5642.0,3.0,2.0,3.0,2.0,8.0,3.0,14.0,3.0,2010.0,4.0,1418.0,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


# Outliers
outliers were detected and attempts to remove it failed....

In [41]:
outliers = []
for a in house:
    try:
        outliers.append(np.abs(house[a]-house[a].mean())<=(3*house[a].std()))
    except:
        pass
    
outliers = pd.DataFrame(outliers)

outliers= pd.DataFrame.transpose(outliers)

outliers.head()

Unnamed: 0,id,mssubclass,lotfrontage,lotarea,overallqual,overallcond,yearbuilt,yearremodadd,masvnrarea,bsmtfinsf1,bsmtfinsf2,bsmtunfsf,totalbsmtsf,1stflrsf,2ndflrsf,lowqualfinsf,grlivarea,bsmtfullbath,bsmthalfbath,fullbath,halfbath,bedroomabvgr,kitchenabvgr,totrmsabvgrd,fireplaces,garageyrblt,garagecars,garagearea,wooddecksf,openporchsf,enclosedporch,3ssnporch,screenporch,poolarea,miscval,mosold,yrsold,saleprice
0,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
1,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,False,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
2,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
3,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,False,True,True,True,True,True,True,True
4,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True


In [42]:
# Dropping rows that are not residential

house.drop(house[house['mszoning'] == 'A'].index, inplace=True)
house.drop(house[house['mszoning'] == 'C (all)'].index, inplace=True)
house.drop(house[house['mszoning'] == 'FV'].index, inplace=True)
house.drop(house[house['mszoning'] == 'I'].index, inplace=True)
house.shape

(1385, 81)

In [43]:
# sns.set(rc={'figure.figsize': (30,30)})
# sns.heatmap(house.corr(), annot=True, cmap='BrBG')

In [44]:
house[house.columns[house.isnull().any()]].isnull().sum()

lotfrontage      251
alley           1320
masvnrtype         5
masvnrarea         5
bsmtqual          37
bsmtcond          37
bsmtexposure      38
bsmtfintype1      37
bsmtfintype2      38
electrical         1
fireplacequ      641
garagetype        79
garageyrblt       79
garagefinish      79
garagequal        79
garagecond        79
poolqc          1378
fence           1108
miscfeature     1333
dtype: int64

In [45]:
# replace NaN values in entire dataframe with None
house.fillna(value='None', inplace=True)

# replace NaN value in electrical column with SBrkr as it is the most common 
house['electrical'] = house.loc[:, 'electrical'].replace(to_replace='None', value='SBrkr')

# Replacing None with O for lotfrontage
house['lotfrontage'] = house.loc[:, 'lotfrontage'].replace(to_replace='None', value=float(0))

# Replace None values with NaN for garageyeblt
house['garageyrblt'] = house.loc[:, 'garageyrblt'].replace(to_replace='None', value= float(0))

# Change type of MsSubClass to object to enable dummy coding
house['mssubclass'] = house['mssubclass'].astype('object')

# Change None values with NaN for MasVnrArea 
house['masvnrarea'] = house.loc[:, 'masvnrarea'].replace(to_replace='None', value= float(0))

In [46]:
house.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1385 entries, 0 to 1459
Data columns (total 81 columns):
id               1385 non-null int64
mssubclass       1385 non-null object
mszoning         1385 non-null object
lotfrontage      1385 non-null float64
lotarea          1385 non-null int64
street           1385 non-null object
alley            1385 non-null object
lotshape         1385 non-null object
landcontour      1385 non-null object
utilities        1385 non-null object
lotconfig        1385 non-null object
landslope        1385 non-null object
neighborhood     1385 non-null object
condition1       1385 non-null object
condition2       1385 non-null object
bldgtype         1385 non-null object
housestyle       1385 non-null object
overallqual      1385 non-null int64
overallcond      1385 non-null int64
yearbuilt        1385 non-null int64
yearremodadd     1385 non-null int64
roofstyle        1385 non-null object
roofmatl         1385 non-null object
exterior1st      1385 no

# filter out fixed characteristic

In [47]:
# Filter out fixed characteristics
fixed = ['id','mssubclass', 'mszoning', 'lotfrontage', 'lotarea','street', 'alley', 'lotshape', 'landcontour',
         'utilities','lotconfig', 'landslope', 'neighborhood', 'condition1','condition2','bldgtype', 'housestyle',
         'overallcond', 'yearbuilt', 'yearremodadd','masvnrtype','masvnrarea','foundation', 'totalbsmtsf',
         '1stflrsf', '2ndflrsf', 'grlivarea', 'bsmtfullbath', 'bsmthalfbath','fullbath', 'halfbath','bedroomabvgr',
         'kitchenabvgr', 'totrmsabvgrd','garagetype', 'garageyrblt', 'garagecars', 'garagearea', 'fireplaces',
         'mosold', 'yrsold']

fixed = house[fixed]
fixed.head()

Unnamed: 0,id,mssubclass,mszoning,lotfrontage,lotarea,street,alley,lotshape,landcontour,utilities,lotconfig,landslope,neighborhood,condition1,condition2,bldgtype,housestyle,overallcond,yearbuilt,yearremodadd,masvnrtype,masvnrarea,foundation,totalbsmtsf,1stflrsf,2ndflrsf,grlivarea,bsmtfullbath,bsmthalfbath,fullbath,halfbath,bedroomabvgr,kitchenabvgr,totrmsabvgrd,garagetype,garageyrblt,garagecars,garagearea,fireplaces,mosold,yrsold
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,5,2003,2003,BrkFace,196.0,PConc,856,856,854,1710,1,0,2,1,3,1,8,Attchd,2003.0,2,548,0,2,2008
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,8,1976,1976,,0.0,CBlock,1262,1262,0,1262,0,1,2,0,3,1,6,Attchd,1976.0,2,460,1,5,2007
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,5,2001,2002,BrkFace,162.0,PConc,920,920,866,1786,1,0,2,1,3,1,6,Attchd,2001.0,2,608,1,9,2008
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,5,1915,1970,,0.0,BrkTil,756,961,756,1717,1,0,1,0,3,1,7,Detchd,1998.0,3,642,1,2,2006
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,5,2000,2000,BrkFace,350.0,PConc,1145,1145,1053,2198,1,0,2,1,4,1,9,Attchd,2000.0,3,836,1,12,2008


In [48]:
fixed.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1385 entries, 0 to 1459
Data columns (total 41 columns):
id              1385 non-null int64
mssubclass      1385 non-null object
mszoning        1385 non-null object
lotfrontage     1385 non-null float64
lotarea         1385 non-null int64
street          1385 non-null object
alley           1385 non-null object
lotshape        1385 non-null object
landcontour     1385 non-null object
utilities       1385 non-null object
lotconfig       1385 non-null object
landslope       1385 non-null object
neighborhood    1385 non-null object
condition1      1385 non-null object
condition2      1385 non-null object
bldgtype        1385 non-null object
housestyle      1385 non-null object
overallcond     1385 non-null int64
yearbuilt       1385 non-null int64
yearremodadd    1385 non-null int64
masvnrtype      1385 non-null object
masvnrarea      1385 non-null float64
foundation      1385 non-null object
totalbsmtsf     1385 non-null int64
1stflrsf 

In [49]:
dummies = house[['mssubclass', 'mszoning', 'street', 'alley', 'lotshape', 
                 'landcontour', 'utilities','lotconfig', 'landslope', 'masvnrtype',
                 'neighborhood','condition1','condition2',  'bldgtype', 'housestyle','foundation', 'garagetype']]
dummies.head(10)

Unnamed: 0,mssubclass,mszoning,street,alley,lotshape,landcontour,utilities,lotconfig,landslope,masvnrtype,neighborhood,condition1,condition2,bldgtype,housestyle,foundation,garagetype
0,60,RL,Pave,,Reg,Lvl,AllPub,Inside,Gtl,BrkFace,CollgCr,Norm,Norm,1Fam,2Story,PConc,Attchd
1,20,RL,Pave,,Reg,Lvl,AllPub,FR2,Gtl,,Veenker,Feedr,Norm,1Fam,1Story,CBlock,Attchd
2,60,RL,Pave,,IR1,Lvl,AllPub,Inside,Gtl,BrkFace,CollgCr,Norm,Norm,1Fam,2Story,PConc,Attchd
3,70,RL,Pave,,IR1,Lvl,AllPub,Corner,Gtl,,Crawfor,Norm,Norm,1Fam,2Story,BrkTil,Detchd
4,60,RL,Pave,,IR1,Lvl,AllPub,FR2,Gtl,BrkFace,NoRidge,Norm,Norm,1Fam,2Story,PConc,Attchd
5,50,RL,Pave,,IR1,Lvl,AllPub,Inside,Gtl,,Mitchel,Norm,Norm,1Fam,1.5Fin,Wood,Attchd
6,20,RL,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Stone,Somerst,Norm,Norm,1Fam,1Story,PConc,Attchd
7,60,RL,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Stone,NWAmes,PosN,Norm,1Fam,2Story,CBlock,Attchd
8,50,RM,Pave,,Reg,Lvl,AllPub,Inside,Gtl,,OldTown,Artery,Norm,1Fam,1.5Fin,BrkTil,Detchd
9,190,RL,Pave,,Reg,Lvl,AllPub,Corner,Gtl,,BrkSide,Artery,Artery,2fmCon,1.5Unf,BrkTil,Attchd


In [50]:
dummies = pd.get_dummies(dummies)
dummies.head()

Unnamed: 0,mssubclass_20,mssubclass_30,mssubclass_40,mssubclass_45,mssubclass_50,mssubclass_60,mssubclass_70,mssubclass_75,mssubclass_80,mssubclass_85,mssubclass_90,mssubclass_120,mssubclass_160,mssubclass_180,mssubclass_190,mszoning_RH,mszoning_RL,mszoning_RM,street_Grvl,street_Pave,alley_Grvl,alley_None,alley_Pave,lotshape_IR1,lotshape_IR2,lotshape_IR3,lotshape_Reg,landcontour_Bnk,landcontour_HLS,landcontour_Low,landcontour_Lvl,utilities_AllPub,utilities_NoSeWa,lotconfig_Corner,lotconfig_CulDSac,lotconfig_FR2,lotconfig_FR3,lotconfig_Inside,landslope_Gtl,landslope_Mod,landslope_Sev,masvnrtype_BrkCmn,masvnrtype_BrkFace,masvnrtype_None,masvnrtype_Stone,neighborhood_Blmngtn,neighborhood_Blueste,neighborhood_BrDale,neighborhood_BrkSide,neighborhood_ClearCr,neighborhood_CollgCr,neighborhood_Crawfor,neighborhood_Edwards,neighborhood_Gilbert,neighborhood_IDOTRR,neighborhood_MeadowV,neighborhood_Mitchel,neighborhood_NAmes,neighborhood_NPkVill,neighborhood_NWAmes,neighborhood_NoRidge,neighborhood_NridgHt,neighborhood_OldTown,neighborhood_SWISU,neighborhood_Sawyer,neighborhood_SawyerW,neighborhood_Somerst,neighborhood_StoneBr,neighborhood_Timber,neighborhood_Veenker,condition1_Artery,condition1_Feedr,condition1_Norm,condition1_PosA,condition1_PosN,condition1_RRAe,condition1_RRAn,condition1_RRNe,condition1_RRNn,condition2_Artery,condition2_Feedr,condition2_Norm,condition2_PosA,condition2_PosN,condition2_RRAe,condition2_RRAn,condition2_RRNn,bldgtype_1Fam,bldgtype_2fmCon,bldgtype_Duplex,bldgtype_Twnhs,bldgtype_TwnhsE,housestyle_1.5Fin,housestyle_1.5Unf,housestyle_1Story,housestyle_2.5Fin,housestyle_2.5Unf,housestyle_2Story,housestyle_SFoyer,housestyle_SLvl,foundation_BrkTil,foundation_CBlock,foundation_PConc,foundation_Slab,foundation_Stone,foundation_Wood,garagetype_2Types,garagetype_Attchd,garagetype_Basment,garagetype_BuiltIn,garagetype_CarPort,garagetype_Detchd,garagetype_None
0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0,1,0,0,0,1,1,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0,1,0,0,0,1,1,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,1,0,0,0,0,0,0,1,1,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0
4,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,1,0,0,0,0,0,0,1,1,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0


In [51]:
# Merge dummies and df and fixed df
fixed = pd.concat([fixed, dummies], axis=1, join='inner')
#DO NOT RUN THIS CELL AGAIN!!!

In [52]:
fixed.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1385 entries, 0 to 1459
Columns: 154 entries, id to garagetype_None
dtypes: float64(3), int64(21), object(17), uint8(113)
memory usage: 607.3+ KB


In [53]:
fixed = fixed.drop(['id','mssubclass', 'mszoning', 'street', 'alley', 'lotshape', 'landcontour', 'utilities',
                    'lotconfig', 'landslope', 'masvnrtype','neighborhood','condition1','condition2',  'bldgtype',
                    'housestyle','foundation', 'garagetype'], axis=1)

In [54]:
list(fixed.columns)

['lotfrontage',
 'lotarea',
 'overallcond',
 'yearbuilt',
 'yearremodadd',
 'masvnrarea',
 'totalbsmtsf',
 '1stflrsf',
 '2ndflrsf',
 'grlivarea',
 'bsmtfullbath',
 'bsmthalfbath',
 'fullbath',
 'halfbath',
 'bedroomabvgr',
 'kitchenabvgr',
 'totrmsabvgrd',
 'garageyrblt',
 'garagecars',
 'garagearea',
 'fireplaces',
 'mosold',
 'yrsold',
 'mssubclass_20',
 'mssubclass_30',
 'mssubclass_40',
 'mssubclass_45',
 'mssubclass_50',
 'mssubclass_60',
 'mssubclass_70',
 'mssubclass_75',
 'mssubclass_80',
 'mssubclass_85',
 'mssubclass_90',
 'mssubclass_120',
 'mssubclass_160',
 'mssubclass_180',
 'mssubclass_190',
 'mszoning_RH',
 'mszoning_RL',
 'mszoning_RM',
 'street_Grvl',
 'street_Pave',
 'alley_Grvl',
 'alley_None',
 'alley_Pave',
 'lotshape_IR1',
 'lotshape_IR2',
 'lotshape_IR3',
 'lotshape_Reg',
 'landcontour_Bnk',
 'landcontour_HLS',
 'landcontour_Low',
 'landcontour_Lvl',
 'utilities_AllPub',
 'utilities_NoSeWa',
 'lotconfig_Corner',
 'lotconfig_CulDSac',
 'lotconfig_FR2',
 'lotconfi

In [55]:
fixed.describe()

Unnamed: 0,lotfrontage,lotarea,overallcond,yearbuilt,yearremodadd,masvnrarea,totalbsmtsf,1stflrsf,2ndflrsf,grlivarea,bsmtfullbath,bsmthalfbath,fullbath,halfbath,bedroomabvgr,kitchenabvgr,totrmsabvgrd,garageyrblt,garagecars,garagearea,fireplaces,mosold,yrsold,mssubclass_20,mssubclass_30,mssubclass_40,mssubclass_45,mssubclass_50,mssubclass_60,mssubclass_70,mssubclass_75,mssubclass_80,mssubclass_85,mssubclass_90,mssubclass_120,mssubclass_160,mssubclass_180,mssubclass_190,mszoning_RH,mszoning_RL,mszoning_RM,street_Grvl,street_Pave,alley_Grvl,alley_None,alley_Pave,lotshape_IR1,lotshape_IR2,lotshape_IR3,lotshape_Reg,landcontour_Bnk,landcontour_HLS,landcontour_Low,landcontour_Lvl,utilities_AllPub,utilities_NoSeWa,lotconfig_Corner,lotconfig_CulDSac,lotconfig_FR2,lotconfig_FR3,lotconfig_Inside,landslope_Gtl,landslope_Mod,landslope_Sev,masvnrtype_BrkCmn,masvnrtype_BrkFace,masvnrtype_None,masvnrtype_Stone,neighborhood_Blmngtn,neighborhood_Blueste,neighborhood_BrDale,neighborhood_BrkSide,neighborhood_ClearCr,neighborhood_CollgCr,neighborhood_Crawfor,neighborhood_Edwards,neighborhood_Gilbert,neighborhood_IDOTRR,neighborhood_MeadowV,neighborhood_Mitchel,neighborhood_NAmes,neighborhood_NPkVill,neighborhood_NWAmes,neighborhood_NoRidge,neighborhood_NridgHt,neighborhood_OldTown,neighborhood_SWISU,neighborhood_Sawyer,neighborhood_SawyerW,neighborhood_Somerst,neighborhood_StoneBr,neighborhood_Timber,neighborhood_Veenker,condition1_Artery,condition1_Feedr,condition1_Norm,condition1_PosA,condition1_PosN,condition1_RRAe,condition1_RRAn,condition1_RRNe,condition1_RRNn,condition2_Artery,condition2_Feedr,condition2_Norm,condition2_PosA,condition2_PosN,condition2_RRAe,condition2_RRAn,condition2_RRNn,bldgtype_1Fam,bldgtype_2fmCon,bldgtype_Duplex,bldgtype_Twnhs,bldgtype_TwnhsE,housestyle_1.5Fin,housestyle_1.5Unf,housestyle_1Story,housestyle_2.5Fin,housestyle_2.5Unf,housestyle_2Story,housestyle_SFoyer,housestyle_SLvl,foundation_BrkTil,foundation_CBlock,foundation_PConc,foundation_Slab,foundation_Stone,foundation_Wood,garagetype_2Types,garagetype_Attchd,garagetype_Basment,garagetype_BuiltIn,garagetype_CarPort,garagetype_Detchd,garagetype_None
count,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0,1385.0
mean,57.792058,10706.158845,5.607942,1970.048375,1984.1213,102.397834,1062.618773,1172.896751,336.516968,1515.031047,0.432491,0.059928,1.548736,0.368953,2.88231,1.048375,6.537906,1864.549458,1.753791,467.954513,0.627437,6.314079,2007.81083,0.376173,0.048375,0.002888,0.008664,0.101083,0.197834,0.042599,0.011552,0.041877,0.01444,0.037545,0.059206,0.029603,0.00722,0.020939,0.011552,0.831047,0.157401,0.002888,0.997112,0.036101,0.953069,0.01083,0.341516,0.027437,0.00722,0.623827,0.044765,0.035379,0.024549,0.895307,0.999278,0.000722,0.183394,0.067148,0.031769,0.002166,0.715523,0.945848,0.044765,0.009386,0.01083,0.311913,0.59278,0.084477,0.012274,0.001444,0.011552,0.041877,0.020217,0.108303,0.036823,0.072202,0.05704,0.020217,0.012274,0.035379,0.162455,0.006498,0.052708,0.029603,0.055596,0.080866,0.018051,0.05343,0.042599,0.015162,0.018051,0.027437,0.007942,0.034657,0.05704,0.857762,0.005776,0.013718,0.007942,0.018051,0.001444,0.00361,0.001444,0.00361,0.989892,0.000722,0.001444,0.000722,0.000722,0.001444,0.846931,0.021661,0.037545,0.024549,0.069314,0.108303,0.010108,0.508303,0.005776,0.00722,0.286643,0.026715,0.046931,0.103971,0.452708,0.420217,0.017329,0.00361,0.002166,0.004332,0.599278,0.012996,0.061372,0.005776,0.259206,0.05704
std,34.946347,10185.732173,1.125799,29.831024,20.554236,174.167142,443.785047,387.466021,436.241125,532.739682,0.521183,0.243449,0.551509,0.500336,0.820535,0.224508,1.624029,459.361788,0.755061,213.762881,0.649454,2.695583,1.326813,0.484599,0.214636,0.053683,0.092711,0.301548,0.39851,0.202025,0.106898,0.200381,0.119341,0.190162,0.236095,0.16955,0.084695,0.143231,0.106898,0.374846,0.36431,0.053683,0.053683,0.186609,0.211568,0.103541,0.474389,0.163412,0.084695,0.484599,0.206863,0.184803,0.154801,0.306268,0.02687,0.02687,0.387129,0.250369,0.175448,0.046507,0.451328,0.226398,0.206863,0.096462,0.103541,0.463442,0.491494,0.278202,0.110147,0.037987,0.106898,0.200381,0.140791,0.310875,0.188395,0.258916,0.232002,0.140791,0.110147,0.184803,0.369001,0.080378,0.22353,0.16955,0.229222,0.272728,0.133182,0.22497,0.202025,0.122243,0.133182,0.163412,0.088797,0.182976,0.232002,0.349421,0.075809,0.116361,0.088797,0.133182,0.037987,0.059997,0.037987,0.059997,0.100067,0.02687,0.037987,0.02687,0.02687,0.037987,0.360184,0.145625,0.190162,0.154801,0.254079,0.310875,0.100067,0.500112,0.075809,0.084695,0.452356,0.161307,0.211568,0.305333,0.497938,0.493772,0.130539,0.059997,0.046507,0.0657,0.490222,0.113299,0.240098,0.075809,0.438357,0.232002
min,0.0,1300.0,1.0,1872.0,1950.0,0.0,0.0,334.0,0.0,334.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,1.0,2006.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,43.0,7711.0,5.0,1953.0,1966.0,0.0,800.0,892.0,0.0,1120.0,0.0,0.0,1.0,0.0,2.0,1.0,5.0,1957.0,1.0,312.0,0.0,5.0,2007.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,63.0,9591.0,5.0,1971.0,1992.0,0.0,994.0,1095.0,0.0,1459.0,0.0,0.0,2.0,0.0,3.0,1.0,6.0,1977.0,2.0,472.0,1.0,6.0,2008.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
75%,79.0,11751.0,6.0,1999.0,2003.0,168.0,1306.0,1412.0,720.0,1784.0,1.0,0.0,2.0,1.0,3.0,1.0,7.0,2000.0,2.0,576.0,1.0,8.0,2009.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
max,313.0,215245.0,9.0,2010.0,2010.0,1378.0,6110.0,4692.0,2065.0,5642.0,3.0,2.0,3.0,2.0,8.0,3.0,14.0,2010.0,4.0,1418.0,3.0,12.0,2010.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


_______________________________________________________________________________________________________________________

In [56]:
from sklearn.feature_selection import SelectKBest, chi2, f_classif
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LassoCV
from sklearn.model_selection import cross_val_score

#Training data
a= house[house['yrsold'] < 2010]
y = a['saleprice']
X = fixed[fixed['yrsold'] < 2010]

#Testing data
b = house[house['yrsold'] == 2010]
yTest = b['saleprice']
XTest = fixed[fixed['yrsold'] == 2010]

cols = list(fixed.columns)

# KBest using f_classif

In [57]:
# Scaling using standard scaler
ss = StandardScaler()
X_fitss = ss.fit_transform(X)

XTest_fitss = ss.fit_transform(XTest)

# finding features using KBest
skb_f = SelectKBest(f_classif, k=5)
skb_f.fit(X_fitss, y)

kbest_fclass = pd.DataFrame([cols, list(skb_f.scores_)], 
                     index=['feature','f_classif']).T.sort_values('f_classif', ascending=False)
kbest_fclass.head(10)

Unnamed: 0,feature,f_classif
108,condition2_RRAn,inf
84,neighborhood_NridgHt,5.12569
83,neighborhood_NoRidge,4.1444
89,neighborhood_Somerst,3.41368
9,grlivarea,3.38479
106,condition2_PosN,3.30312
42,street_Pave,3.2995
41,street_Grvl,3.2995
1,lotarea,3.28119
5,masvnrarea,2.75151


In [58]:
fclass = kbest_fclass.iloc[1:11, 0].tolist()
print fclass

['neighborhood_NridgHt', 'neighborhood_NoRidge', 'neighborhood_Somerst', 'grlivarea', 'condition2_PosN', 'street_Pave', 'street_Grvl', 'lotarea', 'masvnrarea', 'garagecars']


## Fitting the model into linear regression

In [59]:
# Input scaled variables into original dataframe
scaled_fclass = pd.DataFrame(X_fitss, columns = X.columns, index=X.index)
fclass_df = scaled_fclass[['neighborhood_NridgHt', 'neighborhood_NoRidge', 'neighborhood_Somerst',
                           'grlivarea', 'condition2_PosN', 'street_Pave', 'street_Grvl',
                           'lotarea', 'masvnrarea', 'garagecars']]

# fitting scaled variables into linear model regression
linear = LinearRegression()
linear.fit(fclass_df, y)


# R2 score
score = cross_val_score(linear, fclass_df, y, cv=5)
np.mean(score)

0.67470126510740713

# KBest using chi2

In [60]:
from sklearn.preprocessing import MinMaxScaler

# Using MinMaxScaler as chi2 does not accept -ve values
mm = MinMaxScaler()
X_fitmm = mm.fit_transform(X)


skb_chi2 = SelectKBest(chi2, k=5)
skb_chi2.fit(X_fitmm, y)


kbest_chi2 = pd.DataFrame([cols, list(skb_chi2.scores_)], 
                     index=['feature', 'chi2 score']).T.sort_values('chi2 score', ascending=False)
kbest_chi2.head(10)

Unnamed: 0,feature,chi2 score
108,condition2_RRAn,1220.0
84,neighborhood_NridgHt,947.931
83,neighborhood_NoRidge,937.413
106,condition2_PosN,913.75
41,street_Grvl,912.75
89,neighborhood_Somerst,906.863
92,neighborhood_Veenker,855.857
126,foundation_Slab,844.037
48,lotshape_IR3,832.102
68,neighborhood_Blmngtn,788.007


In [61]:
chi2 = kbest_chi2.iloc[0:10, 0].tolist()
print chi2

['condition2_RRAn', 'neighborhood_NridgHt', 'neighborhood_NoRidge', 'condition2_PosN', 'street_Grvl', 'neighborhood_Somerst', 'neighborhood_Veenker', 'foundation_Slab', 'lotshape_IR3', 'neighborhood_Blmngtn']


## Fitting the model into linear regression

In [62]:
scaled_chi2 = pd.DataFrame(X_fitmm, columns = X.columns, index=X.index)
chi2 = scaled_chi2[['condition2_RRAn', 'neighborhood_NridgHt', 'neighborhood_NoRidge', 
                    'condition2_PosN', 'street_Grvl', 'neighborhood_Somerst', 'neighborhood_Veenker',
                    'foundation_Slab', 'lotshape_IR3', 'neighborhood_Blmngtn']]

from sklearn.linear_model import LinearRegression, Lasso

linear = LinearRegression()
linear.fit(chi2, y)


from sklearn.model_selection import cross_val_score
score = cross_val_score(linear, chi2, y, cv=5)
np.mean(score)

0.32283118238972275

## Feature selection using Lasso

In [63]:
optimal_lasso = LassoCV(n_alphas=500, cv=5, verbose=1)
optimal_lasso.fit(X_fitss, y)

print optimal_lasso.alpha_

........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

546.538232317


.............................................................................................................................................................................................................................................................................................................................................................................[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.7s finished


In [64]:
lasso = Lasso(alpha=optimal_lasso.alpha_)

model = lasso.fit(X_fitss, y)
lasso_scores = cross_val_score(lasso, X_fitss, y, cv=10)

print lasso_scores
print np.mean(lasso_scores)


[ 0.87927916  0.86631135  0.87004831  0.75625206  0.83068253  0.72633422
  0.88854097  0.79964233  0.50921822  0.84379592]
0.79701050736


In [65]:
lasso_coefs = pd.DataFrame({'variable':cols,
                            'coef':optimal_lasso.coef_,
                            'abs_coef':np.abs(optimal_lasso.coef_)})

lasso_coefs.sort_values('abs_coef', inplace=True, ascending=False)

lasso_coefs.head(10)

Unnamed: 0,abs_coef,coef,variable
9,31724.338479,31724.338479,grlivarea
84,14413.481565,14413.481565,neighborhood_NridgHt
3,11618.859125,11618.859125,yearbuilt
18,10421.977716,10421.977716,garagecars
90,8416.870528,8416.870528,neighborhood_StoneBr
2,8063.51599,8063.51599,overallcond
83,8019.570804,8019.570804,neighborhood_NoRidge
6,6231.058168,6231.058168,totalbsmtsf
5,6059.820047,6059.820047,masvnrarea
10,5805.361152,5805.361152,bsmtfullbath


In [66]:
scaled_lasso = pd.DataFrame(X_fitss, columns = X.columns, index=X.index)
lasso_scaled = scaled_lasso[['grlivarea', 'neighborhood_NridgHt', 'yearbuilt', 'garagecars', 
                      'neighborhood_StoneBr', 'overallcond', 'neighborhood_NoRidge', 
                      'totalbsmtsf', 'masvnrarea', 'bsmtfullbath']]
lasso_scaled.head()

Unnamed: 0,grlivarea,neighborhood_NridgHt,yearbuilt,garagecars,neighborhood_StoneBr,overallcond,neighborhood_NoRidge,totalbsmtsf,masvnrarea,bsmtfullbath
0,0.356162,-0.244736,1.096022,0.31106,-0.129046,-0.535356,-0.169244,-0.462109,0.521635,1.11443
1,-0.480729,-0.244736,0.198653,0.31106,-0.129046,2.14362,-0.169244,0.445481,-0.585025,-0.810208
2,0.498134,-0.244736,1.02955,0.31106,-0.129046,-0.535356,-0.169244,-0.319041,0.329664,1.11443
3,0.369238,-0.244736,-1.828736,1.629827,-0.129046,-0.535356,-0.169244,-0.685653,-0.585025,1.11443
4,1.267775,-0.244736,0.996314,1.629827,-0.129046,-0.535356,5.908618,0.183934,1.391153,1.11443


In [67]:
lala = lasso.fit(lasso_scaled, y)
cross_val_score(lala, lasso_scaled, y, cv=10)
np.mean(cross_val_score(lala, lasso_scaled, y, cv=10))

0.77547871035357463

# Lasso is the best!

In [68]:
lassoPredict = lasso.predict(XTest_fitss)
lassoPredict.shape

ValueError: shapes (164,136) and (10,) not aligned: 136 (dim 1) != 10 (dim 0)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(rc={'figure.figsize': (10,10)})
g = sns.regplot(lassoPredict, yTest)
g.set(xlabel='Prediction Price', ylabel='Actual Price')
plt.show(g)


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 2. Determine any value of *changeable* property characteristics unexplained by the *fixed* ones.

---

Now that you have a model that estimates the price of a house based on its static characteristics, we can move forward with part 2 and 3 of the plan: what are the costs/benefits of quality, condition, and renovations?

There are two specific requirements for these estimates:
1. The estimates of effects must be in terms of dollars added or subtracted from the house value. 
2. The effects must be on the variance in price remaining from the first model.

The residuals from the first model (training and testing) represent the variance in price unexplained by the fixed characteristics. Of that variance in price remaining, how much of it can be explained by the easy-to-change aspects of the property?

---

**Your goals:**
1. Evaluate the effect in dollars of the renovate-able features. 
- How would your company use this second model and its coefficients to determine whether they should buy a property or not? Explain how the company can use the two models you have built to determine if they can make money. 
- Investigate how much of the variance in price remaining is explained by these features.
- Do you trust your model? Should it be used to evaluate which properties to buy and fix up?

In [None]:
# Finding mean square error

from sklearn.metrics import mean_squared_error as mse

mse(yTest, lassoPredict)

In [None]:
# using the model to fit into house data

ww = house.iloc[:, 1:-2]
ww

In [None]:
reno=['overallqual', 'roofstyle','roofmatl', 'exterior1st', 'exterior2nd', 'exterqual', 'extercond', 'bsmtqual',
      'bsmtcond', 'bsmtexposure', 'bsmtfintype1', 'bsmtfinsf1','bsmtfintype2', 'bsmtfinsf2', 'bsmtunfsf', 'heating',
      'heatingqc', 'centralair', 'electrical', 'lowqualfinsf', 'kitchenqual', 'functional','fireplacequ', 
      'garagefinish', 'garagequal','garagecond', 'paveddrive', 'wooddecksf', 'openporchsf','enclosedporch',
      '3ssnporch', 'screenporch', 'poolarea', 'poolqc','fence', 'miscfeature', 'miscval']

reno = house[reno]
reno

In [None]:
reno.head(10)

In [None]:
dUmmies = house[['overallqual', 'roofstyle','roofmatl', 'exterior1st','exterior2nd',  'exterqual',
                 'extercond', 'bsmtqual','bsmtcond', 'bsmtexposure','bsmtfintype1', 'bsmtfintype2',
                 'heating', 'heatingqc', 'centralair', 'electrical', 'kitchenqual',  'functional', 
                 'fireplacequ','garagefinish','garagequal', 'garagecond', 'paveddrive', 'poolqc','fence', 
                 'miscfeature','miscval']]


dUmmies = pd.get_dummies(dUmmies, dummy_na=True)
dUmmies.head(10)

In [None]:
reno = pd.concat([reno, dUmmies], axis=1, join='inner')

In [None]:
reno = reno.drop(['overallqual', 'roofstyle','roofmatl', 'exterior1st','exterior2nd',  'exterqual',
                 'extercond', 'bsmtqual','bsmtcond', 'bsmtexposure','bsmtfintype1', 'bsmtfintype2',
                 'heating', 'heatingqc', 'centralair', 'electrical', 'kitchenqual',  'functional', 
                 'fireplacequ','garagefinish','garagequal', 'garagecond', 'paveddrive', 'poolqc','fence', 
                 'miscfeature','miscval'], axis=1)

In [None]:
reno.info()

In [None]:
yrsoldless2010 = list(house[house['yrsold'] < 2010].index)
yrsold2010 = list(house[house['yrsold'] == 2010].index)

In [None]:
#Training data
a = house[house['yrsold'] < 2010]
s = a['saleprice']
R = reno.ix[yrsoldless2010]

#Testing data
b = house[house['yrsold'] == 2010]
sTest = b['saleprice']
RTest = reno.ix[yrsold2010]

cols = list(reno.columns)

In [None]:
s.shape

In [None]:
R.shape

In [None]:
# Scaling using standard scaler
ss = StandardScaler()
R_fitss = ss.fit_transform(R)

RTest_fitss = ss.fit_transform(RTest)

In [None]:
# from sklearn.feature_selection import RFECV

# selector = RFECV(linear, step=1, cv=10)
# selector = selector.fit(R, s)

# support = selector.support_
# ranking = selector.ranking_

In [None]:
# pd.DataFrame([cols,  list(ranking)], index=['feature', 'ranking']).T.sort_values('ranking', ascending=True)

In [None]:
# rfecv_columns = np.array(cols)[selector.support_]
# rfecv_columns

In [None]:
# rfecvFeature = R[['heatingqc_Ex', 'heatingqc_Fa', 'heatingqc_Gd', 'heatingqc_Po',
#        'heatingqc_TA', 'electrical_FuseA', 'electrical_FuseF',
#        'electrical_FuseP', 'electrical_Mix', 'electrical_SBrkr',
#        'functional_Mod', 'functional_Sev', 'functional_Typ',
#        'fireplacequ_Ex', 'fireplacequ_Fa', 'fireplacequ_Gd',
#        'fireplacequ_None', 'fireplacequ_Po', 'fireplacequ_TA',
#        'garagefinish_Fin', 'garagefinish_None', 'garagefinish_RFn',
#        'garagefinish_Unf', 'garagequal_Ex', 'garagequal_Fa',
#        'garagequal_Gd', 'garagequal_None', 'garagequal_Po',
#        'garagequal_TA', 'garagecond_Ex', 'garagecond_Fa', 'garagecond_Gd',
#        'garagecond_None', 'garagecond_Po', 'garagecond_TA']]


# model = linear.fit(rfecvFeature, s)

# np.mean(cross_val_score(model, rfecvFeature, s, cv=10))


In [None]:
optimal_lasso = LassoCV(n_alphas=500, cv=5, verbose=1)
optimal_lasso.fit(R_fitss, y)

print optimal_lasso.alpha_

In [None]:
lasso = Lasso(alpha=optimal_lasso.alpha_)

lasso.fit(rfecvFeature, s)
lassoScores = cross_val_score(lasso, rfecvFeature, s, cv=10)

print lassoScores
print np.mean(lassoScores)



<img src="http://imgur.com/GCAf1UX.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 3. What property characteristics predict an "abnormal" sale?

---

The `SaleCondition` feature indicates the circumstances of the house sale. From the data file, we can see that the possibilities are:

       Normal	Normal Sale
       Abnorml	Abnormal Sale -  trade, foreclosure, short sale
       AdjLand	Adjoining Land Purchase
       Alloca	Allocation - two linked properties with separate deeds, typically condo with a garage unit	
       Family	Sale between family members
       Partial	Home was not completed when last assessed (associated with New Homes)
       
One of the executives at your company has an "in" with higher-ups at the major regional bank. His friends at the bank have made him a proposal: if he can reliably indicate what features, if any, predict "abnormal" sales (foreclosures, short sales, etc.), then in return the bank will give him first dibs on the pre-auction purchase of those properties (at a dirt-cheap price).

He has tasked you with determining (and adequately validating) which features of a property predict this type of sale. 

---

**Your task:**
1. Determine which features predict the `Abnorml` category in the `SaleCondition` feature.
- Justify your results.

This is a challenging task that tests your ability to perform classification analysis in the face of severe class imbalance. You may find that simply running a classifier on the full dataset to predict the category ends up useless: when there is bad class imbalance classifiers often tend to simply guess the majority class.

It is up to you to determine how you will tackle this problem. I recommend doing some research to find out how others have dealt with the problem in the past. Make sure to justify your solution. Don't worry about it being "the best" solution, but be rigorous.

Be sure to indicate which features are predictive (if any) and whether they are positive or negative predictors of abnormal sales.

In [None]:
# A: