# House Prices: Advanced Regression Techniques
The purpose of this exercise is to get familiar with both Jupyter and boosting techniques, using this [Kaggle challenge](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/). In this challenge the idea is to predict the sales price for each house. The prediction is evaluated via RMSE of the log of the predicted and observed sales prices.

## Data
The provided data contains four files: 
* train.csv - the training set
* test.csv - the test set
* data_description.txt - full description of each column, originally prepared by Dean De Cock but lightly edited to match the column names used here
* sample_submission.csv - a benchmark submission from a linear regression on year and month of sale, lot square footage, and number of bedrooms

Since the data_description file gives a full description, let's see what information we have available

In [2]:
# Usual suspects
import numpy as np
import os # operating system dependent functionality.
import pandas as pd

# Classifiers and ML functionalities
from sklearn.ensemble import AdaBoostClassifier # Import AdaBoost classifier
from sklearn.svm import SVC # Importing the Support Vector Classifier
from sklearn.tree import DecisionTreeClassifier # Importing the Decision Tree Classifier
from xgboost import XGBClassifier # Importing XGBoost
from logitboost import LogitBoost # Importing LogitBoost
from sklearn.model_selection import train_test_split # Import train_test_split function

# Metrics / bookkeeping
from sklearn import metrics #Import metrics module for accuracy calculation. This has the mean squared error
from math import sqrt # Hence, taking the square root gives us the RMSE
from math import log # Edit 2019-08-10, added log since the evaluation if the RMSE of the log
import time # Keeping track of running time

In [3]:
# Read in a plain text file
with open(os.path.join("D:\\Training\\Kaggle\\House price prediction\\Data", "data_description.txt"), "r") as f:
    text = f.read()
    print(text)

MSSubClass: Identifies the type of dwelling involved in the sale.	

        20	1-STORY 1946 & NEWER ALL STYLES
        30	1-STORY 1945 & OLDER
        40	1-STORY W/FINISHED ATTIC ALL AGES
        45	1-1/2 STORY - UNFINISHED ALL AGES
        50	1-1/2 STORY FINISHED ALL AGES
        60	2-STORY 1946 & NEWER
        70	2-STORY 1945 & OLDER
        75	2-1/2 STORY ALL AGES
        80	SPLIT OR MULTI-LEVEL
        85	SPLIT FOYER
        90	DUPLEX - ALL STYLES AND AGES
       120	1-STORY PUD (Planned Unit Development) - 1946 & NEWER
       150	1-1/2 STORY PUD - ALL AGES
       160	2-STORY PUD - 1946 & NEWER
       180	PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
       190	2 FAMILY CONVERSION - ALL STYLES AND AGES

MSZoning: Identifies the general zoning classification of the sale.
		
       A	Agriculture
       C	Commercial
       FV	Floating Village Residential
       I	Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park 
       RM

We have data on the house type, the environmental + neighbourhood situation, lot information (size, shape, contour), access information. There is info about available utilities and access to features (big streets/railroad/parks). Next to that, there is information on the house itself; material & material quality, condition, age, roof info, foundation, basement, heating/air/electricity, area per floor, bathroom/kitchen/bedroom. Some adjacent stuff, as garage, fireplaces, porch, pool and fences.

All in all, quite an extensive list of features. I imagine some need to be grouped in order to make sense. However before diving into that and since the idea is to get familiar with boosting, next step is to find out what the idea behing boosting is.

## Boosting
Boosting is a machine learning ensemble meta-algorithm for primarily reducing bias, and also variance in supervised learning, and a family of machine learning algorithms that convert weak learners to strong ones [(Wikipedia)](https://en.wikipedia.org/wiki/Boosting_(machine_learning)).
Since we have the observed sales data, it makes sense to use supervised learning. New for me are _ensemble_ algorithms. Here multiple models (all with their own hypothesis) are trained and the combined prediction gives a better score than the individual one. This is based upon a paper from [Michael Kerns (1988)](http://www.cis.upenn.edu/~mkearns/papers/boostnote.pdf) around the question "Can a set of weak learners create a single strong learner?" and the affirmative answer of [Robert E. Schapire (1990)](https://web.archive.org/web/20121010030839/http://www.cs.princeton.edu/~schapire/papers/strengthofweak.pdf). Note that next to boosting, there are also bagging (bootstrap aggregation) and stacking inside the machine learning ensemble family.

Boosting is not a fixed set of algorithms, but centralises around the idea of using multiple weak learning classifiers, which are aggregated into a final classifier using the accuracy of each weak learning classifier as weight. AdaBoost was one of the first algorithms (developed by Schapire and Y. Freund), followed by many others usually in the same framework as AdaBoost, boosting performs gradient descent in a function space using a convex cost function. Hence, the assumption is a convex cost function on which a gradient descent algorithm is applied. The latter is remarkable similar to [Column Generation](https://www.or.rwth-aachen.de/research/publications/colgen.pdf), where a subgradient algorithm is usually used to solve the Lagrangian relaxation. This is something we (myself with three others) applied in a [case study](https://github.com/rowtricker/SeminarEUR) for the Dutch rail company (NS/Nedtrain).

Test all the different boosting algorithms will be a stretch, therefore I will limit this to the following three:
* AdaBoost, since it is the first
* XGboost, which seems to be the most popular recently
* LogitBoost, this uses a logistic cost function. Since this Kaggle challenge also does that, perhaps it's a good fit.
* (Optional) Gradient Tree Boosting, next to XGboost and AdaBoost among the most popular inside Kaggle competitions.

### AdaBoost
Ignoring the mathematical functionality of AdaBoost, AdaBoost works in principle through these steps:
1. Selects a random training subset.
2. Iterative training of the model by selecting the training set based upon the accurate prediction of the last training.
  * Here a higher weight gets assigned to observations that are wrongly classified in order to increase the chance these observations are picked for classification in the next training.
  * In each iteration the weight of the trained classifer is set according to the accuracy of the classifier, the higher the accuracy the higher the weight.
  * Iteration is repeated until the complete training data fits without any errors or specified maximum number of iterations is reached.
6. Final classification is done by "voting" over all the built learning algorithms.

Let's get started by creating the train and test sets

In [4]:
df = pd.read_csv("D:\\Training\\Kaggle\\House price prediction\\Data\\train.csv")
df # Checking the file

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
5,6,50,RL,85.0,14115,Pave,,IR1,Lvl,AllPub,...,0,,MnPrv,Shed,700,10,2009,WD,Normal,143000
6,7,20,RL,75.0,10084,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,307000
7,8,60,RL,,10382,Pave,,IR1,Lvl,AllPub,...,0,,,Shed,350,11,2009,WD,Normal,200000
8,9,50,RM,51.0,6120,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2008,WD,Abnorml,129900
9,10,190,RL,50.0,7420,Pave,,Reg,Lvl,AllPub,...,0,,,,0,1,2008,WD,Normal,118000


The data is not yet ideal, with quite some missing values. Next to that, there are columns with categorical and nominal data. Since this is about exploring the algorithm, first I'm going to test the algorithm with nominal data, thus ignoring the categorical data, and will replace the missing values by 0.

In [6]:
# Transforming MSSubClass, since this is numeric but still categorical
mapper = {20: 'class1', 30: 'class2', 40: 'class3', 45: 'class4', 50: 'class5', 60: 'class6', 70: 'class7', 
          75: 'class8', 80: 'class9', 85: 'class10', 90: 'class11', 120: 'class12', 150: 'class13', 160: 'class14', 
          180: 'class15', 190: 'class16'}
df1 = df.copy()
df1[['MSSubClass']] = df1[['MSSubClass']].replace(mapper)
# Turning every categorical value into dummies
df1 = pd.get_dummies(df1)
# Setting missing values to 0 and setting that to df3 
df2 = df1.copy().fillna(0)
# To make it easy, we split up in x and y (data and target)
x = df2.drop('SalePrice', 1) 
# using all except the Id and the SalePrice columns
y = df2[['SalePrice']]
# Split dataset into training set and test set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3) # 70% training and 30% test

Now that the train and test sets are created, we can start building the classifier. For the [classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html), there are 5 parameters:
* base_estimator. This is the base estimator from which the boosted ensemble is built.
* n_estimators. The maximum number of estimators used.
* learning_rate. Learning rate shrinks the contribution of each classifier by learning_rate.
* algorithm. The used algorithm, either SAMME or SAMME.R. Difference seems to be that the .R is real/continuous, whereas the other one is discrete.
* random_state. Can be used to set the seed used by the random number generator.

Based from this, the first three are most interesting. One can build its own estimator via _base_estimator_, for the other two.. let's see what we get with some experimentation.

In [7]:
start_time = time.time()
# Having created the data sets, we can create the Adaboost classifier
# Create adaboost classifer object
adaboostClassifier = AdaBoostClassifier(n_estimators=1000,
                         learning_rate=0.1)
# Train Adaboost Classifer
adaboostModel = adaboostClassifier.fit(x_train, y_train.values.ravel())

#Predict the response for test dataset
y_pred = adaboostModel.predict(x_test)

# Checking the accuracy of the prediction
print("RMSE: ",sqrt(metrics.mean_squared_error(np.log(y_test), np.log(y_pred))))
print("Executed in: %s seconds." % (time.time() - start_time))

RMSE:  0.29729645941689253
Executed in: 32.65752077102661 seconds.


On a first attempt, we get an RMSE of 0.256, which is not good, not bad, since 85% of the entries at Kaggle score below 0.15. However, given this is just using the standard variables and we set missing values to 0.. it's a start. 

It's pretty fast, with an execution just above 1 second. After toying a bit with different inputs, it seems that execution time is mostly dependent on the number of estimators. At this point, the RMSE does not change that much for different settings. 

We can use any classifier instead of the default, the DecisionTreeClassifier. Next, we're going to try the Support Vector Classifier. The SVC has a lot of [parameters](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html). From these parameters, _kernel_, _tol_, _probability_ and _max_iter_ seem the most needed. Here kernel sets the used kernel, picking from ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. _Probability_ determines whether probability estimates are being used. _Tol_ sets the tolerance level, while _max_iter_ sets the maximum amount of iterations. As one of the breakthroughs in classifiers was the sigmoid function, that is the kernel used here as well. I've used SVM in Java before where it was slow (30 minutes for >1000 observations), curious how that goes here.

Execution time is between 20-90 minutes depending on the settings, while the results are worse. Let's first prepare the data set to make it worth waiting for. Since at the moment the categorical variables are ignored, first step is to change this into dummy variables and see how the algorithm handles the overload of information.

The RMSE is going down quite a bit, from 0.317 to 0.260. Still much room for improvement, but let's explore the other algorithms before really driving down the RMSE.

## XGboost
XGboost stands for extreme gradient boosting and is built upon the same idea as AdaBoost. The creator of this _package_ (!) gives a nice explanation of its history and working at a [machine learning meetup group](https://www.youtube.com/watch?time_continue=2&v=Vly8xGnNiWs). 

In [10]:
# XGboost can handle missing data by itself, so we use df1 which still has the missing values
# To make it easy, we split up in x and y (data and target)
x = df1.drop(['SalePrice'], 1) 
# using all except the Id and the SalePrice columns
y = df1[['SalePrice']]
# Split dataset into training set and test set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3) # 70% training and 30% test

start_time = time.time()
xgboostModel = XGBClassifier() 
xgboostModel.fit(x_train, y_train.values.ravel()) # fitting the model to the training data
print(xgboostModel)


#Predict the response for test dataset
y_pred = xgboostModel.predict(x_test, ntree_limit=4)

# Checking the accuracy of the prediction
print("RMSE: ",sqrt(metrics.mean_squared_error(np.log(y_test), np.log(y_pred))))
print("Executed in: %s seconds." % (time.time() - start_time))

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='multi:softprob', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)
RMSE:  0.3066449952056227
Executed in: 394.27342534065247 seconds.


With the default settings and without any chances in the data, XGboost immediately improves the score of AdaBoost. Going from 0.260 to 0.243. With an execution time of around 6 minutes, that seems a fair tradeoff.
When printing the model we can see that XGboost gives ample opportunity to tweak. [Here](https://xgboost.readthedocs.io/en/latest/parameter.html) a complete overview is given of the available parameters. To list a few:
 * booster, sets the used booster. Above the default is used, _gbtree_. Other options are _gblinear_ and _dart_, where _dart_ and _gbtree_ use tree based models, _gblinear_ sticks to linear models.
 * eta, similar as the _learning rate_ in AdaBoost this sets the step size shrinkage used in update to prevents overfitting.
 * gamma, minimum loss reduction required to make a further partition on a leaf node of the tree. The larger gamma, the more conservative.
 * grow_policy, controls the way new nodes are added to the tree. The default _depthwise_ splits at nodes closest to the root, while _lossgureg:squarederror_, ide_ splits at nodes with highest loss change.
 * objective, used to specify the learning task and the corresponding learning objective. While the default is _reg:squarederror_, regression with squared loss, in our case _reg:squaredlogerror_, a regression with squared log loss is probably more suitable.
 * missing, XGBoost's automatic missing value handling. Now this is fancy, instead of imputation or setting to 0, XGBoost determines a default branch in a tree in case of a missing value. The specification is the notation of the missing value, here NaN.
 * plot tree, this visualizes the forest in order to inspect the model.
 
Let's see what happens when we change the objective function to _reg:squaredlogerror_ and keep the missing values.

In [11]:
start_time = time.time()
xgboostModel = XGBClassifier() 
xgboostModel.fit(x_train, y_train.values.ravel()) # fitting the model to the training data
print(xgboostModel)


#Predict the response for test dataset
y_pred = xgboostModel.predict(x_test)

# Checking the accuracy of the prediction
print("RMSE: ",sqrt(metrics.mean_squared_error(np.log(y_test), np.log(y_pred))))
print("Executed in: %s seconds." % (time.time() - start_time))

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='multi:softprob', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)
RMSE:  0.23311281775830925
Executed in: 387.2980833053589 seconds.


In [14]:
# To make it easy, we split up in x and y (data and target)
x = df2.drop('SalePrice', 1) 
# using all except the Id and the SalePrice columns
y = df2[['SalePrice']]
# Split dataset into training set and test set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3) # 70% training and 30% test

# Temp cell to true AdaBoost with same data
start_time = time.time()
# Having created the data sets, we can create the Adaboost classifier
# Create adaboost classifer object
adaboostClassifier = AdaBoostClassifier(n_estimators=1000,
                         learning_rate=0.1)
# Train Adaboost Classifer
adaboostModel = adaboostClassifier.fit(x_train, y_train.values.ravel())

#Predict the response for test dataset
y_pred = adaboostModel.predict(x_test)

# Checking the accuracy of the prediction
print("RMSE: ",sqrt(metrics.mean_squared_error(np.log(y_test), np.log(y_pred))))
print("Executed in: %s seconds." % (time.time() - start_time))

RMSE:  0.33129845100375993
Executed in: 37.23716592788696 seconds.


The gain is only contributed to leaving missing values as missing, no difference was found between the two objective functions. Now, one of the things in boosting is the penalty function for overfitting, _lambda_. As default it's set to 0, but since we have a huge set of features [..]

## LogitBoost
..

In [15]:

# Since LogitBoost cannot handle missing values, using the same input as AdaBoost
# To make it easy, we split up in x and y (data and target)
x = df2.drop('SalePrice', 1) 
# using all except the Id and the SalePrice columns
y = df2[['SalePrice']]
# Split dataset into training set and test set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3) # 70% training and 30% test

start_time = time.time()
logitBoostModel = LogitBoost(n_estimators=200, random_state=0)
logitBoostModel.fit(x_train, y_train.values.ravel())
#Predict the response for test dataset
y_pred = logitBoostModel.predict(x_test)

# Checking the accuracy of the prediction
print("RMSE: ",sqrt(metrics.mean_squared_error(np.log(y_test), np.log(y_pred))))
print("Executed in: %s seconds." % (time.time() - start_time))

RMSE:  0.28463250082642516
Executed in: 1016.8995733261108 seconds.


After running the three models on the complete dataset, we have
 1. XGBoost with 0.232 in 407 seconds
 2. AdaBoost with 0.266 in 25.8 seconds
 3. LogitBoost with 0.274 in 771 seconds
 
Looking at the Kaggle leaderboard, this would get us nowhere (top 89%, to be exact). For a top 75% a score below 0.165 is required, top 50% needs 0.134 and top 25% requires 0.119, while anything below 0.1 lands in the top 5.
Now that all boosters are up and running, next step is to do some proper feature selecting.
## Feature selection


In [16]:
#from xgboost import plot_tree
import xgboost as test
import matplotlib
import matplotlib.pyplot as plt
#import os
#os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'
test.plot_tree(xgboostModel, num_trees=3, rankdir='LR')
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(150, 100)

plt.show()

ExecutableNotFound: failed to execute ['dot', '-Tpng'], make sure the Graphviz executables are on your systems' PATH

In [None]:
dfNew = df.copy()
# Basement, creating a multiplier for each quality (type 1, 2 and unfinished) used on the square feet of each type. 
# That score is multiplied by a quality / condition measurement
# Finally a small multiplier for exterior access is applied
bsmtQualWrapper = {'Ex': 10, 'Gd': 8, 'TA': 6, 'Fa':4, 'Po':2, 'NA':0}
bsmtQual = dfNew[['BsmtQual']].replace(bsmtQualWrapper)
bsmtCond = dfNew[['BsmtCond']].replace(bsmtQualWrapper)
bsmtExpWrapper = {'Gd': 10, 'Av': 8, 'Mn':6, 'No':4, 'NA':0}
bsmtExp = dfNew[['BsmtExposure']].replace(bsmtExpWrapper)
bsmtFinWrapper = {'GLQ' : 10, 'ALQ' : 8, 'BLQ' : 7, 'Rec' : 5, 'LwQ' : 3, 'Unf' : 1, 'NA' : 0}
BsmtFinType1 = dfNew[['BsmtFinType1']].replace(bsmtFinWrapper)
BsmtFinType2 = dfNew[['BsmtFinType2']].replace(bsmtFinWrapper)

dfNew = dfNew.assign(bsmtScore = pd.DataFrame(
    ((3+BsmtFinType1.values/10)*dfNew[['BsmtFinSF1']].values+(2+BsmtFinType2.values/10)*dfNew[['BsmtFinSF2']].values 
     + dfNew[['BsmtUnfSF']].values)*(1+0.1*bsmtQual.values*bsmtCond.values)*(1+0.01*bsmtExp.values)
    , columns=BsmtFinType1.columns, index=BsmtFinType1.index))
dfNew = dfNew.drop(['BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinSF1','BsmtFinType2','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF'], 1)

# Exterior: Changing the two exterior variables into one score on a 1-100 scale
exteriorMapper = {'Ex': 10, 'Gd': 8, 'TA': 6, 'Fa':4, 'Po':2}
extQual = dfNew[['ExterQual']].replace(exteriorMapper)#
extCond = dfNew[['ExterCond']].replace(exteriorMapper)
dfNew = dfNew.assign(extOA = pd.DataFrame(extQual.values*extCond.values, columns=extQual.columns, index=extQual.index))
dfNew = dfNew.drop(['ExterQual','ExterCond'], 1)

# Garage


# Transforming MSSubClass, since this is numeric but still categorical
mapper = {20: 'class1', 30: 'class2', 40: 'class3', 45: 'class4', 50: 'class5', 60: 'class6', 70: 'class7', 
          75: 'class8', 80: 'class9', 85: 'class10', 90: 'class11', 120: 'class12', 150: 'class13', 160: 'class14', 
          180: 'class15', 190: 'class16'}
dfNew[['MSSubClass']] = dfNew[['MSSubClass']].replace(mapper)


In [None]:
df.plot(x='GarageCars', y='GarageArea', style='o')

In [None]:
df4 = dfNew.copy()
#df4 = df4.drop(['Fireplaces','FireplaceQu','BsmtFullBath','BsmtHalfBath','FullBath','HalfBath'],1)
# Turning every categorical value into dummies
df4 = pd.get_dummies(df4)
# To make it easy, we split up in x and y (data and target)
#x = df4.drop(['SalePrice'], 1) 
x = df[['Neighborhood','LotArea','HouseStyle', 'YearBuilt','1stFlrSF', '2ndFlrSF','TotalBsmtSF','YrSold']]
inflationWrapper = {2015 : 0.001,2014 : 0.016,2013 : 0.015,2012 : 0.021,2011 : 0.032,2010 : 0.016,2009 : -0.004,2008 : 0.038,2007 : 0.028,2006 : 0.032,2005 : 0.034,2004 : 0.027,2003 : 0.023,2002 : 0.016}
inflation = df[['YrSold']].replace(inflationWrapper)
x = x.assign(inflation = pd.DataFrame(inflation, columns=inflation.columns, index=inflation.index))

x = pd.get_dummies(x)
# using all except the Id and the SalePrice columns
y = df[['SalePrice']]
# Split dataset into training set and test set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3) # 70% training and 30% test

start_time = time.time()
xgboostModel = XGBClassifier() 
xgboostModel.fit(x_train, y_train.values.ravel()) # fitting the model to the training data
print(xgboostModel)


#Predict the response for test dataset
y_pred = xgboostModel.predict(x_test)

# Checking the accuracy of the prediction
print("RMSE: ",sqrt(metrics.mean_squared_error(np.log(y_test), np.log(y_pred))))
print("Executed in: %s seconds." % (time.time() - start_time))

In [None]:
# To make it easy, we split up in x and y (data and target)
x = df1.drop('SalePrice', 1) 
inflationWrapper = {2015 : 0.001,2014 : 0.016,2013 : 0.015,2012 : 0.021,2011 : 0.032,2010 : 0.016,2009 : -0.004,2008 : 0.038,2007 : 0.028,2006 : 0.032,2005 : 0.034,2004 : 0.027,2003 : 0.023,2002 : 0.016}
inflation = df[['YrSold']].replace(inflationWrapper)
x = x.assign(inflation = pd.DataFrame(inflation, columns=inflation.columns, index=inflation.index))
# using all except the Id and the SalePrice columns
y = df1[['SalePrice']]
# Split dataset into training set and test set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3) # 70% training and 30% test

start_time = time.time()
xgboostModel = XGBClassifier(n_estimators=1000,nthread=4) 
xgboostModel.fit(x_train, y_train.values.ravel()) # fitting the model to the training data
print(xgboostModel)


#Predict the response for test dataset
y_pred = xgboostModel.predict(x_test)

# Checking the accuracy of the prediction
print("RMSE: ",sqrt(metrics.mean_squared_error(np.log(y_test), np.log(y_pred))))
print("Executed in: %s seconds." % (time.time() - start_time))

In [None]:

from xgboost import plot_tree
#import matplotlib.pyplot as plt
#%matplotlib inline
#plt.hist(df[['YrSold']])
#plt.show()
#df[['YrSold']].plot(kind="hist")
plot_tree(xgboostModel)

In [None]:
print(list(df1.columns.values))

In [None]:
#import matplotlib.pyplot as pyplot
#pyplot.bar(range(len(xgboostModel.feature_importances_)), xgboostModel.feature_importances_)
#pyplot.show()
from numpy import sort
from xgboost import plot_importance
from sklearn.feature_selection import SelectFromModel
plot_importance(xgboostModel)
pyplot.show()

# Fit model using each importance as a threshold
thresholds = sort(list(set(xgboostModel.feature_importances_)))
for thresh in thresholds:
	# select features using threshold
	selection = SelectFromModel(xgboostModel, threshold=thresh, prefit=True)
	select_X_train = selection.transform(x_train)
	# train model
	selection_model = XGBClassifier()
	selection_model.fit(select_X_train, y_train.values.ravel())
	# eval model
	select_X_test = selection.transform(x_test)
	y_pred = selection_model.predict(select_X_test)
	accuracy = sqrt(metrics.mean_squared_error(np.log(y_test), np.log(y_pred)))
	print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))

In [None]:
selection = SelectFromModel(xgboostModel, threshold=0.038, prefit=True)
select_X_train = selection.transform(x_train)
# train model
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train.values.ravel())
plot_importance(selection_model)
pyplot.show()

In [None]:
selection_model.feature_importances_