# Try 2 - Model selection, XGBoost dataset top15

We created new csv-files that only contained the top columns in feature importance and the 'Saleprice'. 
We did three versions: top10, top12 and top15.

We decided to run each of the new csv-files through the same notebook in different copies to see how the number of columns we chose to keep impacted the R2-value.

This notebook will show the csv-file for top15.

## Importing the libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

## Importing the dataset

In [2]:
dataset = pd.read_csv('top15andprice.csv')
X = dataset.iloc[:, :-1]
y = dataset.iloc[:, -1]
dataset.head()

Unnamed: 0,YearRemodAdd,OverallCond,BsmtQual,LotArea,MSZoning,TotRmsAbvGrd,YearBuilt,LandContour,BsmtFinSF1,GarageCars,1stFlrSF,TotalBsmtSF,2ndFlrSF,GrLivArea,OverallQual,Saleprice
0,2003,5,Gd,8450,RL,8,2003,Lvl,706,2,856,856,854,1710,7,208500
1,1976,8,Gd,9600,RL,6,1976,Lvl,978,2,1262,1262,0,1262,6,181500
2,2002,5,Gd,11250,RL,6,2001,Lvl,486,2,920,920,866,1786,7,223500
3,1970,5,TA,9550,RL,7,1915,Lvl,216,3,961,756,756,1717,7,140000
4,2000,5,Gd,14260,RL,9,2000,Lvl,655,3,1145,1145,1053,2198,8,250000


## Encoding categorical data

We can see below, that 3 of the columns have categorical data and must be one hot encoded.

In [3]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   YearRemodAdd  1460 non-null   int64 
 1   OverallCond   1460 non-null   int64 
 2   BsmtQual      1423 non-null   object
 3   LotArea       1460 non-null   int64 
 4   MSZoning      1460 non-null   object
 5   TotRmsAbvGrd  1460 non-null   int64 
 6   YearBuilt     1460 non-null   int64 
 7   LandContour   1460 non-null   object
 8   BsmtFinSF1    1460 non-null   int64 
 9   GarageCars    1460 non-null   int64 
 10  1stFlrSF      1460 non-null   int64 
 11  TotalBsmtSF   1460 non-null   int64 
 12  2ndFlrSF      1460 non-null   int64 
 13  GrLivArea     1460 non-null   int64 
 14  OverallQual   1460 non-null   int64 
 15  Saleprice     1460 non-null   int64 
dtypes: int64(13), object(3)
memory usage: 182.6+ KB


### The onehot encoding of the columns

In [4]:
X = pd.get_dummies(X)
X.head(10)

Unnamed: 0,YearRemodAdd,OverallCond,LotArea,TotRmsAbvGrd,YearBuilt,BsmtFinSF1,GarageCars,1stFlrSF,TotalBsmtSF,2ndFlrSF,...,BsmtQual_TA,MSZoning_C (all),MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,LandContour_Bnk,LandContour_HLS,LandContour_Low,LandContour_Lvl
0,2003,5,8450,8,2003,706,2,856,856,854,...,0,0,0,0,1,0,0,0,0,1
1,1976,8,9600,6,1976,978,2,1262,1262,0,...,0,0,0,0,1,0,0,0,0,1
2,2002,5,11250,6,2001,486,2,920,920,866,...,0,0,0,0,1,0,0,0,0,1
3,1970,5,9550,7,1915,216,3,961,756,756,...,1,0,0,0,1,0,0,0,0,1
4,2000,5,14260,9,2000,655,3,1145,1145,1053,...,0,0,0,0,1,0,0,0,0,1
5,1995,5,14115,5,1993,732,2,796,796,566,...,0,0,0,0,1,0,0,0,0,1
6,2005,5,10084,7,2004,1369,2,1694,1686,0,...,0,0,0,0,1,0,0,0,0,1
7,1973,6,10382,7,1973,859,2,1107,1107,983,...,0,0,0,0,1,0,0,0,0,1
8,1950,5,6120,8,1931,0,2,1022,952,752,...,1,0,0,0,0,1,0,0,0,1
9,1950,6,7420,5,1939,851,1,1077,991,0,...,1,0,0,0,1,0,0,0,0,1


In [5]:
X = X.values
y = y.values

# Model: XGBoost regression

## Splitting the dataset into the Training set and Test set

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 0)

## Training the XGBoost regression model 

In [7]:
from xgboost import XGBRegressor
#seed = 0 is default
regressor = XGBRegressor(seed = 0)
regressor.fit(X_train, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=8, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=0, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

## Predicting the Test set results

In [8]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[194987.98 200624.  ]
 [144900.7  133000.  ]
 [103928.05 110000.  ]
 [202115.02 192000.  ]
 [ 83277.31  88000.  ]
 [ 81643.23  85000.  ]
 [254296.   282922.  ]
 [122497.67 141000.  ]
 [721684.38 745000.  ]
 [152616.72 148800.  ]
 [174471.2  208900.  ]
 [155370.28 136905.  ]
 [226463.73 225000.  ]
 [119197.67 123000.  ]
 [140856.53 119200.  ]
 [139740.08 145000.  ]
 [201286.03 190000.  ]
 [122232.5  123600.  ]
 [130949.71 149350.  ]
 [202103.61 155000.  ]
 [135043.22 166000.  ]
 [144899.81 144500.  ]
 [108221.31 110000.  ]
 [170863.33 174000.  ]
 [177146.73 185000.  ]
 [129176.41 168000.  ]
 [174406.67 177500.  ]
 [ 69404.24  84500.  ]
 [365272.09 320000.  ]
 [116556.85 118500.  ]
 [139712.   110000.  ]
 [192347.03 213000.  ]
 [139353.53 156000.  ]
 [292970.34 250000.  ]
 [256478.73 372500.  ]
 [170105.81 175000.  ]
 [288171.22 277500.  ]
 [113773.29 112500.  ]
 [221307.42 263000.  ]
 [326367.94 325000.  ]
 [232151.62 243000.  ]
 [129666.88 130000.  ]
 [182077.   164990.  ]
 [283675.44

## Evaluating the Model Performance

In [9]:
from sklearn.metrics import r2_score
print(r2_score(y_test,y_pred))

0.8772459102377688
