# Try 2 - Model selection, XGBoost dataset top12

We created new csv-files that only contained the top columns in feature importance and the 'Saleprice'. We did three versions: top10, top12 and top15.

We decided to run each of the new csv-files through the same notebook in different copies to see how the number of columns we chose to keep impacted the R2-value.

This notebook will show the csv-file for top12.

## Importing the libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

## Importing the dataset

In [2]:
dataset = pd.read_csv('top12andprice.csv')
X = dataset.iloc[:, :-1]
y = dataset.iloc[:, -1]
dataset.head()

Unnamed: 0,LotArea,MSZoning,TotRmsAbvGrd,YearBuilt,LandContour,BsmtFinSF1,GarageCars,1stFlrSF,TotalBsmtSF,2ndFlrSF,GrLivArea,OverallQual,Saleprice
0,8450,RL,8,2003,Lvl,706,2,856,856,854,1710,7,208500
1,9600,RL,6,1976,Lvl,978,2,1262,1262,0,1262,6,181500
2,11250,RL,6,2001,Lvl,486,2,920,920,866,1786,7,223500
3,9550,RL,7,1915,Lvl,216,3,961,756,756,1717,7,140000
4,14260,RL,9,2000,Lvl,655,3,1145,1145,1053,2198,8,250000


## Encoding categorical data

We can see below, that 2 of the column have categorical data and must be one hot encoded

In [3]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   LotArea       1460 non-null   int64 
 1   MSZoning      1460 non-null   object
 2   TotRmsAbvGrd  1460 non-null   int64 
 3   YearBuilt     1460 non-null   int64 
 4   LandContour   1460 non-null   object
 5   BsmtFinSF1    1460 non-null   int64 
 6   GarageCars    1460 non-null   int64 
 7   1stFlrSF      1460 non-null   int64 
 8   TotalBsmtSF   1460 non-null   int64 
 9   2ndFlrSF      1460 non-null   int64 
 10  GrLivArea     1460 non-null   int64 
 11  OverallQual   1460 non-null   int64 
 12  Saleprice     1460 non-null   int64 
dtypes: int64(11), object(2)
memory usage: 148.4+ KB


### The onehot encoding of the columns

In [4]:
X = pd.get_dummies(X)
X.head(10)

Unnamed: 0,LotArea,TotRmsAbvGrd,YearBuilt,BsmtFinSF1,GarageCars,1stFlrSF,TotalBsmtSF,2ndFlrSF,GrLivArea,OverallQual,MSZoning_C (all),MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,LandContour_Bnk,LandContour_HLS,LandContour_Low,LandContour_Lvl
0,8450,8,2003,706,2,856,856,854,1710,7,0,0,0,1,0,0,0,0,1
1,9600,6,1976,978,2,1262,1262,0,1262,6,0,0,0,1,0,0,0,0,1
2,11250,6,2001,486,2,920,920,866,1786,7,0,0,0,1,0,0,0,0,1
3,9550,7,1915,216,3,961,756,756,1717,7,0,0,0,1,0,0,0,0,1
4,14260,9,2000,655,3,1145,1145,1053,2198,8,0,0,0,1,0,0,0,0,1
5,14115,5,1993,732,2,796,796,566,1362,5,0,0,0,1,0,0,0,0,1
6,10084,7,2004,1369,2,1694,1686,0,1694,8,0,0,0,1,0,0,0,0,1
7,10382,7,1973,859,2,1107,1107,983,2090,7,0,0,0,1,0,0,0,0,1
8,6120,8,1931,0,2,1022,952,752,1774,7,0,0,0,0,1,0,0,0,1
9,7420,5,1939,851,1,1077,991,0,1077,5,0,0,0,1,0,0,0,0,1


In [5]:
X = X.values
y = y.values

# Model: XGBoost regression

## Splitting the dataset into the Training set and Test set

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 0)

## Training the XGBoost regression model 

In [7]:
from xgboost import XGBRegressor
#seed = 0 is default
regressor = XGBRegressor(seed = 0)
regressor.fit(X_train, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=8, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=0, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

## Predicting the Test set results

In [8]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[237428.5  200624.  ]
 [141992.33 133000.  ]
 [131754.45 110000.  ]
 [209536.81 192000.  ]
 [ 88042.44  88000.  ]
 [ 97214.51  85000.  ]
 [256837.02 282922.  ]
 [118987.96 141000.  ]
 [704263.69 745000.  ]
 [164763.02 148800.  ]
 [214935.52 208900.  ]
 [142625.64 136905.  ]
 [223379.53 225000.  ]
 [118063.73 123000.  ]
 [144837.27 119200.  ]
 [147207.36 145000.  ]
 [203694.67 190000.  ]
 [116519.71 123600.  ]
 [119084.62 149350.  ]
 [185331.7  155000.  ]
 [143144.62 166000.  ]
 [145573.16 144500.  ]
 [109781.05 110000.  ]
 [173763.08 174000.  ]
 [176784.89 185000.  ]
 [149005.8  168000.  ]
 [177795.98 177500.  ]
 [ 71715.73  84500.  ]
 [340494.81 320000.  ]
 [116585.62 118500.  ]
 [139620.77 110000.  ]
 [190911.67 213000.  ]
 [144300.   156000.  ]
 [291510.16 250000.  ]
 [267326.41 372500.  ]
 [172026.48 175000.  ]
 [278281.78 277500.  ]
 [130753.86 112500.  ]
 [229289.67 263000.  ]
 [297611.38 325000.  ]
 [237772.2  243000.  ]
 [129125.66 130000.  ]
 [182958.58 164990.  ]
 [272591.34

## Evaluating the Model Performance

In [9]:
from sklearn.metrics import r2_score
print(r2_score(y_test,y_pred))

0.8728539932453862
