# Try 2 - Model selection, XGBoost dataset top5

We created new csv-files that only contained the top columns in feature importance and the 'Saleprice'. We did three versions: top10, top12 and top15.

We decided to run each of the new csv-files through the same notebook in different copies to see how the number of columns we chose to keep impacted the R2-value.

After we ran those three, we decided to test if removing a few more columns gave a worse or better result.

This notebook will show the csv-file for top5.


## Importing the libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

## Importing the dataset

In [2]:
dataset = pd.read_csv('top5andprice.csv')
X = dataset.iloc[:, :-1]
y = dataset.iloc[:, -1]
dataset.head()

Unnamed: 0,1stFlrSF,TotalBsmtSF,2ndFlrSF,GrLivArea,OverallQual,Saleprice
0,856,856,854,1710,7,208500
1,1262,1262,0,1262,6,181500
2,920,920,866,1786,7,223500
3,961,756,756,1717,7,140000
4,1145,1145,1053,2198,8,250000


## Encoding categorical data

We can see below, that only one column has categorical data and must be one hot encoded.

In [3]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   1stFlrSF     1460 non-null   int64
 1   TotalBsmtSF  1460 non-null   int64
 2   2ndFlrSF     1460 non-null   int64
 3   GrLivArea    1460 non-null   int64
 4   OverallQual  1460 non-null   int64
 5   Saleprice    1460 non-null   int64
dtypes: int64(6)
memory usage: 68.6 KB


### The onehot encoding of the columns

In [4]:
X = pd.get_dummies(X)
X.head(10)

Unnamed: 0,1stFlrSF,TotalBsmtSF,2ndFlrSF,GrLivArea,OverallQual
0,856,856,854,1710,7
1,1262,1262,0,1262,6
2,920,920,866,1786,7
3,961,756,756,1717,7
4,1145,1145,1053,2198,8
5,796,796,566,1362,5
6,1694,1686,0,1694,8
7,1107,1107,983,2090,7
8,1022,952,752,1774,7
9,1077,991,0,1077,5


In [5]:
X = X.values
y = y.values

# Model: XGBoost regression

## Splitting the dataset into the Training set and Test set

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 0)

## Training the XGBoost regression model 

In [7]:
from xgboost import XGBRegressor
#seed = 0 is default
regressor = XGBRegressor(seed = 0)
regressor.fit(X_train, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=8, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=0, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

## Predicting the Test set results

In [8]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[184692.2  200624.  ]
 [147056.38 133000.  ]
 [120415.55 110000.  ]
 [244656.03 192000.  ]
 [ 99908.59  88000.  ]
 [120669.    85000.  ]
 [241481.92 282922.  ]
 [120260.79 141000.  ]
 [753204.62 745000.  ]
 [128320.17 148800.  ]
 [204954.42 208900.  ]
 [178102.67 136905.  ]
 [205253.14 225000.  ]
 [126899.56 123000.  ]
 [119773.02 119200.  ]
 [168331.5  145000.  ]
 [182283.09 190000.  ]
 [133141.38 123600.  ]
 [151956.95 149350.  ]
 [199446.89 155000.  ]
 [132827.81 166000.  ]
 [141679.03 144500.  ]
 [126547.14 110000.  ]
 [148320.67 174000.  ]
 [164985.03 185000.  ]
 [249115.31 168000.  ]
 [182858.86 177500.  ]
 [ 79079.2   84500.  ]
 [330363.44 320000.  ]
 [119110.68 118500.  ]
 [148379.78 110000.  ]
 [148914.19 213000.  ]
 [131513.69 156000.  ]
 [270582.22 250000.  ]
 [262672.81 372500.  ]
 [169824.17 175000.  ]
 [260685.75 277500.  ]
 [117835.2  112500.  ]
 [216811.   263000.  ]
 [318690.28 325000.  ]
 [238655.12 243000.  ]
 [121670.43 130000.  ]
 [205681.59 164990.  ]
 [276969.5 

## Evaluating the Model Performance

In [9]:
from sklearn.metrics import r2_score
print(r2_score(y_test,y_pred))

0.8243776040452636
