# Try 2 - Model selection, XGBoost dataset top7

We created new csv-files that only contained the top columns in feature importance and the 'Saleprice'. We did three versions: top10, top12 and top15.

We decided to run each of the new csv-files through the same notebook in different copies to see how the number of columns we chose to keep impacted the R2-value.

After we ran those three, we decided to test if removing a few more columns gave a worse or better result.

This notebook will show the csv-file for top7.


## Importing the libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

## Importing the dataset

In [2]:
dataset = pd.read_csv('top7andprice.csv')
X = dataset.iloc[:, :-1]
y = dataset.iloc[:, -1]
dataset.head()

Unnamed: 0,BsmtFinSF1,GarageCars,1stFlrSF,TotalBsmtSF,2ndFlrSF,GrLivArea,OverallQual,Saleprice
0,706,2,856,856,854,1710,7,208500
1,978,2,1262,1262,0,1262,6,181500
2,486,2,920,920,866,1786,7,223500
3,216,3,961,756,756,1717,7,140000
4,655,3,1145,1145,1053,2198,8,250000


## Encoding categorical data

We can see below, that none of the columns have categorical data so no need to encode anything.

In [3]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   BsmtFinSF1   1460 non-null   int64
 1   GarageCars   1460 non-null   int64
 2   1stFlrSF     1460 non-null   int64
 3   TotalBsmtSF  1460 non-null   int64
 4   2ndFlrSF     1460 non-null   int64
 5   GrLivArea    1460 non-null   int64
 6   OverallQual  1460 non-null   int64
 7   Saleprice    1460 non-null   int64
dtypes: int64(8)
memory usage: 91.4 KB


In [5]:
X = X.values
y = y.values

# Model: XGBoost regression

## Splitting the dataset into the Training set and Test set

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 0)

## Training the XGBoost regression model 

In [7]:
from xgboost import XGBRegressor
#seed = 0 is default
regressor = XGBRegressor(seed = 0)
regressor.fit(X_train, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=8, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=0, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

## Predicting the Test set results

In [8]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[240702.77 200624.  ]
 [152925.48 133000.  ]
 [125636.73 110000.  ]
 [215485.78 192000.  ]
 [ 86568.7   88000.  ]
 [124416.64  85000.  ]
 [267422.91 282922.  ]
 [117050.02 141000.  ]
 [674705.81 745000.  ]
 [143692.94 148800.  ]
 [189190.95 208900.  ]
 [155771.75 136905.  ]
 [207611.72 225000.  ]
 [127531.34 123000.  ]
 [139433.94 119200.  ]
 [171770.62 145000.  ]
 [172965.14 190000.  ]
 [116246.52 123600.  ]
 [135565.56 149350.  ]
 [220693.03 155000.  ]
 [149670.89 166000.  ]
 [142493.84 144500.  ]
 [112896.7  110000.  ]
 [171144.97 174000.  ]
 [174872.62 185000.  ]
 [231421.39 168000.  ]
 [191123.53 177500.  ]
 [ 70944.99  84500.  ]
 [343369.03 320000.  ]
 [113790.27 118500.  ]
 [110396.42 110000.  ]
 [168420.86 213000.  ]
 [142039.39 156000.  ]
 [283699.38 250000.  ]
 [259947.38 372500.  ]
 [162396.28 175000.  ]
 [292225.53 277500.  ]
 [139924.88 112500.  ]
 [222978.7  263000.  ]
 [288101.44 325000.  ]
 [242286.36 243000.  ]
 [141949.36 130000.  ]
 [184888.67 164990.  ]
 [276222.84

## Evaluating the Model Performance

In [9]:
from sklearn.metrics import r2_score
print(r2_score(y_test,y_pred))

0.8629247966353095
