# House Prices - Numeric features

In this notebook, we will use dataset generated from FE_TotalSF_drop and try to train models using numerical features only

## Preparation

In [None]:
from os.path import join
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import os
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..")))
from model.catboost import catboost
from model.xgboost import xgboost

data_dir = join('..', '..', 'data')
input_dir = join(data_dir, 'feature_engineered', 'FE_TotalSF')
output_dir = join(data_dir, 'feature_engineered')

train = pd.read_csv(join(input_dir, 'TotalSF_drop_train.csv'))
test = pd.read_csv(join(input_dir, 'TotalSF_drop_test.csv'))

train.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,TotalSF
0,60,3,4.189655,9.04204,1,1,3,3,0,4,...,3,4,1,0.0,2,2008,8,4,208500,1712
1,20,3,4.394449,9.169623,1,1,3,3,0,2,...,3,4,1,0.0,5,2007,8,4,181500,2524
2,60,3,4.234107,9.328212,1,1,0,3,0,4,...,3,4,1,0.0,9,2008,8,4,223500,1840
3,70,3,4.110874,9.164401,1,1,0,3,0,0,...,3,4,1,0.0,2,2006,8,0,140000,1717
4,60,3,4.442651,9.565284,1,1,0,3,0,2,...,3,4,1,0.0,12,2008,8,4,250000,2290


In [2]:
original = pd.read_csv(join(data_dir, 'raw', 'train.csv'))
categorical_features = original.select_dtypes(include=['object']).columns.tolist()

train = train.drop(columns=categorical_features)
test = test.drop(columns=categorical_features)

train.head()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice,TotalSF
0,60,4.189655,9.04204,7,5,2003,2003,5.283204,706,0.0,...,4.127134,0.0,0.0,0.0,0.0,0.0,2,2008,208500,1712
1,20,4.394449,9.169623,6,8,1976,1976,0.0,978,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,5,2007,181500,2524
2,60,4.234107,9.328212,7,5,2001,2002,5.09375,486,0.0,...,3.7612,0.0,0.0,0.0,0.0,0.0,9,2008,223500,1840
3,70,4.110874,9.164401,7,5,1915,1970,0.0,216,0.0,...,3.583519,5.609472,0.0,0.0,0.0,0.0,2,2006,140000,1717
4,60,4.442651,9.565284,8,5,2000,2000,5.860786,655,0.0,...,4.442651,0.0,0.0,0.0,0.0,0.0,12,2008,250000,2290


## Modelling

We will train models using newly generated datasets. Also I want to test dropping columns that I supposed they are unneccessary as they are shown on other features.

In [None]:
print('====== Numerical feature ======')
catboost(
    df=train,
    df_test=test,
)

xgboost(
    df=train,
    df_test=test,
)

print('\n====== Drop feature BsmtFinSF1, BsmtFinSF2, BsmtUnfSF ======')

catboost(
    df=train.drop(['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF'], axis=1),
    df_test=test.drop(['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF'], axis=1),
)

xgboost(
    df=train.drop(['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF'], axis=1),
    df_test=test.drop(['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF'], axis=1),
)

print('\n====== Drop feature BsmtFinSF1, BsmtFinSF2 ======')


catboost(
    df=train.drop(['BsmtFinSF1', 'BsmtFinSF2'], axis=1),
    df_test=test.drop(['BsmtFinSF1', 'BsmtFinSF2'], axis=1),
)

xgboost(
    df=train.drop(['BsmtFinSF1', 'BsmtFinSF2'], axis=1),
    df_test=test.drop(['BsmtFinSF1', 'BsmtFinSF2'], axis=1),
)


==== Fold 1 results for CatBoostRegressor ====
Fold 1 - R2: 0.8964 | RMSE: 28185.0869

==== Fold 2 results for CatBoostRegressor ====
Fold 2 - R2: 0.9082 | RMSE: 24983.0691

==== Fold 3 results for CatBoostRegressor ====
Fold 3 - R2: 0.6948 | RMSE: 41061.3784

==== Fold 4 results for CatBoostRegressor ====
Fold 4 - R2: 0.8807 | RMSE: 27368.7478

==== Fold 5 results for CatBoostRegressor ====
Fold 5 - R2: 0.9271 | RMSE: 19519.2313

==== Mean metrics ====
R2 Score: 0.8615
RMSE: 28223.5027

==== Fold 1 results for XGBRegressor ====
Fold 1 - R2: 0.8982 | RMSE: 27949.4469

==== Fold 2 results for XGBRegressor ====
Fold 2 - R2: 0.8547 | RMSE: 31431.2599

==== Fold 3 results for XGBRegressor ====
Fold 3 - R2: 0.6344 | RMSE: 44943.6084

==== Fold 4 results for XGBRegressor ====
Fold 4 - R2: 0.8371 | RMSE: 31981.1134

==== Fold 5 results for XGBRegressor ====
Fold 5 - R2: 0.8929 | RMSE: 23656.9482

==== Mean metrics ====
R2 Score: 0.8235
RMSE: 31992.4754


==== Fold 1 results for CatBoostRegre

As we can see, CatBoost algorithm always gives better performance than XGBoost. And based on the result, models have higher score when unnecessary features are removed. However, these scores just measure how fit the model is on the trained dataset, not on the unseen one. Because of that, I will submit two best models' prediction on Kaggle platform to see it's performance.

In [None]:
print('\n====== Drop feature BsmtFinSF1, BsmtFinSF2, BsmtUnfSF ======')

catboost(
    path_to_log_csv=join('..', '..', 'log', 'numericFeature_drop3features', 'experiment_logger.csv'),
    author="Thien",
    df=train.drop(['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF'], axis=1),
    df_test=test.drop(['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF'], axis=1),
    name_folder="numericFeature_drop3features",
    save_log=True,
    save_model=True,
    save_submission=True
)

print('\n====== Drop feature BsmtFinSF1, BsmtFinSF2 ======')

catboost(
    path_to_log_csv=join('..', '..', 'log', 'numericFeature_drop2features', 'experiment_logger.csv'),
    author="Thien",
    df=train.drop(['BsmtFinSF1', 'BsmtFinSF2'], axis=1),
    df_test=test.drop(['BsmtFinSF1', 'BsmtFinSF2'], axis=1),
    name_folder="numericFeature_drop2features",
    save_log=True,
    save_model=True,
    save_submission=True
)



==== Fold 1 results for CatBoostRegressor ====
Fold 1 - R2: 0.8990 | RMSE: 27836.5268

==== Fold 2 results for CatBoostRegressor ====
Fold 2 - R2: 0.8950 | RMSE: 26714.5556

==== Fold 3 results for CatBoostRegressor ====
Fold 3 - R2: 0.7331 | RMSE: 38399.2221

==== Fold 4 results for CatBoostRegressor ====
Fold 4 - R2: 0.8861 | RMSE: 26745.4389

==== Fold 5 results for CatBoostRegressor ====
Fold 5 - R2: 0.9157 | RMSE: 20994.3366

==== Mean metrics ====
R2 Score: 0.8658
RMSE: 28138.0160
Logged experiment to ..\..\logs\numericFeature_drop3features\experiment_logger.csv
âœ… Model saved to ..\..\log\numericFeature_drop3features\Model Pickles\CatBoostRegressor\CatBoostRegressor_numericFeature_drop3features.pkl
ðŸ“¤ Submission file saved to ..\..\data\submissions\numericFeature_drop3features\CatBoostRegressor\submission_CatBoostRegressor_numericFeature_drop3features.csv


==== Fold 1 results for CatBoostRegressor ====
Fold 1 - R2: 0.8914 | RMSE: 28857.0631

==== Fold 2 results for CatBoos

In [None]:
catboost(
    path_to_log_csv=join('..', '..', 'log', 'numericFeature', 'experiment_logger.csv'),
    author="Thien",
    df=train,
    df_test=test,
    name_folder="numericFeature",
    save_log=True,
    save_model=True,
    save_submission=True
)


==== Fold 1 results for CatBoostRegressor ====
Fold 1 - R2: 0.8964 | RMSE: 28185.0869

==== Fold 2 results for CatBoostRegressor ====
Fold 2 - R2: 0.9082 | RMSE: 24983.0691

==== Fold 3 results for CatBoostRegressor ====
Fold 3 - R2: 0.6948 | RMSE: 41061.3784

==== Fold 4 results for CatBoostRegressor ====
Fold 4 - R2: 0.8807 | RMSE: 27368.7478

==== Fold 5 results for CatBoostRegressor ====
Fold 5 - R2: 0.9271 | RMSE: 19519.2313

==== Mean metrics ====
R2 Score: 0.8615
RMSE: 28223.5027
Logged experiment to ..\..\logs\numericFeature\experiment_logger.csv
âœ… Model saved to ..\..\log\numericFeature\Model Pickles\CatBoostRegressor\CatBoostRegressor_numericFeature.pkl
ðŸ“¤ Submission file saved to ..\..\data\submissions\numericFeature\CatBoostRegressor\submission_CatBoostRegressor_numericFeature.csv


## Result

<img src="images/output_kaggle_numericFeature.png" width="800">

These results display that these models, which are trained using numerical features only, are not as good as the ones trained with all types of features. This proves two things:
- Smaller dataset not always give better performance for models
- Linear models performance could be improved using categorical features, as long as these features are valuable.

## Save file

In [None]:
train.to_csv(join(output_dir, 'FE_numericFeature', 'numericFeature_train.csv'), index=False)
test.to_csv(join(output_dir,'FE_numericFeature', 'numericFeature_test.csv'), index=False)

train.drop(['BsmtFinSF1', 'BsmtFinSF2'], axis=1).to_csv(join(output_dir, 'FE_numericFeature', 'numericFeature_drop2features_train.csv'), index=False)
test.drop(['BsmtFinSF1', 'BsmtFinSF2'], axis=1).to_csv(join(output_dir,'FE_numericFeature', 'numericFeature_drop2features_test.csv'), index=False)

train.drop(['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF'], axis=1).to_csv(join(output_dir, 'FE_numericFeature', 'numericFeature_drop3features_train.csv'), index=False)
test.drop(['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF'], axis=1).to_csv(join(output_dir,'FE_numericFeature', 'numericFeature_drop3features_test.csv'), index=False)

# The end