# Description

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this dataset from Kaggle competition challenges you to predict the final price of each home.

# Requirements

- Build and submit a scikit-learn pipeline having neural network regressor as the model.
- Build and submit a scikit-learn pipeline having gradient boosted regressor as the model.
- Build and submit a xgboost model.
- Achieve a score better than 0.13 on the public leaderboard.

# Evaluation criteria

- Public leaderboard score.
- How simple is the model.
- How fast is the model prediction.
- Code quality.

# Imports

In [78]:
import missingno as msno
import numpy as np
import pandas as pd
pd.set_option("display.max_columns", None)
import seaborn as sns
sns.set_theme()
from sklearn import set_config
set_config(display="diagram")
import xgboost as xgb
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_score
from sklearn.neural_network import MLPRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder

from functions import data_imputer, ordinal_encoder, imbalanced_features
from global_ import none_features, zero_features, quality_categories, quality_features

# Data preparation

In [79]:
train_data = pd.read_csv("data/train.csv")
test_data = pd.read_csv("data/test.csv")

In [80]:
train_data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


In [81]:
#msno.bar(train_data, labels=True, fontsize=12);

In [82]:
#msno.bar(test_data, labels=True, fontsize=12);

## Data cleaning

In [83]:
features_to_drop = ["Id", "Alley", "PoolArea", "PoolQC", "Fence", "MiscFeature"]

cleaned_train_data = train_data.drop(columns=features_to_drop)
cleaned_test_data = test_data.drop(columns=features_to_drop)

In [84]:
# Custom transformer for missing values imputation.
get_data_imputer = FunctionTransformer(data_imputer)

## Feature engineering

### Features merging

In [85]:
def merger(input_df: pd.DataFrame) -> pd.DataFrame:
    input_df["Functional"] = input_df["Functional"].replace(["Min1", "Min2"], "Min")
    input_df["LandContour"] = input_df["LandContour"].replace("Lvl", "Flat")
    input_df["LandContour"] = input_df["LandContour"].replace(["Bnk", "HLS", "Low"], "NotFlat")
    input_df["Condition1"] = input_df["Condition1"].replace(["RRAn", "RRAe", "RRNn", "RRNe"], "RR")
    input_df["Exterior2nd"] = input_df["Exterior2nd"].replace(["CmentBd", "HdBoard"], "Board")
    input_df["SaleType"] = input_df["SaleType"].replace(["ConLw", "ConLD", "ConLI"], "Con")
    
    return input_df

In [86]:
get_merger = FunctionTransformer(merger)

### Ordinal encoding

The best scores were achieved just with ordinal features about quality level. Features about basement area, central air conditioning and garage interior showed worse results. 

In [87]:
# Custom transformer for ordinal features encoding.
get_ordinal_encoder = FunctionTransformer(ordinal_encoder)

### One-hot encoding

In [88]:
ohe_features = [column for column in cleaned_train_data if cleaned_train_data[column].dtypes == object]

ohe_encoder = ColumnTransformer(
    transformers=[
        ("encoder", OneHotEncoder(handle_unknown="ignore"), ohe_features)
        ],
    remainder="passthrough"
)

# Modelling

In [89]:
target = "SalePrice"

# Data used for cross-validation.
X_cv = cleaned_train_data.drop(columns=target)
y_cv = np.log(cleaned_train_data[target])

# Data used for training.
X_train = cleaned_train_data.drop(columns=target)
y_train = cleaned_train_data[target]

## Neural network

In [90]:
nnet = MLPRegressor(
    hidden_layer_sizes=(150,150,150,150),
    alpha=0,
    max_iter=1000,
    random_state=7
)

In [91]:
nnet_pipeline = Pipeline(
    steps=[
        ("imputer", get_data_imputer),
        ("merger", get_merger),
        ("ordinal_encoder", get_ordinal_encoder),
        ("ohe_encoder", ohe_encoder),
        ("model", nnet)
    ]
)

In [92]:
nnet_scores = cross_val_score(nnet_pipeline, X_cv, y_cv, scoring="neg_root_mean_squared_error", error_score="raise")
abs(nnet_scores.mean())

2.8582526256176313

In [93]:
nnet_pipeline.fit(X_train, y_train)

In [94]:
nnet_predictions = nnet_pipeline.predict(cleaned_test_data)

output = pd.DataFrame({"Id": test_data["Id"], "SalePrice": nnet_predictions})
output.to_csv("submissions/nnet_predictions.csv", index=False)

## Gradient Boosting regressor

In [95]:
gbr = GradientBoostingRegressor(random_state=42)

In [96]:
gbr_pipeline = Pipeline(
    steps=[
        ("imputer", get_data_imputer),
        ("merger", get_merger),
        ("ordinal_encoder", get_ordinal_encoder),
        ("ohe_encoder", ohe_encoder),
        ("model", gbr)
    ]
)

In [97]:
gbr_scores = cross_val_score(gbr_pipeline, X_cv, y_cv, scoring="neg_root_mean_squared_error", error_score="raise")
abs(gbr_scores.mean())

0.12528518861466784

In [98]:
gbr_pipeline.fit(X_train, y_train)

In [99]:
gbr_predictions = gbr_pipeline.predict(cleaned_test_data)

output = pd.DataFrame({"Id": test_data["Id"], "SalePrice": gbr_predictions})
output.to_csv("submissions/gbr_predictions.csv", index=False)

## XGBoost regressor

In [100]:
xgb_r = xgb.XGBRegressor()

In [101]:
xgb_r_pipeline = Pipeline(
    steps=[
        ("imputer", get_data_imputer),
        ("merger", get_merger),
        ("ordinal_encoder", get_ordinal_encoder),
        ("ohe_encoder", ohe_encoder),
        ("model", xgb_r)
    ]
)

In [102]:
xgb_r_scores = cross_val_score(xgb_r_pipeline, X_cv, y_cv, scoring="neg_root_mean_squared_error", error_score="raise")
abs(xgb_r_scores.mean())

0.13527563374322832

In [103]:
xgb_r_pipeline.fit(X_train, y_train)

In [104]:
xgb_predictions = xgb_r_pipeline.predict(cleaned_test_data)

output = pd.DataFrame({"Id": test_data["Id"], "SalePrice": xgb_predictions})
output.to_csv("submissions/xgb_predictions.csv", index=False)

# Scores

In [105]:
nnet_score = 2.85825
gbr_score = 0.12529
xgboost_score = 0.13528

kaggle_nnet_score = 0.16374
kaggle_gbr_score = 0.13622
kaggle_xgboost_score = 0.14513



scores_table = pd.DataFrame({"Score": [nnet_score, gbr_score, xgboost_score],
                             "Kaggle Score": [kaggle_nnet_score, kaggle_gbr_score, kaggle_xgboost_score],
                             "Real-Time Score": [round(abs(nnet_scores.mean()), 5), round(abs(gbr_scores.mean()), 5), round(abs(xgb_r_scores.mean()), 5)]}).set_axis(
                                 ["Neural Network", "Gradient Boosting", "XGBoost"], axis="index"
                             )
                             
scores_table

Unnamed: 0,Score,Kaggle Score,Real-Time Score
Neural Network,2.85825,0.16374,2.85825
Gradient Boosting,0.12529,0.13622,0.12529
XGBoost,0.13528,0.14513,0.13528


# Fails

### Things which didn't have any improvements to final score.

- Dealing with imbalanced features had a slightly worse results.