# Preprocess of house prices dataset
In the following notebooks we're going to preprocess the data, that is remove missing variables, transform the variables and treat outliers. We're also going to build a specialized pipeline for those transformations.

In this notebook specifically, model features will be chosen and transformed using pipelines for the test data.

In [1]:
# import dataset and libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler, PowerTransformer, MinMaxScaler, FunctionTransformer, OneHotEncoder
from sklearn.compose import TransformedTargetRegressor, ColumnTransformer, make_column_transformer
from sklearn.pipeline import FeatureUnion, Pipeline, make_pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LinearRegression

In [2]:
train_data = pd.read_csv("data/train_preprocessed.csv", index_col="Id")
test_data = pd.read_csv("data/test_preprocessed.csv", index_col="Id")

Similarly as in non-pipelines version we defined columns under consideration and types of operations on them

In [3]:
ord_model = ["OverallQual", "ExterQual", "BsmtQual", "BsmtExposure", "CentralAir", "KitchenQual", "FireplaceQu", 
             "GarageFinish", "GarageCond", "Fence"]
int_model = ["YearBuilt", 'MoSold', 'YrSold']
nom_model = ["MSZoning"]
rat_model = ["LotArea", "MasVnrArea", "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "GarageArea", 'WoodDeckSF', 'OpenPorchSF', 
             "TotRmsAbvGrd", "BsmtFullBath", "FullBath", "BedroomAbvGr", 'Fireplaces', "GarageCars"]
log_model = ["LotArea", "MasVnrArea", "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "GarageArea", 'WoodDeckSF', 
                'OpenPorchSF', "TotRmsAbvGrd"]

target = "SalePrice"

model_features = nom_model + ord_model + int_model + rat_model

# We also need additional indexes for pipelines
nom_index = [0]
ord_index = [i for i in range(1, len(ord_model) + 1)]
int_index = [i for i in range(len(nom_model + ord_model), len(nom_model + ord_model + int_model))]
rat_index = [i for i in range(len(model_features) - len(rat_model), len(model_features))]
log_index = [i for i in range(len(model_features) - len(rat_model), len(model_features + log_model) - len(rat_model))]

In [4]:
X_test = test_data[model_features]
X = train_data[model_features]
y = train_data[target]

First: I will encode ord_model features with ordinal encoding using df.replace
Then:

Step_1:
- Ratio features which are heavily skewed or have large number of outliers will be treated with log transformation.
- All interval variables will be normalized to start from 0. 
- "MSZoning" will be replaced by one-hot encoding.

Step_2:
- All ratio variables will be treated with Standard Scaler, everything else with MinMax Scaler.

Step_3:
- Target variable will be treated with a log transformation and then scaled using Standard Scaler.

In [5]:
# copying from without pipelines notebook
exter_dict = {"Ex":5, "Gd":4, "TA":3, "Fa":2, "Po":1}
bsmt_qual_dict = {"Ex":5, "Gd":4, "TA":3, "Fa":2, "Po":1, "NoBsmt":0}
bsmt_exp_dict = {"Gd":4, "Av":3, "Mn":2, "No":1, "NoBsmt":0}
cent_dict = {"Y":1, "N":0}
kitch_dict = exter_dict
fire_dict = {"Ex":5, "Gd":4, "TA":3, "Fa":2, "Po":1, "NoFireplace":0}
garg_fin_dict = {"Fin":3, "RFn":2, "Unf":1, "NoGarage":0}
garg_cond_dict = {"Ex":5, "Gd":4, "TA":3, "Fa":2, "Po":1, "NoGarage":0}
fence_dict = {"GdPrv":2, "GdWo":2, "MnPrv":1, "MnWw":1, "NoFence":0}

dict_list = [exter_dict, bsmt_qual_dict, bsmt_exp_dict, cent_dict, kitch_dict, fire_dict, 
             garg_fin_dict, garg_cond_dict, fence_dict]
replacement_dict = dict(zip(ord_model[1:], dict_list))

X_test = X_test.replace(replacement_dict)
X = X.replace(replacement_dict)

In [6]:
X_train,X_valid,y_train,y_valid = train_test_split(X,y,random_state=42,test_size=0.3)

In [7]:
log_transformer = FunctionTransformer(np.log1p, check_inverse=False)

step_1_trans = make_column_transformer(
    (log_transformer, log_index), 
    (MinMaxScaler(), int_index),
    (OneHotEncoder(), nom_index),
    remainder="passthrough",
    sparse_threshold=0
)

step_2_trans = make_column_transformer(
    (StandardScaler(), rat_index),
    (MinMaxScaler(), ord_index + nom_index),
    remainder="passthrough",
    sparse_threshold=0
)

In [8]:
pipe = make_pipeline(step_1_trans, step_2_trans)

In [9]:
# transform X train
pipe.fit(X_train)
X_train = pipe.transform(X_train)
# transform X valid
pipe.fit(X_valid)
X_valid = pipe.transform(X_valid)
# transform X test
pipe.fit(X_test)
X_test = pipe.transform(X_test)

In [10]:
reg_1 = LinearRegression(fit_intercept=False).fit(X_train, y_train)
reg_1.score(X_train, y_train)

0.7962186024276492

In [11]:
y_predict = reg_1.predict(X_valid)

In [12]:
mean_squared_error(np.log(y_valid + 1), np.log(y_predict + 1))

0.11764887162810558

In [25]:
# Predicting the prices for the test:
y_test_pred = reg_1.predict(X_test)

In [26]:
y_test_pred

array([1007946.89237482,  173290.87648018,  194900.34945233, ...,
        185674.74861987,  151661.21052994,  255610.74505277])

In [33]:
submission = test_data.copy()
submission = submission.drop(submission.columns, axis=1)
submission["SalePrice"] = y_test_pred
submission.head()
submission.to_csv("data/submission_1.csv")

Fitting to the log(y_train)

In [13]:
log_y_train = np.log(y_train + 1)
scaled_y_train = (log_y_train - np.mean(log_y_train)) / np.std(log_y_train)

reg_2 = LinearRegression(fit_intercept=False).fit(X_train, log_y_train)
reg_2.score(X_train, log_y_train)

0.8644854795394824

In [14]:
y_predict_valid = reg_2.predict(X_valid)

In [20]:
y_valid

Id
893     154500
1106    325000
414     115000
523     159000
1037    315500
         ...  
332     139000
324     126175
651     205950
440     110000
799     485000
Name: SalePrice, Length: 438, dtype: int64

In [17]:
y_predict_valid

array([ 14.43574044,  15.35261038,  21.58407881,  21.8210499 ,
        15.44445898,  21.03051841,  14.95229682,  14.53222632,
        21.05165572,  14.49158856,  14.60396844,  14.35524689,
        15.88805917,  14.9009937 ,  14.76181304,  14.43966364,
        14.89210888,  14.36121823,  21.59873541,  15.00308336,
        14.80387565,  16.7049367 ,  14.75879117,  14.29667097,
        14.97649652,  14.68723826,  14.91006453,  14.30840323,
        14.78632326,  16.70015791,  14.37035367,  15.10425509,
        14.87221618,  14.18278138,  15.24758201,  14.57002999,
        14.53844611,  14.91728384,  15.3504325 ,  21.35853112,
        21.56553179,  15.08404339,  14.31271441,  15.3762245 ,
        14.43964669,  21.76519378,  14.27151654,  14.34180013,
        15.74209327,  14.43174278,  14.31128665,  14.90719351,
        21.43738133,  15.45975675,  21.78792056,  15.0817064 ,
        14.97827139,  14.71710963,  14.64141652,  14.13121481,
        20.88719845,  14.50165086,  15.26328721,  15.21

In [23]:
mean_squared_error(np.log(y_valid + 1), y_predict_valid) # Why is it so high!?

61.77296877708384