# House Prices Advanced Regression Techniques 2

This is the next part of "House Price.ipynb" where  will try alternative methods to predict the house prices.

## Rounding to 50

If we look at the SalePrice of train set, we can see most of them are dividend to 50. So we will change the post-process a little bit.

In [1]:
%matplotlib inline
import warnings

import numpy as np
import pandas as pd
import seaborn as sns

warnings.filterwarnings('ignore')

In [2]:
df_train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')

df_dividend_50 = df_train.loc[df_train['SalePrice'] % 50 == 0]
percentage = len(df_dividend_50) / len(df_train)
print(f'{percentage:.2%} of the SalePrice are dividend to 50.')

93.22% of the SalePrice are dividend to 50.


That's a lot. We expect test set would be like so.

In [3]:

from sklearn.linear_model import Lasso
from sklearn.ensemble import VotingRegressor
from sklearn.kernel_ridge import KernelRidge
from lightgbm import LGBMRegressor

In [4]:
df_train['SalePriceLog1p'] = np.log1p(df_train['SalePrice'])

The best params are taken from "House Price.ipynb".

In [5]:
kernel_ridge = KernelRidge(**{'alpha': 0.001, 'gamma': 0.001, 'kernel': 'rbf'})
lasso = Lasso(random_state=0, **{'alpha': 0.0001})
lightgbm = LGBMRegressor(random_state=0, **{'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 100})

voting_regressor = VotingRegressor(
    estimators=[
        ('kernel_ridge', kernel_ridge),
        ('lasso', lasso),
        ('lightgbm', lightgbm),
    ],
    n_jobs=-1,
    verbose=1
)

In [6]:
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer
from scipy import stats

In [7]:
high_correlated_with_sale_price = ['OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'TotRmsAbvGrd',
                                   'AgeBuilt', 'AgeRemodAdd']

tobe_normalized_cols = high_correlated_with_sale_price.copy()

selected_columns = ['NormalizedOverallQual', 'NormalizedGrLivArea', 'NormalizedGarageCars', 'NormalizedTotalBsmtSF',
                    'NormalizedFullBath', 'NormalizedTotRmsAbvGrd', 'NormalizedAgeBuilt', 'NormalizedAgeRemodAdd',
                    'Neighborhood_Blmngtn', 'Neighborhood_Blueste', 'Neighborhood_BrDale', 'Neighborhood_BrkSide',
                    'Neighborhood_ClearCr', 'Neighborhood_CollgCr', 'Neighborhood_Crawfor', 'Neighborhood_Edwards',
                    'Neighborhood_Gilbert', 'Neighborhood_IDOTRR', 'Neighborhood_MeadowV', 'Neighborhood_Mitchel',
                    'Neighborhood_NAmes', 'Neighborhood_NPkVill', 'Neighborhood_NWAmes', 'Neighborhood_NoRidge',
                    'Neighborhood_NridgHt', 'Neighborhood_OldTown', 'Neighborhood_SWISU', 'Neighborhood_Sawyer',
                    'Neighborhood_SawyerW', 'Neighborhood_Somerst', 'Neighborhood_StoneBr', 'Neighborhood_Timber',
                    'Neighborhood_Veenker']

The whole process is similar to part 1 except the post-process.

In [8]:
def fill_missing(df: pd.DataFrame):
    continuous_imputer = SimpleImputer(strategy='median')
    discrete_imputer = SimpleImputer(strategy='most_frequent')

    continuous_columns_ = df.select_dtypes(include=['float64', 'int64']).columns
    discrete_columns_ = df.select_dtypes(include=['object']).columns

    df[continuous_columns_] = continuous_imputer.fit_transform(df[continuous_columns_])
    df[discrete_columns_] = discrete_imputer.fit_transform(df[discrete_columns_])

    return df

def year_to_age(df: pd.DataFrame):
    df['AgeBuilt'] = 2016 - df['YearBuilt']
    df['AgeRemodAdd'] = 2016 - df['YearRemodAdd']
    return df

def normalize_continuous_data(df: pd.DataFrame):
    for c in tobe_normalized_cols:
        df[c] = pd.to_numeric(df[c], errors='coerce')
        df[f'Normalized{c}'] = stats.boxcox(df[c], lmbda=0.2)

    return df

def encode_discrete_data(df: pd.DataFrame):
    if 'Neighborhood' in df.columns:
        df = pd.get_dummies(df, columns=['Neighborhood'])
    else:
        print('Neighborhood not found')
    return df

def select_features(df: pd.DataFrame):
    for c in selected_columns:
        df[c] = pd.to_numeric(df[c], errors='coerce')
    return df[selected_columns]

def inverse_log_transform(sale_price_log1p: pd.Series) ->  pd.Series:
    return np.expm1(sale_price_log1p)

def round_nearest_50(sale_price: pd.Series) ->  pd.Series:
    return (sale_price / 50).round() * 50

In [9]:
preprocess_pipeline = make_pipeline(
    FunctionTransformer(fill_missing, validate=False),
    FunctionTransformer(year_to_age, validate=False),
    FunctionTransformer(normalize_continuous_data, validate=False),
    FunctionTransformer(encode_discrete_data, validate=False),
    FunctionTransformer(select_features, validate=False),
)

postprocess_pipeline = make_pipeline(
    FunctionTransformer(inverse_log_transform, validate=False),
    FunctionTransformer(round_nearest_50, validate=False),
)

complete_pipeline = make_pipeline(
    preprocess_pipeline,
    voting_regressor,
    postprocess_pipeline
)

complete_pipeline.fit(df_train, df_train['SalePriceLog1p'])

Read test set and predict.

In [10]:
df_test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

preprocess_test = preprocess_pipeline.transform(df_test)
predictions = voting_regressor.predict(preprocess_test)
post_predictions = postprocess_pipeline.transform(predictions)

df_test['SalePrice'] = post_predictions

In [11]:
df_test['Id'] = df_test['Id'].astype(int)
df_test[['Id', 'SalePrice']].to_csv('../input/house-prices-advanced-regression-techniques/submission/voting-regression-v2.1.csv', index=False)

### Score: 0.14878

## Try with Top 3 models

This time, let's try Voting Regressor and Stack Regressor with top 3 models.

In [12]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Lasso, Ridge
from sklearn.svm import SVR
from sklearn.kernel_ridge import KernelRidge
from xgboost import XGBRegressor
from sklearn.ensemble import VotingRegressor, StackingRegressor

In [13]:
kernel_ridge = KernelRidge(**{'alpha': 0.001, 'gamma': 0.001, 'kernel': 'rbf'})
svr = SVR(**{'C': 100, 'epsilon': 0.1, 'kernel': 'rbf'})
xgboost = XGBRegressor(random_state=0, **{'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 100})

voting_regressor_v2 = VotingRegressor(
    estimators=[
        ('kernel_ridge', kernel_ridge),
        ('svr', svr),
        ('xgboost', xgboost)
    ],
    n_jobs=-1,
    verbose=1
)

voting_regressor_score = cross_val_score(
    voting_regressor_v2,
    preprocess_pipeline.fit_transform(df_train),
    df_train['SalePriceLog1p'],
    cv=5,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1,
    verbose=1
)

print(f'Voting Regressor Score: {voting_regressor_score.mean()}')

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.


Voting Regressor Score: -0.1455864782289374


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    2.9s finished


In [14]:
stack_regressor = StackingRegressor(
    estimators=[
        ('kernel_ridge', kernel_ridge),
        ('svr', svr),
        ('xgboost', xgboost)
    ],
    final_estimator=Ridge(),
    n_jobs=-1,
    verbose=1
)

stack_regressor_score = cross_val_score(
    stack_regressor,
    preprocess_pipeline.fit_transform(df_train),
    df_train['SalePriceLog1p'],
    cv=5,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1,
    verbose=1
)

print(f'Stack Regressor Score: {stack_regressor_score.mean()}')

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.


Stack Regressor Score: -0.14564681848766386


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    3.6s finished


They give nearly the same result. Let's predict by both of them.

In [15]:
complete_voting_regressor_pipeline_v2 = make_pipeline(
    preprocess_pipeline,
    voting_regressor_v2,
    postprocess_pipeline
)

complete_voting_regressor_pipeline_v2.fit(df_train, df_train['SalePriceLog1p'])

In [16]:
complete_stack_regressor_pipeline = make_pipeline(
    preprocess_pipeline,
    stack_regressor,
    postprocess_pipeline
)

complete_stack_regressor_pipeline.fit(df_train, df_train['SalePriceLog1p'])

In [17]:
df_test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
voting_regressor_v2_predictions = postprocess_pipeline.transform(
    voting_regressor_v2.predict(preprocess_pipeline.transform(df_test))
)

df_test['Id'] = df_test['Id'].astype(int)
df_test['SalePrice'] = voting_regressor_v2_predictions
df_test[['Id', 'SalePrice']].to_csv('../input/house-prices-advanced-regression-techniques/submission/voting-regression-v3.csv', index=False)

In [18]:
df_test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

stack_regressor_predictions = postprocess_pipeline.transform(
    stack_regressor.predict(preprocess_pipeline.transform(df_test))
)

df_test['Id'] = df_test['Id'].astype(int)
df_test['SalePrice'] = stack_regressor_predictions
df_test[['Id', 'SalePrice']].to_csv('../input/house-prices-advanced-regression-techniques/submission/stack-regression-v1.csv', index=False)

### Score

Voting Regressor v3: 0.14421
Stack Regressor v1: 0.14412

## Try with Top 5 models

In [19]:
kernel_ridge = KernelRidge(**{'alpha': 0.001, 'gamma': 0.001, 'kernel': 'rbf'})
svr = SVR(**{'C': 100, 'epsilon': 0.1, 'kernel': 'rbf'})
xgboost = XGBRegressor(random_state=0, **{'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 100})
lasso = Lasso(**{'alpha': 0.0001})
ridge = Ridge(**{'alpha': 1})

In [20]:
voting_regressor_v3 = VotingRegressor(
    estimators=[
        ('kernel_ridge', kernel_ridge),
        ('svr', svr),
        ('xgboost', xgboost),
        ('lasso', lasso),
        ('ridge', ridge)
    ],
    n_jobs=-1,
    verbose=1
)

voting_regressor_score = cross_val_score(
    voting_regressor_v3,
    preprocess_pipeline.fit_transform(df_train),
    df_train['SalePriceLog1p'],
    cv=5,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1,
    verbose=1
)

print(f'Voting Regressor Score: {voting_regressor_score.mean()}')

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.


Voting Regressor Score: -0.14665025559092487


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.3s finished


In [21]:
stack_regressor_v2 = StackingRegressor(
    estimators=[
        ('kernel_ridge', kernel_ridge),
        ('svr', svr),
        ('xgboost', xgboost),
        ('lasso', lasso),
        ('ridge', ridge)
    ],
    final_estimator=Ridge(),
    n_jobs=-1,
    verbose=1
)

stack_regressor_score = cross_val_score(
    stack_regressor_v2,
    preprocess_pipeline.fit_transform(df_train),
    df_train['SalePriceLog1p'],
    cv=5,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1,
    verbose=1
)

print(f'Stack Regressor Score: {stack_regressor_score.mean()}')

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.


Stack Regressor Score: -0.14550885721591


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    1.4s finished


And make predictions

In [22]:
complete_voting_regressor_pipeline_v3 = make_pipeline(
    preprocess_pipeline,
    voting_regressor_v3,
    postprocess_pipeline
)

complete_voting_regressor_pipeline_v3.fit(df_train, df_train['SalePriceLog1p'])

In [23]:
df_test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
voting_regressor_v3_predictions = postprocess_pipeline.transform(
    voting_regressor_v3.predict(preprocess_pipeline.transform(df_test))
)

df_test['Id'] = df_test['Id'].astype(int)
df_test['SalePrice'] = voting_regressor_v3_predictions
df_test[['Id', 'SalePrice']].to_csv('../input/house-prices-advanced-regression-techniques/submission/voting-regression-v4.csv', index=False)

In [24]:
complete_stack_regressor_pipeline_v2 = make_pipeline(
    preprocess_pipeline,
    stack_regressor_v2,
    postprocess_pipeline
)

complete_stack_regressor_pipeline_v2.fit(df_train, df_train['SalePriceLog1p'])

In [25]:
df_test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

stack_regressor_v2_predictions = postprocess_pipeline.transform(
    stack_regressor_v2.predict(preprocess_pipeline.transform(df_test))
)

df_test['Id'] = df_test['Id'].astype(int)
df_test['SalePrice'] = stack_regressor_v2_predictions
df_test[['Id', 'SalePrice']].to_csv('../input/house-prices-advanced-regression-techniques/submission/stack-regression-v2.csv', index=False)

### Score

Voting Regressor v4: 0.14577
Stack Regressor v2: 0.14395

## Try VotingRegressor with weights

Above experiments show that VotingRegressor with top 3 models is better than top 5 models.

We will pick it and try a few set of weights.

In [26]:
from sklearn.model_selection import GridSearchCV

In [28]:
voting_regressor_v4 = VotingRegressor(
    estimators=[
        ('kernel_ridge', kernel_ridge),
        ('svr', svr),
        ('xgboost', xgboost)
    ],
    n_jobs=-1,
    verbose=1
)

voting_v4_params = {
    'weights': [
        [0.4, 0.3, 0.3],
        [0.5, 0.25, 0.25],
        [0.6, 0.2, 0.2],
        [0.7, 0.15, 0.15],
    ]
}

voting_regressor_v4_cv = GridSearchCV(
    voting_regressor_v4,
    voting_v4_params,
    cv=5,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1,
    verbose=1
)

df_preprocessed = preprocess_pipeline.fit_transform(df_train)

voting_regressor_v4_cv.fit(df_preprocessed, df_train['SalePriceLog1p'])

print(f'Best Score: {voting_regressor_v4_cv.best_score_}')
print(f'Best Params: {voting_regressor_v4_cv.best_params_}')

Fitting 5 folds for each of 4 candidates, totalling 20 fits
Best Score: -0.145565209088873
Best Params: {'weights': [0.4, 0.3, 0.3]}


In [30]:
voting_regressor_v4_pipeline = make_pipeline(
    preprocess_pipeline,
    voting_regressor_v4_cv.best_estimator_,
    postprocess_pipeline
)

voting_regressor_v4_pipeline.fit(df_train, df_train['SalePriceLog1p'])

In [31]:
df_test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

voting_regressor_v4_predictions = postprocess_pipeline.transform(
    voting_regressor_v4_cv.best_estimator_.predict(preprocess_pipeline.transform(df_test))
)

df_test['Id'] = df_test['Id'].astype(int)
df_test['SalePrice'] = voting_regressor_v4_predictions
df_test[['Id', 'SalePrice']].to_csv('../input/house-prices-advanced-regression-techniques/submission/voting-regression-v5.csv', index=False)

Score: 0.14428