<a href="https://www.kaggle.com/code/erkanhatipoglu/housing-prices-pipeline-starter-code?scriptVersionId=106625877" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction  <a id='introduction'></a>

This is a starter code for those who want to work with sklearn pipelines. The public score for this notebook is **13873.48**, which is within the top **1%** for this competition.However, the score may change for each version due to the indeterministic nature of the model. The reader may quickly get a better score by applying a grid search, EDA, and feature engineering concepts.

Kagglers interested in an improved version of this notebook by using grid search may refer to my notebook [Housing Prices: GridSearchCV Example](https://www.kaggle.com/erkanhatipoglu/housing-prices-gridsearchcv-example).

Kagglers interested in more advanced subjects of sklearn pipelines may refer to my notebook [Introduction to Sklearn Pipelines with Titanic](https://www.kaggle.com/erkanhatipoglu/introduction-to-sklearn-pipelines-with-titanic).

Thank you for reading.


# Table of Contents
* [Introduction](#introduction)
* [Helper Functions](#functions)
* [Preprocessing](#preprocessing) 
* [Validation](#validation) 
* [Cross-validation using full training set](#cross-validation)    
* [Prediction](#prediction) 
* [Saving and submission](#saving)  
* [References](#references)

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files
# under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as
# output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the
# current session


from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_absolute_error, mean_squared_error, make_scorer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, StandardScaler,OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
import matplotlib.pyplot as plt
%matplotlib inline
from xgboost import XGBRegressor
pd.set_option('display.max_columns', None)
import category_encoders as ce
from sklearn.feature_selection import SelectKBest, f_classif, f_regression
from pandas_profiling import ProfileReport

/kaggle/input/home-data-for-ml-course/sample_submission.csv
/kaggle/input/home-data-for-ml-course/sample_submission.csv.gz
/kaggle/input/home-data-for-ml-course/train.csv.gz
/kaggle/input/home-data-for-ml-course/data_description.txt
/kaggle/input/home-data-for-ml-course/test.csv.gz
/kaggle/input/home-data-for-ml-course/train.csv
/kaggle/input/home-data-for-ml-course/test.csv


# Helper functions   <a id='functions'></a>   

<div class="alert alert-block alert-info">
<b>Tip:</b> We will use some helper functions throughout the notebook. Collecting them in one place is a good idea, making the code more organized.
</div>

In [2]:
def save_file (predictions):
    """Save submission file."""
    # Save test predictions to file
    output = pd.DataFrame({'Id': sample_submission_file.Id,
                       'SalePrice': predictions})
    output.to_csv('submission.csv', index=False)
    print ("Submission file is saved")

In [3]:
# Read the data
train_data = pd.read_csv('/kaggle/input/home-data-for-ml-course/train.csv',
                         index_col='Id')
X_test = pd.read_csv('/kaggle/input/home-data-for-ml-course/test.csv',
                     index_col='Id')
X = train_data.copy()

# Remove rows with missing target, separate target from predictors
X.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X.SalePrice  

X.drop(['SalePrice', 'Utilities'], axis=1, inplace=True)
X_test.drop(['Utilities'], axis=1, inplace=True)

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

sample_submission_file = pd.read_csv("/kaggle/input/home-data-for-ml-course/sample_submission.csv")

print('Data is OK')

Data is OK


# Preprocessing  <a id='preprocessing'></a>

In [4]:
# Select object columns
categorical_cols = [cname for cname in X_train.columns if X_train[cname].dtype == "object"]

# Select numeric columns
numerical_cols = [cname for cname in X_train.columns if X_train[cname].dtype in ['int64',
                                                                                 'float64']]

# Number of missing values in each column of training data
missing_val_count_by_column_train = (X_train.isnull().sum())
print("Number of missing values in each column:")
print(missing_val_count_by_column_train[missing_val_count_by_column_train > 0])

Number of missing values in each column:
LotFrontage      212
Alley           1097
MasVnrType         6
MasVnrArea         6
BsmtQual          28
BsmtCond          28
BsmtExposure      28
BsmtFinType1      28
BsmtFinType2      29
Electrical         1
FireplaceQu      551
GarageType        58
GarageYrBlt       58
GarageFinish      58
GarageQual        58
GarageCond        58
PoolQC          1164
Fence            954
MiscFeature     1119
dtype: int64


In [5]:
# Number of missing values in numerical columns
missing_val_count_by_column_numeric = (X_train[numerical_cols].isnull().sum())
print("Number of missing values in numerical columns:")
print(missing_val_count_by_column_numeric[missing_val_count_by_column_numeric > 0])

Number of missing values in numerical columns:
LotFrontage    212
MasVnrArea       6
GarageYrBlt     58
dtype: int64


In [6]:
# Imputation lists

# imputation to null values of these numerical columns need to be 'constant'
constant_num_cols = ['GarageYrBlt', 'MasVnrArea']

# imputation to null values of these numerical columns need to be 'mean'
mean_num_cols = list(set(numerical_cols).difference(set(constant_num_cols)))

# imputation to null values of these categorical columns need to be 'constant'
constant_categorical_cols = ['Alley', 'MasVnrType', 'BsmtQual', 'BsmtCond', 'BsmtExposure',
                             'BsmtFinType1', 'BsmtFinType2', 'FireplaceQu','GarageType',
                             'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence',
                             'MiscFeature']

# imputation to null values of these categorical columns need to be 'most_frequent'
mf_categorical_cols = list(set(categorical_cols).difference(set(constant_categorical_cols)))

my_cols = constant_num_cols + mean_num_cols + constant_categorical_cols + mf_categorical_cols

In [7]:
# Define transformers
# Preprocessing for numerical data

numerical_transformer_m = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])

numerical_transformer_c = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value=0)),
    ('scaler', StandardScaler())])


# Preprocessing for categorical data for most frequent
categorical_transformer_mf = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown = 'ignore', sparse = False))
])

# Preprocessing for categorical data for constant
categorical_transformer_c = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='NA')),
    ('onehot', OneHotEncoder(handle_unknown = 'ignore', sparse = False))
])


# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num_mean', numerical_transformer_m, mean_num_cols),
        ('num_constant', numerical_transformer_c, constant_num_cols),
        ('cat_mf', categorical_transformer_mf, mf_categorical_cols),
        ('cat_c', categorical_transformer_c, constant_categorical_cols)
    ])

In [8]:
# Define Model
model = XGBRegressor(learning_rate = 0.1,
                     n_estimators=500,
                     max_depth=5,
                     min_child_weight=1,
                     gamma=0,
                     subsample=0.8,
                     colsample_bytree=0.8,
                     reg_alpha = 0,
                     reg_lambda = 1,
                     random_state=0)

# Validation with early_stopping_rounds  <a id='validation'></a>

<div class="alert alert-block alert-danger">  
<p>If we want to use early_stopping_rounds with our pipeline, we cannot use the validation set (X_valid) directly. This is because sklearn pipelines do not process the eval_set used with early_stopping_rounds. As a result, we need to process our validation set before using early_stopping_rounds.</p>
    
<p>There is a great danger here. If we forget to process our validation set and if processed data has the same number of columns as the unprocessed data, we may not see an error. Validation with unprocessed data may mislead us.</p>    
    
<p>To process the eval_set, we need to fit_transform X_valid by using our preprocessor, a pipeline consisting of transformers (which do not have a predictor).</p>
</div>

In [9]:
# Preprocessing of validation data
X_valid_eval = preprocessor.fit(X_train, y_train).transform (X_valid)

In [10]:
# Display the number of remaining columns after transformation 
print("We have", X_valid_eval.shape[1], "features after transformation")

We have 296 features after transformation


In [11]:
# Define XGBRegressor fitting parameters for the pipeline
fit_params = {"model__early_stopping_rounds": 50,
              "model__eval_set": [(X_valid_eval, y_valid)],
              "model__verbose": False,
              "model__eval_metric" : "mae"}

In [12]:
# Create and Evaluate the Pipeline
# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train, **fit_params)

# Get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid,preds)

print("Score: {}".format(score))


Score: 16363.153547731165


# Cross-validation using full training set <a id='cross-validation'></a>

In [13]:
# Preprocessing of training data, fit model 
X_cv = X[my_cols].copy()
X_sub = X_test[my_cols].copy()

In [14]:
# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X_cv, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')

print("MAE score:\n", scores)
print("MAE mean: {}".format(scores.mean()))
print("MAE std: {}".format(scores.std()))

MAE score:
 [14982.68008883 15842.70748609 16712.89875856 13384.43094499
 15631.34663955]
MAE mean: 15310.812783604453
MAE std: 1110.8660641167628


# Prediction   <a id='prediction'></a>

In [15]:
# Fit model
my_pipeline.fit(X_cv, y)

# Get predictions
preds = my_pipeline.predict(X_sub)

# Saving and submission   <a id='saving'></a>

In [16]:
# Use predefined utility function
save_file(preds)

Submission file is saved


# References   <a id='references'></a>
* [10-simple-hacks-to-speed-up-your-data-analysis - Parul Pandey](https://www.kaggle.com/parulpandey/10-simple-hacks-to-speed-up-your-data-analysis)
* [Dataset Transformations - Scikit-learn](https://scikit-learn.org/stable/data_transforms.html)
* [Intermediate Machine Learning Course - Pipelines](https://www.kaggle.com/alexisbcook/pipelines)
* [Kaggle Learn](https://www.kaggle.com/learn/overview)