In [1]:
%matplotlib inline
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

import pandas_profiling

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, StandardScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer

# all my custom transformers are put into separate package with docstrings (help(ClassName))
from custom_transformers import MyDropColumns, MyQualityEncoder, MyOtherOrdinalEncoder, MyBinaryEncoder, MySimpleImputer, MyLog1pTransformer

# 0. Business understanding

The goal is to predict sales price of the property in Ames, Iowa. To do that, we are provided with a dataset containing information about each property in 79 different dimensions. The short description of each variable is given in the document called "data_description.txt". After going through this document and trying to understand each variable and how it might affect the price (based on research on Google), I clustered all variables into following 12 categories:

1. <u><b>location</b></u>: MSZoning (type of zoning implies different laws), Neighborhood (physical location: should strongly affect the price), Condition1 and Condition2 (close to parks is good, close to railroads is not), 
2. <u><b>lot</b></u>: LotFrontage (wider frontage is usually better), LotArea (bigger area is usually better), LotShape (each has benefits and drawbacks), LandContour (flatness of the property: unlikely affecting the price), LandSlope (slope of the property: unlikely affecting the price), LotConfig (each has drawbacks and benefits)
3. <u><b>public infrastructure</b></u>: Street (type of road: pavel is better than gravel), Alley (type of alley access: pavel is betten than gravel)
4. <u><b>building's general characteristics</b></u>: BldgType (might affect the price), HouseStyle (might affect the price), YearBuilt and YearRemodAdd (should affect the price), Foundation (type: should affect the price)
5. <u><b>building's quality and conditions (all should strongly affect the price)</b></u>: OverallQual, OverallCond, ExterQual and ExterCond (quality and conditions of exterior material)
6. <u><b>building's external characteristics (all might affect the price)</b></u>: RoofStyle and RoofMatl (type and material of roof), Exterior1st and Exterior2nd (exterior covering on house), MasVnrType and MasVnrArea (type and area of masonry veneer), WoodDeckSF, OpenPorchSF, EnclosedPorch, 3SsnPorch, ScreenPorch. 
7. <u><b>building interior characteristics (all should affect the price)</b></u>: 1stFlrSF, 2bdFlrSF, LowQualityFinSf, GrLivArea, FullBath, HalfBath, Bedroom, Kitchen, KitchenQual, TotRmsAvbGrd (number of rooms), Functional (home functionality), Fireplaces, FireplaceQu, 
8. <u><b>basement-related characteristics (if present: affect if useful)</b></u>: BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinSF1, BsmtFinType2, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, BsmtFullBath, BsmtHalfBath, Fireplace in basement
9. <u><b>utilities-related characteristics</b></u>: Utilities (generally more -> better), Heating and HeatingQC (type and quality of heating), CentralAir (presence is good), Electrical (system quality)
8. <u><b>additional characteristics</b></u>: PoolArea (should affect the price), PoolQC, Fence, MiscFeature and MiscVal (should strongly affect the price)
9. <u><b>garage-related characteristics</b></u>: GarageType, GarageYrBlt, GarageCars (size in car capacity), GarageArea (size in square feet),  GarageQual, GarageCond, PavedDrive (paved driveway)
10. <u><b>sales-related info</b></u>: MoSold and YrSold (to determine the age: affects the price), SaleType (type of sale: might affect the price), SaleCondition (condition on sale: might affect the price). 

Generally, with respect to the real estate, there are two main ways to put a price on property, which is used in combination: cost-based and market-based. Cost-based implies that price of the house should be higher than the sum of all costs related to the house: cost of land (hence, lot characteristics should affect), cost of materials used to construct the building, cost of additional features (pool, garage, etc), cost of the building itself (hence, the size, number of rooms, bedrooms, and usefulness of basement matter). Market-based implies that price of the house is affected by market itself: hence location matters and sales-related information is important. 

# 1. Data understanding (EDA)

In [2]:
original_train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

In [3]:
train = original_train.copy()

Generation of report - slow operation, check attached ready HTML5 report instead

In [4]:
# initial_report = train.profile_report(style={'full_width':True}, title='House Prices: Advanced Regression Techniques (initial)')
# initial_report.to_file(output_file='house_prices_report_initial.html')

Few conclusions can be made from the report itself:

<u><b>Some features are meaningless</b></u>: Id, Utilities (all except 1 values are the same), MSSubClass (same info included in other variables), Street (all except 6 values are the same), Condition1 and Condition2 (most of values are just "Norm")

<u><b>Features that might need log transformation (to fix skewness)</b></u>: 1stFlrSF, 2ndFlrSF (log1p), LotArea, LotFrontage, GrLivArea, BsmtUnfSF, OverallCond, SalePrice, TotalBsmtSF

<u><b>Features that might need to be transformed into True/False, indicating presence, since most of values are either NAs or have one value</b></u>: 2ndFlrSF, Alley, BldgType, MiscFeature, PoolArea, Fence, MasVnrType, MasVnrArea (drop this column as after binarization it will be the same as MasVnrType)

<u><b>Porch-related features to be analyzed</b></u>: 3SsnPorch, EnclosedPorch, OpenPorchSF, ScreenPorch, WoodDeckSF

<u><b>Categorical features with ordinal values to be encoded</b></u>: BsmtCond, BsmtExposure, BsmtFinType1, BstmFinType2, BsmtQual, ExterCond, ExterQual, FireplaceQu, GarageCond, GarageFinish, GarageQual, HeatingQC, KitchenQual, LotShape, PavedDrive, PoolQC, Electrical, Functional, LandSlope, HouseStyle

<u><b>Categorical features without order (one-hot encoding or label encoding)</b></u>: Exterior1st, Exterior2nd, Foundation, GarageType, Heating, LandContour, LotConfig, MSZoning, Neighborhood, RoofMatl, RoofStyle, SaleCondition, SaleType

<u><b>Wrong data types</b></u>: CentralAir

# 2. Data preparation / cleaning 

In [5]:
cols_numerical = train.select_dtypes(include=['float','int']).columns
cols_categorical = train.select_dtypes(['object']).columns

### 2.1 outliers

### 2.1 meaningless features

In [6]:
cols_to_drop = ['MSSubClass', 'Id', 'Utilities', 'Street', 'MasVnrArea', 'Condition1', 'Condition2']
train = MyDropColumns(cols_to_drop).fit_transform(train)

###  2.2 ordinal encoding with imputing missing values

Features related to quality and condition

In [7]:
cols_enc_quality = ['BsmtCond', 'BsmtQual', 'ExterCond', 'ExterQual', 'FireplaceQu', 
                'GarageCond', 'GarageQual', 'HeatingQC', 'KitchenQual', 'PoolQC']
train = MyQualityEncoder(cols_enc_quality).fit_transform(train)

Other features that can be ordinally encoded

In [9]:
cols_enc_ordinal = ['BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'GarageFinish',
                'LotShape', 'PavedDrive', 'Electrical', 'Functional', 'HouseStyle']
train = MyOtherOrdinalEncoder(cols_enc_ordinal).fit_transform(train)

### 2.3 binary encoding with imputing missing values

In [10]:
cols_enc_binary = ['Alley', 'BldgType', 'MiscFeature', 'Fence', 'MasVnrType']
train = MyBinaryEncoder(cols_enc_binary).fit_transform(train)

###  2.4 missing values 

In [11]:
train.isnull().sum()[train.isnull().sum()>0]

LotFrontage    259
GarageType      81
GarageYrBlt     81
dtype: int64

Most likely those missing values indicate absence: absence of Garage and zero Lot Frontage. Let's impute values correspondingly. 

In [12]:
cols_impute_simple = ['LotFrontage', 'GarageType', 'GarageYrBlt']
train = MySimpleImputer(cols_impute_simple).fit_transform(train)

### 2.5 skewness

In [13]:
cols_log = ['1stFlrSF', 'LotArea', 'LotFrontage', 'GrLivArea', 'BsmtUnfSF', 'TotalBsmtSF']
train = MyLog1pTransformer(cols_log).fit_transform(train)

In [14]:
train.loc[:,'SalePrice_log'] = np.log1p(train.loc[:, 'SalePrice'])

# 3. Feature engineering (experimentation)

Few remarks:
1. Techniques: filtering, wrapper and embedded methods.
2. Evaluation: RMSLE
3. Test features using custom transformations

## 3.1 feature creation

### Ideas for additional features

<u><b>Basement-related features</b></u>:
1. Basement exists: from any of Bsmt features
2. Basement SF value = BsmtFinType1 * BsmtFinSF1 + BsmtFinType2 * BsmtSF2 ---> <i>MyValueAddedFeature(basement=True)</i>
3. Basement Finished share = (1-BsmtUnfSF)/TotalBsmtSF or if TotalBsmtSF = 0
4. Basement Additional Value = Basement SF value * BsmtExposure ---> <i>MyValueAddedFeature(basement_adv=True)</i>
5. Basement evaluation mult = BsmtQual * BsmtCond ---> <i>MyQualityFeatures(basement_mult=True)</i>
6. Basement evaluation sum = BsmtQual + BsmtCond ---> <i>MyQualityFeatures(basement_sum=True)</i>
7. Basement size in comparison = TotalBsmtSF / GrLivArea

<u><b>Garage-related features</b></u>:
1. Garage value = GarageArea * GarageQual ---> <i>MyValueAddedFeature(garage=True)</i>
2. Garage evaluation mult = GarageQual * GarageCond ---> <i>MyQualityFeatures(garage_mult=True)</i>
3. Garage evaluation sum = GarageQual + GarageCond ---> <i>MyQualityFeatures(garage_sum=True)</i>

<u><b>Bedrooms/bathrooms/kitchen features</b></u>:
1. Total bathrooms = FullBath + 0.5 * HalfBath + BsmtFullBath + 0.5 * BsmtHalfBath
2. Bathrooms / bedrooms = Total bathrooms / Bedroom
3. Bedrooms share space = Bedrooms / GrLivArea
4. Bedrooms share rooms = Bedrooms / TotRmsAbvGrd
5. All rooms share space = TotRmsAbvGrd / GrLivArea
6. Kitchen value = Kitchen * KitchenQual ---> <i>MyValueAddedFeature(kitchen=True)</i>

<u><b>Time/date features</b></u>:
1. Seasonality = season(MoSold) ---> <i>MyTimeBasedFeatures(season=True)</i>
2. Time since construction (house) = YrSold - YearBuilt ---> <i>MyTimeBasedFeatures(since_house_built=True)</i>
3. Time since construction (garage) = YrSold - GarageYrBlt ---> <i>MyTimeBasedFeatures(since_garage_built=True)</i>
4. Time since renovation (house) = YrSold - YearRemodAdd ---> <i>MyTimeBasedFeatures(since_house_remod=True)</i>
5. Remodeled = True if YearRemodAdd is different than YearBuilt ---> <i>MyTimeBasedFeatures(isRemodeled=True)</i>

<u><b>Space/area-related features</b></u>:
1. Total porch area = WoodDeckSF + OpenPorchSF + EnclosedPorch + 3SsnPorch + ScreenPorch
2. Free space left = (LotArea - TotalBsmtSF - GarageArea - PoolArea - Total porch area) / LotArea
3. House space share = TotalBsmtSF / LotArea

<u><b>Quality-related features</b></u>:
1. High Quality SF = (1-LowQualFinSF)/GrLivArea ---> <i>MyQualityFeatures(high_quality_sf=True)</i>
2. Overall evaluation mult = OverallQual * OverallCond ---> <i>MyQualityFeatures(overall_mult=True)</i>
3. Overall evaluation sum =  OverallQual + OverallCond ---> <i>MyQualityFeatures(overall_sum=True)</i>
4. External material evaluation mult = ExterQual * ExterCond ---> <i>MyQualityFeatures(external_mult=True)</i>
5. External material evaluation sum = ExterQual + ExterCond ---> <i>MyQualityFeatures(external_sum=True)</i>


<u><b>Luxury features</b></u>: 
1. Pool value = PoolArea * PoolQC ---> <i>MyValueAddedFeature(pool=True)</i>
2. Fireplace value = Fireplaces * FireplaceQu ---> <i>MyValueAddedFeature(fireplace=True)</i>


Binary values variables:
1. CentralAir
2. Fence
3. Alley
4. BldgType
5. MasVnrType
6. MiscFeature


One-hot variables
1. Electrical
2. Heating
3. MSZoning
4. Neighborhood -- MUST
5. LandContour
6. LotConfig
7. RoofStyle
8. RoofMatl
9. Exterior1st
10. Exterior2nd
11. Foundation
12. GarageType
13. LandScope


####  Adding value-based features

In [None]:
train = MyValueAddedFeatures().fit_transform(train)

#### Adding quality/conditions related features

In [None]:
train = MyQualityFeatures().fit_transform(train)

In [40]:
train.columns.sort_values()

Index(['1stFlrSF', '1stFlrSF_log', '2ndFlrSF', '3SsnPorch', 'Alley',
       'BedroomAbvGr', 'BldgType', 'BsmtCond', 'BsmtExposure', 'BsmtFinSF1',
       'BsmtFinSF2', 'BsmtFinType1', 'BsmtFinType2', 'BsmtFullBath',
       'BsmtHalfBath', 'BsmtQual', 'BsmtUnfSF', 'BsmtUnfSF_log', 'CentralAir',
       'Electrical', 'EnclosedPorch', 'ExterCond', 'ExterQual', 'Exterior1st',
       'Exterior2nd', 'Fence', 'FireplaceQu', 'Fireplaces', 'Foundation',
       'FullBath', 'Functional', 'GarageArea', 'GarageCars', 'GarageCond',
       'GarageFinish', 'GarageQual', 'GarageType', 'GarageYrBlt', 'GrLivArea',
       'GrLivArea_log', 'HalfBath', 'Heating', 'HeatingQC', 'HouseStyle',
       'KitchenAbvGr', 'KitchenQual', 'LandContour', 'LandSlope', 'LotArea',
       'LotArea_log', 'LotConfig', 'LotFrontage', 'LotFrontage_log',
       'LotShape', 'LowQualFinSF', 'MSZoning', 'MasVnrType', 'MiscFeature',
       'MiscVal', 'MoSold', 'Neighborhood', 'OpenPorchSF', 'OverallCond',
       'OverallQual', 'PavedD

#### Adding 

## 3.2 feature selection
#### Recommended: Chi-Squared Independence Test + Information Gain

### a. filtering

### b. wrapper

### c. embedded methods

# 3. Final pipeline