In [26]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas_profiling
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, StandardScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer

# 0. Business understanding

The goal is to predict sales price of the property in Ames, Iowa. To do that, we are provided with a dataset containing information about each property in 79 different dimensions. The short description of each variable is given in the document called "data_description.txt". After going through this document and trying to understand each variable and how it might affect the price (based on research on Google), I clustered all variables into following 12 categories:

1. <b>location</b>: MSZoning (type of zoning implies different laws), Neighborhood (physical location: should strongly affect the price), Condition1 and Condition2 (close to parks is good, close to railroads is not), 
2. <b>lot</b>: LotFrontage (wider frontage is usually better), LotArea (bigger area is usually better), LotShape (each has benefits and drawbacks), LandContour (flatness of the property: unlikely affecting the price), LandSlope (slope of the property: unlikely affecting the price), LotConfig (each has drawbacks and benefits)
3. <b>public infrastructure</b>: Street (type of road: pavel is better than gravel), Alley (type of alley access: pavel is betten than gravel)
4. <b>building's general characteristics</b>: BldgType (might affect the price), HouseStyle (might affect the price), YearBuilt and YearRemodAdd (should affect the price), Foundation (type: should affect the price)
5. <b>building's quality and conditions (all should strongly affect the price)</b>: OverallQual, OverallCond, ExterQual and ExterCond (quality and conditions of exterior material)
6. <b>building's external characteristics (all might affect the price)</b>: RoofStyle and RoofMatl (type and material of roof), Exterior1st and Exterior2nd (exterior covering on house), MasVnrType and MasVnrArea (type and area of masonry veneer), WoodDeckSF, OpenPorchSF, EnclosedPorch, 3SsnPorch, ScreenPorch. 
7. <b>building interior characteristics (all should affect the price)</b>: 1stFlrSF, 2bdFlrSF, LowQualityFinSf, GrLivArea, FullBath, HalfBath, Bedroom, Kitchen, KitchenQual, TotRmsAvbGrd (number of rooms), Functional (home functionality), Fireplaces, FireplaceQu, 
8. <b>basement-related characteristics (if present: affect if useful)</b>: BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinSF1, BsmtFinType2, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, BsmtFullBath, BsmtHalfBath, Fireplace in basement
9. <b>utilities-related characteristics</b>: Utilities (generally more -> better), Heating and HeatingQC (type and quality of heating), CentralAir (presence is good), Electrical (system quality)
8. <b>additional characteristics</b>: PoolArea (should affect the price), PoolQC, Fence, MiscFeature and MiscVal (should strongly affect the price)
9. <b>garage-related characteristics</b>: GarageType, GarageYrBlt, GarageCars (size in car capacity), GarageArea (size in square feet),  GarageQual, GarageCond, PavedDrive (paved driveway)
10. <b>sales-related info</b>: MoSold and YrSold (to determine the age: affects the price), SaleType (type of sale: might affect the price), SaleCondition (condition on sale: might affect the price). 

Generally, with respect to the real estate, there are two main ways to put a price on property, which is used in combination: cost-based and market-based. Cost-based implies that price of the house should be higher than the sum of all costs related to the house: cost of land (hence, lot characteristics should affect), cost of materials used to construct the building, cost of additional features (pool, garage, etc), cost of the building itself (hence, the size, number of rooms, bedrooms, and usefulness of basement matter). Market-based implies that price of the house is affected by market itself: hence location matters and sales-related information is important. 

# 1. Data understanding (EDA)

In [4]:
original_train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

In [98]:
train = original_train.copy()

In [8]:
report = train.profile_report(style={'full_width':True}, title='House Prices: Advanced Regression Techniques')

In [10]:
report.to_file(output_file='house_prices_report.html')

Based on report, 

Some features are meaningless: Id, Utilities (all except 1 values are the same), MSSubClass (same info contains in other variables), Street (all except 6 values are the same)

Features that might need log transformation (to fix skewness): 1stFlrSF, 2ndFlrSF (log1p), LotArea, LotFrontage, GrLivArea, BsmtUnfSF, MasVnrArea, OverallCond, SalePrice, TotalBsmtSF

Features that might need to be transformed into True/False, indicating presence, since most of values are either NAs or have one value: 2ndFlrSF, Alley, BldgType, MiscFeature, PoolArea, WoodDeckSF, Fence

Porch-related features to be analyzed: 3SsnPorch, EnclosedPorch, OpenPorchSF, ScreenPorch, WoodDeckSF

Categorical features with ordinal values to be encoded: BsmtCond, BsmtExposure, BsmtFinType1, BstmFinType2, BsmtQual, ExterCond, ExterQual, FireplaceQu, GarageCond, GarageFinish, GarageQual, HeatingQC, KitchenQual, LotShape, PavedDrive, PoolQC

Features that require additional analysis: Condition1, Condition2, Electrical, Exterior1st, Exterior2nd, Foundation, Functional, GarageType, Heating, HouseStyle, LandContour, LandSlope, LotConfig, MasVnrType, MSZoning, Neighborhood, RoofMatl, RoofStyle, SaleCondition, SaleType

Wrong data types: BsmtFullBath, BsmtHalfBath, CentralAir, Fireplaces, FullBath, HalfBath, KitchenAvbGr

# 2. Data preparation / cleaning 

Columns to work on: 1stFlrSF, 2ndFlrSF (log1p), LotArea, LotFrontage, GrLivArea, BsmtUnfSF, MasVnrArea, OverallCond, SalePrice, TotalBsmtSF, 2ndFlrSF, Alley, BldgType, MiscFeature, PoolArea, WoodDeckSF, 3SsnPorch, EnclosedPorch, OpenPorchSF, ScreenPorch, WoodDeckSF, Fence, , LotShpe, Condition1, Condition2, Electrical, Exterior1st, Exterior2nd, Foundation, Functional, GarageType, Heating, HouseStyle, LandContour, LandSlope, LotConfig, MasVnrType, MSZoning, Neighborhood, RoofMatl, RoofStyle, SaleCondition, SaleType, BsmtFullBath, BsmtHalfBath, CentralAir, Fireplaces, FullBath, HalfBath, KitchenAvbGr

### 2.1 meaningless features

In [99]:
class DropColumns(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        assert isinstance(X, pd.DataFrame)

        try:
            return X.drop(columns=self.columns)
        
        except KeyError:
            cols_error = list(set(self.columns) - set(X.columns))
            raise KeyError("The DataFrame does not include the columns: %s" % cols_error)

cols_to_drop = ['MSSubClass', 'Id', 'Utilities', 'Street']

train = DropColumns(cols_to_drop).fit_transform(train)

###  2.2 ordinal encoding with imputing missing values

Features related to quality and condition

In [100]:
class QualityEncoder(BaseEstimator, TransformerMixin):
    
    quality_measures = {'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1} 
    
    def __init__(self, columns):
        self.columns=columns
        
    def fit(self, X, y=None):
        return self
    
    def transform(self,X): 
        assert isinstance(X, pd.DataFrame)
        
        try:
            X.loc[:, self.columns] = X.loc[:, self.columns].replace(self.quality_measures)
            
            # replace NAs with 0 since NAs indicate absence of the feature
            X.loc[:, self.columns] = X.loc[:, self.columns].fillna(0) 
            
            return X 
            
        except KeyError:
            cols_error = list(set(self.columns) - set(X.columns))
            raise KeyError('The DataFrame does not include the columns:' % cols_error)
                           
cols_enc_quality = ['BsmtCond', 'BsmtQual', 'ExterCond', 'ExterQual', 'FireplaceQu', 
                'GarageCond', 'GarageQual', 'HeatingQC', 'KitchenQual', 'PoolQC']

train = QualityEncoder(cols_enc_quality).fit_transform(train)

Other features that can be ordinally encoded

In [101]:
class OtherOrdinalEncoder(BaseEstimator, TransformerMixin):
    order_encoder = {
        'BsmtExposure':  {'Gd': 4, 'Av': 3, 'Mn': 2, 'No': 1},
        'BsmtFinType1': {'GLQ': 6, 'ALQ': 5, 'BLQ': 4, 'Rec': 3, 'LwQ': 2, 'Unf': 1},
        'BsmtFinType2': {'GLQ': 6, 'ALQ': 5, 'BLQ': 4, 'Rec': 3, 'LwQ': 2, 'Unf': 1},
        'GarageFinish': {'Fin': 3, 'RFn': 2, 'Unf': 1},
        'LotShape': {'Reg': 4, 'IR1': 3, 'IR2': 2, 'IR3': 1},
        'PavedDrive': {'Y': 1,'P': 0.5, 'N': 0}
    }

    def __init__(self, columns):
        self.columns = columns
    
    def fit(self, X,y=None):
        return self
    
    def transform(self, X):
        
        assert isinstance(X, pd.DataFrame)
        
        try:
            
            for col in self.columns:
                X.loc[:, col] = X.loc[:, col].replace(self.order_encoder[col])
            
            # replace NAs with 0 since NAs indicate absence of the feature
            X.loc[:, self.columns] = X.loc[:, self.columns].fillna(0)
                
            return X 
            
        except KeyError:
            cols_error = list(set(self.columns) - set(X.columns))
            raise KeyError('The DataFrame does not include the columns:' % cols_error)

        

cols_ordinal = ['BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'GarageFinish', 'LotShape', 'PavedDrive']

train = OtherOrdinalEncoder(cols_ordinal).fit_transform(train)


### 2.3 binary encoding with imputing missing values

In [145]:
class CustomBinaryEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        
        assert isinstance(X, pd.DataFrame)
        
        try:
            
            for col in self.columns:
                most_frequent_value = X.loc[:, col].value_counts(dropna=False).argmax()
                if most_frequent_value is np.nan:
                    X.loc[:, col] = X.loc[:, col].notnull().astype(int)
                else:
                    X.loc[:, col] = (X.loc[:, col] == most_frequent_value).astype(int)
                    
            return X
            
        except KeyError:
            cols_error = list(set(self.columns) - set(X.columns))
            raise KeyError('The DataFrame does not include the columns:' % cols_error)
            

cols_binary = ['Alley', 'BldgType', 'MiscFeature', 'Fence']

train = CustomBinaryEncoder(cols_binary).fit_transform(train)

###  2.3 missing values 

In [146]:
train.isnull().sum()[train.isnull().sum()>0].sort_values(ascending=False)

LotFrontage    259
GarageYrBlt     81
GarageType      81
MasVnrArea       8
MasVnrType       8
Electrical       1
dtype: int64

array([nan, 'Ex', 'Fa', 'Gd'], dtype=object)

### 2.3 outliers

### 2.4 skewness

# 3. Feature engineering (experimentation)

Few remarks:
1. Techniques: filtering, wrapper and embedded methods.
2. Evaluation: RMSLE
3. Test features using custom transformations

## 3.1 feature creation

Ideas: 
1. Bathrooms / bedroom: counted total incl. basement
2. Total SF: 1stFlrSF + 2ndFlrSF + 
3. Free lot space: LotArea - 1stFlrSF - Garage - PoolArea - etc.
4. Basement-related: Basement exists, Basement value = BsmtFinType1 * BsmtFinSF1 + BsmtFinType2 * BsmtFinSF2 - Unfinished part, Unfinished SF /Total Bsmt SF
5. Fireplace value: Fireplaces * FireplacesQu
6. Kitchen value: KitchenAvbGr * KitchenQual
7. High Quality Floor SF: 1stFlrSF + 2ndFlrSF - LowQualFinSF
8. Pool value: PoolArea * PoolQC

## 3.2 feature selection
#### Recommended: Chi-Squared Independence Test + Information Gain

### a. filtering

### b. wrapper

### c. embedded methods

# 3. Final pipeline