# Data preparation

This notebook is based on Kaggle <a href="https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard">kernel</a>.

Interesting moments:
- Convert numerical features to category
- Represent year and month as category
- New feature TotalSF th
- Used Box-Cox transforming. <a href="http://onlinestatbook.com/2/transformations/box-cox.html">Link to paper</a>

In [1]:
#import some necessary librairies
import sys
sys.path.append("../../dstoolkit/")

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
%matplotlib inline
import matplotlib.pyplot as plt  # Matlab-style plotting
import seaborn as sns

color = sns.color_palette()
sns.set_style('darkgrid')

import warnings

def ignore_warn(*args, **kwargs):
    pass

warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)


from scipy import stats
from scipy.stats import norm, skew #for some statistics


pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x)) #Limiting floats output to 3 decimal points

from eda import skewness_of_numeric_features
from preparation import label_encoding

In [16]:
train = pd.read_csv("../data/cleaned_train.csv")
test = pd.read_csv("../data/cleaned_test.csv")
ntrain = train.shape[0]

In [3]:
all_data = pd.concat([train.drop('SalePrice', axis=1), test])
all_data.reset_index(drop=True, inplace=True)

## Transforming types of variables

In [4]:
#MSSubClass=The building class
all_data['MSSubClass'] = all_data['MSSubClass'].apply(str)


#Changing OverallCond into a categorical variable
all_data['OverallCond'] = all_data['OverallCond'].astype(str)


#Year and month sold are transformed into categorical features.
all_data['YrSold'] = all_data['YrSold'].astype(str)
all_data['MoSold'] = all_data['MoSold'].astype(str)

## Label Encoding

Encode category features that has ordering in values.

There are features that represent quality, condition and etc. of something.

For example: **OveralQual**. It has 10 degrees of house quality from 1 ('Worse') to 10 ('Super quality')

In [5]:
cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
        'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1', 
        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond', 
        'YrSold', 'MoSold')

all_data = label_encoding(all_data, features=cols)

# shape        
print('Shape all_data: {}'.format(all_data.shape))

Shape all_data: (2917, 78)


## Feature engineering

Create TotalSF feature that is sum of all floors SF.

In [6]:
# Adding total sqfootage feature 
all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']

## Transform skewed features

In [14]:
skewed_features = skewness_of_numeric_features(all_data)
skewed_features.head()

Unnamed: 0,Feature,Skew
PoolArea,PoolArea,15.76
3SsnPorch,3SsnPorch,8.922
LowQualFinSF,LowQualFinSF,8.741
MiscVal,MiscVal,5.595
LandSlope,LandSlope,4.53


In [10]:
from scipy.special import boxcox1p
skewed_features = skewed_features[skewed_features.Skew.abs() > 0.75].Feature
lam = 0.15
for feat in skewed_features:
    #all_data[feat] += 1
    all_data[feat] = boxcox1p(all_data[feat], lam)

## Get dummies

In [15]:
all_data = pd.get_dummies(all_data)
print(all_data.shape)

(2917, 220)


In [17]:
train = all_data[:ntrain]
test = all_data[ntrain:]

## Save prepared features

In [18]:
train.to_csv("../data/prepared_train.csv", index=False)
test.to_csv("../data/prepared_test.csv", index=False)