## Feature Preparation and Engineering for House Pricing

In the previous exercise we have explored the famous Ames Housing data from Dean De Cock. We have seen that the dataset includes numerical and categorical features containing null values. Hence, the dataset is not yet ready to feed it to a Machine Learning model. This is the task of this exercise.

As always, we start by **importing** the standard libraries **pandas** as pd and **numpy** as np, **loading** the dataset as a dataframe called **houses** and looking at the metadata of the dataset by using the **info()** method.

In [2]:
# solution: import pandas as pd and numpy as np
import pandas as pd
import numpy as np

Please import the dataframe **houses** from the path **'../data/houses.csv'**.

In [3]:
# import data
houses = pd.read_csv('../data/houses.csv')

In [4]:
# info
houses.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

Next, drop the column **'Id'** which will be irrelevant for the regression task since it will not help us to explain the target 'SalePrice'.

To solve this problem please use the **drop('Id')** method on the **houses** dataframe. Furthermore, set the arguments **errors** to **'ignore'** and **inplace** to **True**. Do not forget to use the **correct axis** argument. Afterwards, check if the column has been dropped.

In [5]:
# solution
# drop irrelevant column Id by using drop('Id') and the corresponding axis (fillIN)
houses.drop('Id', axis=1, errors='ignore', inplace=True)

In [6]:
# solution (check)
houses.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
1,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
2,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
3,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
4,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000


Maybe you have already noticed that a lot of columns contain NaN values. However, the Machine Learning Algorithms which we will use cannot handle NaN Values, hence we have to remove or fill them.


First, let us **investigate the percentage of null values** in each column. Compute the amount of null/NaN values in each column. To solve this task you can use the **isnull()** and **sum()** method. In order to get the ratio you need to **divide by the number of rows**. Please name the resulting pandas series **null_series** and print it.

In [7]:
# solution
# compute the amount of null values in each column
null_series = houses.isnull().sum() / len(houses)
null_series

MSSubClass       0.000000
MSZoning         0.000000
LotFrontage      0.177397
LotArea          0.000000
Street           0.000000
                   ...   
MoSold           0.000000
YrSold           0.000000
SaleType         0.000000
SaleCondition    0.000000
SalePrice        0.000000
Length: 80, dtype: float64

Sort the series in descending order by using the **sort_values()** method on the **null_series**.

In [8]:
# solution
null_series.sort_values(ascending=False)

PoolQC           0.995205
MiscFeature      0.963014
Alley            0.937671
Fence            0.807534
FireplaceQu      0.472603
                   ...   
CentralAir       0.000000
SaleCondition    0.000000
Heating          0.000000
TotalBsmtSF      0.000000
MSSubClass       0.000000
Length: 80, dtype: float64

Some of the columns contain a lot of null values (>15 %) and it is better to drop these instead of filling the nulls with some other values. For this reason create a list called **dropCols** by **slicing**(conditional indexing) the pandas series so that it only contains the entries where the **ratio > 0.15**, accessing the attribute **index** from the resulting series and using the method **tolist()** on that result. Print the list **dropCols**.

In [9]:
# drop columns where ratio > 0.15
dropCols = null_series[null_series > 0.15].index.tolist()
dropCols

['LotFrontage', 'Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature']

Next, **drop** all the columns which are contained in the list **dropCols** from the houses dataframe by using the **drop()** method. Please, set again the correct **axis**, **errors** to **'ignore'** and **inplace** to **True**. Afterwards, check if the columns have been dropped.

In [10]:
# solution
houses.drop(dropCols, axis=1, errors='ignore', inplace=True)
houses.head()

Unnamed: 0,MSSubClass,MSZoning,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,...,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,60,RL,8450,Pave,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,...,0,0,0,0,0,2,2008,WD,Normal,208500
1,20,RL,9600,Pave,Reg,Lvl,AllPub,FR2,Gtl,Veenker,...,0,0,0,0,0,5,2007,WD,Normal,181500
2,60,RL,11250,Pave,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,...,0,0,0,0,0,9,2008,WD,Normal,223500
3,70,RL,9550,Pave,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,...,272,0,0,0,0,2,2006,WD,Abnorml,140000
4,60,RL,14260,Pave,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,...,0,0,0,0,0,12,2008,WD,Normal,250000


Categorical and numerical columns will be treated differently. For this reason we need to split the dataframe into one containing only the numerical attributes and another one containing only the categorical one.

To solve this task you can use the method **select_dtypes()** on the **houses** dataframe. Please, name the resulting dataframe, only containing object data types, **houses_cat** and the other one **houses_num**.

In [11]:
# split data in categorical and numerical
houses_cat = houses.select_dtypes(include=[object])
houses_num = houses.select_dtypes(exclude=[object])
#.fillna('NaN')

### Treatment of Categorical Features
The categorical features need to be **transformed into numerical ones** in order to work with Sklearn Machine Learning Models.
Therefore, we will use so called Dummy Encoding method **OneHotEncoding**. This will result to (k-1) new columns for an attribute with k different categories. Hence, we introduce a lot of **new dimensions** to the dataframe. For this reason it is useful to first **check the cardinality** of the categorical features.

Please use the method **nunique()** on the **houses_cat** dataframe and sort the result in ascending order by using the method **sort_values()**.

In [12]:
# solution
# compute the cardinality of the categorical features
houses_cat.nunique().sort_values(ascending=False)

Neighborhood     25
Exterior2nd      16
Exterior1st      15
SaleType          9
Condition1        9
RoofMatl          8
HouseStyle        8
Condition2        8
Functional        7
Foundation        6
RoofStyle         6
GarageType        6
Heating           6
BsmtFinType2      6
BsmtFinType1      6
SaleCondition     6
BldgType          5
LotConfig         5
ExterCond         5
MSZoning          5
GarageCond        5
HeatingQC         5
GarageQual        5
Electrical        5
ExterQual         4
LotShape          4
LandContour       4
BsmtQual          4
KitchenQual       4
BsmtExposure      4
BsmtCond          4
MasVnrType        4
LandSlope         3
PavedDrive        3
GarageFinish      3
CentralAir        2
Utilities         2
Street            2
dtype: int64

The feature with the highest cardinality is **Neighborhood** and contains **25** different categories. This is not too much and we don't have to drop or modify some of these columns. 

As the next step we fill the Null values of these features. Here, we use a very simple approach where we fill each NaN value with the string 'NaN'. Therefore, use the method **fillna('Unknown')** on the dataframe **houses_cat** and give the resulting dataframe again the name **houses_cat**. Afterwards, use the method **info()** on the dataframe.

In [13]:
# solution Bonus
#for element 
#houses_cat.apply(lambda x: x.value_counts(), axis=1)

In [14]:
# solution
houses_cat = houses_cat.fillna('Unknown')
houses_cat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 38 columns):
MSZoning         1460 non-null object
Street           1460 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1460 non-null object
ExterQual        1460 non-null object
ExterCond        1460 non-null object
Foundation       1460 non-null object
BsmtQual         1460 non-null object
BsmtCond         1460 non-null object
BsmtExposure     1460 non-null object
BsmtFinType1     14

**Side remark**:
The column type of the categorical feature is object, which is actually a kind of a string data type in pandas. We can transform it to an actual category type (similar to the factor type in R). This data type will occupy less memory.

Please **execute the cell** below and **compare the memory** usage.

In [15]:
# transform to categorical values
houses_cat = houses_cat.apply(lambda x: x.astype('category'))
houses_cat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 38 columns):
MSZoning         1460 non-null category
Street           1460 non-null category
LotShape         1460 non-null category
LandContour      1460 non-null category
Utilities        1460 non-null category
LotConfig        1460 non-null category
LandSlope        1460 non-null category
Neighborhood     1460 non-null category
Condition1       1460 non-null category
Condition2       1460 non-null category
BldgType         1460 non-null category
HouseStyle       1460 non-null category
RoofStyle        1460 non-null category
RoofMatl         1460 non-null category
Exterior1st      1460 non-null category
Exterior2nd      1460 non-null category
MasVnrType       1460 non-null category
ExterQual        1460 non-null category
ExterCond        1460 non-null category
Foundation       1460 non-null category
BsmtQual         1460 non-null category
BsmtCond         1460 non-null category
BsmtExposure 

The final preparation step for the categorical data is OneHotEncoding. An easy way to do that for all columns at once is by using the **get_dummies()** function of pandas. Please, use that function and set the argument **data** to the dataframe **houses_cat** and **prefix_sep** to the **equal sign**. Call the resulting dataframe **houses_cat_dum**. Finally, check the number of columns contained in the new dataframe.

In [16]:
# solution
houses_cat_dum = pd.get_dummies(houses_cat, prefix_sep='=', drop_first=True)
houses_cat_dum.info()
houses_cat_dum.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Columns: 207 entries, MSZoning=FV to SaleCondition=Partial
dtypes: uint8(207)
memory usage: 295.3 KB


Unnamed: 0,MSZoning=FV,MSZoning=RH,MSZoning=RL,MSZoning=RM,Street=Pave,LotShape=IR2,LotShape=IR3,LotShape=Reg,LandContour=HLS,LandContour=Low,...,SaleType=ConLI,SaleType=ConLw,SaleType=New,SaleType=Oth,SaleType=WD,SaleCondition=AdjLand,SaleCondition=Alloca,SaleCondition=Family,SaleCondition=Normal,SaleCondition=Partial
0,0,0,1,0,1,0,0,1,0,0,...,0,0,0,0,1,0,0,0,1,0
1,0,0,1,0,1,0,0,1,0,0,...,0,0,0,0,1,0,0,0,1,0
2,0,0,1,0,1,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
3,0,0,1,0,1,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,0,0,1,0,1,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0


We are done with the treatment of the categorical variables. Before we proceed we **rejoin** the transformed categorical dataframe **houses_cat_dum** with the numerical one **houses_num** by using the **join()** method. Please give the new dataframe the name **houses_prep**.

In [17]:
# solution
houses_prep = houses_num.join(houses_cat_dum)
houses_prep.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Columns: 243 entries, MSSubClass to SaleCondition=Partial
dtypes: float64(2), int64(34), uint8(207)
memory usage: 705.9 KB


### Treatment of numerical features

Next, we fill the NaN values of the remaining features by the median of the corresponding column. The computation of the median requires a **fit** method. Every preprocessing step which requires such a method **must be performed only on the training data**. Hence, we split our dataset into a **training** and **test** dataset called **houses_train** and **houses_test**. The houses_test data set will be used to evaluate our predictions at the very end.

Please **import** the function **train_test_split** from the **sklearn.model_selection** and apply it. As the first argument use the dataframe **houses_prep**. The second argument should be set to **test_size=0.2**, i.e. we use 20% of the data as a test data set. In order to get the same result for all participants we set the **random seed** to the fixed value 42. This allows us to get indentical training and test data sets. Therefore, please use as a third agument of the function **random_state=42**. The function returns a list containing two dataframes which can be unpacked directly.

In [18]:
import sklearn

In [19]:
#sklearn.model_selection\
#.train_test_split()

In [20]:
from sklearn.model_selection import train_test_split
#from sklearn.externals import joblib
#from sklearn.preprocessing import Imputer
# fix random seed
np.random.seed(42)

In [21]:
#split dataframe in training and test dfs
houses_train, houses_test = train_test_split(houses_prep, test_size=0.2,
                                             random_state=42)

In [22]:
houses_train.head()

Unnamed: 0,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,...,SaleType=ConLI,SaleType=ConLw,SaleType=New,SaleType=Oth,SaleType=WD,SaleCondition=AdjLand,SaleCondition=Alloca,SaleCondition=Family,SaleCondition=Normal,SaleCondition=Partial
254,20,8400,5,6,1957,1957,0.0,922,0,392,...,0,0,0,0,1,0,0,0,1,0
1066,60,7837,6,7,1993,1994,0.0,0,0,799,...,0,0,0,0,1,0,0,0,1,0
638,30,8777,5,7,1910,1950,0.0,0,0,796,...,0,0,0,0,1,0,0,0,1,0
799,50,7200,5,7,1937,1950,252.0,569,0,162,...,0,0,0,0,1,0,0,0,1,0
380,50,5000,5,6,1924,1950,0.0,218,0,808,...,0,0,0,0,1,0,0,0,1,0


To fill the NaN values we create an imputer object of the preprocessing class **SimpleImputer**. Hence, **import** the class from the module **sklearn.impute**, instanciate it using **strategy='median'** as the argument and give the resulting object the name **imputer**.

In [24]:
# solution
from sklearn.impute import SimpleImputer

In [25]:
# solution
imputer = SimpleImputer(strategy='median')

As most of the Sklearn Machine Learning Model and preprocessing classes the class SimpleImputer contains the methods fit, transform and fit_transform. Here, the method fit computes the median of the columns and stores it inside the object, transform fills the NaN values by the stored medians/values and fit_transform combines both steps.

Please, use the method fit_transform of the imputer object and use the dataframe houses_train as the argument. Call the result train_array. Afterwards, use only the transform method of the imputer on houses_test and call the result test_array. Print both results. What do you notice?

['axis', 'copy', 'missing_values', 'strategy', 'verbose']

In [None]:
#imputer.fit()

In [26]:
# solution
train_array = imputer.fit_transform(houses_train)
test_array = imputer.transform(houses_test)
test_array

array([[2.0000e+01, 8.4140e+03, 6.0000e+00, ..., 0.0000e+00, 1.0000e+00,
        0.0000e+00],
       [6.0000e+01, 1.2256e+04, 8.0000e+00, ..., 0.0000e+00, 1.0000e+00,
        0.0000e+00],
       [3.0000e+01, 8.9600e+03, 5.0000e+00, ..., 0.0000e+00, 1.0000e+00,
        0.0000e+00],
       ...,
       [6.0000e+01, 8.1990e+03, 7.0000e+00, ..., 0.0000e+00, 1.0000e+00,
        0.0000e+00],
       [7.0000e+01, 9.0840e+03, 4.0000e+00, ..., 0.0000e+00, 1.0000e+00,
        0.0000e+00],
       [2.0000e+01, 8.1200e+03, 4.0000e+00, ..., 0.0000e+00, 1.0000e+00,
        0.0000e+00]])

Hey, we lost the column and index names of our data. This is due to the reason that sklearn objects accept pandas dataframes as input, but return (in general) numpy arrays. Sklearn uses numpy arrays and not pandas dataframes as central data types. However, sometimes it is more feasible to work with dataframes. Therefore, we create again dataframes called train_df and test_df out of the two numpy arrays. We can get the column names and indices from the two dataframes houses_train and houses_test by accessing the attributes columns and index and transform the results to a list by using the method tolist(). Since the columns are the same in both dataframes we only have to extract them from one of the dataframes. However, the indices differ and need to be extracted separately.

In [27]:
# solution
# extract columns and indices
cols = houses_prep.columns.tolist()
train_ind = houses_train.index.tolist()
test_ind = houses_test.index.tolist()
#cols

Now you can create a dataframe by using the pandas method DataFrame(). As the arguments use the arrays, columns and indices. Call the resulting dataframes train_df and test_df.

In [28]:
train_df = pd.DataFrame(train_array, columns=cols, index=train_ind)
test_df = pd.DataFrame(test_array, columns=cols, index=test_ind)

In [29]:
train_df.head()

Unnamed: 0,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,...,SaleType=ConLI,SaleType=ConLw,SaleType=New,SaleType=Oth,SaleType=WD,SaleCondition=AdjLand,SaleCondition=Alloca,SaleCondition=Family,SaleCondition=Normal,SaleCondition=Partial
254,20.0,8400.0,5.0,6.0,1957.0,1957.0,0.0,922.0,0.0,392.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
1066,60.0,7837.0,6.0,7.0,1993.0,1994.0,0.0,0.0,0.0,799.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
638,30.0,8777.0,5.0,7.0,1910.0,1950.0,0.0,0.0,0.0,796.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
799,50.0,7200.0,5.0,7.0,1937.0,1950.0,252.0,569.0,0.0,162.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
380,50.0,5000.0,5.0,6.0,1924.0,1950.0,0.0,218.0,0.0,808.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0


Before saving the data perform a quick check to see if there are still null values in the dataframes and if all the data types are numerical. 

In [30]:
# solution
train_df.isnull().sum().sum()
#test_df.isnull().sum().sum()

0

In [31]:
# solution
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1168 entries, 254 to 1126
Columns: 243 entries, MSSubClass to SaleCondition=Partial
dtypes: float64(243)
memory usage: 2.2 MB


In [32]:
# solution
test_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 292 entries, 892 to 722
Columns: 243 entries, MSSubClass to SaleCondition=Partial
dtypes: float64(243)
memory usage: 556.6 KB


Finally, we save the two dataframes in a python specific binary format. This can be done by using the method .to_pickle() on the dataframes. As the argument please use 'houses_train.pkl' and 'houses_test.pkl' as the file path.

In [36]:
# solution
# save the two dataframes
train_df.to_pickle('../data/houses_train.pkl')
test_df.to_pickle('../data/houses_test.pkl')

# csv
train_df.to_csv('../data/houses_train.csv', index=True)
test_df.to_csv('../data/houses_test.csv', index=True)

In [34]:
train_df.describe()

Unnamed: 0,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,...,SaleType=ConLI,SaleType=ConLw,SaleType=New,SaleType=Oth,SaleType=WD,SaleCondition=AdjLand,SaleCondition=Alloca,SaleCondition=Family,SaleCondition=Normal,SaleCondition=Partial
count,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0,...,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0
mean,56.849315,10689.642123,6.121575,5.58476,1970.965753,1984.89726,103.23887,446.023973,45.152397,570.595034,...,0.003425,0.003425,0.083048,0.001712,0.866438,0.003425,0.005993,0.015411,0.825342,0.083904
std,42.531862,10759.366198,1.367619,1.116062,30.675495,20.733955,172.746354,459.070977,158.217499,446.364551,...,0.058445,0.058445,0.276073,0.041363,0.340326,0.058445,0.077216,0.123233,0.379837,0.277363
min,20.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,20.0,7587.25,5.0,5.0,1953.0,1966.0,0.0,0.0,0.0,222.5,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
50%,50.0,9600.0,6.0,5.0,1972.0,1994.0,0.0,384.5,0.0,480.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
75%,70.0,11700.0,7.0,6.0,2001.0,2004.0,166.0,721.0,0.0,810.25,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
max,190.0,215245.0,10.0,9.0,2010.0,2010.0,1378.0,5644.0,1127.0,2336.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### This is the end of the exercise.