#### Importing Libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
import numpy as np

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

## 2. Data Preprocessing - Prepare the Data for Machine Learning Algorithms
 
 Get a sorted count of the missing values for all the attributes.

In [33]:
housing.isnull().sum().sort_values(ascending=False)

PoolQC           1453
MiscFeature      1406
Alley            1369
Fence            1179
FireplaceQu       690
LotFrontage       259
GarageType         81
GarageCond         81
GarageFinish       81
GarageQual         81
GarageYrBlt        81
BsmtFinType2       38
BsmtExposure       38
BsmtQual           37
BsmtCond           37
BsmtFinType1       37
MasVnrArea          8
MasVnrType          8
Electrical          1
RoofMatl            0
Exterior1st         0
RoofStyle           0
ExterQual           0
Exterior2nd         0
YearBuilt           0
ExterCond           0
Foundation          0
YearRemodAdd        0
SalePrice           0
OverallCond         0
                 ... 
GarageArea          0
PavedDrive          0
WoodDeckSF          0
OpenPorchSF         0
3SsnPorch           0
BsmtUnfSF           0
ScreenPorch         0
PoolArea            0
MiscVal             0
MoSold              0
YrSold              0
SaleType            0
Functional          0
TotRmsAbvGrd        0
KitchenQua

From the results above we can assume that PoolQC to Bsmt attributes are missing for the houses that do not have these facilities (houses without pools, basements, garage etc.). Therefore, the missing values could be filled in with “None”. MasVnrType and MasVnrArea both have 8 missing values, likely houses without masonry veneer.

### Deal With Missing Values
We are going to apply different approaches to fix our missing values, so that we can various approaches in action:

* Replace values for categorical attributes with None.
* Compute the median LotFrontage for all the houses in the same neighborhood, instead of the plain median for the entire column, and use that to impute on a neighborhood by neighborhood basis.
* Replace missing values for most of the numerical columns with zero and one with the mode.
* Drop one non-interesting column, Utilities.

In [34]:
# Imputing Missing Values

housing_processed = housing

# Categorical columns:
cat_cols_fill_none = ['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu',
                     'GarageCond', 'GarageQual', 'GarageFinish', 'GarageType',
                     'BsmtFinType2', 'BsmtExposure', 'BsmtFinType1', 'BsmtQual', 'BsmtCond',
                     'MasVnrType']

# Replace missing values for categorical columns with None
for cat in cat_cols_fill_none:
    housing_processed[cat] = housing_processed[cat].fillna("None")
    
# Group by neighborhood and fill in missing value by the median LotFrontage of all the neighborhood
housing_processed['LotFrontage'] = housing_processed.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))    

# GarageYrBlt, GarageArea and GarageCars these are numerical columns, replace with zero
for col in ['GarageYrBlt', 'GarageArea', 'GarageCars']:
    housing_processed[col] = housing_processed[col].fillna(int(0))
    
#MasVnrArea : replace with zero
housing_processed['MasVnrArea'] = housing_processed['MasVnrArea'].fillna(int(0))

#Use the mode value 
housing_processed['Electrical'] = housing_processed['Electrical'].fillna(housing_processed['Electrical']).mode()[0]

#There is no need of Utilities so let's just drop this column
housing_processed = housing_processed.drop(['Utilities'], axis=1)

In [35]:
# Get the count again to verify that we do not have any more missing values
housing_processed.isnull().apply(sum).max()

0

### Deal With Outliers
Invoking the `quantile()` method on the DataFrame and then filtering based on the knowledge of the quantiles for each attribute, like so:

In [36]:
num_attributes = housing_processed.select_dtypes(exclude='object')

high_quant = housing_processed.quantile(.999)

for i in num_attributes.columns:
    housing_processed = housing_processed.drop(housing_processed[i][housing_processed[i]>high_quant[i]].index)

housing_processed.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1422 entries, 0 to 1458
Data columns (total 79 columns):
MSSubClass       1422 non-null int64
MSZoning         1422 non-null object
LotFrontage      1422 non-null float64
LotArea          1422 non-null int64
Street           1422 non-null object
Alley            1422 non-null object
LotShape         1422 non-null object
LandContour      1422 non-null object
LotConfig        1422 non-null object
LandSlope        1422 non-null object
Neighborhood     1422 non-null object
Condition1       1422 non-null object
Condition2       1422 non-null object
BldgType         1422 non-null object
HouseStyle       1422 non-null object
OverallQual      1422 non-null int64
OverallCond      1422 non-null int64
YearBuilt        1422 non-null int64
YearRemodAdd     1422 non-null int64
RoofStyle        1422 non-null object
RoofMatl         1422 non-null object
Exterior1st      1422 non-null object
Exterior2nd      1422 non-null object
MasVnrType       1422 no

### Deal With Categorical Attributes

we can drop GarageArea because it is highly correlated with GarageCars and the reason for preferring GarageCars is because it is more correlated with price than area. 

In [37]:
#### Remove highly correlated features
# Remove attributes that were identified for excluding when viewing scatter plots & corr values
attributes_drop = ['MiscVal', 'MoSold', 'YrSold', 'BsmtFinSF2','BsmtHalfBath','MSSubClass',
                   'GarageArea', 'GarageYrBlt', '3SsnPorch']

housing_processed = housing_processed.drop(attributes_drop, axis=1)

### Handle Text and Categorical Data
Let's convert all the categories from text to numbers.
A common approach to deal with textual data is to create one binary attribute for each category of the feature: for example, for type of houses, we would have one attribute equal to 1 when the category is 1Story (and 0 otherwise), another attribute equal to 1 when the category is 2Story (and 0 otherwise), and so on. This is called one-hot encoding, because only one attribute will be equal to 1 (hot), while the others will be 0 (cold). The new attributes are also known as dummy attributes.

In [38]:
#### Transforming Cat variables
from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
housing_processed_1hot = cat_encoder.fit_transform(housing_processed)
housing_processed_1hot

<1422x7333 sparse matrix of type '<class 'numpy.float64'>'
	with 99540 stored elements in Compressed Sparse Row format>