# 2.0 Notebook 2: Feature Engineering

We have now acquired a good intuition of our data. Next step involves taking a deep dive into the data. In this section, we will rectify the quality issues with the data.

### 2.1. Introduction

What do we know of Feature Engineering? The main essense of machine learning ischoosing features, what is a feature and why do we need to engineer it? Simply puting into words all machine learning algorithms use some input data to create outputs. This input data comprise features, which are usually in the form of structured columns. Algorithms require features with some specific characteristic to work properly.Therefor we require feature engineering. Main objectives :

* Prepare the accurate input dataset, compatible with the machine learning algorithm requirements 
* Improve the performance of machine learning models.

### 2.2 Objective

Since many features in our data is categorical value, we need to find appropriate techniques to convert them into integers, as our model only inputs numerical values, remove outliers,etc. Few operations to be performed here:
1. Imputation
2. Handling Outliers
3. Log Transform
4. One-Hot Encoding
5. Feature Split
6. Scaling
7. Extracting Date

> Thus, the goal here is to properly mould our dataset to fit perfectly into our model. 


In [1]:
# import libraries
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.impute import SimpleImputer

import warnings

# ignore warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# display all columns
pd.set_option('display.max_columns', None)

We need to load our clean dataset exported from the previous notebook.

In [2]:
# load dataset
df_primary = pd.read_csv('data/data_v1.csv')

In [3]:
# check data
df_primary.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,68 Studley St,2,h,,SS,Jellis,3/09/2016,2.5,3067.0,1.0,1.0,126.0,,,Yarra City Council,-37.8014,144.9958,Northern Metropolitan,4019.0
1,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,1.0,1.0,202.0,,,Yarra City Council,-37.7996,144.9984,Northern Metropolitan,4019.0
2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,1.0,0.0,156.0,79.0,1900.0,Yarra City Council,-37.8079,144.9934,Northern Metropolitan,4019.0
3,Abbotsford,18/659 Victoria St,3,u,,VB,Rounds,4/02/2016,2.5,3067.0,2.0,1.0,0.0,,,Yarra City Council,-37.8114,145.0116,Northern Metropolitan,4019.0
4,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,2.0,0.0,134.0,150.0,1900.0,Yarra City Council,-37.8093,144.9944,Northern Metropolitan,4019.0


In [4]:
#create copy of dataframe
df = df_primary.copy()

### Check for null values

We need to assess how many null values are present in each column in order to fix the issue. If the values are low we can straight away remove them, if not we can fill them by various measures of centrality (mean, mediam, mode) or by any othe means. But first of all we need to carefully analyze it.

In [5]:
def nullvalper(dataframe):
    res = {'col':[],'val':[]}
    for column in dataframe.columns:
        if dataframe[column].isnull().any() == True:
            print("{:<15} : {:<6}items, accounts {:.4f}%".format(
                column,dataframe[column].isnull().sum(),dataframe[column].isnull().sum()
                /len(dataframe[column])*100))

In [6]:
# check null percentage
nullvalper(df)

Price           : 7594  items, accounts 21.8344%
Distance        : 1     items, accounts 0.0029%
Postcode        : 1     items, accounts 0.0029%
Bathroom        : 8226  items, accounts 23.6515%
Car             : 8726  items, accounts 25.0891%
Landsize        : 11790 items, accounts 33.8988%
BuildingArea    : 21115 items, accounts 60.7102%
YearBuilt       : 19304 items, accounts 55.5032%
CouncilArea     : 3     items, accounts 0.0086%
Lattitude       : 7976  items, accounts 22.9327%
Longtitude      : 7976  items, accounts 22.9327%
Regionname      : 3     items, accounts 0.0086%
Propertycount   : 3     items, accounts 0.0086%


We will use measurement of centrality (mean, median, mode) to fill the null values.

#### Fill using meaurement of centrality

The most suitable way to __fill__ in the __null values__ for __Price, Landsize, Distance, BuildingArea, Lattitude, Longtitude and YearBuilt__ with __median__.

In [7]:
df['Price']= df['Price'].fillna(df['Price'].median())
df['Landsize']= df['Landsize'].fillna(df['Landsize'].median())
df['Distance'] = df['Distance'].fillna(df['Distance'].median())
df['BuildingArea']= df['BuildingArea'].fillna(df['BuildingArea'].median())
df['Lattitude']= df['Lattitude'].fillna(df['Lattitude'].median())
df['Longtitude']= df['Longtitude'].fillna(df['Longtitude'].median())
df['YearBuilt']= df['YearBuilt'].fillna(df['YearBuilt'].median())

And for __categorical values__ such as __Postcode, Bathroom, Car, CouncilArea, Regionname and Propertycount__ we use __mode__ to __fill__ in the __null values__.

In [8]:
df['Bathroom']= df['Bathroom'].fillna(df['Bathroom'].mode()[0])
df['Car']= df['Car'].fillna(df['Car'].mode()[0])
df['CouncilArea']= df['CouncilArea'].fillna(df['CouncilArea'].mode()[0])
df['Regionname']= df['Regionname'].fillna(df['Regionname'].mode()[0])
df['Propertycount']= df['Propertycount'].fillna(df['Propertycount'].mode()[0])
df['Postcode']= df['Postcode'].fillna(df['Postcode'].mode()[0])

In [9]:
# check null values
nullvalper(df)

No more null values. We can finally change Data type of Bathroom and Car to Integer, and Date to DateTime (pandas) format.

In [10]:
df['Car'] = pd.to_numeric(df['Car']).round(0).astype(int)
df['Bathroom'] = pd.to_numeric(df['Bathroom']).round(0).astype(int)
df['Date'] = pd.to_datetime(df['Date'])

### Handling Outliers

__Outlier detection using Standard Deviation__

A distance to the average higher than x * standard deviation can be assumed as an outlier. How should we estimate the value for x. Well there is no hard and fast rule for it, but usually, a value between 2 and 4 seems practical, as estimated by the statisticians.

In [11]:
def remove_outlier(data,factor):
    for column in df.select_dtypes(include=np.number).columns.tolist():
        upper_lim = data[column].mean () + data[column].std () * factor
        lower_lim = data[column].mean () - data[column].std () * factor
        
    return data[(data[column] < upper_lim) & (data[column] > lower_lim)]

In [12]:
# remove outlier
df = remove_outlier(df,3)

### Create Extra Features

We'll try to extract more features out the the current features. For example we can create a column HouseAge for using column Year built - 2018 (dataset creation year), 

Create extra column for age of the house. 

In [13]:
# create column HouseAge
df['HouseAge'] = 2018 - df['YearBuilt']

We will create two extra columns, day of year, year, season (all purchase data) using Date column.  

In [18]:
# create Day of Year (1-365) and Year of purchase
df['DayOfYear'] = df['Date'].dt.dayofyear
df['Year'] = df['Date'].dt.year

Now we create Season using the day of the year, as we know the months which have different seasons. We have primarily four seasons namely :
1. Spring (March 1 to May 31)
2. Summer (June 1 to August 31)
3. Autumn (September 1 to November 30)
4. Winter (December 1 to February 28)

In [19]:
# divide seasons by day of year, using the the above data (month detail)
spring = range(60, 152)
summer = range(152, 243)
autumn = range(243, 334)
# everything else would be winter.

In [20]:
season_series = []

for i in df['DayOfYear']:
    if i in spring:
        res = 'spring'
    elif i in summer:
        res = 'summer'
    elif i in autumn:
        res = 'autumn'
    else:
        res = 'winter'
    season_series.append(res)
    
df['season'] = pd.Series(season_series)

In [21]:
df

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount,HouseAge,DayOfYear,Year,season
0,Abbotsford,68 Studley St,2,h,870000.0,SS,Jellis,2016-03-09,2.5,3067.0,1,1,126.0,136.0,1970.0,Yarra City Council,-37.80140,144.99580,Northern Metropolitan,4019.0,48.0,69,2016,spring
1,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,2016-03-12,2.5,3067.0,1,1,202.0,136.0,1970.0,Yarra City Council,-37.79960,144.99840,Northern Metropolitan,4019.0,48.0,72,2016,spring
2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,2016-04-02,2.5,3067.0,1,0,156.0,79.0,1900.0,Yarra City Council,-37.80790,144.99340,Northern Metropolitan,4019.0,118.0,93,2016,spring
3,Abbotsford,18/659 Victoria St,3,u,870000.0,VB,Rounds,2016-04-02,2.5,3067.0,2,1,0.0,136.0,1970.0,Yarra City Council,-37.81140,145.01160,Northern Metropolitan,4019.0,48.0,93,2016,spring
4,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,2017-04-03,2.5,3067.0,2,0,134.0,150.0,1900.0,Yarra City Council,-37.80930,144.99440,Northern Metropolitan,4019.0,118.0,93,2017,spring
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34775,Yarraville,13 Burns St,4,h,1480000.0,PI,Jas,2018-02-24,6.3,3013.0,1,3,593.0,136.0,1970.0,Maribyrnong City Council,-37.81053,144.88467,Western Metropolitan,6543.0,48.0,55,2018,
34776,Yarraville,29A Murray St,2,h,888000.0,SP,Sweeney,2018-02-24,6.3,3013.0,2,1,98.0,104.0,2018.0,Maribyrnong City Council,-37.81551,144.88826,Western Metropolitan,6543.0,0.0,55,2018,
34777,Yarraville,147A Severn St,2,t,705000.0,S,Jas,2018-02-24,6.3,3013.0,1,2,220.0,120.0,2000.0,Maribyrnong City Council,-37.82286,144.87856,Western Metropolitan,6543.0,18.0,55,2018,
34778,Yarraville,12/37 Stephen St,3,h,1140000.0,SP,hockingstuart,2018-02-24,6.3,3013.0,1,2,520.5,136.0,1970.0,Maribyrnong City Council,-37.80770,145.00780,Western Metropolitan,6543.0,48.0,55,2018,


In [22]:
df.isna().any()

Suburb           False
Address          False
Rooms            False
Type             False
Price            False
Method           False
SellerG          False
Date             False
Distance         False
Postcode         False
Bathroom         False
Car              False
Landsize         False
BuildingArea     False
YearBuilt        False
CouncilArea      False
Lattitude        False
Longtitude       False
Regionname       False
Propertycount    False
HouseAge         False
DayOfYear        False
Year             False
season            True
dtype: bool

In [23]:
df.season.value_counts()

autumn    11240
summer     9478
spring     8917
winter     3468
Name: season, dtype: int64

In [27]:
df[df['DayOfYear'] == np.NaN]

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount,HouseAge,DayOfYear,Year,season


In [28]:
df.DayOfYear.value_counts()

301    1095
76      950
55      925
255     907
329     886
       ... 
307      63
273      21
20       18
27       12
28        3
Name: DayOfYear, Length: 77, dtype: int64