# 2.0 Notebook 2: Feature Engineering

We have now acquired a good intuition of our data. Next step involves taking a deep dive into the data. In this section, we will rectify the quality issues with the data.

### 2.1. Introduction

What do we know of Feature Engineering? The main essense of machine learning ischoosing features, what is a feature and why do we need to engineer it? Simply puting into words all machine learning algorithms use some input data to create outputs. This input data comprise features, which are usually in the form of structured columns. Algorithms require features with some specific characteristic to work properly.Therefor we require feature engineering. Main objectives :

* Prepare the accurate input dataset, compatible with the machine learning algorithm requirements 
* Improve the performance of machine learning models.

### 2.2 Objective

Since many features in our data is categorical value, we need to find appropriate techniques to convert them into integers, as our model only inputs numerical values, remove outliers,etc. Few operations to be performed here:
1. Imputation
2. Handling Outliers
3. Extracting Date

> Thus, the goal here is to properly mould our dataset to fit perfectly into our model. 


In [1]:
# import libraries
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.impute import SimpleImputer

import warnings

# ignore warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# display all columns
pd.set_option('display.max_columns', None)

We need to load our clean dataset exported from the previous notebook.

In [2]:
# load dataset
df_primary = pd.read_csv('data/data_v1.csv')

In [None]:
# check data
df_primary.head()

In [4]:
#create copy of dataframe
df = df_primary.copy()

### Check for null values

We need to assess how many null values are present in each column in order to fix the issue. If the values are low we can straight away remove them, if not we can fill them by various measures of centrality (mean, mediam, mode) or by any othe means. But first of all we need to carefully analyze it.

In [5]:
def nullvalper(dataframe):
    res = {'col':[],'val':[]}
    for column in dataframe.columns:
        if dataframe[column].isnull().any() == True:
            print("{:<15} : {:<6}items, accounts {:.4f}%".format(
                column,dataframe[column].isnull().sum(),dataframe[column].isnull().sum()
                /len(dataframe[column])*100))

In [None]:
# check null percentage
nullvalper(df)

We will use measurement of centrality (mean, median, mode) to fill the null values.

#### Fill using meaurement of centrality

The most suitable way to __fill__ in the __null values__ for __Price, Landsize, Distance, BuildingArea, Lattitude, Longtitude and YearBuilt__ with __median__.

In [7]:
df['Price']= df['Price'].fillna(df['Price'].median())
df['Landsize']= df['Landsize'].fillna(df['Landsize'].median())
df['Distance'] = df['Distance'].fillna(df['Distance'].median())
df['BuildingArea']= df['BuildingArea'].fillna(df['BuildingArea'].median())
df['Lattitude']= df['Lattitude'].fillna(df['Lattitude'].median())
df['Longtitude']= df['Longtitude'].fillna(df['Longtitude'].median())
df['YearBuilt']= df['YearBuilt'].fillna(df['YearBuilt'].median())

And for __categorical values__ such as __Postcode, Bathroom, Car, CouncilArea, Regionname and Propertycount__ we use __mode__ to __fill__ in the __null values__.

In [8]:
df['Bathroom']= df['Bathroom'].fillna(df['Bathroom'].mode()[0])
df['Car']= df['Car'].fillna(df['Car'].mode()[0])
df['CouncilArea']= df['CouncilArea'].fillna(df['CouncilArea'].mode()[0])
df['Regionname']= df['Regionname'].fillna(df['Regionname'].mode()[0])
df['Propertycount']= df['Propertycount'].fillna(df['Propertycount'].mode()[0])
df['Postcode']= df['Postcode'].fillna(df['Postcode'].mode()[0])

In [9]:
# check null values
nullvalper(df)

No more null values. We can finally change Data type of Bathroom and Car to Integer, and Date to DateTime (pandas) format.

In [10]:
df['Car'] = pd.to_numeric(df['Car']).round(0).astype(int)
df['Bathroom'] = pd.to_numeric(df['Bathroom']).round(0).astype(int)
df['Date'] = pd.to_datetime(df['Date'])

### Handling Outliers

__Outlier detection using Standard Deviation__

A distance to the average higher than x * standard deviation can be assumed as an outlier. How should we estimate the value for x. Well there is no hard and fast rule for it, but usually, a value between 2 and 4 seems practical, as estimated by the statisticians.

In [11]:
def remove_outlier(data,factor):
    for column in df.select_dtypes(include=np.number).columns.tolist():
        upper_lim = data[column].mean () + data[column].std () * factor
        lower_lim = data[column].mean () - data[column].std () * factor
        
    return data[(data[column] < upper_lim) & (data[column] > lower_lim)]

In [12]:
# remove outlier
df = remove_outlier(df,3)

### Create Extra Features

We'll try to extract more features out the the current features. For example we can create a column HouseAge for using column Year built - 2018 (dataset creation year), 

Create extra column for age of the house. 

In [13]:
# create column HouseAge
df['HouseAge'] = 2018 - df['YearBuilt']

We will create two extra columns, day of year, year, season (all purchase data) using Date column.  

In [15]:
# create Day of Year (1-365) and Year of purchase
df['DayOfYear'] = df['Date'].dt.dayofyear
df['Month'] = df['Date'].dt.month
df['Year'] = df['Date'].dt.year

Now we create Season using the day of the year, as we know the months which have different seasons. We have primarily four seasons namely :
1. Spring (March 1 to May 31)
2. Summer (June 1 to August 31)
3. Autumn (September 1 to November 30)
4. Winter (December 1 to February 28)

In [16]:
#reset index
df.reset_index(inplace=True)

In [17]:
# divide seasons by day of year, using the the above data (month detail)
spring = range(60, 152)
summer = range(152, 243)
autumn = range(243, 334)
# everything else would be winter.

In [18]:
season_series = []

for index,i in df.iterrows():
    if i['DayOfYear'] in spring:
        res = 'spring'
    elif i['DayOfYear'] in summer:
        res = 'summer'
    elif i['DayOfYear'] in autumn:
        res = 'autumn'
    else:
        res = 'winter'
    season_series.append(res)
    
df['season'] = pd.Series(season_series)

In [19]:
# save data
df.drop(columns=['index'],inplace=True)
df.to_csv('data/data_v2.csv',index=False)

We have completed feature engineering and now we have a clean clean dataset. In the next notebook we proceed with Exploratory Data Analysis to unserstand the data with the help of summary statistics and visualizations.