![license_header_logo](https://user-images.githubusercontent.com/59526258/124226124-27125b80-db3b-11eb-8ba1-488d88018ebb.png)
> **Copyright (c) 2020-2021 CertifAI Sdn. Bhd.**<br>
 <br>
This program is part of OSRFramework. You can redistribute it and/or modify
<br>it under the terms of the GNU Affero General Public License as published by
<br>the Free Software Foundation, either version 3 of the License, or
<br>(at your option) any later version.
<br>
<br>This program is distributed in the hope that it will be useful,
<br>but WITHOUT ANY WARRANTY; without even the implied warranty of
<br>MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
<br>GNU Affero General Public License for more details.
<br>
<br>You should have received a copy of the GNU Affero General Public License
<br>along with this program.  If not, see <http://www.gnu.org/licenses/>.
<br>

# Introduction

Feature engineering is a machine learning techniques for extracting new features from raw data.
In this notebook, we will explore on different types of feature engineering techniques:
1. Imputation
2. One-hot encoding for categorical data.
3. Categorize numerical data.
4. Engineer outlier

# Learning Outcome
By the end of this notebook, you should be able to know on how to:
1. Perform feature engineering for missing values. 
2. Handling categorical data (binary, nominal, ordinal)
3. Implement feature engineering for numerical data.
4. Handling outlier values

# Table Of Contents
* [Missing values](#missing)
* [Categorical data](#category)
* [Numerical data](#numeric)
* [Outlier](#outlier)
* [Exercise ](#exercise)

In [None]:
import pandas as pd
import numpy as np
from sklearn import preprocessing

In [None]:
# Dataset of employee
df = pd.DataFrame({
    "Employee ID" : ['100', '101', '102', '103', '104', '105', '106', '107', '108', '109'],
    "Department" : ["Finance", "IT", "Sales", "Human Resource", "Finance", "IT", "Sales", "Finance", "Sales", "IT"],
    "Age" : [24, np.nan, 18, 28, 29, 28, 30, 35, np.nan, 35],
    "Gender" : ["F", "M", "F", "F", "F", "M", "M", "M", "F", "F"],
    "Education" : ["Diploma", "Diploma", "SPM", "Degree", "Degree", "SPM", "Degree", "SPM", "Diploma", "Degree"]
    })

In [None]:
# Change Employee ID as int data type
df['Employee ID'] = df['Employee ID'].astype(int)

In [None]:
df.head()

In [None]:
df.info()

# <a name="typeformat">Missing Values

There are two methods to handle missing values:
1. <b>Delete information</b> that has missing value in it
    * Delete entire column
    * Delete entire rows
    
    
2. <b>Imputation</b>
    * Impute using mean 
    * Impute using mode

In [None]:
# Print the value that has missing values
df.isnull().sum()

## Deleting value
Delete certain row or column with missing data.

### Column

In [None]:
del_column = df.dropna(axis=1)

In [None]:
del_column.head()

### Row

In [None]:
del_row = df.dropna(axis=0)

In [None]:
del_row.head()

## Imputation
Replace missing data with statistical estimates or frequently occured value of the variable. 

Numerical value:
1. Mean
2. Mode
3. Median

Categorical value:
1. Frequently use value or mode imputation.
2. Adding a "missing"/"unknown" category.

In this exercise we will show on how to impute missing values by using mean or mode.

### Mean
Fill up the missing value with mean.

In [None]:
mean = df.fillna(df.mean().astype(int)) 

In [None]:
mean.head()

### Mode
Fill up the missing value with mode.

In [None]:
mode = df.fillna(df.mode())

In [None]:
mode.head()

# <a name="category">Categorical data
Carry out feature engineering process on binary, nominal and ordinal categorical column.

In [None]:
# Print unique value of cat_df
for i in df.columns:
    if df.dtypes[i] == 'object':
        print("Column: {}".format(i))
        print(df[i].unique())

From the result above, here is the list of categorical data:
* Binary : Gender
* Nominal : Department
* Ordinal : Education

## Binary Data
Perform one-hot encoding by using `pd.get_dummies`. It converts categorical variable into dummy/indicator variables for each category.<br>

The column for the first category of our data will be removed by using the `drop_first=True` because it will contain the same information as the new column of the second category or for multiple categories, its information will be captured in the rest of the columns. It will be removed to prevent data redundancy in out dataset.

In [None]:
df = pd.get_dummies(data=df, columns=['Gender'], drop_first=True)

In [None]:
df = df.rename(columns={"Gender_M":"Gender"})

In [None]:
df.head()

Take a quick look on Gender column, value 1 represents male and 0 represents female.

## Nominal Data
They are two methods to perform feature engineering for nominal data:
### Label Encoding 

In [None]:
# make a copy of our data
df_encode = df.copy()

In [None]:
label_encoder = preprocessing.LabelEncoder()

In [None]:
df_encode['Department'] = label_encoder.fit_transform(df_encode['Department'])

In [None]:
df_encode.head()

Label encoding uses alphabetical ordering. The values in 'Department' column will result in the following order :<br>
<b>Finance > Human Resource > IT > Sales</b><br>

Department names do not have an order or rank. When label encoding is performed, it will create order relationship between the categories. Thus, label encoding is less preferred when transforming nominal data. It is most commonly used to transform target variable only.

### One-Hot Encoding

In [None]:
df = pd.get_dummies(data=df, columns=['Department'])

In [None]:
df.head()

**Challenges of One-Hot Encoding**
* Expands feature space
* Many features may look identical, this will lead to redundant information.

## Ordinal data
Data that introduces an order between them for example grade, review or ranking.

In [None]:
# Creating dictionary for mapping the ordinal numerical value
education_dict = {'SPM':1, 'Diploma':2, 'Degree':3}

# Assigning ordinal numerical value to all types of education
df['Education'] = df['Education'].map(education_dict)

In [None]:
df.head()

It is best practice to manually convert variables to numeric values.

# <a name="numeric">Numeric data
Categorize values in Age columns according to their age stages.

In [None]:
# Slice age values
df['Age_Group'] = pd.cut(df['Age'], bins=[0, 18, 25, 30, 35, 40])

In [None]:
df.head()

# <a name="outlier">Handling Outlier</a>
## Outlier?
Mathematically, outlier is a point which it is significant greater or lower than other data values.

## Find Outlier using Boxplot
<img src="https://matplotlib.org/3.2.2/_images/boxplot_explanation.png" width="500"/>

[Image Source: Matplotlib](https://matplotlib.org/3.2.2/faq/howto_faq.html)

## Understand Boxplot
- Boxplot is a method to display the distribution of data
- The Interquartile Range(IQR) indicates the range where most data is spread. We can use it to observe the spread of data. In other words, the data is concentrated in the IQR. 

In [None]:
student = df = pd.DataFrame({
    "Student ID" : ['100', '101', '102', '103', '104', '105', '106', '107', '108', '109'],
    "Marks" : [75, 62, 1, 66, 80, 194, 80, 90, 2, 65]
    })

In [None]:
student

## Interquartile Range (IQR)
Interquartile range, Q3-Q1

In [None]:
# sort in increasing order
sorted(student['Marks'])

In [None]:
# find q1(25%) and q3(75%)
q1, q3 = np.percentile(student['Marks'],[25,75])
print(q1, q3)

In [None]:
# find IQR
IQR = q3 - q1
print(IQR)

In [None]:
# find lower bound and upper bound
lower_bound = q1 - (1.5 * IQR)
upper_bound = q3 + (1.5 * IQR)

print(lower_bound, upper_bound)

As you can see from the example above, if the marks are below the lower boundary and above the upper boundary, they are considered outliers.

In [None]:
student.boxplot()

Visualize outlier in boxplot.

## Trimming

In [None]:
stud_trim = student.copy() # copy so we does not make change to original dataset

In [None]:
# remove the outlier value in "Marks" column
stud_trim.loc[stud_trim['Marks'] < lower_bound, 'Marks'] = np.nan
stud_trim.loc[stud_trim['Marks'] > upper_bound, 'Marks'] = np.nan

In [None]:
stud_trim.boxplot()

The outlier has been successfully removed from the 'Marks' column.

## Winsorizing

In [None]:
stud_winsor = student.copy()

In [None]:
from scipy.stats.mstats import winsorize

In [None]:
stud_winsor

In [None]:
stud_winsor['Marks'] = winsorize(stud_winsor['Marks'], limits=[0.25, 0.1])

In [None]:
stud_winsor.boxplot()

In [None]:
stud_winsor

As you can see from the table above, the dataset has been winsorized such that the extreme values (outliers) are being replaced by the lowest and highest value of our dataset.

# <a name="exercise">Exercise
Perform feature engineering based on dataset given below.

In [None]:
food = pd.read_csv('../data/food_preference.csv')

In [None]:
food.head()

<b>You may follow guidelines below to begin this exercise. Good luck! 

In [None]:
# Step 1 : Check if there is missing value
food.info()

In [None]:
# Step 2 : Deleting row that contain null value
food.dropna(axis=0, inplace=True)

In [None]:
# Step 3 : Find unique values for categorical columns only

# remove 'Timestamp' and 'Participant_ID' columns
cat_df = food[food.columns.difference(['Timestamp', 'Participant_ID'])]

# Print unique value of cat_df
for i in cat_df.columns:
    if cat_df.dtypes[i] == 'object':
        print("Column: {}".format(i))
        print(cat_df[i].unique())

In [None]:
# Step 4 : Standardize Nationality values to malaysian and non-malaysian

# Hint : use this values
nationality = {
    "malaysian" : "malaysian",
    "indian" : "non-malaysian",
    "pakistani" : "non-malaysian",
    "tanzanian" : "non-malaysian",
    "indonesia" : "non-malaysian",
    "pakistan" : "non-malaysian",
    "maldivian" : "non-malaysian",
    "my" : "malaysian",
    "indonesian" : "non-malaysian",
    "malaysia" : "malaysian",
    "canadian" : "non-malaysian",
    "nigerian" : "non-malaysian",
    "algerian" : "non-malaysian",
    "korean" : "non-malaysian",
    "seychellois" : "non-malaysian",
    "indonesain" : "non-malaysian",
    "japan" : "non-malaysian",
    "china" : "non-malaysian",
    "mauritian" : "non-malaysian",
    "yemen" : "non-malaysian"
}

# Start your solution here
food["Nationality"] = food["Nationality"].apply(str.lower).apply(str.strip).apply(lambda x:nationality[x])

In [None]:
# Step 5 : Perform feature engineering for the rest of categorical data

cat_df = food[food.columns.difference(['Timestamp', 'Participant_ID', 'Age'])]

# Start your solution here
food = pd.get_dummies(food, columns=cat_df.columns, drop_first=True)

In [None]:
# Step 6(Bonus) : Extract date from 'Timestamp column'

import datetime
# Hint : New date format %d%m%Y

def date_convert(date_to_convert):
    try:
        return datetime.datetime.strptime(date_to_convert, "%Y/%m/%d %H:%M:%S %p GMT+8").strftime('%d/%m/%Y')
    except:
        return "Error"

# Start your solution here
food['new_date'] = food['Timestamp'].apply(date_convert)

In [None]:
food.head()

Congratulations, now you have a better understanding of how to include feature engineering in your project.

# Further Readings
* <a href=https://heartbeat.comet.ml/hands-on-with-feature-engineering-techniques-encoding-categorical-variables-be4bc0715394>More feature engineering techniques</a>