---
layout: post
title:  "Feature Engineering"
date:   2023-06-02 10:14:54 +0700
categories: MachineLearning
---

# Introduction
Intuitively, feature engineering is the process of understanding the data intimately. So that we can handcraft new features that represent the dataset better and improve the prediction of the model.

Some common methods are:

- Binning/bucketing: For example, in a dataset about the home credit default rate, when collecting client's data of their age, it could make better sense to divide the range into categories: less than 20 years old, from 20 to 30 years old, from 30 to 40 years old, from 40 to 50 years old, and above 50 years old. The reasoning behind this division is that, client less than 20 years old are not allowed to take a loan, and client above 50 years old can be groupped into one group since the most popular ages to take loans are from 20 to 50. Then we diving equally from the age 20 to 50. This unequal division of ages into buckets actually make better sense and generalize the age groups better.

- Polynomial features: We can take square of features, for example, to assume that those features having a nonlinear relationship with the target.

- Feature interaction: This is a way to combine different features, by assuming them having relationship among themselves. For example, we can combine family related features of a client together (which can be a simple linear combination or a complicated equation). The new feature would represent an overview of the client's family status. 

- Categorical feature handling: Since we usually need to transform categorical feature into numerical one, there are ways to do it such as onehot encoding (encode the value into a vector of 1 and 0s, with 1 being the cateogry it belongs to) or label encoding (encode each category as a different number).

- Date time variables: If we have the data on date and time, we can add a lagged variable (the value of the feature in some day in the past), calculate the interval between two dates (for example, the age of the house/car of the client who comes to request a loan).

- Scale the feature: since features are different in nature, they naturally use different units and scales. But that would makes the model inaccurate since the model doesn't really grasp the differences in scales. We can do some engineering to bring all features into one scale, in a way, for the machine to understand the dataset a bit better. The most two popular ways is to do minmax scaling and standardization. In min max scaling, we scale each feature back to a range, could be from 0 to 1. This is also called normalization. In standardization, we minus each value to the mean and divided by the standard deviation of the sample.

# Code example

In the dataset for the home credit default risk, there are about 50 features about the building that the client lives in. We can combine those features into a new one named "living_condition".

In [34]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder

In [16]:
labels = pd.read_csv('home-credit-default-risk/HomeCredit_columns_description.csv',encoding='ISO-8859-1')
data = pd.read_csv('home-credit-default-risk/application_train.csv')

In [17]:
labels.head()

Unnamed: 0.1,Unnamed: 0,Table,Row,Description,Special
0,1,application_{train|test}.csv,SK_ID_CURR,ID of loan in our sample,
1,2,application_{train|test}.csv,TARGET,Target variable (1 - client with payment diffi...,
2,5,application_{train|test}.csv,NAME_CONTRACT_TYPE,Identification if loan is cash or revolving,
3,6,application_{train|test}.csv,CODE_GENDER,Gender of the client,
4,7,application_{train|test}.csv,FLAG_OWN_CAR,Flag if the client owns a car,


In [26]:
# First take all the name of the features related to the building
living_condition = labels['Row'][44:91]
living_condition

44                  APARTMENTS_AVG
45                BASEMENTAREA_AVG
46     YEARS_BEGINEXPLUATATION_AVG
47                 YEARS_BUILD_AVG
48                  COMMONAREA_AVG
49                   ELEVATORS_AVG
50                   ENTRANCES_AVG
51                   FLOORSMAX_AVG
52                   FLOORSMIN_AVG
53                    LANDAREA_AVG
54            LIVINGAPARTMENTS_AVG
55                  LIVINGAREA_AVG
56         NONLIVINGAPARTMENTS_AVG
57               NONLIVINGAREA_AVG
58                 APARTMENTS_MODE
59               BASEMENTAREA_MODE
60    YEARS_BEGINEXPLUATATION_MODE
61                YEARS_BUILD_MODE
62                 COMMONAREA_MODE
63                  ELEVATORS_MODE
64                  ENTRANCES_MODE
65                  FLOORSMAX_MODE
66                  FLOORSMIN_MODE
67                   LANDAREA_MODE
68           LIVINGAPARTMENTS_MODE
69                 LIVINGAREA_MODE
70        NONLIVINGAPARTMENTS_MODE
71              NONLIVINGAREA_MODE
72                 A

In [9]:
# Now preprocess the data a bit
data.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
y_train = data['TARGET']
X_train = data.drop(['TARGET'], axis=1)
y_train = y_train.to_frame()
y_train

Unnamed: 0,TARGET
0,1
1,0
2,0
3,0
4,0
...,...
307506,0
307507,0
307508,0
307509,1


In [32]:
# Let's handle categorical / numerical variables and missing values

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
categoricals = ['object']

X_train_categorical = X_train.select_dtypes(include=categoricals)
X_train_numerical = X_train.select_dtypes(include=numerics)

categorical_columns = X_train_categorical.columns
numerical_columns = X_train_numerical.columns

categorical_imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
numerical_imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

# imputer = imputer.fit(X_train)
  
X_train_categorical = categorical_imputer.fit_transform(X_train_categorical)
X_train_categorical = pd.DataFrame(data=X_train_categorical, columns=categorical_columns)

X_train_numerical = numerical_imputer.fit_transform(X_train_numerical)
X_train_numerical = pd.DataFrame(data=X_train_numerical, columns=numerical_columns)


The thing about using label encoder instead of one hot encoder is that in label encoder, there is an inherent assumption that the values are hierarchically meaningful. This might or might not reflect the qualitative meaning of the value in reality. For example, we categorize the house into 3 district: district 1, district 2, district 3 and encode them into number 0, 1, and 2. Since 2 > 1, it might suggest that district 2 is better than district 1 which might not reflect the real situation in which there are no inherent difference in those two geographical locations (they are both equal in distance to the center for example). We might take this inherent bias into account and try to make a new variable (via clustering or via distance to center) to compensate for this bias in the model. The same goes for the days of the week, inherently the meaning of monday tuesday to sunday might not be that linear. We can hope that the model might have enough data to learn this representation. One hot encoding, on the other hand, assume those categories are all equal, and it puts 1 for that category and 0s for others in the representation vector. For example: a house in district 1 can be represented as [0,1,0].

In [35]:
X_train_categorical = X_train_categorical.apply(LabelEncoder().fit_transform)
X_train_categorical

Unnamed: 0,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,OCCUPATION_TYPE,WEEKDAY_APPR_PROCESS_START,ORGANIZATION_TYPE,FONDKAPREMONT_MODE,HOUSETYPE_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE
0,0,1,0,1,6,7,4,3,1,8,6,5,2,0,5,0
1,0,0,0,0,1,4,1,1,1,3,1,39,2,0,0,0
2,1,1,1,1,6,7,4,3,1,8,1,11,2,0,4,0
3,0,0,0,1,6,7,4,0,1,8,6,5,2,0,4,0
4,0,1,0,1,6,7,4,3,1,3,4,37,2,0,4,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
307506,0,1,0,0,6,7,4,2,5,14,4,43,2,0,5,0
307507,0,0,0,1,6,3,4,5,1,8,1,57,2,0,5,0
307508,0,0,0,1,6,7,1,2,1,10,4,39,2,0,4,0
307509,0,0,0,1,6,1,4,1,1,8,6,3,2,0,5,0


In [36]:
# Some of the features are categorical ('FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE')
# the rest is numerical
living_condition_categoricals = ['FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE']
# living_condition_numericals = [e in living_condition if e not in living_condition_categoricals]
living_condition_numericals = np.setdiff1d(living_condition,living_condition_categoricals)
X_train_numerical[living_condition_numericals]

Unnamed: 0,APARTMENTS_AVG,APARTMENTS_MEDI,APARTMENTS_MODE,BASEMENTAREA_AVG,BASEMENTAREA_MEDI,BASEMENTAREA_MODE,COMMONAREA_AVG,COMMONAREA_MEDI,COMMONAREA_MODE,ELEVATORS_AVG,...,NONLIVINGAREA_AVG,NONLIVINGAREA_MEDI,NONLIVINGAREA_MODE,TOTALAREA_MODE,YEARS_BEGINEXPLUATATION_AVG,YEARS_BEGINEXPLUATATION_MEDI,YEARS_BEGINEXPLUATATION_MODE,YEARS_BUILD_AVG,YEARS_BUILD_MEDI,YEARS_BUILD_MODE
0,0.02470,0.02500,0.025200,0.036900,0.036900,0.038300,0.014300,0.014400,0.014400,0.000000,...,0.000000,0.000000,0.000000,0.014900,0.972200,0.972200,0.972200,0.619200,0.624300,0.634100
1,0.09590,0.09680,0.092400,0.052900,0.052900,0.053800,0.060500,0.060800,0.049700,0.080000,...,0.009800,0.010000,0.000000,0.071400,0.985100,0.985100,0.985100,0.796000,0.798700,0.804000
2,0.11744,0.11785,0.114231,0.088442,0.087955,0.087543,0.044621,0.044595,0.042553,0.078942,...,0.028358,0.028236,0.027022,0.102547,0.977735,0.977752,0.977065,0.752471,0.755746,0.759637
3,0.11744,0.11785,0.114231,0.088442,0.087955,0.087543,0.044621,0.044595,0.042553,0.078942,...,0.028358,0.028236,0.027022,0.102547,0.977735,0.977752,0.977065,0.752471,0.755746,0.759637
4,0.11744,0.11785,0.114231,0.088442,0.087955,0.087543,0.044621,0.044595,0.042553,0.078942,...,0.028358,0.028236,0.027022,0.102547,0.977735,0.977752,0.977065,0.752471,0.755746,0.759637
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
307506,0.20210,0.20400,0.100800,0.088700,0.088700,0.017200,0.020200,0.020300,0.017200,0.220000,...,0.109500,0.111800,0.012500,0.289800,0.987600,0.987600,0.978200,0.830000,0.832300,0.712500
307507,0.02470,0.02500,0.025200,0.043500,0.043500,0.045100,0.002200,0.002200,0.002200,0.000000,...,0.000000,0.000000,0.000000,0.021400,0.972700,0.972700,0.972700,0.626000,0.631000,0.640600
307508,0.10310,0.10410,0.105000,0.086200,0.086200,0.089400,0.012300,0.012400,0.012400,0.000000,...,0.000000,0.000000,0.000000,0.797000,0.981600,0.981600,0.981600,0.748400,0.751800,0.758300
307509,0.01240,0.01250,0.012600,0.088442,0.087955,0.087543,0.044621,0.044595,0.042553,0.078942,...,0.028358,0.028236,0.027022,0.008600,0.977100,0.977100,0.977200,0.752471,0.755746,0.759637


In [37]:
X_train_categorical[living_condition_categoricals]

Unnamed: 0,FONDKAPREMONT_MODE,HOUSETYPE_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE
0,2,0,5,0
1,2,0,0,0
2,2,0,4,0
3,2,0,4,0
4,2,0,4,0
...,...,...,...,...
307506,2,0,5,0
307507,2,0,5,0
307508,2,0,4,0
307509,2,0,5,0


In [40]:
X_train_living_condition = pd.concat([X_train_numerical[living_condition_numericals], X_train_categorical[living_condition_categoricals]],axis=1)
X_train_living_condition

Unnamed: 0,APARTMENTS_AVG,APARTMENTS_MEDI,APARTMENTS_MODE,BASEMENTAREA_AVG,BASEMENTAREA_MEDI,BASEMENTAREA_MODE,COMMONAREA_AVG,COMMONAREA_MEDI,COMMONAREA_MODE,ELEVATORS_AVG,...,YEARS_BEGINEXPLUATATION_AVG,YEARS_BEGINEXPLUATATION_MEDI,YEARS_BEGINEXPLUATATION_MODE,YEARS_BUILD_AVG,YEARS_BUILD_MEDI,YEARS_BUILD_MODE,FONDKAPREMONT_MODE,HOUSETYPE_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE
0,0.02470,0.02500,0.025200,0.036900,0.036900,0.038300,0.014300,0.014400,0.014400,0.000000,...,0.972200,0.972200,0.972200,0.619200,0.624300,0.634100,2,0,5,0
1,0.09590,0.09680,0.092400,0.052900,0.052900,0.053800,0.060500,0.060800,0.049700,0.080000,...,0.985100,0.985100,0.985100,0.796000,0.798700,0.804000,2,0,0,0
2,0.11744,0.11785,0.114231,0.088442,0.087955,0.087543,0.044621,0.044595,0.042553,0.078942,...,0.977735,0.977752,0.977065,0.752471,0.755746,0.759637,2,0,4,0
3,0.11744,0.11785,0.114231,0.088442,0.087955,0.087543,0.044621,0.044595,0.042553,0.078942,...,0.977735,0.977752,0.977065,0.752471,0.755746,0.759637,2,0,4,0
4,0.11744,0.11785,0.114231,0.088442,0.087955,0.087543,0.044621,0.044595,0.042553,0.078942,...,0.977735,0.977752,0.977065,0.752471,0.755746,0.759637,2,0,4,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
307506,0.20210,0.20400,0.100800,0.088700,0.088700,0.017200,0.020200,0.020300,0.017200,0.220000,...,0.987600,0.987600,0.978200,0.830000,0.832300,0.712500,2,0,5,0
307507,0.02470,0.02500,0.025200,0.043500,0.043500,0.045100,0.002200,0.002200,0.002200,0.000000,...,0.972700,0.972700,0.972700,0.626000,0.631000,0.640600,2,0,5,0
307508,0.10310,0.10410,0.105000,0.086200,0.086200,0.089400,0.012300,0.012400,0.012400,0.000000,...,0.981600,0.981600,0.981600,0.748400,0.751800,0.758300,2,0,4,0
307509,0.01240,0.01250,0.012600,0.088442,0.087955,0.087543,0.044621,0.044595,0.042553,0.078942,...,0.977100,0.977100,0.977200,0.752471,0.755746,0.759637,2,0,5,0
