## Machine Learning Process <font color='blue'> Flowchart </font>:
### 1. <font color='blue'> Importing Data to Python</font> 
    * Drop Duplicates 
### 2. <font color='blue'> Data Preprocessing:</font> 
    * Input-Output Split, Train-Test Split
    * Imputation, Processing Categorical, Normalization 
### 3. <font color='blue'> Training Machine Learning:</font> 
    * Choose Score to optimize and Hyperparameter Space
### 4. <font color='blue'> Test Prediction:</font> 
    * Evaluate model performance on Test Data
    

## 1. <font color='blue'> Importing Data to Python</font>

In [164]:
# Import libraries
# Import Numpy sebagai np
# Lalu import pandas sebagai pd
import pandas as pd
import numpy as np


## Dataset Information
 
 
### House Prices: Advanced Regression Techniques

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 31 explanatory variables describing aspect of residential homes in Ames, Iowa, this exercise challenges you to predict the final price of each home.  

See data description here : https://drive.google.com/open?id=1tFd-1tD3Z13XJvz-JcMjHg6pg2-H8T3Wi-4fDIaZta0

In [165]:
# Baca dataset 
data = pd.read_csv("train.csv")

In [166]:
# Check 5 Observasi pertama dataset
data.head()

Unnamed: 0.1,Unnamed: 0,LotFrontage,LotArea,Utilities,MasVnrType,MasVnrArea,HouseStyle,Heating,BsmtQual,BsmtCond,...,GarageQual,GarageCond,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,SaleType,SaleCondition,SalePrice
0,0,65.0,8450,AllPub,BrkFace,196.0,2Story,GasA,Gd,TA,...,TA,TA,0,61,0,0,0,WD,Normal,208500
1,1,80.0,9600,AllPub,,0.0,1Story,GasA,Gd,TA,...,TA,TA,298,0,0,0,0,WD,Normal,181500
2,2,68.0,11250,AllPub,BrkFace,162.0,2Story,GasA,Gd,TA,...,TA,TA,0,42,0,0,0,WD,Normal,223500
3,3,60.0,9550,AllPub,,0.0,2Story,GasA,TA,Gd,...,TA,TA,0,35,272,0,0,WD,Abnorml,140000
4,4,84.0,14260,AllPub,BrkFace,350.0,2Story,GasA,Gd,TA,...,TA,TA,192,84,0,0,0,WD,Normal,250000


## Droping Duplicates

In [167]:
# Cek shape dari data yang akan di drop duplicate nya
data.shape

(1460, 33)

In [168]:
# Cek jika ada atau tidak observasi yang duplikat
data.duplicated().sum()

0

In [169]:
for i in data.columns:
    print (data[i].value_counts(True),'\n\n')

1459    0.000685
478     0.000685
480     0.000685
481     0.000685
482     0.000685
483     0.000685
484     0.000685
485     0.000685
486     0.000685
487     0.000685
488     0.000685
489     0.000685
490     0.000685
491     0.000685
492     0.000685
493     0.000685
494     0.000685
495     0.000685
496     0.000685
497     0.000685
498     0.000685
479     0.000685
477     0.000685
500     0.000685
476     0.000685
457     0.000685
458     0.000685
459     0.000685
460     0.000685
461     0.000685
          ...   
995     0.000685
996     0.000685
997     0.000685
998     0.000685
999     0.000685
1000    0.000685
1001    0.000685
982     0.000685
981     0.000685
980     0.000685
969     0.000685
961     0.000685
962     0.000685
963     0.000685
964     0.000685
965     0.000685
966     0.000685
967     0.000685
968     0.000685
970     0.000685
979     0.000685
971     0.000685
972     0.000685
973     0.000685
974     0.000685
975     0.000685
976     0.000685
977     0.0006

In [170]:
numeric = data._get_numeric_data()
numeric = data.drop("Fireplaces", axis = 1)
numeric.shape

(1460, 32)

In [171]:
categoric = ['MasVnrType', "HouseStyle",
             "BsmtQual", "BsmtExposure", "BsmtFinType1", 
            "Fireplaces", "GarageType", 
             "GarageFinish", "SaleType", 
             "SaleCondition"]
categoric_numeric = ['FirePlaces']
drop =['Unnamed: 0','Heating','Utilities', "BsmtCond","BsmtFinType2","Electrical", "GarageQual",
       "GarageCond",'BsmtFinSF2','EnclosedPorch','3SsnPorch','ScreenPorch']
numeric = [x for x in data.columns if x not in categoric and x not in drop]


### Make function to import and drop 

Buat lah sebuah function dengan spesifikasi:

 1. import data
 2. cek JUMLAH OBSERVASI dan JUMLAH COLUMN
 3. drop duplicate
 4. drop unnecassary column
 5. cek JUMLAH OBSERVASI dan JUMLAH COLUMN, setelah di-drop
 6. return data setelah di-drop

Function dinamakan dengan `import_data` dan menerima 2 argument yaitu:

 1. `filename`: Direktori dimana data tersimpan
 2. `drop`    : Nama kolom yang ingin di hapus
 
Lalu assign function tersebut pada suatu variabel yang dengan nama `data`

In [172]:
# Buatlah function 
def importData(filepath, drop_columns):
    data = pd.read_csv(filepath)
    data = data.drop(drop_columns, axis = 1)
    data = data.drop_duplicates()
    
    return data

In [173]:
# Assign fuction kepada variabel data

data = importData("train.csv", drop)

## 2. <font color='blue'>Data Preprocessing </font>
### Input-Output Split

Disini kita akan memisahkan kolom berdasarkan input dan output.

Data yang digunakan untuk input akan dinamakan dengan `X`, sedangkan untuk output dengan `y`.

Pada dataset ini, kita hanya perlu menggunakan kolom `SalePrice` sebagai output kita. 

In [174]:
# Cek data menggunakan head()
data.head()

Unnamed: 0,LotFrontage,LotArea,MasVnrType,MasVnrArea,HouseStyle,BsmtQual,BsmtExposure,BsmtFinType1,BsmtFinSF1,1stFlrSF,...,Fireplaces,GarageType,GarageFinish,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,SaleType,SaleCondition,SalePrice
0,65.0,8450,BrkFace,196.0,2Story,Gd,No,GLQ,706,856,...,0,Attchd,RFn,2,548,0,61,WD,Normal,208500
1,80.0,9600,,0.0,1Story,Gd,Gd,ALQ,978,1262,...,1,Attchd,RFn,2,460,298,0,WD,Normal,181500
2,68.0,11250,BrkFace,162.0,2Story,Gd,Mn,GLQ,486,920,...,1,Attchd,RFn,2,608,0,42,WD,Normal,223500
3,60.0,9550,,0.0,2Story,TA,No,ALQ,216,961,...,1,Detchd,Unf,3,642,0,35,WD,Abnorml,140000
4,84.0,14260,BrkFace,350.0,2Story,Gd,Av,GLQ,655,1145,...,1,Attchd,RFn,3,836,192,84,WD,Normal,250000


### Make function for input and output

Buatlah sebuah function dengan kriteria dibawah ini:

1. data_input
2. data_output
3. return data_input dan data_output
* Tujuan dari pembuatan function adalah agar function ini dapat digunakan kembali di cases berbeda. 

Function dinamakan dengan `extract_input_output` dan menerima 2 argument yaitu:

1. `data`        : Dataset yang ingin di split
2. `column_name` : Nama kolom yang ingin di jadikan output


In [175]:
# Buatlah function tersebut disini
data = data.drop("Fireplaces",axis=1)
def extractInputOutput(data, output):
    y = data[output]
    x = data.drop(output, axis=1)
    print(x.columns) #optional
    print(y.head()) #optional
    return x,y

# Assign hasil dari funtion tersebut kepada X, y.
# X: data input
# y: data output
x,y = extractInputOutput(data, 'SalePrice')

Index(['LotFrontage', 'LotArea', 'MasVnrType', 'MasVnrArea', 'HouseStyle',
       'BsmtQual', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', '1stFlrSF',
       '2ndFlrSF', 'GarageType', 'GarageFinish', 'GarageCars', 'GarageArea',
       'WoodDeckSF', 'OpenPorchSF', 'SaleType', 'SaleCondition'],
      dtype='object')
0    208500
1    181500
2    223500
3    140000
4    250000
Name: SalePrice, dtype: int64


## Train and Test Split

Pada bagian ini, X dan y akan dibagi menjadi 2 set yaitu training dan tes. Kita akan menggunakan function dari library Scikit Learn yaitu `train_test_split`.

In [176]:
# import function train_test_split dari library Scikit Learn
from sklearn.model_selection import train_test_split

#### Train Test Split Function
1. x adalah input
2. y adalah output
3. test size = seberapa besar test, contoh 0.20 untuk 20% test dari data
4. random state adalah kunci untuk random, harus disetting sama, misal random_state = 123
5. Output: 
    * x_train = input dari data training
    * x_test = input dari data test
    * y_train = output dari training data
    * y_test = output dari training data
6. urutan dari x_train, x_test, y_train dan y_test tidak boleh terbalik

In [177]:
# Split dataset
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size =0.2,
                                                        random_state =100)

In [178]:
# Cek shape untuk tiap set (X_train, X_test, y_train, y_test)
print(x_train.head(), y_train.head()) #
print(x_test.head(), y_test.head()) #


      LotFrontage  LotArea MasVnrType  MasVnrArea HouseStyle BsmtQual  \
657          60.0     7200       None         0.0     2Story       Gd   
965          65.0    10237       None         0.0     2Story       Gd   
1441          NaN     4426    BrkFace       147.0     1Story       Gd   
1444         63.0     8500    BrkFace       106.0     1Story       Gd   
522          50.0     5000       None         0.0     1.5Fin       TA   

     BsmtExposure BsmtFinType1  BsmtFinSF1  1stFlrSF  2ndFlrSF GarageType  \
657            No          Unf           0       851       651     Attchd   
965            No          Unf           0       783       701     Attchd   
1441           Av          GLQ         697       848         0     Attchd   
1444           Av          Unf           0      1422         0     Attchd   
522            No          ALQ         399      1004       660     Detchd   

     GarageFinish  GarageCars  GarageArea  WoodDeckSF  OpenPorchSF SaleType  \
657           RFn  

## Separating Numerical and Categorical Data Manually

## Getting Numerical

In [179]:
# get numeric using ._get_numeric_data()
x_train_num = x._get_numeric_data()

In [180]:
# check the columns
x_train_num.columns

Index(['LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', '1stFlrSF',
       '2ndFlrSF', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF'],
      dtype='object')

In [181]:
x_train_num.head()

Unnamed: 0,LotFrontage,LotArea,MasVnrArea,BsmtFinSF1,1stFlrSF,2ndFlrSF,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF
0,65.0,8450,196.0,706,856,854,2,548,0,61
1,80.0,9600,0.0,978,1262,0,2,460,298,0
2,68.0,11250,162.0,486,920,866,2,608,0,42
3,60.0,9550,0.0,216,961,756,3,642,0,35
4,84.0,14260,350.0,655,1145,1053,3,836,192,84


## Getting Categorical


In [182]:
# Get Categorical
x_train_cat = x.drop(x_train_num, axis=1)

In [183]:
# check the top observations!
x_train_cat.head()

Unnamed: 0,MasVnrType,HouseStyle,BsmtQual,BsmtExposure,BsmtFinType1,GarageType,GarageFinish,SaleType,SaleCondition
0,BrkFace,2Story,Gd,No,GLQ,Attchd,RFn,WD,Normal
1,,1Story,Gd,Gd,ALQ,Attchd,RFn,WD,Normal
2,BrkFace,2Story,Gd,Mn,GLQ,Attchd,RFn,WD,Normal
3,,2Story,TA,No,ALQ,Detchd,Unf,WD,Abnorml
4,BrkFace,2Story,Gd,Av,GLQ,Attchd,RFn,WD,Normal


### Make a function for Separating Numerical and Categorical

In [184]:
# Def a function that returns x_train numerical and x_train categorical
def splitNumCat(data):
    data_num = data._get_numeric_data()
    data_cat = data.drop(list(data_num.columns.values) , axis = 1)

    return data_num, data_cat

x_train_num, x_train_cat = splitNumCat(x_train)


In [185]:
# check the top of the x_train numerical observations!
x_train_num.head()

Unnamed: 0,LotFrontage,LotArea,MasVnrArea,BsmtFinSF1,1stFlrSF,2ndFlrSF,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF
657,60.0,7200,0.0,0,851,651,1,270,0,0
965,65.0,10237,0.0,0,783,701,2,393,0,72
1441,,4426,147.0,697,848,0,2,420,149,0
1444,63.0,8500,106.0,0,1422,0,2,626,192,60
522,50.0,5000,0.0,399,1004,660,2,420,0,24


In [186]:
# check the top of the x_train categorical observations!
x_train_cat.head()

Unnamed: 0,MasVnrType,HouseStyle,BsmtQual,BsmtExposure,BsmtFinType1,GarageType,GarageFinish,SaleType,SaleCondition
657,,2Story,Gd,No,Unf,Attchd,RFn,WD,Normal
965,,2Story,Gd,No,Unf,Attchd,Fin,New,Partial
1441,BrkFace,1Story,Gd,Av,GLQ,Attchd,RFn,WD,Normal
1444,BrkFace,1Story,Gd,Av,Unf,Attchd,RFn,WD,Normal
522,,1.5Fin,TA,No,ALQ,Detchd,Unf,WD,Normal


## Data Imputation

Data imputation adalah proses pengisian data yang memiliki data yang kosong, biasanya diperlihatkan sebagai NaN

Proses tersebut terbagi menjadi 2:
* Numerical Imputation
* Categorical Imputation

In [187]:
# Cek data yang kosong di traininig set input
x_train.isnull().sum()

LotFrontage      209
LotArea            0
MasVnrType         7
MasVnrArea         7
HouseStyle         0
BsmtQual          26
BsmtExposure      27
BsmtFinType1      26
BsmtFinSF1         0
1stFlrSF           0
2ndFlrSF           0
GarageType        68
GarageFinish      68
GarageCars         0
GarageArea         0
WoodDeckSF         0
OpenPorchSF        0
SaleType           0
SaleCondition      0
dtype: int64

## Numerical Data Imputation

### Checking for NaN (Not A Number)

In [188]:
# check the missing value of the x_train_num
x_train_num.isnull().sum()

LotFrontage    209
LotArea          0
MasVnrArea       7
BsmtFinSF1       0
1stFlrSF         0
2ndFlrSF         0
GarageCars       0
GarageArea       0
WoodDeckSF       0
OpenPorchSF      0
dtype: int64

In [189]:
# Import library for imputation
from sklearn.preprocessing import Imputer

In [190]:
# namakan function Imputer menjadi imput, jangan lupa tanda kurung ()
# missing_values adalah tanda missing values dalam data, bisa NaN, bisa 9999, bisa "KOSONG"
# strategy median adalah stragegy imputasi, jika data kosong, maka data diganti dengan median
# strategy bisa diganti dengan mean atau rata-rata
# see median: https://en.wikipedia.org/wiki/Median

imput = Imputer(missing_values='NaN', strategy='median')

* fit: imputer agar mengetahui mean atau median  dari setiap column
* transform: isi data dengan median atau mean
* output dari transform berupda pd dataframe
* namakan column dari x_train_numerical_imputed sesuai dengan x_train_numerical.
     - MENGAPA? karena kita kehilangan nama column setelah data imputation
* beri index dari x_train_numerical_imputed sesuai dengan x_train_numerical.
     - MENGAPA? karena kita kehilangan index setelah data imputation

In [191]:
# isi perintah yang akan dibuat di dalam fungsi baru
# imputer perlu difitting ke data 
imput.fit(x_train_num)
x_train_num_imputed = pd.DataFrame(imput.transform(x_train_num))
x_train_num_imputed.columns = x_train_num.columns
x_train_num_imputed.index =  x_train_num.index

In [192]:
# cek x_train_num_imputed
x_train_num_imputed.head()

Unnamed: 0,LotFrontage,LotArea,MasVnrArea,BsmtFinSF1,1stFlrSF,2ndFlrSF,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF
657,60.0,7200.0,0.0,0.0,851.0,651.0,1.0,270.0,0.0,0.0
965,65.0,10237.0,0.0,0.0,783.0,701.0,2.0,393.0,0.0,72.0
1441,68.0,4426.0,147.0,697.0,848.0,0.0,2.0,420.0,149.0,0.0
1444,63.0,8500.0,106.0,0.0,1422.0,0.0,2.0,626.0,192.0,60.0
522,50.0,5000.0,0.0,399.0,1004.0,660.0,2.0,420.0,0.0,24.0


In [193]:
# cek kembali hasil imputer, apakah missing valuesnya masih ada atau tidak
x_train_num_imputed.isnull().sum()

LotFrontage    0
LotArea        0
MasVnrArea     0
BsmtFinSF1     0
1stFlrSF       0
2ndFlrSF       0
GarageCars     0
GarageArea     0
WoodDeckSF     0
OpenPorchSF    0
dtype: int64

## Categorical Imputation

In [194]:
# check missing values in x_train_cat
x_train_cat.isnull().sum()

MasVnrType        7
HouseStyle        0
BsmtQual         26
BsmtExposure     27
BsmtFinType1     26
GarageType       68
GarageFinish     68
SaleType          0
SaleCondition     0
dtype: int64

In [195]:
# replace missing value with new category ="KOSONG"
x_train_cat_imputed = x_train_cat.fillna(value="KOSONG")

In [196]:
# periksa kembali missing valuesnya
x_train_cat_imputed.isnull().sum()

MasVnrType       0
HouseStyle       0
BsmtQual         0
BsmtExposure     0
BsmtFinType1     0
GarageType       0
GarageFinish     0
SaleType         0
SaleCondition    0
dtype: int64

## Make a Function
* Make a function for numerical imputation

In [197]:
# function definition
from sklearn.preprocessing import Imputer

def numericalImputation(data):
    imputer = Imputer(missing_values='NaN', strategy='median') 
    imputer.fit(data)
    data_imputed = pd.DataFrame(imputer.transform(data))
    data_imputed.columns = data.columns
    data_imputed.index =  data.index
    
    return data_imputed, imputer

In [198]:
# return imputed data and imputer
x_train_num_imputed , imputer = numericalImputation(x_train_num)

In [199]:
# check imputed data
x_train_num_imputed.isnull().sum()

LotFrontage    0
LotArea        0
MasVnrArea     0
BsmtFinSF1     0
1stFlrSF       0
2ndFlrSF       0
GarageCars     0
GarageArea     0
WoodDeckSF     0
OpenPorchSF    0
dtype: int64

In [200]:
# dump imputer
# from sklearn.externals import joblib
from sklearn.externals import joblib
joblib.dump(imputer,'Imputer.pkl')

['Imputer.pkl']

 * Make a function for categorical imputation

In [201]:
# function definition
def categoricalImputation(data):
    data_cat_imputed = data.fillna(value="KOSONG")
    return data_cat_imputed

In [202]:
# return imputed data
x_train_cat_imputed = categoricalImputation(x_train_cat)
x_train_cat_imputed.isnull().sum()

MasVnrType       0
HouseStyle       0
BsmtQual         0
BsmtExposure     0
BsmtFinType1     0
GarageType       0
GarageFinish     0
SaleType         0
SaleCondition    0
dtype: int64

## Preprocessing Categorical Variables

* create dummy variable for each of categorical variable

In [203]:
# create dummies
categorical_dummies =  pd.get_dummies(x_train_cat_imputed)

In [204]:
# periksa top observations
categorical_dummies.head()

Unnamed: 0,MasVnrType_BrkCmn,MasVnrType_BrkFace,MasVnrType_KOSONG,MasVnrType_None,MasVnrType_Stone,HouseStyle_1.5Fin,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2.5Fin,HouseStyle_2.5Unf,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
657,0,0,0,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
965,0,0,0,1,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,1
1441,0,1,0,0,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,1,0
1444,0,1,0,0,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,1,0
522,0,0,0,1,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0


### Make a function to get the dummies

In [205]:
# funtion definition
def categoricalDummies(data):
    cat_dummies = pd.get_dummies(data)
    
    return cat_dummies, cat_dummies.columns

In [206]:
categorical_dummies, dummy_col = categoricalDummies(x_train_cat_imputed)

In [207]:
# dump dummy_columns
from sklearn.externals import joblib
joblib.dump(dummy_col,'dummy_col.pkl')

['dummy_col.pkl']

In [208]:
# check the top observations
categorical_dummies.head()

Unnamed: 0,MasVnrType_BrkCmn,MasVnrType_BrkFace,MasVnrType_KOSONG,MasVnrType_None,MasVnrType_Stone,HouseStyle_1.5Fin,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2.5Fin,HouseStyle_2.5Unf,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
657,0,0,0,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
965,0,0,0,1,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,1
1441,0,1,0,0,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,1,0
1444,0,1,0,0,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,1,0
522,0,0,0,1,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0


## Join data Numerical dan Categorical

In [209]:
# ambil variabel numerical yang sudah tidak memiliki missing values dan variabel kategori yang sudah menjadi dummy
# satukan kembali kolom tersebut menjadi x_train_concat
x_train_concat = pd.concat([x_train_num_imputed,categorical_dummies],axis = 1)

In [210]:
x_train_concat.head()

Unnamed: 0,LotFrontage,LotArea,MasVnrArea,BsmtFinSF1,1stFlrSF,2ndFlrSF,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
657,60.0,7200.0,0.0,0.0,851.0,651.0,1.0,270.0,0.0,0.0,...,0,0,0,1,0,0,0,0,1,0
965,65.0,10237.0,0.0,0.0,783.0,701.0,2.0,393.0,0.0,72.0,...,0,1,0,0,0,0,0,0,0,1
1441,68.0,4426.0,147.0,697.0,848.0,0.0,2.0,420.0,149.0,0.0,...,0,0,0,1,0,0,0,0,1,0
1444,63.0,8500.0,106.0,0.0,1422.0,0.0,2.0,626.0,192.0,60.0,...,0,0,0,1,0,0,0,0,1,0
522,50.0,5000.0,0.0,399.0,1004.0,660.0,2.0,420.0,0.0,24.0,...,0,0,0,1,0,0,0,0,1,0


In [211]:
# Check NaN values
x_train_concat.isnull().sum().sum()

0

## Standardizing Variables

- KEGUNAAN: Menyamakan skala dari variable input
- fit: imputer agar mengetahui mean standard deviasi dari setiap column
- transform: isi data dengan value yang dinormalisasi
- output dari transform berupda pd dataframe
- normalize dikeluarkan karena akan dipakai di test

In [212]:
#Import Standard Scaler
from sklearn.preprocessing import StandardScaler
standardizer = StandardScaler()

In [213]:
# define function for standardizing data
def standardize(data):
    standardizer = StandardScaler()
    standardizer.fit(data)
    
    data_standard = pd.DataFrame(standardizer.transform(data), index=data.index)
    data_standard.columns = data.columns
    
    return data_standard, standardizer

In [214]:
# return standardized data and standardizer
x_train_clean, normalizer = standardize(x_train_concat)

In [215]:
# dump standardizer
from sklearn.externals import joblib
joblib.dump(normalizer,'normalizer.pkl')

['normalizer.pkl']

In [216]:
# check data
x_train_clean.head()

Unnamed: 0,LotFrontage,LotArea,MasVnrArea,BsmtFinSF1,1stFlrSF,2ndFlrSF,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
657,-0.459996,-0.326654,-0.558175,-1.008986,-0.8221,0.679145,-1.022443,-0.952439,-0.753305,-0.704892,...,-0.065597,-0.30781,-0.041434,0.392814,-0.273104,-0.058646,-0.092968,-0.114109,0.471158,-0.309475
965,-0.215433,-0.042901,-0.558175,-1.008986,-1.001643,0.793097,0.307759,-0.374271,-0.753305,0.388821,...,-0.065597,3.248762,-0.041434,-2.545735,-0.273104,-0.058646,-0.092968,-0.114109,-2.122432,3.231281
1441,-0.068696,-0.585834,0.257985,0.579047,-0.830021,-0.804516,0.307759,-0.247357,0.433644,-0.704892,...,-0.065597,-0.30781,-0.041434,0.392814,-0.273104,-0.058646,-0.092968,-0.114109,0.471158,-0.309475
1444,-0.313258,-0.205192,0.030349,-1.008986,0.685537,-0.804516,0.307759,0.720956,0.776186,0.206536,...,-0.065597,-0.30781,-0.041434,0.392814,-0.273104,-0.058646,-0.092968,-0.114109,0.471158,-0.309475
522,-0.949121,-0.532204,-0.558175,-0.099911,-0.418127,0.699656,0.307759,-0.247357,-0.753305,-0.340321,...,-0.065597,-0.30781,-0.041434,0.392814,-0.273104,-0.058646,-0.092968,-0.114109,0.471158,-0.309475


## 3. <font color='blue'>Training Machine Learning</font>

* Kita harus mengalahkan benchmark
* Choose Score to optimize and Hyperparameter Space
* Cross-Validation: Random Search CV 


### Benchmark:

In [217]:
y_train.value_counts(normalize = True).value_counts()

0.000857    362
0.001714     87
0.002571     49
0.003428     28
0.004284     16
0.005141     12
0.006855      7
0.005998      7
0.009426      3
0.008569      2
0.007712      2
0.011140      1
0.012853      1
0.013710      1
Name: SalePrice, dtype: int64

In [218]:
from sklearn.tree import DecisionTreeClassifier
decTree = DecisionTreeClassifier(random_state=123)

In [219]:
decTree.fit(x_train_clean,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=123, splitter='best')

In [220]:
decTree.score(x_train_clean,y_train)

0.99828620394173095

In [221]:
from sklearn.model_selection import RandomizedSearchCV

In [222]:
decTree_param = {'max_depth':[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] }

randomdecTree = RandomizedSearchCV(DecisionTreeClassifier(random_state=123), param_distributions=decTree_param, n_iter=5, cv = 5, scoring ='accuracy')

In [223]:
randomdecTree.fit(x_train_clean,y_train)



RandomizedSearchCV(cv=5, error_score='raise',
          estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=123, splitter='best'),
          fit_params={}, iid=True, n_iter=5, n_jobs=1,
          param_distributions={'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score=True, scoring='accuracy', verbose=0)

In [224]:
randomdecTree.score(x_train_clean,y_train)

0.024850042844901457

In [225]:
randomdecTree.best_params_

{'max_depth': 2}

In [226]:
best_decTree = DecisionTreeClassifier(max_depth = randomdecTree.best_params_.get('max_depth'))

In [227]:
best_decTree.fit(x_train_clean, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [228]:
joblib.dump(best_decTree,'Best_DT.pkl')

['Best_DT.pkl']

## 4. <font color='blue'>Test Prediction</font>

### Preprocessing Test Data

Buat lah sebuah function untuk preprocessing data test. Preprocessing yang dilakukan yakni imputation (numeric and categorical) dan standardizing.
 

Function dinamakan dengan `extract_test` dan menerima 6 argument yaitu:
 1. `data` : Data yang ingin diolah
 2. `numerical_columns`   : Nama kolom numerik
 3. `categorical_columns` : Nama kolom kategorik
 4. `dummies_columns`     : Nama kolom dummy
 5. `imput_numericals`    : Imputer untuk data numerical (hasil preprocessing data training)
 6. `standardizer`        : Standardizer (hasil preprocessing data training)
 
 
Lalu assign function tersebut pada suatu variabel yang dengan nama `x_test_clean`



In [229]:
# function definition
def extractTest(data, numerical_columns, categorical_columns, dummy_column, imput_numericals, standardizer):
        
    numerical_data = data[numerical_columns]
    categorical_data = data[categorical_columns]
    
    numerical_data = pd.DataFrame(imput_numericals.transform(numerical_data)) # imput numerical test
    numerical_data.columns = numerical_columns
    numerical_data.index = data.index
    categorical_data = categorical_data.fillna(value="KOSONG") # imput categorical
    categorical_data.index = data.index
    categorical_data = pd.get_dummies(categorical_data) # Dummies categorical
    x_valid = pd.concat([ numerical_data, categorical_data], axis = 1)
    x_valid_transform = pd.DataFrame(standardizer.transform(x_valid)) # standardization
    x_valid_transform.columns = x_valid.columns # samakan nama column
    x_valid_transform.index = x_valid.index
    
    return x_valid_transform

In [230]:
# load necessary object
# object = joblib.load("filename.pkl")
num_columns = list(x_train_num_imputed.columns.values)
cat_columns = list(x_train_cat_imputed.columns.values)
dummy_col = joblib.load('dummy_col.pkl')
imput = joblib.load('Imputer.pkl')
standardizer = joblib.load('normalizer.pkl')

In [None]:
x_test_clean = extractTest(x_test, num_columns, cat_columns, dummy_col, imput, standardizer)