# Machine Learning

Machine Learning adalah suatu bidang ilmu yang memungkinkan mesin untuk mempelajari pola-pola berdasarkan data.

Belajar Machine Learning berdasarkan dataset yang diambil dari `kaggle` yaitu `Melbourne Housing Snapshot`.

Machine Learning ini digunakan untuk memprediksi `price` berdasarkan atas pertimbangan-pertimbangan(`Features`) yang digunakan dalam proses Machine Learning.

## Import Dataset

Gunakan library `pandas` untuk meng-import dataset yang ada dan akan membuat dataset tersebut menjadi DataFrame.

`read_csv()` adalah fungsi/method pada library `pandas` yang digunakan untuk meng-import suatu dataset.

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('../../../datasets/melb_data.csv/melb_data.csv')

## Eksplorasi Dataset

Hal ini dilakukan untuk mengetahui karakteristik dari dataset yang akan kita gunakan dan untuk mengetahui juga apakah dalam dataset tersebut ada missing value atau tidak.

### Mencari Sampel

- Untuk mengetahui lima baris data pertama, bisa digunakan method `head()`
- Untuk mengetahui lima baris data terakhir, bisa digunakan method `tail()`

In [4]:
df.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


### Mengetahui Baris dan Kolom

- `shape` digunakan untuk mengetahui baris dan kolom
- `columns` digunakan untuk mengetahui nama dari setiap kolom

In [4]:
df.shape

(13580, 21)

In [5]:
df.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

### Melihat Ringkasan Dataset

Method `describe()` akan membuat ringkasan sederhana dari suatu dataset

In [6]:
df.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


## Data Cleaning

Untuk mengatasi missing value gunakan method `dropna()`,  namun hal ini mempunyai kekurangan yaitu data akan semakin mengecil

In [7]:
df = df.dropna()
df.shape

(6196, 21)

# Tahapan Machine Learning

## Memilih Prediction Target

Disini kita akan mengambil prediction target yaitu `price`

In [8]:
y = df['Price']
y

1        1035000.0
2        1465000.0
4        1600000.0
6        1876000.0
7        1636000.0
           ...    
12205     601000.0
12206    1050000.0
12207     385000.0
12209     560000.0
12212    2450000.0
Name: Price, Length: 6196, dtype: float64

## Memilih Features

Features digunakan untuk sebagai acuan mesin dalam melakukan pembelajarannya. Dalam memilih features, tidak semuanya kita gunakan hanya beberapa features saja yang digunakan.

Dan features yang akan kita pilih yaitu `Rooms` `Bathroom` `Landsize` `Lattitude` `Longtitude`

In [9]:
features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = df[features]

In [10]:
X.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1.57634,471.00694,-37.807904,144.990201
std,0.971079,0.711362,897.449881,0.07585,0.099165
min,1.0,1.0,0.0,-38.16492,144.54237
25%,2.0,1.0,152.0,-37.855438,144.926198
50%,3.0,1.0,373.0,-37.80225,144.9958
75%,4.0,2.0,628.0,-37.7582,145.0527
max,8.0,8.0,37000.0,-37.45709,145.52635


## Membuat Model

Disini kita akan menggunakan model `DecisionTreeRegressor`

In [11]:
from sklearn.tree import DecisionTreeRegressor

## Konfigurasi Model

In [12]:
df_model = DecisionTreeRegressor(random_state=1)

## Melakukan Training Data

In [13]:
df_model.fit(X, y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=1, splitter='best')

## Membuat Prediksi

In [14]:
df_model.predict(X.head())

array([1035000., 1465000., 1600000., 1876000., 1636000.])

In [15]:
y.head()

1    1035000.0
2    1465000.0
4    1600000.0
6    1876000.0
7    1636000.0
Name: Price, dtype: float64

# Evaluasi Model

## `mean_absolute_error`

Semakin kecil nilai yang dihasilkan oleh MAE, maka hasil prediksi akan menjadi lebih berkualitas

In [16]:
from sklearn.metrics import mean_absolute_error

In [17]:
y_hat = df_model.predict(X)
mean_absolute_error(y, y_hat)

1115.7467183128902

## Training dan Testing Data

In [18]:
from sklearn.model_selection import train_test_split

### Membagi Data Menjadi Dua Bagian

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

### Konfigurasi dan Training Data

In [20]:
df_model = DecisionTreeRegressor(random_state=1)
df_model.fit(X_train, y_train)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=1, splitter='best')

### Evaluasi

In [21]:
y_hat = df_model.predict(X_test)
mean_absolute_error(y_test, y_hat)

251688.7630729503

## Optimasi Model

In [22]:
def get_mae(max_leaf_nodes, X_train, X_test, y_train, y_test):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(X_train, y_train)
    y_hat = model.predict(X_test)
    mae = mean_absolute_error(y_test, y_hat)
    return mae

In [23]:
for max_leaf_nodes in [5, 50, 500, 5000]:
    leaf_mae = get_mae(max_leaf_nodes, X_train, X_test, y_train, y_test)
    print(f'Max Leaf Nodes : {max_leaf_nodes} \t Mean Absolute Error : {int(leaf_mae)}')

Max Leaf Nodes : 5 	 Mean Absolute Error : 369673
Max Leaf Nodes : 50 	 Mean Absolute Error : 266644
Max Leaf Nodes : 500 	 Mean Absolute Error : 243613
Max Leaf Nodes : 5000 	 Mean Absolute Error : 256227


# Eksplorasi dengan Model Random Forest

In [24]:
from sklearn.ensemble import RandomForestRegressor

In [26]:
rf_model = RandomForestRegressor(n_estimators=100, random_state=1)
rf_model.fit(X_train, y_train)
y_hat = rf_model.predict(X_test)
print(f'Mean Absolute Error : {int(mean_absolute_error(y_test, y_hat))}')

Mean Absolute Error : 190414
