# 7. Regression dengan KNN (K-Nearest Neighbours)

- KNN adalah model machine learning yang dapat digunakan untuk melakukan prediksi berdasarkan kedekatan karakteristik dengan sejumlah tetangga terdekat
- Prediksi yang dilakukan dapat diterapkan dengan baik pada classification maupun regression tasks

Refrensi:https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

In [4]:
import pandas as pd

sensus={
    'tinggi':[158, 170, 183, 191, 155, 168, 180, 150, 170],
    'berat': [64, 86, 84, 80, 49, 59, 67, 54, 60],
    'jk':[
        'pria', 'pria', 'pria', 'pria', 'wanita', 'wanita', 'wanita', 'wanita','wanita'
    ]
}
sensus_df=pd.DataFrame(sensus)
sensus_df

Unnamed: 0,tinggi,berat,jk
0,158,64,pria
1,170,86,pria
2,183,84,pria
3,191,80,pria
4,155,49,wanita
5,168,59,wanita
6,180,67,wanita
7,150,54,wanita
8,170,60,wanita


## Regression dengan KNN

### Features dan Target

In [5]:
import numpy as np

x_train= np.array(sensus_df[['tinggi', 'jk']])
y_train= np.array(sensus_df['berat'])

print(f'x_train:\n{x_train}\n')
print(f'y_train:{y_train}')

x_train:
[[158 'pria']
 [170 'pria']
 [183 'pria']
 [191 'pria']
 [155 'wanita']
 [168 'wanita']
 [180 'wanita']
 [150 'wanita']
 [170 'wanita']]

y_train:[64 86 84 80 49 59 67 54 60]


### Preprocessing Dataset

In [6]:
x_train_transposed = np.transpose(x_train)# mengubah posisi baris jadi kolom dan baris jadi kolom

print(f'x_train:\n{x_train}\n')
print(f'x_train_transposed:\n{x_train_transposed}')

x_train:
[[158 'pria']
 [170 'pria']
 [183 'pria']
 [191 'pria']
 [155 'wanita']
 [168 'wanita']
 [180 'wanita']
 [150 'wanita']
 [170 'wanita']]

x_train_transposed:
[[158 170 183 191 155 168 180 150 170]
 ['pria' 'pria' 'pria' 'pria' 'wanita' 'wanita' 'wanita' 'wanita'
  'wanita']]


In [7]:
# conversi nilai string menjadi numerik dengan menggunakan LabelBinarizer
from sklearn.preprocessing import LabelBinarizer

lb=LabelBinarizer()
jk_binarised = lb.fit_transform(x_train_transposed[1])
print(f'jk: {x_train_transposed[1]}\n')
print(f'jk_binarised:\n{jk_binarised}')

#note: pria= 0, wanita= 1

jk: ['pria' 'pria' 'pria' 'pria' 'wanita' 'wanita' 'wanita' 'wanita' 'wanita']

jk_binarised:
[[0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]]


In [8]:
jk_binarised = jk_binarised.flatten()#merubah menjadi array satu dimensi
jk_binarised

array([0, 0, 0, 0, 1, 1, 1, 1, 1])

In [9]:
x_train_transposed[1] = jk_binarised
x_train = x_train_transposed.transpose()

print(f'x_train_transposed:\n{x_train_transposed}\n')
print(f'x_train\n{x_train}')

x_train_transposed:
[[158 170 183 191 155 168 180 150 170]
 [0 0 0 0 1 1 1 1 1]]

x_train
[[158 0]
 [170 0]
 [183 0]
 [191 0]
 [155 1]
 [168 1]
 [180 1]
 [150 1]
 [170 1]]


### Training KNN Regression Model

In [10]:
from sklearn.neighbors import KNeighborsRegressor

K = 3
model = KNeighborsRegressor (n_neighbors=K)
model.fit(x_train, y_train)

KNeighborsRegressor(n_neighbors=3)

### Prediksi Berat Badan

In [11]:
x_new = np.array([[155, 1]])
x_new

array([[155,   1]])

In [12]:
y_pred = model.predict(x_new)
y_pred

array([55.66666667])

### Evaluasi KNN Regression Model

In [13]:
x_test = np.array([[168, 0], [180, 0], [160, 1], [169, 1]])
y_test = np.array([65, 96, 52, 67])

print(f'x_test:\n{x_test}\n')
print(f'y_test: {y_test}')


x_test:
[[168   0]
 [180   0]
 [160   1]
 [169   1]]

y_test: [65 96 52 67]


In [14]:
y_pred = model.predict (x_test)
y_pred

array([68.33333333, 79.        , 57.33333333, 68.33333333])

## Coefficient of Determination atau $R^2$ 

refrensi: https://en.wikipedia.org/wiki/Coefficient_of_determination

In [15]:
from sklearn.metrics import r2_score

r_squared = r2_score (y_test, y_pred)

print(f'R-squared: {r_squared}')

R-squared: 0.680528691166989


## Mean Absolute Error (MAE) atau Mean Absolute Deviation (MAD)

### _M A E_ adalah nilai rata-rata dari absolute error untuk prediksi

###  _M A E_ $ = \frac{1}{n} \sum_{i =1} ^ {n} | y_i - \hat{y}_i | $

$y_i$ = nilai target kita

$\hat{y}_i $ = nilai prediksi kita

Referensi:https://en.wikipedia.org/wiki/Mean_absolute_error

In [20]:
from sklearn.metrics import mean_absolute_error
MAE= mean_absolute_error (y_test, y_pred)
print(f'MAE: {MAE}')

MAE: 6.749999999999998


$y_i$ == y_test

$\hat{y}_i$ = y_pred

  ## Mean Squared Error (MSE) atau Mean Squared Deviation (MSD)

### _M S D_ adalah nilai rata-rata dari error kuadrat untuk prediksi

###  _M S D_ $ = \frac{1}{n} \sum_{i =1} ^ {n} ( y_i - \hat{y}_i )^ 2 $

$y_i$ = nilai setiap target kita pada testing set

$\hat{y}_i $ = nilai prediksi kita

referensi: https://en.wikipedia.org/wiki/Mean_squared_error

In [22]:
from sklearn.metrics import mean_squared_error
MSE= mean_squared_error (y_test, y_pred)
print(f'MSE: {MSE}')

MSE: 82.58333333333333


## Permasalahan Scaling pada Features

In [27]:
from scipy.spatial.distance import euclidean

# tinggi dalam milimeter
x_train = np.array([[1700, 0], [1600, 1]])
x_new = np.array([[1640, 0]])

[euclidean(x_new[0], d) for d in x_train]

[60.0, 40.01249804748511]

note: bisa kita lihat data poin baru lebih dikat ke data poin dua dari pada yang pertama  

In [29]:
# x tinggidalam meter
x_train = np.array([[1.7, 0], [1.6, 1]])
x_new = np.array([[1.64, 0]])

[euclidean(x_new[0], d) for d in x_train]

[0.06000000000000005, 1.0007996802557442]

note: bisa kita lihat data poin baru lebih dekat ke yang pertama daripada ke data poin ke dua 

## Menerapkan Standar Scaler (Standar Score atau Z-Score)

### Standarisasi nilai features dengan menghapus mean dan scalling ke unit variance

### $ Z = \frac {x-\bar {x}} {S}$ 

Referensi: https://en.wikipedia.org/wiki/Mean_squared_error

In [35]:
from sklearn.preprocessing import StandardScaler

ss= StandardScaler()

In [47]:
# tinggi dalam milimeter
x_train = np.array([[1700, 0], [1600, 1]])
x_train_scaled = ss.fit_transform(x_train)
print (f'x_train_scaled:\n{x_train_scaled}\n')

# tinggi dalam milimeter
x_new = np.array([[1640, 0]])
x_new_scaled = ss.fit_transform(x_new)
print (f'x_new_scaled:\n{x_new_scaled}\n')

jarak= [euclidean(x_new_scaled[0], d) for d in x_train_scaled]
print (f'jarak: {jarak}')

x_train_scaled:
[[ 1. -1.]
 [-1.  1.]]

x_new_scaled:
[[0. 0.]]

jarak: [1.4142135623730951, 1.4142135623730951]


In [48]:
# tinggi dalam meter
x_train = np.array([[1.7, 0], [1.6, 1]])
x_train_scaled = ss.fit_transform(x_train)
print (f'x_train_scaled:\n{x_train_scaled}\n')

# tinggi dalam milimeter
x_new = np.array([[1.64, 0]])
x_new_scaled = ss.fit_transform(x_new)

print (f'x_train_scaled:\n{x_train_scaled}\n')

jarak= [euclidean(x_new_scaled[0], d) for d in x_train_scaled]
print (f'jarak: {jarak}')

x_train_scaled:
[[ 1. -1.]
 [-1.  1.]]

x_train_scaled:
[[ 1. -1.]
 [-1.  1.]]

jarak: [1.4142135623730967, 1.4142135623730934]


## Menerapkan Features Scaling pada KNN

### Dataset

In [52]:
# Training Set
x_train = np.array([[158, 0], [170, 0], [183, 0], [191, 0], [155, 1], [168, 1],
                    [180, 1], [150, 1], [170, 1]])

y_train = np.array([64, 86, 84, 80, 49, 59, 67, 54, 67])

# Test Set
x_test = np.array([[168, 0], [180, 0], [160, 1], [169, 1]])
y_test = np.array([65, 96, 52, 67])

### Features Scaling (Standard Scaler)

In [53]:
x_train_scaled = ss.fit_transform(x_train)
x_test_scaled = ss.transform(x_test)

print(f'x_train_scaled:\n{x_train_scaled}\n')
print(f'x_test_scaled:\n{x_test_scaled}\n')

x_train_scaled:
[[-0.89238551 -1.11803399]
 [ 0.04331968 -1.11803399]
 [ 1.05700031 -1.11803399]
 [ 1.68080378 -1.11803399]
 [-1.12631181  0.89442719]
 [-0.11263118  0.89442719]
 [ 0.82307401  0.89442719]
 [-1.51618897  0.89442719]
 [ 0.04331968  0.89442719]]

x_test_scaled:
[[-0.11263118 -1.11803399]
 [ 0.82307401 -1.11803399]
 [-0.73643464  0.89442719]
 [-0.03465575  0.89442719]]



### Training dan Evaluasi Model

In [54]:
model.fit(x_train_scaled, y_train)
y_pred = model.predict(x_test_scaled)

MAE = mean_absolute_error(y_test, y_pred)
MSE = mean_squared_error(y_test, y_pred)

print(f'MAE: {MAE}')
print(f'MSE: {MSE}')

MAE: 7.583333333333336
MSE: 85.13888888888893


source: https://www.youtube.com/watch?v=W8adIcfv16M Trs_m