# Regression dengan KNN (K Nearest Neighbours)
KNN adalah model machine learning yang dapat digunakan untuk melakukan prediksi berdasarkan kedekatan karakteristik dengan sejumlah tetangga terdekat.

Prediksi yang dilakukan dapat diterapkan baik pada classification maupun regression tasks.

In [2]:
import pandas as pd

sensus = {
    'tinggi': [158, 170, 183, 191, 155, 163, 180, 158, 178],
    'berat': [64, 86, 84, 80, 49, 59, 67, 54, 67],
    'jk': [
        'pria', 'pria', 'pria', 'pria', 'wanita', 'wanita', 'wanita', 'wanita', 
        'wanita'
    ]
}

sensus_df = pd.DataFrame(sensus)
sensus_df

Unnamed: 0,tinggi,berat,jk
0,158,64,pria
1,170,86,pria
2,183,84,pria
3,191,80,pria
4,155,49,wanita
5,163,59,wanita
6,180,67,wanita
7,158,54,wanita
8,178,67,wanita


In [5]:
import numpy as np

X_train = np.array(sensus_df[['tinggi', 'jk']])
y_train = np.array(sensus_df['berat'])

print(f'X_train: {X_train}')
print(f'y_train: {y_train}')

X_train: [[158 'pria']
 [170 'pria']
 [183 'pria']
 [191 'pria']
 [155 'wanita']
 [163 'wanita']
 [180 'wanita']
 [158 'wanita']
 [178 'wanita']]
y_train: [64 86 84 80 49 59 67 54 67]


# Preprocess dataset: konversi label menjadi numerik

In [7]:
X_train_transposed = np.transpose(X_train)

#proses transpose akan merubah posisi baris menjadi kolom ataupun sebaliknya

In [10]:
from sklearn.preprocessing import LabelBinarizer

lb = LabelBinarizer()
jk_binarised = lb.fit_transform(X_train_transposed[1])

print(f'jk: {X_train_transposed[1]}\n')
print(f'jk_binarised:\n {jk_binarised}')

jk: ['pria' 'pria' 'pria' 'pria' 'wanita' 'wanita' 'wanita' 'wanita' 'wanita']

jk_binarised:
 [[0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]]


### Flatten
Method flatten digunakan untuk mengubah multidimensi array menjadi single dimensi array

In [11]:
jk_binarised = jk_binarised.flatten()
jk_binarised

array([0, 0, 0, 0, 1, 1, 1, 1, 1])

In [13]:
X_train_transposed[1] = jk_binarised
X_train = X_train_transposed.transpose()

print (f'X_train_transposes: {X_train_transposed}\n')
print (f'X_train: {X_train}')

X_train_transposes: [[158 170 183 191 155 163 180 158 178]
 [0 0 0 0 1 1 1 1 1]]

X_train: [[158 0]
 [170 0]
 [183 0]
 [191 0]
 [155 1]
 [163 1]
 [180 1]
 [158 1]
 [178 1]]


## Training KNN Regression Model

In [14]:
from sklearn.neighbors import KNeighborsRegressor

K = 3
model = KNeighborsRegressor(n_neighbors=K)
model.fit(X_train, y_train)

KNeighborsRegressor(n_neighbors=3)

## Prediksi Berat Badan


In [15]:
X_new = np.array([[155, 1]])
X_new

array([[155,   1]])

In [16]:
y_pred= model.predict(X_new)
y_pred

array([55.66666667])

Untuk data tinggi badan 155 dengan jenis kelamin wanita diprediksi memiliki 
berat badan 55,6 Kg

# Evaluasi KNN Regression Model


In [32]:
# Testing set
X_test = np.array([[168, 0 ], [180, 0], [160, 1], [169, 1]])
y_test = np.array([65, 96, 52, 67])

print(f'X_test: {X_test}')
print(f'y_test: {y_test}')

X_test: [[168   0]
 [180   0]
 [160   1]
 [169   1]]
y_test: [65 96 52 67]


In [33]:
y_pred = model.predict(X_test)
y_pred

array([69.66666667, 72.66666667, 59.        , 70.66666667])

Untuk tinggi badan 168 dengan jenis kelamin pria memiliki nilai prediksi 69 kg 
sedangkan data yang diharapkan 65kg dan seterusnya

## Coefficient of Determination atau R-squared
Pengujian koefisien determinasi ini dilakukan dengan maksud mengukur kemampuan model dalam menerangkan seberapa pengaruh variabel independen secara bersama–sama (stimultan) mempengaruhi variabel dependen yang dapat diindikasikan oleh nilai adjusted R – Squared (Ghozali, 2016).

In [34]:
from sklearn.metrics import r2_score

r_squared = r2_score(y_test, y_pred)

print(f'R-squared: {r_squared}')

R-squared: 0.39200515796260493


Berdasarkan nilai Koefisien Determinasi (R-Square) pada variabel diatas adalah sebesar 0,392 hal ini menunjukkan bahwa semua variable independent/bebas secara simultan memiliki pengaruh yaitu sebesar 39,2% terhadap Kepatuhan Wajib Pajak (variable dependen/terikat). Sedangkan sisanya yaitu sebesar 60,8% dipengaruhi oleh variabel lain yang tidak diuji dalam penelitian.

## Mean absolute Error (MAE) atau Mean Absolute Deviation (MAD)

merupakan nilai rata-rata dari absolute error dari prediksi

nilai MAE merupakan selisih antara nilai prediksi dengan nilai sebenarnya

semakin kecil nilai MSE akan semakin baik

In [35]:
from sklearn.metrics import mean_absolute_error

MAE = mean_absolute_error(y_test, y_pred)

print(f'MAE: {MAE}')

MAE: 9.666666666666668


## Mean Squared Error (MSE) atau Mean Squared Deviation (MSD)
MSE merupakan nilai kuadrat rata-rata dari nilai prediksi

nilai MSE merupakan selisih antara nilai prediksi dengan nilai sebenarnya

semakin kecil nilai MSE akan semakin baik

In [36]:
from sklearn.metrics import mean_squared_error

MSE = mean_squared_error(y_test, y_pred)

print(f'MSE: {MSE}')

MSE: 157.16666666666663


# Permasalahan Scaling pada Features

In [37]:
from scipy.spatial.distance import euclidean

# tinggi dalam milimeter
X_train = np.array([[1700, 0], [1600, 1]])
X_new = np.array([[1640, 0]])

[euclidean(X_new[0], d) for d in X_train]

[60.0, 40.01249804748511]

In [39]:
# tinggi dalam meter
X_train = np.array([[1.7, 0], [1.6, 1]])
X_new = np.array([[1.64, 0]])

[euclidean(X_new[0], d) for d in X_train]

[0.06000000000000005, 1.0007996802557444]

Jarak pada datapoint meter lebih dekat dibandingkan jarak pada datapoint milimeter

## Menerapkan Standard Scaler (Standard Score atau X-score)

In [41]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

In [44]:
# tinggi dalam milimeter
X_train = np.array([[1700, 0], [1600, 1]])
X_train_scaled = ss.fit_transform(X_train)
print(f'X_train_scaled:\n {X_train_scaled}\n')

X_new = np.array([[1640, 0]])
X_new_scaled = ss.transform(X_new)
print(f'X_new_scaled:\n {X_new_scaled}\n')

jarak = [euclidean(X_new_scaled[0], d) for d in X_train_scaled]
print(f'Jarak: {jarak}')

X_train_scaled:
 [[ 1. -1.]
 [-1.  1.]]

X_new_scaled:
 [[-0.2 -1. ]]

Jarak: [1.2, 2.154065922853802]


In [46]:
# tinggi dalam meter
X_train = np.array([[1.7, 0], [1.6, 1]])
X_train_scaled = ss.fit_transform(X_train)
print(f'X_train_scaled:\n {X_train_scaled}\n')

X_new = np.array([[1.64, 0]])
X_new_scaled = ss.transform(X_new)
print(f'X_new_scaled:\n {X_new_scaled}\n')

jarak = [euclidean(X_new_scaled[0], d) for d in X_train_scaled]
print(f'Jarak: {jarak}')

X_train_scaled:
 [[ 1. -1.]
 [-1.  1.]]

X_new_scaled:
 [[-0.2 -1. ]]

Jarak: [1.2000000000000026, 2.1540659228538006]


Jarak pada datapoint meter dan milimeter memiliki nilai jarak yang sama.
Menerapkan standard scaller sangat baik untuk memiliki hasil yang konsisten.

# Menerapkan Features Scaling pada KNN
## Dataset

In [48]:
# Training set
X_train = np.array ([[158, 0], [170, 0], [183,0], [191, 0],[155, 1], [163,1],
                   [180, 1], [158, 1], [170, 1]])
y_train = np.array ([64, 86, 84, 80, 49, 59, 67, 54, 67])

# Test set
X_test = np.array([[168, 0], [180, 0], [160, 1], [169, 1]])
y_test = np.array([65, 96, 52, 67])

## Features Scaling (Standard Scaler)

In [50]:
X_train_scaled = ss.fit_transform(X_train)
X_test_scaled = ss.transform(X_test)

print(f'X_train_scaled: {X_train_scaled}\n')
print(f'X_test_scaled: {X_test_scaled}')

X_train_scaled: [[-0.9908706  -1.11803399]
 [ 0.01869567 -1.11803399]
 [ 1.11239246 -1.11803399]
 [ 1.78543664 -1.11803399]
 [-1.24326216  0.89442719]
 [-0.57021798  0.89442719]
 [ 0.86000089  0.89442719]
 [-0.9908706   0.89442719]
 [ 0.01869567  0.89442719]]

X_test_scaled: [[-0.14956537 -1.11803399]
 [ 0.86000089 -1.11803399]
 [-0.82260955  0.89442719]
 [-0.06543485  0.89442719]]


## Training & Evaluasi Model

In [51]:
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)

MAE = mean_absolute_error(y_test, y_pred)
MSE = mean_squared_error(y_test, y_pred)

print(f'MAE: {MAE}')
print(f'MSE: {MSE}')

MAE: 7.583333333333336
MSE: 85.13888888888893
