# Teori
kNN merupakan algoritma klasifikasi yang paling sederhana. Sebuah data baru diklasifikasikan ke dalam kelas yang paling mendominasi dari K tetangga data terdekatnya. K tetangga terdekat didefinisikan sebagai K data sampel dengan jarak terdekat dengan data baru. 

Berikut merupakan tahapan dari algoritma kNN: 
1. Tentukan parameter `K`, yaitu bilangan bulat positif yang menyatakan jumlah tetangga terdekat.
2. Hitung jarak data baru ke seluruh data sampel. Penghitungan jarak dapat menggunakan salah satu dari ketiga metode berikut:
   - Euclidian Distance

     <img src="assets/euclid.png">
   - City Block/Manhattan Distance

     <img src="assets/city.png">
   - Minkowski

     <img src="assets/minkowski.png">
3. Urutkan data sampel berdasarkan jaraknya (dari terdekat ke terjauh) ke data baru. Tentukan pula K data sampel dengan jarak terdekat (tetangga terdekat).
4. Kelompokkan K tetangga terdekat tersebut berdasarkan atribut kelasnya.
5. Data baru diklasifikasikan ke dalam kelas dengan jumlah tentangga terdekat paling banyak (mayoritas).





# Praktek

## 0) Info Dataset

1. Title of Database: MAGIC gamma telescope data 2004

2. Sources:

   (a) Original owner of the database:

       R. K. Bock
       Major Atmospheric Gamma Imaging Cherenkov Telescope project (MAGIC)
       http://wwwmagic.mppmu.mpg.de
       rkb@mail.cern.ch

   (b) Donor:

       P. Savicky
       Institute of Computer Science, AS of CR
       Czech Republic
       savicky@cs.cas.cz

   (c) Date received: May 2007

3. Past Usage:

   (a) Bock, R.K., Chilingarian, A., Gaug, M., Hakl, F., Hengstebeck, T.,
       Jirina, M., Klaschka, J., Kotrc, E., Savicky, P., Towers, S.,
       Vaicilius, A., Wittek W. (2004).
       Methods for multidimensional event classification: a case study
       using images from a Cherenkov gamma-ray telescope.
       Nucl.Instr.Meth. A, 516, pp. 511-528.

   (b) P. Savicky, E. Kotrc.
       Experimental Study of Leaf Confidences for Random Forest.
       Proceedings of COMPSTAT 2004, In: Computational Statistics.
       (Ed.: Antoch J.) - Heidelberg, Physica Verlag 2004, pp. 1767-1774.

   (c) J. Dvorak, P. Savicky.
       Softening Splits in Decision Trees Using Simulated Annealing.
       Proceedings of ICANNGA 2007, Warsaw, (Ed.: Beliczynski et. al),
       Part I, LNCS 4431, pp. 721-729.

4. Relevant Information:

   The data are MC generated (see below) to simulate registration of high energy
   gamma particles in a ground-based atmospheric Cherenkov gamma telescope using the
   imaging technique. Cherenkov gamma telescope observes high energy gamma rays,
   taking advantage of the radiation emitted by charged particles produced
   inside the electromagnetic showers initiated by the gammas, and developing in the
   atmosphere. This Cherenkov radiation (of visible to UV wavelengths) leaks
   through the atmosphere and gets recorded in the detector, allowing reconstruction
   of the shower parameters. The available information consists of pulses left by
   the incoming Cherenkov photons on the photomultiplier tubes, arranged in a
   plane, the camera. Depending on the energy of the primary gamma, a total of
   few hundreds to some 10000 Cherenkov photons get collected, in patterns
   (called the shower image), allowing to discriminate statistically those
   caused by primary gammas (signal) from the images of hadronic showers
   initiated by cosmic rays in the upper atmosphere (background).

   Typically, the image of a shower after some pre-processing is an elongated
   cluster. Its long axis is oriented towards the camera center if the shower axis
   is parallel to the telescope's optical axis, i.e. if the telescope axis is
   directed towards a point source. A principal component analysis is performed
   in the camera plane, which results in a correlation axis and defines an ellipse.
   If the depositions were distributed as a bivariate Gaussian, this would be
   an equidensity ellipse. The characteristic parameters of this ellipse
   (often called Hillas parameters) are among the image parameters that can be
   used for discrimination. The energy depositions are typically asymmetric
   along the major axis, and this asymmetry can also be used in discrimination.
   There are, in addition, further discriminating characteristics, like the
   extent of the cluster in the image plane, or the total sum of depositions.

   The data set was generated by a Monte Carlo program, Corsika, described in 
      D. Heck et al., CORSIKA, A Monte Carlo code to simulate extensive air showers,
      Forschungszentrum Karlsruhe FZKA 6019 (1998).
   The program was run with parameters allowing to observe events with energies down
   to below 50 GeV.

5. Number of Instances: 19020

6. Number of Attributes: 11 (including the class)

7. Attribute information:

    1.  fLength:  continuous  # major axis of ellipse [mm]
    2.  fWidth:   continuous  # minor axis of ellipse [mm] 
    3.  fSize:    continuous  # 10-log of sum of content of all pixels [in #phot]
    4.  fConc:    continuous  # ratio of sum of two highest pixels over fSize  [ratio]
    5.  fConc1:   continuous  # ratio of highest pixel over fSize  [ratio]
    6.  fAsym:    continuous  # distance from highest pixel to center, projected onto major axis [mm]
    7.  fM3Long:  continuous  # 3rd root of third moment along major axis  [mm] 
    8.  fM3Trans: continuous  # 3rd root of third moment along minor axis  [mm]
    9.  fAlpha:   continuous  # angle of major axis with vector to origin [deg]
   10.  fDist:    continuous  # distance from origin to center of ellipse [mm]
   11.  class:    g,h         # gamma (signal), hadron (background)

8. Missing Attribute Values: None

9. Class Distribution:

   g = gamma (signal):     12332
   h = hadron (background): 6688

   For technical reasons, the number of h events is underestimated.
   In the real data, the h class represents the majority of the events.

   The simple classification accuracy is not meaningful for this data, since
   classifying a background event as signal is worse than classifying a signal
   event as background. For comparison of different classifiers an ROC curve
   has to be used. The relevant points on this curve are those, where the
   probability of accepting a background event as signal is below one of the
   following thresholds: 0.01, 0.02, 0.05, 0.1, 0.2 depending on the required
   quality of the sample of the accepted events for different experiments.



## 1) Mengimpor Library

In [19]:
import pandas as pd
import numpy as np
import copy
from sklearn.preprocessing import StandardScaler
from IPython.display import display

## 2) Memuat Dataset

In [71]:
# Loading the UC Magic Gamma Dataset
cols = ["fLength", "fWidth", "fSize", "fConc", "fConc1", "fAsym", "fM3Long", "fM3Trans", "fAlpha", "fDist", "class"]
df = pd.read_csv("dataset/magic04.data", names=cols)
print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19020 entries, 0 to 19019
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   fLength   19020 non-null  float64
 1   fWidth    19020 non-null  float64
 2   fSize     19020 non-null  float64
 3   fConc     19020 non-null  float64
 4   fConc1    19020 non-null  float64
 5   fAsym     19020 non-null  float64
 6   fM3Long   19020 non-null  float64
 7   fM3Trans  19020 non-null  float64
 8   fAlpha    19020 non-null  float64
 9   fDist     19020 non-null  float64
 10  class     19020 non-null  object 
dtypes: float64(10), object(1)
memory usage: 1.6+ MB
None


Unnamed: 0,fLength,fWidth,fSize,fConc,fConc1,fAsym,fM3Long,fM3Trans,fAlpha,fDist,class
0,28.7967,16.0021,2.6449,0.3918,0.1982,27.7004,22.011,-8.2027,40.092,81.8828,g
1,31.6036,11.7235,2.5185,0.5303,0.3773,26.2722,23.8238,-9.9574,6.3609,205.261,g
2,162.052,136.031,4.0612,0.0374,0.0187,116.741,-64.858,-45.216,76.96,256.788,g
3,23.8172,9.5728,2.3385,0.6147,0.3922,27.2107,-6.4633,-7.1513,10.449,116.737,g
4,75.1362,30.9205,3.1611,0.3168,0.1832,-5.5277,28.5525,21.8393,4.648,356.462,g


In [72]:
# Calculating the class distribution
df['class'].value_counts()

class
g    12332
h     6688
Name: count, dtype: int64

In [18]:
# Basic Info
df.describe()

Unnamed: 0,fLength,fWidth,fSize,fConc,fConc1,fAsym,fM3Long,fM3Trans,fAlpha,fDist
count,19020.0,19020.0,19020.0,19020.0,19020.0,19020.0,19020.0,19020.0,19020.0,19020.0
mean,53.250154,22.180966,2.825017,0.380327,0.214657,-4.331745,10.545545,0.249726,27.645707,193.818026
std,42.364855,18.346056,0.472599,0.182813,0.110511,59.206062,51.000118,20.827439,26.103621,74.731787
min,4.2835,0.0,1.9413,0.0131,0.0003,-457.9161,-331.78,-205.8947,0.0,1.2826
25%,24.336,11.8638,2.4771,0.2358,0.128475,-20.58655,-12.842775,-10.849375,5.547925,142.49225
50%,37.1477,17.1399,2.7396,0.35415,0.1965,4.01305,15.3141,0.6662,17.6795,191.85145
75%,70.122175,24.739475,3.1016,0.5037,0.285225,24.0637,35.8378,10.946425,45.88355,240.563825
max,334.177,256.382,5.3233,0.893,0.6752,575.2407,238.321,179.851,90.0,495.561


In [50]:
sample = df.iloc[0, :-1].values
print(sample)
df.iloc[0]

[28.7967 16.0021 2.6449 0.3918 0.1982 27.7004 22.011 -8.2027 40.092
 81.8828 0.0]


fLength      28.7967
fWidth       16.0021
fSize         2.6449
fConc         0.3918
fConc1        0.1982
fAsym        27.7004
fM3Long       22.011
fM3Trans     -8.2027
fAlpha        40.092
fDist        81.8828
Distances        0.0
class              g
Name: 0, dtype: object

In [51]:
features = df[df.columns[:-1]].values
features

array([[ 28.7967    ,  16.0021    ,   2.6449    , ...,  40.092     ,
         81.8828    ,   0.        ],
       [ 31.6036    ,  11.7235    ,   2.5185    , ...,   6.3609    ,
        205.261     , 128.04149583],
       [162.052     , 136.031     ,   4.0612    , ...,  76.96      ,
        256.788     , 284.53750181],
       ...,
       [ 75.4455    ,  47.5305    ,   3.4483    , ...,  30.2987    ,
        256.5166    , 188.41775972],
       [120.5135    ,  76.9018    ,   3.9939    , ...,  84.6874    ,
        408.3166    , 370.93562505],
       [187.1814    ,  53.0014    ,   3.2093    , ...,  52.731     ,
        272.3174    , 372.50780741]])

In [35]:
distances = []
for x in features:
    distances.append(np.linalg.norm(x - sample))
distances

[0.0,
 128.04149583322587,
 284.5375018093221,
 54.51574300410845,
 284.7272625790337,
 166.10785141347174,
 146.3865289331638,
 160.27753132120546,
 218.72145929375563,
 47.59534410055252,
 257.9250423397852,
 91.49968496169811,
 123.1865387557017,
 74.96896577958108,
 302.4548489336549,
 134.6936684863472,
 129.4831945367815,
 31.534849542054268,
 109.75508779929064,
 148.43609998059773,
 251.71025040128978,
 249.665545432084,
 230.4847622826073,
 22.900144826834605,
 277.24879753843845,
 250.77944487148466,
 47.55987737957279,
 29.707568020960586,
 110.50910685332678,
 65.39097440320032,
 69.67955554364566,
 303.1076265329858,
 83.21382591613005,
 134.12368811932512,
 229.0176314226265,
 201.68281112241567,
 98.19003893002588,
 63.09801974388736,
 104.38408083754918,
 307.41154777317655,
 234.4360517406826,
 97.43819453161063,
 144.84420871505358,
 48.17554896760389,
 165.8065575156785,
 57.94271842811657,
 115.51219877450173,
 93.4631185237792,
 237.81561875398344,
 99.538610517225

In [36]:
df.insert(10, 'Distances', distances)
df.head()

Unnamed: 0,fLength,fWidth,fSize,fConc,fConc1,fAsym,fM3Long,fM3Trans,fAlpha,fDist,Distances,class
0,28.7967,16.0021,2.6449,0.3918,0.1982,27.7004,22.011,-8.2027,40.092,81.8828,0.0,g
1,31.6036,11.7235,2.5185,0.5303,0.3773,26.2722,23.8238,-9.9574,6.3609,205.261,128.041496,g
2,162.052,136.031,4.0612,0.0374,0.0187,116.741,-64.858,-45.216,76.96,256.788,284.537502,g
3,23.8172,9.5728,2.3385,0.6147,0.3922,27.2107,-6.4633,-7.1513,10.449,116.737,54.515743,g
4,75.1362,30.9205,3.1611,0.3168,0.1832,-5.5277,28.5525,21.8393,4.648,356.462,284.727263,g


In [38]:
df_sorted = df.sort_values(by='Distances')
df_sorted

Unnamed: 0,fLength,fWidth,fSize,fConc,fConc1,fAsym,fM3Long,fM3Trans,fAlpha,fDist,Distances,class
0,28.7967,16.0021,2.6449,0.3918,0.1982,27.7004,22.0110,-8.2027,40.0920,81.8828,0.000000,g
17788,33.2724,11.1166,2.7903,0.4457,0.2796,29.6062,16.6947,-7.4326,36.4037,84.3813,9.811468,h
16368,34.8773,10.9165,2.7528,0.4170,0.2677,31.9997,25.0411,-7.3443,34.4381,86.9540,12.204103,h
16559,24.8882,7.8008,2.4232,0.6792,0.4472,25.3809,14.8384,-9.8301,38.8998,77.8319,12.650584,h
6038,21.7287,11.3452,2.3598,0.5939,0.3079,20.9225,20.1144,-9.9040,35.7407,75.2580,13.675918,g
...,...,...,...,...,...,...,...,...,...,...,...,...
12636,303.5676,38.3100,3.5777,0.1476,0.0812,-449.9526,200.3148,-19.2293,71.5012,86.0291,580.575217,h
15355,276.2766,35.8288,3.5838,0.0604,0.0572,-457.9161,-169.1552,15.1931,68.1382,187.0187,588.554009,h
18979,265.5424,69.4251,3.2266,0.2337,0.1531,464.6310,-287.3636,42.0901,74.8054,145.7859,594.425684,h
12663,265.5827,76.4762,3.1605,0.2898,0.1615,473.0654,-284.7038,40.4617,77.5514,175.0569,603.822656,h


In [44]:
df_sorted.drop(0, inplace=True)
df_sorted

Unnamed: 0,fLength,fWidth,fSize,fConc,fConc1,fAsym,fM3Long,fM3Trans,fAlpha,fDist,Distances,class
17788,33.2724,11.1166,2.7903,0.4457,0.2796,29.6062,16.6947,-7.4326,36.4037,84.3813,9.811468,h
16368,34.8773,10.9165,2.7528,0.4170,0.2677,31.9997,25.0411,-7.3443,34.4381,86.9540,12.204103,h
16559,24.8882,7.8008,2.4232,0.6792,0.4472,25.3809,14.8384,-9.8301,38.8998,77.8319,12.650584,h
6038,21.7287,11.3452,2.3598,0.5939,0.3079,20.9225,20.1144,-9.9040,35.7407,75.2580,13.675918,g
17042,30.2128,22.8856,2.6758,0.3291,0.2120,16.9338,25.8419,-12.6206,39.7337,83.5209,14.223855,h
...,...,...,...,...,...,...,...,...,...,...,...,...
12636,303.5676,38.3100,3.5777,0.1476,0.0812,-449.9526,200.3148,-19.2293,71.5012,86.0291,580.575217,h
15355,276.2766,35.8288,3.5838,0.0604,0.0572,-457.9161,-169.1552,15.1931,68.1382,187.0187,588.554009,h
18979,265.5424,69.4251,3.2266,0.2337,0.1531,464.6310,-287.3636,42.0901,74.8054,145.7859,594.425684,h
12663,265.5827,76.4762,3.1605,0.2898,0.1615,473.0654,-284.7038,40.4617,77.5514,175.0569,603.822656,h


In [80]:
K = 5
df_k = df_sorted[:K]
print("Highest class count: {}".format(df_k['class'].value_counts().idxmax()))
df_k

Highest class count: h


Unnamed: 0,fLength,fWidth,fSize,fConc,fConc1,fAsym,fM3Long,fM3Trans,fAlpha,fDist,Distances,class
17788,33.2724,11.1166,2.7903,0.4457,0.2796,29.6062,16.6947,-7.4326,36.4037,84.3813,9.811468,h
16368,34.8773,10.9165,2.7528,0.417,0.2677,31.9997,25.0411,-7.3443,34.4381,86.954,12.204103,h
16559,24.8882,7.8008,2.4232,0.6792,0.4472,25.3809,14.8384,-9.8301,38.8998,77.8319,12.650584,h
6038,21.7287,11.3452,2.3598,0.5939,0.3079,20.9225,20.1144,-9.904,35.7407,75.258,13.675918,g
17042,30.2128,22.8856,2.6758,0.3291,0.212,16.9338,25.8419,-12.6206,39.7337,83.5209,14.223855,h


In [73]:
K = 7
df_k = df_sorted[:K]
print("Highest class count: {}".format(df_k['class'].value_counts().idxmax()))
df_k

Highest class: h


Unnamed: 0,fLength,fWidth,fSize,fConc,fConc1,fAsym,fM3Long,fM3Trans,fAlpha,fDist,Distances,class
17788,33.2724,11.1166,2.7903,0.4457,0.2796,29.6062,16.6947,-7.4326,36.4037,84.3813,9.811468,h
16368,34.8773,10.9165,2.7528,0.417,0.2677,31.9997,25.0411,-7.3443,34.4381,86.954,12.204103,h
16559,24.8882,7.8008,2.4232,0.6792,0.4472,25.3809,14.8384,-9.8301,38.8998,77.8319,12.650584,h
6038,21.7287,11.3452,2.3598,0.5939,0.3079,20.9225,20.1144,-9.904,35.7407,75.258,13.675918,g
17042,30.2128,22.8856,2.6758,0.3291,0.212,16.9338,25.8419,-12.6206,39.7337,83.5209,14.223855,h
8116,21.7358,10.5355,2.2695,0.5753,0.293,17.5263,16.8871,-4.7663,41.1555,80.2342,15.011597,g
3745,23.12,13.0592,2.3847,0.466,0.2577,24.7351,14.5916,2.8713,42.462,77.5756,15.862067,g


In [79]:
K = 9
df_k = df_sorted[:K]
print("Highest class: {}".format(df_k['class'].value_counts().idxmax()))
df_k

Highest class: h


Unnamed: 0,fLength,fWidth,fSize,fConc,fConc1,fAsym,fM3Long,fM3Trans,fAlpha,fDist,Distances,class
17788,33.2724,11.1166,2.7903,0.4457,0.2796,29.6062,16.6947,-7.4326,36.4037,84.3813,9.811468,h
16368,34.8773,10.9165,2.7528,0.417,0.2677,31.9997,25.0411,-7.3443,34.4381,86.954,12.204103,h
16559,24.8882,7.8008,2.4232,0.6792,0.4472,25.3809,14.8384,-9.8301,38.8998,77.8319,12.650584,h
6038,21.7287,11.3452,2.3598,0.5939,0.3079,20.9225,20.1144,-9.904,35.7407,75.258,13.675918,g
17042,30.2128,22.8856,2.6758,0.3291,0.212,16.9338,25.8419,-12.6206,39.7337,83.5209,14.223855,h
8116,21.7358,10.5355,2.2695,0.5753,0.293,17.5263,16.8871,-4.7663,41.1555,80.2342,15.011597,g
3745,23.12,13.0592,2.3847,0.466,0.2577,24.7351,14.5916,2.8713,42.462,77.5756,15.862067,g
18572,28.9643,22.8684,2.6415,0.3632,0.1945,30.6721,29.7985,-4.9916,45.6028,91.8297,16.0085,h
11775,25.643,13.8085,2.4409,0.4058,0.2047,28.4416,20.009,-7.4265,33.9455,67.7682,16.029998,g


**With standarization**

In [88]:
# Separate numerical features and categorical label
numerical_features = df.iloc[:, :-1]  # All columns except the last (label)
categorical_label = df.iloc[:, -1]

# Create a StandardScaler object
scaler = StandardScaler()

# Standardize the numerical features
scaled_features = scaler.fit_transform(numerical_features)

# Reconstruct the DataFrame with standardized features and original label
df_standardized = pd.DataFrame(scaled_features, columns=numerical_features.columns)
df_standardized['class'] = categorical_label
df_standardized.head()

Unnamed: 0,fLength,fWidth,fSize,fConc,fConc1,fAsym,fM3Long,fM3Trans,fAlpha,fDist,class
0,-0.577226,-0.336804,-0.38113,0.062759,-0.148923,0.541042,0.224818,-0.405842,0.476816,-1.497866,g
1,-0.510969,-0.570027,-0.648595,0.820383,1.471776,0.516919,0.260364,-0.490094,-0.815418,0.153125,g
2,2.568278,6.205858,2.615783,-1.875883,-1.773241,2.044992,-1.478536,-2.18303,1.889224,0.842635,g
3,-0.694768,-0.687259,-1.029478,1.282069,1.606608,0.532771,-0.333515,-0.355359,-0.658804,-1.031463,g
4,0.516622,0.476384,0.711157,-0.347506,-0.28466,-0.0202,0.353086,1.03662,-0.881039,2.176427,g


In [89]:
df_standardized.describe()

Unnamed: 0,fLength,fWidth,fSize,fConc,fConc1,fAsym,fM3Long,fM3Trans,fAlpha,fDist
count,19020.0,19020.0,19020.0,19020.0,19020.0,19020.0,19020.0,19020.0,19020.0,19020.0
mean,7.172671000000001e-17,-1.195445e-16,-2.24146e-16,-3.025971e-16,1.128201e-16,-5.9772260000000005e-18,4.1840580000000006e-17,-9.712992e-18,1.195445e-17,-8.965839e-18
std,1.000026,1.000026,1.000026,1.000026,1.000026,1.000026,1.000026,1.000026,1.000026,1.000026
min,-1.155862,-1.209064,-1.869959,-2.008809,-1.939745,-7.661315,-6.712427,-9.897993,-1.059103,-2.57642
25%,-0.6825213,-0.562379,-0.7361978,-0.7905934,-0.7798731,-0.2745535,-0.4586055,-0.5329216,-0.8465631,-0.6868179
50%,-0.3800999,-0.2747838,-0.1807437,-0.1431941,-0.1643062,0.1409487,0.09350332,0.01999694,-0.3818041,-0.02631582
75%,0.3982656,0.1394619,0.5852541,0.6748758,0.6385776,0.4796163,0.4959385,0.5136004,0.6986894,0.6255307
max,6.631304,12.76608,5.286407,2.804429,4.167511,9.78933,4.466292,8.623528,2.388785,4.037785


In [90]:
sample = df_standardized.iloc[0, :-1].values
print(sample)
df_standardized.iloc[0]

[-0.5772260242856194 -0.3368041914485611 -0.38113036657157895
 0.06275932712882905 -0.148922706222731 0.541042355831885
 0.22481823665617398 -0.40584193694844856 0.4768158700593327
 -1.4978655495625293]


fLength    -0.577226
fWidth     -0.336804
fSize       -0.38113
fConc       0.062759
fConc1     -0.148923
fAsym       0.541042
fM3Long     0.224818
fM3Trans   -0.405842
fAlpha      0.476816
fDist      -1.497866
class              g
Name: 0, dtype: object

In [91]:
distances = []
features = df_standardized[df_standardized.columns[:-1]].values
for x in features:
    distances.append(np.linalg.norm(x - sample))
distances

[0.0,
 2.7812873782491536,
 9.158241072413437,
 2.635736937561619,
 4.581915385022413,
 3.0143083956923284,
 2.8044156133927416,
 2.6005321067151157,
 6.071510042810718,
 1.4429708024427441,
 4.479829667463676,
 2.1003962959860005,
 2.7795867610882135,
 1.4481948394340696,
 4.916524820935779,
 2.1653366824127973,
 2.684506074767927,
 0.761145900644724,
 2.5976260126331576,
 2.649106278485142,
 6.696573909920667,
 4.329598598669334,
 4.2250832426924685,
 2.3600427347493844,
 4.266678000436825,
 4.953480247795195,
 1.7159597416408465,
 0.7491350328379319,
 2.4780199456756873,
 2.038434139625506,
 2.4072469065386506,
 4.923269715519643,
 1.9241097886226601,
 2.3282201608075184,
 4.178942288149308,
 4.5268692304338884,
 2.4034443956924956,
 1.9685989229304366,
 2.0277687339121537,
 5.065264313006271,
 4.309879121172895,
 2.37880169015694,
 2.3659170126553675,
 3.3403444898659407,
 3.460600880791433,
 2.284784828330354,
 2.7294712018656786,
 3.606480824543858,
 4.366708475228871,
 1.7942261

In [92]:
df_standardized.insert(10, 'Distances', distances)
df_standardized.head()

Unnamed: 0,fLength,fWidth,fSize,fConc,fConc1,fAsym,fM3Long,fM3Trans,fAlpha,fDist,Distances,class
0,-0.577226,-0.336804,-0.38113,0.062759,-0.148923,0.541042,0.224818,-0.405842,0.476816,-1.497866,0.0,g
1,-0.510969,-0.570027,-0.648595,0.820383,1.471776,0.516919,0.260364,-0.490094,-0.815418,0.153125,2.781287,g
2,2.568278,6.205858,2.615783,-1.875883,-1.773241,2.044992,-1.478536,-2.18303,1.889224,0.842635,9.158241,g
3,-0.694768,-0.687259,-1.029478,1.282069,1.606608,0.532771,-0.333515,-0.355359,-0.658804,-1.031463,2.635737,g
4,0.516622,0.476384,0.711157,-0.347506,-0.28466,-0.0202,0.353086,1.03662,-0.881039,2.176427,4.581915,g


In [93]:
df_sorted = df_standardized.sort_values(by='Distances')
df_sorted.drop(0, inplace=True)
df_sorted

Unnamed: 0,fLength,fWidth,fSize,fConc,fConc1,fAsym,fM3Long,fM3Trans,fAlpha,fDist,Distances,class
7155,-0.728353,-0.179469,-0.259459,-0.072355,-0.281945,0.641148,0.033384,-0.448373,0.688152,-1.617288,0.453597,g
18572,-0.573270,0.037471,-0.388325,-0.093689,-0.182404,0.591236,0.377518,-0.251661,0.687934,-1.364761,0.526918,h
2397,-0.512104,-0.364609,-0.753126,0.074247,-0.053002,0.694127,0.269188,-0.468232,0.372985,-1.171393,0.547164,g
9625,-0.692981,-0.134477,-0.112185,-0.055397,-0.252083,0.186049,0.121029,-0.391375,0.386328,-1.416027,0.550597,g
11775,-0.651669,-0.456375,-0.812798,0.139342,-0.090103,0.553562,0.185562,-0.368573,0.241344,-1.686741,0.556514,g
...,...,...,...,...,...,...,...,...,...,...,...,...
12990,5.150321,9.155235,1.656381,-1.602920,-1.448377,7.579368,-3.384700,4.339033,1.412021,-0.557462,14.776927,h
15675,4.303252,8.980206,4.982124,-2.005527,-1.877306,-0.736330,-2.435579,7.423829,2.233346,-0.554585,14.854079,h
13496,4.243932,9.767099,3.294601,-1.803129,-1.638409,-3.969385,-4.478100,6.304588,-0.603404,0.646212,15.420724,h
17717,4.972739,12.766078,1.996214,-1.724358,-1.647458,3.797342,0.821574,-4.574068,-0.664052,-2.166479,15.609649,h


In [94]:
K = 5
df_k = df_sorted[:K]
print("Highest class count: {}".format(df_k['class'].value_counts().idxmax()))
df_k

Highest class count: g


Unnamed: 0,fLength,fWidth,fSize,fConc,fConc1,fAsym,fM3Long,fM3Trans,fAlpha,fDist,Distances,class
7155,-0.728353,-0.179469,-0.259459,-0.072355,-0.281945,0.641148,0.033384,-0.448373,0.688152,-1.617288,0.453597,g
18572,-0.57327,0.037471,-0.388325,-0.093689,-0.182404,0.591236,0.377518,-0.251661,0.687934,-1.364761,0.526918,h
2397,-0.512104,-0.364609,-0.753126,0.074247,-0.053002,0.694127,0.269188,-0.468232,0.372985,-1.171393,0.547164,g
9625,-0.692981,-0.134477,-0.112185,-0.055397,-0.252083,0.186049,0.121029,-0.391375,0.386328,-1.416027,0.550597,g
11775,-0.651669,-0.456375,-0.812798,0.139342,-0.090103,0.553562,0.185562,-0.368573,0.241344,-1.686741,0.556514,g


In [95]:
K = 7
df_k = df_sorted[:K]
print("Highest class count: {}".format(df_k['class'].value_counts().idxmax()))
df_k

Highest class count: g


Unnamed: 0,fLength,fWidth,fSize,fConc,fConc1,fAsym,fM3Long,fM3Trans,fAlpha,fDist,Distances,class
7155,-0.728353,-0.179469,-0.259459,-0.072355,-0.281945,0.641148,0.033384,-0.448373,0.688152,-1.617288,0.453597,g
18572,-0.57327,0.037471,-0.388325,-0.093689,-0.182404,0.591236,0.377518,-0.251661,0.687934,-1.364761,0.526918,h
2397,-0.512104,-0.364609,-0.753126,0.074247,-0.053002,0.694127,0.269188,-0.468232,0.372985,-1.171393,0.547164,g
9625,-0.692981,-0.134477,-0.112185,-0.055397,-0.252083,0.186049,0.121029,-0.391375,0.386328,-1.416027,0.550597,g
11775,-0.651669,-0.456375,-0.812798,0.139342,-0.090103,0.553562,0.185562,-0.368573,0.241344,-1.686741,0.556514,g
10854,-0.626591,-0.2375,-0.07727,-0.122134,-0.197788,0.370491,0.185129,-0.353496,0.088086,-1.659959,0.593288,g
17042,-0.543799,0.038409,-0.315745,-0.280223,-0.024045,0.359188,0.299936,-0.617967,0.463089,-1.475945,0.603149,h


In [96]:
K = 9
df_k = df_sorted[:K]
print("Highest class count: {}".format(df_k['class'].value_counts().idxmax()))
df_k

Highest class count: g


Unnamed: 0,fLength,fWidth,fSize,fConc,fConc1,fAsym,fM3Long,fM3Trans,fAlpha,fDist,Distances,class
7155,-0.728353,-0.179469,-0.259459,-0.072355,-0.281945,0.641148,0.033384,-0.448373,0.688152,-1.617288,0.453597,g
18572,-0.57327,0.037471,-0.388325,-0.093689,-0.182404,0.591236,0.377518,-0.251661,0.687934,-1.364761,0.526918,h
2397,-0.512104,-0.364609,-0.753126,0.074247,-0.053002,0.694127,0.269188,-0.468232,0.372985,-1.171393,0.547164,g
9625,-0.692981,-0.134477,-0.112185,-0.055397,-0.252083,0.186049,0.121029,-0.391375,0.386328,-1.416027,0.550597,g
11775,-0.651669,-0.456375,-0.812798,0.139342,-0.090103,0.553562,0.185562,-0.368573,0.241344,-1.686741,0.556514,g
10854,-0.626591,-0.2375,-0.07727,-0.122134,-0.197788,0.370491,0.185129,-0.353496,0.088086,-1.659959,0.593288,g
17042,-0.543799,0.038409,-0.315745,-0.280223,-0.024045,0.359188,0.299936,-0.617967,0.463089,-1.475945,0.603149,h
1534,-0.529537,-0.350508,-0.357431,0.189668,0.044729,0.095742,0.279923,-0.522216,0.266288,-1.26132,0.609736,g
7166,-0.652573,-0.364827,-0.310032,-0.038987,-0.092818,0.192408,0.09597,-0.505761,0.296641,-1.083945,0.613956,g


---

In [13]:
# Shuffle the dataset
df_shuffled = df.sample(frac=1)
df_shuffled.reset_index(drop=True, inplace=True)
df_shuffled.head(10)

Unnamed: 0,fLength,fWidth,fSize,fConc,fConc1,fAsym,fM3Long,fM3Trans,fAlpha,fDist,class
0,13.2353,10.2651,2.1804,0.7987,0.4917,13.0601,6.659,10.9489,25.938,197.826,g
1,42.9187,19.7107,3.0238,0.2887,0.1236,-8.7055,-29.7166,-15.8674,63.1147,102.2467,h
2,20.5434,8.9884,2.4346,0.6287,0.3438,24.4636,9.6247,-3.306,39.6377,235.628,h
3,26.0564,17.3963,2.6848,0.4008,0.2593,-0.9388,14.4535,-9.3874,62.601,122.55,g
4,153.344,40.1891,3.0958,0.2447,0.1504,177.5813,-84.68,-19.4847,2.1684,199.65,h
5,56.6984,11.2842,2.8735,0.3346,0.1884,-18.5883,29.1725,-5.3136,24.6332,120.1266,h
6,39.9538,24.8292,3.3153,0.2385,0.1362,19.2151,33.1294,13.6998,0.139,127.164,g
7,39.0165,23.3996,3.1053,0.237,0.1377,8.3237,-22.4536,-21.3361,3.2298,139.102,g
8,36.6913,28.2534,2.7623,0.3509,0.1858,-28.6774,-34.1515,18.7668,69.125,105.144,g
9,21.138,12.1025,2.6004,0.5521,0.3375,16.3835,16.7364,-4.3067,10.21,97.0489,h


In [16]:
# Splitting the dataset with 80/20 ratio
LENGTH = len(df)
TRAIN_SIZE = .8
df_train = df_shuffled[:int(LENGTH * TRAIN_SIZE)]
df_test = df_shuffled[int(LENGTH * TRAIN_SIZE):]
print("Train dataset: ")
print("Length : {}".format(len(df_train)))
display(df_train.head())
print("\nTest dataset: ")
print("Length : {}".format(len(df_test)))
display(df_test.head())

Train dataset: 
Length : 15216


Unnamed: 0,fLength,fWidth,fSize,fConc,fConc1,fAsym,fM3Long,fM3Trans,fAlpha,fDist,class
0,13.2353,10.2651,2.1804,0.7987,0.4917,13.0601,6.659,10.9489,25.938,197.826,g
1,42.9187,19.7107,3.0238,0.2887,0.1236,-8.7055,-29.7166,-15.8674,63.1147,102.2467,h
2,20.5434,8.9884,2.4346,0.6287,0.3438,24.4636,9.6247,-3.306,39.6377,235.628,h
3,26.0564,17.3963,2.6848,0.4008,0.2593,-0.9388,14.4535,-9.3874,62.601,122.55,g
4,153.344,40.1891,3.0958,0.2447,0.1504,177.5813,-84.68,-19.4847,2.1684,199.65,h



Test dataset: 
Length : 3804


Unnamed: 0,fLength,fWidth,fSize,fConc,fConc1,fAsym,fM3Long,fM3Trans,fAlpha,fDist,class
15216,32.0093,14.5778,2.659,0.3487,0.1941,-29.9439,-25.0603,10.5972,72.0623,250.414,h
15217,34.7886,19.5066,3.0282,0.2605,0.1429,-19.2971,-17.978,10.9297,14.054,221.171,g
15218,64.3192,26.8663,3.4539,0.1762,0.0891,6.8989,64.5494,-21.575,7.653,245.097,g
15219,105.5703,30.1647,3.2129,0.294,0.1617,-161.2025,78.1421,-23.5357,49.2844,237.1594,h
15220,19.2295,11.4438,2.3655,0.5991,0.3039,-10.422,-18.4222,8.9212,24.262,222.838,g


In [62]:
# Separating features from column
x = df_train[df_train.columns[:-1]].values
y = df_train[df_train.columns[-1:]].values
print(x)
print(y)

[[ 13.2353  10.2651   2.1804 ...  10.9489  25.938  197.826 ]
 [ 42.9187  19.7107   3.0238 ... -15.8674  63.1147 102.2467]
 [ 20.5434   8.9884   2.4346 ...  -3.306   39.6377 235.628 ]
 ...
 [128.1116  85.0757   3.4305 ... -20.4238   2.8948 328.6118]
 [ 42.6021  12.401    2.2516 ...  -8.6619   6.8576 187.296 ]
 [ 26.1559   9.9788   2.3701 ...   5.9681  61.282   95.2881]]
[['g']
 ['h']
 ['h']
 ...
 ['h']
 ['g']
 ['g']]


In [63]:
# Performing standarization
scaler = StandardScaler()
x = scaler.fit_transform(x)
x

array([[-0.94539759, -0.64989048, -1.36460104, ...,  0.514308  ,
        -0.06310968,  0.05446809],
       [-0.24396917, -0.13208123,  0.42389071, ..., -0.77923321,
         1.36016745, -1.22359322],
       [-0.77270481, -0.71987937, -0.82555124, ..., -0.17330739,
         0.46137123,  0.55994652],
       ...,
       [ 1.76916684,  3.45123811,  1.28632799, ..., -0.99902085,
        -0.94529825,  1.80330149],
       [-0.25145053, -0.53280011, -1.2136162 , ..., -0.43166061,
        -0.79358595, -0.0863363 ],
       [-0.6400796 , -0.66558549, -0.96232823, ...,  0.27404853,
         1.29000415, -1.3166418 ]])

In [None]:
# Separating features from column
features = df.iloc[:, :-1]
label = df.iloc[:, -1]


In [57]:
# Define a range of K values to explore
k_values = range(1, 11, 2)  # Example range from 1 to 10

# Initialize variables to store errors
errors = {k: [] for k in k_values}
errors

{1: [], 3: [], 5: [], 7: [], 9: []}

In [None]:
# K = 7
# df_k = df_sorted[:K]
# df_k['class'].value_counts().idxmax()

In [None]:
# distances = []
# for x in features:
#     distances.append(np.linalg.norm(x - sample))
# distances

In [None]:
# Finding the optimal K
for i in range(len(df_train)):
    # Data validation 
    val_features = x[i]
    val_label = y[i]

    # Data training
    train_features = np.delete(x, i, axis=0)
    train_label = np.delete(y, i)

    for k in k_values:
        distances = []
        for x in train_features:
            distances.append(np.linalg.norm(x-val_features))
        distances = distances.sort()
        
        

In [None]:
def scale_dataset(df, oversample=False):
    scaler = StandardScaler()
    x = df[df.columns[:-1]].values
    y = df[df.columns[-1]].values  # Label data (1d array)
    x = scaler.fit_transform(x)    
    return data, x, y

In [20]:
x = df_train[df_train.columns[:-1]].values
x

array([[ 13.2353,  10.2651,   2.1804, ...,  10.9489,  25.938 , 197.826 ],
       [ 42.9187,  19.7107,   3.0238, ..., -15.8674,  63.1147, 102.2467],
       [ 20.5434,   8.9884,   2.4346, ...,  -3.306 ,  39.6377, 235.628 ],
       ...,
       [128.1116,  85.0757,   3.4305, ..., -20.4238,   2.8948, 328.6118],
       [ 42.6021,  12.401 ,   2.2516, ...,  -8.6619,   6.8576, 187.296 ],
       [ 26.1559,   9.9788,   2.3701, ...,   5.9681,  61.282 ,  95.2881]])

In [17]:
def find_k(df, k_start, k_end):
    best_k = 0
    scaler = StandardScaler()
    return best_k    

fLength     32.0093
fWidth      14.5778
fSize         2.659
fConc        0.3487
fConc1       0.1941
fAsym      -29.9439
fM3Long    -25.0603
fM3Trans    10.5972
fAlpha      72.0623
fDist       250.414
class             h
Name: 15216, dtype: object

In [None]:
def knn(df, k):
    

## 3) Menerapkan Algoritma K-Nearest Neigbhors