# Prediksi Risiko Diabetes dengan K-Nearest Neighbors (KNN)

##### Source Dataset: https://www.kaggle.com/datasets/mathchi/diabetes-data-set/data

## 1. Data dan Preprocessing

In [59]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns5

In [60]:
data = pd.read_csv("dataset/diabetes.csv")

### - Menampiilkan data

In [61]:
data

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


### - Menampilkan tipedata di setiap kolom

In [62]:
print(data.dtypes)

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object


### - Memilih kolom untuk di prediksi (Glucosee, BMI,, Age, DiabetesPedigreeFunction)

In [63]:
kolom_dipilih = ['Glucose', 'BMI', 'Age', 'DiabetesPedigreeFunction', 'Outcome']
data = data[kolom_dipilih]

### - Visualisasi deskriptif Statistik

In [64]:
data[kolom_dipilih].describe()

Unnamed: 0,Glucose,BMI,Age,DiabetesPedigreeFunction,Outcome
count,768.0,768.0,768.0,768.0,768.0
mean,120.894531,31.992578,33.240885,0.471876,0.348958
std,31.972618,7.88416,11.760232,0.331329,0.476951
min,0.0,0.0,21.0,0.078,0.0
25%,99.0,27.3,24.0,0.24375,0.0
50%,117.0,32.0,29.0,0.3725,0.0
75%,140.25,36.6,41.0,0.62625,1.0
max,199.0,67.1,81.0,2.42,1.0


### - Data Cleaning

In [65]:
kolom_tidak_valid = ['Glucose', 'BMI']
for kolom in kolom_tidak_valid:
    data[kolom] = data[kolom].replace(0, np.nan)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[kolom] = data[kolom].replace(0, np.nan)


#### Mengisi nilai nan dengan median

In [66]:
data = data.fillna(data.median(numeric_only=True))

#### Memisahkan fitur dan target

In [67]:
X = data.drop('Outcome', axis=1)

In [68]:
y = data['Outcome']

In [69]:
data = data.fillna(data.median(numeric_only=True))

# Cek hasil setelah imputasi
print(data.isnull().sum()) 

Glucose                     0
BMI                         0
Age                         0
DiabetesPedigreeFunction    0
Outcome                     0
dtype: int64


### - Featuer Scaling

In [70]:
from sklearn.preprocessing import StandardScaler

#### Membuat scaler dan fitur

In [71]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

#### Membuat dataframe dari hasil scaling

In [75]:
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)

In [76]:
import joblib
joblib.dump(scaler, 'scaler_knn.pkl')

['scaler_knn.pkl']

#### Menampilkan hasil

In [77]:
print("Data setelah scaling:")
print(X_scaled_df.head())
print("\nStatistik deskriptif setelah scaling:")
print(X_scaled_df.describe())


Data setelah scaling:
    Glucose       BMI       Age  DiabetesPedigreeFunction
0  0.866045  0.166619  1.425995                  0.468492
1 -1.205066 -0.852200 -0.190672                 -0.365061
2  2.016662 -1.332500 -0.105584                  0.604397
3 -1.073567 -0.633881 -1.041549                 -0.920763
4  0.504422  1.549303 -0.020496                  5.484909

Statistik deskriptif setelah scaling:
            Glucose           BMI           Age  DiabetesPedigreeFunction
count  7.680000e+02  7.680000e+02  7.680000e+02              7.680000e+02
mean   4.625929e-18  2.613650e-16  1.931325e-16              2.451743e-16
std    1.000652e+00  1.000652e+00  1.000652e+00              1.000652e+00
min   -2.552931e+00 -2.074783e+00 -1.041549e+00             -1.189553e+00
25%   -7.201630e-01 -7.212087e-01 -7.862862e-01             -6.889685e-01
50%   -1.530732e-01 -2.258989e-02 -3.608474e-01             -3.001282e-01
75%    6.112653e-01  6.032562e-01  6.602056e-01              4.662269e-01

### - Seleksi Fitur

#### Menghitung korelasi antar fiutr

In [80]:
correlation = X.corr()
print(correlation)

                           Glucose       BMI       Age  \
Glucose                   1.000000  0.231049  0.266909   
BMI                       0.231049  1.000000  0.025597   
Age                       0.266909  0.025597  1.000000   
DiabetesPedigreeFunction  0.137327  0.153438  0.033561   

                          DiabetesPedigreeFunction  
Glucose                                   0.137327  
BMI                                       0.153438  
Age                                       0.033561  
DiabetesPedigreeFunction                  1.000000  


## 2. Train/Test Split