This dataset is collected from the National Health Insurance Service in Korea. All personal information and sensitive data were excluded.
1. The purpose of this dataset is to classify Drinker or Not.
2. Perform KKN & Decision Tree Classifier.
3. Evaluate Both ML Models by Confusion Matrix and Relevant Performance Parameters.

In [19]:
import numpy as np
import pandas as pd
from pandas.core.dtypes.common import is_numeric_dtype

In [20]:
df = pd.read_csv('/content/smoking_driking_dataset_Ver01.csv')

In [21]:
df.head()

Unnamed: 0,sex,age,height,weight,waistline,sight_left,sight_right,hear_left,hear_right,SBP,...,LDL_chole,triglyceride,hemoglobin,urine_protein,serum_creatinine,SGOT_AST,SGOT_ALT,gamma_GTP,SMK_stat_type_cd,DRK_YN
0,Male,35,170,75,90.0,1.0,1.0,1.0,1.0,120.0,...,126.0,92.0,17.1,1.0,1.0,21.0,35.0,40.0,1.0,Y
1,Male,30,180,80,89.0,0.9,1.2,1.0,1.0,130.0,...,148.0,121.0,15.8,1.0,0.9,20.0,36.0,27.0,3.0,N
2,Male,40,165,75,91.0,1.2,1.5,1.0,1.0,120.0,...,74.0,104.0,15.8,1.0,0.9,47.0,32.0,68.0,1.0,N
3,Male,50,175,80,91.0,1.5,1.2,1.0,1.0,145.0,...,104.0,106.0,17.6,1.0,1.1,29.0,34.0,18.0,1.0,N
4,Male,50,165,60,80.0,1.0,1.2,1.0,1.0,138.0,...,117.0,104.0,13.8,1.0,0.8,19.0,12.0,25.0,1.0,N


In [22]:
df1 = df.copy()
df2 = df.copy()
df3 = df.copy()

In [23]:
X = df1.drop('DRK_YN', axis=1)
Y = df1['DRK_YN']

**Encoding sex column**

In [24]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [25]:
for col in X.columns:
  if is_numeric_dtype(X[col]):
    continue
  else:
    X[col] = le.fit_transform(X[col])

**Normalization**

In [26]:
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()

In [27]:
numeric_columns = X.select_dtypes(include=['float64']).columns
for col in numeric_columns:
    X[col] = mms.fit_transform(X[numeric_columns])
X.head(5)

Unnamed: 0,sex,age,height,weight,waistline,sight_left,sight_right,hear_left,hear_right,SBP,...,HDL_chole,LDL_chole,triglyceride,hemoglobin,urine_protein,serum_creatinine,SGOT_AST,SGOT_ALT,gamma_GTP,SMK_stat_type_cd
0,1,35,170,75,0.082745,0.082745,0.082745,0.082745,0.082745,0.082745,...,0.082745,0.082745,0.082745,0.082745,0.082745,0.082745,0.082745,0.082745,0.082745,0.082745
1,1,30,180,80,0.081736,0.081736,0.081736,0.081736,0.081736,0.081736,...,0.081736,0.081736,0.081736,0.081736,0.081736,0.081736,0.081736,0.081736,0.081736,0.081736
2,1,40,165,75,0.083754,0.083754,0.083754,0.083754,0.083754,0.083754,...,0.083754,0.083754,0.083754,0.083754,0.083754,0.083754,0.083754,0.083754,0.083754,0.083754
3,1,50,175,80,0.083754,0.083754,0.083754,0.083754,0.083754,0.083754,...,0.083754,0.083754,0.083754,0.083754,0.083754,0.083754,0.083754,0.083754,0.083754,0.083754
4,1,50,165,60,0.072654,0.072654,0.072654,0.072654,0.072654,0.072654,...,0.072654,0.072654,0.072654,0.072654,0.072654,0.072654,0.072654,0.072654,0.072654,0.072654


**Training and Testing Sets**

In [32]:
from sklearn.model_selection import train_test_split

In [33]:
Xtrain , Xtest, ytrain, ytest = train_test_split(X, Y ,test_size=0.3, random_state=42)

**K-Nearest Neighbors (KNN)**

In [34]:
from sklearn.neighbors import KNeighborsClassifier

In [35]:
knn = KNeighborsClassifier(n_neighbors=5)

In [36]:
knn.fit(Xtrain, ytrain)

In [37]:
knn_predictions = knn.predict(Xtest)

**Evaluate the Models**

In [38]:
from sklearn.metrics import confusion_matrix, classification_report

# Evaluate the KNN model
knn_cm = confusion_matrix(ytest, knn_predictions)
knn_classification_report = classification_report(ytest, knn_predictions)

In [39]:
print("K-Nearest Neighbors Confusion Matrix:")
print(knn_cm)
print("K-Nearest Neighbors Classification Report:")
print(knn_classification_report)

K-Nearest Neighbors Confusion Matrix:
[[ 94303  54844]
 [ 46207 102050]]
K-Nearest Neighbors Classification Report:
              precision    recall  f1-score   support

           N       0.67      0.63      0.65    149147
           Y       0.65      0.69      0.67    148257

    accuracy                           0.66    297404
   macro avg       0.66      0.66      0.66    297404
weighted avg       0.66      0.66      0.66    297404

