# [9660] Homework 3 - KNN
Data file:
* https://raw.githubusercontent.com/vjavaly/Baruch-CIS-9660/main/data/cardiovascular_disease_adults_60K.csv

## Homework Submission Rules (for all homework assignments)
* Homework is due by 6:05 PM on the due date
  * No late submission will be accepted
* You must submit a cleanly executed notebook (*.ipynb)
  * Verify that you are submitting the correct homework file
* Homework file naming convention
  * LastName_FirstName_HwX.ipynb  [Replace X with the homework #]
    * 1 point deducted for submitting homework not complying with naming convention
* Before submission, execute "Kernel -> Restart Kernel and Run All Cells"
  * 1 point deducted for not submitting a cleanly executed notebook

## Homework 3 Requirements
* Load data
  * Do NOT use meaningless columns (e.g. 'id') as independent variables
* Identify missing values and use SimpleImputer to replace missing values
* Ordinal Encode independent variables: 'smoker', 'alcohol_drinker', 'physically_active', 'cholesterol' and 'glucose'
  * From a health perspective:
    * It is better to NOT BE a 'smoker', NOT BE an 'alcohol_drinker', and TO BE 'physically_active'
    * For 'cholesterol' and 'glucose', 'average' is better than 'above_average', which is better than 'high'
* Dummy (one-hot) independent variable: encode 'gender'
* Label encode dependent variable: 'cardiovascular_disease'
* Separate independent and dependent variables
* Standardize independent variables
* Split data into training and test sets
* Train KNeighborsClassifier (with default hyperparameters)
* Calculate accuracy for KNeighborsClassifier (with default hyperparameters)
* Re-train KNeighborsClassifier (change n_neighbors hyperparameter and at least one other hyperparameter)
  * NOTE: The objective of changing these hyperparameters is to improve model accuracy
    * If you used hyperparameter random_state in your initial model training, do NOT change this value during model retrainings
    * Do NOT re-split training and test sets during model retrainings
* Calculate accuracy for re-trained KNeighborsClassifier (with updated hyperparameters)

In [1]:
from datetime import datetime
print(f'Run time: {datetime.now().strftime("%D %T")}')

Run time: 04/28/25 23:46:33


### Import libraries

In [2]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

### Load data

Risk Factors for Cardiovascular Heart Disease

age: Age of participant (integer)  
gender : Gender of participant (string - male, female)  
height : Height measured in centimeters (integer)  
weight : Weight measured in kilograms (integer)  
systolic_bp  : Systolic blood pressure reading taken from patient (integer)  
diastolic_bp  : Diastolic blood pressure reading taken from patient (integer)  
cholesterol : Total cholesterol level (string - average, above-average, high)  
glucose : Glucose level (string - average, above-average, high)  
smoker : Whether person smokes or not (string - N, Y)  
alcohol_drinker : Whether person drinks alcohol or not (string - NO, YES)  
physically_active : Whether person is physically active or not (string - no, yes)  
cardiovascular_disease : Whether person suffers from cardiovascular diseases or not (string - No, Yes)

In [3]:
from posixpath import sep
# Read cardiovascular_disease_adults_60K.csv into dataframe
#  NOTES:
#   Field separator is '|'
#   Use column 'id' as index_col
df=pd.read_csv('https://raw.githubusercontent.com/vjavaly/Baruch-CIS-9660/main/data/cardiovascular_disease_adults_60K.csv',sep='|',index_col='id')

### Examine data

In [4]:
df.shape

(60000, 12)

In [5]:
df.head()

Unnamed: 0_level_0,age,gender,height,weight,systolic_bp,diastolic_bp,cholesterol,glucose,smoker,alcohol_drinker,physically_active,cardiovascular_disease
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,50.0,Male,168.0,62.0,110,80,average,average,N,NO,yes,No
1,55.0,Female,156.0,85.0,140,90,high,average,N,NO,yes,Yes
2,51.0,Female,165.0,64.0,130,70,high,average,N,NO,no,Yes
3,48.0,Male,169.0,82.0,150,100,average,average,N,NO,yes,Yes
4,47.0,Female,156.0,56.0,100,60,average,average,N,NO,no,No


### Prepare data for model training

#### Use the SimpleImputer to replace missing values

In [6]:
# Check for missing values
df.isnull().sum()

Unnamed: 0,0
age,139
gender,167
height,229
weight,74
systolic_bp,0
diastolic_bp,0
cholesterol,195
glucose,0
smoker,84
alcohol_drinker,0


In [7]:
imp_mean=SimpleImputer(missing_values=np.nan,strategy='mean')
imp_most_freq=SimpleImputer(missing_values=np.nan,strategy='most_frequent')

In [8]:
age_impute=['age']
df[age_impute]=imp_mean.fit_transform(df[age_impute])

In [9]:
gender_impute=['gender']
df[gender_impute]=imp_most_freq.fit_transform(df[gender_impute])

In [10]:
height_impute=['height']
df[height_impute]=imp_mean.fit_transform(df[height_impute])

In [11]:
weight_impute=['weight']
df[weight_impute]=imp_mean.fit_transform(df[weight_impute])

In [12]:
cholesterol_impute=['cholesterol']
df[cholesterol_impute]=imp_most_freq.fit_transform(df[cholesterol_impute])

In [13]:
smoker_impute=['smoker']
df[smoker_impute]=imp_most_freq.fit_transform(df[smoker_impute])

#### Check for missing values again

In [14]:
df.isnull().sum()

Unnamed: 0,0
age,0
gender,0
height,0
weight,0
systolic_bp,0
diastolic_bp,0
cholesterol,0
glucose,0
smoker,0
alcohol_drinker,0


#### Ordinal Encode 'smoker', 'alcohol_drinker', 'physically_active', 'cholesterol' and 'glucose'

In [15]:
df['smoker'].unique()
oe=OrdinalEncoder(categories=[['Y','N']])
df['smoker']=oe.fit_transform(df[['smoker']])

In [16]:
df['alcohol_drinker'].unique()
oe=OrdinalEncoder(categories=[['YES','NO']])
df['alcohol_drinker']=oe.fit_transform(df[['alcohol_drinker']])

In [17]:
df['physically_active'].unique()
oe=OrdinalEncoder(categories=[['no','yes']])
df['physically_active']=oe.fit_transform(df[['physically_active']])

In [18]:
df['cholesterol'].unique()
oe=OrdinalEncoder(categories=[['high','above_average','average']])
df['cholesterol']=oe.fit_transform(df[['cholesterol']])

In [19]:
df['glucose'].unique()
oe=OrdinalEncoder(categories=[['high','above_average','average']])
df['glucose']=oe.fit_transform(df[['glucose']])

In [20]:
# Display first few rows of updated dataframe
df.head(5)

Unnamed: 0_level_0,age,gender,height,weight,systolic_bp,diastolic_bp,cholesterol,glucose,smoker,alcohol_drinker,physically_active,cardiovascular_disease
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,50.0,Male,168.0,62.0,110,80,2.0,2.0,1.0,1.0,1.0,No
1,55.0,Female,156.0,85.0,140,90,0.0,2.0,1.0,1.0,1.0,Yes
2,51.0,Female,165.0,64.0,130,70,0.0,2.0,1.0,1.0,0.0,Yes
3,48.0,Male,169.0,82.0,150,100,2.0,2.0,1.0,1.0,1.0,Yes
4,47.0,Female,156.0,56.0,100,60,2.0,2.0,1.0,1.0,0.0,No


#### Dummy (one-hot) encode gender

In [21]:
one_hot_encode=['gender']
df=pd.get_dummies(df,columns=one_hot_encode,dtype=int)

In [22]:
# Display first few rows of updated dataframe
df.head(5)

Unnamed: 0_level_0,age,height,weight,systolic_bp,diastolic_bp,cholesterol,glucose,smoker,alcohol_drinker,physically_active,cardiovascular_disease,gender_Female,gender_Male
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,50.0,168.0,62.0,110,80,2.0,2.0,1.0,1.0,1.0,No,0,1
1,55.0,156.0,85.0,140,90,0.0,2.0,1.0,1.0,1.0,Yes,1,0
2,51.0,165.0,64.0,130,70,0.0,2.0,1.0,1.0,0.0,Yes,1,0
3,48.0,169.0,82.0,150,100,2.0,2.0,1.0,1.0,1.0,Yes,0,1
4,47.0,156.0,56.0,100,60,2.0,2.0,1.0,1.0,0.0,No,1,0


#### Label encode target variable 'cardiovascular_disease'

In [23]:
le=LabelEncoder()
df['cardiovascular_disease']=le.fit_transform(df['cardiovascular_disease'])

In [24]:
# Display first few rows of updated dataframe
df.head()

Unnamed: 0_level_0,age,height,weight,systolic_bp,diastolic_bp,cholesterol,glucose,smoker,alcohol_drinker,physically_active,cardiovascular_disease,gender_Female,gender_Male
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,50.0,168.0,62.0,110,80,2.0,2.0,1.0,1.0,1.0,0,0,1
1,55.0,156.0,85.0,140,90,0.0,2.0,1.0,1.0,1.0,1,1,0
2,51.0,165.0,64.0,130,70,0.0,2.0,1.0,1.0,0.0,1,1,0
3,48.0,169.0,82.0,150,100,2.0,2.0,1.0,1.0,1.0,1,0,1
4,47.0,156.0,56.0,100,60,2.0,2.0,1.0,1.0,0.0,0,1,0


### Separate independent and dependent variables
* Independent variables: All remaining variables except 'cardiovascular_disease'
* Dependent variable: 'cardiovascular_disease'

In [25]:
df_dependent=df['cardiovascular_disease']
df_independent=df_independent=df.drop(['cardiovascular_disease'],axis=1)

### Standardize independent variables

In [26]:
scaler = StandardScaler()
df['age'] = scaler.fit_transform(np.array(df[['age']]))
df['height'] = scaler.fit_transform(np.array(df[['height']]))
df['weight'] = scaler.fit_transform(np.array(df[['weight']]))
df['systolic_bp'] = scaler.fit_transform(np.array(df[['systolic_bp']]))
df['diastolic_bp'] = scaler.fit_transform(np.array(df[['diastolic_bp']]))
df['cholesterol'] = scaler.fit_transform(np.array(df[['cholesterol']]))
df['glucose'] = scaler.fit_transform(np.array(df[['glucose']]))
df['smoker'] = scaler.fit_transform(np.array(df[['smoker']]))
df['alcohol_drinker'] = scaler.fit_transform(np.array(df[['alcohol_drinker']]))
df['physically_active'] = scaler.fit_transform(np.array(df[['physically_active']]))
df['gender_Female'] = scaler.fit_transform(np.array(df[['gender_Female']]))
df['gender_Male'] = scaler.fit_transform(np.array(df[['gender_Male']]))

In [27]:
df.head()

Unnamed: 0_level_0,age,height,weight,systolic_bp,diastolic_bp,cholesterol,glucose,smoker,alcohol_drinker,physically_active,cardiovascular_disease,gender_Female,gender_Male
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,-0.419503,0.446201,-0.849737,-0.115022,-0.086953,0.537961,0.396094,0.309823,0.236964,0.493429,0,-1.372005,1.372005
1,0.319334,-1.019936,0.752236,0.065819,-0.03433,-2.409905,0.396094,0.309823,0.236964,0.493429,1,0.72886,-0.72886
2,-0.271736,0.079667,-0.710435,0.005539,-0.139576,-2.409905,0.396094,0.309823,0.236964,-2.026636,1,0.72886,-0.72886
3,-0.715038,0.568379,0.543283,0.126099,0.018293,0.537961,0.396094,0.309823,0.236964,0.493429,1,-1.372005,1.372005
4,-0.862805,-1.019936,-1.267643,-0.175302,-0.192199,0.537961,0.396094,0.309823,0.236964,-2.026636,0,0.72886,-0.72886


### Split data into training and test sets

In [28]:
X_train, X_test, y_train, y_test = train_test_split(df_independent, df_dependent, stratify=df_dependent, test_size=0.2, random_state=42)

### Train KNeighborsClassifier (with default hyperparameters)


In [29]:
knn = KNeighborsClassifier()

In [30]:
knn.fit(X_train, y_train)

### Evaluate performance for KNeighborsClassifier (with default hyperparameters)

In [31]:
y_pred = knn.predict(X_test)

In [32]:
# Print model accuracy score
acc_score= accuracy_score(y_test, y_pred)
print(f"Accuracy = {round((acc_score * 100), 4)}%")

Accuracy = 68.6917%


### Train KNeighborsClassifier (change n_neighbors hyperparameter and at least one other hyperparameter)
NOTE: The objective of changing these hyperparameters is to improve model accuracy

In [33]:
knn = KNeighborsClassifier(n_neighbors=20,leaf_size=50)

In [34]:
knn.fit(X_train,y_train)

### Evaluate performance for KNeighborsClassifier (with updated hyperparameters)

In [35]:
y_pred = knn.predict(X_test)

In [36]:
acc_score= accuracy_score(y_test, y_pred)
print(f"Accuracy = {round((acc_score * 100), 4)}%")

Accuracy = 71.8417%
