# K-Nearest Neighbors (KNN) in Healthcare

## What is KNN?

K-Nearest Neighbors (KNN) is a simple, non-parametric machine learning algorithm used for both classification and regression tasks. It operates on the principle that similar data points tend to have similar outcomes. The algorithm makes predictions by finding the 'k' most similar instances in the training dataset and using their values to predict the outcome for a new, unseen data point.

KNN is considered a "lazy learning" algorithm because it doesn't build an explicit model during training. Instead, it stores all training data and makes predictions at query time by computing distances between the query point and all training examples.

## KNN Applications in Healthcare

### Medical Diagnosis
- **Disease Classification**: Classifying patients into disease categories based on symptoms, lab results, and medical history
- **Cancer Detection**: Identifying malignant vs. benign tumors using imaging features
- **Diabetes Risk Assessment**: Predicting diabetes onset based on patient demographics and clinical markers

### Treatment Recommendation
- **Drug Response Prediction**: Recommending medications based on patient similarity and historical treatment outcomes
- **Personalized Treatment Plans**: Matching patients to successful treatment protocols used by similar patients

### Medical Imaging
- **Radiology Analysis**: Classifying X-rays, MRIs, and CT scans by comparing with similar historical cases
- **Pathology Diagnosis**: Analyzing tissue samples by finding similar histological patterns

### Patient Monitoring
- **Risk Stratification**: Identifying high-risk patients by comparing with historical patient profiles
- **Outcome Prediction**: Predicting patient recovery times and treatment success rates


## Steps of KNN Algorithm

### 1. Data Preparation
- Clean and preprocess the dataset
- Handle missing values and outliers
- Normalize or standardize features to ensure equal weight in distance calculations

### 2. Choose the Value of K
- Select the number of nearest neighbors to consider
- Common approaches: odd numbers (3, 5, 7) to avoid ties in classification
- Use cross-validation to find optimal k value

### 3. Calculate Distance
- Compute distance between the query point and all training data points
- Common distance metrics:
  - **Euclidean Distance**: √Σ(xi - yi)²
  - **Manhattan Distance**: Σ|xi - yi|
  - **Minkowski Distance**: (Σ|xi - yi|^p)^(1/p)

### 4. Find K-Nearest Neighbors
- Sort all distances in ascending order
- Select the k data points with smallest distances

### 5. Make Prediction
- **For Classification**: Use majority voting among k neighbors
- **For Regression**: Calculate mean or weighted average of k neighbors' values

### 6. Evaluate Performance
- Use appropriate metrics (accuracy, precision, recall for classification; MSE, MAE for regression)
- Apply cross-validation to assess model generalization

## Advantages of KNN

### Simplicity and Interpretability
- Easy to understand and implement
- No complex mathematical assumptions
- Results are easily interpretable by healthcare professionals

### Versatility
- Works for both classification and regression problems
- Can handle multi-class classification naturally
- Adapts well to new data without retraining

### No Training Period
- No model building phase required
- Can incorporate new patient data immediately
- Suitable for dynamic healthcare environments

### Effective with Small Datasets
- Performs well even with limited training data
- Valuable in medical specialties with rare conditions


## Disadvantages of KNN

### Computational Complexity
- High computational cost during prediction phase
- Memory-intensive as it stores entire training dataset
- Slow query times with large datasets

### Sensitivity to Irrelevant Features
- Performance degrades with high-dimensional data (curse of dimensionality)
- Irrelevant features can dominate distance calculations
- Requires careful feature selection in medical applications

### Sensitivity to Data Quality
- Highly sensitive to outliers and noisy data
- Skewed class distributions can bias predictions
- Missing data can significantly impact performance

### Parameter Selection Challenges
- Choice of k value significantly affects performance
- Distance metric selection impacts results
- Requires domain expertise for optimal configuration

### Scalability Issues
- Poor scalability with increasing dataset size
- Real-time prediction challenges in clinical settings
- Storage requirements grow linearly with data


## Best Practices for KNN in Healthcare

- **Feature Scaling**: Always normalize medical measurements and lab values
- **Feature Selection**: Use domain knowledge to select relevant clinical variables
- **Cross-Validation**: Employ proper validation techniques to avoid overfitting
- **Privacy Considerations**: Implement appropriate data protection measures
- **Clinical Validation**: Always validate predictions with medical professionals
- **Regular Updates**: Continuously update the model with new patient data

## Conclusion

KNN offers a straightforward approach to pattern recognition in healthcare, making it valuable for medical diagnosis, treatment recommendation, and patient monitoring. While it has limitations in terms of computational efficiency and sensitivity to data quality, its interpretability and ease of implementation make it a useful tool in the healthcare professional's analytical toolkit. Success with KNN in healthcare depends on careful data preprocessing, appropriate parameter selection, and close collaboration with medical domain experts.

For this Project, you use the following libraries:

*   [`pandas`](https://pandas.pydata.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for managing the data.
*   [`NumPy`](https://numpy.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for mathematical operations.
*   [`sklearn`](https://scikit-learn.org/stable/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for machine learning and machine learning pipeline-related functions.
*   [`seaborn`](https://seaborn.pydata.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for visualizing the data.
*   [`Matplotlib`](https://matplotlib.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for additional plotting tools.


In [1]:
# Importing required libraries
from tqdm import tqdm
import numpy as np
import pandas as pd
from itertools import accumulate
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import scale, LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, cohen_kappa_score, confusion_matrix
from sklearn.feature_selection import f_classif
from sklearn.utils import resample

In [2]:
# Load the data
df = pd.read_excel('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/X4i8vXLw81g4wEH473zIFA/Diabetes-Classification.xlsx')

In [3]:
df.head()

Unnamed: 0,Patient number,Cholesterol,Glucose,HDL Chol,Chol/HDL ratio,Age,Gender,Height,Weight,BMI,Systolic BP,Diastolic BP,waist,hip,Waist/hip ratio,Diabetes,Unnamed: 16,Unnamed: 17
0,1,193,77,49,3.9,19,female,61,119,22.5,118,70,32,38,0.84,No diabetes,6.0,6.0
1,2,146,79,41,3.6,19,female,60,135,26.4,108,58,33,40,0.83,No diabetes,,
2,3,217,75,54,4.0,20,female,67,187,29.3,110,72,40,45,0.89,No diabetes,,
3,4,226,97,70,3.2,20,female,64,114,19.6,122,64,31,39,0.79,No diabetes,,
4,5,164,91,67,2.4,20,female,70,141,20.2,122,86,32,39,0.82,No diabetes,,


In [5]:
df.drop(columns=['Unnamed: 16', 'Unnamed: 17'])

Unnamed: 0,Patient number,Cholesterol,Glucose,HDL Chol,Chol/HDL ratio,Age,Gender,Height,Weight,BMI,Systolic BP,Diastolic BP,waist,hip,Waist/hip ratio,Diabetes
0,1,193,77,49,3.9,19,female,61,119,22.5,118,70,32,38,0.84,No diabetes
1,2,146,79,41,3.6,19,female,60,135,26.4,108,58,33,40,0.83,No diabetes
2,3,217,75,54,4.0,20,female,67,187,29.3,110,72,40,45,0.89,No diabetes
3,4,226,97,70,3.2,20,female,64,114,19.6,122,64,31,39,0.79,No diabetes
4,5,164,91,67,2.4,20,female,70,141,20.2,122,86,32,39,0.82,No diabetes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
385,386,227,105,44,5.2,83,female,59,125,25.2,150,90,35,40,0.88,No diabetes
386,387,226,279,52,4.3,84,female,60,192,37.5,144,88,41,48,0.85,Diabetes
387,388,301,90,118,2.6,89,female,61,115,21.7,218,90,31,41,0.76,No diabetes
388,389,232,184,114,2.0,91,female,61,127,24.0,170,82,35,38,0.92,Diabetes


In [6]:
frequency_table = df['Diabetes'].value_counts()
props = frequency_table.apply(lambda x: x / len(df['Diabetes']))
print(props)

Diabetes
No diabetes    0.846154
Diabetes       0.153846
Name: count, dtype: float64


In [7]:
df_reduced = df[["Diabetes", "Cholesterol", "Glucose", "BMI", "Waist/hip ratio", "HDL Chol", "Chol/HDL ratio", "Systolic BP", "Diastolic BP", "Weight"]]

numerical_columns = df_reduced.iloc[:, 1:10]

# Applying scaling
scaler = StandardScaler()
preproc_reduced = scaler.fit(numerical_columns)

df_standardized = preproc_reduced.transform(numerical_columns)

# Converting the standardized array back to a DataFrame
df_standardized = pd.DataFrame(df_standardized, columns=numerical_columns.columns)

In [8]:
# Summary statistics of standardized data
df_standardized.describe()

Unnamed: 0,Cholesterol,Glucose,BMI,Waist/hip ratio,HDL Chol,Chol/HDL ratio,Systolic BP,Diastolic BP,Weight
count,390.0,390.0,390.0,390.0,390.0,390.0,390.0,390.0,390.0
mean,7.287618000000001e-17,-1.457524e-16,2.2773810000000003e-17,-6.741046e-16,4.3270230000000006e-17,-6.376666000000001e-17,2.915047e-16,-3.006142e-16,-1.867452e-16
std,1.001285,1.001285,1.001285,1.001285,1.001285,1.001285,1.001285,1.001285,1.001285
min,-2.896986,-1.104399,-2.059272,-2.754229,-2.21747,-1.743891,-2.064517,-2.617764,-1.942901
25%,-0.6328534,-0.4902078,-0.7092421,-0.7027598,-0.7108267,-0.7637287,-0.6628646,-0.6149262,-0.6729533
50%,-0.09484179,-0.3227011,-0.1479938,-0.01893664,-0.2472441,-0.1871623,-0.04964184,-0.0956721,-0.1092203
75%,0.4880041,0.007659498,0.5308134,0.6648866,0.5060777,0.5047173,0.4759777,0.4977612,0.5598254
max,5.285274,5.167799,4.099291,3.536944,4.040895,8.51899,4.943744,3.019853,3.657259


In [9]:
df_stdize = pd.concat([df_reduced['Diabetes'], df_standardized], axis=1)
df_stdize

Unnamed: 0,Diabetes,Cholesterol,Glucose,BMI,Waist/hip ratio,HDL Chol,Chol/HDL ratio,Systolic BP,Diastolic BP,Weight
0,No diabetes,-0.319013,-0.564655,-0.951944,-0.565995,-0.073401,-0.360132,-0.838071,-0.985822,-1.447312
1,No diabetes,-1.372619,-0.527432,-0.360358,-0.702760,-0.536983,-0.533102,-1.276087,-1.875972,-1.050840
2,No diabetes,0.218998,-0.601879,0.079539,0.117828,0.216339,-0.302476,-1.188484,-0.837464,0.237692
3,No diabetes,0.420753,-0.192418,-1.391841,-1.249818,1.143504,-0.763729,-0.662865,-1.430897,-1.571209
4,No diabetes,-0.969111,-0.304089,-1.300828,-0.839524,0.969660,-1.224982,-0.662865,0.201045,-0.902163
...,...,...,...,...,...,...,...,...,...,...
385,No diabetes,0.443170,-0.043523,-0.542385,-0.018937,-0.363140,0.389404,0.563581,0.497761,-1.298635
386,Diabetes,0.420753,3.194941,1.323387,-0.429231,0.100443,-0.129506,0.300771,0.349403,0.361590
387,No diabetes,2.102039,-0.322701,-1.073295,-1.660112,3.924999,-1.109668,3.542092,0.497761,-1.546430
388,Diabetes,0.555256,1.426814,-0.724411,0.528122,3.693208,-1.455608,1.439613,-0.095672,-1.249076


In [10]:
X = df_stdize.drop(columns=['Diabetes'])
y = df_stdize['Diabetes']

# Split the data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [11]:
# Initialize LabelEncoder
label_encoder = LabelEncoder()

y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.fit_transform(y_test)

In [12]:
# Fit the KNN model
# Create a KNN classifier
knn = KNeighborsClassifier()

knn.fit(X_train, y_train_encoded)

#calculate overall accuracy
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test_encoded, y_pred)
print(f'Accuracy: {accuracy:.2%}')

Accuracy: 88.46%


In [13]:
# Hyperparameter tuning
# Create a KNN classifier
knn = KNeighborsClassifier()

param_grid = {'n_neighbors': range(1, 12)}

# Perform grid search with cross-validation
grid_search = GridSearchCV(knn, param_grid, cv=10)
grid_search.fit(X_train, y_train_encoded)


# Best parameters and best score
print("Best parameters found: ", grid_search.best_params_)
print(f"Best accuracy score: , {grid_search.best_score_:.3f}")

# Full results
results = grid_search.cv_results_
for mean_score, std_score, params in zip(results['mean_test_score'], results['std_test_score'], results['params']):
    print(f"Mean accuracy: {mean_score:.3f} (std: {std_score:.3f}) with: {params}")

Best parameters found:  {'n_neighbors': 7}
Best accuracy score: , 0.917
Mean accuracy: 0.875 (std: 0.053) with: {'n_neighbors': 1}
Mean accuracy: 0.820 (std: 0.047) with: {'n_neighbors': 2}
Mean accuracy: 0.901 (std: 0.037) with: {'n_neighbors': 3}
Mean accuracy: 0.897 (std: 0.038) with: {'n_neighbors': 4}
Mean accuracy: 0.913 (std: 0.038) with: {'n_neighbors': 5}
Mean accuracy: 0.913 (std: 0.038) with: {'n_neighbors': 6}
Mean accuracy: 0.917 (std: 0.043) with: {'n_neighbors': 7}
Mean accuracy: 0.917 (std: 0.043) with: {'n_neighbors': 8}
Mean accuracy: 0.917 (std: 0.036) with: {'n_neighbors': 9}
Mean accuracy: 0.917 (std: 0.036) with: {'n_neighbors': 10}
Mean accuracy: 0.913 (std: 0.038) with: {'n_neighbors': 11}


In [14]:
# ANOVA for feature selection
fs_score, fs_p_value = f_classif(X, y)

# Combine scores with feature names
fs_scores = pd.DataFrame({'Feature': X.columns, 'F-Score': fs_score, 'P-Value': fs_p_value})
fs_scores = fs_scores.sort_values(by='F-Score', ascending=False)

print(fs_scores)

           Feature     F-Score       P-Value
1          Glucose  350.809177  3.205119e-56
5   Chol/HDL ratio   31.242678  4.298115e-08
0      Cholesterol   16.893380  4.827353e-05
6      Systolic BP   15.931795  7.853024e-05
3  Waist/hip ratio   12.348083  4.935038e-04
8           Weight   10.588454  1.237749e-03
2              BMI    8.365055  4.040512e-03
4         HDL Chol    5.973355  1.496812e-02
7     Diastolic BP    0.947292  3.310160e-01


In [15]:
# Downsampling
# Converting Diabetes column into binary (0 for No Diabetes and 1 for Diabetes)
df_stdize['Diabetes'] = np.where(df_stdize['Diabetes'] == 'Diabetes', 1, 0)
df_stdize

Unnamed: 0,Diabetes,Cholesterol,Glucose,BMI,Waist/hip ratio,HDL Chol,Chol/HDL ratio,Systolic BP,Diastolic BP,Weight
0,0,-0.319013,-0.564655,-0.951944,-0.565995,-0.073401,-0.360132,-0.838071,-0.985822,-1.447312
1,0,-1.372619,-0.527432,-0.360358,-0.702760,-0.536983,-0.533102,-1.276087,-1.875972,-1.050840
2,0,0.218998,-0.601879,0.079539,0.117828,0.216339,-0.302476,-1.188484,-0.837464,0.237692
3,0,0.420753,-0.192418,-1.391841,-1.249818,1.143504,-0.763729,-0.662865,-1.430897,-1.571209
4,0,-0.969111,-0.304089,-1.300828,-0.839524,0.969660,-1.224982,-0.662865,0.201045,-0.902163
...,...,...,...,...,...,...,...,...,...,...
385,0,0.443170,-0.043523,-0.542385,-0.018937,-0.363140,0.389404,0.563581,0.497761,-1.298635
386,1,0.420753,3.194941,1.323387,-0.429231,0.100443,-0.129506,0.300771,0.349403,0.361590
387,0,2.102039,-0.322701,-1.073295,-1.660112,3.924999,-1.109668,3.542092,0.497761,-1.546430
388,1,0.555256,1.426814,-0.724411,0.528122,3.693208,-1.455608,1.439613,-0.095672,-1.249076


In [16]:
# Number of rows for positive diabetes
positive_diabetes = df_stdize[df_stdize['Diabetes'] == 1].shape[0]
print('Number of rows for positive diabetes: ', positive_diabetes)

# Sample negative cases to match positive cases
negative_diabetes = df_stdize[df_stdize['Diabetes'] == 0]
negative_diabetes_downsampled = resample(negative_diabetes, replace=False, n_samples=positive_diabetes, random_state=42)

# Put positive and negative diabetes case into one df -> balanced
balanced = pd.concat([negative_diabetes_downsampled, df_stdize[df_stdize['Diabetes'] == 1]])
balanced.sample(5)

Number of rows for positive diabetes:  60


Unnamed: 0,Diabetes,Cholesterol,Glucose,BMI,Waist/hip ratio,HDL Chol,Chol/HDL ratio,Systolic BP,Diastolic BP,Weight
335,0,-0.520768,-0.583267,-0.754749,1.34871,-0.479035,-0.071849,-0.312452,-0.614926,0.188133
145,0,-0.767356,-0.452984,-0.754749,1.211945,-0.768775,0.101121,-0.662865,-0.540747,-1.174738
307,0,-0.072425,0.38455,-0.208669,0.664887,0.621973,-0.706072,0.169366,-0.169851,0.064236
237,0,0.331084,-0.471596,-1.664881,-1.660112,2.128617,-1.109668,0.563581,1.981344,-1.670327
379,1,1.900285,2.376019,-0.769918,0.938416,0.274286,0.447061,1.439613,0.497761,-0.307456


In [17]:
balanced['Diabetes'].value_counts()

Unnamed: 0_level_0,count
Diabetes,Unnamed: 1_level_1
0,60
1,60


In [18]:
# Fitting on simpler model
X_simple = balanced[['Glucose']]
y = balanced['Diabetes']

# Split the data
X_train_simple, X_test_simple, y_train_simple, y_test_simple = train_test_split(X_simple, y, test_size=0.2, random_state=42)


In [19]:
knn_simple = KNeighborsClassifier()
knn_simple.fit(X_train_simple, y_train_simple)
y_pred_simple = knn_simple.predict(X_test_simple)
accuracy = accuracy_score(y_test_simple, y_pred_simple)
print(f'Accuracy: {accuracy:.2%}')

Accuracy: 91.67%


In [20]:
# Evaluate confusion matrix
cm = confusion_matrix(y_test_encoded, y_pred)

# Print confusion matrix
print("Confusion Matrix:")
print(cm)
accuracy = accuracy_score(y_test_encoded, y_pred)
print(f'Accuracy: {accuracy:.2%}')

Confusion Matrix:
[[ 8  8]
 [ 1 61]]
Accuracy: 88.46%


In [26]:
# Train the model first
knn = KNeighborsClassifier(n_neighbors=5)  # or use best k from grid search
knn.fit(X_train, y_train_encoded)

# Sample patient data
new_patient = {
    'Cholesterol': 220,
    'Glucose': 120,
    'BMI': 28.5,
    'Waist/hip ratio': 0.90,
    'HDL Chol': 40,
    'Chol/HDL ratio': 5.5,
    'Systolic BP': 130,
    'Diastolic BP': 85,
    'Weight': 75
}

# Convert to DataFrame
test_data = pd.DataFrame([new_patient])

# Standardize the data
test_scaled = preproc_reduced.transform(test_data)

# Make prediction
prediction = knn.predict(test_scaled)
probability = knn.predict_proba(test_scaled)

# Show results
predicted_class = label_encoder.inverse_transform(prediction)[0]
confidence = max(probability[0])

print(f"Prediction: {predicted_class}")
print(f"Confidence: {confidence:.2%}")

Prediction: No diabetes
Confidence: 80.00%




In [27]:
# Multiple sample patients with different risk profiles

sample_patients = [
   # High risk patient
   {
       'Cholesterol': 280,
       'Glucose': 160,
       'BMI': 35.0,
       'Waist/hip ratio': 1.0,
       'HDL Chol': 30,
       'Chol/HDL ratio': 9.3,
       'Systolic BP': 160,
       'Diastolic BP': 100,
       'Weight': 95
   },

   # Low risk patient
   {
       'Cholesterol': 180,
       'Glucose': 85,
       'BMI': 22.0,
       'Waist/hip ratio': 0.75,
       'HDL Chol': 60,
       'Chol/HDL ratio': 3.0,
       'Systolic BP': 110,
       'Diastolic BP': 70,
       'Weight': 65
   },

   # Medium risk patient
   {
       'Cholesterol': 240,
       'Glucose': 110,
       'BMI': 29.0,
       'Waist/hip ratio': 0.88,
       'HDL Chol': 42,
       'Chol/HDL ratio': 5.7,
       'Systolic BP': 135,
       'Diastolic BP': 88,
       'Weight': 80
   }
]

# Test all patients
for i, patient in enumerate(sample_patients, 1):
   test_data = pd.DataFrame([patient])
   test_scaled = preproc_reduced.transform(test_data)

   prediction = knn.predict(test_scaled)
   probability = knn.predict_proba(test_scaled)

   predicted_class = label_encoder.inverse_transform(prediction)[0]
   confidence = max(probability[0])

   print(f"Patient {i}: {predicted_class} (Confidence: {confidence:.2%})")

Patient 1: Diabetes (Confidence: 60.00%)
Patient 2: No diabetes (Confidence: 100.00%)
Patient 3: No diabetes (Confidence: 60.00%)


