In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("diabetes.csv")
df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,Pedigree,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [3]:
# Checking for null values
print("Null values:\n",df.isna().sum(),"\n\n")

# There are no null values, but as a safety measure
print("Before:", df.shape)
df = df.dropna()
print("After:", df.shape) # This proves there are no null rows removed, implying no null values

Null values:
 Pregnancies      0
Glucose          0
BloodPressure    0
SkinThickness    0
Insulin          0
BMI              0
Pedigree         0
Age              0
Outcome          0
dtype: int64 


Before: (768, 9)
After: (768, 9)


In [4]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score

# Applying KNN
X = df.drop(columns=['Outcome'])
y = df['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

# Predict
y_pred = knn.predict(X_test)

# Evaluate
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion matrix:")
print(conf_matrix)

# Confusion matrix format
#            Predicted
#            N      P
# Actual  N  TN     FP
#         P  FN     TP

TN = conf_matrix[0][0]
FP = conf_matrix[0][1]
FN = conf_matrix[1][0]
TP = conf_matrix[1][1]

print("\nUsing confusion matrix...")
print("Acuuracy:", (TP + TN)/(TP + FP + TN + FN))
print("Error:", (FP + FN)/(TP + FP + TN + FN))
print("Precision:", (TP)/(TP + FP))
print("Recall:", (TP)/(TP + FN))


accuracy = accuracy_score(y_test, y_pred)
error = 1 - accuracy
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print("\nUsing functions...")
print("Accuracy:", accuracy)
print("Error:", error)
print("Precision:", precision)
print("Recall", recall)



Confusion matrix:
[[70 29]
 [23 32]]

Using confusion matrix...
Acuuracy: 0.6623376623376623
Error: 0.33766233766233766
Precision: 0.5245901639344263
Recall: 0.5818181818181818

Using functions...
Accuracy: 0.6623376623376623
Error: 0.33766233766233766
Precision: 0.5245901639344263
Recall 0.5818181818181818


# ... Code Explanation: KNN Diabetes Classification


This Jupyter Notebook, Exp5.ipynb, implements the K-Nearest Neighbors (KNN) classification algorithm on the diabetes.csv dataset to predict diabetes status (Outcome). It then computes and compares key classification performance metrics.

Code Explanation: KNN Diabetes Classification
1. Setup and Data Loading (Cells 1, 2)
Imports: Standard libraries pandas (pd) and numpy (np) are imported for data manipulation.

Data Loading: The diabetes.csv file, containing 768 records and 9 features, is loaded into a pandas DataFrame named df. The features include Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, Pedigree, and Age. The target variable is Outcome (0 for non-diabetic, 1 for diabetic).

2. Data Preprocessing (Cell 3)
Handling Null Values:

Python
print("Null values:\n",df.isna().sum(),"\n\n")
df = df.dropna()
The code explicitly checks for and removes any rows containing null values. The output confirms that the initial dataset had no missing values, as the DataFrame shape remains unchanged at (768, 9) before and after the operation.

3. KNN Implementation and Evaluation (Cell 4)
This section sets up the model, trains it, and calculates the required performance metrics.

Data Preparation and Splitting
Python
X = df.drop(columns=['Outcome'])
y = df['Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Feature/Target Separation: The features (X) are defined as all columns except Outcome, and the target (y) is the Outcome column.

Train-Test Split: The data is split into training (80%) and testing (20%) sets. Using random_state=42 ensures the split is consistent and reproducible.

Model Training and Prediction
Python
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
A K-Nearest Neighbors Classifier is initialized and then trained using the X_train and y_train data.

The trained model then generates predictions (y_pred) on the unseen X_test data.

Evaluation using Confusion Matrix and Metrics
The notebook computes the Confusion Matrix first, which is the foundation for all other metrics:

Python
conf_matrix = confusion_matrix(y_test, y_pred)
# ... manual calculation of metrics ...
The resulting matrix is:

( 
70
23
â€‹
  
29
32
â€‹
 )
where the indices correspond to:

TN (True Negatives): 70 (Non-Diabetic correctly predicted as Non-Diabetic)

FP (False Positives): 29 (Non-Diabetic incorrectly predicted as Diabetic)

FN (False Negatives): 23 (Diabetic incorrectly predicted as Non-Diabetic)

TP (True Positives): 32 (Diabetic correctly predicted as Diabetic)

The script then calculates the performance metrics in two ways: manually using the confusion matrix values and using the built-in sklearn.metrics functions. Both methods yielded the same results:

Accuracy:

(TP+TN)/Total=(32+70)/154â‰ˆ0.662
Error Rate:

1âˆ’Accuracyâ‰ˆ0.338
Precision:

TP/(TP+FP)=32/(32+29)â‰ˆ0.525
Recall:

TP/(TP+FN)=32/(32+23)â‰ˆ0.582

# ...ðŸ’¬ Relevant Viva Questions and AnswersCore KNN Concepts

Q: How does the K-Nearest Neighbors (KNN) algorithm make a prediction?

A: KNN is a non-parametric, lazy learning algorithm that is instance-based. To classify a new data point, it finds the K nearest neighbors (data points) in the feature space based on a distance metric (e.g., Euclidean distance). The new point is then assigned the class that is most common (majority vote) among those $K$ neighbors.

Q: What is the main weakness of the KNN algorithm, especially when dealing with high-dimensional data or large datasets?

A: KNN is computationally expensive during the prediction phase, as it requires calculating the distance between the new data point and every single point in the training set. This is particularly slow for large datasets. Additionally, its performance degrades in high-dimensional spaces (the "curse of dimensionality"), as distance metrics become less meaningful.

Q: Did this code apply feature scaling (like StandardScaler)? Should it have, and why?

A: No, the provided code did not apply feature scaling. Yes, it absolutely should have. KNN is heavily influenced by the magnitude of the features because it relies on distance calculations. Features with larger numerical ranges (like Insulin or EstimatedSalary) will disproportionately dominate the distance calculation, regardless of their actual importance. Scaling features to a consistent range is critical for KNN performance.Evaluation Metrics

Q: Based on the confusion matrix, which value represents a severe clinical error in this diabetes prediction model?

A: The False Negative (FN) value, which is 23. A False Negative means the model predicted a patient was Non-Diabetic (0) when they were actually Diabetic (1). In a medical context, missing a positive diagnosis can lead to delayed treatment and severe consequences.

Q: Explain the difference between Precision and Recall for this model, and why both are important

.A:Precision (0.525): Out of all patients the model predicted were diabetic, only 52.5% actually were. Low precision means a high rate of False Positives (wasted follow-up tests or unnecessary worry).Recall (0.582): Out of all patients who actually have diabetes (TP + FN), the model correctly identified 58.2% of them. Recall measures the model's ability to find all the positive cases. Since minimizing False Negatives (missing a diagnosis) is medically critical, a high recall is often prioritized for screening tests.

Q: How is the Error Rate related to Accuracy?

A: The Error Rate is simply the complement of Accuracy.$$\text{Error Rate} = 1 - \text{Accuracy}$$It represents the proportion of total predictions that were incorrect ($(\text{FP} + \text{FN}) / \text{Total}$).