In [9]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

In [11]:
# Load dataset
data = pd.read_csv('./diabetes.csv')


In [12]:
print(data.head())

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   Pedigree  Age  Outcome  
0     0.627   50        1  
1     0.351   31        0  
2     0.672   32        1  
3     0.167   21        0  
4     2.288   33        1  


In [13]:
#Check for null or missing values
data.isnull().sum()

Pregnancies      0
Glucose          0
BloodPressure    0
SkinThickness    0
Insulin          0
BMI              0
Pedigree         0
Age              0
Outcome          0
dtype: int64

In [14]:
# Replace zeros with mean for selected columns
cols_to_replace = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']
for column in cols_to_replace:
    data[column].replace(0, np.nan, inplace=True)
    data[column].fillna(round(data[column].mean(skipna=True)), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data[column].replace(0, np.nan, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data[column].fillna(round(data[column].mean(skipna=True)), inplace=True)


In [15]:
# Features and target
X = data.iloc[:, :8]   # first 8 columns are features
Y = data['Outcome']    # target column

In [16]:
# Split data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

In [17]:
# Initialize KNN
knn = KNeighborsClassifier(n_neighbors=5)  # you can change k
knn.fit(X_train, Y_train)

In [18]:
# Predictions
knn_pred = knn.predict(X_test)

In [19]:
# Metrics
cm = confusion_matrix(Y_test, knn_pred)
accuracy = accuracy_score(Y_test, knn_pred)
error_rate = 1 - accuracy
precision = precision_score(Y_test, knn_pred)
recall = recall_score(Y_test, knn_pred)
f1 = f1_score(Y_test, knn_pred)

In [20]:
# Print results
print("Confusion Matrix:\n", cm)
print("Accuracy Score:", accuracy)
print("Error Rate:", error_rate)
print("Precision Score:", precision)
print("Recall Score:", recall)
print("F1 Score:", f1)

Confusion Matrix:
 [[88 19]
 [19 28]]
Accuracy Score: 0.7532467532467533
Error Rate: 0.24675324675324672
Precision Score: 0.5957446808510638
Recall Score: 0.5957446808510638
F1 Score: 0.5957446808510638


In [None]:
K-Nearest Neighbors (KNN) Algorithm
üîπ Concept

KNN is a supervised machine learning algorithm used for classification (and sometimes regression).
It predicts the class of a new data point based on the majority class of its K nearest neighbors in the training data.

It‚Äôs based on the idea that:

‚ÄúSimilar things exist close to each other.‚Äù

üîπ How It Works (Step-by-Step)

Choose K ‚Üí the number of nearest neighbors to consider (e.g., K=3 or K=5).

Calculate distance between the new data point and all training points.

Usually uses Euclidean distance.

Find K nearest points (those with the smallest distances).

Majority voting ‚Üí the most common class among these K neighbors becomes the predicted class.

Example: If among 5 neighbors, 3 are ‚Äúleave‚Äù and 2 are ‚Äústay‚Äù, result = ‚Äúleave‚Äù.

üîπ Example

Suppose you have data about customers:

Red dots = customers who left

Blue dots = customers who stayed

Now, a new customer (green point) appears.
KNN checks the K closest customers ‚Äî if most of them are red, it predicts ‚Äúwill leave‚Äù.


‚Äã


In [None]:
One-line explanation for each code cell

Cell 0: Import numerical, data-handling and machine-learning libraries (NumPy, pandas, scikit-learn utilities).

Cell 1: Load the diabetes dataset CSV into a pandas DataFrame.

Cell 2: Print the first few rows to inspect feature names and sample records.

Cell 3: Check for nulls/missing values to assess data quality.

Cell 4: Replace zero values in specific medical columns with the column mean (treat zeros as missing and impute).

Short code summary (2‚Äì4 sentences)

The notebook loads a diabetes dataset and performs initial data inspection and cleaning by treating zero values in physiological columns as missing and imputing them with column means. These preprocessing steps prepare the data for modeling (split, scaling, training and evaluation ‚Äî assumed next steps). The pipeline focuses on producing a binary classifier to predict diabetes presence.

Theory (concise, exam-style)

Problem type: Binary classification ‚Äî predict presence (1) or absence (0) of diabetes from clinical features.

Missing data treatment: Zeroes in medical measurements (e.g., Glucose, BloodPressure) often indicate missing rather than actual zero; imputing (mean/median) recovers usable values but can bias variance.

Feature scaling: Many classifiers (KNN, SVM, logistic regression with gradient descent, neural nets) perform better when features are standardized (zero mean, unit variance).

Common models: Logistic Regression (probabilistic linear classifier), Decision Trees / Random Forests (nonparametric, handle unscaled inputs), KNN (instance-based, sensitive to scale), SVM (max-margin, sensitive to scale).

Evaluation metrics: Accuracy, precision, recall, F1-score, ROC-AUC, and confusion matrix ‚Äî choose metrics based on class balance and clinical cost of FN vs FP.

Cross-validation & hyperparameter tuning: Use k-fold CV (or stratified CV) and Grid/Random Search to select robust hyperparameters and avoid overfitting.

Algorithm ‚Äî step-by-step (mapped to notebook cells)

Imports (Cell 0): load numpy, pandas and sklearn modules.

Load data (Cell 1): read CSV into DataFrame.

Inspect (Cell 2): preview data to see columns like Pregnancies, Glucose, BloodPressure, BMI, Outcome.

Check missing (Cell 3): isnull() and value counts to spot missingness or placeholder zeros.

Impute zeros (Cell 4): for selected numeric columns, replace zeros with column mean (simple imputation).

(Assumed next steps ‚Äî typical and expected) Feature/target split: X = features, y = Outcome.

Train/test split: train_test_split(X, y, test_size=..., stratify=y) to preserve class ratio.

Scale features: StandardScaler().fit_transform(X_train) then transform X_test.

Train model: fit chosen classifier on X_train/y_train.

Evaluate: predict on X_test and compute confusion matrix, accuracy, precision, recall, F1, ROC-AUC.

Tune & validate: use cross-validation and GridSearchCV to pick hyperparameters.

Key concepts ‚Äî definition + notebook-specific example (one line each)

Imputation: Filling missing values (here replacing zeros with mean for columns like Glucose).

Train/test split: Hold out data for unbiased evaluation ‚Äî do this after imputation.

Stratification: Keep the same class proportion in train and test ‚Äî important if Outcome is imbalanced.

Standardization: Scale features (mean=0, std=1) ‚Äî required for distance/slope-based models.

Logistic regression: Linear model returning probabilities via the logistic function ‚Äî good baseline for binary medical outcomes.

Confusion matrix: Table of TP/FP/TN/FN ‚Äî use to compute recall (sensitivity) which is critical in diagnostics.

ROC & AUC: ROC curve plots TPR vs FPR; AUC measures separability independent of threshold.

Cross-validation: Average performance across folds to reduce variance in estimates and tune hyperparameters.

Overfitting: Model fits training noise ‚Äî detect via worse validation/test performance than training.

Bias of mean imputation: Mean imputation reduces variance and can bias estimators ‚Äî median or model-based imputation may be better.

Conclusion & result interpretation (concise)

What preprocessing achieved: Replacing zeros with column means recovers usable numeric values but may underestimate variance and affect models sensitive to distribution.

Practical evaluation: Model accuracy alone is insufficient ‚Äî in medical settings prioritize recall (sensitivity) to minimize missed diabetic cases, while controlling false positives to avoid unnecessary follow-ups.

Exam-style conclusion: The notebook correctly identifies and treats placeholder zeros as missing ‚Äî next steps must include proper splitting, scaling, model selection, CV tuning, and clinically-aware metric reporting (sensitivity/ROC). Without those, results are incomplete.

10 likely viva questions (with concise, exam-ready answers)

Q: Why might zeros in medical columns be treated as missing?
A: Because physiological measures like glucose or blood pressure cannot be truly zero in living patients; zeros usually signal missing or unrecorded measurements.

Q: Why is mean imputation a weak choice sometimes?
A: It reduces variance and can bias distributions ‚Äî median or model-based imputation is often more robust, especially with skewed data.

Q: When should you standardize features and why?
A: Standardize before training models that rely on distances or gradient descent (KNN, SVM, logistic regression) to ensure features contribute proportionally.

Q: What is stratified sampling in train/test split?
A: It maintains the same class proportion in train and test sets to yield reliable performance estimates for imbalanced classes.

Q: How do you choose evaluation metrics for a medical classifier?
A: Prioritize recall/sensitivity (catch as many diseased as possible), and use precision and ROC-AUC to balance false positives.

Q: How can you detect overfitting?
A: If train accuracy is high while validation/test accuracy is substantially lower, the model is likely overfitting.

Q: Why might you prefer logistic regression as a baseline?
A: It‚Äôs simple, interpretable, fast to train, and outputs probabilities helpful for clinical decision thresholds.

Q: What is the effect of imbalanced classes and how to handle it?
A: Imbalance can inflate accuracy; handle with stratified splits, resampling (SMOTE/undersampling), class weights, or threshold tuning.

Q: Why choose median over mean imputation for some columns?
A: Median is robust to outliers and skewed distributions, producing less biased central tendency.

Q: How do you explain model predictions to clinicians?
A: Provide predicted probabilities, confusion matrix tradeoffs, and feature-level explanations (SHAP/LIME) that show which features drive predictions.