# Diabetes Prediction using Logistic Regression and KNN

This notebook demonstrates building a **Diabetes Prediction Model** using the PIMA Indians dataset.  

We will follow these steps:
1. Load the data in a DataFrame  
2. Perform Data Preprocessing  
3. Perform Exploratory Data Analysis (EDA)  
4. Build models using **Logistic Regression** and **K-Nearest Neighbour (KNN)**  
5. Use appropriate evaluation metrics

In [None]:
import pandas as pd

# Load dataset
data = pd.read_csv("diabetes.csv")

# Show first few rows
data.head()

In [None]:
# Info and missing values check
data.info()
data.isnull().sum()

# Some columns have zeros where biologically not possible
# (Glucose, BloodPressure, SkinThickness, Insulin, BMI)
import numpy as np

cols_with_zero = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']
data[cols_with_zero] = data[cols_with_zero].replace(0, np.nan)

# Fill NaN with median values
data.fillna(data.median(), inplace=True)

data.describe()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Outcome distribution
sns.countplot(x='Outcome', data=data)
plt.show()

# Correlation heatmap
plt.figure(figsize=(10,6))
sns.heatmap(data.corr(), annot=True, cmap="coolwarm")
plt.show()

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Features and target
X = data.drop('Outcome', axis=1)
y = data['Outcome']

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Train
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)

# Predict
y_pred_log = log_reg.predict(X_test)

# Evaluation
print("Accuracy (Logistic Regression):", accuracy_score(y_test, y_pred_log))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_log))
print("\nClassification Report:\n", classification_report(y_test, y_pred_log))

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Train
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Predict
y_pred_knn = knn.predict(X_test)

# Evaluation
print("Accuracy (KNN):", accuracy_score(y_test, y_pred_knn))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_knn))
print("\nClassification Report:\n", classification_report(y_test, y_pred_knn))