# 🧠 Diabetes Prediction using Logistic Regression
This project uses the **Pima Indians Diabetes Dataset** to build a binary classification model that predicts whether a patient is likely to have diabetes.

Key steps:
- Data cleaning (handling zero entries in medical fields)
- Exploratory Data Analysis (EDA)
- Model training using Logistic Regression
- Saving predictions to CSV for external use

Tool: **Kaggle Notebook** | Language: **Python** | Model: **Scikit-learn**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session


In [None]:
df = pd.read_csv('/kaggle/input/pima-indians-diabetes-database/diabetes.csv')
df.head()

## 🔍 Data Overview and Cleaning
We checked for missing values and found hidden missing data in the form of zeros in medical columns like `Insulin` and `BMI`. We replaced these with median values.

In [None]:
# Shape of the dataset
print("Dataset shape:", df.shape)

# Basic info
df.info()

# Check for missing values
print("\nMissing values:\n", df.isnull().sum())

# Check for zero values in certain columns (zero might be invalid)
zero_check_cols = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
print("\nZeros in key columns:")
print((df[zero_check_cols] == 0).sum())

In [None]:
import numpy as np

# Replace 0s with NaN in relevant columns
cols_to_clean = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
df[cols_to_clean] = df[cols_to_clean].replace(0, np.nan)

# Fill NaNs with median values of each column
df[cols_to_clean] = df[cols_to_clean].fillna(df[cols_to_clean].median())

# Double check
df[cols_to_clean].isnull().sum()

## ⚙️ Feature Scaling and Model Training
We split the dataset into training and testing sets, scaled the features using `StandardScaler`, and trained a Logistic Regression model.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Features and target
X = df.drop('Outcome', axis=1)
y = df['Outcome']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Model training
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

# Predict
y_pred = model.predict(X_test_scaled)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

## 📈 Evaluation & Prediction Output
The model achieved an accuracy of **75.3%** on the test set. We saved predictions in a downloadable CSV file for future use or deployment.

In [None]:
# Save predictions to CSV
submission = pd.DataFrame({
    'Index': X_test.index,
    'Predicted_Outcome': y_pred
})

submission.to_csv('logistic_regression_predictions.csv', index=False)
submission.head()