# Diabetes Prediction using Logistic Regression from Scratch

In this notebook, we implement a Logistic Regression model without using high-level library estimators (like `sklearn.linear_model`). We will build the model using `numpy` to understand the underlying mathematics, including the sigmoid activation function, cost calculation, and gradient descent.

## 1. Import Libraries

We start by importing the necessary libraries:
* `numpy` for matrix operations.
* `pandas` for data manipulation.
* `sklearn` for data splitting and scaling.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

## 2. Logistic Regression Class Definition

Here we define the `LogisticRegression` class. This class encapsulates the logic for the model:
* **`sigmoid`**: Maps any real value into another value between 0 and 1.
* **`cost`**: Computes the binary cross-entropy loss to evaluate how well the model is performing.
* **`fit`**: Updates the weights and bias using Gradient Descent to minimize the cost over a specified number of iterations.
* **`predict`**: Uses the learned weights to predict binary class labels (0 or 1).

In [2]:
class LogisticRegression:
    def __init__(self, learning_rate=0.01, iterations=10000):
        self.lr = learning_rate
        self.iterations = iterations
        self.weights = None
        self.bias = 0
        self.cost_history = []

    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))

    def cost(self, h, y):
        return -1 / y.size * np.sum(y * np.log(h) + (1 - y) * np.log(1 - h))

    def fit(self, X, y):
        m, n = X.shape
        self.weights = np.zeros(n)
        
        for _ in range(self.iterations):
            z = np.dot(X, self.weights) + self.bias
            h = self.sigmoid(z)

            dw = 1 / m * np.dot(X.T, (h - y))
            db = 1 / m * np.sum(h - y)

            self.weights -= self.lr * dw
            self.bias -= self.lr * db

            self.cost_history.append(self.cost(h, y))

    def predict(self, X):
        return (self.sigmoid(np.dot(X, self.weights) + self.bias) >= 0.5).astype(int)

## 3. Load Data

We load and separate the data into features ($X$) and the target variable ($Y$, 'Outcome').

In [3]:
data = pd.read_csv('diabetes.csv')
Y = data['Outcome']
X = data.drop('Outcome', axis=1)

## 4. Data Preprocessing

1.  **Train-Test Split**: We split the dataset into training (80%) and testing (20%) sets to validate the model's performance on unseen data.
2.  **Feature Scaling**: We use `StandardScaler` to normalize the features. This is crucial for gradient descent to converge faster and more strictly.

In [4]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## 5. Model Training

We instantiate our custom `LogisticRegression` model with a learning rate and 4000 iterations. The `fit` method is called to train the parameters (weights and bias) using the training data.

In [5]:
model = LogisticRegression(iterations=4000)
model.fit(X_train, Y_train)

## 6. Evaluation

Finally, we test the trained model on the `X_test` set. We compare the predicted labels against the actual `Y_test` labels to calculate the accuracy score.

In [6]:
predictions = model.predict(X_test)
accuracy = np.mean(predictions == Y_test)
print(f"Model accuracy: {accuracy:.2f}")

Model accuracy: 0.75
