# Logistic regression from scratch

$\^y = \frac{1}{1 + e^{-z}}$ &nbsp;&nbsp;&nbsp;&nbsp; $z = w.X + b$ <br>
where
- $\^y$ is the predicted value
- X is the input variable
- w is the weight of the model
- b is the bias of the model

### Cost function
Binary cross-entropy cost function/Log-loss function: <br>
$J(w, b) = \frac{-1}{n} \sum_{i=0}^{n} (y_{i}log\^y_{i} + (1-y_{i})log(1-\^y_{i}))$

### Gradient descent
$w_{i+1} = w_{i} - \alpha D_{w_{i}}$ <br>
$b_{i+1} = b_{i} - \alpha D_{b_{i}}$ <br>

where
- $D_{w}$ is the partial derivative of the cost function with respect to $w$
- $D_{b}$ is the partial derivative of the cost function with respect to $b$
- $\alpha$ is the learning rate of the algorithm

$D_{w} = \frac{1}{n} \sum_{i=0}^{n} X_{i}(\^y_{i} - y_{i})$ <br>
$D_{b} = \frac{1}{n} \sum_{i=0}^{n} (\^y_{i} - y_{i})$

In [1]:
import numpy as np

In [18]:
class LogisticRegression:
    def __init__(self, learning_rate, epochs):
        self.learning_rate = learning_rate
        self.epochs = epochs

    def fit(self, X, y):
        self.m, self.n = X.shape
        
        self.w = np.zeros(self.n)
        self.b = 0
        
        self.X = X
        self.y = y

        for _ in range(self.epochs):
            self.update_weights()
        
    def update_weights(self):
        y_hat = 1 / (1 + np.exp(-(self.X.dot(self.w) + self.b)))

        dw = (1 / self.m) * np.dot(self.X.T, (y_hat - self.y))
        db = (1 / self.m) * np.sum(y_hat - self.y)

        self.w -= self.learning_rate * dw
        self.b -= self.learning_rate * db

    def predict(self, X):
        y_pred = 1 / (1 + np.exp(-(X.dot(self.w) + self.b)))
        return np.where(y_pred > 0.5, 1, 0)

In [5]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [6]:
diabetes_data = pd.read_csv('Dataset/diabetes.csv')
diabetes_data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [7]:
diabetes_data.shape

(768, 9)

In [8]:
diabetes_data.groupby('Outcome').mean()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


In [9]:
features = diabetes_data.drop(columns='Outcome', axis=1)
target = diabetes_data['Outcome']

In [10]:
scaler = StandardScaler()
scaler.fit(features)
features = scaler.transform(features)

In [11]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.33, shuffle=True)
print(features.shape, X_train.shape, X_test.shape)

(768, 8) (514, 8) (254, 8)


In [22]:
model = LogisticRegression(0.01, 10000)
model.fit(X_train, y_train)

In [23]:
X_train_predictions = model.predict(X_train)
training_data_accuracy = accuracy_score(y_train, X_train_predictions)
print(f'Training accuracy: {training_data_accuracy}')

Training accuracy: 0.7665369649805448


In [26]:
X_test_predictions = model.predict(X_test)
testing_data_accuracy = accuracy_score(y_test, X_test_predictions)
print(f'Testing accuracy: {testing_data_accuracy}')

Testing accuracy: 0.7992125984251969
