# Logistic Regression from Scratch (Diabetes Dataset)

In this notebook, I’m implementing Logistic Regression *from scratch* without sklearn.
The dataset used is the PIMA Diabetes dataset.

---

## 1. Import dependencies Load and explore dataset


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv("diabetes.csv")
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


# 2. Train/Test Split
We split the dataset into train and test sets using a custom shuffle function.


In [3]:
def Shuffle_split(data ,test_ratio,seed=42):
    """ Shuffle and split data set into train and test sets.
    
    Args:
        data(Data Frame):Dataset
        test ratio(Float):fraction of test data
        seed(int): random seed for reproducebility

    returns:
        train_set,test_set,(Data Frame ,Data Frame)
    """
    np.random.seed(seed)
    shuffled_indeces = np.random.permutation(len(data))
    shuffle_number = int (test_ratio * len(data))
    test_indeces = shuffled_indeces[:shuffle_number]
    train_indeces = shuffled_indeces[shuffle_number:]

    return data.iloc[train_indeces] , data.iloc[test_indeces]

In [4]:
train_set, test_set = Shuffle_split(df,0.2)

In [5]:
# Separate features and labels
X = train_set.drop("Outcome", axis=1).to_numpy()
Y = train_set["Outcome"].to_numpy().reshape(-1, 1)

# 3. Normalization

In [6]:
def zscore_normalize_features(X):
    """
    Normalize features using z-score method.

    Args:
        X (ndarray): feature matrix

    Returns:
        X_norm, mu, sigma
    """
    mu = np.mean(X, axis=0)
    sigma = np.std(X, axis=0)
    X_norm = (X - mu) / sigma
    return X_norm, mu, sigma


In [7]:
X_norm ,mean ,sigma= zscore_normalize_features(X)

# 4. Sigmoid Function

In [8]:
def sigmoid(z):
    return 1 / ( 1 + np.exp(-z))

# 5. Cost Function

In [9]:
# define a cost function to calculate the error by Mean Square Method.
def compute_cost_logistic(X,Y,w,b):
    """ 
    compute the logistics regression cost.
    Args:
        X (ndarray): shape (m, n) feature matrix
        Y (ndarray): shape (m, 1) labels
        w (ndarray): shape (n, 1) weights
        b (float): bias term
    returns:
        cost:float
    
    """

    m = X.shape[0]
    cost = 0.0
    for i in range(m):
        z_i = np.dot(X[i],w) + b
        f_wb_i = sigmoid(z_i)

        cost += -Y[i] * np.log(f_wb_i) - (1 - Y[i]) * np.log( 1 - f_wb_i)
        
    cost = cost/m
    return cost

In [10]:
w_tmp = np.ones((8,1))
b_tmp = -1.5
X_train = X_norm
y_train = Y
print(compute_cost_logistic(X_train, y_train, w_tmp, b_tmp))

[0.84478499]
