# Logistic regression

Logistic regression is named for the function used at the core of the method, the logistic function.
Logistic regression uses an equation as the representation, very much like linear regression.


Equation is written as: 

yhat = 1/(1+e^z)
where z = b0+b1*x1)

yhat is a real value between 0 and 1 that needs to be rounded to an integer value and mapped to a predicted class value.

# Stochastic gradient descent

We use stochastic gradient descent here as well but the coefficient formula is different

b = b + (learning rate) * (y - yhat) * yhat * (1 - yhat) * x

# Steps involved in this notebook

There are 3 parts here:
1. Making predictions
2. Estimating coefficients
3. Case study on [Pima Indians Diabetes](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database)

# 1. Making predictions

In [2]:
from math import exp
def predict(row, coefficients):
    yhat = coefficients[0]
    for i in range(len(row) - 1):
        yhat += coefficients[i+1] * row[i]
    return 1.0 / (1.0 + exp(-yhat))

In [4]:
# testing the predict function

dataset = [[2.7810836,2.550537003,0],
[1.465489372,2.362125076,0],
[3.396561688,4.400293529,0],
[1.38807019,1.850220317,0],
[3.06407232,3.005305973,0],
[7.627531214,2.759262235,1],
[5.332441248,2.088626775,1],
[6.922596716,1.77106367,1],
[8.675418651,-0.242068655,1],
[7.673756466,3.508563011,1]]

coef = [-0.406605464, 0.852573316, -1.104746259]

for row in dataset:
    yhat = predict(row, coef)
    print(f"Expected: {round(row[-1], 3)} Predicted: {round(yhat, 3)}  [{round(yhat)}]")

Expected: 0 Predicted: 0.299  [0]
Expected: 0 Predicted: 0.146  [0]
Expected: 0 Predicted: 0.085  [0]
Expected: 0 Predicted: 0.22  [0]
Expected: 0 Predicted: 0.247  [0]
Expected: 1 Predicted: 0.955  [1]
Expected: 1 Predicted: 0.862  [1]
Expected: 1 Predicted: 0.972  [1]
Expected: 1 Predicted: 0.999  [1]
Expected: 1 Predicted: 0.905  [1]


# Estimating coefficients

using stochastic gradient descent to figure out the coefficients

In [7]:
def coefficients_sgd(train, l_rate, n_epoch):
    coef = [0.0 for i in range(len(train[0]))]
    for epoch in range(n_epoch):
        sum_error = 0
        for row in train:
            yhat = predict(row, coef)
            error = row[-1] - yhat
            sum_error += (error ** 2)
        
            coef[0] = coef[0] + l_rate * error * yhat * (1.0 - yhat)
            
            for i in range(len(row) - 1):
                coef[i+1] = coef[i+1] + l_rate * error * yhat * (1.0 - yhat) * row[i]
        print(f"Epoch: {epoch}, lrate: {l_rate}, error: {round(sum_error, 3)}")
    return coef

Example on dummy data

In [10]:
dataset = [[2.7810836,2.550537003,0],
[1.465489372,2.362125076,0],
[3.396561688,4.400293529,0],
[1.38807019,1.850220317,0],
[3.06407232,3.005305973,0],
[7.627531214,2.759262235,1],
[5.332441248,2.088626775,1],
[6.922596716,1.77106367,1],
[8.675418651,-0.242068655,1],
[7.673756466,3.508563011,1]]
l_rate = 0.3
n_epoch = 50
coef = coefficients_sgd(dataset, l_rate, n_epoch)
print(coef)

Epoch: 0, lrate: 0.3, error: 2.217
Epoch: 1, lrate: 0.3, error: 1.613
Epoch: 2, lrate: 0.3, error: 1.113
Epoch: 3, lrate: 0.3, error: 0.827
Epoch: 4, lrate: 0.3, error: 0.623
Epoch: 5, lrate: 0.3, error: 0.494
Epoch: 6, lrate: 0.3, error: 0.412
Epoch: 7, lrate: 0.3, error: 0.354
Epoch: 8, lrate: 0.3, error: 0.31
Epoch: 9, lrate: 0.3, error: 0.276
Epoch: 10, lrate: 0.3, error: 0.248
Epoch: 11, lrate: 0.3, error: 0.224
Epoch: 12, lrate: 0.3, error: 0.205
Epoch: 13, lrate: 0.3, error: 0.189
Epoch: 14, lrate: 0.3, error: 0.174
Epoch: 15, lrate: 0.3, error: 0.162
Epoch: 16, lrate: 0.3, error: 0.151
Epoch: 17, lrate: 0.3, error: 0.142
Epoch: 18, lrate: 0.3, error: 0.134
Epoch: 19, lrate: 0.3, error: 0.126
Epoch: 20, lrate: 0.3, error: 0.119
Epoch: 21, lrate: 0.3, error: 0.113
Epoch: 22, lrate: 0.3, error: 0.108
Epoch: 23, lrate: 0.3, error: 0.103
Epoch: 24, lrate: 0.3, error: 0.098
Epoch: 25, lrate: 0.3, error: 0.094
Epoch: 26, lrate: 0.3, error: 0.09
Epoch: 27, lrate: 0.3, error: 0.087
Epoc

# Case study


[1.1999999999999995, 1.9999999999999996, 3.5999999999999996, 2.8, 4.3999999999999995]
RMSE for the dummy dataset is:  0.693
RMSE is:  33.62982326492123
