# Logistic regression

Logistic regression is named for the function used at the core of the method, the logistic function.
Logistic regression uses an equation as the representation, very much like linear regression.


Equation is written as: 

yhat = 1/(1+e^z)
where z = b0+b1*x1)

yhat is a real value between 0 and 1 that needs to be rounded to an integer value and mapped to a predicted class value.

# Stochastic gradient descent

We use stochastic gradient descent here as well but the coefficient formula is different

b = b + (learning rate) * (y - yhat) * yhat * (1 - yhat) * x

# Steps involved in this notebook

There are 3 parts here:
1. Making predictions
2. Estimating coefficients
3. Case study on [Pima Indians Diabetes](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database)

# 1. Making predictions

In [2]:
from math import exp
def predict(row, coefficients):
    yhat = coefficients[0]
    for i in range(len(row) - 1):
        yhat += coefficients[i+1] * row[i]
    return 1.0 / (1.0 + exp(-yhat))

In [3]:
# testing the predict function

dataset = [[2.7810836,2.550537003,0],
[1.465489372,2.362125076,0],
[3.396561688,4.400293529,0],
[1.38807019,1.850220317,0],
[3.06407232,3.005305973,0],
[7.627531214,2.759262235,1],
[5.332441248,2.088626775,1],
[6.922596716,1.77106367,1],
[8.675418651,-0.242068655,1],
[7.673756466,3.508563011,1]]

coef = [-0.406605464, 0.852573316, -1.104746259]

for row in dataset:
    yhat = predict(row, coef)
    print(f"Expected: {round(row[-1], 3)} Predicted: {round(yhat, 3)}  [{round(yhat)}]")

Expected: 0 Predicted: 0.299  [0]
Expected: 0 Predicted: 0.146  [0]
Expected: 0 Predicted: 0.085  [0]
Expected: 0 Predicted: 0.22  [0]
Expected: 0 Predicted: 0.247  [0]
Expected: 1 Predicted: 0.955  [1]
Expected: 1 Predicted: 0.862  [1]
Expected: 1 Predicted: 0.972  [1]
Expected: 1 Predicted: 0.999  [1]
Expected: 1 Predicted: 0.905  [1]


# Estimating coefficients

using stochastic gradient descent to figure out the coefficients

In [5]:
def coefficients_sgd(train, l_rate, n_epoch):
    coef = [0.0 for i in range(len(train[0]))]
    for epoch in range(n_epoch):
        sum_error = 0
        for row in train:
            yhat = predict(row, coef)
            error = row[-1] - yhat
            sum_error += (error ** 2)
        
            coef[0] = coef[0] + l_rate * error * yhat * (1.0 - yhat)
            
            for i in range(len(row) - 1):
                coef[i+1] = coef[i+1] + l_rate * error * yhat * (1.0 - yhat) * row[i]
        print(f"Epoch: {epoch}, lrate: {l_rate}, error: {round(sum_error, 3)}")
    return coef

Example on dummy data

In [6]:
dataset = [[2.7810836,2.550537003,0],
[1.465489372,2.362125076,0],
[3.396561688,4.400293529,0],
[1.38807019,1.850220317,0],
[3.06407232,3.005305973,0],
[7.627531214,2.759262235,1],
[5.332441248,2.088626775,1],
[6.922596716,1.77106367,1],
[8.675418651,-0.242068655,1],
[7.673756466,3.508563011,1]]
l_rate = 0.3
n_epoch = 50
coef = coefficients_sgd(dataset, l_rate, n_epoch)
print(coef)

Epoch: 0, lrate: 0.3, error: 2.217
Epoch: 1, lrate: 0.3, error: 1.613
Epoch: 2, lrate: 0.3, error: 1.113
Epoch: 3, lrate: 0.3, error: 0.827
Epoch: 4, lrate: 0.3, error: 0.623
Epoch: 5, lrate: 0.3, error: 0.494
Epoch: 6, lrate: 0.3, error: 0.412
Epoch: 7, lrate: 0.3, error: 0.354
Epoch: 8, lrate: 0.3, error: 0.31
Epoch: 9, lrate: 0.3, error: 0.276
Epoch: 10, lrate: 0.3, error: 0.248
Epoch: 11, lrate: 0.3, error: 0.224
Epoch: 12, lrate: 0.3, error: 0.205
Epoch: 13, lrate: 0.3, error: 0.189
Epoch: 14, lrate: 0.3, error: 0.174
Epoch: 15, lrate: 0.3, error: 0.162
Epoch: 16, lrate: 0.3, error: 0.151
Epoch: 17, lrate: 0.3, error: 0.142
Epoch: 18, lrate: 0.3, error: 0.134
Epoch: 19, lrate: 0.3, error: 0.126
Epoch: 20, lrate: 0.3, error: 0.119
Epoch: 21, lrate: 0.3, error: 0.113
Epoch: 22, lrate: 0.3, error: 0.108
Epoch: 23, lrate: 0.3, error: 0.103
Epoch: 24, lrate: 0.3, error: 0.098
Epoch: 25, lrate: 0.3, error: 0.094
Epoch: 26, lrate: 0.3, error: 0.09
Epoch: 27, lrate: 0.3, error: 0.087
Epoc

# Case study


In [16]:
%run helper-functions.ipynb

In [17]:
# Logistic regression algorithm with SGD
def logistic_regression(train, test, l_rate, n_poch):
    predictions = list()
    coef = coefficients_sgd(train, l_rate, n_epoch)
    
    for row in test:
        yhat = predict(row, coef)
        yhat = round(yhat)
        predictions.append(yhat)
    return(predictions)


In [23]:
# Testing the algorithm on dataset

seed(1)
filename = 'diabetes.csv'
dataset = load_csv(filename)

for i in range(len(dataset[0])):
    str_column_to_float(dataset, i)

#normalize
minmax = dataset_minmax(dataset)
normalize_dataset(dataset, minmax)


In [27]:
# Evaluate algorithm

n_folds = 5
l_rate = 0.1
n_epoch = 100
scores = evaluate_algorithm(dataset, logistic_regression, n_folds, l_rate, n_epoch)
print("ScoresL %s" % scores)
print('Mean Accuracy: %3f%%' % (sum(scores) / float(len(scores))))

Epoch: 0, lrate: 0.1, error: 139.813
Epoch: 1, lrate: 0.1, error: 133.309
Epoch: 2, lrate: 0.1, error: 128.382
Epoch: 3, lrate: 0.1, error: 124.424
Epoch: 4, lrate: 0.1, error: 121.213
Epoch: 5, lrate: 0.1, error: 118.571
Epoch: 6, lrate: 0.1, error: 116.362
Epoch: 7, lrate: 0.1, error: 114.487
Epoch: 8, lrate: 0.1, error: 112.875
Epoch: 9, lrate: 0.1, error: 111.473
Epoch: 10, lrate: 0.1, error: 110.24
Epoch: 11, lrate: 0.1, error: 109.148
Epoch: 12, lrate: 0.1, error: 108.172
Epoch: 13, lrate: 0.1, error: 107.294
Epoch: 14, lrate: 0.1, error: 106.501
Epoch: 15, lrate: 0.1, error: 105.78
Epoch: 16, lrate: 0.1, error: 105.122
Epoch: 17, lrate: 0.1, error: 104.518
Epoch: 18, lrate: 0.1, error: 103.963
Epoch: 19, lrate: 0.1, error: 103.45
Epoch: 20, lrate: 0.1, error: 102.975
Epoch: 21, lrate: 0.1, error: 102.533
Epoch: 22, lrate: 0.1, error: 102.122
Epoch: 23, lrate: 0.1, error: 101.739
Epoch: 24, lrate: 0.1, error: 101.38
Epoch: 25, lrate: 0.1, error: 101.044
Epoch: 26, lrate: 0.1, err