# Logistic Regression

Logistic regression is a model used to classify samples. Usually, the graphs have a y-axis from 0 to 1 which predicts the probability of classification. An S-curve, determined by likelihood, of fit determines the probability of either classificatin at each x-value. 

In [123]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
data = pd.read_csv("heart.csv")

In [124]:
data.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


In [125]:
data = data.dropna()

In [126]:
data['DEATH_EVENT'].value_counts()

0    203
1     96
Name: DEATH_EVENT, dtype: int64

In [127]:
from sklearn.model_selection import train_test_split

In [128]:
x = data[['serum_creatinine', 'ejection_fraction']]
y = data['DEATH_EVENT']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=0)
logReg = LogisticRegression()
logReg.fit(x_train, y_train)

LogisticRegression()

In [129]:
score = logReg.score(x_test, y_test)
score

0.7466666666666667

The cell above shows the accuracy of our logistic regression model to be 73.33%

In [130]:
y_pred = logReg.predict(x_test)
print('Accuracy of logistic regression classifier on test set:', (logReg.score(x_test, y_test)))

Accuracy of logistic regression classifier on test set: 0.7466666666666667


In [131]:
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)

[[46  2]
 [17 10]]


The above cell shows us a confusion matrix which showcases diagonals that have a sum of correct vs. incorrect predictions. In this case, 46 + 10 = 56 predictions are correct whereas 17 + 2 = 19 are incorrect. 

The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier to not label a sample as positive if it is negative.


The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0. The F-beta score weights the recall more than the precision by a factor of beta. beta = 1.0 means recall and precision are equally important.

The support is the number of occurrences of each class in y_test.


In [134]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.73      0.96      0.83        48
           1       0.83      0.37      0.51        27

    accuracy                           0.75        75
   macro avg       0.78      0.66      0.67        75
weighted avg       0.77      0.75      0.72        75

