In [None]:
# LOGISTIC REGRESSION ALGORITHM
#  ----------------------------

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
data = pd.read_csv('diabetes.csv')

# Separate features (X) and target (y)
x = data.drop('Outcome', axis=1)  # All columns except 'Outcome'
y = data['Outcome']  # Only the 'Outcome' column
# print(x)
# print(y)

# Split the data into training and test sets. Configure test-size to change which percentage of data is used for testing
# Random_state to control the randomness resulting from the split. Algorithm can work without it but expect inconsistent outputs
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=51)

# Standardize the features (mean=0, variance=1)
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)  # Fit and transform the training data
x_test = scaler.transform(x_test)  # Only transform the test data
# print(x_train)
# print(x_test)

# Initialize the Logistic Regression model
logreg = LogisticRegression()

# Train the model using the training data
logreg.fit(x_train, y_train)

# Make predictions on the test data
y_pred = logreg.predict(x_test)

# Evaluate the model
# Print classification report
print("1. Classification Report:\n", classification_report(y_test, y_pred))
print("----------------------------------------------------------------\n")

# Graph the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Greens', 
            xticklabels=['Predicted Negative', 'Predicted Positive'], 
            yticklabels=['Actual Negative', 'Actual Positive'])
plt.title('2. Confusion Matrix')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.show()
print("----------------------------------------------------------------\n")

# Print ROC AUC score
print("3. ROC AUC Score:\n", roc_auc_score(y_test, y_pred))


----------------------------------------------------------------------------------------------------------------------------
The algorithm primarily utilizes functions from the sklearn library.

The dataset is first loaded via pandas, where the features and outcome column are seperated into the variables x and y respectively. The data is then split into two; the test_size variable determines the percentage used for training, whilst the remainder goes towards training. The value of the random_state will help with the resulting randomness from the split; and for further consistency we standardize the features with StandardScaler().

The algorithm is then initiated. The model is trained with the training data, and predictions are made on the test features. The resulting data is then evaluated via comparison to y_test (the actual target values); the final results are then printed.

In summary, features are the inputs the model uses to learn patterns; and the target is the output the model aims to predict.

There are three outputs.


1. Classification Report
Provides a breakdown of the two classes: tested negative (0), and tested positive (1) for diabetes.

First, The report outputs four factors:
1a. Precision == The ratio of correctly predicted positive observations to the total predicted positives.
1b. Recall == The ratio of correctly predicted positive observations to all the observations in the actual class.
1c. F1-Score == The weighted average of Precision and Recall.
1d. Support == The number of actual occurrences of the class in the dataset.

Second, the accuraccy; the proportion of correctly predicted instances out of the total number of instances.


2. Confusion Matrix
Shows the number of correct and incorrect predictions the model made for each class. The ouput is a 2x2 matrix.


3. ROC AUC Score
The ROC (Receiver Operating Characteristic) AUC (Area Under the Curve) score is a performance measurement for classification problems at various threshold settings. It is a measure of how much the model is capable of distinguishing between classes. A score of 0.7 and above means the algorithm is working within an acceptable range.