Logistic Regression (3-class vs 2-class) Apply Normalization on Dataset Use this as a baseline.

### TODO:
1) Test on cleaned data (did uncleaned)
2) Test without all predictors (only necessary ones)
3) Apply other testing statistics:
      train time
      AIC -> low is good
      Confusion Matrix
      ROC curve later (read up on it)  

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('diabetes_012_health_indicators_BRFSS2015.csv')

In [2]:
# Step 1: Identify binary and continuous columns
binary_cols = [col for col in df.columns if sorted(df[col].dropna().unique()) == [0, 1]]
continuous_cols = [col for col in df.columns if col not in binary_cols]
# Step 2: Calculate Q1, Q2, Q3, IQR only for continuous columns
Q1 = df[continuous_cols].quantile(0.25)
Q2 = df[continuous_cols].quantile(0.50)  # Median
Q3 = df[continuous_cols].quantile(0.75)
IQR = Q3 - Q1

# Step 3: Remove outliers for continuous columns only
# (Binary columns are untouched)
filter_condition = ~((df[continuous_cols] < (Q1 - IQR)) | (df[continuous_cols] > (Q3 + IQR))).any(axis=1)
df_clean = df[filter_condition]

print("\nOriginal shape:", df.shape)
print("Cleaned shape:", df_clean.shape)


Original shape: (253680, 22)
Cleaned shape: (144834, 22)


In [4]:
# pre data cleaned model simple benchmark classification
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression, Perceptron
from sklearn.metrics import accuracy_score
y = list(df["Diabetes_012"])

# omit target
X = df.to_numpy()
X = np.delete(X, 0,1)
print(X)

[[1. 1. 1. ... 9. 4. 3.]
 [0. 0. 0. ... 7. 6. 1.]
 [1. 1. 1. ... 9. 4. 8.]
 ...
 [0. 0. 1. ... 2. 5. 2.]
 [1. 0. 1. ... 7. 5. 1.]
 [1. 1. 1. ... 9. 6. 2.]]


Standardize All Features
Transform data, both categorical and continuous to be on same scale.. mean of 0 and standard deviation of 1... btwn [-1, 1]
Aka z-score normalization
Create Test and Training sets

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

standardize = StandardScaler()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state=42)

X_train = standardize.fit_transform(X_train)
X_test = standardize.fit_transform(X_test)

print("0:1:2 proportions in the test dataset")
print(y_test.count(0)/len(y_test)) #no diabetes
print(y_test.count(1)/len(y_test)) #prediabets
print(y_test.count(2)/len(y_test)) #diabetes

0:1:2 proportions in the test dataset
0.8432355723746452
0.01859035004730369
0.13817407757805109


### Baseline Logistic Regression Model (Multinomial)
- Including all 3 classes for classification

In [45]:
#Multi-class Regression with:
#lbfgs solver... Quasi-Newton Method
#class_weight = 'balanced'.. all classes treated equally

logistic_mulitnomial = LogisticRegression(
                    penalty='l2',
                    fit_intercept=True,
                    class_weight=None, # default... change for class imbalance
                    solver='lbfgs',
                   #multi_class = 'auto' -> function defaults to multinomial
                    max_iter=100,
                    random_state=0,
                ).fit(X_train, y_train)

y_pred = logistic_mulitnomial.predict(X_test)

print(f"Logistic Regression (unequal class weight) accuracy: {accuracy_score(y_test, y_pred)}")

Logistic Regression (unequal class weight) accuracy: 0.847839798170924


#### Description:
We can see that accuracy is roughly 85%, even with imbalanced class weighting. From our other model statistics, we know this to be the case because the model is almost always predicting no_diabetes (unequal weights). Therefore, this outcome doesn't actually prove that our model is performing well.

In [37]:
logistic_mulitnomial = LogisticRegression(
                    penalty='l2',
                    fit_intercept=True,
                    class_weight=None,
                    solver='saga',
                   #multi_class = 'auto' -> function defaults to multinomial
                    max_iter=100,
                    random_state=0,
                ).fit(X_train, y_train)

y_pred = logistic_mulitnomial.predict(X_test)

print(f"Logistic Regression accuracy: {accuracy_score(y_test, y_pred)}")

Logistic Regression accuracy: 0.847839798170924


Solvers 'lbfgs', 'newton-cg', 'sag', 'saga' all converge to same solution for class imbalance

In [38]:
#Multi-class regression with:
#lbfgs solver... Quasi-Newton Method
#class_weight = 'balanced'.. all classes treated equally

logistic_mulitnomial = LogisticRegression(
                    penalty='l2',
                    fit_intercept=True,
                    class_weight='balanced',
                    solver='lbfgs',
                   #multi_class = 'auto' -> function defaults to multinomial
                    max_iter=100,
                    random_state=0,
                ).fit(X_train, y_train)

y_pred = logistic_mulitnomial.predict(X_test)

print(f"Logistic Regression (lbfgs) accuracy: {accuracy_score(y_test, y_pred)}")

Logistic Regression (lbfgs) accuracy: 0.6465152948596657


#### Description:
As one can see, the accuracy dropped significantly compared to an unqual class weighting. In reality, this may not be bad, because it shows that the model is making an effort to predict class 1 or 2, not just class 0. Calculated below are the class weights, which illustrate a significant penalization for incorrectly classifying group_2 compared to no_diabetes.

In [24]:
# Weights for balanced classes
# Model is heavily penalized for incorrect group_2 classifications

import numpy as np
from collections import Counter

# Assume y_train is your training label array
class_counts = Counter(y_train)
n_samples = len(y_train)
n_classes = len(class_counts)

weights = {cls: n_samples / (n_classes * count) for cls, count in class_counts.items()}

print("Computed class weights (balanced):", weights)


Computed class weights (balanced): {0.0: 0.3958183804025589, 2.0: 2.3857352443290827, 1.0: 18.371958285052145}


In [44]:
# Weights for balanced classes
# solver = newton-cg

logistic_mulitnomial = LogisticRegression(
                    class_weight='balanced',
                    solver='newton-cg',
                    random_state=0,
                ).fit(X_train, y_train)

y_pred = logistic_mulitnomial.predict(X_test)

print(f"Logistic Regression (newton-cg) accuracy: {accuracy_score(y_test, y_pred)}")

logistic_mulitnomial = LogisticRegression(
                    class_weight='balanced',
                    solver='saga',
                    max_iter=100,
                    random_state=0,
                ).fit(X_train, y_train)

y_pred = logistic_mulitnomial.predict(X_test)

print(f"Logistic Regression (sag) accuracy: {accuracy_score(y_test, y_pred)}")

Logistic Regression (newton-cg) accuracy: 0.6463418479974772
Logistic Regression (sag) accuracy: 0.5932986439608956




#### Description:
Multinomial Logistic Regression (lbfgs) accuracy is about 65%, same for Newton-CG loss minimizer, while the Sag solver didn't converge, so for 100 iterations it's about 59%. Trying the saga solver, it also didn't converge and after 1000 iterations, only had about 53% accuracy