Logistic Regression (3-class vs 2-class)
Apply Normalization on Dataset
Use this as a baseline.

Different:
1) optimization techniques
2) regularization (L2 vs L1 vs none) aka penalty
3) cleaned vs uncleaned data
4) all predictors vs only uncorrelated ones
Measure:
1) accuracy (correct/total)
2) train time
3) AIC -> low is good
4) Confusion Matrix
5) ROC curve later (read up on it)

In [2]:
!git pull origin main  

Updating 5406643..5949bf3
Fast-forward
 .gitignore          |    6 +-
 ljmurphy.ipynb      | 2238 --------------------------------------------
 murphy-5050-v.ipynb | 2552 +++++++++++++++++++++++++++++++++++++++++++++++++++
 murphy-base-v.ipynb |  122 +++
 4 files changed, 2678 insertions(+), 2240 deletions(-)
 delete mode 100644 ljmurphy.ipynb
 create mode 100644 murphy-5050-v.ipynb
 create mode 100644 murphy-base-v.ipynb


From https://github.com/LJMurphyy/diabetes-risk-predictor
 * branch            main       -> FETCH_HEAD
   5406643..5949bf3  main       -> origin/main


In [18]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('diabetes_binary_5050split_health_indicators_BRFSS2015.csv')

## Remove Outliers for Continuous Predictors

In [19]:
# Step 1: Identify binary and continuous columns
binary_cols = [col for col in df.columns if sorted(df[col].dropna().unique()) == [0, 1]]
continuous_cols = [col for col in df.columns if col not in binary_cols]
# Step 2: Calculate Q1, Q2, Q3, IQR only for continuous columns
Q1 = df[continuous_cols].quantile(0.25)
Q2 = df[continuous_cols].quantile(0.50)  # Median
Q3 = df[continuous_cols].quantile(0.75)
IQR = Q3 - Q1

# Step 3: Remove outliers for continuous columns only
# (Binary columns are untouched)
filter_condition = ~((df[continuous_cols] < (Q1 - IQR)) | (df[continuous_cols] > (Q3 + IQR))).any(axis=1)
df_clean = df[filter_condition]

print("\nOriginal shape:", df.shape)
print("Cleaned shape:", df_clean.shape)


Original shape: (253680, 22)
Cleaned shape: (144834, 22)


### Combine classes to binary classification 
- Create numpy array containing all 21 features

In [20]:
# proportions of 0:1:2 diabetes
y = list(df["Diabetes_012"])
length = len(y)
print(y.count(0)/length) # 0 = no diabetes
print(y.count(1)/length) # 1 = pre-diabetes
print(y.count(2)/length) # 2 = diabetes


# pre data cleaned model simple benchmark classification
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression, Perceptron
from sklearn.metrics import accuracy_score
y = list(df["Diabetes_012"])
y = list(map(lambda X: 1 if X == 2 or X == 1 else X, y))

# omit target
X = df.to_numpy()
X = np.delete(X, 0,1)
print(X)

0.8424116997792495
0.018255282245348472
0.13933301797540207
[[1. 1. 1. ... 9. 4. 3.]
 [0. 0. 0. ... 7. 6. 1.]
 [1. 1. 1. ... 9. 4. 8.]
 ...
 [0. 0. 1. ... 2. 5. 2.]
 [1. 0. 1. ... 7. 5. 1.]
 [1. 1. 1. ... 9. 6. 2.]]


### Standardize All Features
- Transform data, both categorical and continuous to be on same scale.. mean of 0 and standard deviation of 1... btwn [-1, 1]
- Aka z-score normalization
- Create Test and Training sets

In [21]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, Perceptron
from sklearn.metrics import accuracy_score

standardize = StandardScaler()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state=42)

X_train = standardize.fit_transform(X_train)
X_test = standardize.fit_transform(X_test)

print("0:1 proportions in the test dataset")
print(y_test.count(0)/len(y_test))
print(y_test.count(1)/len(y_test))

0:1 proportions in the test dataset
0.8432355723746452
0.15676442762535478


### Baseline Logistic Regression Model (Multinomial)
- Including all 3 classes for classification

In [17]:

logistic_baseline = LogisticRegression(
                    penalty='l2',
                    dual=False,
                    tol=0.0001,
                    C=1.0,
                    fit_intercept=True,
                    intercept_scaling=1,
                    class_weight=None,
                    solver='lbfgs',
                    max_iter=100,
                    multi_class='auto',
                    random_state=None,
                    n_jobs=None,
                    l1_ratio=None
                )

### Binomial Logistic Regression
- One that Varsha did

In [25]:
logistic_binomial = LogisticRegression(
                    penalty='l2',
                    class_weight=None,
                    solver='lbfgs',
                    max_iter=100,
                    multi_class='auto',
                    random_state=0,
                    n_jobs=None,
                    l1_ratio=None).fit(X_train, y_train)

y_pred = logistic_binomial.predict(X_test)

print(f"Logistic Regression accuracy: {accuracy_score(y_test, y_pred)}")



Logistic Regression accuracy: 0.8489120151371807
