In [36]:
# Importing pandas and numpy package.
import pandas as pd
import numpy as np

In [37]:
# Reading in the data set.
df = pd.read_csv('df.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,general_health,physical_health_days,mental_health_days,has_health_plan,meets_aerobic_guidelines,physical_activity_150min,muscle_strengthening,high_blood_pressure,high_cholesterol,...,height_inches,bmi,education_level,income_group,smoking_status,alcohol_consumption,binge_drinking,heavy_drinking,diabetes_status,difficulty_walking
0,4,4.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,...,68.0,25.85,3.0,5.0,4.0,1.0,1.0,1.0,1.0,1.0
1,8,3.0,5.0,0.0,1.0,0.0,3.0,0.0,0.0,0.0,...,64.0,33.47,3.0,4.0,4.0,0.0,1.0,1.0,3.0,1.0
2,9,3.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,...,70.0,22.96,2.0,5.0,3.0,0.0,1.0,1.0,1.0,0.0
3,10,3.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,...,71.0,41.84,3.0,5.0,4.0,0.0,1.0,1.0,3.0,0.0
4,12,2.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,...,68.0,29.65,4.0,6.0,3.0,1.0,1.0,1.0,3.0,1.0


In [38]:
# Deleting unnecessary column and making all non diabetic entries being represented by 0.
del df['Unnamed: 0']
df['diabetes_status'] = df['diabetes_status'].replace(3,0)

### What scoring metric should be used

There are four metrics that I will consider and discuss:

1. Accuracy
2. Precision
3. Recall
4. F<sub>1</sub> score

* Accuracy is the number of correct predictions divided by the total number of predictions. However, accuracy combines predictions for both classes, so it does not indicate whether the model performs better on one class than the other. In this scenario, especially with an imbalanced target variable, accuracy may be misleading, as it can be negatively affected if the model performs poorly on one class.

The other three metrics—precision, recall, and F<sub>1</sub> score—are class-dependent. In other words, they require specifying the positive label. Since the main objective of this project is to predict whether an individual is diabetic, we will define our positive label as 1 (representing diabetic individuals).

* Precision is the number of true positives (correctly predicted diabetic individuals) divided by the sum of true positives and false positives (incorrectly predicted diabetic individuals) (Precision_Score, n.d.).

* Recall is the number of true positives divided by the sum of true positives and false negatives (where diabetic individuals are incorrectly predicted as non-diabetic) (Recall_Score, n.d.).

* F<sub>1</sub> score is calculated as two times the number of true positives divided by two times the number of true positives plus the number of false positives and false negatives (F1_Score, n.d.).

In a medical context, there are pros and cons to the above metrics. Focusing on precision would minimize the number of false positives, reducing the need for unnecessary resources. On the other hand, prioritizing recall would minimize false negatives, allowing the model to correctly identify as many diabetic individuals as possible. The F<sub>1</sub> score combines both recall and precision, offering a balanced view of performance.

Since both false positives and false negatives have significant consequences in medical predictions, the F<sub>1</sub> score will be the focal point of this evaluation. By balancing precision and recall, the F<sub>1</sub> score ensures that the model performs well across both metrics, making it a more comprehensive measure of its effectiveness in predicting diabetic individuals while minimizing potential risks.

## Data Preprocessing

In [39]:
# Importing train test split function
from sklearn.model_selection import  train_test_split

In [40]:
# Creating separate diabetes status variable, and deleting diabetes status variable from original dataset.
df_target = df['diabetes_status']
del df['diabetes_status']

In [41]:
# Splitting up the data set into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(df, df_target, test_size=0.30, random_state=22)

In [42]:
# Importing preprocessing functions. 
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.compose import ColumnTransformer

In [43]:
# Creating initial pipeline.
pipe1 = Pipeline([
    ('log', FunctionTransformer(func=np.log1p)),
    ('scaler', StandardScaler()),
    ])

# Creating list for all continuous and discrete variables.
cont = ['age','height_inches','bmi']
disc = ['general_health', 'physical_health_days', 'mental_health_days',
       'has_health_plan', 'meets_aerobic_guidelines',
       'physical_activity_150min', 'muscle_strengthening',
       'high_blood_pressure', 'high_cholesterol', 'heart_disease',
       'lifetime_asthma', 'arthritis', 'sex', 
       'education_level', 'income_group', 'smoking_status',
       'alcohol_consumption', 'binge_drinking', 'heavy_drinking',
       'difficulty_walking']

# Creating column transformer to send all continuous variables to pipe1 and scale all the discrete variables. 
ct = ColumnTransformer([
    ('num', pipe1, cont),
    ('disc', StandardScaler(), disc),
])

In [44]:
# Fitting ct with the training predictors, transforming the training predictors using ct, and transforming the testing predictors with ct.
X_train1 = ct.fit_transform(X_train)
X_test1 = ct.transform(X_test)

# Creating data sets with newly transformed predictor data.
X_train1 = pd.DataFrame(X_train1, columns=X_train.columns)
X_test1 = pd.DataFrame(X_test1, columns=X_test.columns)

In [45]:
print('Dimensions of Predictor Training Set:', X_train1.shape)
print('Dimensions of Predictor Testing Set:', X_test1.shape)

Dimensions of Predictor Training Set: (152586, 23)
Dimensions of Predictor Testing Set: (65395, 23)


## Model Building

In [46]:
# Importing f1_score function.
from sklearn.metrics import f1_score

### Logistic Regression

In [47]:
# Importing Logistic Regression function
from sklearn.linear_model import LogisticRegression

In [48]:
# Creating Logistic Regression model.
lg = LogisticRegression()

# Fitting lg with training data.
lg.fit(X_train1, y_train)

# Calculating the f1_score on the training data for lg.
pred_target = lg.predict(X_train1)
f1_train = np.round(f1_score(y_train, pred_target),2)
print('f1_score on training data:', f1_train)

# Calculating the f1_score on the testing data for lg.
pred_target = lg.predict(X_test1)
f1_test = np.round(f1_score(y_test, pred_target),2)
print('f1_score on testing data:', f1_test)

f1_score on training data: 0.28
f1_score on testing data: 0.28


This isn't a great start, but since this is a fairly basic model with little parameter tinkering, it is a start. 

In [49]:
# Creating LogisticRegression model with class_weight set to balanced.
lg = LogisticRegression(class_weight='balanced')

# Fitting lg with training data.
lg.fit(X_train1, y_train)

# Calculating f1_score on training data.
pred_target = lg.predict(X_train1)
f1_train = np.round(f1_score(y_train, pred_target),2)
print('f1_score on training data:', f1_train)

# Calculating f1_score on testing data. 
pred_target = lg.predict(X_test1)
f1_test = np.round(f1_score(y_test, pred_target),2)
print('f1_score on testing data:', f1_test)

f1_score on training data: 0.46
f1_score on testing data: 0.46


By tinkering with the class_weight parameter, we were able to increase the f1_score of the model by .18 for the testing data.

In [50]:
# Importing grid search cv function, stratified fold function for cv, and make scorer function for custom scoring metric
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import make_scorer

In [51]:
# Creating dictionary for different parameter possibilities. 
param_dist = {
    'C': [0.001, .01, .1, .5],
    'class_weight': [{0: 1, 1: w} for w in [1, 2, 3, 5.69]]
}

#  Specifying initial Logistic Regression model.
lg = LogisticRegression(random_state=42)
# Creating custom scoring metric.
f1 = make_scorer(f1_score, pos_label=1)
# Creating GridSearchCV function.
grid = GridSearchCV(lg, param_dist, cv=StratifiedKFold(n_splits=4), scoring = f1, verbose=1)
# Fitting grid with the training data.
grid.fit(X_train1, y_train)

Fitting 4 folds for each of 16 candidates, totalling 64 fits


In [52]:
print(grid.best_params_)

{'C': 0.001, 'class_weight': {0: 1, 1: 3}}


In [53]:
lg = grid.best_estimator_
lg.fit(X_train1, y_train)
pred_target = lg.predict(X_train1)
f1_train = np.round(f1_score(y_train, pred_target),2)
print('f1_score on training data:', f1_train)
pred_target = lg.predict(X_test1)
f1_test = np.round(f1_score(y_test, pred_target),2)
print('f1_score on testing data:', f1_test)

f1_score on training data: 0.47
f1_score on testing data: 0.47


While I thought maybe there would be more of an increase with our model, we can tell that the f1_score on the testing data only increased by .01. In other words, we could flip a coin, and get better results.

### Random Forest Classifier

In [54]:
# Importing Random Forest Classifier function.
from sklearn.ensemble import RandomForestClassifier

In [55]:
# Creating Random Forest Classifier model with random state set to 42 for reproducibility.
rfc = RandomForestClassifier(random_state=42)

# Fitting rfc with training data. 
rfc.fit(X_train1, y_train)

# Calculating the f1_score on the training data for rfc.
pred_target = rfc.predict(X_train1)
f1_train = f1_score(y_train, pred_target, pos_label=1)
print('f1_score on training data:', np.round(f1_train,2))

# Calculating the f1_score on the testing data for rfc.
pred_target = rfc.predict(X_test1)
f1_test = f1_score(y_test, pred_target, pos_label=1)
print('f1_score on testing data:', np.round(f1_test,2))

f1_score on training data: 1.0
f1_score on testing data: 0.26


Overfitting is present with drastic drop in f1_score for testing data compared to training data. 

In [56]:
# Creating random forest classifier with random state and class_weight set to balanced.
rfc = RandomForestClassifier(random_state=42, class_weight='balanced')

# Fitting rfc with training data. 
rfc.fit(X_train1, y_train)

# Calculating f1_score on training data for rfc.
pred_target = rfc.predict(X_train1)
f1_train = f1_score(y_train, pred_target, pos_label=1)
print('f1_score on training data:', np.round(f1_train,2))

# Calculating f1_Score on testing data for rfc.
pred_target = rfc.predict(X_test1)
f1_test = f1_score(y_test, pred_target, pos_label=1)
print('f1_score on testing data:', np.round(f1_test,2))

f1_score on training data: 1.0
f1_score on testing data: 0.23


Tinkering with the class_weight parameter made the testing f1_score worse.

In [57]:
from sklearn.model_selection import RandomizedSearchCV

In [58]:
param_dist = [{
    'n_estimators': [10, 30, 100],
    'max_depth': [5, 10],
    'max_features': ['sqrt', 'log2'],
    'class_weight': [{0:1, 1:w} for w in [1, 3, 5.69]] + [None]
}]

f1_scorer = make_scorer(f1_score, pos_label=1)

rfc = RandomForestClassifier(random_state=22)

# Creating custom scoring metric.
f1 = make_scorer(f1_score, pos_label=1)
# Creating GridSearchCV function.
grid = RandomizedSearchCV(rfc, param_dist, cv=StratifiedKFold(n_splits=4), scoring = f1, verbose=1, random_state=42)
# Fitting grid with the training data.
grid.fit(X_train1, y_train)

Fitting 4 folds for each of 10 candidates, totalling 40 fits


In [59]:
print(grid.best_params_)

{'n_estimators': 30, 'max_features': 'sqrt', 'max_depth': 10, 'class_weight': {0: 1, 1: 3}}


In [60]:
# Creating random forest classifier from the best estimator from rand_search.
rfc = grid.best_estimator_

# Fitting rfc with training data.
rfc.fit(X_train1, y_train)

# Calculating f1_score on training data for rfc
pred_target = rfc.predict(X_train1)
f1_train = f1_score(y_train, pred_target, pos_label=1)
print('f1_score on training data:', np.round(f1_train,2))

# Calculating f1_score on testing data for rfc.
pred_target = rfc.predict(X_test1)
f1_test = f1_score(y_test, pred_target, pos_label=1)
print('f1_score on testing data:', np.round(f1_test,2))

f1_score on training data: 0.51
f1_score on testing data: 0.46


Isn't as much difference between training and testing f1_scores, but testing f1_score is still not great.

### HistGradient Boosting Classifier

In [61]:
from sklearn.ensemble import HistGradientBoostingClassifier

In [62]:
# Creating HistGradientBoostingClassifier with random state for reproducibility.
hbc = HistGradientBoostingClassifier(random_state=22)

# Fitting hbc with training data.
hbc.fit(X_train1, y_train)

# Calculating f1_score on training data for hbc.
pred_target = hbc.predict(X_train1)
f1_train = f1_score(y_train, pred_target, pos_label=1)
print('Basic HistGradBoostingClassifier')
print('f1_score on training data:', np.round(f1_train,2))

# Calculating f1_score on testing data for hbc.
pred_target = hbc.predict(X_test1)
f1_test = f1_score(y_test, pred_target, pos_label=1)
print('f1_score on testing data:', np.round(f1_test,2))

print()

# Creating HistGradientBoostingClassifier with random state and class weight set to balanced.
hbc = HistGradientBoostingClassifier(random_state=22, class_weight='balanced')

# Fitting hbc with training data.
hbc.fit(X_train1, y_train)

# Calculating f1_score on training data for hbc.
pred_target = hbc.predict(X_train1)
f1_train = f1_score(y_train, pred_target, pos_label=1)
print('HistGradientBoostingClassifier with class weight set to balanced.')
print('f1_score on training data:', np.round(f1_train,2))

# Calculating f1_score on testing data for hbc.
pred_target = hbc.predict(X_test1)
f1_test = f1_score(y_test, pred_target, pos_label=1)
print('f1_score on testing data:', np.round(f1_test,2))


Basic HistGradBoostingClassifier
f1_score on training data: 0.29
f1_score on testing data: 0.28

HistGradientBoostingClassifier with class weight set to balanced.
f1_score on training data: 0.47
f1_score on testing data: 0.46


This model doesn't have the overfitting that the random forest classifier had. Like the logistic regression model, we were able to increase the f1_score by setting class weight to balanced.

In [63]:
param_dist = [{
    'learning_rate': [.1, .5, .9],
    'max_iter': [10, 50, 100],
    'max_leaf_nodes': [5, 15],
    'max_depth': [5,10,20],
    'min_samples_leaf': [5,10],
    'l2_regularization': [.1, .25, 1],
    'class_weight': [{0:1, 1:w} for w in [1, 3, 5.69]]

}]

f1_scorer = make_scorer(f1_score, pos_label=1)

hgbc = HistGradientBoostingClassifier(random_state=22)

rand_search = RandomizedSearchCV(hgbc, param_distributions=param_dist, 
                                scoring=f1_scorer, cv=StratifiedKFold(n_splits=4), verbose=1, n_iter=50, random_state=22)

rand_search.fit(X_train1, y_train)

print(rand_search.best_estimator_)
print(rand_search.best_score_)

Fitting 4 folds for each of 50 candidates, totalling 200 fits
HistGradientBoostingClassifier(class_weight={0: 1, 1: 3},
                               l2_regularization=0.25, max_depth=20,
                               max_leaf_nodes=5, min_samples_leaf=10,
                               random_state=22)
0.4763985799866082


In [64]:
# Creating HistGradientBoostingClassifier based on best estimator from rand_search.
hbc = rand_search.best_estimator_

# Fitting hbc with training data.
hbc.fit(X_train1, y_train)

# Calculating f1_score on training data for hbc.
pred_target = hbc.predict(X_train1)
f1_train = f1_score(y_train, pred_target, pos_label=1)
print('f1_score on training data:', np.round(f1_train,2))

# Calculating f1_score on testing data for hbc.
pred_target = hbc.predict(X_test1)
f1_test = f1_score(y_test, pred_target, pos_label=1)
print('f1_score on testing data:', np.round(f1_test,2))

f1_score on training data: 0.48
f1_score on testing data: 0.48


We did find a new high f1_score for the model, but it still isn't great.

# Conclusion

While I eventually want to explore more complex models such as XGBoost and neural networks, I recognize that my current approach isn't necessarily flawed, but it may not be entirely correct. For instance, one-hot encoding discrete variables might prove more helpful. However, this would significantly increase the dimensionality of the training and testing datasets, so analyzing feature importance or reducing dimensionality could be beneficial.

# References

* precision_score. (n.d.). Scikit-learn. https://scikit-learn.org/dev/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score

* recall_score. (n.d.). Scikit-learn. https://scikit-learn.org/dev/modules/generated/sklearn.metrics.recall_score.html#sklearn.metrics.recall_score

* f1_score. (n.d.). Scikit-learn. https://scikit-learn.org/dev/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score