### Loading in and preprocessing the data

In [91]:
# Importing pandas and numpy library
import pandas as pd
import numpy as np

# Loading in data frame
df = pd.read_csv('df.csv')

# Deleting unnecessary column and symbolizing all non-diabetic records with 0 instead of 3.
del df['Unnamed: 0']
df['diabetes_status'] = df['diabetes_status'].replace(3,0)

# Isolating target variable. 
df_target = df['diabetes_status']
del df['diabetes_status']

In [92]:
# Importing functions to split up and transform our data.
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, FunctionTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

In [93]:
# Creating lists to contain the continuous variable names and the discrete variable names.
cont = ['age','height_inches','bmi']
disc = ['general_health', 'physical_health_days', 'mental_health_days',
       'has_health_plan', 'meets_aerobic_guidelines',
       'physical_activity_150min', 'muscle_strengthening',
       'high_blood_pressure', 'high_cholesterol', 'heart_disease',
       'lifetime_asthma', 'arthritis', 'sex', 
       'education_level', 'income_group', 'smoking_status',
       'alcohol_consumption', 'binge_drinking', 'heavy_drinking',
       'difficulty_walking']

In [94]:
# Splitting up predictor and target variable into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(df, df_target, test_size=0.30, random_state=22, stratify=df_target)
# Utilizing stratify parameters helps ensure that the percentage or diabetic and non-diabetic individuals are around the same in the training and tests sets.

# Creating pipeline to logarithmically transform and min max scale all continuous variables.
cont_pipeline = Pipeline([
    ('log', FunctionTransformer(func=np.log1p)),
    ('scaler', MinMaxScaler()),
    ])

# Creating pipeline to one hot encode all discrete variables while dropping the first.
disc_pipeline = Pipeline([
    ('one_hot', OneHotEncoder(sparse_output=False, drop='first'))
])

# Creating a column transform to send all discrete variable to disc_pipeline and all continuous variables to cont_pipeline.
column_transformer = ColumnTransformer([
    ('cont', cont_pipeline, cont),
    ('disc', disc_pipeline, disc),
])

# Fitting column transform with training data and then transforming training data using the fitted column transformer.
X_train1 = column_transformer.fit_transform(X_train)
# Transforming testing data using the fitted column transform.
X_test1 = column_transformer.transform(X_test)

# Creating data frames based on X_train1 and X_test1 for feature importance analysis later one.
X_train1 = pd.DataFrame(X_train1)
X_test1 = pd.DataFrame(X_test1)

## Model Building

The goal of this is to once again explore different models, such as Logistic Regression, Random Forest Classifier, Balanced Random Forest Classifier, Hist Gradient Boosting Classifier, and XGBClassifier.

### Logistic Regression

In [95]:
# Importing Logistic Regression and f1_score function.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

In [96]:
# Creating Logistic Regression model with class weight set to balanced and max iterations to 500.
lg = LogisticRegression(class_weight='balanced', max_iter=500)        # Max iter was set as the model wasn't able to find a solute with a default of 100 max iterations.

# Fitting lg with training data.
lg.fit(X_train1, y_train)

# Calculating the f1_score on the training data for lg.
pred_target = lg.predict(X_train1)
f1_train = np.round(f1_score(y_train, pred_target),2)
print('f1 score on training data:', f1_train)

# Calculating the f1_score on the testing data for lg.
pred_target = lg.predict(X_test1)
f1_test = np.round(f1_score(y_test, pred_target),2)
print('f1 score on testing data:', f1_test)

f1 score on training data: 0.46
f1 score on testing data: 0.46


Revisiting the 'first_model_building_process.ipynb' file, I noticed that the results—specifically the F1 score on both the training and testing datasets—remained at 0.46. Despite applying different transformations and stratifying the target variable, these changes didn’t significantly impact the model’s performance at this stage

In [97]:
# Importing grid search cv function, stratified fold function for cv, and make scorer function for custom scoring metric
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import make_scorer

In [98]:
# Creating list to house dictionary of possible parameter values.
param_dist = [{
    'C': [0.001, .01, .1, .5],
    'class_weight': [{0: 1, 1: w} for w in [1, 2, 3, 5.69]]
}]

#  Specifying initial Logistic Regression model.
lg = LogisticRegression(max_iter=500)
# Creating GridSearchCV function.
grid = GridSearchCV(lg, param_dist, 
                    cv=StratifiedKFold(n_splits=4),                    # Used to split up the data into 4 sets for cross validation while stratifying the target variable.
                    scoring = make_scorer(f1_score, pos_label=1),      # Calling custom scoring function to calculate f1_score on 1 instances.
                    verbose=1)                                         # Setting verbose to 1 allows the total number of fits to be printed.
# Fitting grid with the training data.
grid.fit(X_train1, y_train)

# Printing the best parameters from the best estimator from grid.
print(grid.best_params_)

Fitting 4 folds for each of 16 candidates, totalling 64 fits
{'C': 0.1, 'class_weight': {0: 1, 1: 3}}


While the class weight matched the best parameter identified in the logistic regression grid search in 'first_model_building_process.ipynb', the optimal value for the regularization parameter C differed: it was 0.1 in this case, compared to 0.001 in the earlier analysis.

In [99]:
# Creating Logistic Regression model based on the best estimator from grid.
lg = grid.best_estimator_
# Fitting lg with training data.
lg.fit(X_train1, y_train)
# Calculating the f1_score on the training data by lg.
pred_target = lg.predict(X_train1)
f1_train = np.round(f1_score(y_train, pred_target),2)
print('f1 score on training data:', f1_train)
# Calculating the f1_score on the testing data by lg.
pred_target = lg.predict(X_test1)
f1_test = np.round(f1_score(y_test, pred_target),2)
print('f1 score on testing data:', f1_test)

f1 score on training data: 0.47
f1 score on testing data: 0.48


While the training data remains consistent with the grid search best estimator in 'first_model_building_process.ipynb', the F1 score on the testing data improved by .1.

### Random Forest Classifier

In [100]:
# Importing RandomForestClassifier function.
from sklearn.ensemble import RandomForestClassifier

In [101]:
# Creating a RandomForestClassifier model, rfc, with random state and class_weight set to balanced.
rfc = RandomForestClassifier(random_state=42, class_weight='balanced')

# Fitting rfc with training data. 
rfc.fit(X_train1, y_train)

# Calculating f1_score on training data.
pred_target = rfc.predict(X_train1)
f1_train = f1_score(y_train, pred_target, pos_label=1)
print('f1_score on training data:', np.round(f1_train,2))

# Calculating f1_Score on testing data.
pred_target = rfc.predict(X_test1)
f1_test = f1_score(y_test, pred_target, pos_label=1)
print('f1_score on testing data:', np.round(f1_test,2))

f1_score on training data: 1.0
f1_score on testing data: 0.23


As expected, there is still clear overfitting.

In [102]:
# Importing randomized search cv function.
from sklearn.model_selection import RandomizedSearchCV

One advantage of the GridSearchCV function is that it exhaustively evaluates all possible combinations of parameters from the specified dictionary. However, this can become computationally expensive when there are many combinations.

In contrast, the RandomizedSearchCV function addresses this issue by randomly sampling a pre-specified number of parameter combinations. This approach makes it more efficient, especially when the parameter space is large, while still providing a good chance of finding an optimal or near-optimal set of parameters.

In [103]:
# Creating list to house dictionary of possible parameter values.
param_dist = [{
    'n_estimators': [10, 30, 100],
    'max_depth': [5, 10],
    'max_features': ['sqrt', 'log2'],
    'class_weight': [{0:1, 1:w} for w in [1, 3, 5.69]] + [None]
}]

# Creating initial RandomForestClassifier model.
rfc = RandomForestClassifier(random_state=22)

# Creating RandomizedSearchCV function.
rand_search = RandomizedSearchCV(rfc, param_dist, 
                                 cv=StratifiedKFold(n_splits=4), 
                                 scoring = make_scorer(f1_score, pos_label=1), 
                                 verbose=1, random_state=42)
# Fitting rand_search with the training data.
rand_search.fit(X_train1, y_train)

# Printing the best parameters from the best estimator from rand_search.
print(rand_search.best_params_)

Fitting 4 folds for each of 10 candidates, totalling 40 fits
{'n_estimators': 30, 'max_features': 'sqrt', 'max_depth': 10, 'class_weight': {0: 1, 1: 3}}


The parameters mentioned above were the same as the best parameters found using RandomizedSearchCV in the 'first model building process.ipynb' file

In [104]:
# Creating a RandomForestClassifier model, rfc, from the best estimator from rand_search.
rfc = rand_search.best_estimator_

# Fitting rfc with training data.
rfc.fit(X_train1, y_train)

# Calculating f1_score on training data.
pred_target = rfc.predict(X_train1)
f1_train = f1_score(y_train, pred_target, pos_label=1)
print('f1_score on training data:', np.round(f1_train,2))

# Calculating f1_score on testing data.
pred_target = rfc.predict(X_test1)
f1_test = f1_score(y_test, pred_target, pos_label=1)
print('f1_score on testing data:', np.round(f1_test,2))

f1_score on training data: 0.48
f1_score on testing data: 0.45


Both the F1 scores for the training and testing data decreased by 0.03 compared to the best estimator from the RandomizedSearchCV results in 'first model building process.ipynb'. Since the estimator used the same parameters, it appears that the stratification and one-hot encoding strategy did not improve performance as expected.

### Balanced Random Forest Classifier

In [105]:
# Importing BalancedRandomForestClassifier function.
from imblearn.ensemble import BalancedRandomForestClassifier

In [106]:
# Creating BalancedRandomForestClassifier model (brfc).
brfc = BalancedRandomForestClassifier(random_state=42)

# Fitting brfc with training data.
brfc.fit(X_train1, y_train)
print()
print()

# Calculating f1_score on training data.
pred_target = brfc.predict(X_train1)
f1_train = f1_score(y_train, pred_target, pos_label=1)
print('f1_score on training data:', np.round(f1_train,2))

# Calculating f1_score on testing data.
pred_target = brfc.predict(X_test1)
f1_test = f1_score(y_test, pred_target, pos_label=1)
print('f1_score on testing data:', np.round(f1_test,2))

  warn(
  warn(
  warn(




f1_score on training data: 0.6
f1_score on testing data: 0.45


First, the warnings above are nothing to worry about. From the output, we can observe some overfitting, but the testing f1 score remains decent compared to the initial Random Forest Classifier model testing f1 score (.45 to .23), especially considering that none of the model parameters have been fine-tuned yet.

In [107]:
# Creating a dictionary inside of a list with different parameter possibilities. 
param_dist = [{
    'n_estimators': [5, 25, 50, 100],
    'max_depth': [5, 10, 20],
    'max_features': ['sqrt','log2',None],
    'class_weight': [{0:1,1:2} for w in [1, 3, 5.69]] + [None],
}]

# Creating initial BalancedRandomForestClassifier
brfc = BalancedRandomForestClassifier(random_state=42)

#Creating RandomizedSearchCV function
rand_search = RandomizedSearchCV(rfc, param_dist, 
                                 cv=StratifiedKFold(n_splits=4), 
                                 scoring = make_scorer(f1_score, pos_label=1), 
                                 verbose=1, 
                                 random_state=42, 
                                 n_iter=15)                                     # Specifies that 15 different parameter combinations will be independently selected.
# Fitting grid with the training data.
rand_search.fit(X_train1, y_train)

# Printing the best parameters from the best estimator from rand_search.
print(rand_search.best_params_)

Fitting 4 folds for each of 15 candidates, totalling 60 fits
{'n_estimators': 50, 'max_features': None, 'max_depth': 10, 'class_weight': {0: 1, 1: 2}}


In [108]:
# Creating a BalancedRandomForestClassifier model, brfc, from the best estimator from rand_search.
brfc = rand_search.best_estimator_
# Fitting brfc with training data.
brfc.fit(X_train1, y_train)
# Calculating f1_score on training data.
pred_target = brfc.predict(X_train1)
f1_train = f1_score(y_train, pred_target, pos_label=1)
print('f1_score on training data:', np.round(f1_train,2))
# Calculating f1_score on testing data.
pred_target = brfc.predict(X_test1)
f1_test = f1_score(y_test, pred_target, pos_label=1)
print('f1_score on testing data:', np.round(f1_test,2))

f1_score on training data: 0.49
f1_score on testing data: 0.43


The overfitting is less apparent than before; however, the F1 score on the testing data has decreased, which is a negative outcome.

### Hist Gradient Boosting Classifier

In [109]:
# Importing HistGradientBoostingClassifier function.
from sklearn.ensemble import HistGradientBoostingClassifier

In [110]:
# Creating a HistGradientBoostingClassifier model, hgbc, with random state set to 42 and class_weight to balanced.
hgbc = HistGradientBoostingClassifier(random_state=42, class_weight='balanced')

# Fitting hgbc with training data.
hgbc.fit(X_train1, y_train)

# Calculating f1_score on training data.
pred_target = hgbc.predict(X_train1)
f1_train = f1_score(y_train, pred_target, pos_label=1)
print('f1_score on training data:', np.round(f1_train,2))

# Calculating f1_score on testing data.
pred_target = hgbc.predict(X_test1)
f1_test = f1_score(y_test, pred_target, pos_label=1)
print('f1_score on testing data:', np.round(f1_test,2))

f1_score on training data: 0.46
f1_score on testing data: 0.46


Compared to the initial HistGradientBoostingClassifier model from the 'first model building process.ipynb' file, the f1 score on the training data is .01 worse while the the f1 score on the testing data is exactly the same with .46.

In [111]:
# Creating list to house dictionary of possible parameter values.
param_dist = [{
    'learning_rate': [.1, .5, .9],
    'max_iter': [10, 25, 50, 100],
    'max_leaf_nodes': [5, 15],
    'max_depth': [5,10,20],
    'min_samples_leaf': [5,10],
    'l2_regularization': [.1, .25, .5, 1],
    'class_weight': [{0:1, 1:w} for w in [1, 2, 3, 5.69]]
}]

# Creating initial HistGradientBoostingClassifier model.
hgbc = HistGradientBoostingClassifier(random_state=42)
# Creating initial RandomizedSearchCV function.
rand_search = RandomizedSearchCV(hgbc, param_distributions=param_dist, 
                                scoring= make_scorer(f1_score, pos_label=1),
                                cv=StratifiedKFold(n_splits=4), 
                                verbose=1, n_iter=130, random_state=22)
# Fitting rand_search with training data.
rand_search.fit(X_train1, y_train)

# Printing the best parameters from the best estimator from rand_search.
print(rand_search.best_params_)

Fitting 4 folds for each of 130 candidates, totalling 520 fits
{'min_samples_leaf': 5, 'max_leaf_nodes': 15, 'max_iter': 100, 'max_depth': 20, 'learning_rate': 0.1, 'l2_regularization': 0.5, 'class_weight': {0: 1, 1: 3}}


In [112]:
# Creating a HistGradientBoostingClassifier model, hgbc, based on best estimator from rand_search.
hgbc = rand_search.best_estimator_

# Fitting hgbc with training data.
hgbc.fit(X_train1, y_train)

# Calculating f1_score on training data.
pred_target = hgbc.predict(X_train1)
f1_train = f1_score(y_train, pred_target, pos_label=1)
print('f1_score on training data:', np.round(f1_train,2))

# Calculating f1_score on testing data.
pred_target = hgbc.predict(X_test1)
f1_test = f1_score(y_test, pred_target, pos_label=1)
print('f1_score on testing data:', np.round(f1_test,2))

f1_score on training data: 0.48
f1_score on testing data: 0.48


Our training and testing scores did both increase, but the testing score is still subpar at .48.

### XGB Classifier

In [113]:
# Importing XGBClassifier function.
from xgboost import XGBClassifier

In [114]:
# Creating a XGBClassifier model (xgbc) with random_state set to 42 and scale_pos_weight set to 5.69.
xgbc = XGBClassifier(random_state=42, scale_pos_weight=5.69)

# Fitting xgbc with training data. 
xgbc.fit(X_train1, y_train)

# Calculating f1_score on training data.
pred_target = xgbc.predict(X_train1)
f1_train = np.round(f1_score(y_train, pred_target),2)
print('f1_score on training data:', f1_train)

# Calculating f1_score on testing data.
pred_target = xgbc.predict(X_test1)
f1_test = np.round(f1_score(y_test, pred_target),2)
print('f1_score on testing data:', f1_test)

f1_score on training data: 0.5
f1_score on testing data: 0.46


In [115]:
# Creating list to house dictionary with possible parameter values.
param_dist = [{
    'eta': [0.1, 0.3, 0.7, 0.9],
    'gamma': [0.1, 0.4, 1],
    'max_depth': [2, 6, 9, 14],
    'scale_pos_weight': [1, 2, 3, 5.69]
}]
# Creating initial XGBClassifier model.
xgbc = XGBClassifier(random_state=42)
# Creating RandomizedSearchCV function.
rand_search = RandomizedSearchCV(xgbc, param_distributions=param_dist, 
                                scoring=make_scorer(f1_score, pos_label=1), 
                                cv=StratifiedKFold(n_splits=4), 
                                verbose=1, n_iter=20, random_state=22)
# Fitting rand_search with training data.
rand_search.fit(X_train1, y_train)

# Printing the best parameters from the best estimator from rand_search.
print(rand_search.best_params_)

Fitting 4 folds for each of 20 candidates, totalling 80 fits
{'scale_pos_weight': 3, 'max_depth': 2, 'gamma': 0.1, 'eta': 0.3}


In [116]:
# Creating a XGBClassifier model, xgbc, based on the best estimator from rand_search.
xgbc = rand_search.best_estimator_

# Fitting xgbc with training data. 
xgbc.fit(X_train1, y_train)

# Calculating f1_score on training data.
pred_target = xgbc.predict(X_train1)
f1_train = np.round(f1_score(y_train, pred_target),2)
print('f1_score on training data:', f1_train)

# Calculating f1_score on testing data.
pred_target = xgbc.predict(X_test1)
f1_test = np.round(f1_score(y_test, pred_target),2)
print('f1_score on testing data:', f1_test)

f1_score on training data: 0.48
f1_score on testing data: 0.48


This f1 score on the testing data is the same as the f1 score on the testing data on the 'best' HistGradientBoostingClassifier model.

In [117]:
# Creating data frame with all feature names and corresponding importance.
feature_importance = pd.DataFrame()
feature_importance['features'] = xgbc.feature_names_in_
feature_importance['importance'] = xgbc.feature_importances_

# Printing shape of data frame that only contains feature with no importance.
print(feature_importance[feature_importance['importance'] == 0].shape)


(50, 2)


There are 50 variables that have no importance.

In [118]:
# Creating new data frame with features that have some importance.
impt_features = feature_importance.loc[feature_importance['importance'] != 0]
# Creating a list with the values from the 'features' columns.
imp_cols = impt_features['features'].values.astype(int).tolist()

# Creating new training and testing predictor data frame with the important variables.
X_train1_imp = X_train1[imp_cols]
X_test1_imp = X_test1[imp_cols]

In [119]:
# Creating XGBClassifier,xgbc, based on the best parameters from the rand_search best estimator.
xgbc = XGBClassifier(scale_pos_weight= 3, max_depth= 2, gamma= 0.1, eta= 0.3, random_state=42)

# Fitting xgbc with training data. 
xgbc.fit(X_train1_imp, y_train)

# Calculating f1_score on training data.
pred_target = xgbc.predict(X_train1_imp)
f1_train = np.round(f1_score(y_train, pred_target),2)
print('f1_score on training data:', f1_train)

# Calculating f1_score on testing data.
pred_target = xgbc.predict(X_test1_imp)
f1_test = np.round(f1_score(y_test, pred_target),2)
print('f1_score on testing data:', f1_test)

f1_score on training data: 0.48
f1_score on testing data: 0.48


While I hoped that removing non-important variables would improve the F1 score on the testing data, it remained unchanged at 0.48.

## Conclusion

The initial goal of this notebook was to determine whether one-hot encoding discrete variables would enable the model to make more accurate assumptions about each discrete variable. Additionally, stratifying the target variable when splitting the data into training and testing sets aimed to help the model generalize better to unseen data. However, while these adjustments did not decrease the F1 score for the target variable, they also did not result in any significant improvement.

**Next Steps**

Moving forward, the focus will shift to evaluating not only the F1 score for diabetic instances but also metrics such as recall, precision, and the confusion matrix. By analyzing these metrics, we can hopefully identify the factors affecting the F1 score. This analysis will primarily be conducted using HistGradientBoostClassifier and XGBClassifier.