*Adjusting the Preprocessing Strategy*

Before diving into the following data, I believe it would be helpful to refine the preprocessing strategy.
So far, I have been separating my predictor variables into either continuous or discrete categories. However, there may be differences within discrete variables that need to be addressed. The reason for this consideration is that I have been one-hot encoding all discrete variables, which is generally fine. However, if I start adding interaction terms, this approach could significantly increase the dimensionality of the data. Since we already have a large number of instances in the training set, a substantial increase in dimensionality would greatly increase the computational time required for each step.

As mentioned in the initial notebook, when our data is in numeric format, values like 2 are interpreted as greater than 1. However, in some cases, there is no mathematical relationship or inherent order between such values. To handle this, we can classify discrete variables as either nominal or ordinal:

Nominal variables are discrete variables with no inherent order between categories.
Ordinal variables have an inherent order or ranking between categories.
Below are examples of each:

**Ordinal Variables**
* general_health: As this value increases, an individual's health worsens, indicating an inherent order.
* Other examples: physical_activity_150, education_level, income_group, smoking_status, physical_health_days, and mental_health_days.

**Nominal Variables**
* sex: This variable represents either male or female. There is no inherent order, making it nominal.
* Other examples: has_health_plan, meets_aerobic_guidelines, muscle_strengthening, high_blood_pressure, high_cholesterol, heart_disease, lifetime_asthma, arthritis, alcohol_consumption, binge_drinking, heavy_drinking, and difficulty_walking.

Since ordinal variables have an inherent order and are already represented by numerical values, there is no need to one-hot encode them. However, nominal variables should be one-hot encoded to ensure they are appropriately represented in the model.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Importing pandas and numpy library
import pandas as pd
import numpy as np

# Loading in data frame
df = pd.read_csv('/content/drive/MyDrive/diabetic/df.csv')



# Deleting unnecessary column and symbolizing all non-diabetic records with 0 instead of 3.
del df['Unnamed: 0']
df['diabetes_status'] = df['diabetes_status'].replace(3,0)

df.info()

df.columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 217981 entries, 0 to 217980
Data columns (total 24 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   general_health            217981 non-null  float64
 1   physical_health_days      217981 non-null  float64
 2   mental_health_days        217981 non-null  float64
 3   has_health_plan           217981 non-null  float64
 4   meets_aerobic_guidelines  217981 non-null  float64
 5   physical_activity_150min  217981 non-null  float64
 6   muscle_strengthening      217981 non-null  float64
 7   high_blood_pressure       217981 non-null  float64
 8   high_cholesterol          217981 non-null  float64
 9   heart_disease             217981 non-null  float64
 10  lifetime_asthma           217981 non-null  float64
 11  arthritis                 217981 non-null  float64
 12  sex                       217981 non-null  float64
 13  age                       217981 non-null  f

Index(['general_health', 'physical_health_days', 'mental_health_days',
       'has_health_plan', 'meets_aerobic_guidelines',
       'physical_activity_150min', 'muscle_strengthening',
       'high_blood_pressure', 'high_cholesterol', 'heart_disease',
       'lifetime_asthma', 'arthritis', 'sex', 'age', 'height_inches', 'bmi',
       'education_level', 'income_group', 'smoking_status',
       'alcohol_consumption', 'binge_drinking', 'heavy_drinking',
       'diabetes_status', 'difficulty_walking'],
      dtype='object')

In [None]:
for i in df.columns:
  df[i] = df[i].astype('float32')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 217981 entries, 0 to 217980
Data columns (total 24 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   general_health            217981 non-null  float32
 1   physical_health_days      217981 non-null  float32
 2   mental_health_days        217981 non-null  float32
 3   has_health_plan           217981 non-null  float32
 4   meets_aerobic_guidelines  217981 non-null  float32
 5   physical_activity_150min  217981 non-null  float32
 6   muscle_strengthening      217981 non-null  float32
 7   high_blood_pressure       217981 non-null  float32
 8   high_cholesterol          217981 non-null  float32
 9   heart_disease             217981 non-null  float32
 10  lifetime_asthma           217981 non-null  float32
 11  arthritis                 217981 non-null  float32
 12  sex                       217981 non-null  float32
 13  age                       217981 non-null  f

In [None]:
# Isolating target variable.
df_target = df['diabetes_status']
del df['diabetes_status']

# Importing functions to split up and transform our data.
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Creating lists to contain the continuous, ordinal, and nominal variable names
continuous = ['age','height_inches','bmi']

ordinal = ['general_health', 'physical_health_days', 'mental_health_days', 'physical_activity_150min',
           'education_level', 'income_group', 'smoking_status']

nominal = ['has_health_plan', 'meets_aerobic_guidelines', 'muscle_strengthening',
           'high_blood_pressure', 'high_cholesterol', 'heart_disease',
           'lifetime_asthma', 'arthritis', 'sex',
           'alcohol_consumption', 'binge_drinking', 'heavy_drinking',
           'difficulty_walking']

# Splitting up predictor and target variable into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(df, df_target, test_size=0.30, random_state=22, stratify=df_target)
# Utilizing stratify parameters helps ensure that the percentage or diabetic and non-diabetic individuals are around the same in the training and tests sets.

# Creating pipeline to logarithmically transform and scale all continuous variables.
continuous_pipeline = Pipeline([
    ('log', FunctionTransformer(func=np.log1p)),
    ('scaler', StandardScaler()),
    ])

# Creating pipeline to scale each ordinal variable.
ordinal_pipeline = Pipeline([
    ('scaler', StandardScaler()),
])

# Creating pipeline to one-hot encode all nominal variables while dropping the first and then scale each variable.
nominal_pipeline = Pipeline([
    ('one_hot', OneHotEncoder(sparse_output=False, drop='first')),
    ('scaler', StandardScaler()),
])

# Creating a column transform to send all continuous to the continuous_pipeline, all ordinal variables to the ordinal_pipeline, and all nominal variables to the nominal_pipeline.
column_transformer = ColumnTransformer([
    ('cont', continuous_pipeline, continuous),
    ('ord', ordinal_pipeline, ordinal),
    ('nom', nominal_pipeline, nominal),
])


# Fitting column transform with training data and then transforming training data using the fitted column transformer.
X_train1 = column_transformer.fit_transform(X_train)
# Transforming testing data using the fitted column transform.
X_test1 = column_transformer.transform(X_test)

nominal_columns = column_transformer.named_transformers_['nom']['one_hot'].get_feature_names_out(nominal).tolist()

correct_columns = continuous + ordinal + nominal_columns

# Creating data frames based on X_train1 and X_test1 for feature importance analysis later one.
X_train1 = pd.DataFrame(X_train1, columns=correct_columns)
X_test1 = pd.DataFrame(X_test1, columns=correct_columns)

In [None]:
X_train1.head()

Unnamed: 0,age,height_inches,bmi,general_health,physical_health_days,mental_health_days,physical_activity_150min,education_level,income_group,smoking_status,...,high_blood_pressure_1.0,high_cholesterol_1.0,heart_disease_1.0,lifetime_asthma_1.0,arthritis_1.0,sex_1.0,alcohol_consumption_1.0,binge_drinking_1.0,heavy_drinking_1.0,difficulty_walking_1.0
0,-0.186894,1.574059,-3.222404,0.439036,2.951524,-0.514054,-0.697145,-1.347789,-2.396734,0.66324,...,0.86452,0.847498,-0.314338,0.416234,1.358582,1.016614,-1.141708,0.402352,0.260923,2.374006
1,-0.691372,-1.032328,0.052506,2.378831,-0.503027,3.246575,1.614918,-1.347789,0.189562,0.66324,...,0.86452,0.847498,-0.314338,0.416234,1.358582,-0.983658,-1.141708,0.402352,0.260923,-0.421229
2,0.837489,0.903223,-0.023848,-0.530861,-0.503027,-0.514054,1.614918,-1.347789,-0.457012,0.66324,...,-1.156711,-1.179944,-0.314338,0.416234,-0.736062,1.016614,0.87588,0.402352,-3.832553,-0.421229
3,-2.056301,0.903223,0.645197,0.439036,-0.503027,3.246575,-0.697145,-1.347789,0.189562,-0.494853,...,0.86452,0.847498,-0.314338,-2.402492,-0.736062,1.016614,0.87588,0.402352,0.260923,-0.421229
4,-0.979195,-1.032328,0.61674,-0.530861,-0.503027,1.36626,0.458887,-0.229358,0.189562,-0.494853,...,0.86452,0.847498,-0.314338,0.416234,-0.736062,-0.983658,-1.141708,0.402352,0.260923,-0.421229


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

In [None]:
# Importing grid search cv function, stratified fold function for cv, and make scorer function for custom scoring metric
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold

In [None]:
y_train.head()

Unnamed: 0,diabetes_status
180036,0.0
106667,1.0
41133,0.0
178616,0.0
42232,0.0


In [None]:
# Creating a HistGradientBoostingClassifier model, hgbc, with random state set to 42 and class_weight to balanced.
lg = LogisticRegression(random_state=42, class_weight='balanced')

# Fitting hgbc with training data.
lg.fit(X_train1, y_train)

# Calculating f1_score on training data.
pred_target = lg.predict(X_train1)
f1_train = f1_score(y_train, pred_target, pos_label=1.0)
print('f1_score on training data:', np.round(f1_train,2))
print(lg.predict_proba(X_train1)[1,0])
# Calculating f1_score on testing data.
pred_target = lg.predict(X_test1)
f1_test = f1_score(y_test, pred_target,pos_label=1.0)
print('f1_score on testing data:', np.round(f1_test,2))

f1_score on training data: 0.46
0.5111452400921723
f1_score on testing data: 0.46


In [None]:
# Creating list to house dictionary of possible parameter values.
param_dist = [{
    'max_iter': [25, 50, 100, 150],
    'class_weight': [{0:1, 1:w} for w in [1, 2, 3, 5.69]]
}]

# Creating initial HistGradientBoostingClassifier model.
lg = LogisticRegression(random_state=42)
# Creating initial RandomizedSearchCV function.
rand_search = RandomizedSearchCV(lg, param_distributions=param_dist,
                                scoring= 'balanced_accuracy',
                                cv=StratifiedKFold(n_splits=4),
                                verbose=1, n_iter=40, random_state=22)
# Fitting rand_search with training data.
rand_search.fit(X_train1, y_train)

# Printing the best parameters from the best estimator from rand_search.
print(rand_search.best_params_)

best_estimator = rand_search.best_estimator_
# Calculating f1_score on training data.
pred_target = best_estimator.predict(X_train1)
f1_train = f1_score(y_train, pred_target, pos_label=1.0)
print('f1_score on training data:', np.round(f1_train,2))

# Calculating f1_score on testing data.
pred_target = best_estimator.predict(X_test1)
f1_test = f1_score(y_test, pred_target,pos_label=1.0)
print('f1_score on testing data:', np.round(f1_test,2))



Fitting 4 folds for each of 16 candidates, totalling 64 fits
{'max_iter': 25, 'class_weight': {0: 1, 1: 5.69}}
f1_score on training data: 0.46
f1_score on testing data: 0.46


In [None]:
pred_df = column_transformer.fit_transform(df)

nominal_columns = column_transformer.named_transformers_['nom']['one_hot'].get_feature_names_out(nominal).tolist()

correct_columns = continuous + ordinal + nominal_columns

pred_df = pd.DataFrame(pred_df, columns=correct_columns)

In [None]:
lg = LogisticRegression(random_state=42, class_weight={0:1, 1:5.69}, max_iter=25)
lg.fit(pred_df, df_target)

pred_target = lg.predict(pred_df)
f1_train = f1_score(df_target, pred_target, pos_label=1.0)
print('f1_score on training data:', np.round(f1_train,2))


f1_score on training data: 0.46


In [None]:
import joblib

In [None]:
joblib.dump(lg, 'lg.joblib.diab')

['lg.joblib.diab']

In [None]:
joblib.dump(column_transformer, 'column_transformer.joblib.diab')

['column_transformer.joblib.diab']

In [None]:
lg_diab = pickle.dumps(lg)