#### Solve a **Multivariate Classification** problem using **RandomForestclassifier** algorithm

- Hyperparameter tuned applying 

    - **Bayesian search** optimization using **Optuna** library 


> ### Dataset used: &emsp;[Pima Indians Diabetes Database](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database)

In [4]:
import numpy as np
import pandas as pd

In [2]:
# define column names
column_names: list[str] = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']

# Load the data from the .csv file
df_patient_data = pd.read_csv('../data/processed/pima-indians-diabetes.csv',
                          header = None,
                          names = column_names)

df_patient_data.head(10)              # df_patient_data.head() only gets 1st 5 rows

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


In [3]:
df_patient_data['Age'].value_counts()           # find individual counts of "Age" for different age groups, this is like a basic EDA

Age
22    72
21    63
25    48
24    46
23    38
28    35
26    33
27    32
29    29
31    24
41    22
30    21
37    19
42    18
33    17
38    16
36    16
32    16
45    15
34    14
46    13
43    13
40    13
39    12
35    10
50     8
51     8
52     8
44     8
58     7
47     6
54     6
49     5
48     5
57     5
53     5
60     5
66     4
63     4
62     4
55     4
67     3
56     3
59     3
65     3
69     2
61     2
72     1
81     1
64     1
70     1
68     1
Name: count, dtype: int64

In [181]:
df_patient_data.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [5]:
df_patient_data.info()              # gives information about column "datatypes" and if any null columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [6]:
df_patient_data.shape           # (768, 9) --> 768 rows and 9 columns

(768, 9)

In [184]:
df_patient_data.isnull().sum()              # checks if any of the columns has a value >= "1" , meaning sum > 1 denotes some "null" values

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [185]:
df_patient_data.isnull().any()              # determines if any columns has null values

Pregnancies                 False
Glucose                     False
BloodPressure               False
SkinThickness               False
Insulin                     False
BMI                         False
DiabetesPedigreeFunction    False
Age                         False
Outcome                     False
dtype: bool

In [186]:
(df_patient_data == 0).any()          # determines if any columns has zero values in "True" or "False"

Pregnancies                  True
Glucose                      True
BloodPressure                True
SkinThickness                True
Insulin                      True
BMI                          True
DiabetesPedigreeFunction    False
Age                         False
Outcome                      True
dtype: bool

In [187]:
# display patients with 0 Insulin (i.e, not measured or missing value) - such columns would need 
# Imputation: Replace missing values with a statistical measure such as the mean, median, or mode of the non-missing values.

df_patient_zero = df_patient_data[df_patient_data['Insulin'] == 0]
df_patient_zero

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
5,5,116,74,0,0,25.6,0.201,30,0
7,10,115,0,0,0,35.3,0.134,29,0
...,...,...,...,...,...,...,...,...,...
761,9,170,74,31,0,44.0,0.403,43,1
762,9,89,62,0,0,22.5,0.142,33,0
764,2,122,70,27,0,36.8,0.340,27,0
766,1,126,60,0,0,30.1,0.349,47,1


> #### Feature Engineering: Simple Imputation

In [188]:
# Replace zero's with "NaN" in columns where zero values do not make sense
cols_with_nonsenical_zeros: list[str] =  ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
df_patient_data[cols_with_nonsenical_zeros] = df_patient_data[cols_with_nonsenical_zeros].replace(0, np.nan)

# Impute the NaN's with mean of respective column
df_patient_data.fillna(df_patient_data.mean(), inplace = True)

# check if "NaN" present - summed up
# print(df_patient_data.isna().sum())

# now checking again if any meaningful columns has zero aka missing values or NaN's
(df_patient_data == 0).any()

Pregnancies                  True
Glucose                     False
BloodPressure               False
SkinThickness               False
Insulin                     False
BMI                         False
DiabetesPedigreeFunction    False
Age                         False
Outcome                      True
dtype: bool

> #### Split Train, Test data

In [189]:
# divide data into train (input / features) & test (output) data
from pandas.core.frame import DataFrame, Series
from sklearn.model_selection import train_test_split

X: DataFrame = df_patient_data.drop('Outcome', axis = 1)
y: Series = df_patient_data['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.3,
                                                    random_state = 42)

print(f'Features or input variables:\n, {X.head()}, \n\nand shape is, {X.shape}')

print(f'\nTarget or output variables:\n, {y.head()}, \nand shape is, {y.shape}')

Features or input variables:
,    Pregnancies  Glucose  BloodPressure  SkinThickness     Insulin   BMI  \
0            6    148.0           72.0       35.00000  155.548223  33.6   
1            1     85.0           66.0       29.00000  155.548223  26.6   
2            8    183.0           64.0       29.15342  155.548223  23.3   
3            1     89.0           66.0       23.00000   94.000000  28.1   
4            0    137.0           40.0       35.00000  168.000000  43.1   

   DiabetesPedigreeFunction  Age  
0                     0.627   50  
1                     0.351   31  
2                     0.672   32  
3                     0.167   21  
4                     2.288   33  , 

and shape is, (768, 8)

Target or output variables:
, 0    1
1    0
2    1
3    0
4    1
Name: Outcome, dtype: int64, 
and shape is, (768,)


In [190]:
# Optional: Scale the input (features data - both train & test data) for better model performance
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

print(f'Scaled X_train:\n, {X_train[:5]}, \n\nand shape is, {X_train.shape}')       # prints a large vectorised array like [[-0.8362943  -0.89610788 -1.00440048 -1.27450178 -1.14686808 -1.20403257

print(f'\nScaled X_test:\n, {X_test[:5]}, \n\nand shape is, {X_test.shape}')

Scaled X_train:
, [[-0.8362943  -0.89610788 -1.00440048 -1.27450178 -1.14686808 -1.20403257
  -0.61421636 -0.94861028]
 [ 0.39072767 -0.56399695 -0.02026586  0.02449184  1.99579164  0.66428525
  -0.90973787 -0.43466673]
 [-1.14304979  0.43233584 -0.34831073  1.55966612  1.11302206  1.44035573
  -0.30699103 -0.77729576]
 [ 0.08397217  0.29949146 -0.34831073 -0.92023079  0.12432012  0.11816158
  -0.90681191 -0.43466673]
 [-0.8362943  -0.63041914 -3.46473705  1.08730481 -0.85261155  1.58407249
  -0.83951493 -0.00638043]], 

and shape is, (537, 8)

Scaled X_test:
, [[ 6.97483158e-01 -7.96474605e-01 -1.16842292e+00  4.96853160e-01
   4.06806388e-01  2.47506663e-01 -1.16803926e-01  8.50192166e-01]
 [-5.29538810e-01 -3.31519303e-01  2.25767799e-01  3.78762831e-01
   1.29998139e-03  4.91825147e-01 -9.41923376e-01 -1.03426754e+00]
 [-5.29538810e-01 -4.64363675e-01 -6.76355609e-01  4.26092128e-02
   1.29998139e-03 -2.12386954e-01 -9.12663821e-01 -1.03426754e+00]
 [ 1.31099414e+00 -4.97574768e-01

#### RandomForestClassifier model hyper-params

In [191]:
from sklearn.ensemble import RandomForestClassifier

RandomForestClassifier().get_params()
# print(len(RandomForestClassifier().get_params()))       # total 19 keys

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [192]:
from sklearn.model_selection import cross_val_score
import optuna
from optuna import Study

> #### Define the **Objective function** and create a **Study**

In [193]:
def objective(trial):
    # Define hyperparameter values
    n_estimators = trial.suggest_categorical('n_estimators', [20, 60, 100, 120])
    max_features = trial.suggest_categorical('max_features', [0.2, 0.6, 1.0])
    max_depth = trial.suggest_categorical('max_depth', [2, 8, None])
    max_samples = trial.suggest_categorical('max_samples', [0.5, 0.75, 1.0])
    bootstrap = trial.suggest_categorical('bootstrap', [True, False])
    min_samples_split = trial.suggest_categorical('min_samples_split', [2, 5])
    min_samples_leaf = trial.suggest_categorical('min_samples_leaf', [1, 2])

    # Create model with suggested hyperparameters
    model_params = {
        'n_estimators': n_estimators,
        'max_features': max_features,
        'max_depth': max_depth,
        'max_samples': max_samples if bootstrap else None,
        'bootstrap': bootstrap,
        'min_samples_split': min_samples_split,
        'min_samples_leaf': min_samples_leaf,
        'random_state': 42
    }

    model = RandomForestClassifier(**model_params)

    # Perform 3-fold cross-validation and calculate accuracy
    score = cross_val_score(model,
                            X_train,
                            y_train,
                            cv = 3,
                            scoring='accuracy').mean()

    return score                # Return the accuracy score for Optuna to maximize

> #### Connect to **Dagshub** repo

In [196]:
import mlflow as mfl
import dagshub as dgb
from dotenv import load_dotenv
import os

try:
    load_dotenv()
    repo_owner: str | None = os.getenv('DAGSHUB_REPO_OWNER')
    repo_name: str | None = os.getenv('DAGSHUB_REPO_NAME')

except ValueError as error:
    print(str(error))

# Initialize Dagshub
dgb.init(repo_owner = repo_owner, 
             repo_name = repo_name,
             mlflow = True)

ImRaNM-001 mlflow-experiment-hp-tuning
https://dagshub.com/ImRaNM-001/mlflow-experiment-hp-tuning.mlflow


In [173]:
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix
# Disabled local uri due to connection made to dagshub
# if not mfl.is_tracking_uri_set():
#     mfl.set_tracking_uri(uri = 'http://127.0.0.1:5000')

mfl.set_tracking_uri(f'https://dagshub.com/{repo_owner}/{repo_name}.mlflow')

# set the experiment name
mfl.set_experiment('RandomForest Bayesian Search HyperParameter Tuning')

with mfl.start_run(run_name = 'series_of_runs') as parent:  
    study: Study = optuna.create_study(direction = 'maximize',
                                   sampler = optuna.samplers.TPESampler()
                                   )                    # In order to maximize accuracy
    study.optimize(objective, n_trials = 5)            # Run 50 trials to find the best hyperparameters

    # Train the RandomForestClassifier model using the best hyperparameters from Optuna
    best_model = RandomForestClassifier(**study.best_trial.params, 
                                        random_state = 42)

    # Fit the model to the training data
    best_model.fit(X_train, y_train)

    # Make predictions on the test set
    y_pred = best_model.predict(X_test)

    # get the model signature
    signature = mfl.models.infer_signature(model_input = X_train,
                                              model_output = best_model.predict(X_test))

    # Calculate the metrics: "accuracy", "precision_score", "precision_score" on the test set
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)

    # Start a nested run for each trial
    for trial in study.trials:
        with mfl.start_run(run_name = f'Experiment: {trial.number}', nested = True) as nested_run:
            # Log trial hyperparameters and metrics
            mfl.log_params(trial.params)
            mfl.log_metric('trial_value', trial.value)

[I 2025-01-26 05:02:16,484] A new study created in memory with name: no-name-ea8d4a22-7dc3-4545-87ad-2243bd72f10e


[I 2025-01-26 05:02:16,909] Trial 0 finished with value: 0.7653631284916201 and parameters: {'n_estimators': 120, 'max_features': 1.0, 'max_depth': 8, 'max_samples': 0.5, 'bootstrap': True, 'min_samples_split': 5, 'min_samples_leaf': 2}. Best is trial 0 with value: 0.7653631284916201.
[I 2025-01-26 05:02:17,106] Trial 1 finished with value: 0.7709497206703911 and parameters: {'n_estimators': 120, 'max_features': 0.2, 'max_depth': 8, 'max_samples': 0.75, 'bootstrap': True, 'min_samples_split': 2, 'min_samples_leaf': 1}. Best is trial 1 with value: 0.7709497206703911.
[I 2025-01-26 05:02:17,270] Trial 2 finished with value: 0.7616387337057727 and parameters: {'n_estimators': 120, 'max_features': 0.6, 'max_depth': 2, 'max_samples': 0.75, 'bootstrap': False, 'min_samples_split': 2, 'min_samples_leaf': 1}. Best is trial 1 with value: 0.7709497206703911.
[I 2025-01-26 05:02:17,420] Trial 3 finished with value: 0.7597765363128491 and parameters: {'n_estimators': 100, 'max_features': 0.2, 'max

🏃 View run Experiment: 0 at: http://127.0.0.1:5000/#/experiments/638940344264241299/runs/67fb9b97a71f4a0694de74b76eb9346d
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/638940344264241299
🏃 View run Experiment: 1 at: http://127.0.0.1:5000/#/experiments/638940344264241299/runs/5facae863e08416298b8aff0340a7ba1
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/638940344264241299
🏃 View run Experiment: 2 at: http://127.0.0.1:5000/#/experiments/638940344264241299/runs/55cd7a8684364855b891dcf509a6e142
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/638940344264241299
🏃 View run Experiment: 3 at: http://127.0.0.1:5000/#/experiments/638940344264241299/runs/3296b7c2905346c2bd96d772cfc5b635
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/638940344264241299
🏃 View run Experiment: 4 at: http://127.0.0.1:5000/#/experiments/638940344264241299/runs/bdab82b20bea41bf9187d32e81e93bd1
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/638940344264241299
🏃 Vie

In [174]:
# Print the best results
best_trial_val: str = f'{study.best_trial.value:.2f}'
print(f'Best trial accuracy: {best_trial_val}')

best_hyper_params = study.best_trial.params
print(f'Best hyperparameters: {best_hyper_params}')

# Print the test accuracy, precision_score and confusion_matrix
print(f'Test Accuracy with best hyperparameters: {accuracy:.2f}')
print(f'Best precision score: {precision:.2f}')
print(f'Best confusion matrix: {conf_matrix}')

Best trial accuracy: 0.77
Best hyperparameters: {'n_estimators': 120, 'max_features': 0.2, 'max_depth': 8, 'max_samples': 0.75, 'bootstrap': True, 'min_samples_split': 2, 'min_samples_leaf': 1}
Test Accuracy with best hyperparameters: 0.75
Best precision score: 0.64
Best confusion matrix: [[122  29]
 [ 28  52]]


In [197]:
# Save the confusion matrix using joblib
import joblib as jb
# with open('confusion_matrix.pkl', 'wb') as conf_matrix_file:
#     pickle.dump(conf_matrix, conf_matrix_file)

with open('confusion_matrix.joblib', 'wb') as conf_matrix_file:
    jb.dump(conf_matrix, conf_matrix_file)

##### Advantages of using serialization_format = 'joblib' over the default pickle in mlflow.sklearn.log_model:

1. Better for Large NumPy Arrays:

    - Joblib's Strength: Joblib is particularly efficient at handling large NumPy arrays, which are very common in machine learning data. It uses memory mapping techniques that allow it to save and load these arrays much faster than pickle, especially when the arrays are very large.

2. Potential for Parallelism:Joblib's Parallelism: 

    - Joblib has built-in support for parallel processing. While mlflow.sklearn.log_model might not directly utilize this for model saving, it's an advantage of joblib if you're doing other data processing tasks in your workflow.
    
3. (Slightly) Better Compression:

    - Marginal Difference: In some cases, joblib can produce slightly smaller file sizes compared to pickle, but this difference is often not significant.
    
    ##### When Pickle Might Be Okay:

    - Small Models: If your scikit-learn model is relatively small and doesn't involve massive NumPy arrays, pickle might work just fine.
    
    - Simplicity: Pickle is built-in to Python, so it might be slightly simpler if you don't want to add joblib as a dependency.
    
    In summary: For most machine learning projects (especially if you have large datasets or arrays), using serialization_format="joblib" with mlflow.sklearn.log_model is the recommended approach. It offers performance advantages, particularly during the loading and saving of models.  
    
    However, if you have a simple use case with small models, the default pickle behavior might suffice.

In [176]:
with mfl.start_run(run_name = 'best_model') as best_model:       
    mfl.log_params(best_hyper_params)
    mfl.log_metric('trial_value', best_trial_val)
    mfl.log_metric('accuracy_score', accuracy)
    mfl.log_metric('precision_score', precision)
    
    # Log the confusion matrix as an artifact
    mfl.log_artifact('confusion_matrix.joblib')

    # sklearn model supported formats: ['pickle', 'cloudpickle']
    mfl.sklearn.log_model(best_model,
                          artifact_path = 'final_model',
                          registered_model_name = 'RandomForestClassifier_model',
                          signature = signature)

Registered model 'RandomForestClassifier_model' already exists. Creating a new version of this model...
2025/01/26 05:02:19 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: RandomForestClassifier_model, version 6


🏃 View run best_model at: http://127.0.0.1:5000/#/experiments/638940344264241299/runs/d5517dc5fe37444da57e0483d6aea5c1
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/638940344264241299


Created version '6' of model 'RandomForestClassifier_model'.
