Working with the Models Subpackage
----------------------------------

The ``models`` subpackage is crafted to offer a comprehensive suite of tools for creating and managing various machine learning models within the ``MED3pa`` package.


## Using the ModelFactory Class
The `ModelFactory` class within the `models` subpackage offers a streamlined approach to creating machine learning models, either from predefined configurations or from serialized states. Here’s how to leverage this functionality effectively:


### Step 1: Importing Necessary Modules
Start by importing the required classes and utilities for model management:


In [1]:
import sys
import os

sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '..')))

from pprint import pprint
from MED3pa.models import factories


### Step 2: Creating an Instance of ModelFactory
Instantiate the `ModelFactory`, which serves as your gateway to generating various model instances:


In [2]:
factory = factories.ModelFactory()


### Step 3: Discovering Supported Models
Before creating a model, check which models are currently supported by the factory:


In [3]:
print("Supported models:", factory.get_supported_models())


Supported models: ['XGBoostModel']


### Step 4: Creating a Model using the factory
There are mainly two ways to create a model using the factory, from hyperparameters or from a serialized (pickled) file.

#### Creating a model from hyperparameters

In [4]:
xgb_params = {
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'eta': 0.1,
    'max_depth': 6,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'min_child_weight': 1,
    'nthread': 4,
    'tree_method': 'hist',
    'device': 'cpu'
}

xgb_model = factory.create_model_with_hyperparams('XGBoostModel', xgb_params)
pprint(xgb_model.get_info())


{'data_preparation_strategy': 'ToDmatrixStrategy',
 'model': 'XGBoostModel',
 'model_type': 'Booster',
 'params': {'colsample_bytree': 0.8,
            'device': 'cpu',
            'eta': 0.1,
            'eval_metric': 'auc',
            'max_depth': 6,
            'min_child_weight': 1,
            'nthread': 4,
            'objective': 'binary:logistic',
            'subsample': 0.8,
            'tree_method': 'hist'},
 'pickled_model': False}


#### Loading a Model from a Serialized State
For pre-trained models, we can make use of the `create_model_from_pickled` method to load a model from its serialized (pickled) state. You only need to specify the path to this pickled file. This function will examine the pickled file and extract all necessary information:


In [5]:
xgb_model_pkl = factory.create_model_from_pickled('./models/diabetes_xgb_model.pkl')
pprint(xgb_model_pkl.get_info())


{'data_preparation_strategy': 'ToDmatrixStrategy',
 'model': 'XGBoostModel',
 'model_type': 'Booster',
 'params': {'alpha': 0,
            'base_score': 0.3500931,
            'boost_from_average': 1,
            'booster': 'gbtree',
            'cache_opt': 1,
            'colsample_bylevel': 1,
            'colsample_bynode': 1,
            'colsample_bytree': 0.824717641,
            'debug_synchronize': 0,
            'device': 'cpu',
            'disable_default_eval_metric': 0,
            'eta': 0.0710294247,
            'eval_metric': ['auc'],
            'fail_on_invalid_gpu_id': 0,
            'gamma': 0.302559406,
            'grow_policy': 'depthwise',
            'interaction_constraints': '',
            'lambda': 1,
            'learning_rate': 0.0710294247,
            'max_bin': 256,
            'max_cached_hist_node': 65536,
            'max_cat_threshold': 64,
            'max_cat_to_onehot': 4,
            'max_delta_step': 0,
            'max_depth': 9,
           

## Using the Model Class
In this section, we will learn how to train, predict, and evaluate a machine learning model. For this, we will directly use the created model from the previous section.


### Step 1: Training the Model
Generate Training and Validation Data:

Prepare the data for training and validation. The following example generates synthetic data for demonstration purposes:


In [6]:
import numpy as np

np.random.seed(0)
X_train = np.random.randn(1000, 10)
y_train = np.random.randint(0, 2, 1000)
X_val = np.random.randn(1000, 10)
y_val = np.random.randint(0, 2, 1000)


When training a model, you can specify additional `training_parameters`. If they are not specified, the model will use the initialization parameters. You can also specify whether you'd like to balance the training classes.

If a validation set is provided, the Model will use it for validation and then outputs the evaluation results on the set.


In [7]:
training_params = {
    'eval_metric': 'logloss',
    'eta': 0.1,
    'max_depth': 6
}
xgb_model.train(X_train, y_train, X_val, y_val, training_params, balance_train_classes=True)


Evaluation Results:
logloss: 16.82


### Step 2: Predicting Using the Trained Model
Model Prediction:

Once the model is trained, use it to predict labels or probabilities on a new dataset. This step demonstrates predicting binary labels for the test data. The `return_proba` parameter specifies whether to return the `predicted_probabilities` or the `predicted_labels`. The labels are calculated based on the `threshold`:


In [8]:
X_test = np.random.randn(1000, 10)
y_test = np.random.randint(0, 2, 1000)
y_pred = xgb_model.predict(X_test, return_proba=False, threshold=0.5)


### Step 3: Evaluating the Model
Evaluate the model's performance using various metrics to understand its effectiveness in making predictions. The supported metrics include Accuracy, AUC, Precision, Recall, and F1 Score, among others. The `evaluate` method will handle the model predictions and then evaluate the model based on these predictions. You only need to specify the test data.

To retrieve the list of supported `classification_metrics`, you can use `ClassificationEvaluationMetrics.supported_metrics()`:



In [9]:
from MED3pa.models import ClassificationEvaluationMetrics

# Display supported metrics
print("Supported evaluation metrics:", ClassificationEvaluationMetrics.supported_metrics())

# Evaluate the model
evaluation_results = xgb_model.evaluate(X_test, y_test, eval_metrics=['Auc', 'Accuracy'], print_results=True)


Supported evaluation metrics: ['Accuracy', 'BalancedAccuracy', 'Precision', 'Recall', 'F1Score', 'Specificity', 'Sensitivity', 'Auc', 'LogLoss', 'Auprc', 'NPV', 'PPV', 'MCC']
Evaluation Results:
Auc: 0.51
Accuracy: 0.50


### Step 4: Retrieving Model Information
The `get_info` method provides detailed information about the model, including its type, parameters, data preparation strategy, and whether it's a pickled model. This is useful for understanding the configuration and state of the model:


In [10]:
model_info = xgb_model.get_info()
pprint(model_info)


{'data_preparation_strategy': 'ToDmatrixStrategy',
 'model': 'XGBoostModel',
 'model_type': 'Booster',
 'params': {'colsample_bytree': 0.8,
            'device': 'cpu',
            'eta': 0.1,
            'eval_metric': 'logloss',
            'max_depth': 6,
            'min_child_weight': 1,
            'nthread': 4,
            'objective': 'binary:logistic',
            'subsample': 0.8,
            'tree_method': 'hist'},
 'pickled_model': False}


### Step 5: Saving Model Information
You can save the model by using the `save` method, which will save the underlying model instance as a pickled file, and the model's information as a .json file:


In [11]:
xgb_model.save("./models/saved_model")