# Modeling tutorial

In this notebook we'll walk through the main functions of the `modeling` package. For this tutorial, we'll use a preprocessed subset of a Kaggle dataset you can find [here](https://www.kaggle.com/datasets/edumagalhaes/quality-prediction-in-a-mining-process). The goal of the dataset is to predict the output silica concentration of a [floation](https://en.wikipedia.org/wiki/Froth_flotation) process at an iron mine. 

Though we use a tag dictionary in this example, most functions can be used without one. We'll highlight these features below.

Of course, this notebook assumes you have a cleaned dataset.

Currently, the model performance reporting in `modeling` only works for regression problems.

## Setup

First, we'll read in our datasets. For this problem, we'll read in the data and a simple tag dictionary.

In [1]:
import logging
import sys

logging.basicConfig(level=logging.INFO, stream=sys.stdout)

In [2]:
# Resolve path when used in a usecase project
from pathlib import Path

sys.path.insert(0, str(Path("../../").resolve()))

In [3]:
from modeling.datasets import get_sample_model_input_data, get_sample_tag_dict

df = get_sample_model_input_data()
tag_dict = get_sample_tag_dict()

INFO:numexpr.utils:NumExpr defaulting to 8 threads.


In [4]:
df.describe().round(2)

Unnamed: 0,air_flow01,air_flow02,air_flow03,air_flow04,air_flow05,air_flow06,air_flow07,amina_flow,column_level01,column_level02,...,ore_pulp_flow,ore_pulp_ph,silica_conc,silica_feed,starch_flow,iron_minus_silica,feed_diff_divide_silica,total_column_level,total_air_flow,silica_conc_lagged
count,1368.0,1368.0,1368.0,1368.0,1368.0,1368.0,1368.0,1473.0,1368.0,1368.0,...,1368.0,1368.0,1368.0,1368.0,1473.0,1368.0,1368.0,1473.0,1473.0,1367.0
mean,280.4,277.26,281.19,298.99,299.02,286.84,287.05,480.04,518.9,522.32,...,397.35,9.77,12.81,14.64,2947.89,41.66,4.1,3033.23,1867.43,2.32
std,28.0,28.43,27.6,1.8,2.21,22.42,22.08,77.78,114.37,110.31,...,7.4,0.34,3.24,6.62,708.08,11.59,3.19,967.32,530.16,1.01
min,200.0,200.0,200.0,293.46,287.21,200.0,200.0,300.0,303.84,228.22,...,378.32,9.0,8.15,2.0,2000.0,15.0,0.5,0.0,0.0,1.0
25%,250.09,250.09,250.08,299.7,299.75,273.89,279.21,422.58,428.3,449.56,...,399.28,9.55,10.3,9.18,2296.32,32.97,1.69,2771.38,1846.63,1.55
50%,299.85,299.54,299.9,299.91,299.89,299.83,299.81,491.99,499.89,500.0,...,399.91,9.8,12.25,14.18,2925.19,41.91,2.95,3124.3,2084.03,2.05
75%,299.92,299.86,299.93,299.95,299.97,299.94,299.94,540.93,599.8,599.38,...,400.34,10.01,14.5,19.58,3442.27,50.5,5.48,3548.3,2099.05,2.91
max,300.0,300.0,300.0,300.0,300.0,300.0,300.0,698.68,800.0,800.0,...,410.0,10.77,27.77,30.0,5943.95,63.0,31.5,5094.78,2099.91,5.0


In our tag dictionary, we're interested in three columns:

- The tag: the literal column name in our dataset.
- The tag type: what the tag represents in the process. Here we're only concerned with `"control"`, `"input"`, and "`output`" tags. 
- The feature indicator column: this boolean column is `True` for tags that are features in the model we'd like to build. Note that we can have more than one of these columns in the tag dictionary.

This last column is named `td_features_column` here and in all `modeling` function arguments. See more below.

<div class="alert alert-info">
<b>Note</b>

Remember, we are modeling dependencies between physical properties of the process and the target, meaning that each feature name should represent very clear and human-readable phisical property, not a copy-pasted tag identifier.

* An example of a good feature name is `pump_pressure_kPa | H105`. It represents the **human-readable physical property of the process**, unit of measurement, and the reference to the original tag this physical property is calculated based on. If the preprocessing recipe or formula for this specific feature changes, a new feature called `pump_pressure_kPa | H105` will be created, while `pump_pressure_kPa | H105` will still exist and usthean old recipe othe a old formula. 

* An example of a bad feature name is `ZO.RHONH955.H105.SP` because it is not human-readable, and it is hard to tie the tag identifier with its physical meaning. Additionally, full tag identifiers are never uby SMEs in day-to-day communications.





  


</div>

In [5]:
td_features_column = "silica_model_features"

tag_dict.to_frame()[["tag", "tag_type", td_features_column]]

Unnamed: 0,tag,tag_type,silica_model_features
0,iron_feed,input,Y
1,silica_feed,input,Y
2,starch_flow,control,Y
3,amina_flow,control,Y
4,ore_pulp_flow,control,Y
5,ore_pulp_ph,control,Y
6,ore_pulp_density,control,Y
7,air_flow01,input,
8,air_flow02,input,
9,air_flow03,input,


Instead of using td and `td_features_column` below, a list of feature names can be passed.

In [6]:
model_features = tag_dict.select(td_features_column)

The first function we can use from `oai.modeling` is `drop_nan_rows`. This function simply drops nan rows for features and the target that we're including in our model.

In [7]:
target_column = "silica_conc"
datetime_column = "timestamp"

In [8]:
from modeling.utils import drop_nan_rows

df_dropna = drop_nan_rows(
    df,
    td=tag_dict,
    td_features_column=td_features_column,
    target_column=target_column,
)

INFO:modeling.utils:Dropping 105 rows with NaN values. Original sample size was 1473 and is now 1368.


## Model training

<div class="alert alert-info">
<b>Note</b>
    
> The idea of modeling in OptimusAI is to **learn joint dependencies** between features and target as accurately as possible, meaning the models we build are **descriptive**, not predictive.
    
</div>

In this tutorial [we will build a SklearnPipeline](../../../../../../docs/build/apidoc/modeling/modeling.models.sklearn_pipeline.html#modeling.models.sklearn_pipeline.model.SklearnPipeline), which is a wrapper for `sklearn.pipeline.Pipeline`, however you are welcome to use [any other model from the list](../../../../../../docs/build/apidoc/modeling/modeling.models.sklearn_model.html) or [implement your own modeling logic](./model_base.ipynb#Create-a-custom-inheritor-of-ModelBase).

The modeling procedure in modeling package consists of 4 major steps:

1. initialize `ModelFactory` using the `create_model_factory` function
2. produce model with `ModelFactory` using the `create_model` function
3. split data on train and test datasets using the `create_splitter` and `split_data` functions
4. tune hyperparameters using the `tune_model` function
5. train model using the `train_model` function

Modeling logic is encapsulated in the model classes and the functions, that we'll use in this section are mostly for pipelining purposes.

Please see

- [tutorial](./model_base.ipynb) on ModelBase to learn what that class is, what methods does it have, and how components interact with each other.
- [tutorial](./splitter_base.ipynb) on SplitterBase to understand what splitter subpackage can offer.

After executing this step, **you'll get a trained model instance and other entities needed for running model performance report**.

### `create_model_factory_from_tag_dict`

First we need to initialize model factory. `create_model_factory` simply creates factory object using the `init_config`. Learn more about `init_config` structure in 

`ModelFactory` creates models based on the configuration representation. `Modelfactorys` are also required for model hyperparameters tuning. Learn about `model_init_config` structure for each of the builder classes in `ModelFactory` [tutorial](./model_base.ipynb) as well.

<div class="alert alert-info">
<b>Note</b>
    
Each of the `ModelFactory` classes require its own structure of initialization config, which is described in API section.
 
</div>

In [9]:
from modeling import create_model_factory_from_tag_dict

sklearn_pipeline_init_config = {
    'estimator': {
        'class_name': 'sklearn.linear_model.SGDRegressor',
        'kwargs': {
            'penalty': 'elasticnet', 
            'random_state': 123,
        }
    },
    'transformers': [
        {
            'class_name': 'sklearn.preprocessing.StandardScaler',
            'kwargs': {},
            'name': 'standard_scaler',
            'wrapper': 'preserve_columns',
        }
    ]
}

sklearn_pipeline_factory = create_model_factory_from_tag_dict(
    # Class type can be passed as well. See function API.
    "modeling.SklearnPipelineFactory",
    sklearn_pipeline_init_config,
    tag_dict,
    td_features_column,
    target_column,
)

In case you don't use `TagDict` you can execute `modeling.create_model_factory` which does the same job, but require features provided manually.

### `create_model`

Then, we'll create a model. This function calls factory's `.create()` method.

In [10]:
from modeling import create_model

sklearn_pipeline = create_model(sklearn_pipeline_factory)
sklearn_pipeline

SklearnPipeline(estimator=Pipeline(steps=[('standard_scaler',
                 SklearnTransform(transformer=StandardScaler())),
                ('estimator',
                 SGDRegressor(penalty='elasticnet', random_state=123))]), target="silica_conc" ,features_in=['iron_feed', 'silica_feed', 'starch_flow', 'amina_flow', 'ore_pulp_flow', 'ore_pulp_ph', 'ore_pulp_density', 'total_air_flow', 'total_column_level', 'feed_diff_divide_silica'], features_out=None)

### `split_data`

The `modeling` module contains a few useful classes for splitting data on train and test datasets. We've demonstrated how to use those in the `splitters` [tutorial](./splitter_base.ipynb)

Each of the classes have the same API defined by `SplitterBase` ([API](../../../../../../docs/build/apidoc/modeling/modeling.splitters.html#modeling.splitters.base_splitter.SplitterBase)). Call `.split` method on data to split on train and test datasets.

We'll create a splitter instance using the function and then split the data by train and test like in the section above.

In [11]:
from modeling import create_splitter, split_data

split_datetime = "2017-08-30 23:00:00"

splitter = create_splitter(
    "date", 
    splitting_parameters={
        "datetime_column": datetime_column,
        "split_datetime": split_datetime,
    },
)
train_data, test_data = split_data(df_dropna, splitter)

INFO:modeling.splitters._splitters.base_splitter:Length of data before splitting is 1368
INFO:modeling.splitters._splitters.by_date_splitter:Splitting by datetime: 2017-08-30 23:00:00
INFO:modeling.splitters._splitters.base_splitter:Length of the train data after splitting is 1287, length of the test data after splitting is 81.


### `tune_model`

Next step is to tune model hyperparameters. This is an optional step which produces `BaseModel` instance with its' hyperparameters tuned.

As usual, we'll make a function call, which initializes `ModelTuner` and calls `.tune()` method.

`ModelTuner` tunes models based on the configuration specification. Learn about `model_tuner_config` structure for each of the tuner classes in `ModelBase` [tutorial](./model_base.ipynb) as well.


<div class="alert alert-info">
<b>Note</b>
    
Each of the `ModelTuner` classes require its own structure of initialization config, which is described in API section.

</div>

In [12]:
sklearn_pipeline_tuner_config = {
    'class_name': 'sklearn.model_selection.GridSearchCV',
    'kwargs': {
        'n_jobs': -1,
        'refit': 'mae',
        'param_grid': {
            'estimator__alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10],
            'estimator__l1_ratio': [1e-05, 0.0001, 0.001, 0.01, 0.1, 1]
        },
        'scoring': {
            'mae': 'neg_mean_absolute_error',
            'rmse': 'neg_root_mean_squared_error',
            'r2': 'r2',
        }
    }
}

In [13]:
from modeling import create_tuner, tune_model

In [14]:
model_tuner = create_tuner(
    sklearn_pipeline_factory,
    model_tuner_type="modeling.SklearnPipelineTuner",
    tuner_config=sklearn_pipeline_tuner_config,
)
sklearn_pipeline = tune_model(
    model_tuner=model_tuner, 
    data=train_data,
    hyperparameters_config=None,
)

INFO:modeling.models.sklearn_pipeline.model:`features_out` attribute is not specified. Setting `features_out` based on factual data.
INFO:modeling.models.sklearn_pipeline.tuner:Initializing sklearn hyperparameters tuner...
INFO:modeling.models.sklearn_pipeline.tuner:Tuning hyperparameters...


### `train_model`

Then model finally is getting trained. This function simply calls `BaseModel`'s `.fit()` method. For training and testing we'll use train_data and test_data that we split above.

<div class="alert alert-info">
<b>Note</b>
    
We're only using the train data for running final model training step. Test data is not concatenated and is not used in any way for training final version of the model. Instead, test data will be used for producing test metrics in the next section.
    
**This approach is used by default to ensure model is validated and the expected behavior is captured on the unseen and likely the latest available data, before model is used for the optimizing.**

</div>

In [15]:
from modeling import train_model

sklearn_pipeline = train_model(sklearn_pipeline, train_data)

##  Evaluating model performance

These steps utilize trained model to extract predictions, metrics and feature importance using the provided data. 

<div class="alert alert-info">
<b>Note</b>
    
Steps below are needed mostly for extracting predictions and metrics into the DataFrames for storing as an artifacts of the training procedure. We expect functions below to be used in pipelining tools, e.g. Kedro.

</div>

Updated model performance report in the `reporting` package does not utilize DataFrames produced my these functions unlike previous version of model performance report.

### `calculate_model_predictions`

The first datasets needed by the model report are the train and test sets with model predictions appended.

In [16]:
from modeling import calculate_model_predictions


train_predictions = calculate_model_predictions(
    train_data, sklearn_pipeline,
)
test_predictions = calculate_model_predictions(
    test_data, sklearn_pipeline,
)

train_predictions.head()

Unnamed: 0,model_prediction
0,12.742889
1,13.378801
2,12.819613
3,12.468984
4,11.458713


### `calculate_metrics`

Next, we'll need metric dataset provided by the `calculate_metrics` function.

We can either use the datasets we just created, or the model to create the predictions again. Below, we use the datasets rather than doing another prediction.

In [17]:
from modeling import calculate_metrics

train_metrics = calculate_metrics(
    train_data, model=sklearn_pipeline,
)
test_metrics = calculate_metrics(
    test_data, model=sklearn_pipeline,
)

test_metrics

{'mae': 1.9282782055363596,
 'rmse': 2.4244794086208996,
 'mse': 5.878100402826748,
 'mape': 0.14826119319371578,
 'r_squared': 0.4873476180975649,
 'var_score': 0.4873609075368811}

### `calculate_model_prediction_bounds`

We can use the above metrics to create approximate confidence intervals for the predictions. These intervals can be useful for monitoring live model performance.

In reality, we would use the model metrics on test set on new data to calculate the lower and upper bounds. Below we show an example using the test metrics and test data set.

In [18]:
from modeling import calculate_model_prediction_bounds

prediction_bounds = calculate_model_prediction_bounds(
    data=test_data,
    model=sklearn_pipeline,
    model_metrics=test_metrics,
    error_metric="rmse",
    error_multiplier = 1.96,
)
prediction_bounds.head()

Unnamed: 0,timestamp,actuals,predictions,lower_bound,upper_bound
1392,2017-08-30 23:00:00,15.023342,12.470228,7.718249,17.222208
1393,2017-08-31 02:00:00,14.987169,13.19436,8.44238,17.946339
1394,2017-08-31 05:00:00,14.170544,14.128286,9.376306,18.880265
1395,2017-08-31 08:00:00,10.17083,11.654594,6.902615,16.406574
1396,2017-08-31 11:00:00,11.712113,12.100266,7.348287,16.852246


### `cross_validate`

In addition to studying model performance on a single train-test split, it is a good practice to _cross validate_ its performance on multiple splits.

`modeling` provides `cross_validate()` function that provides an intuitive API to do that.

In [19]:
from modeling import cross_validate

cv_strategy_config = {
    "class_name": "sklearn.model_selection.TimeSeriesSplit",
    "kwargs": {
        "n_splits": 3,
    },
}

cross_validation_metrics = cross_validate(
    model=sklearn_pipeline,
    data=train_data,
    cv_strategy_config=cv_strategy_config,
)
cross_validation_metrics

INFO:modeling.models._cross_validation:Cross-validating using: TimeSeriesSplit(gap=0, max_train_size=None, n_splits=3, test_size=None)


Unnamed: 0_level_0,mae,mae,mape,mape,mse,mse,r_squared,r_squared,rmse,rmse,var_score,var_score
Unnamed: 0_level_1,train,test,train,test,train,test,train,test,train,test,train,test
Fold,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
0,1.69,2.95,0.12,0.27,4.76,11.61,0.31,-0.1,2.18,3.41,0.31,0.33
1,1.76,2.65,0.13,0.23,5.22,9.89,0.51,0.14,2.28,3.14,0.51,0.18
2,1.96,2.49,0.16,0.22,6.34,8.72,0.44,-0.25,2.52,2.95,0.44,-0.11


Via `cv_strategy_config` argument to this function, you can use a cross-validator of your choice (e.g. `ShuffleSplit`) and customize its parameters. The general recommendations are:

* Explore model performance with **multiple splitting strategies** provided in `sklearn.model_selection`. `TimeSeriesSplit` and `ShuffleSplit` are a must in most cases.
* Check **not only the average values** of metrics across folds, but also variation from fold to fold and difference between train and test values.
* If you are using a **`Tuner`** that involves cross-validation, **re-use its CV strategy** for model performance as well.

In the example above, we are using a `TimeSeriesSplit` strategy which is usually a good proxy of what accuracy to expect in production.
[Here in the `sklearn` documentation](https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html#visualize-cross-validation-indices-for-many-cv-objects) you can find visual examples of this and other available strategies.

Additionally, you may be interested in exploring specific data slices behind each fold to see why metrics are different. This can be achieved by supplying argument `return_splits=True`, while it defaults to `False`:

In [20]:
cross_validation_metrics, cv_splits = cross_validate(
    model=sklearn_pipeline,
    data=train_data,
    cv_strategy_config=cv_strategy_config,
    return_splits=True,
)

# This is a mapping from integer fold indices to dictionaries with train and test data.
cv_splits[0]["test_data"].head()

INFO:modeling.models._cross_validation:Cross-validating using: TimeSeriesSplit(gap=0, max_train_size=None, n_splits=3, test_size=None)


Unnamed: 0,timestamp,air_flow01,air_flow02,air_flow03,air_flow04,air_flow05,air_flow06,air_flow07,amina_flow,column_level01,...,ore_pulp_flow,ore_pulp_ph,silica_conc,silica_feed,starch_flow,iron_minus_silica,feed_diff_divide_silica,total_column_level,total_air_flow,silica_conc_lagged
429,2017-05-02 14:00:00,299.923863,299.815867,299.991593,299.967524,299.936769,299.785333,299.987863,531.154194,462.590172,...,399.186259,9.395195,16.946918,20.836667,2000.0,32.216667,1.546153,2879.716996,2099.408811,2.426667
430,2017-05-02 17:00:00,299.15868,299.772339,299.771094,299.90325,299.938078,299.795719,299.968526,494.676863,451.185796,...,400.169934,9.320372,15.676978,19.99,2000.0,32.93,1.647324,2759.806224,2098.307685,3.473333
431,2017-05-02 20:00:00,299.740215,299.944333,299.928565,299.928594,299.912035,299.640856,299.900798,516.819924,450.692141,...,400.296176,9.422287,13.138355,19.99,2110.479007,32.93,1.647324,2913.800158,2098.995396,3.688333
432,2017-05-02 23:00:00,299.970343,299.699035,299.970652,299.917854,299.991417,299.726178,299.861983,507.627161,450.277143,...,399.477817,9.502582,13.928617,15.966667,2000.0,39.416667,2.468685,2928.556098,2099.137461,1.913333
433,2017-05-03 02:00:00,299.935894,299.909102,299.968813,299.89732,299.880098,299.984594,299.710052,553.140015,440.183637,...,400.489189,9.390444,16.486466,7.92,2000.0,52.39,6.614899,2893.950035,2099.285874,2.19


### `calculate_feature_importance`

Then, we'll get the feature importances. Function below will just call `.get_feature_importance` method of the model and store the results in the DataFrame.


In the example below we're extracting feature importance from `SklearnPipeline`. In the default implementation it will try to access `feature_importance_` attribute of the model or, if it does not exist, `sklearn.inspection.permutation_importance` [will be used instead](https://scikit-learn.org/stable/modules/generated/sklearn.inspection.permutation_importance.html#sklearn.inspection.permutation_importance).

In [21]:
from modeling import calculate_feature_importance

importances = calculate_feature_importance(train_data, sklearn_pipeline)
importances

INFO:modeling.models.sklearn_pipeline.model:Estimator of type <class 'sklearn.linear_model._stochastic_gradient.SGDRegressor'> does not have `feature_importances_` using sklearn.inspection.permutation_importances instead.


Unnamed: 0_level_0,feature_importance
feature_name,Unnamed: 1_level_1
ore_pulp_density,0.318981
total_air_flow,0.156597
feed_diff_divide_silica,0.055192
starch_flow,0.027987
ore_pulp_ph,0.023104
total_column_level,0.011798
amina_flow,0.003491
silica_feed,0.000695
iron_feed,-0.004983
ore_pulp_flow,-0.016836


### `calculate_shap_feature_importance`

Finally, we'll extract SHAP feature importance from the train data.
`ModelBase.get_shap_feature_importance()` method is used and result is returned in the form of DataFrame.

Note, that `ModelBase.get_shap_feature_importance()` will calculate shap importance for all the features provided in the input dataset.

In [22]:
from modeling import calculate_shap_feature_importance
import numpy as np

samples = train_data.loc[np.random.choice(train_data.index, 200)]
shap_importances = calculate_shap_feature_importance(samples, sklearn_pipeline)
shap_importances

INFO:modeling.models.sklearn_pipeline.model:`Using model-agnostic` <class 'shap.explainers._exact.ExactExplainer'>` to extract SHAP values... `shap` can't apply model-specific algorithms for <class 'modeling.models.sklearn_pipeline.model.SklearnPipeline'>. Consider switching to `SklearnModel` if computation time or quality don't fit your needs.


Unnamed: 0_level_0,shap_feature_importance
feature_name,Unnamed: 1_level_1
ore_pulp_density,0.889351
total_air_flow,0.803316
feed_diff_divide_silica,0.53858
starch_flow,0.330423
ore_pulp_flow,0.242422
total_column_level,0.23933
ore_pulp_ph,0.228754
iron_feed,0.185321
amina_flow,0.049499
silica_feed,0.007501
