## Workbook Intention

This workbook is designed to demonstrate and end-to-end model development workflow, leveraging scalable and efficient coding and infrastructure practices based on our internal Xometry ML Platform capabilities.

**It should be possible to complete this workflow without the need to create a sagemaker project or any other kind of governed, persistent infrastructure.** In this way, these practices translate to one-off POC type development work as well as work that needs to result in a deployable model.

The expected steps are as follows:

1. Read data from a database
1. Split data into train/test/validation sets
1. Conduct feature engineering
1. Conduct HPO for multiple algorithms 
1. Train a model based on the best HPO run
1. Conduct inference on new data using the trained model



In [None]:
#library imports 
import boto3
import os
import random
import joblib

import xoml_sagemaker.pipeline_types
import xoml_sagemaker.generate_data
import xoml_sagemaker.processing
import xoml_sagemaker.feature_engineering
from xoml_sagemaker.pipeline_types import StaticHyperparameter

In [None]:
# Replace with your project name.
# This will be prepended to remote job names for easier identification when debugging.
project_name = 'train-updt'

#### Step 1. Read data from database

**GOAL:**

Create a dataset on which to train a model.

**TODO:**
 - support feature store queries
 - expose instance_type parameter on processing job, as our method of moving data from source to s3 data brings it into memory, we will at some point encounter the need for this to be adjusted.

In [None]:
config = xoml_sagemaker.pipeline_types.GenerateData(
    JobName=project_name,
    JobType="Snowflake",
    SQLFile="./1. sql/query.sql",
)
data_res = xoml_sagemaker.generate_data.launch_generate_data_job(config)

#### Step 2. Split data
**GOAL:**

Split the result of Step 1 according to the instructions in `2. split_data/split_data.py`.

**TODO:**

 - demo support for args (e.g., for train/test and train/test/validation)

In [None]:
split_config = xoml_sagemaker.pipeline_types.Preprocess(
    JobType="Managed",
    JobName=project_name,
    Framework="SKLearn",
    FrameworkVersion="0.20.0",
    Code="split_data.py",
    CodeSourceDir="./2. split_data/",
    InputS3Dir=data_res.output_path
)

split_res = xoml_sagemaker.processing.launch_preprocessing_job(split_config)

#### Feature Engineering

**GOAL:**
Conduct feature engineering on each of the outputs of the `split_data` step, as instructed in `3. feature_engineering/ft_eng.py`

**TODO:**

 - update the example to not take paths for output or python version

In [None]:
feature_engineering_config = xoml_sagemaker.pipeline_types.FeatureEngineering(
    job_name=project_name,
    job_type="Managed",
    # required params for a managed job
    framework="SKLearn",
    framework_version="0.23-1",
    code_source_dir="./3. feature_engineering/",
    code="ft_eng.py",
    python_version="py3",
    # params for input and output
    input_s3_dir="s3://data-science-826190527795-lizleki/train-updt-45846674-pwgb/output/result/", #split_res.output_path,
    model_s3_dir="s3://data-science-826190527795-lizleki/train-updt/",
    transformed_data_s3_dir="s3://data-science-826190527795-lizleki/train-updt",
)
    
ft_eng_res = xoml_sagemaker.feature_engineering.launch_feature_engineering_job(feature_engineering_config)

#### Hyper Parameter Optimization & Experiment Tracking

At this stage, we want to run an HPO job to build many models and compare their performance on some objective metric. We also want to track each of those runs in our mlflow server.

functional expectations 
 - hpo job is capable of comparing results across one or more estimators (e.g., xgboost vs. pytorch), with an hpo grid for each.
- exposure of relevant job configuration via arguments with logical default values
 - every run of the hpo job is tracked to a single experiment 
 - enforcement of a standard job name
 - enforcement of a standard experiment name
   - also the ability to override the experiment name, to allow the DS to run this step many times and log to the same experiment.
 - enforecement of a standard on destination of the outputs
    - something like user_bucket/hpo/jobname/train.csv, user_bucket/hpo/jobname/test.csv, user_bucket/hpo/jobname/val.csv

other thoughts 
 - this will be the trickiest to abstract, in my opinion



#### Train Final Model

At this stage, we understand what algorithm and hyperparameter specification is the most performant. We now want to write a script to train that model, so that we it can be submitted to our pipeline for registration and deployment. 

The goal here is simply to test that our `train.py` script does result in the model we expect.

functional expectations 
 - enforcement of a standard job name
- exposure of relevant job configuration via arguments with logical default values

 - support for custom docker, if needed (standard will be to use managed images)
 - enforecement of a standard on destination of the outputs
    - something like user_bucket/train/jobname/



In [None]:

train_config = xoml_sagemaker.pipeline_types.Train(
    job_type="Managed",
    job_name=project_name,
    framework="XGBoost",
    framework_version="1.7-1",
    code_source_dir="4. train/train.py",
    code="train.py",
    python_version="py3",
    input_data_dir="s3://data-science-826190527795-lizleki/train-updt",
    output_data_dir="s3://data-science-826190527795-lizleki/train-updt/{project}/output".format(project=project_name),
    static_hyperparameters=[
        StaticHyperparameter(Key="num_round", Value="50"),
        StaticHyperparameter(Key="max_depth", value="5"),
        StaticHyperparameter(Key="eta", Value="0.2"),
        StaticHyperparameter(Key="objective", Value="reg:squarederror"),
        StaticHyperparameter(Key="gamma", Value="4"),
    ],
)
launched_job = train_module.launch_training_job(train_config)

**A LTERNATIVE TO TRAIN FINAL MODEL **
#### Generate the config.yaml

We know that some DS teams create and ensemble model based on the top N HPO runs (as an example see [CNC Cost Model](https://github.com/xometry/datasci-cnc-cost-model/tree/master)). For these types of models, it does not make sense to run the `train_model` step. Instead, we would need a function to handle the ensembling workflow, so the data scientist could assess the ensembled model.

This is probably a processing job. We should use the same code used by the legacy infrastructure. Let's avoid changes to ensure that we don't introduce an unexpected outcome.

functional expectations 
- takes as input the number of models to ensemble
- has logical defaults that can be overridden 
    - e.g., the name of the HPO job to reference (we can default to the last run by them, if we're working on an object)
- enforecement of a standard on destination of the outputs
    - something like user_bucket/ensemble/jobname/

In [None]:
ensemble_desc = workflow.ensemble_model(n_jobs = 5)

In [None]:
df = pd.read_csv(process_desc['test_data'])
endpoint_desc.predict(df.loc(0))

**OPTIONAL**
#### Generate the config.yaml

Our model training and registration pipeline expects that configurations be passed via a yaml file. It will be necessary for data scientists to populate this yaml with a predefined structure and parameter values which have been 'discovered' during this interactive workflow. It has been proposed that we generate the yaml for them. This may solve that

functional expectations 
- values based on the latest runs of the processing, training, and endpoint generation jobs
- ability to override those values, if desired

In [None]:
workflow.update_config(destination="./config.yaml")