In [5]:
from openadmet.models.anvil.workflow import AnvilWorkflow, AnvilDeepLearningWorkflow


# 2.1 Anvil walkthrough

## Sections in an Anvil Workflow 


First lets look at the Anvil workflow class normally used for training a traditional ML model (e.g an LGBM). You can see it contains a series of sub-schema ([Pydantic schema](https://docs.pydantic.dev/latest/) at their core) that correspond to the diagram above, with all of the requisite steps for training and evaluating a model. Each of the steps has a required API that makes it interoperable with the upstream and downstream sections of the workflow.

In [3]:
AnvilWorkflow?

[31mInit signature:[39m
AnvilWorkflow(
    *,
    metadata: openadmet.models.anvil.specification.Metadata,
    data_spec: openadmet.models.anvil.specification.DataSpec,
    transform: Optional[openadmet.models.transforms.transform_base.TransformBase] = [38;5;28;01mNone[39;00m,
    split: openadmet.models.split.split_base.SplitterBase,
    feat: openadmet.models.features.feature_base.FeaturizerBase,
    model: openadmet.models.architecture.model_base.ModelBase,
    ensemble: openadmet.models.active_learning.ensemble_base.EnsembleBase | [38;5;28;01mNone[39;00m = [38;5;28;01mNone[39;00m,
    trainer: openadmet.models.trainer.trainer_base.TrainerBase,
    evals: list[openadmet.models.eval.eval_base.EvalBase],
    parent_spec: openadmet.models.anvil.specification.AnvilSpecification,
    debug: bool = [38;5;28;01mFalse[39;00m,
    driver: openadmet.models.anvil.Drivers = <Drivers.SKLEARN: [33m'sklearn'[39m>,
) -> [38;5;28;01mNone[39;00m
[31mDocstring:[39m      Workflow for ru

Lets construct just one sub-section of the workflow  the **Metadata section**. This section is designed to capture the **who** **what** and **why** of this workflow. As we are building a model of CYP inhibition from our pre-curated data, lets make sure we 


In [7]:
from openadmet.models.anvil.specification import Metadata
Metadata?

[31mInit signature:[39m
Metadata(
    *,
    version: Literal[[33m'v1'[39m],
    driver: str = [33m'sklearn'[39m,
    name: str,
    build_number: Annotated[int, Ge(ge=[32m0[39m)],
    description: str,
    tag: str,
    authors: str,
    email: pydantic.networks.EmailStr,
    biotargets: list[str],
    tags: list[str],
) -> [38;5;28;01mNone[39;00m
[31mDocstring:[39m     
Metadata specification.

Attributes
----------
version : Literal["v1"]
    The version of the metadata schema.
driver : str
    The driver for the workflow.
name : str
    The name of the workflow.
build_number : int
    The build number of the workflow (must be non-negative).
description : str
    Description of the workflow.
tag : str
    Primary tag for the workflow.
authors : str
    Name of the authors.
email : EmailStr
    Email address of the contact person.
biotargets : list[str]
    List of biotargets associated with the workflow.
tags : list[str]
    Additional tags for the workflow.
[31mInit do

In [8]:
metadata_instance = Metadata(version="v1",
                             build_number=0,
                             name="CYP3A4-1st-attempt",
                             description="trying out anvil on ChEMBL curated CYP3A4 data",
                             tag="CYP3A4-attempt-1",
                             authors="Jane ADMET",
                             email="jane.admet@therapeutics.co",
                             biotargets=["CYP3A4"],
                             tags=["ChEMBL"])
metadata_instance

Metadata(version='v1', driver='sklearn', name='CYP3A4-1st-attempt', build_number=0, description='trying out anvil on ChEMBL curated CYP3A4 data', tag='CYP3A4-attempt-1', authors='Jane ADMET', email='jane.admet@therapeutics.co', biotargets=['CYP3A4'], tags=['ChEMBL'])

We can see we have nicely constructed a metadata schema for our run. All of the possible options are contained in our detailed `Anvil` [reference guide](https://openadmet-models.readthedocs.io/en/latest/anvil_reference.html) 

**Lets make a few more sections**

In [9]:
# Data section 

from openadmet.models.anvil.specification import DataSpec
data_instance = DataSpec(type="intake", resource="../01_Data_Curation/processed_data/processed_CYP3A4_inhibition.csv", target_cols="OPENADMET_LOGAC50", input_col="OPENADMET_CANONICAL_SMILES")

Here we specified our curated ChEMBL CYP3A4 inhibition data to be read in. The `target_col` is our $y$ values or "targets" (here CYP3A4 pIC50s) the `input_col` is how we will read in our chemical structures 

In [11]:
# Split section 

from openadmet.models.anvil.specification import SplitSpec

split_spec = SplitSpec(type="L")

Here we specified a simple random split of our data. You can use any of the classes in `openadmet.models.split`

In [13]:
# Model Section, LGBM gradient boosting regressor 

from openadmet.models.architecture.lgbm import LGBMRegressorModel

model_instance = LGBMRegressorModel()

We will make use of a powerful traditional ML technique, a gradient boosting regressor from [LightGBM](https://lightgbm.readthedocs.io/en/stable/). You can use any of the classes in `openadmet.models.architecture`

In [None]:
anvil_wf = AnvilWorkflow(metadata=metadata_instance,
                        data_spec=data_instance,
                        feat=
                        split=split_instance,
                        model=model_instance,
                        trainer=trainer_instance,
                        evals=evals)