## MLFlow Fundamentals
* See slide deck for schematic overview

### 0. Core idea
* code is packaged in a MLFLow (wrapper/object, etc.)
* challenge it therefore to package it correctly:
    * conda environment set-up (we can also use docker if we want to)
    * the project description and commands are set-up
* this way, we can execute our projects successfully
*

### 1. Example of building an MLFLow Component
* A python script `download_data.py` can be run from the command line like this:

In [None]:
python download_data.py \
       --file_url https://raw.githubusercontent.com/scikit-learn/scikit-learn/4dfdfb4e1bb3719628753a4ece995a1b2fa5312a/sklearn/datasets/data/iris.csv \
       --artifact_name iris \
       --artifact_type raw_data \
       --artifact_description "The sklearn IRIS dataset"

* To run via MLFLow, it needs a env file - `conda.yml`:

In [None]:
name: download_data
channels:
  - conda-forge
  - defaults
dependencies:
  - requests=2.24.0
  - pip=20.3.3
  - pip:
      - wandb==0.10.21

* and a project file, `MLproject`:

In [None]:
name: download_data
conda_env: conda.yml

entry_points:
  main:
    parameters:
      file_url:
        description: URL of the file to download
        type: uri
      artifact_name:
        description: Name for the W&B artifact that will be created
        type: str
      artifact_type:
        description: Type of the artifact to create
        type: str
        default: raw_data
      artifact_description:
        description: Description for the artifact
        type: str

    command: >-
      python download_data.py --file_url {file_url} \
                              --artifact_name {artifact_name} \
                              --artifact_type {artifact_type} \
                              --artifact_description {artifact_description}

* it can then be run from the command like as follows:

In [None]:
# here . is used assuming the script to run is in our current directory
mlflow . -P file_url={file_url} \
        -P artifact_name={artifact_name} \
        -P artifact_type={artifact_type} \
        -P artifact_description={artifact_description}

### 2. Running an MLFLow project - options

In [None]:
# mlflow run to filepath of project code

mlflow run /path/to/the/local/folder

mlflow run git@github.com/my_username/my_repo.git

In [None]:
# Parameters are specified using -P [parameter_name]=[parameter value]. specify one -P option for each parameter
mlflow run ./my_project -P file_url=https://myurl.com -P artifact_name=my-artifact

In [None]:
# -e for a different entry point to main
mlflow run ./my_project -e other_script _P parameter=value

In [None]:
# -v to specify a specifc release
mlflow run git@github.com/my_username/my_repo.git -v 1.2.3 \
            -P file_url=https://myurl.com \
            -P artifact_name=my-artifact

### 3. Linking components into a pipeline

* Pipeline is implemented in MLFlow as an MLFlow project that calls other MLFlow projects
    * The main.py of the 'control project' calls the other MLFlow projects
    * The control project has a env and a ML Project project description
* We use mlflow.run in the main script - this calls our sub-projects
    * Q: by using mlflow.run, it looks like we sacrifice some flexibility in what entry points can be called - these need to be predefined up front (?)
*

In [None]:
# example of mlflow.run and how it compares to a command line implementation

import mlflow

mlflow.run(
  # URI can be a local path or the URL to a git repository
  # here we are going to use a local path
  uri="my_project",
  # Entry point to call
  entry_point="main",           ## this value here is where i think it gets more specific
  # Parameters for that entry point
  parameters={
    "file_url": "https://...",
    "artifact_name": "my_data.csv"
  }
)

# equivalent to

! mlflow run my_project -e main -P file_url="https://..." -P artifact_name="my_data.csv"

In [None]:
# a main.py script could look like this
## need to understand what uri is here - think its the name of the sub mlflow project

# mlflow.run implements the project description in the same way as the CL
# So each mlflow.run runs the component via the project description


import mlflow

mlflow.run(
  uri="download_data",
  entry_point="main",
  parameters={
    "file_url": "https://...",
    "output_artifact": "raw_data.csv"
  }
)

mlflow.run(
  uri="remove_duplicates",
  entry_point="main",
  parameters={
    "input_artifact": "raw_data.csv:latest",
    "output_artifact": "clean_data.csv"
  }
)

* to note - we can turn this file into something that is usable on the CL - we would add a def go function and insert the relevant arg.parse parameters.
* This thenm is what hydra will do

## 4. Hydra

* we can set up a hydra config file that enables us to set parameters in main.py from the command line (need to verify these last two elements)

In [None]:
# example config file
main:
  project_name: my_project
  experiment_name: dev
data:
  train_data: "exercise_6/data_train.csv:latest"
random_forest_pipeline:
  random_forest:
    n_estimators: 100
    criterion: gini
    max_depth: null

In [None]:
# the config file sets up the use of using different parameters by hydra

# we need to import hydra and add the hydra decorator, referencing the name of our config file
import mlflow
import hydra

@hydra.main(config_name="config")
def go(config):
  # Now here config is a dictionary with our configuration
  # For example, to access the parameter train_data in the data
  # section we can just do
  train_data = config["data"]["train_data"]

  ...


if __name__=="__main__":
  go()

In [None]:
# finally, we update our MLFlow project description file to accomdate hydra

name: main
conda_env: conda.yml

entry_points:
  main:
    parameters:
      hydra_options:                                ## here
        description: Hydra parameters to override
        type: str
        default: ''
    command: >-
      python main.py $(echo {hydra_options})        ## here

* How hydra overrides parameters and runs from the command line

In [None]:
# examples

mlflow run [path or URL to the pipeline] \
       -P hydra_options="main.experiment_name=my_experiment"

mlflow run [path or URL to the pipeline] \
       -P hydra_options="random_forest.random_forest_pipeline.n_estimators=50"

mlflow run [path or URL to the pipeline] \
       -P hydra_options="main.experiment_name=my_experiment main.project_name=test"


### Organising pipeline so that it is tracked with W&B

In [None]:
#  ensures that the pipeline is grouped together in one single experiment within the appropriate project.

# NOTE: this will only work if the components do NOT set the experiment name and the project on their own when calling wandb.init.

import hydra
import mlflow
import os

@hydra.main(config_name="config")
def go(config):

  os.environ["WANDB_PROJECT"] = config["main"]["project_name"]
  os.environ["WANDB_RUN_GROUP"] = config["main"]["experiment_name"]

  mlflow.run(
    ...
  )

  ...


if __name__ == "__main__":
  go()

In [None]:
# here is another example with root path added - apparently needed

import mlflow
import os
import wandb
import hydra
from omegaconf import DictConfig


# This automatically reads in the configuration
@hydra.main(config_name='config')
def go(config: DictConfig):

    # Setup the wandb experiment. All runs will be grouped under this name
    os.environ["WANDB_PROJECT"] = config["main"]["project_name"]
    os.environ["WANDB_RUN_GROUP"] = config["main"]["experiment_name"]

    # You can get the path at the root of the MLflow project with this:
    root_path = hydra.utils.get_original_cwd()

    _ = mlflow.run(
        os.path.join(root_path, "download_data"),
        "main",
        parameters={
            "file_url": config["data"]["file_url"],
            "artifact_name": "iris.csv",
            "artifact_type": "raw_data",
            "artifact_description": "Input data"
        },
    )

    _ = mlflow.run(
        os.path.join(root_path, "process_data"),
        "main",
        parameters={
            "input_artifact": "iris.csv:latest",
            "artifact_name": "clean_data.csv",
            "artifact_type": "processed_data",
            "artifact_description": "Cleaned data"
        },
    )



if __name__ == "__main__":
    go()
