# "Mlflow : Inferred Assumptions and Tricky Parts"


- toc: true
- branch: main
- badges: false
- comments: true
- categories: [experimentation, mlflow, python]

I have been using Mlflow for a while now to track experimentation and while adding mlflow logging as a functionality, I had to infer some design choices for the following by trail and error:
1. Logging as MLProjects or using Fluent API 
2. Logging parameters (`mlflow.log_params()` or `mlflow.log_param()`)
3. Logging artifacts (`mlflow.log_artifacts()` or `mlflow.log_artifact()`)
4. Logging models (`mlflow.<insert-library>.log_model()`)

I wanted to collect them in one place because while working with large codebases that have long data loading and training periods, some of these _gotchas_ are encountered long after running the scripts, and depending on the task and infrastructure it could be costly.  

# Pre-requisites

I am going to use a simple program that doesn't do a lot of machine learning, it simply takes in a parameter (an integer `index`) and iterates upto that number while printing the numbers to a file.
```python
def iter(index: int, log_file: Path):
    for i in range(index):
        with open(log_file, "w") as lf:
            lf.write(f"{i}\n")
```

But a commonly followed pattern in machine learning / deep learning training code is:
1. Accept parameters from cli using packages like `argparse` or `click`.
2. Create some artifacts (our log file here is an artifact that we can use)
3. After performing training (also during), save the model. 

And so I mimic that with the following

In [None]:
# iterate.py 

import click
from pathlib import Path
import pickle

@click.command()
@click.option(
    "--index",
    "-i",
    type=click.INT,
    help="Upper limit for iteration",
)
@click.option(
    "--log-dir",
    "-L",
    type=click.Path(),
    default=Path("log_artifacts"),
    help="Log artifact file path"
)
@click.option(
    "--model-dir",
    "-M",
    type=click.Path(),
    default=Path("model_dir"),
    help="'Model' directory"
)
def iterate(index: int, log_dir: Path, model_dir: Path) -> None:
    ## Create dirs if they don't exist
    if not log_dir.exists(): log_dir.mkdir()
    if not model_dir.exists(): model_dir.mkdir()

    # set log_file path
    artifact_file = log_dir / "log_file.txt"

    # perform iteration and logging
    iter_and_log(index, artifact_file)

    # serialize and save the function that does serialization and logging
    # this is our proxy for a model
    with open(model_dir / "model_pickle", "wb") as model_file:
        pickle.dump(iter_and_log, model_file)

def iter_and_log(index: int, log_file: Path) -> None:
    """Function that does the iteration and logging
    Iterates to `index` and logs the values to `log_file`
    """
    with open(log_file, "w") as lf:
        for i in range(index):
            lf.write(f"{i}\n")
    

if __name__ == "__main__":
    iterate()

This can be run for 10 iterations as follows
```bash
python iterate.py -i 10
```
and it will create the artifact directories, iterate and log, and then serialize and dump the function to file.  
Now, with the pre-requisits done, let's talk about adding mlflow logging and running the project.

# Mlflow Projects and Fluent API

A project that has mlflow logging can run using two patterns

### [Mlflow Projects](https://mlflow.org/docs/latest/projects.html)
* In this pattern, an [MLproject file gets defined](https://mlflow.org/docs/latest/projects.html#mlproject-file) at the project's root directory. 
* To run the project, the `mlflow run` command is used to run an entry point where the 