# Overview

PyTrack is designed as an object oriented mapper for [DVC](https://dvc.org/).
DVC provides tracking of large data files within a GIT repository.
Therefore, all PyTrack instances will later be executed inside a GIT repository.
Furthermore, DVC provides method for building a dependency graph, tracking parameters, comparing metrics and querying multiple runs.

**Why does it need an object-oriented mapper?**

Whilst DVC provides all this functionality it is designed to be programming language independent. PyTrack is designed purely for building python packages and is optimized in such manner.

## Stages


DVC organizes its pipeline in multiple stages (see https://dvc.org/doc/start for more information).
In the case of PyTrack every stage is decorated with `pytrack.PyTrack` as follows

In [1]:
# We are working in a temporary directory for easier cleanup
import os
import shutil
from tempfile import TemporaryDirectory
from pytrack import PyTrackProject
from IPython.display import Pretty, display

temp_dir = TemporaryDirectory()
cwd = os.getcwd()

shutil.copy("Introduction.ipynb", temp_dir.name)

os.chdir(temp_dir.name)

project = PyTrackProject()
project.create_dvc_repository()

In [2]:
from pytrack import PyTrack, DVC


@PyTrack(nb_name="Introduction.ipynb")
class Stage:
    def __init__(self):
        """Class constructor

        Definition of parameters and results
        """
        self.n_1 = DVC.params()
        self.n_2 = DVC.params()
        self.sum = DVC.result()
        self.dif = DVC.result()

    def __call__(self, n_1, n_2):
        """User input

        Parameters
        ----------
        n_1: First number
        n_2: Second number
        """
        self.n_1 = n_1
        self.n_2 = n_2

    def run(self):
        """Actual computation
        """
        self.sum = self.n_1 + self.n_2
        self.dif = self.n_1 - self.n_2

Jupyter support is an experimental feature! Please save your notebook before running this command!
Submit issues to https://github.com/zincware/py-track.


This example defines a DVC stage that performs an addition and subtraction on two numbers `n_1, n_2`.
To use the stage we have to move it inside a directory and initialize `git init` and `dvc init`
If we now instantiate a stage and call it `Stage()(5, 10)` three important files will be generated for us:

In [3]:
stage = Stage()
stage(5, 10)

Updating n_1 with PyTrackOption and value None!
Updating n_2 with PyTrackOption and value None!
Updating sum with PyTrackOption and value None!
Updating dif with PyTrackOption and value None!
Updating n_1 with 5
Updating n_2 with 10
'DVCParams' object has no attribute 'params'
'DVCParams' object has no attribute 'params'
No results found!
'DVCParams' object has no attribute 'result'
No results found!
'DVCParams' object has no attribute 'result'
--- Writing new DVC file! ---
Overwriting existing configuration!
running script: dvc run -n Stage_0 --outs outs\0_Stage.json --params config\pytrack.json:Stage.0.params --deps src\Stage.py --force --no-exec python -c "from src.Stage import Stage; Stage(id_=0).run()"


## dvc.yaml
The first file we are interested in defines all DVC stage ``dvc.yaml``

In [4]:
display(Pretty("dvc.yaml"))

stages:
  Stage_0:
    cmd: python -c "from src.Stage import Stage; Stage(id_=0).run()"
    deps:
    - src\Stage.py
    params:
    - config/pytrack.json:
      - Stage.0.params
    outs:
    - outs\0_Stage.json


In this scenario we are using `@PyTrack(nb_name="Introduction.ipynb")` which allows us to use a Jupyter Notebook.
This creates a file that contains our class definition in `src/Stage.py`.
We can see here that DVC will run `python3 -c "from src.Stage import Stage; Stage(id_=0).run()"`.
This requires that all information for running this command must be given through files.
It is crucial that this command can run without requiring anything being passed to the `__init__` of the class!
Furthermore, we see here that we pass the argument `id_=0` which is not defined in our `__init__` because PyTrack handles this for us automatically.
This file also specifies the dependencies and outputs from our stage. This information can then be used to generate e.g., the DAG.


## params.json

All `DVC.params()` are stored in ``pytrack.json``. Our file contains two numbers and looks as follows

In [5]:
display(Pretty("config/pytrack.json"))

{
    "Stage": {
        "0": {
            "params": {
                "n_1": 5,
                "n_2": 10
            }
        }
    },
    "default": null
}

Here ``Stage`` gives the name of Stage, which is usually the name of the class.
Therefore, it is important that PyTrack stages don’t share a name within one pipeline.
The ``id_=0`` allows for having multiple parameters to a single stage.
This is usually not a good idea and therefore 0 is handled as the default.

## 0_Stages.json

The file outs/0_Stage.json is the output from the stage.
It contains the values for Stage(id_=0).sum and Stage(id_=0).d if after running the stage.
PyTrack needs to know which attributes are considered results and therefore has the definition of result() in the init.
This allows accessing and sharing the result of a stage without manually opening the generated files.
In general all paths should be handled through PyTrack in a way described later.

We can use ``dvc repro`` or `PyTrackProject` to run our code.

In [6]:
stage.sum

No results found!


In [7]:
!dvc repro
stage.sum


Updating n_1 with PyTrackOption and value None!
Updating n_2 with PyTrackOption and value None!
Updating sum with PyTrackOption and value None!
Updating dif with PyTrackOption and value None!
Updating sum with 15
Processing value {'sum': 15}
No results found!
Writing {'sum': 15} to outs\0_Stage.json
successful!
Updating dif with -5
Processing value {'dif': -5}
Loading results {'sum': 15}
Writing {'sum': 15, 'dif': -5} to outs\0_Stage.json
successful!
Loading results {'sum': 15, 'dif': -5}


Running stage 'Stage_0':
> python -c "from src.Stage import Stage; Stage(id_=0).run()"
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.lock
Use `dvc push` to send your updates to remote storage.


15

This will create the outs/0_Stage.json as

In [8]:
display(Pretty("outs/0_Stage.json"))


{
    "sum": 15,
    "dif": -5
}

In [9]:
# Cleanup all files
os.chdir("..")
temp_dir.cleanup()