# Hamilton + Prefect


#### Requirements:

- Set up Prefect 

- Install dependencies (listed in `requirements.txt`)

More details on how to set up your environment can be found [here](https://github.com/DAGWorks-Inc/hamilton/blob/main/examples/prefect/README.md#prefect-setup).

***

Uncomment and run the cell below if you are in a Google Colab environment. It will:
1. Mount google drive. You will be asked to authenticate and give permissions.
2. Change directory to google drive.
3. Make a directory "hamilton-tutorials"
4. Change directory to it.
5. Clone this repository to your google drive
6. Move your current directory to the hello_world example
7. Install requirements.

This means that any modifications will be saved, and you won't lose them if you close your browser.

In [1]:
## 1. Mount google drive
# from google.colab import drive
# drive.mount('/content/drive')
## 2. Change directory to google drive.
# %cd /content/drive/MyDrive
## 3. Make a directory "hamilton-tutorials"
# !mkdir hamilton-tutorials
## 4. Change directory to it.
# %cd hamilton-tutorials
## 5. Clone this repository to your google drive
# !git clone https://github.com/DAGWorks-Inc/hamilton/
## 6. Move your current directory to the hello_world example
# %cd hamilton/examples/hello_world
## 7. Install requirements.
# %pip install -r requirements.txt
# clear_output()  # optionally clear outputs
# To check your current working directory you can type `!pwd` in a cell and run it.

***

In this example, were going to show how to run a simple `data preprocessing -> model training -> model evaluation` workflow using Hamilton within Prefect tasks.

The functions that support this workflow are logically groupped in the modules `prepare_data`, `train_model`, and `evaluate_model` imported below.

***

In [2]:
# We use the autoreload extension that comes with ipython to automatically reload modules when
# the code in them changes.

# import the jupyter extension
%load_ext autoreload
# set it to only reload the modules imported
%autoreload 1
# import the function modules you want to reload when they change.
# i.e. these should be your modules you write your functions in. As you change them,
# they will be reimported without you having to do anything.
%aimport prepare_data
%aimport train_model
%aimport evaluate_model

import pandas as pd
from prefect import flow, task
from hamilton import base, driver

***
The Prefect workflow has 2 tasks: `prepare_data_task` and `train_and_evaluate_model_task` that defines how/where our modular functions should be executed.
***

In [3]:
# use the @task to define Prefect tasks, which adds logging, retries, etc.
# the function parameters define the config and inputs needed by Hamilton
@task
def prepare_data_task(
    raw_data_location: str,
    hamilton_config: dict,
    label: str,
    results_dir: str,
) -> str:
    """Load external data, preprocess dataset, and store cleaned data"""
    raw_df = pd.read_csv(raw_data_location, sep=";")

    dr = driver.Driver(hamilton_config, prepare_data)

    # prepare_data.ALL_FEATURES is a constant defined in the module
    features_df = dr.execute(
        final_vars=prepare_data.ALL_FEATURES + [label],
        inputs={"raw_df": raw_df},
    )
    
    # uncomment these lines to produce a local DAG visualization file:
    # dr.visualize_execution(
    #     final_vars=prepare_data.ALL_FEATURES + [label],
    #     inputs={"raw_df": raw_df},
    #     output_file_path="hamilton_dag",
    #     render_kwargs={"format": "png"},
    # )

    # save results to local file; for prod, save to an S3 bucket instead
    features_path = f"{results_dir}/features.csv"
    features_df.to_csv(features_path)

    return features_path


@task
def train_and_evaluate_model_task(
    features_path: str,
    hamilton_config: str,
    label: str,
    feature_set: list[str],
    validation_user_ids: list[str],
) -> None:
    """Train and evaluate machine learning model"""
    dr = driver.Driver(
        hamilton_config,
        train_model,
        evaluate_model,
        adapter=base.SimplePythonGraphAdapter(base.DictResult()),
    )

    dr.execute(
        final_vars=["save_validation_preds", "model_results"],
        inputs=dict(
            features_path=features_path,
            label=label,
            feature_set=feature_set,
            validation_user_ids=validation_user_ids,
        ),
    )

In [4]:
# use @flow to define the Prefect flow.
# the function parameters define the config and inputs needed by all tasks
# this way, we prevent having constants being hardcoded in the flow or task body
@flow(
    name="hamilton-absenteeism-prediction",
    description="Predict absenteeism using Hamilton and Prefect",
)
def absenteeism_prediction_flow(
    raw_data_location: str = "./data/Absenteeism_at_work.csv",
    feature_set: list[str] = [
        "age_zero_mean_unit_variance",
        "has_children",
        "has_pet",
        "is_summer",
        "service_time",
    ],
    label: str = "absenteeism_time_in_hours",
    validation_user_ids: list[str] = [
        "1",
        "2",
        "4",
        "15",
        "17",
        "24",
        "36",
    ],
):
    """Predict absenteeism using Hamilton and Prefect

    The workflow is composed of 2 tasks, each with its own Hamilton driver.
    Notice that the task `prepare_data_task` relies on the Python module `prepare_data.py`,
    while the task `train_and_evaluate_model_task` relies on two Python modules
    `train_model.py` and `evaluate_model.py`.
    """

    # the task returns the string value `features_path`, by passing this value
    # to the next task, Prefect is able to generate the dependencies graph
    features_path = prepare_data_task(
        raw_data_location=raw_data_location,
        hamilton_config=dict(
            development_flag=True,
        ),
        label=label,
        results_dir="./data",
    )

    train_and_evaluate_model_task(
        features_path=features_path,
        hamilton_config=dict(
            development_flag=True,
            task="binary_classification",
            pred_path="./data/predictions.csv",
            model_config={},
            scorer_name="accuracy",
            bootstrap_iter=1000,
        ),
        label=label,
        feature_set=feature_set,
        validation_user_ids=validation_user_ids,
    )

In [5]:
absenteeism_prediction_flow()

[Completed(message=None, type=COMPLETED, result=UnpersistedResult(type='unpersisted', artifact_type='result', artifact_description='Unpersisted result of type `str`')),
 Completed(message=None, type=COMPLETED, result=UnpersistedResult(type='unpersisted', artifact_type='result', artifact_description='Unpersisted result of type `NoneType`'))]

***
For more tips on how to work with Hamilton and Prefect, you can read more [here](https://github.com/DAGWorks-Inc/hamilton/blob/main/examples/prefect/README.md#tips).