Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs/how-tos/ pre-commit #750

Merged
merged 2 commits into from
Mar 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/how-tos/_pre-commit/validate_changes.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 0 additions & 2 deletions docs/how-tos/cli-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,6 @@ You can verify the installation with
hamilton --help
```

# Reference

## `hamilton` (global)

**Options**:
Expand Down
1 change: 1 addition & 0 deletions docs/how-tos/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,5 +20,6 @@ directory. If there's an example you want but don't see, reach out or open an is
use-hamilton-for-lineage
use-hamilton-for-llm-workflows
cli-reference
pre-commit-hooks
cache-nodes
custom-driver
136 changes: 136 additions & 0 deletions docs/how-tos/pre-commit-hooks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# Hamilton pre-commit
## Use pre-commit hooks for safer Hamilton code changes

This page gives an introduction to pre-commit hooks and how to use custom hooks to validate your Hamilton code.

## What are pre-commit hooks?
A pre-commit hook is a script or command that's executed automatically before making a commit. The goal of these hooks is to standardize code formatting and catch erroneous code before being committed. For example, popular hooks include ensuring files have no syntax errors, sorting imports, and normalizing line breaks.

Note that it's different from testing, which focuses on the behavior of the code. You can think of pre-commit hooks as checks and formatting you would do everytime you save a file.

## Add pre-commit hooks to your project
Hooks are a mechanism of the `git` version control system. You can find your project's hooks under the `.git/hooks` directory (it might be hidden by default). There should be many files with the `.sample` extension that serve as example scripts.

The preferred way of working with pre-commit hooks is through the [pre-commit](https://pre-commit.com/) Python library. This library allows you to import and configure hooks for your repository with a `.pre-commit-config.yaml` file.

### Steps to get started
1. install the pre-commit library
```python
pip install pre-commit
```

2. add a `.pre-commit-config.yaml` to your repository
```yaml
# .pre-commit-config.yaml
repos:
# repository with hook definitions
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v2.3.0 # release version of the repo
hooks: # list of hooks from the repo to include in this project
- id: end-of-file-fixer
- id: trailing-whitespace
- id: check-yaml
args: ['--unsafe'] # some accept arguments

# download another repository with hooks
- repo: https://github.com/psf/black
rev: 22.10.0
hooks:
- id: black
```

3. install the hooks defined in `.pre-commit-config.yaml`
```console
pre-commit install
```
Now, hooks will automatically run on `git commit`

4. to manually run hooks
```console
pre-commit run --all-files
```

## Custom Hamilton pre-commit hooks
pre-commit hooks are great developer tools, but off-the-shelf solutions aren't aware of the Hamilton framework. Hence, we developed a pre-commit hook to help you author Hamilton dataflows! Under the hood, they leverage the `hamilton` CLI, so if you are unfamiliar with it, feel free to install it and view the `--help` messages.

```console
pip install sf-hamilton[cli]
hamilton --help
```

### Checking dataflow definition
Hamilton doesn't have many syntactic constraints, but there's a few things we want to catch:
- functions parameters and return are type annotated
- a node consistently has the same type (e.g., a parameter in multiple functions)
- functions with a name starting with underscore (`_`) are ignored from the dataflow
- functions with a `@config` decorator received a trailing double underscore with a suffix (e.g., `hello__weekday()`, `hello__weekend()`)

Instead of reimplementing this logic, we can try to build the Hamilton Driver with the command `hamilton build MODULES` and catch errors. This also ensures the verification is always in sync with the actual build mechanism. This hook will help prevent us from committing invalid dataflow definitions.

### Checking dataflow paths
A dataflow definition might be valid, but it might break paths in unexpected ways. The command `hamilton validate` (which internally uses `Driver.validate_execution()`) can check if a node is reachable.

For example, take a look at `my_module.py`, which contains the nodes `A, B, C`, and the changes between `v1` and `v2`.

```python
# my_module.py - v1
def A() -> int: ...

def B(A: int) -> float: ...

def C(A: int, B: float) -> None: ...

# driver code
dr = driver.Builder().with_mdoules(my_module).build()
dr.validate_execution(final_vars=["C"]) # <- success
```

```python
# my_module.py - v2
def B(X: int) -> float: ...

def C(A: int, B: float) -> None: ...

# driver code
dr = driver.Builder().with_mdoules(my_module).build()
dr.validate_execution(final_vars=["C"]) # <- failure. missing `A`
```

![alt text](_pre-commit/validate_changes.png)

In `v1`, the dataflow could be validated for `C` without any inputs. Now, a developer made `B` depend on `X` instead of `A` and removed `A`. This change accidentally impacted `C` which now depends on the external input `A`. Note that both `v1` and `v2` have a valid dataflow definition. To catch breaking changes to the path to `C`, we could use `hamilton validate --context context.json my_module.py` with the context:

```json
// context.json
{ "HAMILTON_FINAL_VARS": ["C"] }
// will call .validate_execution(final_vars["C"])
```

```{note}
pre-commit hooks can prevent commits from breaking a core path, but you should use unit and integration tests for more robust checks.
```

## Add Hamilton pre-commit to your project
We alluded to the relationship between pre-commit hooks and the `hamilton` command line tool. In fact, the basic hook is designed to take a list of `hamilton` commands and will execute them in order when hooks are triggered.

To use them, add this snippet to your `.pre-commit-config.yaml` and adapt it to your project:

```yaml
- repo: https://github.com/dagworks/hamilton-pre-commit
rev: v0.1.2 # use a ref >= 0.1.2
hooks:
- id: cli-command
name: Hamilton CLI command
args: [ # list of CLI commands to execute
hamilton build my_module.py,
hamilton build my_module2.py,
hamilton validate --context context.json my_module.py my_module2.py,
]
```

The above snippet would:
- check the dataflow definition of `my_module.py`
- check the dataflow definition of `my_module2.py`
- validate the execution path specified in `context.json` for dataflow composed of `my_module.py` and `my_module2.py`

You can pass any `hamilton` CLI command to the pre-commit hook, but it will only care about it succeeding or failing.