Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the logic of openlineage.airflow.macros.lineage_parent_id #2488

Closed
blacklight opened this issue Mar 4, 2024 · 4 comments · Fixed by #2578
Closed

Update the logic of openlineage.airflow.macros.lineage_parent_id #2488

blacklight opened this issue Mar 4, 2024 · 4 comments · Fixed by #2578

Comments

@blacklight
Copy link
Contributor

The macro openlineage.airflow.macros.lineage_parent_id currently generates the parent_id as namespace/name/run_id, where name however is a UUID generated from the task metadata, and run_id is the concatenation of the task instance timestamp and try number.

This isn't compliant with the OpenLineage conventions. I'd expect name=<dag_id>.<task_id>, as it's a convention in other parts of the codebase, and run_id to be a UUID.

Copy link

boring-cyborg bot commented Mar 4, 2024

Thanks for creating your first OpenLineage issue! Your feedback is valuable and improves the project. If you haven't already, please be sure to follow the issue template!

blacklight added a commit to blacklight/OpenLineage that referenced this issue Mar 4, 2024
…arent_id`.

- Returned format: `<namespace>/<name>/<run_id>`.

- `name` should be `<dag_id>.<task_id>`, not a UUID.

- `run_id` should be a UUID, not `<run_timestamp>.<try_number>`.
blacklight added a commit to blacklight/OpenLineage that referenced this issue Mar 4, 2024
…arent_id`.

- Returned format: `<namespace>/<name>/<run_id>`.

- `name` should be `<dag_id>.<task_id>`, not a UUID.

- `run_id` should be a UUID, not `<run_timestamp>.<try_number>`.

Signed-off-by: Fabio Manganiello <fabio.manganiello@booking.com>
@dolfinus
Copy link
Contributor

dolfinus commented Mar 30, 2024

May I propose is to just remove (or ar least deprecate) the lineage_parent_id macro, and instead use lineage_run_id with some new macro lineage_namespace. This allows passing values to child job like Spark session without parsing, e.g.:

SparkSubmitOperator(
  ...,
  config={
    "spark.openlineage.parentJobNamespace": "{{macros.OpenLineageProviderPlugin.namespace()}}",
    "spark.openlineage.parentJobName": "{{dag.dag_id}}.{{task.task_id}}",
    "spark.openlineage.parentRunId": "{{macros.OpenLineageProviderPlugin.lineage_run_id(task_instance)}}",
)

@kacpermuda
Copy link
Contributor

There is no need to deprecate the macro, we can simply add two new macros - then users can choose to either pass all the information and parse it in spark or pass everything separately.

blacklight added a commit to blacklight/OpenLineage that referenced this issue Apr 4, 2024
…arent_id`.

- Returned format: `<namespace>/<name>/<run_id>`.

- `name` should be `<dag_id>.<task_id>`, not a UUID.

- `run_id` should be a UUID, not `<run_timestamp>.<try_number>`.

Signed-off-by: Fabio Manganiello <fabio.manganiello@booking.com>
Signed-off-by: Fabio Manganiello <fabio@manganiello.tech>
blacklight added a commit to blacklight/OpenLineage that referenced this issue Apr 4, 2024
…arent_id`.

- Returned format: `<namespace>/<name>/<run_id>`.

- `name` should be `<dag_id>.<task_id>`, not a UUID.

- `run_id` should be a UUID, not `<run_timestamp>.<try_number>`.

- Both `lineage_run_id` and `lineage_parent_id` should expose the same
  interface - only a `TaskInstance` object is now required as argument.

- Import `_DAG_NAMESPACE` instead of inferring it again.

Airflow-Reference: apache/airflow#37877
Signed-off-by: Fabio Manganiello <fabio.manganiello@booking.com>
blacklight added a commit to blacklight/OpenLineage that referenced this issue Apr 4, 2024
…arent_id`.

- Returned format: `<namespace>/<name>/<run_id>`.

- `name` should be `<dag_id>.<task_id>`, not a UUID.

- `run_id` should be a UUID, not `<run_timestamp>.<try_number>`.

- Both `lineage_run_id` and `lineage_parent_id` should expose the same
  interface - only a `TaskInstance` object is now required as argument.

- Import `_DAG_NAMESPACE` instead of inferring it again.

Airflow reference: apache/airflow#37877

Signed-off-by: Fabio Manganiello <fabio.manganiello@booking.com>
blacklight added a commit to blacklight/OpenLineage that referenced this issue Apr 4, 2024
…arent_id`.

- Returned format: `<namespace>/<name>/<run_id>`.

- `name` should be `<dag_id>.<task_id>`, not a UUID.

- `run_id` should be a UUID, not `<run_timestamp>.<try_number>`.

- Both `lineage_run_id` and `lineage_parent_id` should expose the same
  interface - only a `TaskInstance` object is now required as argument.

- Import `_DAG_NAMESPACE` instead of inferring it again.

Airflow reference: apache/airflow#37877

Signed-off-by: Fabio Manganiello <fabio@manganiello.tech>
kacpermuda pushed a commit that referenced this issue Apr 5, 2024
#2578)

* [#2488] Fixed format returned by `airflow.macros.lineage_parent_id`.

- Returned format: `<namespace>/<name>/<run_id>`.

- `name` should be `<dag_id>.<task_id>`, not a UUID.

- `run_id` should be a UUID, not `<run_timestamp>.<try_number>`.

- Both `lineage_run_id` and `lineage_parent_id` should expose the same
  interface - only a `TaskInstance` object is now required as argument.

- Import `_DAG_NAMESPACE` instead of inferring it again.

Airflow reference: apache/airflow#37877

Signed-off-by: Fabio Manganiello <fabio@manganiello.tech>

* Fixed redefinition of `get_unknown_source_attribute_run_facet` introduced upon merge.

Signed-off-by: Fabio Manganiello <fabio@manganiello.tech>

* Addressed #2578 (comment)

Signed-off-by: Fabio Manganiello <fabio.manganiello@booking.com>

* Fixed failing macro test

Signed-off-by: Fabio Manganiello <fabio.manganiello@booking.com>

---------

Signed-off-by: Fabio Manganiello <fabio@manganiello.tech>
Signed-off-by: Fabio Manganiello <fabio.manganiello@booking.com>
@dolfinus
Copy link
Contributor

dolfinus commented Apr 5, 2024

we can simply add two new macros

Done: #2582

blacklight added a commit to blacklight/OpenLineage that referenced this issue Apr 9, 2024
…eage events are emitted.

This is a follow-up [/missed corner case] of
OpenLineage#2560

Signed-off-by: Fabio Manganiello <fabio@manganiello.tech>
mobuchowski pushed a commit that referenced this issue Apr 9, 2024
… are emitted. (#2591)

This is a follow-up [/missed corner case] of
#2560

Signed-off-by: Fabio Manganiello <fabio@manganiello.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants