-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create column lineage endpoint proposal #2077
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,143 @@ | ||
# Proposal: Column lineage endpoint proposal | ||
|
||
Author(s): @julienledem | ||
|
||
Created: 20022-08-18 | ||
|
||
Dicussion: [column lineage endpoint issue #2045](https://github.com/MarquezProject/marquez/issues/2045) | ||
|
||
## Overview | ||
|
||
### Use cases | ||
- Find the current upstream dependencies of a column. A column in a dataset is derived from columns in upstream datasets. | ||
- See column-level lineage in the dataset level lineage when available. | ||
- Retrieve point-in-time upstream lineage for a dataset or a column. What did the lineage look like yesterday compared to today? | ||
|
||
### Existing elements | ||
|
||
- OpenLineage defines a [column-level lineage facet]- (https://github.com/OpenLineage/OpenLineage/blob/ff0d87d30ed6c9fe39472788948266a6d3190585/spec/facets/ColumnLineageDatasetFacet.md). | ||
pawel-big-lebowski marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Marquez has a lineage endpoint `GET /api/v1/lineage` that returns the current lineage graph connected to a job or a dataset | ||
|
||
### New Elements | ||
We propose to add the following: | ||
- Add column lineage to the lineage endpoint | ||
- A new column-lineage endpoint leveraging the column lineage facet to retrieve lineage for a given column. | ||
- Point-in-time upstream (dataset or column level) lineage given a version of a dataset. | ||
|
||
## Proposal | ||
|
||
### add column lineage to existing endpoint | ||
In the GET /lineage api, add column lineage to DATASET nodes' data | ||
```diff | ||
{ | ||
"id": "dataset:food_delivery:public.categories", | ||
"type": "DATASET", | ||
"data": { | ||
"type": "DATASET", | ||
"id": { | ||
"namespace": "food_delivery", | ||
"name": "public.categories" | ||
}, | ||
"type": "DB_TABLE", | ||
... | ||
"fields": [{ | ||
... | ||
}], | ||
> columnLineage: { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Calls to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think yes, we would reuse the columnLineage facet object. The OL javadoc needs to be updated. It is not an automated process at the moment |
||
> "a": { | ||
> inputFields: [ | ||
> {namespace: "ns", name: "name", "field": "a"}, | ||
> ... other inputs | ||
> ], | ||
> transformationDescription: "identical", | ||
> transformationType: "IDENTITY" | ||
> }, | ||
> "b": ... other output fields | ||
> } | ||
}, | ||
"inEdges": [{ | ||
"origin": "job:food_delivery:etl_orders.etl_categories", | ||
"destination": "dataset:food_delivery:public.categories" | ||
}], | ||
"outEdges": [{ | ||
"origin": "dataset:food_delivery:public.categories", | ||
"destination": "job:food_delivery:etl_orders.etl_orders_7_days" | ||
}] | ||
} | ||
``` | ||
|
||
### add a column-level-lineage endpoint: | ||
|
||
``` | ||
GET /column-lineage?nodeId=dataset:food_delivery:public.delivery_7_days&column=a | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given that we have the following endpoint to query lineage:
I'm not sure there's much advantage to defining a separate endpoint for column-level lineage. Although a new endpoint would contextualize the API call; with proper documentation, we can extend out current lineage endpoint to support columns:
If the query param
On the backend, these calls would be handled differently. When querying for upstream lineage, the graph returned would consists of only nodes upstream of
You can then recursively follow the in edges to traverse the upstream graph consisting of job-to-dataset relationships: {
.
.
"inEdges": [{
"origin": "job:{namespace}:{job}",
"destination": "dataset:{namespace}:{dataset}"
}],
"outEdges": [{
"origin": "job:{namespace}:{job}",
"destination": "dataset:{namespace}:{dataset}"
}]
} For column-level lineage, the in / out node edges in the upstream lineage By consistent, I mean that backend can assist in better representing the dataset-to-dataset relationship (or rather dataset-column-to-dataset-column relationship) on a given dataset for a particular column by defining the following node ID:
For example, with the node ID defined, an upstream lineage call would now be:
{
.
.
"inEdges": [{
"origin": "dataset:my-namespace:my-dataset#my-field",
"destination": "dataset:my-namespace:my-other-dataset#my-other-field"
}],
"outEdges": [{
"origin": "dataset:my-namespace:my-other-dataset#my-other-field",
"destination": "dataset:my-namespace:some-other-dataset#some-other-field"
}]
} There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm proposing a different endpoint fot /column-lineage because the payload would be different, containing only datasets. I was considering that the columnLineage facet was already providing edges and that the inEdges and outEdges fields of the lineage graph became unnecessary. To me /upstream or /downstream is not an endpoint as they are more of a filter on the lineage than a different result. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I would then change the payload from a graph consisting of nodes (with in/out edges that aren't really relevant), to more an array of datasets objects that don't have in / out edges as much of the metadata that is relevant for lineage, wouldn't apply here. My thinking is this: the lineage call returns a set of nodes, but doesn't specify if they all have to be datasets, or all have to be jobs. It's generic in that way. What matters are the nodeIDs and that the
Basically, I think column-level lineage should still be represented via a graph data structure. If we will only be using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ohh man, it's great discussion although it took me 10 times reading to get to know what are you talking about. I tried to include the initial idea of Julien mixed with the feedback of Willy. Some clue design decisions:
Other:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the update @pawel-big-lebowski This looks good to me. I left a minor comment bellow |
||
``` | ||
`column` is a ne parameter that must be a column in the schema of the provided dataset `nodeId`. | ||
|
||
The logic is layered on the existing lineage endpoint, filtering down to the datasets that contribute to that column. | ||
It only returns dataset nodes. | ||
|
||
```diff | ||
{ | ||
graph: [ | ||
{ | ||
"id": "dataset:db1:table2", | ||
"type": "DATASET", | ||
data: { | ||
namespace: "DB1", | ||
name: "table2", | ||
> columnLineage: { | ||
> "a": { | ||
> inputFields: [ | ||
> {namespace: "DB1", name: "table1, "field": "a"} | ||
> ], | ||
> transformationDescription: "identical", | ||
> transformationType: "IDENTITY" | ||
> }, | ||
> "b": ... other output fields | ||
> } | ||
}, | ||
... | ||
} | ||
] | ||
} | ||
``` | ||
|
||
### Point in time upstream lineage | ||
return historical upstream lineage from a given Dataset version. | ||
This adds the version element to the nodeId in both the existing `/api/v1/lineage` and newly proposed `/api/v1/column-lineage` endpoint | ||
``` | ||
GET /lineage?nodeId=dataset:food_delivery:public.delivery_7_days:{version} | ||
GET /column-lineage?nodeId=dataset:food_delivery:public.delivery_7_days:{version}&column=a | ||
pawel-big-lebowski marked this conversation as resolved.
Show resolved
Hide resolved
|
||
``` | ||
This returns only upstream lineage in this current proposal. This is because upstream lineage is well defined to a specific version while downstream lineage is not. The data payload would add a version field. | ||
```diff | ||
{ | ||
graph: [ | ||
{ | ||
< "id": "dataset:db1:table2", | ||
> "id": "dataset:db1:table2#{VERSION UUID}", | ||
"type": "DATASET", | ||
data: { | ||
namespace: "DB1", | ||
name: "table2", | ||
> version: "{VERSION UUID}" | ||
... | ||
} | ||
} | ||
] | ||
} | ||
``` | ||
|
||
## Implementation | ||
|
||
### columne lineage facet in lineage | ||
Adding the columnLineage facet requires a formatting of existing facet data. | ||
### column lineage endpoint | ||
The `/column-lineage` endpoint leverages the `/lineage` endpoint and then filters down the payload to return the expected result. | ||
### point-in-time upstream lineage | ||
The point-in-time upstream lineage leverages the run to dataset version relation to track back the lineage of a given dataset of job version. | ||
Dataset version -> run that produced it -> consumed Dataset Versions. | ||
|
||
## Next Steps | ||
|
||
Review of this proposal and production of detailed design for the implementation, in particular for the point in time lineage which might affect the dabtabase schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compare use case needs a proposal on it's own 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair enough, I'm hoping someone else can take over that part and go in the details
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking to start with just point-in-time upstream lineage. And have compare later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this proposal should be limited to point-in-time within column-level lineage. We should leave compare feature and also point-in-time for
lineage
endpoint which has nothing to do with column level.