Skip to content

Commit

Permalink
add more details
Browse files Browse the repository at this point in the history
Signed-off-by: Julien Le Dem <julien@apache.org>
  • Loading branch information
julienledem committed Aug 19, 2022
1 parent bd9cea9 commit 258f38c
Showing 1 changed file with 86 additions and 81 deletions.
167 changes: 86 additions & 81 deletions proposals/2045-column-lineage-endpoint.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,87 +8,75 @@ Dicussion: [column lineage endpoint issue #2045](https://github.com/MarquezProje

## Overview

OpenLineage defines a [column-level lineage facet](https://github.com/OpenLineage/OpenLineage/blob/ff0d87d30ed6c9fe39472788948266a6d3190585/spec/facets/ColumnLineageDatasetFacet.md).
We propose to add a Marquez endpoint leveraging this facet to filter down lineage for given column.
### Use cases
- Find the current upstream dependencies of a column. A column in a dataset is derived from columns in upstream datasets.
- See column-level lineage in the dataset level lineage when available.
- Retrieve point-in-time upstream lineage for a dataset or a column. What did the lineage look like yesterday compared to today?

### Existing elements

- OpenLineage defines a [column-level lineage facet]- (https://github.com/OpenLineage/OpenLineage/blob/ff0d87d30ed6c9fe39472788948266a6d3190585/spec/facets/ColumnLineageDatasetFacet.md).
- Marquez has a lineage endpoint `GET /api/v1/lineage` that returns the current lineage graph connected to a job or a dataset

### New Elements
We propose to add the following:
- Add column lineage to the lineage endpoint
- A new column-lineage endpoint leveraging the column lineage facet to retrieve lineage for a given column.
- Point-in-time upstream (dataset or column level) lineage given a version of a dataset.

## Proposal

### add column lineage to existing endpoint
In the GET /lineage api, add column lineage to DATASET nodes' data
```
{
"id": "dataset:food_delivery:public.categories",
"type": "DATASET",
"data": {
"type": "DATASET",
"id": {
"namespace": "food_delivery",
"name": "public.categories"
},
"type": "DB_TABLE",
"name": "public.categories",
"physicalName": "public.categories",
"createdAt": "2021-03-09T02:33:18.468719Z",
"updatedAt": "2022-08-04T05:08:09.190723Z",
"namespace": "food_delivery",
"sourceName": "analytics_db",
"fields": [{
"name": "id",
"type": "INTEGER",
"tags": [],
"description": "The unique ID of the category."
}, {
"name": "name",
"type": "VARCHAR",
"tags": [],
"description": "The name of the category."
}, {
"name": "menu_id",
"type": "INTEGER",
"tags": [],
"description": "The ID of the menu related to the category."
}, {
"name": "description",
"type": "TEXT",
"tags": [],
"description": "The description of the category."
}],
> columnLineage: {
> "a": {
> inputFields: [
> {namespace: "ns", name: "name", "field": "a"},
> ... other inputs
> ],
> transformationDescription: "identical",
> transformationType: "IDENTITY"
> },
> "b": ... other output fields
> }
"tags": [],
"lastModifiedAt": "2022-08-04T05:03:09.190723Z",
"description": null,
"lastlifecycleState": null
},
"inEdges": [{
"origin": "job:food_delivery:etl_orders.etl_categories",
"destination": "dataset:food_delivery:public.categories"
}],
"outEdges": [{
"origin": "dataset:food_delivery:public.categories",
"destination": "job:food_delivery:etl_orders.etl_orders_7_days"
}]
}
```diff
{
"id": "dataset:food_delivery:public.categories",
"type": "DATASET",
"data": {
"type": "DATASET",
"id": {
"namespace": "food_delivery",
"name": "public.categories"
},
"type": "DB_TABLE",
...
"fields": [{
...
}],
> columnLineage: {
> "a": {
> inputFields: [
> {namespace: "ns", name: "name", "field": "a"},
> ... other inputs
> ],
> transformationDescription: "identical",
> transformationType: "IDENTITY"
> },
> "b": ... other output fields
> }
},
"inEdges": [{
"origin": "job:food_delivery:etl_orders.etl_categories",
"destination": "dataset:food_delivery:public.categories"
}],
"outEdges": [{
"origin": "dataset:food_delivery:public.categories",
"destination": "job:food_delivery:etl_orders.etl_orders_7_days"
}]
}
```

### add a column-level-lineage endpoint:

```
GET /column-lineage?nodeId=dataset:food_delivery:public.delivery_7_days&column=a
```
that would be layered on the existing lineage endpoint but filtered down to the datasets that contribute to that column.
It also only returns dataset nodes
`column` is a ne parameter that must be a column in the schema of the provided dataset `nodeId`.

```
The logic is layered on the existing lineage endpoint, filtering down to the datasets that contribute to that column.
It only returns dataset nodes.

```diff
{
graph: [
{
Expand All @@ -97,31 +85,48 @@ It also only returns dataset nodes
data: {
namespace: "DB1",
name: "table2",
columnLineage: {
"a": {
inputFields: [
{namespace: "DB1", name: "table1, "field": "a"}
],
transformationDescription: "identical",
transformationType: "IDENTITY"
},
"b": ... other output fields
}
> columnLineage: {
> "a": {
> inputFields: [
> {namespace: "DB1", name: "table1, "field": "a"}
> ],
> transformationDescription: "identical",
> transformationType: "IDENTITY"
> },
> "b": ... other output fields
> }
},
...
}
]
}
```

### Point in time upstream lineage
return historical upstream lineage from a given Dataset version.
This adds the version element to the nodeId in both the existing `/api/v1/lineage` and newly proposed `/api/v1/column-lineage` endpoint
```
GET /lineage?nodeId=dataset:food_delivery:public.delivery_7_days:{version}
GET /column-lineage?nodeId=dataset:food_delivery:public.delivery_7_days:{version}&column=a
```
This returns only upstream lineage in this current proposal.
The upstream lineage is well defined to a specific version while downstream lineage is not
The data payload would also add a version field.
This returns only upstream lineage in this current proposal. This is because upstream lineage is well defined to a specific version while downstream lineage is not. The data payload would add a version field.
```diff
{
graph: [
{
< "id": "dataset:db1:table2",
> "id": "dataset:db1:table2#{VERSION UUID}",
"type": "DATASET",
data: {
namespace: "DB1",
name: "table2",
> version: "{VERSION UUID}"
...
}
}
]
}
```

## Implementation

Expand Down

0 comments on commit 258f38c

Please sign in to comment.