# Dataset symlink demo

Sometimes it happens that a dataset is identified by different name depending on a context. This may happen for Spark tables which can be identified by their physical location or identifier made of catalog and table name (logical identifier). This inconsistency can lead to broken lineage: imagine one job writing to a physical location and the other reading from a catalog and table name. With no extra information provided, the backend is not able to produce a consistent lineage of that. 

Dataset symlink feature is an extra facet within dataset saying: *Hey, these are the alternate names of the dataset*.

The information is stored on the Marquez side and, what even more important, it is used to identify datasets by all its possible names. If a dataset primary name is a physical location, then we will be able to retrieve it by its logical name (catalog and table name) if such were provided within symlink facet. All the names will be used to create edges of the lineage graph. 

This demo presents the above in action. 

Let us first verify if Marquez is up and running 

In [1]:
import json,requests
marquez_url = "http://marquez:5000" ## this may depend on your local setup
if (requests.get("{}/api/v1/namespaces".format(marquez_url)).status_code == 200):
    print("Marquez is OK.")
else:
    print("Cannot connect to Marquez")

Marquez is OK.


Next we spin a Spark cluster with an Openlineage connector enabled.
We enable Hive support for the Spark session and specify fake Hive metastore "http://metastore". Trick is done to emulate a behaviour of having two possible identifiers for a dataset:
 * its physical location on disk,
 * its logical location identified by a catalog and table name.

In [3]:
from pyspark.sql import SparkSession

import warnings
warnings.filterwarnings('ignore')

spark = (SparkSession.builder.master('local')
         .appName('sample_spark')
         .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener')
         .config('spark.jars.packages', 'io.openlineage:openlineage-spark:0.15.1')
         .config('spark.openlineage.url', '{}/api/v1/namespaces/dataset-symlinks/'.format(marquez_url))
         .config("spark.sql.catalogImplementation", "hive")
         .config("spark.sql.hive.metastore.uris", "http://metastore")
         .enableHiveSupport()
         .getOrCreate())

In [4]:
spark.sql("DROP TABLE IF EXISTS some_table;")

22/10/25 12:57:53 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
22/10/25 12:57:53 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
22/10/25 12:57:57 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
22/10/25 12:57:57 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore UNKNOWN@172.18.0.4
22/10/25 12:58:00 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
22/10/25 12:58:00 WARN LogicalPlanSerializer: Can't register jackson scala module for serializing LogicalPlan
22/10/25 12:58:01 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
22/10/25 12:58:01 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
22/10/25 12:58:01 WARN DropTableCommandVisitor: Unable to find table by identifier `default`.`s

DataFrame[]

22/10/25 12:58:02 WARN DropTableCommandVisitor: Unable to find table by identifier `default`.`some_table` - Table or view 'some_table' not found in database 'default'


In [12]:
spark.sql("CREATE TABLE IF NOT EXISTS some_table (key INT, value STRING) USING hive;")

22/10/25 12:59:46 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
22/10/25 12:59:47 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist
22/10/25 12:59:47 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
22/10/25 12:59:47 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
22/10/25 12:59:47 WARN HiveMetaStore: Location: file:/home/jovyan/notebooks/spark-warehouse/some_table specified for non-external table:some_table


DataFrame[]

Let's see the latest event sent to Marquez and its output datasets:

In [13]:
event = requests.get("{}/api/v1/events/lineage?limit=1".format(marquez_url)).json()['events'][0]

print(json.dumps(event['job'], indent=2))

{
  "namespace": "dataset-symlinks",
  "name": "sample_spark.execute_create_table_command",
  "facets": {
    "documentation": null,
    "sourceCodeLocation": null,
    "sql": null
  }
}


In [14]:
# dataset name 
print("Primiary dataset name is \n\t namespace:'{}', \n\t name:'{}'".format(
    event['outputs'][0]['namespace'], 
    event['outputs'][0]['name']
))
print("\nSymlinks facet:")
# and symlink name
print(json.dumps(event['outputs'][0]['facets']['symlinks'], indent=2))

Primiary dataset name is 
	 namespace:'file', 
	 name:'/home/jovyan/notebooks/spark-warehouse/some_table'

Symlinks facet:
{
  "_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.15.1/integration/spark",
  "_schemaURL": "https://openlineage.io/spec/facets/1-0-0/SymlinksDatasetFacet.json#/$defs/SymlinksDatasetFacet",
  "identifiers": [
    {
      "namespace": "hive://metastore",
      "name": "default.some_table",
      "type": "TABLE"
    }
  ]
}


## Get dataset by different names

Let's encode dataset name (`/home/jovyan/notebooks/spark-warehouse/some_table`), retrieve dataset by namespace and name and print its `id`: 

In [15]:
import urllib
dataset = requests.get(
    "{}/api/v1/namespaces/file/datasets/{}".format(
        marquez_url, 
        urllib.parse.quote_plus(event['outputs'][0]['name'])
    )
).json()
print(json.dumps(dataset['id'], indent=2))

{
  "namespace": "file",
  "name": "/home/jovyan/notebooks/spark-warehouse/some_table"
}


Let's now try to retrieve a dataset by a symlink name. We expect to get the same dataset although requesting by a different name: 

In [16]:
dataset = requests.get(
    "{}/api/v1/namespaces/{}/datasets/{}".format(
        marquez_url, 
        urllib.parse.quote_plus("hive://metastore"),
        "default.some_table"
    )
).json()
print(json.dumps(dataset['id'], indent=2))

{
  "namespace": "file",
  "name": "/home/jovyan/notebooks/spark-warehouse/some_table"
}


## Create lineage graph with symlink edges

Assume there is some other job that reads Hive table by its catalog location. 

In [17]:
other_job_event = {
    'eventType': 'COMPLETE',
    'eventTime': '2022-10-25T11:35:31.341Z',
    'job': {
        'namespace': 'another-job-namespace',
        'name': 'another-job',
        'facets': {'documentation': None, 'sourceCodeLocation': None, 'sql': None}
    },
    'run': {
        'runId': 'ae8b3ab7-254b-4d81-9c0a-c152f6sdac90'
    },
    'inputs': [
        {
            'namespace': 'hive://metastore', # read a dataset by its logical name
            'name': 'default.some_table'
        }
    ],
    'outputs': [
         {
            'namespace': 'another-namespace', # write to some other dataset
            'name': 'another-table'
        }
    ],
    'producer': 'https://github.com/OpenLineage/OpenLineage/tree/0.15.1/integration/spark',
    'schemaURL': 'https://openlineage.io/spec/1-0-3/OpenLineage.json#/$defs/RunEvent'
}

In [18]:
requests.post(
    "{}/api/v1/lineage/".format(marquez_url), 
    data=json.dumps(other_job_event), 
    headers={"Content-Type":"application/json"}
)

<Response [201]>

Let's fetch lineage graph for `/home/jovyan/notebooks/spark-warehouse/some_table`

In [19]:
graph = requests.get(
    "{}/api/v1/lineage?nodeId=dataset:file:{}".format(
        marquez_url, 
        urllib.parse.quote_plus("/home/jovyan/notebooks/spark-warehouse/some_table")
    )
).json()['graph']

and print nodes corresponding to dataset: 

In [20]:
for node in graph:
    if node['type']=='DATASET':
        print(node['id'])

dataset:another-namespace:another-table
dataset:file:/home/jovyan/notebooks/spark-warehouse/some_table


Node representing a dataset `dataset:another-namespace:another-table` is on the list which means Marquez was able to build a linege dependency based on a symlink.