Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally capture memory only actions like collect(), count() etc #361

Closed
ganeshnikumbh opened this issue Nov 27, 2021 · 6 comments · Fixed by #565
Closed

Optionally capture memory only actions like collect(), count() etc #361

ganeshnikumbh opened this issue Nov 27, 2021 · 6 comments · Fixed by #565
Assignees
Labels
Milestone

Comments

@ganeshnikumbh
Copy link

ganeshnikumbh commented Nov 27, 2021

Hi,

I have code where I am collecting the spark DF rows into a variable and then iterating over those rows to insert in Cosmos DB graph. When I run this notebook I expect to see some lineage generated in driver logs (I tried setting the dispatcher to console as well as logging). But I do not see anything.
But I replace the collect action with writing to a file in Azure datalake, then I can see the lineage details in driver logs.
Any idea what could be the issue?

@wajda
Copy link
Contributor

wajda commented Nov 28, 2021

This is as designed - only persistent actions lineage is captured. The reason is that the data stored into memory doesn't have any stable identifier, so the lineage of it might only be useful temporarily, but not quite in a long term perspective.

On the other hand, when thinking about it after some time passed I acknowledge the existing of a valid use case for that.
I think about adding a configuration property via that the user would be able to enable capturing lineage from memory only actions.

@wajda wajda added the feature label Nov 28, 2021
@wajda wajda added this to the 1.0.0 milestone Nov 28, 2021
@ganeshnikumbh
Copy link
Author

@wajda , thanks for clarifying. I hope to see the feature soon :)..

@wajda wajda reopened this Nov 29, 2021
@wajda wajda modified the milestones: 1.0.0, 1.1.0 Dec 28, 2021
@wajda wajda changed the title Lineage no generated for collect and count actions Optionally capture memory only actions like collect(), count() etc Sep 22, 2022
@yuribogomolov
Copy link

+100 to the feature request.
We found multiple coverage gaps caused by this. The pattern is quite common for data quality jobs, validate-and-publish workflows and ML flows (not all ML methods are supported by Spark and data scientists usually aggregate data in Spark and use df.toPandas() to collect data and run a model before moving data back to Spark).

@wajda
Copy link
Contributor

wajda commented Oct 18, 2022

The lineage data model requires existence of a target data source that is represented by a URI.
I guess we could simply use a made up URI that would contain a host name, a type and a hash code of a result object/collection. Something like memory://1.2.3.4/[Lorg.apache.spark.sql.Row;@486e6b30

@yuribogomolov
Copy link

Yeah, that works. I guess multiple steps need changes to support "read-only" actions. collect and toPandas are implemented outside of Catalyst, and the only action-specific information provided by the Spark QueryListener is available in funcName, but it's not propagated to SplineAgent and downstream classes at this point.

@wajda wajda modified the milestones: 1.1.0, 1.0.0 Oct 19, 2022
@wajda wajda self-assigned this Dec 20, 2022
wajda added a commit that referenced this issue Dec 22, 2022
…e classes. Add more attributes to their signature.
wajda added a commit that referenced this issue Dec 22, 2022
wajda added a commit that referenced this issue Dec 22, 2022
wajda added a commit that referenced this issue Dec 22, 2022
@wajda
Copy link
Contributor

wajda commented Dec 22, 2022

Fixed.
Capturing of non-persistent actions are disabled by default. To enable it the following property needs to be set:
as YAML:

spline:
    plugins:
        za.co.absa.spline.harvester.plugin.embedded.NonPersistentActionsCapturePlugin:
            enabled: true

or in a properties format:

spline.plugins.za.co.absa.spline.harvester.plugin.embedded.NonPersistentActionsCapturePlugin.enabled=true

wajda added a commit that referenced this issue Dec 22, 2022
wajda added a commit that referenced this issue Dec 22, 2022
wajda added a commit that referenced this issue Dec 22, 2022
* issue #361 pass `funcName` down to the harvester components

* issue #361 Deduplicate "Synthetic" flag in components' extras

* issue #361 Promote type aliases ReadNodeInfo and WriteNodeInfo to case classes. Add more attributes to their signature.

* issue #361 Make plugins configurable

* issue #361 Optionally capture memory only actions like `collect()`, `count()` etc

* (no issue) Rename RDD example job for clarity

* issue #361 Scala 2.11 syntax fix

* issue #361 README

* issue #361 add test

* issue #565 minor: remove unnecessary "toString" call

* issue #565 Restart Spark context after enabling non-persistent actions plugin
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants