Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update spark job name to reflect spark application name and execution node #1191

Merged
merged 2 commits into from
Apr 6, 2021

Conversation

collado-mike
Copy link
Collaborator

This change changes the job naming behavior creating a new job name for each query execution in the spark application. For each query execution, a new job is created, with a unique runId and a parent facet pointing to the run identified by the parameters passed into the agent.

As an example, one job name I generated in testing was orders_dump_to_gcs.execute_insert_into_hadoop_fs_relation_command, where orders_dump_to_gcs is the spark application name and execute_insert_into_hadoop_fs_relation_command is the node name returned by the DataWritingCommandExec physical plan node.

@codecov
Copy link

codecov bot commented Apr 6, 2021

Codecov Report

Merging #1191 (807212c) into main (4438c5d) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##               main    #1191   +/-   ##
=========================================
  Coverage     74.44%   74.44%           
  Complexity      803      803           
=========================================
  Files           180      180           
  Lines          4790     4790           
  Branches        368      368           
=========================================
  Hits           3566     3566           
  Misses          852      852           
  Partials        372      372           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4438c5d...807212c. Read the comment docs.

… node

Signed-off-by: Michael Collado <mike@datakin.com>
Signed-off-by: Michael Collado <mike@datakin.com>
@@ -2,7 +2,7 @@
"eventType": "COMPLETE",
"eventTime": "2021-01-01T00:00:00Z",
"run": {
"runId": "ea445b5c-22eb-457a-8007-01c7c52b6e54",
"runId": "fake_run_id",
"facets": {
"parent": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Did you also want to update the parent runID used in the test?

Copy link
Member

@wslulciuc wslulciuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, @collado-mike! The naming convention now more closely aligns with our Airflow integration. Excited to see how a spark job launched via Airflow can be linked via to its parent runID (= the operator that submitted the spark job) and displayed in our lineage graph.

@wslulciuc wslulciuc merged commit 9e167e5 into main Apr 6, 2021
@wslulciuc wslulciuc deleted the spark_job_name branch April 6, 2021 23:12
@wslulciuc wslulciuc added this to Review in Marquez 0.14.0 via automation Apr 8, 2021
@wslulciuc wslulciuc moved this from Review to Done in Marquez 0.14.0 Apr 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

2 participants