Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to get lineage from spark execution plan? #659

Closed
liujiawen opened this issue Apr 20, 2023 · 2 comments
Closed

Is it possible to get lineage from spark execution plan? #659

liujiawen opened this issue Apr 20, 2023 · 2 comments

Comments

@liujiawen
Copy link

Background

I found in my production environment, it's hard to add listener to the spark shell. Then i thought it will be ok to get execution plan of spark jobs from the history server. Is it possible to use the execution plan i downloaded instead of spark job listener?
Thank you!

Feature

Add a feature to read spark execution plan file, get lineage and send data to the spline producer.

Example [Optional]

example logical plan:
InsertIntoHadoopFsRelationCommand s3://example/dw/ods/ods_example_table, [dt=2023-04-18], false, [dt#62], Parquet, [field.delim=�, line.delim=
, serialization.format=�, partitionOverwriteMode=DYNAMIC, parquet.compression=SNAPPY, mergeSchema=false], Overwrite, CatalogTable(
Database: default
Table: ods_example_table
Owner: hdfs
Created Time: Wed Apr 19 16:03:44 CST 2023
Last Access: UNKNOWN
Created By: Spark 2.2 or prior
Type: EXTERNAL
Provider: hive
Table Properties: [bucketing_version=2, parquet.compression=SNAPPY, serialization.null.format=, transient_lastDdlTime=1681891424]
Location: s3://example/dw/ods/ods_example_table
Serde Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Storage Properties: [serialization.format=�, line.delim=
, field.delim=�]
Partition Provider: Catalog
Partition Columns: [dt]
Schema: root
|-- seller_id: string (nullable = true)
|-- code: string (nullable = true)
|-- country_code: string (nullable = true)
|-- org_code: string (nullable = true)
|-- name_cn: string (nullable = true)
|-- name_en: string (nullable = true)
|-- api: string (nullable = true)
|-- create_time: string (nullable = true)
|-- creater: string (nullable = true)
|-- update_time: string (nullable = true)
|-- operator: string (nullable = true)
|-- status: string (nullable = true)
|-- org_code_add: string (nullable = true)
|-- user_name: string (nullable = true)
|-- is_online: string (nullable = true)
|-- dt: string (nullable = true)
), org.apache.spark.sql.execution.datasources.CatalogFileIndex@a03e7be2, [seller_id, code, country_code, org_code, name_cn, name_en, api, create_time, creater, update_time, operator, status, org_code_add, user_name, is_online, dt]
+- Project [seller_id#47, code#48, country_code#49, org_code#50, name_cn#51, name_en#52, api#53, create_time#54, creater#55, update_time#56, operator#57, status#58, org_code_add#59, user_name#60, is_online#61, cast(2023-04-18 as string) AS dt#62]
+- Project [cast(seller_id#0 as string) AS seller_id#47, cast(code#1 as string) AS code#48, cast(country_code#2 as string) AS country_code#49, cast(org_code#3 as string) AS org_code#50, cast(name_cn#4 as string) AS name_cn#51, cast(name_en#5 as string) AS name_en#52, cast(api#6 as string) AS api#53, cast(create_time#7 as string) AS create_time#54, cast(creater#8 as string) AS creater#55, cast(update_time#9 as string) AS update_time#56, cast(operator#10 as string) AS operator#57, cast(status#11 as string) AS status#58, cast(org_code_add#12 as string) AS org_code_add#59, cast(user_name#13 as string) AS user_name#60, cast(is_online#14 as string) AS is_online#61]
+- Project [seller_id#0, code#1, country_code#2, org_code#3, name_cn#4, name_en#5, api#6, create_time#7, creater#8, update_time#9, operator#10, status#11, org_code_add#12, user_name#13, is_online#14]
+- SubqueryAlias spark_catalog.dbinit.ods_example_table
+- HiveTableRelation [dbinit.ods_example_table, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [seller_id#0, code#1, country_code#2, org_code#3, name_cn#4, name_en#5, api#6, create_time#7, cre..., Partition Cols: []]

Proposed Solution [Optional]

@wajda
Copy link
Contributor

wajda commented Apr 21, 2023

No, Spark Agent is meant to be used as a listener. The execution plan representation from the Spark history server isn't enough. Such functionality however could be implemented as another project, something like Spline Agent for Spark History Server.

@wajda wajda closed this as completed Apr 21, 2023
@liujiawen
Copy link
Author

Spline Agent for Spark History Server

No, Spark Agent is meant to be used as a listener. The execution plan representation from the Spark history server isn't enough. Such functionality however could be implemented as another project, something like Spline Agent for Spark History Server.

Thank you wajda for your instructive answer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

2 participants