-
Notifications
You must be signed in to change notification settings - Fork 294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[INTEGRATION][AIRFLOW] Introduce a new extractor for Trino #1288
Conversation
Signed-off-by: Kengo Seki <sekikn@apache.org>
Though I submitted a PR, there's one known issue in the current implementation. Let me explain it. It correctly works if input and output database (or catalog, in Trino's terminology) is the same as the one TrinoOperator connects (it's specified by the extra parameter of Airflow's connection).
In this case, the OL proxy receives the following event after triggering the DAG, which contains both input and output correctly.
But if the catalog of input or output table is differ from the one TrinoOperator connects, that table is not picked up by the extractor (Trino allows users to insert or join tables over catalogs).
This is because SqlExtractor doesn't support the multiple database situation for now. I'd like to fix it in the future, but it may be time-taking, especially if it requires to dig into openlineage-sql, because I'm not familar with Rust :) |
As a note, there is separate Trino integration: https://github.com/takezoe/trino-openlineage/ In the end, internal integration is better because database always has the correct info, and it might be able to avoid any of the issues resulting from the use of a SQL parser.
I think the bug results from us not being able to get information about this table by using |
Totally agreed. I just got in touch with its author @takezoe, who is a famous developer in Japan as a creator of open source git-hosting software and an author of several Java and Scala books. He said that it is OK for him to contribute his code if the OL community welcomes it, but the current implementation is just a prototype and it will take some time to improve its quality. So I'm looking forward to his contribution in the future, when he finds some time to do it. ;) By the way, I think this extractor is still useful in some cases, for example users leverage Trino but don't have its ownership. So I'll also investigate the real cause of the problem in question. |
@sekikn Thanks for reaching out to @takezoe! We'll be happy to accept the contribution, and help with improving native Trino integration.
I agree. I'm happy to accept the current extractor, just that we have similar issue related to Airflow 2.1.4 as SFTP Extractor PR: #1263 In addition, could you open GitHub issue regarding multiple database problem? |
Thank you for summoning me :-) My trino-openlineage implementation is very experimental as @sekikn mentioned so it may take some time to make it practical. Anyway, I will try to find a time to do it! |
Yes, I did. @sekikn, you should set Integration tests need fixes - DAG lacks |
Thanks everyone, will update the PR! |
Hmm, only setting |
@sekikn I'm not too familiar with how Trino works with Airflow. I tested it with postgresql and tpch catalogs and after this change it works. Somehow I could not create table in memory catalog with |
Thank you for the investigation, @JDarDagran! Yes, I also ensured it works in some cases, but it doesn't seem to work with the current integration test for some reason. With
But the latter is lost from the event sent to the backend, and additionally, wrong database and schema+table seem to be wrongly combined. It still occurs if I inserted an additional task to create "default" schema inside the "memory" catalog before the insertion query.
I'm not familar enough with the functions in sql_extractor.py and dbapi_utils.py yet, so I think I should understand them first. |
I finally figured out its cause. I had to give the optional column which refers to the database name in information_schema to TrinoExtractor, otherwise dbapi_utils assumes that all tables and columns are in the current database. Thank you for your help @JDarDagran @mobuchowski, I'll update the PR soon! |
…fferent databases Signed-off-by: Kengo Seki <sekikn@apache.org>
b15830c
to
e9f9828
Compare
The CI failure occurred on the Spark tests and seems unrelated with this fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sekikn, thank you for your marvelous contribution!
Signed-off-by: Kengo Seki sekikn@apache.org
Problem
Currently OL doesn't have a built-in extractor for Trino.
Solution
This PR add TrinoExtractor corresponding to TrinoOperator.
If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports
S3
andGCS
filesystem operations, tested with AWS EMR).Checklist
CHANGELOG.md
with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)