Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spark: allow running custom integration tests as a CLI target #2692

Open
mobuchowski opened this issue May 13, 2024 · 0 comments
Open

spark: allow running custom integration tests as a CLI target #2692

mobuchowski opened this issue May 13, 2024 · 0 comments
Assignees
Labels
area:integration/spark kind:feature A request or addition of new functionality

Comments

@mobuchowski
Copy link
Member

A common need is validating whether OpenLineage integration produces valid OpenLineage events given a Spark job.
However, it's not very easy right now: you have to clone the project, get the whole development environment set up, and then if you use custom dependencies or customized Spark, you have to edit a lot of Java test files to configure it to resemble actual environment.

Other way is setting it up and using it with some actual production job: this has different set of problems. It also requires you to set up a real OpenLineage backend just to confirm that a set of events matches expectations.

However, we can simplify the experience by creating a test CLI application that reuses common integration test framework, and gives a binary OK/FAILED response given a Spark job, expected files with JSON events and Docker image with Spark - with any customizations possible, custom dependencies, other deviations from Apache-hosted Spark libs.

The idea is that we'd run a given Spark SQL statement within a pre-prepared job using given Spark image, similar to what we're doing in the integration tests now. Events could be given in a separate files or in single newline-delimited JSON file.
The resulting events could be matched either by mockserver, or we could use File transport with some bind mounted directory.

First iteration could just accept a SQL job, without any other job customizations needed:

void testGivenSQL() {
    String givenSQL = readGivenSQL();
 
   spark.sql(givenSQL);

    checkInput(inputJson());
}

CLI interface could look like this:

./openlineage-spark-test --sql test.sql --image custom-docker-image:1.0.0 --events events.json
@pawel-big-lebowski pawel-big-lebowski self-assigned this May 14, 2024
@kacpermuda kacpermuda added kind:feature A request or addition of new functionality area:integration/spark labels May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:integration/spark kind:feature A request or addition of new functionality
Projects
None yet
Development

No branches or pull requests

3 participants