spark: allow running custom integration tests as a CLI target #2692

mobuchowski · 2024-05-13T18:46:00Z

A common need is validating whether OpenLineage integration produces valid OpenLineage events given a Spark job.
However, it's not very easy right now: you have to clone the project, get the whole development environment set up, and then if you use custom dependencies or customized Spark, you have to edit a lot of Java test files to configure it to resemble actual environment.

Other way is setting it up and using it with some actual production job: this has different set of problems. It also requires you to set up a real OpenLineage backend just to confirm that a set of events matches expectations.

However, we can simplify the experience by creating a test CLI application that reuses common integration test framework, and gives a binary OK/FAILED response given a Spark job, expected files with JSON events and Docker image with Spark - with any customizations possible, custom dependencies, other deviations from Apache-hosted Spark libs.

The idea is that we'd run a given Spark SQL statement within a pre-prepared job using given Spark image, similar to what we're doing in the integration tests now. Events could be given in a separate files or in single newline-delimited JSON file.
The resulting events could be matched either by mockserver, or we could use File transport with some bind mounted directory.

First iteration could just accept a SQL job, without any other job customizations needed:

void testGivenSQL() {
    String givenSQL = readGivenSQL();
 
   spark.sql(givenSQL);

    checkInput(inputJson());
}

CLI interface could look like this:

./openlineage-spark-test --sql test.sql --image custom-docker-image:1.0.0 --events events.json

The text was updated successfully, but these errors were encountered:

pawel-big-lebowski self-assigned this May 14, 2024

kacpermuda added kind:feature A request or addition of new functionality area:integration/spark labels May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spark: allow running custom integration tests as a CLI target #2692

spark: allow running custom integration tests as a CLI target #2692

mobuchowski commented May 13, 2024

spark: allow running custom integration tests as a CLI target #2692

spark: allow running custom integration tests as a CLI target #2692

Comments

mobuchowski commented May 13, 2024