Skip to content
matthewgillett edited this page Dec 28, 2023 · 15 revisions

What does MegaSparkDiff Solve?

How to compare and DIFF different object types in the cloud at scale? example: compare data in S3 to a JDBC source OR compare a Hive table to a Hadoop file

While conducting data analysis between versions of objects a diff operation is required between data that resides in similar or different data sources, coupled with the challenge of conducting the diff at scale, there is a need to a tool that can conduct the comparison between pair combinations of data sources at scale.

The use cases for data comparison can be one of the following

  • (a) Columnar based .. i.e. there is a schema and columns can be included or excluded from the diff.
  • (b) Treat all inputs as text i.e. the contents should be treated as line separated textual data (example flat files with variable column counts).

MegaSparkDiff tool uses SPARK internally to parallelize the comparison operation.

Pair Combinations of Data Sources

Execution Environment

(a) DataBricks / EMR / EC2

Specify a pair of data sources for comparison + execution is conducted via a SPARK job parallelized over EMR or EC2 Report is saved to a file and/or can be sent over email once the comparison is done.

As a sub-component of a larger big data process user includes the Apples3Apples as a maven dependency in his/her big data process and invokes the diff operation and acquires the outputs as a return value of a method.

(b) Local Machine / CI Build (non-cluster runs)

  1. User can launch the compare on a local machine whether as an independent tool.
  2. User can invoke the comparison by means of a maven dependency in a JVM based project.

Overall Use Case

Articles and How To

(a) Data Transformation Testing Using MegaSparkDiff To Test ETLs At Scale sample code can be found here

(b) How. to Visually Compare Data within DataBricks Notebook.