Skip to content

How to Compare Data within a DataBricks Environment

Ahmed Ibrahim edited this page Oct 6, 2018 · 7 revisions

Through the below simple steps you can compare DataFrames and visualize the results

Sample data in Dataframes for comparison

The columns Key1 and Key2 constitute the primary key for our tables

val left = Seq(
  ("1", "1" , "Adam" ,"Andreson"),
  ("2","2","Bob","Branson"),
  ("4","4","Chad","Charly"),
  ("5","5","Joe","Smith"),
  ("5","5","Joe","Smith"),
  ("6","6","Edward","Eddy"),
  ("7","7","normal","normal")
).toDF("key1" , "key2" , "value1" , "value2")

val right   = Seq(
  ("1","1",null,null),
  ("3","3","Young","Yan"),
  ("5","5","Joe","Smith"),
  ("6","6","Edward","Eddy"),
  ("7","7","normal","normal"),
  (null,null,"null key","null key")
).toDF("key1" , "key2", "value1" , "value2")

Initialize MSD for DataBricks

The method receives the spark context and references it for future spark operations.

import org.finra.msd.sparkfactory.SparkFactory
SparkFactory.initializeDataBricks(spark)

Compare and Visualize

Note that the Val Key contains a sequence that represents the primary key columns Note that the parameter 100 is for specifying how many records you want to display as HTML.

import org.finra.msd.sparkcompare.SparkCompare
import org.finra.msd.visualization.Visualizer
val comparisonReult = SparkCompare.compareSchemaDataFrames(left,right)
val key: Seq[String] = Seq("key1", "key2")
val joinedDf = SparkCompare.fullOuterJoinDataFrames(comparisonReult.getLeft,comparisonReult.getRight , key)
val html = Visualizer.renderHorizontalTable(joinedDf , 100)
displayHTML(html)

Step By Step

(1) Create a Cluster

Create a DataBricks Cluster

(2) Import MegaSparkDiff from Maven as a Library

Import MegaSparkDiff as a Library in DataBricks

(3) After Import Make Sure Its Attached to the Cluster

Make Sure MegaSparkDiff is Attached to the Cluster

(4) Create a New Scala Notebook and Attach to the New Cluster

Create a new Scala Notebook

(5) Comparison Results Displayed in a DataBricks Notebook

Comparison Results Displayed in a DataBricks Notebook

(6) Results Explained

The key columns are the ones in the middle The differences between left and right are highlighted yellow