-
Notifications
You must be signed in to change notification settings - Fork 26
Visual DataFrame Diff in a Jupyter Notebook
Ahmed Ibrahim edited this page Nov 12, 2018
·
5 revisions
Through the below simple steps you can compare DataFrames and visualize the results
docker pull jupyter/all-spark-notebook
docker run -p 8888:8888 jupyter/all-spark-notebook
To open Jupyter navigate to http://127.0.0.1:8888/?token={get the value from the terminal window}
%AddDeps org.finra.megasparkdiff mega-spark-diff 0.2.1
The columns Key1 and Key2 constitute the primary key for our tables
val left = Seq(
("1", "1" , "Adam" ,"Andreson"),
("2","2","Bob","Branson"),
("4","4","Chad","Charly"),
("5","5","Joe","Smith"),
("5","5","Joe","Smith"),
("6","6","Edward","Eddy"),
("7","7","normal","normal")
).toDF("key1" , "key2" , "value1" , "value2")
val right = Seq(
("3","3","Young","Yan"),
("5","5","Joe","Smith"),
("6","6","Edward","Eddy"),
("7","7","normal","normal"),
(null,null,"null key","null key")
).toDF("key1" , "key2", "value1" , "value2")
The method receives the spark context and references it for future spark operations.
import org.finra.msd.sparkfactory.SparkFactory
SparkFactory.initializeSparkContext
Note that the Val Key contains a sequence that represents the primary key columns Note that the parameter 100 is for specifying how many records you want to display as HTML.
import org.finra.msd.sparkcompare.SparkCompare
import org.finra.msd.visualization.Visualizer
val comparisonReult = SparkCompare.compareSchemaDataFrames(left,right)
val key: Seq[String] = Seq("key1", "key2")
val joinedDf = SparkCompare.fullOuterJoinDataFrames(comparisonReult.getLeft,comparisonReult.getRight , key)
kernel.magics.html(Visualizer.renderHorizontalTable(joinedDf , 100))