Skip to content

Visual DataFrame Diff in a Jupyter Notebook

Ahmed Ibrahim edited this page Nov 12, 2018 · 5 revisions

Through the below simple steps you can compare DataFrames and visualize the results

How to Run Jupyter in Docker

docker pull jupyter/all-spark-notebook
docker run -p 8888:8888 jupyter/all-spark-notebook

To open Jupyter navigate to http://127.0.0.1:8888/?token={get the value from the terminal window}

Import MSD into Jupyter

%AddDeps org.finra.megasparkdiff mega-spark-diff 0.2.1

Sample data in Dataframes for comparison

The columns Key1 and Key2 constitute the primary key for our tables

val left = Seq(
  ("1", "1" , "Adam" ,"Andreson"),
  ("2","2","Bob","Branson"),
  ("4","4","Chad","Charly"),
  ("5","5","Joe","Smith"),
  ("5","5","Joe","Smith"),
  ("6","6","Edward","Eddy"),
  ("7","7","normal","normal")
).toDF("key1" , "key2" , "value1" , "value2")

val right   = Seq(
  ("3","3","Young","Yan"),
  ("5","5","Joe","Smith"),
  ("6","6","Edward","Eddy"),
  ("7","7","normal","normal"),
  (null,null,"null key","null key")
).toDF("key1" , "key2", "value1" , "value2")

Initialize MSD for Jupiter

The method receives the spark context and references it for future spark operations.

import org.finra.msd.sparkfactory.SparkFactory
SparkFactory.initializeSparkContext

Compare and Visualize

Note that the Val Key contains a sequence that represents the primary key columns Note that the parameter 100 is for specifying how many records you want to display as HTML.

import org.finra.msd.sparkcompare.SparkCompare
import org.finra.msd.visualization.Visualizer
val comparisonReult = SparkCompare.compareSchemaDataFrames(left,right)
val key: Seq[String] = Seq("key1", "key2")
val joinedDf = SparkCompare.fullOuterJoinDataFrames(comparisonReult.getLeft,comparisonReult.getRight , key)
kernel.magics.html(Visualizer.renderHorizontalTable(joinedDf , 100))

The Notebook

Full Jupyter Notebook

The Visual Diff Result

MegaSparkDiff in Jupyter