A quick walkthrough of wrangling a df in HoloClean

In [23]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

data = spark.read.csv("../../../datasets/food/food_input_holo.csv", header=True, encoding='utf-8')

Take note of the large number of variants on 'Chicago' in this dataset. Our wrangler attempts to merge these values into one.

In [24]:
print [i.city for i in data.select('city').distinct().collect()]

[u'GLENCOE', u'LAKE ZURICH', u'Maywood', u'EAST HAZEL CREST', u'SCHAUMBURG', u'SCHILLER PARK', u'chicago', u'BURNHAM', u'CHicago', u'BLOOMINGDALE', u'WORTH', u'CHICAGOI', u'INACTIVE', u'DES PLAINES', u'ELK GROVE VILLAGE', u'STREAMWOOD', u'EVERGREEN PARK', u'CALUMET CITY', u'OAK PARK', None, u'BRIDEVIEW', u'MAYWOOD', u'BERWYN', u'NILES NILES', u'BEDFORD PARK', u'OOLYMPIA FIELDS', u'CHCICAGO', u'Chicago', u'JUSTICE', u'ELMHURST', u'CHARLES A HAYES', u'BANNOCKBURNDEERFIELD', u'CHICAGO', u'CICERO', u'CCHICAGO', u'TINLEY PARK', u'CHICAGO HEIGHTS', u'EVANSTON', u'Norridge', u'OAK LAWN', u'CHICAGOCHICAGO', u'CHCHICAGO', u'OLYMPIA FIELDS', u'LOMBARD', u'alsip', u'COUNTRY CLUB HILLS', u'FRANKFORT', u'CHESTNUT STREET', u'BOLINGBROOK', u'NAPERVILLE', u'ALSIP', u'SKOKIE', u'BLUE ISLAND', u'SUMMIT', u'BROADVIEW', u'WESTMONT']


In [25]:
from wrangler import Wrangler

wrangler = Wrangler()

In [26]:
from transformer import Transformer
from transform_functions import lowercase, trim

functions = [lowercase, trim]
columns = ["city", "dbaname"]

transformer = Transformer(functions, columns)

In [27]:
wrangler.add_transformer(transformer)

Our wrangler by default uses levenshtein's distance but it can take any distance function for comparing strings.

The only trick is you must specify the threshold at which to stop clustering. For example, levenshtein's distance uses a default threshold of 3, so 'chicago' and 'checago' will be clustered but 'chicago' and 'cafcebo' will not. This threshold needs to be chosen depending on the distance function used and the known properties of the column's data.

In [28]:
from col_norm_info import ColNormInfo
import distance

cols = list()
cols.append(ColNormInfo("city"))
cols.append(ColNormInfo("dbaname", distance.jaccard, 0.7))

Other than the column information, our normalizer takes the max number of distinct values that we will permit it to compare. Any more than that and the process becomes too time and space intensive so we simply do not normalize any column that fails that condition

In [29]:
from normalizer import Normalizer

normalizer = Normalizer(cols, max_distinct=1000)

In [30]:
wrangler.add_normalizer(normalizer)

In [31]:
wrangled_df = wrangler.wrangle(data)

Note that all values have been simplified, and various chicago typos have been combined into just 'chicago'

In [32]:
print [i.city for i in wrangled_df.select('city').distinct().collect()]

[u'charles a hayes', u'maywood', u'streamwood', u'broadview', u'bannockburndeerfield', u'chicago', u'summit', u'tinley park', u'calumet city', None, u'bolingbrook', u'worth', u'country club hills', u'burnham', u'blue island', u'evergreen park', u'niles niles', u'norridge', u'des plaines', u'chicago heights', u'bloomingdale', u'evanston', u'lombard', u'skokie', u'lake zurich', u'glencoe', u'frankfort', u'east hazel crest', u'westmont', u'schiller park', u'schaumburg', u'oak park', u'alsip', u'elmhurst', u'bedford park', u'inactive', u'chestnut street', u'elk grove village', u'berwyn', u'naperville', u'oolympia fields', u'justice', u'chicagochicago']
