# A Step-by-Step Guide to Holoclean example

Noisy and erroneous data is a major bottleneck in analytics. Data cleaning and repairing account for about 60% of the work of data scientists. To address this bottleneck, we recently introduced HoloClean, a semi-automated data repairing framework that relies on statistical learning and inference to repair errors in structured data. In HoloClean, we build upon the paradigm of weak supervision and demonstrate how to leverage diverse signals, including user-defined heuristic rules (such as generalized data integrity constraints) and external dictionaries, to repair erroneous data. 

In this post, we walk through the process of implementing Holoclean, by creating a simple end-to-end example.

### This post is an executable Jupyter notebook: you’re encouraged to download it and experiment with the examples yourself!

# Setup

Firstly, we import all the module from Holoclean that we will use.

In [1]:
from holoclean.holoclean import HoloClean, Session
from holoclean.errordetection.errordetector import ErrorDetectors
from holoclean.featurization.featurizer import Signal_Init,Signal_cooccur, Signal_dc
from holoclean.learning.accuracy import Accuracy
import time

#  Initialization

In this part, we create the Holoclean and Session object that we will use for this example. In addition the user gives the position for input file and the file that contains the denial constraints.

In [2]:
start_time_dd = time.time()
holo_obj = HoloClean()
session = Session("Session", holo_obj)
#csv_path=raw_input("Please give the full path for the csv file:")
csv_path="test/test.csv"
session.ingest_dataset(csv_path)
#dc_path=raw_input("Please give the full path for the file of the denial constraints:")
dc_path="test/dc1.txt"
session.denial_constraints(dc_path)
finish_time_dd = time.time()
print("The running time of this part was: "+ str((finish_time_dd-start_time_dd)) +" seconds")

  cursor.execute('SELECT @@tx_isolation')


The running time of this part was: 57.4448869228 seconds


#  Error detection

In this part, we create the error detection. The output of this part is the C_dk table that contains all the noisy cells and the C_Clean table that contains the clean cells

In [3]:
start_time_dd = time.time()

err_detector = ErrorDetectors(session.Denial_constraints, holo_obj.dataengine,
                                      holo_obj.spark_session, session.dataset)
session.add_error_detector(err_detector)
session.ds_detect_errors()

finish_time_dd = time.time()
print("The running time of this part was: "+ str((finish_time_dd-start_time_dd)) +" seconds")

The running time of this part was: 114.105597019 seconds


# Domain pruning

In this part, we prune the domain. The output of this part is the possible_values tables that contains all the possible values for each cell

In [4]:
start_time_dd = time.time()

pruning_threshold=0.5
session.ds_domain_pruning(pruning_threshold)

finish_time_dd = time.time()
print("The running time of this part was: "+ str((finish_time_dd-start_time_dd)) +" seconds")

The running time of this part was: 3.64925289154 seconds


# Featurization

In this part, we implement the featurization module of holoclean. We choose the signals that we want to use and the output of this part is the featurization table that contains the factors that we will use

In [5]:
start_time_dd = time.time()

initial_value_signal = Signal_Init(session.Denial_constraints, holo_obj.dataengine,
                                           session.dataset)
session.add_featurizer(initial_value_signal)
statistics_signal = Signal_cooccur(session.Denial_constraints, holo_obj.dataengine,
                                           session.dataset)
session.add_featurizer(statistics_signal)
dc_signal = Signal_dc(session.Denial_constraints, holo_obj.dataengine, session.dataset)
session.add_featurizer(dc_signal)
session.ds_featurize()

finish_time_dd = time.time()
print("The running time of this part was: "+ str((finish_time_dd-start_time_dd)) +" seconds")

The running time of this part was: 0.773250102997 seconds


#  Learning

In the learning phase, we create a wrapper for numbskull that we will use for the gibbs sampling. The output of this part is the new weight table.

In [6]:
start_time_dd = time.time()

session._numskull()

finish_time_dd = time.time()
print("The running time of this part was: "+ str((finish_time_dd-start_time_dd)) +" seconds")

The running time of this part was: 3.92434692383 seconds


In this part, we use the new weight, to learn the probabilities for each value for the cells

In [7]:
start_time_dd = time.time()

session.ds_repair()

finish_time_dd = time.time()
print("The running time of this part was: "+ str((finish_time_dd-start_time_dd)) +" seconds")

The running time of this part was: 1.01248002052 seconds


# Evaluation

In this part, we find the accuracy of our results by comparing them to the clean version of our initial data

In [8]:
start_time_dd = time.time()

#csv_path_correction=raw_input("Please give the full path for the csv file for testing the accuracy:")
csv_path_correction="test/testGT.csv"
acc = Accuracy(holo_obj.dataengine, csv_path_correction, session.dataset, holo_obj.spark_session)
acc.accuracy_calculation()

finish_time_dd = time.time()
print("The running time of this part was: "+ str((finish_time_dd-start_time_dd)) +" seconds")

The precision that we have is :0.75
The recall that we have is :0.75
The F1 score that we have is :0.75
The running time of this part was: 12.4441890717 seconds
