# A Step-by-Step Guide to Holoclean example

Noisy and erroneous data is a major bottleneck in analytics. Data cleaning and repairing account for about 60% of the work of data scientists. To address this bottleneck, we recently introduced HoloClean, a semi-automated data repairing framework that relies on statistical learning and inference to repair errors in structured data. In HoloClean, we build upon the paradigm of weak supervision and demonstrate how to leverage diverse signals, including user-defined heuristic rules (such as generalized data integrity constraints) and external dictionaries, to repair erroneous data. 

In this post, we walk through the process of implementing Holoclean, by creating a simple end-to-end example.

# Setup

Firstly, we import all the module from Holoclean that we will use.

In [1]:
from holoclean.holoclean import HoloClean, Session
from holoclean.errordetection.errordetector import ErrorDetectors
from holoclean.featurization.featurizer import SignalInit, SignalCooccur, SignalDC
from holoclean.learning.accuracy import Accuracy
from time import time as t

##   Initialization
In this part, we create the Holoclean and Session object that we will use for this example.

In [2]:
        holo_obj = HoloClean()
        session = Session("Session", holo_obj) 
        print "Testing started :"+str(t())
        

Testing started :1517347649.77


  cursor.execute('SELECT @@tx_isolation')


## Read Input and DC from file
Test data and the Denial Constraints will be read using the Session's ingestor.
After ingesting the test data will be loaded into MySQL tables along with entries in the a metadata table.

In [3]:
        fx = open('execution_time.txt', 'w')
        list_time = []
        start_time = t()
        
        session.ingest_dataset("test/inputDatabase.csv")
        #session.ingest_dataset("test/test.csv")
        d = t()-start_time
        list_time.append(d)
        holo_obj.logger.info('ingest csv time: '+str(d)+'\n')
        fx.write('ingest csv time: '+str(d)+'\n')
        print 'Init table'
        sql = holo_obj.dataengine.get_table_to_dataframe("Init", session.dataset)
        sql.select('index','ProviderNumber','HospitalName', 'Address1').show()
        print 'ingest csv time: '+str(d)+'\n'
        
        start_time = t()
        
        session.denial_constraints("test/inputConstraint.txt")
        #session.denial_constraints("test/dc1.txt")
        d = t() - start_time
        list_time.append(d)
        holo_obj.logger.info('read denial constraints time: '+str(d)+'\n')
        fx.write('read denial constraints time: '+str(d)+'\n')
        print 'read denial constraints time: '+str(d)+'\n'


Init table
+-----+--------------+--------------------+--------------------+
|index|ProviderNumber|        HospitalName|            Address1|
+-----+--------------+--------------------+--------------------+
|    1|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|
|    2|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|
|    3|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|
|    4|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|
|    5|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|
|    6|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|
|    7|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|
|    8|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|
|    9|         10019|HELEN KELLER MEMO...|1300 SOUTH MONTGO...|
|   10|         10019|HELEN KELLER MEMO...|1300 SOUTH MONTGO...|
|   11|         10019|HELEN KELLER MEMO...|1300 SOUTH MONTGO...|
|   12|         10019|HELEN KELLER MEMO...|1300 SOUTH MONTGO...|
|   13|       

## Error Detection
In this part, we create the error detection. The output of this part is the C_dk table that contains all the noisy cells and the C_Clean table that contains the clean cells

In [4]:
        start_time = t()
        err_detector = ErrorDetectors(session.Denial_constraints, holo_obj.dataengine,
                                      holo_obj.spark_session, session.dataset)
        session.add_error_detector(err_detector)
        session.ds_detect_errors()
        d = t() - start_time
        list_time.append(d)
        holo_obj.logger.info('error dectection time: '+str(d)+'\n')
        fx.write('error dectection time: '+str(d)+'\n')
        
        print 'Clean table'
        sql = holo_obj.dataengine.get_table_to_dataframe("C_clean", session.dataset)
        sql.show()
        print 'Don\'t know table'
        sql = holo_obj.dataengine.get_table_to_dataframe("C_dk", session.dataset)
        sql.show()
        print 'error dectection time: '+str(d)+'\n'

Clean table
+---+----------------+
|ind|            attr|
+---+----------------+
|106|        Address3|
|132|   HospitalOwner|
|106|EmergencyService|
|107|    HospitalName|
|126|      CountyName|
|104|    HospitalType|
| 10|        Address1|
|163|  ProviderNumber|
|165|        Stateavg|
|111|        Address1|
|110|    HospitalName|
|110|  ProviderNumber|
|147|        Address2|
|167|   HospitalOwner|
|140|           Score|
|110|    HospitalType|
|187|EmergencyService|
|152|           Score|
|109|    HospitalType|
|156|      CountyName|
+---+----------------+
only showing top 20 rows

Don't know table
+---+-----------+
|ind|       attr|
+---+-----------+
|897|       City|
|658|       City|
|952|       City|
|677|    ZipCode|
|228|    ZipCode|
|466|    ZipCode|
| 52|       City|
|433|    ZipCode|
|853|    ZipCode|
|596|       City|
|703|PhoneNumber|
|533|    ZipCode|
|219|    ZipCode|
|206|       City|
|199|       City|
|643|    ZipCode|
|941|    ZipCode|
|377|    ZipCode|
|772|PhoneNumbe

## Domain Pruning
In this part, we prune the domain. The output of this part is the possible_values tables that contains all the possible values for each cell

In [5]:
        start_time = t()
        pruning_threshold = 0.5
        session.ds_domain_pruning(pruning_threshold)
        d = t() - start_time
        list_time.append(d)
        holo_obj.logger.info('domain pruning time: '+str(d)+'\n')
        fx.write('domain pruning time: '+str(d)+'\n')
        
        print 'Possible Values table'
        sql = holo_obj.dataengine.get_table_to_dataframe("Possible_values", session.dataset)
        sql.show()
        print 'domain pruning time: '+str(d)+'\n'

Possible Values table
+---+-----------+-------------+--------+---------+
|tid|  attr_name|     attr_val|observed|data_type|
+---+-----------+-------------+--------+---------+
|691|    ZipCode|        35233|       1|   String|
|805|   Stateavg|     AL_AMI-3|       1|   String|
|574|   Stateavg|      AL_HF-1|       1|   String|
|456|       City|    SHEFFIELD|       1|   String|
|232|       City|   FORT PAYNE|       1|   String|
|113|   Stateavg|AL_SCIP-INF-1|       1|   String|
|805|   Stateavg|      AL_HF-3|       0|   String|
|691|PhoneNumber|   2059344011|       1|   String|
|575|       City|       VALLEY|       1|   String|
|456|       City|   BIRMINGHAM|       0|   String|
|114|       City|          OPP|       1|   String|
|232|      State|           AL|       1|   String|
|806|       City|      GADSDEN|       1|   String|
|114|      State|           AL|       1|   String|
|232|    ZipCode|        35968|       1|   String|
|575|      State|           AL|       1|   String|
|806|    

# Featurization

In this part, we implement the featurization module of holoclean. We choose the signals that we want to use and the output of this part is the featurization table that contains the factors that we will use

## Initial Value Signal 

In [6]:
        start_time = t()
        start_time1 = t()
        initial_value_signal = SignalInit(session.Denial_constraints, holo_obj.dataengine,
                                          session.dataset)
        session.add_featurizer(initial_value_signal)
        d = t() - start_time
        list_time.append(d)
        holo_obj.logger.info('init signal time: '+str(d)+'\n')
        fx.write('init signal time: '+str(d)+'\n')
        print 'init signal time: '+str(d)+'\n'

init signal time: 0.000649213790894



## Co-occurence Signal 

In [7]:
        start_time = t()
   
        statistics_signal = SignalCooccur(session.Denial_constraints, holo_obj.dataengine,
                                          session.dataset)
        session.add_featurizer(statistics_signal)
        d = t() - start_time
        list_time.append(d)
        holo_obj.logger.info('cooccur signal time: '+str(d)+'\n')
        fx.write('cooccur signal time: '+str(d)+'\n')
        print 'cooccur signal time: '+str(d)+'\n'

cooccur signal time: 0.000582933425903



## DC Signal

In [8]:
        start_time = t()
        d = t() - start_time
        list_time.append(d)
        holo_obj.logger.info('dc signal time: '+str(d)+'\n')
        fx.write('dc signal time: '+str(d)+'\n')
        print 'dc signal time: '+str(d)+'\n'
        start_time = t()
        dc_signal = SignalDC(session.Denial_constraints, holo_obj.dataengine, session.dataset)
        d = t() - start_time
        list_time.append(d)
        holo_obj.logger.info('dc featurize time: '+str(d)+'\n')
        fx.write('dc featurize time: '+str(d)+'\n')
        print 'dc featurize time: '+str(d)+'\n'
        session.add_featurizer(dc_signal)

dc signal time: 4.41074371338e-05

dc featurize time: 0.000277042388916



We use the signals that we choose in the previous steps. The output of this part is the featurization table that contains the factors that we will use in the next step.

In [9]:
        session.ds_featurize()
        d = t() - start_time
        list_time.append(d)
        holo_obj.logger.info('total featurization time: '+str(d)+'\n')
        fx.write('total featurization time: '+str(d)+'\n')
        print 'Feature table'
        sql = holo_obj.dataengine.get_table_to_dataframe("Feature", session.dataset)
        sql.show()
        
        print 'total featurization time: '+str(d)+'\n'

adding weight_id to feature table...
adding weight_id to feature table is finished
Feature table
+---------+--------+---------+--------------------+--------------------+-------+---------+
|var_index|rv_index|  rv_attr|        assigned_val|             feature|   TYPE|weight_id|
+---------+--------+---------+--------------------+--------------------+-------+---------+
|        1|       1|     City|          BIRMINGHAM|     Init=BIRMINGHAM|   init|       11|
|        2|       1|     City|          BIRMINGHAM|ProviderNumber=10018|cooccur|      578|
|        3|       1|     City|          BIRMINGHAM|HospitalName=CALL...|cooccur|      579|
|        4|       1|     City|          BIRMINGHAM|Address1=1720 UNI...|cooccur|      580|
|        5|       1|     City|          BIRMINGHAM|      Address2=Empty|cooccur|      571|
|        6|       1|     City|          BIRMINGHAM|      Address3=Empty|cooccur|      572|
|        7|       1|     City|          BIRMINGHAM|            State=AL|cooccur|    

#  Learning
In the learning phase, we create a wrapper for numbskull that we will use for the gibbs sampling. The output of this part is the new weight table.

In [10]:
        start_time = t()
        session._numskull()
        d = t() - start_time
        list_time.append(d)
        holo_obj.logger.info('numbskull time: '+str(d)+'\n')
        fx.write('numbskull time: '+str(d)+'\n')
        print 'numbskull time: '+str(d)+'\n'
        start_time = t()

numbskull is starting
wrapper is starting
wrapper is finished
1
numbskull is finished
adding weight is finished is finished
numbskull time: 224.518476963



In this part, we use the new weight, to learn the probabilities for each value for the cells


In [12]:
        session.ds_repair()
        d = t() - start_time
        list_time.append(d)
        holo_obj.logger.info('repair time: '+str(d)+'\n')
        fx.write('repair time: '+str(d)+'\n')
        print 'repair time: '+str(d)+'\n'

        holo_obj.logger.info('Total time: ' + str(sum(list_time)) + '\n')
        fx.write('Total time: ' + str(sum(list_time)) + '\n')
        print 'Total time: ' + str(sum(list_time)) + '\n'

        fx.close()

starting repairs
repairs are finished
+-------+-------+--------+------------+-------------------+
|rv_attr|rv_attr|rv_index|assigned_val|        probability|
+-------+-------+--------+------------+-------------------+
|   City|   City|       1|  BIRMINGHAM|                1.0|
|   City|   City|      10|  BIRMINGHAM|0.26885085414689114|
|   City|   City|      10|   SHEFFIELD| 0.7311491458531089|
|   City|   City|     100|         OPP|                1.0|
|   City|   City|    1000|     ONEONTA|                1.0|
|   City|   City|     101|         OPP|                1.0|
|   City|   City|     102|         OPP|                1.0|
|   City|   City|     103|         OPP|                1.0|
|   City|   City|     104|         OPP|                1.0|
|   City|   City|     105|         OPP|                1.0|
|   City|   City|     106|         OPP|                1.0|
|   City|   City|     107|         OPP|                1.0|
|   City|   City|     108|         OPP|                1.0|
| 