# Writing Custom Error Detectors

HoloClean learns to clean data by first splitting it into two categories `clean` and `dont_know` or `dk` for short. It then uses the `clean` set to learn a factor graph. We've provided one kind of error detector, the `DCErrorDetector` which uses Denial Constraints to make these splits. However, HoloClean accepts arbitrary splits through the `ErrorDetector` class.

# A `hello world` Example
The heart of an error detector is two functions, `get_noisy_cells` and `get_clean_cells`

In [1]:
class SimpleErrorDetector:
    def __init__(self, spark_session):
        self.spark_session = spark_session
    
    def get_noisy_cells(self, spark_data_frame):
        spark_data_frame.createOrReplaceTempView("table1")
        query = "SELECT index as ind "\
                "FROM table1 "\
                "WHERE "\
                "ZipCode LIKE '%x%'"
            
        result = self.spark_session.sql(query)
        attr_frame = self.spark_session.createDataFrame([['ZipCode']], ['attr'])
        result = result.crossJoin(attr_frame)
        return result
                                              
                                      
        
    
    def get_clean_cells(self, spark_data_frame, noisy_cells_data_frame):
        spark_data_frame.createOrReplaceTempView("table1")
        query = "SELECT index as ind "\
                "FROM table1 "\
                "WHERE "\
                "ZipCode NOT LIKE '%x%'"
            
        result = self.spark_session.sql(query)
        attr_frame = self.spark_session.createDataFrame([['ZipCode']], ['attr'])
        result = result.crossJoin(attr_frame)
        return result

In [2]:
from holoclean.holoclean import HoloClean, Session
from holoclean.errordetection.errordetector import ErrorDetectors
from holoclean.featurization.featurizer import SignalInit, SignalCooccur, SignalDC
from holoclean.featurization.featurizer import Featurizer
from holoclean.learning.softmax import SoftMax
from holoclean.learning.accuracy import Accuracy
import time

holo_obj = HoloClean(mysql_driver = "../holoclean/lib/mysql-connector-java-5.1.44-bin.jar" )
session = Session("Session", holo_obj)

  cursor.execute('SELECT @@tx_isolation')


In [3]:
dataset = "../datasets/hospital1k/hospital_dataset.csv"

denial_constraints = "../datasets/hospital1k/hospital_constraints.txt"

ground_truth = "../datasets/hospital1k/groundtruth.csv"

# Ingesting Dataset and Denial Constraints
start_time = time.time()
t0 = time.time()
session.ingest_dataset(dataset)
t1 = time.time()
total = t1 - t0


print 'time for ingesting file: ' + str(total) + '\n'
session.denial_constraints(denial_constraints)
print 'Init table'
sql = holo_obj.dataengine.get_table_to_dataframe("Init", session.dataset)
sql.select('index','ProviderNumber','HospitalName', 'Address1').show()

time for ingesting file: 4.73261594772

Init table
+-----+--------------+--------------------+--------------------+
|index|ProviderNumber|        HospitalName|            Address1|
+-----+--------------+--------------------+--------------------+
|    1|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|
|    2|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|
|    3|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|
|    4|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|
|    5|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|
|    6|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|
|    7|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|
|    8|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|
|    9|         10019|HELEN KELLER MEMO...|1300 SOUTH MONTGO...|
|   10|         10019|HELEN KELLER MEMO...|1300 SOUTH MONTGO...|
|   11|         10019|HELEN KELLER MEMO...|1300 SOUTH MONTGO...|
|   12|         10019|HELEN KELLER MEMO

In [4]:
#err_detector = ErrorDetectors(session.Denial_constraints, holo_obj.dataengine,
                            # holo_obj.spark_session, session.dataset)
err_2 = ErrorDetectors(session.Denial_constraints, holo_obj.dataengine,
                             holo_obj.spark_session, session.dataset, SimpleErrorDetector(holo_obj.spark_session))
session.add_error_detector(err_2)
session.ds_detect_errors()

In [8]:
sql = holo_obj.dataengine.get_table_to_dataframe("C_clean", session.dataset)
sql.count()

970

In [None]:
sql = holo_obj.dataengine.get_table_to_dataframe("Init", session.dataset)
sql.select("ZipCode")