# Writing Custom Error Detectors

HoloClean learns to clean data by first splitting it into two categories `clean` and `dont_know` or `dk` for short. It then uses the `clean` set to learn a factor graph. We've provided one kind of error detector, the `DCErrorDetector` which uses Denial Constraints to make these splits. However, HoloClean accepts arbitrary splits through the `ErrorDetector` class.

# A `hello world` Example
The heart of an error detector is two functions, `get_noisy_cells` and `get_clean_cells`. We are using the hospital dataset from before. We know that some Zip Codes are formatted incorrectly, so we'll write an error detector that gives HoloClean all the erroneous Zip Codes using some simple regular expressions. 

In [1]:
class SimpleErrorDetector:
    def __init__(self, session):
        self.spark_session = session.holo_env.spark_session
        self.dataengine = session.holo_env.dataengine
        self.dataset = session.dataset
    
    def get_noisy_cells(self):
        '''
            well get a spark DataFrame Instance of our Data
            and return a new DataFrame with the schema 
            |ind|attr|
            
            where ind is the index of our data 
            and attr is the name of the column 
            or columns we believe are dirty
        '''
        spark_data_frame = self.dataengine.get_table_to_dataframe('Init', self.dataset)
    
        spark_data_frame.createOrReplaceTempView("table1")
        query = "SELECT __ind as ind "\
                "FROM table1 "\
                "WHERE "\
                "ZipCode NOT RLIKE '[0-9]{5}'"
            
        result = self.spark_session.sql(query)
        attr_frame = self.spark_session.createDataFrame([['ZipCode']], ['attr'])
        result = result.crossJoin(attr_frame)
        return result
                                              
                                      
        
    
    def get_clean_cells(self):
        '''
            The same as before, but now we'll get 
            reference noisy data in case we need it
        '''
        spark_data_frame = self.dataengine.get_table_to_dataframe('Init', self.dataset)
        
        spark_data_frame.createOrReplaceTempView("table1")
        query = "SELECT __ind as ind "\
                "FROM table1 "\
                "WHERE "\
                "ZipCode RLIKE '[0-9]{5}'"
            
        result = self.spark_session.sql(query)
        attr_frame = self.spark_session.createDataFrame([['ZipCode']], ['attr'])
        result = result.crossJoin(attr_frame)
        return result

### Now we'll start up HoloClean

In [2]:
from holoclean.holoclean import HoloClean, Session

holo       =  HoloClean(
            holoclean_path="..",         # path to holoclean package
            verbose=False,
            # to limit possible values for training data
            pruning_threshold1=0.1,
            # to limit possible values for training data to less than k values
            pruning_clean_breakoff=6,
            # to limit possible values for dirty data (applied after
            # Threshold 1)
            pruning_threshold2=0,
            # to limit possible values for dirty data to less than k values
            pruning_dk_breakoff=6,
            # learning parameters
            learning_iterations=30,
            learning_rate=0.001,
            batch_size=5
        )
session = Session(holo)

  """)


### And ingest the dataset

You can review what's happening here in our [Data Loading & Denial Constraints Tutorial](Tutorial_1.ipynb).

In [3]:
dataset = "data/hospital.csv"

denial_constraints = "data/hospital_constraints.txt"

ground_truth = "data/hospital_clean.csv"

# Ingesting Dataset and Denial Constraints

data = session.load_data(dataset)

data.select('__ind','ProviderNumber','HospitalName', 'Address1').show()

Time to Load Data: 6.77303004265

+-----+--------------+--------------------+--------------------+
|__ind|ProviderNumber|        HospitalName|            Address1|
+-----+--------------+--------------------+--------------------+
|    1|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|
|    2|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|
|    3|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|
|    4|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|
|    5|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|
|    6|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|
|    7|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|
|    8|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|
|    9|         10019|HELEN KELLER MEMO...|1300 SOUTH MONTGO...|
|   10|         10019|HELEN KELLER MEMO...|1300 SOUTH MONTGO...|
|   11|         10019|HELEN KELLER MEMO...|1300 SOUTH MONTGO...|
|   12|         10019|HELEN KELLER MEMO...|1300 SOUTH MO

# Adding Your Error Detector to HoloClean

In [4]:
'''
    We instantiate an ErrorDetector class,
    and give it an instance of our
    SimpleErrorDetector Object
'''
err = SimpleErrorDetector(session)
#run error detection
error_detector_list =[]
error_detector_list.append(err)
clean, dirty = session.detect_errors(error_detector_list)

Time for Error Detection: 0.957987070084



# Viewing the results

The following table will give us all records which are believed to be erroneous:

In [5]:
dirty.show(5)

+---+-------+
|ind|   attr|
+---+-------+
| 45|ZipCode|
| 64|ZipCode|
| 71|ZipCode|
| 94|ZipCode|
|138|ZipCode|
+---+-------+
only showing top 5 rows



And if we view the original dataset, viewing index 45 will confirm our suspicion

In [6]:
data.filter(data.__ind == 45).select(["__ind","ZipCode"]).show()

+-----+-------+
|__ind|ZipCode|
+-----+-------+
|   45|  x5957|
+-----+-------+

