# Writing Custom Error Detectors

HoloClean learns to clean data by first splitting it into two categories `clean` and `dont_know` or `dk` for short. It then uses the `clean` set to learn a factor graph. We've provided one kind of error detector, the `DCErrorDetector` which uses Denial Constraints to make these splits. However, HoloClean accepts arbitrary splits through the `ErrorDetector` class.

# A `hello world` Example
The heart of an error detector is two functions, `get_noisy_cells` and `get_clean_cells`. We are using the hospital dataset from before. We know that some Zip Codes are formatted incorrectly, so we'll write an error detector thT gives HoloClean all the erroneous Zip Codes using some simple regular expressions. 

In [11]:
class SimpleErrorDetector:
    def __init__(self, spark_session):
        self.spark_session = spark_session
    
    def get_noisy_cells(self, spark_data_frame):
        '''
            well get a spark DataFrame Instance of our Data
            and return a new DataFrame with the schema 
            |ind|attr|
            
            where ind is the index of our data 
            and attr is the name of the column 
            or columns we believe are dirty
        '''
    
        spark_data_frame.createOrReplaceTempView("table1")
        query = "SELECT index as ind "\
                "FROM table1 "\
                "WHERE "\
                "ZipCode NOT RLIKE '[0-9]{5}'"
            
        result = self.spark_session.sql(query)
        attr_frame = self.spark_session.createDataFrame([['ZipCode']], ['attr'])
        result = result.crossJoin(attr_frame)
        return result
                                              
                                      
        
    
    def get_clean_cells(self, spark_data_frame, noisy_cells_data_frame):
        '''
            The same as before, but now we'll get 
            reference noisy data in case we need it
        '''
        spark_data_frame.createOrReplaceTempView("table1")
        query = "SELECT index as ind "\
                "FROM table1 "\
                "WHERE "\
                "ZipCode RLIKE '[0-9]{5}'"
            
        result = self.spark_session.sql(query)
        attr_frame = self.spark_session.createDataFrame([['ZipCode']], ['attr'])
        result = result.crossJoin(attr_frame)
        return result

### Now we'll start up HoloClean

In [2]:
from holoclean.holoclean import HoloClean, Session
from holoclean.errordetection.errordetector import ErrorDetectors
from holoclean.featurization.featurizer import SignalInit, SignalCooccur, SignalDC
from holoclean.featurization.featurizer import Featurizer
from holoclean.learning.softmax import SoftMax
from holoclean.learning.accuracy import Accuracy


holo_obj = HoloClean(mysql_driver = "../holoclean/lib/mysql-connector-java-5.1.44-bin.jar" )
session = Session("Session", holo_obj)

  cursor.execute('SELECT @@tx_isolation')


### And ingest the dataset

In [3]:
dataset = "../datasets/hospital1k/hospital_dataset.csv"

denial_constraints = "../datasets/hospital1k/hospital_constraints.txt"

ground_truth = "../datasets/hospital1k/groundtruth.csv"

# Ingesting Dataset and Denial Constraints

session.ingest_dataset(dataset)


sql = holo_obj.dataengine.get_table_to_dataframe("Init", session.dataset)
sql.select('index','ProviderNumber','HospitalName', 'Address1').show()

time for ingesting file: 5.46357989311

Init table
+-----+--------------+--------------------+--------------------+
|index|ProviderNumber|        HospitalName|            Address1|
+-----+--------------+--------------------+--------------------+
|    1|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|
|    2|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|
|    3|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|
|    4|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|
|    5|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|
|    6|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|
|    7|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|
|    8|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|
|    9|         10019|HELEN KELLER MEMO...|1300 SOUTH MONTGO...|
|   10|         10019|HELEN KELLER MEMO...|1300 SOUTH MONTGO...|
|   11|         10019|HELEN KELLER MEMO...|1300 SOUTH MONTGO...|
|   12|         10019|HELEN KELLER MEMO

# Adding Your Error Detector to HoloClean

In [4]:
'''
    We instantiate an ErrorDetector class,
    and give it an instance of our
    SimpleErrorDetector Object
'''
err = ErrorDetectors(SimpleErrorDetector(holo_obj.spark_session))
session.add_error_detector(err)
#run error detection
session.ds_detect_errors()

# Viewing the results

The following table will give us all records which are believed to be erroneous:

In [9]:
sql = holo_obj.dataengine.get_table_to_dataframe("C_dk", session.dataset)
sql.show()

+---+-------+
|ind|   attr|
+---+-------+
| 45|ZipCode|
| 64|ZipCode|
| 71|ZipCode|
| 94|ZipCode|
|138|ZipCode|
|140|ZipCode|
|150|ZipCode|
|158|ZipCode|
|197|ZipCode|
|233|ZipCode|
|268|ZipCode|
|284|ZipCode|
|291|ZipCode|
|326|ZipCode|
|333|ZipCode|
|341|ZipCode|
|367|ZipCode|
|376|ZipCode|
|407|ZipCode|
|421|ZipCode|
+---+-------+
only showing top 20 rows



And if we view the original dataset, viewing index 45 will confirm our suspicion

In [7]:
sql = holo_obj.dataengine.get_table_to_dataframe("Init", session.dataset)
sql.select(["index","ZipCode"]).show(45)


+-----+-------+
|index|ZipCode|
+-----+-------+
|    1|  35233|
|    2|  35233|
|    3|  35233|
|    4|  35233|
|    5|  35233|
|    6|  35233|
|    7|  35233|
|    8|  35233|
|    9|  35660|
|   10|  35660|
|   11|  35660|
|   12|  35660|
|   13|  35660|
|   14|  35660|
|   15|  35660|
|   16|  35660|
|   17|  35660|
|   18|  35660|
|   19|  35660|
|   20|  36302|
|   21|  36302|
|   22|  36302|
|   23|  36302|
|   24|  36302|
|   25|  36302|
|   26|  36302|
|   27|  36302|
|   28|  36302|
|   29|  36302|
|   30|  36302|
|   31|  36302|
|   32|  36302|
|   33|  36302|
|   34|  36302|
|   35|  36302|
|   36|  36302|
|   37|  36302|
|   38|  36302|
|   39|  36302|
|   40|  36302|
|   41|  36302|
|   42|  36302|
|   43|  36302|
|   44|  36302|
|   45|  x5957|
+-----+-------+
only showing top 45 rows

