In [1]:
#Load up a HoloClean session

from holoclean.holoclean import HoloClean, Session
holo_obj = HoloClean(mysql_driver = "../holoclean/lib/mysql-connector-java-5.1.44-bin.jar" )
session = Session( holo_obj)
dataset = "data/hospital_dataset.csv"
data = session.load_data(dataset)

  cursor.execute('SELECT @@tx_isolation')


# Introduction to Denial Constraints

HoloClean's goal is to clean your data, and the system is driven by a description of what clean data *should* be like. These are expressed in the form of a Denial Constraint, which is similar to a [functional dependency](https://en.wikipedia.org/wiki/Functional_dependency). However, functional dependencies express things that should hold for your data, a denial constraint expresses what clean data is not like.
## An Example: The Hospital Dataset

This tutorial will walk through one of the Denial Constraints used in the Hospital Dataset. The data has the following fields:





`
index,
ProviderNumber,
HospitalName,
Address1,
Address2,
Address3,
City,
State,
ZipCode,
CountyName,
PhoneNumber,
HospitalType,
HospitalOwner,
EmergencyService,
Condition,
MeasureCode,
MeasureName,
Score,
Sample,
Stateavg`


And we know that there are some errors in our data. For example some people have mistyped the city name, and so we see results like






In [13]:
data.select('City', 'ZipCode').show()

+----------+-------+
|      City|ZipCode|
+----------+-------+
|BIRMINGHAM|  35233|
|BIRMINGHAM|  35233|
|BIRMINGHAM|  35233|
|BIRMINGHxM|  35233|
|BIRMINGHAM|  35233|
|BIRMINGHAM|  35233|
|BIRMINGHAM|  35233|
|BIRMINGxAM|  35233|
| SHEFFIELD|  35660|
| SHEFFIELD|  35660|
| SHEFFxELD|  35660|
| SHEFFIELD|  35660|
| SHEFFIELD|  35660|
| SHEFFIELD|  35660|
| SHEFFIELD|  35660|
| SHEFFIELD|  35660|
| SHEFFIELD|  35660|
| SHEFFIELD|  35660|
| SHEFFIELD|  35660|
|    DOTHAN|  36302|
+----------+-------+
only showing top 20 rows



Clearly we have an issue with a city called `BIRMGINxAM`. However, we know that whenever the zip codes are the same, the city should be the same. In the language of functional dependencies we could write this as: for any records $t_1, t_2$

$$t_1.ZipCode = t_2.ZipCode \implies t_1.City = t_2.City$$

However the HoloClean denial constraint will be.

`t1&t2&EQ(t1.ZipCode,t2.ZipCode)&IQ(t1.City,t2.City)`

Let's break down how this works:`t1&t2` specifies that two records will be involved in the error. `EQ(t1.ZipCode, t2.ZipCode)&IQ(t1.City, t2.City)` says that the records will have equal zip codes, but inequal cities. Now any pairs of records in the hospital dataset which make this true will be marked as potentially dirty.


# Adding Denial Constraints to HoloClean
There are multiple ways to add denial constraints to the system, the first is to load from a text file

In [None]:


#Load a set of denial contstraints
dc_path = "data/hospital_constraints.txt"
dcs = session.load_denial_constraints(dc_path)
dcs

# Adding/Removing Constraints one-by-one

In [None]:
dcs = session.add_denial_constraint('t1&t2&EQ(t1.ZipCode,t2.ZipCode)&IQ(t1.Stateavg,t2.Stateavg)')
dcs

# Denial Constraint Operators

If you want a thorough introduction to denial constraints, refer to the [HoloClean Paper](https://arxiv.org/pdf/1702.00820.pdf). For the brief introduction the logical operators available are:

|Operator|Meaning|
|--------|-----|
|`EQ(x.y,z.w)`| `x.y==z.w` |
|`IQ(x.y,z.w)`| `x.y != z.w` |
|`GT(x.y, z.w)`| `x.y > z.y`|
|`GTE(x.y, z.w)`| `x.y >= z.y`|
|`LT(x.y, z.w)`| `x.y < z.y`|
|`LT(x.y, z.w)`| `x.y <= z.y`|

All denial constraints are of the form `t1&t2&<X>&<Y>&...`  where `<X>` and `<Y>` are logical operators mentioned above.

# Next Steps

Denial Constraints are just one of HoloClean's error detectors that it uses for learning, if you'd like to write your own check out our error detectors tutorial. If you want to learn about the next steps in the HoloClean pipeline, check out the [repairs]() tutorial.