Our `Session` manages a connection to the Postgres database automatically  and allows us to save intermediate results.

The first in our set of tutorials introduces the infrastructure of `HoloClean` and presents the initial steps needed to get your data interacting with `HoloClean`. We'll also discuss Denial Constraints, the primary source of information that `HoloClean` uses to perform repairs.

# Intro to Holoclean: Data Loading and Denial Constraints

## Part 1: Setup & Loading Data

### Connecting to the Database

Without further ado, let's see some code! We begin by initializing `HoloClean` and `Session` objects.

In [1]:
from holoclean.holoclean import HoloClean, Session

holo =      HoloClean(
            holoclean_path="..",         # path to holoclean package
            verbose=False,
            # to limit possible values for training data
            pruning_threshold1=0.1,
            # to limit possible values for training data to less than k values
            pruning_clean_breakoff=6,
            # to limit possible values for dirty data (applied after
            # Threshold 1)
            pruning_threshold2=0,
            # to limit possible values for dirty data to less than k values
            pruning_dk_breakoff=6,
            # learning parameters
            learning_iterations=30,
            learning_rate=0.001,
            batch_size=5
        )
session = Session(holo)

  """)


### Loading Data

Next, we ingest the hospital data we'd like to clean. This is a commonly used research dataset that we'll be using for all of our introductory tutorials.

In [2]:
data_path = "data/hospital.csv"

data = session.load_data(data_path)

Time to Load Data: 12.2033860683



At this time, we only support .csv files for our data format. 

The data is then loaded into the database and a representation is returned. `HoloClean` uses PySpark DataFrames as its internal data structure and so any PySpark operations can be used.

For Example:

In [3]:
data.select('HospitalName', 'City').show(5)

+--------------------+----------+
|        HospitalName|      City|
+--------------------+----------+
|CALLAHAN EYE FOUN...|BIRMINGHAM|
|CALLAHAN EYE FOUN...|BIRMINGHAM|
|CALLAHAN EYE FOUN...|BIRMINGHAM|
|CALLAHAN EYE FOUN...|BIRMINGHxM|
|CALLAHAN EYE FOUN...|BIRMINGHAM|
+--------------------+----------+
only showing top 5 rows



In [4]:
data.printSchema()

root
 |-- ProviderNumber: string (nullable = true)
 |-- HospitalName: string (nullable = true)
 |-- Address1: string (nullable = true)
 |-- Address2: string (nullable = true)
 |-- Address3: string (nullable = true)
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- ZipCode: string (nullable = true)
 |-- CountyName: string (nullable = true)
 |-- PhoneNumber: string (nullable = true)
 |-- HospitalType: string (nullable = true)
 |-- HospitalOwner: string (nullable = true)
 |-- EmergencyService: string (nullable = true)
 |-- Condition: string (nullable = true)
 |-- MeasureCode: string (nullable = true)
 |-- MeasureName: string (nullable = true)
 |-- Score: string (nullable = true)
 |-- Sample: string (nullable = true)
 |-- Stateavg: string (nullable = true)
 |-- __ind: long (nullable = true)



In [5]:
data.count()

1000

Please see the [Apache Spark website](https://spark.apache.org/docs/latest/sql-programming-guide.html) for a full guide through DataFrames and their functionality.

## Part 2: Introduction to Denial Constraints


HoloClean's goal is to clean your data, and the system is driven by a description of what clean data *should* be like. These are expressed in the form of a Denial Constraint, which is similar to a [functional dependency](https://en.wikipedia.org/wiki/Functional_dependency). However, functional dependencies express things that should hold for your data, a denial constraint expresses what clean data is not like.

### An Example: The Hospital Dataset

This tutorial will walk through one of the Denial Constraints used in the Hospital Dataset. The data has the following fields:





`
index,
ProviderNumber,
HospitalName,
Address1,
Address2,
Address3,
City,
State,
ZipCode,
CountyName,
PhoneNumber,
HospitalType,
HospitalOwner,
EmergencyService,
Condition,
MeasureCode,
MeasureName,
Score,
Sample,
Stateavg`


And we know that there are some errors in our data. For example some people have mistyped the city name, and so we see results like

In [6]:
data.select('City', 'ZipCode').show()

+----------+-------+
|      City|ZipCode|
+----------+-------+
|BIRMINGHAM|  35233|
|BIRMINGHAM|  35233|
|BIRMINGHAM|  35233|
|BIRMINGHxM|  35233|
|BIRMINGHAM|  35233|
|BIRMINGHAM|  35233|
|BIRMINGHAM|  35233|
|BIRMINGxAM|  35233|
| SHEFFIELD|  35660|
| SHEFFIELD|  35660|
| SHEFFxELD|  35660|
| SHEFFIELD|  35660|
| SHEFFIELD|  35660|
| SHEFFIELD|  35660|
| SHEFFIELD|  35660|
| SHEFFIELD|  35660|
| SHEFFIELD|  35660|
| SHEFFIELD|  35660|
| SHEFFIELD|  35660|
|    DOTHAN|  36302|
+----------+-------+
only showing top 20 rows



Clearly we have an issue with a city called `BIRMGINxAM`. However, we know that whenever the zip codes are the same, the city should be the same. In the language of functional dependencies we could write this as: for any records $t_1, t_2$

$$t_1.ZipCode = t_2.ZipCode \implies t_1.City = t_2.City$$

However the HoloClean denial constraint will be.

`t1&t2&EQ(t1.ZipCode,t2.ZipCode)&IQ(t1.City,t2.City)`

Let's break down how this works:`t1&t2` specifies that two records will be involved in the error. `EQ(t1.ZipCode, t2.ZipCode)&IQ(t1.City, t2.City)` says that the records will have equal zip codes, but inequal cities. Now any pairs of records in the hospital dataset which make this true will be marked as potentially dirty.


## Adding Denial Constraints to HoloClean
There are multiple ways to add denial constraints to the system, the first is to load from a text file

In [7]:
#Load a set of denial contstraints
dc_path = "data/hospital_constraints.txt"
dcs = session.load_denial_constraints(dc_path)
dcs

['t1&t2&EQ(t1.ZipCode,t2.ZipCode)&IQ(t1.City,t2.City)',
 't1&t2&EQ(t1.ZipCode,t2.ZipCode)&IQ(t1.State,t2.State)',
 't1&t2&EQ(t1.PhoneNumber,t2.PhoneNumber)&IQ(t1.ZipCode,t2.ZipCode)',
 't1&t2&EQ(t1.PhoneNumber,t2.PhoneNumber)&IQ(t1.City,t2.City)',
 't1&t2&EQ(t1.PhoneNumber,t2.PhoneNumber)&IQ(t1.State,t2.State)',
 't1&t2&EQ(t1.ProviderNumber,t2.ProviderNumber)&EQ(t1.MeasureCode,t2.MeasureCode)&IQ(t1.Stateavg,t2.Stateavg)',
 't1&t2&EQ(t1.MeasureCode,t2.MeasureCode)&IQ(t1.MeasureName,t2.MeasureName)',
 't1&t2&EQ(t1.MeasureCode,t2.MeasureCode)&IQ(t1.Condition,t2.Condition)',
 't1&t2&EQ(t1.State,t2.State)&EQ(t1.MeasureCode,t2.MeasureCode)&IQ(t1.Stateavg,t2.Stateavg)']

## Adding/Removing Constraints one-by-one

In [8]:
dcs = session.add_denial_constraint('t1&t2&EQ(t1.ZipCode,t2.ZipCode)&IQ(t1.Stateavg,t2.Stateavg)')
dcs

['t1&t2&EQ(t1.ZipCode,t2.ZipCode)&IQ(t1.City,t2.City)',
 't1&t2&EQ(t1.ZipCode,t2.ZipCode)&IQ(t1.State,t2.State)',
 't1&t2&EQ(t1.PhoneNumber,t2.PhoneNumber)&IQ(t1.ZipCode,t2.ZipCode)',
 't1&t2&EQ(t1.PhoneNumber,t2.PhoneNumber)&IQ(t1.City,t2.City)',
 't1&t2&EQ(t1.PhoneNumber,t2.PhoneNumber)&IQ(t1.State,t2.State)',
 't1&t2&EQ(t1.ProviderNumber,t2.ProviderNumber)&EQ(t1.MeasureCode,t2.MeasureCode)&IQ(t1.Stateavg,t2.Stateavg)',
 't1&t2&EQ(t1.MeasureCode,t2.MeasureCode)&IQ(t1.MeasureName,t2.MeasureName)',
 't1&t2&EQ(t1.MeasureCode,t2.MeasureCode)&IQ(t1.Condition,t2.Condition)',
 't1&t2&EQ(t1.State,t2.State)&EQ(t1.MeasureCode,t2.MeasureCode)&IQ(t1.Stateavg,t2.Stateavg)',
 't1&t2&EQ(t1.ZipCode,t2.ZipCode)&IQ(t1.Stateavg,t2.Stateavg)']

# Denial Constraint Operators

If you want a thorough introduction to denial constraints, refer to the [HoloClean Paper](https://arxiv.org/pdf/1702.00820.pdf). For the brief introduction the logical operators available are:

|Operator|Meaning|
|--------|-----|
|`EQ(x.y,z.w)`| `x.y==z.w` |
|`IQ(x.y,z.w)`| `x.y != z.w` |
|`GT(x.y, z.w)`| `x.y > z.y`|
|`GTE(x.y, z.w)`| `x.y >= z.y`|
|`LT(x.y, z.w)`| `x.y < z.y`|
|`LT(x.y, z.w)`| `x.y <= z.y`|

All denial constraints are of the form `t1&t2&<X>&<Y>&...`  where `<X>` and `<Y>` are logical operators mentioned above.

# Next Steps

Denial Constraints are just one of HoloClean's error detectors that it uses for learning, if you'd like to write your own check out our [Error Detectors](Tutorial_3.ipynb) tutorial. If you want to learn about the next steps in the HoloClean pipeline, check out our [Complete Pipeline](Tutorial_2.ipynb) tutorial.