# Intro to Holoclean: Repairing Erroneous Hospital Data

In this post, we walk through the process of using `HoloClean` to repair a dataset with information about hospitals. The dataset is an easy benchmark where errors amount to ~5% of the total data. There is significant duplicate information across the dataset - the ideal environment for Holoclean.

## Part 1: Setup & Loading Data

We've split this tutorial into 4 separate parts. Each is a simple and self-contained guide through the major components of `HoloClean`. 

Our first notebook introduces the infrastructure and performs the basic steps needed to load data for repairs.

### Connecting to the Database

Without further ado, let's see some code! We begin by initializing `HoloClean` and `Session` objects.

In [1]:
from holoclean.holoclean import HoloClean, Session

holo = HoloClean(mysql_driver = "../holoclean/lib/mysql-connector-java-5.1.44-bin.jar" )
session = Session(holo)

Our `Session` manages a connection to the MySQL database automatically  and allows us to save intermediate results.

### Loading Data

Next, we ingest the hospital dataset we'd like to clean.

In [3]:
data_path = "../datasets/hospital1k/hospital_dataset.csv"

data = session.load_data(data_path)

At this time, we only support .csv files as our data formats. The data is then loaded into the database and a representation is returned. `HoloClean` uses PySpark DataFrames as its internal data structure and so any PySpark operations can be used.

For Example:

In [7]:
data.show(1, truncate=True)

+-----+--------------+--------------------+--------------------+--------+--------+----------+-----+-------+----------+-----------+--------------------+--------------------+----------------+--------------------+-----------+--------------------+-----+------+--------------+
|index|ProviderNumber|        HospitalName|            Address1|Address2|Address3|      City|State|ZipCode|CountyName|PhoneNumber|        HospitalType|       HospitalOwner|EmergencyService|           Condition|MeasureCode|         MeasureName|Score|Sample|      Stateavg|
+-----+--------------+--------------------+--------------------+--------+--------+----------+-----+-------+----------+-----------+--------------------+--------------------+----------------+--------------------+-----------+--------------------+-----+------+--------------+
|    1|         10018|CALLAHAN EYE FOUN...|1720 UNIVERSITY BLVD|   Empty|   Empty|BIRMINGHAM|   AL|  35233| JEFFERSON| 2053258100|Acute Care Hospitals|Voluntary non-pro...|            

In [8]:
data.printSchema()

root
 |-- index: string (nullable = true)
 |-- ProviderNumber: string (nullable = true)
 |-- HospitalName: string (nullable = true)
 |-- Address1: string (nullable = true)
 |-- Address2: string (nullable = true)
 |-- Address3: string (nullable = true)
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- ZipCode: string (nullable = true)
 |-- CountyName: string (nullable = true)
 |-- PhoneNumber: string (nullable = true)
 |-- HospitalType: string (nullable = true)
 |-- HospitalOwner: string (nullable = true)
 |-- EmergencyService: string (nullable = true)
 |-- Condition: string (nullable = true)
 |-- MeasureCode: string (nullable = true)
 |-- MeasureName: string (nullable = true)
 |-- Score: string (nullable = true)
 |-- Sample: string (nullable = true)
 |-- Stateavg: string (nullable = true)



In [9]:
data.count()

1000

Please see: https://spark.apache.org/docs/latest/sql-programming-guide.html for more information

It's easy to see this dataset has a significant number of errors.

In [20]:
data.select('City').show(15)

+----------+
|      City|
+----------+
|BIRMINGHAM|
|BIRMINGHAM|
|BIRMINGHAM|
|BIRMINGHxM|
|BIRMINGHAM|
|BIRMINGHAM|
|BIRMINGHAM|
|BIRMINGxAM|
| SHEFFIELD|
| SHEFFIELD|
| SHEFFxELD|
| SHEFFIELD|
| SHEFFIELD|
| SHEFFIELD|
| SHEFFIELD|
+----------+
only showing top 15 rows



In Tutorial 2, we'll see an intro of the driving force behing `HoloClean`, Denial Constraints, and how they can be written to detect and repair these sorts of errors.