# End-to-End Data Cleaning Pipeline with Raha and Baran
We build an end-to-end data cleaning pipeline with our configuration-free data error detection and correction systems, Raha and Baran.

## Error Detection with Raha

### 1. Instantiating the Detection Class
We first load and the `raha` module and instantiate the `Detection` class.

In [1]:
import raha
app_1 = raha.Detection()

# How many tuples would you label?
app_1.LABELING_BUDGET = 20

# Do you want to filter out irrelevant error detector startegies?
app_1.VERBOSE = True
app_1.STRATEGY_FILTERING = True
app_1.HISTORICAL_DATASETS = [
    {
        "name": "hospital",
        "path": "/media/mohammad/C20E45C80E45B5E7/Projects/raha/datasets/hospital/dirty.csv",
        "clean_path": "/media/mohammad/C20E45C80E45B5E7/Projects/raha/datasets/hospital/clean.csv"
    },
    {
        "name": "beers",
        "path": "/media/mohammad/C20E45C80E45B5E7/Projects/raha/datasets/beers/dirty.csv",
        "clean_path": "/media/mohammad/C20E45C80E45B5E7/Projects/raha/datasets/beers/clean.csv"
    }
]

### 2. Instantiating the Dataset
We next load and instantiate the dataset object.

In [3]:
dataset_dictionary = {
    "name": "flights",
    "path": "/media/mohammad/C20E45C80E45B5E7/Projects/raha/datasets/flights/dirty.csv",
    "clean_path": "/media/mohammad/C20E45C80E45B5E7/Projects/raha/datasets/flights/clean.csv"
}
d = app_1.initialize_dataset(dataset_dictionary)
d.dataframe.head()

Unnamed: 0,tuple_id,src,flight,sched_dep_time,act_dep_time,sched_arr_time,act_arr_time
0,1,aa,AA-3859-IAH-ORD,7:10 a.m.,7:16 a.m.,9:40 a.m.,9:32 a.m.
1,2,aa,AA-1733-ORD-PHX,7:45 p.m.,7:58 p.m.,10:30 p.m.,
2,3,aa,AA-1640-MIA-MCO,6:30 p.m.,,7:25 p.m.,
3,4,aa,AA-518-MIA-JFK,6:40 a.m.,6:54 a.m.,9:25 a.m.,9:28 a.m.
4,5,aa,AA-3756-ORD-SLC,12:15 p.m.,12:41 p.m.,2:45 p.m.,2:50 p.m.


### 3. Running Error Detection Strategies
Raha runs (all or the promising) error detector strategies on the dataset. This step could take a while because all the strategies should be run on the dataset. 

In [4]:
app_1.run_strategies(d)

209 strategy profiles are collected.


In [5]:
import pandas
strategies_df = pandas.DataFrame(columns=["Name", "Score", "New Column", "Historical Column"])
for sp in d.strategy_profiles:
    strategies_df = strategies_df.append({"Name": sp["name"], "Score": sp["score"], "New Column": sp["new_column"], "Historical Column": sp["historical_column"]}, ignore_index=True)

strategies_df.head()

Unnamed: 0,Name,Score,New Column,Historical Column
0,"[""OD"", [""histogram"", ""0.7"", ""0.7""]]",0.920941,flights.tuple_id,hospital.phone
1,"[""OD"", [""histogram"", ""0.5"", ""0.3""]]",0.920941,flights.tuple_id,hospital.phone
2,"[""OD"", [""histogram"", ""0.9"", ""0.7""]]",0.920941,flights.tuple_id,hospital.phone
3,"[""OD"", [""histogram"", ""0.1"", ""0.3""]]",0.920941,flights.tuple_id,hospital.phone
4,"[""OD"", [""histogram"", ""0.1"", ""0.7""]]",0.920941,flights.tuple_id,hospital.phone


### 4. Generating Features
Raha then generates a feature vector for each data cell based on the output of error detector strategies. 

In [6]:
app_1.generate_features(d)

24 Features are generated for column 0.
19 Features are generated for column 1.
22 Features are generated for column 2.
12 Features are generated for column 3.
11 Features are generated for column 4.
12 Features are generated for column 5.
22 Features are generated for column 6.


### 5. Building Clusters
Raha next builds a hierarchical clustering model for our clustering-based sampling approach.

In [7]:
app_1.build_clusters(d)

A hierarchical clustering model is built for column 0.
A hierarchical clustering model is built for column 1.
A hierarchical clustering model is built for column 2.
A hierarchical clustering model is built for column 3.
A hierarchical clustering model is built for column 4.
A hierarchical clustering model is built for column 5.
A hierarchical clustering model is built for column 6.


### 6. Interactive Tuple Sampling and Labeling
Raha then iteratively samples a tuple. We should label data cells of each sampled tuple.

In [8]:
import pandas
import IPython.display
import ipywidgets

def on_button_clicked(_):
    for j in range(0, len(checkboxes)):
        cell = (d.sampled_tuple, j)
        if checkboxes[j].value == True:
            d.labeled_cells[cell] = 1
        else:
            d.labeled_cells[cell] = 0
        d.labeled_tuples[d.sampled_tuple] = 1

app_1.sample_tuple(d)
print("Mark the dirty cells in the following sampled tuple.")
sampled_tuple = pandas.DataFrame(data=[d.dataframe.iloc[d.sampled_tuple, :]], columns=d.dataframe.columns)
IPython.display.display(sampled_tuple)          
checkboxes = [ipywidgets.Checkbox(value=False, description=label) for label in d.dataframe.columns]
button = ipywidgets.Button(description="Save the Annotation")
button.on_click(on_button_clicked)
output = ipywidgets.VBox(children=checkboxes + [button])
IPython.display.display(output)

Tuple 984 is sampled.
Mark the dirty cells in the following sampled tuple.


Unnamed: 0,tuple_id,src,flight,sched_dep_time,act_dep_time,sched_arr_time,act_arr_time
984,985,flightwise,AA-4330-CVG-ORD,3:35 p.m.,3:36 p.m.,3:50 p.m.,3:26 p.m.


VBox(children=(Checkbox(value=False, description='tuple_id'), Checkbox(value=False, description='src'), Checkb…

In [12]:
print(len(d.labeled_tuples))
print(d.labeled_cells)

3
{(2105, 0): 1, (2105, 1): 0, (2105, 2): 0, (2105, 3): 0, (2105, 4): 0, (2105, 5): 0, (2105, 6): 1, (285, 0): 0, (285, 1): 0, (285, 2): 1, (285, 3): 0, (285, 4): 0, (285, 5): 1, (285, 6): 0, (170, 0): 0, (170, 1): 0, (170, 2): 0, (170, 3): 1, (170, 4): 1, (170, 5): 0, (170, 6): 0}


In [9]:
import pandas
import IPython.display
import ipywidgets

while len(d.labeled_tuples) < app_1.LABELING_BUDGET:
    app_1.sample_tuple(d)
    if d.has_ground_truth:
        app_1.label_with_ground_truth(d)

Tuple 1834 is sampled.
Tuple 1834 is labeled.
Tuple 629 is sampled.
Tuple 629 is labeled.
Tuple 2069 is sampled.
Tuple 2069 is labeled.
Tuple 490 is sampled.
Tuple 490 is labeled.
Tuple 1681 is sampled.
Tuple 1681 is labeled.
Tuple 641 is sampled.
Tuple 641 is labeled.
Tuple 566 is sampled.
Tuple 566 is labeled.
Tuple 1085 is sampled.
Tuple 1085 is labeled.
Tuple 2035 is sampled.
Tuple 2035 is labeled.
Tuple 1557 is sampled.
Tuple 1557 is labeled.
Tuple 2042 is sampled.
Tuple 2042 is labeled.
Tuple 1571 is sampled.
Tuple 1571 is labeled.
Tuple 799 is sampled.
Tuple 799 is labeled.
Tuple 1364 is sampled.
Tuple 1364 is labeled.
Tuple 1601 is sampled.
Tuple 1601 is labeled.
Tuple 1562 is sampled.
Tuple 1562 is labeled.
Tuple 783 is sampled.
Tuple 783 is labeled.
Tuple 1357 is sampled.
Tuple 1357 is labeled.
Tuple 787 is sampled.
Tuple 787 is labeled.
Tuple 1042 is sampled.
Tuple 1042 is labeled.


# 7. Propagating User Labels
Raha then propagates each user label through its cluster.

In [13]:
app_1.propagate_labels(d)

The number of labeled data cells increased from 21 to 2384.


### 8. Predicting Labels of Data Cells
Raha then trains and applies one classifier per data column to predict the label of the rest of data cells.

In [14]:
app_1.predict_labels(d)

A classifier is trained and applied on column 0.
A classifier is trained and applied on column 1.
A classifier is trained and applied on column 2.
A classifier is trained and applied on column 3.
A classifier is trained and applied on column 4.
A classifier is trained and applied on column 5.
A classifier is trained and applied on column 6.


### 9. Storing Results
Raha can also store the error detection results.

In [15]:
app_1.store_results(d)

The results are stored in /media/mohammad/C20E45C80E45B5E7/Projects/raha/datasets/flights/raha-baran-results-flights/error-detection/detection.dictionary.


### 10. Evaluating the Error Detection Task
We can finally evaluate our error detection task.

In [16]:
p, r, f = d.get_data_cleaning_evaluation(d.detected_cells)[:3]
print("Raha's performance on {}:\nPrecision = {:.2f}\nRecall = {:.2f}\nF1 = {:.2f}".format(d.name, p, r, f))

Raha's performance on flights:
Precision = 0.25
Recall = 0.22
F1 = 0.23


# Error Correction with Baran

### 1. Instantiating the Correction Class
We first instantiate the `Correction` class.

In [20]:
app_2 = raha.Correction()

# How many tuples would you label?
app_2.LABELING_BUDGET = 20
app_2.VERBOSE = True

# Have you pretrained the value-based models already?
app_2.PRETRAINED_VALUE_BASED_MODELS_PATH = ""

### 2. Initializing the Dataset Object
We next initialize the dataset object.

In [21]:
d = app_2.initialize_dataset(d)
d.dataframe.head()

Unnamed: 0,tuple_id,src,flight,sched_dep_time,act_dep_time,sched_arr_time,act_arr_time
0,1,aa,AA-3859-IAH-ORD,7:10 a.m.,7:16 a.m.,9:40 a.m.,9:32 a.m.
1,2,aa,AA-1733-ORD-PHX,7:45 p.m.,7:58 p.m.,10:30 p.m.,
2,3,aa,AA-1640-MIA-MCO,6:30 p.m.,,7:25 p.m.,
3,4,aa,AA-518-MIA-JFK,6:40 a.m.,6:54 a.m.,9:25 a.m.,9:28 a.m.
4,5,aa,AA-3756-ORD-SLC,12:15 p.m.,12:41 p.m.,2:45 p.m.,2:50 p.m.


### 3. Initializing the Error Corrector Models
Baran initializes the error corrector models.

In [22]:
app_2.initialize_models(d)

The error corrector models are initialized.


### 4. Interactive Tuple Sampling, Labeling, models updating, feature generating, and correction predicting
Baran then iteratively samples a tuple. We should label data cells of each sampled tuple. It then udpates the models accordingly and generates a feature vector for each pair of a data error and a correction candidate. Finally, it trains and applies a classifier on each data column to predict the final correction of each data error.

In [None]:
while len(d.labeled_tuples) < app_2.LABELING_BUDGET:
    app_2.sample_tuple(d)
    if d.has_ground_truth:
        app_2.label_with_ground_truth(d)
    else:
       import pandas
       import IPython.display
       print("Label the dirty cells in the following sampled tuple.")
       sampled_tuple = pandas.DataFrame(data=[d.dataframe.iloc[d.sampled_tuple, :]], columns=d.dataframe.columns)
       IPython.display.display(sampled_tuple)
       for j in range(d.dataframe.shape[1]):
           cell = (d.sampled_tuple, j)
           value = d.dataframe.iloc[cell]
           d.labeled_cells[cell] = input("What is the correction for value '{}'?\n".format(value))
       d.labeled_tuples[d.sampled_tuple] = 1
    app_2.update_models(d)
    app_2.generate_features(d)
    app_2.predict_corrections(d)

In [29]:
import pandas
import IPython.display
import ipywidgets

def on_button_clicked(_):
    for j in range(0, len(texts)):
        cell = (d.sampled_tuple, j)
        if d.dataframe.iloc[cell] != texts[j].value:
            d.labeled_cells[cell] = texts[j].value
        d.labeled_tuples[d.sampled_tuple] = 1

app_2.sample_tuple(d)
print("Fix the dirty cells in the following sampled tuple.")
sampled_tuple = pandas.DataFrame(data=[d.dataframe.iloc[d.sampled_tuple, :]], columns=d.dataframe.columns)
IPython.display.display(sampled_tuple)  
texts = [ipywidgets.Text(value=d.dataframe.iloc[d.sampled_tuple, j]) for j in range(d.dataframe.shape[1])]
button = ipywidgets.Button(description="Save the Annotation")
button.on_click(on_button_clicked)
output = ipywidgets.VBox(children=texts + [button])
IPython.display.display(output)

Tuple 2134 is sampled.
Fix the dirty cells in the following sampled tuple.


Unnamed: 0,tuple_id,src,flight,sched_dep_time,act_dep_time,sched_arr_time,act_arr_time
2134,2135,businesstravellogue,AA-789-ORD-DEN,1:05 p.m.,1:19 p.m.,2:35 p.m.,3:13 p.m.


VBox(children=(Text(value='2135'), Text(value='businesstravellogue'), Text(value='AA-789-ORD-DEN'), Text(value…

In [30]:
print(len(d.labeled_tuples))
print(d.labeled_cells)

3
{(988, 0): '98900', (1842, 0): '184322222222222', (2134, 6): '3:13 p.m.666666666666666'}


### 5. Storing Results
Baran can also store the error detection results.

In [None]:
app_2.store_results(d)

### 6. Evaluating the Error Correction Task
We can finally evaluate our error correction task.

In [None]:
p, r, f = d.get_data_cleaning_evaluation(d.corrected_cells)[-3:]
print("Baran's performance on {}:\nPrecision = {:.2f}\nRecall = {:.2f}\nF1 = {:.2f}".format(d.name, p, r, f))