# End-to-End Data Cleaning Pipeline with Raha and Baran
We build an end-to-end data cleaning pipeline with our configuration-free data error detection and correction systems, Raha and Baran.

## Error Detection with Raha

### 1. Instantiating the Detection Class
We first load and the `raha` module and instantiate the `Detection` class.

In [1]:
import raha
app_1 = raha.Detection()

# How many tuples would you label?
app_1.LABELING_BUDGET = 20

# Do you want to filter out irrelevant error detector startegies?
app_1.STRATEGY_FILTERING = False
app_1.HISTORICAL_DATASETS = [
    {
        "name": "hospital",
        "path": "/media/mohammad/C20E45C80E45B5E7/Projects/raha/datasets/hospital/dirty.csv",
        "clean_path": "/media/mohammad/C20E45C80E45B5E7/Projects/raha/datasets/hospital/clean.csv"
    },
    {
        "name": "beers",
        "path": "/media/mohammad/C20E45C80E45B5E7/Projects/raha/datasets/beers/dirty.csv",
        "clean_path": "/media/mohammad/C20E45C80E45B5E7/Projects/raha/datasets/beers/clean.csv"
    }
]

### 2. Instantiating the Dataset
We next load and instantiate the dataset object.

In [2]:
dataset_dictionary = {
    "name": "flights",
    "path": "/media/mohammad/C20E45C80E45B5E7/Projects/raha/datasets/flights/dirty.csv",
    "clean_path": "/media/mohammad/C20E45C80E45B5E7/Projects/raha/datasets/flights/clean.csv"
}
d = app_1.initialize_dataset(dataset_dictionary)
d.dataframe.head()

Unnamed: 0,tuple_id,src,flight,sched_dep_time,act_dep_time,sched_arr_time,act_arr_time
0,1,aa,AA-3859-IAH-ORD,7:10 a.m.,7:16 a.m.,9:40 a.m.,9:32 a.m.
1,2,aa,AA-1733-ORD-PHX,7:45 p.m.,7:58 p.m.,10:30 p.m.,
2,3,aa,AA-1640-MIA-MCO,6:30 p.m.,,7:25 p.m.,
3,4,aa,AA-518-MIA-JFK,6:40 a.m.,6:54 a.m.,9:25 a.m.,9:28 a.m.
4,5,aa,AA-3756-ORD-SLC,12:15 p.m.,12:41 p.m.,2:45 p.m.,2:50 p.m.


### 3. Running Error Detection Strategies
Raha runs (all or the promising) error detector strategies on the dataset. This step could take a while because all the strategies should be run on the dataset. 

In [3]:
app_1.run_strategies(d)

I just load strategies' results as they have already been run on the dataset!


2326 strategy profiles are collected.


### 4. Generating Features
Raha then generates a feature vector for each data cell based on the output of error detector strategies. 

In [4]:
app_1.generate_features(d)

40 Features are generated for column 0.
66 Features are generated for column 1.
62 Features are generated for column 2.
156 Features are generated for column 3.
73 Features are generated for column 4.
156 Features are generated for column 5.
88 Features are generated for column 6.


### 5. Building Clusters
Raha next builds a hierarchical clustering model for our clustering-based sampling approach.

In [5]:
app_1.build_clusters(d)

A hierarchical clustering model is built for column 0.
A hierarchical clustering model is built for column 1.
A hierarchical clustering model is built for column 2.
A hierarchical clustering model is built for column 3.
A hierarchical clustering model is built for column 4.
A hierarchical clustering model is built for column 5.
A hierarchical clustering model is built for column 6.


### 6. Interactive Tuple Sampling and Labeling
Raha then iteratively samples a tuple. We should label data cells of each sampled tuple.

In [6]:
while len(d.labeled_tuples) < app_1.LABELING_BUDGET:
    app_1.sample_tuple(d)
    if d.has_ground_truth:
        app_1.label_with_ground_truth(d)
    else:
        import pandas
        import IPython.display
        print("Label the dirty cells in the following sampled tuple.")
        sampled_tuple = pandas.DataFrame(data=[d.dataframe.iloc[d.sampled_tuple, :]], columns=d.dataframe.columns)
        IPython.display.display(sampled_tuple)
        for j in range(d.dataframe.shape[1]):
            cell = (d.sampled_tuple, j)
            value = d.dataframe.iloc[cell]
            d.labeled_cells[cell] = int(input("Is the value '{}' dirty?\nType 1 for yes.\nType 0 for no.\n".format(value)))
        d.labeled_tuples[d.sampled_tuple] = 1

Tuple 1538 is sampled.
Tuple 1538 is labeled.
Tuple 91 is sampled.
Tuple 91 is labeled.
Tuple 990 is sampled.
Tuple 990 is labeled.
Tuple 943 is sampled.
Tuple 943 is labeled.
Tuple 1855 is sampled.
Tuple 1855 is labeled.
Tuple 1403 is sampled.
Tuple 1403 is labeled.
Tuple 92 is sampled.
Tuple 92 is labeled.
Tuple 225 is sampled.
Tuple 225 is labeled.
Tuple 238 is sampled.
Tuple 238 is labeled.
Tuple 1183 is sampled.
Tuple 1183 is labeled.
Tuple 814 is sampled.
Tuple 814 is labeled.
Tuple 664 is sampled.
Tuple 664 is labeled.
Tuple 1025 is sampled.
Tuple 1025 is labeled.
Tuple 1339 is sampled.
Tuple 1339 is labeled.
Tuple 1267 is sampled.
Tuple 1267 is labeled.
Tuple 2120 is sampled.
Tuple 2120 is labeled.
Tuple 2227 is sampled.
Tuple 2227 is labeled.
Tuple 795 is sampled.
Tuple 795 is labeled.
Tuple 676 is sampled.
Tuple 676 is labeled.
Tuple 1430 is sampled.
Tuple 1430 is labeled.


# 7. Propagating User Labels
Raha then propagates each user label through its cluster.

In [7]:
app_1.propagate_labels(d)

The number of labeled data cells increased from 140 to 12805.


### 8. Predicting Labels of Data Cells
Raha then trains and applies one classifier per data column to predict the label of the rest of data cells.

In [8]:
app_1.predict_labels(d)

A classifier is trained and applied on column 0.
A classifier is trained and applied on column 1.
A classifier is trained and applied on column 2.
A classifier is trained and applied on column 3.
A classifier is trained and applied on column 4.
A classifier is trained and applied on column 5.
A classifier is trained and applied on column 6.


### 9. Storing Results
Raha can also store the error detection results.

In [9]:
app_1.store_results(d)

The results are stored in /media/mohammad/C20E45C80E45B5E7/Projects/raha/datasets/flights/raha-results-flights/error-detection/detection.dictionary.


### 10. Evaluating the Error Detection Task
We can finally evaluate our error detection task.

In [10]:
p, r, f = d.get_data_cleaning_evaluation(d.detected_cells)[:3]
print("Raha's performance on {}:\nPrecision = {:.2f}\nRecall = {:.2f}\nF1 = {:.2f}".format(d.name, p, r, f))

Raha's performance on flights:
Precision = 0.84
Recall = 0.89
F1 = 0.86


# Error Correction with Baran

### 1. Instantiating the Correction Class
We first instantiate the `Correction` class.

In [11]:
app_2 = raha.Correction()

# How many tuples would you label?
app_2.LABELING_BUDGET = 20

# Have you pretrained the value-based models already?
app_2.PRETRAINED_VALUE_BASED_MODELS_PATH = ""

### 2. Initializing the Dataset Object
We next initialize the dataset object.

In [12]:
d = app_2.initialize_dataset(d)
d.dataframe.head()

Unnamed: 0,tuple_id,src,flight,sched_dep_time,act_dep_time,sched_arr_time,act_arr_time
0,1,aa,AA-3859-IAH-ORD,7:10 a.m.,7:16 a.m.,9:40 a.m.,9:32 a.m.
1,2,aa,AA-1733-ORD-PHX,7:45 p.m.,7:58 p.m.,10:30 p.m.,
2,3,aa,AA-1640-MIA-MCO,6:30 p.m.,,7:25 p.m.,
3,4,aa,AA-518-MIA-JFK,6:40 a.m.,6:54 a.m.,9:25 a.m.,9:28 a.m.
4,5,aa,AA-3756-ORD-SLC,12:15 p.m.,12:41 p.m.,2:45 p.m.,2:50 p.m.


### 3. Initializing the Error Corrector Models
Baran initializes the error corrector models.

In [13]:
app_2.initialize_models(d)

The error corrector models are initialized.


### 4. Interactive Tuple Sampling, Labeling, models updating, feature generating, and correction predicting
Baran then iteratively samples a tuple. We should label data cells of each sampled tuple. It then udpates the models accordingly and generates a feature vector for each pair of a data error and a correction candidate. Finally, it trains and applies a classifier on each data column to predict the final correction of each data error.

In [14]:
while len(d.labeled_tuples) < app_2.LABELING_BUDGET:
    app_2.sample_tuple(d)
    if d.has_ground_truth:
        app_2.label_with_ground_truth(d)
    else:
       import pandas
       import IPython.display
       print("Label the dirty cells in the following sampled tuple.")
       sampled_tuple = pandas.DataFrame(data=[d.dataframe.iloc[d.sampled_tuple, :]], columns=d.dataframe.columns)
       IPython.display.display(sampled_tuple)
       for j in range(d.dataframe.shape[1]):
           cell = (d.sampled_tuple, j)
           value = d.dataframe.iloc[cell]
           d.labeled_cells[cell] = input("What is the correction for value '{}'?\n".format(value))
       d.labeled_tuples[d.sampled_tuple] = 1
    app_2.update_models(d)
    app_2.generate_features(d)
    app_2.predict_corrections(d)

Tuple 95 is sampled.
Tuple 95 is labeled.
The error corrector models are updated with new labeled tuple 95.
448080 pairs of (a data error, a potential correction) are featurized.
51% (2654 / 5213) of data errors are corrected.
Tuple 1481 is sampled.
Tuple 1481 is labeled.
The error corrector models are updated with new labeled tuple 1481.
451920 pairs of (a data error, a potential correction) are featurized.
88% (4603 / 5213) of data errors are corrected.
Tuple 220 is sampled.
Tuple 220 is labeled.
The error corrector models are updated with new labeled tuple 220.
457693 pairs of (a data error, a potential correction) are featurized.
90% (4676 / 5213) of data errors are corrected.
Tuple 1213 is sampled.
Tuple 1213 is labeled.
The error corrector models are updated with new labeled tuple 1213.
463876 pairs of (a data error, a potential correction) are featurized.
90% (4678 / 5213) of data errors are corrected.
Tuple 1389 is sampled.
Tuple 1389 is labeled.
The error corrector models are 

### 5. Storing Results
Baran can also store the error detection results.

In [15]:
app_2.store_results(d)

The results are stored in /media/mohammad/C20E45C80E45B5E7/Projects/raha/datasets/flights/baran-results-flights/error-correction/correction.dictionary.


### 6. Evaluating the Error Correction Task
We can finally evaluate our error correction task.

In [16]:
p, r, f = d.get_data_cleaning_evaluation(d.corrected_cells)[-3:]
print("Baran's performance on {}:\nPrecision = {:.2f}\nRecall = {:.2f}\nF1 = {:.2f}".format(d.name, p, r, f))

Baran's performance on flights:
Precision = 0.62
Recall = 0.62
F1 = 0.62
