# Raha
Welcome to Raha, a configuration-free error detection system!  
In this notebook, we demonstrate this system. In particular, we will learn how we can simply detect data errors in a dataset via Raha.

## 1. Instantiating Raha
We first load and instantiate `raha`.

In [29]:
import raha
app = raha.Detection()

app.LABELING_BUDGET = 20
app.STRATEGY_FILTERING = False
app.HISTORICAL_DATASETS = ["hospital", "beers"]   # ["hospital", "beers",...]

## 2. Instantiating the Dataset
We next load and instantiate the dataset object.

In [31]:
dataset_dictionary = {
    "name": "flights",
    "path": "/media/mohammad/C20E45C80E45B5E7/Projects/raha/datasets/flights/dirty.csv",
    "clean_path": "/media/mohammad/C20E45C80E45B5E7/Projects/raha/datasets/flights/clean.csv"
}
d = app.init_dataset(dataset_dictionary)
d.dataframe.head()

Unnamed: 0,tuple_id,src,flight,sched_dep_time,act_dep_time,sched_arr_time,act_arr_time
0,1,aa,AA-3859-IAH-ORD,7:10 a.m.,7:16 a.m.,9:40 a.m.,9:32 a.m.
1,2,aa,AA-1733-ORD-PHX,7:45 p.m.,7:58 p.m.,10:30 p.m.,
2,3,aa,AA-1640-MIA-MCO,6:30 p.m.,,7:25 p.m.,
3,4,aa,AA-518-MIA-JFK,6:40 a.m.,6:54 a.m.,9:25 a.m.,9:28 a.m.
4,5,aa,AA-3756-ORD-SLC,12:15 p.m.,12:41 p.m.,2:45 p.m.,2:50 p.m.


## 3. Running Error Detection Strategies
Raha runs (all or the promising) error detection strategies on the dataset. This step could take a while because all the strategies should be run on the dataset. 

In [32]:
app.run_strategies(d)

I just load strategies' results as they have already been run on the dataset!


## 4. Generating Features
Raha then generates a feature vector for each data cell based on the output of error detection strategies. 

In [33]:
app.generate_features(d)

40 Features are extracted for column 0.
66 Features are extracted for column 1.
62 Features are extracted for column 2.
156 Features are extracted for column 3.
73 Features are extracted for column 4.
156 Features are extracted for column 5.
88 Features are extracted for column 6.


## 5. Build Clusters
Raha next builds a hierarchical clustering model for our clustering-based sampling approach.

In [34]:
app.build_clusters(d)

A hierarchical clustering model is built for column 0.
A hierarchical clustering model is built for column 1.
A hierarchical clustering model is built for column 2.
A hierarchical clustering model is built for column 3.
A hierarchical clustering model is built for column 4.
A hierarchical clustering model is built for column 5.
A hierarchical clustering model is built for column 6.


## 6. Interactive Tuple Sampling and Labeling
Raha then iteratively samples a tuple. We should label data cells of each sampled tuple.

In [35]:
for k in d.clustering_range:
    si = app.sample_tuple(d, k)
    if d.has_ground_truth:
        app.label_with_ground_truth(d, k, si)
    else:
        import pandas
        import IPython.display
        print("Label the dirty cells in the following sampled tuple.")
        sampled_tuple = pandas.DataFrame(data=[d.dataframe.iloc[si, :]], columns=d.dataframe.columns)
        IPython.display.display(sampled_tuple)
        for j in range(d.dataframe.shape[1]):
            cell = (si, j)
            value = d.dataframe.iloc[cell]
            d.labeled_cells[cell] = int(input("Is the value '{}' dirty?\nType 1 for yes.\nType 0 for no.\n".format(value)))
            if cell in d.cells_clusters_k_j_ce[k][j]:
                c = d.cells_clusters_k_j_ce[k][j][cell]
                d.labels_per_cluster[(j, c)][cell] = d.labeled_cells[cell] 

## 7. Propagating User Labels
Raha then propagates each user label through its cluster.

In [36]:
app.propagate_labels(d)

## 8. Classifying Cells
Raha then trains and applies one classifier per data column to predict the label of the rest of data cells.

In [37]:
app.classify_cells(d)

A classification model is trained and tested for column 0.
A classification model is trained and tested for column 1.
A classification model is trained and tested for column 2.
A classification model is trained and tested for column 3.
A classification model is trained and tested for column 4.
A classification model is trained and tested for column 5.
A classification model is trained and tested for column 6.


## 9. Storing Results
We can also store the error detection results.

In [39]:
app.store_results(d)

The error detection results are stored in /media/mohammad/C20E45C80E45B5E7/Projects/raha/datasets/flights/raha-results-flights/error-detection/detection.dictionary.


## 10. Evaluating Error Detection
We can finally evaluate our error detection process.

In [40]:
p, r, f = d.get_data_cleaning_evaluation(d.detected_cells)[:3]
print("Raha's performance on {}:\nPrecision = {:.2f}\nRecall = {:.2f}\nF1 = {:.2f}".format(d.name, p, r, f))

Raha's performance on flights:
Precision = 0.87
Recall = 0.84
F1 = 0.85
