## ETIQ Data Fingerprinting Example

Data Fingerprinting allows us to compare datasets between different snapshots, creating issues if we see the data has changed significantly.

In [1]:
import etiq

Thanks for using the ETIQ.AI toolkit

Help improve our product: Call `etiq.enable_telemetry()` to provide
anonymous library usage statistics.
        


Here we're loading a dataset and a mock dataset which is missing age values.
<div class="alert alert-block alert-warning">
    Note that we need to label our column types correctly as categorical or continuous for us to apply the correct fingerprint to each column.
</div>

In [None]:
dataframe = etiq.load_sample("adultdataset")

cat_col = ["workclass", "education", "educational-num", "marital-status", "occupation", "relationship", "race", "gender", "native-country"]
cont_col = ["age", "fnlwgt", "capital-gain", "capital-loss", "hours-per-week", "income"]

dataset1 = etiq.datasets.SimpleDatasetBuilder.dataset(dataframe[dataframe.age < 67], label="income", cat_col=cat_col, cont_col=cont_col)
dataset2 = etiq.datasets.SimpleDatasetBuilder.dataset(dataframe, label="income", cat_col=cat_col, cont_col=cont_col)

In [38]:
project = etiq.projects.open("Data Fingerprinting Example")

model = etiq.model.DefaultXGBoostClassifier()

snapshot1 = project.snapshots.create(name="Snapshot 1", model=model, dataset=dataset1)
snapshot2 = project.snapshots.create(name="Snapshot 2", model=model, dataset=dataset2)

INFO:etiq.charting:Created histogram summary of data (15 fields)
INFO:etiq.charting:Histogram summary already created for this data.


### The Comparison

Here we're doing the comparison. We're comparing our current snapshot dataset (`snapshot2`) to the first dataset in `snapshot1`.

Note the `margin` argument which lets us state how much difference we're willing to tolerate. Here `0.1` means +/- 10% tolerance in change.

In [43]:
segments, issues, issueaggregates = snapshot2.scan_data_changes(snapshot1, margin=0.1)

INFO:etiq.pipeline.DataPipeline0105:Starting pipeline
INFO:great_expectations.datasource.fluent.config:Loading 'datasources' ->
[]
INFO:great_expectations.validator.validator:	30 expectation(s) included in expectation_suite. result_format settings filtered.
INFO:etiq.pipeline.DataPipeline0105:Running Great Expectation scans
INFO:great_expectations.validator.validator:	30 expectation(s) included in expectation_suite.


Calculating Metrics:   0%|          | 0/113 [00:00<?, ?it/s]

INFO:etiq.pipeline.DataPipeline0105:Completed pipeline


In [44]:
from IPython.core.display import display, HTML

header = lambda x: display(HTML(f"<h2>{x}</h2>"))

header("Segments")
display(segments)
header("Issues")
display(issues)
header("Issue Aggregates")
display(issueaggregates)

  from IPython.core.display import display, HTML



Unnamed: 0,name,business_rule,mask,tags,is_global,number_of_samples,total_number_of_samples,raw_business_rule,parent_segment_business_rule
0,all,all,[],{},True,0.0,0,,


Unnamed: 0,name,feature,segment,measure,measure_value,metric,metric_value,threshold,value,record
0,expect_column_values_to_be_between,age,all,,,,,"(17, 66)",,
1,expect_column_values_to_be_between,age,all,,,,,"(17, 66)",,
2,expect_column_values_to_be_between,age,all,,,,,"(17, 66)",,
3,expect_column_values_to_be_between,age,all,,,,,"(17, 66)",,
4,expect_column_values_to_be_between,age,all,,,,,"(17, 66)",,
...,...,...,...,...,...,...,...,...,...,...
1561,expect_column_values_to_be_between,age,all,,,,,"(17, 66)",,
1562,expect_column_values_to_be_between,age,all,,,,,"(17, 66)",,
1563,expect_column_values_to_be_between,age,all,,,,,"(17, 66)",,
1564,expect_column_values_to_be_between,age,all,,,,,"(17, 66)",,


Unnamed: 0,name,metric,measure,features,segments,total_issues_tested,issues_found,threshold
0,expect_column_values_to_be_between,,,{age},{all},48842,1566,"(17, 66)"
1,expect_column_values_to_not_be_null,,,{age},{},48842,0,"(nan, nan)"
2,expect_column_unique_value_count_to_be_between,,,{workclass},{},0,0,"(8.1, 9.9)"
3,expect_column_values_to_not_be_null,,,{workclass},{},48842,0,"(nan, nan)"
4,expect_column_values_to_be_between,,,{fnlwgt},{},48842,0,"(12285, 1490400)"
5,expect_column_values_to_not_be_null,,,{fnlwgt},{},48842,0,"(nan, nan)"
6,expect_column_unique_value_count_to_be_between,,,{education},{},0,0,"(14.4, 17.6)"
7,expect_column_values_to_not_be_null,,,{education},{},48842,0,"(nan, nan)"
8,expect_column_unique_value_count_to_be_between,,,{educational-num},{},0,0,"(14.4, 17.6)"
9,expect_column_values_to_not_be_null,,,{educational-num},{},48842,0,"(nan, nan)"
