# Data Fingerprinting in Etiq

Fingerprinting is a way to show how two datasets relate.

## Use Cases:

* Determining whether a new dataset has the same characteristics as the old one.
* Determining whether a transformed dataset has the correct number of rows based on the original one.


## Metrics:

The following metrics are determined for each column in both datasets, though this list can be limited if required.

These metrics will only be applied to features of a suitable type.

* Count - How many rows are there in the dataset?
* Minimum
* Maximum
* Mean
* Median
* Missing - How many rows are missing values in this column.
* Sum
* Unique - Count of distinct values in this column.
* Standard Deviation


## Getting Started

Let's load our datasets. For this example we have some synthetic data - insurance claims, insurance premiums and a profitability dataset which has been derived from the other two:

In [1]:
from pathlib import Path
import pandas as pd

datapath = Path("./Data")
claims_df = pd.read_csv(datapath / "claims.csv")
premiums_df = pd.read_csv(datapath / "premiums.csv")
profitability_df = pd.read_csv(datapath / "profitability.csv")

In [2]:
# Claims made per client
claims_df

Unnamed: 0,ClaimID,ClientID,Month,Amount
0,A001,C01,1,1.0
1,A002,C02,2,2.0
2,A003,C01,1,2.0
3,A004,C03,3,2.0
4,A005,C05,4,0.5
5,A006,C05,6,0.5
6,A007,C01,11,0.5
7,A008,C02,12,0.5
8,A009,C02,8,0.5
9,A005,C04,9,0.5


In [3]:
# Premiums paid per customer
premiums_df.head(10)

Unnamed: 0,ClientID,Month,PremiumPaid
0,C01,1,0.1
1,C01,2,0.1
2,C01,3,0.1
3,C01,4,0.1
4,C01,5,0.1
5,C01,6,0.1
6,C01,7,0.1
7,C01,8,0.1
8,C01,9,0.1
9,C01,10,0.1


In [4]:
# Profit per customer - total premiums minus any claims.
profitability_df

Unnamed: 0,ClientID,Amount,PremiumPaid
0,C01,3.5,1.2
1,C02,3.0,1.2
2,C03,2.0,1.2
3,C04,0.5,1.2
4,C05,1.0,1.2


For each dataset, we wrap them in our Etiq adapter:

In [5]:
import etiq

claims_data = etiq.SimpleDatasetBuilder.datasets(validation_features=claims_df)
profitability_data = etiq.SimpleDatasetBuilder.datasets(validation_features=profitability_df)
premiums_data = etiq.SimpleDatasetBuilder.datasets(validation_features=premiums_df)


Thanks for using the ETIQ.AI toolkit

Help improve our product: Call `etiq.enable_telemetry()` to provide
anonymous library usage statistics.
        
See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
  return np.find_common_type(types, [])

See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
  common = np.find_common_type([values.dtype, comps_array.dtype], [])

See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
  return np.find_common_type(types, [])

See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
  return np.find_common_type(types, [])

See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
  common = np.find_common_type([values.dtype, comps_array.dtype], [])

See 

Create our project:

In [6]:
project = etiq.projects.open(name="Fingerprint Project")

Create a snapshot for each dataset. In etiq, a snapshot will have many methods for testing data issues:

In [7]:
profitability_snapshot = project.snapshots.create(name="Profitability", dataset=profitability_data, model=None)
#claims_snapshot = project.snapshots.create(name="Claims", dataset=claims_data, model=None)
premiums_snapshot = project.snapshots.create(name="Premiums", dataset=premiums_data, model=None)

INFO:etiq.charting:Histogram summary already created for this data.
INFO:etiq.charting:Histogram summary already created for this data.


Now etiq knows about our datasets, we can start to compare. How does our profitability data compare to our premiums data?

In [8]:
segments, issues, aggregate_issues = profitability_snapshot.scan_fingerprints(premiums_snapshot)

print("## Issues")
display(issues)
print("## Aggregate Issues")
display(aggregate_issues)

INFO:etiq.pipeline.BasePipeline0403:Starting pipeline
INFO:etiq.pipeline.BasePipeline0403:Completed pipeline
## Issues


Unnamed: 0,name,feature,segment,measure,measure_value,metric,metric_value,threshold,value,record
0,pivot,ClientID,all,,,count,60.0,"(0.99, 0.99)",,
1,pivot,PremiumPaid,all,,,count,60.0,"(0.99, 0.99)",,
2,pivot,PremiumPaid,all,,,min,0.1,"(0.99, 0.99)",,
3,pivot,PremiumPaid,all,,,max,0.1,"(0.99, 0.99)",,
4,pivot,PremiumPaid,all,,,mean,0.1,"(0.99, 0.99)",,
5,pivot,PremiumPaid,all,,,median,0.1,"(0.99, 0.99)",,
6,pivot,PremiumPaid,all,,,std,4.198471e-17,"(0.99, 0.99)",,


## Aggregate Issues


Unnamed: 0,name,metric,measure,features,segments,total_issues_tested,issues_found,threshold
0,pivot,count,,{ClientID},{all},1,1,"(0.99, 0.99)"
1,pivot,missing,,{ClientID},{},1,0,"(0.99, 0.99)"
2,pivot,unique,,{ClientID},{},1,0,"(0.99, 0.99)"
3,pivot,count,,{PremiumPaid},{all},1,1,"(0.99, 0.99)"
4,pivot,min,,{PremiumPaid},{all},1,1,"(0.99, 0.99)"
5,pivot,max,,{PremiumPaid},{all},1,1,"(0.99, 0.99)"
6,pivot,mean,,{PremiumPaid},{all},1,1,"(0.99, 0.99)"
7,pivot,median,,{PremiumPaid},{all},1,1,"(0.99, 0.99)"
8,pivot,missing,,{PremiumPaid},{},1,0,"(0.99, 0.99)"
9,pivot,sum,,{PremiumPaid},{},1,0,"(0.99, 0.99)"


## Interpreting Results:

* We can see that the name given is "pivot" - etiq thinks the profitability data is a pivot of the premiums table.
* We can spot the differences between the two tables:
  * The count is different between these tables (correct)
  * There are no missing `ClientID` values - this is good and an indication our pivot was correct.
  * The count, minimum, maximum, mean and median values are different *but* the sum is correct which suggests our aggregation is correct.

Note too that we've only tested the fields which are common to both tables.

To cover:

* Specific Metrics
* Specific Groupings
* Custom Margin