# Notebook Summary 


### Quickstart

  1. Import etiq library - for install please check our docs (https://docs.etiq.ai/) 

  2. Login to the dashboard - this way you can send the results to your dashboard instance (Etiq AWS instance if you use the SaaS version). To deploy on your own cloud instance, get in touch (info@etiq.ai)

  3. Create or open a project 
  
### Data Issues


  4. Load Adult dataset
  
  5. Create comparison dataset based on Adult - for example purposes
  
  6. Load your config file and create your snapshot
  
  7. Scan for data isssues


## What are data issues?

Data collection and validation forms an essential part of any machine learning pipeline. A number of issues could come up at the data collection phase and the Etiq library provides a way of detecting these. Instead of having the user define explicit rules as to what constitutes valid data the rules are automatically generated based on an exemplar dataset.

The different kinds of data issues detected are

1. **Identical Feature**  
This is a data issue where a feature in one dataset has values which are just identical copies of the exemplar dataset.

2. **Missing Feature**:  
This is a data issue where a feature in the exemplar dataset is missing from the comparison dataset.

3. **Unknown Feature**: 
This is a data issue where a feature in the comparison dataset is missing from the exemplar dataset.

4. **Missing Feature Category**:  
This is a data issue where a categorical feature has values in the exemplar dataset which are missing from the comparison dataset.

5. **Unknown Feature Category**  
This is a data issue where a categorical feature has values in the comparison dataset which are missing from the exemplar dataset.

6. **Feature Value Below Minimum**  
This is a data issue where a continuous feature has value(s) in the comparison dataset which are lower than the minimum value for that feature in the exemplar dataset.

7. **Feature Value Below Minimum**  
This is a data issue where a continuous feature has a value(s) in the comparison dataset which are higher than the maximum value for that feature in the exemplar dataset.

8. **Ordering**
This is a data issue where, within a record from the dataset, some predefined ordering is violated e.g. an end date feature occurs before a start date feature.

9. **Missing IDs**
This is a data issue where, within a record from the dataset, an id is missing.

10. **Duplicate Records**
This is a data issue where, within a record from the dataset, there is a duplicate record. The set of fields that determine a unique record can be set in the configuration otherwise the scan is skipped.

11. **Data Drift**
This is a data issue where the underlying distribution of a feature has changed or "drifted" between a base dataset and the comparison dataset.

## Set-up

In [1]:
import numpy as np
import pandas as pd
# Import etiq
import etiq
from etiq import SimpleDatasetBuilder, etiq_config

Thanks for trying out the ETIQ.ai toolkit!

Visit our getting started documentation at https://docs.etiq.ai/

Visit our Slack channel at https://etiqcore.slack.com/ for support or feedback.



  from pandas import MultiIndex, Int64Index


In [1]:
from etiq import login as etiq_login
#etiq_login("https://dashboard.etiq.ai/", '<your-token>')

Thanks for trying out the ETIQ.ai toolkit!

Visit our getting started documentation at https://docs.etiq.ai/

Visit our Slack channel at https://etiqcore.slack.com/ for support or feedback.



In [3]:
# Can get/create a single named project
project = etiq.projects.open(name="Data Issues Fixed")

## Create the test datasets based on the Adult Income Dataset

In [4]:
from numpy.random import RandomState

prng = RandomState(seed=3)
# Loading a dataset. We're using the adult dataset
data = etiq.utils.load_sample("adultdata")
data = data.replace('?', np.nan)
data.dropna(inplace=True)
data.reset_index(inplace=True, drop=True)
# Create a fake id
data['ID'] = range(0,len(data))
data.drop(['native-country', 'capital-loss'], axis=1, inplace=True)
# Create a couple of fake timestamp columns
start = pd.to_datetime('2022-01-01').value//10**9
end = pd.to_datetime('2022-12-31').value//10**9

data['start_date'] = [f'{pd.to_datetime(prng.randint(start, end), unit="s").date()}' for a in range(0,len(data))]
data['end_date'] = data.apply(lambda row: f'{(pd.Timestamp(row["start_date"]) + pd.Timedelta(days=1)).date()}', axis=1)

data_comparison = data.copy()
# For 10 random rows swap the start and end dates
date_violation_rows = prng.choice(range(0,len(data)), size=10)
for idx in date_violation_rows:
    start, end = data_comparison.iloc[idx]['start_date'], data_comparison.iloc[idx]['end_date']
    data_comparison.loc[idx, 'end_date'] = start
    data_comparison.loc[idx, 'start_date'] = end
# for five random rows remove the ID
missing_id_rows = prng.choice(range(0,len(data)), size=5)
for idx in missing_id_rows:
    data_comparison.loc[idx, 'ID'] = pd.NA
# for five random rows duplicate the id
duplicate_id_rows = prng.choice(range(0,len(data)), size=5)
for idx in duplicate_id_rows[1:]:
    data_comparison.loc[idx, 'ID'] =  data_comparison.loc[duplicate_id_rows[0], 'ID']  
# Expand the number of hours worked per week by 50%
data_comparison['hours-per-week'] = 1.5 * data['hours-per-week']
# Consolidate "Never-married", "Divorced", "Widowed into a new "Single" status
data_comparison.loc[data.eval('`marital-status` in ["Never-married", "Divorced", "Widowed"]'), 'marital-status'] = 'Single' 
# Shuffle the data
data_comparison = data_comparison.sample(frac=1).reset_index(drop=True)
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,hours-per-week,income,ID,start_date,end_date
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,40,<=50K,0,2022-07-15,2022-07-16
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,50,<=50K,1,2022-01-21,2022-01-22
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,40,>50K,2,2022-09-07,2022-09-08
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,40,>50K,3,2022-07-19,2022-07-20
4,34,Private,198693,10th,6,Never-married,Other-service,Not-in-family,White,Male,0,30,<=50K,4,2022-04-02,2022-04-03


In [5]:
data_comparison.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,hours-per-week,income,ID,start_date,end_date
0,23,Private,437161,Some-college,10,Single,Other-service,Unmarried,Black,Female,0,60.0,<=50K,14648,2022-06-17,2022-06-18
1,27,Private,223751,Some-college,10,Single,Craft-repair,Not-in-family,White,Male,0,67.5,<=50K,37076,2022-11-29,2022-11-30
2,34,Private,162442,HS-grad,9,Married-civ-spouse,Adm-clerical,Wife,White,Female,0,30.0,>50K,34934,2022-01-15,2022-01-16
3,34,Federal-gov,149368,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,60.0,>50K,27239,2022-06-20,2022-06-21
4,35,Private,185366,Bachelors,13,Single,Sales,Not-in-family,White,Male,0,75.0,<=50K,43911,2022-10-19,2022-10-20


## Find data issues


In [6]:
with etiq.etiq_config('data-issues-config.json'):
    # Construct etiq datasets for the base and comparison datasets
    base_dataset = SimpleDatasetBuilder.dataset(data)
    comparison_dataset = SimpleDatasetBuilder.dataset(data_comparison)
    # Get/create a single named project
    project = etiq.projects.open(name="Data Issues")
    # Creating a snapshot
    base_snapshot = project.snapshots.create(name="Base Snapshot",
                                            dataset=base_dataset,
                                            model=None)
    # Run data issues scan on the base dataset
    (base_segments, base_issues, base_issue_summary) = base_snapshot.scan_data_issues()
    # Creating a snapshot
    comparison_snapshot = project.snapshots.create(name="Comparison Snapshot",
                                            dataset=comparison_dataset,
                                            model=None)
    # Run data issues scan on the base dataset
    (comparison_segments, comparison_issues, comparison_issue_summary) = comparison_snapshot.scan_data_issues()    

INFO:etiq.charting:Histogram summary already created for this data.
INFO:etiq.pipeline.DataPipeline0604:Starting pipeline
INFO:etiq.pipeline.DataPipeline0604:Completed pipeline
INFO:etiq.charting:Created histogram summary of data (13 fields)
INFO:etiq.pipeline.DataPipeline0544:Starting pipeline
INFO:etiq.pipeline.DataPipeline0544:Completed pipeline


The base dataset should have no issues but the comparison dataset snapshot scan should find the missing id (five), duplicated id (five) and order violation (ten) issues.

In [7]:
base_issue_summary

Unnamed: 0,name,metric,measure,features,segments,total_issues_tested,issues_found,threshold
0,order_violation,,,{},{},45222,0,"(nan, nan)"
1,missing_id,,,{},{},45222,0,"(nan, nan)"
2,duplicate_record,,,{},{},45222,0,"(nan, nan)"


In [8]:
comparison_issue_summary

Unnamed: 0,name,metric,measure,features,segments,total_issues_tested,issues_found,threshold
0,order_violation,,,{start_date},{all},45222,10,"(nan, nan)"
1,missing_id,,,{ID},{all},45222,5,"(nan, nan)"
2,duplicate_record,,,{},{all},45222,5,"(nan, nan)"


In [10]:
with etiq.etiq_config('data-issues-config.json'):
    # Construct etiq datasets for the base and comparison datasets
    base_dataset = SimpleDatasetBuilder.dataset(data)
    comparison_dataset = SimpleDatasetBuilder.dataset(data_comparison)
    # Get/create a single named project
    project = etiq.projects.open(name="Data Issues")
    # Creating a snapshot
    snapshot = project.snapshots.create(name="Data Issues Snapshot",
                                        dataset=base_dataset,
                                        comparison_dataset=comparison_dataset,
                                        model=None)
    # Run data issues scan
    (segments, issues, issue_summary) = snapshot.scan_data_issues()
    # Scan for data drift
    (drift_segments, drift_issues, drift_issue_summary) = snapshot.scan_drift_metrics(drift_measures = ['psi'], 
                                                                                      features=['hours-per-week',
                                                                                                'age',
                                                                                                'capital-gain',
                                                                                                'gender'])


INFO:etiq.charting:Histogram summary already created for this data.
INFO:etiq.charting:Histogram summary already created for this data.
INFO:etiq.pipeline.DataPipeline0749:Starting pipeline
INFO:etiq.pipeline.DataPipeline0749:Completed pipeline
INFO:etiq.pipeline.DriftPipeline0273:Starting pipeline
INFO:etiq.pipeline.DriftPipeline0273:Calculated breakpoints for hours-per-week
INFO:etiq.pipeline.DriftPipeline0273:Calculated breakpoints for capital-gain
INFO:etiq.pipeline.DriftPipeline0273:Calculated breakpoints for age
INFO:etiq.pipeline.DriftPipeline0273:Calculated drift measures.
INFO:etiq.pipeline.DriftPipeline0273:Identifying drift measure issues.
INFO:etiq.pipeline.DriftPipeline0273:Completed pipeline


The base/comparison scan will find five issues. Four of these issues are to do with the missing and new categories in the "marital-status" feature. The other issue highlights the changed range on the "hours-per-week" feature. The drift scan should find that only the **hours-per-week** feature shows drift.

In [11]:
issues

Unnamed: 0,name,feature,segment,measure,measure_value,metric,metric_value,threshold,value,record
0,missing_category,marital-status,all,,,,,"(nan, nan)",Divorced,
1,missing_category,marital-status,all,,,,,"(nan, nan)",Never-married,
2,missing_category,marital-status,all,,,,,"(nan, nan)",Widowed,
3,unrecognized_category,marital-status,all,,,,,"(nan, nan)",Single,
4,feature_value_above_maximum,hours-per-week,all,,,,,"(1, 99)",148.5,


In [12]:
issue_summary

Unnamed: 0,name,metric,measure,features,segments,total_issues_tested,issues_found,threshold
0,identical_feature_values,,,{},{},16,0,"(nan, nan)"
1,missing_feature,,,{},{},16,0,"(nan, nan)"
2,unrecognized_feature,,,{},{},16,0,"(nan, nan)"
3,missing_category,,,{marital-status},{all},16,1,"(nan, nan)"
4,unrecognized_category,,,{marital-status},{all},16,1,"(nan, nan)"
5,feature_value_below_minimum,,,{},{},16,0,"(nan, nan)"
6,feature_value_above_maximum,,,{hours-per-week},{all},16,1,"(nan, nan)"


In [13]:
drift_issue_summary

Unnamed: 0,name,metric,measure,features,segments,total_issues_tested,issues_found,threshold
0,feature_drift_above_threshold,,<function psi at 0x7f0b17905f70>,{hours-per-week},{all},4,1,"(0.0, 0.2)"
