# Notebook Summary 


### Quickstart

  1. Import etiq library - for install please check our docs (https://docs.etiq.ai/) 

  2. Login to the dashboard - this way you can send the results to your dashboard instance (Etiq AWS instance if you use the SaaS version). To deploy on your own cloud instance, get in touch (info@etiq.ai)

  3. Create or open a project 
  
### Data Issues


  4. Load Adult dataset
  
  5. Create comparison dataset based on Adult - for example purposes
  
  6. Load your config file and create your snapshot
  
  7. Scan for data isssues


## What are data issues?

Data collection and validation forms an essential part of any machine learning pipeline. A number of issues could come up at the data collection phase and the Etiq library provides a way of detecting these. Instead of having the user define explicit rules as to what constitutes valid data the rules are automatically generated based on an exemplar dataset.

The different kinds of data issues detected are

1. **Identical Feature**  
This is a data issue where a feature in one dataset has values which are just identical copies of the exemplar dataset.

2. **Missing Feature**:  
This is a data issue where a feature in the exemplar dataset is missing from the comparison dataset.

3. **Unknown Feature**: 
This is a data issue where a feature in the comparison dataset is missing from the exemplar dataset.

4. **Missing Feature Category**:  
This is a data issue where a categorical feature has values in the exemplar dataset which are missing from the comparison dataset.

5. **Unknown Feature Category**  
This is a data issue where a categorical feature has values in the comparison dataset which are missing from the exemplar dataset.

6. **Feature Value Below Minimum**  
This is a data issue where a continuous feature has value(s) in the comparison dataset which are lower than the minimum value for that feature in the exemplar dataset.

7. **Feature Value Below Minimum**  
This is a data issue where a continuous feature has a value(s) in the comparison dataset which are higher than the maximum value for that feature in the exemplar dataset.

## Set-up

In [1]:
import numpy as np
# Import etiq
import etiq
from etiq import SimpleDatasetBuilder, etiq_config

Thanks for trying out the ETIQ.ai toolkit!

Visit our getting started documentation at https://docs.etiq.ai/

Visit our Slack channel at https://etiqcore.slack.com/ for support or feedback.



  from pandas import MultiIndex, Int64Index


In [2]:
from etiq import login as etiq_login
etiq_login("https://dashboard.etiq.ai/", '<your-token>')

In [3]:
# Can get/create a single named project
project = etiq.projects.open(name="Data Issues Fixed")

## Create the test datasets based on the Adult Income Dataset

In [4]:
# Loading a dataset. We're using the adult dataset
data = etiq.utils.load_sample("adultdata")
data = data.replace('?', np.nan)
data.dropna(inplace=True)
data.reset_index(inplace=True, drop=True)

data_comparison = data.copy()
# Expand the number of hours worked per week by 50%
data_comparison['hours-per-week'] = 1.5 * data['hours-per-week']
# Consolidate "Never-married", "Divorced", "Widowed into a new "Single" status
data_comparison.loc[data.eval('`marital-status` in ["Never-married", "Divorced", "Widowed"]'), 'marital-status'] = 'Single' 
# Shuffle the data
data_comparison = data_comparison.sample(frac=1).reset_index(drop=True)
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,34,Private,198693,10th,6,Never-married,Other-service,Not-in-family,White,Male,0,0,30,United-States,<=50K


In [5]:
data_comparison.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,24,Private,216469,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,1579,75.0,United-States,<=50K
1,25,Private,99126,Assoc-voc,11,Married-civ-spouse,Prof-specialty,Wife,White,Female,7688,0,60.0,United-States,>50K
2,62,Private,197918,HS-grad,9,Single,Adm-clerical,Unmarried,White,Female,0,0,60.0,United-States,<=50K
3,23,Private,97472,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,60.0,United-States,<=50K
4,29,Private,251526,Some-college,10,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,30.0,United-States,<=50K


## Find data issues


In [6]:
with etiq.etiq_config('data-issues-config.json'):
    # Construct etiq datasets for the base and comparison datasets
    base_dataset = SimpleDatasetBuilder.dataset(data)
    comparison_dataset = SimpleDatasetBuilder.dataset(data_comparison)
    # Get/create a single named project
    project = etiq.projects.open(name="Data Issues")
    # Creating a snapshot
    snapshot = project.snapshots.create(name="Data Issues Snapshot",
                                        dataset=base_dataset,
                                        comparison_dataset=comparison_dataset,
                                        model=None)
    # Run data issues scan
    (segments, issues, issue_summary) = snapshot.scan_data_issues()


INFO:etiq.pipeline.DataPipeline0609:Starting pipeline
INFO:etiq.pipeline.DataPipeline0609:Completed pipeline


The scan will find five issues. Four of these issues are to do with the missing and new categories in the "marital-status" feature. The other issue highlights the changed range on the "hours-per-week" feature. 

In [7]:
issues

Unnamed: 0,name,feature,segment,measure,measure_value,metric,metric_value,threshold,value
0,missing_category,marital-status,all,,,,,"(nan, nan)",Divorced
1,missing_category,marital-status,all,,,,,"(nan, nan)",Never-married
2,missing_category,marital-status,all,,,,,"(nan, nan)",Widowed
3,unrecognized_category,marital-status,all,,,,,"(nan, nan)",Single
4,feature_value_above_maximum,hours-per-week,all,,,,,"(1, 99)",148.5


In [8]:
issue_summary

Unnamed: 0,name,metric,measure,features,segments,total_issues_tested,issues_found,threshold
0,identical_feature_values,,,{},{},15,0,"(0.0, 1.0)"
1,missing_feature,,,{},{},15,0,"(0.0, 1.0)"
2,unrecognized_feature,,,{},{},15,0,"(0.0, 1.0)"
3,missing_category,,,{marital-status},{all},15,3,"(nan, nan)"
4,unrecognized_category,,,{marital-status},{all},15,1,"(nan, nan)"
5,feature_value_below_minimum,,,{},{},15,0,"(0.0, 1.0)"
6,feature_value_above_maximum,,,{hours-per-week},{all},15,1,"(1, 99)"
