# Lesson Overview 

1. Review of TF Data Validation (TFDV) & Installation notes

2. Dataset Review, compute & visualize statistics

3. Generate Dataset Statistics

4. Infer Schema

5. Annomolies & Skew

6. Freeze Schema



## Data Validation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation:

+ Compute summary statistics for train/test/validation data in a scalable way. While `scikit-learn` is limited to datasets which fit into RAM, this is not a concern for TFDV.
+ Includes a viewer for data distributions and statistics (integration with [Facets](https://pair-code.github.io/facets/) 
+ Automatic schema inference 
+ Schema generation includes description of expectations about data (like required values, ranges, and vocabularies)
+ A schema viewer to help you inspect the schema
+ Anomaly detection to identify anomalies (missing features, out-of-range values, or wrong feature types)
+ An anomalies viewer to see which features have anomalies
+ TFDV can help validate new data for inference to ensure no bad features are processed
+ TFDV can help validate that your model has been trained on part of the decision surface for new data during inference 
+ TFDV can help validate data after it's been transformed by TF Transform to ensure nothing unexpected has occurred to the data

## Dataset

The dataset will be using throughout this session will be the New York Yellow Cab dataset available via [BigQuery public datasets](https://console.cloud.google.com/marketplace/details/city-of-new-york/nyc-tlc-trips?filter=solution-type:dataset&filter=category:encyclopedic).

Here is an example of how to extract data:
```
SELECT vendor_id,
       EXTRACT(MONTH FROM pickup_datetime) AS pickup_month,
       EXTRACT(HOUR FROM pickup_datetime) AS pickup_hour,
       EXTRACT(DAYOFWEEK FROM pickup_datetime) AS pickup_day_of_week, 
       EXTRACT(MONTH FROM dropoff_datetime) AS dropoff_month,
       EXTRACT(HOUR FROM dropoff_datetime) AS dropoff_hour,
       EXTRACT(DAYOFWEEK FROM dropoff_datetime) AS dropoff_day_of_week,
       passenger_count,
       store_and_fwd_flag, 
       trip_distance,
       fare_amount,
       tip_amount,
       payment_type,
       trip_type
FROM `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2018`  

```

## Third party packages already installed!

Third party dependencies can be found in `requirements.txt` and already have been installed. The specific versions matter, please look at [this](https://github.com/tensorflow/tfx#compatible-versions) for more info.

In [1]:
%%bash
pip install tensorflow==1.14.0
pip install tfx==0.14.0rc1
pip install apache-beam==2.14.0 
pip install tensorflow-data-validation==0.14.1
pip install tensorflow-metadata==0.14.0
pip install tensorflow-model-analysis==0.14.0
pip install tensorflow-transform==0.14.0


Collecting tfx==0.14.0rc1
  Downloading https://files.pythonhosted.org/packages/33/f0/dec724b6bf3bbcb657f153a9f6279d14c4917c454d9b75f9ea10b7785015/tfx-0.14.0rc1-py2-none-any.whl (383kB)
Collecting tensorflow-data-validation<0.15,>=0.14.1 (from tfx==0.14.0rc1)
  Downloading https://files.pythonhosted.org/packages/35/bf/b5ce7a4ab497f2fe9e5e379eee2c9044f2cd7de3f53e8c29d5cc5c4ae86b/tensorflow_data_validation-0.14.1-cp27-cp27mu-manylinux2010_x86_64.whl (2.4MB)
Collecting ml-metadata<0.15,>=0.14 (from tfx==0.14.0rc1)
  Downloading https://files.pythonhosted.org/packages/0f/8f/dd2ff819eead90569c45a4f26c910ac70095590b71d6c602f3fc7caa07fb/ml_metadata-0.14.0-cp27-cp27mu-manylinux2010_x86_64.whl (4.8MB)
Collecting tensorflow-transform<0.15,>=0.14 (from tfx==0.14.0rc1)
  Downloading https://files.pythonhosted.org/packages/44/84/c8770b330a3fbe4e6a727e3e922a04d3a755a79870e4ee090b959cb01983/tensorflow-transform-0.14.0.tar.gz (221kB)
Collecting apache-beam[gcp]<3,>=2.14 (from tfx==0.14.0rc1)
  Downloa

ERROR: fastai 0.7.0 has requirement torch<0.4, but you'll have torch 1.1.0 which is incompatible.
ERROR: albumentations 0.1.12 has requirement imgaug<0.2.7,>=0.2.5, but you'll have imgaug 0.2.9 which is incompatible.
ERROR: multiprocess 0.70.8 has requirement dill>=0.3.0, but you'll have dill 0.2.9 which is incompatible.
ERROR: google-cloud-storage 1.16.1 has requirement google-cloud-core<2.0dev,>=1.0.0, but you'll have google-cloud-core 0.29.1 which is incompatible.
ERROR: google-cloud-translate 1.5.0 has requirement google-cloud-core<2.0dev,>=1.0.0, but you'll have google-cloud-core 0.29.1 which is incompatible.


## Load necessary packages

In [2]:
import warnings
warnings.filterwarnings("ignore")

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import tempfile
import urllib
import tfx
from tfx.components.evaluator.component import Evaluator
from tfx.components.example_gen.csv_example_gen.component import CsvExampleGen
from tfx.components.example_validator.component import ExampleValidator
from tfx.components.model_validator.component import ModelValidator
from tfx.components.pusher.component import Pusher
from tfx.components.schema_gen.component import SchemaGen
from tfx.components.statistics_gen.component import StatisticsGen
from tfx.components.trainer.component import Trainer
from tfx.components.transform.component import Transform
from tfx.orchestration.interactive.interactive_context import InteractiveContext
from tfx.proto import evaluator_pb2
from tfx.proto import pusher_pb2
from tfx.proto import trainer_pb2
from tfx.proto.evaluator_pb2 import SingleSlicingSpec
from tfx.utils.dsl_utils import csv_input

import os
import shutil
import argparse
import tensorflow as tf
import apache_beam as beam  
import tensorflow_transform as tft
import tensorflow_model_analysis as tfma
import tensorflow_data_validation as tfdv
import tensorflow_transform.beam as tft_beam
from google.protobuf import text_format 
from tensorflow.python.lib.io import file_io
from tensorflow_metadata.proto.v0 import schema_pb2
from tensorflow_transform import coders as tft_coders
from tensorflow_transform.tf_metadata import metadata_io
from tensorflow_transform.saved import saved_transform_io
from tensorflow_transform.beam.tft_beam_io import transform_fn_io
from tensorflow_transform.coders import example_proto_coder
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.tf_metadata import dataset_schema
from tensorflow_transform.tf_metadata import schema_utils

print('TFDV version: {}'.format(tfdv.version.__version__))
print('TF version: {}'.format(tf.VERSION))

W0911 03:32:03.409130 139853902325632 deprecation_wrapper.py:119] From /usr/local/lib/python2.7/dist-packages/tfx/components/transform/executor.py:57: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

W0911 03:32:03.410499 139853902325632 deprecation.py:323] From /usr/local/lib/python2.7/dist-packages/tfx/components/transform/executor.py:57: from_feature_spec (from tensorflow_transform.tf_metadata.dataset_schema) is deprecated and will be removed in a future version.
Instructions for updating:
from_feature_spec is a deprecated, use schema_utils.schema_from_feature_spec


TFDV version: 0.14.1
TF version: 1.14.0


## Download Data

The data we will be using for this lesson is available via Google Cloud Storage. Here's how to download it to your Colab instance. 

In [3]:
%%bash
mkdir -p /tmp/data/
mkdir -p /tmp/data/train
mkdir -p /tmp/data/serving
mkdir -p /tmp/data/eval
mkdir -p /tmp/data/module

wget -P /tmp/data/train/ https://storage.googleapis.com/tfx_course/data/train/train.csv 
wget -P /tmp/data/serving/ https://storage.googleapis.com/tfx_course/data/serving/serving.csv 
wget -P /tmp/data/eval/ https://storage.googleapis.com/tfx_course/data/eval/eval.csv 
wget -P /tmp/data/module https://storage.googleapis.com/tfx_course/taxi_utils.py

--2019-09-11 03:32:03--  https://storage.googleapis.com/tfx_course/data/train/train.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.142.128, 2607:f8b0:400e:c08::80
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.142.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 273804 (267K) [application/octet-stream]
Saving to: ‘/tmp/data/train/train.csv’

     0K .......... .......... .......... .......... .......... 18% 42.9M 0s
    50K .......... .......... .......... .......... .......... 37% 71.8M 0s
   100K .......... .......... .......... .......... .......... 56% 74.9M 0s
   150K .......... .......... .......... .......... .......... 74% 77.1M 0s
   200K .......... .......... .......... .......... .......... 93% 75.5M 0s
   250K .......... .......                                    100% 71.7M=0.004s

2019-09-11 03:32:03 (65.5 MB/s) - ‘/tmp/data/train/train.csv’ saved [273804/273804]

--2019-09-11 03:32:03--  https://s

## Define lesson wide variables

In [4]:
BASE_DIR = os.getcwd()
DATA_DIR = os.path.join(BASE_DIR, '/tmp/data')
OUTPUT_DIR = os.path.join(BASE_DIR)

# base dir containing train and eval data
TRAIN_DATA_DIR = os.path.join(DATA_DIR, 'train')
EVAL_DATA_DIR = os.path.join(DATA_DIR, 'eval')
SERVING_DATA_DIR = os.path.join(DATA_DIR, 'serving')

TRAIN_DATA = os.path.join(TRAIN_DATA_DIR, 'train.csv')
EVAL_DATA = os.path.join(EVAL_DATA_DIR, 'eval.csv')
SERVING_DATA = os.path.join(SERVING_DATA_DIR, 'serving.csv')

TF_OUTPUT_BASE_DIR = os.path.join(OUTPUT_DIR, 'tf')
_taxi_module_file = os.path.join("/tmp/module/", 'taxi_utils.py')

context = InteractiveContext()
print(context.metadata_connection_config)

sqlite {
  filename_uri: "/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/metadata.sqlite"
  connection_mode: READWRITE_OPENCREATE
}



## Remove output from previous runs

In [0]:
shutil.rmtree(TF_OUTPUT_BASE_DIR, ignore_errors=True)

## Preview dataset

Let's peak into the data. This is a tabular dataset which includes taxi cab rides from New York city during 2018.

In [6]:
! head -n 5 /tmp/data/train/train.csv

vendor_id,pickup_month,pickup_hour,pickup_day_of_week,dropoff_month,dropoff_hour,dropoff_day_of_week,passenger_count,trip_distance,fare_amount,tip_amount,payment_type,trip_type
2,1,0,2,1,1,2,1,9.78,35,0,2,1
2,1,0,2,1,0,2,2,1.34,26,0,1,1
2,1,1,2,1,2,2,1,10.34,34.5,7.16,1,1
2,1,1,2,1,2,2,1,9.79,30.5,0,2,1


## Read in Data using Pandas

In [7]:
import pandas as pd 
data_train = pd.read_csv(os.path.join(TRAIN_DATA_DIR, 'train.csv'))
data_train.head(5)                                                  

Unnamed: 0,vendor_id,pickup_month,pickup_hour,pickup_day_of_week,dropoff_month,dropoff_hour,dropoff_day_of_week,passenger_count,trip_distance,fare_amount,tip_amount,payment_type,trip_type
0,2,1,0,2,1,1,2,1,9.78,35.0,0.0,2,1
1,2,1,0,2,1,0,2,2,1.34,26.0,0.0,1,1
2,2,1,1,2,1,2,2,1,10.34,34.5,7.16,1,1
3,2,1,1,2,1,2,2,1,9.79,30.5,0.0,2,1
4,2,1,2,2,1,3,2,1,10.99,35.0,9.08,1,1


## Compute Summary Statistics

In [8]:
data_train.describe()

Unnamed: 0,vendor_id,pickup_month,pickup_hour,pickup_day_of_week,dropoff_month,dropoff_hour,dropoff_day_of_week,passenger_count,trip_distance,fare_amount,tip_amount,payment_type,trip_type
count,7999.0,7999.0,7999.0,7999.0,7999.0,7999.0,7999.0,7999.0,7999.0,7999.0,7999.0,7999.0,7999.0
mean,1.846856,1.715214,12.830479,4.178522,1.715464,13.044131,4.166521,1.373922,9.350869,33.173018,2.395086,1.356045,1.054882
std,0.360149,0.715741,5.777055,1.949893,0.716015,5.900859,1.955372,1.043119,4.955842,27.704877,3.815409,0.581879,0.227764
min,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,-100.0,-0.8,1.0,1.0
25%,2.0,1.0,9.0,2.0,1.0,9.0,2.0,1.0,6.91,26.5,0.0,1.0,1.0
50%,2.0,2.0,13.0,4.0,2.0,13.0,4.0,1.0,8.65,30.5,0.0,1.0,1.0
75%,2.0,2.0,17.0,6.0,2.0,18.0,6.0,1.0,11.25,38.0,5.15,2.0,1.0
max,2.0,3.0,23.0,7.0,3.0,23.0,7.0,9.0,101.87,2126.0,63.0,5.0,2.0


## ExampleGen TFX Pipeline Component

This component ingests data into TFX pipelines.


*   Input: data formatted in CSV, TFRecord & BigQuery

*   Output: tf.Example records


We’ll be using CsvExampleGen executor to convert a CSV into TF examples.
Here is a list of currently supported sources:

*   CSV files
*   TFRecord files with TF Example data format
*   BigQuery queries

For BigQuery based ExampleGen, see [this](https://github.com/tensorflow/tfx/blob/master/docs/guide/examplegen.md#query-based-examplegen) for more details.

In [9]:
# ingest CSV data
examples = csv_input(TRAIN_DATA_DIR)
example_gen = CsvExampleGen(input_base=examples)
context.run(example_gen)

W0911 03:32:06.289539 139853902325632 deprecation_wrapper.py:119] From /usr/local/lib/python2.7/dist-packages/tfx/orchestration/component_launcher.py:87: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

W0911 03:32:06.356410 139853902325632 deprecation_wrapper.py:119] From /usr/local/lib/python2.7/dist-packages/tfx/components/base/base_driver.py:44: The name tf.gfile.Exists is deprecated. Please use tf.io.gfile.exists instead.

W0911 03:32:06.397552 139853902325632 deprecation_wrapper.py:119] From /usr/local/lib/python2.7/dist-packages/tfx/components/example_gen/csv_example_gen/executor.py:83: The name tf.gfile.Glob is deprecated. Please use tf.io.gfile.glob instead.

W0911 03:32:10.374464 139853902325632 tfrecordio.py:57] Couldn't find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be.


0,1
.execution_id,1
.component,"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } CsvExampleGen at 0x7f31fbff4910.inputs['input_base'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type u'ExternalPath' (1 artifact) at 0x7f31fbff4f10.type_nameExternalPath._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExternalPath' (uri: /tmp/data/train) at 0x7f31fbff4fd0.type_nameExternalPath.uri/tmp/data/train.span0.split.outputs['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type u'ExamplesPath' (2 artifacts) at 0x7f31fbff4f90.type_nameExamplesPath._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/) at 0x7f31fbff4f50.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/.span0.splittrain[1] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/) at 0x7f31fbff4ed0.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/.span0.spliteval.exec_properties['custom_config']None['input_config']{  ""splits"": [  {  ""name"": ""single_split"", ""pattern"": ""*""  }  ] }['output_config']{  ""splitConfig"": {  ""splits"": [  {  ""hashBuckets"": 2, ""name"": ""train""  }, {  ""hashBuckets"": 1, ""name"": ""eval""  }  ]  } }"
.component.inputs,['input_base'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type u'ExternalPath' (1 artifact) at 0x7f31fbff4f10.type_nameExternalPath._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExternalPath' (uri: /tmp/data/train) at 0x7f31fbff4fd0.type_nameExternalPath.uri/tmp/data/train.span0.split
.component.outputs,['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type u'ExamplesPath' (2 artifacts) at 0x7f31fbff4f90.type_nameExamplesPath._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/) at 0x7f31fbff4f50.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/.span0.splittrain[1] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/) at 0x7f31fbff4ed0.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/.span0.spliteval

0,1
.inputs,['input_base'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type u'ExternalPath' (1 artifact) at 0x7f31fbff4f10.type_nameExternalPath._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExternalPath' (uri: /tmp/data/train) at 0x7f31fbff4fd0.type_nameExternalPath.uri/tmp/data/train.span0.split
.outputs,['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type u'ExamplesPath' (2 artifacts) at 0x7f31fbff4f90.type_nameExamplesPath._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/) at 0x7f31fbff4f50.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/.span0.splittrain[1] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/) at 0x7f31fbff4ed0.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/.span0.spliteval
.exec_properties,"['custom_config']None['input_config']{  ""splits"": [  {  ""name"": ""single_split"", ""pattern"": ""*""  }  ] }['output_config']{  ""splitConfig"": {  ""splits"": [  {  ""hashBuckets"": 2, ""name"": ""train""  }, {  ""hashBuckets"": 1, ""name"": ""eval""  }  ]  } }"

0,1
['input_base'],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type u'ExternalPath' (1 artifact) at 0x7f31fbff4f10.type_nameExternalPath._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExternalPath' (uri: /tmp/data/train) at 0x7f31fbff4fd0.type_nameExternalPath.uri/tmp/data/train.span0.split

0,1
.type_name,ExternalPath
._artifacts,[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExternalPath' (uri: /tmp/data/train) at 0x7f31fbff4fd0.type_nameExternalPath.uri/tmp/data/train.span0.split

0,1
[0],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExternalPath' (uri: /tmp/data/train) at 0x7f31fbff4fd0.type_nameExternalPath.uri/tmp/data/train.span0.split

0,1
.type_name,ExternalPath
.uri,/tmp/data/train
.span,0
.split,

0,1
['examples'],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type u'ExamplesPath' (2 artifacts) at 0x7f31fbff4f90.type_nameExamplesPath._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/) at 0x7f31fbff4f50.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/.span0.splittrain[1] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/) at 0x7f31fbff4ed0.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/.span0.spliteval

0,1
.type_name,ExamplesPath
._artifacts,[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/) at 0x7f31fbff4f50.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/.span0.splittrain[1] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/) at 0x7f31fbff4ed0.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/.span0.spliteval

0,1
[0],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/) at 0x7f31fbff4f50.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/.span0.splittrain
[1],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/) at 0x7f31fbff4ed0.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/.span0.spliteval

0,1
.type_name,ExamplesPath
.uri,/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/
.span,0
.split,train

0,1
.type_name,ExamplesPath
.uri,/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/
.span,0
.split,eval

0,1
['custom_config'],
['input_config'],"{  ""splits"": [  {  ""name"": ""single_split"", ""pattern"": ""*""  }  ] }"
['output_config'],"{  ""splitConfig"": {  ""splits"": [  {  ""hashBuckets"": 2, ""name"": ""train""  }, {  ""hashBuckets"": 1, ""name"": ""eval""  }  ]  } }"

0,1
['input_base'],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type u'ExternalPath' (1 artifact) at 0x7f31fbff4f10.type_nameExternalPath._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExternalPath' (uri: /tmp/data/train) at 0x7f31fbff4fd0.type_nameExternalPath.uri/tmp/data/train.span0.split

0,1
.type_name,ExternalPath
._artifacts,[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExternalPath' (uri: /tmp/data/train) at 0x7f31fbff4fd0.type_nameExternalPath.uri/tmp/data/train.span0.split

0,1
[0],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExternalPath' (uri: /tmp/data/train) at 0x7f31fbff4fd0.type_nameExternalPath.uri/tmp/data/train.span0.split

0,1
.type_name,ExternalPath
.uri,/tmp/data/train
.span,0
.split,

0,1
['examples'],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type u'ExamplesPath' (2 artifacts) at 0x7f31fbff4f90.type_nameExamplesPath._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/) at 0x7f31fbff4f50.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/.span0.splittrain[1] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/) at 0x7f31fbff4ed0.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/.span0.spliteval

0,1
.type_name,ExamplesPath
._artifacts,[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/) at 0x7f31fbff4f50.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/.span0.splittrain[1] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/) at 0x7f31fbff4ed0.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/.span0.spliteval

0,1
[0],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/) at 0x7f31fbff4f50.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/.span0.splittrain
[1],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/) at 0x7f31fbff4ed0.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/.span0.spliteval

0,1
.type_name,ExamplesPath
.uri,/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/
.span,0
.split,train

0,1
.type_name,ExamplesPath
.uri,/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/
.span,0
.split,eval


## Compute statistics

TFDV can help you compute descriptive statistics which provides an overview of the data in terms of the features that are present and the shapes of their distributions. The statistics are computed over training & serving data. 


We'll be using `StatisticsGen` to compute statistics for our training data. It can scale to large datasets using Apache Beam.

*   Input: Datasets produced by ExampleGen component

*   Output: Dataset stats

Here’s how you can use StatisticsGen


In [10]:
# train_stats = tfdv.generate_statistics_from_csv(data_location = TRAIN_DATA)
statistics_gen = StatisticsGen(
    input_data=example_gen.outputs['examples'])
context.run(statistics_gen)

0,1
.execution_id,2
.component,function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } StatisticsGen at 0x7f32273b2190.inputs['input_data'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type u'ExamplesPath' (2 artifacts) at 0x7f31fbff4f90.type_nameExamplesPath._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/) at 0x7f31fbff4f50.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/.span0.splittrain[1] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/) at 0x7f31fbff4ed0.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/.span0.spliteval.outputs['output'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleStatisticsPath' (2 artifacts) at 0x7f31fb91bf50.type_nameExampleStatisticsPath._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExampleStatisticsPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/train/) at 0x7f31fb91bc50.type_nameExampleStatisticsPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/train/.span0.splittrain[1] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExampleStatisticsPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/eval/) at 0x7f31fb91bd10.type_nameExampleStatisticsPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/eval/.span0.spliteval.exec_properties{}
.component.inputs,['input_data'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type u'ExamplesPath' (2 artifacts) at 0x7f31fbff4f90.type_nameExamplesPath._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/) at 0x7f31fbff4f50.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/.span0.splittrain[1] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/) at 0x7f31fbff4ed0.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/.span0.spliteval
.component.outputs,['output'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleStatisticsPath' (2 artifacts) at 0x7f31fb91bf50.type_nameExampleStatisticsPath._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExampleStatisticsPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/train/) at 0x7f31fb91bc50.type_nameExampleStatisticsPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/train/.span0.splittrain[1] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExampleStatisticsPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/eval/) at 0x7f31fb91bd10.type_nameExampleStatisticsPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/eval/.span0.spliteval

0,1
.inputs,['input_data'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type u'ExamplesPath' (2 artifacts) at 0x7f31fbff4f90.type_nameExamplesPath._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/) at 0x7f31fbff4f50.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/.span0.splittrain[1] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/) at 0x7f31fbff4ed0.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/.span0.spliteval
.outputs,['output'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleStatisticsPath' (2 artifacts) at 0x7f31fb91bf50.type_nameExampleStatisticsPath._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExampleStatisticsPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/train/) at 0x7f31fb91bc50.type_nameExampleStatisticsPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/train/.span0.splittrain[1] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExampleStatisticsPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/eval/) at 0x7f31fb91bd10.type_nameExampleStatisticsPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/eval/.span0.spliteval
.exec_properties,{}

0,1
['input_data'],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type u'ExamplesPath' (2 artifacts) at 0x7f31fbff4f90.type_nameExamplesPath._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/) at 0x7f31fbff4f50.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/.span0.splittrain[1] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/) at 0x7f31fbff4ed0.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/.span0.spliteval

0,1
.type_name,ExamplesPath
._artifacts,[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/) at 0x7f31fbff4f50.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/.span0.splittrain[1] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/) at 0x7f31fbff4ed0.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/.span0.spliteval

0,1
[0],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/) at 0x7f31fbff4f50.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/.span0.splittrain
[1],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/) at 0x7f31fbff4ed0.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/.span0.spliteval

0,1
.type_name,ExamplesPath
.uri,/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/
.span,0
.split,train

0,1
.type_name,ExamplesPath
.uri,/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/
.span,0
.split,eval

0,1
['output'],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleStatisticsPath' (2 artifacts) at 0x7f31fb91bf50.type_nameExampleStatisticsPath._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExampleStatisticsPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/train/) at 0x7f31fb91bc50.type_nameExampleStatisticsPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/train/.span0.splittrain[1] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExampleStatisticsPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/eval/) at 0x7f31fb91bd10.type_nameExampleStatisticsPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/eval/.span0.spliteval

0,1
.type_name,ExampleStatisticsPath
._artifacts,[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExampleStatisticsPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/train/) at 0x7f31fb91bc50.type_nameExampleStatisticsPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/train/.span0.splittrain[1] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExampleStatisticsPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/eval/) at 0x7f31fb91bd10.type_nameExampleStatisticsPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/eval/.span0.spliteval

0,1
[0],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExampleStatisticsPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/train/) at 0x7f31fb91bc50.type_nameExampleStatisticsPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/train/.span0.splittrain
[1],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExampleStatisticsPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/eval/) at 0x7f31fb91bd10.type_nameExampleStatisticsPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/eval/.span0.spliteval

0,1
.type_name,ExampleStatisticsPath
.uri,/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/train/
.span,0
.split,train

0,1
.type_name,ExampleStatisticsPath
.uri,/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/eval/
.span,0
.split,eval

0,1
['input_data'],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type u'ExamplesPath' (2 artifacts) at 0x7f31fbff4f90.type_nameExamplesPath._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/) at 0x7f31fbff4f50.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/.span0.splittrain[1] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/) at 0x7f31fbff4ed0.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/.span0.spliteval

0,1
.type_name,ExamplesPath
._artifacts,[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/) at 0x7f31fbff4f50.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/.span0.splittrain[1] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/) at 0x7f31fbff4ed0.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/.span0.spliteval

0,1
[0],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/) at 0x7f31fbff4f50.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/.span0.splittrain
[1],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExamplesPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/) at 0x7f31fbff4ed0.type_nameExamplesPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/.span0.spliteval

0,1
.type_name,ExamplesPath
.uri,/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/train/
.span,0
.split,train

0,1
.type_name,ExamplesPath
.uri,/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/CsvExampleGen/examples/1/eval/
.span,0
.split,eval

0,1
['output'],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleStatisticsPath' (2 artifacts) at 0x7f31fb91bf50.type_nameExampleStatisticsPath._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExampleStatisticsPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/train/) at 0x7f31fb91bc50.type_nameExampleStatisticsPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/train/.span0.splittrain[1] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExampleStatisticsPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/eval/) at 0x7f31fb91bd10.type_nameExampleStatisticsPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/eval/.span0.spliteval

0,1
.type_name,ExampleStatisticsPath
._artifacts,[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExampleStatisticsPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/train/) at 0x7f31fb91bc50.type_nameExampleStatisticsPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/train/.span0.splittrain[1] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExampleStatisticsPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/eval/) at 0x7f31fb91bd10.type_nameExampleStatisticsPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/eval/.span0.spliteval

0,1
[0],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExampleStatisticsPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/train/) at 0x7f31fb91bc50.type_nameExampleStatisticsPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/train/.span0.splittrain
[1],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type u'ExampleStatisticsPath' (uri: /tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/eval/) at 0x7f31fb91bd10.type_nameExampleStatisticsPath.uri/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/eval/.span0.spliteval

0,1
.type_name,ExampleStatisticsPath
.uri,/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/train/
.span,0
.split,train

0,1
.type_name,ExampleStatisticsPath
.uri,/tmp/tfx-interactive-2019-09-11T03_32_04.659029-usNeKj/StatisticsGen/output/2/eval/
.span,0
.split,eval


TFDV is able to scale to datasets which don't fit in RAM since it uses [Apache Beam's](https://beam.apache.org/releases/pydoc/2.9.0/) data-parallel processing framework to scale the computation of statistics. The API also exposes a Beam PTransform for statistics generation.

## Visualize statistics

`tfdv.visualize_statistics` uses [Facets](https://pair-code.github.io/facets/) to create a visualization of our training data.

• If you have numeric features and catagorical features, they will be visualized separately. Each chart displays the distributions for each feature respectively.

• Features with missing or zero values display a percentage in red to indicate that there may be issues with examples in those features. The percentage is the percentage of examples that have missing or zero values for that feature. For example, `tip_amount` has a value of zero for 63% of the rows.

• Try clicking "expand" above the charts to change the display

• Try hovering over bars in the charts to display bucket ranges and counts

• Try switching between the log and linear scales

• Try selecting "quantiles" from the "Chart to show" menu, and hover over the markers to show the quantile percentages

In [0]:
# import TFDV and get the train statistics path
import tensorflow_data_validation as tfdv
from tfx.types.artifact_utils import get_split_uri
artifact_list = statistics_gen.outputs['output'].get()
train_artifact_uri = get_split_uri(artifact_list, 'train')
train_stats_path = os.path.join(train_artifact_uri, 'stats_tfrecord')

In [12]:
# load statistics and visualize data
train_stats = tfdv.load_statistics(train_stats_path)
tfdv.visualize_statistics(train_stats)

W0911 03:32:19.855958 139853902325632 deprecation.py:323] From /usr/local/lib/python2.7/dist-packages/tensorflow_data_validation/utils/stats_gen_lib.py:357: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


## Infer Schema

For machine learning projects with structured data, we must understand the semantic meaning of each column, it's provenance, and the type/range of values. We can use `SchemaGen` to create a schema for our data. Manually inferring a schema can be a lengthy & error prone task, especially for datasets with large number of features.

It's really important to ensure the schema has been correctly generated as this will be used by the machine learning pipeline both during model training & inference. The schema also serves as documentation for the data, which can be useful for other data scientists, business analysts and/or developers on a project. Let's use `tfdv.display_schema` to display the inferred schema so that we can review it.

`SchemaGen` produces a schema.proto which is auto inferred which contains data types for each feature value, whether a feature is available and value ranges.


*   Input: Statistics from an StatisticsGen component

*   Output: Data schema proto

Here’s how you can call it...


In [13]:
infer_schema = SchemaGen(
    stats=statistics_gen.outputs['output'],
    infer_feature_shape=False)
context.run(infer_schema)

# get schema path
schema_dir = infer_schema.outputs['output'].get()[0].uri
schema_path = os.path.join(schema_dir, 'schema.pbtxt')

schema = tfdv.load_schema_text(schema_path)
tfdv.display_schema(schema)

W0911 03:32:19.940735 139853902325632 deprecation_wrapper.py:119] From /usr/local/lib/python2.7/dist-packages/tfx/utils/io_utils.py:76: The name tf.gfile.ListDirectory is deprecated. Please use tf.io.gfile.listdir instead.



Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'payment_type',INT,required,single,-
'trip_distance',FLOAT,required,single,-
'pickup_month',INT,required,single,-
'pickup_day_of_week',INT,required,single,-
'dropoff_month',INT,required,single,-
'trip_type',INT,required,single,-
'tip_amount',FLOAT,required,single,-
'pickup_hour',INT,required,single,-
'vendor_id',INT,required,single,-
'fare_amount',FLOAT,required,single,-


## Train vs Evaluation Data Validation

For supervized machine learning with structured data, it's critical that we...

+ Ensure the distribution (range of values) of the training data matches that of the evaluation set. Otherwise, it's likely that what the model learns using the training data wouldn't generalize to new data during inference.

+ Ensure train/test/validation & new data (during inference) matches the same schema

+ Ensure that we reduce the training-serving skew. This is the difference between performance during training and performance during serving. This skew can be caused by:

  + A discrepancy between how you handle data in the training and serving pipelines.

  + A change in the data between when you train and when you serve.

  + A feedback loop between your model and your algorithm.
  
TFDV can help us with a majority of these scenarios.

In [14]:
eval_stats = tfdv.generate_statistics_from_csv(data_location = EVAL_DATA)

W0911 03:32:20.031860 139853902325632 deprecation_wrapper.py:119] From /usr/local/lib/python2.7/dist-packages/tensorflow_data_validation/utils/stats_gen_lib.py:327: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.



In [15]:
# compare stats of train vs eval data
tfdv.visualize_statistics(lhs_statistics=eval_stats, rhs_statistics=train_stats,
                          lhs_name='EVAL_DATA.SET', rhs_name='TRAIN_DATASET')

Few things to keep in mind...

• Notice that each feature now includes statistics for both the training and evaluation datasets.

• Notice that the charts now have both the training and evaluation datasets overlaid, making it easy to compare them.

• Notice that the charts now include a percentages view, which can be combined with log or the default linear scales.

• `trip_distance` is different for training vs evaluation sets. Is this an issue? How will this cause problem(s)?

## Check for Train vs Evaluation set Annomolies

There is one important question to ask before we continue. Does our evaluation dataset match the schema from our training dataset? You will need to be careful with categorical features as there may be values present in the training data which aren't in evaluation set, or vice versa.

Let's think about the following scenarios...

1) What would happen if you tried to evaluate using data with categorical feature values that were not in our training dataset? 

2) What about numeric features that are outside the ranges in our training dataset?

In [16]:
# perform anomaly detection based on statistics and data schema
validate_stats = ExampleValidator(
    stats=statistics_gen.outputs['output'],
    schema=infer_schema.outputs['output'])
context.run(validate_stats)

validation_dir = validate_stats.outputs['output'].get()[0].uri
anomalies_path = os.path.join(validation_dir, 'anomalies.pbtxt')

# visualize the anomalies.
anomalies = tfdv.load_anomalies_text(anomalies_path)
tfdv.display_anomalies(anomalies)

## Fix Data Annomolies in the Schema

There are various reasons why data annomolies exist. Often, there is an issue in the data collection or pipeline which feeds data downstream, so you'll want to investigate and fix any underlying data issues in upstream processes before you continue.

Another common annomoly which can occur is if you have a categorical value in your training set which isn't in the evaluation set, you'll need to use:

`tfdv.get_domain(schema, feature_name).value.append('new_unique_value')`.

While we can't fix all the annomolies, we should fix issues we are not comfortable accepting.

In [17]:
# update the schema based on the observed anomalies
vendor_id = tfdv.get_feature(schema, 'vendor_id')
# we want feature vendor_id to be populated in at least 50% of the examples
vendor_id.presence.min_fraction = 0.5

# validate eval stats after updating the schema 
updated_anomalies = tfdv.validate_statistics(eval_stats, schema)
tfdv.display_anomalies(updated_anomalies)

We are confident that the training and evaluation data are now consistent!

## Schema Environments

For this training session, we will need to create a `serving` dataset. Typically, all datasets in a pipeline should use the same schema, however; there are some notable exceptions. For instance, in supervised learning we need to include labels in our dataset, but when we serve the model for inference the labels will not be included. For this reaosn, we need to make a slight schema variation.

We can use `Environments` to help us use slightly differing schema definitions for each use case (train, model validation, inference). Specifically, we can use `in_environment` and `not_in_environment` to indicate which features in schema should be associated with a set of environments respectively.

For example, in our dataset the `fare_amount` feature is included as the label for training, but it's missing in the serving data. Without environment specified, it will show up as an anomaly.

In [18]:
serving_stats = tfdv.generate_statistics_from_csv(data_location = SERVING_DATA)
serving_anomalies = tfdv.validate_statistics(serving_stats, schema)

tfdv.display_anomalies(serving_anomalies)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'fare_amount',Column dropped,Column is completely missing



Now we just have the tips feature (which is our label) showing up as an anomaly ('Column dropped'). Of course we don't expect to have labels in our serving data, so let's tell TFDV to ignore that.

In [19]:
# all features are by default in both TRAINING, EVAL and SERVING environments
schema.default_environment.append('TRAINING')
schema.default_environment.append('EVAL')
schema.default_environment.append('SERVING')

# indicate that 'fare_amount' feature is not in SERVING environment.
tfdv.get_feature(schema, 'fare_amount').not_in_environment.append('SERVING')

serving_anomalies_with_env = tfdv.validate_statistics(
    serving_stats, schema, environment='SERVING')

tfdv.display_anomalies(serving_anomalies_with_env)

## Check for Skew

In addition to checking whether a dataset conforms to the expectations set in the schema, TFDV also provides functionalities to detect drift and skew. TFDV performs this check by comparing the statistics of the different datasets based on the drift/skew comparators specified in the schema.

TFDV can detect three different kinds of skew in your data - schema skew, feature skew, and distribution skew.

**1) Schema Skew** 

We saw that the schema between training & serving is expected to be slightly different, specifically, the label feature being only present in the training data but not in serving. This should be specified through enviornment field in the schema.

**2) Feature Skew** 

Feature skew occurs when the feature values that a model trains on are different from the feature values that it sees at serving time. For example, this can happen when there is a trend such as inflation in the price of fares. 

**3) Distribution Skew** 

Distribution skew occurs when the distribution of the training dataset is significantly different from the distribution of the serving dataset. One of the key causes for distribution skew is using different code or different data sources to generate the training dataset. 

Read up on `skew_comparator.infinity_norm.threshold` & `drift_comparator.infinity_norm.threshold` to see examples for how to set a threshold for categorical feautures.

## Freeze Schema

We want to persist our schema so that it can be used by other team members as well as the rest of the TensorFlow Transform & Serving pipeline. 

In [20]:
from tensorflow.python.lib.io import file_io
from google.protobuf import text_format

file_io.recursive_create_dir(OUTPUT_DIR)
schema_file = os.path.join(OUTPUT_DIR, 'schema.pbtxt')
tfdv.write_schema_text(schema, schema_file)

!cat {schema_file}

feature {
  name: "payment_type"
  value_count {
    min: 1
    max: 1
  }
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
}
feature {
  name: "trip_distance"
  value_count {
    min: 1
    max: 1
  }
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
}
feature {
  name: "pickup_month"
  value_count {
    min: 1
    max: 1
  }
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
}
feature {
  name: "pickup_day_of_week"
  value_count {
    min: 1
    max: 1
  }
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
}
feature {
  name: "dropoff_month"
  value_count {
    min: 1
    max: 1
  }
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
}
feature {
  name: "trip_type"
  value_count {
    min: 1
    max: 1
  }
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
}
feature {
  name: "tip_amount"
  value_count {
    min: 1
    max: 1
  }
  type: FLOAT
  presence {
    min_fraction: 1.0
 