# Analyzing data with Tensorflow Data Validation

This notebook demonstrates how TensorFlow Data Validation (TFDV) can be used to analyze and validate your data, including generating descriptive statistics, inferring and fine tuning schema, checking for and fixing anomalies, and detecting drift and skew. It's important to understand your dataset's characteristics, including how it might change over time in your production pipeline. It's also important to look for anomalies in your data, and to compare your training, evaluation, and serving datasets to make sure that they're consistent. TFDV is the tool to achieve it.

You are going to use a variant of Cover Type dataset. For more information about the dataset refer to [the dataset summary page.](../../datasets/covertype/README.md)

## Lab setup
### Import packages and check the versions

In [42]:
import os
import tempfile
import tensorflow as tf
import tensorflow_data_validation as tfdv
import time

from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions, SetupOptions, DebugOptions, WorkerOptions
from tensorflow_metadata.proto.v0 import schema_pb2

print('TensorFlow version: {}'.format(tf.__version__))
print('TensorFlow Data Validation version: {}'.format(tfdv.__version__))

TensorFlow version: 2.0.0
TensorFlow Data Validation version: 0.15.0


### Set the locations

In [44]:
TRAINING_DATASET='gs://workshop-datasets/covertype/training/covertype_training.csv'
TRAINING_DATASET_WITH_MISSING_VALUES='gs://workshop-datasets/covertype/training_missing/covertype_training_missing.csv'
EVALUATION_DATASET='gs://workshop-datasets/covertype/evaluation/covertype_evaluation.csv'
EVALUATION_DATASET_WITH_ANOMALIES='gs://workshop-datasets/covertype/evaluation_anomalies/covertype_evaluation_anomalies.csv'
SERVING_DATASET='gs://workshop-datasets/covertype/serving/covertype_serving.csv'
LAB_ROOT_FOLDER='/home/mlops-labs/Lab-11-TFDV/01-Covertype-Dataset'

### Configure GCP settings

In [45]:
PROJECT_ID = 'jk-mlops-demo'
REGION = 'us-central1'
STAGING_BUCKET = 'gs://{}-lab11'.format(PROJECT_ID)

### Create a GCP staging bucket

In [46]:
!gsutil mb -p $PROJECT_ID $STAGING_BUCKET 

Creating gs://jk-mlops-demo-lab11/...
ServiceException: 409 Bucket jk-mlops-demo-lab11 already exists.


## Computing and visualizing descriptive statistics

 
TFDV can compute descriptive statistics that provide a quick overview of the data in terms of the features that are present and the shapes of their value distributions.

Internally, TFDV uses Apache Beam's data-parallel processing framework to scale the computation of statistics over large datasets. For applications that wish to integrate deeper with TFDV (e.g., attach statistics generation at the end of a data-generation pipeline), the API also exposes a Beam PTransform for statistics generation.

Let's start by using `tfdv.generate_statistics_from_csv` to compute statistics for the training data split.

Notice that although your dataset is in Google Cloud Storage you will run you computation locally on the notebook's host, using the Beam DirectRunner. Later in the lab, you will use Cloud Dataflow to calculate statistics on a remote distributed cluster.

In [48]:
train_stats = tfdv.generate_statistics_from_csv(
    data_location=TRAINING_DATASET_WITH_MISSING_VALUES
)

You can now use `tfdv.visualize_statistics` to create a visualization of your data. `tfdv.visualize_statistics` uses [Facets](https://pair-code.github.io/facets/) that provides succinct, interactive visualizations to aid in understanding and analyzing machine learning datasets.

In [49]:
tfdv.visualize_statistics(train_stats)

The interactive widget you see is **Facets Overview**. 
- Numeric features and categorical features are visualized separately, including charts showing the distributions for each feature.
- Features with missing or zero values display a percentage in red as a visual indicator that there may be issues with examples in those features. The percentage is the percentage of examples that have missing or zero values for that feature.
- Try clicking "expand" above the charts to change the display
- Try hovering over bars in the charts to display bucket ranges and counts
- Try switching between the log and linear scales
- Try selecting "quantiles" from the "Chart to show" menu, and hover over the markers to show the quantile percentages

## Infering Schema
Now let's use `tfdv.infer_schema` to create a schema for the data. A schema defines constraints for the data that are relevant for ML. Example constraints include the data type of each feature, whether it's numerical or categorical, or the frequency of its presence in the data. For categorical features the schema also defines the domain - the list of acceptable values. Since writing a schema can be a tedious task, especially for datasets with lots of features, TFDV provides a method to generate an initial version of the schema based on the descriptive statistics.

In [50]:
schema = tfdv.infer_schema(train_stats)
tfdv.display_schema(schema=schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'Wilderness_Area',STRING,required,,'Wilderness_Area'
'Aspect',INT,required,,-
'Cover_Type',INT,required,,-
'Elevation',INT,required,,-
'Hillshade_3pm',INT,required,,-
'Hillshade_9am',INT,required,,-
'Hillshade_Noon',INT,required,,-
'Horizontal_Distance_To_Fire_Points',INT,required,,-
'Horizontal_Distance_To_Hydrology',FLOAT,optional,single,-
'Horizontal_Distance_To_Roadways',INT,required,,-


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'Wilderness_Area',"'Cache', 'Commanche', 'Neota', 'Rawah'"


Notice that `tfdv.infer_schema` did not infer all features properly. Although, both `Soil_Type` and `Cover_Type` are `INT` type, they should be interpreted as categorical rather than numeric. You can use `tfdv` functions to manually fine tune the schema.

In [51]:
soil_type_domain = [
"2702", "2703", "2704", "2705", "2706", "2717", "3501", "3502", "4201", "4703", "4704", "4744", "4758", "5101", 
"5151", "6101", "6102", "6731", "7101", "7102", "7103", "7201", "7202", "7700", "7701", "7702", "7709", "7710", 
"7745", "7746", "7755", "7756", "7757", "7790", "8703", "8707", "8708", "8771", "8772", "8776",
]

tfdv.get_feature(schema, 'Soil_Type').type = schema_pb2.FeatureType.BYTES
tfdv.set_domain(schema, 'Soil_Type', schema_pb2.StringDomain(name='Soil_Type', value=soil_type_domain))

tfdv.set_domain(schema, 'Cover_Type', schema_pb2.IntDomain(name='Cover_Type', min=1, max=7, is_categorical=True))

In [52]:
tfdv.display_schema(schema=schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'Wilderness_Area',STRING,required,,'Wilderness_Area'
'Aspect',INT,required,,-
'Cover_Type',INT,required,,"[1,7]"
'Elevation',INT,required,,-
'Hillshade_3pm',INT,required,,-
'Hillshade_9am',INT,required,,-
'Hillshade_Noon',INT,required,,-
'Horizontal_Distance_To_Fire_Points',INT,required,,-
'Horizontal_Distance_To_Hydrology',FLOAT,optional,single,-
'Horizontal_Distance_To_Roadways',INT,required,,-


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'Wilderness_Area',"'Cache', 'Commanche', 'Neota', 'Rawah'"
'Soil_Type',"'2702', '2703', '2704', '2705', '2706', '2717', '3501', '3502', '4201', '4703', '4704', '4744', '4758', '5101', '5151', '6101', '6102', '6731', '7101', '7102', '7103', '7201', '7202', '7700', '7701', '7702', '7709', '7710', '7745', '7746', '7755', '7756', '7757', '7790', '8703', '8707', '8708', '8771', '8772', '8776'"


In [56]:
stats_options = tfdv.StatsOptions(schema=schema, infer_type_from_schema=True)

train_stats = tfdv.generate_statistics_from_csv(
    data_location=TRAINING_DATASET_WITH_MISSING_VALUES,
    stats_options=stats_options,
)

tfdv.visualize_statistics(train_stats)

## Creating statistics using Cloud Dataflow

Previously, you created descriptive statistics using local compute. This may work for smaller datasets. But for large datasets you may not have enough local compute power. The `tfdv.generate_statistics_*` functions can utilize `DataflowRunner` to run Beam processing on a distributed Dataflow cluster.

To run TFDV on Google Cloud Dataflow, the TFDV library must be must be installed on the Dataflow workers. There are different ways to install additional packages on Dataflow. You are going to use the Python `setup.py` file approach.

You also configure `tfdv.generate_statistics_from_csv` to use the schema created in the previous steps.

### Configure Dataflow settings

In [53]:
%%writefile setup.py

from setuptools import setup

setup(
    name='tfdv',
    description='TFDV Runtime.',
    version='0.1',
    install_requires=[
      'tensorflow_data_validation==0.15.0'
    ]
)

Overwriting setup.py


### Regenerate statistics

In [54]:
options = PipelineOptions()
options.view_as(GoogleCloudOptions).project = PROJECT_ID
options.view_as(GoogleCloudOptions).region = REGION
options.view_as(GoogleCloudOptions).job_name = "tfdv-{}".format(time.strftime("%Y%m%d-%H%M%S"))
options.view_as(GoogleCloudOptions).staging_location = STAGING_BUCKET + '/staging/'
options.view_as(GoogleCloudOptions).temp_location = STAGING_BUCKET + '/tmp/'
options.view_as(StandardOptions).runner = 'DataflowRunner'
options.view_as(SetupOptions).setup_file = os.path.join(LAB_ROOT_FOLDER, 'setup.py')

stats_options = tfdv.StatsOptions(schema=schema, infer_type_from_schema=True)

train_stats = tfdv.generate_statistics_from_csv(
    data_location=TRAINING_DATASET_WITH_MISSING_VALUES,
    stats_options=stats_options,
    pipeline_options=options,
    output_path=STAGING_BUCKET + '/output/'
)



In [55]:
tfdv.visualize_statistics(train_stats)

## Analyzing evaluation data

So far we've only been looking at the training data. It's important that our evaluation data is consistent with our training data, including that it uses the same schema. It's also important that the evaluation data includes examples of roughly the same ranges of values for our numerical features as our training data, so that our coverage of the loss surface during evaluation is roughly the same as during training. The same is true for categorical features. Otherwise, we may have training issues that are not identified during evaluation, because we didn't evaluate part of our loss surface.

You will now generate statistics for the evaluation split and visualize both training and evaluation splits on the same chart:

- The training and evaluation datasets overlay, making it easy to compare them.
- The charts now include a percentages view, which can be combined with log or the default linear scales.
- Notice that some features are significantly different for the training versus the evaluation datasets, in particular check the mean and median. Will that cause problems?
- Click expand on the Numeric Features chart, and select the log scale. Review the n_hrefs feature, and notice the difference in the max. Will evaluation miss parts of the loss surface?

In [None]:
soil_type_domain = [
"2702", "2703", "2704", "2705", "2706", "2717", "3501", "3502", "4201", "4703", "4704", "4744", "4758", "5101", 
"5151", "6101", "6102", "6731", "7101", "7102", "7103", "7201", "7202", "7700", "7701", "7702", "7709", "7710", 
"7745", "7746", "7755", "7756", "7757", "7790", "8703", "8707", "8708", "8771", "8772", "8776",
]

tfdv.get_feature(schema, 'Soil_Type').type = schema_pb2.FeatureType.BYTES
tfdv.set_domain(schema, 'Soil_Type', schema_pb2.StringDomain(name='Soil_Type', value=soil_type_domain))

tfdv.set_domain(schema, 'Cover_Type', schema_pb2.IntDomain(name='Cover_Type', min=1, max=7, is_categorical=True))

In [25]:
stats_options = tfdv.StatsOptions(schema=eval_schema, infer_type_from_schema=True)

eval_stats = tfdv.generate_statistics_from_csv(
    data_location=EVALUATION_DATASET_WITH_ANOMALIES,
    stats_options=stats_options
)

In [26]:
tfdv.visualize_statistics(eval_stats)

### Checking for anomalies

In [24]:
eval_schema = tfdv.infer_schema(eval_stats)
tfdv.display_schema(schema=eval_schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'Aspect',INT,required,,-
'Cover_Type',INT,required,,-
'Elevation',INT,required,,-
'Hillshade_3pm',INT,required,,-
'Hillshade_9am',INT,required,,-
'Hillshade_Noon',INT,required,,-
'Horizontal_Distance_To_Fire_Points',INT,required,,-
'Horizontal_Distance_To_Hydrology',INT,required,,-
'Horizontal_Distance_To_Roadways',INT,required,,-
'Slope',INT,required,,-


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'Wilderness_Area',"'Cache', 'Commanche', 'Neota', 'Rawah'"


In [21]:
tfdv.visualize_statistics(lhs_statistics=eval_stats, rhs_statistics=train_stats,
                         lhs_name='EVAL DATASET', rhs_name='TRAIN_DATASET')

In [None]:
stats_options = tfdv.StatsOptions(schema=schema, infer_type_from_schema=True)

train_stats = tfdv.generate_statistics_from_csv(
    data_location=SERVING_DATASET,
    stats_options=stats_options
)

tfdv.visualize_statistics(train_stats)

In [None]:
#schema = schema_pb2.Schema()

#schema.feature.add(name='Elevation', type=schema_pb2.FeatureType.FLOAT)
#schema.feature.add(name='Aspect', type=schema_pb2.FeatureType.FLOAT)
#schema.feature.add(name='Slope', type=schema_pb2.FeatureType.FLOAT)
#schema.feature.add(name='Horizontal_Distance_To_Hydrology', type=schema_pb2.FeatureType.FLOAT)
#schema.feature.add(name='Vertical_Distance_To_Hydrology', type=schema_pb2.FeatureType.FLOAT)
#schema.feature.add(name='Horizontal_Distance_To_Roadways', type=schema_pb2.FeatureType.FLOAT)
#schema.feature.add(name='Hillshade_9am', type=schema_pb2.FeatureType.FLOAT)
#schema.feature.add(name='Hillshade_Noon', type=schema_pb2.FeatureType.FLOAT)
#schema.feature.add(name='Hillshade_3pm', type=schema_pb2.FeatureType.FLOAT)
#schema.feature.add(name='Horizontal_Distance_To_Fire_Points', type=schema_pb2.FeatureType.FLOAT)

#schema.feature.add(name='Wilderness_Area', type=schema_pb2.FeatureType.BYTES)
#schema.feature.add(name='Soil_Type', type=schema_pb2.FeatureType.BYTES)

#schema.feature.add(name='Cover_Type', type=schema_pb2.FeatureType.INT)
#tfdv.set_domain(schema, 'Cover_Type', schema_pb2.IntDomain(min=1, max=7, is_categorical=True))

In [30]:
%pip install gcsfs

Collecting gcsfs
  Downloading https://files.pythonhosted.org/packages/d8/10/891a143325fb237bd4f990efcd13fac257af8dcd6525f804981eb5f6f632/gcsfs-0.5.3-py2.py3-none-any.whl
Collecting fsspec>=0.6.0
[?25l  Downloading https://files.pythonhosted.org/packages/04/1e/6108c48f2d4ad9ef1a6bff01fb58245c009f37b2bd0505ec6d0f55cc326d/fsspec-0.6.1-py3-none-any.whl (62kB)
[K     |████████████████████████████████| 71kB 5.9MB/s eta 0:00:011
Installing collected packages: fsspec, gcsfs
Successfully installed fsspec-0.6.1 gcsfs-0.5.3
Note: you may need to restart the kernel to use updated packages.


In [31]:
import pandas as pd

In [32]:
eval_df = pd.read_csv(EVALUATION_DATASET)
eval_df.head()

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,Wilderness_Area,Soil_Type,Cover_Type
0,3051,7,8,0,0,2568,211,225,151,778,Commanche,7202,1
1,2820,212,15,127,22,3679,205,253,180,3237,Rawah,7746,2
2,2288,2,30,134,18,626,165,170,130,601,Cache,4703,6
3,3155,31,13,256,11,3079,216,210,128,3120,Commanche,7700,2
4,2945,90,11,127,13,3637,237,223,116,6449,Rawah,7745,2


In [34]:
eval_df.Soil_Type.unique()

array([7202, 7746, 4703, 7700, 7745, 7756, 4744, 4758, 2704, 2703, 7101,
       7757, 7755, 2717, 2705, 8772, 7102, 6102, 7702, 8771, 6101, 7790,
       7201, 4704, 2702, 7103, 8776, 7709, 7710, 5101, 8703, 3502, 6731,
       2706, 4201, 7701, 8708, 3501, 8707])

In [35]:
len(eval_df.Soil_Type.unique())

39

In [36]:
train_df = pd.read_csv(TRAINING_DATASET)


NameError: name 'unique' is not defined

In [37]:
train_df.Soil_Type.unique()

array([7202, 7746, 7755, 7756, 2703, 4703, 7745, 2717, 6101, 4758, 7201,
       7700, 8771, 8703, 2702, 7102, 7702, 2704, 8707, 4744, 7757, 4704,
       2705, 8776, 8772, 7790, 5101, 7103, 7101, 6102, 2706, 7710, 7709,
       6731, 7701, 3502, 4201, 8708, 3501, 5151])

In [38]:
len(train_df.Soil_Type.unique())

40

In [39]:
eval_df = pd.read_csv(EVALUATION_DATASET_WITH_ANOMALIES)

In [40]:
len(eval_df.Soil_Type.unique())

40