# ML with TensorFlow Extended (TFX) -- Part 1
The puprpose of this tutorial is to show how to do end-to-end ML with TFX libraries on Google Cloud Platform. This tutorial covers:
1. Data analysis and schema generation with **TF Data Validation**.
2. Data preprocessing with **TF Transform**.
3. Model training with **TF Estimator**.
4. Model evaluation with **TF Model Analysis**.

This notebook has been tested in Jupyter on the Deep Learning VM.

## 0. Setup Python and Cloud environment

Install the libraries we need and set up variables to reference our project and bucket.

In [None]:
%pip install -q --upgrade grpcio_tools tensorflow_data_validation

In [1]:
import apache_beam as beam
import platform
import tensorflow as tf
import tensorflow_data_validation as tfdv
import tensorflow_transform as tft
import tornado

print('tornado version: {}'.format(tornado.version))
print('Python version: {}'.format(platform.python_version()))
print('TF version: {}'.format(tf.__version__))
print('TFT version: {}'.format(tft.__version__))
print('TFDV version: {}'.format(tfdv.__version__))
print('Apache Beam version: {}'.format(beam.__version__))

  'Running the Apache Beam SDK on Python 3 is not yet fully supported. '


tornado version: 6.0.2
Python version: 3.5.3
TF version: 1.13.1
TFT version: 0.13.0
TFDV version: 0.13.1
Apache Beam version: 2.11.0


In [2]:
PROJECT = 'cloud-training-demos'    # Replace with your PROJECT
BUCKET = 'cloud-training-demos-ml'  # Replace with your BUCKET
REGION = 'us-central1'              # Choose an available region for Cloud MLE

import os

os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION

In [3]:
%%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

## ensure we predict locally with our current Python environment
gcloud config set ml_engine/local_python `which python`

Updated property [core/project].
Updated property [compute/region].
Updated property [ml_engine/local_python].


<img valign="middle" src="images/tfx.jpeg">

### Flights dataset

We'll use the flights dataset from the book [Data Science on Google Cloud Platform](http://shop.oreilly.com/product/0636920057628.do)

In [5]:
DATA_BUCKET = "gs://cloud-training-demos/flights/chapter8/output/"
TRAIN_DATA_PATTERN = DATA_BUCKET + "train*"
EVAL_DATA_PATTERN = DATA_BUCKET + "test*"

In [16]:
!gcloud storage ls --long $TRAIN_DATA_PATTERN
!gcloud storage ls --long $EVAL_DATA_PATTERN
!gcloud storage cat $DATA_BUCKET'trainFlights-00000-of-00007.csv' | head -1

  19791059  2018-11-30T01:26:30Z  gs://cloud-training-demos/flights/chapter8/output/trainFlights-00000-of-00007.csv
 113651981  2018-11-30T01:26:33Z  gs://cloud-training-demos/flights/chapter8/output/trainFlights-00001-of-00007.csv
 141696199  2018-11-30T01:26:34Z  gs://cloud-training-demos/flights/chapter8/output/trainFlights-00002-of-00007.csv
   1861214  2018-11-30T01:26:29Z  gs://cloud-training-demos/flights/chapter8/output/trainFlights-00003-of-00007.csv
   6713759  2018-11-30T01:26:29Z  gs://cloud-training-demos/flights/chapter8/output/trainFlights-00004-of-00007.csv
   1900597  2018-11-30T01:26:29Z  gs://cloud-training-demos/flights/chapter8/output/trainFlights-00005-of-00007.csv
 151664685  2018-11-30T01:26:35Z  gs://cloud-training-demos/flights/chapter8/output/trainFlights-00006-of-00007.csv
TOTAL: 7 objects, 437279494 bytes (417.02 MiB)
   2686954  2018-11-30T01:26:29Z  gs://cloud-training-demos/flights/chapter8/output/testFlights-00000-of-00007.csv
    749937  2018-11-30T01:

## 1. Data Analysis
For data analysis, visualization, and schema generation, we use [TensorFlow Data Validation](https://www.tensorflow.org/tfx/guide/tfdv) to perform the following:
1. **Analyze** the training data and produce **statistics**.
2. Generate data **schema** from the produced statistics.
3. **Configure** the schema.
4. **Validate** the evaluation data against the schema.
5. **Save** the schema for later use.

In [7]:
import tensorflow_data_validation as tfdv
print('TFDV version: {}'.format(tfdv.__version__))

TFDV version: 0.13.1


### 1.1 Compute and visualise statistics

In [8]:
CSV_COLUMNS = ('ontime,dep_delay,taxiout,distance,avg_dep_delay,avg_arr_delay' + 
               ',carrier,dep_lat,dep_lon,arr_lat,arr_lon,origin,dest').split(',')
TARGET_FEATURE_NAME = 'ontime'
DEFAULTS     = [[0.0],[0.0],[0.0],[0.0],[0.0],[0.0],\
                ['na'],[0.0],[0.0],[0.0],[0.0],['na'],['na']]

In [9]:
# This is a convenience function for CSV. We can write a Beam pipeline for other formats.
# https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/generate_statistics_from_csv
train_stats = tfdv.generate_statistics_from_csv(
    data_location=TRAIN_DATA_PATTERN, 
    column_names=CSV_COLUMNS,
    stats_options=tfdv.StatsOptions(sample_rate=0.1)
)



Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


In [10]:
tfdv.visualize_statistics(train_stats)

### 1.2 Infer Schema

In [11]:
schema = tfdv.infer_schema(statistics=train_stats)
tfdv.display_schema(schema=schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'distance',FLOAT,required,,-
'arr_lon',FLOAT,required,,-
'origin',BYTES,required,,-
'dep_lon',FLOAT,required,,-
'avg_arr_delay',FLOAT,required,,-
'ontime',FLOAT,required,,-
'taxiout',FLOAT,required,,-
'dest',BYTES,required,,-
'arr_lat',FLOAT,required,,-
'dep_delay',FLOAT,required,,-


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'carrier',"'AA', 'AS', 'B6', 'DL', 'EV', 'F9', 'HA', 'MQ', 'NK', 'OO', 'UA', 'US', 'VX', 'WN'"


### 1.3 Configure Schema

Specify some tolerance for values.

In [12]:
# Relax the minimum fraction of values that must come from the domain for feature occupation.
carrier = tfdv.get_feature(schema, 'carrier')
carrier.distribution_constraints.min_domain_mass = 0.9

# All features are by default in both TRAINING and SERVING environments.
#schema.default_environment.append('TRAINING')
#schema.default_environment.append('EVALUATION')
#schema.default_environment.append('SERVING')

# Specify that weight and class feature is not in SERVING environment.
#tfdv.get_feature(schema, TARGET_FEATURE_NAME).not_in_environment.append('SERVING')

### 1.4 Validate evaluation data

In [None]:
eval_stats = tfdv.generate_statistics_from_csv(EVAL_DATA_PATTERN, column_names=CSV_COLUMNS)
eval_anomalies = tfdv.validate_statistics(eval_stats, schema) #, environment='EVALUATION')
tfdv.display_anomalies(eval_anomalies)

### 1.5 Freeze the schema

In [13]:
RAW_SCHEMA_LOCATION = 'raw_schema.pbtxt'

In [14]:
from tensorflow.python.lib.io import file_io
from google.protobuf import text_format

tfdv.write_schema_text(schema, RAW_SCHEMA_LOCATION)

In [15]:
!cat {RAW_SCHEMA_LOCATION}

feature {
  name: "distance"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "arr_lon"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "origin"
  type: BYTES
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "dep_lon"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "avg_arr_delay"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "ontime"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "taxiout"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1

## License

Copyright 2019 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

---
**Disclaimer**: This is not an official Google product. The sample code provided for an educational purpose.

---