# Diving into data
The first task in any data science or ML project is to understand and clean the data.

* Understand the data types for each feature
* Look for anomalies and missing values
* Understand the distributions for each feature

![examplegen1.png](img/examplegen1.png)
![examplegen2.png](img/examplegen2.png)

* [ExampleGen](https://www.tensorflow.org/tfx/guide/examplegen) ingests and splits the input dataset.
* [StatisticsGen](https://www.tensorflow.org/tfx/guide/statsgen) calculates statistics for the dataset.
* [SchemaGen](https://www.tensorflow.org/tfx/guide/schemagen) examines the statistics and creates a data schema.
* [ExampleValidator](https://www.tensorflow.org/tfx/guide/exampleval) looks for anomalies and missing values in the dataset.


## Data Validation

Use the code below to run TensorFlow Data Validation on your pipeline.  Start by importing and opening the metadata store.

In [None]:
from __future__ import print_function

!pip install -q papermill
!pip install -q matplotlib
!pip install -q networkx

import os
import tfx_utils
%matplotlib notebook

def _make_default_sqlite_uri(pipeline_name):
    return os.path.join(os.environ['HOME'], 'airflow/tfx/metadata', pipeline_name, 'metadata.db')

def get_metadata_store(pipeline_name):
    return tfx_utils.TFXReadonlyMetadataStore.from_sqlite_db(_make_default_sqlite_uri(pipeline_name))

pipeline_name = 'taxi'

pipeline_db_path = _make_default_sqlite_uri(pipeline_name)
print('Pipeline DB:\n{}'.format(pipeline_db_path))

store = get_metadata_store(pipeline_name)

**Now print out the data artifacts:**

In [None]:
# Visualize properties of example artifacts
store.get_artifacts_of_type_df(tfx_utils.TFXArtifactTypes.EXAMPLES)

**Now visualize the dataset features.**

_Hint: try ID 2 or 3_

In [None]:
# Visualize stats for data
store.display_stats_for_examples(1, split='Split-train')

**Now plot the artifact lineage:**

In [None]:
# Try different IDs here. Click stop in the plot when changing IDs.
%matplotlib inline
store.plot_artifact_lineage(1)

<!--- ![notebook-step3-stats.png](img/notebook-step3-stats.png) --->

#### More advanced example
The example presented here is really only meant to get you started. For a more advanced example see the [TensorFlow Data Validation Colab](https://www.tensorflow.org/tfx/tutorials/data_validation/chicago_taxi).
For more information on using TFDV to explore and validate a dataset, [see the examples on tensorflow.org](https://www.tensorflow.org/tfx/data_validation/get_started).

# Feature Engineering
You can increase the predictive quality of your data and/or reduce dimensionality with feature engineering.
* Feature crosses
* Vocabularies
* Embeddings
* PCA
* Categorical encoding

One of the benefits of using TFX is that you will write your transformation code once, and the resulting transforms will be consistent between training and serving.

![transform.png](img/transform.png)

[Transform](https://www.tensorflow.org/tfx/guide/transform) performs feature engineering on the dataset.

Use the code below to run TensorFlow Transform on some example data using the schema from your pipeline. Start by importing and opening the metadata store.

In [None]:
from __future__ import print_function

import os
import tempfile
import pandas as pd

import tensorflow as tf
import tensorflow_transform as tft
from tensorflow_transform import beam as tft_beam
import tfx_utils
from tfx.utils import io_utils
from tensorflow_metadata.proto.v0 import schema_pb2

# For DatasetMetadata boilerplate
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.tf_metadata import schema_utils

def _make_default_sqlite_uri(pipeline_name):
    return os.path.join(os.environ['HOME'], 'airflow/tfx/metadata', pipeline_name, 'metadata.db')

def get_metadata_store(pipeline_name):
    return tfx_utils.TFXReadonlyMetadataStore.from_sqlite_db(_make_default_sqlite_uri(pipeline_name))

pipeline_name = 'taxi'

pipeline_db_path = _make_default_sqlite_uri(pipeline_name)
print('Pipeline DB:\n{}'.format(pipeline_db_path))

store = get_metadata_store(pipeline_name)

**Get the schema URI from the metadata store**

In [None]:
# Get the schema URI from the metadata store
schemas = store.get_artifacts_of_type_df(tfx_utils.TFXArtifactTypes.SCHEMA)
print(schemas.URI)
schema_uri = schemas.URI.iloc[0] + '/schema.pbtxt'
print ('Schema URI:\n{}'.format(schema_uri))

**Get the schema that was inferred by TensorFlow Data Validation**

In [None]:
schema_proto = io_utils.parse_pbtxt_file(file_name=schema_uri, message=schema_pb2.Schema())
feature_spec, domains = schema_utils.schema_as_feature_spec(schema_proto)
legacy_metadata = dataset_metadata.DatasetMetadata(schema_utils.schema_from_feature_spec(feature_spec, domains))

**Define features and create functions for TensorFlow Transform**

In [None]:
# Categorical features are assumed to each have a maximum value in the dataset.
_MAX_CATEGORICAL_FEATURE_VALUES = [24, 31, 12]

_NUMERICAL_FEATURES = ['trip_miles', 'fare', 'trip_seconds']

_BUCKET_FEATURES = [
    'pickup_latitude', 'pickup_longitude', 'dropoff_latitude',
    'dropoff_longitude'
]
# Number of buckets used by tf.transform for encoding each feature.
_FEATURE_BUCKET_COUNT = 10

_CATEGORICAL_NUMERICAL_FEATURES = [
    'trip_start_hour', 'trip_start_day', 'trip_start_month',
    'pickup_census_tract', 'dropoff_census_tract', 'pickup_community_area',
    'dropoff_community_area'
]

_CATEGORICAL_STRING_FEATURES = [
    'payment_type',
    'company',
]

# Number of vocabulary terms used for encoding categorical features.
_VOCAB_SIZE = 1000

# Count of out-of-vocab buckets in which unrecognized categorical are hashed.
_OOV_SIZE = 10

# Keys
_LABEL_KEY = 'tips'
_FARE_KEY = 'fare'


def t_name(key):
  """
  Rename the feature keys so that they don't clash with the raw keys when
  running the Evaluator component.
  Args:
    key: The original feature key
  Returns:
    key with '_xf' appended
  """
  return key + '_xf'


def _make_one_hot(x, key):
  """Make a one-hot tensor to encode categorical features.
  Args:
    X: A dense tensor
    key: A string key for the feature in the input
  Returns:
    A dense one-hot tensor as a float list
  """
  integerized = tft.compute_and_apply_vocabulary(x,
          top_k=_VOCAB_SIZE,
          num_oov_buckets=_OOV_SIZE,
          vocab_filename=key, name=key)
  depth = (
      tft.experimental.get_vocabulary_size_by_name(key) + _OOV_SIZE)
  one_hot_encoded = tf.one_hot(
      integerized,
      depth=tf.cast(depth, tf.int32),
      on_value=1.0,
      off_value=0.0)
  return tf.reshape(one_hot_encoded, [-1, depth])


def _fill_in_missing(x):
  """Replace missing values in a SparseTensor.
  Fills in missing values of `x` with '' or 0, and converts to a dense tensor.
  Args:
    x: A `SparseTensor` of rank 2.  Its dense shape should have size at most 1
      in the second dimension.
  Returns:
    A rank 1 tensor where missing values of `x` have been filled in.
  """
  if not isinstance(x, tf.sparse.SparseTensor):
    return x

  default_value = '' if x.dtype == tf.string else 0
  return tf.squeeze(
      tf.sparse.to_dense(
          tf.SparseTensor(x.indices, x.values, [x.dense_shape[0], 1]),
          default_value),
      axis=1)

def preprocessing_fn(inputs):
  """tf.transform's callback function for preprocessing inputs.

  Args:
    inputs: map from feature keys to raw not-yet-transformed features.

  Returns:
    Map from string feature key to transformed feature operations.
  """
  outputs = {}
  for key in _NUMERICAL_FEATURES:
    # If sparse make it dense, setting nan's to 0 or '', and apply zscore.
    outputs[t_name(key)] = tft.scale_to_z_score(
        _fill_in_missing(inputs[key]), name=key)

  for key in _BUCKET_FEATURES:
    outputs[t_name(key)] = tf.cast(tft.bucketize(
            _fill_in_missing(inputs[key]), _FEATURE_BUCKET_COUNT, name=key),
            dtype=tf.float32)

  for key in _CATEGORICAL_STRING_FEATURES:
    outputs[t_name(key)] = _make_one_hot(_fill_in_missing(inputs[key]), key)

  for key in _CATEGORICAL_NUMERICAL_FEATURES:
    outputs[t_name(key)] = _make_one_hot(_fill_in_missing(inputs[key]), key)

  # Was this passenger a big tipper?
  taxi_fare = _fill_in_missing(inputs[_FARE_KEY])
  tips = _fill_in_missing(inputs[_LABEL_KEY])
  outputs[_LABEL_KEY] = tf.where(
      tf.math.is_nan(taxi_fare),
      tf.cast(tf.zeros_like(taxi_fare), tf.int64),
      # Test if the tip was > 20% of the fare.
      tf.cast(
          tf.greater(tips, tf.multiply(taxi_fare, tf.constant(0.2))), tf.int64))

  return outputs

**Display the results of transforming some example data**

In [None]:
from IPython.display import display
with tft_beam.Context(temp_dir=tempfile.mkdtemp()):
    raw_examples = [
        {
            "fare": [100.0],
            "trip_start_hour": [12],
            "pickup_census_tract": ['abcd'],
            "dropoff_census_tract": [12345], 
            "company": ['taxi inc.'],
            "trip_start_timestamp": [123456],
            "pickup_longitude": [12.0],
            "trip_start_month": [5],
            "trip_miles": [8.0],
            "dropoff_longitude": [12.05],
            "dropoff_community_area": [123],
            "pickup_community_area": [123],
            "payment_type": ['visa'],
            "trip_seconds": [600],
            "trip_start_day": [12],
            "tips": [10.0],
            "pickup_latitude": [80.0],
            "dropoff_latitude": [80.01],
        }
    ]
    (transformed_examples, transformed_metadata), transform_fn = (
        (raw_examples, legacy_metadata)
        | 'AnalyzeAndTransform' >> tft_beam.AnalyzeAndTransformDataset(
            preprocessing_fn))
    display(pd.DataFrame(transformed_examples))

#### More advanced example
The example presented here is really only meant to get you started. For a more advanced example see the [TensorFlow Transform Colab](https://www.tensorflow.org/tfx/tutorials/transform/census).

# Training
Train a TensorFlow model with your nice, clean, transformed data.
* Include the transformations from earlier so that they are applied consistently
* Save the results as a SavedModel for production
* Visualize and explore the training process using TensorBoard
* Also save an EvalSavedModel for analysis of model performance

[Trainer](https://www.tensorflow.org/tfx/guide/trainer) trains the model using TensorFlow [Estimators](https://github.com/tensorflow/docs/blob/master/site/en/r1/guide/estimators.md)

## Train a Model

Use the code below to run TensorBoard on the model in your pipeline.

In [None]:
from __future__ import print_function

import os
import webbrowser
import tensorflow as tf

!pip install -q tensorboard
tf.get_logger().propagate = False

pipeline_name = 'taxi'
tensorboard_logdir = os.path.join(os.environ['HOME'], 'airflow/tfx/pipelines', pipeline_name, 'Trainer/model_run')
print('tensorboard_logdir: {}'.format(tensorboard_logdir))
#os.environ['TENSORBOARD_LOGDIR'] = tensorboard_logdir

In [None]:
# Load the TensorBoard notebook extension
%load_ext tensorboard

**Start TensorBoard**

Wait for TensorBoard to start

In [None]:
%tensorboard --logdir $tensorboard_logdir

#### More advanced example
The example presented here is really only meant to get you started. For a more advanced example see the [TensorBoard Tutorial](https://www.tensorflow.org/tensorboard/get_started).

# Analyzing model performance
Understanding more than just the top level metrics.
* Users experience model performance for their queries only
* Poor performance on slices of data can be hidden by top level metrics
* Model fairness is important
* Often key subsets of users or data are very important, and may be small
  * Performance in critical but unusual conditions
  * Performance for key audiences such as influencers
* If you’re replacing a model that is currently in production, first make sure that the new one is better
* Evaluator tells the Pusher component if the model is OK

[Evaluator](https://www.tensorflow.org/tfx/guide/evaluator) performs deep analysis of the training results, and ensures that the model is "good enough" to be pushed to production.

Use the code below to run TensorFlow Model Analysis on the model in your pipeline. Start by importing and opening the metadata store.

In [None]:
from __future__ import print_function

import os
import tfx_utils

def _make_default_sqlite_uri(pipeline_name):
    return os.path.join(os.environ['HOME'], 'airflow/tfx/metadata', pipeline_name, 'metadata.db')

def get_metadata_store(pipeline_name):
    return tfx_utils.TFXReadonlyMetadataStore.from_sqlite_db(_make_default_sqlite_uri(pipeline_name))

pipeline_name = 'taxi' # or taxi_solution
pipeline_db_path = _make_default_sqlite_uri(pipeline_name)
print('Pipeline DB:\n{}'.format(pipeline_db_path))

store = get_metadata_store(pipeline_name)

**Now print out the model artifacts:**

In [None]:
store.get_artifacts_of_type_df(tfx_utils.TFXArtifactTypes.MODEL)

**Now analyze the model performance:**

In [None]:
store.display_tfma_analysis(13, slicing_column='trip_start_hour')

**Now plot the artifact lineage:**

In [None]:
# Try different IDs here. Click stop in the plot when changing IDs.
%matplotlib inline
store.plot_artifact_lineage(13)

# Ready for production

If the new model is ready, make it so.
* Pusher deploys SavedModels to well-known locations

Deployment targets receive new models from well-known locations
* TensorFlow Serving
* TensorFlow Lite
* TensorFlow JS
* TensorFlow Hub

[Pusher](https://www.tensorflow.org/tfx/guide/pusher) deploys the model to a serving infrastructure.

# Next Steps
You have now trained and validated your model, and exported a **SavedModel** file under the `~/airflow/saved_models/taxi` directory. Your model is now ready for production. You can now deploy your model to any of the TensorFlow deployment targets, including:

[TensorFlow Serving](https://www.tensorflow.org/tfx/guide/serving), for serving your model on a server or server farm and processing REST and/or gRPC inference requests.
[TensorFlow Lite](https://www.tensorflow.org/lite), for including your model in an Android or iOS native mobile application, or in a Raspberry Pi, IoT, or microcontroller application.
[TensorFlow.js](https://www.tensorflow.org/js), for running your model in a web browser or Node.JS application.#### More advanced example
The example presented here is really only meant to get you started. For a more advanced example see the [TFMA Chicago Taxi Tutorial](https://www.tensorflow.org/tfx/tutorials/model_analysis/tfma_basic).