##### Copyright &copy; 2020 The TensorFlow Authors.

<font size=-1>Licensed under the Apache License, Version 2.0 (the \"License\");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at [https://www.apache.org/licenses/LICENSE-2.0](https://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.</font>

# Create a TFX pipeline using templates

## Introduction

This document will provide instructions to create a TensorFlow Extended (TFX) pipeline
using *templates* which are provided with TFX Python package.
Many of the instructions are Linux shell commands, which will run on an AI Platform Notebooks instance. Corresponding Jupyter Notebook code cells which invoke those commands using `!` are provided.

You will build a pipeline using [Taxi Trips dataset](
https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-psew)
released by the City of Chicago. We strongly encourage you to try building
your own pipeline using your dataset by utilizing this pipeline as a baseline.


## Step 1. Set up your environment.

AI Platform Pipelines will prepare a development environment to build a pipeline, and a Kubeflow Pipeline cluster to run the newly built pipeline.

**NOTE:** To select a particular TensorFlow version, or select a GPU instance, create a TensorFlow pre-installed instance in AI Platform Notebooks.

**NOTE:** There might be some errors during package installation. For example: 

>"ERROR: some-package 0.some_version.1 has requirement other-package!=2.0.,&lt;3,&gt;=1.15, but you'll have other-package 2.0.0 which is incompatible." Please ignore these errors at this moment.


Install `tfx`, `kfp`, and `skaffold`, and add installation path to the `PATH` environment variable.

In [1]:
# Install tfx and kfp Python packages.
import sys

!{sys.executable} -m pip install --user --upgrade -q -v --log /tmp/pip.log  --use-feature=2020-resolver tfx==0.26.0
!{sys.executable} -m pip install --user --upgrade -q -v --log /tmp/pip.log  --use-feature=2020-resolver kfp==1.0.0
# Download skaffold and set it executable.
!curl -Lo skaffold https://storage.googleapis.com/skaffold/releases/latest/skaffold-linux-amd64 && chmod +x skaffold && mv skaffold /home/jupyter/.local/bin/



Collecting tfx==0.26.0
  Downloading tfx-0.26.0-py3-none-any.whl (2.2 MB)
[K     |████████████████████████████████| 2.2 MB 7.5 MB/s eta 0:00:01
Collecting ml-pipelines-sdk==0.26.0
  Downloading ml_pipelines_sdk-0.26.0-py3-none-any.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 16.9 MB/s eta 0:00:01
Collecting pyarrow<0.18,>=0.17
  Downloading pyarrow-0.17.1-cp37-cp37m-manylinux2014_x86_64.whl (63.8 MB)
[K     |████████████████████████████████| 63.8 MB 58.4 MB/s eta 0:00:01
Collecting kubernetes<12,>=10.0.1
  Downloading kubernetes-11.0.0-py3-none-any.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 48.9 MB/s eta 0:00:01
Collecting httplib2<0.18.0,>=0.8
  Downloading httplib2-0.17.4-py3-none-any.whl (95 kB)
[K     |████████████████████████████████| 95 kB 4.8 MB/s  eta 0:00:01
Collecting numpy<2,>=1.14.3
  Downloading numpy-1.18.5-cp37-cp37m-manylinux1_x86_64.whl (20.1 MB)
[K     |████████████████████████████████| 20.1 MB 30.6 MB/s eta 0:00:01
Collecting

In [2]:
# Set `PATH` to include user python binary directory and a directory containing `skaffold`.
PATH=%env PATH
%env PATH={PATH}:/home/jupyter/.local/bin

env: PATH=/usr/local/cuda/bin:/opt/conda/bin:/opt/conda/condabin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/home/jupyter/.local/bin


Let's check the versions of TFX.

In [3]:
!python3 -c "from tfx import version ; print('TFX version: {}'.format(version.__version__))"

TFX version: 0.26.1


In AI Platform Pipelines, TFX is running in a hosted Kubernetes environment using [Kubeflow Pipelines](https://www.kubeflow.org/docs/pipelines/overview/pipelines-overview/).

Let's set some environment variables to use Kubeflow Pipelines.

First, get your GCP project ID.

In [4]:
# Read GCP project id from env.
shell_output=!gcloud config list --format 'value(core.project)' 2>/dev/null
GOOGLE_CLOUD_PROJECT=shell_output[0]
%env GOOGLE_CLOUD_PROJECT={GOOGLE_CLOUD_PROJECT}
print("GCP project ID:" + GOOGLE_CLOUD_PROJECT)

env: GOOGLE_CLOUD_PROJECT=cloud-training-281409
GCP project ID:cloud-training-281409


We also need to access your KFP cluster. You can access it in your Google Cloud Console under "AI Platform > Pipeline" menu. The "endpoint" of the KFP cluster can be found from the URL of the Pipelines dashboard, or you can get it from the URL of the Getting Started page where you launched this notebook. Let's create an `ENDPOINT` environment variable and set it to the KFP cluster endpoint. **ENDPOINT should contain only the hostname part of the URL.** For example, if the URL of the KFP dashboard is `https://1e9deb537390ca22-dot-asia-east1.pipelines.googleusercontent.com/#/start`, ENDPOINT value becomes `1e9deb537390ca22-dot-asia-east1.pipelines.googleusercontent.com`.

>**NOTE: You MUST set your ENDPOINT value below.**

In [5]:
# This refers to the KFP cluster endpoint
ENDPOINT='https://3b47c1a069ee9385-dot-us-west1.pipelines.googleusercontent.com' # Enter your ENDPOINT here.
if not ENDPOINT:
    from absl import logging
    logging.error('Set your ENDPOINT in this cell.')

Set the image name as `titanic_pipeline` under the current GCP project.

In [46]:
import pipeline.configs as configs
#PIPELINE_NAME=configs.PIPELINE_NAME
#print(PIPELINE_NAME)
PIPELINE_NAME="titanic_pipeline"
print(PIPELINE_NAME)

titanic_pipeline3
titanic_pipeline


In [43]:
# Docker image name for the pipeline image.
CUSTOM_TFX_IMAGE='gcr.io/' + GOOGLE_CLOUD_PROJECT + '/' + PIPELINE_NAME

## Load kaggle

In [8]:
!pip install -q kaggle
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!rm kaggle.json
!chmod 600 ~/.kaggle/kaggle.json

And, it's done. We are ready to create a pipeline.

## Step 2. Copy the predefined template to your project directory.

In this step, we will create a working pipeline project directory and files by copying additional files from a predefined template.

You may give your pipeline a different name by changing the `PIPELINE_NAME` below. This will also become the name of the project directory where your files will be put.

We don't do that here, since template was already copied

TFX includes the `taxi` template with the TFX python package. If you are planning to solve a point-wise prediction problem, including classification and regresssion, this template could be used as a starting point.

The `tfx template copy` CLI command copies predefined template files into your project directory.

Change the working directory context in this notebook to the project directory.

>NOTE: Don't forget to change directory in `File Browser` on the left by clicking into the project directory once it is created.

## Step 3. Browse your copied source files

The TFX template provides basic scaffold files to build a pipeline, including Python source code, sample data, and Jupyter Notebooks to analyse the output of the pipeline. The `taxi` template uses the same *Chicago Taxi* dataset and ML model as the [Airflow Tutorial](https://www.tensorflow.org/tfx/tutorials/tfx/airflow_workshop).

Here is brief introduction to each of the Python files.
-   `pipeline` - This directory contains the definition of the pipeline
    -   `configs.py` — defines common constants for pipeline runners
    -   `pipeline.py` — defines TFX components and a pipeline
-   `models` - This directory contains ML model definitions.
    -   `features.py`, `features_test.py` — defines features for the model
    -   `preprocessing.py`, `preprocessing_test.py` — defines preprocessing
        jobs using `tf::Transform`
    -   `estimator` - This directory contains an Estimator based model.
        -   `constants.py` — defines constants of the model
        -   `model.py`, `model_test.py` — defines DNN model using TF estimator
    -   `keras` - This directory contains a Keras based model.
        -   `constants.py` — defines constants of the model
        -   `model.py`, `model_test.py` — defines DNN model using Keras
-   `local_runner.py`, `kubeflow_runner.py` — define runners for each orchestration engine


You might notice that there are some files with `_test.py` in their name. These are unit tests of the pipeline and it is recommended to add more unit tests as you implement your own pipelines.
You can run unit tests by supplying the module name of test files with `-m` flag. You can usually get a module name by deleting `.py` extension and replacing `/` with `.`.  For example:

In [9]:
!{sys.executable} -m models.features_test
!{sys.executable} -m models.keras.model_test

Running tests under Python 3.7.9: /opt/conda/bin/python
[ RUN      ] FeaturesTest.testNumberOfBucketFeatureBucketCount
INFO:tensorflow:time(__main__.FeaturesTest.testNumberOfBucketFeatureBucketCount): 0.0s
I0220 22:15:00.970785 140096448493376 test_util.py:1973] time(__main__.FeaturesTest.testNumberOfBucketFeatureBucketCount): 0.0s
[       OK ] FeaturesTest.testNumberOfBucketFeatureBucketCount
[ RUN      ] FeaturesTest.testTransformedNames
INFO:tensorflow:time(__main__.FeaturesTest.testTransformedNames): 0.0s
I0220 22:15:00.971262 140096448493376 test_util.py:1973] time(__main__.FeaturesTest.testTransformedNames): 0.0s
[       OK ] FeaturesTest.testTransformedNames
[ RUN      ] FeaturesTest.test_session
[  SKIPPED ] FeaturesTest.test_session
----------------------------------------------------------------------
Ran 3 tests in 0.001s

OK (skipped=1)
Running tests under Python 3.7.9: /opt/conda/bin/python
[ RUN      ] ModelTest.testBuildKerasModel
2021-02-20 22:15:09.351002: I tensorflow

## Step 4. Run your first TFX pipeline

Components in the TFX pipeline will generate outputs for each run as [ML Metadata Artifacts](https://www.tensorflow.org/tfx/guide/mlmd), and they need to be stored somewhere. You can use any storage which the KFP cluster can access, and for this example we will use Google Cloud Storage (GCS). A default GCS bucket should have been created automatically. Its name will be `<your-project-id>-kubeflowpipelines-default`.


In [11]:
import os
from pathlib import Path

_data_root = os.path.join(".", 'data')
_train_dirpath = os.path.join(_data_root, "train")
_train_filepath = os.path.join(_train_dirpath, "train.csv")
_test_dirpath = os.path.join(_data_root, "test")
_test_filepath = os.path.join(_test_dirpath, "test.csv")

Let's upload our sample data to GCS bucket so that we can use it in our pipeline later.

In [12]:
!kaggle competitions download -c titanic -p {_data_root} --force
!unzip -o {_data_root}/"titanic.zip" -d {_data_root}
!cp {_data_root}/"train.csv" {_train_filepath}
!cp {_data_root}/"test.csv" {_test_filepath}

# clean up
!rm  {_data_root}/*.csv  {_data_root}/*.zip

Downloading titanic.zip to ./data
  0%|                                               | 0.00/34.1k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 34.1k/34.1k [00:00<00:00, 11.9MB/s]
Archive:  ./data/titanic.zip
  inflating: ./data/gender_submission.csv  
  inflating: ./data/test.csv         
  inflating: ./data/train.csv        


Copy csv data to bucket

In [10]:
!gsutil cp data/train/train.csv gs://{GOOGLE_CLOUD_PROJECT}-kubeflowpipelines-default/tfx-template/data/titanic/data.csv

Copying file://data/train/train.csv [Content-Type=text/csv]...
/ [1 files][ 59.8 KiB/ 59.8 KiB]                                                
Operation completed over 1 objects/59.8 KiB.                                     


Let's create a TFX pipeline using the `tfx pipeline create` command.

>Note: When creating a pipeline for KFP, we need a container image which will be used to run our pipeline. And `skaffold` will build the image for us. Because skaffold pulls base images from the docker hub, it will take 5~10 minutes when we build the image for the first time, but it will take much less time from the second build.

In [44]:
!tfx pipeline create  \
--pipeline-path=kubeflow_runner.py \
--endpoint={ENDPOINT} \
--build-target-image={CUSTOM_TFX_IMAGE}

CLI
Creating pipeline
Detected Kubeflow.
Use --engine flag if you intend to use a different orchestrator.
Reading build spec from build.yaml
No local setup.py, copying the directory and configuring the PYTHONPATH.
[Skaffold] Generating tags...
[Skaffold]  - gcr.io/cloud-training-281409/titanic_pipeline4 -> gcr.io/cloud-training-281409/titanic_pipeline4:latest
[Skaffold] Checking cache...
[Skaffold]  - gcr.io/cloud-training-281409/titanic_pipeline4: Not found. Building
[Skaffold] Building [gcr.io/cloud-training-281409/titanic_pipeline4]...
[Skaffold] Sending build context to Docker daemon  1.116MB
[Skaffold] Step 1/4 : FROM tensorflow/tfx:0.26.1
[Skaffold]  ---> 6dd91a0791af
[Skaffold] Step 2/4 : WORKDIR /pipeline
[Skaffold]  ---> Using cache
[Skaffold]  ---> 7ba72b899108
[Skaffold] Step 3/4 : COPY ./ ./
[Skaffold]  ---> 3c4da537067b
[Skaffold] Step 4/4 : ENV PYTHONPATH="/pipeline:${PYTHONPATH}"
[Skaffold]  ---> Running in 97eaba83dc33
[Skaffold] Removing intermediate container 97eaba83

While creating a pipeline, `Dockerfile` and `build.yaml` will be generated to build a Docker image. Don't forget to add these files to the source control system (for example, git) along with other source files.

A pipeline definition file for [argo](https://argoproj.github.io/argo/) will be generated, too. The name of this file is `${PIPELINE_NAME}.tar.gz`. For example, it will be `my_pipeline.tar.gz` if the name of your pipeline is `my_pipeline`. It is recommended NOT to include this pipeline definition file into source control, because it will be generated from other Python files and will be updated whenever you update the pipeline. For your convenience, this file is already listed in `.gitignore` which is generated automatically.

NOTE: `kubeflow` will be automatically selected as an orchestration engine if `airflow` is not installed and `--engine` is not specified.

Now start an execution run with the newly created pipeline using the `tfx run create` command.

In [45]:
!tfx run create --pipeline-name={PIPELINE_NAME} --endpoint={ENDPOINT}

CLI
Creating a run for pipeline: titanic_pipeline4
Detected Kubeflow.
Use --engine flag if you intend to use a different orchestrator.
Run created for pipeline: titanic_pipeline4
+-------------------+--------------------------------------+----------+---------------------------+---------------------------------------------------------------------------------------------------------------------------+
| pipeline_name     | run_id                               | status   | created_at                | link                                                                                                                      |
| titanic_pipeline4 | 431ce8f1-0162-4aad-9b30-67473ac2bc29 |          | 2021-02-21T00:18:00+00:00 | https://3b47c1a069ee9385-dot-us-west1.pipelines.googleusercontent.com/#/runs/details/431ce8f1-0162-4aad-9b30-67473ac2bc29 |
+-------------------+--------------------------------------+----------+---------------------------+--------------------------------------------------

Or, you can also run the pipeline in the KFP Dashboard.  The new execution run will be listed under Experiments in the KFP Dashboard.  Clicking into the experiment will allow you to monitor progress and visualize the artifacts created during the execution run.

However, we recommend visiting the KFP Dashboard. You can access the KFP Dashboard from the Cloud AI Platform Pipelines menu in Google Cloud Console. Once you visit the dashboard, you will be able to find the pipeline, and access a wealth of information about the pipeline.
For example, you can find your runs under the *Experiments* menu, and when you open your execution run under Experiments you can find all your artifacts from the pipeline under *Artifacts* menu.

>Note: If your pipeline run fails, you can see detailed logs for each TFX component in the Experiments tab in the KFP Dashboard.
    
One of the major sources of failure is permission related problems. Please make sure your KFP cluster has permissions to access Google Cloud APIs. This can be configured [when you create a KFP cluster in GCP](https://cloud.google.com/ai-platform/pipelines/docs/setting-up), or see [Troubleshooting document in GCP](https://cloud.google.com/ai-platform/pipelines/docs/troubleshooting).

## Create local pipeline

In [None]:
!tfx pipeline create  \
--pipeline-path=local_runner.py \
--build-target-image={CUSTOM_TFX_IMAGE}

## Run local pipeline

In [None]:
!python local_runner.py

## Update the pipeline

In [23]:
# Update the pipeline
!tfx pipeline update \
--pipeline-path=kubeflow_runner.py \
--endpoint={ENDPOINT}
# You can run the pipeline the same way.
#!tfx run create --pipeline-name {PIPELINE_NAME} --endpoint={ENDPOINT}

CLI
Updating pipeline
Detected Kubeflow.
Use --engine flag if you intend to use a different orchestrator.
New container image is built. Target image is available in the build spec file.
Instructions for updating:
external_input is deprecated, directly pass the uri to ExampleGen.
INFO:absl:Excluding no splits because exclude_splits is not set.
INFO:absl:Excluding no splits because exclude_splits is not set.
INFO:absl:Excluding no splits because exclude_splits is not set.
INFO:absl:Adding upstream dependencies for component CsvExampleGen
INFO:absl:Adding upstream dependencies for component ResolverNode_latest_blessed_model_resolver
INFO:absl:Adding upstream dependencies for component StatisticsGen
INFO:absl:   ->  Component: CsvExampleGen
INFO:absl:Adding upstream dependencies for component SchemaGen
INFO:absl:   ->  Component: StatisticsGen
INFO:absl:Adding upstream dependencies for component ExampleValidator
INFO:absl:   ->  Component: SchemaGen
INFO:absl:   ->  Component: StatisticsGe

### Check pipeline outputs

Visit the KFP dashboard to find pipeline outputs in the page for your pipeline run. Click the *Experiments* tab on the left, and *All runs* in the Experiments page. You should be able to find the latest run under the name of your pipeline.

## Step 6. Add components for training.

In this step, you will add components for training and model validation including `Transform`, `Trainer`, `ResolverNode`, `Evaluator`, and `Pusher`.

>**Double-click to open `pipeline.py`**. Find and uncomment the 5 lines which add `Transform`, `Trainer`, `ResolverNode`, `Evaluator` and `Pusher` to the pipeline. (Tip: search for `TODO(step 6):`)

As you did before, you now need to update the existing pipeline with the modified pipeline definition. The instructions are the same as Step 5. Update the pipeline using `tfx pipeline update`, and create an execution run using `tfx run create`.

In [22]:
!tfx pipeline update \
--pipeline-path=kubeflow_runner.py \
--endpoint={ENDPOINT}
!tfx run create --pipeline-name {PIPELINE_NAME} --endpoint={ENDPOINT}

CLI
Updating pipeline
Detected Kubeflow.
Use --engine flag if you intend to use a different orchestrator.
Reading build spec from build.yaml
[Skaffold] Generating tags...
[Skaffold]  - gcr.io/cloud-training-281409/titanic-pipeline -> gcr.io/cloud-training-281409/titanic-pipeline:latest
[Skaffold] Checking cache...
[Skaffold]  - gcr.io/cloud-training-281409/titanic-pipeline: Not found. Building
[Skaffold] Building [gcr.io/cloud-training-281409/titanic-pipeline]...
[Skaffold] Sending build context to Docker daemon  1.161MB
[Skaffold] Step 1/4 : FROM tensorflow/tfx:0.26.1
[Skaffold]  ---> 6dd91a0791af
[Skaffold] Step 2/4 : WORKDIR /pipeline
[Skaffold]  ---> Using cache
[Skaffold]  ---> a0efa056cb46
[Skaffold] Step 3/4 : COPY ./ ./
[Skaffold]  ---> 583ab8d30268
[Skaffold] Step 4/4 : ENV PYTHONPATH="/pipeline:${PYTHONPATH}"
[Skaffold]  ---> Running in 55b6d9177100
[Skaffold] Removing intermediate container 55b6d9177100
[Skaffold]  ---> ef7781c9c0d1
[Skaffold] Successfully built ef7781c9c0d1

When this execution run finishes successfully, you have now created and run your first TFX pipeline in AI Platform Pipelines!

**NOTE:** You might have noticed that every time we create a pipeline run, every component runs again and again even though the input and the parameters were not changed.
It is waste of time and resources, and you can skip those executions with pipeline caching. You can enable caching by specifying `enable_cache=True` for the `Pipeline` object in `pipeline.py`.


## Step 7. (*Optional*) Try BigQueryExampleGen

[BigQuery](https://cloud.google.com/bigquery) is a serverless, highly scalable, and cost-effective cloud data warehouse. BigQuery can be used as a source for training examples in TFX. In this step, we will add `BigQueryExampleGen` to the pipeline.

>**Double-click to open `pipeline.py`**. Comment out `CsvExampleGen` and uncomment the line which creates an instance of `BigQueryExampleGen`. You also need to uncomment the `query` argument of the `create_pipeline` function.

We need to specify which GCP project to use for BigQuery, and this is done by setting `--project` in `beam_pipeline_args` when creating a pipeline.

>**Double-click to open `configs.py`**. Uncomment the definition of `GOOGLE_CLOUD_REGION`, `BIG_QUERY_WITH_DIRECT_RUNNER_BEAM_PIPELINE_ARGS` and `BIG_QUERY_QUERY`. You should replace the region value in this file with the correct values for your GCP project.

>**Note: You MUST set your GCP region in the `configs.py` file before proceeding.**

>**Change directory one level up.** Click the name of the directory above the file list. The name of the directory is the name of the pipeline which is `my_pipeline` if you didn't change.

>**Double-click to open `kubeflow_runner.py`**. Uncomment two arguments, `query` and `beam_pipeline_args`, for the `create_pipeline` function.

Now the pipeline is ready to use BigQuery as an example source. Update the pipeline as before and create a new execution run as we did in step 5 and 6.

In [None]:
!tfx pipeline update \
--pipeline-path=kubeflow_runner.py \
--endpoint={ENDPOINT}
!tfx run create --pipeline-name {PIPELINE_NAME} --endpoint={ENDPOINT}

## Step 8. (*Optional*) Try Dataflow with KFP

Several [TFX Components uses Apache Beam](https://www.tensorflow.org/tfx/guide/beam) to implement data-parallel pipelines, and it means that you can distribute data processing workloads using [Google Cloud Dataflow](https://cloud.google.com/dataflow/). In this step, we will set the Kubeflow orchestrator to use dataflow as the data processing back-end for Apache Beam.

>**Double-click `pipeline` to change directory, and double-click to open `configs.py`**. Uncomment the definition of `GOOGLE_CLOUD_REGION`, and `DATAFLOW_BEAM_PIPELINE_ARGS`.

>**Change directory one level up.** Click the name of the directory above the file list. The name of the directory is the name of the pipeline which is `my_pipeline` if you didn't change.

>**Double-click to open `kubeflow_runner.py`**. Uncomment `beam_pipeline_args`. (Also make sure to comment out current `beam_pipeline_args` that you added in Step 7.)

Now the pipeline is ready to use Dataflow. Update the pipeline and create an execution run as we did in step 5 and 6.

In [None]:
!tfx pipeline update \
--pipeline-path=kubeflow_dag_runner.py \
--endpoint={ENDPOINT}
!tfx run create --pipeline-name {PIPELINE_NAME} --endpoint={ENDPOINT}

You can find your Dataflow jobs in [Dataflow in Cloud Console](http://console.cloud.google.com/dataflow).

>**Double-click to open `pipeline.py`**. Reset the value of `enable_cache` to `True`.


In [None]:
!tfx pipeline update \
--pipeline-path=kubeflow_dag_runner.py \
--endpoint={ENDPOINT}
!tfx run create --pipeline-name {PIPELINE_NAME} --endpoint={ENDPOINT}

You can find your training jobs in [Cloud AI Platform Jobs](https://console.cloud.google.com/ai-platform/jobs). If your pipeline completed successfully, you can find your model in [Cloud AI Platform Models](https://console.cloud.google.com/ai-platform/models).

## Step 10. Ingest YOUR data to the pipeline

We made a pipeline for a model using the Chicago Taxi dataset. Now it's time to put your data into the pipeline.

Your data can be stored anywhere your pipeline can access, including GCS, or BigQuery. You will need to modify the pipeline definition to access your data.

1. If your data is stored in files, modify the `DATA_PATH` in `kubeflow_runner.py` or `local_runner.py` and set it to the location of your files. If your data is stored in BigQuery, modify `BIG_QUERY_QUERY` in `pipeline/configs.py` to correctly query for your data.
1. Add features in `models/features.py`.
1. Modify `models/preprocessing.py` to [transform input data for training](https://www.tensorflow.org/tfx/guide/transform).
1. Modify `models/keras/model.py` and `models/keras/constants.py` to [describe your ML model](https://www.tensorflow.org/tfx/guide/trainer).
  - You can use an estimator based model, too. Change `RUN_FN` constant to `models.estimator.model.run_fn` in `pipeline/configs.py`.

Please see [Trainer component guide](https://www.tensorflow.org/tfx/guide/trainer) for more introduction.

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Alternatively, you can clean up individual resources by visiting each consoles:
- [Google Cloud Storage](https://console.cloud.google.com/storage)
- [Google Container Registry](https://console.cloud.google.com/gcr)
- [Google Kubernetes Engine](https://console.cloud.google.com/kubernetes)
