# Guided Project

(Adapted from [Create a TFX pipeline using templates](https://www.tensorflow.org/tfx/tutorials/tfx/template))

In [None]:
import os

## Step 1. Environment setup

### `tfx` and `kfp` tools setup

In [None]:
%%bash

TFX_PKG="tfx==0.22.0"
KFP_PKG="kfp==0.5.1"

pip freeze | grep $TFX_PKG || pip install -Uq $TFX_PKG
pip freeze | grep $KFP_PKG || pip install -Uq $KFP_PKG

You may need to restart the kernel at this point.

### `skaffold` tool setup

In [None]:
PATH=%env PATH
%env PATH={PATH}:/home/jupyter/.local/bin

In [None]:
%%bash

LOCAL_BIN="/home/jupyter/.local/bin"
SKAFFOLD_URI="https://storage.googleapis.com/skaffold/releases/latest/skaffold-linux-amd64"

test -d $LOCAL_BIN || mkdir -p $LOCAL_BIN

which skaffold || (
    curl -Lo skaffold $SKAFFOLD_URI &&
    chmod +x skaffold               &&
    mv skaffold $LOCAL_BIN
)

Modify the `PATH` environment variable so that `skaffold` is available:

At this point, you shoud see the `skaffold` tool with the command `which`:

In [None]:
!which skaffold

### Environment variable setup

In AI Platform Pipelines, TFX is running in a hosted Kubernetes environment using [Kubeflow Pipelines](https://www.kubeflow.org/docs/pipelines/overview/pipelines-overview/).

Let's set some environment variables to use Kubeflow Pipelines.

First, get your GCP project ID.

In [None]:
shell_output=!gcloud config list --format 'value(core.project)' 2>/dev/null
GOOGLE_CLOUD_PROJECT=shell_output[0]

%env GOOGLE_CLOUD_PROJECT={GOOGLE_CLOUD_PROJECT}

We also need to access your KFP cluster. You can access it in your Google Cloud Console under "AI Platform > Pipeline" menu.

The "endpoint" of the KFP cluster can be found from the URL of the Pipelines dashboard, 
or you can get it from the URL of the Getting Started page where you launched this notebook.

Let's create an ENDPOINT environment variable and set it to the KFP cluster endpoint.

ENDPOINT should contain only the hostname part of the URL. 
For example, if the URL of the KFP dashboard is

<a href="https://1e9deb537390ca22-dot-asia-east1.pipelines.googleusercontent.com/#/start">https://1e9deb537390ca22-dot-asia-east1.pipelines.googleusercontent.com/#/start</a>, 

ENDPOINT value becomes 1e9deb537390ca22-dot-asia-east1.pipelines.googleusercontent.com.

In [None]:
ENDPOINT = # Enter your ENDPOINT here.

Set the image name as tfx-pipeline under the current GCP project:

In [None]:
# Docker image name for the pipeline image.
CUSTOM_TFX_IMAGE = 'gcr.io/' + GOOGLE_CLOUD_PROJECT + '/tfx-pipeline'
CUSTOM_TFX_IMAGE

## Step 2. Copy the predefined template to your project directory.

In this step, we will create a working pipeline project directory and 
files by copying additional files from a predefined template.

You may give your pipeline a different name by changing the PIPELINE_NAME below. 

This will also become the name of the project directory where your files will be put.

In [None]:
PIPELINE_NAME = "tfx_templated_pipeline"
PROJECT_DIR = os.path.join(os.path.expanduser("."), PIPELINE_NAME)
PROJECT_DIR

TFX includes the taxi template with the TFX python package. 

If you are planning to solve a point-wise prediction problem,
including classification and regresssion, this template could be used as a starting point.

The `tfx template copy` CLI command copies predefined template files into your project directory.

In [None]:
!tfx template copy \
  --pipeline-name={PIPELINE_NAME} \
  --destination-path={PROJECT_DIR} \
  --model=taxi

In [None]:
%cd {PROJECT_DIR}

### Step 3. Browse your copied source files

The TFX template provides basic scaffold files to build a pipeline, including Python source code,
sample data, and Jupyter Notebooks to analyse the output of the pipeline. 

The `taxi` template uses the same Chicago Taxi dataset and ML model as 
the [Airflow Tutorial](https://www.tensorflow.org/tfx/tutorials/tfx/airflow_workshop).

Here is brief introduction to each of the Python files:

`pipeline` - This directory contains the definition of the pipeline
* `configs.py` — defines common constants for pipeline runners
* `pipeline.py` — defines TFX components and a pipeline

`models` - This directory contains ML model definitions.
* `features.py`, `features_test.py` — defines features for the model
* `preprocessing.py`, `preprocessing_test.py` — defines preprocessing jobs using tf::Transform

`models/estimator` - This directory contains an Estimator based model.
* `constants.py` — defines constants of the model
* `model.py`, `model_test.py` — defines DNN model using TF estimator

`models/keras` - This directory contains a Keras based model.
* `constants.py` — defines constants of the model
* `model.py`, `model_test.py` — defines DNN model using Keras

`beam_dag_runner.py`, `kubeflow_dag_runner.py` — define runners for each orchestration engine


**Running the tests:**
You might notice that there are some files with `_test.py` in their name. 
These are unit tests of the pipeline and it is recommended to add more unit 
tests as you implement your own pipelines. 
You can run unit tests by supplying the module name of test files with `-m` flag. 
You can usually get a module name by deleting `.py` extension and replacing `/` with `..`

For example:

In [None]:
!python -m models.features_test
!python -m models.keras.model_test

## Step 4. Run your first TFX pipeline

Components in the TFX pipeline will generate outputs for each run as
[ML Metadata Artifacts](https://www.tensorflow.org/tfx/guide/mlmd), and they need to be stored somewhere.
You can use any storage which the KFP cluster can access, and for this example we
will use Google Cloud Storage (GCS).

Let us create this bucket. Its name will be `<YOUR_PROJECT>-kubeflowpipelines-default`.

In [None]:
GCS_BUCKET_NAME = GOOGLE_CLOUD_PROJECT + '-kubeflowpipelines-default'
GCS_BUCKET_NAME

In [None]:
!gsutil mb gs://{GCS_BUCKET_NAME}

Let's upload our sample data to GCS bucket so that we can use it in our pipeline later.

In [None]:
!gsutil cp data/data.csv gs://{GCS_BUCKET_NAME}/tfx-template/data/data.csv

Let's create a TFX pipeline using the `tfx pipeline create` command.

**Note:** When creating a pipeline for KFP, we need a container image which will 
be used to run our pipeline. And skaffold will build the image for us. Because `skaffold`
pulls base images from the docker hub, it will take 5~10 minutes when we build
the image for the first time, but it will take much less time from the second build.

In [None]:
!tfx pipeline create  \
--pipeline-path=kubeflow_dag_runner.py \
--endpoint={ENDPOINT} \
--build-target-image={CUSTOM_TFX_IMAGE}

While creating a pipeline, `Dockerfile` and `build.yaml` will be generated to build a Docker image.

Don't forget to add these files to the source control system (for example, git) along with other source files.

A pipeline definition file for [argo](https://argoproj.github.io/argo/) will be generated, too. 
The name of this file is `${PIPELINE_NAME}.tar.gz.` 
For example, it will be `tfx_templated_pipeline.tar.gz` if the name of your pipeline is my_pipeline. 
It is recommended NOT to include this pipeline definition file into source control, because it will be generated from other Python files and will be updated whenever you update the pipeline. For your convenience, this file is already listed in `.gitignore` which is generated automatically.

Now start an execution run with the newly created pipeline using the `tfx run create` command.

**Note:** You may see the following error `Error importing tfx_bsl_extension.coders.` Please ignore it.

In [None]:
!tfx run create --pipeline-name={PIPELINE_NAME} --endpoint={ENDPOINT}

Or, you can also run the pipeline in the KFP Dashboard. The new execution run will be listed 
under Experiments in the KFP Dashboard. 
Clicking into the experiment will allow you to monitor progress and visualize 
the artifacts created during the execution run.

However, we recommend visiting the KFP Dashboard. You can access the KFP Dashboard from 
the Cloud AI Platform Pipelines menu in Google Cloud Console. Once you visit the dashboard, 
you will be able to find the pipeline, and access a wealth of information about the pipeline. 
For example, you can find your runs under the Experiments menu, and when you open your
execution run under Experiments you can find all your artifacts from the pipeline under Artifacts menu.

**Note:** If your pipeline run fails, you can see detailed logs for each TFX component in the Experiments tab in the KFP Dashboard.

One of the major sources of failure is permission related problems. 
Please make sure your KFP cluster has permissions to access Google Cloud APIs.
This can be configured [when you create a KFP cluster in GCP](https://cloud.google.com/ai-platform/pipelines/docs/setting-up),
or see [Troubleshooting document in GCP](https://cloud.google.com/ai-platform/pipelines/docs/troubleshooting).

# Step 5. Add components for data validation.

In this step, you will add components for data validation including `StatisticsGen`, `SchemaGen`, and `ExampleValidator`.
If you are interested in data validation, please see 
[Get started with Tensorflow Data Validation](https://www.tensorflow.org/tfx/data_validation/get_started).

**Double-click to change directory to pipeline and double-click again to open** pipeline.py. 
Find and uncomment the 3 lines which add `StatisticsGen`, `SchemaGen`, and `ExampleValidator` to the pipeline.
(Tip: search for comments containing TODO(step 5):). Make sure to save `pipeline.py` after you edit it.