# Source
https://cloud.google.com/solutions/machine-learning/data-preprocessing-for-ml-with-tf-transform-pt2#introduction  
https://github.com/GoogleCloudPlatform/tf-estimator-tutorials/blob/master/00_Miscellaneous/tf_transform/tft-01%20-%20Babyweight%20preprocessing%20with%20tf.Transform.ipynb

## Description
This is the adaptation of *01 - Babyweight preprocessing with tf.Transform [HAS ERRORS].ipynb* to run on Cloud Dataflow. The followng changes have been applied:  
* the code has been moved to pipelines/01-babyweight/main.py
* the function prep_bq_row has been fixed and no longer turns every field into string
* pipelines/01-babyweight/requirements.txt has been defined to setup the local environment that launches the pipeline with compatible dependencies to tensorflow-transform==0.28.0 according to the [compatibility table](https://pypi.org/project/tfx/). This file must be used only to install dependencies in the local environment used to launch the pipeline on Cloud Dataflow. Dependencies installed on remote Dataflow workers are defined in pipelines/01-babyweight/setup.py.
* pipelines/01-babyweight/setup.py has been defined with dependencies to the correct version of tensorflow_transform according to the [compatibility table](https://pypi.org/project/tfx/)

# Babyweight Data Prepcessing with tf.Transfrom 

### Install required packages

In [1]:
%%bash
pip install --upgrade --ignore-installed -r ../../pipelines/01-babyweight/requirements.txt

Collecting pyarrow==2.0.0
  Downloading pyarrow-2.0.0-cp37-cp37m-manylinux2014_x86_64.whl (17.7 MB)
Collecting apache-beam[gcp]==2.28.0
  Downloading apache_beam-2.28.0-cp37-cp37m-manylinux2010_x86_64.whl (9.0 MB)
Collecting tensorflow==2.4.0
  Downloading tensorflow-2.4.0-cp37-cp37m-manylinux2010_x86_64.whl (394.7 MB)
Collecting tfx==0.28.0
  Downloading tfx-0.28.0-py3-none-any.whl (2.3 MB)
Collecting tfx-bsl==0.28.1
  Downloading tfx_bsl-0.28.1-cp37-cp37m-manylinux2010_x86_64.whl (2.2 MB)
Collecting ml-metadata==0.28.0
  Downloading ml_metadata-0.28.0-cp37-cp37m-manylinux2010_x86_64.whl (2.9 MB)
Collecting tensorflow-data-validation==0.28.0
  Downloading tensorflow_data_validation-0.28.0-cp37-cp37m-manylinux2010_x86_64.whl (1.3 MB)
Collecting tensorflow-metadata==0.28.0
  Downloading tensorflow_metadata-0.28.0-py3-none-any.whl (47 kB)
Collecting tensorflow-model-analysis==0.28.0
  Downloading tensorflow_model_analysis-0.28.0-py3-none-any.whl (1.7 MB)
Collecting tensorflow-serving-api

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
conda 4.9.2 requires ruamel_yaml>=0.11.14, which is not installed.
tensorflow-probability 0.11.0 requires cloudpickle==1.3, but you have cloudpickle 1.6.0 which is incompatible.
tensorflow-io 0.15.0 requires tensorflow<2.4.0,>=2.3.0, but you have tensorflow 2.4.0 which is incompatible.
jupyterlab-git 0.11.0 requires nbdime<2.0.0,>=1.1.0, but you have nbdime 2.1.0 which is incompatible.
fairness-indicators 0.26.0 requires tensorflow!=2.0.*,!=2.1.*,!=2.2.*,!=2.4.*,<3,>=1.15.2, but you have tensorflow 2.4.0 which is incompatible.
fairness-indicators 0.26.0 requires tensorflow-data-validation<0.27,>=0.26, but you have tensorflow-data-validation 0.28.0 which is incompatible.
fairness-indicators 0.26.0 requires tensorflow-model-analysis<0.27,>=0.26, but you have tensorflow-model-analysis 0.28.0 which is incompatible.
ex

In [2]:
!pip list | grep 'tensorflow'
!pip list | grep 'beam'
!pip list | grep 'cloud-dataflow'

tensorflow                     2.4.0
tensorflow-cloud               0.1.13
tensorflow-data-validation     0.28.0
tensorflow-datasets            3.0.0
tensorflow-estimator           2.4.0
tensorflow-hub                 0.9.0
tensorflow-io                  0.15.0
tensorflow-metadata            0.28.0
tensorflow-model-analysis      0.28.0
tensorflow-probability         0.11.0
tensorflow-serving-api         2.4.0
tensorflow-transform           0.28.0
apache-beam                    2.28.0


### Set global flags

In [3]:
PROJECT ='mlteam-ml-specialization-2021' # change to your project_Id
BUCKET = 'mlteam-ml-specialization-2021-taxi' # change to your bucket name
REGION = 'europe-west1' # change to your region
ROOT_DIR = 'babyweight_tft' # directory where the output is stored locally or on GCS

RUN_LOCAL = False # if True, the DirectRunner is used, else DataflowRunner
DATA_SIZE = 10000 # number of records to be retrieved from BigQuery

In [4]:
import os

os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION
os.environ['ROOT_DIR'] = ROOT_DIR
os.environ['RUN_LOCAL'] = str(RUN_LOCAL)

### Launch pipeline
NOTE: before launching the pipeline, please edit **pipelines/01-babyweight/main.py** to setup the global flags according to the ones defined in this notebook

In [8]:
%%bash
cd ../../pipelines/01-babyweight
python3 main.py

feature {
  name: "gestation_weeks"
  type: FLOAT
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
feature {
  name: "is_male"
  type: BYTES
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
feature {
  name: "key"
  type: FLOAT
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
feature {
  name: "mother_age"
  type: FLOAT
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
feature {
  name: "mother_race"
  type: BYTES
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
feature {
  name: "plurality"
  type: FLOAT
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
feature {
  name: "weight_pounds"
  type: FLOAT
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}

Launching DataflowRunner job preprocess-babweight-data-tft-210408085211 ... hang on

Sample data size: 10000
Sink transformed data files location: gs://mlteam-ml-specialization-2021-taxi/babyweight_tft/transformed
Sink transform artefact location: gs://mlteam-ml-specialization-2021-taxi/babyweight_tf

2021-04-08 08:52:09.100169: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Instructions for updating:
ColumnSchema is a deprecated, use from_feature_spec to create a `Schema`
Instructions for updating:
Schema is a deprecated, use schema_utils.schema_from_feature_spec to create a `Schema`
2021-04-08 08:52:12.466898: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-08 08:52:12.467307: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64
2021-04-08 08:52:12.467339: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2021-04-08 08:52:12.467366: I tensorflow/stream_executor/cuda/c

## Explore the produced artefacts 

In [9]:
%%bash

echo 'local run:' ${RUN_LOCAL}
echo 'directory:' ${ROOT_DIR}
echo ''

echo 'transformed data:' 
if [ "${RUN_LOCAL}" = "True " ] 
then ls ${ROOT_DIR}/transformed 
else gsutil ls gs://${BUCKET}/${ROOT_DIR}/transformed 
fi
echo ''

echo 'transformed metadata:'  
if [ "${RUN_LOCAL}" = "True " ] 
then ls ${ROOT_DIR}/transform/transformed_metadata
else gsutil ls gs://${BUCKET}/${ROOT_DIR}/transform/transformed_metadata 
fi
echo ''

echo 'transform artefact:'   
if [ "${RUN_LOCAL}" = "True " ] 
then ls ${ROOT_DIR}/transform/transform_fn
else gsutil ls gs://${BUCKET}/${ROOT_DIR}/transform/transform_fn 
fi
echo ''

echo 'transform assets:'
if [ "${RUN_LOCAL}" = "True " ] 
then ls ${ROOT_DIR}/transform/transform_fn/assets
else gsutil ls gs://${BUCKET}/${ROOT_DIR}/transform/transform_fn/assets 
fi
echo ''

local run: False
directory: babyweight_tft

transformed data:
gs://mlteam-ml-specialization-2021-taxi/babyweight_tft/transformed/eval-00000-of-00001.tfrecords
gs://mlteam-ml-specialization-2021-taxi/babyweight_tft/transformed/train-00000-of-00003.tfrecords
gs://mlteam-ml-specialization-2021-taxi/babyweight_tft/transformed/train-00001-of-00003.tfrecords
gs://mlteam-ml-specialization-2021-taxi/babyweight_tft/transformed/train-00002-of-00003.tfrecords

transformed metadata:
gs://mlteam-ml-specialization-2021-taxi/babyweight_tft/transform/transformed_metadata/
gs://mlteam-ml-specialization-2021-taxi/babyweight_tft/transform/transformed_metadata/asset_map
gs://mlteam-ml-specialization-2021-taxi/babyweight_tft/transform/transformed_metadata/schema.pbtxt

transform artefact:
gs://mlteam-ml-specialization-2021-taxi/babyweight_tft/transform/transform_fn/
gs://mlteam-ml-specialization-2021-taxi/babyweight_tft/transform/transform_fn/saved_model.pb
gs://mlteam-ml-specialization-2021-taxi/babyweigh