# Data Preprocessing using TFX

In almost all the cases of Machine learning applications we need to transform our data into a more meaningful format. To help us in achieving tasks related to data processing TFX provides a component named `Tensorflow Transform`. It allows us to build transformation steps as tensorflow graphs.

One of the major concerns we have in a production ml system is training/serving data skew. To help in these cases TFT builds are preprocessing graph to process daa and preserve it with determined boundary values for the feautes. Then this graph can be used in the inference phase of the model which ensures the same preprocessing for that data as well.

In order to achieve that, we can combine the model along with the preprocessing graph along with the used parameters. Then that can be used in the API server. Also this gives us an additional benefit of being able to analyze the inference input data. One example is identifying `out of vocabulary word` inputs to a NLP model. 

### Data Preprocessing with TFT

TFT processes the data we ingested to our pipeline with the earlier generated data schema and outputs 2 artifacts.

1. Preprocessed training and evaluation datasets in TFRecord format.
2. Exported preprocessing graph which we can use when we export our ml model.

TFT provides a function named `preprocessing_fn` which will receive the raw data, apply the transformation and then return the processed data. Point to note is that all the transformation applied to the data should be tensorflow operations. This allows tft to distribute processing effectively dusing execution.

<center>

`pip install tensorflow-transform`
</center>


Once we have the package ready we can define the transoformations as we need using the preprocessing_fn like we said earlier. Example usage of preprocessing function is like below.

In [1]:
import tensorflow_transform as tft
import tensorflow as tf

In [2]:
def preprocessing_fn(inputs):

    x = inputs['x']
    x_normalized = tft.scale_to_0_1(x)
    return { 'x_xf': x_normalized }

This type functions will receive batch inputs as python dictionaries with key being the feature name and values being the raw data. It is expected to return transformed features as a dictionary as well.

Some other important points about tft preprocessing are below.

- **Feature naming**: As these features will get used by the components down the pipeline it is important to keep track of the naming of the transformed features.

- **Data types**: it is important to cast the data to proper data types after transformations. (string, int32/64, float32/64)

- **Preprocessing happens as batches**: We need to keep in mind this when applying transformations.

- **No eager execution**: Which means we can only use tf graph operations.

Since most of the operations need to run as tf graph operations, tft provides considerable amount of transformations we can use for a typical ML work task. Below are some examples.

* tft.scale_to_z_score: Normalize feature to 0 mean and 1 std.

* tft.buckeize: separate features into bins

* tft.pca: To get the principal components of the feature

* tft.compute_and_apply_vocabulary: Computes unique items in a feature column and assign indexed numerical form.

Below are some NLP specific functions.

* tft.ngram: Calculates the n-gram of given string data

* tft.bag_of_words: Calculates the bag of words.

* tft.tfidf: Calculates TFIDF of string data.

Also [Tensorflow Text](https://github.com/tensorflow/text#installation) package provides more support for text processing including text normalization/ tokenization and language models like BERT.

For computer vision related problems tft provides many options via tf.images and tf.io modules to image decoding, resize, adjusting, converting etc.

Below is an example of reading tfrecord encoded image and preprocessing it.

In [3]:
def preprocessing_fn(raw_images):

    raw_image_flatten = tf.reshape(raw_images, [-1])
    img_rgb = tf.io.decode_jpeg(raw_image_flatten, channels=3)
    img_gray = tf.image.rgb_to_grayscale(img_rgb)

    img = tf.image.convert_image_dtype(img_gray, tf.float32)
    resized_img = tf.image.resize_with_pad(img, target_height=300, target_width=300)

    return tf.reshape(resized_img, [-1, 300, 300, 1])

The execution of preprocessing function can be done in 2 methods. One is using tft as a standalone package or in a tfx pipeline.

The below is a sample standalone execution usage of tft.

In [4]:
import tensorflow as tf
from tensorflow_transform.tf_metadata import dataset_metadata, schema_utils

raw_data = [
            {'x': 1.20},
            {'x': 2.99},
            {'x': 100.00},
            ]

First we need to define metadata for the data set.

In [5]:
raw_data_metadata = dataset_metadata.DatasetMetadata(
    schema_utils.schema_from_feature_spec({
        'x': tf.io.FixedLenFeature([], tf.float32),
    }))

Now since we have the data and meta data we can execute it. To do that we can use tft provided bindings for apache beam based execution with AnalyzeAndTransform function. This function performs a 2 step process, first analyze the dataset with metadata and calculate necessary values and then transform. We can define other configs like batch size as well.

In [19]:
%%writefile scripts/tft_transform_standalone.py

import tempfile
import tensorflow_transform.beam as tft_beam
import pprint
import tensorflow as tf
from tensorflow_transform.tf_metadata import dataset_metadata, schema_utils

raw_data = [
            {'x': 1.20},
            {'x': 2.99},
            {'x': 4.00},
            ]

raw_data_metadata = dataset_metadata.DatasetMetadata(
    schema_utils.schema_from_feature_spec({
        'x': tf.io.FixedLenFeature([], tf.float32),
    }))

def preprocessing_fn(inputs):
    x = inputs['x']
    x_normalized = tft.scale_to_0_1(x)

    return { 'x_xf': x_normalized }

with tft_beam.Context(temp_dir=tempfile.mkdtemp()):
    transformed_dataset, transform_fn = (  
        (raw_data, raw_data_metadata) | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn))

transformed_data, transformed_metadata = transformed_dataset
pprint.pprint(transformed_data)

Overwriting scripts/tft_transform_standalone.py


For some reason tft_beam does not work with jupyter notebook(I could not find out why!). So I have executed it in a python file. More details about this usage can be obtained from the [documentation](https://www.tensorflow.org/tfx/tutorials/transform/simple).

But instead of wanting to run transformation as a standalone function, we can integrate it to ML pipelines. This is not complex as we only need to define the preprocessing_fn with the necessary transformations. But we need to keep track of transformations with proper structures(remember outputs should be dictionary with key being feature name and values being the transformed data)