## Stage 1: Installing dependencies and setting up the environment

In [0]:
!pip install tensorflow-transform

Collecting tensorflow-transform
[?25l  Downloading https://files.pythonhosted.org/packages/dd/b2/eb6b34eedf6f61003e292975a033b4ebc1b1be8537a9b5094d2a6944f229/tensorflow-transform-0.13.0.tar.gz (173kB)
[K     |████████████████████████████████| 174kB 43.3MB/s 
Collecting apache-beam[gcp]<3,>=2.11 (from tensorflow-transform)
[?25l  Downloading https://files.pythonhosted.org/packages/81/7f/1cbf2e0967a11971a55ee56d61641ab77ecc5acdc4f66cb4948e878e0c65/apache_beam-2.13.0-cp27-cp27mu-manylinux1_x86_64.whl (2.7MB)
[K     |████████████████████████████████| 2.7MB 42.6MB/s 
Collecting pydot<1.3,>=1.2.0 (from tensorflow-transform)
[?25l  Downloading https://files.pythonhosted.org/packages/c3/f1/e61d6dfe6c1768ed2529761a68f70939e2569da043e9f15a8d84bf56cadf/pydot-1.2.4.tar.gz (132kB)
[K     |████████████████████████████████| 133kB 47.3MB/s 
Collecting avro<2.0.0,>=1.8.1; python_version < "3.0" (from apache-beam[gcp]<3,>=2.11->tensorflow-transform)
[?25l  Downloading https://files.pythonhosted.o

## Stage 2: Import project dependencies

In [0]:
import tempfile
import pandas as pd
import tensorflow as tf
import tensorflow_transform as tft
import tensorflow_transform.beam.impl as tft_beam

from __future__ import print_function
from tensorflow_transform.tf_metadata import dataset_metadata, dataset_schema

## Stage 3: Dataset preprocessing

### Loading the Pollution dataset

In [0]:
dataset = pd.read_csv("pollution_small.csv")

IOError: ignored

In [0]:
dataset.head()

### Dropping the Data column

In [0]:
features = dataset.drop("Date", axis=1)

In [0]:
features.head()

### Converting the dataset from dataframe to list of Python dictionaries

In [0]:
dict_features = list(features.to_dict("index").values())

In [0]:
dict_features[:2]

### Defining the dataset metadata

In [0]:
data_metadata = dataset_metadata.DatasetMetadata(
    dataset_schema.from_feature_spec({
        "no2":tf.FixedLenFeature([], tf.float32),
        "so2":tf.FixedLenFeature([], tf.float32),
        "pm10":tf.FixedLenFeature([], tf.float32),
        "soot":tf.FixedLenFeature([], tf.float32),
    }
    )
)

In [0]:
data_metadata

## Stage 4: The preprocessing function

In [0]:
def preprocessing_fn(inputs):
    
    no2 = inputs['no2']
    pm10 = inputs['pm10']
    so2 = inputs['so2']
    soot = inputs['soot']
    
    no2_normalized = no2 - tft.mean(no2)
    so2_normalized = so2 - tft.mean(so2)
    
    pm10_normalized = tft.scale_to_0_1(pm10)
    soot_normalized = tft.scale_by_min_max(soot)
    
    return {
        "no2_normalized":no2_normalized,
        "so2_normalized":so2_normalized,
        "pm10_normalized":pm10_normalized,
        "soot_normalized":soot_normalized
    }

## Stage 5: Putting everything together

Tensorflow Transform uses **Apache Beam** in the background to perform scalable data transforms. In this function we will use a direct runner.

Arguments to provide to the runner:

    dict_features - This is our dataset converted into Python Dictionary.
    data_metadata - This is our mada data for the dataset that we have created.
    preprocessing_fn - The main preprocessing function. Called to perform preprocessing operation per column.


This is a special syntax used in Apache Beam. This is used to stack operations and invoke transforms on our data.

```
result = data_to_pass | where_to_pass_the_data
```

Let's break down our case:

**result**  -> `transformed_dataset, transform_fn`

**data_to_pass** -> `(dict_features, data_metadata)`

**where_to_pass_the_data** -> `tft_beam.AnalyzeAndTransformDataset(preprocessing_fn)` 

```
transformed_dataset, transform_fn = ((dict_features, data_metadata) | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn))

```

If you want to learn more about the syntax, we recommend this link: 
https://beam.apache.org/documentation/programming-guide/#applying-transforms

LINKS:
> more about Apache Beam: https://beam.apache.org/ 

In [0]:
def data_transform():
    
    with tft_beam.Context(temp_dir=tempfile.mkdtemp()):
        transformed_dataset, transform_fn = ((dict_features, data_metadata) | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn))
        
    transformed_data, transformed_metadata = transformed_dataset
    
    for i in range(len(transformed_data)):
        print("Raw: ", dict_features[i])
        print("Transformed:", transformed_data[i])

In [0]:
data_transform()

Instructions for updating:
Use tf.cast instead.
Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: /tmp/tmpmKEV30/tftransform_tmp/2f30aa15810948ca8b3cbaeffd43fa53/saved_model.pb
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: /tmp/tmpmKEV30/tftransform_tmp/e4407da90c6b4ed2a16ca0078ba4c5c5/saved_model.pb
INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:No assets to write.


INFO:tensorflow:No assets to write.


INFO:tensorflow:SavedModel written to: /tmp/tmpmKEV30/tftransform_tmp/4f7c06a389444e749832be3893f638c1/saved_model.pb


INFO:tensorflow:SavedModel written to: /tmp/tmpmKEV30/tftransform_tmp/4f7c06a389444e749832be3893f638c1/saved_model.pb


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


Raw:  {'so2': 44.38, 'no2': 14.1, 'pm10': 98.67, 'soot': 34.81}
Transformed: {u'no2_normalized': -18.577978, u'soot_normalized': 0.2834235, u'pm10_normalized': 0.34071696, u'so2_normalized': 28.855408}
Raw:  {'so2': 29.75, 'no2': 14.1, 'pm10': 52.33, 'soot': 33.06}
Transformed: {u'no2_normalized': -18.577978, u'soot_normalized': 0.26620758, u'pm10_normalized': 0.16963857, u'so2_normalized': 14.225407}
Raw:  {'so2': 36.25, 'no2': 20.5, 'pm10': 74.67, 'soot': 39.25}
Transformed: {u'no2_normalized': -12.1779785, u'soot_normalized': 0.32710278, u'pm10_normalized': 0.25211355, u'so2_normalized': 20.725407}
Raw:  {'so2': 46.44, 'no2': 17.3, 'pm10': 72.0, 'soot': 34.38}
Transformed: {u'no2_normalized': -15.377979, u'soot_normalized': 0.2791933, u'pm10_normalized': 0.24225645, u'so2_normalized': 30.915405}
Raw:  {'so2': 56.56, 'no2': 25.64, 'pm10': 81.0, 'soot': 45.59}
Transformed: {u'no2_normalized': -7.037979, u'soot_normalized': 0.38947365, u'pm10_normalized': 0.2754827, u'so2_normalized': 