# TFTを試す
> このノートブックでは`estimator3_train_and_evaluate_(1).ipynb`で作成したEstiamtorにfeature engineeringを行う。 前処理としてはシンプルに特徴量のスケーリングのみをおこなう。


## TFTについて　
> TFTは、大きく分けて２つの段階にわかれる。 
    1. Analyze phase
    2. Transform phase  
である

### 実際の操作

1. 訓練データセットのスキーマを定義
2. 分析してから変換を行うPTransformを利用して、前処理済みのデータと変換関数を返す
    1. Beamの関数でデータを読み込む
    2. 訓練に必要ないデータをフィルターする
    3. 生データと定義したデータのスキーマを含むメタデータを`AnalyzeAndTransformDataset(preprocess func)`で処理する
3. 前処理済み訓練データをTFRecordsとして書き出す
 
`注意点` : preprocess func は、 TensorFlow（Transform）の関数で書く必要がある。

## 実際に書いてみる

In [14]:
# import libraries
import os
import os.path
import tempfile
import tensorflow as tf
import apache_beam as beam
import tensorflow_transform as tft
import tensorflow_transform.beam as tft_beam

from apache_beam.io import tfrecordio
from tensorflow_transform.coders import ExampleProtoCoder
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.tf_metadata import dataset_schema
from tensorflow_transform.beam.tft_beam_io import transform_fn_io

### まず、metadataを定義する

In [15]:
NUMERIC_FEATURE_NAMES = ['pickuplon','pickuplat', 'dropofflon','dropofflat','passengers']
TARGET_FEATURE_NAME = 'fare_amount'
KEY_COLUMN = 'key'

def create_raw_metadata():  
    
    raw_data_schema = {}
    
    # key feature schema
    raw_data_schema[KEY_COLUMN]= dataset_schema.ColumnSchema(
        tf.float32, [], dataset_schema.FixedColumnRepresentation())
    
    # target feature schema
    raw_data_schema[TARGET_FEATURE_NAME]= dataset_schema.ColumnSchema(
        tf.float32, [], dataset_schema.FixedColumnRepresentation())
        
    # numerical features schema
    raw_data_schema.update({ column_name : dataset_schema.ColumnSchema(
        tf.float32, [], dataset_schema.FixedColumnRepresentation())
                            for column_name in NUMERIC_FEATURE_NAMES})
    
      # create dataset_metadata given raw_schema
    raw_metadata = dataset_metadata.DatasetMetadata(
        dataset_schema.Schema(raw_data_schema))
    
    return raw_metadata



ｃｓｖから読み込む事もできる。（この場合一行目がカラム名でないのでexplicitに入力する必要がある）

ｃｓｖから読み込む事もできる。（この場合一行目がカラム名でないのでexplicitに入力する必要がある）

下記でも良い

NUMERIC_FEATURE_NAMES = ['pickuplon','pickuplat', 'dropofflon','dropofflat','passengers']
TARGET_FEATURE_NAME = 'fare_amount'
KEY_COLUMN = 'key'

def create_raw_metadata():  
    
    raw_data_schema = {}
    
    # key feature schema
    raw_data_schema[KEY_COLUMN] = tf.FixedLenFeature([], tf.float32)
    
    # target feature schema
    raw_data_schema[TARGET_FEATURE_NAME]= tf.FixedLenFeature([], tf.float32)
        
    # numerical features schema
    raw_data_schema.update({ column_name :tf.FixedLenFeature([], tf.float32)
                            for column_name in NUMERIC_FEATURE_NAMES})
    
      # create dataset_metadata given raw_schema
    raw_metadata = dataset_metadata.DatasetMetadata(
        dataset_schema.from_feature_spec(raw_data_schema))
    
    return raw_metadata



In [16]:
create_raw_metadata()

W0806 14:06:32.122081 4459820480 deprecation.py:323] From <ipython-input-15-a953fbc5c3e9>:11: ColumnSchema (from tensorflow_transform.tf_metadata.dataset_schema) is deprecated and will be removed in a future version.
Instructions for updating:
ColumnSchema is a deprecated, use from_feature_spec to create a `Schema`
W0806 14:06:32.171721 4459820480 deprecation_wrapper.py:119] From /Users/baito/Works/TFX/.venv/lib/python3.7/site-packages/tensorflow_transform/tf_metadata/dataset_schema.py:107: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

W0806 14:06:32.184382 4459820480 deprecation_wrapper.py:119] From /Users/baito/Works/TFX/.venv/lib/python3.7/site-packages/tensorflow_transform/tf_metadata/schema_utils.py:63: The name tf.SparseFeature is deprecated. Please use tf.io.SparseFeature instead.



{'_schema': Schema(feature {
  name: "dropofflat"
  type: FLOAT
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
feature {
  name: "dropofflon"
  type: FLOAT
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
feature {
  name: "fare_amount"
  type: FLOAT
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
feature {
  name: "key"
  type: FLOAT
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
feature {
  name: "passengers"
  type: FLOAT
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
feature {
  name: "pickuplat"
  type: FLOAT
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
feature {
  name: "pickuplon"
  type: FLOAT
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
)}

### preprocessing fnを定義する

In [17]:
def preprocessing_fn(inputs):
    # 他の特徴料は変化させないまま、いくつかの特徴量を修正する可能性があるため、最初にコピーしておく。
    outputs = inputs.copy()
    
    # bucketize
    outputs['pickuplon'] = tft.bucketize(inputs['pickuplon'], num_buckets=5)
    outputs['pickuplat'] = tft.bucketize(inputs['pickuplat'], num_buckets=5)
    outputs['dropofflon'] = tft.bucketize(inputs['dropofflon'], num_buckets=5)
    outputs['dropofflat'] = tft.bucketize(inputs['dropofflat'], num_buckets=5)
    
    # Scaling
    outputs['passengers'] = tft.scale_to_0_1(inputs['passengers'])
    
    return outputs

### beamの関数を書く
> 通常のbeam処理内にTFTの処理をラッピングする。

In [19]:

with beam.Pipeline() as p: # 必要ならパイプラインオプションを書いておく
    with tft_beam.Context(temp_dir=tempfile.mkdtemp()):
        # Csv coder
        ordered_columns = [
            'fare_amount', 'pickuplon','pickuplat','dropofflon',
            'dropofflat','passengers', 'key'
        ]
        raw_metadata = create_raw_metadata()
        converter = tft.coders.CsvCoder(ordered_columns, raw_metadata.schema)
        
        # 訓練データ読み込み
        raw_data = (
            p
            | "ReadTrainData" >> beam.io.ReadFromText('./data/taxi-train.csv')
            | "DecodeTrainCsv" >> beam.Map(converter.decode)
        )
        raw_dataset = (raw_data, raw_metadata)
        
        ## 前処理関数を当てはめるn
        transformed_dataset, transform_fn = (
            raw_dataset | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn)
        )
        transformed_data, transformed_metadata = transformed_dataset
        
        ## 前処理済み訓練データをTFRecordで保存
        transformed_data_coder = ExampleProtoCoder(transformed_metadata.schema)
        _ = (
            transformed_data
            | "EncodeTrainData" >> beam.Map(transformed_data_coder.encode)
            | "WriteTrainData" >> beam.io.WriteToTFRecord(
                './tfrecord/train_transformed'
            )
        )
        
        # テストデータを読み込み
        raw_test_data = (
            p
            | "ReadTestData" >> beam.io.ReadFromText('./data/taxi-test.csv')
            | "DecodeTestCsv" >> beam.Map(converter.decode)
        )
        raw_test_dataset = (raw_test_data, raw_metadata)
        
        ## 前処理関数を当てはめる
        transformed_test_dataset = (
            (raw_test_dataset, transform_fn) | tft_beam.TransformDataset()
        )
        transform_test_data, _ = transformed_test_dataset
        
        ## 前処理済み訓練データをTFRecordで保存
        _ = (
            transform_test_data
            | "EncodeTestData" >> beam.Map(transformed_data_coder.encode)
            | "WriteTestData" >> beam.io.WriteToTFRecord('./tfrecord/test_transformed')
        )
        
        # 変換関数を保存
        _ = (
            transform_fn
            | "WriteTransformFn" >> tft_beam.WriteTransformFn("./transform_fn")
        )

W0806 14:07:44.162734 4459820480 tfrecordio.py:57] Couldn't find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be.


#### 変換されたTFRecordを読み込んでチェックする

In [20]:
def _parse_function(example_proto):
    features={name : tf.FixedLenFeature([], tf.float32) for name in [
            'fare_amount','passengers', 'key'
        ]}
    features.update({
        name: tf.FixedLenFeature([], tf.int64) for name in [
            'pickuplon','pickuplat','dropofflon',
            'dropofflat']
    })
    parsed_features = tf.parse_single_example(example_proto, features)
    return parsed_features

dataset = tf.data.TFRecordDataset(['./tfrecord/train_transformed-00000-of-00001'])
dataset = dataset.map(_parse_function)

In [21]:
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()

In [22]:
with tf.Session() as sess:
    print(sess.run(next_element))
    print(sess.run(next_element))

{'dropofflat': 4, 'dropofflon': 3, 'fare_amount': 12.0, 'key': 0.0, 'passengers': 0.0, 'pickuplat': 2, 'pickuplon': 1}
{'dropofflat': 4, 'dropofflon': 4, 'fare_amount': 4.5, 'key': 1.0, 'passengers': 0.0, 'pickuplat': 4, 'pickuplon': 4}


問題ない

In [28]:
def create_feature_eval_columns(transformed_metadata):
    feature_columns = []
    
    column_schemas = transformed_metadata.schema.as_feature_spec()
    
    for feature_name in column_schemas:
        column_schema = column_schemas[feature_name]
        
        if column_schema.dtype == tf.float32:
            feature_columns.append(tf.feature_column.numeric_column(feature_name))
        elif column_schema.dtype == tf.int64:
            feature_columns.append(
                tf.feature_column.categorical_column_with_identity(
                feature_name, num_buckets=5+1)
                                  )
    return feature_columns

In [31]:
def _parse_func2(example_proto):
    feat_eval_col = create_feature_eval_columns(transformed_metadata)
    feature_spec =  tf.feature_column.make_parse_example_spec(
      feat_eval_col)
    features = tf.io.parse_single_example(example_proto, feature_spec)
    return features

In [32]:
from tensorflow_transform.tf_metadata import metadata_io
TRANSFORM_ARTEFACTS_DIR = 'transform_fn'
transformed_metadata = metadata_io.read_metadata(
        os.path.join(TRANSFORM_ARTEFACTS_DIR,"transformed_metadata"))

In [33]:
dataset = tf.data.TFRecordDataset(['./tfrecord/train_transformed-00000-of-00001'])
dataset = dataset.map(_parse_func2)

In [34]:
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()

In [35]:
with tf.Session() as sess:
    print(sess.run(next_element))
    print(sess.run(next_element))

{'dropofflat': SparseTensorValue(indices=array([[0]]), values=array([4]), dense_shape=array([1])), 'dropofflon': SparseTensorValue(indices=array([[0]]), values=array([3]), dense_shape=array([1])), 'pickuplat': SparseTensorValue(indices=array([[0]]), values=array([2]), dense_shape=array([1])), 'pickuplon': SparseTensorValue(indices=array([[0]]), values=array([1]), dense_shape=array([1])), 'fare_amount': array([12.], dtype=float32), 'key': array([0.], dtype=float32), 'passengers': array([0.], dtype=float32)}
{'dropofflat': SparseTensorValue(indices=array([[0]]), values=array([4]), dense_shape=array([1])), 'dropofflon': SparseTensorValue(indices=array([[0]]), values=array([4]), dense_shape=array([1])), 'pickuplat': SparseTensorValue(indices=array([[0]]), values=array([4]), dense_shape=array([1])), 'pickuplon': SparseTensorValue(indices=array([[0]]), values=array([4]), dense_shape=array([1])), 'fare_amount': array([4.5], dtype=float32), 'key': array([1.], dtype=float32), 'passengers': arra