# TFDV
tensorflow data validation略してTFDVのやれることを確認していく

https://www.tensorflow.org/tfx/data_validation/get_started#checking_for_errors_on_a_per-example_basis を参考にしている 

TFDVで出来ることは 
* compute descriptive statistics
* infer a schema
* detect data anomalies 

である。

## Computing descriptive data statistics
基本的な統計量をすぐ測れる

In [17]:
import tensorflow_data_validation as tfdv

In [40]:
stats = tfdv.generate_statistics_from_tfrecord('./tfrecord/train_transformed-00000-of-00001')

In [3]:
tfdv.visualize_statistics(stats)

ｃｓｖから読み込む事もできる。（この場合一行目がカラム名でないのでexplicitに入力する必要がある）

In [4]:
stats2 = tfdv.generate_statistics_from_csv('./data/taxi-train.csv', column_names=['dropofflon','dropofflat','passengers', 'fare_amount', 'pickuplon','pickuplat', 'key'])

In [5]:
tfdv.visualize_statistics(stats2)

## Inferring a schema over the data
> 特徴量がどういった属性を持っているかを述べたスキーマを作成出来る。プロパティとして
* あると期待されている特徴量かどうか
* 型
* それぞれのイグザンプルの中の1つの特徴量の値の数
* 全てのイグザンプルをまたいだそれぞれの特徴量があるかどうか
* 特徴量が期待されたドメインかどうか　

ようは簡単に言えば正しいデータとはどういうものかを記述する。これらはエラーを発見するのに使われる。またこれらのスキーマはTFTでも使われる。
   

In [41]:
# schemaを書く
schema = tfdv.infer_schema(stats)

In [8]:
schema

feature {
  name: "passengers"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "fare_amount"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "pickuplat"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "dropofflat"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "key"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "pickuplon"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "dropofflon"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1

In [9]:
tfdv.display_schema(schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'passengers',FLOAT,required,,-
'fare_amount',FLOAT,required,,-
'pickuplat',INT,required,,-
'dropofflat',INT,required,,-
'key',FLOAT,required,,-
'pickuplon',INT,required,,-
'dropofflon',INT,required,,-


In [10]:
tfdv.get_feature(schema, 'dropofflat').presence.min_fraction = 0.5

In [11]:
schema

feature {
  name: "passengers"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "fare_amount"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "pickuplat"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "dropofflat"
  type: INT
  presence {
    min_fraction: 0.5
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "key"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "pickuplon"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "dropofflon"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1

`dropofflat`の`presense`の値が変わっていることがわかる

## Checking the data for errors 
データセットが期待したものであるか、どこが通常と違うのかを測定できる。

### Matching the statistics of the dataset against a schema
異なるデータに対して要件を満たしているデータであるかをスキーマと比較して検出する

In [12]:
other_stats = tfdv.generate_statistics_from_tfrecord('./tfrecord/test_transformed-00000-of-00001')

In [13]:
anomalies = tfdv.validate_statistics(statistics=other_stats, schema=schema)
tfdv.display_anomalies(anomalies)

特におかしい特徴量は無いようだ。試しに確実にエラーがでるCSVでやってみる。

In [14]:
anomalies2 = tfdv.validate_statistics(statistics=stats2, schema=schema)
tfdv.display_anomalies(anomalies2)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'pickuplon',Expected data of type: INT but got FLOAT,
'dropofflat',Expected data of type: INT but got FLOAT,
'dropofflon',Expected data of type: INT but got FLOAT,
'key',Expected data of type: FLOAT but got INT,


## Checking for erros on a per-example basis
TFDVは一つ一つのエグザンプルに対して異常値検出をすることも可能である。
現状ではpython2/3に関わらずエラー。

In [18]:
options = tfdv.StatsOptions(schema=schema)
anomalous_example_stats = tfdv.validate_tfexamples_in_tfrecord(
   data_location='./tfrecord/test_transformed-00000-of-00001', stats_options=options) 
tfdv.display_anomalies(anomalous_example_stats)

AttributeError: 'module' object has no attribute 'validate_tfexamples_in_tfrecord'

## Schema envirometn
パイプライン内のデータセットは基本的に統一されたスキーマでが、いくつかのケースではスキーマの種類を複数にしなくては行けない時がある。  
例えば訓練時には使われるが提供時には使われないラベルなどである。そういったもの環境変数を指定することで設定出来る。

In [21]:
%%bash
# labelを消したCSVを作成
cat ./data/taxi-test.csv | cut -d "," -f 2-7 > ./data/serving.csv 

In [28]:
# 訓練時の統計量とスキーマ
train_stats = tfdv.generate_statistics_from_csv('./data/taxi-valid.csv', column_names=['fare_amount','dropofflon','dropofflat', 'pickuplon','pickuplat','passengers','key'])
schema2 = tfdv.infer_schema(train_stats)

In [29]:
# Serving時の統計量
serving_stats = tfdv.generate_statistics_from_csv('./data/serving.csv', column_names=['dropofflon','dropofflat', 'pickuplon','pickuplat','passengers','key'])
serving_anomalies = tfdv.validate_statistics(serving_stats, schema2)
tfdv.display_anomalies(serving_anomalies)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'fare_amount',Column dropped,Column is completely missing


Servingのときにfare_amountが抜けているのが確認できる。Servingのときにはこれらが抜けることをスキーマに設定出来る。

In [30]:
schema2.default_environment.append('TRAINING')
schema2.default_environment.append('SERVING')

In [33]:
schema2

feature {
  name: "passengers"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "fare_amount"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "pickuplat"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "dropofflat"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "key"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "pickuplon"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "dropofflon"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      siz

In [36]:
# Serving のときにfare_amountが抜けても良いようにする
tfdv.get_feature(schema2, 'fare_amount').not_in_environment.append('SERVING')
schema2

feature {
  name: "passengers"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "fare_amount"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  not_in_environment: "SERVING"
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "pickuplat"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "dropofflat"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "key"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "pickuplon"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "dropofflon"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
 

In [38]:
serving_anomalies_with_env = tfdv.validate_statistics(
    serving_stats, schema2, environment='SERVING'
)

In [39]:
tfdv.display_anomalies(serving_anomalies_with_env)

以上として認識されなくなったことがわかる。

## Checking data skew and drift

In [45]:
# skewをチェック
serving_stats = tfdv.generate_statistics_from_tfrecord('./tfrecord/test_transformed-00000-of-00001')

tfdv.get_feature(schema, 'passengers').skew_comparator.infinity_norm.threshold = 0.01
schema

feature {
  name: "passengers"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  skew_comparator {
    infinity_norm {
      threshold: 0.01
    }
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "fare_amount"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "pickuplat"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "dropofflat"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "key"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "pickuplon"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "dropofflon"
  type: INT
  presence {
    

In [43]:
skew_anomalies = tfdv.validate_statistics(
    statistics=stats, schema=schema, serving_statistics=serving_stats)

In [44]:
tfdv.display_anomalies(skew_anomalies)

In [51]:
# driftをチェック
tfdv.get_feature(schema, 'passengers').drift_comparator.infinity_norm.threshold = 0.01
schema

feature {
  name: "passengers"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  skew_comparator {
    infinity_norm {
      threshold: 0.01
    }
  }
  drift_comparator {
    infinity_norm {
      threshold: 0.01
    }
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "fare_amount"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "pickuplat"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "dropofflat"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "key"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "pickuplon"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1

In [52]:
drift_anomalies = tfdv.validate_statistics(
        statistics=stats, schema=schema, serving_statistics=serving_stats)

In [53]:
tfdv.display_anomalies(drift_anomalies)

## Writing custom data conector
inputの形式がcsv tfrecordの様に用意されたものでなくても、独自実装も可能。
ただし、low レベルのAPIで自分たちで書く必要がある。