## 設計模式: hashed feature
情境
- 解決類別特徵有關的三個問題
    1. 詞彙表不完整: 如有些特徵有出現訓練沒有見過的"類別"。
    2. 由於基數造成的模型大小: 類別數目可能太大。
    3. 冷啟動: 其實跟1是差不多的。

實作
- 將類別特徵分組，並且接受可能產生碰撞(collision) 的代表。
    - [tf](https://www.tensorflow.org/api_docs/python/tf/feature_column/categorical_column_with_hash_bucket)

優點
- 解決冷啟動問題

缺點
- 喪失模型準確率


In [1]:
# bigquery

In [2]:
import tensorflow as tf

In [8]:
features = {
    'words': 
        tf.constant([
            ['Tensorflow', 'Keras', 'RNN', 'LSTM','CNN'], 
            ['LSTM', 'CNN', 'Tensorflow', 'Keras', 'RNN'], 
            ['CNN', 'Tensorflow','LSTM', 'Keras', 'RNN']])
}

# 建立物件
words = tf.feature_column.categorical_column_with_hash_bucket(
    key='words', hash_bucket_size=10, dtype=tf.string
) # key 對應上面features 的words

# 
words_embedded = tf.feature_column.embedding_column(categorical_column=words, dimension=16)
columns = [words_embedded]

input_layer = tf.keras.layers.DenseFeatures(columns)
dense_tensor = input_layer(features)

In [9]:
dense_tensor

<tf.Tensor: shape=(3, 16), dtype=float32, numpy=
array([[ 0.18836786, -0.10930021, -0.00112247, -0.3538489 , -0.1907592 ,
         0.02508214, -0.03321335, -0.01071842, -0.05055372,  0.16551732,
        -0.04368504,  0.19830208, -0.08976967,  0.21544245,  0.05605598,
        -0.012444  ],
       [ 0.18836786, -0.10930023, -0.00112247, -0.3538489 , -0.19075921,
         0.02508214, -0.03321335, -0.01071842, -0.05055372,  0.16551732,
        -0.04368504,  0.19830206, -0.08976966,  0.21544245,  0.05605597,
        -0.012444  ],
       [ 0.18836786, -0.10930023, -0.00112247, -0.3538489 , -0.19075921,
         0.02508214, -0.03321335, -0.01071842, -0.05055372,  0.16551732,
        -0.04368504,  0.19830206, -0.08976966,  0.21544245,  0.05605597,
        -0.012444  ]], dtype=float32)>