In [1]:
import tensorflow as tf

# 样本读取与特征输入
样本特征值 -> 特征名 -> 特征值的真实值表示


## Data Processing
1. 样本特征值 -> 特征名 

input_fn：特征值serialized(protos) + 特征名(features)；按照特征名与input feature解析为dict{feature_name:Tensor/SparseTensor}。

2. 特征名 -> 特征值的真实值表示
feature_column：得到特征最终的真实表示(类别特征可训练)，

## Input Layer
InputLayer = DenseFeatures(feature_column)：feature_column按照feature_name从feature_dict取值，喂入feature_column得到进模型后的真实表示


# 解析tf_record
将serialized的tf_record样本，**按照features指定的`xxFeature`映射**，解析为对应的Tensor或者SparseTensor
```python
feature_dict = tf.io.parse_example(proto, features)
## 格式如下
feature_dict['playtime'] = Tensor对象;
feature_dict['featureField'] = SparseTensor(index_key)
```

## serialized/protos样式
```python
proto = serialized = [
  features
    { feature { key: "key" value { float_list { value: [1.0, 2.0] } } } },
  features
    { feature []}]
```
```python
## check tf_record
with open("features.json",'w') as f:
    for serialized_example in tf.compat.v1.python_io.tf_record_iterator(unzip_file):
        example.ParseFromString(serialized_example)
        f.write(MessageToJson(example))
        try:
            print(eval(MessageToJson(example))['features']['feature']['playtime'])
            break
        except:
            print('loss')
```

## features样式
features为提前定义好的(dict)解析为{feature_name:feature_config}；value具体是指该特征为VarLenFeature，SparseFeature，FixedLenFeature，FixedLenSequenceFeature，RaggedFeature
```python
# 按照自定义特征key，构造特征dict结构
features={
    ## featureFieldxx 变长特征 ——> 均为tf.int64类型
    ## densexxxx 定长特征 ——> 均为tf.float32类型
    'dense3442':tf.io.FixedLenFeature((), tf.float32, 0.0),
    't_rseat_list': tf.io.VarLenFeature(tf.string),
    'featureField28': tf.io.VarLenFeature(tf.int64)
}
```

### FixedLenFeature
#### 定长单值类型-Tensor
空值设置缺省，配置定长
```python
  [
    features {
      feature { key: "age" value { int64_list { value: [ 0 ] } } }
      feature { key: "gender" value { bytes_list { value: [ "f" ] } } }
     },
     features {
      feature { key: "age" value { int64_list { value: [] } } }
      feature { key: "gender" value { bytes_list { value: [ "f" ] } } }
    }]
  ```
```python
  features: {
      "age": FixedLenFeature([], dtype=tf.int64, default_value=-1),
      "gender": FixedLenFeature([], dtype=tf.string),
  }
  ```
  And the expected output is:Tensor
  ```python
  {
    "age": [[0], [-1]],
    "gender": [["f"], ["f"]],
  }
```

### VarLenFeature
#### 变长特征-SparseTensor
按照indices填充values，
```python
  [
    features {
      feature { key: "kw" value { bytes_list { value: [ "knit", "big" ] } } }
      feature { key: "gps" value { float_list { value: [] } } }
    },
    features {
      feature { key: "kw" value { bytes_list { value: [ "emmy" ] } } }
      feature { key: "dank" value { int64_list { value: [ 42 ] } } }
      feature { key: "gps" value { } }
    }
  ]
  ```
  And arguments
  ```python
  example_names: ["input0", "input1"],
  features: {
      "kw": VarLenFeature(tf.string),
      "dank": VarLenFeature(tf.int64),
      "gps": VarLenFeature(tf.float32),
  }
  ```
  Then the output is a dictionary:SparseTensor
  ```python
  {
    "kw": SparseTensor(
        indices=[[0, 0], [0, 1], [1, 0]],
        values=["knit", "big", "emmy"]
        dense_shape=[2, 2]),
    "dank": SparseTensor(
        indices=[[1, 0]],
        values=[42],
        dense_shape=[2, 1]),
    "gps": SparseTensor(
        indices=[],
        values=[],
        dense_shape=[2, 0]),
  }
kw:[['knit','big'], # example_1
    ['emmy',  x  ]] # example_2
dank:[[x ], # example_1
      [42]] # example_2
gps:[[x], # example_1
     [x]] # example_2
```

### SparseFeature
```python
  [
    features {
      feature { key: "val" value { float_list { value: [ 0.5, -1.0 ] } } }
      feature { key: "ix" value { int64_list { value: [ 3, 20 ] } } }
    },
    features {
      feature { key: "val" value { float_list { value: [ 0.0 ] } } }
      feature { key: "ix" value { int64_list { value: [ 42 ] } } }
    }]
  ```
And arguments
```python
  example_names: ["input0", "input1"],
  features: {
      "sparse": SparseFeature(index_key="ix", value_key="val", dtype=tf.float32, size=100),
  }
  ```
  Then the output is a dictionary:
  ```python
  {
    "sparse": SparseTensor(
        indices=[[0, 3], [0, 20], [1, 42]],
        values=[0.5, -1.0, 0.0]
        dense_shape=[2, 100]),
  }
  ```

### FixedLenSequenceFeature
定长序列特征：FixedLenSequenceFeature，[batch_size,max_length]
```python
output：{"ft": [[1.0, 2.0], [3.0, -1.0]]}
```


### RaggedFeature

# 配置FeatureColumn
feature_column输入：feature_name作为key，生成各特征对应的feature_column，不同的route到wide和deep部分。

类组织关系：
1. FeatureColumn -> DenseColumn,CategoricalColumn,SequenceDenseColumn

1.1 DenseColumn -> NumericColumn, BucketizedColumn, EmbeddingColumn, IndicatorColumn

1.2 CategoricalColumn -> HashedCategoricalColumn, WeightedCategoricalColumn, CrossedColumn


## dense_feature

### numeric_column
Represents real valued or numerical features

In [2]:
num_col = tf.feature_column.numeric_column(key='num_1',shape=(1,),default_value=None,dtype=tf.float32,normalizer_fn=None)
num_col

NumericColumn(key='num_1', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)

## sparse_feature

### bucketized_column
Represents discretized dense input bucketed by `boundaries`.
dense->bkt->sparse

In [3]:
bkt_col = tf.feature_column.bucketized_column(source_column=num_col, boundaries=list(range(4)))
bkt_col

BucketizedColumn(source_column=NumericColumn(key='num_1', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), boundaries=(0, 1, 2, 3))

### categorical_column_with_hash_bucket
Represents sparse feature where ids are set by hashing

In [4]:
categorical_hash_bkt = tf.feature_column.categorical_column_with_hash_bucket(key='cate_bkt_1',hash_bucket_size=4, dtype=tf.int64)
categorical_hash_bkt

HashedCategoricalColumn(key='cate_bkt_1', hash_bucket_size=4, dtype=tf.int64)

### crossed_column
Returns a column for performing crosses of categorical features

In [5]:
crossed_col = tf.feature_column.crossed_column(keys=['t1','t2'], hash_bucket_size=8, hash_key=None)
crossed_col

CrossedColumn(keys=('t1', 't2'), hash_bucket_size=8, hash_key=None)

## transform 2 emb

### embedding_column
**DenseColumn** that converts from sparse, categorical input.

In [6]:
emb = tf.feature_column.embedding_column(
    categorical_column=crossed_col,
    dimension = 8,combiner='mean',
    initializer=None,ckpt_to_load_from=None,tensor_name_in_ckpt=None,max_norm=None,trainable=True,use_safe_embedding_lookup=True)
emb

EmbeddingColumn(categorical_column=CrossedColumn(keys=('t1', 't2'), hash_bucket_size=8, hash_key=None), dimension=8, combiner='mean', initializer=<tensorflow.python.ops.init_ops.TruncatedNormal object at 0x7f2f39d52410>, ckpt_to_load_from=None, tensor_name_in_ckpt=None, max_norm=None, trainable=True, use_safe_embedding_lookup=True)

### shared_embedding_columns_v2
List of dense columns that convert from sparse, categorical input.

In [None]:
# tf.compat.v1.disable_v2_behavior()
# shared_list =tf.feature_column.shared_embeddings(
#     categorical_columns=[categorical_hash_bkt], dimension=8, combiner='sqrtn',
#     initializer=tf.keras.initializers.VarianceScaling(distribution='uniform')
# )
# shared_list

# Deep部分的InputLayer
## 测试Tensor/SparseTensor输入
基于`feature_columns` 生成 `dense Tensor`；作为模型的第一层，input_layer

In [7]:
# 序列id，不足长度，引擎发送-1，保证维度一致；-1 has a special meaning of missing feature
features ={
    'num_1':tf.constant([[0.1],[0.2],[0.3]]), # numerical column
    't1': tf.constant([[1000,-1,-1], [-1,1002,-1],[1000,1002,-1]]), # sparse tensor
    't2': tf.constant([[2],[-1],[3]]) # sparse tensor
}
features

{'num_1': <tf.Tensor: shape=(3, 1), dtype=float32, numpy=
 array([[0.1],
        [0.2],
        [0.3]], dtype=float32)>,
 't1': <tf.Tensor: shape=(3, 3), dtype=int32, numpy=
 array([[1000,   -1,   -1],
        [  -1, 1002,   -1],
        [1000, 1002,   -1]], dtype=int32)>,
 't2': <tf.Tensor: shape=(3, 1), dtype=int32, numpy=
 array([[ 2],
        [-1],
        [ 3]], dtype=int32)>}

In [8]:
input_layer = tf.compat.v1.keras.layers.DenseFeatures([num_col,emb])
input_layer

<keras.feature_column.dense_features.DenseFeatures at 0x7f2f39d37a10>

In [9]:
net = input_layer(features, training=True)
net

<tf.Tensor: shape=(3, 9), dtype=float32, numpy=
array([[ 0.1       , -0.16142218, -0.39117107,  0.29652628,  0.2053033 ,
         0.37033555, -0.21723793,  0.19645822,  0.31968176],
       [ 0.2       ,  0.01981424, -0.23201554,  0.27135864,  0.2385637 ,
         0.45666438, -0.25182116, -0.37126732,  0.3487841 ],
       [ 0.3       ,  0.16646463,  0.44881603, -0.36725768,  0.14276004,
        -0.25586963, -0.15189463, -0.11481848,  0.34413606]],
      dtype=float32)>

## tf.compat.v1.keras.layers.DenseFeatures
1. 第3节的FeatureColumn List构造DenseFeatures。其中column必须均为DenseColumn的子类(`numeric_column`，`embedding_column`，`bucketized_column`，`indicator_column`)；
2. 第2节的input_fn解析后feature_dict({key：Tensor/SparseTensor})，构造FeatureTransformationCache；
3. DensorFeatures.call遍历所有的FeatureColumn，通过cache得到feature转换后的`dense Tensor`输出，后续输入model；可配置dict{key:dense_tensor}索引。

具体的：
1. numeric：返回dense tensor；
2. emb：先得到sparse_tensors，再得到dense_tensors 