# 特征列
本文档详细介绍了特征列。您可以将特征列视为原始数据和 Estimator 之间的媒介。特征列内容丰富，使您可以将各种原始数据转换为 Estimator 可以使用的格式，从而可以轻松地进行实验。

通过 Estimator（鸢尾花的 DNNClassifier）的 feature_columns 参数指定模型的输入。特征列在输入数据（由 input_fn 返回）与模型之间架起了桥梁。

# 数值列

In [None]:
import tensorflow as tf

In [None]:
# Defaults to a tf.float32 scalar.
numeric_feature_column = tf.feature_column.numeric_column(key="SepalLength")

In [None]:
# Represent a tf.float64 scalar.
numeric_feature_column = tf.feature_column.numeric_column(key="SepalLength",
                                                          dtype=tf.float64)

In [None]:
# Represent a 10-element vector in which each cell contains a tf.float32.
vector_feature_column = tf.feature_column.numeric_column(key="Bowling",
                                                         shape=10)

# Represent a 10x5 matrix in which each cell contains a tf.float32.
matrix_feature_column = tf.feature_column.numeric_column(key="MyMatrix",
                                                         shape=[10,5])

# 分桶列
根据数值范围将其值分为不同的类别

In [None]:
# First, convert the raw input to a numeric column.
numeric_feature_column = tf.feature_column.numeric_column("Year")

# Then, bucketize the numeric column on the years 1960, 1980, and 2000.
bucketized_feature_column = tf.feature_column.bucketized_column(
    source_column = numeric_feature_column,
    boundaries = [1960, 1980, 2000])

# 分类标识列
分类标识列是分桶列的一种特殊情况

In [None]:
# Create categorical output for an integer feature named "my_feature_b",
# The values of my_feature_b must be >= 0 and < num_buckets
identity_feature_column = tf.feature_column.categorical_column_with_identity(
    key='my_feature_b',
    num_buckets=4) # Values [0, 4)

# In order for the preceding call to work, the input_fn() must return
# a dictionary containing 'my_feature_b' as a key. Furthermore, the values
# assigned to 'my_feature_b' must belong to the set [0, 4).
def input_fn():
    ...
    return ({ 'my_feature_a':[7, 9, 5, 2], 'my_feature_b':[3, 1, 2, 2] },
            [Label_values])

# 分类词汇列
1、tf.feature_column.categorical_column_with_vocabulary_list

2、tf.feature_column.categorical_column_with_vocabulary_file

In [None]:
# Given input "feature_name_from_input_fn" which is a string,
# create a categorical feature by mapping the input to one of
# the elements in the vocabulary list.
vocabulary_feature_column = tf.feature_column.categorical_column_with_vocabulary_list(
        key=feature_name_from_input_fn,
        vocabulary_list=["kitchenware", "electronics", "sports"])

In [None]:
# Given input "feature_name_from_input_fn" which is a string,
# create a categorical feature to our model by mapping the input to one of
# the elements in the vocabulary file
vocabulary_feature_column =
    tf.feature_column.categorical_column_with_vocabulary_file(
        key=feature_name_from_input_fn,
        vocabulary_file="product_class.txt",
        vocabulary_size=3)

# 经过哈希处理的列
到目前为止，我们处理的示例都包含很少的类别。例如，我们的 product_class 示例只有 3 个类别。但是通常，类别的数量非常大，以至于无法为每个词汇或整数设置单独的类别，因为这会消耗太多内存。对于此类情况，我们可以反问自己：“我愿意为我的输入设置多少类别？”实际上，tf.feature_column.categorical_column_with_hash_bucket 函数使您能够指定类别的数量。对于这种类型的特征列，模型会计算输入的哈希值，然后使用模运算符将其置于其中一个 hash_bucket_size 类别中

In [None]:
# pseudocode
feature_id = hash(raw_feature) % hash_buckets_size

In [None]:
hashed_feature_column =
    tf.feature_column.categorical_column_with_hash_bucket(
        key = "some_feature",
        hash_buckets_size = 100) # The number of categories

# 组合列
通过将多个特征组合为一个特征（称为特征组合），模型可学习每个特征组合的单独权重。

In [None]:
def make_dataset(latitude, longitude, labels):
    assert latitude.shape == longitude.shape == labels.shape

    features = {'latitude': latitude.flatten(),
                'longitude': longitude.flatten()}
    labels=labels.flatten()

    return tf.data.Dataset.from_tensor_slices((features, labels))

# Bucketize the latitude and longitude usig the `edges`
latitude_bucket_fc = tf.feature_column.bucketized_column(
    tf.feature_column.numeric_column('latitude'),
    list(atlanta.latitude.edges))

longitude_bucket_fc = tf.feature_column.bucketized_column(
    tf.feature_column.numeric_column('longitude'),
    list(atlanta.longitude.edges))

# Cross the bucketized columns, using 5000 hash bins.
crossed_lat_lon_fc = tf.feature_column.crossed_column(
    [latitude_bucket_fc, longitude_bucket_fc], 5000)

fc = [
    latitude_bucket_fc,
    longitude_bucket_fc,
    crossed_lat_lon_fc]

# Build and train the Estimator.
est = tf.estimator.LinearRegressor(fc, ...)

# 指标列和嵌入列
指标列和嵌入列从不直接处理特征，而是将分类列视为输入。
指标列将每个类别视为独热矢量中的一个元素，其中匹配类别的值为 1，其余类别为 0

嵌入列并非将数据表示为很多维度的独热矢量，而是将数据表示为低维度普通矢量，其中每个单元格可以包含任意数字，而不仅仅是 0 或 1。通过使每个单元格能够包含更丰富的数字，嵌入列包含的单元格数量远远少于指标列。

In [None]:
categorical_column = ... # Create any type of categorical column.

# Represent the categorical column as an indicator column.
indicator_column = tf.feature_column.indicator_column(categorical_column)

In [None]:
categorical_column = ... # Create any categorical column

# Represent the categorical column as an embedding column.
# This means creating a one-hot vector with one element for each category.
embedding_column = tf.feature_column.embedding_column(
    categorical_column=categorical_column,
    dimension=dimension_of_embedding_vector)

# 将特征列传递给 Estimator
将特征列传递给 Estimator
如下面的列表所示，并非所有 Estimator 都支持所有类型的 feature_columns 参数：

    LinearClassifier 和 LinearRegressor：接受所有类型的特征列。

    DNNClassifier 和 DNNRegressor：只接受密集列。其他类型的列必须封装在 indicator_column 或 embedding_column 中。

    DNNLinearCombinedClassifier 和 DNNLinearCombinedRegressor：

    linear_feature_columns 参数接受任何类型的特征列。

    dnn_feature_columns 参数只接受密集列。