# Feature Columns
Feature column are bridge between raw data and estimator or model. In this tutorial, you will learn
1. What is TensorFlow Feature Columns
2. Numeric Feature Columns
  - Bucketized Feature Columns
3. Categorical Feature Columns
  - Indicator Feature Columns
  - Embedding Feature Columns
  - Hashed Feature Columns
4. Crossed Feature Columns


As the following figure suggests, you specify the input to a model through the feature_columns argument of an Estimator (DNNClassifier for Iris). Feature Columns bridge input data (as returned by input_fn) with your model.

![alt text](https://www.tensorflow.org/images/feature_columns/inputs_to_model_bridge.jpg)

   Feature columns bridge raw data with the data your model needs. 
   
   To create feature columns, call functions from the `tf.feature_column` module. This tutorial explains nine of the functions in that module. As the following figure shows, all nine functions return either a Categorical-Column or a Dense-Column object, except bucketized_column, which inherits from both classes:
   
   ![alt text](https://www.tensorflow.org/images/feature_columns/some_constructors.jpg)
   Feature column methods fall into two main categories and one hybrid category. 
   
   Let's look at these functions in more detail.

## Import TensorFlow and other libraries

In [6]:
from __future__ import absolute_import, division, print_function, unicode_literals

import numpy as np
import pandas as pd


import tensorflow as tf

from tensorflow import feature_column
from tensorflow.keras import layers

## Create Demo data


In [7]:
data = {'marks': [55,21,63,88,74,54,95,41,84,52],
        'grade': ['average','poor','average','good','good','average','good','average','good','average'],
        'point': ['c','f','c+','b+','b','c','a','d+','b+','c']}

## Demo Data

In [8]:
data_df = pd.DataFrame(data)
data_df

Unnamed: 0,marks,grade,point
0,55,average,c
1,21,poor,f
2,63,average,c+
3,88,good,b+
4,74,good,b
5,54,average,c
6,95,good,a
7,41,average,d+
8,84,good,b+
9,52,average,c


In [9]:
# A utility method to show transromation from feature column
def demo(feature_column):
  feature_layer = layers.DenseFeatures(feature_column)
  print(feature_layer(data).numpy())

### Numeric columns
- A [numeric column](https://www.tensorflow.org/api_docs/python/tf/feature_column/numeric_column) is the simplest type of column. It is used to represent real valued features. 
- When using this column, your model will receive the column value from the dataframe unchanged.

In [10]:
marks = feature_column.numeric_column("marks")

#A layer that produces a dense Tensor based on given feature_columns
feature_layer = layers.DenseFeatures(marks)
print(feature_layer(data))
# print(feature_layer(data).numpy())

tf.Tensor(
[[55.]
 [21.]
 [63.]
 [88.]
 [74.]
 [54.]
 [95.]
 [41.]
 [84.]
 [52.]], shape=(10, 1), dtype=float32)


### Bucketized columns
Instead of representing year as a numeric column, we could split the year into several buckets using a [bucketized column](https://www.tensorflow.org/api_docs/python/tf/feature_column/bucketized_column). 
- Notice the one-hot values below describe which age range each row matches.
     Buckets **include the left boundary, and exclude the right boundary**. 
- For example, consider raw data that represents the year a house was built. Instead of representing that year as a scalar numeric column, we could split the year into the following four buckets:
![alt text](https://www.tensorflow.org/images/feature_columns/bucketized_column.jpg)

Dividing year data into four buckets.

The model will represent the buckets as follows:
>Date Range| Description
>------------|--------------------
>< 1960 |  	[1, 0, 0, 0]
>>= 1960 but < 1980 | [0, 1, 0, 0]
>>= 1980 but < 2000 | [0, 0, 1, 0]
>>= 2000| [0, 0, 0, 1]


The following code demonstrates how to create a bucketized feature:

In [11]:
marks = feature_column.numeric_column("marks")
marks_buckets = feature_column.bucketized_column(marks, boundaries=[30,40,50,60,70,80,90])

feature_layer = layers.DenseFeatures(marks_buckets)
print(feature_layer(data))

tf.Tensor(
[[0. 0. 0. 1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0.]], shape=(10, 8), dtype=float32)


## Categorical Columns

## Indicator and embedding columns
- Indicator columns: One-hot vector
- Embedding columns: As the number of categories grow large, it becomes infeasible to train a neural network using one-hot encodings. We can use an embedding column to overcome this limitation

### Indicator columns
Categorical vocabulary columns provide a good way to represent strings as a one-hot vector. For example:

![alt text](https://www.tensorflow.org/images/feature_columns/categorical_column_with_vocabulary.jpg)

Mapping string values to vocabulary columns

In [12]:
grade = feature_column.categorical_column_with_vocabulary_list(
      'grade', ['poor', 'average', 'good'])

grade_one_hot = feature_column.indicator_column(grade)
feature_layer = layers.DenseFeatures(grade_one_hot)
print(feature_layer(data))

tf.Tensor(
[[0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 1. 0.]], shape=(10, 3), dtype=float32)


#### Point column as indicator_column

In [13]:
point = feature_column.categorical_column_with_vocabulary_list(
      'point', data_df['point'].unique())

point_one_hot = feature_column.indicator_column(point)
feature_layer = layers.DenseFeatures(point_one_hot)
print(feature_layer(data))

tf.Tensor(
[[1. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 1. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]], shape=(10, 7), dtype=float32)


### Embedding columns
Instead of representing the data as a one-hot vector of many dimensions, an [embedding column](https://www.tensorflow.org/api_docs/python/tf/feature_column/embedding_column) represents that data as a lower-dimensional, dense vector in which each cell can contain any number, not just 0 or 1. The size of the embedding (8, in the example below) is a parameter that must be tuned.

Key point: **using an embedding column is best when a categorical column has many possible values**. We are using one here for demonstration purposes


Let's look at an example comparing indicator and embedding columns. Suppose our input examples consist of different words from a limited palette of only 81 words. Further suppose that the data set provides the following input words in 4 separate examples:


   *  "dog"
   *  "spoon"
   *  "scissors"
   *   "guitar"
   
   In that case, the following figure illustrates the processing path for embedding columns or indicator columns.
   
   ![alt text](https://www.tensorflow.org/images/feature_columns/embedding_vs_indicator.jpg)

An **embedding column** stores categorical data in a *lower-dimensional* vector than an **indicator column**. (We just placed random numbers into the embedding vectors; training determines the actual numbers.) 

When an example is processed, one of the categorical_column_with... functions maps the example string to a numerical categorical value. For example, a function maps "spoon" to [32]. (The 32 comes from our imagination—the actual values depend on the mapping function.) You may then represent these numerical categorical values in either of the following two ways:

  *   As an indicator column. A function converts each numeric categorical value into an 81-element one-hot vector (because our palette consists of 81 words), placing a 1 in the index of the categorical value (0, 32, 79, 80) and a 0 in all the other positions.

  *  As an embedding column. A function uses the numerical categorical values (0, 32, 79, 80) as indices to a lookup table. Each slot in that lookup table contains a 3-element vector.

How do the values in the embeddings vectors magically get assigned? Actually, the assignments happen during training. That is, the model learns the best way to map your input numeric categorical values to the embeddings vector value in order to solve your problem.

### Point column as embedding_column

In [14]:
# Notice the input to the embedding column is the categorical column
# we previously created
point_embedding = feature_column.embedding_column(point, dimension=4)
feature_layer = layers.DenseFeatures(point_embedding)
print(feature_layer(data))

tf.Tensor(
[[ 0.24854095  0.49465972  0.03039111 -0.53011465]
 [-0.26425385 -0.39394012 -0.17774487  0.52569956]
 [-0.5754715  -0.666224    0.12761454 -0.18463595]
 [-0.9779533  -0.6311813   0.818969    0.1169072 ]
 [-0.21336433  0.63465536 -0.2656139  -0.52830386]
 [ 0.24854095  0.49465972  0.03039111 -0.53011465]
 [ 0.1225443  -0.11145852 -0.37519047  0.6560355 ]
 [ 0.69775164 -0.19064797 -0.97839147  0.19655545]
 [-0.9779533  -0.6311813   0.818969    0.1169072 ]
 [ 0.24854095  0.49465972  0.03039111 -0.53011465]], shape=(10, 4), dtype=float32)


### Hashed feature columns

Another way to represent a categorical column with a large number of values is to use a [categorical_column_with_hash_bucket](https://www.tensorflow.org/api_docs/python/tf/feature_column/categorical_column_with_hash_bucket). This feature column calculates a hash value of the input, then selects one of the `hash_bucket_size` buckets to encode a string. When using this column, you do not need to provide the vocabulary, and you can choose to make the number of hash_buckets significantly smaller than the number of actual categories to save space.

Key point: An important downside of this technique is that there may be collisions in which different strings are mapped to the same bucket. In practice, this can work well for some datasets regardless.

In [15]:
point_hashed = feature_column.categorical_column_with_hash_bucket(
      'point', hash_bucket_size=4)
demo(feature_column.indicator_column(point_hashed))

[[1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [1. 0. 0. 0.]]


At this point, you might rightfully think: "This is crazy!" After all, we are forcing the different input values to a smaller set of categories. This means that two probably unrelated inputs will be mapped to the same category, and consequently mean the same thing to the neural network. The following figure illustrates this dilemma, showing that kitchenware and sports both get assigned to category (hash bucket) 12:

![alt text](https://www.tensorflow.org/images/feature_columns/hashed_column.jpg)
Representing data with hash buckets. 

As with many counterintuitive phenomena in machine learning, it turns out that hashing often works well in practice. That's because hash categories provide the model with some separation. The model can use additional features to further separate kitchenware from sports.

### Crossed feature columns
Combining features into a single feature, better known as [feature crosses](https://developers.google.com/machine-learning/glossary/#feature_cross), enables a model to learn separate weights for each combination of features. Here, we will create a new feature that is the cross of marks and age. Note that `crossed_column` does not build the full table of all possible combinations (which could be very large). Instead, it is backed by a `hashed_column`, so you can choose how large the table is.

Combining features into a single feature, better known as feature crosses, enables the model to learn separate weights for each combination of features.

More concretely, suppose we want our model to calculate real estate prices in Atlanta, GA. Real-estate prices within this city vary greatly depending on location. Representing latitude and longitude as separate features isn't very useful in identifying real-estate location dependencies; however, crossing latitude and longitude into a single feature can pinpoint locations. Suppose we represent Atlanta as a grid of 100x100 rectangular sections, identifying each of the 10,000 sections by a feature cross of latitude and longitude. This feature cross enables the model to train on pricing conditions related to each individual section, which is a much stronger signal than latitude and longitude alone.

The following figure shows our plan, with the latitude & longitude values for the corners of the city in red text:

![alt text](https://www.tensorflow.org/images/feature_columns/Atlanta.jpg)

Map of Atlanta. Imagine this map divided into 10,000 sections of equal size. 



In [16]:
crossed_feature = feature_column.crossed_column([marks_buckets, grade], hash_bucket_size=10)
demo(feature_column.indicator_column(crossed_feature))

[[0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]]
