##1.  Categorical Features

###`categorical_column_with_identity`

This column is useful when the values of the feature are integers in a contiguous range.



In [1]:
import tensorflow as tf

# Define the feature column
identity_column = tf.feature_column.categorical_column_with_identity(
    key='identity_column', num_buckets=5)

# Example input
features = {'identity_column': [0, 1, 2, 3, 4]}

# Create a dense tensor from the feature column
identity_column_indicator = tf.feature_column.indicator_column(identity_column)
tensor = tf.keras.layers.DenseFeatures([identity_column_indicator])(features)

print(tensor)

Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.
Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.


tf.Tensor(
[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]], shape=(5, 5), dtype=float32)


###`categorical_column_with_vocabulary_list`

This column is useful when you know all possible values of the feature and can list them.



In [2]:
import tensorflow as tf

# Define the feature column
vocabulary_column = tf.feature_column.categorical_column_with_vocabulary_list(
    key='vocabulary_column', vocabulary_list=['apple', 'banana', 'cherry'])

# Example input
features = {'vocabulary_column': ['apple', 'banana', 'apple', 'cherry', 'banana']}

# Create a dense tensor from the feature column
vocabulary_column_indicator = tf.feature_column.indicator_column(vocabulary_column)
tensor = tf.keras.layers.DenseFeatures([vocabulary_column_indicator])(features)

print(tensor)

Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.


tf.Tensor(
[[1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]], shape=(5, 3), dtype=float32)


###`categorical_column_with_vocabulary_file`

This column is useful when you have a large vocabulary stored in a file.


In [3]:
import tensorflow as tf

# Create a vocabulary file
vocabulary_file = 'vocabulary.txt'
with open(vocabulary_file, 'w') as f:
    f.write('\n'.join(['apple', 'banana', 'cherry']))

# Define the feature column
vocabulary_file_column = tf.feature_column.categorical_column_with_vocabulary_file(
    key='vocabulary_file_column', vocabulary_file=vocabulary_file, num_oov_buckets=1)

# Example input
features = {'vocabulary_file_column': ['apple', 'banana', 'apple', 'cherry', 'banana']}

# Create a dense tensor from the feature column
vocabulary_file_column_indicator = tf.feature_column.indicator_column(vocabulary_file_column)
tensor = tf.keras.layers.DenseFeatures([vocabulary_file_column_indicator])(features)

print(tensor)

Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.


tf.Tensor(
[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 1. 0.]
 [0. 1. 0. 0.]], shape=(5, 4), dtype=float32)


###`categorical_column_with_hash_bucket`

This column is useful when the possible values of the feature are not known or are very large.



In [4]:
import tensorflow as tf

# Define the feature column
hash_bucket_column = tf.feature_column.categorical_column_with_hash_bucket(
    key='hash_bucket_column', hash_bucket_size=10)

# Example input
features = {'hash_bucket_column': ['apple', 'banana', 'cherry', 'date', 'elderberry']}

# Create a dense tensor from the feature column
hash_bucket_column_indicator = tf.feature_column.indicator_column(hash_bucket_column)
tensor = tf.keras.layers.DenseFeatures([hash_bucket_column_indicator])(features)

print(tensor)

Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.


tf.Tensor(
[[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]], shape=(5, 10), dtype=float32)


##2.  Crossed Categorical Features


###`tf.feature_column.bucketized_column`

This column type is used to transform continuous features into categorical features by placing them into specified buckets.



In [5]:
import tensorflow as tf

# Define the feature column
age = tf.feature_column.numeric_column('age')

# Define the bucket boundaries
age_buckets = tf.feature_column.bucketized_column(age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])

# Example input
features = {'age': [15, 20, 28, 35, 42, 50, 58, 65, 70]}

# Create a dense tensor from the feature column
bucketized_column_indicator = tf.feature_column.indicator_column(age_buckets)
tensor = tf.keras.layers.DenseFeatures([bucketized_column_indicator])(features)

print(tensor)

Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.
Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.


tf.Tensor(
[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]], shape=(9, 11), dtype=float32)


###`tf.feature_column.crossed_column`

This column type is used to create a single categorical feature by crossing two or more categorical features.



In [6]:
import tensorflow as tf

# Define individual categorical columns
age_buckets = tf.feature_column.bucketized_column(
    tf.feature_column.numeric_column('age'), boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])

education = tf.feature_column.categorical_column_with_vocabulary_list(
    key='education', vocabulary_list=['HighSchool', 'Bachelors', 'Masters', 'PhD'])

# Cross the age and education columns
age_education_crossed = tf.feature_column.crossed_column(
    [age_buckets, education], hash_bucket_size=1000)

# Example input
features = {
    'age': [25, 30, 45, 50, 65],
    'education': ['Bachelors', 'PhD', 'Masters', 'HighSchool', 'Bachelors']
}

# Create a dense tensor from the crossed column
age_education_crossed_indicator = tf.feature_column.indicator_column(age_education_crossed)
tensor = tf.keras.layers.DenseFeatures([age_education_crossed_indicator])(features)

print(tensor)

Instructions for updating:
Use `tf.keras.layers.experimental.preprocessing.HashedCrossing` instead for feature crossing when preprocessing data to train a Keras model.


tf.Tensor(
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]], shape=(5, 1000), dtype=float32)


## 3. Encoding Categorical Features Using One-Hot Vectors


*  One-Hot vectors => Few categories
*  Embeddings => Large vocabulary



###`indicator_column`

An indicator_column in TensorFlow is used to convert categorical columns into dense one-hot encoded vectors.

In [7]:
import tensorflow as tf

# Define the categorical column
education = tf.feature_column.categorical_column_with_vocabulary_list(
    key='education', vocabulary_list=['HighSchool', 'Bachelors', 'Masters', 'PhD'])

# Convert the categorical column to an indicator column
education_indicator = tf.feature_column.indicator_column(education)

# Example input
features = {'education': ['Bachelors', 'PhD', 'Masters', 'HighSchool', 'Bachelors']}

# Create a dense tensor from the indicator column
tensor = tf.keras.layers.DenseFeatures([education_indicator])(features)

print(tensor)

tf.Tensor(
[[0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [0. 1. 0. 0.]], shape=(5, 4), dtype=float32)




*   Less than 10 => One-Hot Enconding
*   More than 50 => Embeddings
*   Between 10 and 50 => Test both of them



An embedding is a trainable dense vector that represents a category. It enhances representation learning

embedding_column is used to represent high-dimensional categorical features in a lower-dimensional space.

In [8]:
import tensorflow as tf

# Define the categorical column
education = tf.feature_column.categorical_column_with_vocabulary_list(
    key='education', vocabulary_list=['HighSchool', 'Bachelors', 'Masters', 'PhD'])

# Convert the categorical column to an embedding column with 2D embeddings
education_embedding = tf.feature_column.embedding_column(education, dimension=2)

# Example input
features = {'education': ['Bachelors', 'PhD', 'Masters', 'HighSchool', 'Bachelors']}

# Create a dense tensor from the embedding column
tensor = tf.keras.layers.DenseFeatures([education_embedding])(features)

print(tensor)

Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.


tf.Tensor(
[[ 0.5059846  -0.26062796]
 [ 0.9991912  -0.21856011]
 [ 0.45055795 -0.29276913]
 [ 0.7213919   0.11462164]
 [ 0.5059846  -0.26062796]], shape=(5, 2), dtype=float32)


In [9]:
import tensorflow as tf

# Define the categorical column
words = tf.feature_column.categorical_column_with_vocabulary_list(
    key='words', vocabulary_list=['cat', 'dog', 'bird', 'fish', 'elephant'])

# Convert the categorical column to an embedding column with word embeddings
word_embedding = tf.feature_column.embedding_column(words, dimension=3)

# Example input
features = {'words': ['cat', 'dog', 'fish', 'elephant', 'bird']}

# Create a dense tensor from the embedding column
tensor = tf.keras.layers.DenseFeatures([word_embedding])(features)

print(tensor)

tf.Tensor(
[[ 0.5343626   0.38480031 -1.083207  ]
 [-0.57303953 -0.12368282 -0.13121653]
 [ 0.08513737  0.16181467  0.41971907]
 [ 1.1402742  -0.03480695 -0.8175274 ]
 [-0.05962566  0.21273744 -0.34122324]], shape=(5, 3), dtype=float32)


##4.  Using Feature Columns for Parsing


The make_parse_example_spec function generates a parsing spec based on the feature columns provided. This is useful when you want to parse tf.train.Example records.

In [10]:
import tensorflow as tf

# Define feature columns
feature_columns = [
    tf.feature_column.numeric_column('latitude'),
    tf.feature_column.numeric_column('longitude'),
    tf.feature_column.categorical_column_with_vocabulary_list('ocean_proximity', ['NEAR BAY', 'INLAND', 'NEAR OCEAN', 'ISLAND'])
]

# Generate the parsing spec
parse_spec = tf.feature_column.make_parse_example_spec(feature_columns)

# Example serialized Example
example = tf.train.Example(features=tf.train.Features(feature={
    'latitude': tf.train.Feature(float_list=tf.train.FloatList(value=[37.7749])),
    'longitude': tf.train.Feature(float_list=tf.train.FloatList(value=[-122.4194])),
    'ocean_proximity': tf.train.Feature(bytes_list=tf.train.BytesList(value=[b'NEAR BAY']))
}))

serialized_example = example.SerializeToString()

# Parse the example
parsed_features = tf.io.parse_single_example(serialized_example, parse_spec)
print(parsed_features)

Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.


{'ocean_proximity': SparseTensor(indices=tf.Tensor([[0]], shape=(1, 1), dtype=int64), values=tf.Tensor([b'NEAR BAY'], shape=(1,), dtype=string), dense_shape=tf.Tensor([1], shape=(1,), dtype=int64)), 'latitude': <tf.Tensor: shape=(1,), dtype=float32, numpy=array([37.7749], dtype=float32)>, 'longitude': <tf.Tensor: shape=(1,), dtype=float32, numpy=array([-122.4194], dtype=float32)>}


In [13]:
import tensorflow as tf

# Define feature descriptions
feature_descriptions = {
    'latitude': tf.io.FixedLenFeature([], tf.float32),
    'longitude': tf.io.FixedLenFeature([], tf.float32),
    'ocean_proximity': tf.io.FixedLenFeature([], tf.string),
    'median_house_value': tf.io.FixedLenFeature([], tf.float32)
}

# Define the parse_examples function
def parse_examples(serialized_examples):
    examples = tf.io.parse_example(serialized_examples, feature_descriptions)
    targets = examples.pop('median_house_value')  # separate the targets
    return examples, targets

# Create example data
examples = [
    {
        'latitude': 37.7749,
        'longitude': -122.4194,
        'ocean_proximity': 'NEAR BAY',
        'median_house_value': 1000000.0
    },
    {
        'latitude': 34.0522,
        'longitude': -118.2437,
        'ocean_proximity': 'INLAND',
        'median_house_value': 850000.0
    },
    {
        'latitude': 40.7128,
        'longitude': -74.0060,
        'ocean_proximity': 'NEAR OCEAN',
        'median_house_value': 1200000.0
    }
]

# Write the example data to a TFRecord file
tfrecord_filename = 'internal_data.tfrecord'
with tf.io.TFRecordWriter(tfrecord_filename) as writer:
    for example in examples:
        feature = {
            'latitude': tf.train.Feature(float_list=tf.train.FloatList(value=[example['latitude']])),
            'longitude': tf.train.Feature(float_list=tf.train.FloatList(value=[example['longitude']])),
            'ocean_proximity': tf.train.Feature(bytes_list=tf.train.BytesList(value=[example['ocean_proximity'].encode()])),
            'median_house_value': tf.train.Feature(float_list=tf.train.FloatList(value=[example['median_house_value']]))
        }
        tf_example = tf.train.Example(features=tf.train.Features(feature=feature))
        writer.write(tf_example.SerializeToString())

# Set up the dataset
batch_size = 2
dataset = tf.data.TFRecordDataset([tfrecord_filename])
dataset = dataset.repeat().shuffle(10000).batch(batch_size).map(parse_examples)

# Example iteration over the dataset
for batch in dataset.take(1):
    features, targets = batch
    print('Features:', features)
    print('Targets:', targets)


Features: {'latitude': <tf.Tensor: shape=(2,), dtype=float32, numpy=array([40.7128, 40.7128], dtype=float32)>, 'longitude': <tf.Tensor: shape=(2,), dtype=float32, numpy=array([-74.006, -74.006], dtype=float32)>, 'ocean_proximity': <tf.Tensor: shape=(2,), dtype=string, numpy=array([b'NEAR OCEAN', b'NEAR OCEAN'], dtype=object)>}
Targets: tf.Tensor([1200000. 1200000.], shape=(2,), dtype=float32)


##5.  Using Feature Columns in Your Models


In [14]:
import tensorflow as tf
from tensorflow import keras

# Define feature columns
feature_columns = [
    tf.feature_column.numeric_column('latitude'),
    tf.feature_column.numeric_column('longitude'),
    tf.feature_column.categorical_column_with_vocabulary_list('ocean_proximity', ['NEAR BAY', 'INLAND', 'NEAR OCEAN', 'ISLAND']),
    tf.feature_column.numeric_column('median_house_value')
]

In [15]:
# Invent example data
examples = [
    {
        'latitude': 37.7749,
        'longitude': -122.4194,
        'ocean_proximity': 'NEAR BAY',
        'median_house_value': 1000000.0
    },
    {
        'latitude': 34.0522,
        'longitude': -118.2437,
        'ocean_proximity': 'INLAND',
        'median_house_value': 850000.0
    },
    {
        'latitude': 40.7128,
        'longitude': -74.0060,
        'ocean_proximity': 'NEAR OCEAN',
        'median_house_value': 1200000.0
    },
    {
        'latitude': 36.7783,
        'longitude': -119.4179,
        'ocean_proximity': 'INLAND',
        'median_house_value': 650000.0
    },
    {
        'latitude': 32.7157,
        'longitude': -117.1611,
        'ocean_proximity': 'NEAR BAY',
        'median_house_value': 950000.0
    }
]

# Write the example data to a TFRecord file
tfrecord_filename = 'internal_data.tfrecord'
with tf.io.TFRecordWriter(tfrecord_filename) as writer:
    for example in examples:
        feature = {
            'latitude': tf.train.Feature(float_list=tf.train.FloatList(value=[example['latitude']])),
            'longitude': tf.train.Feature(float_list=tf.train.FloatList(value=[example['longitude']])),
            'ocean_proximity': tf.train.Feature(bytes_list=tf.train.BytesList(value=[example['ocean_proximity'].encode()])),
            'median_house_value': tf.train.Feature(float_list=tf.train.FloatList(value=[example['median_house_value']]))
        }
        tf_example = tf.train.Example(features=tf.train.Features(feature=feature))
        writer.write(tf_example.SerializeToString())

In [16]:
# Define feature descriptions
feature_descriptions = {
    'latitude': tf.io.FixedLenFeature([], tf.float32),
    'longitude': tf.io.FixedLenFeature([], tf.float32),
    'ocean_proximity': tf.io.FixedLenFeature([], tf.string),
    'median_house_value': tf.io.FixedLenFeature([], tf.float32)
}

# Define the parse_examples function
def parse_examples(serialized_examples):
    examples = tf.io.parse_example(serialized_examples, feature_descriptions)
    targets = examples.pop('median_house_value')  # separate the targets
    return examples, targets

# Set up the dataset
batch_size = 2
dataset = tf.data.TFRecordDataset([tfrecord_filename])
dataset = dataset.repeat().shuffle(10000).batch(batch_size).map(parse_examples)

In [18]:
# Define the feature columns without the target
ocean_proximity = feature_columns[2]
ocean_proximity = tf.feature_column.indicator_column(ocean_proximity)
columns_without_target = [
    feature_columns[0],
    feature_columns[1],
    ocean_proximity  # Use the wrapped column
]

# Define the Keras model
model = keras.models.Sequential([
    keras.layers.DenseFeatures(feature_columns=columns_without_target),
    keras.layers.Dense(1)
])

# Compile the model
model.compile(loss="mse", optimizer="sgd", metrics=["accuracy"])

# Determine steps_per_epoch
steps_per_epoch = len(examples) // batch_size

# Train the model
history = model.fit(dataset, steps_per_epoch=steps_per_epoch, epochs=5)

print(history.history)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
{'loss': [4.416362399137792e+16, 2.008283007779263e+26, 8.918753868311994e+35, inf, inf], 'accuracy': [0.0, 0.0, 0.0, 0.0, 0.0]}


###TF Transform
For improving data pre-processing speed in Production with this feature columns encoding process

In [None]:
import tensorflow_transform as tft

def preprocess(inputs):  # inputs is a batch of input features
  median_age = inputs["housing_median_age"]
  ocean_proximity = inputs["ocean_proximity"]
  standardized_age = tft.scale_to_z_score(median_age - tft.mean(median_age))
  ocean_proximity_id = tft.compute_and_apply_vocabulary(ocean_proximity)
  return {
    "standardized_median_age": standardized_age,
    "ocean_proximity_id": ocean_proximity_id
  }

# The TensorFlow Datasets (TFDS) Project


In [19]:
import tensorflow_datasets as tfds

dataset = tfds.load(name="mnist")
mnist_train, mnist_test = dataset["train"], dataset["test"]

Downloading and preparing dataset 11.06 MiB (download: 11.06 MiB, generated: 21.00 MiB, total: 32.06 MiB) to /root/tensorflow_datasets/mnist/3.0.1...


Dl Completed...:   0%|          | 0/5 [00:00<?, ? file/s]

Dataset mnist downloaded and prepared to /root/tensorflow_datasets/mnist/3.0.1. Subsequent calls will reuse this data.


In [20]:
mnist_train = mnist_train.repeat(5).batch(32).prefetch(1)
for item in mnist_train:
  images = item["image"]
  labels = item["label"]

In [21]:
mnist_train = mnist_train.repeat(5).batch(32)
mnist_train = mnist_train.map(lambda items: (items["image"], items["label"]))
mnist_train = mnist_train.prefetch(1)

In [23]:
import tensorflow_datasets as tfds
from tensorflow import keras

mnist_train = dataset["train"].repeat().prefetch(1)
# Replace ... with actual Keras layers
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dense(10)
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="sgd")
model.fit(mnist_train, steps_per_epoch=60000 // 32, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x78eb4ec5ebc0>