# Advanced Feature Engineering in Keras 

**Learning Objectives**
1. Define and adopt stateful preprocessing layers.
2. Apply custom transformations with `Lambda` layers.

## Overview
In this notebook, we use Keras to build a taxifare price prediction model and utilize feature engineering to improve the fare amount prediction for NYC taxi cab rides. 

## Setup
Start by importing the necessary libraries for this lab.

In [None]:
import os

os.environ["KERAS_BACKEND"] = "tensorflow"

import datetime
import shutil

import keras
import numpy as np
import pandas as pd
import tensorflow as tf
from keras import Model
from keras.callbacks import TensorBoard
from keras.layers import (
    CategoryEncoding,
    Concatenate,
    Dense,
    Discretization,
    Embedding,
    Flatten,
    HashedCrossing,
    Input,
    Lambda,
)
from matplotlib import pyplot as plt

In [None]:
%matplotlib inline

### Load raw data 

We will use the taxifare dataset, using the CSV files that we created in the first notebook of this sequence. Those files have been saved into `../data`.

In [None]:
!ls -l ../data/taxi-*.csv

### Use tf.data to read the CSV files

We wrote these functions for reading data from the CSV files above in the [previous notebook](2_dataset_api.ipynb).

The `tf.data` API efficiently loads and preprocesses data. 
- `parse_csv`: Parses a CSV row into features and a label. Features are returned as a tuple for Functional API compatibility with multiple inputs.
- `create_dataset`: Builds a `tf.data.Dataset` from CSV files, including mapping `parse_csv`, repeating, shuffling (for training), and batching.

In [None]:
def parse_csv(row):
    ds = tf.strings.split(row, ",")
    # Label: fare_amount
    label = tf.strings.to_number(ds[0])
    # Feature: pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude
    feature = tf.strings.to_number(ds[2:6])  # use some features only
    # Passing feature in tuple so that we can handle them separately.
    return (feature[0], feature[1], feature[2], feature[3]), label


def create_dataset(pattern, batch_size, mode="eval"):
    ds = tf.data.TextLineDataset(pattern)
    ds = ds.map(parse_csv).repeat()
    if mode == "train":
        ds.shuffle(buffer_size=1000)
    ds = ds.batch(batch_size, drop_remainder=True)
    return ds

## Define Advanced Feature Engineering
Next, we'll try to improve performance by adding more feature engineering:
1.  **Normalization:** Applied to coordinates before distance calculation and other processing.
2.  **Euclidean Distance:** Calculated using a `Lambda` layer.


### Setup Feature Normalization with `Normalization` Layer

The `keras.layers.Normalization` layer standardizes features by scaling them to have zero mean and unit variance.

Since it requires some states (mean and variance), we'll need to either:
1. Precompute the state values and instantiate the layer with it.
```python
keras.layers.Normalization(mean=..., variance=...)
```
2. Or, compute the values using the `adapt()` method.

Here let's take a look at the latter option.

We first load data to compute these statistics. Here we retrieve the latitude and longitude columns.

In [None]:
def lat_lon_parser(row, pick_lat):
    ds = tf.strings.split(row, ",")
    # latitude idx: 3 and 5, longitude idx: 2 and 4
    idx = [3, 5] if pick_lat else [2, 4]
    return tf.strings.to_number(tf.gather(ds, idx))


ds = tf.data.Dataset.list_files("../data/taxi-train.csv")
ds = ds.flat_map(tf.data.TextLineDataset)
lat_values = ds.map(lambda x: lat_lon_parser(x, True)).batch(10000)
lon_values = ds.map(lambda x: lat_lon_parser(x, False)).batch(10000)

Then, we instantiate `Normalization` layers (`lat_scaler`, `lon_scaler`) and use their `adapt()` method on the loaded latitude and longitude values to learn the mean and variance.
These adapted layers can then be used in the model to apply the learned normalization.

In [None]:
lat_scaler = keras.layers.Normalization(axis=None)
lon_scaler = keras.layers.Normalization(axis=None)

lat_scaler.adapt(lat_values)
lon_scaler.adapt(lon_values)

print("Computed statistics for latitude:")
print(f"mean: {lat_scaler.mean}, variance: {lat_scaler.variance}")
print(f"+++++")
print("Computed statistics for longitude:")
print(f"mean: {lon_scaler.mean}, variance: {lon_scaler.variance}")

### Define Input Layers

In [None]:
INPUT_COLS = [
    "pickup_longitude",
    "pickup_latitude",
    "dropoff_longitude",
    "dropoff_latitude",
]

# input layer is all float
inputs = {
    colname: Input(name=colname, shape=(1,), dtype="float32")
    for colname in INPUT_COLS
}

### Custom Feature: Euclidean Distance with a `Lambda` Layer

The `euclidean` function calculates straight-line distance. We'll use a [`keras.layers.Lambda` layer](https://keras.io/api/layers/core_layers/lambda/) later to wrap this function, allowing its direct integration into our Keras model for feature engineering. This keeps preprocessing bundled with the model.

In [None]:
def euclidean(params):
    lon1, lat1, lon2, lat2 = params
    londiff = lon2 - lon1
    latdiff = lat2 - lat1
    return tf.sqrt(londiff * londiff + latdiff * latdiff)

### Define Preprocessing, Normalization, and Lambda Layer Integration

Applying the feature engineering steps:
1. **Bucket Boundaries:** Now adjusted for normalized data (range `[-5, 5]`).
2. **Normalization:** Raw coordinates are scaled using the adapted `lon_scaler` and `lat_scaler`.
3. **Lambda Layer:** `euclidean` function calculates distance on these *normalized* coordinates.
4. **Discretization:** Normalized coordinates are bucketized.
5. **Feature Crossing & Embedding:** Applied to the (now normalized and discretized) features.
6. **Concatenate:** The `euclidean_distance` and the final `pd_embed` are combined to be fed into the DNN.

In [None]:
NBUCKETS = 16

latbuckets = np.linspace(start=-5, stop=5, num=NBUCKETS).tolist()
lonbuckets = np.linspace(start=-5, stop=5, num=NBUCKETS).tolist()

# Normalize longitude
scaled_plon = lon_scaler(inputs["pickup_longitude"])
scaled_dlon = lon_scaler(inputs["dropoff_longitude"])

# Normalize latitude
scaled_plat = lat_scaler(inputs["pickup_latitude"])
scaled_dlat = lat_scaler(inputs["dropoff_latitude"])

# Lambda layer for the custom euclidean function
euclidean_distance = Lambda(euclidean, name="euclidean")(
    [scaled_plon, scaled_plat, scaled_dlon, scaled_dlat]
)

# Discretization
plon = Discretization(lonbuckets, name="plon_bkt")(scaled_plon)
plat = Discretization(latbuckets, name="plat_bkt")(scaled_plat)
dlon = Discretization(lonbuckets, name="dlon_bkt")(scaled_dlon)
dlat = Discretization(latbuckets, name="dlat_bkt")(scaled_dlat)


# Feature Cross with HashedCrossing layer
p_fc = HashedCrossing(num_bins=(NBUCKETS + 1) ** 2, name="p_fc")((plon, plat))
d_fc = HashedCrossing(num_bins=(NBUCKETS + 1) ** 2, name="d_fc")((dlon, dlat))
pd_fc = HashedCrossing(num_bins=(NBUCKETS + 1) ** 4, name="pd_fc")((p_fc, d_fc))

# Embedding with Embedding layer
pd_embed = Flatten()(
    Embedding(input_dim=(NBUCKETS + 1) ** 4, output_dim=10, name="pd_embed")(
        pd_fc
    )
)

deep = Concatenate()(
    [
        scaled_plon,
        scaled_dlon,
        scaled_plat,
        scaled_dlat,
        euclidean_distance,
        pd_embed,
    ]
)

### Define the DNN Layers

The concatenated `euclidean_distance` and `pd_embed` tensor is passed through `Dense` layers.

In [None]:
dnn_hidden_units = [32, 8]

# Add hidden Dense layers
for i, num_nodes in enumerate(dnn_hidden_units, start=1):
    deep = Dense(num_nodes, activation="relu", name=f"hidden_{i}")(deep)

# final output is a linear activation because this is regression
output = Dense(1, activation="linear", name="fare")(deep)

### Instantiate and Compile the Engineered Model

Define the Keras Model with the original inputs and the final engineered output.

In [None]:
def rmse(y_true, y_pred):
    squared_error = tf.keras.ops.square(y_pred[:, 0] - y_true)
    return tf.keras.ops.sqrt(tf.keras.ops.mean(squared_error))

In [None]:
model = keras.Model(inputs=list(inputs.values()), outputs=output)

# Compile model
model.compile(optimizer="adam", loss="mse", metrics=[rmse], run_eagerly=True)

Let's see how our model architecture has changed now.

In [None]:
tf.keras.utils.plot_model(model, show_layer_names=True, rankdir="LR")

## Train the Engineered Model

Train the new model using the same setup as before.

In [None]:
BATCH_SIZE = 64
NUM_TRAIN_EXAMPLES = 10000 * 10  # training dataset will repeat, wrap around
NUM_EVALS = 10  # how many times to evaluate
NUM_EVAL_EXAMPLES = 1000

In [None]:
trainds = create_dataset(
    pattern="../data/taxi-train.csv", batch_size=BATCH_SIZE, mode="train"
)

evalds = create_dataset(
    pattern="../data/taxi-valid.csv", batch_size=BATCH_SIZE, mode="eval"
).take(NUM_EVAL_EXAMPLES // BATCH_SIZE)

steps_per_epoch = NUM_TRAIN_EXAMPLES // (BATCH_SIZE * NUM_EVALS)

history = model.fit(
    trainds,
    validation_data=evalds,
    epochs=NUM_EVALS,
    steps_per_epoch=steps_per_epoch,
)

As before, let's visualize the DNN model layers. 

In [None]:
RMSE_COLS = ["rmse", "val_rmse"]

pd.DataFrame(history.history)[RMSE_COLS].plot()

Let's a prediction with this new model with engineered features on the example we had above. 

In [None]:
model.predict(
    {
        "pickup_longitude": tf.convert_to_tensor([-73.982683]),
        "pickup_latitude": tf.convert_to_tensor([40.742104]),
        "dropoff_longitude": tf.convert_to_tensor([-73.983766]),
        "dropoff_latitude": tf.convert_to_tensor([40.755174]),
    },
    steps=1,
)

Copyright 2025 Google Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.