# Introducing the Keras Functional API and Feature Engineering

**Prerequisites:** This notebook assumes you have some familiarity with the Keras Sequential API, covered in previous notebooks.

**Learning Objectives**
  1. Understand the Keras Functional API and its advantages for building flexible model architectures.
  2. Learn how to implement Wide & Deep models, understanding the concepts of memorization (wide) and generalization (deep).
  3. Utilize various Keras preprocessing layers for feature engineering (e.g., `Discretization`, `HashedCrossing`, `Embedding`).
  4. Apply custom feature engineering transformations using Keras `Lambda` layers.
  5. Build, train, and evaluate a model using the Functional API, incorporating engineered features.

### Introduction: Beyond Sequential Models

In previous notebooks, you've become acquainted with the Keras Sequential API, which is excellent for building models as a linear stack of layers. However, many real-world problems require more complex model architectures. This is where the [Keras Functional API](https://www.tensorflow.org/guide/keras#functional_api) shines.

The Functional API allows you to create a graph of layers, offering greater flexibility. With it, you can build models with:
*   **Multiple input and output layers:** Imagine a model that takes both an image and text as input, or predicts multiple attributes simultaneously.
*   **Shared layers:** Use the same layer multiple times in different parts of your model.
*   **Non-sequential data flows:** Create architectures with branches, skip connections (like in ResNets), or other custom pathways.

In this notebook, we'll explore these capabilities by first building a **Wide & Deep model**. This type of model combines the strengths of learning simple, direct rules (the "wide" part) with the ability to learn complex, abstract patterns (the "deep" part). We'll then enhance our model further by applying custom feature engineering techniques using `Lambda` layers, demonstrating another powerful aspect of the Functional API.

## Building Wide & Deep models with the Keras Functional API

The core idea behind **Wide & Deep models** is to combine the strengths of two distinct types of learning: **memorization** and **generalization**. You can explore the original research paper here: [Wide & Deep Learning for Recommender Systems](https://arxiv.org/abs/1606.07792).

<img src='assets/wide_deep.png' width='80%'>
<sup>(image: https://ai.googleblog.com/2016/06/wide-deep-learning-better-together-with.html)</sup>

Let's break down what this means:

*   **The "Wide" Component (Memorization):**
    *   This part of the model is typically a linear model. It excels at *memorizing* direct interactions between features and the target variable. 
    *   Think of it as learning simple, explicit rules from the data. For example, "if a user frequently buys product A and product B together, they are likely to click on an ad for product C."
    *   It often uses features like cross-product transformations (e.g., `pickup_location` AND `dropoff_location`) to learn these specific rules.

*   **The "Deep" Component (Generalization):**
    *   This part is a deep neural network (DNN).
    *   It's good at *generalizing* by learning complex, abstract patterns in the data through multiple layers of non-linear transformations. 
    *   It often uses embedding layers to represent categorical features in a dense, lower-dimensional space, allowing the model to discover nuanced relationships. For example, it might learn that certain types of pickup locations (e.g., airports, business districts) have similar fare characteristics, even if the specific locations themselves are different.

**Why combine them?**
By joining these two components, a Wide & Deep model can leverage the strengths of both. The wide part quickly learns simple rules, while the deep part captures more complex patterns that the wide part might miss. This often leads to better performance on tasks where both types of learning are valuable, such as recommendation systems, search ranking, and classification problems with diverse features.

In this notebook, we'll construct both the wide and deep sections using Keras preprocessing layers and then combine them using the Functional API to predict taxi fare amounts.

## Setup

First, let's import the necessary libraries. We'll need `os` for environment variables, `shutil` for file operations, `numpy` and `pandas` for data manipulation, `tensorflow` and `keras` for building our model, and `matplotlib` for plotting.

The `os.environ["KERAS_BACKEND"] = "tensorflow"` line explicitly sets Keras to use the TensorFlow backend. While Keras 3 is multi-backend, this ensures consistency for the lab. We also import specific layers and utilities from Keras that we'll use extensively.

In [None]:
import os

os.environ["KERAS_BACKEND"] = "tensorflow"

import datetime
import shutil

import keras
import numpy as np
import pandas as pd
import tensorflow as tf
from keras import Model
from keras.callbacks import TensorBoard
from keras.layers import (
    CategoryEncoding,
    Concatenate,
    Dense,
    Discretization,
    Embedding,
    Flatten,
    HashedCrossing,
    Input,
    Lambda,
)
from matplotlib import pyplot as plt

In [None]:
%matplotlib inline

### Load and Inspect Raw Data

We'll continue working with the NYC taxifare dataset. The necessary CSV files (`taxi-train.csv`, `taxi-valid.csv`) were prepared in the first notebook of this series and are expected to be in the `../data` directory.

The `!ls -l ../data/*.csv` command below is a quick way to verify that the data files exist and to see their sizes.

In [None]:
!ls -l ../data/*.csv

### Create Data Input Pipeline with `tf.data`

To efficiently load and preprocess our data, we'll use the `tf.data` API. The functions `parse_csv` and `create_dataset` are similar to what we used in the [previous notebook (2_dataset_api.ipynb)](2_dataset_api.ipynb).

**Key changes and points to note:**
*   **`parse_csv(row)` function:**
    *   It takes a row from the CSV file (a string) and parses it.
    *   It extracts the label (`fare_amount`) and a subset of features: `pickup_longitude`, `pickup_latitude`, `dropoff_longitude`, and `dropoff_latitude`.
    *   **Important for Functional API:** The features are returned as a tuple: `(feature[0], feature[1], feature[2], feature[3])`. This is crucial because when using the Functional API with multiple `Input` layers (as we will do shortly), Keras expects the input data to be provided in a way that can be mapped to these distinct inputs. A tuple of individual feature tensors (or a dictionary mapping input names to tensors) serves this purpose.
*   **`create_dataset(pattern, batch_size, mode)` function:**
    *   `tf.data.TextLineDataset(pattern)`: Reads lines from text files matching the given pattern (e.g., `../data/taxi-train.csv`).
    *   `.map(parse_csv)`: Applies our parsing function to each line.
    *   `.repeat()`: Repeats the dataset indefinitely. This is useful when training for a fixed number of steps rather than a fixed number of epochs.
    *   `.shuffle(buffer_size=1000)` (if `mode == "train"`): Randomly shuffles the data to ensure that the model doesn't learn from the order of examples. `buffer_size` determines how many elements are buffered for shuffling.
    *   `.batch(batch_size, drop_remainder=True)`: Groups examples into batches. `drop_remainder=True` ensures all batches have the same size, which can simplify model training.

In [None]:
def parse_csv(row):
    ds = tf.strings.split(row, ",")
    # Label: fare_amount
    label = tf.strings.to_number(ds[0])
    # Feature: pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude
    feature = tf.strings.to_number(ds[2:6])  # use some features only
    # Passing feature in tuple so that we can handle them separately.
    return (feature[0], feature[1], feature[2], feature[3]), label


def create_dataset(pattern, batch_size, mode="eval"):
    ds = tf.data.TextLineDataset(pattern)
    ds = ds.map(parse_csv).repeat()
    if mode == "train":
        ds.shuffle(buffer_size=1000)
    ds = ds.batch(batch_size, drop_remainder=True)
    return ds

### Build a Wide & Deep Model with the Keras Functional API

Now, let's construct our Wide & Deep model. As discussed earlier, this involves creating two distinct pathways for our data: 
1.  **Wide Path:** Processes features (often categorical or crossed features) through a linear model or simple transformations. This path helps the model memorize specific feature interactions.
2.  **Deep Path:** Processes features (often numerical or embeddings of categorical features) through a multi-layer neural network. This path allows the model to generalize by learning complex patterns.

Finally, the outputs of these two paths are combined to make a prediction. The Keras Functional API is ideal for this, as it allows us to define these separate pathways and then merge them.

#### Define Input Layers

With the Functional API, we begin by defining the entry points for our data. Each distinct input feature will have its own `keras.Input` layer. Think of these as placeholders that will receive the data when the model is run.

For our taxi fare model, we're using four input features: `pickup_longitude`, `pickup_latitude`, `dropoff_longitude`, and `dropoff_latitude`.

*   `Input(name=colname, shape=(1,), dtype="float32")`:
    *   `name`: Giving each input a unique name is a good practice, especially for complex models. It helps in debugging and understanding model summaries. The name should match the key if you're feeding data as a dictionary.
    *   `shape=(1,)`: This specifies that each input feature is a scalar (a single floating-point number). For example, one pickup longitude value at a time.
    *   `dtype="float32"`: Defines the data type of the input.

We store these input layers in a dictionary called `inputs`, where keys are the column names and values are the corresponding `Input` layer objects. This dictionary structure will be convenient later when we construct the `Model`.

In [None]:
INPUT_COLS = [
    "pickup_longitude",
    "pickup_latitude",
    "dropoff_longitude",
    "dropoff_latitude",
]

inputs = {
    colname: Input(name=colname, shape=(1,), dtype="float32")
    for colname in INPUT_COLS
}

#### Define Preprocessing Logic for Wide & Deep Features

Now that we have our input layers, we'll define the preprocessing steps. This is where the Functional API's flexibility starts to become evident. We can create different processing pipelines for different inputs or different model paths (wide vs. deep).

**Constants for Preprocessing:**
*   `dnn_hidden_units = [32, 8]`: Defines the number of neurons in the hidden layers of our deep path.
*   `NBUCKETS = 16`: The number of buckets we'll use for discretizing our latitude and longitude features. This means we're dividing the range of these continuous features into 16 distinct categories.

**Preprocessing Steps:**

1.  **Bucketization with `Discretization` Layer:**
    *   Continuous features like latitude and longitude are often more powerful when transformed into categorical features. Bucketization (or binning) achieves this.
    *   `np.linspace(...)`: We first define the boundaries for these buckets. For example, `latbuckets` will be a list of 16 evenly spaced latitude values.
    *   `Discretization(lonbuckets, name="plon_bkt")(inputs["pickup_longitude"])`:
        *   This layer takes the raw `pickup_longitude` input and assigns each value to one of the `lonbuckets`.
        *   The output `plon` (pickup longitude bucketized) will be an integer representing the bucket index.
    *   We do this for pickup/dropoff latitudes and longitudes, creating `plon`, `plat`, `dlon`, `dlat`.

2.  **Feature Crossing with `HashedCrossing` Layer:**
    *   Feature crossing combines two or more categorical features into a new feature that represents their joint interaction. This is a key technique for the "wide" part of the model, helping it memorize specific combinations.
    *   `HashedCrossing(num_bins=(NBUCKETS + 1) ** 2, name="p_fc")((plon, plat))`:
        *   This layer takes the bucketized pickup longitude (`plon`) and latitude (`plat`) and creates a new crossed feature `p_fc` (pickup feature cross).
        *   `num_bins`: The number of hash bins for the output. A common practice is to set this to roughly the product of the number of categories of the input features. Since each discretized feature has `NBUCKETS` categories, `(NBUCKETS + 1)**2` provides a reasonable hash space (the +1 accounts for potential out-of-vocabulary values, though `Discretization` handles this by assigning to boundary buckets).
    *   We create three crossed features:
        *   `p_fc`: Crossing pickup longitude and latitude.
        *   `d_fc`: Crossing dropoff longitude and latitude.
        *   `pd_fc`: Crossing `p_fc` and `d_fc` (a higher-order cross representing the interaction between pickup area and dropoff area).

Notice how we chain operations: the output of one layer (e.g., `inputs["pickup_longitude"]`) becomes the input to the next (e.g., `Discretization(...)`). This is the essence of the Functional API – building a graph of computations.

In [None]:
dnn_hidden_units = [32, 8]
NBUCKETS = 16

# Define Bucketization boundaries
latbuckets = np.linspace(start=40.5, stop=41.0, num=NBUCKETS).tolist()
lonbuckets = np.linspace(start=-74.2, stop=-73.7, num=NBUCKETS).tolist()

# Bucketization with Discretization layer
plon = Discretization(lonbuckets, name="plon_bkt")(inputs["pickup_longitude"])
plat = Discretization(latbuckets, name="plat_bkt")(inputs["pickup_latitude"])
dlon = Discretization(lonbuckets, name="dlon_bkt")(inputs["dropoff_longitude"])
dlat = Discretization(latbuckets, name="dlat_bkt")(inputs["dropoff_latitude"])

# Feature Cross with HashedCrossing layer
p_fc = HashedCrossing(num_bins=(NBUCKETS + 1) ** 2, name="p_fc")((plon, plat))
d_fc = HashedCrossing(num_bins=(NBUCKETS + 1) ** 2, name="d_fc")((dlon, dlat))
pd_fc = HashedCrossing(num_bins=(NBUCKETS + 1) ** 4, name="pd_fc")((p_fc, d_fc))

#### Build the Deep Path

The deep path of our model is responsible for generalization. It typically involves taking numerical features and/or embeddings of categorical features and passing them through several neural network layers.

1.  **Embedding Layer for Crossed Features:**
    *   The `pd_fc` (pickup-dropoff crossed feature) is a high-cardinality categorical feature. To make it usable by a neural network, we embed it into a lower-dimensional dense vector space using an `Embedding` layer.
    *   `Embedding(input_dim=(NBUCKETS + 1) ** 4, output_dim=10, name="pd_embed")(pd_fc)`:
        *   `input_dim`: This is the size of the vocabulary, i.e., the number of possible categories for `pd_fc`. It's `(NBUCKETS + 1) ** 4` because `pd_fc` was created by crossing two features, each of which was a cross of two NBUCKET-sized discretized features.
        *   `output_dim=10`: This is the dimensionality of the embedding vector. Each category of `pd_fc` will be mapped to a 10-dimensional vector. The model will learn these embeddings during training.
    *   The output `pd_embed` will have a shape like `(batch_size, 1, 10)` if the input `pd_fc` is `(batch_size, 1)`. The `Flatten` layer will remove the middle dimension.

2.  **Concatenate Inputs for the Deep Network:**
    *   We gather all features that will go into the deep path:
        *   The original numerical inputs: `inputs["pickup_longitude"]`, `inputs["pickup_latitude"]`, `inputs["dropoff_longitude"]`, `inputs["dropoff_latitude"]`.
        *   The flattened embedding of our crossed feature: `Flatten(name="flatten_embedding")(pd_embed)`.
    *   `Concatenate(name="deep_input")([...])`: This layer merges these features into a single tensor, which will serve as the input to the subsequent `Dense` layers.

3.  **Add Hidden Dense Layers:**
    *   We create a stack of fully connected (`Dense`) layers. These layers allow the model to learn complex, non-linear relationships between the deep features.
    *   `for i, num_nodes in enumerate(dnn_hidden_units, start=1): deep = Dense(num_nodes, activation="relu", name=f"hidden_{i}")(deep)`:
        *   We iterate through `dnn_hidden_units` (which we defined as `[32, 8]`).
        *   Each `Dense` layer has a `relu` activation function, which introduces non-linearity.
        *   The output of one `Dense` layer becomes the input to the next, progressively transforming the features.

The final `deep` tensor represents the output of the deep path of our model.

In [None]:
# Embedding with Embedding layer
pd_embed = Embedding(
    input_dim=(NBUCKETS + 1) ** 4, output_dim=10, name="pd_embed"
)(pd_fc)

# Concatenate and define inputs for deep network
deep = Concatenate(name="deep_input")(
    [
        inputs["pickup_longitude"],
        inputs["pickup_latitude"],
        inputs["dropoff_longitude"],
        inputs["dropoff_latitude"],
        Flatten(name="flatten_embedding")(pd_embed),
    ]
)

# Add hidden Dense layers
for i, num_nodes in enumerate(dnn_hidden_units, start=1):
    deep = Dense(num_nodes, activation="relu", name=f"hidden_{i}")(deep)

#### Build the Wide Path

The wide path of our model is designed for memorization, primarily using sparse, categorical features created through transformations like crossing.

1.  **One-Hot Encoding with `CategoryEncoding` Layer:**
    *   The crossed features (`p_fc`, `d_fc`, `pd_fc`) are categorical (represented as integers). To use them in a typically linear wide model (or as direct inputs to the final layer), they are often one-hot encoded.
    *   `CategoryEncoding(num_tokens=(NBUCKETS + 1) ** 2, name="p_onehot")(p_fc)`:
        *   This layer takes the integer-encoded `p_fc` and converts it into a multi-hot binary vector (since `HashedCrossing` can output multiple hashes if there are collisions, though less likely with sufficient bins, `CategoryEncoding` handles this by summing the encodings; if it were a single integer category, this would be a one-hot vector).
        *   `num_tokens`: This should be equal to the `num_bins` used in the corresponding `HashedCrossing` layer.
    *   We create one-hot encoded versions for `p_fc`, `d_fc`, and `pd_fc`.

2.  **Concatenate Inputs for the Wide Network:**
    *   `wide = Concatenate(name="wide_input")([p_onehot, d_onehot, pd_onehot])`:
        *   All the one-hot encoded sparse features are concatenated together to form the input for the wide part of the model.

The `wide` tensor now holds the combined sparse features for the memorization path.

In [None]:
# Onehot Encoding with CategoryEncoding layer
p_onehot = CategoryEncoding(num_tokens=(NBUCKETS + 1) ** 2, name="p_onehot")(
    p_fc
)
d_onehot = CategoryEncoding(num_tokens=(NBUCKETS + 1) ** 2, name="d_onehot")(
    d_fc
)
pd_onehot = CategoryEncoding(num_tokens=(NBUCKETS + 1) ** 4, name="pd_onehot")(
    pd_fc
)

# Concatenate and define inputs for wide network
wide = Concatenate(name="wide_input")([p_onehot, d_onehot, pd_onehot])

#### Combine Wide and Deep Paths

With both the `deep` and `wide` pathways processed, the next step is to combine them. This is a hallmark of Wide & Deep models.

*   `concat = Concatenate(name="concatenate")([deep, wide])`:
    *   The `Concatenate` layer takes the output tensor from the deep path (`deep`) and the output tensor from the wide path (`wide`) and merges them into a single, wider tensor.

This `concat` tensor now contains information from both the generalization (deep) and memorization (wide) parts of our model.

Finally, we define the output layer:
*   `output = Dense(1, activation=None, name="output")(concat)`:
    *   A `Dense` layer with a single unit is used to produce the final prediction.
    *   `activation=None`: For regression tasks (like predicting taxi fare, which is a continuous value), the output layer typically has no activation function (or a linear activation). This allows the model to output values in any range.
    *   The input to this layer is the `concat` tensor that combines both wide and deep features.

In [None]:
# Concatenate wide & deep networks
concat = Concatenate(name="concatenate")([deep, wide])

# Define the final output layer
output = Dense(1, activation=None, name="output")(concat)

#### Define Evaluation Metric, Instantiate and Compile the Model

Before we can train our model, we need a few more things:

1.  **Custom RMSE Metric (`rmse` function):**
    *   Root Mean Squared Error (RMSE) is a common metric for regression problems. It measures the square root of the average of squared differences between predicted and actual values. This penalizes larger errors more heavily.
    *   Keras allows custom metrics. Our `rmse` function calculates this: `tf.keras.ops.square(y_pred[:, 0] - y_true)` computes the squared error, `tf.keras.ops.mean(...)` averages it, and `tf.keras.ops.sqrt(...)` takes the square root.
    *   `y_pred[:, 0]` is used because our model outputs a tensor of shape `(batch_size, 1)`, and we need to compare it with `y_true` which is typically `(batch_size,)`.

2.  **Instantiate the Model (`keras.Model`):**
    *   `model = Model(inputs=list(inputs.values()), outputs=output)`
    *   This is where we officially define our Keras model using the Functional API.
    *   `inputs`: We provide a list of all the input layers we defined earlier (obtained from `list(inputs.values())`).
    *   `outputs`: We specify the final output tensor of our model, which is `output` (the `Dense` layer we defined in the previous step).
    *   Keras automatically traces the graph of layers from the inputs to the outputs to create the complete model structure.

3.  **Compile the Model (`model.compile`):**
    *   `model.compile(optimizer="adam", loss="mse", metrics=[rmse], run_eagerly=True)`
    *   `optimizer="adam"`: Adam is a widely used and generally effective optimization algorithm.
    *   `loss="mse"`: Mean Squared Error (MSE) is chosen as the loss function. The model will try to minimize this value during training. For regression, MSE is a standard choice.
    *   `metrics=[rmse]`: We include our custom RMSE metric to monitor during training and evaluation.
    *   `run_eagerly=True`: This is often useful for debugging, as it runs the model step-by-step, making it easier to trace errors. For production or performance-critical training, you'd typically set this to `False` (which is the default) to leverage TensorFlow's graph execution capabilities.

4.  **Visualize the Model (`tf.keras.utils.plot_model`):**
    *   `tf.keras.utils.plot_model(model, show_shapes=False, rankdir="LR")`
    *   This utility is incredibly helpful for visualizing the architecture of your Functional API model. It generates an image of the layer graph.
    *   `show_shapes=False`: Hides the input/output shapes of each layer in the plot for a cleaner look. Set to `True` if you want to see them.
    *   `rankdir="LR"`: Renders the graph from Left to Right, which is often intuitive for model architectures.
    *   Take a moment to examine the generated plot. You should be able to trace the wide and deep paths, the preprocessing steps, and how they combine. This visual confirmation is a great way to ensure your model is connected as intended.

In [None]:
def rmse(y_true, y_pred):
    squared_error = tf.keras.ops.square(y_pred[:, 0] - y_true)
    return tf.keras.ops.sqrt(tf.keras.ops.mean(squared_error))

In [None]:
model = Model(inputs=list(inputs.values()), outputs=output)

model.compile(optimizer="adam", loss="mse", metrics=[rmse], run_eagerly=True)

In [None]:
tf.keras.utils.plot_model(model, show_shapes=False, rankdir="LR")

### Train and Evaluate the Wide & Deep Model

With our model defined and compiled, we can now proceed to train it.

#### Setup Training Parameters and Datasets

*   **Training Parameters:**
    *   `BATCH_SIZE = 64`: The number of examples processed in each training step.
    *   `NUM_TRAIN_EXAMPLES = 10000 * 10`: This defines the total number of training examples to process in one "training loop" or "virtual epoch". Since our dataset repeats, we can simulate a larger dataset.
    *   `NUM_EVALS = 10`: We will evaluate our model 10 times during the entire training process.
    *   `NUM_EVAL_EXAMPLES = 1000`: The number of examples from the validation set to use for each evaluation.

*   **Create Datasets:**
    *   `trainds = create_dataset(...)`: Creates the training dataset using our `create_dataset` function. `mode="train"` ensures shuffling.
    *   `evalds = create_dataset(...).take(...)`: Creates the validation dataset. `.take(NUM_EVAL_EXAMPLES // BATCH_SIZE)` ensures we only use a specific number of batches from the validation set for each evaluation, making evaluation faster and consistent.

**A Note on "Virtual Epochs"**: We're training for a fixed number of steps (`steps_per_epoch`) over a number of evaluations (`NUM_EVALS`). The `trainds` dataset is set to `.repeat()`. This means we are not training for a fixed number of passes over the *entire* training dataset (which is the traditional definition of an epoch). Instead, each "epoch" in the Keras `fit` call corresponds to processing `steps_per_epoch` batches. This approach, sometimes referred to as using "virtual epochs," is common when working with very large or streaming datasets. For more details, you can refer to the blog post [ML Design Pattern #3: Virtual Epochs](https://medium.com/google-cloud/ml-design-pattern-3-virtual-epochs-f842296de730).

#### Train the Model

In [None]:
BATCH_SIZE = 64
NUM_TRAIN_EXAMPLES = 10000 * 10  # training dataset will repeat, wrap around
NUM_EVALS = 10  # how many times to evaluate
NUM_EVAL_EXAMPLES = 1000  # enough to get a reasonable sample

trainds = create_dataset(
    pattern="../data/taxi-train.csv", batch_size=BATCH_SIZE, mode="train"
)

evalds = create_dataset(
    pattern="../data/taxi-valid.csv", batch_size=BATCH_SIZE, mode="eval"
).take(NUM_EVAL_EXAMPLES // BATCH_SIZE)

In [None]:
%%time
steps_per_epoch = NUM_TRAIN_EXAMPLES // (BATCH_SIZE * NUM_EVALS)

OUTDIR = "./taxi_trained"
shutil.rmtree(path=OUTDIR, ignore_errors=True)  # start fresh each time

history = model.fit(
    x=trainds,
    steps_per_epoch=steps_per_epoch,
    epochs=NUM_EVALS,
    validation_data=evalds,
    callbacks=[TensorBoard(OUTDIR)],
)

#### Visualize Training History

The `history` object returned by `model.fit()` contains a record of the loss and metrics at each epoch. We can use `pandas` and `matplotlib` to plot the training and validation RMSE to observe how our model learned over time.

Look for:
*   **Decreasing RMSE:** Both training and validation RMSE should ideally decrease.
*   **Convergence:** The curves should start to flatten out, indicating the model is no longer improving significantly.
*   **Overfitting:** If the training RMSE continues to decrease while the validation RMSE starts to increase, it's a sign of overfitting. Our current plot might show a bit of this, which could be addressed with techniques like regularization, more data, or a simpler model if it were more severe.

In [None]:
RMSE_COLS = ["rmse", "val_rmse"]

pd.DataFrame(history.history)[RMSE_COLS].plot()

## Improve Model Performance with Advanced Feature Engineering

Our first Wide & Deep model provides a good baseline. Now, let's explore how targeted feature engineering can potentially improve its performance. We'll introduce two key techniques:
1.  **Custom Transformation with `Lambda` Layers:** We'll calculate the Euclidean distance between pickup and dropoff points.
2.  **Normalization:** We'll normalize our input coordinates before calculating the distance and feeding them into other layers.

This time, for simplicity and to focus on these new feature engineering steps, we will build a DNN (Deep Neural Network) model only, but these techniques could also be incorporated back into a Wide & Deep structure.

### Custom Feature: Euclidean Distance with a `Lambda` Layer

The raw latitude and longitude coordinates are useful, but the direct distance between pickup and dropoff points is likely a very strong signal for predicting taxi fare. We can engineer this feature.

**Euclidean Distance:**
The function `euclidean(params)` calculates the straight-line distance between two points (`lon1`, `lat1`) and (`lon2`, `lat2`) using the formula: `sqrt((lon2-lon1)^2 + (lat2-lat1)^2)`.
This is a simplification, as it doesn't account for the Earth's curvature (Haversine distance would be more accurate for longer distances) or road networks. However, for relatively short taxi rides within a city, it's often a good and computationally cheap approximation.

**Using `keras.layers.Lambda`:**
To incorporate this custom Python function directly into our Keras model graph, we use a `Lambda` layer. 
*   **What it does:** A `Lambda` layer wraps an arbitrary expression or function as a Keras layer.
*   **Why it's useful:** It allows for quick experimentation with custom transformations without needing to write a full custom layer class. It's great for simple, stateless operations.
*   **How we'll use it:** Later, when building the model, we'll pass the `euclidean` function to a `Lambda` layer, and then call that layer with the appropriate input tensors (pickup and dropoff coordinates).

This approach keeps our feature engineering logic as part of the model itself, which is excellent for deployment and consistency, as the preprocessing is bundled with the model.

In [None]:
def euclidean(params):
    lon1, lat1, lon2, lat2 = params
    londiff = lon2 - lon1
    latdiff = lat2 - lat1
    return tf.sqrt(londiff * londiff + latdiff * latdiff)

### Feature Normalization with `Normalization` Layer

Neural networks often perform better when input numerical features are on a similar scale. **Normalization** scales features to have zero mean and unit variance (or to a specific range like [0,1]). This can help with faster convergence during training and prevent features with larger magnitudes from dominating the learning process.

Keras provides a `keras.layers.Normalization` layer for this purpose. It's a stateful layer, meaning it can compute statistics (mean and variance) from the data and then use these statistics to normalize subsequent data.

**Steps:**
1.  **Load Data for Adaptation:**
    *   To calculate the mean and variance, the `Normalization` layer needs to see some data. We load the `taxi-train.csv` into a Pandas DataFrame.
    *   We then extract all latitude values (`pickup_latitude`, `dropoff_latitude`) into `lat_values` and all longitude values (`pickup_longitude`, `dropoff_longitude`) into `lon_values`.

2.  **Instantiate and Adapt Normalization Layers:**
    *   `lat_scaler = keras.layers.Normalization(axis=None)`: We create a `Normalization` layer for latitude. `axis=None` means the normalization will be computed across all values provided.
    *   `lon_scaler = keras.layers.Normalization(axis=None)`: Similarly, for longitude.
    *   `lat_scaler.adapt(lat_values)`: This is the crucial step. The `adapt()` method takes the `lat_values` (a NumPy array of all latitude values from the training set) and computes the mean and variance. These statistics are then stored within the `lat_scaler` layer.
    *   The same is done for `lon_scaler` with `lon_values`.

3.  **Inspect Computed Statistics (Optional):**
    *   The `print` statements show the computed `mean` and `variance` for both latitude and longitude. This is a good sanity check.

Once these layers are "adapted," they can be used as part of our model to normalize new incoming data using the statistics learned from the training set. This ensures that the same transformation is applied during training, evaluation, and inference.

In [None]:
CSV_COLUMNS = [
    "fare_amount",
    "pickup_datetime",
    "pickup_longitude",
    "pickup_latitude",
    "dropoff_longitude",
    "dropoff_latitude",
    "passenger_count",
    "key",
]

df = pd.read_csv("../data/taxi-train.csv", names=CSV_COLUMNS)
lat_values = pd.concat(
    [df["pickup_latitude"], df["dropoff_latitude"]], ignore_index=True
).to_numpy()
lon_values = pd.concat(
    [df["pickup_longitude"], df["dropoff_longitude"]], ignore_index=True
).to_numpy()

In [None]:
lat_scaler = keras.layers.Normalization(axis=None)
lon_scaler = keras.layers.Normalization(axis=None)

lat_scaler.adapt(lat_values)
lon_scaler.adapt(lon_values)

print(
    f"Computed statistics for latitude:\nmean: {lat_scaler.mean}, variance: {lat_scaler.variance}, "
)
print(f"+++++")
print(
    f"Computed statistics for latitude:\nmean: {lon_scaler.mean}, variance: {lon_scaler.variance}, "
)

### Building a DNN Model with Engineered Features

Now we'll construct a new Deep Neural Network (DNN) that incorporates our engineered features: the Euclidean distance (via a `Lambda` layer) and normalized coordinates.

The overall flow will be:
1.  Define `Input` layers for our raw coordinate features.
2.  Normalize these raw coordinates using the `Normalization` layers we adapted earlier (`lat_scaler`, `lon_scaler`).
3.  Calculate the Euclidean distance using the `Lambda` layer with our `euclidean` function, operating on the *normalized* coordinates.
4.  Discretize the *normalized* coordinates.
5.  Create feature crosses from these discretized, normalized coordinates.
6.  Embed the final crossed feature.
7.  Concatenate the Euclidean distance feature and the embedding.
8.  Pass the concatenated result through a few `Dense` layers to make the final prediction.

This model demonstrates how to integrate various preprocessing and custom transformation layers seamlessly using the Keras Functional API.

#### Define Inputs

In [None]:
INPUT_COLS = [
    "pickup_longitude",
    "pickup_latitude",
    "dropoff_longitude",
    "dropoff_latitude",
]

# input layer is all float
inputs = {
    colname: Input(name=colname, shape=(1,), dtype="float32")
    for colname in INPUT_COLS
}

#### Define Preprocessing, Normalization, and Lambda Layer Integration

This is where we apply the feature engineering concepts we've discussed:

1.  **Updated Bucket Boundaries:**
    *   `latbuckets = np.linspace(start=-5, stop=5, num=NBUCKETS).tolist()`
    *   `lonbuckets = np.linspace(start=-5, stop=5, num=NBUCKETS).tolist()`
    *   **Important Change:** The bucket boundaries for latitude and longitude are now defined over the range `[-5, 5]`. This is because we will be feeding *normalized* coordinates into the `Discretization` layers. Normalized data (with zero mean and unit variance) will mostly fall within a smaller range (e.g., roughly -3 to +3 for 99.7% of data if perfectly Gaussian). Using `[-5, 5]` provides a safe range for bucketing this normalized data.

2.  **Apply Normalization:**
    *   `scaled_plon = lon_scaler(inputs["pickup_longitude"])`
    *   The raw input longitudes and latitudes are passed through their respective adapted `Normalization` layers (`lon_scaler`, `lat_scaler`). The outputs (`scaled_plon`, `scaled_dlon`, `scaled_plat`, `scaled_dlat`) are the normalized versions of these coordinates.

3.  **Lambda Layer for Euclidean Distance:**
    *   `euclidean_distance = Lambda(euclidean, name="euclidean")([scaled_plon, scaled_plat, scaled_dlon, scaled_dlat])`
    *   Here, our `euclidean` function (wrapped in a `Lambda` layer) is called with the *normalized* coordinates. This calculates the Euclidean distance based on the scaled feature values.

4.  **Discretization of Normalized Coordinates:**
    *   `plon = Discretization(lonbuckets, name="plon_bkt")(scaled_plon)`
    *   The *normalized* pickup longitude (`scaled_plon`) is now bucketized using the new `lonbuckets` (ranging from -5 to 5).
    *   This is repeated for all four coordinate features.

5.  **Feature Crossing and Embedding (Similar to before, but on normalized+discretized data):**
    *   `p_fc = HashedCrossing(...)((plon, plat))`
    *   `d_fc = HashedCrossing(...)((dlon, dlat))`
    *   `pd_fc = HashedCrossing(...)((p_fc, d_fc))`
    *   The feature crossing logic remains the same, but it now operates on features derived from normalized data.
    *   `pd_embed = Flatten()(Embedding(...)(pd_fc))`: The final crossed feature `pd_fc` is embedded and flattened as before.

6.  **Concatenate Engineered Features for the DNN:**
    *   `deep = Concatenate()([euclidean_distance, pd_embed])`
    *   The input to our deep stack of layers now consists of two engineered features:
        *   `euclidean_distance`: Our custom-calculated distance.
        *   `pd_embed`: The embedding of the crossed feature derived from normalized and discretized coordinates.

This sequence demonstrates a more sophisticated feature engineering pipeline built directly into the model.

In [None]:
NBUCKETS = 16

latbuckets = np.linspace(start=-5, stop=5, num=NBUCKETS).tolist()
lonbuckets = np.linspace(start=-5, stop=5, num=NBUCKETS).tolist()

# Normalize longitude
scaled_plon = lon_scaler(inputs["pickup_longitude"])
scaled_dlon = lon_scaler(inputs["dropoff_longitude"])

# Normalize latitude
scaled_plat = lat_scaler(inputs["pickup_latitude"])
scaled_dlat = lat_scaler(inputs["dropoff_latitude"])

# Lambda layer for the custom euclidean function
euclidean_distance = Lambda(euclidean, name="euclidean")(
    [scaled_plon, scaled_plat, scaled_dlon, scaled_dlat]
)

# Discretization
plon = Discretization(lonbuckets, name="plon_bkt")(scaled_plon)
plat = Discretization(latbuckets, name="plat_bkt")(scaled_plat)
dlon = Discretization(lonbuckets, name="dlon_bkt")(scaled_dlon)
dlat = Discretization(latbuckets, name="dlat_bkt")(scaled_dlat)


# Feature Cross with HashedCrossing layer
p_fc = HashedCrossing(num_bins=(NBUCKETS + 1) ** 2, name="p_fc")((plon, plat))
d_fc = HashedCrossing(num_bins=(NBUCKETS + 1) ** 2, name="d_fc")((dlon, dlat))
pd_fc = HashedCrossing(num_bins=(NBUCKETS + 1) ** 4, name="pd_fc")((p_fc, d_fc))

# Embedding with Embedding layer
pd_embed = Flatten()(
    Embedding(input_dim=(NBUCKETS + 1) ** 4, output_dim=10, name="pd_embed")(
        pd_fc
    )
)

deep = Concatenate()([euclidean_distance, pd_embed])

#### Define the DNN Layers

After the feature engineering, the `deep` tensor (containing the concatenated `euclidean_distance` and `pd_embed`) is passed through a stack of `Dense` layers, similar to our first model.

*   `dnn_hidden_units = [32, 8]`: We use the same hidden layer structure as before (two hidden layers with 32 and 8 units respectively, and ReLU activations).
*   `output = Dense(1, activation="linear", name="fare")(deep)`: The final output layer is a single `Dense` unit with a linear activation, suitable for predicting the continuous fare amount. We've named it "fare" for clarity.

In [None]:
dnn_hidden_units = [32, 8]

# Add hidden Dense layers
for i, num_nodes in enumerate(dnn_hidden_units, start=1):
    deep = Dense(num_nodes, activation="relu", name=f"hidden_{i}")(deep)

# final output is a linear activation because this is regression
output = Dense(1, activation="linear", name="fare")(deep)

#### Instantiate and Compile the Engineered Model

Now we instantiate and compile our new model, which includes the normalization and Lambda layers for feature engineering.

*   `model = keras.Model(inputs=list(inputs.values()), outputs=output)`: We define the model by specifying its inputs (the original raw coordinate `Input` layers) and the final `output` tensor.
*   `model.compile(optimizer="adam", loss="mse", metrics=[rmse], run_eagerly=True)`: Compilation is the same as before, using Adam optimizer, MSE loss, and our custom RMSE metric.

In [None]:
model = keras.Model(inputs=list(inputs.values()), outputs=output)

# Compile model
model.compile(optimizer="adam", loss="mse", metrics=[rmse], run_eagerly=True)

#### Visualize the Engineered Model Architecture

Let's plot our new model to see how the architecture incorporates the feature engineering steps. Pay attention to:
*   The initial `Input` layers.
*   How these inputs are fed into the `Normalization` layers (e.g., `normalization_...`).
*   The `Lambda` layer (`lambda_...`) taking the normalized coordinates to compute `euclidean_distance`.
*   The `Discretization` layers operating on the normalized coordinates.
*   The subsequent `HashedCrossing` and `Embedding` layers.
*   The final `Concatenate` layer that brings together the `euclidean_distance` and the `pd_embed` before feeding into the `Dense` stack.

This visualization clearly shows the flow of data through our custom feature engineering pipeline integrated within the Keras model.

Now that we've built and visualized this more complex model, let's train it and see if our feature engineering efforts yield an improvement in performance.

In [None]:
tf.keras.utils.plot_model(
    model, "dnn_model_engineered.png", show_shapes=False, rankdir="LR"
)

### Train and Evaluate the Engineered Model

The training setup (batch size, number of examples, evaluation steps) will be the same as for the first model. We create new `trainds` and `evalds` datasets (as the previous ones might have been fully consumed or be in an unknown state) and then call `model.fit()`.

After training, we'll plot the RMSE curves again. Compare these curves (and the final RMSE values) to those from the first Wide & Deep model. 

**Things to consider:**
*   Did the RMSE improve (i.e., is it lower)?
*   Does the model converge faster or slower?
*   Is there more or less overfitting?

Feature engineering is an iterative process. Sometimes, engineered features provide a significant boost; other times, their impact might be minimal or even detrimental if not designed carefully. The key is to experiment and evaluate systematically.

In [None]:
BATCH_SIZE = 64
NUM_TRAIN_EXAMPLES = 10000 * 10  # training dataset will repeat, wrap around
NUM_EVALS = 10  # how many times to evaluate
NUM_EVAL_EXAMPLES = 1000

In [None]:
trainds = create_dataset(
    pattern="../data/taxi-train.csv", batch_size=BATCH_SIZE, mode="train"
)

evalds = create_dataset(
    pattern="../data/taxi-valid.csv", batch_size=BATCH_SIZE, mode="eval"
).take(NUM_EVAL_EXAMPLES // BATCH_SIZE)

steps_per_epoch = NUM_TRAIN_EXAMPLES // (BATCH_SIZE * NUM_EVALS)

history = model.fit(
    trainds,
    validation_data=evalds,
    epochs=NUM_EVALS,
    steps_per_epoch=steps_per_epoch,
)

In [None]:
RMSE_COLS = ["rmse", "val_rmse"]

pd.DataFrame(history.history)[RMSE_COLS].plot()

Copyright 2025 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.