# Feature Crosses

**Learning Objectives:**
  * Improve a linear regression model with the addition of additional synthetic features (this is a continuation of the previous exercise)
  * Use an input function to convert pandas `DataFrame` objects to `Tensors` and invoke the input function in `fit()` and `predict()` operations
  * Use the FTRL optimization algorithm for model training
  * Create new synthetic features through one-hot encoding, binning, and feature crosses

## Setup

Firstly, we will install the `Tensorflow 2.x` release for the notebook.

In [2]:
!pip install -q tensorflow==2.0.0-beta1

mesh-tensorflow          0.0.5                
tensor2tensor            1.11.0               
tensorboard              1.14.0               
tensorboardcolab         0.0.22               
tensorflow               2.0.0b1              
tensorflow-estimator     1.14.0               
tensorflow-hub           0.5.0                
tensorflow-metadata      0.14.0               
tensorflow-probability   0.7.0                


In [0]:
import io
import math
import logging
import numpy as np
import pandas as pd
import tensorflow as tf

from functools import reduce
from matplotlib import cm
from matplotlib import gridspec
from matplotlib import pyplot as plt
from sklearn import metrics
from IPython.display import display
from datetime import datetime


In [0]:
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format
logging.getLogger('tensorflow').disabled = True
%load_ext tensorboard

Load data from the CSV file to the Pandas DataFrame

In [0]:
california_housing_dataframe = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv", sep=",")

california_housing_dataframe = california_housing_dataframe.reindex(
    np.random.permutation(california_housing_dataframe.index))

Now, we will prepare input features from California housing data set

In [0]:
def preprocess_features(california_housing_dataframe):
  """Prepares input features from California housing data set.

  Args:
    california_housing_dataframe: A Pandas DataFrame expected to contain data
      from the California housing data set.
  Returns:
    A DataFrame that contains the features to be used for the model, including
    synthetic features.
  """
  selected_features = california_housing_dataframe[
    ["latitude",
     "longitude",
     "housing_median_age",
     "total_rooms",
     "total_bedrooms",
     "population",
     "households",
     "median_income"]]
  processed_features = selected_features.copy()
  # Create a synthetic feature.
  processed_features["rooms_per_person"] = (
    california_housing_dataframe["total_rooms"] /
    california_housing_dataframe["population"])
  return processed_features

In [0]:
def preprocess_targets(california_housing_dataframe):
  """Prepares target features (i.e., labels) from California housing data set.

  Args:
    california_housing_dataframe: A Pandas DataFrame expected to contain data
      from the California housing data set.
  Returns:
    A DataFrame that contains the target feature.
  """
  output_targets = pd.DataFrame()
  # Scale the target to be in units of thousands of dollars.
  output_targets["median_house_value"] = (
    california_housing_dataframe["median_house_value"] / 1000.0)
  return output_targets

In [8]:
# Choose the first 12000 (out of 17000) examples for training.
training_examples = preprocess_features(california_housing_dataframe.head(12000))
training_targets = preprocess_targets(california_housing_dataframe.head(12000))

# Choose the last 5000 (out of 17000) examples for validation.
validation_examples = preprocess_features(california_housing_dataframe.tail(5000))
validation_targets = preprocess_targets(california_housing_dataframe.tail(5000))

# Double-check that we've done the right thing.
print("Training examples summary:")
display(training_examples.describe())
print("Validation examples summary:")
display(validation_examples.describe())

print("Training targets summary:")
display(training_targets.describe())
print("Validation targets summary:")
display(validation_targets.describe())

Training examples summary:


Unnamed: 0,latitude,longitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,rooms_per_person
count,12000.0,12000.0,12000.0,12000.0,12000.0,12000.0,12000.0,12000.0,12000.0
mean,35.6,-119.6,28.7,2650.5,541.5,1431.4,503.0,3.9,2.0
std,2.1,2.0,12.6,2194.4,426.5,1133.9,389.1,1.9,1.2
min,32.5,-124.3,2.0,2.0,1.0,3.0,1.0,0.5,0.0
25%,33.9,-121.8,18.0,1468.0,297.0,791.0,282.0,2.6,1.5
50%,34.2,-118.5,29.0,2140.0,436.0,1167.0,410.0,3.5,1.9
75%,37.7,-118.0,37.0,3158.2,650.0,1720.0,607.0,4.7,2.3
max,42.0,-114.3,52.0,37937.0,6445.0,28566.0,6082.0,15.0,55.2


Validation examples summary:


Unnamed: 0,latitude,longitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,rooms_per_person
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,35.6,-119.5,28.3,2627.1,534.3,1425.3,496.8,3.9,2.0
std,2.1,2.0,12.6,2145.0,409.2,1180.7,373.3,1.9,1.1
min,32.5,-124.3,1.0,12.0,4.0,8.0,2.0,0.5,0.1
25%,33.9,-121.8,18.0,1446.0,295.0,787.8,279.0,2.6,1.5
50%,34.2,-118.5,28.0,2105.0,428.0,1166.0,407.0,3.5,1.9
75%,37.7,-118.0,37.0,3135.0,647.0,1721.2,601.0,4.8,2.3
max,42.0,-114.5,52.0,30401.0,4957.0,35682.0,4769.0,15.0,29.4


Training targets summary:


Unnamed: 0,median_house_value
count,12000.0
mean,207.2
std,116.0
min,15.0
25%,118.8
50%,180.4
75%,264.2
max,500.0


Validation targets summary:


Unnamed: 0,median_house_value
count,5000.0
mean,207.6
std,116.1
min,15.0
25%,121.5
50%,180.1
75%,266.8
max,500.0


## FTRL Optimization Algorithm

High dimensional linear models benefit from using a variant of gradient-based optimization called FTRL. This algorithm has the benefit of scaling the learning rate differently for different coefficients, which can be useful if some features rarely take non-zero values (it also is well suited to support L1 regularization). We can apply FTRL using the [`optimizers.Ftrl`](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/optimizers/Ftrl).

If we want to use bucketized features (see below) we have to implement input [`DenseFeatures`](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/layers/DenseFeatures) layer and pass to constructor `feature_columns` parameter (like in TensorFlow 1). We can avoid it and proceed as usual if no bucketized columns in our input.

In [0]:
# Define Linear regressor layer

class RegressorLayer(tf.keras.layers.Layer):

  def __init__(self, input_dim):
    super(RegressorLayer, self).__init__()
    w_init = tf.zeros_initializer()
    self.w = tf.Variable(initial_value=w_init(shape=(input_dim, 1),
                                              dtype='float32'),
                         trainable=True)
    b_init = tf.zeros_initializer()
    self.b = tf.Variable(initial_value=b_init(shape=(1,),
                                              dtype='float32'),
                         trainable=True)

  def call(self, inputs):
    return tf.add(tf.matmul(inputs, self.w), self.b)

  def compute_output_shape(self, input_shape):
    shape = tf.TensorShape(input_shape).as_list()
    shape[-1] = 1
    return tf.TensorShape(shape)

  def get_config(self):
    base_config = super(RegressorLayer, self).get_config()
    return base_config

  @classmethod
  def from_config(cls, config):
    return cls(**config)

# Define LinearRegressor Model
  
class LinearRegressorModel(tf.keras.models.Model):

  def __init__(self, num_classes, feature_columns):
    super(LinearRegressorModel, self).__init__(name='linear_regressor_model', dynamic=True)
    self.feature_columns = None
    if feature_columns != None:
      self.feature_columns = [x.source_column.name for x in feature_columns if x.__class__.__name__ != 'EmbeddingColumn']
      self.feature_layer = tf.keras.layers.DenseFeatures(feature_columns)
      num_classes = reduce(lambda y, x: y + int(x.variable_shape[-1]), feature_columns, 0)
    self.regressor = RegressorLayer(num_classes)

  def call(self, inputs):
    if self.feature_columns != None:
      features = {key:np.array(value) for key,value in zip(self.feature_columns, np.array(inputs).T)}
      inputs = self.feature_layer(features)
    results = self.regressor(inputs)
    return results

In [0]:
def fit_model(learning_rate,
              steps_per_epoch,
              batch_size,
              training_examples,
              training_targets,
              validation_examples,
              validation_targets,
              feature_columns=None):
  """Trains a linear regression model of multiple features.
  Args:
    learning_rate: A `float`, the learning rate.
    steps_per_epoch: A non-zero `int`, the total number of training steps. A training step
      consists of a forward and backward pass using a single batch.
    batch_size: A non-zero `int`, the batch size.
    training_examples: A `DataFrame` containing one or more columns from
      `california_housing_dataframe` to use as input features for training.
    training_targets: A `DataFrame` containing exactly one column from
      `california_housing_dataframe` to use as target for training.
    validation_examples: A `DataFrame` containing one or more columns from
      `california_housing_dataframe` to use as input features for validation.
    validation_targets: A `DataFrame` containing exactly one column from
      `california_housing_dataframe` to use as target for validation.
  """
  epochs = 10
  
  model = LinearRegressorModel(training_examples.shape[1], feature_columns)
  model.compile(optimizer=tf.keras.optimizers.Ftrl(learning_rate=learning_rate, clipnorm=5.0),
                loss='mse',
                metrics=[tf.keras.metrics.RootMeanSquaredError()])

  def log(epoch, logs):
    root_mean_squared_error = logs["root_mean_squared_error"]
    print("  epoch %02d : %0.2f" % (epoch, root_mean_squared_error))

  model_callback = tf.keras.callbacks.LambdaCallback(
      on_epoch_end=lambda epoch, logs: log(epoch, logs))
  logdir="logs/feature_crosses_with_tf2_and_keras_plus_tensorboard/scalars" + datetime.now().strftime("%Y%m%d-%H%M%S")
  tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=logdir,
                                                        #histogram_freq=1,
                                                        update_freq='epoch')
  
  print("Train model...")
  print("RMSE (on training data):")
  _ = model.fit(training_examples.values,
            training_targets.values,
            validation_data=(validation_examples.values, validation_targets.values),
            epochs=epochs,
            steps_per_epoch=steps_per_epoch,
            batch_size=batch_size,
            callbacks=[model_callback, tensorboard_callback],
            verbose=0)
  print("Model training finished.")
  return model


In [0]:
!rm -rf logs/feature_crosses

In [18]:
model = fit_model(
    learning_rate=0.00000001,
    steps_per_epoch=100,
    batch_size=20,
    training_examples=training_examples,
    training_targets=training_targets,
    validation_examples=validation_examples,
    validation_targets=validation_targets)

Train model...
RMSE (on training data):
  epoch 00 : 237.40
  epoch 01 : 237.40
  epoch 02 : 237.40
  epoch 03 : 237.40
  epoch 04 : 237.40
  epoch 05 : 237.40
  epoch 06 : 237.40
  epoch 07 : 237.40
  epoch 08 : 237.40
  epoch 09 : 237.40
Model training finished.


In [19]:
%tensorboard --logdir logs/feature_crosses

## One-Hot Encoding for Discrete Features

Discrete (i.e. strings, enumerations, integers) features are usually converted into families of binary features before training a logistic regression model.

For example, suppose we created a synthetic feature that can take any of the values `0`, `1` or `2`, and that we have a few training points:

| # | feature_value |
|---|---------------|
| 0 |             2 |
| 1 |             0 |
| 2 |             1 |

For each possible categorical value, we make a new **binary** feature of **real values** that can take one of just two possible values: 1.0 if the example has that value, and 0.0 if not. In the example above, the categorical feature would be converted into three features, and the training points now look like:

| # | feature_value_0 | feature_value_1 | feature_value_2 |
|---|-----------------|-----------------|-----------------|
| 0 |             0.0 |             0.0 |             1.0 |
| 1 |             1.0 |             0.0 |             0.0 |
| 2 |             0.0 |             1.0 |             0.0 |

## Bucketized (Binned) Features

Bucketization is also known as binning.

We can bucketize `population` into the following 3 buckets (for instance):
- `bucket_0` (`< 5000`): corresponding to less populated blocks
- `bucket_1` (`5000 - 25000`): corresponding to mid populated blocks
- `bucket_2` (`> 25000`): corresponding to highly populated blocks

Given the preceding bucket definitions, the following `population` vector:

    [[10001], [42004], [2500], [18000]]

becomes the following bucketized feature vector:

    [[1], [2], [0], [1]]

The feature values are now the bucket indices. Note that these indices are considered to be discrete features. Typically, these will be further converted in one-hot representations as above, but this is done transparently.

To define feature columns for bucketized features, we can use [`bucketized_column`](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/feature_column/bucketized_column), which takes a numeric column as input and transforms it to a bucketized feature using the bucket boundaries specified in the `boundaries` argument. The following code defines bucketized feature columns for `households` and `longitude`; the `get_quantile_based_boundaries` function calculates boundaries based on quantiles, so that each bucket contains an equal number of elements.

In [0]:
def get_quantile_based_boundaries(feature_values, num_buckets):
  boundaries = np.arange(1.0, num_buckets) / num_buckets
  quantiles = feature_values.quantile(boundaries)
  return [quantiles[q] for q in quantiles.keys()]

# Divide households into 7 buckets.
households = tf.feature_column.numeric_column("households")
bucketized_households = tf.feature_column.bucketized_column(
  households, boundaries=get_quantile_based_boundaries(
    california_housing_dataframe["households"], 7))

# Divide longitude into 10 buckets.
longitude = tf.feature_column.numeric_column("longitude")
bucketized_longitude = tf.feature_column.bucketized_column(
  longitude, boundaries=get_quantile_based_boundaries(
    california_housing_dataframe["longitude"], 10))

## Task 1: Train the Model on Bucketized Feature Columns
**Bucketize all the real valued features in our example, train the model and see if the results improve.**

In the preceding code block, two real valued columns (namely `households` and `longitude`) have been transformed into bucketized feature columns. Your task is to bucketize the rest of the columns, then run the code to train the model. There are various heuristics to find the range of the buckets. This exercise uses a quantile-based technique, which chooses the bucket boundaries in such a way that each bucket has the same number of examples.

In [0]:
def construct_feature_columns():
  """Construct the TensorFlow Feature Columns.

  Returns:
    A set of feature columns
  """ 
  households = tf.feature_column.numeric_column("households")
  longitude = tf.feature_column.numeric_column("longitude")
  latitude = tf.feature_column.numeric_column("latitude")
  housing_median_age = tf.feature_column.numeric_column("housing_median_age")
  median_income = tf.feature_column.numeric_column("median_income")
  rooms_per_person = tf.feature_column.numeric_column("rooms_per_person")
  
  # Divide households into 7 buckets.
  bucketized_households = tf.feature_column.bucketized_column(
    households, boundaries=get_quantile_based_boundaries(
      training_examples["households"], 7))

  # Divide longitude into 10 buckets.
  bucketized_longitude = tf.feature_column.bucketized_column(
    longitude, boundaries=get_quantile_based_boundaries(
      training_examples["longitude"], 10))
  
  # Divide latitude into 10 buckets.
  bucketized_latitude = tf.feature_column.bucketized_column(
    latitude, boundaries=get_quantile_based_boundaries(
      training_examples["latitude"], 10))

  # Divide housing_median_age into 7 buckets.
  bucketized_housing_median_age = tf.feature_column.bucketized_column(
    housing_median_age, boundaries=get_quantile_based_boundaries(
      training_examples["housing_median_age"], 7))
  
  # Divide median_income into 7 buckets.
  bucketized_median_income = tf.feature_column.bucketized_column(
    median_income, boundaries=get_quantile_based_boundaries(
      training_examples["median_income"], 7))
  
  # Divide rooms_per_person into 7 buckets.
  bucketized_rooms_per_person = tf.feature_column.bucketized_column(
    rooms_per_person, boundaries=get_quantile_based_boundaries(
      training_examples["rooms_per_person"], 7))
  
  feature_columns = [
    bucketized_longitude,
    bucketized_latitude,
    bucketized_housing_median_age,
    bucketized_households,
    bucketized_median_income,
    bucketized_rooms_per_person]
  
  return feature_columns

**NOTE**

When we have bucketized columns in our input, we have to send these columns to our model as `feature_columns` parameter.

In [22]:
_ = fit_model(
    learning_rate=0.001,
    steps_per_epoch=500,
    batch_size=100,
    training_examples=training_examples,
    training_targets=training_targets,
    validation_examples=validation_examples,
    validation_targets=validation_targets,
    feature_columns=construct_feature_columns())

Train model...
RMSE (on training data):
  epoch 00 : 237.33
  epoch 01 : 237.25
  epoch 02 : 237.20
  epoch 03 : 237.17
  epoch 04 : 237.13
  epoch 05 : 237.10
  epoch 06 : 237.08
  epoch 07 : 237.05
  epoch 08 : 237.03
  epoch 09 : 237.01
Model training finished.


In [23]:
%tensorboard --logdir logs/feature_crosses

Reusing TensorBoard on port 6006 (pid 300), started 0:00:47 ago. (Use '!kill 300' to kill it.)

## Feature Crosses

Crossing two (or more) features is a clever way to learn non-linear relations using a linear model. In our problem, if we just use the feature `latitude` for learning, the model might learn that city blocks at a particular latitude (or within a particular range of latitudes since we have bucketized it) are more likely to be expensive than others. Similarly for the feature `longitude`. However, if we cross `longitude` by `latitude`, the crossed feature represents a well defined city block. If the model learns that certain city blocks (within range of latitudes and longitudes) are more likely to be more expensive than others, it is a stronger signal than two features considered individually.

Currently, the feature columns API only supports discrete features for crosses. To cross two continuous values, like `latitude` or `longitude`, we can bucketize them.

If we cross the `latitude` and `longitude` features (supposing, for example, that `longitude` was bucketized into `2` buckets, while `latitude` has `3` buckets), we actually get six crossed binary features. Each of these features will get its own separate weight when we train the model.

## Task 2: Train the Model Using Feature Crosses

**Add a feature cross of `longitude` and `latitude` to your model, train it, and determine whether the results improve.**

Refer to the TensorFlow API docs for [`crossed_column()`](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/feature_column/crossed_column) to build the feature column for your cross. Use a `hash_bucket_size` of `1000`.

To use crossed column in Keras model, you need to add it in an [`embedding column`](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/feature_column/embedding_column), which converts it from sparse to dense representation.

In [0]:
def construct_feature_columns():
  """Construct the TensorFlow Feature Columns.

  Returns:
    A set of feature columns
  """ 
  households = tf.feature_column.numeric_column("households")
  longitude = tf.feature_column.numeric_column("longitude")
  latitude = tf.feature_column.numeric_column("latitude")
  housing_median_age = tf.feature_column.numeric_column("housing_median_age")
  median_income = tf.feature_column.numeric_column("median_income")
  rooms_per_person = tf.feature_column.numeric_column("rooms_per_person")
  
  # Divide households into 7 buckets.
  bucketized_households = tf.feature_column.bucketized_column(
    households, boundaries=get_quantile_based_boundaries(
      training_examples["households"], 7))

  # Divide longitude into 10 buckets.
  bucketized_longitude = tf.feature_column.bucketized_column(
    longitude, boundaries=get_quantile_based_boundaries(
      training_examples["longitude"], 10))
  
  # Divide latitude into 10 buckets.
  bucketized_latitude = tf.feature_column.bucketized_column(
    latitude, boundaries=get_quantile_based_boundaries(
      training_examples["latitude"], 10))

  # Divide housing_median_age into 7 buckets.
  bucketized_housing_median_age = tf.feature_column.bucketized_column(
    housing_median_age, boundaries=get_quantile_based_boundaries(
      training_examples["housing_median_age"], 7))
  
  # Divide median_income into 7 buckets.
  bucketized_median_income = tf.feature_column.bucketized_column(
    median_income, boundaries=get_quantile_based_boundaries(
      training_examples["median_income"], 7))
  
  # Divide rooms_per_person into 7 buckets.
  bucketized_rooms_per_person = tf.feature_column.bucketized_column(
    rooms_per_person, boundaries=get_quantile_based_boundaries(
      training_examples["rooms_per_person"], 7))
  
  long_x_lat = tf.feature_column.embedding_column(tf.feature_column.crossed_column(
    set([bucketized_longitude, bucketized_latitude]), hash_bucket_size=1000), 1000)
  
  feature_columns = set([
    bucketized_longitude,
    bucketized_latitude,
    bucketized_housing_median_age,
    bucketized_households,
    bucketized_median_income,
    bucketized_rooms_per_person,
    long_x_lat])
  
  return feature_columns

In [25]:
_ = fit_model(
    learning_rate=0.001,
    steps_per_epoch=500,
    batch_size=100,
    training_examples=training_examples,
    training_targets=training_targets,
    validation_examples=validation_examples,
    validation_targets=validation_targets,
    feature_columns=construct_feature_columns())

Train model...
RMSE (on training data):
  epoch 00 : 237.05
  epoch 01 : 236.21
  epoch 02 : 235.35
  epoch 03 : 234.50
  epoch 04 : 233.65
  epoch 05 : 232.80
  epoch 06 : 231.96
  epoch 07 : 231.13
  epoch 08 : 230.30
  epoch 09 : 229.47
Model training finished.


In [26]:
%tensorboard --logdir logs/feature_crosses

Reusing TensorBoard on port 6006 (pid 300), started 0:01:43 ago. (Use '!kill 300' to kill it.)

## Optional Challenge: Try Out More Synthetic Features

So far, we've tried simple bucketized columns and feature crosses, but there are many more combinations that could potentially improve the results. For example, you could cross multiple columns. What happens if you vary the number of buckets? What other synthetic features can you think of? Do they improve the model?