# Feature engineering with tensorflow (notebook 02)

These are my personal notes on the "google cloud - feature engineering" course on coursera (https://www.coursera.org/learn/feature-engineering). This notebook will continue where notebook 1 on this repo left off. Based on the same housing price dataset, this notebook will cover feature crossing and embeddings in tensorflow. 

## Approach

We will compare (the evaluation set loss of) **four** models for predicting house prices, where the first model serves as baseline for the others: 

1. Linear regressor // Data: house location (bucketized longtitude and latitude), median income of inhabitants, house properties (i.e., median age, rooms per house, bedrooms per room)
2. Linear regressor // Data: Adding longitude x latitude feature crosses 
3. Linear regressor // Data: Substituting feature crosses with embeddings of longitude x latitude feature crosses  
4. DNN regressor    // Data: Substituting feature crosses with embeddings of longitude x latitude feature crosses  

We will use four general functions: 

- **add_features(df):** can be used to add additional features to the dataset (i.e., by combining existing features)
- **make_input_fn(df, num_epochs):** creates a node in the comp graph that feeds the data. It calls add_features(df)
- **create_feature_cols():** defines which features are passed to the model (and does some feature transformation, like one-hot-encoding) 
- **train_and_evaluate(output_dir, num_train_steps):** runs training and evaluation when called. Instantiates a model (linear regressor or DNN regressor) and calls the previous three functions  

## A brief look at the data again 

In [1]:
# importing packages 
import itertools
import tensorflow as tf 
import tensorflow.feature_column as fc 
import pandas as pd
import numpy as np

In [2]:
# importing data into a pandas dataframe
df = pd.read_csv("data\california_housing_train.csv")

#### Columns in the dataset
- **longitude and latitude** -- long and lat values for the US West Coast area
- **housing_median_age** -- median age of houses in the area
- **total_rooms** -- total number of rooms of all houses in the area 
- **total_bedrooms** -- total number of bedrooms of all houses in the area
- **population** -- total number of people living in the area
- **households** -- total number of households in the area
- **median_income** -- median income 
- **median_house_value** -- **value to be predicted**

In [3]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


#### A brief look at the summary statistics
- no empty cells (17,000 entried for each column) 
- house values range between USD 15,000 - 500,000
- longitude values range from -124 to -114
- latitude values range from 32.5 to 42

In [4]:
df.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0
mean,-119.562108,35.625225,28.589353,2643.664412,539.410824,1429.573941,501.221941,3.883578,207300.912353
std,2.005166,2.13734,12.586937,2179.947071,421.499452,1147.852959,384.520841,1.908157,115983.764387
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.79,33.93,18.0,1462.0,297.0,790.0,282.0,2.566375,119400.0
50%,-118.49,34.25,29.0,2127.0,434.0,1167.0,409.0,3.5446,180400.0
75%,-118.0,37.72,37.0,3151.25,648.25,1721.0,605.25,4.767,265000.0
max,-114.31,41.95,52.0,37937.0,6445.0,35682.0,6082.0,15.0001,500001.0


In [5]:
# splitting data into train and evaluation set 
def split_train_dev(train_split, df):
    train_df = df.sample(frac=train_split,random_state=1)
    dev_df = df.drop(train_df.index)
    return train_df, dev_df

train_df, eval_df = split_train_dev(0.8, df)

## Setting up the linear and dnn regressor 

In [16]:
# SHARED INPUT FUNCTIONS 

# input function 
def make_input_fn(df, num_epochs):
    return tf.estimator.inputs.pandas_input_fn(
        x = add_features(df),
        y = df['median_house_value'] / 100000, # !!! 
        batch_size = 128,
        num_epochs = num_epochs,
        shuffle = True,
        queue_capacity = 1000,
        num_threads = 1
    )

# LINEAR REGRESSOR 

# Create estimator train and evaluate function
def linear_train_and_evaluate(output_dir, num_train_steps):
    
    # Specify output directory  
    run_config = tf.estimator.RunConfig(
                 model_dir=output_dir,      
                 save_summary_steps=100,                       
                 save_checkpoints_steps=100)   # dictates max frequency of eval 
    
    # specify model 
    estimator = tf.estimator.LinearRegressor(config=run_config,
                                             feature_columns = create_feature_cols())
    
    # specify train set
    train_spec = tf.estimator.TrainSpec(input_fn = make_input_fn(train_df, None), 
                                             max_steps = num_train_steps)
    
    # specify eval set 
    eval_spec = tf.estimator.EvalSpec(input_fn = make_input_fn(eval_df, 1), 
                                    steps = None, 
                                    throttle_secs = 5)  # evaluates no more than every 5 seconds per second
    
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
    
# DNN REGRESSOR 

def dnn_train_and_evaluate(output_dir, num_train_steps):
    
    # Specify output directory  
    run_config = tf.estimator.RunConfig(
                 model_dir=output_dir,      
                 save_summary_steps=100,                       
                 save_checkpoints_steps=100)   # dictates max frequency of eval 
    
    # specify model 
    estimator = tf.estimator.DNNRegressor(hidden_units=[75, 25, 7],
                                          config=run_config,
                                          feature_columns = create_feature_cols(), 
                                          batch_norm=True)
    
    # specify train set
    train_spec = tf.estimator.TrainSpec(input_fn = make_input_fn(train_df, None), 
                                             max_steps = num_train_steps)
    
    # specify eval set 
    eval_spec = tf.estimator.EvalSpec(input_fn = make_input_fn(eval_df, 1), 
                                    steps = None, 
                                    throttle_secs = 5)  # evaluates no more than every 5 seconds per second
    
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

## Feature crosses

**Idea:** Combine input features in such a way that the model does not have to explicitely learn their interactions/dependencies, but rather receives feature combinations as input. 

**For example,** in this dataset we have data for **latitude and longitude** for each sample. We then proceeded and put these values into **discrete buckets** grouping floating point values together roughly every 0.5 degrees. During training, the model "learns" weights to multiply these input features with to optimize the prediction loss.  

**The issue with this method:** Let's assume there is a certain quadrant of land (at longitude 110.0-110.5 / latitude 33.0-33.5) where property prices are extremely high. In this case, the model cannot simply assign a very high weight to the longitude 110.0-110.5 bucket if slightly to the north of the high-value quandrant (let's say at longitude 110.0-110.5 / latitude 34.0-34.5) house prices are for some reason low. In other words, **only the combination of these two specific coordinates indicates high prices in this case.** While a linear model will not be able to "learn" this, a deep neural network can, though at the cost of complexity (i.e., compute power). 

If we instead **create new feature columns,** where each column consists of the multiplication of two specific longitudes/ latitude buckets (i.e., output == 1 if a property happens to fall into that specific area), then the model can easily attach a weight to every single quadrant that is described in this way. This means that even a simple linear model can learn to predict high prices for some quadrants and lower prices for others.  

<img src='images\feature_cross.jpg' width='1000' height='1000'/>

We can thus expect that the linear model will close the performance gap to the neural network in our example (and potentially for our neural net to converge faster and require less depth/complexity)

**The risk of feature crosses:** model can overfit if we present too many cominations of the same data 

## Feature crosses in tensorflow

The method **tf.feature_column.crossed_column([<cat_column>, <cat_column>], nbuckets)** requires a list of categorical input columns. The number of buckets determines onto how many buckets the resulting feature cross combinations will be distributed.* 
- If the number of buckets specified matches exactly the number of category combinations from the feature columns, then each combination will end up in one column. 
- If the number of buckets specified is smaller than the number of category combinations from the feature columns, then multiple combinations will fall into the same column. This forces the model to generalize more. If it is larger, samples from one combination will be spread across multiple columns. This allows the model for a higher degree of "memorizing" the training data 

 
*intuitively, the method creates a hash for each feature cross combination which it then divides by the number of buckets (using the modulo operation) which assign it to a bucket.

## Baseline model: Linear regressor without feature crosses

In [7]:
# creating some input features 
def add_features(df):
    df['avg_rooms_per_house'] = df['total_rooms'] / df['households'] #expect positive correlation
    df['avg_persons_per_room'] = df['population'] / df['total_rooms'] #expect negative correlation
    df['avg_bedrooms_per_room'] = df['total_bedrooms'] / df['total_rooms'] # expect negative correlation 
    return df


# Defining which features to include and bucketizing longitude and latitude
def create_feature_cols():
    
    # define number of longitude and latitude buckets
    num_buckets = 30 
    long_buckets = np.linspace(-124.0, -114.5, num_buckets).tolist()
    lat_buckets = np.linspace(32.0, 42, num_buckets).tolist()
    
    # define input features 
    return [
    fc.bucketized_column(tf.feature_column.numeric_column('longitude'), 
                                        boundaries = long_buckets),  
    fc.bucketized_column(tf.feature_column.numeric_column('latitude'), 
                                        boundaries = lat_buckets),
    fc.numeric_column('median_income'),
    fc.numeric_column('housing_median_age'),
    fc.numeric_column('avg_rooms_per_house'),
    fc.numeric_column('avg_persons_per_room'),
    fc.numeric_column('avg_bedrooms_per_room')
    ]

In [9]:
# prevent verbose output
tf.logging.set_verbosity(tf.logging.FATAL)

# run baseline model
linear_train_and_evaluate(output_dir='CHECKPOINTS/feat_eng_02/model_base', num_train_steps = 3000)

Here is the baseline loss achieved by the first model on the evaluation set

<img src='images\feature_eng2_base.PNG' width='600' height='600'/>

## Model 2: Linear regressor with feature cross

Below, we are adding a feature cross between the bucketized values of longitude and latitude. Each new feature consists of a unique combination of longitude (e.g., 123.5-124) and latitude (e.g., 34.5-35) that in this case make up a physical area on a map. Each sample from the housing data falls into exactly one of those areas.  

In [10]:
# creating new input features 
def add_features(df):
    df['avg_rooms_per_house'] = df['total_rooms'] / df['households'] #expect positive correlation
    df['avg_persons_per_room'] = df['population'] / df['total_rooms'] #expect negative correlation
    df['avg_bedrooms_per_room'] = df['total_bedrooms'] / df['total_rooms'] # expect negative correlation 
    return df


# Add feature cross
def create_feature_cols():
    num_buckets = 30 
    long_buckets = np.linspace(-124.0, -114.5, num_buckets).tolist()
    lat_buckets = np.linspace(32.0, 42, num_buckets).tolist()
    
    b_long = fc.bucketized_column(fc.numeric_column('longitude'), long_buckets)  
    b_lat = fc.bucketized_column(fc.numeric_column('latitude'), lat_buckets)
    
    return [
    # add feature cross
    fc.crossed_column([b_lat, b_long], num_buckets**2), 
    # add other features 
    fc.numeric_column('median_income'),
    fc.numeric_column('housing_median_age'),
    fc.numeric_column('avg_rooms_per_house'),
    fc.numeric_column('avg_persons_per_room'),
    fc.numeric_column('avg_bedrooms_per_room')
    ]

In [11]:
# run model 2
linear_train_and_evaluate(output_dir='CHECKPOINTS/feat_eng_02/model_2', num_train_steps = 3000)

From the average loss output recorded during training (chart below), we see a clear improvement from adding the feature crosses (light blue line) relative to the original baseline (dark blue line)

<img src='images\feature_eng2_model_2.PNG' width='600' height='600'/>

# Embeddings

**Why embeddings:** 

The issue with feature crosses is that they create a very **sparse encoding**, i.e., for each sample in the dataset we have hundreds of possible longitude / latitude bucket combinations, of which exactly one has a value of 1 (i.e., the bucket that correpsonds to the area where the house is in) while all the others have a value of 0. Embeddings translate this sparse representation into something more dense (and meaningful):

Rather than feeding the regressor with all the hundreds of feature cross values, we run those through a dense layer with one or more neurons, which then feed into the network. Like any other parameter, the model trains the weights of this dense layer with respect to the objective function (i.e., in our case minimizing the loss from the difference between predicted and actual house prices).  

Each embedding feature is a real floating point number (the weighted sum of feature crosses). Feature crosses (in our case areas of land) that are similar to each other in ways that determine house prices, will receive similar values from this embedding exercise. Crucially, when looking at a new sample, the weights applied to its location will make the model treat it similarly to other houses from the same location (at least with respect to the impact of location on house prices). 

Embeddings are a critical part to recommendation engines (e.g., movies on netflix) or natural language models (e.g., google translate).  

### Substituting the raw feature crosses with embeddings in our input features

In [17]:
# creating new input features 
def add_features(df):
    df['avg_rooms_per_house'] = df['total_rooms'] / df['households'] #expect positive correlation
    df['avg_persons_per_room'] = df['population'] / df['total_rooms'] #expect negative correlation
    df['avg_bedrooms_per_room'] = df['total_bedrooms'] / df['total_rooms'] # expect negative correlation 
    return df


# Add feature cross
def create_feature_cols():
    num_buckets = 30 
    long_buckets = np.linspace(-124.0, -114.5, num_buckets).tolist()
    lat_buckets = np.linspace(32.0, 42, num_buckets).tolist()
    
    b_long = fc.bucketized_column(fc.numeric_column('longitude'), long_buckets)  
    b_lat = fc.bucketized_column(fc.numeric_column('latitude'), lat_buckets)
    
    feature_cross = fc.crossed_column([b_lat, b_long], num_buckets**2)
    
    return [
    # add embedding
    fc.embedding_column(feature_cross, num_buckets//4), 
    # add other features 
    fc.numeric_column('median_income'),
    fc.numeric_column('housing_median_age'),
    fc.numeric_column('avg_rooms_per_house'),
    fc.numeric_column('avg_persons_per_room'),
    fc.numeric_column('avg_bedrooms_per_room')
    ]

## Model 3: Linear regressor with embeddings


In [13]:
# run model 3
linear_train_and_evaluate(output_dir='CHECKPOINTS/feat_eng_02/model_3', num_train_steps = 3000)

This gives us a very similar outcome as in our model 2. However, converting sparse input feature into dense input features through embeddings, **we can run the dataset now through a deep neural network with tensorflow - see model 4 below**

<img src='images\feature_eng2_model_3.PNG' width='600' height='600'/>

## Model 4: DNN regressor with embeddings

In [18]:
# run model 4
dnn_train_and_evaluate(output_dir='CHECKPOINTS/feat_eng_02/model_6', num_train_steps = 3000)

Running the embedding through a (somewhat arbirtrarily configured) four layer neural network improves again on the previously observed loss from the linear regressor. The deep NN is able to better capture non-linearities in the input data which the linear model simply cannot capture

<img src='images\feature_eng2_model_4.PNG' width='600' height='600'/>