source: https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/deepdive/04_features/a_features.ipynb

Nearly same source example: https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/deepdive/05_artandscience/a_handtuning.ipynb

I have made some changes to the original source notebook.

You can run this notebook in Google Cloud Datalab.

If you want to run this notebook in jupyter you may have to install google.datalab.ml for tensorboard, see how: https://github.com/googledatalab/pydatalab

Versions: I have tried this notebook with Google Cloud Datalab with tensorflow 1.8, and jupyter with tensorflow 1.12.


# Results

Results will differ to run to run.

Best result of a few different model and/or different parameter quick runs:

estimator = tf.estimator.DNNRegressor(model_dir = output_dir, 
                                        feature_columns=create_feature_cols(), 
                                        hidden_units=[128, 64, 16], 
                                        activation_fn=tf.nn.tanh, 
                                        dropout=0.15)

global step 10000: average_loss = 0.41720954, global_step = 10000, label/mean = 2.0454628, loss = 52.35207, prediction/mean = 2.2294657

# Trying out features

**Learning Objectives:**
  * Improve the accuracy of a model by adding new features with the appropriate representation

The data is based on 1990 census data from California. This data is at the city block level, so these features reflect the total number of rooms in that block, or the total number of people who live on that block, respectively.

## Set Up
In this first cell, we'll load the necessary libraries.

In [1]:
import math
import shutil
import numpy as np
import pandas as pd
import tensorflow as tf
import os

from google.datalab.ml import TensorBoard

print(tf.__version__)
tf.logging.set_verbosity(tf.logging.INFO)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format

1.12.0


Next, we'll load our data set.

In [2]:
#df = pd.read_csv("https://storage.googleapis.com/ml_universities/california_housing_train.csv", sep=",")

df = pd.read_csv("data/california_housing_train.csv", sep=",")


## Examine and split the data

It's a good idea to get to know your data a little bit before you work with it.

We'll print out a quick summary of a few useful statistics on each column.

This will include things like mean, standard deviation, max, min, and various quantiles.

In [3]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.3,34.2,15.0,5612.0,1283.0,1015.0,472.0,1.5,66900.0
1,-114.5,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.8,80100.0
2,-114.6,33.7,17.0,720.0,174.0,333.0,117.0,1.7,85700.0
3,-114.6,33.6,14.0,1501.0,337.0,515.0,226.0,3.2,73400.0
4,-114.6,33.6,20.0,1454.0,326.0,624.0,262.0,1.9,65500.0


In [4]:
df.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0
mean,-119.6,35.6,28.6,2643.7,539.4,1429.6,501.2,3.9,207300.9
std,2.0,2.1,12.6,2179.9,421.5,1147.9,384.5,1.9,115983.8
min,-124.3,32.5,1.0,2.0,1.0,3.0,1.0,0.5,14999.0
25%,-121.8,33.9,18.0,1462.0,297.0,790.0,282.0,2.6,119400.0
50%,-118.5,34.2,29.0,2127.0,434.0,1167.0,409.0,3.5,180400.0
75%,-118.0,37.7,37.0,3151.2,648.2,1721.0,605.2,4.8,265000.0
max,-114.3,42.0,52.0,37937.0,6445.0,35682.0,6082.0,15.0,500001.0


Now, split the data into two parts -- training and evaluation.

In [5]:
np.random.seed(seed=1) #makes result reproducible
msk = np.random.rand(len(df)) < 0.8
traindf = df[msk]
evaldf = df[~msk]

In [6]:
traindf.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,13612.0,13612.0,13612.0,13612.0,13612.0,13612.0,13612.0,13612.0,13612.0
mean,-119.6,35.6,28.7,2632.0,536.0,1423.3,498.1,3.9,207986.5
std,2.0,2.1,12.6,2163.3,416.7,1126.0,379.3,1.9,116514.3
min,-124.3,32.5,1.0,8.0,1.0,3.0,1.0,0.5,14999.0
25%,-121.8,33.9,18.0,1461.0,296.0,787.0,281.0,2.6,119600.0
50%,-118.5,34.2,29.0,2117.5,432.0,1168.0,408.0,3.6,180800.0
75%,-118.0,37.7,37.0,3146.0,644.2,1715.0,602.0,4.8,266300.0
max,-114.3,42.0,52.0,37937.0,5471.0,35682.0,5189.0,15.0,500001.0


In [7]:
evaldf.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,3388.0,3388.0,3388.0,3388.0,3388.0,3388.0,3388.0,3388.0,3388.0
mean,-119.6,35.7,28.3,2690.4,553.0,1454.8,513.7,3.8,204546.3
std,2.0,2.1,12.6,2245.5,440.2,1231.5,404.7,1.8,113802.5
min,-124.3,32.5,2.0,2.0,2.0,6.0,2.0,0.5,22500.0
25%,-121.8,33.9,18.0,1467.0,300.0,796.0,283.8,2.5,118800.0
50%,-118.6,34.3,28.0,2171.5,441.0,1160.0,414.0,3.5,178650.0
75%,-118.0,37.7,37.0,3167.2,667.0,1756.2,615.2,4.7,258825.0
max,-114.6,41.9,52.0,32627.0,6445.0,28566.0,6082.0,15.0,500001.0


## Training and Evaluation

In this exercise, we'll be trying to predict **median_house_value** It will be our label (sometimes also called a target).

We'll modify the feature_cols and input function to represent the features you want to use.

Note: total_rooms is per block so to get rooms per house, and other, we need make some transformations.

We divide **total_rooms** by **households** to get **avg_rooms_per_house** which we excect to positively correlate with **median_house_value**. 

We also divide **population** by **total_rooms** to get **avg_persons_per_room** which we expect to negatively correlate with **median_house_value**.

In [8]:
def add_more_features(df):
  df['avg_rooms_per_house'] = df['total_rooms'] / df['households'] # positive correlation
  df['avg_bedrooms_per_house'] = df['total_bedrooms'] / df['households'] # positive correlation
  df['avg_persons_per_room'] = df['population'] / df['total_rooms'] # negative correlation
  return df

In [9]:
#add_more_features(df)

In [10]:
# Create pandas input function: returns function, that has signature of () -> (dict of features, target)
SCALE=100000
BATCH_SIZE=128

def make_input_fn(df, num_epochs):
  return tf.estimator.inputs.pandas_input_fn(
    x = add_more_features(df),
    #x = df,  
    y = df['median_house_value'] / SCALE, # will talk about why later in the course
    batch_size = BATCH_SIZE,
    num_epochs = num_epochs,
    shuffle = True,
    queue_capacity = 1000,
    num_threads = 1
  )

In [11]:
# Define your feature columns

# np.arange() similar to np.linspace(), but uses a step size instead of the number of samples.

def create_feature_cols():
  return [
    tf.feature_column.numeric_column('housing_median_age'),
    tf.feature_column.bucketized_column(tf.feature_column.numeric_column('latitude'), boundaries = np.linspace(32.0, 42, num=10).tolist()),
    tf.feature_column.bucketized_column(tf.feature_column.numeric_column('longitude'), boundaries = np.linspace(-124.3, -114.3, num=10).tolist()),
    tf.feature_column.numeric_column('avg_rooms_per_house'),
    tf.feature_column.numeric_column('avg_bedrooms_per_house'),
    tf.feature_column.numeric_column('avg_persons_per_room'),
    tf.feature_column.numeric_column('median_income')
  ]

## Choose a model ...

## LinearRegressor

Note LinearRegressor default loss: loss is calculated by using mean squared error.

In [12]:
# Create estimator train and evaluate function
def train_and_evaluate(output_dir, num_train_steps):
  estimator = tf.estimator.LinearRegressor(model_dir = output_dir, feature_columns = create_feature_cols())
  train_spec = tf.estimator.TrainSpec(input_fn = make_input_fn(traindf, None), 
                                      max_steps = num_train_steps)
  eval_spec = tf.estimator.EvalSpec(input_fn = make_input_fn(evaldf, 1), 
                                    steps = None, 
                                    start_delay_secs = 1, # start evaluating after N seconds, 
                                    throttle_secs = 5)  # evaluate every N seconds
  tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

In [13]:
# Create estimator train and evaluate function
def train_and_evaluate(output_dir, num_train_steps):
  estimator = tf.estimator.LinearRegressor(model_dir = output_dir, 
                                           feature_columns = create_feature_cols(),
                                           optimizer=tf.train.FtrlOptimizer(learning_rate=0.1, l1_regularization_strength=0.001))
  train_spec = tf.estimator.TrainSpec(input_fn = make_input_fn(traindf, None), 
                                      max_steps = num_train_steps)
  eval_spec = tf.estimator.EvalSpec(input_fn = make_input_fn(evaldf, 1), 
                                    steps = None, 
                                    start_delay_secs = 1, # start evaluating after N seconds, 
                                    throttle_secs = 5)  # evaluate every N seconds
  tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

## DNNRegressor

Note DNNRegressor default loss: loss is calculated by using mean squared error.

In [14]:
# Create estimator train and evaluate function
def train_and_evaluate(output_dir, num_train_steps):

  estimator = tf.estimator.DNNRegressor(model_dir = output_dir, 
                                        feature_columns=create_feature_cols(), 
                                        hidden_units=[128, 64, 16], 
                                        activation_fn=tf.nn.tanh, 
                                        dropout=0.15)
  
  train_spec = tf.estimator.TrainSpec(input_fn = make_input_fn(traindf, None), 
                                      max_steps = num_train_steps)
  eval_spec = tf.estimator.EvalSpec(input_fn = make_input_fn(evaldf, 1), 
                                    steps = None, 
                                    start_delay_secs = 1, # start evaluating after N seconds, 
                                    throttle_secs = 5)  # evaluate every N seconds
  tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

In [15]:
# Create estimator train and evaluate function
def train_and_evaluate(output_dir, num_train_steps):

  estimator = tf.estimator.DNNRegressor(model_dir = output_dir, 
                                        feature_columns=create_feature_cols(), 
                                        hidden_units=[128, 64, 16], 
                                        activation_fn=tf.nn.tanh, 
                                        dropout=0.25,
                                        optimizer=tf.train.ProximalAdagradOptimizer(learning_rate=0.1,l1_regularization_strength=0.001))
  
  train_spec = tf.estimator.TrainSpec(input_fn = make_input_fn(traindf, None), 
                                      max_steps = num_train_steps)
  eval_spec = tf.estimator.EvalSpec(input_fn = make_input_fn(evaldf, 1), 
                                    steps = None, 
                                    start_delay_secs = 1, # start evaluating after N seconds, 
                                    throttle_secs = 5)  # evaluate every N seconds
  tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

## try add RMSE metric

In [15]:
# Create estimator train and evaluate function
def train_and_evaluate(output_dir, num_train_steps):

  estimator = tf.estimator.DNNRegressor(model_dir = output_dir, 
                                        feature_columns=create_feature_cols(), 
                                        hidden_units=[128, 64, 16], 
                                        activation_fn=tf.nn.tanh, 
                                        dropout=0.25)

  # --- Add RMSE evaluation metric: it is simply the square root of the default metric: MSE ---
  def rmse(labels, predictions):
    pred_values = tf.cast(predictions['predictions'], tf.float64)
    return {'rmse': tf.metrics.root_mean_squared_error(labels, pred_values)}
  
  estimator = tf.contrib.estimator.add_metrics(estimator,rmse)
  
  # --- continue ---

  train_spec = tf.estimator.TrainSpec(input_fn = make_input_fn(traindf, None), 
                                      max_steps = num_train_steps)

  eval_spec = tf.estimator.EvalSpec(input_fn = make_input_fn(evaldf, 1), 
                                    steps = None, 
                                    start_delay_secs = 1, # start evaluating after N seconds, 
                                    throttle_secs = 5)  # evaluate every N seconds
  tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

## Continue ...

In [15]:
OUTDIR = './trained_model'

try:
  os.makedirs(OUTDIR)
except OSError:
  pass

In [16]:
# Run the model

shutil.rmtree(OUTDIR, ignore_errors = True)

train_and_evaluate(OUTDIR, 10000)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': './trained_model', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f2583ac7c88>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


INFO:tensorflow:Not using Distribute Coordinator.
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 600.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
INFO:tensorflow:Saving checkpoints for 0 into ./trained_model/model.ckpt.
INFO:tensorflow:loss = 338.76926, step = 1
INFO:tensorflow:global_step/sec: 199.52
INFO:tensorfl

INFO:tensorflow:loss = 37.670395, step = 6701 (0.489 sec)
INFO:tensorflow:global_step/sec: 207.371
INFO:tensorflow:loss = 55.90632, step = 6801 (0.480 sec)
INFO:tensorflow:global_step/sec: 211.85
INFO:tensorflow:loss = 82.52406, step = 6901 (0.472 sec)
INFO:tensorflow:global_step/sec: 222.077
INFO:tensorflow:loss = 58.363945, step = 7001 (0.451 sec)
INFO:tensorflow:global_step/sec: 213.576
INFO:tensorflow:loss = 49.954376, step = 7101 (0.471 sec)
INFO:tensorflow:global_step/sec: 210.336
INFO:tensorflow:loss = 77.60665, step = 7201 (0.476 sec)
INFO:tensorflow:global_step/sec: 211.426
INFO:tensorflow:loss = 40.904007, step = 7301 (0.469 sec)
INFO:tensorflow:global_step/sec: 219.261
INFO:tensorflow:loss = 64.83626, step = 7401 (0.456 sec)
INFO:tensorflow:global_step/sec: 210.669
INFO:tensorflow:loss = 46.562035, step = 7501 (0.474 sec)
INFO:tensorflow:global_step/sec: 217.627
INFO:tensorflow:loss = 46.480667, step = 7601 (0.462 sec)
INFO:tensorflow:global_step/sec: 224.001
INFO:tensorflow

In [17]:
# Launch tensorboard

# Note: If you use jupyter then you may have to use the address shown in the terminal console instead of the below link to tensorboard.

TensorBoard().start(OUTDIR)

18179