# Neural Network

**Learning Objectives:**
  * Use the `DNNRegressor` class in TensorFlow to predict median housing price

The data is based on 1990 census data from California. This data is at the city block level, so these features reflect the total number of rooms in that block, or the total number of people who live on that block, respectively.
<p>
Let's use a set of features to predict house value.

## Set Up
In this first cell, we'll load the necessary libraries.

In [None]:
!sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst

In [None]:
# Ensure the right version of Tensorflow is installed.
!pip freeze | grep tensorflow==2.1

In [1]:
import math
import shutil
import numpy as np
import pandas as pd
import tensorflow as tf
import logging

logger = tf.get_logger()
logger.setLevel(logging.INFO)

#tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format

Next, we'll load our data set.

In [2]:
df = pd.read_csv("https://storage.googleapis.com/ml_universities/california_housing_train.csv", sep=",")

## Examine the data

It's a good idea to get to know your data a little bit before you work with it.

We'll print out a quick summary of a few useful statistics on each column.

This will include things like mean, standard deviation, max, min, and various quantiles.

In [3]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.3,34.2,15.0,5612.0,1283.0,1015.0,472.0,1.5,66900.0
1,-114.5,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.8,80100.0
2,-114.6,33.7,17.0,720.0,174.0,333.0,117.0,1.7,85700.0
3,-114.6,33.6,14.0,1501.0,337.0,515.0,226.0,3.2,73400.0
4,-114.6,33.6,20.0,1454.0,326.0,624.0,262.0,1.9,65500.0


In [4]:
df.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0
mean,-119.6,35.6,28.6,2643.7,539.4,1429.6,501.2,3.9,207300.9
std,2.0,2.1,12.6,2179.9,421.5,1147.9,384.5,1.9,115983.8
min,-124.3,32.5,1.0,2.0,1.0,3.0,1.0,0.5,14999.0
25%,-121.8,33.9,18.0,1462.0,297.0,790.0,282.0,2.6,119400.0
50%,-118.5,34.2,29.0,2127.0,434.0,1167.0,409.0,3.5,180400.0
75%,-118.0,37.7,37.0,3151.2,648.2,1721.0,605.2,4.8,265000.0
max,-114.3,42.0,52.0,37937.0,6445.0,35682.0,6082.0,15.0,500001.0


This data is at the city block level, so these features reflect the total number of rooms in that block, or the total number of people who live on that block, respectively.  Let's create a different, more appropriate feature.  Because we are predicing the price of a single house, we should try to make all our features correspond to a single house as well

In [5]:
df['num_rooms'] = df['total_rooms'] / df['households']
df['num_bedrooms'] = df['total_bedrooms'] / df['households']
df['persons_per_house'] = df['population'] / df['households']
df.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,num_rooms,num_bedrooms,persons_per_house
count,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0
mean,-119.6,35.6,28.6,2643.7,539.4,1429.6,501.2,3.9,207300.9,5.4,1.1,3.0
std,2.0,2.1,12.6,2179.9,421.5,1147.9,384.5,1.9,115983.8,2.5,0.5,4.0
min,-124.3,32.5,1.0,2.0,1.0,3.0,1.0,0.5,14999.0,0.8,0.3,0.7
25%,-121.8,33.9,18.0,1462.0,297.0,790.0,282.0,2.6,119400.0,4.4,1.0,2.4
50%,-118.5,34.2,29.0,2127.0,434.0,1167.0,409.0,3.5,180400.0,5.2,1.0,2.8
75%,-118.0,37.7,37.0,3151.2,648.2,1721.0,605.2,4.8,265000.0,6.1,1.1,3.3
max,-114.3,42.0,52.0,37937.0,6445.0,35682.0,6082.0,15.0,500001.0,141.9,34.1,502.5


In [6]:
df.drop(['total_rooms', 'total_bedrooms', 'population', 'households'], axis = 1, inplace = True)
df.describe()

Unnamed: 0,longitude,latitude,housing_median_age,median_income,median_house_value,num_rooms,num_bedrooms,persons_per_house
count,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0
mean,-119.6,35.6,28.6,3.9,207300.9,5.4,1.1,3.0
std,2.0,2.1,12.6,1.9,115983.8,2.5,0.5,4.0
min,-124.3,32.5,1.0,0.5,14999.0,0.8,0.3,0.7
25%,-121.8,33.9,18.0,2.6,119400.0,4.4,1.0,2.4
50%,-118.5,34.2,29.0,3.5,180400.0,5.2,1.0,2.8
75%,-118.0,37.7,37.0,4.8,265000.0,6.1,1.1,3.3
max,-114.3,42.0,52.0,15.0,500001.0,141.9,34.1,502.5


## Build a neural network model

In this exercise, we'll be trying to predict `median_house_value`. It will be our label (sometimes also called a target). We'll use the remaining columns as our input features.

To train our model, we'll first use the [LinearRegressor](https://www.tensorflow.org/api_docs/python/tf/contrib/learn/LinearRegressor) interface. Then, we'll change to DNNRegressor


In [8]:
for colname in 'housing_median_age,median_income,num_rooms,num_bedrooms,persons_per_house'.split(','):
    print(colname)

housing_median_age
median_income
num_rooms
num_bedrooms
persons_per_house


In [9]:
featcols = {
  colname : tf.feature_column.numeric_column(colname) \
    for colname in 'housing_median_age,median_income,num_rooms,num_bedrooms,persons_per_house'.split(',')
}
# Bucketize lat, lon so it's not so high-res; California is mostly N-S, so more lats than lons
featcols['longitude'] = tf.feature_column.bucketized_column(tf.feature_column.numeric_column('longitude'),
                                                   np.linspace(-124.3, -114.3, 5).tolist())
featcols['latitude'] = tf.feature_column.bucketized_column(tf.feature_column.numeric_column('latitude'),
                                                  np.linspace(32.5, 42, 10).tolist())

In [10]:
featcols.keys()

dict_keys(['housing_median_age', 'median_income', 'num_rooms', 'num_bedrooms', 'persons_per_house', 'longitude', 'latitude'])

In [24]:
# Split into train and eval
msk = np.random.rand(len(df)) < 0.8
traindf = df[msk]
evaldf = df[~msk]

SCALE = 100000
BATCH_SIZE= 100
OUTDIR = './housing_trained'

def make_input_fn(df, mode, batch_size = BATCH_SIZE):
    if mode == tf.estimator.ModeKeys.TRAIN:
        num_epochs = None # loop indefinetly
        shuffle=True
    else:
        num_epochs = 1 # one run and it's over
        shuffle=False
    
    return tf.compat.v1.estimator.inputs.pandas_input_fn(x = df[list(featcols.keys())],
                                                y = df["median_house_value"] / SCALE,  # note the scaling
                                                num_epochs = num_epochs,
                                                batch_size = batch_size, 
                                                shuffle = shuffle)

def train_input_fn(df, batch_size=BATCH_SIZE):
    return make_input_fn(df, mode=tf.estimator.ModeKeys.TRAIN, batch_size=batch_size)

def eval_input_fn(df):
    return make_input_fn(df, mode=tf.estimator.ModeKeys.EVAL, batch_size=len(df))


In [29]:
# Linear Regressor
def train_and_evaluate(output_dir, num_train_steps):
    myopt = tf.keras.optimizers.Ftrl(learning_rate = 0.01) # note the learning rate
    estimator = tf.estimator.LinearRegressor(
                       model_dir = output_dir, 
                       feature_columns = featcols.values(),
                       optimizer = myopt)
  
    #Add rmse evaluation metric
    def rmse(labels, predictions):
        pred_values = tf.cast(predictions['predictions'],tf.float64)
        return {'rmse': tf.compat.v1.metrics.root_mean_squared_error(labels*SCALE, pred_values*SCALE)}
    estimator = tf.compat.v1.estimator.add_metrics(estimator,rmse)
    
    #estimator = tf.estimator.add_metrics(estimator, rmse)

    train_spec = tf.estimator.TrainSpec(input_fn = train_input_fn(df = traindf, batch_size = BATCH_SIZE),
                                      max_steps = num_train_steps)
    eval_spec = tf.estimator.EvalSpec(input_fn = eval_input_fn(df = evaldf),
                                        steps = None,                                        
                                        start_delay_secs = 1, # start evaluating after N seconds
                                        throttle_secs = 10  # evaluate every N seconds)
                                     )
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
    
    
# Run training    
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time
train_and_evaluate(OUTDIR, num_train_steps = (20 * len(traindf)) / BATCH_SIZE) 

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': './housing_trained', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Using config: {'_model_dir': './housing_trained', '_tf_random_seed': None, '_save_summary_steps': 10

In [45]:
# DNN Regressor
def train_and_evaluate(output_dir, num_train_steps):
    myopt = tf.keras.optimizers.Ftrl(learning_rate = 0.01) # note the learning rate
    #myopt = tf.optimizers.FtrlOptimizer(learning_rate = 0.01) # note the learning rate
    
    estimator = tf.estimator.DNNRegressor(
        model_dir = output_dir, 
        feature_columns = featcols.values(),
        hidden_units = [100, 50, 20],
        optimizer = myopt,
        dropout=0.1
        )
        
    #Add rmse evaluation metric
    def rmse(labels, predictions):
        pred_values = tf.cast(predictions['predictions'],tf.float64)
        return {'rmse': tf.compat.v1.metrics.root_mean_squared_error(labels*SCALE, pred_values*SCALE)}
    #estimator = tf.compat.v1.estimator.add_metrics(estimator,rmse)
    
    estimator = tf.estimator.add_metrics(estimator, rmse)
  
    train_spec=tf.estimator.TrainSpec(
                       input_fn = train_input_fn(df= traindf),
                       max_steps = num_train_steps)
    eval_spec=tf.estimator.EvalSpec(
                       input_fn = eval_input_fn(df = evaldf),
                       steps = None,
                       start_delay_secs = 1, # start evaluating after N seconds
                       throttle_secs = 10,  # evaluate every N seconds
                       )
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

# Run training    
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time
tf.compat.v1.summary.FileWriterCache.clear() 
train_and_evaluate(OUTDIR, num_train_steps = (15 * len(traindf)) / BATCH_SIZE) 

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': './housing_trained', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Using config: {'_model_dir': './housing_trained', '_tf_random_seed': None, '_save_summary_steps': 10

In [13]:
%%bash
pip install datalab

Collecting datalab
  Downloading datalab-1.2.0-py3-none-any.whl (1.4 MB)
Collecting plotly>=1.12.5
  Downloading plotly-4.8.2-py2.py3-none-any.whl (11.5 MB)
Collecting configparser>=3.5.0
  Downloading configparser-5.0.0-py3-none-any.whl (22 kB)
Collecting pandas-profiling==1.4.0
  Downloading pandas-profiling-1.4.0.tar.gz (18 kB)
Collecting google-cloud-monitoring==0.31.1
  Downloading google_cloud_monitoring-0.31.1-py2.py3-none-any.whl (138 kB)
Building wheels for collected packages: pandas-profiling
  Building wheel for pandas-profiling (setup.py): started
  Building wheel for pandas-profiling (setup.py): finished with status 'done'
  Created wheel for pandas-profiling: filename=pandas_profiling-1.4.0-py2.py3-none-any.whl size=24018 sha256=d17cd135bf9a5753902f057fa68d671b20496b0879d539c568ada16722d412f2
  Stored in directory: /home/jupyter/.cache/pip/wheels/99/1b/22/fd21454e83092576a63d092a437c8415cb531d116c2552e19d
Successfully built pandas-profiling
Installing collected packages: 

In [46]:
from google.datalab.ml import TensorBoard
#OUTDIR = gs://${BUCKET}
#TensorBoard().start(OUTDIR)
TensorBoard.list()

Unnamed: 0,pid,logdir,port
0,7880,./housing_trained,49677


In [40]:
#TensorBoard.stop(3445)
TensorBoard.start(OUTDIR)

7880

In [None]:
%load_ext tensorboard
%tensorboard --logdir './housing_trained'