# Time-series prediction (temperature from weather stations)

This notebook illustrates:

* Predicting the "next" value of a long time-series
* Using a LSTM model on numeric data
* Serving a LSTM model

<b>Note:</b>
See [(Time series prediction with RNNs and TensorFlow)](../05_artandscience/d_customestimator.ipynb) for a very similar example, except that it works with multiple short sequences.

In [None]:
# change these to try this notebook out
BUCKET = 'cloud-training-demos-ml'
PROJECT = 'cloud-training-demos'
REGION = 'us-central1'

In [None]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION

In [None]:
%%datalab project set -p $PROJECT

# Data exploration and cleanup

The data are temperature data from US weather stations. This is a public dataset from NOAA.

In [None]:
import numpy as np
import seaborn as sns
import pandas as pd
import tensorflow as tf
import google.datalab.bigquery as bq
from __future__ import print_function

In [None]:
query="""
SELECT
  stationid, date,
  MAX(tmin) AS tmin,
  MAX(tmax) AS tmax,
  IF (MOD(ABS(FARM_FINGERPRINT(stationid)), 10) < 7, True, False) AS is_train
FROM (
  SELECT
    wx.id as stationid,
    wx.date as date,
    CONCAT(wx.id, " ", CAST(wx.date AS STRING)) AS recordid,
    IF (wx.element = 'TMIN', wx.value/10, NULL) AS tmin,
    IF (wx.element = 'TMAX', wx.value/10, NULL) AS tmax
  FROM
    `bigquery-public-data.ghcn_d.ghcnd_2016` AS wx
  WHERE STARTS_WITH(id, 'USW000')
)
GROUP BY
  stationid, date
ORDER BY
  stationid, date
"""
df = bq.Query(query).execute().result().to_dataframe()
df.head()

In [None]:
df.describe()

Unfortunately, there are missing observations on some days.

In [None]:
df.isnull().sum()

One way to fix this is to do a pivot table and then replace the nulls by filling it forward

In [None]:
def cleanup_nulls(df, variablename):
  df2 = df.pivot_table(variablename, 'date', 'stationid', fill_value=np.nan)
  print('Before: {} null values'.format(df2.isnull().sum().sum()))
  df2.fillna(method='ffill', inplace=True)
  df2.fillna(method='bfill', inplace=True)
  df2.dropna(axis=1, inplace=True)
  print('After: {} null values'.format(df2.isnull().sum().sum()))
  return df2

In [None]:
traindf = cleanup_nulls(df[df['is_train']], 'tmin')

In [None]:
traindf.head()

In [None]:
seq = traindf.iloc[:,0]
print('{} values in the sequence'.format(len(seq)))
ax = sns.tsplot(seq)
ax.set(xlabel='day-number', ylabel='temperature');

In [None]:
seq.to_string(index=False).replace('\n', ',')

In [None]:
# Save the data to disk in such a way that each time series is on a single line
def to_csv(indf, filename):
  df = cleanup_nulls(indf, 'tmin')
  print('Writing {} sequences to {}'.format(len(df.columns), filename))
  with open(filename, 'w') as ofp:
    for i in xrange(0, len(df.columns)):
      if i%10 == 0:
        print('{}'.format(i), end='...')
      seq = traindf.iloc[:,i]
      line = seq.to_string(index=False, header=False).replace('\n', ',')
      ofp.write(line + '\n')
    print('Done')
to_csv(df[df['is_train']], 'train.csv')
to_csv(df[~df['is_train']], 'eval.csv')

In [None]:
%bash
ls -l *.csv
head -1 eval.csv | tr ',' ' ' | wc
wc *.csv

Our CSV file sequences consist of 366 numbers. Each number is one input and the prediction output is the next number given previous numbers as history. With 366 numbers (one instance) input, we will have 366 output numbers. For training, each instance's 0~364 numbers are inputs, and 1~365 are truth. For prediction, it is like "given a series of numbers, predict next n numbers".

# Model


We will use TensorFlow's [Estimator](https://www.tensorflow.org/api_docs/python/tf/contrib/learn/Estimator) to build our model. Estimators help construct the training/evaluation/prediction graph. They reuse the common graph, and fork only when needed (i.e. input_fn). They also handle model export. Models exported can be deployed to Google Cloud ML Engine for online prediction.

In [None]:
import tensorflow as tf
import shutil
import tensorflow.contrib.learn as tflearn
import tensorflow.contrib.layers as tflayers
from tensorflow.contrib.learn.python.learn import learn_runner
from tensorflow.contrib.learn.python.learn.utils import saved_model_export_utils
import tensorflow.contrib.rnn as rnn

# tf.decode_csv requires DEFAULTS to infer data types and default values.
SEQ_LEN = 366
DEFAULTS = [[0.0] for x in xrange(0, SEQ_LEN)]

# The Estimator API requires named features.
TIMESERIES_FEATURE_NAME = 'rawdata'

# Training batch size.
BATCH_SIZE = 25

## Input

Our CSV file structure is quite simple -- a bunch of floating point numbers (note the type of DEFAULTS). We ask for the data to be read BATCH_SIZE sequences at a time.

In [None]:
def create_input_fn(filename, mode):  
  """Creates an input_fn for estimator in training or evaluation."""
  
  def _input_fn():
    """Returns named features and labels, as required by Estimator."""    
    # could be a path to one file or a file pattern.
    input_file_names = tf.train.match_filenames_once(filename)
    
    filename_queue = tf.train.string_input_producer(
        input_file_names, num_epochs=None, shuffle=True)
    reader = tf.TextLineReader()
    _, value = reader.read_up_to(filename_queue, num_records=BATCH_SIZE)

    # parse the csv values
    batch_data = tf.decode_csv(value, record_defaults=DEFAULTS)
    batch_data = tf.transpose(batch_data) # [BATCH_SIZE, SEQ_LEN]

    # Get x and y. They are both of shape [BATCH_SIZE, SEQ_LEN - 1]
    batch_len = tf.shape(batch_data)[0]
    x = tf.slice(batch_data, [0, 0], [batch_len, SEQ_LEN-1])
    y = tf.slice(batch_data, [0, 1], [batch_len, SEQ_LEN-1])
    
    return {TIMESERIES_FEATURE_NAME: x}, y   # dict of features, target

  return _input_fn

## Inference Graph

Following Estimator's requirements, we will create a model_fn representing the inference model. Note that this function defines the graph that will be used in training, evaluation and prediction.

To supply a model function to the Estimator API, you need to return a ModelFnOps. The rest of the function creates the necessary objects.

In [None]:
#  Think of the  of LSTM units as how much history you want the network to remember
LSTM_SIZE = [10, 20]
DNN_SIZE  = [50, 25]
  
# scale the temperatures to make the optimization easier; tmin values are -58 to 38, scale it to be 0 to 1
def scale_temperature(t):
  return (t + 58) / (38+58)

def unscale_temperature(sc):
  return (sc*(38+58)) - 58

def model_fn(features, targets, mode):
  """Define the inference model."""
  
  # scale the input values to lie between 0-1. this will help optimization
  input_seq = scale_temperature(features[TIMESERIES_FEATURE_NAME])
  
  #lat = features['latitude']

  # RNN requires input tensor rank > 2. Adding one dimension.
  input_seq = tf.expand_dims(input_seq, axis=-1)
  
  # LSTM output will be [BATCH_SIZE, SEQ_LEN - 1, lstm_output_size]
  lstm_layer = [tf.nn.rnn_cell.LSTMCell(size) for size in LSTM_SIZE]
  lstm_cell = tf.nn.rnn_cell.MultiRNNCell(lstm_layer)
  lstm_outputs, _ = tf.nn.dynamic_rnn(cell=lstm_cell,
                                      inputs=input_seq,
                                      dtype=tf.float32)
  
  # Reshape to [BATCH_SIZE * (SEQ_LEN - 1), lstm_output] so it is 2-D and can
  # be fed to next layer.
  lstm_outputs = tf.reshape(lstm_outputs, [-1, lstm_cell.output_size])
  
  #extras = [lstm_outputs, lat, lon]
  
  # Add hidden layers on top of LSTM layer to add some "nonlinear" to the model.
  prev_layer = [lstm_outputs]
  for h in DNN_SIZE:
    hidden1 = tf.contrib.layers.fully_connected(inputs=prev_layer[-1], num_outputs=h)
    prev_layer.append(hidden1)
    
  uniform_initializer = tf.random_uniform_initializer(minval=-0.08, maxval=0.08)
  predictions = tf.contrib.layers.fully_connected(inputs=prev_layer[-1],
                                                  num_outputs=1,
                                                  activation_fn=None,
                                                  weights_initializer=uniform_initializer,
                                                  biases_initializer=uniform_initializer)

  # predictions are all we need when mode is not train/eval.
  # but remember to unscale the values
  predictions_dict = {"predicted_temperature": unscale_temperature(predictions)}

  # If train/evaluation, we'll need to compute loss.
  # If train, we will also need to create an optimizer.
  loss, train_op, eval_metric_ops = None, None, None
  if mode == tf.contrib.learn.ModeKeys.TRAIN or mode == tf.contrib.learn.ModeKeys.EVAL:
    # scale the temperature so that we match the 0-1 scale of predictions
    # it's better to do this rather than unscale the predictions because the
    # learning rate and optimizers are all set up for small numbers
    targets = scale_temperature(targets)
      
    # Note: The reshape below is needed because Estimator needs to know
    # loss shape. Without reshaping below, loss's shape would be unknown.
    targets = tf.reshape(targets, [tf.size(targets)])
    predictions = tf.reshape(predictions, [tf.size(predictions)])
    loss = tf.losses.mean_squared_error(targets, predictions)
    eval_metric_ops = {
      "rmse_scaled": tf.metrics.root_mean_squared_error(targets, predictions)
    }

    if mode == tf.contrib.learn.ModeKeys.TRAIN:
      train_op = tf.contrib.layers.optimize_loss(
          loss=loss,
          global_step=tf.contrib.framework.get_global_step(),
          learning_rate=0.01,
          optimizer="Adagrad")
  
  # return ModelFnOps as Estimator requires.
  return tflearn.ModelFnOps(
      mode=mode,
      predictions=predictions_dict,
      loss=loss,
      train_op=train_op,
      eval_metric_ops=eval_metric_ops)

## Training

Distributed training is launched off using an Experiment.  The key line here is that we use tflearn.Estimator rather than, say tflearn.DNNRegressor.  This allows us to provide a model_fn, which will be our RNN defined above.  Note also that we specify a serving_input_fn -- this is how we parse the input data provided to us at prediction time using gcloud or Cloud ML Online Prediction.

In [None]:
def get_train():
  return create_input_fn('train.csv', mode=tf.contrib.learn.ModeKeys.TRAIN)


def get_eval():
  return create_input_fn('eval.csv', mode=tf.contrib.learn.ModeKeys.EVAL)


def serving_input_fn():
  feature_placeholders = {
      TIMESERIES_FEATURE_NAME: tf.placeholder(tf.float32, [None, None])
  }
  return tflearn.utils.input_fn_utils.InputFnOps(
      feature_placeholders,
      None,
      feature_placeholders
  )


def experiment_fn(output_dir):
    """An experiment_fn required for Estimator API to run training."""

    estimator = tflearn.Estimator(model_fn=model_fn,
                                  model_dir=output_dir,
                                  config=tf.contrib.learn.RunConfig(save_checkpoints_steps=500))
    return tflearn.Experiment(
        estimator,
        train_input_fn=get_train(),
        eval_input_fn=get_eval(),
        export_strategies=[saved_model_export_utils.make_export_strategy(
            serving_input_fn,
            default_output_alternative_key=None,
            exports_to_keep=1
        )],
        train_steps=1000
    )


shutil.rmtree('training', ignore_errors=True) # start fresh each time.
learn_runner.run(experiment_fn, 'training')

## Model Summary

We can plot model's training summary events using Datalab's ML library.

In [None]:
from google.datalab.ml import Summary

summary = Summary('./training')
summary.plot(['OptimizeLoss/loss', 'loss'])

# Prediction

Let's pull up a curve and see how we do at predicting the last few values of the series. A week's forecast is reasonable.

In [None]:
def get_one_series(filename, dayno):
  with open(filename) as fp:
    fields = fp.readline().strip().split(',')
    prediction_data = map(float, fields)
  
    # Upto dayno as input; upto dayno+7 as prediction
    prediction_x = list(prediction_data[:dayno])
    prediction_y = list(prediction_data[dayno:(dayno+7)])

    sns.tsplot(prediction_x, color='blue')
    y_truth_curve = [np.nan] * (len(prediction_x)-1) + [prediction_x[-1]] + prediction_y
    sns.tsplot(y_truth_curve, color='green')
    return prediction_x, prediction_y, y_truth_curve

prediction_x, prediction_y, y_truth_curve = get_one_series('eval.csv', 285)
print('{} inputs; expecting {} outputs'.format(len(prediction_x), len(prediction_y)))

First prediction we will do is just sending x, and for each value in x it will return a predicted value which is for the very next time step. And then we can compare the predicted values with the truth (x+1).

In [None]:
# Load model.
estimator = tflearn.Estimator(model_fn=model_fn, model_dir='training')

# Feed Prediction data.
predict_input_fn = lambda: {TIMESERIES_FEATURE_NAME: tf.constant([prediction_x])}

predicted = list(estimator.predict(input_fn=predict_input_fn))
predicted = [p['predicted_temperature'] for p in predicted]

# Plot prediction source.
sns.tsplot(prediction_x, color='green')

# Plot predicted values.
sns.tsplot([prediction_x[0]] + predicted, color='red');

This time, let's send in x, and predict next n values.
The way we do this is to invoke the prediction on x, take the prediction, append it to x and make another prediction.
Repeat n times and we've created n predictions.

In [None]:
estimator = tflearn.Estimator(model_fn=model_fn, model_dir='training')

# Prediction data starts with x.
x_total = list(prediction_x)

# Make n predictions.
for i in range(len(prediction_y)):
  predict_input_fn = lambda: {TIMESERIES_FEATURE_NAME: tf.constant([x_total])}
  p = list(estimator.predict(input_fn=predict_input_fn))
  # For each step, append the tail element of last predicted values.  
  x_total.append(p[-1]['predicted_temperature'])

# The first len(prediction_x) elements are prediction source. So remove them.
y_predicted = x_total[len(prediction_x):]

# Zero out prediction source (making them nan), add the last value of prediction source
# so the first edge in the curve is plotted, and add predicted values.
y_predicted_curve = [np.nan] * (len(prediction_x)-1) + [prediction_x[-1]] + y_predicted

# Plot prediction source.
sns.tsplot(prediction_x, color='blue')

# Plot truth curve.
sns.tsplot(y_truth_curve, color='green')

# Plot predicted curve.
sns.tsplot(y_predicted_curve, color='red')