# Time-series prediction (temperature from weather stations)

Companion to [(Time series prediction, end-to-end)](./sinewaves.ipynb), except on a real dataset.

In [None]:
# change these to try this notebook out
BUCKET = 'cloud-training-demos-ml'
PROJECT = 'cloud-training-demos'
REGION = 'us-central1'

In [None]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION
os.environ['TFVERSION'] = '1.8'

# Data exploration and cleanup

The data are temperature data from US weather stations. This is a public dataset from NOAA.

In [None]:
import numpy as np
import seaborn as sns
import pandas as pd
import tensorflow as tf
import google.datalab.bigquery as bq
from __future__ import print_function

In [None]:
def query_to_dataframe(year):
  query="""
SELECT
  stationid, date,
  MIN(tmin) AS tmin,
  MAX(tmax) AS tmax,
  IF (MOD(ABS(FARM_FINGERPRINT(stationid)), 10) < 7, True, False) AS is_train
FROM (
  SELECT
    wx.id as stationid,
    wx.date as date,
    CONCAT(wx.id, " ", CAST(wx.date AS STRING)) AS recordid,
    IF (wx.element = 'TMIN', wx.value/10, NULL) AS tmin,
    IF (wx.element = 'TMAX', wx.value/10, NULL) AS tmax
  FROM
    `bigquery-public-data.ghcn_d.ghcnd_{}` AS wx
  WHERE STARTS_WITH(id, 'USW000')
)
GROUP BY
  stationid, date
""".format(year)
  df = bq.Query(query).execute().result().to_dataframe()
  return df

df = query_to_dataframe(2016)
df.head()

In [None]:
df.describe()

Unfortunately, there are missing observations on some days.

In [None]:
df.isnull().sum()

One way to fix this is to do a pivot table and then replace the nulls by filling it with nearest valid neighbor

In [None]:
def cleanup_nulls(df, variablename):
  df2 = df.pivot_table(variablename, 'date', 'stationid', fill_value=np.nan)
  print('Before: {} null values'.format(df2.isnull().sum().sum()))
  df2.fillna(method='ffill', inplace=True)
  df2.fillna(method='bfill', inplace=True)
  df2.dropna(axis=1, inplace=True)
  print('After: {} null values'.format(df2.isnull().sum().sum()))
  return df2

In [None]:
traindf = cleanup_nulls(df[df['is_train']], 'tmin')

In [None]:
traindf.head()

In [None]:
seq = traindf.iloc[:,0]
print('{} values in the sequence'.format(len(seq)))
ax = sns.tsplot(seq)
ax.set(xlabel='day-number', ylabel='temperature');

In [None]:
seq.to_string(index=False).replace('\n', ',')

In [None]:
# Save the data to disk in such a way that each time series is on a single line
# save to sharded files, one for each year
# This takes about 15 minutes
import shutil, os
shutil.rmtree('data/temperature', ignore_errors=True)
os.makedirs('data/temperature')

def to_csv(indf, filename):
  df = cleanup_nulls(indf, 'tmin')
  print('Writing {} sequences to {}'.format(len(df.columns), filename))
  with open(filename, 'w') as ofp:
    for i in xrange(0, len(df.columns)):
      if i%10 == 0:
        print('{}'.format(i), end='...')
      seq = df.iloc[:365,i]  # chop to 365 days to avoid leap-year problems ...
      line = seq.to_string(index=False, header=False).replace('\n', ',')
      ofp.write(line + '\n')
    print('Done')

for year in xrange(2000, 2017):
  print('Querying data for {} ... hang on'.format(year))
  df = query_to_dataframe(year)
  to_csv(df[df['is_train']], 'data/temperature/train-{}.csv'.format(year))
  to_csv(df[~df['is_train']], 'data/temperature/eval-{}.csv'.format(year))

In [None]:
%bash
head -1 data/temperature/eval-2004.csv | tr ',' ' ' | wc
head -1 data/temperature/eval-2005.csv | tr ',' ' ' | wc
wc -l data/temperature/train*.csv
wc -l data/temperature/eval*.csv

In [None]:
%bash
gsutil -m rm -rf gs://${BUCKET}/temperature/*
gsutil -m cp data/temperature/*.csv gs://${BUCKET}/temperature

Our CSV file sequences consist of 365 values. For training, each instance's 0~364 numbers are inputs, and 365th is truth.

# Model

This is the same model as [(Time series prediction, end-to-end)](./sinewaves.ipynb)


In [None]:
%bash
#for MODEL in dnn; do
for MODEL in linear cnn dnn lstm lstm2 lstmN; do
  OUTDIR=gs://${BUCKET}/temperature/$MODEL
  JOBNAME=temperature_${MODEL}_$(date -u +%y%m%d_%H%M%S)
  REGION=us-central1
  gsutil -m rm -rf $OUTDIR
  gcloud ml-engine jobs submit training $JOBNAME \
     --region=$REGION \
     --module-name=sinemodel.task \
     --package-path=${PWD}/sinemodel \
     --job-dir=$OUTDIR \
     --staging-bucket=gs://$BUCKET \
     --scale-tier=BASIC_GPU \
     --runtime-version=$TFVERSION \
     -- \
     --train_data_path="gs://${BUCKET}/temperature/train*.csv" \
     --eval_data_path="gs://${BUCKET}/temperature/eval*.csv"  \
     --output_dir=$OUTDIR \
     --train_steps=5000 --sequence_length=365 --model=$MODEL
done

## Monitor training with TensorBoard

Use this cell to launch tensorboard. If tensorboard appears blank try refreshing after 5 minutes

In [None]:
from google.datalab.ml import TensorBoard
TensorBoard().start('gs://{}/temperature'.format(BUCKET))

In [None]:
for pid in TensorBoard.list()['pid']:
  TensorBoard().stop(pid)
  print 'Stopped TensorBoard with pid {}'.format(pid)

## Results

When I ran it, these were the RMSEs that I got for different models:

| Model | # of steps | Minutes | RMSE |
| --- | ----| --- | --- | --- |
| dnn | 5000 | 19 min | 9.82 |
| cnn | 5000 | 22 min | 6.68 |
| lstm | 5000 | 41 min | 3.15 |
| lstm2 | 5000 | 107 min | 3.91 |
| lstmN | 5000 | 107 min | 11.5 |

As you can see, on real-world time-series data, LSTMs can really shine, but the highly tuned version for the synthetic data doesn't work as well on a similiar, but different problem. Instead, we'll probably have to retune ...

<p>
## Next steps
This is likely not the best way to formulate this problem. A better method to work with this data would be to pull out arbitrary, shorter sequences (say of length 20) from the input sequences. This would be akin to image augmentation in that we would get arbitrary subsets, and would allow us to predict the sequence based on just the last 20 values instead of requiring a whole year.  It would also avoid the problem that currently, we are training only for Dec. 30/31.

Feature engineering would also help. For example, we might also add a climatological average (average temperature at this location over the last 10 years on this date) as one of the inputs. I'll leave both these improvements as exercises for the reader :)

Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License