<h1> Creating high-resolution Landcover data using Machine Learning </h1>

In this notebook, we train a TensorFlow model to fit Landsat 8 bands to a low-resolution landcover map. Then, we use that model on the high-resolution Landsat data to create a high-resolution landcover map. In essence, we are using TensorFlow to <a href="https://gisclimatechange.ucar.edu/question/63">statistically downscale</a> the landcover data (note that the term "downscaling" is counterintuitive -- downscaling an image increases its resolution or upsamples it).

<div id="toc"></div>

In [18]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>

<h2> Preprocessing using Cloud Dataflow </h2>

Cloud Dataflow can scale up and simplify preprocessing in Cloud ML.  We'll need to read the Geotiffs and then merge them in such a way that all the data corresponding to a pixel becomes a single TFRecord. We'll also need to scale the pixel values to lie in the range [0,1]. If you do this sort of thing naively, you'll run out of memory or burn through your wallet -- the total size of the images alone is 25 GB.  Trying to fit it all in memory would require machines with 3-4 times more RAM.

We'll use Cloud Dataflow to distribute the preprocessing onto an autoscaled cluster of machines with 3 GB of RAM each.  (Note: change 'nrows' to something like 3 to get a small dataset to work with, and make sure that the number of workers is within your quota for simultaneous Compute Engine VMs.)

In [4]:
# a Python generator that packs all the training data line-by-line
def get_next_line(x):
  '''
      return (lineno, linedata, featnames)
      where linedata is a 2D array with first dimension being feature# and second dimension column in image 
  '''  
  import osgeo.gdal as gdal
  import struct
  import os
  import subprocess
  
  # The gdal library can not read from CloudStorage, so this class downloads the data to local VM
  class LandsatReader():
   def __init__(self, gsfile, destdir='./'):
      self.gsfile = gsfile
      self.dest = os.path.join(destdir, os.path.basename(self.gsfile))
      if os.path.exists(self.dest):
        print 'Using already existing {}'.format(self.dest)
      else:
        print 'Getting {0} to {1} '.format(self.gsfile, self.dest)
        ret = subprocess.check_call(['gsutil', 'cp', self.gsfile, self.dest])
      self.dataset = gdal.Open( self.dest, gdal.GA_ReadOnly )
   def __exit__(self, exc_type=None, exc_val=None, exc_tb=None):
      os.remove( self.dest ) # cleanup  
   def ds(self):
      return self.dataset

  # open all the necessary files
  input_dir = 'gs://mdh-test/landsat-ml/'
  featnames = ['b{}'.format(band) for band in xrange(1,8)] # 8
  filenames = [os.path.join(input_dir, 'landsat8-{}.tif'.format(band)) for band in featnames]
  filenames.append(os.path.join(input_dir, 'srtm-elevation.tif')); featnames.append('elev')
  filenames.append(os.path.join(input_dir, 'mcd12-labels.tif')); featnames.append('landcover')
  readers = [LandsatReader(filename) for filename in filenames]
  bands = [reader.ds().GetRasterBand(1) for reader in readers] 
  print "Opened ", filenames
      
  # read one row of each the images and yield them
  ncols = bands[0].XSize
  nrows = bands[0].YSize
  print "Reading ", nrows, "x", ncols, " images corresponding to ", featnames
  packformat = 'f' * ncols
  for line in xrange(0, nrows):  #WARN! Change 'nrows' here to 10 to get a small (0.05% of whole) dataset to work with.
        line_data = [struct.unpack(packformat, band.ReadRaster(0, line, ncols, 1, ncols, 1, gdal.GDT_Float32)) for band in bands]
        yield (line, line_data, featnames)
      
def get_features_from_line(args):
  '''
      return (1, dict)  or (0, dict)
      where the first number is 1 or 0 depending on whether this row belongs to training (1)
      or eval (0) partition.
      dict is the set of features formed from pixels from all the bands
  ''' 
  (line, line_data, featnames) = args
  if line_data:
    ncols = len(line_data[0])
    for col in xrange(0, ncols): # ncols
          featdict = {'rowcol': '{},{}'.format(line,col)}
          for f in xrange(0, len(featnames)):
            featdict[featnames[f]] = line_data[f][col]
          featdict['landcover'] = '{}'.format(int(featdict['landcover']+0.5))
          yield ( 0 if (line+col)%3==0 else 1, featdict )    # 1/3 are eval

def get_partition(group_and_featdict, nparts):
  (is_train, featdict) = group_and_featdict
  return is_train # 0 or 1

def get_featdict(group_and_featdict):
  (is_train, featdict) = group_and_featdict
  return featdict

def run():
  import os
  import numpy as np
  import apache_beam as beam
  import google.cloud.ml as ml
  import google.cloud.ml.io as io
  import google.cloud.ml.features as features

  # Change as needed
  BUCKET = 'cloud-training-demos-ml'
  PROJECT = 'cloud-training-demos' 
  OUTPUT_DIR = 'gs://{0}/landcover/preproc'.format(BUCKET); RUNNER = 'DataflowPipelineRunner'
  #OUTPUT_DIR = './preproc'; RUNNER = 'DirectPipelineRunner'
  
  pipeline = beam.Pipeline(argv=['--project', PROJECT,
                               '--runner', RUNNER,
                               '--job_name', 'landcover',
                               '--extra_package', ml.sdk_location,
                               '--max_num_workers', '10',
                               '--no_save_main_session', 'True',  # to prevent pickling and uploading Datalab itself!
                               '--setup_file', './preproc/setup.py',  # for gdal installation on the cloud -- see CUSTOM_COMMANDS in setup.py
                               '--staging_location', 'gs://{0}/landcover/staging'.format(BUCKET),
                               '--temp_location', 'gs://{0}/landcover/temp'.format(BUCKET)])
        
  print ml.sdk_location
  (evalg, traing) = (pipeline 
     | beam.Create([0]) # make the generator function like a source
     | beam.FlatMap(get_next_line)
     | beam.FlatMap(get_features_from_line) # (is_train, featdict)
     | beam.Partition(get_partition, 2)
  )  # eval, train both contain (is_train, featdict)
  eval = evalg | 'eval_features' >> beam.Map(get_featdict)
  train = traing | 'train_features' >> beam.Map(get_featdict)
  
  class LandcoverFeatures(object):
    columns = ('rowcol', 'b1', 'b2', 'landcover')
    key = features.key('rowcol')
    landcover = features.target('landcover').discrete()  # classification problem
    inputbands = [
      features.numeric('b1').scale(),
      features.numeric('b2').scale(),
      features.numeric('b3').scale(),
      features.numeric('b4').scale(),
      features.numeric('b5').scale(),
      features.numeric('b6').scale(),
      features.numeric('b7').scale(),
      #features.numeric('el').discretize(buckets=[1,5001,50], sparse=True),  # elevation
    ]
  feature_set = LandcoverFeatures()
  (metadata, train_features, eval_features) = ((train, eval) |
   'Preprocess' >> ml.Preprocess(feature_set))
  (metadata
     | 'SaveMetadata'
     >> io.SaveMetadata(os.path.join(OUTPUT_DIR, 'metadata.yaml')))
  (train_features
     | 'WriteTraining'
     >> io.SaveFeatures(os.path.join(OUTPUT_DIR, 'features_train')))
  (eval_features
     | 'WriteEval'
     >> io.SaveFeatures(os.path.join(OUTPUT_DIR, 'features_eval')))
  pipeline.run()
  
run()

gs://cloud-ml/sdk/cloudml-0.1.6-alpha.dataflow.tar.gz




In [20]:
!gsutil ls gs://cloud-training-demos-ml/landcover/preproc/

gs://cloud-training-demos-ml/landcover/preproc/features_eval-00000-of-00003.tfrecord.gz
gs://cloud-training-demos-ml/landcover/preproc/features_eval-00001-of-00003.tfrecord.gz
gs://cloud-training-demos-ml/landcover/preproc/features_eval-00002-of-00003.tfrecord.gz
gs://cloud-training-demos-ml/landcover/preproc/features_train-00000-of-00003.tfrecord.gz
gs://cloud-training-demos-ml/landcover/preproc/features_train-00001-of-00003.tfrecord.gz
gs://cloud-training-demos-ml/landcover/preproc/features_train-00002-of-00003.tfrecord.gz
gs://cloud-training-demos-ml/landcover/preproc/metadata.yaml


<h2> Create ML model using TensorFlow </h2>

I cheated here. I simply took the <a href="https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/iris">Cloud ML sample for Iris classification</a> and copied it into my repo.  The only change I had to make was to three fields, changing:
<pre>
KEY_FEATURE_COLUMN = 'key'
TARGET_FEATURE_COLUMN = 'species'
REAL_VALUED_FEATURE_COLUMNS = 'measurements'
</pre>
to
<pre>
KEY_FEATURE_COLUMN = 'key'
TARGET_FEATURE_COLUMN = 'landcover'
REAL_VALUED_FEATURE_COLUMNS = 'inputbands'
</pre>
Essentially, my new values match what I had in the class LandcoverFeatures during preprocessing (see above).  This is needed because that's what now encoded in the tfrecord files the preprocessing step wrote out.

The model itself is a neural network with 2 hidden layers. The Iris sample uses the tf.learn API. It is a classification network, and the sample does all the saving, exporting, distribution, etc. I'm not going to worry too much about it. The samples are quite useful in that way. You can use the Iris sample for classification and the Census sample for regression -- you won't have to change much provided your inputs are similar. In my case, all my inputs are like the Iris sample in that they are all real-valued columns.

In [21]:
!ls -lR landcover

landcover:
total 8
-rw-r--r-- 1 root root  746 Nov  3 18:28 setup.py
drwxr-xr-x 2 root root 4096 Nov  3 23:07 trainer

landcover/trainer:
total 24
-rw-r--r-- 1 root root  677 Nov  3 21:54 __init__.py
-rw-r--r-- 1 root root 9176 Nov  3 23:07 task.py
-rw-r--r-- 1 root root 5553 Nov  3 22:51 util.py


<h2> Train model using Cloud ML </h2>

Let's train the model locally on a subset of the data to ensure that we get things right. Then, we can train on the cloud with all of the data.

<br/>
<h4> Local training </h4>

In [22]:
%bash
rm -rf /content/training-data-analyst/blogs/landsat-ml/landcover_trained
tar cvfz landcover.tgz landcover

landcover/
landcover/trainer/
landcover/trainer/task.py
landcover/trainer/util.py
landcover/trainer/__init__.py
landcover/setup.py


In [23]:
%mlalpha train
package_uris: /content/training-data-analyst/blogs/landsat-ml/landcover.tgz
python_module: trainer.task
scale_tier: BASIC
region: us-central1
args:
  train_data_paths: gs://cloud-training-demos-ml/landcover/preproc/features_train-0000*
  eval_data_paths: gs://cloud-training-demos-ml/landcover/preproc/features_eval-0000*
  metadata_path: gs://cloud-training-demos-ml/landcover/preproc/metadata.yaml
  output_path: /content/training-data-analyst/blogs/landsat-ml/landcover_trained
  max_steps:  1000
  batch_size: 1000
  layer1_size: 30
  layer2_size: 10
  learning_rate: 0.01

In [None]:
# Copyright 2016 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.