# Preparation of a labeled set of aerial images from public data sources

This notebook illustrates how a set of aerial images, labeled by land use classification, was generated from freely available U.S. datasets. The image sets generated in this notebook were used to train a DNN for image classification and evaluate its performance. For more detail, please see the rest of the [Embarrassingly Parallel Image Classification](https://github.com/Azure/Embarrassingly-Parallel-Image-Classification) repository.

This walkthrough will demonstrate the technique using source data from Middlesex County, MA, the home of Microsoft's New England Research & Development (NERD) Center. The method was applied to many U.S. counties in parallel to generate the full training and validation sets.

<img src="./img/data_overview/middlesex_ma.png" />

Note: it is not necessary to complete all steps in this tutorial in order to proceed to other tutorials in this series. For a fast start, complete the steps under the [Prepare an Azure Data Science Virtual Machine for image extraction](#dsvm) section, then proceed directly to the [Dataset preparation for deep learning](#prep) section.

## Outline
- [Source data](#sourcedata)
   - [National Agriculture Imagery Program](#naip)
   - [National Land Cover Database](#nlcd)
- [Prepare an Azure Data Science Virtual Machine for image extraction](#dsvm)
   - [Provision the VM](#provision)
   - [Connect to the VM by remote desktop](#rd)
   - [Install Python packages and supporting software](#install)
   - [Download input data locally](#download)
- [Produce image sets for training and evaluation](#production)
   - [Overview of the approach](#overview)
   - [Convert NAIP images from MrSID to GeoTIFF](#conversion)
   - [Add tiling for fast data extraction](#tiling)
   - [Find a lat-lon bounding box for the region](#box)
   - [Get the approximate dimensions of the bounding box in meters](#meters)
   - [Find tiles with consistent land use class](#consistent)
   - [Extract and save images for tiles of interest](#extraction)
   - [Dataset partitioning](#partition)
- [Dataset preparation for deep learning](#prep)
   - [Cognitive Toolkit (CNTK)](#cntk)
   - [TensorFlow](#tf)
- [Next Steps](#nextsteps)

<a name="sourcedata"></a>
## Source data

<a name="naip"></a>
### National Agriculture Imagery Program

The US [National Agriculture Imagery Program](https://www.fsa.usda.gov/programs-and-services/aerial-photography/imagery-programs/naip-imagery/index), run by the [US Department of Agriculture](https://www.usda.gov/)'s [Farm Service Agency](https://www.fsa.usda.gov/), provides images within one month of collection every 1-2 years, in natural (RGB) color with 1-meter ground sample distance. In this tutorial, we use Compressed County Mosaics obtained from the [Geospatial Data Gateway](https://gdg.sc.egov.usda.gov/); you can download the 2016 Middlesex County, MA directly from [our mirror](https://mawahstorage.blob.core.windows.net/aerialimageclassification/naipsample/ortho_imagery_NAIPM16_ma017_3344056_01.zip) if you prefer.

A zoomed-out view of the 2016 NAIP data covering Middlesex County, MA is shown below (Cambridge and Boston at ESE edge):

<img src="./img/data_overview/mediumnaip_white.png" width="500px"/>

<a name="nlcd"></a>
### National Land Cover Database

The US [National Land Cover Database](https://www.mrlc.gov/nlcd2011.php) (NLCD), maintained by the [Multi-Resolution Land Characteristic Consortium](https://www.mrlc.gov/), provides land use labels at 30-meter spatial resolution for the United States. The [sixteen land use classes](https://www.mrlc.gov/nlcd11_leg.php) in the NLCD are organized hierarchically into major (Developed, Forested, Cultivated, &c.) and minor (e.g. Deciduous, Evergreen, or Mixed Forest) categories. Classifications are based largely on seasonally-collected satellite imagery, including data collected outside the visual spectrum. Because of the extensive post-processing required to prepare the NLCD dataset, new versions are published approximately every five years, often with a multi-year delay between data collection and publication. For this project, we use the [NLCD 2011 Land Cover](https://www.mrlc.gov/nlcd11_data.php) dataset.

The colorized NLCD data for Middlesex County, MA are shown below. Developed land is shown in shades of red, water/wetlands in blue/cyan, forested land in green, and cultivated land in magenta.

<img src="./img/data_overview/mediumnlcd.png" width="500px"/>

For more information on the NLCD, please see the following publication:

Homer, C.G., Dewitz, J.A., Yang, L., Jin, S., Danielson, P., Xian, G., Coulston, J., Herold, N.D., Wickham, J.D., and Megown, K., 2015, [Completion of the 2011 National Land Cover Database for the conterminous United States - Representing a decade of land cover change information](http://bit.ly/1K7WjO3). Photogrammetric Engineering and Remote Sensing, v. 81, no. 5, p. 345-354 

<a name="dsvm"></a>
## Prepare an Azure Data Science Virtual Machine for image extraction

For reproducibility and brevity, this walkthrough describes how to set up an [Azure Data Science Virtual Machine](https://azure.microsoft.com/en-us/marketplace/partners/microsoft-ads/standard-data-science-vm/) with the [Deep Learning Toolkit](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/microsoft-ads.dsvm-deep-learning) for image extraction and later deep neural network training. We recommend provisioning a Deep Learning DSVM now so that you can progress seamlessly to the model training step afterward, but GPU compute is not necessary for the image extraction portion of this tutorial. If you prefer not to create a VM at this stage, you can likely adapt the steps below to install needed software on your local computer.

<a name="provision"></a>
### Provision the VM

1. In the [Azure Portal](https://ms.portal.azure.com), begin provisioning a new Deep Learning VM.
    1. Click the "+ New" button at upper left to launch a search pane.
    1. Type in "Deep Learning Toolkit for the DSVM" and press Enter.
    1. In the search results, choose the "Deep Learning Toolkit for the DSVM" published by Microsoft.
    1. After reading the description, press "Create" to begin customization.
1. In the "Basics" pane, choose a username, password, resource group, and location.
    - We recommend creating a new resource group so that you can easily delete all associated resources, like network interfaces and IP addresses, when you are finished using the VM.
    - Note that GPU VMs are not available in all regions. We used the South Central US region for this tutorial.
1. In the "Settings" pane, choose a [virtual machine size](https://docs.microsoft.com/en-us/azure/virtual-machines/virtual-machines-windows-sizes) that includes a graphics card based on your needs. (The default will suffice for this tutorial.)
1. Confirm your settings on the "Summary" pane, then click "Purchase" on the "Buy" pane to provision the VM.

VM deployment may take 20-30 minutes.

<a name="rd"></a>
### Connect to the VM by remote desktop

After the VM deployment is finished, you can connect to the VM by remote desktop as follows:
1. Navigate to the VM's pane in Azure Portal (e.g. by searching for the VM's name).
1. Click "Connect" along the bar on top of the pane to download an RDP (remote desktop) file.
1. Double-click the RDP file to start the connection.
1. Supply the username and password you chose earlier. You may need to specify the "domain" (VM name) as well as your username, e.g. "\\myvmname\myusername", so that the connection doesn't attempt to use your computer's default domain name.

### (Optional) Run this notebook using the VM's Jupyter notebook server

We recommend downloading this notebook on the VM and running it with the VM's Jupyter notebook server, which will allow you to easily run the code examples while reading along. (Alternatively, you can type the code snippets into the Python interpreter run from an Anaconda prompt.) Instructions for starting the VM's Jupyter server and loading notebook files can be found [here](https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-data-science-provision-vm#how-to-create-a-strong-password-for-jupyter-and-start-the-notebook-server).

To download this notebook on your VM, open a browser via remote desktop and navigate to [this file's entry in the git repository](https://github.com/Azure/Embarrassingly-Parallel-Image-Classification/blob/master/image_set_preparation.ipynb): right-click on the "raw" button near the top of the page and save the file. Alternatively, you may download or clone [the entire repo](https://github.com/Azure/Embarrassingly-Parallel-Image-Classification) to the VM at this time. If placed in the `C:\dsvm\notebooks` directory, the file will be easily launchable from the default starting page of the Jupyter notebook server.

<a name="install"></a>
### Install Python packages and supporting software

#### LizardTech's GeoExpress Command Line Applications

Download LizardTech's free [GeoExpress Command Line Applications](https://www.lizardtech.com/gis-tools/tools-and-utilities) (available on the [GIS Tools and Utilities](https://s3.amazonaws.com/bin.us.lizardtech.com/utilities/GeoExpressCLUtils-9.5.0.4326-win64.zip)), for 64-bit Windows, which we will use to convert NAIP imagery from the provided [MrSID](https://en.wikipedia.org/wiki/MrSID) format to GeoTIFF format. After downloading and decompressing the archive on your VM, copy the files to a permanent directory of your choice. Define the path to the binary file `mrsidgeodecode` in a Python variable named `mrsidgeodecode_path`:

In [1]:
mrsidgeodecode_path = 'C:\\Program Files\\LizardTech\\GeoExpressCLUtils-9.5.0.4326-win64\\bin\\mrsidgeodecode'

Note that if you are running this notebook on the VM's Jupyter notebook server, you can define this variable by clicking on the code cell and pressing Ctrl+Enter.

#### Python packages
To read the resulting GeoTIFF, we will use the GDAL package from within Python. As of this writing, Christoph Gohlke's [Unofficial Windows Binaries for Python Extension Packages](http://www.lfd.uci.edu/~gohlke/pythonlibs/) provide the easiest means to install the GDAL package, binaries, and library. We recommend that you download the Python 3.5 (`cp35`), `win_amd64` wheels for the following packages from Christoph's site:
- [GDAL](http://www.lfd.uci.edu/~gohlke/pythonlibs/#gdal)
- [PyPROJ](http://www.lfd.uci.edu/~gohlke/pythonlibs/#pyproj)
- [Basemap](http://www.lfd.uci.edu/~gohlke/pythonlibs/#basemap)

To install these wheels:
1. Launch an Anaconda Prompt.
1. Type the following commands (in this order):

  ```activate py35
  pip install [GDAL wheel filepath]
  pip install [PyPROJ wheel filepath]
  pip install [Basemap wheel filepath]
  ```

<a name="download"></a>
### Download input data locally

In this step, you will download the raw imagery and land use labels needed for image extraction. You may skip directly to the [Dataset preparation for deep learning](#prep) section if you prefer to work with the extracted image sets we provide.

The input data and intermediate files generated in this tutorial are quite large (>90 GB). We recommend downloading and storing all files in your VM's temporary filespace on the `D:` drive: files in this location will be deleted if the machine is restarted, but this will not be problematic if you complete the tutorial in one sitting and store the output images in a more permanent location. Alternatively, you can [provision an additional data disk](https://docs.microsoft.com/en-us/azure/virtual-machines/virtual-machines-windows-attach-disk-portal) to ensure that your files will not be deleted if the VM restarts.

#### National Land Cover Database

[Download](https://www.mrlc.gov/nlcd11_data.php) the compressed 2011 National Land Cover Database file (1.1 GB) and extract its contents to a folder of your choice. The main files of interest in this directory are the `.img` image file (referred to in the Windows Explorer description as a "Disc Image File") and its `.ige` "image extension". Store the full path to the `.img` file by editing and executing the cell below:

In [2]:
nlcd_filepath = 'D:\\nlcd_2011_landcover_2011_edition_2014_10_10\\nlcd_2011_landcover_2011_edition_2014_10_10.img'

#### National Agriculture Imagery Program

NAIP Compressed County Mosaics can be obtained via the [Geospatial Data Gateway](https://gdg.sc.egov.usda.gov/). You may choose any county of interest, but for simplicity, this tutorial will assume that you used Middlesex County, MA 2016 data (which we have [mirrored for your convenience](https://mawahstorage.blob.core.windows.net/aerialimageclassification/naipsample/ortho_imagery_NAIPM16_ma017_3344056_01.zip).
Download and decompress the file, storing the folder path in the following variable:

In [7]:
naip_dir = 'D:\\ortho_imagery_NAIPM16_ma017_3344056_01\\ortho_imagery'

The main file of interest in this directory is the `.sid` ([MrSID](https://en.wikipedia.org/wiki/MrSID)) file. The image is substantially compressed in this format: it will grow by ~40x when we convert the image to GeoTIFF format for further analysis. 

<a name="production"></a>
## Produce image sets for training and evaluation

<a name="overview"></a>
### Overview of the approach

Our goal is to extract many smaller aerial images ("tiles") from the NAIP image of the entire county. Each tile will be labeled with a land use classification determined from the NLCD data. Before launching into the technical details of this process, we provide an overview of the steps involved.

#### Choosing tile size
ResNet DNNs are traditionally trained with 224 px x 224 px images: this sets our minimum tile size at 224 meters x 224 meters (because our aerial imagery has approx. 1 meter resolution). A sample tile with these dimensions is shown below at full size.

<img src="./img/extraction/sample_tile.png" />

#### Proposing candidate tiles

We divided the Middlesex County, MA region along a rectangular grid of tiles, as illustrated in the approximately 1 km x 1 km region shown below: 

<img src="./img/extraction/common_naip_tiled.png" />

To create the grid, we will need to find the latitude/longitude boundaries and dimensions in meters of the county-level NAIP image. We do this by converting the image to GeoTIFF format and using the Python GDAL package to extract the necessary information from the header. Once we have defined the grid, we can use GDAL to extract the image for each tile from the county-level image.

#### Assigning land use classes to tiles

The tiles described above often contain multiple types of land, making it difficult to assign a single land use label. To illustrate this problem, we overlaid the NLCD land use labels and the NAIP aerial image in a region centered on the [Boston Common](https://binged.it/2kTjwOe) (NAIP-only image shown above):

<img src="./img/extraction/common_tiled_only.png" />

A straightforward approach to avoid ambiguity when assigning labels would be to remove any tiles with multiple land uses. Unfortunately, we found this requirement too restrictive: the number of tiles retained by this method was too small for model training. We developed an arbitrary, more lenient criterion for labeling in which we sampled the land use labels at nine regularly-spaced points within each tile, as depicted below:

<img src="./img/extraction/common_points.png" />

If all nine points had the same land use class, we assigned that land use class as the tile's label. Images that could not be assigned a label by this method were not included in the training and validation sets.

<a name="conversion"></a>
### Convert NAIP images from MrSID to GeoTIFF

We used [LizardTech's GeoExpress Command Line Application](LizardTech's GeoExpress Command Line Application) `mrsidgeodecode` to convert the aerial imagery data from MrSID format to GeoTIFF. This command may take several minutes to run. The resulting GeoTIFF file is roughly 45 GB in size.

In [9]:
import os
from subprocess import call

filename_base = None
for filename in os.listdir(naip_dir):
    if filename.endswith('.sid'):
        filename_base = filename.split('.sid')[0]
        break
if filename_base == None:
    raise Exception("Couldn't find the MrSID file in folder {}; is the name correct?".format(naip_dir))
    
geodecode_command = [mrsidgeodecode_path, '-wf',
                     '-i', os.path.join(naip_dir, '{}.sid'.format(filename_base)),
                     '-o', os.path.join(naip_dir, '{}.tif'.format(filename_base))]
call(geodecode_command)

<a name="tiling"></a>
### Add tiling for fast image extraction

To speed up extraction of images from arbitrary regions in the 45 GB file, we add tiling to the image. Since we must write the tiled image before deleting the untiled version, we briefly double our disk space usage at this step.

In [11]:
gdal_command = ['C:\\Anaconda\\envs\\py35\\Lib\\site-packages\\osgeo\\gdal_translate',
            '-co', 'TILED=YES',
            os.path.join(naip_dir, '{}.tif'.format(filename_base)),
            os.path.join(naip_dir, '{}_tiled.tif'.format(filename_base))]
call(gdal_command)
os.remove(os.path.join(naip_dir, '{}.tif'.format(filename_base)))
naip_filepath = os.path.join(naip_dir, '{}_tiled.tif'.format(filename_base))

<a name="box"></a>
### Find a lat-lon bounding box for the region

We now extract the county's boundaries in latitude/longitude coordinates from the GeoTIFF file. Note that because of the county's irregular shape, some regions inside this lat/lon bounding box will not contain any data. The bounding box is nonetheless useful for establishing the dimensions of a tiling grid.

In [12]:
from osgeo import gdal
from gdalconst import *
import osr
from mpl_toolkits.basemap import Basemap
from collections import namedtuple

LatLonBounds = namedtuple('LatLonBounds', ['llcrnrlat', 'llcrnrlon', 'urcrnrlat', 'urcrnrlon'])

def get_bounding_box(naip_filepath):
    ''' Finds a bounding box for the NAIP GeoTIFF in lat/lon '''
    naip_image = gdal.Open(naip_filepath, GA_ReadOnly)
    naip_proj = osr.SpatialReference()
    naip_proj.ImportFromWkt(naip_image.GetProjection())
    naip_ulcrnrx, naip_xstep, _, naip_ulcrnry, _, naip_ystep = naip_image.GetGeoTransform()

    world_map = Basemap(lat_0 = 0,
                        lon_0 = 0,
                        llcrnrlat=-90, urcrnrlat=90,
                        llcrnrlon=-180, urcrnrlon=180,
                        resolution='c', projection='stere')
    world_proj = osr.SpatialReference()
    world_proj.ImportFromProj4(world_map.proj4string)
    ct_to_world = osr.CoordinateTransformation(naip_proj, world_proj)
    
    lats = []
    lons = []
    for corner_x, corner_y in [(naip_ulcrnrx, naip_ulcrnry),
                               (naip_ulcrnrx, naip_ulcrnry + naip_image.RasterYSize * naip_ystep),
                               (naip_ulcrnrx + naip_image.RasterXSize * naip_xstep,
                                naip_ulcrnry + naip_image.RasterYSize * naip_ystep),
                               (naip_ulcrnrx + naip_image.RasterXSize * naip_xstep, naip_ulcrnry)]:
        xpos, ypos, _ = ct_to_world.TransformPoint(corner_x, corner_y)
        lon, lat = world_map(xpos, ypos, inverse=True)
        lats.append(lat)
        lons.append(lon)

    return(LatLonBounds(llcrnrlat=min(lats),
                        llcrnrlon=min(lons),
                        urcrnrlat=max(lats),
                        urcrnrlon=max(lons)))

region_bounds = get_bounding_box(naip_filepath)

<a name="meters"></a>
### Get the approximate dimensions of the bounding box in meters

Here we approximate the width/height of the bounding box in meters, using the fact that latitude does not change substantially on the county scale. This information is needed later to define a grid of 224 meter x 224 meter tiles.

In [13]:
RegionSize = namedtuple('RegionSize', ['width', 'height'])  # in meters!
import numpy as np

def get_approx_region_size(region_bounds):
    ''' Returns the region width (at mid-lat) and height in meters'''
    mid_lat_radians = (region_bounds.llcrnrlat + region_bounds.urcrnrlat) * \
                      (np.pi / 360)
    earth_circumference = 6.371E6 * 2 * np.pi # in meters
    region_middle_width_meters = (region_bounds.urcrnrlon - region_bounds.llcrnrlon) * \
                                 earth_circumference * np.cos(mid_lat_radians) / (360)
    region_height_meters = (region_bounds.urcrnrlat - region_bounds.llcrnrlat) * \
                           earth_circumference / (360)
    return(RegionSize(region_middle_width_meters, region_height_meters))

approx_region_size = get_approx_region_size(region_bounds)

<a name="consistent"></a>
### Find tiles with consistent land use class

#### Define helper functions

The function below defines helper functions that help us access data from both input images (the NAIP aerial image and the NLCD land use label image) using latitude and longitude coordinate systems.

In [14]:
from PIL import Image

def create_helper_functions(region_bounds, nlcd_filepath, naip_filepath):
    ''' Makes helper functions to label points (NLCD) and extract tiles (NAIP) '''
    nlcd_image = gdal.Open(nlcd_filepath, GA_ReadOnly)
    nlcd_proj = osr.SpatialReference()
    nlcd_proj.ImportFromWkt(nlcd_image.GetProjection())
    nlcd_ulcrnrx, nlcd_xstep, _, nlcd_ulcrnry, _, nlcd_ystep = nlcd_image.GetGeoTransform()
    
    naip_image = gdal.Open(naip_filepath, GA_ReadOnly)
    naip_proj = osr.SpatialReference()
    naip_proj.ImportFromWkt(naip_image.GetProjection())
    naip_ulcrnrx, naip_xstep, _, naip_ulcrnry, _, naip_ystep = naip_image.GetGeoTransform()
    
    region_map = Basemap(lat_0 = (region_bounds.llcrnrlat + region_bounds.urcrnrlat)/2,
                         lon_0 = (region_bounds.llcrnrlon + region_bounds.urcrnrlon)/2,
                         llcrnrlat=region_bounds.llcrnrlat,
                         llcrnrlon=region_bounds.llcrnrlon,
                         urcrnrlat=region_bounds.urcrnrlat,
                         urcrnrlon=region_bounds.urcrnrlon,
                         resolution='c',
                         projection='stere')
    
    region_proj = osr.SpatialReference()
    region_proj.ImportFromProj4(region_map.proj4string)
    ct_to_nlcd = osr.CoordinateTransformation(region_proj, nlcd_proj)
    ct_to_naip = osr.CoordinateTransformation(region_proj, naip_proj)

    def get_nlcd_label(point):
        ''' Project lat/lon point to NLCD GeoTIFF; return label of that point '''
        basemap_coords = region_map(point.lon, point.lat)  # NB unusual argument order
        x, y, _ = [int(i) for i in ct_to_nlcd.TransformPoint(*basemap_coords)]
        xoff = int(round((x - nlcd_ulcrnrx) / nlcd_xstep))
        yoff = int(round((y - nlcd_ulcrnry) / nlcd_ystep))
        label = int(nlcd_image.ReadAsArray(xoff=xoff, yoff=yoff, xsize=1, ysize=1))
        return(label)
        
    def get_naip_tile(tile_bounds, tile_size):
        ''' Check that tile lies within county bounds; if so, extract its image '''
        
        # Transform tile bounds in lat/lon to NAIP projection coordinates
        xmax, ymax = region_map(tile_bounds.urcrnrlon, tile_bounds.urcrnrlat)
        xmin, ymin = region_map(tile_bounds.llcrnrlon, tile_bounds.llcrnrlat)
        xstep = (xmax - xmin) / tile_size.width
        ystep = (ymax - ymin) / tile_size.height

        grid = np.mgrid[xmin:xmax:tile_size.width * 1j, ymin:ymax:tile_size.height * 1j]
        shape = grid[0, :, :].shape
        size = grid[0, :, :].size
        xy_target = np.array(ct_to_naip.TransformPoints(grid.reshape(2, size).T))
        xx = xy_target[:,0].reshape(shape)
        yy = xy_target[:,1].reshape(shape)
        
        # Extract rectangle from NAIP GeoTIFF containing superset of needed points
        xoff = int(round((xx.min() - naip_ulcrnrx) / naip_xstep))
        yoff = int(round((yy.max() - naip_ulcrnry) / naip_ystep))
        xsize_to_use = int(np.ceil((xx.max() - xx.min())/np.abs(naip_xstep))) + 1
        ysize_to_use = int(np.ceil((yy.max() - yy.min())/np.abs(naip_ystep))) + 1
        data = naip_image.ReadAsArray(xoff=xoff,
                                      yoff=yoff,
                                      xsize=xsize_to_use,
                                      ysize=ysize_to_use)        
        # Map the pixels of interest in NAIP GeoTIFF to the tile (might involve rotation or scaling)
        image = np.zeros((xx.shape[1], xx.shape[0], 3)).astype(int)  # rows are height, cols are width, third dim is color
        
        try:
            for i in range(xx.shape[0]):
                for j in range(xx.shape[1]):
                    x_idx = int(round((xx[i,j] - naip_ulcrnrx) / naip_xstep)) - xoff
                    y_idx = int(round((yy[i,j] - naip_ulcrnry) / naip_ystep)) - yoff
                    image[xx.shape[1] - j - 1, i, :] = data[:, y_idx, x_idx]
        except TypeError as e:
            # The following can occur if our pixel superset request exceeds the GeoTIFF's bounds
            return(None)
        
        if np.sum(image.sum(axis=2) == 0) > 10: # too many nodata pixels
            return None
        
        image = Image.fromarray(image.astype('uint8'))
        return(image)
    
    return(get_nlcd_label, get_naip_tile)
get_nlcd_label, get_naip_tile = create_helper_functions(region_bounds, nlcd_filepath, naip_filepath)

#### Tile selection

We form a rectangular grid of candidate tiles spanning as much of the bounding box as possible. The boundaries of tiles that could be assigned a label are noted, and will later be used to extract the corresponding aerial image.

In [19]:
LatLonPosition = namedtuple('LatLonPosition', ['lat', 'lon'])
Tile = namedtuple('Tile', ['bounds', 'label'])

nlcd_label_to_class = {21: 'Developed',
                       22: 'Developed',
                       23: 'Developed',
                       24: 'Developed',
                       31: 'Barren',
                       41: 'Forest',
                       42: 'Forest',
                       43: 'Forest',
                       90: 'Forest',
                       51: 'Shrub',
                       52: 'Shrub',
                       71: 'Herbaceous',
                       72: 'Herbaceous',
                       73: 'Herbaceous',
                       74: 'Herbaceous',
                       95: 'Herbaceous',
                       81: 'Cultivated',
                       82: 'Cultivated'}

def find_tiles_with_consistent_labels(region_bounds, region_size, tile_size):
    ''' Find tiles for which nine grid points all have the same label '''
    tiles_wide = int(np.floor(region_size.width / tile_size.width))
    tiles_tall = int(np.floor(region_size.height / tile_size.height))
    tile_width = (region_bounds.urcrnrlon - region_bounds.llcrnrlon) / (region_size.width / tile_size.width)
    tile_height = (region_bounds.urcrnrlat - region_bounds.llcrnrlat) / (region_size.height / tile_size.height)
    
    current_lat = region_bounds.llcrnrlat
    current_lon = region_bounds.llcrnrlon
    
    tiles_to_use = []
    for i in range(tiles_tall):
        for j in range(tiles_wide):
            try:
                labels = []
                for k in range(3):
                    for ell in range(3):
                        my_label = get_nlcd_label(LatLonPosition(lat=current_lat + tile_width * (1 + 2*k) / 6,
                                                                 lon=current_lon + tile_height * (1 + 2*ell) / 6))
                        labels.append(nlcd_label_to_class[my_label])
                num_matching = np.sum(np.array(labels) == labels[4])
                if (num_matching == 9):
                    bounds = LatLonBounds(llcrnrlat=current_lat,
                                          llcrnrlon=current_lon,
                                          urcrnrlat=current_lat + tile_height,
                                          urcrnrlon=current_lon + tile_width)
                    tiles_to_use.append(Tile(bounds=bounds,
                                             label=labels[4]))
            except KeyError:
                pass
            current_lon += tile_width
        current_lon = region_bounds.llcrnrlon
        current_lat += tile_height
    return(tiles_to_use)

tile_size = RegionSize(224, 224)
tiles = find_tiles_with_consistent_labels(region_bounds, approx_region_size, tile_size)
print('Found {} tiles to extract'.format(len(tiles)))

Note that we have grouped together similar NLCD land cover labels:
- **Developed**: 21, 22, 23, 24
- **Forest**: 41, 42, 43, 90
- **Barren**: 31
- **Shrub**: 51, 52
- **Herbaceous**: 71, 72, 73, 74, 95
- **Cultivated**: 81, 82

<a name="extraction"></a>
### Extract and save images for tiles of interest

The most time-consuming step is the extraction of images from the GeoTIFF. We use a helper function defined earlier, `naip_get_tile`, to check that the tile lies entirely inside the county shapefile and, if so, return the extracted image. Images are sorted into directories based on land use class. Expect this step to take up to a few hours. You can monitor the progress by checking the number of images created in the output directory.

In [22]:
import uuid

def extract_tiles(tiles, dest_folder, filename_base):
    ''' Coordinates saving tile data, including extracted images and CSV descriptions '''
    my_region_id = uuid.uuid4()
    if not os.path.exists(dest_folder):
        os.makedirs(dest_folder)
        
    tile_descriptions = []
    i = 0
    while i < len(tiles):
        tile = tiles[i]
        tile_image = get_naip_tile(tile.bounds, tile_size)
        i += 1
        if (tile_image is None):
            continue  # tile did not lie entirely within the county boundary (it was at least partially blank)
        my_directory = os.path.join(dest_folder, '{}'.format(tile.label))
        my_filename = os.path.join(my_directory, '{}_{}.png'.format(filename_base, i))
        if not os.path.exists(my_directory):
            os.makedirs(my_directory)
        tile_image.save(my_filename, 'PNG')
    return

output_dir = 'D:\\tiles'
extract_tiles(tiles, output_dir, filename_base)

<a name="partition"></a>
### Dataset partitioning

To create large and diverse training/validation sets, we repeated the steps above for 12 counties spread across the United States. We partitioned the images into training and validation sets at the county level: this division allows a reasonably accurate estimate of model performance by ensuring some dissimilarity between images seen during training and those used to evaluate the model's performance. We balanced each image set by randomly removing images from overrepresented classes.

While you are welcome to repeat those steps yourself, for expediency we recommend that you continue with this tutorial by downloading our training and validation sets (linked in the next section).

<a name="prep"></a>
## Prepare deep learning framework-specific input files

If you have not generated your own training and validation sets through image extraction, download the following files and decompress them in your VM's temporary (`D:\`) storage:
- [Balanced training image set (~3 GB)](https://mawahstorage.blob.core.windows.net/aerialimageclassification/imagesets/balanced_training_set.zip)
- [Balanced validation image set (~1 GB)](https://mawahstorage.blob.core.windows.net/aerialimageclassification/imagesets/balanced_validation_set.zip)

The image sets linked above contain raw PNG images sorted into folders by their assigned label. Many deep learning frameworks require proprietary image formats or supporting files to efficiently load images in minibatches for training. We now produce the supporting files needed by our CNTK and TensorFlow training scripts. Note that we do not need to produce similar supporting files for the validation image set, because we will not use minibatching when applying the trained models to the validation set.

Update the `training_image_dir` variable below to reflect the directory where your training and validation sets have been saved. The `label_to_number_dict` variable specifies the correspondence between the label names and a numeric code; it does not need to be modified unless you have changed the labeling scheme.

In [None]:
training_image_dir = 'D:\\balanced_training_set'
label_to_number_dict = {'Barren': 0,
                        'Forest': 1,
                        'Shrub': 2,
                        'Cultivated': 3,
                        'Herbaceous': 4,
                        'Developed': 5}

<a name="cntk"></a>
### Cognitive Toolkit (CNTK)

Our CNTK training script uses a MAP file -- a tab-delimited file where each line lists an image's filepath and label -- to load image data in minibatches. We generate the MAP file as follows:

In [30]:
with open(os.path.join(training_image_dir, 'map.txt'), 'w') as map_file:
    for label in np.sort(os.listdir(training_image_dir)):
        my_dir = os.path.join(training_image_dir, label)
        if not os.path.isdir(my_dir):
            continue
        for filename in os.listdir(my_dir):
            map_file.write('{}\t{}\n'.format(os.path.join(my_dir, filename),
                                             label_to_number_dict[label]))

<a name="tf"></a>
### TensorFlow

We made use of the [`tf-slim` API](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/slim) for Tensorflow, which provides pre-trained ResNet models and helpful scripts for retraining and scoring. These scripts require converting the image set into [TFRecords](https://www.tensorflow.org/how_tos/reading_data/#file_formats) for minibatching. (Each TFRecord contains many image files as well as their labels.) We also create a `labels.txt` file mapping the labels to integer values, and a `dataset_split_info.csv` file describing the images assigned to the training set.

The following code was modified from the [Tensorflow models repo's slim subdirectory](https://github.com/tensorflow/models/tree/master/slim).

In [None]:
# Original Copyright 2016 The TensorFlow Authors. All Rights Reserved.
# Modified 2017 by Microsoft Corporation.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

import tensorflow as tf
import pandas as pd

np.random.seed(5318)

class ImageReader(object):
    def __init__(self):
        # Initializes function that decodes RGB JPEG data.
        self._decode_png_data = tf.placeholder(dtype=tf.string)
        self._decode_png = tf.image.decode_png(self._decode_png_data, channels=3)

    def read_image_dims(self, sess, image_data):
        image = self.decode_png(sess, image_data)
        return image.shape[0], image.shape[1]

    def decode_png(self, sess, image_data):
        image = sess.run(self._decode_png,
                         feed_dict={self._decode_png_data: image_data})
        assert len(image.shape) == 3
        assert image.shape[2] == 3
        return image
    
def image_to_tfexample(image_data, image_format, height, width, class_id):
    return tf.train.Example(features=tf.train.Features(feature={
        'image/encoded': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_data])),
        'image/format': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_format])),
        'image/class/label': tf.train.Feature(int64_list=tf.train.Int64List(value=[class_id])),
        'image/height': tf.train.Feature(int64_list=tf.train.Int64List(value=[height])),
        'image/width': tf.train.Feature(int64_list=tf.train.Int64List(value=[width])),
    }))

def find_images(image_dir):
    training_filenames = []
    for folder in os.listdir(image_dir):
        folder_path = os.path.join(image_dir, folder)
        if not os.path.isdir(folder_path):
            continue
        ''' This is a new directory/label -- consider all images inside it '''
        my_filenames = []
        for filename in os.listdir(folder_path):
            my_filenames.append(os.path.join(folder_path, filename))
        training_filenames.extend(my_filenames)
    print('Found {} training images'.format(len(training_filenames)))
    return(training_filenames)
    
def write_dataset(dataset_name, split_name, my_filenames,  image_dir, n_shards=5):
    num_per_shard = int(np.ceil(len(my_filenames) / n_shards))
    records = []
    with tf.Graph().as_default():
        image_reader = ImageReader()
        with tf.Session('') as sess:
            for shard_idx in range(n_shards):
                shard_filename = os.path.join(image_dir,
                                              '{}_{}_{:05d}-of-{:05d}.tfrecord'.format(dataset_name,
                                                                                       split_name,
                                                                                       shard_idx+1,
                                                                                       n_shards))
                with tf.python_io.TFRecordWriter(shard_filename) as tfrecord_writer:
                    for image_idx in range(num_per_shard * shard_idx,
                                           min(num_per_shard * (shard_idx+1), len(my_filenames))):
                        with open(my_filenames[image_idx], 'rb') as f:
                            image_data = f.read()
                        height, width = image_reader.read_image_dims(sess, image_data)
                        class_name = os.path.basename(os.path.dirname(my_filenames[image_idx]))
                        class_id = label_to_number_dict[class_name]
                        example = image_to_tfexample(image_data, b'png', height, width, class_id)
                        tfrecord_writer.write(example.SerializeToString())
                        records.append([dataset_name, split_name, my_filenames[image_idx], shard_idx,
                                        image_idx, class_name, class_id])
    df = pd.DataFrame(records, columns=['dataset_name', 'split_name', 'filename', 'shard_idx', 'image_idx',
                                        'class_name', 'class_id'])
    return(df)
 
training_image_dir = 'D:\\balanced_training_set'
training_filenames = find_images(training_image_dir)
training_filenames = np.random.permutation(training_filenames)
df = write_dataset('aerial', 'train', training_filenames, training_image_dir, n_shards=50)
df.to_csv(os.path.join(training_image_dir, 'dataset_split_info.csv'), index=False)

with open(os.path.join(training_image_dir, 'labels.txt'), 'w') as f:
    for key, value in label_to_number_dict.items():
        f.write('{}:{}\n'.format(key, value))

## Next Steps

* To retrain image classification DNNs using the training image set, proceed to the [Model Training](https://github.com/Azure/Embarrassingly-Parallel-Image-Classification/blob/master/model_training.ipynb) notebook.
   * You may skip this step if you choose to use our example retrained DNNs.
* To apply DNNs to the validation set images on Spark, proceed to the [Scoring on Spark](./scoring_on_spark.ipynb) notebook.