# Preparation of a labeled set of aerial images from public data sources

This notebook illustrates how a set of aerial images, labeled by land use classification, was generated from freely available U.S. datasets. The image sets generated in this notebook were used to train a DNN for image classification and evaluate its performance. For more detail, please see the rest of the [Embarrassingly Parallel Image Classification](https://github.com/Azure/Embarrassingly-Parallel-Image-Classification) repository.

This walkthrough will demonstrate the technique using source data from Middlesex County, MA, the home of Microsoft's New England Research & Development (NERD) Center. The method was applied to many U.S. counties in parallel to generate the full training and evaluation sets.

<img src="./img/data_overview/middlesex_ma.png" />

## Outline
- [Source data](#sourcedata)
   - [National Agriculture Imagery Program](#naip)
   - [National Land Cover Database](#nlcd)
- [Preparing an Azure Data Science Virtual Machine for image extraction](#dsvm)
   - [Provisioning and accessing the DSVM](#provision)
   - [Installing Python packages and supporting software](#install)
   - [Downloading input data locally](#download)
- [Producing image sets for training and evaluation](#production)
   - [Overview of the approach](#overview)
   - [Converting NAIP images from MrSID to GeoTIFF](#conversion)
   - [Add tiling for fast data extraction](#tiling)
   - [Find a lat-lon bounding box for the region](#box)
   - [Get the approximate dimensions of the bounding box in meters](#meters)
   - [Finding tiles with consistent land use class](#consistent)
   - [Extract and save images for tiles of interest](#extraction)
   - [Combining similar labels](#combine)
   - [Dataset partitioning](#partition)
- [Dataset preparation for deep learning](#prep)
   - [Cognitive Toolkit (CNTK)](#cntk)
   - [TensorFlow](#tf)
- [Data Transfer](#transfer)
- [Next Steps](#nextsteps)

<a name="sourcedata"></a>
## Source data

<a name="naip"></a>
### National Agriculture Imagery Program

The US [National Agriculture Imagery Program](https://www.fsa.usda.gov/programs-and-services/aerial-photography/imagery-programs/naip-imagery/index) is run by the [US Department of Agriculture](https://www.usda.gov/)'s [Farm Service Agency](https://www.fsa.usda.gov/). NAIP images are provided within one month of image collection, in natural (RGB) color with 1-meter ground sample distance. In this tutorial, we use Compressed County Mosaics obtained from the [Geospatial Data Gateway](https://gdg.sc.egov.usda.gov/); you can download the 2016 Middlesex County, MA directly from [our mirror](https://mawahstorage.blob.core.windows.net/aerialimageclassification/naipsample/ortho_imagery_NAIPM16_ma017_3344056_01.zip) if you prefer.

A zoomed-out view of the 2016 NAIP data covering Middlesex County, MA is shown below (Cambridge and Boston at south end of east edge):

<img src="./img/data_overview/mediumnaip_white.png" width="300 px" />

<a name="nlcd"></a>
### National Land Cover Database

The US [National Land Cover Database](https://www.mrlc.gov/nlcd2011.php) (NLCD), maintained by the [Multi-Resolution Land Characteristic Consortium](https://www.mrlc.gov/), provides land use labels at 30-meter spatial resolution for the United States. The [sixteen land use classes](https://www.mrlc.gov/nlcd11_leg.php) in the NLCD are organized hierarchically into major (Developed, Forested, Planted/Cultivated, &c.) and minor (e.g. Deciduous, Evergreen, or Mixed Forest) categories. Classifications are based largely on seasonally-collected satellite imagery (including data collected outside the visual spectrum). Because of the extensive post-processing required to prepare this dataset, new versions are published approximately every five years, often with a multi-year delay between data collection and publication. For this project, we use the [NLCD 2011 Land Cover](https://www.mrlc.gov/nlcd11_data.php) dataset.

The colorized NLCD data for Middlesex County, MA are shown below. Developed land is shown in shades of red, water/wetlands in blue/cyan, forested land in green, and cultivated land in magenta.

<img src="./img/data_overview/mediumnlcd.png" width="300 px" />

For more information on the NLCD, please see the following publication:

Homer, C.G., Dewitz, J.A., Yang, L., Jin, S., Danielson, P., Xian, G., Coulston, J., Herold, N.D., Wickham, J.D., and Megown, K., 2015, [Completion of the 2011 National Land Cover Database for the conterminous United States - Representing a decade of land cover change information](http://bit.ly/1K7WjO3). Photogrammetric Engineering and Remote Sensing, v. 81, no. 5, p. 345-354 

<a name="dsvm"></a>
## Preparing an Azure Data Science Virtual Machine for image extraction

For reproducibility and brevity, this walkthrough describes how to set up an [Azure Data Science Virtual Machine](https://azure.microsoft.com/en-us/marketplace/partners/microsoft-ads/standard-data-science-vm/) for image extraction. It is very likely that the steps below can be adapted for your own computer.

<a name="provision"></a>
### Provisioning and accessing the DSVM

To provision a Data Science Virtual Machine, you will need an [Azure account](https://azure.microsoft.com/en-us/free/). Click the "Deploy to Azure" button to begin the process:

<a href="https://azuredeploy.net/?repository=https://github.com/Azure/Azure-MachineLearning-DataScience/tree/master/Data-Science-Virtual-Machine/Windows">![Deploy to Azure](https://camo.githubusercontent.com/a941ea1d057c4efc2dcc0a680f43c97728ec0bd8/687474703a2f2f617a7572656465706c6f792e6e65742f6465706c6f79627574746f6e2e737667)</a>

After logging in, you can select a name for your VM, login information, and [VM size](https://docs.microsoft.com/en-us/azure/virtual-machines/virtual-machines-windows-sizes#a-series). We chose an "A4" VM because of the large temporary hard disk space included. (Aerial imagery data for some counties can take up >100 GB when decompressed and converted to GeoTIFF format.) If your hard disk requirements are smaller, you may prefer a VM with a solid-state drive for faster file I/O. Confirm your selections to begin deployment.

After deployment concludes, navigate to the VM's overview pane in [Azure Portal](https://portal.azure.com) by e.g. searching for the VM or resource name. Click the "Connect" button along the top of the pane to download a Remote Desktop connection file. Once the download completes, double-click the file to connect to your VM.

Note that your VM will continue to run even after you disconnect from Remote Desktop. You can stop or delete the VM at any time from [Azure Portal](https://portal.azure.com).

<a name="install"></a>
### Installing Python packages and supporting software

#### LizardTech's GeoExpress Command Line Applications
NAIP imagery provided in [MrSID](https://en.wikipedia.org/wiki/MrSID) format: we will convert them to GeoTIFF format using LizardTech's free [GeoExpress Command Line Applications](https://www.lizardtech.com/gis-tools/tools-and-utilities). After downloading and decompressing the archive on your VM, copy the files to a permanent directory of your choice. Add the path to the binary file `mrsidgeodecode` here:

In [None]:
mrsidgeodecode_path = 'C:\\Program Files\\LizardTech\\GeoExpressCLUtils-9.5.0.4326-win64\\bin\\mrsidgeodecode'

#### Python packages
To read the resulting GeoTIFF, we will use the GDAL package from within Python. As of this writing, Christoph Gohlke's [Unofficial Windows Binaries for Python Extension Packages](http://www.lfd.uci.edu/~gohlke/pythonlibs/) provide the easiest means to install the GDAL package, binaries, and library. We downloaded the Python 3.5 wheels for the following packages from Christoph's site for this walkthrough:
- [GDAL](http://www.lfd.uci.edu/~gohlke/pythonlibs/#gdal)
- [Basemap](http://www.lfd.uci.edu/~gohlke/pythonlibs/#basemap)

To install these wheels:
1. Launch an Anaconda Prompt
1. Activate the Python 3.5 environment pre-installed on your DSVM using the following command:

  `activate py35`<br/><br/>
  
1. Install each wheel (`[wheel filename]`) with the following command:

  `pip install [wheel filename]`

<a name="download"></a>
### Downloading input data locally

#### National Land Cover Database

[Download](https://www.mrlc.gov/nlcd11_data.php) the compressed 2011 National Land Cover Database file (1.1 GB) and extract its contents to a folder of your choice (we used D:\nlcd). Store the filepath with a `.img` extension by editing and executing the cell below:

In [1]:
nlcd_filepath = 'C:\\Users\\mawah\\Documents\\landcover\\nlcd_2011_landcover_2011_edition_2014_10_10\\nlcd_2011_landcover_2011_edition_2014_10_10.img'

#### National Agriculture Imagery Program

As of this writing (1/31/2017), an unplanned outage complicates access to NAIP Compressed County Mosaics via the [Geospatial Data Gateway](https://gdg.sc.egov.usda.gov/). We have mirrored [a sample file](https://mawahstorage.blob.core.windows.net/aerialimageclassification/naipsample/ortho_imagery_NAIPM16_ma017_3344056_01.zip) (Middlesex County, MA 2016 data) for your convenience in following this tutorial. (You can make additional county requests using the email address provided on the gateway.) Download and decompress the file, storing the folder path in the following variable:

In [5]:
naip_dir = 'C:\\Users\\mawah\\Documents\\landcover\\ortho_imagery_NAIPM16_ma025_3364482_01\\ortho_imagery'

<a name="production"></a>
## Producing image sets for training and evaluation

<a name="overview"></a>
### Overview of the approach

Our goal is to extract many smaller aerial images ("tiles") from the NAIP GeoTIFF. Each tile will be labeled with a land use classification determined from the NLCD data.

#### Choosing tile size
ResNet DNNs are traditionally trained with 224 px x 224 px images: this sets our minimum tile size at 224 meters x 224 meters (because our aerial imagery has approx. 1 meter resolution). A sample tile with these dimensions is shown below at full size.

<img src="./img/extraction/sample_tile.png" />

#### Proposing candidate tiles

We divided the Middlesex County, MA region along a rectangular grid of candidate tiles.

<img src="./img/extraction/common_naip_tiled.png" />

Many tiles from this grid were not retained in our image set because they could not be assigned a land use class label (see next subsection).

#### Assigning land use classes to tiles

It is straightforward to label a tile if the land use classification is homogeneous throughout the tile. Unfortunately, it is very common for a 224 m x 224 m region chosen at random to include multiple different land uses. To illustrate this problem, we overlaid the NLCD and NAIP data in a region centered on the [Boston Common](https://binged.it/2kTjwOe) (NAIP-only image shown above):

<img src="./img/extraction/common_tiled_only.png" />

We found that too few tiles would be extracted if we required that each tile have a single land use. We developed an arbitrary, more lenient criterion for tile inclusion by sampling the land use class at nine regularly-spaced points within each tile, as depicted below. If all nine points sampled had the same land use class, we retained the tile for our image set and assigned the land use class as a label.

<img src="./img/extraction/common_points.png" />

<a name="conversion"></a>
### Converting NAIP images from MrSID to GeoTIFF

We used [LizardTech's GeoExpress Command Line Application](LizardTech's GeoExpress Command Line Application) `mrsidgeodecode` to convert the aerial imagery data from MrSID format to GeoTIFF.

In [6]:
import os
import shutil

filename_base = None
for filename in os.listdir(naip_dir):
    if filename.endswith('.sid'):
        filename_base = filename.split('.sid')[0]
        break
if filename_base == None:
    raise Exception("Couldn't find the MrSID file in {}".format(naip_dir))
    
geodecode_command = [mrsidgeodecode_path, '-wf',
                     '-i', os.path.join(naip_dir, '{}.sid'.format(filename_base)),
                     '-o', os.path.join(naip_dir, '{}.tif'.format(filename_base))]
results = shutil.run(geodecode_command)

<a name="tiling"></a>
### Add tiling for fast data extraction

To speed up image extraction images from arbitrary regions of the GeoTIFF, we add tiling to the image. Since we must write the tiled image before deleting the untiled version, we effectively double our disk space usage at this step.

In [None]:
gdal_command = ['C:\\Anaconda\\envs\\py35\\Lib\\site-packages\\osgeo\\gdal_translate',
            '-co', 'TILED=YES',
            os.path.join(naip_dir, '{}.tif'.format(filename_base)),
            os.path.join(naip_dir, '{}_tiled.tif'.format(filename_base))]
results = run(gdal_command)
os.remove(os.path.join(naip_dir, '{}.tif'.format(filename_base)))

In [7]:
naip_filepath = os.path.join(naip_dir, '{}_tiled.tif'.format(filename_base))

<a name="box"></a>
### Find a lat-lon bounding box for the region

We now find the lat-lon bounding box of the NAIP GeoTIFF. Some locations inside this bounding box will not have any available data, but the box is still useful for first-order estimation of the usable region.

In [8]:
from osgeo import gdal
from gdalconst import *
import osr
from mpl_toolkits.basemap import Basemap
from collections import namedtuple

LatLonBounds = namedtuple('LatLonBounds', ['llcrnrlat', 'llcrnrlon', 'urcrnrlat', 'urcrnrlon'])

def get_bounding_box(naip_filepath):
    ''' Finds a bounding box for the NAIP GeoTIFF in lat/lon '''
    naip_image = gdal.Open(naip_filepath, GA_ReadOnly)
    naip_proj = osr.SpatialReference()
    naip_proj.ImportFromWkt(naip_image.GetProjection())
    naip_ulcrnrx, naip_xstep, _, naip_ulcrnry, _, naip_ystep = naip_image.GetGeoTransform()

    world_map = Basemap(lat_0 = 0,
                        lon_0 = 0,
                        llcrnrlat=-90, urcrnrlat=90,
                        llcrnrlon=-180, urcrnrlon=180,
                        resolution='c', projection='stere')
    world_proj = osr.SpatialReference()
    world_proj.ImportFromProj4(world_map.proj4string)
    ct_to_world = osr.CoordinateTransformation(naip_proj, world_proj)
    
    lats = []
    lons = []
    for corner_x, corner_y in [(naip_ulcrnrx, naip_ulcrnry),
                               (naip_ulcrnrx, naip_ulcrnry + naip_image.RasterYSize * naip_ystep),
                               (naip_ulcrnrx + naip_image.RasterXSize * naip_xstep,
                                naip_ulcrnry + naip_image.RasterYSize * naip_ystep),
                               (naip_ulcrnrx + naip_image.RasterXSize * naip_xstep, naip_ulcrnry)]:
        xpos, ypos, _ = ct_to_world.TransformPoint(corner_x, corner_y)
        lon, lat = world_map(xpos, ypos, inverse=True)
        lats.append(lat)
        lons.append(lon)

    return(LatLonBounds(llcrnrlat=min(lats),
                        llcrnrlon=min(lons),
                        urcrnrlat=max(lats),
                        urcrnrlon=max(lons)))

region_bounds = get_bounding_box(naip_filepath)

<a name="meters"></a>
### Get the approximate dimensions of the bounding box in meters

Here we approximate the width/height of the bounding box in meters, using the fact that latitude does not change substantially on the county scale.

In [9]:
RegionSize = namedtuple('RegionSize', ['width', 'height'])  # in meters!
import numpy as np

def get_approx_region_size(region_bounds):
    ''' Returns the region width (at mid-lat) and height in meters'''
    mid_lat_radians = (region_bounds.llcrnrlat + region_bounds.urcrnrlat) * \
                      (np.pi / 360)
    earth_circumference = 6.371E6 * 2 * np.pi # in meters
    region_middle_width_meters = (region_bounds.urcrnrlon - region_bounds.llcrnrlon) * \
                                 earth_circumference * np.cos(mid_lat_radians) / (360)
    region_height_meters = (region_bounds.urcrnrlat - region_bounds.llcrnrlat) * \
                           earth_circumference / (360)
    return(RegionSize(region_middle_width_meters, region_height_meters))

approx_region_size = get_approx_region_size(region_bounds)

<a name="consistent"></a>
### Finding tiles with consistent land use class

#### Define helper functions

The function below defines a point labeler that we will use for tile selection. At the same time, we define an image extractor that we will use later in the tutorial. (Both functions require a coordinate transform from lat-lon to GeoTIFF-specific coordinates; defining them together keeps things succinct.)

In [10]:
from PIL import Image

def create_helper_functions(region_bounds, nlcd_filepath, naip_filepath):
    ''' Makes helper functions to label points (NLCD) and extract tiles (NAIP) '''
    nlcd_image = gdal.Open(nlcd_filepath, GA_ReadOnly)
    nlcd_proj = osr.SpatialReference()
    nlcd_proj.ImportFromWkt(nlcd_image.GetProjection())
    nlcd_ulcrnrx, nlcd_xstep, _, nlcd_ulcrnry, _, nlcd_ystep = nlcd_image.GetGeoTransform()
    
    naip_image = gdal.Open(naip_filepath, GA_ReadOnly)
    naip_proj = osr.SpatialReference()
    naip_proj.ImportFromWkt(naip_image.GetProjection())
    naip_ulcrnrx, naip_xstep, _, naip_ulcrnry, _, naip_ystep = naip_image.GetGeoTransform()
    
    region_map = Basemap(lat_0 = (region_bounds.llcrnrlat + region_bounds.urcrnrlat)/2,
                         lon_0 = (region_bounds.llcrnrlon + region_bounds.urcrnrlon)/2,
                         llcrnrlat=region_bounds.llcrnrlat,
                         llcrnrlon=region_bounds.llcrnrlon,
                         urcrnrlat=region_bounds.urcrnrlat,
                         urcrnrlon=region_bounds.urcrnrlon,
                         resolution='c',
                         projection='stere')
    
    region_proj = osr.SpatialReference()
    region_proj.ImportFromProj4(region_map.proj4string)
    ct_to_nlcd = osr.CoordinateTransformation(region_proj, nlcd_proj)
    ct_to_naip = osr.CoordinateTransformation(region_proj, naip_proj)

    def get_nlcd_label(point):
        ''' Project lat/lon point to NLCD GeoTIFF; return label of that point '''
        basemap_coords = region_map(point.lon, point.lat)  # NB unusual argument order
        x, y, _ = [int(i) for i in ct_to_nlcd.TransformPoint(*basemap_coords)]
        xoff = int(round((x - nlcd_ulcrnrx) / nlcd_xstep))
        yoff = int(round((y - nlcd_ulcrnry) / nlcd_ystep))
        label = int(nlcd_image.ReadAsArray(xoff=xoff, yoff=yoff, xsize=1, ysize=1))
        return(label)
        
    def get_naip_tile(tile_bounds, tile_size):
        ''' Check that tile lies within county bounds; if so, extract its image '''
        
        # Transform tile bounds in lat/lon to NAIP projection coordinates
        xmax, ymax = region_map(tile_bounds.urcrnrlon, tile_bounds.urcrnrlat)
        xmin, ymin = region_map(tile_bounds.llcrnrlon, tile_bounds.llcrnrlat)
        xstep = (xmax - xmin) / tile_size.width
        ystep = (ymax - ymin) / tile_size.height

        grid = np.mgrid[xmin:xmax:tile_size.width * 1j, ymin:ymax:tile_size.height * 1j]
        shape = grid[0, :, :].shape
        size = grid[0, :, :].size
        xy_target = np.array(ct_to_naip.TransformPoints(grid.reshape(2, size).T))
        xx = xy_target[:,0].reshape(shape)
        yy = xy_target[:,1].reshape(shape)
        
        # Extract rectangle from NAIP GeoTIFF containing superset of needed points
        xoff = int(round((xx.min() - naip_ulcrnrx) / naip_xstep))
        yoff = int(round((yy.max() - naip_ulcrnry) / naip_ystep))
        xsize_to_use = int(np.ceil((xx.max() - xx.min())/np.abs(naip_xstep))) + 1
        ysize_to_use = int(np.ceil((yy.max() - yy.min())/np.abs(naip_ystep))) + 1
        data = naip_image.ReadAsArray(xoff=xoff,
                                      yoff=yoff,
                                      xsize=xsize_to_use,
                                      ysize=ysize_to_use)        
        # Map the pixels of interest in NAIP GeoTIFF to the tile (might involve rotation or scaling)
        image = np.zeros((xx.shape[1], xx.shape[0], 3)).astype(int)  # rows are height, cols are width, third dim is color
        
        try:
            for i in range(xx.shape[0]):
                for j in range(xx.shape[1]):
                    x_idx = int(round((xx[i,j] - naip_ulcrnrx) / naip_xstep)) - xoff
                    y_idx = int(round((yy[i,j] - naip_ulcrnry) / naip_ystep)) - yoff
                    image[xx.shape[1] - j - 1, i, :] = data[:, y_idx, x_idx]
        except TypeError as e:
            # The following can occur if our pixel superset request exceeds the GeoTIFF's bounds
            return(None)
        
        if np.sum(image.sum(axis=2) == 0) > 10: # too many nodata pixels
            return None
        
        image = Image.fromarray(image.astype('uint8'))
        return(image)
    
    return(get_nlcd_label, get_naip_tile)
get_nlcd_label, get_naip_tile = create_helper_functions(region_bounds, nlcd_filepath, naip_filepath)

#### Tile selection

We form a rectangular grid of candidate tiles spanning as much of the (approximated-as-rectangular) bounding box as possible. Tiles that have "reasonably homogeneous" land use classification are retained for image extraction in the next step. 

In [11]:
LatLonPosition = namedtuple('LatLonPosition', ['lat', 'lon'])
Tile = namedtuple('Tile', ['bounds', 'label'])

def find_tiles_with_consistent_labels(region_bounds, region_size, tile_size):
    ''' Find tiles for which nine grid points all have the same label '''
    tiles_wide = int(np.floor(region_size.width / tile_size.width))
    tiles_tall = int(np.floor(region_size.height / tile_size.height))
    tile_width = (region_bounds.urcrnrlon - region_bounds.llcrnrlon) / (region_size.width / tile_size.width)
    tile_height = (region_bounds.urcrnrlat - region_bounds.llcrnrlat) / (region_size.height / tile_size.height)
    
    current_lat = region_bounds.llcrnrlat
    current_lon = region_bounds.llcrnrlon
    
    tiles_to_use = []
    for i in range(tiles_tall):
        for j in range(tiles_wide):
            try:
                labels = []
                for k in range(3):
                    for ell in range(3):
                        labels.append(get_nlcd_label(LatLonPosition(lat=current_lat + tile_width * (1 + 2*k) / 6,
                                                                    lon=current_lon + tile_height * (1 + 2*ell) / 6)))
                num_matching = np.sum(np.array(labels) == labels[4])
                if (num_matching == 9) and (labels[4] > 10):
                    bounds = LatLonBounds(llcrnrlat=current_lat,
                                          llcrnrlon=current_lon,
                                          urcrnrlat=current_lat + tile_height,
                                          urcrnrlon=current_lon + tile_width)
                    tiles_to_use.append(Tile(bounds=bounds,
                                             label=labels[4]))
            except KeyError:
                pass
            current_lon += tile_width
        current_lon = region_bounds.llcrnrlon
        current_lat += tile_height
    return(tiles_to_use)

tile_size = RegionSize(224, 224)
tiles = find_tiles_with_consistent_labels(region_bounds, approx_region_size, tile_size)

<a name="extraction"></a>
### Extract and save images for tiles of interest

The most time-consuming step is the extraction of images from the GeoTIFF. We use a helper function defined earlier, `naip_get_tile`, to check that the tile lies entirely inside the county shapefile and, if so, return the extracted image. Images are sorted into directories based on land use class.

In [None]:
import uuid

def extract_tiles(tiles, dest_folder, filename_base):
    ''' Coordinates saving tile data, including extracted images and CSV descriptions '''
    my_region_id = uuid.uuid4()
    if not os.path.exists(dest_folder):
        os.makedirs(dest_folder)
        
    tile_descriptions = []
    i = 0
    while i < len(tiles):
        tile = tiles[i]
        tile_image = get_naip_tile(tile.bounds, tile_size)
        i += 1
        if (tile_image is None):
            continue  # tile did not lie entirely within the county boundary (at least partially blank)
        my_directory = os.path.join(dest_folder, '{:02d}'.format(tile.label))
        my_filename = os.path.join(my_directory, '{}_{}.png'.format(filename_base, i))
        if not os.path.exists(my_directory):
            os.makedirs(my_directory)
        tile_image.save(my_filename, 'PNG')
    return

output_dir = naip_dir  # if analyzing multiple counties, we recommending saving to the same output directory
extract_tiles(tiles, output_dir, filename_base)

<a name="combine"></a>
### Combining similar labels

Some land use classes are difficult to distinguish in the visible spectrum using "leaf-on" imagery. For example, in the images extracted above, you are probably unable to distinguish the types of forest/woody wetland, or separate "lightly developed" from "moderately developed" land. For our use case, we found it acceptable to combine similar land use types into larger categories. If you choose this route, you may elect to combine the image folders after extraction, or increase the number of usable images by incorporating your grouping into the `find_tiles_with_consistent_labels()` function.

We combined the land use types as follows:
- **Developed**: 21, 22, 23, 24
- **Water/Wetlands**: 11, 12, 95 (note: our dataset did not include any images of 12)
- **Forest**: 41, 42, 43, 90
- **Barren**: 31
- **Shrubland**: 51, 52 (note: our dataset did not include any examples of 51)
- **Grassland**: 71, 72, 73, 74 (note: our dataset did not include any examples of 72, 73, 74)
- **Cultivated**: 81, 82

<a name="partition"></a>
### Dataset partitioning

We chose to divide images into training and evaluation datasets at the county level. This partitioning imposes a realistic dissimilarity between images in each set, which in term contributes to a more accurate estimate of model performance. Other important considerations during partitioning include maintaining similar class compositions in training vs. evaluation sets; balancing classes within each dataset (through pruning if necessary); and including diverse examples spanning the range of images likely to be encountered by the trained model after deployment.

After combining labels and partitioning, our images were grouped by subfolders (named 0-6) in the folder `E:\combined\train_subsampled`.

<a name="prep"></a>
## Dataset preparation for deep learning

Above, we produced raw PNG images sorted into folders by label. We now produce the supporting input files that our training scripts require. Note that we do not need to produce these supporting files for the evaluation set.

<a name="cntk"></a>
### Cognitive Toolkit (CNTK)

Our CNTK training script expects a tab-delimited "map" file that includes the full filepath and label for each image:

In [8]:
import numpy as np
import os

def describe_image(filename, label, map_file):
    map_file.write('{}\t{}\n'.format(filename, label))
    return
    
image_dir = 'E:\\combined\\train_subsample'

with open(os.path.join(image_dir, 'map.txt'), 'w') as map_file:
    with open(os.path.join(image_dir, 'mean.txt'), 'w') as mean_file:
        for label in os.listdir(image_dir):
            my_dir = os.path.join(image_dir, label)
            if not os.path.isdir(my_dir):
                continue
            for filename in os.listdir(my_dir):
                describe_image(filename=os.path.join(my_dir, filename), label=label, map_file=map_file)

<a name="tf"></a>
### TensorFlow

We made use of the [`tf-slim` API](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/slim) for Tensorflow, which provides pre-trained ResNet models and helpful scripts for retraining and scoring. Below, we convert our raw PNG images to the [TFRecords](https://www.tensorflow.org/how_tos/reading_data/#file_formats) files that those scripts expect as input. (Our evaluation images will be scored on Spark without conversion to TFRecord format.) We also create a `labels.txt` file mapping the folder names to integer labels, and a `dataset_split_info.csv` file describing the images assigned to the training set.

The following code was modified from the [Tensorflow models repo's slim subdirectory](https://github.com/tensorflow/models/tree/master/slim).

In [None]:
# Original Copyright 2016 The TensorFlow Authors. All Rights Reserved.
# Modified 2017 by Microsoft Corporation.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

import numpy as np
import tensorflow as tf
import pandas as pd
import os

np.random.seed(5318)

class ImageReader(object):
    def __init__(self):
        # Initializes function that decodes RGB JPEG data.
        self._decode_png_data = tf.placeholder(dtype=tf.string)
        self._decode_png = tf.image.decode_png(self._decode_png_data, channels=3)

    def read_image_dims(self, sess, image_data):
        image = self.decode_png(sess, image_data)
        return image.shape[0], image.shape[1]

    def decode_png(self, sess, image_data):
        image = sess.run(self._decode_png,
                         feed_dict={self._decode_png_data: image_data})
        assert len(image.shape) == 3
        assert image.shape[2] == 3
        return image
    
def image_to_tfexample(image_data, image_format, height, width, class_id):
    return tf.train.Example(features=tf.train.Features(feature={
        'image/encoded': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_data])),
        'image/format': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_format])),
        'image/class/label': tf.train.Feature(int64_list=tf.train.Int64List(value=[class_id])),
        'image/height': tf.train.Feature(int64_list=tf.train.Int64List(value=[height])),
        'image/width': tf.train.Feature(int64_list=tf.train.Int64List(value=[width])),
    }))

def find_and_split_images(image_dir, validation_fraction=0.2):
    class_names = []
    training_filenames = []
    validation_filenames = []
    
    for folder in os.listdir(image_dir):
        folder_path = os.path.join(image_dir, folder)
        if not os.path.isdir(folder_path):
            continue
        ''' This is a new directory/label -- consider all images inside it '''
        class_names.append(folder)
        my_filenames = []
        for filename in os.listdir(folder_path):
            my_filenames.append(os.path.join(folder_path, filename))
        n_validation = int(np.ceil(validation_fraction * len(my_filenames)))
        validation_filenames.extend(my_filenames[:n_validation])
        training_filenames.extend(my_filenames[n_validation:])
    print('Found {} training and {} validation images'.format(len(training_filenames),
                                                              len(validation_filenames)))
    return(sorted(class_names), training_filenames, validation_filenames)
    
def write_dataset(dataset_name, split_name, my_filenames, class_names_to_ids, image_dir, n_shards=5):
    num_per_shard = int(np.ceil(len(my_filenames) / n_shards))
    records = []
    with tf.Graph().as_default():
        image_reader = ImageReader()
        with tf.Session('') as sess:
            for shard_idx in range(n_shards):
                shard_filename = os.path.join(image_dir,
                                              '{}_{}_{:05d}-of-{:05d}.tfrecord'.format(dataset_name,
                                                                                       split_name,
                                                                                       shard_idx+1,
                                                                                       n_shards))
                with tf.python_io.TFRecordWriter(shard_filename) as tfrecord_writer:
                    for image_idx in range(num_per_shard * shard_idx,
                                           min(num_per_shard * (shard_idx+1), len(my_filenames))):
                        with open(my_filenames[image_idx], 'rb') as f:
                            image_data = f.read()
                        # Getting some sort of early EOF error with this version on Windows Server 2012:
                        # image_data = tf.gfile.FastGFile(my_filenames[image_idx], 'r').read()
                        height, width = image_reader.read_image_dims(sess, image_data)
                        class_name = os.path.basename(os.path.dirname(my_filenames[image_idx]))
                        class_id = class_names_to_ids[class_name]
                        example = image_to_tfexample(image_data, b'png', height, width, class_id)
                        tfrecord_writer.write(example.SerializeToString())
                        records.append([dataset_name, split_name, my_filenames[image_idx], shard_idx,
                                        image_idx, class_name, class_id])
    df = pd.DataFrame(records, columns=['dataset_name', 'split_name', 'filename', 'shard_idx', 'image_idx',
                                        'class_name', 'class_id'])
    return(df)
 
image_dir = 'E:\\combined\\train_subsample'

# Our validation set has already been created and resides in a separate folder.
# Assign all of the images in image_dir to the training set.
class_names, training_filenames, _ = find_and_split_images(image_dir, 0.0)
training_filenames = np.random.permutation(training_filenames)
class_names_to_ids = dict(zip(class_names, list(range(len(class_names)))))
df = write_dataset('aerial', 'train', training_filenames, class_names_to_ids, image_dir, n_shards=50)
df.to_csv(os.path.join(image_dir, 'dataset_split_info.csv'), index=False)

with open(os.path.join(image_dir, 'labels.txt'), 'w') as f:
    for i in range(len(class_names)):
        f.write('{0}:{0}\n'.format(i))

<a name="transfer"></a>

## Data Transfer

After generating the training and evaluation input data, we recommend creating a backup (and potentially transferring to a more powerful computer or VM for model training). We chose to transfer our data to an [Azure Blob Storage](https://docs.microsoft.com/en-us/azure/storage/storage-create-storage-account) account using the command line tool [AzCopy](https://docs.microsoft.com/en-us/azure/storage/storage-use-azcopy) (but see [Azure Storage Explorer](http://storageexplorer.com/) for a GUI-based alternative).

Don't forget to delete any unneeded resources once you have completed the tutorial!

## Next Steps

Please see the next notebook in the [Embarrassingly Parallel Image Classification](https://github.com/Azure/Embarrassingly-Parallel-Image-Classification) repository, [Model Training](https://github.com/Azure/Embarrassingly-Parallel-Image-Classification/blob/master/model_training.ipynb), for information on training or retraining a deep neural network for image classification.