In order to get ground truth, we're going to download data from OpenStreetMaps.  OpenStreetMaps can be hard to work with, but fortunately there is a great library that can help called geopandas osm.  If you haven't already, you will need to set up the geospatial tools before you can use this script.

This script downloads map features from OSM.  OSM has a lot of different features, like buildings, waterways, and nature.  You can find a description of the features on the [OSM Wiki](http://wiki.openstreetmap.org/wiki/Map_Features)

In [1]:
import json

import shapely.geometry
import geopandas as gpd
import geopandas_osm.osm

meta_df = gpd.read_file('vectors/image_metadata.geojson')
poly = shapely.geometry.box(*meta_df.unary_union.bounds)

osm_df = geopandas_osm.osm.query_osm('way', poly, recurse='down', tags='building')
building_columns = osm_df.columns

buildings = osm_df[~osm_df.building.isnull()][['building', 'name', 'geometry']]
building_centroids = buildings.set_geometry(buildings.centroid, inplace=False)
building_centroids.to_file('vectors/building_centers.geojson', 'GeoJSON')

  df = df.sort_index(by='index')[['lon', 'lat']]


Now that we downloaded the buildings, let's have a look at all the different building types, and how often they occur.

In [3]:
from collections import Counter

Counter(building_centroids.building.values)

Counter({'abandoned': 1,
         'amenity': 1,
         'apartments': 18,
         'church': 15,
         'civic': 5,
         'commercial': 2,
         'construction': 3,
         'farm': 1,
         'garage': 133,
         'garages': 10,
         'greenhouse': 1,
         'historic': 1,
         'hospital': 1,
         'house': 54,
         'industrial': 1,
         'kindergarten': 1,
         'office': 5,
         'public': 1,
         'residential': 1010,
         'retail': 49,
         'roof': 38,
         'school': 41,
         'shop': 1,
         'temple': 1,
         'train_station': 13,
         'walled': 1,
         'warehouse': 1,
         'yes': 68644})

We can see that these match up with the building types described on the [OSM Wiki](http://wiki.openstreetmap.org/wiki/Map_Features#Building).  If you want to view the geojson file you downloaded, you can use QGIS or [geojson.io](http://geojson.io)

Next, let's get the land-use data that's in OSM.

In [4]:
osm_df = geopandas_osm.osm.query_osm('way', poly, recurse='down', tags='landuse')
landuse = osm_df[~osm_df.landuse.isnull()]
print(landuse.landuse.unique())
print(landuse.shape)
landuse.to_file('vectors/landuse.geojson', 'GeoJSON')

  df = df.sort_index(by='index')[['lon', 'lat']]


['cemetery' 'grass' 'forest' 'basin' 'retail' 'recreation_ground'
 'industrial' 'farmyard' 'quarry' 'school' 'residential' 'vineyard'
 'allotments' 'commercial' 'farmland' 'construction' 'sand' 'brownfield'
 'greenhouse_horticulture' 'subdivision' 'meadow']
(332, 50)


Now let's download the waterways

In [5]:
osm_df = geopandas_osm.osm.query_osm('way', poly, recurse='down', tags='waterway')
waterways = osm_df[~osm_df.waterway.isnull()]
print(waterways.waterway.unique())
print(waterways.shape)
waterways.to_file('vectors/waterways.geojson', 'GeoJSON')

['stream' 'riverbank' 'river']
(41, 6)


  df = df.sort_index(by='index')[['lon', 'lat']]


Finally, let's get the natural features that are marked in OSM.

In [6]:
osm_df = geopandas_osm.osm.query_osm('way', poly, recurse='down', tags='natural')
nature = osm_df[~osm_df.natural.isnull()]
print(nature.natural.unique())
print(nature.shape)
nature.to_file('vectors/nature.geojson', 'GeoJSON')

['fell' 'water' 'wood' 'sand']
(58, 8)


  df = df.sort_index(by='index')[['lon', 'lat']]


We're going to need to decide what categories to train our model on.  In order to decide that, let's look at the GlobCover categories, and the land types in OSM and decide how to categorize things.

Here are the land types in GlobCover
* Post-flooding or irrigated croplands (or aquatic)
* Rainfed croplands
* Mosaic cropland (50-70%) / vegetation (grassland/shrubland/forest) (20-50%)
* Mosaic vegetation (grassland/shrubland/forest) (50-70%) / cropland (20-50%) 
* Closed to open (>15%) broadleaved evergreen or semi-deciduous forest (>5m)
* Closed (>40%) broadleaved deciduous forest (>5m)
* Open (15-40%) broadleaved deciduous forest/woodland (>5m)
* Closed (>40%) needleleaved evergreen forest (>5m)
* Open (15-40%) needleleaved deciduous or evergreen forest (>5m)
* Closed to open (>15%) mixed broadleaved and needleleaved forest (>5m)
* Mosaic forest or shrubland (50-70%) / grassland (20-50%)
* Mosaic grassland (50-70%) / forest or shrubland (20-50%) 
* Closed to open (>15%) (broadleaved or needleleaved, evergreen or deciduous) shrubland (<5m)
* Closed to open (>15%) herbaceous vegetation (grassland, savannas or lichens/mosses)
Sparse (<15%) vegetation
* Closed to open (>15%) broadleaved forest regularly flooded (semi-permanently or temporarily) - Fresh or brackish water
* Closed (>40%) broadleaved forest or shrubland permanently flooded - Saline or brackish water
* Closed to open (>15%) grassland or woody vegetation on regularly flooded or waterlogged soil - Fresh, brackish or saline water
* Artificial surfaces and associated areas (Urban areas >50%)
* Bare areas
* Water bodies
* Permanent snow and ice
* No data (burnt areas, clouds,…)

Looking at the data from OSM, I think reasonable categories would be:
* Water
* Wood
* Urban

We can get find the land in each category by looking at the OSM features.