# Wrangling Open Street Map of Waterloo, Canada
## By @IanEdington



### Map Area: Region of Waterloo, Canada
    https://www.openstreetmap.org/relation/2062154
    https://www.openstreetmap.org/relation/2062153

### References used during this project
    https://www.udacity.com/course/viewer#!/c-ud032-nd/l-760758686/m-817328934
    https://docs.python.org/3/library/xml.etree.elementtree.html
    https://docs.python.org/2/library/re.html
    http://stackoverflow.com/questions/5029934/python-defaultdict-of-defaultdict


In [1]:
import xml.etree.cElementTree as ET
from pprint import pprint
import re
from importlib import reload

#-- show plots in notebook
%matplotlib inline

#-- Import wrangling functions using my lasso
import Lasso as l

In [77]:
# Reloading my lasso whenever it gets low
l = reload(l) #assign to l so it stops showing the module output

##Understanding the data
Getting an idea of what is going on inside the area chosen.
Looking at the possible values for each tag type and each 

In [78]:
atr_d, st_atr_d, s_st_d, tag_k_v_dict = l.summarizes_data_2_tags_deep('waterloo-OSM-data.osm')

In [62]:
# types of top level tags:
# Expected [node, way, relation]
pprint (sorted(list(atr_d.keys())))

['member', 'meta', 'nd', 'node', 'osm', 'relation', 'tag', 'way']


###### Why are member, nd & tag as top level tags?
It looks like all start tags were analysed not just top level tags. This won't change our analysis. It will actually be useful to see how tag is used as a child of different tags.

###### What about &lt;osm&gt;?
Only one osm element in set:<br/>
&lt;osm version="0.6" generator="Overpass API"&gt;

###### What about &lt;note&gt;?
Only one note element in set:<br/>
&lt;note&gt; The data included in this document is from www.openstreetmap.org. The data is made available under ODbL. &lt;/note&gt;

###### What about &lt;meta&gt;?
Only one meta element in set:<br/>
&lt;meta osm_base="2015-07-16T03:14:03Z"/&gt;

## Focus on node, way, relation

### Understand the difference between them

In [82]:
print(dict(s_st_d['node']))
print(dict(s_st_d['way']))
print(dict(s_st_d['relation']))

{'node': {'tag'}}
{'way': {'nd', 'tag'}}
{'relation': {'member', 'tag'}}


    All three have 'tag' tags.
    Ways have 'nd' tags.
    Relations have 'member' tags.

### nodes
#### node attributes
Attributes of node: Expected [id, lat, lon, ...]
Address keys should be here.
All keys should be indexable: check for problem keys.

In [64]:
#list of node attribute names (keys)
node_attribute_key_list = list(atr_d['node'].keys())

# check for problem keys
print (l.check_keys_list(node_attribute_key_list))
print (sorted(node_attribute_key_list))

[]
['changeset', 'id', 'lat', 'lon', 'timestamp', 'uid', 'user', 'version']


    No problem chars in node attribute keys. Seems interesting that these are the only attributes.
    Everything else must be kept in the tags.

#### Node Tag k:v pairs

k:v pairs of Tags on node.
Expected: a bunch of different keys describing different types of nodes.
All keys should be indexable: check for problem keys.

Keys from tags shouldn't conflict with attribute keys from same node. How can we check this? Set check of node_attribute_key_list vs node_tag_key_list.

In [65]:
#list of node attribute names (keys)
node_tag_key_list = list(tag_k_v_dict['node'].keys())

# check for problem keys, conflict with node attribute keys, print all keys
print ('Problem Keys:')
print (l.check_keys_list(node_tag_key_list))
print ('Set check against node_attribute_key_list:')
print (set(node_tag_key_list)&set(node_attribute_key_list))
print ('list of all tag keys')
pprint (sorted(node_tag_key_list))

Problem Keys:
[]
Set check against node_attribute_key_list:
set()
list of all tag keys
['FIXME',
 'access',
 'addr:city',
 'addr:country',
 'addr:housename',
 'addr:housenumber',
 'addr:interpolation',
 'addr:postcode',
 'addr:province',
 'addr:state',
 'addr:street',
 'addr:unit',
 'administrative',
 'aerialway',
 'aeroway',
 'alcohol',
 'alt_name',
 'amenity',
 'artist',
 'artist_name',
 'artwork_type',
 'atm',
 'automated',
 'backrest',
 'barrier',
 'beauty',
 'bench',
 'bicycle',
 'bicycle_parking',
 'bin',
 'board_type',
 'books',
 'booth',
 'bottle',
 'brand',
 'building',
 'building:levels',
 'built',
 'bus',
 'button',
 'button_operated',
 'canvec:UUID',
 'capacity',
 'car',
 'clothes',
 'colour',
 'contact:phone',
 'contact:website',
 'content',
 'contents',
 'covered',
 'craft',
 'created_by',
 'crossing',
 'crossing:barrier',
 'crossing:bell',
 'crossing:light',
 'cuisine',
 'currency:CAD',
 'cycleway',
 'dbh_cm',
 'denomination',
 'description',
 'designation',
 'destinatio

Observations:
    1. No problem chars in node tag keys.
    2. There doesn't seem to be overlap between the two key lists
    3. "addr:" fields are only one deep (ie. no extra :'s )

Interesting keys:
*    Stand alone: ['FIXME', 'place', 'note', 'fixme', 'dbh_cm']
*    Address: ['addr:city', 'addr:country', 'addr:housename', 'addr:housenumber', 'addr:interpolation', 'addr:postcode', 'addr:province', 'addr:state', 'addr:street', 'addr:unit',]


#### Check interesting keys for problems

In [14]:
# print contents of interesting keys. (leave address to later section)
for key in ['FIXME', 'place', 'note', 'fixme', 'dbh_cm']:
    print ('This is the contents of ' + key)
    pprint (tag_k_v_dict['node'][key])

This is the contents of FIXME
{'believe this is further E. To be checked',
 'construction continues. Check early 2010',
 'not so sure about this corner',
 'split?',
 'survey'}
This is the contents of place
{'locality', 'city', 'village', 'neighbourhood', 'hamlet', 'suburb'}
This is the contents of note
{'2x',
 '71.73.77.80.83.84.89.90.93,95.94.99.100.105.,109.110.113.114.119.120.121.124.132.136',
 'Archaeological and Historic Sites Board of Ontario plaque',
 'Erected by the United Rubber Workers',
 'FIXME: Correct designation for an historic building???',
 'Full Serve',
 'Historic Sites and Monuments Board of Canada plaque',
 'IRA Needles',
 'Iron Horse Trail',
 'Kitchener City Hall',
 'Kitchener Market',
 'Level 2 charger',
 'Men, Women, Family Washroom',
 'New transit hub (proposed)',
 'Permanent barrier blocks vehicles. Road continues',
 'Regional road number not referenced on exit signs',
 'Rental property',
 'Route 7D',
 'Sells Bitcoin: tinkercoin.com/dvlb',
 'Still used, but only

    'FIXME': marks files that warrent a closer look
    'place': just another location descriptor
    'note': lots of notes, one fixme starting with 'FIXME:'
    'fixme': marks files that warrent a closer look
    'dbh_cm': only once, seems to be in reference to a tree

FIXME and fixme should be called the same
FIXME should be taken out of the notes section

### Ways

#### Way attributes
Attributes of way: Expected ['changeset', 'id', 'lat', 'lon', 'timestamp', 'uid', 'user', 'version']

All keys should be indexable: check for problem keys.



In [44]:
#list of way attribute names (keys)
way_attribute_key_list = list(atr_d['way'].keys())

# check for problem keys
print (l.check_keys_list(way_attribute_key_list))
print (sorted(way_attribute_key_list))


[]
['changeset', 'id', 'timestamp', 'uid', 'user', 'version']


    Interesting **no lat & lon for way.** All other attrib's are the same as for node

#### Way Tag k:v pairs

k:v pairs of Tags on way.
Expected: a bunch of different keys describing different types of ways.
All keys should be indexable: check for problem keys.

Keys from tags shouldn't conflict with attribute keys from same way. How can we check this? Set check of way_attribute_key_list vs way_tag_key_list.

In [45]:
#list of way attribute names (keys)
way_tag_key_list = list(tag_k_v_dict['way'].keys())

# check for problem keys, conflict with way attribute keys, print all keys
print ('Problem Keys:')
print (l.check_keys_list(way_tag_key_list))
print ('Set check against way_attribute_key_list:')
print (set(way_tag_key_list)&set(way_attribute_key_list))
print ('list of all tag keys')
pprint (sorted(way_tag_key_list))

Problem Keys:
[]
Set check against way_attribute_key_list:
set()
list of all tag keys
['FIXME',
 'FIXME:access',
 'GeoBaseNHN:ACQTECH',
 'GeoBaseNHN:DatasetName',
 'GeoBaseNHN:PROVIDER',
 'GeoBaseNHN:UUID',
 'GeoBaseNHN:VALDATE',
 'NHS',
 'README',
 'abutters',
 'access',
 'accuracy:meters',
 'addr:city',
 'addr:country',
 'addr:full',
 'addr:housename',
 'addr:housename:zh',
 'addr:housenumber',
 'addr:interpolation',
 'addr:postcode',
 'addr:province',
 'addr:state',
 'addr:street',
 'aerialway',
 'aerialway:occupancy',
 'aeroway',
 'alt_name',
 'alt_name:fr',
 'alt_name:zh',
 'amenity',
 'area',
 'atm',
 'attribute',
 'attribution',
 'automated',
 'barrier',
 'basin',
 'bench',
 'bicycle',
 'boat',
 'boundary',
 'branch',
 'brand',
 'bridge',
 'building',
 'building:colour',
 'building:flats',
 'building:levels',
 'building:material',
 'building:min_level',
 'building:part',
 'buildingpart',
 'bus',
 'cables',
 'canvec:CODE',
 'canvec:UUID',
 'capacity',
 'capacity:disabled',
 'capa

    no problem keys
    no confict with attribute keys
    interesting keys found:
    ['FIXME', 'FIXME:access', 'capacity:women', 'fixme', 'note', 'psv', 'to', 'type']
    
    address: ['addr:city', 'addr:country', 'addr:full', 'addr:housename', 'addr:housename:zh', 'addr:housenumber', 'addr:interpolation', 'addr:postcode', 'addr:province', 'addr:state', 'addr:street']

In [46]:
# print contents of interesting keys. (leave address to later section)
for key in ['FIXME', 'FIXME:access', 'capacity:women', 'fixme', 'note', 'psv', 'to', 'type']:
    print ('This is the contents of ' + key)
    pprint (sorted(tag_k_v_dict['way'][key]))

This is the contents of FIXME
['Check. May not exist the whole way any more',
 'Estimated. Needs a proper gps track',
 'FIXME',
 'Survey this parking lot more fully',
 "The correct name may be Arthur St. Please note that King's Highway 85 "
 'begins at and heads south from King St.',
 'abandoned?',
 'approximate',
 'check surface',
 'detail TBA',
 'divided',
 'does not match Bing imagery',
 'name (not in Statscan data)',
 'name?',
 'road moved',
 'surface',
 'unmapped private road',
 'yes']
This is the contents of FIXME:access
["access=official means it's officially open to all traffic - if not, use "
 'access=no or private']
This is the contents of capacity:women
['no']
This is the contents of fixme
['Does this waterway still exist?',
 'What is "ROC"?',
 'approximate',
 'boundary is very rough',
 'is this new or demolished?',
 'just added the pathes into the parking lot',
 'quickly drawn from Bing',
 'rough boundary',
 'separate park from woods',
 'sport type is not known',
 'stream',

    nothing seems to be out of order here
    
    fixme and FIXME should be standardized

### Relation

#### Relation attributes
Attributes of relation: Expected ['changeset', 'id', 'lat', 'lon', 'timestamp', 'uid', 'user', 'version']

All keys should be indexable: check for problem keys.



In [47]:
#list of relation attribute names (keys)
relation_attribute_key_list = list(atr_d['relation'].keys())

# check for problem keys
print (l.check_keys_list(relation_attribute_key_list))
print (sorted(relation_attribute_key_list))


[]
['changeset', 'id', 'timestamp', 'uid', 'user', 'version']


    again no lat or lon, like way

#### Relation Tag k:v pairs

k:v pairs of Tags on relation.
Expected: a bunch of different keys describing different types of relations.
All keys should be indexable: check for problem keys.

Keys from tags shouldn't conflict with attribute keys from same relation. How can we check this? Set check of relation_attribute_key_list vs relation_tag_key_list.

In [48]:
#list of relation attribute names (keys)
relation_tag_key_list = list(tag_k_v_dict['relation'].keys())

# check for problem keys, conflict with relation attribute keys, print all keys
print ('Problem Keys:')
print (l.check_keys_list(relation_tag_key_list))
print ('Set check against relation_attribute_key_list:')
print (set(relation_tag_key_list)&set(relation_attribute_key_list))
print ('list of all tag keys')
pprint (sorted(relation_tag_key_list))

Problem Keys:
[]
Set check against relation_attribute_key_list:
set()
list of all tag keys
['FIXME',
 'access',
 'addr:city',
 'addr:housenumber',
 'addr:street',
 'admin_level',
 'alt_name',
 'alt_name:fr',
 'alt_name:zh',
 'amenity',
 'area',
 'bicycle',
 'boundary',
 'boundary_type',
 'building',
 'building:buildyear',
 'ca_on_county',
 'ca_on_edr',
 'ca_on_trailblazer',
 'colour',
 'cycle_network',
 'description',
 'destination',
 'direction',
 'enforcement',
 'fee',
 'foot',
 'from',
 'highway',
 'horse',
 'is_in:state',
 'landuse',
 'leisure',
 'level',
 'level:usage',
 'lit',
 'modifier',
 'name',
 'name:GO',
 'name:GRT',
 'name:Greyhound',
 'name:de',
 'name:en',
 'name:es',
 'name:fr',
 'name:it',
 'name:ru',
 'name:zh',
 'natural',
 'network',
 'note',
 'note_1',
 'old_name',
 'opening_hours',
 'operator',
 'osmc:symbol',
 'parking',
 'public_transport',
 'ref',
 'restriction',
 'route',
 'short_turn',
 'site',
 'source',
 'sport',
 'state',
 'surface',
 'symbol',
 'to',
 'ty

    no problem keys
    no confict with attribute keys
    interesting keys found:
    ['FIXME', 'note', 'note_1']
    
    address: ['addr:city', 'addr:housenumber', 'addr:street']

In [50]:
# print contents of interesting keys. (leave address to later section)
for key in ['FIXME', 'note', 'note_1']:
    print ('This is the contents of ' + key)
    pprint (sorted(tag_k_v_dict['relation'][key]))
    print ()

This is the contents of FIXME
['Confirm classifications such as "Road" and "Township". (e.g. This may be a '
 '"Concession".) See note and note_1 tags.']

This is the contents of note
['Be aware that some municipalities may be classifying roads to match '
 'another municipality or the previous jurisdiction of the road. (e.g. A '
 'city-status municipality may classify its roads with "Regional" or '
 '"County".)',
 'Davis Centre',
 'Unsigned 7000 Series Highway designation',
 'Waterloo University, Laurier University, Kitchener, Cambridge SmartCentres '
 '(westbound request stop drop-off only), Aberfoyle Park & Ride, Milton Park '
 '& Ride, Square One']

This is the contents of note_1
['Classifications are mentioned on pg. 114 of Book 8 (Volume 1, May 2010) of '
 'the Ontario Traffic Manual.']



    note and note_1 should be standardized

### Address
Check the ends of addresses.

###Potential Problem areas
#### address

## Make a plan for how to store the data



###1. Problems Encountered in the Map
Student response describes the challenges encountered while auditing, fixing and processing the dataset for the area of their choice. Some of the problems encountered during data audit are cleaned programmatically.


Student response shows understanding of the process of auditing, and ways to correct or standardize the data, including dealing with problems specific to the location, e.g. related to language or traditional ways of formatting. Some of the problems encountered during data audit are cleaned programmatically.  

###2. Data Overview
Student provides a statistical overview about their chosen dataset, like:

    size of the file
    number of unique users
    number of nodes and ways
    number of chosen type of nodes, like cafes, shops etc
    
Student response provides the statistics about their chosen map area.

Student response also includes the MongoDB queries used to obtain the statistics.

###3. Additional Ideas
Other ideas about the datasets

Student is able to analyze the dataset and recognize opportunities for using it in other projects

Student proposes one or more additional ways of improving and analyzing the data and gives thoughtful discussion about the benefits and anticipated problems in implementing the improvement.

### Code Review:
####Code Functionality

All Lesson 6 problems are solved correctly. Final project code functionality reflects the description in the project document. All required Lesson 6 questions are correctly solved with the submitted code. Final project code functionality reflects the description in the project document.

####Code Readability

Final project code is well structured.

Final project code is commented as necessary.
Final project code follows an intuitive, easy-to-follow logical structure.

Final project code that is not intuitively readable is well-documented with comments.

###Thoroughness and Succinctness of Submission

Student submission is long enough to thoroughly answer the questions asked without giving unnecessary detail.
A good general guideline is that your question responses should take about 3-6 pages.