# Wrangling OSM data of Waterloo, ON, Canada
## By @IanEdington



### Map Area: Region of Waterloo, ON, Canada
    https://www.openstreetmap.org/relation/2062154
    https://www.openstreetmap.org/relation/2062153

### References used during this project
    https://www.udacity.com/course/viewer#!/c-ud032-nd/l-760758686/m-817328934
    https://docs.python.org/3/library/xml.etree.elementtree.html
    https://docs.python.org/2/library/re.html
    http://stackoverflow.com/questions/5029934/python-defaultdict-of-defaultdict
    http://stackoverflow.com/questions/16614648/canadian-postal-code-regex


In [1]:
import xml.etree.cElementTree as ET
from pprint import pprint
import re
from importlib import reload

#-- show plots in notebook
%matplotlib inline

#-- Import wrangling functions using my lasso
import Lasso as l

#-- XML file name
osmfile = 'waterloo-OSM-data.osm'
osmsample ='waterloo-OSM-sample.osm'

In [9]:
# Reloading my lasso whenever it gets low
l = reload(l) #assign to l so it stops showing the module output

##Understanding the data
Getting an idea of what is going on inside the area chosen.
Looking at the possible values for each tag type and each 

In [15]:
atr_d, st_atr_d, s_st_d, tag_k_v_dict = l.summarizes_data_2_tags_deep(osmfile)

In [16]:
# types of top level tags:
# Expected [node, way, relation]
pprint (sorted(atr_d.keys()))

['member', 'meta', 'nd', 'node', 'osm', 'relation', 'tag', 'way']


###### Why are member, nd & tag as top level tags?
It looks like all start tags were analysed not just top level tags. This won't change our analysis. It will actually be useful to see how tag is used as a child of different tags.

###### What about &lt;osm&gt;?
Only one osm element in set:<br/>
&lt;osm version="0.6" generator="Overpass API"&gt;

###### What about &lt;note&gt;?
Only one note element in set:<br/>
&lt;note&gt; The data included in this document is from www.openstreetmap.org. The data is made available under ODbL. &lt;/note&gt;

###### What about &lt;meta&gt;?
Only one meta element in set:<br/>
&lt;meta osm_base="2015-07-16T03:14:03Z"/&gt;

## Focus on node, way, relation

### Understand the difference between them

In [5]:
s_st_d_2 = {x:y for k,v in s_st_d.items() if k != 'osm' for x,y in v.items()}
pprint(s_st_d_2)

{'node': {'tag'}, 'relation': {'tag', 'member'}, 'way': {'tag', 'nd'}}


    All three have 'tag' tags.
    Ways have 'nd' tags.
    Relations have 'member' tags.

### nodes
#### node attributes
Attributes of node: Expected [id, lat, lon, ...]
Address keys should be here.
All keys should be indexable: check for problem keys.

In [6]:
#list of node attribute names (keys)
node_attribute_key_list = list(atr_d['node'].keys())

# check for problem keys
print (l.check_keys_list(node_attribute_key_list))
print (sorted(node_attribute_key_list))

[]
['changeset', 'id', 'lat', 'lon', 'timestamp', 'uid', 'user', 'version']


    No problem chars in node attribute keys.
    Seems interesting that these are the only attributes.
    Everything else must be kept in the tags.

#### Node Tag k:v pairs

k:v pairs of Tags on node.
Expected: a bunch of different keys describing different types of nodes.
All keys should be indexable: check for problem keys.

Keys from tags shouldn't conflict with attribute keys from same node. How can we check this? Set check of node_attribute_key_list vs node_tag_key_list.

In [7]:
#list of node attribute names (keys)
node_tag_key_list = list(tag_k_v_dict['node'].keys())

# check for problem keys, conflict with node attribute keys, print all keys
print ('Problem Keys:')
print (l.check_keys_list(node_tag_key_list))
print ('Set check against node_attribute_key_list:')
print (set(node_tag_key_list)&set(node_attribute_key_list))
print ('list of all tag keys')
pprint (sorted(node_tag_key_list))

Problem Keys:
[]
Set check against node_attribute_key_list:
set()
list of all tag keys
['FIXME',
 'access',
 'addr:city',
 'addr:country',
 'addr:housename',
 'addr:housenumber',
 'addr:interpolation',
 'addr:postcode',
 'addr:province',
 'addr:state',
 'addr:street',
 'addr:unit',
 'administrative',
 'aerialway',
 'aeroway',
 'alcohol',
 'alt_name',
 'amenity',
 'artist',
 'artist_name',
 'artwork_type',
 'atm',
 'automated',
 'backrest',
 'barrier',
 'beauty',
 'bench',
 'bicycle',
 'bicycle_parking',
 'bin',
 'board_type',
 'books',
 'booth',
 'bottle',
 'brand',
 'building',
 'building:levels',
 'built',
 'bus',
 'button',
 'button_operated',
 'canvec:UUID',
 'capacity',
 'car',
 'clothes',
 'colour',
 'contact:phone',
 'contact:website',
 'content',
 'contents',
 'covered',
 'craft',
 'created_by',
 'crossing',
 'crossing:barrier',
 'crossing:bell',
 'crossing:light',
 'cuisine',
 'currency:CAD',
 'cycleway',
 'dbh_cm',
 'denomination',
 'description',
 'designation',
 'destinatio

    Observations:
        1. No problem chars in node tag keys.
        2. There doesn't seem to be overlap between the two key lists
        3. "addr:" fields are only one deep (ie. no extra :'s )

    Interesting keys:
    *    Stand alone: ['FIXME', 'place', 'note', 'fixme', 'dbh_cm']
    *    Address:

    ['addr:city', 'addr:country', 'addr:housename', 'addr:housenumber', 'addr:interpolation', 'addr:postcode', 'addr:province', 'addr:state', 'addr:street', 'addr:unit',]

#### Check interesting keys for problems

In [8]:
# print contents of interesting keys. (leave address to later section)
for key in ['FIXME', 'place', 'note', 'fixme', 'dbh_cm']:
    print ('This is the contents of ' + key)
    pprint (tag_k_v_dict['node'][key])

This is the contents of FIXME
{'believe this is further E. To be checked',
 'construction continues. Check early 2010',
 'not so sure about this corner',
 'split?',
 'survey'}
This is the contents of place
{'village', 'neighbourhood', 'locality', 'city', 'suburb', 'hamlet'}
This is the contents of note
{'2x',
 '71.73.77.80.83.84.89.90.93,95.94.99.100.105.,109.110.113.114.119.120.121.124.132.136',
 'Archaeological and Historic Sites Board of Ontario plaque',
 'Erected by the United Rubber Workers',
 'FIXME: Correct designation for an historic building???',
 'Full Serve',
 'Historic Sites and Monuments Board of Canada plaque',
 'IRA Needles',
 'Iron Horse Trail',
 'Kitchener City Hall',
 'Kitchener Market',
 'Level 2 charger',
 'Men, Women, Family Washroom',
 'New transit hub (proposed)',
 'Permanent barrier blocks vehicles. Road continues',
 'Regional road number not referenced on exit signs',
 'Rental property',
 'Route 7D',
 'Sells Bitcoin: tinkercoin.com/dvlb',
 'Still used, but only

    'FIXME': marks files that warrent a closer look
    'place': just another location descriptor
    'note': lots of notes, one fixme starting with 'FIXME:'
    'fixme': marks files that warrent a closer look
    'dbh_cm': only once, seems to be in reference to a tree

    FIXME and fixme should be called the same
    FIXME should be taken out of the notes section

### Ways

#### Way attributes
Attributes of way: Expected ['changeset', 'id', 'lat', 'lon', 'timestamp', 'uid', 'user', 'version']

All keys should be indexable: check for problem keys.



In [9]:
#list of way attribute names (keys)
way_attribute_key_list = list(atr_d['way'].keys())

# check for problem keys
print (l.check_keys_list(way_attribute_key_list))
print (sorted(way_attribute_key_list))


[]
['changeset', 'id', 'timestamp', 'uid', 'user', 'version']


    Interesting **no lat & lon for way.** All other attrib's are the same as for node

#### Way Tag k:v pairs

k:v pairs of Tags on way.
Expected: a bunch of different keys describing different types of ways.
All keys should be indexable: check for problem keys.

Keys from tags shouldn't conflict with attribute keys from same way. How can we check this? Set check of way_attribute_key_list vs way_tag_key_list.

In [10]:
#list of way attribute names (keys)
way_tag_key_list = list(tag_k_v_dict['way'].keys())

# check for problem keys, conflict with way attribute keys, print all keys
print ('Problem Keys:')
print (l.check_keys_list(way_tag_key_list))
print ('Set check against way_attribute_key_list:')
print (set(way_tag_key_list)&set(way_attribute_key_list))
print ('list of all tag keys')
pprint (sorted(way_tag_key_list))

Problem Keys:
[]
Set check against way_attribute_key_list:
set()
list of all tag keys
['FIXME',
 'FIXME:access',
 'GeoBaseNHN:ACQTECH',
 'GeoBaseNHN:DatasetName',
 'GeoBaseNHN:PROVIDER',
 'GeoBaseNHN:UUID',
 'GeoBaseNHN:VALDATE',
 'NHS',
 'README',
 'abutters',
 'access',
 'accuracy:meters',
 'addr:city',
 'addr:country',
 'addr:full',
 'addr:housename',
 'addr:housename:zh',
 'addr:housenumber',
 'addr:interpolation',
 'addr:postcode',
 'addr:province',
 'addr:state',
 'addr:street',
 'aerialway',
 'aerialway:occupancy',
 'aeroway',
 'alt_name',
 'alt_name:fr',
 'alt_name:zh',
 'amenity',
 'area',
 'atm',
 'attribute',
 'attribution',
 'automated',
 'barrier',
 'basin',
 'bench',
 'bicycle',
 'boat',
 'boundary',
 'branch',
 'brand',
 'bridge',
 'building',
 'building:colour',
 'building:flats',
 'building:levels',
 'building:material',
 'building:min_level',
 'building:part',
 'buildingpart',
 'bus',
 'cables',
 'canvec:CODE',
 'canvec:UUID',
 'capacity',
 'capacity:disabled',
 'capa

    no problem keys
    no confict with attribute keys
    interesting keys found:
    ['FIXME', 'FIXME:access', 'capacity:women', 'fixme', 'note', 'psv', 'to', 'type']
    
    address: ['addr:city', 'addr:country', 'addr:full', 'addr:housename', 'addr:housename:zh', 'addr:housenumber', 'addr:interpolation', 'addr:postcode', 'addr:province', 'addr:state', 'addr:street']

In [11]:
# print contents of interesting keys. (leave address to later section)
for key in ['FIXME', 'FIXME:access', 'capacity:women', 'fixme', 'note', 'psv', 'to', 'type']:
    print ('This is the contents of ' + key)
    pprint (sorted(tag_k_v_dict['way'][key]))

This is the contents of FIXME
['Check. May not exist the whole way any more',
 'Estimated. Needs a proper gps track',
 'FIXME',
 'Survey this parking lot more fully',
 "The correct name may be Arthur St. Please note that King's Highway 85 "
 'begins at and heads south from King St.',
 'abandoned?',
 'approximate',
 'check surface',
 'detail TBA',
 'divided',
 'does not match Bing imagery',
 'name (not in Statscan data)',
 'name?',
 'road moved',
 'surface',
 'unmapped private road',
 'yes']
This is the contents of FIXME:access
["access=official means it's officially open to all traffic - if not, use "
 'access=no or private']
This is the contents of capacity:women
['no']
This is the contents of fixme
['Does this waterway still exist?',
 'What is "ROC"?',
 'approximate',
 'boundary is very rough',
 'is this new or demolished?',
 'just added the pathes into the parking lot',
 'quickly drawn from Bing',
 'rough boundary',
 'separate park from woods',
 'sport type is not known',
 'stream',

    nothing seems to be out of order here
    
    fixme and FIXME should be standardized

#### nd tag

In [33]:
print (st_atr_d['way']['nd'].keys())
for v in st_atr_d['way']['nd']['ref']:
    try:
        int(v)
    except:
        print ('not an int: ' + v)

dict_keys(['ref'])


    'nd' is always an int -> keep in array of ints

### Relation

#### Relation attributes
Attributes of relation: Expected ['changeset', 'id', 'lat', 'lon', 'timestamp', 'uid', 'user', 'version']

All keys should be indexable: check for problem keys.



In [12]:
#list of relation attribute names (keys)
relation_attribute_key_list = list(atr_d['relation'].keys())

# check for problem keys
print (l.check_keys_list(relation_attribute_key_list))
print (sorted(relation_attribute_key_list))


[]
['changeset', 'id', 'timestamp', 'uid', 'user', 'version']


    again no lat or lon, like way

#### Relation Tag k:v pairs

k:v pairs of Tags on relation.
Expected: a bunch of different keys describing different types of relations.
All keys should be indexable: check for problem keys.

Keys from tags shouldn't conflict with attribute keys from same relation. How can we check this? Set check of relation_attribute_key_list vs relation_tag_key_list.

In [13]:
#list of relation attribute names (keys)
relation_tag_key_list = list(tag_k_v_dict['relation'].keys())

# check for problem keys, conflict with relation attribute keys, print all keys
print ('Problem Keys:')
print (l.check_keys_list(relation_tag_key_list))
print ('Set check against relation_attribute_key_list:')
print (set(relation_tag_key_list)&set(relation_attribute_key_list))
print ('list of all tag keys')
pprint (sorted(relation_tag_key_list))

Problem Keys:
[]
Set check against relation_attribute_key_list:
set()
list of all tag keys
['FIXME',
 'access',
 'addr:city',
 'addr:housenumber',
 'addr:street',
 'admin_level',
 'alt_name',
 'alt_name:fr',
 'alt_name:zh',
 'amenity',
 'area',
 'bicycle',
 'boundary',
 'boundary_type',
 'building',
 'building:buildyear',
 'ca_on_county',
 'ca_on_edr',
 'ca_on_trailblazer',
 'colour',
 'cycle_network',
 'description',
 'destination',
 'direction',
 'enforcement',
 'fee',
 'foot',
 'from',
 'highway',
 'horse',
 'is_in:state',
 'landuse',
 'leisure',
 'level',
 'level:usage',
 'lit',
 'modifier',
 'name',
 'name:GO',
 'name:GRT',
 'name:Greyhound',
 'name:de',
 'name:en',
 'name:es',
 'name:fr',
 'name:it',
 'name:ru',
 'name:zh',
 'natural',
 'network',
 'note',
 'note_1',
 'old_name',
 'opening_hours',
 'operator',
 'osmc:symbol',
 'parking',
 'public_transport',
 'ref',
 'restriction',
 'route',
 'short_turn',
 'site',
 'source',
 'sport',
 'state',
 'surface',
 'symbol',
 'to',
 'ty

    no problem keys
    no confict with attribute keys
    interesting keys found:
    ['FIXME', 'note', 'note_1']
    
    address: ['addr:city', 'addr:housenumber', 'addr:street']

In [14]:
# print contents of interesting keys. (leave address to later section)
for key in ['FIXME', 'note', 'note_1']:
    print ('This is the contents of ' + key)
    pprint (sorted(tag_k_v_dict['relation'][key]))
    print ()

This is the contents of FIXME
['Confirm classifications such as "Road" and "Township". (e.g. This may be a '
 '"Concession".) See note and note_1 tags.']

This is the contents of note
['Be aware that some municipalities may be classifying roads to match '
 'another municipality or the previous jurisdiction of the road. (e.g. A '
 'city-status municipality may classify its roads with "Regional" or '
 '"County".)',
 'Davis Centre',
 'Unsigned 7000 Series Highway designation',
 'Waterloo University, Laurier University, Kitchener, Cambridge SmartCentres '
 '(westbound request stop drop-off only), Aberfoyle Park & Ride, Milton Park '
 '& Ride, Square One']

This is the contents of note_1
['Classifications are mentioned on pg. 114 of Book 8 (Volume 1, May 2010) of '
 'the Ontario Traffic Manual.']



    note and note_1 should be standardized

#### member tags

In [35]:
print (st_atr_d['relation']['member'].keys())

dict_keys(['type', 'ref', 'role'])


In [42]:
for key, val in st_atr_d['relation']['member'].items():
    print (key +' values')
    print (list(val)[0:100], '\n')

type values
['relation', 'way', 'node'] 

ref values
['41017195', '4000832', '41065477', '86807706', '3928500', '86805375', '309123326', '124007420', '4040114', '10919534', '217650820', '265236737', '144358794', '4710032', '28294310', '305206499', '35525167', '39971804', '1930995349', '210604729', '39551063', '41017645', '27872897', '42916030', '26977406', '27934219', '331048665', '13858766', '39547537', '39971717', '41809912', '143865910', '35526560', '325049746', '35525037', '39971863', '39971742', '28294534', '39458399', '268204209', '39547540', '202029543', '42850295', '286671333', '182704478', '132600396', '193612579', '4000816', '42687476', '35509160', '23400166', '33765238', '67379274', '35513929', '125362471', '42239407', '302977962', '193622375', '39457459', '26240578', '68594567', '263914724', '350498959', '41065492', '41841481', '19888167', '341890783', '312021813', '190133360', '26456337', '40253736', '33859800', '53538375', '2043274989', '147712920', '123890597', '31977249

    'member': [{'type': sub_tag.attribute.get('type'),
                'ref':  sub_tag.attribute.get('ref'),
                'role': sub_tag.attribute.get('role')},
               {...},
               {...}]
               
    role of "" -> None

### Address

All Addresse Elements to check:
* 'addr:state'
* 'addr:interpolation'
* 'addr:city'
* 'addr:street'
* 'addr:postcode'
* 'addr:housename:zh'
* 'addr:unit'
* 'addr:country'
* 'addr:housenumber'
* 'addr:province'
* 'addr:full'
* 'addr:housename'

#### Street endings: 'addr:street'

In [15]:
street_types = l.process_audit_address_type(tag_k_v_dict)
pprint (sorted(list(street_types)))

['154',
 'AVenue',
 'Avenue',
 'Boardwalk',
 'Boulevard',
 'Circle',
 'Court',
 'Crescent',
 'Cresent',
 'Cross',
 'Dr',
 'Dr.',
 'Drive',
 'E',
 'East',
 'Forwell',
 'Gate',
 'Lane',
 'Line',
 'Lodge',
 'N',
 'North',
 'Parkway',
 'Place',
 'Rd',
 'Road',
 'Sideroad',
 'Sirling',
 'South',
 'St',
 'St.',
 'Steet',
 'Street',
 'Townline',
 'Trail',
 'W',
 'Walk',
 'Way',
 'West',
 'tdcanadatrust.com']


    lots of E, East, W, ...
    
    Check second last word if last word is in
    ['N', 'North', 'E', 'East', 'South', 'West', 'W',]

In [16]:
street_directions = ['N', 'North', 'E', 'East', 'South', 'West', 'W',]
street_types_no_directions = l.process_audit_address_type(
        tag_k_v_dict, directions=street_directions)
pprint (sorted(list(street_types_no_directions)))
print(street_types - street_types_no_directions)

['154',
 'AVenue',
 'Ave',
 'Avenue',
 'Boardwalk',
 'Boulevard',
 'Circle',
 'Court',
 'Crescent',
 'Cresent',
 'Cross',
 'Dr',
 'Dr.',
 'Drive',
 'Forwell',
 'Gate',
 'Lane',
 'Line',
 'Lodge',
 'Parkway',
 'Place',
 'Rd',
 'Road',
 'Sideroad',
 'Sirling',
 'St',
 'St.',
 'Steet',
 'Street',
 'Townline',
 'Trail',
 'Walk',
 'Way',
 'tdcanadatrust.com']
{'West', 'East', 'E', 'South', 'W', 'North', 'N'}


    Map should be made to transform:
        S, s to South
        E, e to East
        W, w to West
        N, n to North
        
    Somehow check second level when these are found and replace:
        'AVenue', 'Ave', 'Avenue'
        'Crescent', 'Cresent',
        'Dr', 'Dr.', 'Drive',
        'Rd', 'Road',
        'St', 'St.', 'Steet', 'Street',
    
    tdcanadatrust shouldn't be an address

#### Postal Codes: 'addr:postcode'

In [17]:
RE_POSTAL_CODE = re.compile(r"^([a-zA-Z]\d[a-zA-Z]( )?\d[a-zA-Z]\d)$")
all_postal_codes = l.wrap_up_tag_k_v_dict(tag_k_v_dict, 'addr:postcode')

for postal in list(all_postal_codes):
    if not RE_POSTAL_CODE.search(postal):
        print (postal)


    all postal codes are formated correctly

#### state, interpolation, city, country, province:

In [18]:
elements_to_check = ['addr:state', 'addr:interpolation',
                     'addr:city', 'addr:country', 'addr:province']

for a in elements_to_check:
    print ('Address part: '+a)
    print (sorted(l.wrap_up_tag_k_v_dict(tag_k_v_dict, a)))

Address part: addr:state
['ON']
Address part: addr:interpolation
['228', 'all', 'even', 'odd', 'yes']
Address part: addr:city
['Bloomingdale', 'Breslau', 'Cambridge', 'City of Cambridge', 'City of Kitchener', 'City of Waterloo', 'Kitchener', 'New Dundee', 'Petersburg', 'Saint Agatha', 'St. Agatha', 'St. Petersburg', 'Township of Guelph/Eramosa', 'Township of North Dumfries', 'Township of Wellesley', 'Township of Wilmot', 'Township of Woolwich', 'Waterloo', 'Wilmot', 'kitchener', 'waterloo']
Address part: addr:country
['CA']
Address part: addr:province
['ON', 'Ontario', 'on', 'ontario']


    state should be empty
        if province is populated:
            disregard state
        else:
            assign state to province
    
    province should be all 'Ontario'
    
    interpolation Yes is not a valid value
        Add 'FIXME' : 'Yes is not a valid entry for addr:interpolation'
    
    city:
        'City of Cambridge' -> 'Cambridge'
        'City of Kitchener' -> 'Kitchener'
        'kitchener' -> 'Kitchener'
        'City of Waterloo' -> 'Waterloo'
        'waterloo' -> 'Waterloo'
        'St. Agatha' -> 'Saint Agatha'

####addr:housename:zh, addr:housenumber, addr:full, addr:housename

In [19]:
elements_to_check = ['addr:housename:zh', 'addr:housenumber',
                     'addr:full', 'addr:housename']

for a in elements_to_check:
    print ('\nAddress part: '+a)
    print (sorted(l.wrap_up_tag_k_v_dict(tag_k_v_dict, a)))


Address part: addr:housename:zh
['学生生活中心']

Address part: addr:housenumber
['0', '1', '1-255', '1-399', '1-50', '10', '10-50', '100', '100 1/2', '1000', '1001', '101', '1014', '102', '1027', '103', '1030', '1031', '1032', '1033', '1035', '104', '1042', '1048', '1049', '105', '1050', '1051', '1058', '106', '1065', '107', '108', '1080', '109', '1095', '1097', '11', '11-50', '110', '1100', '1102', '1103', '1104', '1105', '111', '1110', '1111', '1113', '1118', '1119', '112', '1120', '1121', '1127', '1128', '1129', '113', '1132', '1133', '1134', '1135', '1138', '114', '1142', '1143', '1144', '1145', '115', '1151', '116', '1166', '1169', '117', '1171', '1172', '1173', '1174', '1175', '1176', '1177', '1178', '1179', '118', '1180', '1182', '1183', '1184', '1187', '1188', '119', '1191', '1198', '1199', '119A', '119B', '119C', '12', '12 - 370', '12-50', '120', '1200', '1201', '121', '1211 ', '1214', '1218', '1219', '122', '1222', '1224', '1225', '1226', '1227', '123', '1235', '124', '1248', '12

    addr:full is only used by once will include it for data 
    Everything else seems normal.

## Make a plan for how to store the data



###1. Problems Encountered in the Map
Student response describes the challenges encountered while auditing, fixing and processing the dataset for the area of their choice. Some of the problems encountered during data audit are cleaned programmatically.


Student response shows understanding of the process of auditing, and ways to correct or standardize the data, including dealing with problems specific to the location, e.g. related to language or traditional ways of formatting. Some of the problems encountered during data audit are cleaned programmatically.  

###Problem areas identified:

nd values:
    'nd' is always an int -> keep in array of ints

member tags:
    'member': [{'type': sub_tag.attribute.get('type'),
                'ref':  sub_tag.attribute.get('ref'),
                'role': sub_tag.attribute.get('role')},
               {...},
               {...}]
               
    role of "" -> None

tag keys:
    'fixme' -> 'FIXME'
        
    'note': lots of notes, one fixme starting with 'FIXME:'
    'note' & 'note_1'

tag values:
    addr:street
        Map should be made to transform:
            S, s to South
            E, e to East
            W, w to West
            N, n to North

        Somehow check second level when these are found and replace:
            'AVenue', 'Ave', 'Avenue'
            'Crescent', 'Cresent',
            'Dr', 'Dr.', 'Drive',
            'Rd', 'Road',
            'St', 'St.', 'Steet', 'Street',
    
        tdcanadatrust shouldn't be an address

    addr:state should be empty
        if province is populated:
            disregard state
        else:
            assign state to province
    
    addr:province should be all 'Ontario'
    
    addr:interpolation Yes is not a valid value
        Add 'FIXME' : 'Yes is not a valid entry for addr:interpolation'
    
    addr:city:
        'City of Cambridge' -> 'Cambridge'
        'City of Kitchener' -> 'Kitchener'
        'kitchener' -> 'Kitchener'
        'City of Waterloo' -> 'Waterloo'
        'waterloo' -> 'Waterloo'
        'St. Agatha' -> 'Saint Agatha'

### Data Structure
    {'type':    xml_tree.tag
    
     'id':      int(xml_tree('id'))
     
     'pos':     [float(xml_tree('lat')),
                 float(xml_tree('lon'))],
                 
     'created': {'version':     int(xml_tree('uid')),
                 'changeset':   int(xml_tree('changeset')),
                 'timestamp':   xml_tree('timestamp'),
                 'user':        xml_tree('user'),
                 'uid':         int(xml_tree('uid'))},
                 
     'address': {'housenumber': tag_tag['addr:housenumber'],
                 'postcode': tag_tag['addr:postcode'],
                 'street': tag_tag['addr:street'], ...},
                 
     'member':  [{'type': member_tag('type'),
                  'ref':  int(member_tag('ref')),
                  'role': member_tag('role')},
                 {.....................................}],
                 
     'node_refs':[int(nd_tag['ref']),
                  int(nd_tag['ref']), ... ],
                  
      tag['k']:  tag_tag['v'],
      tag['k']:  tag_tag['v'], ... }

####Iteration 1:
Implement data storage plan

In [10]:
json_sample = l.process_map(osmsample)
mashup = {'addr':{}}
for element in json_sample:
    for key, val in element.items():
        if key == 'addr':
            for k, v in val.items():
                mashup['addr'][k] = v
        else:
            mashup[key] = val
pprint (mashup)

#### Iteration 2:
Implement these changes #done

    create int's of the following attributes:
        uid                #done
        version            #done
        changeset          #done
        id                 #done
        nd ref             #done
        relation ref       #done
    change certain containers are not consistent with oridinal data:
        node_refs -> nd    #done
        members -> member  #done
        address -> addr    #done

In [13]:
l = reload(l)
json_sample = l.process_map(osmsample)

mashup = {'addr':{}}
for element in json_sample:
    for key, val in element.items():
        if key == 'addr':
            for k, v in val.items():
                mashup['addr'][k] = v
        else:
            mashup[key] = val
pprint (mashup)

{'FIXME': 'Confirm classifications such as "Road" and "Township". (e.g. This '
          'may be a "Concession".) See note and note_1 tags.',
 'GeoBaseNHN:ACQTECH': 'Vector Data',
 'GeoBaseNHN:DatasetName': '02GA000',
 'GeoBaseNHN:PROVIDER': 'federal',
 'GeoBaseNHN:UUID': 'be20b7e01a9842f58883156b00625f53',
 'GeoBaseNHN:VALDATE': '1990',
 'NHS': 'yes',
 'access': 'private',
 'accuracy:meters': '10',
 'addr': {'city': 'Kitchener',
          'country': 'CA',
          'housename': 'Timeless Material Co',
          'housename:zh': '学生生活中心',
          'housenumber': '176',
          'interpolation': 'odd',
          'postcode': 'N2B 3N1',
          'province': 'Ontario',
          'state': 'ON',
          'street': 'Westchester Drive',
          'unit': '230'},
 'admin_level': '8',
 'administrative': 'city of K',
 'aerialway': 'pylon',
 'aeroway': 'taxiway',
 'alt_name': 'The Hub',
 'amenity': 'parking',
 'area': 'yes',
 'artist_name': 'Ron Baird',
 'artwork_type': 'sculpture',
 'atm': 'ye

#### Iteration 3:
merge 'FIXME' and 'fixme' into 'FIXME'
Finish address processing

    merge 'FIXME' and 'fixme' into 'FIXME'  #done

    addr:street
        Map should be made to transform:
            S, s to South
            E, e to East
            W, w to West
            N, n to North

        Somehow check second level when these are found and replace:
            'AVenue', 'Ave', 'Avenue'
            'Crescent', 'Cresent',
            'Dr', 'Dr.', 'Drive',
            'Rd', 'Road',
            'St', 'St.', 'Steet', 'Street',
    
        tdcanadatrust shouldn't be an address

    addr:state should be empty
        if province is populated:
            disregard state
        else:
            assign state to province
    
    addr:province should be all 'ON'
        
    addr:city:
        'City of Cambridge' -> 'Cambridge'
        'City of Kitchener' -> 'Kitchener'
        'kitchener' -> 'Kitchener'
        'City of Waterloo' -> 'Waterloo'
        'waterloo' -> 'Waterloo'
        'St. Agatha' -> 'Saint Agatha'

In [18]:
l = reload(l)
json_sample = l.process_map(osmsample)

mashup = {'addr':{}}
for element in json_sample:
    for key, val in element.items():
        if key == 'addr':
            for k, v in val.items():
                mashup['addr'][k] = v
        else:
            mashup[key] = val
pprint (mashup)

{'FIXME': 'Confirm classifications such as "Road" and "Township". (e.g. This '
          'may be a "Concession".) See note and note_1 tags.',
 'GeoBaseNHN:ACQTECH': 'Vector Data',
 'GeoBaseNHN:DatasetName': '02GA000',
 'GeoBaseNHN:PROVIDER': 'federal',
 'GeoBaseNHN:UUID': 'be20b7e01a9842f58883156b00625f53',
 'GeoBaseNHN:VALDATE': '1990',
 'NHS': 'yes',
 'access': 'private',
 'accuracy:meters': '10',
 'addr': {'city': 'Kitchener',
          'country': 'CA',
          'housename': 'Timeless Material Co',
          'housename:zh': '学生生活中心',
          'housenumber': '176',
          'interpolation': 'odd',
          'postcode': 'N2B 3N1',
          'province': 'ON',
          'street': 'Westchester Drive',
          'unit': '230'},
 'admin_level': '8',
 'administrative': 'city of K',
 'aerialway': 'pylon',
 'aeroway': 'taxiway',
 'alt_name': 'The Hub',
 'amenity': 'parking',
 'area': 'yes',
 'artist_name': 'Ron Baird',
 'artwork_type': 'sculpture',
 'atm': 'yes',
 'attribution': 'GeoBase®'

#### Manually:

    addr:interpolation Yes is not a valid value
        Add 'FIXME' : 'Yes is not a valid entry for addr:interpolation'
    addr:street
        tdcanadatrust shouldn't be an address -> "url"
    'note': starting with 'FIXME:' -> 'FIXME'
    merge 'note_1' into 'note'
    

#### Process full dataset with manual fixes:

In [4]:
l = reload(l)
json_sample = l.process_map(osmfile)

#### Import into Mongo DB using mongoimport

    from xml_tree:
    node:     248,288
    way:       31,662
    relation:     234
    total:    280,184

    $ mongoimport -d osm -c elements --file waterloo-OSM-data.osm.json
    >>> imported 280184 documents

###2. Data Overview
Student provides a statistical overview about their chosen dataset, like:

In [7]:
from pymongo import MongoClient
client = MongoClient('localhost:27017')
osm = client.osm

####size of the file: 

    original OSM xml: 5,649,775 bytes
    JSON:             5,913,023 bytes

####number of unique users & top 5 contributors:

In [19]:
pipeline = [{'$group': {'_id': '$created.uid',
                        'user': {'$first': '$created.user'},
                        'count': {'$sum':1}}},
            {'$sort': {'count': -1}}]

users = osm.elements.aggregate(pipeline)['result']

print('There were ' + str(len(users)) + ' unique users who contributed to the Waterloo Map.\n')

for ur in users[:5]:
    print (ur['user'] + ' added ' + str(ur['count']) + ' elements')

There were 321 unique users who contributed to the Waterloo Map.

permute added 127990 elements
fuego added 30268 elements
andrewpmk added 28966 elements
rw__ added 20155 elements
Xylem added 12611 elements


####number of nodes and ways :

number of chosen type of nodes, like cafes, shops etc

    
Student response provides the statistics about their chosen map area.

Student response also includes the MongoDB queries used to obtain the statistics.

###3. Additional Ideas
Other ideas about the datasets

Student is able to analyze the dataset and recognize opportunities for using it in other projects

Student proposes one or more additional ways of improving and analyzing the data and gives thoughtful discussion about the benefits and anticipated problems in implementing the improvement.