# OpenStreetMap Case Study

## Step One - Complete Programming Exercises
Make sure all programming exercises are solved correctly in the "Case Study: OpenStreetMap Data" Lesson in the course you have chosen (MongoDB or SQL). This is the last lesson in that section.

In [1]:
import xml.etree.cElementTree as ET
from collections import defaultdict
import re
import pprint
import csv
import codecs
import cerberus
from street_suffix import map_street_suffix
from audit import audit_addresses
from update import update_addr
from prepare_for_database import process_map

osm_file = 'philadelphia_pennsylvania.osm'

### Iterative Parsing
Your task is to use the iterative parsing to process the map file and find out not only what tags are there, but also how many, to get the feeling on how much of which data you can expect to have in the map.  Return a dictionary with the tag name as the key and number of times this tag can be encountered in the map as value.

In [None]:
tags = defaultdict(int)
for event, element in ET.iterparse(osm_file):
    tags[element.tag] += 1
print tags

### Tag Types
Your task is to explore the data a bit more.  Before you process the data and add it into your database, you should check the "k" value for each tag and see if there are any potential problems.  We have provided you with 3 regular expressions to check for certain patterns in the tags. As we saw in the quiz earlier, we would like to change the data
model and expand the "addr:street" type of keys to a dictionary like this: {"address": {"street": "Some value"}} So, we have to see if we have such tags, and if we have any tags with problematic characters.

Please complete the function 'key_type', such that we have a count of each of
four tag categories in a dictionary:
  - "lower", for tags that contain only lowercase letters and are valid,
  - "lower_colon", for otherwise valid tags with a colon in their names,
  - "problemchars", for tags with problematic characters, and
  - "other", for other tags that do not fall into the other three categories.

In [None]:
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

def key_type(element, keys):
    if element.tag == "tag":
        result1 = lower.search(element.attrib['k'])
        result2 = lower_colon.search(element.attrib['k'])
        result3 = problemchars.search(element.attrib['k'])
        
        if result1 is not None:
            keys['lower'] += 1
        elif result2 is not None:
            keys['lower_colon'] += 1
        elif result3 is not None:
            keys['problemchars'] += 1
        else:
            keys['other'] += 1

    return keys



keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
for _, element in ET.iterparse(osm_file):
    keys = key_type(element, keys)

print keys

### Exploring Users
Your task is to explore the data a bit more.  The first task is a fun one - find out how many unique users have contributed to the map in this particular area!

In [None]:
users = set()
for _, element in ET.iterparse(osm_file):
    uid=element.get('uid')
    if uid is not None:
        users.add(uid)
print len(users), " Total users"

### Improving Street Names
Your task in this exercise has two steps:

- audit the osm_file and change the variable 'mapping' to reflect the changes needed to fix the unexpected street types to the appropriate ones in the expected list.
- write the update_name function, to actually fix the street name.

In [None]:
with open(osm_file, "r") as f:
    addr_address_fields = []
    tiger_address_fields = []
    for event, elem in ET.iterparse(f, events=("start",)):
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if "addr" in tag.attrib['k'] and tag.attrib['k'] not in addr_address_fields:
                    addr_address_fields.append(tag.attrib['k'])
                elif ("tiger:name" in tag.attrib['k'] or "tiger:zip" in tag.attrib['k']) and \
                    tag.attrib['k'] not in tiger_address_fields:
                    tiger_address_fields.append(tag.attrib['k'])
                    
                    
                    
print "Addr: fields "                    
print addr_address_fields
print "______________"
print "Tiger: fields "
print tiger_address_fields

In [None]:
expected, mapping = map_street_suffix()
pprint.pprint(dict(mapping))

In [None]:
audit_addresses(osm_file, expected)

### Preparing for Database - SQL
After auditing is complete the next step is to prepare the data to be inserted into a SQL database.  To do so you will parse the elements in the OSM XML file, transforming them from document format to tabular format, thus making it possible to write to .csv files.  These csv files can then easily be imported to a SQL database as tables.

The process for this transformation is as follows:
- Use iterparse to iteratively step through each top level element in the XML
- Shape each element into several data structures using a custom function
- Utilize a schema and validation library to ensure the transformed data is in the correct format
- Write each data structure to the appropriate .csv files

We've already provided the code needed to load the data, perform iterative parsing and write the output to csv files. Your task is to complete the shape_element function that will transform each element into the correct format. To make this process easier we've already defined a schema (see the schema.py file in the last code tab) for the .csv files and the eventual tables. Using the cerberus library we can validate the output against this schema to ensure it is correct.

#### Shape Element Function
The function should take as input an iterparse Element object and return a dictionary.

##### If the element top level tag is "node":
The dictionary returned should have the format {"node": .., "node_tags": ...}

The "node" field should hold a dictionary of the following top level node attributes:
- id
- user
- uid
- version
- lat
- lon
- timestamp
- changeset
All other attributes can be ignored

The "node_tags" field should hold a list of dictionaries, one per secondary tag. Secondary tags are
child tags of node which have the tag name/type: "tag". Each dictionary should have the following
fields from the secondary tag attributes:

- id: the top level node id attribute value
- key: the full tag "k" attribute value if no colon is present or the characters after the colon if one is.
- value: the tag "v" attribute value
- type: either the characters before the colon in the tag "k" value or "regular" if a colon is not present.

Additionally,

- if the tag "k" value contains problematic characters, the tag should be ignored
- if the tag "k" value contains a ":" the characters before the ":" should be set as the tag type
  and characters after the ":" should be set as the tag key
- if there are additional ":" in the "k" value they and they should be ignored and kept as part of
  the tag key. For example:

  <tag k="addr:street:name" v="Lincoln"/>
  should be turned into
  {'id': 12345, 'key': 'street:name', 'value': 'Lincoln', 'type': 'addr'}

- If a node has no secondary tags then the "node_tags" field should just contain an empty list.


##### If the element top level tag is "way":
The dictionary should have the format {"way": ..., "way_tags": ..., "way_nodes": ...}

The "way" field should hold a dictionary of the following top level way attributes:

- id
-  user
- uid
- version
- timestamp
- changeset

All other attributes can be ignored

The "way_tags" field should again hold a list of dictionaries, following the exact same rules as for "node_tags".

Additionally, the dictionary should have a field "way_nodes". "way_nodes" should hold a list of dictionaries, one for each nd child tag.  Each dictionary should have the fields:

- id: the top level element (way) id
- node_id: the ref attribute value of the nd tag
- position: the index starting at 0 of the nd tag i.e. what order the nd tag appears within the way element      

In [2]:
process_map(osm_file, validate=False)

check in: 0
check in: 10000
check in: 20000
check in: 30000
check in: 40000
check in: 50000
check in: 60000
check in: 70000
check in: 80000
check in: 90000
check in: 100000
check in: 110000
check in: 120000
check in: 130000
check in: 140000
check in: 150000
check in: 160000
check in: 170000
check in: 180000
check in: 190000
check in: 200000
check in: 210000
check in: 220000
check in: 230000
check in: 240000
check in: 250000
check in: 260000
check in: 270000
check in: 280000
check in: 290000
check in: 300000
check in: 310000
check in: 320000
check in: 330000
check in: 340000
check in: 350000
check in: 360000
check in: 370000
check in: 380000
check in: 390000
check in: 400000
check in: 410000
check in: 420000
check in: 430000
check in: 440000
check in: 450000
check in: 460000
check in: 470000
check in: 480000
check in: 490000
check in: 500000
check in: 510000
check in: 520000
check in: 530000
check in: 540000
check in: 550000
check in: 560000
check in: 570000
check in: 580000
check in: 5