# OpenStreetMap Case Study

## Step One - Complete Programming Exercises
Make sure all programming exercises are solved correctly in the "Case Study: OpenStreetMap Data" Lesson in the course you have chosen (MongoDB or SQL). This is the last lesson in that section.

In [1]:
import xml.etree.cElementTree as ET
from collections import defaultdict
import re
import pprint
import requests
from bs4 import BeautifulSoup
import csv
import codecs
import cerberus

osm_file = 'sample.osm'

### Iterative Parsing
Your task is to use the iterative parsing to process the map file and find out not only what tags are there, but also how many, to get the feeling on how much of which data you can expect to have in the map.  Return a dictionary with the tag name as the key and number of times this tag can be encountered in the map as value.

In [2]:
tags = defaultdict(int)
for event, element in ET.iterparse(osm_file):
    tags[element.tag] += 1
print tags

defaultdict(<type 'int'>, {'node': 32697, 'nd': 38628, 'member': 500, 'tag': 20670, 'relation': 51, 'way': 3410, 'osm': 1})


### Tag Types
Your task is to explore the data a bit more.  Before you process the data and add it into your database, you should check the "k" value for each tag and see if there are any potential problems.  We have provided you with 3 regular expressions to check for certain patterns in the tags. As we saw in the quiz earlier, we would like to change the data
model and expand the "addr:street" type of keys to a dictionary like this: {"address": {"street": "Some value"}} So, we have to see if we have such tags, and if we have any tags with problematic characters.

Please complete the function 'key_type', such that we have a count of each of
four tag categories in a dictionary:
  - "lower", for tags that contain only lowercase letters and are valid,
  - "lower_colon", for otherwise valid tags with a colon in their names,
  - "problemchars", for tags with problematic characters, and
  - "other", for other tags that do not fall into the other three categories.

In [3]:
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

def key_type(element, keys):
    if element.tag == "tag":
        result1 = lower.search(element.attrib['k'])
        result2 = lower_colon.search(element.attrib['k'])
        result3 = problemchars.search(element.attrib['k'])
        
        if result1 is not None:
            keys['lower'] += 1
        elif result2 is not None:
            keys['lower_colon'] += 1
        elif result3 is not None:
            keys['problemchars'] += 1
        else:
            keys['other'] += 1

    return keys



keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
for _, element in ET.iterparse(osm_file):
    keys = key_type(element, keys)

print keys

{'problemchars': 0, 'lower': 9913, 'other': 1798, 'lower_colon': 8959}


### Exploring Users
Your task is to explore the data a bit more.  The first task is a fun one - find out how many unique users have contributed to the map in this particular area!

In [4]:
users = set()
for _, element in ET.iterparse(osm_file):
    uid=element.get('uid')
    if uid is not None:
        users.add(uid)
print len(users), " Total users"

766  Total users


### Improving Street Names
Your task in this exercise has two steps:

- audit the osm_file and change the variable 'mapping' to reflect the changes needed to fix the unexpected street types to the appropriate ones in the expected list.
- write the update_name function, to actually fix the street name.

In [5]:
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

In [6]:
#Scrape 'https://pe.usps.com/text/pub28/28apc_002.htm' to obtain list of
#primary street suffix names
with requests.Session() as session:
    response = session.get('https://pe.usps.com/text/pub28/28apc_002.htm', headers={'user-agent': 'Chrome/60.0.3112.113'})
    html = response.text
    soup = BeautifulSoup(html, "html.parser")
    table = soup.find(id='ep533076')
    expected = []
    mapping = {}
    for each in table.find_all('tr')[1:]:
        if len(each) == 6:
            text = str(each.text)
            text = text.split(" ")
            while '' in text: #remove all blank spaces created from converting unicode to str
                text.remove('')
            street_suffix_name = str(text[0]).title() 
            abbr = str(text[2]).title()
            expected.append(street_suffix_name)
            mapping[abbr] = street_suffix_name



In [7]:
def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        formatted_street_type = street_type.title().replace(".", "") #only capitalize first letter and remove "." from abbreviations
        if formatted_street_type not in expected:
            street_types[formatted_street_type].add(street_name)
    return street_types

with open(osm_file, "r") as f:
    street_types = defaultdict(set)
    for event, elem in ET.iterparse(f, events=("start",)):
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if tag.attrib['k'] == "addr:street":
                    initial_dict = audit_street_type(street_types, tag.attrib['v'])
pprint.pprint(dict(initial_dict))

{'3': set(['Main St #3']),
 '70': set(['State Route 70']),
 '73': set(['State Route 73']),
 'Atreet': set(['Arch Atreet']),
 'Audubon': set(['John James Audubon']),
 'B': set(['Salem Ave Building B']),
 'Centura': set(['Centura']),
 'Chanticleer': set(['Chanticleer']),
 'Cir': set(['Woodfield Cir']),
 'Croft': set(['Kings Croft']),
 'Ct': set(['Portsmouth Ct']),
 'Ii': set(['The Woods Ii']),
 'Rd': set(["Arney's Mount Rd", 'South Easton Rd']),
 'Royal': set(['Five Crown Royal']),
 'Sheffield': set(['Sheffield']),
 'St': set(['Carson St',
            'Green St',
            'N 24th St.',
            'Spring Garden St',
            'jackson st',
            'livingston st',
            'mercer st',
            's broad st']),
 'West': set(['Coventry Circle West']),
 'Woods': set(['The Woods'])}


In [8]:
def update_name(name):
    name = name.title()
    name_split = name.split()
    word_to_replace = name_split[(len(name_split)-1)].replace(".", "")
    if word_to_replace not in expected:
        try:
            name_split[(len(name_split)-1)] = mapping[word_to_replace]
            name = " ".join(name_split)
            print name
        except KeyError:
            pass
    return name


### Preparing for Database - SQL
After auditing is complete the next step is to prepare the data to be inserted into a SQL database.  To do so you will parse the elements in the OSM XML file, transforming them from document format to tabular format, thus making it possible to write to .csv files.  These csv files can then easily be imported to a SQL database as tables.

The process for this transformation is as follows:
- Use iterparse to iteratively step through each top level element in the XML
- Shape each element into several data structures using a custom function
- Utilize a schema and validation library to ensure the transformed data is in the correct format
- Write each data structure to the appropriate .csv files

We've already provided the code needed to load the data, perform iterative parsing and write the output to csv files. Your task is to complete the shape_element function that will transform each element into the correct format. To make this process easier we've already defined a schema (see the schema.py file in the last code tab) for the .csv files and the eventual tables. Using the cerberus library we can validate the output against this schema to ensure it is correct.

#### Shape Element Function
The function should take as input an iterparse Element object and return a dictionary.

##### If the element top level tag is "node":
The dictionary returned should have the format {"node": .., "node_tags": ...}

The "node" field should hold a dictionary of the following top level node attributes:
- id
- user
- uid
- version
- lat
- lon
- timestamp
- changeset
All other attributes can be ignored

The "node_tags" field should hold a list of dictionaries, one per secondary tag. Secondary tags are
child tags of node which have the tag name/type: "tag". Each dictionary should have the following
fields from the secondary tag attributes:

- id: the top level node id attribute value
- key: the full tag "k" attribute value if no colon is present or the characters after the colon if one is.
- value: the tag "v" attribute value
- type: either the characters before the colon in the tag "k" value or "regular" if a colon
        is not present.

Additionally,

- if the tag "k" value contains problematic characters, the tag should be ignored
- if the tag "k" value contains a ":" the characters before the ":" should be set as the tag type
  and characters after the ":" should be set as the tag key
- if there are additional ":" in the "k" value they and they should be ignored and kept as part of
  the tag key. For example:

  <tag k="addr:street:name" v="Lincoln"/>
  should be turned into
  {'id': 12345, 'key': 'street:name', 'value': 'Lincoln', 'type': 'addr'}

- If a node has no secondary tags then the "node_tags" field should just contain an empty list.


##### If the element top level tag is "way":
The dictionary should have the format {"way": ..., "way_tags": ..., "way_nodes": ...}

The "way" field should hold a dictionary of the following top level way attributes:

- id
-  user
- uid
- version
- timestamp
- changeset

All other attributes can be ignored

The "way_tags" field should again hold a list of dictionaries, following the exact same rules as for "node_tags".

Additionally, the dictionary should have a field "way_nodes". "way_nodes" should hold a list of dictionaries, one for each nd child tag.  Each dictionary should have the fields:

- id: the top level element (way) id
- node_id: the ref attribute value of the nd tag
- position: the index starting at 0 of the nd tag i.e. what order the nd tag appears within the way element      

In [9]:
# schema.py

schema = {
    'node': {
        'type': 'dict',
        'schema': {
            'id': {'required': True, 'type': 'integer', 'coerce': int},
            'lat': {'required': True, 'type': 'float', 'coerce': float},
            'lon': {'required': True, 'type': 'float', 'coerce': float},
            'user': {'required': True, 'type': 'string'},
            'uid': {'required': True, 'type': 'integer', 'coerce': int},
            'version': {'required': True, 'type': 'string'},
            'changeset': {'required': True, 'type': 'integer', 'coerce': int},
            'timestamp': {'required': True, 'type': 'string'}
        }
    },
    'node_tags': {
        'type': 'list',
        'schema': {
            'type': 'dict',
            'schema': {
                'id': {'required': True, 'type': 'integer', 'coerce': int},
                'key': {'required': True, 'type': 'string'},
                'value': {'required': True, 'type': 'string'},
                'type': {'required': True, 'type': 'string'}
            }
        }
    },
    'way': {
        'type': 'dict',
        'schema': {
            'id': {'required': True, 'type': 'integer', 'coerce': int},
            'user': {'required': True, 'type': 'string'},
            'uid': {'required': True, 'type': 'integer', 'coerce': int},
            'version': {'required': True, 'type': 'string'},
            'changeset': {'required': True, 'type': 'integer', 'coerce': int},
            'timestamp': {'required': True, 'type': 'string'}
        }
    },
    'way_nodes': {
        'type': 'list',
        'schema': {
            'type': 'dict',
            'schema': {
                'id': {'required': True, 'type': 'integer', 'coerce': int},
                'node_id': {'required': True, 'type': 'integer', 'coerce': int},
                'position': {'required': True, 'type': 'integer', 'coerce': int}
            }
        }
    },
    'way_tags': {
        'type': 'list',
        'schema': {
            'type': 'dict',
            'schema': {
                'id': {'required': True, 'type': 'integer', 'coerce': int},
                'key': {'required': True, 'type': 'string'},
                'value': {'required': True, 'type': 'string'},
                'type': {'required': True, 'type': 'string'}
            }
        }
    }
}

In [10]:
OSM_PATH = osm_file

NODES_PATH = "csv_files/nodes.csv"
NODE_TAGS_PATH = "csv_files/nodes_tags.csv"
WAYS_PATH = "csv_files/ways.csv"
WAY_NODES_PATH = "csv_files/ways_nodes.csv"
WAY_TAGS_PATH = "csv_files/ways_tags.csv"

LOWER_COLON = re.compile(r'^([a-z]|_)+:([a-z]|_)+')
PROBLEMCHARS = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

# Make sure the fields order in the csvs matches the column order in the sql table schema
NODE_FIELDS = ['id', 'lat', 'lon', 'user', 'uid', 'version', 'changeset', 'timestamp']
NODE_TAGS_FIELDS = ['id', 'key', 'value', 'type']
WAY_FIELDS = ['id', 'user', 'uid', 'version', 'changeset', 'timestamp']
WAY_TAGS_FIELDS = ['id', 'key', 'value', 'type']
WAY_NODES_FIELDS = ['id', 'node_id', 'position']

def get_tags(element, element_id, problem_chars=PROBLEMCHARS, default_tag_type='regular'):
    tags = []
    for tag in element.iter('tag'):
        tags_dict = {}
        key = tag.get('k')
        if re.search(problem_chars, key) is None:
            tags_dict["id"] = element_id
            if ":" in key:
                key = key.split(":", 1)
                tags_dict['key']=str(key[1])
                tags_dict['type']=str(key[0])
            else:
                tags_dict['key']=str(key)
                tags_dict['type']=default_tag_type
            if tags_dict['key'] == 'street' and tags_dict ['type'] == 'addr':
                st_name = tag.get('v')
                tags_dict['value'] = update_name(st_name)
            else:
                tags_dict['value']=tag.get('v')
            tags.append(tags_dict)
    return tags
    
def shape_element(element, node_attr_fields=NODE_FIELDS, way_attr_fields=WAY_FIELDS):
    """Clean and shape node or way XML element to Python dict"""

    node_attribs = {}
    way_attribs = {}
    way_nodes = []

    if element.tag == "node":
        for each in node_attr_fields:
            attribute = element.get(each)
            node_attribs[each] = attribute
        node_id = node_attribs['id']
        tags = get_tags(element, node_id)
        return {'node': node_attribs, 'node_tags': tags}

    elif element.tag == 'way':
        for each in way_attr_fields:
            attribute = element.get(each)
            way_attribs[each] = attribute
        way_id = way_attribs['id']
        tags = get_tags(element, way_id)
        i = 0
        for node in element.iter('nd'):
            nodes_dict = {}
            nodes_dict['id'] = way_id
            nodes_dict['node_id'] = node.get('ref')
            nodes_dict['position'] = i
            way_nodes.append(nodes_dict)
            i +=1
        return {'way': way_attribs, 'way_nodes': way_nodes, 'way_tags': tags}



# ================================================== #
#               Helper Functions                     #
# ================================================== #
def get_element(osm_file, tags):
    """Yield element if it is the right type of tag"""

    context = ET.iterparse(osm_file, events=('start', 'end'))
    _, root = next(context)
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            yield elem
            root.clear()


def validate_element(element, validator, schema=schema):
    """Raise ValidationError if element does not match schema"""
    if validator.validate(element, schema) is not True:
        field, errors = next(validator.errors.iteritems())
        message_string = "\nElement of type '{0}' has the following errors:\n{1}"
        error_string = pprint.pformat(errors)
        
        raise Exception(message_string.format(field, error_string))


class UnicodeDictWriter(csv.DictWriter, object):
    """Extend csv.DictWriter to handle Unicode input"""

    def writerow(self, row):
        super(UnicodeDictWriter, self).writerow({
            k: (v.encode('utf-8') if isinstance(v, unicode) else v) for k, v in row.iteritems()
        })

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)


# ================================================== #
#               Main Function                        #
# ================================================== #
def process_map(file_in, validate):
    """Iteratively process each XML element and write to csv(s)"""

    with codecs.open(NODES_PATH, 'w') as nodes_file, \
        codecs.open(NODE_TAGS_PATH, 'w') as nodes_tags_file, \
        codecs.open(WAYS_PATH, 'w') as ways_file, \
        codecs.open(WAY_NODES_PATH, 'w') as way_nodes_file, \
        codecs.open(WAY_TAGS_PATH, 'w') as way_tags_file:

        nodes_writer = UnicodeDictWriter(nodes_file, NODE_FIELDS)
        node_tags_writer = UnicodeDictWriter(nodes_tags_file, NODE_TAGS_FIELDS)
        ways_writer = UnicodeDictWriter(ways_file, WAY_FIELDS)
        way_nodes_writer = UnicodeDictWriter(way_nodes_file, WAY_NODES_FIELDS)
        way_tags_writer = UnicodeDictWriter(way_tags_file, WAY_TAGS_FIELDS)

        nodes_writer.writeheader()
        node_tags_writer.writeheader()
        ways_writer.writeheader()
        way_nodes_writer.writeheader()
        way_tags_writer.writeheader()

        validator = cerberus.Validator()

        for element in get_element(file_in, tags=('node', 'way')):
            el = shape_element(element)
            if el:
                
                if validate is True:
                    validate_element(el, validator)

                if element.tag == 'node':
                    nodes_writer.writerow(el['node'])
                    node_tags_writer.writerows(el['node_tags'])
                elif element.tag == 'way':
                    ways_writer.writerow(el['way'])
                    way_nodes_writer.writerows(el['way_nodes'])
                    way_tags_writer.writerows(el['way_tags'])



process_map(OSM_PATH, validate=True)

Portsmouth Court
South Easton Road
N 24Th Street
Green Street
Spring Garden Street
Carson Street
S Broad Street
Jackson Street
Mercer Street
Mercer Street
Livingston Street
Arney'S Mount Road
Woodfield Circle
Hearthstone Boulevard
