# OpenStreetMap Case Study

## Step One - Complete Programming Exercises
Make sure all programming exercises are solved correctly in the "Case Study: OpenStreetMap Data" Lesson in the course you have chosen (MongoDB or SQL). This is the last lesson in that section.

In [1]:
import xml.etree.cElementTree as ET
from collections import defaultdict
import re
import pprint
import requests
from bs4 import BeautifulSoup

### Iterative Parsing
Your task is to use the iterative parsing to process the map file and find out not only what tags are there, but also how many, to get the feeling on how much of which data you can expect to have in the map.  Return a dictionary with the tag name as the key and number of times this tag can be encountered in the map as value.

In [2]:
tags = defaultdict(int)
for event, element in ET.iterparse("philadelphia_pennsylvania.osm"):
    tags[element.tag] += 1
print tags

defaultdict(<type 'int'>, {'node': 3269680, 'nd': 3961981, 'bounds': 1, 'member': 60781, 'tag': 2071283, 'relation': 5114, 'way': 340971, 'osm': 1})


### Tag Types
Your task is to explore the data a bit more.  Before you process the data and add it into your database, you should check the "k" value for each tag and see if there are any potential problems.  We have provided you with 3 regular expressions to check for certain patterns in the tags. As we saw in the quiz earlier, we would like to change the data
model and expand the "addr:street" type of keys to a dictionary like this: {"address": {"street": "Some value"}} So, we have to see if we have such tags, and if we have any tags with problematic characters.

Please complete the function 'key_type', such that we have a count of each of
four tag categories in a dictionary:
  - "lower", for tags that contain only lowercase letters and are valid,
  - "lower_colon", for otherwise valid tags with a colon in their names,
  - "problemchars", for tags with problematic characters, and
  - "other", for other tags that do not fall into the other three categories.

In [3]:
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

def key_type(element, keys):
    if element.tag == "tag":
        result1 = lower.search(element.attrib['k'])
        result2 = lower_colon.search(element.attrib['k'])
        result3 = problemchars.search(element.attrib['k'])
        
        if result1 is not None:
            keys['lower'] += 1
        elif result2 is not None:
            keys['lower_colon'] += 1
        elif result3 is not None:
            keys['problemchars'] += 1
        else:
            keys['other'] += 1

    return keys



keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
for _, element in ET.iterparse("philadelphia_pennsylvania.osm"):
    keys = key_type(element, keys)

print keys

{'problemchars': 8, 'lower': 983593, 'other': 182986, 'lower_colon': 904696}


### Exploring Users
Your task is to explore the data a bit more.  The first task is a fun one - find out how many unique users have contributed to the map in this particular area!

In [4]:
users = set()
for _, element in ET.iterparse("philadelphia_pennsylvania.osm"):
    uid=element.get('uid')
    if uid is not None:
        users.add(uid)
print len(users), " Total users"

2353  Total users


### Improving Street Names
Your task in this exercise has two steps:

- audit the OSMFILE and change the variable 'mapping' to reflect the changes needed to fix the unexpected street types to the appropriate ones in the expected list.

- write the update_name function, to actually fix the street name.

In [5]:
OSMFILE = "philadelphia_pennsylvania.osm"
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

In [6]:
#Scrape 'https://pe.usps.com/text/pub28/28apc_002.htm' to obtain list of
#primary street suffix names
with requests.Session() as session:
    response = session.get('https://pe.usps.com/text/pub28/28apc_002.htm', headers={'user-agent': 'Chrome/60.0.3112.113'})
    html = response.text
    soup = BeautifulSoup(html, "html.parser")
    table = soup.find(id='ep533076')
    expected = []
    mapping = {}
    for each in table.find_all('tr')[1:]:
        if len(each) == 6:
            text = str(each.text)
            text = text.split(" ")
            while '' in text: #remove all blank spaces created from converting unicode to str
                text.remove('')
            street_suffix_name = str(text[0]).title() 
            abbr = str(text[2]).title()
            expected.append(street_suffix_name)
            mapping[abbr] = street_suffix_name



In [7]:
def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        formatted_street_type = street_type.title().replace(".", "") #only capitalize first letter and remove "." from abbreviations
        if formatted_street_type not in expected:
            street_types[formatted_street_type].add(street_name)
    return street_types

with open(OSMFILE, "r") as f:
    street_types = defaultdict(set)
    for event, elem in ET.iterparse(f, events=("start",)):
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if tag.attrib['k'] == "addr:street":
                    initial_dict = audit_street_type(street_types, tag.attrib['v'])
pprint.pprint(dict(initial_dict))

{'1': set(['Bloomfield Dr, Unit 1',
           'Route 1',
           'S Newtown Street Rd #1',
           'Sansom St #1',
           'Walnut St #1']),
 '111': set(['South Clinton Avenue Ste. 111']),
 '118': set(['Upland Ave #118']),
 '168': set(['Marlton Pike East Ste. 168']),
 '17': set(['Lancaster Avenue #17']),
 '19047': set(['200 Manor Ave. Langhorne, PA 19047',
               '2245 E. Lincoln Hwy, Langhorne, PA 19047',
               '2275 E Lincoln Hwy, Langhorne, PA 19047',
               '2300  East Lincoln Highway, Pennsylvania 19047']),
 '19067': set(['East Trenton Avenue Morrisville, PA 19067']),
 '2': set(['Buck Rd #2']),
 '205': set(['Office Center Dr #205']),
 '206': set(['US 206', 'US 70 & US 206']),
 '3': set(['Main St #3']),
 '302': set(['Route 73 North, Suite 302']),
 '312': set(['312']),
 '315': set(['Heritage Center Dr #315']),
 '33': set(['33', 'Route 33']),
 '37Th': set(['N 37th']),
 '38': set(['New Jersey 38', 'Route 38', 'State Route 38']),
 '39Th': set(['N 39th

In [8]:
print len(initial_dict)

121


In [9]:
def update_name(name, mapping):
    name_split = name.split()
    word_to_replace = name_split[(len(name_split)-1)]
    formatted_word_to_replace = word_to_replace.title().replace(".","")
    try:
        name = name.replace(word_to_replace, mapping[formatted_word_to_replace])
    except KeyError:
        name = name
    return name

st_types = defaultdict(set)
for street_type, ways in street_types.iteritems():
    for name in ways:
        better_name = update_name(name, mapping)
        final_dict = audit_street_type(st_types, better_name)        
pprint.pprint(dict(final_dict))

{'1': set(['Bloomfield Dr, Unit 1',
           'Route 1',
           'S Newtown Street Rd #1',
           'Sansom St #1',
           'Walnut St #1']),
 '111': set(['South Clinton Avenue Ste. 111']),
 '118': set(['Upland Ave #118']),
 '168': set(['Marlton Pike East Ste. 168']),
 '17': set(['Lancaster Avenue #17']),
 '19047': set(['200 Manor Ave. Langhorne, PA 19047',
               '2245 E. Lincoln Hwy, Langhorne, PA 19047',
               '2275 E Lincoln Hwy, Langhorne, PA 19047',
               '2300  East Lincoln Highway, Pennsylvania 19047']),
 '19067': set(['East Trenton Avenue Morrisville, PA 19067']),
 '2': set(['Buck Rd #2']),
 '205': set(['Office Center Dr #205']),
 '206': set(['US 206', 'US 70 & US 206']),
 '3': set(['Main St #3']),
 '302': set(['Route 73 North, Suite 302']),
 '312': set(['312']),
 '315': set(['Heritage Center Dr #315']),
 '33': set(['33', 'Route 33']),
 '37Th': set(['N 37th']),
 '38': set(['New Jersey 38', 'Route 38', 'State Route 38']),
 '39Th': set(['N 39th

In [10]:
print len(final_dict)

109
