
#  Introduction


For this project, I have downloaded sample size of Phoenix metropolitan area, predominantly Tempe, AZ as I am doing my Masters in CS at Arizona State University. Selecting Tempe city will help me to validate the data using my personal knowledge of the area. Here's the link to the Overpass API for downloading the OSM of the region.

<center> http://overpass-api.de/api/map?bbox=-112.117,33.3394,-111.865,33.4727 </center>

Below is the map screenshot of OSM :

<img src="https://raw.githubusercontent.com/parthoiiitm/Data-Wrangling-with-OpenStreetMap/master/tempe_screenshot.png" width="500" height="500" />

# Data auditing, cleaning, and problems encountered in map

First, we want to figure out what kind of elements are present in OSM file, and which are important.

In [14]:
import xml.etree.cElementTree as ET
from collections import defaultdict
import pprint
import re
import codecs
import json

OSM_FILE = "tempeaz.osm"  

In [3]:
def count_tags(filename):
    # YOUR CODE HERE
    dict_ = {}
    for event,elem in ET.iterparse(filename):
        if elem.tag not in dict_:
            dict_[elem.tag] = 1
        else:
            dict_[elem.tag] += 1
    return dict_


tags = count_tags(OSM_FILE)
pprint.pprint(tags)

{'bounds': 1,
 'member': 24852,
 'meta': 1,
 'nd': 324070,
 'node': 261860,
 'note': 1,
 'osm': 1,
 'relation': 549,
 'tag': 263230,
 'way': 39135}


As we can see, the major elements are member, nd, node, relation, tag and way. We will audit these elements, clean them and store them in JSON format in order to be stored in MongoDB database.

As the actual file is > 65 MB,  it will take time to manipulate the actual file. So, we will create a sample file called sample.osm, in which we will test our shape element function.

In [4]:
SAMPLE_FILE = "sample.osm"

k = 10 # Parameter: take every k-th top level element

def get_element(osm_file, tags=('node', 'way', 'relation')):
    """Yield element if it is the right type of tag

    Reference:
    http://stackoverflow.com/questions/3095434/inserting-newlines-in-xml-file-generated-via-xml-etree-elementtree-in-python
    """
    context = iter(ET.iterparse(osm_file, events=('start', 'end')))
    _, root = next(context)
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            yield elem
            root.clear()


with open(SAMPLE_FILE, 'wb') as output:
    output.write('<?xml version="1.0" encoding="UTF-8"?>\n')
    output.write('<osm>\n  ')

    # Write every kth top level element
    for i, element in enumerate(get_element(OSM_FILE)):
        if i % k == 0:
            output.write(ET.tostring(element, encoding='utf-8'))

    output.write('</osm>')

Now, we want to check whether "k" value for each "< tag >" has any issue or not. To see this, we divided the key type in four categories:

* "lower", for tags that contain only lowercase letters and are valid 
* "lower_colon", for otherwise valid tags with a colon in their names <br>
* "problemchars", for tags with problematic characters, and<br>
* "other", for other tags that do not fall into the other three categories.<br>

We check this using the regex expressions and write in a separate file. Then we write the unique key tags belonging to each category in the respective file for later observation.

In [5]:

# Regular expression for each category
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

lo = set()
lo_co = set()
pro_co = set()
oth = set()

def key_type(element, keys):
    if element.tag == "tag":
        # YOUR CODE HERE
        low = lower.search(element.attrib['k'])
        low_col = lower_colon.search(element.attrib['k'])
        prob = problemchars.search(element.attrib['k'])
        if low:
            keys["lower"] += 1
            lo.add(element.attrib['k']) 
        elif low_col:
            keys["lower_colon"] +=1
            lo_co.add(element.attrib['k']) 
        elif prob:
            keys["problemchars"] +=1
            pro_co.add(element.attrib['k']) 
        else:
            keys["other"] +=1
            oth.add(element.attrib['k'])
        
    return keys

def write_data(data, filename):
    with open(filename, 'wb') as f:
        for x in data:
            f.write(x + "\n")

def process_map(filename):
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}

    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)
    
    write_data(lo, 'lower.txt')
    write_data(lo_co, 'lower_colon.txt')
    write_data(pro_co, 'problem_chars.txt')
    write_data(oth, 'other.txt')
    return keys


keys = process_map(OSM_FILE)
pprint.pprint(keys)

{'lower': 163667, 'lower_colon': 95639, 'other': 3924, 'problemchars': 0}


Opening the other.txt, I found out that there are several keys which are in uppercase (e.g. gnis:County), some consist of numbers (e.g. POP2010), some consist of multiple instances of the same key type (e.g. tiger:name_base_1, tiger:name_base_2) and some keys have more than two colon separators (service : bicycle : screwdriver).

We will lowercase all the key tags in shape element for uniformity. For keys with >1 colon separator, we normalized it to two colon separators, as they aren't very frequent keys in usage. For keys with one column separator which are infrequent, we replace it with a single key. We stored it in the dictionary called **replace**, with dictionary value as the key replacing the existing one.

replace = {
    
    'destination:ref:to':'destination:ref_to',
    'generator:output:electricity':'output_electricity',
    'plant:output:electricity':'plant_output_electricity',
    'service :bicycle' :'service_bicycle',
    'source:hgv:national_network':'hgv:national_network_source',
    'turn:lanes':'turn_lanes', 
    'turn:lanes:backward':'turn_lanes:backward',
    'turn:lanes:forward':'turn_lanes:forward',
    'turn:lanes:both_ways':'turn_lanes:both_ways', 
    'wheelchair:description':'wheelchair_description'
}

For multiple instances of the same key type, we group it under the parent key type and store in a dictionary called **group**. The dictionary value are the parent key in which the values will be grouped.

group = {  
    
    'name:vi':'name:vi',
    'alt_name:vi':'name:vi',
    'official_name:vi':'name:vi',
    
    'name':'name:name',
    'name_1':'name:name',
    'name_2':'name:name',
    'alt_name':'name:name',
    
    'note':'note:note',
    'note_2':'note:note',
    'old_ref':'old_ref',
    'old_ref2':'old_ref',
    
    
    'tiger:name_base':'tiger:name_base',               
    'tiger:name_base_1':'tiger:name_base',
    'tiger:name_base_2':'tiger:name_base',
    
    'tiger:name_direction_prefix':  'tiger:name_direction_prefix',
    'tiger:name_direction_prefix_1':  'tiger:name_direction_prefix',
    'tiger:name_direction_prefix_2':  'tiger:name_direction_prefix',
    
    'tiger:name_type' : 'tiger:name_type',
    'tiger:name_type_1' : 'tiger:name_type', 
    'tiger:name_type_2' : 'tiger:name_type',
    'tiger:name_type_4' : 'tiger:name_type',
    
    'tiger:zip_left': 'tiger:zip_left',
    'tiger:zip_left_1': 'tiger:zip_left',
    'tiger:zip_left_2': 'tiger:zip_left',
    'tiger:zip_left_3': 'tiger:zip_left',
    'tiger:zip_left_4': 'tiger:zip_left',
    
    'tiger:zip_right':'tiger:zip_right',
    'tiger:zip_right_1':'tiger:zip_right',
    'tiger:zip_right_2':'tiger:zip_right',
    'tiger:zip_right_3':'tiger:zip_right',
    'tiger:zip_right_4':'tiger:zip_right'
}

Replace and group will be used in shape_element function.

There are some keys which are listed in both lower and lower_colon. In cases when they are listed in same node 

E.g.    
     < tag k="is_in" v="USA"/> <br>
     < tag k="is_in:continent" v="America"/>

It will throw error. So, we want to normalize such single key words from < key > to < key >:< key > (e.g. is_in:is_in) such that it will be easy to nest the sub-keys within a parent key.

e.g.    
    {
    is_in:{
            is_in:USA, 
            continent:America
          }
    } 


In [7]:
lo = set()
lo_co = set()


def read_data(filename):
    with open(filename, 'rb') as f:
        lines = f.readlines()
        for l in lines:
            spl = l.strip('\n').split(":")
            if len(spl) == 2:
                #extract the parent key
                lo_co.add(spl[0])
            else:
                lo.add(spl[0])
                
            

read_data('lower.txt')
read_data('lower_colon.txt')
common = lo.intersection(lo_co)
pprint.pprint(common)

set(['alt_name',
     'bridge',
     'building',
     'capacity',
     'communication',
     'crossing',
     'cycleway',
     'destination',
     'disused',
     'flag',
     'golf',
     'height',
     'hgv',
     'internet_access',
     'is_in',
     'lanes',
     'name',
     'note',
     'old_name',
     'oneway',
     'operator',
     'parking',
     'population',
     'public_transport',
     'ref',
     'restriction',
     'source',
     'toilets',
     'traffic_signals',
     'wheelchair'])


We will use this potential conflict keys in shape_element. 

We also need to observe the street names, because in many cases the street names are over-abbreviated, some are abbreviated and some are written in complete form. So, there is a lot of inconsistency in street names. In order to deal with such inconsistencies, we will first audit the street names, and find various variations of it.

In [10]:
#Regular expression to find the street type
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)


expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons"]


def audit_street_type(street_types, street_name):
     # find street type using regex
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        
        #if street type not in expected list, we add the new street type and add the name along with it
        if street_type not in expected:
            street_types[street_type].add(street_name)


#returns true if key tag is a street name
def is_street_name(elem):
    return (elem.attrib['k'] == "addr:street")


def audit(osmfile):
    osm_file = open(osmfile, "r")
    street_types = defaultdict(set)
    for event, elem in ET.iterparse(osm_file, events=("start",)):

        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'])
    osm_file.close()
    return street_types


st_types = audit(OSM_FILE)
pprint.pprint(dict(st_types))

{'108': set(['W Elliot Rd #108']),
 '1400-1532': set(['N. Central Avenue, Suite 1400-1532']),
 '270': set(['East Washington Street Suite 270']),
 '900': set(['Rio Salado Parkway #900']),
 'Ave': set(['S Central Ave',
             'S Farmer Ave',
             'South Forest Ave',
             'South Longmore Ave',
             'Terrace Ave']),
 'Blvd.': set(['Apache Blvd.']),
 'Circle': set(['South Arizona Mills Circle']),
 'Dobson': set(['South Dobson']),
 'Longmore': set(['South Longmore']),
 'Mall': set(['East Lemon Mall',
              'East Orange Mall',
              'East Tyler Mall',
              'South Cady Mall',
              'South Forest Mall']),
 'Park': set(['East Gammage Park']),
 'Pkwy': set(['East Rio Salado Pkwy']),
 'South': set(['East Sky Harbor Circle South']),
 'St': set(['W 18th St']),
 'Sycamore': set(['North Sycamore']),
 'Valencia': set(['South Valencia']),
 'Way': set(['West Ikea Way'])}


Based on the above dictionary, we create a **mapping** dictionary to fix the street names. We write a function called update_name to change the street name so that there will be uniformity among all the street names. We use our sample file to test the function.

In [88]:
mapping = {
        'Ave':  'Avenue',
        'Blvd.': 'Boulevard',
        'E':'East',
        'E.':'East',
        'N.':'North',
        'Pkwy':'Parkway',
        'Rd':'Road',
        'S':'South',
        'S.':'South',
        'St':'Street',
        'W':'West'
        }


all_street_word = set()
name_list = []

def audit_street(osmfile):
    osm_file = open(osmfile, "r")
    for event, elem in ET.iterparse(osm_file, events=("start",)):

        if elem.tag in["node", "way", "relation"] :
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    better_name = update_name(tag.attrib['v'])
                    name_list.append(better_name)
                    
    osm_file.close()
    return name_list

def update_name(name):

    # YOUR CODE HERE
    for n in name.split():
        if n in mapping:
            name = name.replace(n, mapping[n])
    return name



st_name = audit_street(SAMPLE_FILE)
pprint.pprint(st_name)

['West Baseline Road',
 'West Baseline Road',
 'South Longmore',
 'West Portland Street',
 'West Portland Street',
 'West Roosevelt Street',
 'North 4th Avenue',
 'North 3rd Avenue',
 'North 4th Avenue',
 'East Adams Street',
 'West McDowell Road',
 'North 5th Avenue',
 'West Willetta Street',
 'West McDowell Road',
 'West Lynwood Street',
 'West Lynwood Street',
 'West Portland Street',
 'West Portland Street',
 'West Lynwood Street',
 'West Willetta Street',
 'West Lynwood Street',
 'West Portland Street',
 'West Portland Street',
 'West Culver Street',
 'North 5th Avenue',
 'North 5th Avenue',
 'West Willetta Street',
 'West Culver Street',
 'North 4th Avenue',
 'West Lynwood Street',
 'West Lynwood Street',
 'West Lynwood Street',
 'West Willetta Street',
 'North 24th Street',
 'East Washington Street',
 'East Washington Street',
 'East Washington Street',
 'East Washington Street',
 'North 24th Street',
 'East Adams Street',
 'North 1st Street',
 'East Washington Street',
 'East J

There are also lot of issues associated with the postal codes. We have written the following code to audit the postal code and find over-abbreviated, under-abbreviated or inconsistent postal codes.

In [120]:

#street_type_re = re.compile(r'\d{5}([ \-]\d{4})?$', re.IGNORECASE)
postal_type_re = re.compile(r'^\d{5}$', re.IGNORECASE)
postal_types = defaultdict(set)


expected = [ 'tiger:zip_left','tiger:zip_left_1','tiger:zip_left_2',
             'tiger:zip_left_3','tiger:zip_left_4', 'tiger:zip_right',
             'tiger:zip_right_1','tiger:zip_right_2','tiger:zip_right_3',
             'tiger:zip_right_4','addr:postcode']

def audit_postal_code(postal_value, elem):
    m = postal_type_re.search(postal_value)
    if not m:
        postal_types[elem.attrib['k']].add(postal_value)

def is_postal_code(elem):
    return (elem.attrib['k'] in expected)


def audit(osmfile):
    osm_file = open(osmfile, "r")
    for event, elem in ET.iterparse(osm_file, events=("start",)):

        if elem.tag in["node", "way", "relation"] :
            for tag in elem.iter("tag"):
                if is_postal_code(tag):
                    audit_postal_code(tag.attrib['v'], tag)
    osm_file.close()
    return postal_types


audit(OSM_FILE)
pprint.pprint(dict(postal_types))

{'addr:postcode': set(['8',
                       '85003-1333',
                       '85003-1376',
                       '85004-1323',
                       '85004-1418',
                       '85004-1455',
                       '85004-1506',
                       '85004-1722',
                       '85004-1820',
                       '85004-1873',
                       '85004-4527',
                       '85006-3651',
                       '85006-3678',
                       '85007-1908',
                       '85007-1909',
                       '85007-2101',
                       '85007-2121',
                       '85007-2126',
                       '85007-2129',
                       '85007-2309',
                       '85007-2604',
                       '85007-2607',
                       '85007-2616',
                       '85007-3232',
                       '85008-4905',
                       '85284-1103',
                       'AZ 85007',
            

As we can observe, in tiger:zip_left and tiger:zip_right, the postal codes are separated by semi-colon. We need to reformat that and store it in the list as separate postal code. In addr:postcode, we want to drop all leading state characters (as in "AZ 85281") and 4 - digit zip code extensions following a hyphen (as in "85284-1103"). Plus, the first result in addr:postcode has postal code '8', it is incorrect and we need to rectify it. To do this, we have a python library called pygeocoder using which, we can get the postal code by just inputting the address.

We will do all of this using a function called update_postal_code and to test it on sample file, we will iterate on sample file using audit pin function.

In [174]:
from pygeocoder import Geocoder as gc


def is_incorrect_postal_code(postal_value, tag):
    if is_postal_code(tag):
        m = postal_type_re.search(postal_value)
        if not m:
            return True
        return False


def audit_pin(osmfile):
    osm_file = open(osmfile, "r")
    for event, elem in ET.iterparse(osm_file, events=("start",)):

        if elem.tag in["node", "way", "relation"] :
            for tag in elem.iter("tag"):
                if is_incorrect_postal_code(tag.attrib['v'], tag):
                    print tag.attrib['v'], update_postal_code(tag.attrib['v'], elem), elem.attrib['id']
    osm_file.close()
    

def update_postal_code(postal_value, element):
    if len(postal_value) > 5:
        if '-' in postal_value:
            return postal_value[:5]
        
        elif 'AZ' in postal_value:
            return postal_value[3:]
        
        elif ';' in postal_value or ':' in postal_value:
            return list(set(x.strip() for x in re.split(';|:',postal_value)))
    else:
        
        # In case the postal code is smaller than 5 digits, we search using 
        # pygeocoder by inputting address
        
        home, street, city, state= "","","",""
        for tag in element.iter('tag'):
            if tag.attrib['k'] =='addr:housenumber':
                    home = tag.attrib['v'] + ", "
            elif tag.attrib['k'] =='addr:street':
                    street = tag.attrib['v'] + ", "
            elif tag.attrib['k'] =='addr:city':
                    city = tag.attrib['v'] + ", "
            elif tag.attrib['k'] =='addr:state':
                    state = tag.attrib['v']
        
        address = home+street+city+state
        return gc.geocode(address.lower()).postal_code

audit_pin(SAMPLE_FILE)

85007-2126 85007 1682771572
AZ 85008 85008 51990955
85004-4527 85004 102348550
85007-2616 85007 109835069
85007-2604 85007 109835152
85006-3678 85006 125115768
85004-1418 85004 145385690
85004-1873 85004 147227042
85017;85017;85002 ['85002', '85017'] 358494996
85009;85017;85009 ['85009', '85017'] 358494996
85262; 85254 ['85262', '85254'] 402543835
85262; 85254 ['85262', '85254'] 402543835
85262; 85254 ['85262', '85254'] 436940703
85262; 85254 ['85262', '85254'] 436940703
85262; 85254 ['85262', '85254'] 436940722
85262; 85254 ['85262', '85254'] 436940722
85262; 85254 ['85262', '85254'] 436940732
85262; 85254 ['85262', '85254'] 436940732
85262; 85254 ['85262', '85254'] 436944796
85262; 85254 ['85262', '85254'] 436944796
85017;85017;85002 ['85002', '85017'] 436970519
85009;85017;85009 ['85009', '85017'] 436970519
AZ 85042 85042 3547143


Now, we will wrangle the data and transform the shape of the data into the model we did in quiz. We will use **group** and **replace** dictionary as discussed above, and use a list called **conflict_keys** to store the keys which may cause conflict and modify them.

The details of the particular data node creation such as user id , timestamp etc. are grouped under key called **created** .The attributes for latitude and longitude are added to a **pos** array. If the tag have keys separated by ':', I have aggregated related descriptors into single top level keys. 

E.g.

    {
    "id": "2406124091",
    "type: "node",
    "visible":"true",
    "created": {
              "version":"2",
              "changeset":"17206049",
              "timestamp":"2013-08-03T16:43:42Z",
              "user":"linuxUser16",
              "uid":"1219059"
            },
    "pos": [41.9757030, -87.6921867],
    "addr": {
              "housenumber": "5157",
              "postcode": "60625",
              "street": "North Lincoln Ave"
            },
    "amenity": "restaurant",
    "cuisine": "mexican",
    "name": "La Cabana De Don Luis",
    "phone": "1 (773)-271-5176"
    }
    
    :
    :

This will result in a more organized structure of data.

In case of  "way" tags, we encounter node references as follows:

      <nd ref="305896090"/>
      <nd ref="1719825889"/>

They are grouped under node_refs as follows:
"node_refs": ["305896090", "1719825889"]

Member tags come under relations. I observed that in all instances, member elements **type** and **role** are common, only the ref varied. So, I  decided to group the member references in a list according to type and role under member dictionary. Here, inside member dictionary, the key will be the type, and the value will be a dictionary consisting of role and member_refs. 

E.g. 

    <member type="way" ref="30650245" role="outer"/>
    <member type="way" ref="30650242" role="outer"/>
    <member type="way" ref="30420603" role="inner"/>
    <member type="way" ref="30420613" role="inner"/>

will be turned as

    "member": {
        "way": [
          {
            "member_refs": [
              "30650245",
              "30650242"
            ],
            "role": "outer"
          },
          {
            "member_refs": [
              "30420603",
              "30420613"
            ],
            "role": "inner"
          }
        ]
    }

In [170]:
#Dict to rename the single key which can cause conflict while nesting

conflict_keys = [
                 'building',
                 'capacity',
                 'crossing',
                 'cycleway',
                 'destination',
                 'height',
                 'hgv',
                 'iso3166-1',
                 'internet_access',
                 'is_in',
                 'lanes',
                 'old_name',
                 'oneway',
                 'operator',
                 'population',
                 'ref',
                 'source',
                 'traffic_signals'
                 #'bridge',         # Commented keys don't cause any conflicy, so it's not needed 
                 #'communication',  # to unnecessarily nest them
                 #'disused',
                 #'flag',
                 #'golf',
                 #'name',
                 #'note',
                 #'parking',
                 #'public_transport',
                 #'restriction',
                 #'toilets',
                 #'turn_lanes',
                 #'wheelchair'
                ]

#Dict to rename the key
replace = {
    
    'destination:ref:to':'destination:ref_to',
    'generator:output:electricity':'output_electricity',
    'plant:output:electricity':'plant_output_electricity',
    'service:bicycle':'service_bicycle',
    'source:hgv:national_network':'hgv:national_network_source',
    'turn:lanes':'turn_lanes',
    'turn:lanes:backward':'turn_lanes:backward',
    'turn:lanes:forward':'turn_lanes:forward',
    'turn:lanes:both_ways':'turn_lanes:both_ways', 
    'wheelchair:description':'wheelchair_description'
}

#Dict to group in common element
group = {  
    
    'name:vi':'name:vi',
    'alt_name:vi':'name:vi',
    'official_name:vi':'name:vi',
    
    'name':'name:name',
    'name_1':'name:name',
    'name_2':'name:name',
    'alt_name':'name:name',
    
    'note':'note:note',
    'note_2':'note:note',
    'old_ref':'old_ref',
    'old_ref2':'old_ref',
    
    
    'tiger:name_base':'tiger:name_base',               
    'tiger:name_base_1':'tiger:name_base',
    'tiger:name_base_2':'tiger:name_base',
    'tiger:name_base_3':'tiger:name_base',
    'tiger:name_base_4':'tiger:name_base',
    
    'tiger:name_direction_prefix':  'tiger:name_direction_prefix',
    'tiger:name_direction_prefix_1':  'tiger:name_direction_prefix',
    'tiger:name_direction_prefix_2':  'tiger:name_direction_prefix',
    
    'tiger:name_type' : 'tiger:name_type',
    'tiger:name_type_1' : 'tiger:name_type', 
    'tiger:name_type_2' : 'tiger:name_type',
    'tiger:name_type_3' : 'tiger:name_type',
    'tiger:name_type_4' : 'tiger:name_type',
    
    'tiger:zip_left': 'tiger:zip_left',
    'tiger:zip_left_1': 'tiger:zip_left',
    'tiger:zip_left_2': 'tiger:zip_left',
    'tiger:zip_left_3': 'tiger:zip_left',
    'tiger:zip_left_4': 'tiger:zip_left',
    
    'tiger:zip_right':'tiger:zip_right',
    'tiger:zip_right_1':'tiger:zip_right',
    'tiger:zip_right_2':'tiger:zip_right',
    'tiger:zip_right_3':'tiger:zip_right',
    'tiger:zip_right_4':'tiger:zip_right'
}



CREATED = [ "version", "changeset", "timestamp", "user", "uid"]


def shape_element(element):
    node = {}
    if element.tag in ['node','way','relation']:
        node['id'] = element.attrib['id']
        node['tag_type'] = element.tag
        node['created'] = {k: element.attrib[k] for k in CREATED}
        if element.tag == 'node':
            node['pos'] = [float(element.attrib['lat']), float(element.attrib['lon'])]

        for tag in element.iter('tag'):
            
            #Lowercase all the keys
            attribute_string =  tag.attrib['k'].lower()
            
            #Dictionary key formation to rename the single key which can cause conflict while nesting
            if attribute_string in conflict_keys:
                if attribute_string not in node:
                    node[attribute_string] = {attribute_string : tag.attrib['v']}
                    
                else:
                    node[attribute_string].update({attribute_string : tag.attrib['v']})
            
        
            
            # Rename the existing key in dictionary 
            elif attribute_string in replace:
                replaced_key = replace[attribute_string]
                replaced_key_list = replaced_key.split(':')
                replaced_key_length = len(replaced_key_list)
                
                # if key is separated by ':'
                if replaced_key_length == 2:
                    if replaced_key_list[0] not in node:
                        node[replaced_key_list[0]] = { replaced_key_list[0] : tag.attrib['v'] }
                    else:
                        node[replaced_key_list[0]].update({ replaced_key_list[0] : tag.attrib['v'] })
                
                # if key is a single word
                elif replaced_key_list[0] not in node:
                        node[replaced_key_list[0]] = tag.attrib['v']
                
                        
            # Group values in a common parent key  
            elif attribute_string in group:
                group_list = group[attribute_string].split(":")
                
                # if key is separated by ':'
                if len(group_list) == 2:
                    if group_list[0] not in node:
                        node[group_list[0]] = { group_list[1]:[tag.attrib['v']] }
                    elif group_list[1] not in node[group_list[0]]:
                        node[group_list[0]][group_list[1]] = [tag.attrib['v']] 
                    else:
                        node[group_list[0]][group_list[1]].append(tag.attrib['v'])
                    
                    
                    #correct postal codes for tiger:zip_left or tiger:zip_right
                    if audit_postal_code_modified(tag.attrib['v'], tag):  
                        node[group_list[0]][ group_list[1] ] = update_postal_code(tag.attrib['v'], element)
                        
                    
                    #group unique elements
                    node[group_list[0]][group_list[1]] = list(set(node[group_list[0]][group_list[1]]))

                # if key is a single word
                else:
                    if group_list[0] not in node:
                        node[group_list[0]] = [tag.attrib['v']]
                    else:
                        node[group_list[0]].append(tag.attrib['v'])
                    
                    #group unique elements
                    node[group_list[0]] = list(set(node[group_list[0]]))
            
            
            #Handle all other cases
            else:
                other_list = attribute_string.split(":")
                if len(other_list) == 2:
                    if other_list[0] not in node:
                        node[other_list[0]] = {other_list[1]:tag.attrib['v']}
                    else:
                        node[other_list[0]].update({other_list[1]:tag.attrib['v']})
                    
                    # Correct the street names 
                    if attribute_string == "addr:street" :
                        node[other_list[0]][ other_list[1] ] = update_name(tag.attrib['v'])
                    
                    # Correct the postal code in case of addr:postcode
                    if audit_postal_code_modified(tag.attrib['v'], tag):
                            node[other_list[0]][ other_list[1] ] = update_postal_code(tag.attrib['v'], element)
                
                else:
                    node[other_list[0]] = tag.attrib['v']
        
        # Group all node reference ids belonging to a node in a list
        if element.tag == 'way':
            node["node_refs"] = []
            for tag in element.iter('nd'):
                node['node_refs'].append(tag.attrib['ref'])
        
        # Grouping of member tag in the node as described above
        if element.tag == 'relation':
            node['member'] = {}
            
            for tag in element.iter('member'):
                # In case the nmember dictionary doesn't exist, we create one
                if tag.attrib['type'] not in node['member']:
                    # in case the role is blank, we will group under 'type' key 
                    # of member with role as 'NA', meaning Not Available
                    if tag.attrib['role']=="":
                        node['member'].update({tag.attrib['type']:[{'role':'NA',
                                                                   'member_refs':[tag.attrib['ref']]}]})
                    else:
                        #Otherwise we group with respective role, and also group the member references in list
                        node['member'].update({tag.attrib['type']:[{'role': tag.attrib['role'], 
                                                               'member_refs':[tag.attrib['ref']]}]})
                else:
                    if tag.attrib['role']=="":
                        for i,item in enumerate(node['member'][tag.attrib['type']]):
                            if item['role']=='NA':
                                item['member_refs'].append(tag.attrib['ref'])
                                break
                    else:
                        # We need to push the member reference in the appropriate dictionary among
                        # the list of dictionaries, according to the role
                        if not any(d.get('role', None) == tag.attrib['role'] for d in node['member'][tag.attrib['type']]):
                            node['member'][tag.attrib['type']].append({'role':tag.attrib['role'],
                                                                       'member_refs':[tag.attrib['ref']]})
                        else:
                            for i,item in enumerate(node['member'][tag.attrib['type']]):
                                if item['role']==tag.attrib['role']:
                                    item['member_refs'].append(tag.attrib['ref'])
                                    break


        #remove keys with empty values
        node = {k: v for k, v in node.items() if v}
        
        return node
    else:
        return None


def process_map(file_in, pretty = False):
    # You do not need to change this file
    file_out = "{0}.json".format(file_in)
    data = []
    with codecs.open(file_out, "w") as fo:
        for _, element in ET.iterparse(file_in):
            el = shape_element(element)
            if el:
                data.append(el)
                if pretty:
                    fo.write(json.dumps(el, indent=2)+"\n")
                else:
                    fo.write(json.dumps(el) + "\n")
    return data


#data = process_map('sample.osm', False)
data = process_map('tempe.osm', False)

#  Overview of the Data

After converting the OSM data in JSON format, we want to store the data in our MongoDB database, so that we can perform some queries and get an insight into data. I have imported the JSON data in MongoDB in database called **tempeosm**. It can be seen as follows:

<img src="https://raw.githubusercontent.com/parthoiiitm/Data-Wrangling-with-OpenStreetMap/master/db_import.png" width="700" height="700" />

In [171]:
def get_db(db_name):
    from pymongo import MongoClient
    client = MongoClient('localhost:27017')
    db = client[db_name]
    return db

db = get_db('tempeosm')

Now, let's have some basic statistics about the dataset and the MongoDB queries used to gather them.
 
### File Size

* tempe.osm ......... 66 MB
* tempe.osm.json .... 69 MB

### Number of documents

In [259]:
doc_length = db.tempeosm.find().count()
doc_length

300751

### Number of nodes

In [175]:
db.tempeosm.find({'tag_type':'node'}).count()

261354

### Number of ways

In [176]:
db.tempeosm.find({'tag_type':'way'}).count()

38851

### Number of relations

In [177]:
db.tempeosm.find({'tag_type':'relation'}).count()

546

### Number of unique users

In [353]:
print len(db.tempeosm.distinct("created.user"));

367


### Top contributing user

In [338]:
def make_contributor_pipeline():
    
    pipeline = [{"$group":{"_id":"$created.user", 
                            "count":{"$sum":1}}}, 
                       {"$sort":{"count":-1}}, 
                       {"$limit":1}]
    return pipeline

def top_results(db, pipeline):
    return [top for top in db.tempeosm.aggregate(pipeline)]


pipeline = make_contributor_pipeline()
result = top_contributor(db, pipeline)
for i in result:
    print i['_id'], i['count']

Dr Kludge 169093


### 5 major postal code in Phoenix Metropolitan Region

In [281]:
def make_postal_code_pipeline():
    
    pipeline = [{"$match":{"addr.postcode":{"$exists":1}}}, 
                       {"$group":{"_id":"$addr.postcode",
                                  "count":{"$sum":1}}}, 
                       {"$sort":{"count":-1}},
                       {"$limit":5}]

    return pipeline

pipeline = make_postal_code_pipeline()
result = top_results(db, pipeline)

for i in result:
    print i['_id'], i['count']

85006 207
85003 112
85004 87
85281 85
85201 50


# Additional ideas about the dataset

 Here are some user percentage statistics [1] :

In [271]:
def make_topuser_pipeline():
    
    pipeline = [ 
                    {"$group":{"_id":"$created.user",
                               "count":{"$sum":1}}},
                    {"$project":{"count":1,
                                 "percentage":{"$multiply":[{"$divide":[100,doc_length]},"$count"]}}
                    },
                    {"$sort":{"count":-1}}, 
                    {"$limit":5}
                ]
        
    return pipeline

pipeline = make_topuser_pipeline()
result = top_results(db, pipeline)

for i in result:
    print  i

{u'count': 169093, u'percentage': 56.22358695399184, u'_id': u'Dr Kludge'}
{u'count': 31392, u'percentage': 10.437870530771303, u'_id': u'TheDutchMan13'}
{u'count': 11832, u'percentage': 3.9341515073931594, u'_id': u'jfuredy'}
{u'count': 7695, u'percentage': 2.558594983890328, u'_id': u'dwh1985'}
{u'count': 6255, u'percentage': 2.0797935833962318, u'_id': u'woodpeck_fixbot'}


Top user contribution percentage ("Dr Kludge") - 56.22%

Top 5 Users' contributions:

    * Dr Kludge - 169093  
        ~56.22% 
        
    * TheDutchMan13 - 31392 
        ~10.44% 

    * jfuredy - 11832 
        ~3.93% 

    * dwh1985 - 7695 
        ~2.56% 

    * woodpeck_fixbot - 6255 
        ~2.08% 
    
Combined Top 5 users contribution percentage : 75.23%

<br><br>

**Let's explore some more aspects of the data:**

As I go to ASU daily by bike, the following is one of the most important query of my interest.

### Top 3 kinds of bicycle roads 

In [287]:
pipeline = [{"$match":{"bicycle":{"$exists":1}, "bicycle": "designated"}},
                                {"$group":{"_id":"$surface", "count":{"$sum":1}}},
                                {"$sort":{"count":-1}}, {"$limit":3}]

result = top_results(db, pipeline)
for i in result:
    print  i


{u'count': 367, u'_id': u'asphalt'}
{u'count': 15, u'_id': u'concrete'}
{u'count': 12, u'_id': u'paved'}


As Tempe's demography is mostly consisted of students, let's see what are the top amenities in Tempe. [1]

### Top 5 amenities in Tempe 

In [311]:
pipeline = [{"$match":{"amenity":{"$exists":1}}}, 
                {"$group":{"_id":"$amenity",
                            "count":{"$sum":1}}},
                {"$sort":{"count":-1}}, 
                {"$limit":5}]


result = top_results(db, pipeline)

for i in result:
    print i['_id']

parking
restaurant
fast_food
school
place_of_worship


Yup, we see that the above amenities are in conform with basic needs of student life. 

We students love to eat a lot of fast food as well. Let's see the following query. [2]

### 3 most popular fast foods in Tempe 

In [273]:
def make_cuisine_pipeline():
    
    pipeline = [{"$match":{"amenity":"fast_food", 
                            "cuisine":{"$exists":1}}}, 
                       {"$group":{"_id":"$cuisine", 
                            "count":{"$sum":1}}}, 
                       {"$sort":{"count":-1}}, 
                       {"$limit":3}]
    return pipeline

pipeline = make_cuisine_pipeline()
result = top_results(db, pipeline)

for i in result:
    print i['_id'], i['count']

burger 31
sandwich 10
mexican 9


# Conclusion

* I have interestingly noticed that bots are one of the major contributors of geotagging data.




* Categorization of data has only been done on the subset of Phoenix data, and the data auditing functions and dictionaries (e.g., group, replace) made on the basis of subset data might not be enough to correctly categorize and remove all kinds of problem from the original Phoenix data (~600 MB).



* A good amount of positional data is available in the OSM file. Using more features of pygeocoder and other libraries, I guess we can clean the data even better.

# References

[1] OpenStreetMap Sample Project Data Wrangling with MongoDB by Matthew Banbury
    https://docs.google.com/document/d/1F0Vs14oNEs2idFJR3C_OPxwS6L0HPliOii-QpbmrMo4/pub
    
[2] Exploratory Analysis of Open Street Map Data by Miadad Rashid                
    http://www.slideshare.net/MiadadRashid/project-48464910?ref=https://www.linkedin.com/

<br>

# APPENDIX : Python code for Case Study quizzes

In [None]:
#Quiz 1: Iterative Parsing

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Your task is to use the iterative parsing to process the map file and
find out not only what tags are there, but also how many, to get the
feeling on how much of which data you can expect to have in the map.
Fill out the count_tags function. It should return a dictionary with the 
tag name as the key and number of times this tag can be encountered in 
the map as value.

Note that your code will be tested with a different data file than the 'example.osm'
"""
import xml.etree.cElementTree as ET
import pprint

def count_tags(filename):
    # YOUR CODE HERE
    dict_ = {}
    for event,elem in ET.iterparse(filename):
        if elem.tag not in dict_:
            dict_[elem.tag] = 1
        else:
            dict_[elem.tag] += 1
    return dict_


def test():

    tags = count_tags('example.osm')
    pprint.pprint(tags)
    assert tags == {'bounds': 1,
                     'member': 3,
                     'nd': 4,
                     'node': 20,
                     'osm': 1,
                     'relation': 1,
                     'tag': 7,
                     'way': 1}

    

if __name__ == "__main__":
    test()

### Quiz 2: Data Model

<img src="https://raw.githubusercontent.com/parthoiiitm/Data-Wrangling-with-OpenStreetMap/master/quiz.png" width="500" height="500" />

In [None]:
#QUIZ 3: Tag Types

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import xml.etree.cElementTree as ET
import pprint
import re
"""
Your task is to explore the data a bit more.
Before you process the data and add it into your database, you should check the
"k" value for each "<tag>" and see if there are any potential problems.

We have provided you with 3 regular expressions to check for certain patterns
in the tags. As we saw in the quiz earlier, we would like to change the data
model and expand the "addr:street" type of keys to a dictionary like this:
{"address": {"street": "Some value"}}
So, we have to see if we have such tags, and if we have any tags with
problematic characters.

Please complete the function 'key_type', such that we have a count of each of
four tag categories in a dictionary:
  "lower", for tags that contain only lowercase letters and are valid,
  "lower_colon", for otherwise valid tags with a colon in their names,
  "problemchars", for tags with problematic characters, and
  "other", for other tags that do not fall into the other three categories.
See the 'process_map' and 'test' functions for examples of the expected format.
"""


lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')


def key_type(element, keys):
    if element.tag == "tag":
        # YOUR CODE HERE
        low = lower.search(element.attrib['k'])
        low_col = lower_colon.search(element.attrib['k'])
        prob = problemchars.search(element.attrib['k'])
        if low:
            keys["lower"] += 1
        elif low_col:
            keys["lower_colon"] +=1
        elif prob:
            keys["problemchars"] +=1
        else:
            keys["other"] +=1
        
    return keys




def process_map(filename):
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)

    return keys



def test():
    # You can use another testfile 'map.osm' to look at your solution
    # Note that the assertion below will be incorrect then.
    # Note as well that the test function here is only used in the Test Run;
    # when you submit, your code will be checked against a different dataset.
    keys = process_map('example.osm')
    pprint.pprint(keys)
    assert keys == {'lower': 5, 'lower_colon': 0, 'other': 1, 'problemchars': 1}


if __name__ == "__main__":
    test()

In [None]:
#QUIZ 4: Exploring Users

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import xml.etree.cElementTree as ET
import pprint
import re
"""
Your task is to explore the data a bit more.
The first task is a fun one - find out how many unique users
have contributed to the map in this particular area!

The function process_map should return a set of unique user IDs ("uid")
"""

def get_user(element):
    return


def process_map(filename):
    users = set()
    for _, element in ET.iterparse(filename):
        if 'uid' in element.attrib:
            users.add(element.attrib['uid'])

    return users


def test():

    users = process_map('example.osm')
    pprint.pprint(users)
    assert len(users) == 6



if __name__ == "__main__":
    test()

In [None]:
# QUIZ 5: Improving Street Names

"""
Your task in this exercise has two steps:

- audit the OSMFILE and change the variable 'mapping' to reflect the changes needed to fix 
    the unexpected street types to the appropriate ones in the expected list.
    You have to add mappings only for the actual problems you find in this OSMFILE,
    not a generalized solution, since that may and will depend on the particular area you are auditing.
- write the update_name function, to actually fix the street name.
    The function takes a string with street name as an argument and should return the fixed name
    We have provided a simple test so that you see what exactly is expected
"""
import xml.etree.cElementTree as ET
from collections import defaultdict
import re
import pprint

OSMFILE = "example.osm"
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)


expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons"]

# UPDATE THIS VARIABLE
mapping = { "St": "Street",
            "St.": "Street",
            "Ave":"Avenue",
            "Rd.":"Road"
            }


def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)


def is_street_name(elem):
    return (elem.attrib['k'] == "addr:street")


def audit(osmfile):
    osm_file = open(osmfile, "r")
    street_types = defaultdict(set)
    for event, elem in ET.iterparse(osm_file, events=("start",)):

        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'])
    osm_file.close()
    return street_types


def update_name(name, mapping):

    # YOUR CODE HERE
    last_word = name.split()[-1]
    return name.replace(last_word, mapping[last_word])


def test():
    st_types = audit(OSMFILE)
    assert len(st_types) == 3
    pprint.pprint(dict(st_types))

    for st_type, ways in st_types.iteritems():
        for name in ways:
            better_name = update_name(name, mapping)
            print name, "=>", better_name
            if name == "West Lexington St.":
                assert better_name == "West Lexington Street"
            if name == "Baldwin Rd.":
                assert better_name == "Baldwin Road"


if __name__ == '__main__':
    test()

In [None]:
# QUIZ 6: Preparing for database

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import xml.etree.cElementTree as ET
import pprint
import re
import codecs
import json
"""
Your task is to wrangle the data and transform the shape of the data
into the model we mentioned earlier. The output should be a list of dictionaries
that look like this:

{
"id": "2406124091",
"type: "node",
"visible":"true",
"created": {
          "version":"2",
          "changeset":"17206049",
          "timestamp":"2013-08-03T16:43:42Z",
          "user":"linuxUser16",
          "uid":"1219059"
        },
"pos": [41.9757030, -87.6921867],
"address": {
          "housenumber": "5157",
          "postcode": "60625",
          "street": "North Lincoln Ave"
        },
"amenity": "restaurant",
"cuisine": "mexican",
"name": "La Cabana De Don Luis",
"phone": "1 (773)-271-5176"
}

You have to complete the function 'shape_element'.
We have provided a function that will parse the map file, and call the function with the element
as an argument. You should return a dictionary, containing the shaped data for that element.
We have also provided a way to save the data in a file, so that you could use
mongoimport later on to import the shaped data into MongoDB. 

Note that in this exercise we do not use the 'update street name' procedures
you worked on in the previous exercise. If you are using this code in your final
project, you are strongly encouraged to use the code from previous exercise to 
update the street names before you save them to JSON. 

In particular the following things should be done:
- you should process only 2 types of top level tags: "node" and "way"
- all attributes of "node" and "way" should be turned into regular key/value pairs, except:
    - attributes in the CREATED array should be added under a key "created"
    - attributes for latitude and longitude should be added to a "pos" array,
      for use in geospacial indexing. Make sure the values inside "pos" array are floats
      and not strings. 
- if the second level tag "k" value contains problematic characters, it should be ignored
- if the second level tag "k" value starts with "addr:", it should be added to a dictionary "address"
- if the second level tag "k" value does not start with "addr:", but contains ":", you can
  process it in a way that you feel is best. For example, you might split it into a two-level
  dictionary like with "addr:", or otherwise convert the ":" to create a valid key.
- if there is a second ":" that separates the type/direction of a street,
  the tag should be ignored, for example:

<tag k="addr:housenumber" v="5158"/>
<tag k="addr:street" v="North Lincoln Avenue"/>
<tag k="addr:street:name" v="Lincoln"/>
<tag k="addr:street:prefix" v="North"/>
<tag k="addr:street:type" v="Avenue"/>
<tag k="amenity" v="pharmacy"/>

  should be turned into:

{...
"address": {
    "housenumber": 5158,
    "street": "North Lincoln Avenue"
}
"amenity": "pharmacy",
...
}

- for "way" specifically:

  <nd ref="305896090"/>
  <nd ref="1719825889"/>

should be turned into
"node_refs": ["305896090", "1719825889"]
"""


lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

CREATED = [ "version", "changeset", "timestamp", "user", "uid"]


def shape_element(element):
    node = {}
    if element.tag == "node" or element.tag == "way" :
        node['id'] = element.attrib['id']
        node['type'] = element.tag
        node['created'] = {k: element.attrib[k] for k in CREATED}
        if 'visible' in element.attrib:
            node['visible'] = element.attrib['visible']
        if element.tag != "way":
            node['pos'] = [float(element.attrib['lat']), float(element.attrib['lon'])]
        node['address'] = {}
        for tag in element.iter('tag'):
            attribute_string =  tag.attrib['k'].split(':')
            node[attribute_string[0]] = {}
            length = len(attribute_string)
            if length > 2:
                pass
            elif length == 2:
                if attribute_string[0] == 'addr':
                    node['address'].update({attribute_string[1]:tag.attrib['v']})
                else:
                    node[attribute_string[0]].update({attribute_string[1]:tag.attrib['v']})
            else:
                node[attribute_string[0]] = tag.attrib['v']
        
        if element.tag == 'way':
            node["node_refs"] = []
            for tag in element.iter('nd'):
                node['node_refs'].append(tag.attrib['ref'])
        node = {k: v for k, v in node.items() if v}
        
        return node
    else:
        return None

def process_map(file_in, pretty = False):
    # You do not need to change this file
    file_out = "{0}.json".format(file_in)
    data = []
    with codecs.open(file_out, "w") as fo:
        for _, element in ET.iterparse(file_in):
            el = shape_element(element)
            if el:
                data.append(el)
                if pretty:
                    fo.write(json.dumps(el, indent=2)+"\n")
                else:
                    fo.write(json.dumps(el) + "\n")
    return data

def test():
    # NOTE: if you are running this code on your computer, with a larger dataset, 
    # call the process_map procedure with pretty=False. The pretty=True option adds 
    # additional spaces to the output, making it significantly larger.
    data = process_map('example.osm', True)
    #pprint.pprint(data)
    
    correct_first_elem = {
        "id": "261114295", 
        "visible": "true", 
        "type": "node", 
        "pos": [41.9730791, -87.6866303], 
        "created": {
            "changeset": "11129782", 
            "user": "bbmiller", 
            "version": "7", 
            "uid": "451048", 
            "timestamp": "2012-03-28T18:31:23Z"
        }
    }
    assert data[0] == correct_first_elem
    assert data[-1]["address"] == {
                                    "street": "West Lexington St.", 
                                    "housenumber": "1412"
                                      }
    assert data[-1]["node_refs"] == [ "2199822281", "2199822390",  "2199822392", "2199822369", 
                                    "2199822370", "2199822284", "2199822281"]

if __name__ == "__main__":
    test()