Project Summary: In this project, I use data wrangling techniques, such as assessing the quality of the data for validity, accuracy, completeness, consistency and uniformity, to clean OpenStreetMap data. Then I convert the dataset from XML to CSV format, import the cleaned .csv files into database, conduct SQL queries to provide a statistical overview of the dataset. Finally, I give some additional suggestions for improving and analyzing the data.
Map Area: Stokholm
Split osm file into a smaller sample (SAMPLE_FILE). The original file (Stockholm) is 2GB. Challenges: activating python2 via source activate py2 to be able to run the following code. I stared with k=30

In [1]:
import xml.etree.ElementTree as ET  # Use cElementTree or lxml if too slow

OSM_FILE = "stockholm_sweden.osm"  # Replace this with your osm file
SAMPLE_FILE = "sample.osm"

k = 10 # Parameter: take every k-th top level element

def get_element(osm_file, tags=('node', 'way', 'relation')):
    """Yield element if it is the right type of tag

    Reference:
    http://stackoverflow.com/questions/3095434/inserting-newlines-in-xml-file-generated-via-xml-etree-elementtree-in-python
    """
    context = iter(ET.iterparse(osm_file, events=('start', 'end')))
    _, root = next(context)
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            yield elem
            root.clear()


with open(SAMPLE_FILE, 'wb') as output:
    output.write('<?xml version="1.0" encoding="UTF-8"?>\n')
    output.write('<osm>\n  ')

    # Write every kth top level element
    for i, element in enumerate(get_element(OSM_FILE)):
        if i % k == 0:
            output.write(ET.tostring(element, encoding='utf-8'))

    output.write('</osm>')

Parse data-set and identify different tags, using iterative parsing.

In [2]:
import xml.etree.cElementTree as ET
import pprint
from collections import defaultdict
import collections

def count_tags(filename):
        all_tags=ET.iterparse(filename)
        nodes= defaultdict(int)
        for node in all_tags:
            nodes[node[1].tag] +=1
        return dict(nodes)           
    
def test():

    tags = count_tags(SAMPLE_FILE)
    pprint.pprint(tags)

if __name__ == "__main__":
    test()

{'member': 19825,
 'nd': 745157,
 'node': 610544,
 'osm': 1,
 'relation': 1012,
 'tag': 216022,
 'way': 69782}


Unique users contributed to the map in this particular area:

In [3]:
def process_map(filename):
    users = set()
    for _, element in ET.iterparse(filename):
        if "uid" in element.attrib:
            users.add(element.get('uid'))

    return users

def test():

    users = process_map(SAMPLE_FILE)
    pprint.pprint(len(users))
#    assert len(users) == 6

if __name__ == "__main__":
    test()

1883


# auditing 
One of the usual problems in openstreetmap dataset is from the street name abbreviation. However, I have not found any problems by only looking at the osm file. Here I will try to find something via my code.
1-Building the regular expression to match the last element in the string, where usually the street type is based. 
2-Then based on the street abbreviation, create a mapping that need to be cleaned.

I tried all sort of changes in my code, however the resut is an empty {}. Swedes are really good at documentation afterall ;)

In [42]:
import re
import xml.etree.cElementTree as ET
import pprint
from collections import defaultdict
import collections

street_types= defaultdict(set)
street_type_re = re.compile(r'\b\S+\ä\.?$', re.IGNORECASE)

expected = [ "Vag", "Gatan", "Alle", "Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons", "Cove", "Alley", "Park", "Way", "Walk" "Circle", "Highway", 
            "Plaza", "Path", "Center", "Mission"]

mapping = { "väg": "Vag" ,
            "gata":"Gatan",
            "gatan": "Gatan" ,
            "alle" :"Alle",
            }

def audit_street_type(street_types, street_name):
   
    m = street_type_re.search(street_name) #finds the pattern 
    if m:
        street_type = m.group() #returns the last word
        if street_type not in expected: 
            street_types[street_type].add(street_name)
            
def is_street_name(elem):
    return (elem.attrib["k"] == "name")


def audit(osmfile):
    osm_file = open(osmfile, "r")
    street_types = collections.defaultdict(set)
    for event, elem in ET.iterparse(osm_file, events=("start",)):
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'])
    osm_file.close()
    return street_types

def update_name(name, mapping, regex):
    m = regex.search(name)
    if m:
        st_type = m.group()
        if st_type in mapping:
            name = re.sub(regex, mapping[st_type], name)
    return name
 

sf_st_types = audit(SAMPLE_FILE)
pprint.pprint(dict(sf_st_types) )   

for street_type, ways in sf_st_types.iteritems():
    for name in ways:
        better_name = update_name(name, mapping, street_type_re)
        print name, "=>", better_name

{}


Checking ‘k’ value for each tag. creating a dictionary of the different tags. Regular expressions: lower is for valid only-lowercase-letter tags. lower_colon is for other valid tags with a colon in the value. problemchars is for tags with problematic characters.

In [5]:
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')


def key_type(element, keys):
    if element.tag == "tag":
        if re.match(lower, element.attrib['k']):
            keys["lower"] += 1
        elif re.match(lower_colon, element.attrib['k']):
            keys["lower_colon"] += 1
        elif re.search(problemchars, element.attrib['k']):
            keys["problemchars"] += 1
        else:
            keys['other'] += 1
    return keys


def process_map(filename):
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)

    return keys

sf_all_keys = process_map(SAMPLE_FILE)
print sf_all_keys

{'problemchars': 1, 'lower': 148584, 'other': 1683, 'lower_colon': 65754}


Auditinf postal codes. The first two digit of postal codes in sweden is 72.

In [41]:
import collections
def audit_zipcode(invalid_zipcodes, zipcode):
    twoDigits = zipcode[0:2]
    
    if twoDigits != 72 or not twoDigits.isdigit():
        invalid_zipcodes[twoDigits].add(zipcode)
        
def is_zipcode(elem):
    return (elem.attrib['k'] == "addr:postcode")

def audit_zip(osmfile):
    osm_file = open(osmfile, "r")
    invalid_zipcodes = defaultdict(set)
    for event, elem in ET.iterparse(osm_file, events=("start",)):

        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_zipcode(tag):
                    audit_zipcode(invalid_zipcodes,tag.attrib['v'])

    return invalid_zipcodes

sf_zipcode = audit_zip(SAMPLE_FILE)
pprint.pprint(dict(sf_zipcode))


{'04': set(['04435']),
 '10': set(['102 41', '10315', '10691']),
 '11': set(['111 27',
            '111 29',
            '111 37',
            '111 43',
            '111 45',
            '111 47',
            '111 52',
            '11120',
            '11121',
            '11129',
            '11134',
            '11140',
            '11143',
            '11144',
            '11145',
            '11146',
            '11148',
            '11157',
            '11160',
            '11218',
            '11221',
            '11226',
            '11227',
            '11232',
            '11238',
            '11243',
            '11248',
            '11264',
            '11265',
            '11269',
            '113 27',
            '113 43',
            '113 58',
            '113 61',
            '11320',
            '11329',
            '11336',
            '11338',
            '11351',
            '11359',
            '114 39',
            '114 60',
            '114 86',
            '11415

Problems Encountered: Inconsistent postal codes! Although I indicated invalid zipcode in a broad set, many of the above zipcodes are valid and have no promlebs. In Stockholm area zip codes all begin with “72” or “41”, however some of zip codes were outside this region.
In the following code, I modify the function to clean zip code, change xxxxx-xxxx format into 5 digits format, to remove the blank in the middle and create a consistant zipcode. 

In [86]:
def update_zip(zipcode):
    zipcode= zipcode.replace(" ","")
    zipChar = re.findall('[a-zA-Z]*', zipcode)
    if zipChar:
        zipChar = zipChar[0]
    zipChar.strip()
    if zipChar == "u":
        updateZip = re.findall(r'\d+', zipcode)
        if updateZip:
            return ((re.findall(r'\d+', zipcode))[0])
    else:
            
        d=((re.findall(r'\d+', zipcode))[0])
        return d
        


for street_type, ways in sf_zipcode.iteritems():
    for name in ways:
        better_name = update_zip(name)
        print name, "=>", better_name

11269 => 11269
113 58 => 11358
111 37 => 11137
115 43 => 11543
115 45 => 11545
115 25 => 11525
11320 => 11320
11134 => 11134
11243 => 11243
11338 => 11338
11227 => 11227
115 24 => 11524
115 27 => 11527
115 26 => 11526
115 21 => 11521
115 20 => 11520
115 23 => 11523
115 22 => 11522
111 47 => 11147
11157 => 11157
11743 => 11743
111 43 => 11143
11248 => 11248
111 27 => 11127
116 63 => 11663
111 29 => 11129
114 39 => 11439
11265 => 11265
11264 => 11264
116 21 => 11621
11226 => 11226
11221 => 11221
113 61 => 11361
116 28 => 11628
11415 => 11415
113 27 => 11327
11646 => 11646
111 45 => 11145
11218 => 11218
11129 => 11129
11144 => 11144
113 43 => 11343
115 53 => 11553
11238 => 11238
11121 => 11121
11553 => 11553
11336 => 11336
117 62 => 11762
11232 => 11232
11631 => 11631
114 86 => 11486
11143 => 11143
11146 => 11146
11120 => 11120
11359 => 11359
11145 => 11145
118 52 => 11852
11140 => 11140
11639 => 11639
11419 => 11419
114 60 => 11460
11160 => 11160
11428 => 11428
11148 => 11148
11738 => 11

After auditing is complete the next step is to prepare the data to be inserted into a SQL database.
To do so I did parse the elements in the OSM XML file, transforming them from document format to
tabular format, thus making it possible to write to .csv files.  These csv files can then easily be
imported to a SQL database as tables.

In [None]:
NODES_PATH = "nodes.csv"
NODE_TAGS_PATH = "nodes_tags.csv"
WAYS_PATH = "ways.csv"
WAY_NODES_PATH = "ways_nodes.csv"
WAY_TAGS_PATH = "ways_tags.csv"

LOWER_COLON = re.compile(r'^([a-z]|_)+:([a-z]|_)+')
PROBLEMCHARS = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

SCHEMA = schema.schema

# Make sure the fields order in the csvs matches the column order in the sql table schema
NODE_FIELDS = ['id', 'lat', 'lon', 'user', 'uid', 'version', 'changeset', 'timestamp']
NODE_TAGS_FIELDS = ['id', 'key', 'value', 'type']
WAY_FIELDS = ['id', 'user', 'uid', 'version', 'changeset', 'timestamp']
WAY_TAGS_FIELDS = ['id', 'key', 'value', 'type']
WAY_NODES_FIELDS = ['id', 'node_id', 'position']


def shape_element(element, node_attr_fields = NODE_FIELDS, way_attr_fields = WAY_FIELDS,
                  problem_chars = PROBLEMCHARS, default_tag_type = 'regular'):
    """Clean and shape node or way XML element to Python dict"""

    node_attribs = {}
    way_attribs = {}
    way_nodes = []
    tags = []  # Handle secondary tags the same way for both node and way elements

    if element.tag == 'node':
        for node in NODE_FIELDS:
            node_attribs[node] = element.attrib[node]
        for child in element:
            tag = {}
            if PROBLEMCHARS.search(child.attrib["k"]):
                continue
        
            elif LOWER_COLON.search(child.attrib["k"]):
                tag_type = child.attrib["k"].split(':',1)[0]
                tag_key = child.attrib["k"].split(':',1)[1]
                tag["key"] = tag_key
                if tag_type:
                    tag["type"] = tag_type
                else:
                    tag["type"] = 'regular'
            
                tag["id"] = element.attrib["id"]
                tag["value"] = child.attrib["v"]
            else:
                tag["value"] = child.attrib["v"]
                tag["key"] = child.attrib["k"]
                tag["type"] = "regular"
                tag["id"] = element.attrib["id"]
            if tag:
                tags.append(tag)
        return {'node': node_attribs, 'node_tags': tags}
    elif element.tag == 'way':
        for way in WAY_FIELDS:
            way_attribs[way] = element.attrib[way]
        for child in element:
            nd = {}
            tag = {}
            if child.tag == 'tag':
                if PROBLEMCHARS.search(child.attrib["k"]):
                    continue
                elif LOWER_COLON.search(child.attrib["k"]):
                    tag_type = child.attrib["k"].split(':',1)[0]
                    tag_key = child.attrib["k"].split(':',1)[1]
                    tag["key"] = tag_key
                    if tag_type:
                        tag["type"] = tag_type
                    else:
                        tag["type"] = 'regular'
                    tag["id"] = element.attrib["id"]
                    tag["value"] = child.attrib["v"]
    
                else:
                    tag["value"] = child.attrib["v"]
                    tag["key"] = child.attrib["k"]
                    tag["type"] = "regular"
                    tag["id"] = element.attrib["id"]
                if tag:
                    tags.append(tag)
                    
            elif child.tag == 'nd':
                nd['id'] = element.attrib["id"]
                nd['node_id'] = child.attrib["ref"]
                nd['position'] = len(way_nodes)
            
                if nd:
                    way_nodes.append(nd)
            else:
                continue
        return {'way': way_attribs, 'way_nodes': way_nodes, 'way_tags': tags}


In [81]:


tree = ET.parse(SAMPLE_FILE)
root = tree.getroot()

counttotal = 0
count = 0
wp = []
regex = re.compile('^(5)(6)\d{4}$')
for i in tree.getiterator('tag'):
    k1 = i.get("k")
    if k1 == "addr:postcode":
        v1 = i.get("v")
        m1 = regex.match(v1)
        if not m1:
            counttotal = counttotal +1
            if len(v1) <> 6:
                v1 = v1.replace(" ","")
                v1 = v1.replace(",","")
                v1 = v1.replace("-","")
                v1 = v1.replace('"',"")
                v1 = v1.replace('55',"5")
                m2 = regex.match(v1)
                if not m2:
                    wp.append(v1)
                    count = count +1
            elif len(v1) == 6:
                wp.append(v1)
                count = count +1

print wp
print count
print counttotal

['15391', '163 56', '11419', '7591', '746 93', '134 39', '164 94', '760 10', '192 79', '187 70', '75228', '120 55', '114 86', '120 51', '131 40', '131 54', '1865', '120 57', '120 57', '120 59', '120 59', '120 59', '120 59', '120 59', '120 59', '120 58', '120 57', '120 57', '12051', '12051', '12051', '12051', '11146', '12352', '123 41', '17672', '164 40', '131 31', '131 31', '131 71', '122 48', '18651', '18651', '18651', '18651', '18651', '18651', '11227', '11265', '120 67', '120 54', '120 58', '120 58', '120 58', '120 54', '120 54', '120 55', '120 55', '120 55', '120 57', '120 55', '120 55', '120 57', '120 57', '120 59', '120 54', '120 55', '120 60', '120 60', '11265', '120 60', '120 60', '120 58', '120 60', '12052', '120 60', '120 60', '120 60', '11264', '120 60', '120 60', '120 60', '120 60', '120 58', '120 58', '120 58', '120 58', '120 58', '11243', '760 10', '120 58', '120 60', '120 54', '120 54', '120 56', '120 56', '120 56', '120 57', '11733', '11320', '120 54', '120 57', '120 56

Some helper functions. additional help!

In [None]:
def get_element(osm_file, tags=('node', 'way', 'relation')):
    """Yield element if it is the right type of tag"""

    context = iter(ET.iterparse(osm_file, events=('start', 'end')))
    _, root = next(context)
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            yield elem
            root.clear()


def validate_element(element, validator, schema=SCHEMA):
    """Raise ValidationError if element does not match schema"""
    if validator.validate(element, schema) is not True:
        field, errors = next(validator.errors.iteritems())
        message_string = "\nElement of type '{0}' has the following errors:\n{1}"
        error_strings = (
            "{0}: {1}".format(k, v if isinstance(v, str) else ", ".join(v))
            for k, v in errors.iteritems()
        )
        raise cerberus.ValidationError(
            message_string.format(field, "\n".join(error_strings))
        )


class UnicodeDictWriter(csv.DictWriter, object):
    """Extend csv.DictWriter to handle Unicode input"""

    def writerow(self, row):
        super(UnicodeDictWriter, self).writerow({
            k: (v.encode('utf-8') if isinstance(v, unicode) else v) for k, v in row.iteritems()
        })

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

def process_map(file_in, validate):
    """Iteratively process each XML element and write to csv(s)"""

    with codecs.open(NODES_PATH, 'w') as nodes_file, \
         codecs.open(NODE_TAGS_PATH, 'w') as nodes_tags_file, \
         codecs.open(WAYS_PATH, 'w') as ways_file, \
         codecs.open(WAY_NODES_PATH, 'w') as way_nodes_file, \
         codecs.open(WAY_TAGS_PATH, 'w') as way_tags_file:

        nodes_writer = UnicodeDictWriter(nodes_file, NODE_FIELDS)
        node_tags_writer = UnicodeDictWriter(nodes_tags_file, NODE_TAGS_FIELDS)
        ways_writer = UnicodeDictWriter(ways_file, WAY_FIELDS)
        way_nodes_writer = UnicodeDictWriter(way_nodes_file, WAY_NODES_FIELDS)
        way_tags_writer = UnicodeDictWriter(way_tags_file, WAY_TAGS_FIELDS)

        nodes_writer.writeheader()
        node_tags_writer.writeheader()
        ways_writer.writeheader()
        way_nodes_writer.writeheader()
        way_tags_writer.writeheader()

        validator = cerberus.Validator()

        for element in get_element(file_in, tags=('node', 'way')):
            el = shape_element(element)
            if el:
                if validate is True:
                    validate_element(el, validator)

                if element.tag == 'node':
                    nodes_writer.writerow(el['node'])
                    node_tags_writer.writerows(el['node_tags'])
                elif element.tag == 'way':
                    ways_writer.writerow(el['way'])
                    way_nodes_writer.writerows(el['way_nodes'])
                    way_tags_writer.writerows(el['way_tags'])


sf_sample = "data\sf_sample.osm"
                    
if __name__ == '__main__':
    # Note: Validation is ~ 10X slower. For the project consider using a small
    # sample of the map when validating.
    process_map(sf_sample, validate = False)

Overview of the data
This section contains basic statistics about the dataset, the MongoDB queries used to gather them, and some additional ideas about the data in context.
File Size
Number of Nodes
Number of Ways
Number of unique users
Top 10 contrinuters 



Additional ideas:
List of top 20 Amenities in Stockholm

In [None]:
c.execute("SELECT value, COUNT(*) as num \
            FROM nodes_tags \
           WHERE key='amenity' \
           GROUP BY value \
           ORDER BY num DESC \
           LIMIT 20;")

pprint.pprint(c.fetchall())

Conclusion
From the process of auditing we can see the dataset is fairly well-cleaned even though there are some minor error such as inconsistent postal codes. Since there are thousands of contributing users, so it is inevitable to have so many human input error. My thought is: is it possible to create a monitor system to check everybody’s contribution regularly. In addition, because OpenStreetMaps is an open source project, there’re still a lot of areas map outdated such as my hometown Baoding, HeBei Province, China. So I hope OpenStreetMaps can obtain these data from other open data sources.