# Project Description

For this project data from [Open Street map](https://www.openstreetmap.org/#map=5/51.500/-0.100) is downloaded. The data comes in the xml format and gets wrangeled and than stored in a mongo db database.

Therfore, the data is inspected for inconsistend or false values, in the next step the data is cleaned and than transformed for storing in a MongoDB database.  

# Import Libaries

In [1]:
import os
from sys import exit
import urllib.request

from collections import defaultdict
import re
import pprint

import xml.etree.cElementTree as ElementTree

from pymongo import MongoClient
from bson.objectid import ObjectId 

# Set Global Variables

In [2]:
URL = 'http://overpass-api.de/api/map?bbox=11.4388,48.0593,11.6448,48.2507'
DATASET_NAME = 'muenchen'
DATASET_PATH = 'data/' + DATASET_NAME + '.osm'
SAMPLE_PATH = 'data/' + DATASET_NAME + '_sample' + '.osm'

MONGODB_DB = 'mongodb://localhost:27017'

# Project Preparation

## Function download the map data

In [3]:
def load_OpenStreetMap_data(url, path):
    """Download a xml file representing an area based on the openstreetmap data

    Args:
        url of xml map
        path file name to for storage

    Returns:
        None
    """
    ### Create data dir if not exts
    if not os.path.exists('data/'):
        os.makedirs('data/')
    
    urllib.request.urlretrieve(url, path)

## Function sample generation for auditing

In [4]:
def sample_xml_data(path, path_sample, k):
    """Creates a sample of a given openstreetmap file and stores the sample to disc

    Args:
        path of the original osm file
        path of the sample osm file
        k nth element that should be sampled

    Returns:
        None
    """
    
    def get_element(osm_file, tags=('node', 'way', 'relation')):
        """Yield element if it is the right type of tag

        Reference:
        http://stackoverflow.com/questions/3095434/inserting-newlines-in-xml-file-generated-via-xml-etree-elementtree-in-python
        """
        context = iter(ElementTree.iterparse(osm_file, events=('start', 'end')))
        _, root = next(context)
        for event, elem in context:
            if event == 'end' and elem.tag in tags:
                yield elem
                root.clear()


    with open(path_sample, 'w') as output:
        output.write('<?xml version="1.0" encoding="UTF-8"?>\r\n')
        output.write('<osm>\r\n  ')

        # Write every kth top level element
        for i, element in enumerate(get_element(path)):
            if i % k == 0:
                #output.write(str(ElementTree.tostring(element, encoding='utf-8')))
                output.write(str(ElementTree.tostring(element))+'\r\n')
        output.write('</osm>')

# Load the Dataset

In [5]:
"""
Downloads the data to the data folder of a local repository after you run it once you can uncomment this lines.
To prevent the code from downloading the data every time you run the code.
"""  
#load_OpenStreetMap_data(URL, DATASET_PATH)

"""
Generate s a sample of the dataset for the auditing process after you run it once you can uncomment this lines.
To prevent the code from downloading the data every time you run the code.
"""  
#sample_xml_data(DATASET_PATH, SAMPLE_PATH, 100)

'\nGenerate s a sample of the dataset for the auditing process after you run it once you can uncomment this lines.\nTo prevent the code from downloading the data every time you run the code.\n'

# Audit of the data

## Audit functions

### Generic function

In [6]:
def audit_tag(tag_type, expected=[], splitting=False):
    """Audits a given tag type in the xml file

    Args:
        tag_type tag in the xml file that should be audited
        expected values that are marked as correct
        splitting specifies if the value should be split based on punctation

    Returns:
        None
    """
    
    ### Checks if tag is equal the tag specified 
    def is_tag_type(element, tag_type):
        return (element.attrib['k'] == tag_type)

    ### Audtis the tag value
    def audit_tag_type(tag_types, tag_type, expected, splitting):
    
        if splitting:
            tag_type = tag_type.split(',')[0]
            tag_type = tag_type.split(';')[0]
            tag_type = tag_type.split('_')[0]
            tag_type = tag_type.lower()

        if tag_type not in expected:
                tag_types[tag_type].add(tag_type)
    
    ### Load xml file
    osm_file = open(SAMPLE_PATH, "rb")
    
    ### Creates the dict for storing the invalid values
    tag_types = defaultdict(set)
    
    ### Loops over the xml file and audit the tag
    for event, element in ElementTree.iterparse(osm_file, events=("start",)):
        if element.tag == "node" or element.tag == "way":
            for tag in element.iter('tag'):
                if is_tag_type(tag, tag_type):
                    #tag_types[tag.attrib['v']].add(tag.attrib['v'])
                    audit_tag_type(tag_types, tag.attrib['v'], expected, splitting)
                
    ### Close the connection to the file
    osm_file.close()
    
    ### Prints the values that not marked as valid
    pprint.pprint(tag_types)

### Generic regex function

In [7]:
def audit_tag_reg(tag_type, regex = ".*"):
    """Audits a given tag type in the xml file

    Args:
        tag_type tag in the xml file that should be audited
        regex specifies a regex that describes valid values
        
    Returns:
        None
    """
    
    ### Checks if tag is equal the tag specified 
    def is_tag_type_reg(element, tag_type):
        return (element.attrib['k'] == tag_type)

    ### Audtis the tag value
    def audit_tag_type_reg(tag_types, tag_type, regex):
    
        audit_tag_type_re = re.compile(regex, re.IGNORECASE)

        m = audit_tag_type_re.search(tag_type)

        if m:
            tag_type_name = m.group()

            tag_types[tag_type_name].add(tag_type)

    ### Load xml file
    osm_file = open(SAMPLE_PATH, "rb")
    
    ### Creates the dict for storing the invalid values
    tag_types = defaultdict(set)
    
    ### Loops over the xml file and audit the tag
    for event, element in ElementTree.iterparse(osm_file, events=("start",)):
        if element.tag == "node" or element.tag == "way":
            for tag in element.iter('tag'):
                if is_tag_type_reg(tag, tag_type):
                    #tag_types[tag.attrib['v']].add(tag.attrib['v'])
                    audit_tag_type_reg(tag_types, tag.attrib['v'], regex)
    
    ### Close the connection to the file
    osm_file.close()
    
    ### Prints the values that not marked as valid
    pprint.pprint(tag_types)

### Specific street function

In [8]:
def audti_street_names(expected=[]):
    """Audits a given street name

    Args:
        expected values that are marked as correct
        
    Returns:
        None
    """
    
    ### Checks if tag is equal the tag specified 
    def is_street_name(element):
        return (element.attrib['k'] == "addr:street")

    ### Audtis the tag value
    def audit_street_type(street_types, street_name, expected):

        street_type_re = re.compile(r'\S+\.?$', re.IGNORECASE)

        m = street_type_re.search(street_name)

        if m:
            street_type = m.group()
            if street_type not in expected:
                street_types[street_type].add(street_name)
            #street_types[street_type] += 1
    
    ### Load xml file
    osm_file = open(SAMPLE_PATH, "rb")
    
    ### Creates the dict for storing the invalid values
    street_types = defaultdict(set)
    
    ### Loops over the xml file and audit the tag
    for event, element in ElementTree.iterparse(osm_file, events=("start",)):
        if element.tag == "node" or element.tag == "way":
            for tag in element.iter('tag'):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'], expected)
    
    ### Close the connection to the file
    osm_file.close()
    
    ### Prints the values that not marked as valid
    pprint.pprint(street_types)

## Audit housenumbers

In [9]:
### Audit the housenumber tag of the xml file
audit_tag("addr:housenumber")

defaultdict(<class 'set'>,
            {'1': {'1'},
             '10': {'10'},
             '10-18': {'10-18'},
             '100': {'100'},
             '101': {'101'},
             '102': {'102'},
             '102b': {'102b'},
             '103': {'103'},
             '104': {'104'},
             '105': {'105'},
             '106': {'106'},
             '107': {'107'},
             '107a': {'107a'},
             '109': {'109'},
             '10a': {'10a'},
             '10c': {'10c'},
             '10e': {'10e'},
             '11': {'11'},
             '110': {'110'},
             '111': {'111'},
             '111a': {'111a'},
             '112': {'112'},
             '113': {'113'},
             '114': {'114'},
             '114a': {'114a'},
             '115': {'115'},
             '116': {'116'},
             '117': {'117'},
             '117c': {'117c'},
             '118': {'118'},
             '118b': {'118b'},
             '119': {'119'},
             '11a': {'11a'},
        

## Audit postcodes

In [10]:
### Audit the postcode tag of the xml file any code consiting of 5 numbers is expected to be valid
#audit_tag("addr:postcode")
audit_tag_reg("addr:postcode", "^(?![0-9]{5})")

defaultdict(<class 'set'>, {})


## Audit street

In [11]:
### Audit the street names tag of the xml file
audti_street_names()

defaultdict(<class 'set'>,
            {'Abensbergstraße': {'Abensbergstraße'},
             'Abenthumstraße': {'Abenthumstraße'},
             'Aberlestraße': {'Aberlestraße'},
             'Achatstraße': {'Achatstraße'},
             'Adalbertstraße': {'Adalbertstraße'},
             'Adam-Berg-Straße': {'Adam-Berg-Straße'},
             'Adelsbergstraße': {'Adelsbergstraße'},
             'Adenauerring': {'Adenauerring'},
             'Adolf-Mathes-Weg': {'Adolf-Mathes-Weg'},
             'Aggensteinstraße': {'Aggensteinstraße'},
             'Agilolfingerplatz': {'Agilolfingerplatz'},
             'Agnes-Bernauer-Straße': {'Agnes-Bernauer-Straße'},
             'Agnes-Miegel-Straße': {'Agnes-Miegel-Straße'},
             'Agricolastraße': {'Agricolastraße'},
             'Aiblingerstraße': {'Aiblingerstraße'},
             'Aidenbachstraße': {'Aidenbachstraße'},
             'Aindorferstraße': {'Aindorferstraße'},
             'Albert-Roßhaupter-Straße': {'Albert-Roßhaupter-Straße'

## Audit amenity

In [12]:
### Audit the amenity tag of the xml file
audit_tag("amenity")

defaultdict(<class 'set'>,
            {'atm': {'atm'},
             'bank': {'bank'},
             'bar': {'bar'},
             'bench': {'bench'},
             'bicycle_parking': {'bicycle_parking'},
             'bicycle_rental': {'bicycle_rental'},
             'cafe': {'cafe'},
             'cinema': {'cinema'},
             'clock': {'clock'},
             'dentist': {'dentist'},
             'doctors': {'doctors'},
             'dormitory': {'dormitory'},
             'fast_food': {'fast_food'},
             'fire_station': {'fire_station'},
             'food_court': {'food_court'},
             'fountain': {'fountain'},
             'ice_cream': {'ice_cream'},
             'kindergarten': {'kindergarten'},
             'marketplace': {'marketplace'},
             'parking': {'parking'},
             'parking_entrance': {'parking_entrance'},
             'pharmacy': {'pharmacy'},
             'place_of_worship': {'place_of_worship'},
             'post_box': {'post_box'},
     

## Audit cuisine

In [13]:
### Audit the cuisine tag of the xml file any common cuisine type is expected to be valid
#audit_tag("cuisine")
audit_tag("cuisine", ["asian", "coffee_shop", "german", "greek", "indian",\
                     "italian", "pizza", "portuguese"], True)

defaultdict(<class 'set'>, {'regional': {'regional'}, 'coffee': {'coffee'}})


## Audit phone

In [14]:
### Audit the phone tag of the xml file any number beginning with "+49 89 " is expected to be valid
#audit_tag("phone")
audit_tag_reg("phone", "^(?![+][0-9]{2}\s[0-9]{2}\s).*")

defaultdict(<class 'set'>,
            {'+49 (0) 89 21 99 89 42': {'+49 (0) 89 21 99 89 42'},
             '+49 (0) 89/932574': {'+49 (0) 89/932574'},
             '+49 (0)89/895 580 68-0': {'+49 (0)89/895 580 68-0'},
             '+49 800 344226622': {'+49 800 344226622'},
             '+4989-36 10 33 97': {'+4989-36 10 33 97'},
             '+498923709821': {'+498923709821'},
             '089 20206561': {'089 20206561'},
             '089 39 25 05': {'089 39 25 05'},
             '089 507743': {'089 507743'},
             '089 89059980': {'089 89059980'},
             '089/2167-10090': {'089/2167-10090'},
             '089/4482274': {'089/4482274'}})


# Prepare Data For Storage

## Cleaning data

### Cleaning cuisine values

In [15]:
def clean_cuisine_value(item):
    """Porcess a string and replace invalid entries with vaild entries

    Args:
        item unprocesses value
        
    Returns:
        item cleansed value
    """
    
    ### dict that specifies the mapping of invalid to valid entries
    mapping = {
    "afghan": "afghanisch",
    "afghani": "afghanisch",
    "israel": "israeli",
    "doener": "döner",
    "mediteran": "mediterranean",
    "vietnam": "vietnamese",
    "türkisch": "turkish"
    }
    
    ### Processing of the string
    item = item.split(',')[0]
    item = item.split(';')[0]
    item = item.split('_')[0]
    item = item.lower()
    
    ### Loop through the mapping dict and replace the invalid values
    for key, value in mapping.items():
        
        if item == key:
            
            item = value
    
    return item

### Cleaning amenity values

In [16]:
def clean_amenity_value(item):
    """Porcess a string and replace invalid entries with vaild entries

    Args:
        item unprocesses value
        
    Returns:
        item cleansed value
    """
        
    ### dict that specifies the mapping of invalid to valid entries
    mapping = {
    "automatenservice": "atm",
    "baggage checkroom": "baggage_checkroom",
    "no": None
    }
    
    ### Processing of the string
    item = item.lower()
    
    ### Loop through the mapping dict and replace the invalid values
    for key, value in mapping.items():
        
        if item == key:
            
            item = value
    
    return item

### Clean phone

In [17]:
def clean_phone_value(value):
    """Porcess a phone number and replace invalid entries with vaild entries

    Args:
        value unprocesses value
        
    Returns:
        vaule cleansed value
    """
    
    ### Remove punctation of the phone number
    value = "".join(c for c in value if c not in ('(', ')', '-', '/'))
    
    ### Proces the phone number
    char_list = list(value)
    if len(char_list) >= 10:
        if char_list[0] == '0':
            del char_list[0]

        if char_list[3] == '8':
            char_list.insert(3, '0')

        if char_list[3] == '0':
            char_list[3] = ' '

        if char_list[4] == '0':
            del char_list[4]
            del char_list[4]

        if char_list[6] != ' ':
            char_list.insert(6, ' ')

        if char_list[0] == '+':
            value = ''.join(char_list)
        else:
            value = '+49 ' + ''.join(char_list)
    else:
        return None
            
    ### Check if phone number is valid    
    audit_tag_type_re = re.compile("^(?![+][0-9]{2}\s[0-9]{2}\s).*", re.IGNORECASE)
    
    m = audit_tag_type_re.search(value)
    
    ### Return just if phone number is valid
    if m:
        return value

## Converting the data into json style and clean it

In [18]:
def convert_clean_xml_data():
    """Process the data in json style format and clean it

    Args:
        None
        
    Returns:
        entry_list cleansed data in a json style format
    """
        
    entry_list = []

    ### Load xml file
    osm_file = open(DATASET_PATH, "rb")
    
    ### Loops over the xml file and process it in cleaned json style format
    for event, element in ElementTree.iterparse(osm_file, events=("start",)):

        ### Just consider the tags node or way
        if element.tag == "node" or element.tag == "way":
            
            ### Container for the json data
            json_data = dict()
            
            ### Loops of the tags in the node tag
            for node in element.iter('node'):
                
                json_data['id'] = node.attrib['id']
                json_data['type'] = element.tag
                json_data['visibile'] = 'true'

                created_fields = dict()
                created_fields['version'] = node.attrib['version']
                created_fields['changeset'] = node.attrib['changeset']
                created_fields['timestamp'] = node.attrib['timestamp']
                created_fields['user'] = node.attrib['user']
                created_fields['uid'] = node.attrib['uid']
                json_data['created'] = created_fields

                postion_list = []
                postion_list.append(node.attrib['lat'])
                postion_list.append(node.attrib['lon'])
                json_data['pos'] = postion_list
             
            ### Loops of the tags in the way tag
            for node in element.iter('way'):

                json_data['id'] = node.attrib['id']
                json_data['type'] = element.tag
                json_data['visibile'] = 'true'

                created_fields = dict()
                created_fields['version'] = node.attrib['version']
                created_fields['changeset'] = node.attrib['changeset']
                created_fields['timestamp'] = node.attrib['timestamp']
                created_fields['user'] = node.attrib['user']
                created_fields['uid'] = node.attrib['uid']
                json_data['created'] = created_fields

            ### Container for the adress data
            address_fields = dict()
            
            ### Loops of the tags in the tag tag
            for tag in element:
                if tag.tag == 'tag':

                    if tag.attrib['k'] == "addr:housenumber":
                        address_fields['housenumber'] = tag.attrib['v']

                    if tag.attrib['k'] == "addr:postcode" :
                        address_fields['postcode'] = tag.attrib['v']

                    if tag.attrib['k'] == "addr:street" :
                        address_fields['street'] = tag.attrib['v']

                    if tag.attrib['k'] == "amenity" :
                        item = clean_amenity_value(tag.attrib['v'])
                        json_data['amenity'] = item

                    if tag.attrib['k'] == "cuisine" :
                        item = clean_cuisine_value(tag.attrib['v'])
                        json_data['cuisine'] = item

                    if tag.attrib['k'] == "name" :
                        json_data['name'] = tag.attrib['v']

                    if tag.attrib['k'] == "phone" :
                        item = clean_phone_value(tag.attrib['v'])
                        json_data['phone'] = item

            json_data['address'] = address_fields

            entry_list.append(json_data)

    ### Close the connection to the file
    osm_file.close()
    
    return entry_list

# Store The Data in MongoDB

In [21]:
### Convert the xml data into json and clean the found errors in the auditing process
json_entries = convert_clean_xml_data()

### Etablish a connection to a mongdb instance
client = MongoClient(MONGODB_DB)

### Open database openStreetMapData and store it
db = client.openStreetMapData

### Insert entries into database table map
for entry in json_entries:
    entry['_id']  = ObjectId()
    db.map.insert_one(entry)

# Query The Dataset in MongoDB

In [22]:
print("Query The Dataset in MongoDB")

### Size of the database
print("Size of database: " + str(round(db.command("dbstats")['storageSize'] / (1024*1024))) + " MB")

print('-----')

# Number of documents
print('Number of documents: ' + str(db.map.find().count()))

print('-----')

# Number of nodes
print('Number of nodes: ' + str(db.map.find({"type":"node"}).count()))

print('-----')

# Number of ways
print('Number of ways: ' + str(db.map.find({"type":"way"}).count()))

print('-----')

# Number of unique users
print('Number of unique users: ' + str(len(db.map.distinct("created.user"))))

print('-----')

# Top 10 contributing user
users = db.map.aggregate([{"$group":{"_id":"$created.user", "count":{"$sum":1}}}, {"$sort":{"count":-1}}, {"$limit":10}])
print("Top 10 most active users:")
for i in range(10):
    print(users.next())
    
print('-----')
    
# Number of cuisines appearing only once 
count_single_cousine = db.map.aggregate([{"$group":{"_id":"$cuisine", "count":{"$sum":1}}}, {"$group":{"_id":"$count", "count single cousines":{"$sum":1}}}, {"$sort":{"_id":1}}, {"$limit":1}])
print("Number of cuisines appearing only once: " + str(count_single_cousine.next()["count single cousines"]))

print('-----')

# Cuisines appearing only once
cousines = db.map.aggregate([{"$group":{"_id":"$cuisine", "count":{"$sum":1}}}, {"$sort":{"count":1}}, {"$limit":10}, {'$match':{'count': {'$lte': 1}}}])
print("Sample of 5 cousine appearing only once:")
for i in range(5):
    print(cousines.next()["_id"])

print('-----')

# Postcode areas with the highest bar density
postcode_bars = db.map.aggregate([{"$match":{"amenity":{"$exists":1}, "amenity":"bar"}},
{"$match":{"address.postcode":{"$exists":1}}},
{"$group":{"_id":"$address.postcode", "count":{"$sum":1}}},
{"$sort":{"count":-1}}, {"$limit":5}
])

print("Top 5 bar density postcode areas:")
for i in range(5):
    print(postcode_bars.next())

Query The Dataset in MongoDB
Size of database: 112 MB
-----
Number of documents: 1695297
-----
Number of nodes: 1420595
-----
Number of ways: 274702
-----
Number of unique users: 2881
-----
Top 10 most active users:
{'_id': 'BeKri', 'count': 179732}
{'_id': 'rolandg', 'count': 131495}
{'_id': 'ToniE', 'count': 131199}
{'_id': 'heilbron', 'count': 117306}
{'_id': 'KonB', 'count': 107448}
{'_id': 'Basstoelpel', 'count': 60448}
{'_id': 'marek kleciak', 'count': 54966}
{'_id': 'osmkurt', 'count': 44081}
{'_id': 'Filius Martii', 'count': 36452}
{'_id': 'ludwich', 'count': 31992}
-----
Number of cuisines appearing only once: 58
-----
Sample of 5 cousine appearing only once:
snack
yugoslavian
senegalese
panasia
pasta
-----
Top 5 bar density postcode areas:
{'_id': '80469', 'count': 27}
{'_id': '80331', 'count': 10}
{'_id': '80802', 'count': 9}
{'_id': '80799', 'count': 8}
{'_id': '81667', 'count': 8}


# Conclusion

The review of the data of the Munich area shows that it is already very tidy. But there are some minor inconsitis like using different languages for tagging amenities and different notations for formatting phone numbers.
A better data processor could adress this issues, by considering all notations for phone numbers used in Germany and format it all to the same format and identify the tags that are not in the english language and translate them.