*** DA3 OPEN STREET MAP - DATA WRANGLING WITH MONGODB***

*Map Area: Berlin, Germany*

https://www.openstreetmap.org/#map=13/52.5180/13.4076

The area contains the central area of Berlin. As it is one of the biggest cities in Europe, we can expect a lot of additional information about this area from the osm project.

In [None]:
#Sample of nodes in an .osm file:

<node id="302864488" visible="true" version="7" changeset="36059354" timestamp="2015-12-20T06:35:42Z" user="atpl_pilot" uid="881429" lat="52.5259586" lon="13.3894424">
  <tag k="addr:city" v="Berlin"/>
  <tag k="addr:country" v="DE"/>
  <tag k="addr:housenumber" v="45"/>
  <tag k="addr:postcode" v="10117"/>
  <tag k="addr:street" v="Oranienburger Straße"/>
  <tag k="addr:suburb" v="Mitte"/>
  <tag k="amenity" v="restaurant"/>
  <tag k="contact:phone" v="+493028040505"/>
  <tag k="cuisine" v="cuban"/>
  <tag k="name" v="QBA"/>
  <tag k="website" v="http://www.qba-restaurant.de/"/>
  <tag k="wheelchair" v="no"/>
 </node>
 
 <way id="5090250" visible="true" timestamp="2009-01-19T19:07:25Z" version="8" changeset="816806" user="Blumpsy" uid="64226">
    <nd ref="822403"/>
    <nd ref="21533912"/>
    <nd ref="821601"/>
    <nd ref="21533910"/>
    <nd ref="135791608"/>
    <nd ref="333725784"/>
    <nd ref="333725781"/>
    <nd ref="333725774"/>
    <nd ref="333725776"/>
    <nd ref="823771"/>
    <tag k="highway" v="residential"/>
    <tag k="name" v="Clipstone Street"/>
    <tag k="oneway" v="yes"/>
  </way>

In [1]:
import xml.etree.ElementTree as ET
import pprint
import re
import os
import codecs
import json
from collections import defaultdict

#https://www.openstreetmap.org/export#map=13/52.5180/13.4076
OSM_FILE = 'berlin_map.osm'
SAMPLE_FILE = 'berlin_map_reduced.osm'

In [3]:
#check file size
#resource: http://stackoverflow.com/questions/2104080/how-to-check-file-size-in-python

def convert_bytes(num):
    """
    this function will convert bytes to MB.... GB... etc
    """
    for x in ['bytes', 'KB', 'MB', 'GB', 'TB']:
        if num < 1024.0:
            return "%3.1f %s" % (num, x)
        num /= 1024.0

def file_size(file_path):
    """
    this function will return the file size
    """
    if os.path.isfile(file_path):
        file_info = os.stat(file_path)
        return convert_bytes(file_info.st_size)
    
size = file_size(OSM_FILE)
print ('OSMSize', size)

OSMSize 154.6 MB


**Create sample file with the k - th size**

In [4]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-

k = 14 # Parameter: take every k-th top level element
def get_element(osm_file, tags=('node', 'way', 'relation')):
    """Yield element if it is the right type of tag
    Reference:
    http://stackoverflow.com/questions/3095434/inserting-newlines-in-xml-file-generated-via-xml-etree-elementtree-in-python
    """
    context = iter(ET.iterparse(osm_file, events=('start', 'end')))
    _, root = next(context)
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            yield elem
            root.clear()

with open(SAMPLE_FILE, 'wb') as output:
    b = bytearray()
    b.extend('<?xml version="1.0" encoding="UTF-8"?>\n'.encode())
    b.extend('<osm>\n  '.encode())
    output.write(b)

    # Write every kth top level element
    print (OSM_FILE)
    for i, element in enumerate(get_element(OSM_FILE)):

        if not i % k:
            output.write(ET.tostring(element, encoding='utf-8'))
    b_end = bytearray()
    b_end.extend('</osm>'.encode())
    output.write(b_end)

berlin_map.osm


In [5]:
#check size of sample file
sample_size = file_size(SAMPLE_FILE)
print ('SampleSize', sample_size)

SampleSize 11.3 MB


**AUDITING THE .OSM FILE**

In [6]:
#get benchmark data
"""
    Reference:
    https://classroom.udacity.com/nanodegrees/nd002/parts/0021345404/modules/316820862075462/lessons/768058569/concepts/8426285720923#
"""
def get_benchmark_data(filename):
    users = set()
    count_nodes = 0
    count_ways = 0
    count_relations = 0
    
    for _, element in ET.iterparse(filename):
        if element.tag == 'node':
            count_nodes += 1
            user = element.attrib['uid']
            if user in users:
                pass
            else:
                users.add(user)
        if element.tag == 'way':
            count_ways += 1
        if element.tag == 'relation':
            count_relations += 1
    return users, count_nodes, count_ways, count_relations

users, count_nodes, count_ways, count_relations = get_benchmark_data(OSM_FILE)

print ('UNIQUE USERS: ', len(users))
print ('NODES:', count_nodes)
print ('WAYS:', count_ways)
print ('RELATIONS', count_relations)

UNIQUE USERS:  2003
NODES: 594858
WAYS: 86201
RELATIONS 3203


**AUDIT STREET NAMES**

*Problems encounterd while auditing street names:*

   In German language there a lot of different namings and writings for streets: For example a 'Straße' can be 'Auerstraße' or     'Antwerpener Straße' or 'Alfred-Jung-Straße' or 'Straße des 17. Juni'. This happens also to other street types, like              '*weg*', '*zeile*', '*platz*' and also for streets that are or used to be near rivers like: '*damm*', '*ufer*', '*graben*'.
   Because there a lot of bridges in Berlin, bridges belong to ways *brücke*. 
   A street name starting with 'Zur ', 'Am ' or 'An ' simply means 'at ', but is also a valid street name. 
    
   Another problem is the case sensitiveness. Those key words may be written with uppercase or lowercase. In our case we simply    transform it to lowercase, to audit the street names. 
    
   Another problem are the german Umlaute. Because we don't wan't to get troubled, instead of using regex, we will use a simple    loop to audit the street names.
    
   A further task would be to transform all Umlaute of the file at the beginning, to make sure not to get in trouble later.

In [7]:
'''reference:
https://classroom.udacity.com/nanodegrees/nd002/parts/0021345404/modules/316820862075462/lessons/768058569/concepts/8755386150923#'''

#start with lower case letters, due to different ways of writing street names in german
expected = ['straße', 'platz', 'gasse', 'weg', 'allee', 'damm', 'ufer', 'graben', 'brücke', 
            'promenade', 'park', 'am', 'an', 'markt', 'steg', 'hof', 'zeile']

def audit_street_type(street_types, street_name):
    found = False
    for e in expected:
        if e in street_name.lower():
            found = True
            break
           
    if found:
        street_types[e].add(street_name)
    else:
        street_types[street_name.title()].add(street_name)    
                              
def is_street_name(elem):
    return (elem.attrib['k'] == "addr:street")

#audit the street names
def audit_street_names(osmfile):
    osm_file = open(osmfile, 'rt', encoding='utf-8')
    #set is a list without duplicates
    street_types = defaultdict(set)
    for event, elem in ET.iterparse(osm_file, events=("start",)):
        if elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'])
    osm_file.close()
    print('Numbers of different street types in Berlin')
    pretty_print(street_types)

def pretty_print(d):
    for sorted_key in sorted(d, key=lambda k: len(d[k]), reverse=True):
        v = d[k]
        print (sorted_key.title(), ':', len(d[sorted_key]))
        
audit_street_names(OSM_FILE)

Numbers of different street types in Berlin
Straße : 620
Platz : 48
Ufer : 28
Allee : 18
Damm : 12
Weg : 9
Am : 8
Park : 6
An : 5
Graben : 4
Brücke : 4
Zeile : 3
Markt : 3
Promenade : 3
Hof : 2
Stadtbahnbogen : 1
Dohnagestell : 1
Südstern : 1
Unter Den Linden : 1
Südring : 1
Alt-Moabit : 1
Wriezener Karree : 1
Zur Börse : 1
Alt-Stralau : 1
Viehtrift : 1
Fischerinsel : 1
Vor Dem Schlesischen Tor : 1
In Den Ministergärten : 1
Steg : 1
Prenzlauer Berg : 1
Südpassage : 1
Zur Innung : 1
Hinter Der Katholischen Kirche : 1
Neue Welt : 1
Fischzug : 1
Zur Waage : 1
Großer Stern : 1
Westring : 1


It seems that all street names are valid!

** Audit postal codes and suburbs**

Audit Post codes and suburbs The postal code in Berlin ranges from 10115 to 14199 In the following task we want to check if the postal code is valid for Berlin and if it matches with the suburb according to thist list : https://en.wikipedia.org/wiki/List_of_postal_codes_in_Germany#Berlin

In [8]:
suburbs_with_postal_codes = {
                            'Mitte': [10115, 10117, 10119, 10178, 10179],
                             'Gesundbrunnen': [13347, 13353, 13355, 13357, 13359, 13409],
                             'Friedrichshain': [10243, 10245, 10247, 10249],
                             'Prenzlauer Berg': [10405, 10407, 10409, 10435, 10437, 10439, 10369],
                             'Kreuzberg': [10961, 10963, 10965, 10967, 10969, 10997, 10999],
                             'Tiergarten': [10551, 10553, 10785, 10787, 10559, 10555, 10557],
                             'Charlottenburg': [10585, 10587, 10589, 10623, 10625, 10627, 10629],
                             'Wilmersdorf': [10707, 10709, 10711, 10713, 10715, 10719, 10717],
                             'Tempelhof': [10777, 10779, 10781, 10783, 10789],
                             'Schöneberg': [10823, 10825, 10827, 10829, 10717, 10783],
                             'Neukölln': [12043, 12045, 12047, 12049, 12051, 12053, 12055, 12057, 12059],
                             'Steglitz': [12157, 12161, 12163, 12165, 12167, 12169],
                             'Lichterfelde': [12203, 12205, 12207, 12209],
                             'Wedding': [13347, 13349, 13351, 13353, 13355, 13357, 13359],
                             'Reinickendorf': [13403, 13405, 13407, 13409],
                             'Lichtenberg': [13055, 13053, 10365, 10367, 10317],
                             'Pankow': [13187, 13189],
                             'Zehlendorf': [14163, 14165, 14167, 14169],
                             'Wannsee': [14109],
                             'Wittenau': [13435, 13437, 13439],
                             'Weißensee': [13086, 13088, 13089],
                             'Mahrzahn': [12679, 12681, 12683, 12685, 12687, 12689],
                             'Köpenick': [12555, 12557, 12559, 12587, 12435],
                             'Adlershof': [12487, 12489],
                             'Lichtenrade': [12305, 12307, 12309],
                             'Marienfelde': [12277, 12279],
                             'Lankwitz': [12247, 12249]
                            }

In [9]:
#create a list with all postal codes
all_postal_codes = set()
for v in suburbs_with_postal_codes.values():
    for v_ in v:
        all_postal_codes.add(v_)
#create a list to output wrong postal codes    
wrong_postal_codes = set()

#check if postal code is correct and then check if it corresponds to correct suburbs
def audit_postal_code(postal_code):
    if int(postal_code) not in all_postal_codes:
        wrong_postal_codes.add(int(postal_code))
        
def is_postal_code(elem):
    return (elem.attrib['k'] == "addr:postcode")

def audit_postal_codes(osmfile):
    osm_file = open(osmfile, 'rb')
    #set is a list without duplicates
    street_types = defaultdict(set)
    for event, elem in ET.iterparse(osm_file, events=("start",)):
        if elem.tag == "way" or elem.tag == "node":
            for tag in elem.iter("tag"):
                if is_postal_code(tag):
                    audit_postal_code(tag.attrib['v'])
    osm_file.close()

audit_postal_codes(OSM_FILE)
print (wrong_postal_codes)

set()


All postal codes are correct!

Now we want to check our suburbs:

In [10]:
all_suburbs = set()
for v in suburbs_with_postal_codes.keys():
    all_suburbs.add(v)
#create a list to output wrong postal codes    
wrong_suburbs = set()

def audit_suburbs(osmfile):
    osm_file = open(osmfile, 'rb')
    #set is a list without duplicates
    street_types = defaultdict(set)
    for event, elem in ET.iterparse(osm_file, events=("start",)):
        if elem.tag == "way" or elem.tag == "node":
            for tag in elem.iter("tag"):
                if is_suburb(tag):
                    audit_suburb(tag.attrib['v'])
    osm_file.close()

def audit_suburb(suburb):
    if (suburb) not in all_suburbs:
        wrong_suburbs.add(suburb)    
    
def is_suburb(elem):
    return (elem.attrib['k'] == "addr:suburb")

audit_suburbs(SAMPLE_FILE)
print (wrong_suburbs)

{'Plänterwald', 'Moabit', 'Alt-Hohenschönhausen', 'Rummelsburg', 'Fennpfuhl', 'Friedrichsfelde', 'Hansaviertel', 'Alt-Treptow'}


Uhh. There are some neighbourhoods that aren't suburbs.

Fennpfuhl, Alt-Hohenschönhausen, Friedrichsfelde and Rummelsburg are neighbourhoods in Lichtenberg.
Moabit and the Hansaviertel are neighbourhoods in Tiergarten.
Alt-Treptow and Plänterwald are neighbourhoods in Köpenick.

We will map these neighbourhoods to the corresponding suburbs before saving them to the mongoDB.

In the next step we want to see if postal codes are correct for the corresponding suburb.

In [11]:
#check if postal code and suburb are correct
check_suburbs_and_postal_codes = defaultdict(set)
    
def audit_postal_code_and_suburb(osmfile):
    osm_file = open(osmfile, 'rb')
    for event, elem in ET.iterparse(osm_file, events=("start",)):
        current_suburb = None
        current_postal_code = None
        if elem.tag == "way" or elem.tag == "node":
            for tag in elem.iter("tag"):
                if is_suburb(tag):
                    current_suburb = tag.attrib['v']
                if is_postal_code(tag):
                    current_postal_code = tag.attrib['v']
        if current_postal_code and current_suburb:        
            check_suburbs_and_postal_codes[current_suburb] = int(current_postal_code)                             
    osm_file.close()

audit_postal_code_and_suburb(OSM_FILE) 
print (check_suburbs_and_postal_codes)

defaultdict(<class 'set'>, {'Plänterwald': 12435, 'Friedrichshain': 10245, 'Weißensee': 13088, 'Tiergarten': 10785, 'Rummelsburg': 10317, 'Pankow': 10439, 'Hansaviertel': 10555, 'Lichtenberg': 10367, 'Alt-Treptow': 12435, 'Mitte': 10115, 'Charlottenburg-Wilmersdorf': 10623, 'Lichtenrade': 10777, 'Prenzlauer Berg': 10405, 'Schöneberg': 10787, 'Moabit': 10559, 'Alt-Hohenschönhausen': 13055, 'Friedrichsfelde': 10245, 'Gesundbrunnen': 13357, 'Neukölln': 12045, 'Fennpfuhl': 10367, 'Wedding': 13351, 'Kreuzberg': 10969, 'Charlottenburg': 10623, 'Wilmersdorf': 10789})


In [12]:
def check_check(a, b):
    #check if key, value also exist in key and value array
    for key in a:
        if key in b:
            first, second = a[key], b[key]
            if first in second:
                pass
            else:
                print(key, first, second)
        else:
            print ('no suburb named', key)
            
check_check(check_suburbs_and_postal_codes, suburbs_with_postal_codes)  

no suburb named Plänterwald
no suburb named Rummelsburg
Pankow 10439 [13187, 13189]
no suburb named Hansaviertel
no suburb named Alt-Treptow
no suburb named Charlottenburg-Wilmersdorf
Lichtenrade 10777 [12305, 12307, 12309]
Schöneberg 10787 [10823, 10825, 10827, 10829, 10717, 10783]
no suburb named Moabit
no suburb named Alt-Hohenschönhausen
no suburb named Friedrichsfelde
no suburb named Fennpfuhl
Wilmersdorf 10789 [10707, 10709, 10711, 10713, 10715, 10719, 10717]


The wrong suburbs from above, where already found in the audit_suburbs(). And we found Charlottenburg-Wilmersdorf, this value combines two suburbs.These are going to be corrected later.
There are some suburbs that don't match with our postal codes.
In a next step we could check the exact address and find out if the postal code or the suburb is wrong. Or both.
But for now it looks ok.

In [13]:
''' 
reference:
https://classroom.udacity.com/nanodegrees/nd002/parts/0021345404/modules/316820862075462/lessons/768058569/concepts/8755386150923#'''

CREATED = [ "version", "changeset", "timestamp", "user", "uid"]
BICYCLE_WAYS = ["cycleway", "bicycle_road", "bicycle"]
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

def shape_element(element):
    node = {}
    created_dict = {}
    pos_list = []
    long = 0
    lat = 0
    address_dict = {}
    bicycle_way = False

    if element.tag == "node" or element.tag == "way" :      
        node["type"] = element.tag
        for name, value in element.items():
            if name in CREATED:
                created_dict[name] = value
            elif name == 'long':
                lon = float(value)
            elif name == 'lat':
                lat = float(value)
            else:
                node[name] = value

            if len(created_dict):
                node["created"] = created_dict
            if lat:
                pos_list.append(lat)
            if long:
                pos_list.append(long)
            if len(pos_list):
                node["pos"] = pos_list
        
        for tag in element.iter("tag"):
            k = tag.attrib['k']
            v = tag.attrib['v']
            
            #get all bicycle ways, the german and the english values
            if k in BICYCLE_WAYS and value != "no":
                bicycle_way = True
            #get the address   
            if k == 'addr:suburb':
                address_dict['suburb'] = update_suburbs(v, mapping_suburbs)
            elif k == 'addr:postcode':
                address_dict['postal_code'] = v
            elif k == 'addr:street':
                address_dict['street'] = v
            elif k == 'addr:housenumber':
                address_dict['housenumber'] = v
            elif k == 'addr:country':
                pass
            elif k == 'addr:city':
                pass 
            elif problemchars.search(k):
                pass
            else: node[k] = v                 
                
        node_refs_list = []
        for nd in element.iter("nd"):
            node_refs_list.append(nd.attrib['ref'])
        if len(node_refs_list):
            node["node_refs"] = node_refs_list
        
        if bicycle_way is True: 
            node['bicycle_way'] = 'Yes'
        if len(address_dict):
            node['address'] = address_dict
        return node
    else:
        return None
                
#fix suburbs before exporting to mongoDB
mapping_suburbs = { "Alt-Treptow": "Köpenick",
                    "Moabit": "Tiergarten",
                    "Hansaviertel": "Tiergarten",
                    "Fennpfuhl": "Lichtenberg",
                    "Alt-Hohenschönhausen": "Lichtenberg",
                    "Rummelsburg": "Lichtenberg",
                    "Plänterwald": "Köpenick",
                    "Charlottenburg-Wilmersdorf": "Charlottenburg" }

def update_suburbs(suburb, mapping_suburbs):
    for key, value in mapping_suburbs.items():
        if suburb == key:
            suburb = value
    return suburb

def process_map(file_in):
    file_out = "{0}.json".format(file_in)
    data = []
    with codecs.open(file_out, "w") as fo:
        for _, element in ET.iterparse(file_in):
            el = shape_element(element)
            
            if el:
                data.append(el)
                fo.write(json.dumps(el) + "\n")
    return data

data = process_map('berlin_map.osm')

**Show first five elements of the generated .json data**

In [14]:
example = data[:5]
print (example)

[{'lon': '13.4477657', 'pos': [52.5265215, 52.5265215, 52.5265215], 'created': {'timestamp': '2015-07-19T12:27:05Z', 'uid': '1439784', 'version': '6', 'user': 'der-martin', 'changeset': '32731923'}, 'id': '12614600', 'type': 'node'}, {'lon': '13.4816429', 'pos': [52.512938, 52.512938, 52.512938], 'created': {'timestamp': '2015-12-06T19:23:44Z', 'uid': '43566', 'version': '9', 'user': 'anbr', 'changeset': '35793006'}, 'id': '12614606', 'type': 'node'}, {'lon': '13.4287152', 'pos': [52.5375413, 52.5375413, 52.5375413], 'created': {'timestamp': '2015-11-15T08:45:31Z', 'uid': '43566', 'version': '12', 'user': 'anbr', 'changeset': '35323067'}, 'id': '12614644', 'type': 'node'}, {'lon': '13.4444757', 'pos': [52.5295609, 52.5295609, 52.5295609], 'created': {'timestamp': '2016-08-31T19:16:25Z', 'uid': '43566', 'version': '4', 'user': 'anbr', 'changeset': '41833634'}, 'id': '12614650', 'type': 'node'}, {'lon': '13.4696325', 'pos': [52.5141418, 52.5141418, 52.5141418], 'created': {'timestamp': '

**Save data to mongoDB**

In [15]:
import pymongo
from pymongo import MongoClient

client = MongoClient('localhost', 27017)

db = client.berlin

def insert_osm_data(infile, db):
    db.berlin.drop()      
    #import data into a collection named "berlin"
    db.berlin.insert_many(infile)
    print (db.berlin.find_one())

insert_osm_data(data, db)

{'lon': '13.4477657', '_id': ObjectId('587f2bf3c7993a32c49f991e'), 'type': 'node', 'pos': [52.5265215, 52.5265215, 52.5265215], 'created': {'timestamp': '2015-07-19T12:27:05Z', 'uid': '1439784', 'changeset': '32731923', 'user': 'der-martin', 'version': '6'}, 'id': '12614600'}


In [29]:
#print (len(db.berlin.distinct("created")));

:( distinct works only up to 16mb

**Get statistics of our database**

In [34]:
stats = (db.command("dbstats"))

{'dataSize': 192994830.0, 'avgObjSize': 283.3746121848474, 'collections': 1, 'numExtents': 0, 'storageSize': 62980096.0, 'db': 'berlin', 'objects': 681059, 'ok': 1.0, 'views': 0, 'indexes': 1, 'indexSize': 5992448.0}


In [41]:
print ('Size of Collection')
print (convert_bytes(62980096.0))

Size of Collection
60.1 MB


**Count unique users**

In [17]:
print (len(db.berlin.distinct("created.user")));

2277


In [18]:
def pretty_print_list(d):
    for member in d:
        print (member['_id'], ':', member['count'])

** Count entries for suburbs **

In [19]:
# get top ten of contributing users
def count_entries_by_suburbs():
    pipeline = [{'$match': {'address.suburb': {'$exists': 1}}},
                {'$group' : { '_id' : '$address.suburb', 'count' : {'$sum' : 1}}},       
                {'$sort': {'count': -1}}]
    return pipeline

def aggregate(db, pipeline):
    return [doc for doc in db.berlin.aggregate(pipeline)]

pipeline = count_entries_by_suburbs()
result = aggregate(db, pipeline)
print ('Entries for suburbs:')
pretty_print_list(result);

Entries for suburbs:
Mitte : 7513
Prenzlauer Berg : 6303
Kreuzberg : 6100
Friedrichshain : 5866
Tiergarten : 4407
Lichtenberg : 3032
Schöneberg : 2943
Wedding : 1293
Gesundbrunnen : 1240
Köpenick : 588
Neukölln : 581
Wilmersdorf : 546
Charlottenburg : 481
Weißensee : 324
Pankow : 1
Lichtenrade : 1
Friedrichsfelde : 1


**Get top 10 contributiong users**

In [20]:
# get top ten of contributing users
def get_top_ten_users():
    pipeline = [{'$group' : { '_id' : '$created.user', 'count' : {'$sum' : 1}}},       
                {'$sort': {'count': -1}},
                { '$limit': 10}]
    return pipeline

def aggregate(db, pipeline):
    return [doc for doc in db.berlin.aggregate(pipeline)]

pipeline = get_top_ten_users()
result = aggregate(db, pipeline)
print ('Top 10 contributing users with counted entries:')
pretty_print_list(result);

Top 10 contributing users with counted entries:
atpl_pilot : 246334
MorbZ : 48504
Bot45715 : 34936
anbr : 34122
toaster : 19297
Polarbear : 13495
Berliner Igel : 13286
Shmias : 10802
wicking : 8506
Elwood : 8266


**Get an overview of different types**

In [21]:
def get_ways_and_nodes():
    pipeline =  [{'$group' : { '_id' : '$type', 'count' : {'$sum' : 1}}},       
                {'$sort': {'count': -1}}]
    return pipeline

pipeline = get_ways_and_nodes()
result = aggregate(db, pipeline)
print ('Different Types:')
pretty_print_list(result)  

Different Types:
node : 594822
way : 86165
multipolygon : 30
Schrank : 22
property_line : 4
kiosk : 2
bazar : 1
noise_barrier : 1
sewage : 1
furniture : 1
television : 1
sundial : 1
schwäbisch : 1
MFG18 : 1
Poliscan : 1
cable_distribution_cabinet : 1
parking_tickets : 1
public_transport : 1
woman : 1
turkish : 1


In [None]:
There are a lot of rare types. In one next step, we could clean this up aswell.

**Get top 10 amenities**

In [22]:
def get_overview_amenities():
    pipeline = [{'$match':{'amenity':{'$exists':1}}},
                {'$group' : { '_id': '$amenity', 'count' : {'$sum':1}}},
               {'$sort': {'count': -1}},
               {'$limit': 10}]
    return pipeline
    
pipeline = get_overview_amenities()
result = aggregate(db, pipeline)
print ('Amenities:')
pretty_print_list(result)

Amenities:
restaurant : 1709
bench : 1542
cafe : 1005
parking : 960
fast_food : 723
bicycle_parking : 623
recycling : 580
waste_basket : 578
kindergarten : 572
vending_machine : 499


**Get number of bycicle roads**

As we can see, there are 623 bycicle_parkings counted, which seems to be a lot. Because in Berlin there is currently a referndum  - https://volksentscheid-fahrrad.de/english/ - to make the city more bycicle friendly, we would like to take a closer look at the bycicle ways in the city.
Because of the mixture of german and english there are different nodes to mark cicleways. http://wiki.openstreetmap.org/wiki/Key:bicycle_road

In [23]:
def get_roads():
    pipeline = [{ '$group': {'_id': 'highway', 'count': {'$sum': 1}}}]
    return pipeline
    
pipeline = get_roads()
result = aggregate(db, pipeline)
print ('Roads:')
pretty_print_list(result)
print ('~~~~~~~~~~~~~~~~~~~~~~~')       
    
#find bycicle roads in berlin
def get_bicycle_roads():
    pipeline = [{ '$match': {'$or':
                    [{'bicycle': { '$in': ['official', 'designated', 'use_sidepath']}},
                    {'bicycle_road': 'yes'},
                    {'cycleway': {'$in': ['lane', 'opposite', 'shared', 'share_busway', 'track']}}]
                    }},
                {'$group' : { '_id': '$highway', 'count' : {'$sum':1}}},
                {'$sort': {'count': -1}}]          
    return pipeline

pipeline = get_bicycle_roads()
result = aggregate(db, pipeline)
print ('Bicycle Roads:')
pretty_print_list(result)
print ('~~~~~~~~~~~~~~~~~~~~~~~')   

def get_total_amount_of_cycleways():   
    pipeline = [{ '$match': {'$or':
                    [{'bicycle': { '$in': ['official', 'designated', 'use_sidepath']}},
                    {'bicycle_road': 'yes'},
                    {'cycleway': {'$in': ['lane', 'opposite', 'shared', 'share_busway', 'track']}}]
                    }},
                {'$group' : { '_id': None, 'count' : {'$sum':1}}}] 
    return pipeline
    
pipeline = get_total_amount_of_cycleways()
result = aggregate(db, pipeline)
print ('Bicycle Roads Total:')
pretty_print_list(result)
print ('~~~~~~~~~~~~~~~~~~~~~~~')   

def get_total_cycleways_with_cleaned_data():
    pipeline = [{ '$match': {'bicycle_way': 'Yes'} },
                {'$group' : { '_id': None, 'count' : {'$sum':1}}}] 
    return pipeline

pipeline = get_total_cycleways_with_cleaned_data()
result = aggregate(db, pipeline)
print ('Bicycle Roads Total - cleaned data:')
pretty_print_list(result)
print ('~~~~~~~~~~~~~~~~~~~~~~~') 

Roads:
highway : 681059
~~~~~~~~~~~~~~~~~~~~~~~
Bicycle Roads:
secondary : 996
primary : 529
tertiary : 347
residential : 195
path : 66
service : 20
cycleway : 20
pedestrian : 17
footway : 6
living_street : 6
construction : 3
None : 3
secondary_link : 2
primary_link : 1
~~~~~~~~~~~~~~~~~~~~~~~
Bicycle Roads Total:
None : 2211
~~~~~~~~~~~~~~~~~~~~~~~
Bicycle Roads Total - cleaned data:
None : 5652
~~~~~~~~~~~~~~~~~~~~~~~


Compared to the total number of highways, the number of bicycle roads in this map is extremly small
It would be also interresting to calculate the length of all bicycle ways and compare them to the normal street net. 
The Query with the cleaned data for bicycles shows over fifty percent more entries. It seems that we missed some values for the different bicycles keys in the query for the osm data.

*Hunger*
Because Berlin is known for its masses of restaurants we want to take a closer look at the types of restaurants:

In [24]:
def get_overview_amenities():
    pipeline = [{'$match': {'amenity': {'$exists': 1}, 
                            'amenity': {'$in': ['restaurant', 'fast_food', 'food_court', 'biergarten', 'bar', 'bbq', 'cafe'] 
                           }}},
               {"$group":{"_id":"$cuisine", "count":{"$sum":1}}},
               {"$sort":{"count":-1}},
               {"$limit": 11}]

    return pipeline
    
pipeline = get_overview_amenities()
result = aggregate(db, pipeline)
print ('Restaurants by cuisine:')
pretty_print_list(result)

Restaurants by cuisine:
None : 1885
italian : 265
german : 117
asian : 114
kebab : 95
burger : 90
vietnamese : 84
regional : 83
indian : 81
coffee_shop : 70
pizza : 69


## Additional suggestions for improving and analyzing the data

#### Length of bicycle ways and alles bio
- it would be to calculate the length of all bycicley ways compared to the 'normal' street length. 
- It also would be interessting to calculate the the percentage of the surface types, because it is said that Berlin is a "green" city.

#### Improve the dataset
- Gamification would be a good way to encourage people to contribute more and correct data. For example games that are using geodata, like Pokemon Go or Geocaching,
    could add functionality to the osm data. People could be attracted by leveling up or get other benefits for the game.  
- Because nowadays google maps is very widely used and people can easily add amenities etc. via google, it would be great if that data could be merged into the OSM data.

#### Potential problems
 - The language gap may be a problem. It may be that keys from different languages exist in one area, altough they mean the same. When doing analytics this can be missed out easily. 
 - Missing or wrong data is a problem. This may skew findings in analytics or simply misslead people. http://maproulette.org/ could help to minor the problems.
 - Add totally wrong data in bad faith. As everone can add data to the OSM Project, it can happen that people add false data just for fun.

## Conclusion
The data set was cleaned, so that the language gap is bridged for our investigations. We could have cleaned out the data more, but as we want to have an basic overwiew, the cleaning amount is right. 
It is quite surprising that in a city where almost all inhabitants are "super greenies", the amount of bicycle ways is that small compared to the normal highways. It would be interesting to check the amount of bicycle ways in Amsterdam and compare it to Berlin.
