# OpenStreetMap Sample Project Data Wrangling with MongoDB
### Author: 夏强 （Xia Qiang）

##  Problems Encountered in the Map

1. overlong street names(五一路w青年中路北255&学田南路南205) need to be shorted (五一路)
2. street name have two language (松花一村 Songhua Community #1) need to change into chinese(松花一村)
3. English and Pinying street name(Wuzhong Rd.) need to be translated(吴中路)


## References

http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
http://blog.csdn.net/gatieme/article/details/43235791
http://stackoverflow.com/questions/28544686/unicodeencodeerror-ascii-codec-cant-encode-characters-in-position-0-5-ordin

## fixing

* problem1 and problem2 can be fixed by getting the first continuous  Chinese characters block

 I use re.compile(u"[\u4e00-\u9fa5]{2，}") to match street name ,get the first group to the maps dict
 
 for example:
  u'\u677e\u82b1\u4e00\u6751 Songhua Community #1': u'\u677e\u82b1\u4e00\u6751',
  
  u'\u6843\u575e\u8defs(\u5317\u4fa7)\u8dc3\u9f99\u8def\u897f35&\u8398\u56ed\u8def\u4e1c340': u'\u6843\u575e\u8def',
 

* for problem3 I use python translate package to translate and manually translate some street names that can not be translated well by python translate package

updata code :

def update_name(name):

    m = chinese_re.search(name)
    better_name = name
    if m:
        better_name = m.group(0)
    #### Manual translation for some street names
    elif name =='Lane 1555 Jinshajiang road(west)':
        better_name = u'金沙西路'
    elif name =='WenSanLu DianZi XinXi JieQu, Xihu':
        better_name = u'文三路'
    elif name =='ZhongShangNanEr Lu':
        better_name = u'中山南二路'
    elif name =='Yuanli Rd':
        better_name = u'袁立路'
    elif name =='Huashang Rd':
        better_name = u'华商路'
    elif name =='Wensan West Rode':
        better_name = u'文三西路'
    elif name =='hehuaxing':
        better_name = u'荷花形'
    #### machine translation for other street names
    else:
        better_name = translator.translate(name)
 

## Data Overview
### File sizes
                                                
shanghai_china.osm ....... 575M

shanghai_china.osm.json .... 591M 

### Summary Statistics of Data
* Number of documents: 3118886
* Number of unique users: 1754
* Number of nodes: 2782288
* Number of ways: 336598

### top 10 users
* {u'count': 385090, u'_id': u'Chen Jia'}
* {u'count': 173176, u'_id': u'aighes'}
* {u'count': 128684, u'_id': u'katpatuka'}
* {u'count': 128497, u'_id': u'XBear'}
* {u'count': 115870, u'_id': u'yangfl'}
* {u'count': 103682, u'_id': u'dkt'}
* {u'count': 103017, u'_id': u'Holywindon'}
* {u'count': 95124, u'_id': u'u_kubota'}
* {u'count': 86470, u'_id': u'jamesks'}
* {u'count': 84132, u'_id': u'zzcolin'}

* Top10 contributing user as a percentage of total documents 45.00780086223094%

## Additional Ideas
#### suggestions for improving and analyzing the data
* problem:  multilingual names
* suggestions: build a dictionary for placenames, so that every name have English edition and Chinese edition of name, then make the OSM respectively as Shanghai_Chinese_placename.OSM and Shanghai_English_placename.OSM

#### benefits and problems of the improvement

* benefits: can get rid of name problems such as have many names for one place ; more readable for people from different country
* problems: Increased workload



## code and results


# 1.find probelms 

##  street types and street names


In [41]:

import xml.etree.cElementTree as ET
from collections import defaultdict
import re
import pprint
import sys

street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road",
            "Trail", "Parkway", "Commons"]

mapping = { "St": "Street",
            "St.": "Street",
            "Ave": "Avenue",
            "Rd.": "Road"
            }

def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)


def is_street_name(elem):
    return (elem.tag == "tag") and (elem.attrib['k'] == "addr:street")


def audit(osmfile):
    osm_file = open(osmfile, "r")
    street_types = defaultdict(set)
    for event, elem in ET.iterparse(osm_file, events=("start",)):
        if is_street_name(elem):
            for tag in elem.iter():
                audit_street_type(street_types, tag.attrib['v'])

    return street_types

def run(osm_file):
    st_types = audit(osm_file)
    st_dict = dict(st_types).items()
    for k,v in st_dict:
        print 'type: ' 
        print k.encode('utf8')
        print 'names: '
        for i in v:
            print i.encode('utf8')  

In [42]:
run('sample.osm')

type: 
2731弄
names: 
沪南路2731弄
type: 
w(新秦灶人家园门南)永怡路北495&永昌路南75
names: 
国强路w(新秦灶人家园门南)永怡路北495&永昌路南75
type: 
1028弄2支弄
names: 
秀沿路1028弄2支弄
type: 
2727弄
names: 
沪南路2727弄
type: 
Rd.
names: 
Wuzhong Rd.
type: 
e(入口北)桃园路北195&崇川路南405
names: 
工农南路e(入口北)桃园路北195&崇川路南405
type: 
2729弄
names: 
沪南路2729弄
type: 
s(北侧)跃龙路西35&莘园路东340
names: 
桃坞路s(北侧)跃龙路西35&莘园路东340
type: 
e(门北)江淮路北35&崇川路南385
names: 
崇文路e(门北)江淮路北35&崇川路南385
type: 
26弄
names: 
安宁路26弄
type: 
1
names: 
松花一村 Songhua Community #1
type: 
450号w(门北)濠北路北320&钟秀中路南400
names: 
工农路450号w(门北)濠北路北320&钟秀中路南400
type: 
358弄
names: 
鹤庆路358弄
type: 
88弄
names: 
安宁路88弄
type: 
hehuaxing
names: 
hehuaxing
type: 
road(west)
names: 
Lane 1555 Jinshajiang road(west)
type: 
w青年中路北255&学田南路南205
names: 
五一路w青年中路北255&学田南路南205


## Things need to be changed
1. overlong street names(五一路w青年中路北255&学田南路南205) need to be shorted (五一路)
2. street name have two language (松花一村 Songhua Community #1) need to change into chinese(松花一村)
3. English and Pinying street name(Wuzhong Rd.) need to be translated(吴中路)

# 2.fix problems

* problem1 and problem2 can be fixed by getting the first continuous  Chinese characters block
* for problem3 I use python translate package to translate and manually translate some street names that can not be translated well by python translate package

### translation errors:
* 'Lane 1555 Jinshajiang road(west)'
* 'WenSanLu DianZi XinXi JieQu, Xihu'
* 'ZhongShangNanEr Lu'
* 'Yuanli Rd'
* 'Huashang Rd'
* 'Wensan West Rode'
* 'hehuaxing'



In [44]:
import xml.etree.cElementTree as ET
from collections import defaultdict
import re
import pprint
import sys
from translate import Translator as T
translator= T(to_lang="zh")

street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)
## re match continuous Chinese characters block
chinese_re = re.compile(u"[\u4e00-\u9fa5]{2,}")

def is_street_name(elem):
    return (elem.tag == "tag") and (elem.attrib['k'] == "addr:street")


def update_name(name):
    m = chinese_re.search(name)
    better_name = name
    if m:
        better_name = m.group(0)
    # Manual translation for some street names
    elif name =='Lane 1555 Jinshajiang road(west)':
        better_name = u'金沙西路'
    elif name =='WenSanLu DianZi XinXi JieQu, Xihu':
        better_name = u'文三路'
    elif name =='ZhongShangNanEr Lu':
        better_name = u'中山南二路'
    elif name =='Yuanli Rd':
        better_name = u'袁立路'
    elif name =='Huashang Rd':
        better_name = u'华商路'
    elif name =='Wensan West Rode':
        better_name = u'文三西路'
    elif name =='hehuaxing':
        better_name = u'荷花形'
    # machine translation for other street names
    else:
        better_name = translator.translate(name)
        
        
    
        
    return better_name
# store street name and better street name as key and vlaue in a dict
def audit(osmfile):
    osm_file = open(osmfile,'r')
    lst = []
    for event, elem in ET.iterparse(osm_file, events=("start",)):
        if is_street_name(elem):
            name = elem.attrib['v']
            better_name = update_name(name)
            if (name,better_name) not in lst:
                lst.append((name,better_name))
        elem.clear()
    return dict(lst)
def run(osmfile):
    maps = audit(osmfile)
    pprint.pprint(maps)
    

In [3]:
run('sample.osm')

{'Alley 3668, Xiuyan Road': u'\u5c0f\u5df7 3668\uff0c\u5cab\u5ca9\u8def',
 'Ding Xiang Road': u'\u4e01\u4e61\u8def',
 'Lane 1555 Jinshajiang road(west)': u'\u91d1\u6c99\u897f\u8def',
 'Wuzhong Rd.': u'\u5434\u4e2d\u8def',
 'Xiuyan Road': u'\u5cab\u5ca9\u8def',
 'hehuaxing': u'\u8377\u82b1\u5f62',
 u'\u4e09\u65b0\u5317\u8def': u'\u4e09\u65b0\u5317\u8def',
 u'\u4e2d\u5c71\u4e1c\u4e00\u8def': u'\u4e2d\u5c71\u4e1c\u4e00\u8def',
 u'\u4e30\u6f6d\u8def': u'\u4e30\u6f6d\u8def',
 u'\u4e94\u4e00\u8defw\u9752\u5e74\u4e2d\u8def\u5317255&\u5b66\u7530\u5357\u8def\u5357205': u'\u4e94\u4e00\u8def',
 u'\u4f1a\u6587\u8def': u'\u4f1a\u6587\u8def',
 u'\u4fdd\u4e50\u8def': u'\u4fdd\u4e50\u8def',
 u'\u5174\u4e1a\u8def': u'\u5174\u4e1a\u8def',
 u'\u56ed\u4e8c\u8def': u'\u56ed\u4e8c\u8def',
 u'\u56fd\u5b9a\u8def': u'\u56fd\u5b9a\u8def',
 u'\u56fd\u5f3a\u8defw(\u65b0\u79e6\u7076\u4eba\u5bb6\u56ed\u95e8\u5357)\u6c38\u6021\u8def\u5317495&\u6c38\u660c\u8def\u535775': u'\u56fd\u5f3a\u8def',
 u'\u5b66\u9662\u8def':

In [None]:
maps = audit("shanghai_china.osm")

###  A Problem has been Encountered
#### audit("shanghai_china.osm") take too much time and seems do not work 

* fix:  iterparse the osm file get street name list ,then fix the list

In [1]:
import xml.etree.cElementTree as ET
import re
import pprint

def is_street_name(elem):
    return (elem.tag == "tag") and (elem.attrib['k'] == "addr:street")
def getname(osmfile):
    osm_file = open(osmfile, "r")
    lst = []
    for event, elem in ET.iterparse(osm_file, events=("start",)):
        if is_street_name(elem):
            name = elem.attrib['v']
            if not name in lst:
                lst.append(name)
             
        elem.clear()
    return lst


In [2]:
names = getname("shanghai_china.osm")

In [3]:
len(names)

1079

In [4]:
import xml.etree.cElementTree as ET
import re
import pprint
from translate import Translator as T
translator= T(to_lang="zh")

street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)
chinese_re = re.compile(u"[\u4e00-\u9fa5]{2,}")


def update_name(name):
    m = chinese_re.search(name)
    better_name = name
    if m:
        better_name = m.group(0)
    # Manual translation for some street names
    elif name =='Lane 1555 Jinshajiang road(west)':
        better_name = u'金沙西路'
    elif name =='WenSanLu DianZi XinXi JieQu, Xihu':
        better_name = u'文三路'
    elif name =='ZhongShangNanEr Lu':
        better_name = u'中山南二路'
    elif name =='Yuanli Rd':
        better_name = u'袁立路'
    elif name =='Huashang Rd':
        better_name = u'华商路'
    elif name =='Wensan West Rode':
        better_name = u'文三西路'
    elif name =='hehuaxing':
        better_name = u'荷花形'
    # machine translation for other street names
    else:
        better_name = translator.translate(name)
        
        
    
        
    return better_name
# store street name and better street name as key and vlaue in a dict
def audit(names):
    lst = []
    for name in names:
        better_name = update_name(name)
        if (name,better_name) not in lst:
                lst.append((name,better_name))
    return dict(lst)

In [5]:
maps = audit(names)

In [7]:
## store maps in case needed later
import pickle
mapfile = "maps.pkl"
with open(mapfile, "w") as fo:
    pickle.dump(maps, fo)

# 3.Shape the data for mongdb

In [77]:
import xml.etree.cElementTree as ET
import pprint
import re
import codecs
import json
import pickle
"""
Your task is to wrangle the data and transform the shape of the data
into the model we mentioned earlier. The output should be a list of dictionaries
that look like this:

{
"id": "2406124091",
"type: "node",
"visible":"true",
"created": {
          "version":"2",
          "changeset":"17206049",
          "timestamp":"2013-08-03T16:43:42Z",
          "user":"linuxUser16",
          "uid":"1219059"
        },
"pos": [41.9757030, -87.6921867],
"address": {
          "housenumber": "5157",
          "postcode": "60625",
          "street": "North Lincoln Ave"
        },
"amenity": "restaurant",
"cuisine": "mexican",
"name": "La Cabana De Don Luis",
"phone": "1 (773)-271-5176"
}

You have to complete the function 'shape_element'.
We have provided a function that will parse the map file, and call the function with the element
as an argument. You should return a dictionary, containing the shaped data for that element.
We have also provided a way to save the data in a file, so that you could use
mongoimport later on to import the shaped data into MongoDB. 

Note that in this exercise we do not use the 'update street name' procedures
you worked on in the previous exercise. If you are using this code in your final
project, you are strongly encouraged to use the code from previous exercise to 
update the street names before you save them to JSON. 

In particular the following things should be done:
- you should process only 2 types of top level tags: "node" and "way"
- all attributes of "node" and "way" should be turned into regular key/value pairs, except:
    - attributes in the CREATED array should be added under a key "created"
    - attributes for latitude and longitude should be added to a "pos" array,
      for use in geospacial indexing. Make sure the values inside "pos" array are floats
      and not strings. 
- if the second level tag "k" value contains problematic characters, it should be ignored
- if the second level tag "k" value starts with "addr:", it should be added to a dictionary "address"
- if the second level tag "k" value does not start with "addr:", but contains ":", you can
  process it in a way that you feel is best. For example, you might split it into a two-level
  dictionary like with "addr:", or otherwise convert the ":" to create a valid key.
- if there is a second ":" that separates the type/direction of a street,
  the tag should be ignored, for example:

<tag k="addr:housenumber" v="5158"/>
<tag k="addr:street" v="North Lincoln Avenue"/>
<tag k="addr:street:name" v="Lincoln"/>
<tag k="addr:street:prefix" v="North"/>
<tag k="addr:street:type" v="Avenue"/>
<tag k="amenity" v="pharmacy"/>

  should be turned into:

{...
"address": {
    "housenumber": 5158,
    "street": "North Lincoln Avenue"
}
"amenity": "pharmacy",
...
}

- for "way" specifically:

  <nd ref="305896090"/>
  <nd ref="1719825889"/>

should be turned into
"node_refs": ["305896090", "1719825889"]
"""
with open("maps.pkl", "r") as f:
    maps =  pickle.load(f)

lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

CREATED = [ "version", "changeset", "timestamp", "user", "uid"]


def shape_element(element):
    node = {}
    address = {}
    if element.tag == "node" or element.tag == "way" :
        created = {}
        pos = [None, None]
        noderef = []
        node['type'] = element.tag
 
        for k,v in element.items(): 
            # fill created and pos
            if k in CREATED:
                created[k]= v
            elif k == "lat": pos[0] = float(v)
            elif k == "lon": pos[1] = float(v)
            else : 
                node[k] = v
        node['created'] = created
        node['pos'] = pos
        for tag in element.iter("tag"):
            if not 'k' in tag.keys():
                continue
            tagk = tag.attrib['k']
            tagv = tag.attrib['v']
            m = problemchars.search(tagk)
            if not m:
                if tagk == "addr:street":
                    address['street'] = maps(tagv)             
                elif tagk == 'addr:housenumber':                 
                    address['housenumber'] = tagv                    
                    node['address'] = address                      
                elif 'addr' not in tagk:
                    node[tagk] = tagv
                else:
                    pass  
        
        # fill noderef   
        for nd in element.iter("nd"):
            if not 'ref' in nd.keys():
                continue
            noderef.append(nd.attrib['ref'])
            node['node_refs'] = noderef
        return node
    else :
        return None


def process_map(file_in, pretty = False):
    # You do not need to change this file
    file_out = "{0}.json".format(file_in)
    data = []
    with codecs.open(file_out, "w") as fo:
        for _, element in ET.iterparse(file_in):
            el = shape_element(element)
            if el:
                data.append(el)
                if pretty:
                    fo.write(json.dumps(el, indent=2)+"\n")
                else:
                    fo.write(json.dumps(el) + "\n")
            element.clear()
    return data
    
   

In [78]:
def run(osm_file):
    # NOTE: if you are running this code on your computer, with a larger dataset,
    # call the process_map procedure with pretty=False. The pretty=True option adds
    # additional spaces to the output, making it significantly larger.
    data = process_map(osm_file, False)
    return data

In [79]:
data = run("shanghai_china.osm")

## Insert the data into local MongoDB Database

In [80]:
from pymongo import MongoClient
client = MongoClient()
db = client.Shanghai
collection = db.Map
collection.drop()
collection = db.Map
collection.insert_many(data)

<pymongo.results.InsertManyResult at 0x179bd8a68>

In [81]:
collection

Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), u'Shanghai'), u'Map')

## Data Overview

In [70]:
# Number of documents
collection.find().count()

3118886

In [71]:
# Number of nodes
collection.find({"type":"node"}).count()

2782288

In [72]:
# Number of ways
collection.find({"type":"way"}).count()

336598

In [72]:
# Number of unique users
len(collection.distinct( "created.user" ))

1754

In [30]:
# Top 10 contributing user
Top10_user = collection.aggregate([{"$group":{"_id":"$created.user", "count":{"$sum":1}}}, {"$sort":{"count":-1}}, {"$limit":10}])
for i in Top10_user:
    print i

{u'count': 385090, u'_id': u'Chen Jia'}
{u'count': 173176, u'_id': u'aighes'}
{u'count': 128684, u'_id': u'katpatuka'}
{u'count': 128497, u'_id': u'XBear'}
{u'count': 115870, u'_id': u'yangfl'}
{u'count': 103682, u'_id': u'dkt'}
{u'count': 103017, u'_id': u'Holywindon'}
{u'count': 95124, u'_id': u'u_kubota'}
{u'count': 86470, u'_id': u'jamesks'}
{u'count': 84132, u'_id': u'zzcolin'}


### Additional data exploration using MongoDB queries

In [31]:
##Top10 contributing user as a percentage of total documents
Top10_user.next
top10_doc = collection.find({"created.user":{"$in":['Chen Jia','aighes','katpatuka','XBear','yangfl',
                                                    'dkt','Holywindon','u_kubota','jamesks','zzcolin']}}).count()
total_doc = collection.find().count()

In [32]:
100*float(top10_doc)/float(total_doc)

45.00780086223094