# 1. Problems Encountered in the Map

## 1.1 Running "audit.py"

I downloaded the map of Beijing in China in xml format. After a little digging, I run my audit.py against the xml file and print the result in an organized way.

In [21]:
execfile( "audit.py")



***********************************************************
Problem 1: inconsistency in keys
Unicode keys in the "tag" element are:
[u'\u758f\u6563\u4eba\u6570', u'\u9ec4\u5357\u82d1\u5c0f\u533a', u'\u8f66\u5e93']
-----------------------------------------------------------
Strings containing hyphen are:
['name:bat-smg', 'name:roa-rup', 'x-shop', 'x-highway', 'x-name', 'name:cbk-zam', 'name:zh-simplified', 'name:roa-tara', 'name:be-x-old', 'name:zh-yue', 'name:zh-traditional', 'name:zh-min-nan', 'name:zh-classical']
-----------------------------------------------------------
Strings containing uppercase letters or abbreviated words are:
['No.', 'currency:CNY', 'gns:UFI', 'gns:UNI', 'gns:DSG', 'gns:ADM1', 'FIXME']
***********************************************************




***********************************************************
Problem 2: colon(s) in keys
Strings containing single colon:
['ref:en', 'name:uz', 'name:ur', 'name:ug', 'payment:debit_cards', 'name:ilo', 'name:ia', '

## 1.2 Three Problems

Three problems are presented:

1. Problems associated with different type of keys of tag elements. Uniformity of keys makes search through Mongodb more easily. There are three procedures I have done to improve uniformity of keys. First, I find a handful chinese characters as keys in the xml. And I will discard these tag. Second, many keys with multiple words containing underscore("_") or hyphen("-"). I replace all the hyphens with underscore for uniformity of keys. Third, all the uppercase letters are replaced by lower case letters. These procedures are done in "xml_to_json.py".

2. Problem with colons in keys of tag element. As we can see from the result of running "audit.py". Some keys contain a single colon(":") and some keys contain double colons. Usually, colon means existence of sub-class. However, a smaller number of keys containing double colons which indicate level-two sub-class. I think level-one subclass is a standard way of dealing with these keys. I replace the second colon with underscore. And all these keys will only contain one colon. I construct sub-classes from these keys. For example, I have three keys with colon: "name:ch:simplified", "name:ch:traditional" and "name:en". These keys will form such structure: name_other:{ch_simplified:..., ch_traditional:...,  en:...}. Noticing that I add "_other" suffix to make a new group of word. The procedure is also in "xml_to_json.py".

3. Problem of names. I noticed that some values of 'name' tag are neither Chinese character nor English words, such as "Guxiang 20" in the above output of "audit.py". These names are all pinyin which is a phonetic system of chinese characters. And these names are hardly used as names in Chinese. Luckily, these pinyins can be easily translated into more meaningful Chinese words. I replace all the pinyin in 'name' tag in the process. The procedure are also in "xml_to_json.py".

## 1.3 Structures of Cleaning Process

I will briefly introduce my structures of program which dealing with above problems. The following figure show the main structures of "xml_to_json.py".

<img src="figures/main_program_flow.png" style="max-width:100%; width: 40%; max-width: none">

The program will build a list containing all the dictionaries from xml. The building process of this huge list is done by shaping individual element. There are three kinds of shaping method corresponding to three kinds of elements which is node, way and relation. The structure of shaping node is as following:

<img src="figures/shape_node_flow.png" style="max-width:100%; width: 40%; max-width: none">

and way:

<img src="figures/shape_way_flow.png" style="max-width:100%; width: 40%; max-width: none">

and relation:

<img src="figures/shape_relation_flow.png" style="max-width:100%; width: 40%; max-width: none">

In the function "process_node_tags", "process_way_tags" and "process_relation_tags", problem 2 and 3 are actually solved. Problem 1 are solved in "node_key_filters", "way_key_filters" and "relation_key_filters". The programs are structured such way to accommodate more possible data wranglings in future.  

# 2. Data Overview

Data shaped by above process is imported into Mongodb. The json objects are insert into database "example" as collection "beijing_maps". 

In [5]:
from db_functions import *
insert_maps(JSON_FILE, DB_NAME)
all_collections = get_collections(DB_NAME)
print JSON_FILE
print DB_NAME

../beijing_china.osm.json
examples


## 2.1 Size of xml file and json file

In [6]:
!du -h ../beijing_china.osm
!du -h ../beijing_china.osm.json

130M	../beijing_china.osm
142M	../beijing_china.osm.json


## 2.2 Some Total Counts

### 2.2.1 Total count of all the documents.

In [7]:
all_collections.find().count()

699609

### 2.2.2 Total count of nodes, ways and relations.

In [8]:
all_collections.find({"type":"node"}).count()

605204

In [9]:
all_collections.find({"type":"way"}).count()

88958

In [10]:
all_collections.find({"type":"relation"}).count()

5447

### 2.2.3 Counts associated with 'tag' element

Number of node, way and relation elements containing 'tag'.

In [11]:
all_collections.find({"type":"node", "tag":  {"$exists": 1}}).count()

34490

In [12]:
all_collections.find({"type":"way", "tag":  {"$exists": 1}}).count()

88240

In [13]:
all_collections.find({"type":"relation", "tag":  {"$exists": 1}}).count()

5437

Number of node, way and relation elements containing 'name'.

In [14]:
all_collections.find({"type":"node", "tag.name":  {"$exists": 1}}).count()

9610

In [15]:
all_collections.find({"type":"way", "tag.name":  {"$exists": 1}}).count()

17124

In [16]:
all_collections.find({"type":"relation", "tag.name":  {"$exists": 1}}).count()

4761

## 2.3 Some knowledge of updating time

Earliest time for updating of the map.

In [17]:
sorted_times = get_unique_time_sorted(all_collections)
print sorted_times[0]

2007-03-14 18:09:10


Latest time when the map file is downloading.

In [18]:
print sorted_times[-1]

2015-08-28 14:04:24


The time when people update most frequently on this map.

In [20]:
pipeline = [{'$group':{'_id':'$timestamp',
                       'count':{'$sum':1}}},
            {'$sort':{'count':-1}},
            {'$limit':1}
            ]
list(all_collections.aggregate(pipeline))

[{u'_id': u'2015-06-12T06:31:41Z', u'count': 92}]

# 3. Additional Ideas