# Data Wrangling with OpenStreetMap and MongoDB

OpenStreetMap is a community built free editable map of the world, inspired by the success of Wikipedia where crowdsourced data is open and free from proprietary restricted use. We see some examples of its use by Craigslist and Foursquare, as an open source alternative to Google Maps.

http://www.openstreetmap.org

Users can map things such as polylines of roads, draw polygons of buildings or areas of interest, or insert nodes for landmarks. These map elements can be further tagged with details such as street addresses or amenity type. Map data is stored in an XML format. More details about the OSM XML can be found here:

http://wiki.openstreetmap.org/wiki/OSM_XML

Some highlights of the OSM XML format relevent to this project are:
- OSM XML is list of instances of data primatives (nodes, ways, and relations) found within a given bounds
- nodes represent dimensionless points on the map
- ways contain node references to form either a polyline or polygon on the map
- nodes and ways both contain children tag elements that represent key value pairs of descriptive information about a given node or way

As with any user generated content, there is likely going to be dirty data. In this project I'll attempt to do some auditing, cleaning, and data summarizing tasks with Python and MongoDB.

## Chosen Map Area

For this project, I chose to ~43MB from the Hyderabad, India. As i started my career in Hyderabad and liked the city. I figured that my familiarity with the area makes it a good candidate for analysis.

https://mapzen.com/data/metro-extracts/metro/hyderabad_india/

## Auditing the Data

With the OSM XML file downloaded, lets parse through it with ElementTree and count the number of unique element types. Iterative parsing is utilized since the XML is too large to process in memory.

In [2]:
import xml.etree.ElementTree as ET
import pprint
from collections import defaultdict

tags = {}
filename = 'hyderabad.osm'

for event, elem in ET.iterparse(filename):
    if elem.tag in tags: tags[elem.tag] += 1
    else:                tags[elem.tag] = 1

pprint.pprint(tags)

{'bounds': 1,
 'member': 11471,
 'nd': 4094848,
 'node': 3239607,
 'osm': 1,
 'relation': 2470,
 'tag': 869409,
 'way': 772473}


Here I have built three regular expressions: `lower`, `lower_colon`, and `problemchars`.
- 'lower': matches strings containing lower case characters
- 'lower_colon': matches strings containing lower case characters and a single colon within the string
- 'problemchars': matches characters that cannot be used within keys in MongoDB
Here is a sample of OSM XML:
'''
<node id="4773253021" lat="17.4269665" lon="78.343947" version="1" timestamp="2017-04-03T16:35:42Z" changeset="47418038" uid="5585414" user="Aparna8">
		<tag k="name" v="Bowl-O-China"/>
		<tag k="level" v="0"/>
		<tag k="amenity" v="restaurant"/>
		<tag k="cuisine" v="chinese"/>
		<tag k="smoking" v="no"/>
		<tag k="takeaway" v="no"/>
		<tag k="addr:city" v="Hyderabad,Telangana"/>
		<tag k="addr:street" v="Manikonda Village, Gatchibowli SEZ, Madhava Reddy Colony, Gachibowli"/>
		<tag k="addr:postcode" v="500032"/>
		<tag k="addr:housenumber" v="203/1"/>
	</node>
'''

Within the node element there are ten 'tag' children. The key for half of these children begin with 'addr:'. Later in this notebook I will use the 'lower_colon' regex to help find these keys so I can build a single 'address' document within a larger json document.

In [4]:
import re

lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

def key_type(element, keys):
    if element.tag == "tag":
        for tag in element.iter('tag'):
            k = tag.get('k')
            if lower.search(k):
                keys['lower'] += 1
            elif lower_colon.search(k):
                keys['lower_colon'] += 1
            elif problemchars.search(k):
                keys['problemchars'] += 1
            else:
                keys['other'] += 1
        
    return keys

def process_map(filename):
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
    
    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)

    return keys

keys = process_map(filename)
pprint.pprint(keys)

{'lower': 863829, 'lower_colon': 5307, 'other': 257, 'problemchars': 16}


Now lets redefine `process_map` to build a set of unique userid's found within the XML. I will then output the length of this set, representing the number of unique users making edits in the chosen map area.

In [3]:
def process_map(filename):
    users = set()
    for _, element in ET.iterparse(filename):
        for e in element:
            if 'uid' in e.attrib:
                users.add(e.attrib['uid'])               
    return users

users = process_map(filename)
len(users)

1082

# Problems with the Data

**Street Names**

The majority of this project will be devoted to auditing and cleaning street names seen within the OSM XML.On auditing a sample of the dataset, I ran into the following errors:

1. Incorrect Postal Code/ whitespace between them: 996544

2. Inconsistent City Names: HYDERABAD, Hyderabad, hyderabad

3. Flouting of Convention (In filling the addresses): Plot 103-105, KPBH 5th Phase

In [9]:
import re

def is_city_name(elem):
    #Checks if the key is the city name
    return (elem.attrib['k'] == "addr:city")

def is_postal_code(elem):
    #Checks if the key is a postal code
    return (elem.attrib['k'] == "addr:postcode")

street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

The `audit_street_type` function will take in the dictionary of street types we are building, a string to audit, a regex to match against that string, and the list of expected street types.

The function will search the string for the regex. If there is a match and the match is not in our list of expected street types, add the match as a key to the dictionary and add the string to the set.

In [14]:
def audit_street(osm_file):
    counts = defaultdict(int)
    street_types = defaultdict(set)
    for event, element in ET.iterparse(osm_file, events=("start",)):
        if element.tag in ["node", "way", "relation"]:
            for tag in element.iter("tag"):
                if is_street_name(tag):
                    if tag.attrib['v']:
                        #Getting counts of addr:street keys
                        counts[tag.attrib['k']] += 1
                        #Storing values of addr:street tags
                        m = street_type_re.search(tag.attrib['v'])
                        if m:
                            street_type = m.group()
                            #if street_type not in expected:
                            street_types[street_type].add(tag.attrib['v'])
                            
    return dict(counts), street_types

The function `street_name` determines if an element contains an attribute `k="addr:street"`. Lets use `is_street_name` as the `tag_filter` when I call the `audit` function to audit street names.

In [13]:
def is_street_name(element):
    #Checks if the key is a street name
    return (element.attrib['k'] in ['addr:street'])

Now I will define an `audit` function to do the parsing and auditing of the street names.

I have defined this function so that it not only audits `tag` elements where `k="addr:street"`, but whichever `tag` elements match the `tag_filter` function. The audit function also takes in a regex and the list of expected matches.

In [10]:
#unpacks the count and the key values
counts, street_types = audit_street(filename)

Now lets pretty print the output of `audit`

In [11]:
print("The number of 'addr:street': {}".format(counts['addr:street']))
pprint.pprint(street_types)

The number of 'addr:street': 841
defaultdict(<class 'set'>,
            {'1': {'Parwathi Nagar Road No 1',
                   'Quena Square, Banjara Hills Road No. 1',
                   'lane number 1',
                   'road number 1',
                   'street number 1',
                   'ushodaya colony phase 1'},
             '10': {'Street No. 10', 'Road no 10'},
             '10-D': {'Street 10-D'},
             '11': {'Road No 11'},
             '12': {'Road No. 12', 'Road No 12', '12'},
             '13': {'Road No 13'},
             '14': {'Road No 14'},
             '15': {'Road No 15'},
             '2': {'Road Number 2'},
             '20': {'20', 'street no: 20'},
             '22': {'22'},
             '25': {'road no 25'},
             '3': {'Banjarahiils, Rd. No. 3',
                   'KPHB road no 3',
                   'Road No 3',
                   'Siddartha Nagar Road number 3',
                   'Street No : 3',
                   'VANDANAPURI COLONY STRE

Now I have a list of some abbreviated street types (as well as locations without street types). This is by no means a comprehensive list of all of the abbreviated street types used within the XML as all of these matches occur only as the last token at the end of a street name, but it is a very good first swipe at the problem.

To replace these abbreviated street types, I will define an update function that takes a string to update, a mapping dictionary, and a regex to search.

But, Before that we can observe instead of street name we have a complete address in certain cases. This is the scenario we haved discussed under Flouting of Naming Convention.

### Inconsistency in City Name and Postal Code

In [15]:
def audit_rest(osm_file):
    #audits city name and postalcodes
    city_name = defaultdict(int)
    postal_codes = defaultdict(int)
    
    for event, element in ET.iterparse(osm_file, events=("start",)):
        if element.tag in ["node", "way", "relation"]:
            for tag in element.iter("tag"):
                if is_city_name(tag):
                    city_name[tag.attrib['v']] += 1
                elif is_postal_code(tag):
                    postal_codes[tag.attrib['v']] += 1                                    
    return city_name, postal_codes

city_name, postal_codes = audit_rest(filename)

pprint.pprint(city_name)
pprint.pprint(postal_codes)    

defaultdict(<class 'int'>,
            {', Hyderabad': 1,
             'Bandlaguda': 1,
             "Beside Centre for Good Governance, Greenlands colony, Gachibowli 'X'Roads, Sherilingampally, Rangareddy Dt.,": 1,
             'Beside Sai Gopi Chand Batmintion Academy, Greenlands Colony': 1,
             'CHAMPAPET': 1,
             'Greater Hyderabad Municipal Corporation': 1,
             'HITEC City': 1,
             'HYDERABAD': 21,
             'Hyderabad': 369,
             'Hyderabad Telangana': 1,
             'Hyderabad, Telangana': 3,
             'Hyderabad, Telangana.': 3,
             'Hyderabad,Telangana': 1,
             'KARMANGHAT': 1,
             'Kismat pur': 2,
             'Kismatpur': 1,
             'Kukatpally Hyderabad': 1,
             'Kukatpally Hyderabad,Telangana': 1,
             'Madhapur': 1,
             'Madhapur, Hyderabad, India': 1,
             'Masab Tank, Hyderabad': 1,
             'Masoorabad': 2,
             'Nizampet': 1,
             'R

In [6]:
mapping = {'Colony,' : 'Colony' ,'Begumpet,' : 'Begumpet',
        'Amderpet' : 'Ameerpet', 'Hydera' : 'Hyderabad',
        'Begumpet,' : 'Begumpet', 'Gachibowl' : 'Gachibowli',
        'Gachibowlo' : 'Gachibowli', 'Gadda,' : 'Gadda',
        'Guda,' : 'Guda', 'Hyderabad,' : 'Hyderabad',
        'Konapur' : 'Kondapur', 'Madhapur,' : 'Madhapur',
        'Maehinaguda' : 'Madinaguda', 'Mehedipatmam' : 'Mehdipatnam',
        'Mehedipatnam' : 'Mehdipatnam', 'Rd' : 'Road',
        'RD' : 'Road', 'rd' : 'Road',
        'raod' : 'Road', 'vanasthallipuram' : 'Vanasthalipuram',
        'x;road' : 'x-roads', 'Vanasthalipuram,' : 'Vanasthalipuram',
        'Malkajgiri,' : 'Malkajgiri'
    }

## Preparing for MongoDB

To load the XML data into MongoDB, I will have to transform the data into json documents structured like this:
```
{
    "id": "2406124091",
    "type: "node",
    "visible":"true",
    "created": {
                  "version":"2",
                  "changeset":"17206049",
                  "timestamp":"2013-08-03T16:43:42Z",
                  "user":"linuxUser16",
                  "uid":"1219059"
               },
    "pos": [41.9757030, -87.6921867],
    "address": {
                  "housenumber": "5157",
                  "postcode": "60625",
                  "street": "North Lincoln Ave"
               },
    "amenity": "restaurant",
    "cuisine": "mexican",
    "name": "La Cabana De Don Luis",
    "phone": "1 (773)-271-5176"
}
```
The transform will follow these rules:
- Process only 2 types of top level tags: node and way
- All attributes of node and way should be turned into regular key/value pairs, except:
  - The following attributes should be added under a key `created: version, changeset, timestamp, user, uid`
  - Attributes for latitude and longitude should be added to a pos array, for use in geospacial indexing. Make sure the values inside pos array are floats and not strings.
- If second level `tag` "k" value contains problematic characters, it should be ignored
- If second level `tag` "k" value starts with "addr:", it should be added to a dictionary address
- If second level `tag` "k" value does not start with "addr:", but contains ":", you can process it same as any other tag.
- If there is a second ":" that separates the type/direction of a street, the tag should be ignored, for example:
```
<tag k="addr:housenumber" v="5158"/>
<tag k="addr:street" v="North Lincoln Avenue"/>
<tag k="addr:street:name" v="Lincoln"/>
<tag k="addr:street:prefix" v="North"/>
<tag k="addr:street:type" v="Avenue"/>
<tag k="amenity" v="pharmacy"/>
```
should be turned into:
```
{
    "address": {
                   "housenumber": 5158,
                   "street": "North Lincoln Avenue"
               },
    "amenity": "pharmacy"
}
```
For "way" specifically:
```
<nd ref="305896090"/>
<nd ref="1719825889"/>
```
should be turned into:
```
{
    "node_refs": ["305896090", "1719825889"]
}
```
To do this transformation, lets define a function `shape_element` that processes an element. Within this function I will use the update function with the regexes and mapping dictionaries defined above to clean street addresses. Additionally, I will store timestamp as a Python `datetime` rather than as a string. The format of the timestamp can be found here:

http://overpass-api.de/output_formats.html

In [17]:
from datetime import datetime

CREATED = ["version", "changeset", "timestamp", "user", "uid"]

def shape_element(element):
    node = {}    
    if element.tag == "node" or element.tag == "way" :
        node['type'] = element.tag
        
        # Parse attributes
        for attrib in element.attrib:

            # Data creation details
            if attrib in CREATED:
                if 'created' not in node:
                    node['created'] = {}
                if attrib == 'timestamp':
                    node['created'][attrib] = datetime.strptime(element.attrib[attrib], '%Y-%m-%dT%H:%M:%SZ')
                else:
                    node['created'][attrib] = element.get(attrib)

            # Parse location
            if attrib in ['lat', 'lon']:
                lat = float(element.attrib.get('lat'))
                lon = float(element.attrib.get('lon'))
                node['pos'] = [lat, lon]

            # Parse the rest of attributes
            else:
                node[attrib] = element.attrib.get(attrib)
            
        # Process tags
        for tag in element.iter('tag'):
            key   = tag.attrib['k']
            value = tag.attrib['v']
            if not problemchars.search(key):

                # Tags with single colon and beginning with addr
                if lower_colon.search(key) and key.find('addr') == 0:
                    if 'address' not in node:
                        node['address'] = {}
                    sub_attr = key.split(':')[1]
                    if is_street_name(tag):
                        # Do some cleaning
                        if ',' in tag.attrib['v']:
                            sub_attr = 'full'
                        name = tag.attrib['v']
                        name_s = name.split(' ')
                        street_type = name_s[len(name_s)-1]
                        if street_type in mapping:
                            node['address'][sub_attr] = name.replace(street_type,mapping[name_s[len(name_s)-1]])
                        else:
                            node['address'][sub_attr] = name
                    #Cleans the city name
                    elif is_city_name(tag):
                        if tag.attrib['v'] != "Hyderabad":
                            node['address'][sub_attr] = "Hyderabad"
                    elif is_postal_code(tag):
                        if tag.attrib['v'] in ['500 032','500 081', '500 095']:
                            k = tag.attrib['v'].split(' ', 1)
                            node['address'][sub_attr] = ''.join(k)
                        elif tag.attrib['v'] == '996544':
                            node['address'][sub_attr] = '500001'
                        elif len(tag.attrib['v']) != 6:
                            node['address'][sub_attr] = '500001'
                    else:    
                        node['address'][sub_attr] = value

                # All other tags that don't begin with "addr"
                elif not key.find('addr') == 0:
                    if key not in node:
                        node[key] = value
                else:
                    node["tag:" + key] = value
        
        # Process nodes
        for nd in element.iter('nd'):
            if 'node_refs' not in node:
                node['node_refs'] = []
            node['node_refs'].append(nd.attrib['ref'])

        return node
    else:
        return None

Now parse the XML, shape the elements, and write to a json file.

We're using BSON for compatibility with the date aggregation operators. There is also a Timestamp type in MongoDB, but use of this type is explicitly discouraged by the [documentation](http://docs.mongodb.org/manual/core/document/#timestamps).

In [30]:
import json
from bson import json_util

def process_map(file_in, pretty = False):
    file_out = "{0}.json".format(file_in)
    with open(file_out, "w") as fo:
        for _, element in ET.iterparse(file_in):
            el = shape_element(element)
            if el:
                if pretty:
                    fo.write(json.dumps(el, indent=2, default=json_util.default)+"\n")
                else:
                    fo.write(json.dumps(el, default=json_util.default) + "\n")

process_map(filename)

## Overview of the Data

Lets look at the size of the files we worked with and generated.

In [26]:
import os
print('The downloaded file is {} MB'.format(os.path.getsize(filename)/1.0e6)) # convert from bytes to megabytes

The downloaded file is 734.883521 MB


In [22]:
print('The json file is {} MB'.format(os.path.getsize(filename + ".json")/1.0e6)) # convert from bytes to megabytes

The json file is 1341.467325 MB


**Plenty of Street Addresses**

Besides dirty data within the `addr:street` field, we're working with a sizeable amount of data on street addresses. Here I will count the total number of nodes and ways that contain a tag child with `k="addr:street"`

In [23]:
osm_file = open(filename, "r")
address_count = 0

for event, elem in ET.iterparse(osm_file, events=("start",)):
    if elem.tag == "node" or elem.tag == "way":
        for tag in elem.iter("tag"): 
            if is_street_name(tag):
                address_count += 1

address_count

842

There are plenty of locations on the map that has their street addresses tagged. It looks like OpenStreetMap's community has collected a good amount of data for this area.

## Working with MongoDB

The first task is to execute mongod to run MongoDB. There are plenty of guides to do this. On OS X, if you have `mongodb` installed via homebrew, homebrew actually has a handy `brew services` command.

To start mongodb:

    brew services start mongodb

To stop mongodb if it's already running:

    brew services stop mongodb

Alternatively, if you have MongoDB installed and configured already we can run a subprocess for the duration of the python session:

In [25]:
import signal
import subprocess

# The os.setsid() is passed in the argument preexec_fn so
# it's run after the fork() and before  exec() to run the shell.
pro = subprocess.Popen('mongod', preexec_fn = os.setsid)

Next, connect to the database with `pymongo`

In [26]:
from pymongo import MongoClient

db_name = 'openstreetmap'

# Connect to Mongo DB
client = MongoClient('localhost:27017')
# Database 'openstreetmap' will be created if it does not exist.
db = client[db_name]

Then just import the dataset with `mongoimport`.

In [27]:
# Build mongoimport command
collection = filename[:filename.find('.')]
working_directory = '/Users/James/Dropbox/Projects/da/data-wrangling-with-openstreetmap-and-mongodb/'
json_file = filename + '.json'

mongoimport_cmd = 'mongoimport -h 127.0.0.1:27017 ' + \
                  '--db ' + db_name + \
                  ' --collection ' + collection + \
                  ' --file ' + working_directory + json_file

# Before importing, drop collection if it exists (i.e. a re-run)
if collection in db.collection_names():
    print 'Dropping collection: ' + collection
    db[collection].drop()
    
# Execute the command
print 'Executing: ' + mongoimport_cmd
subprocess.call(mongoimport_cmd.split())

Dropping collection: cupertino_california
Executing: mongoimport -h 127.0.0.1:27017 --db openstreetmap --collection cupertino_california --file /Users/James/Dropbox/Projects/da/data-wrangling-with-openstreetmap-and-mongodb/cupertino_california.osm.json


0

## Investigating the Data

After importing, get the collection from the database.

In [None]:
hyderabad = db[collection]

Here's where the fun stuff starts. Now that we have a audited and cleaned up collection, we can query for a bunch of interesting statistics.

**Number of Documents**

In [None]:
hyderabad.find().count()

**Number of Unique Users**

In [None]:
len(hyderabad.distinct('created.user'))

**Number of Nodes and Ways**

In [None]:
hyderabad.aggregate({'$group': {'_id': '$type', \
                                           'count': {'$sum' : 1}}})['result']

**Top Three Contributors**

In [None]:
top_users = hyderabad.aggregate([{'$group': {'_id': '$created.user', \
                                                        'count': {'$sum' : 1}}}, \
                                            {'$sort': {'count' : -1}}, \
                                            {'$limit': 3}])['result']

pprint.pprint(top_users)
print

for user in top_users:
    pprint.pprint(hyderabad.find({'created.user': user['_id']})[0])

**Three Most Referenced Nodes**

In [None]:
top_nodes = hyderabad.aggregate([{'$unwind': '$node_refs'}, \
                                            {'$group': {'_id': '$node_refs', \
                                                        'count': {'$sum': 1}}}, \
                                            {'$sort': {'count': -1}}, \
                                            {'$limit': 3}])['result']

pprint.pprint(top_nodes)
print

for node in top_nodes:
    pprint.pprint(hyderabad.find({'id': node['_id']})[0])

**Number of Documents with Street Addresses**

In [None]:
hyderabad.find({'address.street': {'$exists': 1}}).count()

**List of Zip Codes**

In [None]:
hyderabad.aggregate([{'$match': {'address.postcode': {'$exists': 1}}}, \
                                {'$group': {'_id': '$address.postcode', \
                                            'count': {'$sum': 1}}}, \
                                {'$sort': {'count': -1}}])['result']

It looks like have some invalid zip codes, with the state name or unicode characters included.

The zip codes with 4 digit postal codes included are still valid though, and we might consider removing these postal codes during the cleaning process.

**Cities with Most Records**

In [None]:
hyderabad.aggregate([{'$match': {'address.city': {'$exists': 1}}}, \
                                {'$group': {'_id': '$address.city', \
                                            'count': {'$sum': 1}}}, \
                                {'$sort': {'count': -1}}])['result']


Likewise, some cities capitalization and the accented-e gives way to more auditing and cleaning.

It's interesting to note how well Sunnyvale and Santa Clara have been documented, relative to the other cities despite having the area covering mostly Cupertino, Saratoga, West San Jose.

**Top 10 Amenities**

In [None]:
hyderabad.aggregate([{'$match': {'amenity': {'$exists': 1}}}, \
                                {'$group': {'_id': '$amenity', \
                                            'count': {'$sum': 1}}}, \
                                {'$sort': {'count': -1}}, \
                                {'$limit': 10}])['result']

** Top 10 Banks**

It's a pain when there isn't a local branch of your bank closeby. Lets what banks have the most locations in this area to avoid this.

In [None]:
hyderabad.aggregate([{'$match': {'amenity': 'bank'}}, \
                                {'$group': {'_id': '$name', \
                                            'count': {'$sum': 1}}}, \
                                {'$sort': {'count': -1}}, \
                                {'$limit': 10}])['result']

## Other Ideas About the Dataset

From exploring the OpenStreetMap dataset, I found the data structure to be flexible enough to include a vast multitude of user generated quantitative and qualitative data beyond that of simply defining a virtual map. There's plenty of potential to extend OpenStreetMap to include user reviews of establishments, subjective areas of what classifies a good vs bad neighborhood, housing price data, school reviews, walkability/bikeability, quality of mass transit, and a bunch of other metrics that could form a solid foundation for robust recommender systems. These recommender systems could aid users in deciding where to live or what cool food joints to check out.

The data is far too incomplete to be able to implement such recommender systems as it stands now, but the OpenStreetMap project could really benefit from visualizing data on content generation within their maps. For example, a heat map layer could be overlayed on the map showing how frequently or how recently certain regions of the map have been updated. These map layers could help guide users towards areas of the map that need attention in order to help more fully complete the data set.

Next I will cover a couple of queries that are aligned with these ideas about the velocity and volume of content generation

**Amount of Nodes Elements Created by Day of Week**

I will use the `$dayOfWeek` operator to extract the day of week from the `created.timestamp` field, where 1 is Sunday and 7 is Saturday:

http://docs.mongodb.org/manual/reference/operator/aggregation/dayOfWeek/

In [None]:
hyderabad.aggregate([{'$project': {'dayOfWeek': {'$dayOfWeek': '$created.timestamp'}}}, \
                                {'$group': {'_id': '$dayOfWeek', \
                                            'count': {'$sum': 1}}}, \
                                {'$sort': {'_id': 1}}])['result']

It seems like users were more active on in the beginning of the week.

**Age of Elements**

Lets see how old elements were created in the XML using the `created.timestamp` field and visualize this data by pushing the calculated values into a list.

In [None]:
ages = hyderabad.aggregate([ \
               {'$project': {'ageInMilliseconds': {'$subtract': [datetime.now(), '$created.timestamp']}}}, \
               {'$project': {'_id': 0, \
                             'ageInDays': {'$divide': ['$ageInMilliseconds', 1000*60*60*24]}}}, \
               {'$group'  : {'_id': 1, \
                             'ageInDays': {'$push': '$ageInDays'}}}, \
               {'$project': {'_id': 0, \
                             'ageInDays': 1}}])['result'][0]

Now I have a dictionary with an `ageInDays` key and a list of floats as the value. Next, I will create a pandas dataframe from this dictionary

In [None]:
from pandas import DataFrame

age_df = DataFrame.from_dict(ages)
# age_df.index.name = 'element'
print age_df.head()

Lets plot a histogram of this series with our best friend `ggplot`. The binwidth is set to 30 (about a month)

In [None]:
%matplotlib inline
from ggplot import *
import warnings

# ggplot usage of pandas throws a future warning
warnings.filterwarnings('ignore')

print ggplot(aes(x='ageInDays'), data=age_df) + \
             geom_histogram(binwidth=30, fill='#007ee5')

Note the rise and fall of large spikes of activity occurring about every 400 days. I hypothesize that these are due to single users making many edits in this concentrated map area in a short period of time.