# Introduction 

I have been living in the Boston area for the last few years since grad school. The dataset analyzed for the purposes of this project pertains to the Boston area. The Boston area dataset was exported from [openstreetmaps](http://www.openstreetmap.org/#map=17/40.71652/-73.94470&layers=H). The analysis included the following steps 

* **Question Phase:** This phase involves asking general questions about the dataset. The questions involve the problem we are trying to solve for. 
* **Data Auditing:** This phase involves auditing the data to identify anomalies and patterns. E.g. In the streetmap data we could run into street names which have some kind special characters in them, or we could run into zipcodes in the Boston area that have some kind of alphabetical characters in them. 
* **Data Cleansing:** This phase involves classifying the anomalies that are identified in the previous step and devising approaches to clean up the data. The cleansing could be either manual or done programmatically. The project assumes a programmatic way to cleansing some of the anomalies. 

Data Auditing and Data Cleansing follow a repetive approach till a fair amount of data anomalies have been identified and also cleansed approrpriately. 

* **Conclusion:** This phase involves drawing conclusion about the dataset, based on the auditing and cleansing steps 
* **Communication:** The phase involves communicating the results of the analysis to the audiences. In a real life scenario this would be the business users who make business decisions based on the dataset analysis. 

In addition, this project also involves importing the dataset into [mongoDB](https://www.mongodb.com/), followed by executing some of the mongoDB's aggregation commands to further analyze the dataset that has been imported. 






# Question Phase





# Data Auditing
## Identifying the TAGS along with the count of occurences of each of the TAGS

This step involves doing an initial analysis of the dataset and doing an assessment of the XML nodes. The step also involves counting the number of instances of the specific node. While this step does provide a good start to the data auditing process, it does not answer a whole of questions that needs to be answered. This step definitely helps us confirm the validity of the XML format as the XML parser (ET.iterparse) is able to parse through the entire XML file. 

In [1]:
import xml.etree.cElementTree as ET
import pprint
import re
import codecs
import json
from collections import defaultdict

filename = 'boston.osm'

def test(): 
    tag_list = []
    tag_dict = defaultdict(int)
    for _, element in ET.iterparse(filename): 
        tag_list.append(element.tag)
    for item in tag_list: 
        tag_dict[item] += 1
    pprint.pprint(tag_dict)


if __name__ == "__main__":
    test()

defaultdict(<type 'int'>, {'node': 444899, 'nd': 551066, 'bounds': 1, 'member': 5284, 'tag': 221456, 'relation': 645, 'way': 75362, 'osm': 1})


# Data Auditing (contd)
## Identifying the Count of Unique Users (based on UIDs) 
OpenSteet map being an openly available map which can be updated by users all over the world, it made sense to get an idea of the number of unique users who have contributed to the map. 



In [2]:
import xml.etree.cElementTree as ET
import pprint
import re
import codecs
import json
from collections import defaultdict

filename = 'boston.osm'

def test(): 
    users = set()
    for _,element in ET.iterparse(filename):
        if 'uid' in element.attrib:
            users.add(element.get('uid'))
    print "Count of Unique Users:", (len(users))
    print (users)
    
if __name__ == "__main__":
    test()

Count of Unique Users: 848
set(['701372', '3057995', '378464', '14850', '967832', '152074', '4581744', '2176051', '113450', '45027', '1084189', '1723055', '2219985', '70696', '1836471', '1723831', '616774', '2406578', '843366', '396743', '60604', '256900', '4801391', '985060', '252811', '2859541', '1956174', '487535', '110489', '966176', '1802093', '1964257', '3753426', '3401472', '151559', '352232', '2080154', '2411240', '1815503', '2825553', '94578', '1679807', '4667099', '3355238', '2663591', '3846187', '233028', '3711221', '15750', '4732', '1896093', '1691206', '1884900', '2318', '83629', '29320', '4459317', '5013298', '3461700', '2859289', '621028', '674872', '47892', '8609', '38487', '2320085', '52411', '372391', '198831', '3245344', '118021', '5014050', '668245', '297464', '1963816', '336460', '5089369', '701297', '602100', '42429', '1870581', '2031562', '398754', '2851892', '2500516', '3571425', '510166', '4937878', '563947', '69777', '128470', '690266', '3582', '927001', '5827

# Data Auditing (contd)
## Identifying the Count of Unique Users (based on Users) 
This test is very similar to the test done above, the only difference is that instead of using the UID element we are using the USER element. The test is done to be sure that the results are the same. In both the tests we notice that the counts are the same

In [3]:
import xml.etree.cElementTree as ET
import pprint
import re
import codecs
import json
from collections import defaultdict

filename = 'boston.osm'

def test(): 
    users = set()
    for _,element in ET.iterparse(filename):
        if 'user' in element.attrib:
            users.add(element.get('user'))
    print len(users)
    print users
    
if __name__ == "__main__":
    test()

848
set(['maxmetcalfe', 'Roger Neumann', 'dloutzen', 'TuftsReady', 'Matej Cepl', 'Steven Deeds', 'Brett Camper', 'noobi', 'Thia564', 'signed0', 'iandees', 'OSMF Redaction Account', 'pcs14', 'scs', 'YunmoW', 'RichRico', 'xybot', 'genuinejack', 'Feddy Pariona Rojas', 'jraviles', 'aonline1', 'ramyaragupathy', 'Mashin', 'phyzome', 'cgu66', 'pierlux', 'Miselajus', 'Manu1400', 'moyogo', 'yurasi', 'smita1', 'gameboo', 'jborthwick', 'maxerickson', 'digdesign', 'cowsandmilk', 'tmcw', 'marnen', 'jak119', 'sivan00', 'smithbone', 'pilotrobert', 'TheDimka', 'Htg610', 'Asumu Takikawa', 'mattbert', 'paolodepetrillo', 'lm0nster', 'Ivanaf', 'Joshua Gerber', 'pezespe', 'Chris Paci', 'db248', 'trsmith', 'wolfgang8741', 'EricTufts', 'Manuel Aristaran', 'beweta', 'Ahlzen', 'kumarhk', 'dfieldsarlington', '42429', 'afreeman', 'minewman', '0123456789', 'virtualxtc', 'Bill Ricker', 'Eliyak', 'Carnildo', 'srajkovic', 'willber118', 'Roadsguy', 'Latze', 'oldtopos', 'Sudip Chandra Paudel', 'woodpeck_repair', 'John

# Data Auditing (Contd) 
## User Contribution Count: 

The purpose of this audit is to identify the number of times a specific user has contributed to the map. This test was done to just get an assessment of the top contribution numbers

In [4]:
import xml.etree.cElementTree as ET
import codecs
import json
import operator
from collections import defaultdict

filename = 'boston.osm'

def test(): 
    user_list = []
    user_dict = defaultdict(int)
    for _,element in ET.iterparse(filename):
        if 'user' in element.attrib:
            user_list.append(element.get('user'))
        else: 
            continue
    
    for item in user_list: 
        user_dict[item] +=1 
    sorted_user_dict = sorted(user_dict.iteritems(), key=operator.itemgetter(1), reverse=True)
    pprint.pprint(sorted_user_dict)
    
    
if __name__ == "__main__":
    test()

[('crschmidt', 269245),
 ('jremillard-massgis', 64989),
 ('wambag', 29490),
 ('OceanVortex', 27828),
 ('ryebread', 21770),
 ('morganwahl', 20412),
 ('mapper999', 8315),
 ('cspanring', 6817),
 ('JasonWoof', 5445),
 ('synack', 5054),
 ('ingalls_imports', 4166),
 ('Alexey Lukin', 3516),
 ('fiveisalive', 3145),
 ('MassGIS Import', 3115),
 ('Utible', 2872),
 ('probiscus', 1779),
 ('Prithason', 1439),
 ('phyzome', 1409),
 ('Extant', 1193),
 ('Alan Bragg', 1171),
 ('massDOT', 1167),
 ('Steven Deeds', 1124),
 ('pkoby', 1090),
 ('Ahlzen', 1080),
 ('thetornado76', 955),
 ('JessAk71', 953),
 ('3yoda', 890),
 ('jokeefe', 881),
 ('ceyockey', 790),
 ('woodpeck_repair', 771),
 ('Aredhel', 731),
 ('Pouletic', 702),
 ('jwass', 683),
 ('mterry', 669),
 ('dloutzen', 633),
 ('Peter Dobratz', 585),
 ('pokey', 574),
 ('KindredCoda', 553),
 ('dannya222', 546),
 ('aroach', 509),
 ('onurozgun', 504),
 ('headwatersolver', 464),
 ('kalanz', 410),
 ('nimapper', 394),
 ('Echo Echo', 393),
 ('nkhall', 382),
 ('spac

**The top ten contributors based on the results of the output above:**
[('crschmidt', 269245),
 ('jremillard-massgis', 64989),
 ('wambag', 29490),
 ('OceanVortex', 27828),
 ('ryebread', 21770),
 ('morganwahl', 20412),
 ('mapper999', 8315),
 ('cspanring', 6817),
 ('JasonWoof', 5445),
 ('synack', 5054)]

# Data Auditing (contd)
## This step involves auditing the value of the addr tags for 

1. Lower Case Characters 
2. Lower Case Characters with Colon 
3. Problematic Characters 

The final output includes a dictionary with keys for each of the category and the count of values for each of the categories

In [5]:
import re
import xml.etree.cElementTree as ET
import pprint
import re
import codecs
import json
from collections import defaultdict

lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

filename = 'boston.osm'


def process_address(addr_value):
    if re.search(lower,addr_value):
        keys['lower'] += 1
    elif re.search(lower_colon, addr_value): 
        keys['lower_colon'] += 1
    elif re.search(problemchars, addr_value): 
        keys['problemchars'] += 1 
    else: 
        keys['other'] += 1
                

def read_file():
    for _, element in ET.iterparse(filename): 
        if element.tag == 'tag': 
            key=element.get('k')
            if 'addr:' in key:
                addr_value = element.get('v')
                process_address(addr_value)
    return keys
                    

if __name__ == "__main__":
    keys = defaultdict(int)
    read_key = read_file()
    print read_key


defaultdict(<type 'int'>, {'problemchars': 3332, 'lower': 83, 'other': 8195})


# Data Auditing (contd) 
## Auditing Street Names and Identifying Anomalies
This step involves auditing the street names for the last element in the string name. Identifying the last element in street name is done via regular expression. 

The psuedocode involves comparing this "last element value" in the street name with list of expected elements in the "expected" list. If the "last element value" is not found in the expected list, the element is added to the dictionary as key along with the associated street name. 

In [6]:
import re
import xml.etree.cElementTree as ET
import pprint
import re
import codecs
import json
from collections import defaultdict

street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons", "Heights", "North", "East", "West", "South"]


filename = 'boston.osm'
                
def process_addressanomalies(addressvalue): 
    match = street_type_re.search(addressvalue)
    if match: 
        street_type = match.group()
        if street_type not in expected: 
            street_types[street_type].add(addressvalue)
        
def read_file():
    for _, element in ET.iterparse(filename): 
        if element.tag == 'tag': 
            key=element.get('k')
            if 'addr:street' in key:
                addr_value = element.get('v')
                process_addressanomalies(addr_value)
    return street_types                

if __name__ == "__main__":
    street_types = defaultdict(set)
    read_key = read_file()
#     print read_key
pprint.pprint(dict(read_key))
    


{'1100': set(['First Street, Suite 1100']),
 '1302': set(['Cambridge Street #1302']),
 '3': set(['Kendall Square - 3']),
 '303': set(['First Street, Suite 303']),
 '501': set(['Bromfield Street #501']),
 '6': set(['South Station, near Track 6']),
 '846028': set(['PO Box 846028']),
 'Ave': set(['738 Commonwealth Ave',
             'Boston Ave',
             'College Ave',
             'Commonwealth Ave',
             'Concord Ave',
             'Francesca Ave',
             'Highland Ave',
             'Josephine Ave',
             'Lexington Ave',
             'Massachusetts Ave',
             'Morrison Ave',
             'Mystic Ave',
             'Somerville Ave',
             'Western Ave',
             'Willow Ave']),
 'Ave.': set(['Brighton Ave.',
              'Massachusetts Ave.',
              'Somerville Ave.',
              'Spaulding Ave.']),
 'Broadway': set(['Broadway']),
 'Cambrdige': set(['Cambrdige']),
 'Center': set(['Cambridge Center', 'Financial Center']),
 'Circle':

# On quick review of the output from the previous step we see that 
1. Street has different variations. E.g. Some forms include "st", "St", "St." , "St," and also "ST" 
2. Similarly we also see that Avenue is sometimes referred to as "Ave". 

# Data Cleansing

This step involves creation of logic to cleanse the anomalies identified in the previous step. This steps involves a mapping dictionary which has the mapping between the incorrect format and the correct format of the street names in the form of key/value pairs. The psuedocode involves utilizing some of old code, that involved building a dictionary, by grouping street names that ended with a similar value. 

The cleansing step involves looping through the dictionary of elements and validating to see if the value can be found in the mapping dictionary. If the mapping is found then appropriate replacement is done to the street name using the python replace method


In [7]:
import re
import xml.etree.cElementTree as ET
import pprint
import re
import codecs
import json
from collections import defaultdict

street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons", "Heights", "North", "East", "West", "South"]

mapping = { "St": "Street",
            "St.": "Street",
            "Rd.": "Road", 
            "Ave": "Avenue", 
            "Ave.": "Avenue", 
            "St": "Street", 
            "St,": "Street", 
           "St.": "Street", 
           "ST": "Street"
            }


filename = 'boston.osm'
                
def process_addressanomalies(addressvalue): 
    match = street_type_re.search(addressvalue)
    if match: 
        street_type = match.group()
        if street_type not in expected: 
            street_types[street_type].add(addressvalue)

def update_name(): 
    for k, v in read_key.iteritems(): 
        for vitem in v: 
            match = street_type_re.search(vitem) 
            val =  match.group(0)
            if val in mapping: 
                new_name = vitem.replace(match.group(0), mapping[match.group(0)])
                print vitem, "==>", new_name
            

def read_file():
    for _, element in ET.iterparse(filename): 
        if element.tag == 'tag': 
            key=element.get('k')
            if 'addr:street' in key:
                addr_value = element.get('v')
                process_addressanomalies(addr_value)
    return street_types                

if __name__ == "__main__":
    street_types = defaultdict(set)
    read_key = read_file()
    better_adress = update_name()

Walnut St, ==> Walnut Street
Pearl St. ==> Pearl Street
Banks St. ==> Banks Street
Marshall St. ==> Marshall Street
Prospect St. ==> Prospect Street
Main St. ==> Main Street
Albion St. ==> Albion Street
Saint Mary's St. ==> Saint Mary's Street
Boylston St. ==> Boylston Street
Stuart St. ==> Stuart Street
Elm St. ==> Elm Street
Newton ST ==> Newton Street
Brighton Ave. ==> Brighton Avenue
Spaulding Ave. ==> Spaulding Avenue
Massachusetts Ave. ==> Massachusetts Avenue
Somerville Ave. ==> Somerville Avenue
Brentwood St ==> Brentwood Street
Athol St ==> Athol Street
Everett St ==> Everett Street
South Waverly St ==> South Waverly Street
Litchfield St ==> Litchfield Street
Hampshire St ==> Hampshire Street
Main St ==> Main Street
Cambridge St ==> Cambridge Street
Arsenal St ==> Arsenal Street
Merrill St ==> Merrill Street
Antwerp St ==> Antwerp Street
1629 Cambridge St ==> 1629 Cambridge Street
Elm St ==> Elm Street
Lothrop St ==> Lothrop Street
Charles St ==> Charles Street
Dane St ==> Dan

# XML to a JSON Conversion 
## This step involves parsing through the XML file to create a JSON file, which will then be used to import into mongoDB. We need to follow the below rules for translation 

* Process only 2 types of top level tags: "node" and "way"
* All attributes of "node" and "way" should be turned into regular key/value pairs, except: attributes in the CREATED array should be added under a key "created", attributes for latitude and longitude should be added to a "pos" array, for use in geospacial indexing. Make sure the values inside "pos" array are floats and not strings.
* If second level tag "k" value contains problematic characters, it should be ignored
* If second level tag "k" value starts with "addr:", it should be added to a dictionary "address"
* If second level tag "k" value does not start with "addr:", but contains ":", you can process it same as any other tag.
* If there is a second ":" that separates the type/direction of a street, the tag should be ignored

In [8]:
import xml.etree.cElementTree as ET
import pprint
import re
import codecs
import json
import sys
sys.setrecursionlimit(10000)
from collections import defaultdict


lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

CREATED = [ "version", "changeset", "timestamp", "user", "uid"]

def floatOrNofloat(n):
    return float(n) if n else None
    
def shape_element(element): 
    node = defaultdict(dict) 
    if element.tag == "node" or element.tag == "way":
        
        node["tag"] = element.tag

        node ["id"] = element.get('id')

        lat = element.get('lat')

        lon = element.get('lon')

        if lat or lon:
            node['pos'] = [floatOrNofloat(lat), floatOrNofloat(lon)]
        
        node["created"] = {}

        for key in CREATED:
            node["created"][key] = element.get(key)
        
        for child in element.getchildren():
            
            key = child.get("k")
            ref = child.get("ref")
            
            if key == 'address': 
                node['fulladdress'] = child.get('v')
            
            if key is not None: 
                if key.startswith('addr:'):
                    split_key = key.split(":")
                    node['address'][split_key[1]] = child.get('v')
                elif 'amenity' in key: 
                    node['amenity'] = child.get('v')
                elif 'name' in key: 
                    node['name'] = child.get('v')
            
            if ref: 
                if "node_refs" not in node: 
                    node["node_refs"] = []
                else: 
                    node["node_refs"].append(ref)
        
        return node
    else:
        return None
        

    
def process_map(file_in, pretty = False):
    file_out = "{0}.json".format(file_in)
    data = []
    with codecs.open(file_out, "w") as fo:
        for _, element in ET.iterparse(file_in):
            el = shape_element(element)
            if el:
                data.append(el)
                if pretty:
                    fo.write(json.dumps(el, indent=2)+"\n")
                else:
                    fo.write(json.dumps(el) + "\n")
    return data

def test():
    data = process_map('Boston.osm', True)
    

if __name__ == "__main__":
    test()

The output of the step above is the creation of **"Boston.osm.json"** file, which is later been used to import into MongoDB 

# Setting up for Mongo Data Analysis

In [9]:
import pymongo
from pymongo import MongoClient
import pprint

In [10]:
client = MongoClient()
db = client.boston
print db

Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), u'boston')


# Data Analysis/Data Exploration in MongoDB

# Assessing the Size of the Original OSM File and the JSON File 

In [11]:
import os
print 'The original OSM file is {} MB'.format(os.path.getsize('Boston.osm')/1.0e6)
print 'The JSON file is {} MB'.format(os.path.getsize('Boston.osm' + ".json")/1.0e6)

The original OSM file is 100.298776 MB
The JSON file is 145.946684 MB


In [12]:
boston = db['bostonc']

# Number of Documents

In [13]:
boston.find().count()

520261

# Number of Unique Users

In [14]:
len(boston.distinct('created.user'))

827

# Number of Nodes and Ways

In [15]:
print "Number of nodes:",boston.find({'tag': 'node'}).count()
print "Number of ways:", boston.find({'tag': 'way'}).count()

Number of nodes: 444899
Number of ways: 75362


# Top 10 Contributors along with the UserNames

In [16]:
result = boston.aggregate( [
                                        { "$group" : {"_id" : "$created.user", "count" : { "$sum" : 1} } },
                                        { "$sort" : {"count" : -1} }, 
                                        { "$limit" : 10 } ] )

pprint.pprint(list(result))

[{u'_id': u'crschmidt', u'count': 269155},
 {u'_id': u'jremillard-massgis', u'count': 64989},
 {u'_id': u'wambag', u'count': 29468},
 {u'_id': u'OceanVortex', u'count': 27793},
 {u'_id': u'ryebread', u'count': 21755},
 {u'_id': u'morganwahl', u'count': 20291},
 {u'_id': u'mapper999', u'count': 8309},
 {u'_id': u'cspanring', u'count': 6817},
 {u'_id': u'JasonWoof', u'count': 5439},
 {u'_id': u'synack', u'count': 5042}]


# List of Top 50 Amenities in the Boston Area

In [17]:
result = boston.aggregate( [            {'$match': {'amenity': {'$exists': 1}}},
                                        { "$group" : {"_id" : "$amenity", "count" : { "$sum" : 1} } },
                                        { "$sort" : {"count" : -1} }, 
                                        { "$limit" : 50 } ] )

pprint.pprint(list(result))

[{u'_id': u'parking', u'count': 545},
 {u'_id': u'bench', u'count': 495},
 {u'_id': u'restaurant', u'count': 398},
 {u'_id': u'bicycle_parking', u'count': 214},
 {u'_id': u'school', u'count': 205},
 {u'_id': u'place_of_worship', u'count': 184},
 {u'_id': u'library', u'count': 162},
 {u'_id': u'cafe', u'count': 158},
 {u'_id': u'fast_food', u'count': 114},
 {u'_id': u'bicycle_rental', u'count': 89},
 {u'_id': u'university', u'count': 77},
 {u'_id': u'post_box', u'count': 69},
 {u'_id': u'bank', u'count': 65},
 {u'_id': u'waste_basket', u'count': 59},
 {u'_id': u'pub', u'count': 49},
 {u'_id': u'fuel', u'count': 41},
 {u'_id': u'fountain', u'count': 34},
 {u'_id': u'pharmacy', u'count': 34},
 {u'_id': u'atm', u'count': 33},
 {u'_id': u'hospital', u'count': 31},
 {u'_id': u'fire_station', u'count': 31},
 {u'_id': u'drinking_water', u'count': 31},
 {u'_id': u'car_sharing', u'count': 28},
 {u'_id': u'bar', u'count': 28},
 {u'_id': u'parking_space', u'count': 27},
 {u'_id': u'post_office', u

# Extracting the List of Colleges from the DataSet

In [18]:
colleges = boston.aggregate([{"$match":{"amenity":{"$exists":1},
                                 "amenity":"college",}},      
                      {"$group":{"_id":{"Name":"$name"},
                                 "count":{"$sum":1}}},
                      {"$project":{"_id":0,
                                  "College":"$_id.Name",
                                  "Name":"$count"}},
                      {"$sort":{"Count":-1}}, 
                      {"$limit":10}])
pprint.pprint(list(colleges))

[{u'College': u'North Bennet Street School', u'Name': 1},
 {u'College': u'Radcliffe Quad', u'Name': 1},
 {u'College': u'Bunker Hill Community College', u'Name': 1},
 {u'College': u'Emerson College', u'Name': 7},
 {u'College': u'Berklee College of Music', u'Name': 7},
 {u'College': u'Emerson College \u2013 Walker Building', u'Name': 1},
 {u'College': u'Emerson College \u2013 Tuffte Performing Arts Center',
  u'Name': 1},
 {u'College': u'Fisher College', u'Name': 1},
 {u'College': u'Emerson College - Little Building', u'Name': 1},
 {u'College': u'Emerson College \u2013 Piano Row', u'Name': 1}]


**This list is definitely missing some of the key universities in the Boston Area like Harvard, MIT, NorthEastern. On further review of the dataset I noticed that the missing schools and colleges are infact a part of the dataset, they just don't have an amenity of college attached to them** 

# Extracting the list of Public Buildings in the Boston Area

In [19]:
building = boston.aggregate([{"$match":{"amenity":{"$exists":1},
                                 "amenity":"public_building",}},      
                      {"$group":{"_id":{"Name":"$name"},
                                 "count":{"$sum":1}}},
                      {"$project":{"_id":0,
                                  "Building":"$_id.Name",
                                  "Name":"$count"}},
                      {"$sort":{"Count":-1}}, 
                      {"$limit":10}])
pprint.pprint(list(building))

[{u'Building': None, u'Name': 1},
 {u'Building': u'Social Security Administration', u'Name': 1},
 {u'Building': u'Suffolk', u'Name': 5},
 {u'Building': u'Middlesex', u'Name': 2}]


# Extracting the Top Cities in the Boston Area

In [20]:
cities = boston.aggregate([
        {"$match": {"address.city":{"$exists":1}}}, 
        {"$group":{"_id":"$address.city", "count":{"$sum":1}}},
        {"$sort": {"count": -1}}, 
        {"$limit":10}                                 
    ])

pprint.pprint(list(cities))

[{u'_id': u'Boston', u'count': 619},
 {u'_id': u'Cambridge', u'count': 555},
 {u'_id': u'Somerville', u'count': 240},
 {u'_id': u'Arlington', u'count': 172},
 {u'_id': u'Allston', u'count': 17},
 {u'_id': u'Arlington, MA', u'count': 9},
 {u'_id': u'Charlestown', u'count': 9},
 {u'_id': u'Watertown', u'count': 9},
 {u'_id': u'Cambridge, MA', u'count': 8},
 {u'_id': u'Brookline', u'count': 7}]


# Conclusion and Other Suggested Improvements

Given the size of the Boston Data Set that was analyzed and the number of issues that existed with the dataset, it was a lot better than I anticipated. That said, there are definitely areas for improvement. E.g.We noticed inconsistencies with the street names while executing our auditing in Python. In addition we found other issues around missing data, or the data being associated with different types 
1. When we executed a query to extract the list of colleges in the Boston Area based on amenity == "college", the result set was missing some of the key institutions in the Boston area (e.g. MIT, Harvard, NorthEastern). On further analysis by looking at OSM file we noticed that the data is infact present, but just that it was associated with a different type. 
2. Similarly, when we executed a query to extract the list of public buildings in the Boston Area based on amenity == "public_building, not a whole lot of buildings showed up. 

If we further analyze the root cause for 1 and 2, we can definitely conclude that these are the effects of manual contribution from hundreds and hundreds of users over the web. One approach to rectify this would be to use a structured input form, this would force the users to adhere to a standard structure/format; this will definitely reduce the amount of anomalies.

We could also do further analysis on other data elements. Some examples are 

1. Analyzing the post codes to see if the Zipcodes belong to the Boston Area, or whether the Zipcodes have any alphabetical characters in them.
2. We could probably do a regular sync of the user entered data in Open Street Maps with some other mapping software like Google MAPS via Google API and clean up some of the data format to be consistent. 

Note: I am not recommending filling up the missing data, all I am recommending is cleaning up/updating the already entered data with more cleaner version of the data (if available) via the Google API 

Finally, we could also run competitions (or coding meetups), where we could assign certain datasets to groups of people and have them do a comprehensive analysis and update Open StreetMaps. 

# References

Listed Below are some of the references I used 
* http://wiki.openstreetmap.org/wiki/Browsing
* Udacity Lectures 
* A lot of StackOverflow Threads (everytime I ran into an error in Python and Regular Expressions)
* https://docs.python.org/2/library/re.html
* https://regexone.com/references/python
* https://pymotw.com/2/xml/etree/ElementTree/parse.html#parsing-an-entire-document
* http://effbot.org/zone/element-iterparse.htm



