## Project OpenStreet Map Data Wrangling - Processing Coding Part

#### Map Area

New York, NY, US

- [https://mapzen.com/data/metro-extracts/metro/new-york_new-york/](https://mapzen.com/data/metro-extracts/metro/new-york_new-york/)

This is the place that I am most familiar in United States, and I do have strong interest to see what the OpenStreet Map Data reveals in this Metro Area.

<img src="img/image.png" alt="Drawing" style="width: 400px;"/>


In [None]:
import re
import xml.etree.cElementTree as ET
import pprint
from collections import defaultdict
import json

##### Data Sampling

As the whole dataset is supper larger, in order to test the whole process, I split the whole dataset and just use the part to test the process first.

In [2]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import xml.etree.ElementTree as ET  # Use cElementTree or lxml if too slow

OSM_FILE = "new-york.osm"  # Replace this with your osm file
SAMPLE_FILE = "sample1.osm"

k = 5 # Parameter: take every k-th top level element

def get_element(osm_file, tags=('node', 'way', 'relation')):
    """Yield element if it is the right type of tag

    Reference:
    http://stackoverflow.com/questions/3095434/inserting-newlines-in-xml-file-generated-via-xml-etree-elementtree-in-python
    """
    context = iter(ET.iterparse(osm_file, events=('start', 'end')))
    _, root = next(context)
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            yield elem
            root.clear()


with open(SAMPLE_FILE, 'wb') as output:
    output.write('<?xml version="1.0" encoding="UTF-8"?>\n')
    output.write('<osm>\n  ')

    # Write every kth top level element
    for i, element in enumerate(get_element(OSM_FILE)):
        if i % k == 0:
            output.write(ET.tostring(element, encoding='utf-8'))

    output.write('</osm>')

#### Data Structure - Node/Way Structure

Below is similar node sample and way sample structure we might have.

For the node,
```xml
    <node id="757860928" visible="true" version="2" changeset="5288876" timestamp="2010-07-22T16:16:51Z" user="uboot" uid="26299" lat="41.9747374" lon="-87.6920102">
    <tag k="amenity" v="fast_food"/>
    <tag k="cuisine" v="sausage"/>
    <tag k="name" v="Shelly's Tasty Freeze"/>
    </node>
   
For the way,

    <way id="209809850" visible="true" version="1" changeset="15353317" timestamp="2013-03-13T15:58:04Z" user="chicago-buildings" uid="674454">
      <nd ref="2199822281"/>
      <nd ref="2199822390"/>
      <nd ref="2199822392"/>
      <nd ref="2199822369"/>
      <nd ref="2199822370"/>
      <nd ref="2199822284"/>
      <nd ref="2199822281"/>
      <tag k="addr:housenumber" v="1412"/>
      <tag k="addr:street" v="West Lexington St."/>
      <tag k="addr:street:name" v="Lexington"/>
      <tag k="addr:street:prefix" v="West"/>
      <tag k="addr:street:type" v="Street"/>
      <tag k="building" v="yes"/>
      <tag k="building:levels" v="1"/>
      <tag k="chicago:building_id" v="366409"/>
     </way>
 
 ```  
    

### Auditing Data

Before we transform the data into the database, I would love to first audit the data to see understand the data and the problem associated with it.

##### Auditing the Primary (top) Tag

In [None]:
#tag_count = {}
#for event, element in ET.iterparse('sample1.osm'):
#    if element.tag in tag_count:
#        tag_count[element.tag] += 1
#   else:
#       tag_count[element.tag] = 1

#pprint.pprint(tag_count)

We could see here in the new york region open street map data, there are 11578310 nodes, 1816197 ways,and 10360 relations.

In [3]:
tag_count = {}
for event, element in ET.iterparse('new-york.osm'):
    if element.tag in tag_count:
        tag_count[element.tag] += 1
    else:
        tag_count[element.tag] = 1

pprint.pprint(tag_count)


{'bounds': 1,
 'member': 115813,
 'nd': 14628206,
 'node': 11578310,
 'osm': 1,
 'relation': 10360,
 'tag': 9766470,
 'way': 1816197}


##### Auditing the Distinct Users

Using get_user function, we could count all unique user id, which in total is 5221.

In [None]:
def get_user(element):
    return element.get('uid')


def process_map(filename):
    users = set()
    for _, element in ET.iterparse(filename):
            users.add(get_user(element))
    #print users
    return users

In [None]:
users = process_map('new-york.osm')
pprint.pprint(len(users))

##### Auditing Type of tag

Except the primary main tags we summarized, as we only focused on the node and way, I would focus on the "<tag>" tags to check out the attributes within those tags, which is the "key" value of the tag.

Here, I set up 4 categories based on the tags.

- "lower", for tags that contain only lowercase letters and are valid.
    
- "lower_colon", for otherwise valid tags with a colon in their names.
    
- "problemchars", for tags with problematic characters.
    
- "other", for other tags that do not fall into the other three categories.

In [4]:
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')
lo = set()
lo_co = set()
pro_co = set()
oth = set()

In [5]:
def key_type(element, keys):
    """Aduit types of tags we defined above"""
    if element.tag == "tag":
        if lower.match(element.attrib['k']):
            keys['lower']+=1
            lo.add(element.attrib['k']) 
        elif lower_colon.match(element.attrib['k']):
            keys['lower_colon']+=1
            lo_co.add(element.attrib['k']) 
        elif problemchars.match(element.attrib['k']):
            keys['problemchars']+=1
            pro_co.add(element.attrib['k']) 
        else:
            keys['other']+=1
            oth.add(element.attrib['k'])
    return keys

In [6]:
keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
for _, element in ET.iterparse('new-york.osm'):
        #print element.attr
    keys = key_type(element, keys)
        #print element
pprint.pprint(keys)

{'lower': 191752, 'lower_colon': 290361, 'other': 6634, 'problemchars': 0}


In [16]:
def write_data(data, filename):
    """Split and save different types of tags"""
    with open(filename, 'wb') as f:
        for x in data:
            f.write(x + "\n")

In [17]:
write_data(lo, 'lower.txt')
write_data(lo_co, 'lower_colon.txt')
write_data(pro_co, 'problem_chars.txt')
write_data(oth, 'other.txt')

Except the key value type we had are lower character, lower character with colon, problem character and others.

Through other.txt, most common cases in the file are:
- Include uppercase letters (magic:DESC_)
- Include numbers in the text (ISO3166-1:alpha3)
- Include more than one colon (railway:signal:combined:hood)

##### Auditing Street Name

Continuing with the tag keys we explored, as we worked on the map data, my interest is all the data relates to the address. Here we would love to explore the street name to see the data quality.

In [2]:
street_type_re=re.compile(r'\b\S+\.?$',re.IGNORECASE) #character nospace . (0/1) end

# The standard name that could be the street type.
expected=["Street", "Avenue", "Alley","Boulevard","Broadway" "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons","Path","Plaza","Terrace","Walk","Way","Circle","Crescent","Expressway","Highway","Loop","Terminal"]

def audit_street_type(street_types,street_name):
    m=street_type_re.search(street_name)
    if m:
        street_type=m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)

In [3]:
def is_street_name(elem):
    return (elem.attrib['k']=="addr:street")

In [4]:
def audit(osmfile):
    osm_file = open(osmfile, "r")
    street_types = defaultdict(set)
    for event, elem in ET.iterparse(osm_file, events=("start",)):

        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                #print 'qewqe'
                #print tag
                if is_street_name(tag):
                    #print tag.attrib['v'] street name
                    audit_street_type(street_types, tag.attrib['v'])
    osm_file.close()
    return dict(street_types)


In [7]:
st_types = audit('new-york.osm')
pprint.pprint(dict(st_types))

{'1': set(['36th St Front 1',
           'Burt Drive ste 1',
           'Cushing St # 1',
           'Graham Avenue #1',
           'Grand Concourse #1',
           'New Jersey 17 N #1',
           'North US Highway 1',
           'ROAD 1',
           'U.S. 1',
           'US 1',
           'US HIGHWAY 1',
           'US Highway 1']),
 '10': set(['NJ Route 10', 'Nwe Jersey 10', 'Route 10']),
 '10003': set(['Irvine Place, #1, New York, NY, 10003']),
 '10024': set(['West 80th Street NYC 10024']),
 '101': set(['20th Ave, Suite 101']),
 '106B': set(['Lincoln Highway #106B']),
 '109': set(['Central Park South Suite 109',
             'Route 109',
             'State Highway 109']),
 '10C': set(['78th #10C']),
 '11217': set(['305 Schermerhorn St., Brooklyn, NY 11217']),
 '12': set(['Main St #12']),
 '120': set(['H Highway 34 #120']),
 '1204': set(['Journal Square #1204']),
 '1603': set(['STREET 1603']),
 '17': set(['New Jersey 17', 'Route 17']),
 '18': set(['NJ 18', 'State Route 18']),
 '180

 'Hl': set(['Blueberry Hl', 'Diana Hl', 'Eagle Rock Hl', 'Turnip Hl']),
 'Hudson': set(['Castle Point on Hudson']),
 'Hwy': set(['Sunrise Hwy']),
 'I': set(['Avenue I', 'E Putnam Ave #I', 'Lane I', 'NICHOL AVENUE BLD I']),
 'Island': set(['Captree Island',
                'Governors Island',
                'Great Island',
                'Liberty Island',
                'Randalls Island',
                'Wards Island']),
 'J': set(['Avenue J', 'NICHOL AVENUE BLD J']),
 'Jefferson': set(['North Jefferson']),
 'Jefffries': set(['Jefffries']),
 'John': set(['Avenue Saint John']),
 'K': set(['Avenue K']),
 'Knls': set(['Idle Day Knls']),
 'Knolls': set(['The Knolls']),
 'L': set(['Avenue L', 'Lane L', 'NICHOL AVENUE BLD L']),
 'LANE': set(['HOES LANE',
              'MARVIN LANE',
              'POULTRY LANE',
              'RYDERS LANE',
              'SHEEPFOLD LANE',
              'SUTTON LANE']),
 'LC': set(['E 72nd St #LC']),
 'Lafayette': set(['Lafayette']),
 'Landing': set(['Bay 

As what we could see, we could see there are some problems existing in the name of the street,

- Abbreviation: Some name use the abbreviation of the type of street, like avenue could be Ave.
- Not ending with the street name (exceptions like direction, numbers, sigle letter)
- Capitalization: Some names are all in capital or none in capita.
- Type error


In [38]:
#for st_type, ways in st_types.iteritems():
 #   for name in ways:
  #      better_name = update_street_name(name, mapping)

In [None]:
#for event,elem in ET.iterparse('new-york.osm',events=("start",)):
#    if elem.tag=="way":
 #       for tag in elem.iter("tag"):
 #           if is_street_name(tag):
 #               audit_street_type(street_types,tag.attrib['v'])
#pprint.pprint(dict(street_types))    
                    

##### Auditing City Name

Same as the street name, I will aduit the city name to check the data quality.

In [6]:
city_name_re=re.compile(r'\b\S+\.?$',re.IGNORECASE) #character nospace . (0/1) end

def is_city_name(elem):
    return (elem.attrib['k']=="addr:city")

In [7]:
osm_file = open('new-york.osm', "r")
city_names = defaultdict(set)
for event, elem in ET.iterparse(osm_file, events=("start",)):

    if elem.tag == "node" or elem.tag == "way":
        for tag in elem.iter("tag"):
            if is_city_name(tag):
                m=city_name_re.search(tag.attrib['v'])
                if m:
                    city_name=m.group()
                    city_names[city_name].add(tag.attrib['v'])
                    

In [8]:
city_names

defaultdict(set,
            {'08901': {'New Brunswick, NJ 08901'},
             '1': {'1'},
             '11370': {'East Elmhurst, NY 11370', 'Jackson Heights, NY 11370'},
             'Aberdeen': {'Aberdeen'},
             'Albans': {'Saint Albans', 'St Albans'},
             'Albertson': {'Albertson'},
             'Allendale': {'Allendale'},
             'Alpine': {'Alpine'},
             'Amboy': {'Perth Amboy', 'South Amboy'},
             'Amityville': {'Amityville', 'North Amityville'},
             'Ardsley': {'Ardsley'},
             'Arlington': {'North Arlington'},
             'Arverne': {'Arverne'},
             'Asharoken': {'Asharoken'},
             'Astoria': {'Astoria'},
             'Avenel': {'Avenel'},
             'Babylon': {'Babylon', 'North Babylon', 'West Babylon'},
             'Baldwin': {'Baldwin'},
             'Bank': {'Red Bank'},
             'Bay': {'Huntington Bay', 'Oyster Bay'},
             'Bayonne': {'Bayonne'},
             'Bayside': {'Bayside

Based on the result I extract from the open street map, here are some types of duplicates.

- Capitalization: Not all the first character are in capital, eg.New York City: new york, New York city.
- Type error: some words include the type error version, eg. Brooklyn:brooklyn, Brookklyn
- Abbreviation: some cities have the abbreviation version, eg. New York: NY. 
- Ended with the punctuation: some names include the punctuation such as comma, parenthesis.



#### Data Update

As the data issues we found above related to the city and street name, below are the two functions that we would love to update the city name and street name in the process of importing the data into mongoDB.

###### Update the street name

In [None]:
# UPDATE THIS VARIABLE
mapping = { "St": "Street",
            "St.": "Street",
            "Ave":"Avenue",
           "AVENUE":"Avenue",
           "AVenue":"Avenue",
           "Ave.":"Avenue",
           "AVE.":"Avenue",
           "Ave": "Avenue",
           "Ave,": "Avenue",
           "Avenue,": "Avenue",
           "Avene": "Avenue",
           "Aveneu": "Avenue",
           'Avenue,#392':"Avenue",
           "ave":"Avenue",
           "avenue":"Avenue",
           "Blv.":"Boulevard",
           "Blvd":"Boulevard",
           "Blvd.":"Boulevard",
           "Blvd":"Boulevard",
           "boulevard":"Boulevard",
           "Broadwat": "Broadway",
           "Broadway.":"Broadway",
           "CIRCLE":"Circle",
           "Cir": "Circle",
           "Cres":"Crescent",
           "DRIVE":"Drive",
           "drive":"Drive",
           "Dr":"Drive",
           "Dr.":"Drive",
           "Driveway":"Drive",
           "E":"East",
            "EAST":"East",
           'Expy':"Expressway",
           "Hwy":"Highway",
           'LANE':"Lane",
           "lane":"Lane",
            "N":"North",
           "north":"North",
           'PARKWAY':"Parkway",
           'Pkwy': "Parkway",
           'Pky': "Parkway",
           "PLACE":"Place",
           "Pl": "Place",
           "Plz":"Place",
           'PLAZA':"Plaza",
           'ROAD':"Road",
           'Rd':"Road",
           'Rd.':"Road",
           "road": "Road",
           "S":"South",
           "SOUTH":"South",
           'ST':"Street",
           "st":"Street",
           "STREET":"Street",
           'st':"Street",
           'St.':"Street",
           'Steet':"Street",
           "street":"Street",
           'Streeet':"Street",
           'Ter':"Terminal",
          'Tunpike':'Turnpike',
           "Tunrpike": 'Turnpike',
           "Turnlike": 'Turnpike',
           "W":"West",
           "WEST":"West",
           "west":"West",
           "WAY":"Way"
            }

def update_street_name(name, mapping):

    m=street_type_re.search(name)
    better_name=name  # make sure that other name that not in the mapping are in the result
    if m:
        if m.group in mapping.keys():    
            #print m.group() 
            better_street_type=mapping[m.group()]
            better_name=street_type_re.sub(better_street_type,name)

    return better_name



##### Update City Name

In [9]:
# update the city name

city_mapping={
    "new york": "New York City",
    "New York city": "New York City",
    "york":"New York",
    "CITY": "New York City",
    "CIty": "New York City",
    "brooklyn":"Brooklyn",
    "Brookklyn": "Brooklyn",
    "BrookLyn": "Brooklyn",
    "Bronx, NY": "Bronx", 
    "bloomfield":"Bloomfield",
    "caldwell":"West Caldwell",
    "flushing": "Flushing",
    "FARMINGDALE": "Farmingdale",
    "island":"Staten Island",
    'linden':"Linden",
    "northport":"Northport",
    "new rochelle": "New Rochelle",
    "ny":"New York",
    "Lake":"Lakes",
    'plaine':"White Plaine",
    "PISCATAWAY":"Piscataway",
    "Queens)":"Queens",
    "queens":"Queens",
    "ridgewood":"Ridgewood",
    "rochelle":"New Rochelle",
    "stamford":"Stamford",
    "vernon":'Mount Vernon'  
}

def update_city_name(name, city_mapping):
    m=city_name_re.search(name)
    update=name
    if m:
        if m.group in city_mapping.keys():
            better_city_name=street_mapping[m.group]
            update=city_name_re.sub(better_city_name,name)
    return update


### Data Transformation




The task is to wrangle the data and transform the shape of data into a list of dictionaries that could be imported into MongoDB. Here is the rules of transformation. As the transformation process in Ipython notebook is too slow, I would prefer to do it in the single python file.

For the dictionary structure for the element,
- I would process "node" and "way" elements
- All attributes of "node" and "way" should be turned into regular key/value pairs, except:
    - Attributes for latitude and longitude should be added to a "pos" key
    - Attributes for version, changeset, timestamp, user, uid should be added in the CREATED array, which should be added under a key 'created'.
- if second level tag "k" value contains problematic characters, it should be ignored.
- if second level tag "k" value starts with "addr:", it should be added to a dictionary "address".
- if second level tag "k" value does not start with "addr:", but contains ":", you can process it same as any other tag.
- if there is a second ":" that separates the type/direction/prefix of a street, the tag should be ignored.

For example in node, the transformation I would love to do is below.
```xml
    <node id="757860928" visible="true" version="2" changeset="5288876" timestamp="2010-07-22T16:16:51Z" user="uboot" uid="26299" lat="41.9747374" lon="-87.6920102">
      <tag k="amenity" v="fast_food"/>
      <tag k="cuisine" v="sausage"/>
      <tag k="name" v="Shelly's Tasty Freeze"/>
     </node>

     <tag k="addr:city" v="Chicago"/>
      <tag k="addr:country" v="US"/>
      <tag k="addr:housenumber" v="4874"/>
      <tag k="addr:postcode" v="60625"/>
      <tag k="addr:state" v="Illinois"/>
      <tag k="addr:street" v="N. Lincoln Ave"/>
      <tag k="addr:street:type" v="Avenue"/> ##this will be ignored
      <tag k="name" v="Matty Ks"/>
      <tag k="phone" v="(773)-654-1347"/>
      <tag k="shop" v="doityourself"/>
      <tag k="source" v="GPS"/>
```
    should be turned into,
    {...
    "address": {
              "city": "Chicago",
              "country": "US",
              "housenumber": "4874",
              "postcode": "60625",
              "state": "Illinois",
              "street": "N. Lincoln Ave",
            },
    "name":"Matty Ks",
    "phone": "(773)-654-1347",
    "shop": "doityourself",
    "source": "GPS"
    ...
    }
 
 
If we take s short node example, we could return a dictionary, containing the shaped data for the element.

    {
    'id': 757860928,
    'type':'node',
    "created":{
        'version': '2',
        'changeset': 5288876,
        'timestamp': '2010-07-22T16:16:51Z',
        'user': 'uboot',
         'uid': 26299,
         },
    "pos": { 
              'lat': 41.9747374,
              'lon': -87.6920102,
    },
    "address": {
              "housenumber": "5157",
              "postcode": "60625",
              "street": "North Lincoln Ave"
            },
    "amenity": 'fast_food',
    "cuisine": 'sausage',
    "name": "Shelly's Tasty Freeze"
    }



For element "way" specifically, we would process node reference and added to the "node_refs" array and the whole array will be added under the key 'node_refs', such as 
```xml
    <nd ref="305896090"/>
    <nd ref="1719825889"/>
```
should be turned into "node_refs": ["305896090", "1719825889"]

For example in way,
```xml
     <way id="209809850" visible="true" version="1" changeset="15353317" timestamp="2013-03-13T15:58:04Z" user="chicago-buildings" uid="674454">
      <nd ref="2199822281"/>
      <nd ref="2199822390"/>
      <nd ref="2199822392"/>
      <nd ref="2199822369"/>
      <nd ref="2199822370"/>
      <nd ref="2199822284"/>
      <nd ref="2199822281"/>
      <tag k="addr:housenumber" v="1412"/>
      <tag k="addr:street" v="West Lexington St."/>
      <tag k="addr:street:name" v="Lexington"/>
      <tag k="addr:street:prefix" v="West"/>
      <tag k="addr:street:type" v="Street"/>
      <tag k="building" v="yes"/>
      <tag k="building:levels" v="1"/>
      <tag k="chicago:building_id" v="366409"/>
     </way>
```
     {
    'id': 209809850,
    'type':'node',
    "created":{
        'version': '1',
        'changeset': 15353317,
        'timestamp': '2013-03-13T15:58:04Z',
        'user': 'chicago-buildings',
         'uid': 674454,
         },
    "node_refs": ["2199822281", "2199822390","2199822392", "2199822369", "2199822370","2199822384" ,"2199822381"],
    "address": {
              "housenumber": "1412",
              "postcode": "60625",
              "street": "West Lexington St.",
              "building":"yes"         
            },
    "amenity": 'fast_food',
    "cuisine": 'sausage',
    "name": "Shelly's Tasty Freeze"
    }


In [10]:
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
PROBLEMCHARS = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')
lower_dot=re.compile(r'^([a-z]|_)*\.([a-z]|_)*$')

CREATED = [ "version", "changeset", "timestamp", "user", "uid"]

def shape_element(element):
    node={}
    # go through the node attribute
    if element.tag=='node' or element.tag == "way" :
        for attrib, value in element.attrib.iteritems():
            node["type"]=element.tag
            # Get all the elements in the CREATED combined them in the "created" dictionary
            if attrib in CREATED:
                if "created" not in node.keys():
                    node['created']={}
                node['created'][attrib]=value
            elif attrib=='lat' or attrib=='lon':
                if "pos" not in node.keys():
                    node['pos']={}
                node['pos'][attrib]=float(value)
            else:
                node[attrib]=value
                    
            for tag in element.iter("tag"):
                tag_key=tag.attrib['k']
                tag_value=tag.attrib['v']
                
                if PROBLEMCHARS.match(tag_key):
                    continue
                elif tag_key.startswith("addr:"):
                    if "address" not in node.keys():
                        node["address"] = {}
                    addr_key = tag_key[len("addr:") : ]
                    if lower_colon.match(addr_key): # if there is more than : in the tag_key
                        continue
                    elif addr_key=='city':
                        node["address"][addr_key]=update_city_name(tag_value, city_mapping)
                        #node["address"][addr_key]=tag_value
                    elif addr_key=='street':
                        node["address"][addr_key]=update_street_name(tag_value, mapping)
                    else:
                        node["address"][addr_key]=tag_value
                elif lower_colon.match(tag_key):
                    node[tag_key]=tag_value
                elif lower_dot.match(tag_key):
                    a,b=tag_key.split('.')
                    if a not in node.keys():
                        node[a]={}
                    node[a][b]=tag_value
                else:
                     node[tag_key]=tag_value
            
            for nd in element.iter("nd"): 
                if "node_refs" not in node.keys():
                    node["node_refs"] = []
                node["node_refs"].append(nd.attrib['ref'])
            
        return node
           

In [11]:
#Read the generated list of dictionaries, and dump into the json file

def process_map(file_in):
    
    filename = "{0}.json".format(file_in)
    data=[]
    with open(filename, 'wb') as outfile:
        for _,element in ET.iterparse(file_in):
            dic=shape_element(element)
            if dic:
                data.append(dic)
        json.dump(data, outfile)
    return data
                

In [None]:
process_map('new-york.osm')

### Import of the data

Based on the generated python file new-york.osm.json, now it's the time to import into MongoDB.
Here I did in two ways. As the sample I tested before is quite small, I adopted the python coding method to do the import.

In [122]:
client.drop_database('udacity_openstreet') #drop the database, make sure the database is not in the MongoDB.

In [123]:
# insert data through Pymongo
from pymongo import MongoClient
client=MongoClient('mongodb://localhost:27017')

# create the database
db=client['udacity_openstreet']
collection=db['map']


In [124]:
#insert data
for item in data:
    collection.insert(item)

  app.launch_new_instance()


Another method is to use the command line to import the json file into MongoDB, which is much better to import large file.


In [None]:
#~/Desktop/project/Udacity/data_analyst/data_Wrangling/OpenstreetMap
# insert by mongo command
#mongod dbpath ~/Desktop/project/Udacity/data_analyst/data_Wrangling/OpenstreetMap
mongoimport --db udacitymap --collection map --file ~/Desktop/project/Udacity/data_analyst/data_Wrangling/OpenstreetMap/new-york.osm.json --jsonArray
mongoimport --db dbName --collection collectionName --file fileName.json --jsonArray