# OpenStreetMap Data Wrangling Process 

## A Short Introduction on OpenStreetMap

OpenStreetMap (OSM) foundation is building free and editable map of the world, enabling the development of freely-reusable geospatial data. The data from OpenStreetMap is being used by many applications such as GoogleMaps, Foursquare and Craigslist. 

OpenStreetMap data structure has four core elements: Nodes, Ways, Tags, and Relations

- Nodes are points with a geographic position stored as lon (longitude) and lan (latitude). They are used to show points on the map such as points of interest.
- Ways are ordered list of nodes, representing a polyline or a polyline if they make a closed loop. They are used to show streets, rivers, and area such as parks, lakes, etc.
- Relations are ordered list of nodes, ways and relations (called 'members') where each members can have a 'role' (a string). They are used to show the relation between nodes and ways such as restrictions on roads.
- Tags are pairs of Keys and Values. They are used to store metdata of the map such as address, type of building, or any sort of physical property. Tags are always attached to a node, way or relation and are not stand-alone elements.

To look at the map, or download your area of interest, you can visit http://www.openstreetmap.org website. 

For more information you can check their wiki which includes all the necessary information and documentation:
https://en.wikipedia.org/wiki/OpenStreetMap

Users can add points on the map, create relations of the nodes and ways, and assign properties such as address or type to them. The data can be stored in OSM format, and can be access in different formats. For the purpose of this project, I use the OSM XML format.
http://wiki.openstreetmap.org/wiki/OSM_XML

In this project, I will work with the raw data from an area. Since the data is put by different users, I suppose it can be quite messy; therefore, I will use cleaning and auditing functions to make the data look clean for analysis. I will export the data into CSV format and use this format to create tables in my SQLite database. Then I run queries on the database to extract information such as number of nodes and ways, most contributing users, top points of interest, etc. I will conclude the project by discussing benefits as well as some anticipated problems in implementing the improvement.

## Area Chosen for Analysis

For this project, I chose San Francisco area in the US. I chose this area as it is a point of interest for me with its big IT corporations; also, it is a place I want to travel to one day. I decided to download the file locally to my machine.  
https://mapzen.com/data/metro-extracts/metro/san-francisco_california/

The 'metro extracts' will provide the map of the metropolian area (i.e. where you can find more elements to work with)

The original file is about 1.01GB in size; however, I use a sample file about 50MB to perform my initial analysis on. Once I am satisfied with the code, I run it on the original file to create the CSV files for my database. 

The data analyzed in this Jupyter notebook is from the sample file to be able to show shorter results from my analysis. I have included all the functions here, as well as the function to create CSV files, in separate .py files in my repository. Using those you can run the code on the original file. 
https://github.com/Nazaniiin/OpenStreetMap_DataWrangling

## Exploring the Data a bit

Let's start going through the data, find its problems and clean them. First, we'll take a look into the dataset and parse through using ElementTree and extract information such as different types of elements (nodes,ways,etc.) in the OSM XML file.

Using ET.iterparse (i.e. iterative parsing) is efficient here since the original file is too large for processing the whole thing. So iterative parsing will parse the file as it builds it.  
http://stackoverflow.com/questions/12792998/elementtree-iterparse-strategy

In this code, I will iterate through different tags in the XML (nodes,ways,relations,member,...) and count them, put them in a dictionary with the key being the tag name.

In [7]:
import xml.etree.cElementTree as ET
import pprint

OSMFILE = '/Users/nazaninmirarab/Desktop/Data Science/P3/Project/Submission2/san-francisco_california_sample.osm'

def count_tags(filename):
    tags= {}
    for event, elem in ET.iterparse(filename):
        if elem.tag not in tags.keys():
            tags[elem.tag] = 1
        else:
            tags[elem.tag] += 1
    
    pprint.pprint(tags)
    
count_tags(OSMFILE)

{'member': 2530,
 'nd': 281292,
 'node': 235744,
 'osm': 1,
 'relation': 283,
 'tag': 87071,
 'way': 27558}


I will do a bit more exploration on the data. In the OSM XML file, the 'tag' element has key-value pairs which contain information about different points (nodes) or ways in the map. I parse through this element using the following regular expressions:
- lower -> ^([a-z]|_)*$ : This matches strings that contain only lower case characters. I start at the beginning of the strong and match between zero to unlimited times a character in range 'a' to 'z'. This regular expression also covers the underscore '_' character. 

- lower_colon -> ^([a-z]|_)*:([a-z]|_)*$ : This matches strings which contain lower case characters but also have the colon ':' character e.g. addr:street is one type of tag which specifies a street name. 

- problemchar -> [=\+/&<>;\'"\?%#$@\,\. \t\r\n] : This matches tags with problematic characters specified in the regex pattern. 

Take this section of the map as an example:

    <node changeset="30175357" id="358830340" lat="37.6504905" lon="-122.4896963" timestamp="2015-04-12T22:43:37Z" 
        uid="35667" user="encleadus" version="4">
		<tag k="name" v="Ocean Shore School" />
		<tag k="phone" v="+1 650 738 6650" />
		<tag k="amenity" v="school" />
		<tag k="website" v="http://www.oceanshoreschool.org/" />
		<tag k="addr:city" v="Pacifica" />
		<tag k="addr:state" v="CA" />
		<tag k="addr:street" v="Oceana Boulevard" />
		<tag k="gnis:created" v="04/06/1998" />
		<tag k="addr:postcode" v="94044" />
		<tag k="gnis:state_id" v="06" />
		<tag k="gnis:county_id" v="081" />
		<tag k="gnis:feature_id" v="1785657" />
		<tag k="addr:housenumber" v="411" />
	</node>
    
This node tag has 13 tag elements inside it. There are multiple keys that have the ':' character in them, so they fall under the 'lower_colon' regular expression. keys like name, phone, and amenity will fall under the 'lower' regular expression. There are no problematic characters in this specific node.

In [8]:
import xml.etree.cElementTree as ET
import pprint
import re

OSMFILE = '/Users/nazaninmirarab/Desktop/Data Science/P3/Project/Submission2/san-francisco_california_sample.osm'

lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')


def key_type(element, keys):
    if element.tag == "tag":
        for tag in element.iter('tag'): #iterating through the tag element in the XML file
            k = element.attrib['k'] #looking for the tag attribute 'k' which contains the keys
            if re.search(lower, k):
                keys['lower'] += 1
            elif re.search(lower_colon, k):
                keys['lower_colon'] += 1
            elif re.search(problemchars, k):
                keys['problemchars'] += 1
            else:
                keys['other'] += 1
                
    return keys

def process_map(filename):
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)

    pprint.pprint(keys)
    
process_map(OSMFILE)

{'lower': 51867, 'lower_colon': 33909, 'other': 1281, 'problemchars': 14}


I now want to collect some information about the users contributed to the OpenStreetMap data for San Francisco area. I want to calculate the number of unique users. 

To find the users, we need to look through the attributes of the node, way and relation tags. The 'uid' attribute is what we need to count.

    <node changeset="30175357" id="358830340" lat="37.6504905" lon="-122.4896963" timestamp="2015-04-12T22:43:37Z" 
    uid="35667" user="encleadus" version="4">
    

In [11]:
OSMFILE = '/Users/nazaninmirarab/Desktop/Data Science/P3/Project/Submission2/san-francisco_california_sample.osm'

def process_map(filename):
    users = set()
    for _, element in ET.iterparse(filename):
        if element.tag == 'node' or element.tag == 'way' or element.tag == 'relation':
                userid = element.attrib['uid']
                users.add(userid)

    print len(users)
    
process_map(OSMFILE)

1284


## Auditing Street Names

Since there are many users who are entering data in OpenStreetMap, the way they represent the formatting of streets can vary. For example, the street type 'Avenue' can be written in formats such as:
- Avenue (starting with capital letter)
- Ave
- Ave.
- avenue (starting with small letter)

To be able to process the data, we need to make these street types uniform. In case we are later searching for specific Avenue names, we can do a quick search on all street types that have the word 'Avenue' in them and we can make sure that we are not missing anything with abrreviations of Avenue.

For this part, we need to:
- Regular Expression -> \b\S+\.?$ : Have a regular expression that can extract a string which might or might not have the '.' character in it (e.g. Ave. has the '.' character). This regular expression matches letters without any white space (\S) with zero to one '.' 
- Have a list of names that we expect the streets to have (e.g. Avenue, Street, Highway, ...)
- Collect a list of all types of street names that do NOT match the ones in the expected list; and change those to one matching the expected list. For example, changing 'St.' to 'Street'. 
- Parse through the tags where they keys are equal to 'addr:street' and collect the value attribute of them. I will use the regular expression explained earlier to read the strings in the value attribute. 
For example:
      <tag k="addr:street" v="Oceana Blv." />

- Make a dictionary with the values being the value attribute of the tags and keys being the street type found from the regular expression.
- Create a list called 'mapping' and enter all the different varieties of street types found
- Change all those street names in the mapping list to match the ones in the expected list

In [12]:
import xml.etree.cElementTree as ET
from collections import defaultdict
import re
import pprint

OSMFILE = "/Users/nazaninmirarab/Desktop/Data Science/P3/Project/Submission2/san-francisco_california_sample.osm"
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

# the list of street types that we want to have
expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons"]


The 'audit_street_type' function will get the list of street types and using the regular expression, compare them to the expected list. If they do not match the names in the expected list, it adds it to the street_types dictionary

In [14]:
def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)

The 'is_street_name' function will get the elements in the file (i.e. the tag element) and return the attributes in that element for which their key is equal to 'addr:street'. 

The 'audit' funntion uses iterative parsing to go through the XML file, parse node and way elements, and iterate through their tag element. It will then call the 'audit_street_type' function to add the value attribute of the tag (i.e. the street name) to it. 

In [17]:
def is_street_name(elem):
    return (elem.attrib['k'] == "addr:street")


def audit(osmfile):
    osm_file = open(osmfile, "r")
    street_types = defaultdict(set)
    
    #parses the XML file
    for event, elem in ET.iterparse(osm_file, events=("start",)):
        # iterate through the 'tag' element of node and way elements
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'])
    osm_file.close()
    return street_types

Now I will call the 'audit' function and use pretty print to get a nice-looking output of the dictionary that has the names of the streets in it.

In [19]:
street_types = audit(OSMFILE)

pprint.pprint(dict(street_types))

{'A': set(['Upton St #A']),
 'Alameda': set(['Alameda', 'The Alameda']),
 'Alley': set(['Hodges Alley']),
 'Ave': set(['San Pablo Ave', 'Tehama Ave']),
 'Ave.': set(['Santa Cruz Ave.']),
 'Avenie': set(['Garvin Avenie']),
 'Blvd': set(['N California Blvd']),
 'Broadway': set(['Broadway']),
 'Building': set(['Ferry Building']),
 'Center': set(['Bon Air Center', 'Westlake Center']),
 'Circle': set(['Blossom Circle',
                'Croydon Circle',
                'Gloria Circle',
                'Inner Circle',
                'Wilson Circle']),
 'Gardens': set(['Wildwood Gardens']),
 'H': set(['Avenue H']),
 'Highway': set(['Bayshore Highway', 'Great Highway', 'Shoreline Highway']),
 'Hyde': set(['Hyde']),
 'Las': set(['Alameda De Las']),
 'Lugano': set(['Via Lugano']),
 'Marina': set(['Pacific Marina']),
 'Mason': set(['Fort Mason']),
 'Ness': set(['Van Ness']),
 'North': set(['Mission Bay Boulevard North']),
 'Ora': set(['Avenue Del Ora']),
 'Path': set(['Indian Rock Path', 'Mendoci

======================================================================================================================

Going through the street name list, I will use it to update the 'mapping' list. In this list I mention the format of the street type that was found in the file (left) and specify to what format it needs to be changed (right).

The dictionary containing the abbreviated street types do not cover all the different street types, but covers a comprehensive number of them. I go through this list and see which ones can be changed. For instance:
- 'Alameda': set(['Alameda', 'The Alameda'

is a name that do not need any changes; therefore, I leave it out. However, a list like:
- 'St': set(['Bell St', 'Delancey St']

can definately change from 'Bell St -> Bell Street'. So, I will add 'St' to the mapping and specify what I expect to get after updating it.

In [21]:
#The list of dictionaries, containing street types that need to be changed to match the expected list
mapping = { "St": "Street",
            "St.": "Street",
            "street": "Street",
            "Ave": "Avenue",
            "Ave.": "Avenue",
            "AVE": "Avenue,",
            "avenue": "Avenue",
            "Rd.": "Road",
            "Rd": "Road",
            "road": "Road",
            "Blvd": "Boulevard",
            "Blvd.": "Boulevard",
            "Blvd,": "Boulevard",
            "boulevard": "Boulevard",
            "broadway": "Broadway",
            "square": "Square",
            "way": "Way",
            "Dr.": "Drive",
            "Dr": "Drive",
            "ct": "Court",
            "Ct": "Court",
            "court": "Court",
            "Sq": "Square",
            "square": "Square",
            "cres": "Crescent",
            "Cres": "Crescent",
            "Ctr": "Center",
            "Hwy": "Highway",
            "hwy": "Highway",
            "Ln": "Lane",
            "Ln.": "Lane",
            "parkway": "Parkway"
            }

To match the expected list of street name and replace the abbreviated street types, I wrote a function that uses the mapping to do this conversion.

I take the street name (e.g. N California Blvd) and split it at the space character. In case I could find a string that matches any in the mapping, I replace it with the format I have specified for it. When the function finds 'Blvd', it goes through mapping and map it to 'Boulevard', and the final street name will come out as 'N California Boulevard'.

In [22]:
def update_name(name, mapping):
    output = list()
    parts = name.split(" ")
    for part in parts:
        if part in mapping:
            output.append(mapping[part])
        else:
            output.append(part)
    return " ".join(output)

Let's do a print to see how the changes have been applied. I iterate through the street_types from which collected different street types from the 'audit' function, and call the 'update_name' function to change the street type.

In [24]:
for st_type, ways in street_types.iteritems():
        for name in ways:
            better_name = update_name(name, mapping)
            print name, "=>", better_name

Upton St #A => Upton Street #A
Via Lugano => Via Lugano
Van Ness => Van Ness
Devonshire Way => Devonshire Way
Ranleigh Way => Ranleigh Way
Piedmont Way => Piedmont Way
Bel Air Way => Bel Air Way
Berkeley Way => Berkeley Way
Black Fox Way => Black Fox Way
Windsor Way => Windsor Way
Boulevard Way => Boulevard Way
Sussex Way => Sussex Way
Mitchell Way => Mitchell Way
Eastlake Way => Eastlake Way
Bayridge Way => Bayridge Way
Cheshire Way => Cheshire Way
Bristol Way => Bristol Way
Martin Luther King Jr Way => Martin Luther King Jr Way
Chelsea Way => Chelsea Way
Mcnulty Way => Mcnulty Way
Mills Way => Mills Way
Dwight Way => Dwight Way
Glenn Way => Glenn Way
Camberly Way => Camberly Way
Lincoln Way => Lincoln Way
Sterling Way => Sterling Way
Buena Vista Way => Buena Vista Way
Shepard Way => Shepard Way
Red Oak Way => Red Oak Way
Park Way => Park Way
Altamont Way => Altamont Way
California Way => California Way
Marina Way => Marina Way
Bill Drake Way => Bill Drake Way
Lancaster Way => Lancast

## Auditing Postcodes

Postcodes are another inconsistent type of data that is entered into the map. The inconsistency is either in how they are represented (with the city abbreviation or without) or how long they are.

The theory behind audting Postcodes is the same as auditing street names. 

I need to first check how the postcodes are being shown in the map. 

In the 'dicti' function, I create a dictionary where I can hold postcodes. The dictionary key will be the postcode itself and the dictionary value will be the number of times that postcode was repeated throughout the map.

In [26]:
OSMFILE = '/Users/nazaninmirarab/Desktop/Data Science/P3/Project/Submission2/san-francisco_california_sample.osm'

def dicti(data, item):
    data[item] += 1

The 'get_postcode' function will take the 'tag' element as an input and return the elements for which the keys are equal to 'addr:postcode' 

The 'audit' function, like the one for street names, parses the XML file and iterates through node and way elements. It extracts the value attribute (i.e. the postcode) and add it to the 'dicti' dictionary.

In [28]:
def get_postcode(elem):
    return (elem.attrib['k'] == "addr:postcode")

def audit(osmfile):
    osm_file = open(osmfile, "r")
    data = defaultdict(int)
    # parsing the XML file
    for event, elem in ET.iterparse(osm_file, events=("start",)):
        
        # iterating through node and way elements.
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if get_postcode(tag):
                    dicti(data, tag.attrib['v'])
    
    return data

Now I will call the 'audit' function and print the output which should be a list of dictionaries of postcodes.

In [36]:
postcodes = audit(OSMFILE)

pprint.pprint(dict(postcodes))

{'90214': 1,
 '94002': 2,
 '94005': 1,
 '94010': 3,
 '94014': 1,
 '94015': 2,
 '94019': 1,
 '94025': 1,
 '94027': 1,
 '94044': 4,
 '94061': 9,
 '94063': 17,
 '94065': 4,
 '94066': 1,
 '94070': 1,
 '94080': 2,
 '94102': 8,
 '94103': 39,
 '94104': 4,
 '94105': 2,
 '94107': 13,
 '94108': 8,
 '94109': 20,
 '94110': 8,
 '94111': 2,
 '94112': 3,
 '94113': 4,
 '94114': 20,
 '94115': 3,
 '94116': 102,
 '94117': 60,
 '94118': 4,
 '94121': 12,
 '94121-3131': 1,
 '94122': 254,
 '94123': 10,
 '94124': 2,
 '94127': 32,
 '94128': 1,
 '94129': 1,
 '94131': 9,
 '94132': 3,
 '94133': 46,
 '94134': 1,
 '94143': 1,
 '94158': 2,
 '94166': 1,
 '94303': 1,
 '94402': 1,
 '94403': 3,
 '94404': 3,
 '94501': 5,
 '94507': 1,
 '94530': 5,
 '94536': 1,
 '94544': 2,
 '94545': 1,
 '94546': 1,
 '94549': 1,
 '94552': 1,
 '94556': 1,
 '94577': 5,
 '94578': 5,
 '94587': 16,
 '94596': 1,
 '94598': 3,
 '94601': 1,
 '94602': 4,
 '94606': 6,
 '94607': 5,
 '94608': 4,
 '94609': 2,
 '94610': 69,
 '94611': 144,
 '94612': 6,
 '

======================================================================================================================
### Different Postcodes and Ways to Clean Them up
The output shows that the postcodes are in these formats:
- A 5-digit format (e.g. 12345)
- A 5-digit format followed by more numbers after a hyphen (e.g. 12345-6789)

After running the code on the original file, I also found out that there were postcodes which where:
- A format with which the city name is mentioned in the beginning (e.g. CA 12345), or
- A format shorter or longer than 5 digits (e.g. 515 or 134567)
- A format that only has the 'CA' characters in them, with no digit

To deal with the postcodes, I divide them into different categories:
- First category include the ones:
    - Where the length equals to 5 (e.g. 12345)
    - Where the length is longer than 5, and they contain characters (like abbreviations of a city) (e.g. CA 12345)
    
- Second category include the ones:
    - Where the length is longer than 5, and they are followed by a hyphen (e.g. 12345-6789)
    
- Third category include the ones:
    - Where the length is longer than 5, but are not followed by any hyphen (e.g. 123456)
    - Where the length is shorter than 5 (e.g. 1234, 515)
    - Where the postcode equals to 'CA'
    
For the first category, I use a regular expression to extract only 5 digits from the pattern. This regex asserts position at start of the string ( ^ ) and matches any character that is NOT a digit ( \D* ). The ( \d{5} ) matches a digit exactly 5 times. In case the postcode starts with letters (e.g. CA 12345), it gives two groups of output: One is 'CA' and the other is '12345'. Depending on which one is needed, the preferred group can be chosen.  

    ^\D*(\d{5}).*

For the second category, I use another regular expression to extract the first 5 digits. This regex matches digits 5 times, is followed by a '-', and then matching digits exactly 4 times.

    ^(\d{5})-\d{4}$
    
For the third category, having postcodes which have shorter or longer than 5-digit length, means that they are not valid. To clean them up, I replace those postcodes with '00000'. I use regular expressions to be able to find the ones that are exactly 6-digit long. 

    ^\d{6}$ is to find postcodes that are exactly 6-digit long
    

Now I write the 'update_postcode' function. I use different conditions in the function to match the postcodes in the 3 categories I explained. 

In [79]:
def update_postcode(digit):
    output = list()
    
    first_category = re.compile(r'^\D*(\d{5}$)', re.IGNORECASE)
    
    second_category = re.compile('^(\d{5})-\d{4}$')
    
    third_category = re.compile('^\d{6}$')
    
    if re.search(first_category, digit):
        new_digit = re.search(first_category, digit).group(1)
        output.append(new_digit)
        
    elif re.search(second_category, digit):
        new_digit = re.search(second_category, digit).group(1)
        output.append(new_digit)
    
    elif re.search(third_category, digit):
        third_output = third_category.search(digit)
        new_digit = '00000'
        output.append('00000')
    
    # this condition matches the third category for the other two types of postcodes
    elif digit == 'CA' or len(digit) < 5:
        new_digit = '00000'
        output.append(new_digit)

    return ', '.join(str(x) for x in output)


I will print the output after the changes are done to the postcodes

In [80]:
for postcode, nums in postcodes.iteritems():
    better_code = update_postcode(postcode)
    print postcode, "=>", better_code

94121-3131 => 94121
94404 => 94404
94402 => 94402
94403 => 94403
94118 => 94118
94611 => 94611
94610 => 94610
94612 => 94612
94112 => 94112
94113 => 94113
94110 => 94110
94111 => 94111
94116 => 94116
94618 => 94618
94114 => 94114
94115 => 94115
90214 => 90214
94577 => 94577
94578 => 94578
94124 => 94124
94123 => 94123
94122 => 94122
94121 => 94121
94129 => 94129
94128 => 94128
94606 => 94606
94607 => 94607
94602 => 94602
94601 => 94601
94044 => 94044
94804 => 94804
94608 => 94608
94609 => 94609
94965 => 94965
94131 => 94131
94132 => 94132
94133 => 94133
94134 => 94134
94507 => 94507
94501 => 94501
94587 => 94587
94303 => 94303
95498 => 95498
94904 => 94904
94143 => 94143
94027 => 94027
94025 => 94025
94596 => 94596
94598 => 94598
94619 => 94619
94117 => 94117
94015 => 94015
94014 => 94014
94010 => 94010
94019 => 94019
94158 => 94158
94109 => 94109
94108 => 94108
94080 => 94080
94536 => 94536
94530 => 94530
94005 => 94005
94166 => 94166
94104 => 94104
94549 => 94549
94545 => 94545
94544

## Preparing the Data for the Database

To load the data to the SQLite database, I need to transfer it from the XML file to CSV files. I create multiple CSV files, and later create the corresponding tables in my database based on them.

The CSV files I want to have are:
- Node
- Node_tags
- Way
- Way_tags
- Way_nodes

Each of these CSV files contains different columns and stores data based on those columns. The columns used in the CSV files will be the table columns in the database. This is the schema:
- NODE_FIELDS = ['id', 'lat', 'lon', 'user', 'uid', 'version', 'changeset', 'timestamp']
- NODE_TAGS_FIELDS = ['id', 'key', 'value', 'type']
- WAY_FIELDS = ['id', 'user', 'uid', 'version', 'changeset', 'timestamp']
- WAY_TAGS_FIELDS = ['id', 'key', 'value', 'type']
- WAY_NODES_FIELDS = ['id', 'node_id', 'position']

To create these files, I will parse the 'node' and 'way' tags and extract the tags inside them. 

The 'shape_element' function takes as input an iterparse Element object and returns a dictionary. Depending on whether the element is 'node' or 'way', the dictionary looks different. 

Here's an example from the 'node' element in the XML file:

    <node changeset="27772228" id="358830414" lat="37.6668652" lon="-122.4895243" timestamp="2014-12-29T09:43:14Z" uid="14293" user="KindredCoda" version="2">
		<tag k="ele" v="118" />
		<tag k="name" v="Longview Park" />
		<tag k="leisure" v="park" />
		<tag k="gnis:created" v="04/06/1998" />
		<tag k="gnis:state_id" v="06" />
		<tag k="gnis:county_id" v="081" />
		<tag k="gnis:feature_id" v="1785701" />
	</node>

### For Node:

The dictionary returns the format {"node": .., "node_tags": ...}

The "node" field holds a dictionary of the following top level node attributes:

id, user, uid, version, lat, lon, timestamp, changeset (All other attributes are ignored)

The "node_tags" field holds a list of dictionaries, one per secondary tag. Secondary tags are child tags of node which have the tag name/type: "tag". Each dictionary has the following fields from the secondary tag attributes:

- id: the top level node id attribute value 
    - This will be node['id'] = '358830414' in the above example
- key: the full tag "k" attribute value if no colon is present or the characters after the colon if one is.
    - k="name" is one example 
- value: the tag "v" attribute value
    - v="Longview Park" is the value for the key k="name"
- type: either the characters before the colon in the tag "k" value or "regular" if a colon is not present.
    - For k="name", the type would be 'regular'
- if the tag "k" value contains problematic characters, the tag should be ignored
- if the tag "k" value contains a ":" the characters before the ":" should be set as the tag type and characters after the ":" should be set as the tag key
    - In < tag k="gnis:county_id" > since a ':' is present, the tag['type'] = 'gnis' and tag['key'] = 'county_id'
- if there are additional ":" in the "k" value they and they should be ignored and kept as part of the tag key. For example should be turned into:
    - {'id': 12345, 'key': 'street:name', 'value': 'Lincoln', 'type': 'addr'}
- If a node has no secondary tags then the "node_tags" field should just contain an empty list.

### For Way:

The dictionary has the format {"way": ..., "way_tags": ..., "way_nodes": ...}

The "way" field should hold a dictionary of the following top level way attributes:

id, user, uid, version, timestamp, changeset (All other attributes are ignored)

The "way_tags" field again holds a list of dictionaries, following the exact same rules as for "node_tags".
Additionally, the dictionary has a field "way_nodes". "way_nodes" holds a list of dictionaries, one for each nd child tag. Each dictionary has the fields:

- id: the top level element (way) id
- node_id: the ref attribute value of the nd tag
- position: the index starting at 0 of the nd tag i.e. what order the nd tag appears within the way element

======================================================================================================================
To write the data into CSV files, I defined the 'shape_element' function where I will also use my 'update_name' and 'update_postcode' functions to clean the street names and postcodes before they are inserted into the CSV files.

In my shaping_csv.py file where the shape_element function is located, I import the audit.py file which contains the update_name and update_postcode functions. How I import the audit.py script is:
- Have audit.py and shaping_csv.py files in the same directory
- In shaping_csv.py script, use the following command:
    - from audit import *   
- The above command imports all the functions from the audit.py script into the shaping_csv.py script

I call update_name and update_postcode functions twice in the code; once for the node element and one for the way element. The place to call them is when I am iterating through the 'tag' element for node or way, and I reach the attrib['v'].

For tag['key'] and tag['type'], I used regular expressions to process all types of them. If there is a colon ':' in the tag['key'], I wrote the following patterns:

- Characters before colon -> ^[a-zA-Z]*:
    - This pattern starts from the beginning of the string, and matches all letters from a-z until it reaches a colon.
    - This pattern is used to define the ['type'] 

- Characters after colon -> :[a-zA-Z_]+
    - This pattern starts from the first time it finds a colon ':' and continue with the rest of the string.
    - This pattern also makes sure to catch strings with an underscore '_'. For example in k="gnis:county_id , the pattern sets the ['key'] = county_id, not county.
    - This pattern is used to define the ['key']

One issue I noticed while I was validating my CSV file against the expected schema, I found out that some 'uid' values are missing from the data. Although according to the best practice of OpenStreetMap, all user's information like (uid and user fields) should be written while submitting data, they are not; thus, causing validation to throw errors.

To fix this issue, I added a try and except statement, and gave a fake value to that attribute.  
        
        try:       
            way_attribs[item] = element.attrib[item]  
        except:        
            way_attribs[item] = "9999999"  
            
The same problem with some empty fields are relatable to the 'k' attribute as well; meaning users did not added information for all the attributes in node or way. To overcome this, I used a conditional statement, and set the ['type'] and ['key'] to 'regular' in case the field is empty. Else, it uses the regular expressions to find corresponding patterns.

In [85]:
PROBLEMCHARS = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

def shape_element(element):

    node_attribs = {} # Handle the attributes in node element
    way_attribs = {} # Handle the attributes in way element
    way_nodes = [] # Handle the 'nd' tag in the way element
    tags = []  # Handle secondary tags the same way for both node and way elements
    
    # Handling node elements
    if element.tag == 'node':
        for item in NODE_FIELDS:
            try:
                node_attribs[item] = element.attrib[item]
            except:
                node_attribs[item] = "9999999"
        
        # Iterating through the 'tag' tags in the node element
        for tg in element.iter('tag'):
            if not PROBLEMCHARS.search(tg.attrib['k']):
                tag_dict_node = {}
                tag_dict_node['id'] = element.attrib['id']

                # Calling the update_name function to clean up problematic street names based on audit.py file
                if is_street_name(tg):
                    better_name = update_name(tg.attrib['v'], mapping)
                    tag_dict_node['value'] = better_name

                # Calling the update_postcode function to clean up problematic postcodes based on audit.py file
                elif get_postcode(tg):
                    better_postcode = update_postcode(tg.attrib['v'])
                    tag_dict_node['value'] = better_postcode
                
                # For other values that are not street names or postcodes
                else:
                    tag_dict_node['value'] = tg.attrib['v']

                if ':' not in tg.attrib['k']:
                    tag_dict_node['key'] = tg.attrib['k']
                    tag_dict_node['type'] = 'regular'
                else: 
                    character_before_colon = re.findall('^[a-zA-Z]*:', tg.attrib['k'])
                    character_after_colon = re.findall(':[a-zA-Z_]+' , tg.attrib['k'])
                    if len(character_after_colon) != 0:
                        tag_dict_node['key'] = character_after_colon[0][1:]
                    else:
                        tag_dict_node['key'] = 'regular'

                    if len(character_before_colon) != 0:
                        tag_dict_node['type'] = character_before_colon[0][: -1]
                    else:
                        tag_dict_node['type'] = 'regular'
                tags.append(tag_dict_node)
            
        return {'node': node_attribs, 'node_tags': tags}
        
    # Handling way elements
    elif element.tag == 'way':
        for item in WAY_FIELDS:
            try:
                way_attribs[item] = element.attrib[item]
            except:
                way_attribs[item] = "9999999"
        
        # Iterating through 'tag' tags in way element
        for tg in element.iter('tag'):
            if not PROBLEMCHARS.search(tg.attrib['k']):
                tag_dict_way = {}
                tag_dict_way['id'] = element.attrib['id']

                # Calling the update_name function to clean up problematic street names based on audit.py file
                if is_street_name(tg):
                    better_name_way = update_name(tg.attrib['v'], mapping)
                    tag_dict_way['value'] = better_name_way

                # Calling the update_postcode function to clean up problematic postcodes based on audit.py file
                if get_postcode(tg):
                    better_postcode_way = update_postcode(tg.attrib['v'])
                    tag_dict_way['value'] = better_postcode_way

                # For other values that are not street names or postcodes
                else:
                    tag_dict_way['value'] = tg.attrib['v']

                if ':' not in tg.attrib['k']:
                    tag_dict_way['key'] = tg.attrib['k']
                    tag_dict_way['type'] = 'regular'
                else:
                    character_before_colon = re.findall('^[a-zA-Z]*:', tg.attrib['k'])
                    character_after_colon = re.findall(':[a-zA-Z_]+', tg.attrib['k'])
                
                    if len(character_after_colon) == 1:
                        tag_dict_way['key'] = character_after_colon[0][1:]
                    if len(character_after_colon) > 1:
                        tag_dict_way['key'] = character_after_colon[0][1: ] + character_after_colon[1]
                
                    if len(character_before_colon) != 0:
                        tag_dict_way['type'] = character_before_colon[0][: -1]
                    else:
                        tag_dict_way['type'] = 'regular'
                
                tags.append(tag_dict_way)
        
        # Iterating through 'nd' tags in way element
        count = 0
        for tg in element.iter('nd'):
            tag_dict_nd = {}
            tag_dict_nd['id'] = element.attrib['id']
            tag_dict_nd['node_id'] = tg.attrib['ref']
            tag_dict_nd['position'] = count
            count += 1
            
            way_nodes.append(tag_dict_nd)
        
        return {'way': way_attribs, 'way_nodes': way_nodes, 'way_tags': tags}


With the shape_element function in place, I can now parse and shape the data, and write it to CSV files.

The main function is what I used to call my audit function to update street names and postcodes. The python script shaping_csv.py takes care of creating the CSV files.

In [None]:
OSM_PATH = '/Users/nazaninmirarab/Desktop/Data Science/P3/Project/Submission2/san-francisco_california_sample.osm'

if __name__ == '__main__':

    process_map(OSM_PATH)

After the CSV files are created (by running the shaping_csv.py file), it is time to create the database and insert the information from those CSV files to their corresponding tables.

I created the database called 'openstreetmap_sf_db' and I created tables with columns based on the columns from the CSV files, and inserted the data from the CSV files to the corresponding tables in the database. The 'creating_database.py' file takes care of creating the tables and inserting data in them.

After the tables are created, I can now start investigating them and getting queries on them.

The code below shows how I have created the tables in the the database. This is an example code where tables_name, column_name and filename.csv are replaced according to the table I want to insert in the database.

First I connect to the sqlite file, and make sure to check that the table I want to create is not already created.
Using 'cur.execute' I execute commands to the database in python. After creating the table, I inserted the data from the CSV file into it. I did this process for every table I wanted to create in the database. 

In [None]:
import sqlite3
import csv
from pprint import pprint

sqlite_file = 'db.sqlite'

conn = sqlite3.connect(sqlite_file)
cur = conn.cursor()

#making sure a table that already exists does not get created
cur.execute('DROP TABLE IF EXISTS nodes')
conn.commit()

#creating the table
cur.execute('''
    CREATE TABLE tables_name(column_name type, column_name type, column_name type, column_name type)
''')
conn.commit()

with open('filename.csv', 'rb') as f:
    dr = csv.DictReader(f)
    in_db = [(i['column_name'], i['column_name'], i['column_name'], i['column_name']) for i in dr]
    
#insert the data
cur.executemany('INSERT INTO tables_name(column_name, column_name, column_name, column_name) VALUES(?, ?, ?, ?);', in_db)
conn.commit()


## Data Overview

I want to get some information regarding the CSV files and the database I created.

By importing 'hurry.filesize' I can translate the file sizes from bytes to KB or MB. To install the library, you need to 'pip install hurry.filesize' it on your machine. I got the idea from using this method from the post below:  
https://discussions.udacity.com/t/display-files-and-their-sizes-in-directory/186741

In [150]:
from pprint import pprint
import os
from hurry.filesize import size 

dirpath = '/Users/nazaninmirarab/Desktop/Data Science/P3/Project/Submission2/Sizes'

files_list = []
for path, dirs, files in os.walk(dirpath):
    files_list.extend([(filename, size(os.path.getsize(os.path.join(path, filename)))) for filename in files])

for filename, size in files_list:
    print '{:.<40s}: {:5s}'.format(filename,size)

nodes.csv...............................: 378M 
nodes_tags.csv..........................: 8M   
openstreetmap_sf_db.sqlite..............: 514M 
san-francisco_california.osm............: 966M 
san-francisco_california_sample.osm.....: 48M  
ways.csv................................: 31M  
ways_nodes.csv..........................: 128M 
ways_tags.csv...........................: 48M  


Now that I have audited and cleaned the data and transfered everything into table in my database, I can start running queries on it. The queries answer many questions such as:   
- Number of nodes
- Number of way
- Number of unique users
- Most contributing users
- Number of users who contributed only once
- Top 10 amenities in San Fracisco
- Cuisines in San Francisco
- Shops in San Francisco
- Users who added amenities 

This is basically when the fun with the data starts. You can go through it and extract as much information as you like. You just need to come up with a question, and write your query to find its answer!

First, I make sure I am connected to the database:

In [143]:
import sqlite3

sqlite_file = '/Users/nazaninmirarab/Desktop/Data Science/P3/Project/Submission2/openstreetmap_sf_db.sqlite'
con = sqlite3.connect(sqlite_file)
cur = con.cursor()

### Number of nodes

In [144]:
def number_of_nodes():
    output = cur.execute('SELECT COUNT(*) FROM nodes')
    return output.fetchone()[0]

print 'Number of nodes: \n' , number_of_nodes()

Number of nodes: 
4714877


### Number of ways

In [145]:
def number_of_ways():
    output = cur.execute('SELECT COUNT(*) FROM ways')
    return output.fetchone()[0]

print 'Number of ways: \n' , number_of_ways()

Number of ways: 
551145


### Number of unique users

In [131]:
def number_of_unique_users():
    output = cur.execute('SELECT COUNT(DISTINCT e.uid) FROM \
                         (SELECT uid FROM nodes UNION ALL SELECT uid FROM ways) e')
    return output.fetchone()[0]

print 'Number of unique users: \n' , number_of_unique_users()

Number of unique users: 
2579


### Most contributing users

In [132]:
def most_contributing_users():
    
    output = cur.execute('SELECT e.user, COUNT(*) as num FROM \
                         (SELECT user FROM nodes UNION ALL SELECT user FROM ways) e \
                         GROUP BY e.user \
                         ORDER BY num DESC \
                         LIMIT 10 ')
    pprint(output.fetchall())
    return output.fetchall()

print 'Most contributing users: \n'
most_contributing_users()

Most contributing users: 

[(u'ediyes', 918915),
 (u'Luis36995', 710456),
 (u'Rub21', 395077),
 (u'RichRico', 224724),
 (u'calfarome', 185498),
 (u'oldtopos', 167538),
 (u'KindredCoda', 151208),
 (u'karitotp', 135330),
 (u'samely', 125861),
 (u'abel801', 108313)]


[]

### Number of users who contributed once

In [133]:
def number_of_users_contributed_once():
    
    output = cur.execute('SELECT COUNT(*) FROM \
                             (SELECT e.user, COUNT(*) as num FROM \
                                 (SELECT user FROM nodes UNION ALL SELECT user FROM ways) e \
                                  GROUP BY e.user \
                                  HAVING num = 1) u')
    
    return output.fetchone()[0]
                         
print 'Number of users who have contributed once: \n', number_of_users_contributed_once()

Number of users who have contributed once: 
634


### Top 10 amenities in San Francisco

In [134]:
def top_ten_amenities_in_sf():
    output = cur.execute('SELECT value, COUNT(*) as num FROM nodes_tags\
                            WHERE key="amenity" \
                            GROUP BY value \
                            ORDER BY num DESC \
                            LIMIT 20' )
    pprint(output.fetchall())
    return output.fetchall()

print 'Top ten amenities: \n'
top_ten_amenities_in_sf()

Top ten amenities: 

[(u'restaurant', 2816),
 (u'bench', 1137),
 (u'cafe', 943),
 (u'place_of_worship', 719),
 (u'post_box', 680),
 (u'school', 604),
 (u'fast_food', 562),
 (u'bicycle_parking', 556),
 (u'drinking_water', 492),
 (u'toilets', 394),
 (u'bank', 364),
 (u'bar', 314),
 (u'parking', 272),
 (u'fuel', 270),
 (u'car_sharing', 225),
 (u'waste_basket', 203),
 (u'pub', 200),
 (u'atm', 189),
 (u'post_office', 156),
 (u'pharmacy', 143)]


[]

### Top 10 cuisines in San Francisco

In [135]:
def cuisines_in_sf():
    output = cur.execute ('SELECT value, COUNT(*) as num FROM ways_tags \
                           WHERE key="cuisine" \
                           GROUP BY value \
                           ORDER BY num DESC \
                           LIMIT 10')
    pprint(output.fetchall())
    return output.fetchall()

print 'Top 10 cuisines: \n'
cuisines_in_sf()

Top 10 cuisines: 

[(u'burger', 71),
 (u'mexican', 47),
 (u'pizza', 29),
 (u'chinese', 26),
 (u'american', 20),
 (u'coffee_shop', 19),
 (u'italian', 15),
 (u'japanese', 15),
 (u'sushi', 12),
 (u'seafood', 10)]


[]

### Different types of shops

In [136]:
def shops_in_sf():
    output = cur.execute('SELECT value, COUNT(*) as num FROM nodes_tags\
                            WHERE key="shop" \
                            GROUP BY value \
                            ORDER BY num DESC' )
    pprint.pprint(output.fetchall())
    return output.fetchall()

print 'Different types of shops: \n'
top_ten_amenities_in_sf()

Different types of shops: 

[(u'restaurant', 2816),
 (u'bench', 1137),
 (u'cafe', 943),
 (u'place_of_worship', 719),
 (u'post_box', 680),
 (u'school', 604),
 (u'fast_food', 562),
 (u'bicycle_parking', 556),
 (u'drinking_water', 492),
 (u'toilets', 394),
 (u'bank', 364),
 (u'bar', 314),
 (u'parking', 272),
 (u'fuel', 270),
 (u'car_sharing', 225),
 (u'waste_basket', 203),
 (u'pub', 200),
 (u'atm', 189),
 (u'post_office', 156),
 (u'pharmacy', 143)]


[]

### Users who added amenities to the map

Since the list is long and would make the look of this document not nice, I added the limit of showing only 10 of them. The way to show those users is by no order. If 'LIMIT 10' is removed, you can view the whole list.

In [137]:
def users_who_added_amenity():
    output = cur.execute('SELECT DISTINCT(nodes.user), nodes_tags.value FROM \
                            nodes join nodes_tags \
                            on nodes.id=nodes_tags.id \
                            WHERE key="amenity" \
                            GROUP BY value \
                            LIMIT 10' ) # Remove this part to view the whole list of users
    pprint(output.fetchall())
    return output.fetchall()

print 'Users who added amenity to the map: \n'
users_who_added_amenity()

Users who added amenity to the map: 

[(u'claysmalley', u'Corner Market'),
 (u'dchiles', u'Note 281478'),
 (u'lxbarth', u'Pet grooming shop'),
 (u'oldtopos', u'addr:housenumber'),
 (u'JessAk71', u'amusements'),
 (u'Mark Mavromatis', u'animal_shelter'),
 (u'Jothirnadh', u'arts_centre'),
 (u'manings', u'atm'),
 (u'JessAk71', u'bakery'),
 (u'poornibadrinath', u'bank')]


[]

### List of postcodes

Since I did some cleaning on the varieties of postcode format, I want to get a query from them and see how they look like now. While I was audting and cleaning the postcode formats that were either 6-digit long, shorter than 5-digit long or equaled to only 'CA', I decided that since they were not valid, I could set them to zero. 

By running the below query on the database after applying the changes, I now can see that there are 10 postcodes set to zero:   

        (u'00000', 10)
        
For the purpose of this documentation, I have only printed the top-5. If you remove the 'LIMIT 5' part from the query, you will be able to see the complete list of postcodes.

We also see that the most repetative postcode is 94122 with the highest number of 5109 repetitions.

In [213]:
def list_of_postcodes():
    output = cur.execute('SELECT e.value, COUNT(*) as num FROM \
                            (SELECT value FROM nodes_tags WHERE key="postcode"\
                             UNION ALL SELECT value FROM ways_tags WHERE key="postcode") e \
                            GROUP BY e.value \
                            ORDER BY num DESC \
                            LIMIT 5' ) # Remove this limit to see the complete list of postcodes
    pprint(output.fetchall())
    return output.fetchall()

print 'List of postcodes: \n'
list_of_postcodes()


List of postcodes: 

[(u'94122', 5109),
 (u'94611', 2990),
 (u'94116', 2202),
 (u'94610', 1357),
 (u'94117', 1220)]


[]

### Amenities around 94122 Postcode

Since 94122 is the most repepative postcode, I wanted to check to see what the amenities around this area are. Since the list was quite long, I limited it to the first 20 amenities with the highest number.

In [221]:
def amenities_around_94122():
    output = cur.execute('SELECT nodes_tags.value, COUNT(*) as num \
                          FROM nodes_tags \
                            JOIN (SELECT DISTINCT(id) FROM nodes_tags WHERE key="amenity") AS amenities \
                            ON nodes_tags.id = amenities.id \
                            WHERE nodes_tags.key="amenity"\
                            GROUP BY nodes_tags.value \
                            ORDER BY num DESC \
                            LIMIT 20' ) # Remove this limit to see the complete list of postcodes
    pprint(output.fetchall())
    return output.fetchall()

print 'Amenities around 94122 postcode: \n'
amenities_around_94122()

Amenities around 94122 postcode: 

[(u'restaurant', 2816),
 (u'bench', 1137),
 (u'cafe', 943),
 (u'place_of_worship', 719),
 (u'post_box', 680),
 (u'school', 604),
 (u'fast_food', 562),
 (u'bicycle_parking', 556),
 (u'drinking_water', 492),
 (u'toilets', 394),
 (u'bank', 364),
 (u'bar', 314),
 (u'parking', 272),
 (u'fuel', 270),
 (u'car_sharing', 225),
 (u'waste_basket', 203),
 (u'pub', 200),
 (u'atm', 189),
 (u'post_office', 156),
 (u'pharmacy', 143)]


[]

### Popular Cafes in San Francisco

In [222]:
def most_popular_cafes():
    output = cur.execute('SELECT nodes_tags.value, COUNT(*) as num \
                          FROM nodes_tags \
                            JOIN (SELECT DISTINCT(id) FROM nodes_tags WHERE value="coffee_shop") AS cafes \
                            ON nodes_tags.id = cafes.id \
                            WHERE nodes_tags.key="name"\
                            GROUP BY nodes_tags.value \
                            ORDER BY num DESC \
                            LIMIT 10' ) # Remove this limit to see the complete list of postcodes
    pprint(output.fetchall())
    return output.fetchall()

print 'Most popular cafes in San Francisco: \n'
most_popular_cafes()

Most popular cafes in San Francisco: 

[(u'Starbucks', 51),
 (u"Peet's Coffee & Tea", 15),
 (u'Starbucks Coffee', 15),
 (u"Peet's Coffee and Tea", 7),
 (u'Philz Coffee', 5),
 (u"Peet's Coffee", 4),
 (u'Beanery', 3),
 (u'Blue Bottle Coffee', 3),
 (u'Highwire Coffee Roasters', 3),
 (u'Alchemy Collective Cafe', 2)]


[]

No surprise that Starbucks in the most popular brand in the US. There are, however, some mistakes in the names. For example 'Starbuck' and 'Starbucks Coffee' are the same thing with different names.

## Discussions about the Data

### Anticipated Problems

Data wrangling for this project has been a time-consuming and comlicated task due to many inconsistencies in the data. I could spot only a few those problems and clean them up, but I am sure there are many that I could not catch. The main reason behind these inconsistencies is human error; however, there can be way to improve the overall quality of the data, especially when it is supposed to be used for means of analysis.

#### Empty user id fields

This was one of the problems I found out about while I was creating my CSV files and trying to validate them against the correct schema. I received the error that the 'uid' field is empty. This came as a surprise for me since I thought they least of what contributors can include is to not leave the 'uid' field empty. 

According to best practices of OpenStreetMap, the 'uid' is a part that needs to be added; however, there were many empty fields in the data for this item. OpenStreetMap can make this a mandatory step for contributors but on the other hand this might descrease the number of contributions.

#### Invalid format for postcodes

Postcode formats were one other issues I had to catch and fix a few times during this project. While aduting the postcodes in the sample file, I was able to spot some issues, but after running the script against the priginal file, I saw there were more problems that I had not seen. The standard postcode format was in form of a 5-digit code, with no letters or other characters; however, there were formats with the state abbreviation before the digit, which is still acceptables. But, formats where the code was 6-digit long or shorter than 5-digit long or only 'CA' were invalid, and in my opinion there needs to be some checks before the user can add the data to the map. For example, there can be easy checks for fields like postcode to make sure no invalid data is entered. This can also help contributors to know what is expected of them.

### Suggestion for Improvement

#### Gamification

One way that could help to enhance the quality and accuracy of the data can be using gamification methods such as 'top contributors' to increase the level of motivation for submitting more data. This gamification, though, needs to apply OpenStreetMap best practices so that the data submitted has less inconsistencies.