# Project 3: OpenStreetMap 


## Project Overview:

OpenStreetMap(OSM) is a free, editable map. The growth of the OSM has been motivated by the availability of map information across the world. The data is generated by different users and stored in XML format. 
This project is a requirement for Udacity Data Analysis Nanodegree, data wrangling section. The aim of this project is to assist new and interesting uses of the data.
In this Project, I chose Ottawa as my map area to apply data munging techniques such as validity, accuracy, completeness,consistency and uniformity to it.  

## Map Area
#### Ottawa, Canada 

Ottawa is the capital city of Canada.I am visiting ottawa next summer and I would like an opportunity to contribute to its improvement on OpenStreetMap. 
* https://www.openstreetmap.org/relation/4136816#map=9/45.2723/-75.7256
* https://mapzen.com/data/metro-extracts/metro/ottawa_canada/

In [2]:
OSM_FILE = "ottawa_canada.osm"  

SAMPLE_FILE = "ottawa_sample.osm"

#### File Size

In [3]:
def convert_bytes(num):
    for x in ['bytes', 'KB', 'MB', 'GB', 'TB']:
        if num < 1024.0:
            return "%3.1f %s" % (num, x)
        num /= 1024.0

def file_size(filename):
    if os.path.isfile(filename):
        file_info = os.stat(filename)
        return convert_bytes(file_info.st_size)
    
size = file_size(OSM_FILE)
print ('OSMSize', size)

OSMSize 1001.4 MB


## Generate Sample Data

In [4]:
'''
create a sample of the file 
'''
    
k = 100
def get_element(osm_file, tags=('node', 'way', 'relation')):
    """Yield element if it is the right type of tag
    Reference:
    http://stackoverflow.com/questions/3095434/inserting-newlines-in-xml-file-generated-via-xml-etree-elementtree-in-python
    """
    context = iter(ET.iterparse(osm_file, events=('start', 'end')))
    _, root = next(context)
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            yield elem
            root.clear()

with open(SAMPLE_FILE, 'wb') as output:
    b = bytearray()
    b.extend('<?xml version="1.0" encoding="UTF-8"?>\n'.encode())
    b.extend('<osm>\n  '.encode())
    output.write(b)

    # Write every kth top level element
    print (OSM_FILE)
    for i, element in enumerate(get_element(OSM_FILE)):

        if not i % k:
            output.write(ET.tostring(element, encoding='utf-8'))
    b_end = bytearray()
    b_end.extend('</osm>'.encode())
    output.write(b_end)

ottawa_canada.osm


The original size of the dataset is approximately 1 GB; therefore, I used the code above to take a sample of my original OSM file. According to the value k the size of sample file is about 50 MB.


In [5]:
sample_size = file_size(SAMPLE_FILE)
print ('SampleSize', sample_size)

SampleSize 10.2 MB


After taking a sample of the dataset, I need to figure out what kind of elements are in the OSM file, and how important they are. 

As seen in the output below, the major elements are members, nd, node, relation, tag and way. 

In [6]:
def count_tags(filename):
    tags = {}
    for event, elem in ET.iterparse(filename):
        if elem.tag in tags: 
            tags[elem.tag] += 1
        else:
            tags[elem.tag] = 1
    return tags
pprint.pprint(count_tags(SAMPLE_FILE))

{'member': 531,
 'nd': 47955,
 'node': 44738,
 'osm': 1,
 'relation': 43,
 'tag': 44676,
 'way': 4518}


#### Patterns in the tags

In [7]:
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

def key_type(element, keys):
    if element.tag == "tag":
        for tag in element.iter('tag'):
            k = tag.get('k')
            if lower.search(element.attrib['k']):
                keys['lower'] = keys['lower'] + 1
            elif lower_colon.search(element.attrib['k']):
                keys['lower_colon'] = keys['lower_colon'] + 1
            elif problemchars.search(element.attrib['k']):
                keys['problemchars'] = keys['problemchars'] + 1
            else:
                keys['other'] = keys['other'] + 1
    return keys

def process_map(filename):
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)
    return keys
pprint.pprint(process_map(SAMPLE_FILE))

{'lower': 23420, 'lower_colon': 21140, 'other': 116, 'problemchars': 0}


### Auditing the OSM file:

### Auditing Street Names:

Problems encountered while auditing street names:
 
   LowerCase: some of the key words are written with uppercase while others are written with lowercase.For example, Road and road.
  
  Abbreviation: a large proportion of key words are written completely; however, the left are abbreviated.
  
In oreder to clean this, I wrote the code below to transform each lowercase to uppercase, and not to abbreviate any word.   

In [8]:
regex = re.compile(r'\b\S+\.?', re.IGNORECASE)
expected = ["Rue", "Road", "Street", "Avenue", "Way", "Circle", "Drive", "Court", "Crescent","Lane", "Parkway", "Garden", "Private", "Palce", "Bridge", "Boulevard", "Square", "Ridge", "Gate", "Grove"] #expected names in the dataset

def audit_street(street_types, street_name): 
    m = regex.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)
            
def is_street_name(elem):
    return (elem.attrib['k'] == "addr:street")

def audit(osmfile):
    osm_file = open(osmfile, "r", encoding='utf-8')
    street_types = defaultdict(set)
    for event, elem in ET.iterparse(osm_file, events=("start",)):
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street(street_types, tag.attrib['v'])
    osm_file.close()
    return street_types
    
def pretty_print(d):
    for sorted_key in sorted(d, key=lambda k: len(d[k]), reverse=True):
        v = d[k]
        if len(d[sorted_key]) >= 5:
            print(sorted_key.title(), ':', len(d[sorted_key]))

print('Numbers of different street types in Ottawa:')
pretty_print(audit(SAMPLE_FILE))

Numbers of different street types in Ottawa:
Chemin : 52
County : 14
St. : 13
Impasse : 13
Concession : 12
Des : 11
Old : 10
Route : 9
Rideau : 8
South : 7
River : 5
Country : 5
Promenade : 5
Willow : 5
Montée : 5


In [9]:
mapping = {"Ave": "Avenue",
            "Ave.": "Avenue",
            "avenue": "Avenue",
            "ave": "Avenue",
            "Blvd": "Boulevard",
            "Blvd.": "Boulevard",
            "Blvd,": "Boulevard",
            "Boulavard": "Boulevard",
            "Boulvard": "Boulevard",
            "Ct": "Court",
            "Dr": "Drive",
            "Dr.": "Drive",
            "Ln": "Lane",
            "Ln.": "Lane",
            "Pl": "Place",
            "Plz": "Plaza",
            "Rd": "Road",
            "Rd.": "Road",
            "St": "Street",
            "St.": "Street",
            "st": "Street",
            "street": "Street",
            "square": "Square",
            "parkway": "Parkway"
            }

def string_case(s): 
    if s.isupper():
        return s
    else:
        return s.title()
    
def update_name(name, mapping):
    name = name.split(' ')
    for i in range(len(name)):
        if name[i] in mapping:
            name[i] = mapping[name[i]]
            name[i] = string_case(name[i])
        else:
            name[i] = string_case(name[i])
    name = ' '.join(name)
    return name

### Auditing Postal Codes:

##### Problems encountered while auditing street names:

   1. All Ottawa postal codes should start with the letters 'k' or 'j', but aftering testing the postal codes and figure out that some postal codes start with 'ON'.
   2. Ottawa postal codes consists of 3 digits, space then 3 digits (7 digits). To test my OSM file I built a regular expression to match the correct postal codes. 



In [10]:
postal_type_re = re.compile(r'^[kj]\d\w \d\w\d', re.IGNORECASE)
postal_types = defaultdict(set)

def audit_postal_code(postal_value, elem):
    m = postal_type_re.search(postal_value)
    if not m:
        postal_types[elem.attrib['k']].add(postal_value)

def is_postal_code(elem):
    return (elem.attrib['k'] == "addr:postcode" )


def audit(osmfile):
    osm_file = open(osmfile, "r", encoding='utf-8')
    for event, elem in ET.iterparse(osm_file, events=("start",)):

        if elem.tag in["node", "way", "relation"] :
            for tag in elem.iter("tag"):
                if is_postal_code(tag):
                    audit_postal_code(tag.attrib['v'], tag)
    osm_file.close()
    return postal_types


audit(SAMPLE_FILE)
pprint.pprint(dict(postal_types))

{'addr:postcode': {'K1Z8A2', 'K1Z6H6'}}


In [11]:
def is_incorrect_postal_code(postal_value, tag):
    if is_postal_code(tag):
        m = postal_type_re.search(postal_value)
        if not m:
            return True
        return False

def audit_pin(osmfile):
    osm_file = open(osmfile, "r", encoding='utf-8')
    for event, elem in ET.iterparse(osm_file, events=("start",)):

        if elem.tag in["node", "way", "relation"] :
            for tag in elem.iter("tag"):
                 if is_incorrect_postal_code(tag.attrib['v'], tag):
                    print(tag.attrib['v'], '==>', update_postal_code(tag.attrib['v'], elem))
    osm_file.close()    

def update_postal_code(postal_value, element):
    if len(postal_value) != 7:
        if ' ' not in postal_value:
            return (postal_value[0:3]+ ' ' + postal_value[3:6])
        
        elif 'ON' in postal_value:
            return postal_value[3:]
    

### Preparing for SQL database:

After auditing the OSM file, I need to prepare the data to be inserted into a SQL database. To do so, I parsed through all elements in the OSM XML file, transforming them from document format to .csv files. These csv files can then easily be imported to a SQL database as tables.


#### Converting from XML to CSV

Now I need to convert the OSM XML file to a csv file to work with it in next step.

#### Importing tables form CSV to SQL database

Finally I built a SQL database and imported tables to this database from my csv files. 

### Quering SQL database

In [14]:
import csv
import sqlite3

def number_of_nodes():

	result = cur.execute('SELECT COUNT(*) FROM nodes')

	return result.fetchone()[0]



def number_of_ways():

	result = cur.execute('SELECT COUNT(*) FROM ways')

	return result.fetchone()[0]



def number_of_unique_users():

	result = cur.execute('SELECT COUNT(DISTINCT(e.uid)) FROM (SELECT uid FROM nodes UNION ALL SELECT uid FROM ways) e')

	return result.fetchone()[0]

    

def top_contributing_users():

	users = []

	for row in cur.execute('SELECT e.user, COUNT(*) as num FROM (SELECT user FROM nodes UNION ALL SELECT user FROM ways) e GROUP BY e.user ORDER BY num DESC LIMIT 10'):

		users.append(row)

	return users



def number_of_users_contributing_once():

	result = cur.execute('SELECT COUNT(*) FROM (SELECT e.user, COUNT(*) as num FROM (SELECT user FROM nodes UNION ALL SELECT user FROM ways) e GROUP BY e.user HAVING num=1) u')

	return result.fetchone()[0]






if __name__ == '__main__':

	

	con = sqlite3.connect("ottawa.db")

	cur = con.cursor()

	

	print( "Number of nodes: " , number_of_nodes())

	print( "Number of ways: " , number_of_ways())

	print( "Number of unique users: " , number_of_unique_users())

	print( "Top contributing users: " , top_contributing_users())

	print( "Number of users contributing once: " , number_of_users_contributing_once())




Number of nodes:  827649
Number of ways:  83595
Number of unique users:  1
Top contributing users:  [('user', 911244)]
Number of users contributing once:  0


No additional ideas can be explored in the data. The file has no 'resturants' or 'amenities' tags for example.

## Conclusion 


OpenStreetMap data is not perfect as any human modified project. It'll take a lot of time to find and clean all human-made errors. However, I am happry that I have made my first step. I modified street names and postal codes and made them more consistent and uniform. After that, transformed XML to CSV format and imported it into SQL database.Finally, I found more interesting information about Ottawa.

##### Additional ideas:
I think that there is two ways to improve OpenStreenMap Data:
   1. Attract more people to improve the data. For example, competitions can encourage data analyst to explore the data and get benefits from it; however, this will attract a small group if people
   2. Link the map with more popular maps like google maps, so it can be used easily or share data with google maps; however, it isn't easy to link OSM data with google since it has a large amount of data.  
