## Module 3: Chapter 1 Activity :

### The data cleansing process

In this activity, we will follow the cleansing process steps using real data of 'Sin City'. The data file 'SinCity.osm' has to be at the same directory of the IPython Notebook to run this activity successfully. Open Street map (.osm) files https://www.openstreetmap.org/ are coded in XML, and contain geographic data in a structured and ordered format. 

#### Audit data:
Our aim in this activity is to perform the actual cleansing for street information in SinCity data. The first task is to open and explore the raw data using the followong code:


In [2]:
inFile = open('SinCity.osm', 'r')
data = inFile.readlines()
inFile.close()

data

# then have a look, search (e.g. CTRL-F) for 'node"

['<?xml version="1.0" encoding="UTF-8"?>\n',
 '<osm version="0.6" generator="CGImap 0.3.3 (28791 thorn-03.openstreetmap.org)" copyright="OpenStreetMap and contributors" attribution="http://www.openstreetmap.org/copyright" license="http://opendatacommons.org/licenses/odbl/1-0/">\n',
 ' <bounds minlat="41.9704500" minlon="-87.6928300" maxlat="41.9758200" maxlon="-87.6894800"/>\n',
 ' <node id="261114295" visible="true" version="7" changeset="11129782" timestamp="2012-03-28T18:31:23Z" user="bbmiller" uid="451048" lat="41.9730791" lon="-87.6866303"/>\n',
 ' <node id="261114296" visible="true" version="6" changeset="8448766" timestamp="2011-06-15T17:04:54Z" user="bbmiller" uid="451048" lat="41.9730416" lon="-87.6878512"/>\n',
 ' <node id="261114299" visible="true" version="5" changeset="8581395" timestamp="2011-06-29T14:14:14Z" user="bbmiller" uid="451048" lat="41.9729565" lon="-87.6939548"/>\n',
 ' <node id="261146436" visible="true" version="5" changeset="8581395" timestamp="2011-06-29T14

#### Decoding osm/xml files

The file contains many nodes. We are interested only in the location nodes. Some of the nodes are for users/contributors e.g. 'bbmiller' & friends, skip those and look for the location data.

The second node is an example of a location, a mexican restaurant on "North Lincoln Ave" (search for 'restaurant' if you can't see it). The node consists of lines of tags in pairs, 'k' & 'v' e.g.  
<pre class='nd pp'><code>
&lt;node id="2406124091" visible="true" version="2" changeset="17206049" timestamp="2013-08-03T16:43:42Z" user="linuxUser16" uid="1219059" lat="41.9757030" lon="-87.6921867"/>',
&lt;tag k="addr:city" v="Chicago"/>',
&lt;tag k="addr:housenumber" v="5157"/>',
&lt;tag k="addr:postcode" v="60625"/>',
&lt;tag k="addr:street" v="North Lincoln Av."/>',
&lt;tag k="amenity" v="restaurant"/>',
&lt;tag k="cuisine" v="mexican"/>',
&lt;tag k="name" v="La Cabana De Don Luis"/>',
&lt;tag k="outdoor_seating" v="no"/>',
&lt;tag k="phone" v="1 (773)-271-5176"/>',
&lt;tag k="smoking" v="no"/>',
&lt;tag k="takeaway" v="yes"/>',
 ' node'/>
 </pre>

As per this example,  the location information is found in the 'addr' (address) tag of type 'city', Chicago in this case, 'housenumber', 'postcode' and 'street. Thus, we audit the addr:street pairs closely for street information. 

To extract steert iformation, we either treat the osm file as text or xml. In this activity, we read and parse the osm file into dictionaries (because they are a key-value structure) using xml libraries as follows:

In [3]:
import xml.etree.cElementTree as ET
from collections import defaultdict
import re

Then, we creat a regular expression to capture different street types. 

In [4]:
# a regex to dig out street types
street_type_re = re.compile(r'\b\S+?$', re.IGNORECASE) 
# street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE) 
# regular expression, S is anything that is not 'whitespace'
# \s stands for "whitespace character". 
# it includes [ \t\r\n\f]. That is: \s matches a space, a tab, a line break, or a form feed.
# \S is the equivalent of [^\s].

In [5]:
expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Lane", "Road"]

The following method audits the street types. We start with the list of expected types. We then expand the list with other captured types (in the osm file) that do not already exist in the list. 

In [26]:
# build a list of types, e.g. Ave & Av. are not in expected so make a note (i.e. store)
def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        print(street_type)
        if street_type not in expected:
            street_types[street_type].add(street_name)

# open the file and parse, expecting valid xml
def audit(osmfile):
    osm_file = open(osmfile, "r")
    street_types = defaultdict(set)
    for event, elem in ET.iterparse(osm_file, events=("start",)):
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if tag.attrib['k'] == "addr:street": # look for all the street keys (not values)
                    audit_street_type(street_types, tag.attrib['v']) # then check if it's expected
    return street_types

# map to preferred (or standard) names
def update_name(name, mapping): 
    name_array = name.split(' ')
    last = name_array[-1]
    name_array[-1] = mapping[last]
    return ' '.join(name_array)

# put it all together 
def test(file):
    st_types = audit(file)
    #pprint.pprint(dict(st_types))

    for st_type, ways in st_types.items():
        for name in ways:
            better_name = update_name(name, mapping)
            print (name, "=>", better_name)

We then use the audit method for inspecting street types in Sin City. 

In [27]:
street_types = audit("SinCity.osm")

Av.
Ave
Rd.
St.


According to the audit process, we found a total of 4 street types. Examples are as follows:

In [28]:
street_types

defaultdict(set,
            {'Av.': {'North Lincoln Av.'},
             'Ave': {'North Lincoln Ave'},
             'Rd.': {'Baldwin Rd.'},
             'St.': {'West Lexington St.'}})

We now replace/map between values represent the same type (such as Av. and Ave in our example). 

In [29]:
mapping = {"St.": "Street"} # want St. to be Street
for st_type, ways in street_types.items():
        for name in ways:
            better_name = update_name(name, mapping)
            print (name, "=>", better_name)

KeyError: 'Ave'

#### Our mapping does not include 'Ave', so add it (and 'Avenue') to the mapping and repeat:


In [30]:
mapping = {"St.": "Street", "Ave": "Avenue"}

In [1]:
for st_type, ways in street_types.items():
        for name in ways:
            better_name = update_name(name, mapping)
            print (name, "=>", better_name)

NameError: name 'street_types' is not defined

### KeyError: 'Av.' 

So 'North Lincoln Ave' => 'North Lincoln Avenue' worked but 'Av.' didn't. We also map 'Rd.' to proper 'Road' name.    

In [32]:
 mapping = {"St.": "Street", "Ave": "Avenue", "Rd.": "Road", "Av.": "Avenue"}

In [33]:
for st_type, ways in street_types.items():
        for name in ways:
            better_name = update_name(name, mapping)
            print (name, "=>", better_name)

('North Lincoln Ave', '=>', 'North Lincoln Avenue')
('North Lincoln Av.', '=>', 'North Lincoln Avenue')
('Baldwin Rd.', '=>', 'Baldwin Road')
('West Lexington St.', '=>', 'West Lexington Street')


In this example, we have fixed the street names in Sin City and changed them to the standard form.

#### Parsing data using other xml library:

In [16]:
from lxml import etree
tree = etree.parse('SinCity.osm')

In [34]:
# and look at all the <tag>s
for elem in tree.xpath(".//tag"):
    print(elem.attrib)
    # or
    # print(elem.values())

{'k': 'highway', 'v': 'traffic_signals'}
{'k': 'addr:city', 'v': 'Chicago'}
{'k': 'addr:housenumber', 'v': '5157'}
{'k': 'addr:postcode', 'v': '60625'}
{'k': 'addr:street', 'v': 'North Lincoln Av.'}
{'k': 'amenity', 'v': 'restaurant'}
{'k': 'cuisine', 'v': 'mexican'}
{'k': 'name', 'v': 'La Cabana De Don Luis'}
{'k': 'outdoor_seating', 'v': 'no'}
{'k': 'phone', 'v': '1 (773)-271-5176'}
{'k': 'smoking', 'v': 'no'}
{'k': 'takeaway', 'v': 'yes'}
{'k': 'addr:city', 'v': 'Chicago'}
{'k': 'addr:country', 'v': 'US'}
{'k': 'addr:housenumber', 'v': '4874'}
{'k': 'addr:postcode', 'v': '60625'}
{'k': 'addr:state', 'v': 'Illinois'}
{'k': 'addr:street', 'v': 'North Lincoln Ave'}
{'k': 'name', 'v': 'Matty Ks'}
{'k': 'phone', 'v': '(773)-654-1347'}
{'k': 'shop', 'v': 'doityourself'}
{'k': 'source', 'v': 'GPS'}
{'k': 'amenity', 'v': 'fast_food'}
{'k': 'cuisine', 'v': 'sausage'}
{'k': 'name', 'v': "Shelly's Tasty Freeze"}
{'k': 'highway', 'v': 'service'}
{'k': 'addr:housename', 'v': 'Village Hall'}
{'k'

In [35]:
# or just the streets
for elem in tree.xpath(".//tag"):
    #elem.attrib['name'] = 'Street'
    if elem.values()[0] == "addr:street":
        print(elem.values())

['addr:street', 'North Lincoln Av.']
['addr:street', 'North Lincoln Ave']
['addr:street', 'Baldwin Rd.']
['addr:street', 'West Lexington Street']


In [36]:
# fix one
for elem in tree.xpath(".//tag"):
    #elem.attrib['name'] = 'Street'
    if elem.values()[0] == "addr:street":
        
        if 'St.' in elem.values()[1]:
            #elem.values()[1] = '' # doesn't work

            elem.attrib['v'] = 'West Lexington Street'
            print(elem.values())

In [38]:
# and show that it's changd
for elem in tree.xpath(".//tag"):
    #elem.attrib['name'] = 'Street'
    if elem.values()[0] == "addr:street":
        print(elem.values())
       

['addr:street', 'North Lincoln Av.']
['addr:street', 'North Lincoln Ave']
['addr:street', 'Baldwin Rd.']
['addr:street', 'West Lexington Street']


In [37]:
# and write to file
tree.write("temp.osm")

# then look for the change

In [39]:
#!cat temp.osm

inFile = open('temp.osm', 'r')

data = inFile.readlines()

inFile.close()

data


['<osm version="0.6" generator="CGImap 0.3.3 (28791 thorn-03.openstreetmap.org)" copyright="OpenStreetMap and contributors" attribution="http://www.openstreetmap.org/copyright" license="http://opendatacommons.org/licenses/odbl/1-0/">\n',
 ' <bounds minlat="41.9704500" minlon="-87.6928300" maxlat="41.9758200" maxlon="-87.6894800"/>\n',
 ' <node id="261114295" visible="true" version="7" changeset="11129782" timestamp="2012-03-28T18:31:23Z" user="bbmiller" uid="451048" lat="41.9730791" lon="-87.6866303"/>\n',
 ' <node id="261114296" visible="true" version="6" changeset="8448766" timestamp="2011-06-15T17:04:54Z" user="bbmiller" uid="451048" lat="41.9730416" lon="-87.6878512"/>\n',
 ' <node id="261114299" visible="true" version="5" changeset="8581395" timestamp="2011-06-29T14:14:14Z" user="bbmiller" uid="451048" lat="41.9729565" lon="-87.6939548"/>\n',
 ' <node id="261146436" visible="true" version="5" changeset="8581395" timestamp="2011-06-29T14:14:14Z" user="bbmiller" uid="451048" lat="41

In [40]:
# search for 'West Lexington Street', should be just above