# OpenStreetMap Case Study: Santiago
    

## Map Area

Santiago, Chile

<ul>
<li> https://www.openstreetmap.org/#map=12/-33.4568/-70.5882  </li>
<li> https://mapzen.com/data/metro-extracts/metro/santiago_chile/ </li>
</ul>

As an expat living in Santiago, I am always trying to familiarize myself more with where I live.  I'm interested in how complete this data is and in what I can find out by querying.  Maybe I'll even improve my Spanish a little bit. 

Initial inspection of the downloaded osm file in the terminal reveals that it is about 260 MB large.  I used the code provided in the project details (see footnote 1) to create a small sample, one tenth the size of the original to explore.

## Problems Encountered in the Map

While exploring the data, I decided to focus on the most frequent secondary tags. The following is a list of the ten most common secondary tags within a node or a way.  I decided to find out more about the first seven.


In [None]:
addr:street: 22855
addr:housenumber: 21573
name: 10176
addr:interpolation: 10118
highway: 9979
source: 3871
id_origin: 3722
surface: 3274
building: 2907
oneway: 1951

* Unsurprisingly, many `addr:street` values were unstandardized. (Av., Av, Ave, Avda. and Avenida all indicating the same thing, e.g.) This isn't as large a problem as it is in the wester world since in Chile, most street names are *just* a name -- that is to say they don't have a classifier  such as "Calle" or "Avenida" after them like many western countries have a "Street", "Avenue", "Rue", "Strasse" or some other kind of street name after the actual proper noun. 
* Some values for the `addr:housenumber` key were not integer values and appeared to be street names, and thus their tags should be changed from `addr:housenumber` to `addr:street`.  This was at first clouded by the fact that several house numbers included a letter, e.g. "1345 A" or "s/n" (sin numero) but are in fact correct and thus should stay as is.  Though there are a fair number of "S/n" buildings here, and finding them can sometimes be tough, it's much easier than where I used to live (Amman, Jordan) where the address system was only introduced in 2007!
* Additionally many `name` values appeared to include street names, in other words they were improperly using the `name` key, or duplicating what was already included in the `addr:street` key.  Regardless of whether these tags were improperly used, similar standardizations needed to be applied to the `name` values as the `addr:street`
* Almost all `addr:interpolation` were of the value of 'even' or 'odd' indicating a good normalization (since addr:interpolation can also have a range of numbers as its value) but one "Las Hualtatas" was clearly mislabeled. 
* Highways seemed to be properly used
* Several versions of the most commonly used sources are present and not standardized.
* The key `id_origin` presents numeric values, most of which are repeated only once.  Presumably this indicates where the id number of an element came from, but it is strange that only two nodes per origin.  There is nothing in the <a href = "https://wiki.openstreetmap.org/wiki/Main_Page">OSM documentation</a> about it, which makes me wonder how to find out about commonly used conventions within one particular area.





### Unstandardized street, place, and source names

To fix these issues, I created two separate mapping dictionaries, and update fucntions which were called later on when shaping the elements to create the CSV files.  For the street names, the abbreviated portion was replaced, whereas for source names, any mention of a key part of the source (for example, the web address of National Institute of Statistics,) caused the entire source to be replaced with the standardized name.

In [None]:
'''mapping dictionary for fixing abbreviations in street names'''

mapping = { "A.": "Avenida",
            "Av.": "Avenida",
            "Av" : "Avenida",
            "Avda." : "Avenida",
            "Avda" : "Avenida",
            "Ave.": "Avenida",
            "Ave": "Avenida",
            "Co.": "Cerro",
            "Co" : "Cerro",
            "Psje." : "Pasaje",
            "Psje" : "Pasaje",
            "Pje." : "Pasaje",
            "Pje" : "Pasaje",
            "Fco." : "Francisco",
            "Fco" : "Francisco",
            "Sta. " : "Santa",
            "Sta" : "Santa"
            }

''' mapping dictionary to update sources '''

sources= {'www.ine.cl' : 'Instituto Nacional de Estadistica www.ine.cl',
            'Instituto Nacional De Estadisticas': 'Instituto Nacional de Estadistica www.ine.cl',
            'Bing' : "Bing",
            "bing" : "Bing",
            "2016 por KG" : "Reconocimiento cartográfico 2016 por KG"}


def update_streetname(name, mapping):
    '''Changes a portion of a key to a its value in a mapping dictionary'''
    for error in mapping.keys():
        if error in name:
            name = re.sub(r'\b' + error + r'\b\.?', mapping[error], name)
    return name

def update_sourcename(name, mapping):
    '''Changes a name from a key to a its value in a mapping dictionary'''
    for error in mapping.keys():
        if error in name:
            name = mapping[error]
    return name

### Non-numeric house numbers

The fixer function here check's to see that *some* numbers are included and that the value is not "S/N". If the housenumber does not meet these requirements, the key value is changed to `name`.  This fix needs to be done before the "name" fixes occur, so that it can be changed to a `addr:street` if necessary.

In [None]:
numbers = re.compile('\d')

def addressfix(element):
    '''If a housenumber does not contain any numeric characters, it's key tag ("k"=) is changed to a "name" 
    rather than "addr:housenumber" 
    Args: 
        element - an element from an OSM file
    Returns:
        element - updated element
    '''
    if element.attrib['k'] == "addr:housenumber":
        if not(numbers.search(element.attrib['v'])):
            if (element.attrib['v'] != "s/n"): 
                if (element.attrib['v'] != "S/N"):
                    print element.attrib['v']
                    element.attrib['k'] = 'name'
    return element


### Misused "name" tags

In [None]:
#Counts the number of nodes which have both an amenity and name tag and nodes that have only an amenity tag with no name
#Same for street names and regular name tags

ns_count =0
na_count = 0
amenity_only = 0
street_only =0

for node in tag_list:
    if ('name' in tag_list[node]):
        if ('amenity' in tag_list[node]):
            na_count +=1
        if ('addr:street' in tag_list[node]):
            ns_count +=1
    elif ('amenity' in tag_list[node]):
        amenity_only +=1
    elif ('addr:street' in tag_list[node]):
        street_only +=1
        
print "Nodes with both amenity and name tags: " + str(na_count)
print "Nodes with an amenity but no name tag: " + str(amenity_only)
print
print "Nodes with both addr:street and name tags: " + str(ns_count)
print "Nodes with street but no name tag: " + str(street_only)

Only about 13% of amenities are nameless.  Noticeably the sample has 1542 amenities, but only 1045 listed here. Presumably the rest are not nodes, but rather ways or relations. (A football field may have several nodes to demarcate the edges and thus would be a way.)  Still, it seems that the majority of amenities are named.  On the otherhand, most of the nodes labeled with a street:addr do not additionally have a name though some do.  Perhaps these 1043 (in only a sample document) are duplicately named or perhaps some of them include a different name and street:addr.

The function below updates an elements `name` tag to an `addr:street` tag if the value matches a known street name or if common street classifiers are found within the value.

In [None]:
streets = ["Av.", "Ave", "Avda.", "Avenida", "Calle", "Camino", "Diagonal",  "Pje", "Pje.", "Psje", "Pasaje"]

def name_street_fix(element):
    if element.attrib['k'] == "name":
        # Checks to see if the name value matches an already given street name, this is particularly useful for the streets
        # which are dates.  The tag key is changed to "addr:street"
        for street in street_values:
            if element.attrib['v'] == street:
                    element.attrib['k'] = 'addr:street'
        # If the name value has a common street classifier in it, the tag key is changed to "addr:street"   
        for street in streets:
            if element.attrib['v'].find(street):
                element.attrib['k'] = 'addr:street'
    return element
            

### Addr:interpolation -- one error, for now

From the sample, it seemed there was just one value that was not 'even' or 'odd', it was "Las Hualtatas" a busy street in a swank part of town.  Though I know this is a street, there are other incidents where an actualy place name could be used instead of a street.  To take advantage of the name_street_fix function, I opt to change the key to `name` and allow `name_street_fix` to then change into into 'addr:street'.  Though this seems like a roundabout path, it is better to use in case a non-numerical, non-street name name is accidentally put as a value for `addr:interpolation`.  This is entirely possible since in my sample I am looking at only 1/10 of the data, and there may be other errors out there.

In [None]:
#Creates the fix (changing the key to addr:street, rather than addr:interpolation when the value is not even or odd) 
# to be used later when shaping elements

def Hualtatasfix(element):
    if element.attrib['k'] == 'addr:interpolation':
        if (element.attrib['v'] != 'even' and element.attrib['v'] != 'odd'):
            element.attrib['k'] = 'addr:name'
    return element
            
        

### Highways

For the large part, it seems as if the elements with the highway tag have been properly used.  "Residential" and "living_street" are appropriately used according the the documentation.  The "bus_stop" tags may or may not be accurate.  According to the documentation: *"highway=bus_stop should be used for "A small bus stop. Can be mapped more rigorously using public_transport=stop_position for the position where the vehicle stops and public_transport=platform for the place where passengers wait. See public_transport= for more details."*  See footenote 4. Without further knowledge about whether these are "small bus stops" or more major ones, this change cannot be made. This could be verified by looking at data from other sources, such as http://www.transantiago.cl.  

# Data Overview

### File Sizes

In [None]:
santiago.osm ....... 254,432 KB
santiago.db ........ 200,582 KB
sample.osm ......... 25,772 KB
nodes.csv .......... 82,565 KB
nodes_tags.csv ..... 26,408 KB
ways.csv ........... 14,443 KB
ways_nodes.csv ..... 27,845 KB
way_tags.csv ....... 17,738 KB

Once I created the database in the command terminal, I ran queries programtically. What follows are just some of the queries themselves and the outputs.

### Number of nodes and ways

In [None]:
SELECT COUNT(*)
FROM nodes;

961239

### Number of unique users

In [None]:
SELECT COUNT(*)
FROM ways;

236355

### Distinct number of users

In [None]:
SELECT COUNT(DISTINCT(uid))          
FROM (SELECT uid FROM nodes 
UNION SELECT uid FROM ways);

1489


This is different than the answer gotten when the following function was run on the original santiago.osm file

In [None]:
def get_user(element):
    ''' Returns the user id from an element'''
    if element.get('uid'):
        return element.get('uid')
        print element.get('uid')
    else:
        return None

def number_users(filename):
    ''' Returns the number of unique user ids in an osm file'''
    users = set()
    for _, element in ET.iterparse(filename):
        if get_user(element):
            if get_user(element) not in users:
                users.add(get_user(element))
    return users

users = number_users('santiago.osm')

In [None]:
There are 1493 unique users contributing to this data set.

### Top ten users

In [None]:
SELECT nodes_ways.user, COUNT(*) as num
FROM (SELECT user FROM nodes UNION ALL SELECT user FROM ways) nodes_ways
GROUP BY nodes_ways.user
ORDER BY num DESC
LIMIT 10;

In [None]:
[(u'Julio_Costa_Zambelli', 206514),
 (u'Fede Borgnia', 196366),
 (u'felipeedwards', 95448),
 (u'chesergio', 59706),
 (u'dintrans_g', 56384),
 (u'madek', 32527),
 (u'Baconcrisp', 31644),
 (u'toniello', 26331),
 (u'Chilestreet', 25054),
 (u'Run_cl', 22982)]

### One hit wonders 

In [None]:
SELECT COUNT(*) 
FROM
    (SELECT e.user, COUNT(*) as num
     FROM (SELECT user FROM nodes UNION ALL SELECT user FROM ways) e
     GROUP BY e.user
     HAVING num=1)  u

439

### Number of user contributing over 10,000 times

In [None]:
SELECT COUNT(*) 
FROM
    (SELECT e.user, COUNT(*) as num
     FROM (SELECT user FROM nodes UNION ALL SELECT user FROM ways) e
     GROUP BY e.user
     HAVING num>10000)  u;

20

### Total number of contributions from those 20 users

In [None]:
SELECT sum(u.num) 
FROM
    (SELECT e.user, COUNT(*) as num
     FROM (SELECT user FROM nodes UNION ALL SELECT user FROM ways) e
     GROUP BY e.user
     HAVING num>10000)  u;

916783

### Total number of contributions

In [None]:
SELECT sum(u.num) 
FROM
    (SELECT e.user, COUNT(*) as num
     FROM (SELECT user FROM nodes UNION ALL SELECT user FROM ways) e
     GROUP BY e.user)  u;

1197594

This means that the top 20 users contributed ~77 % of the data.
In other words, the top 1% of users contributed ~77% of the data

### Top ten areas of the city (comunas)

In [None]:
SELECT tags.value, COUNT(*) as count 
FROM (SELECT * FROM nodes_tags UNION ALL 
      SELECT * FROM ways_tags) tags
WHERE tags.key LIKE '%city'
GROUP BY tags.value
ORDER BY count
LIMIT 10;

In [None]:
[(u'Providencia', 3415),
 (u'Santiago', 1620),
 (u'Las Condes', 1591),
 (u'La Reina', 963),
 (u'Puente Alto', 896),
 (u'La Florida', 826),
 (u'San Bernardo', 812),
 (u'Maip\xfa', 798),
 (u'\xd1u\xf1oa', 733),
 (u'Pudahuel', 538)]

Not surprisingly, the highest number of "city" tags falls into Providencia, which is one of the most popular areas of the city.  (And where I live!) Runners-up are other comunas with high numbers of restaurants, schools, residential areas, etc.  We can see that this is another area for improvement since there are some redundancies, e.g. the comuna of Providencia is represented in 6 different ways: 'Providencia', 'Providencia,Santiago', 'Santiago;Providencia', 'Providencia;Santiago', 'Providencia;Santiago', 'Providencias'.  Additionally, while initially scanning the data, I noticed that some had opted to use the "is_in" tag to denote a part of the city, furthermore pointing out a need for standardizing the data.

### Top ten amenities

In [None]:
SELECT value, COUNT(*) as num
FROM nodes_tags
WHERE key='amenity'
GROUP BY value
ORDER BY num DESC
LIMIT 10;

In [None]:
[(u'school', 1946),
 (u'restaurant', 1643),
 (u'kindergarten', 1053),
 (u'pharmacy', 749),
 (u'fast_food', 632),
 (u'parking', 586),
 (u'bank', 498),
 (u'cafe', 445),
 (u'bench', 306),
 (u'bicycle_parking', 241)]

### Top twenty types of cuisine

In [None]:
SELECT nodes_tags.value, COUNT(*) as num
FROM nodes_tags 
    JOIN (SELECT DISTINCT(id) FROM nodes_tags WHERE value='restaurant') i
    ON nodes_tags.id=i.id
WHERE nodes_tags.key='cuisine'
GROUP BY nodes_tags.value
ORDER BY num DESC
LIMIT 20;

In [None]:
[(u'sushi', 89),
 (u'chinese', 85),
 (u'pizza', 51),
 (u'regional', 29),
 (u'steak_house', 23),
 (u'italian', 21),
 (u'sandwich', 21),
 (u'peruvian', 20),
 (u'international', 16),
 (u'japanese', 15),
 (u'chicken', 10),
 (u'burger', 9),
 (u'seafood', 9),
 (u'Peruvian', 7),
 (u'indian', 6),
 (u'arab', 5),
 (u'asian', 5),
 (u'coffee_shop', 5),
 (u'spanish', 5),
 (u'mexican', 4)]

Not surprising, there's a ton of sushi here. But it all has cream cheese in it!  I like a Philly roll now and then, but not all the time!

### Top ten amenities in Providenca

In [None]:
SELECT nodes_tags.value, COUNT(*) as num
FROM nodes_tags 
    JOIN (SELECT DISTINCT(id) FROM nodes_tags WHERE value='Providencia') i
    ON nodes_tags.id=i.id
WHERE nodes_tags.key='amenity'
GROUP BY nodes_tags.value
ORDER BY num DESC
LIMIT 10;

In [None]:
[(u'bicycle_parking', 104),
 (u'restaurant', 77),
 (u'cafe', 34),
 (u'kindergarten', 28),
 (u'pharmacy', 24),
 (u'fast_food', 18),
 (u'bank', 11),
 (u'pub', 11),
 (u'embassy', 7),
 (u'bureau_de_change', 6)]

There are a lot of bike racks here... let's see where else.

In [None]:
SELECT nodes_tags.value, COUNT(*) as num
FROM nodes_tags 
    JOIN (SELECT DISTINCT(id) FROM nodes_tags WHERE value='bicycle_parking') i
    ON nodes_tags.id=i.id
WHERE nodes_tags.key LIKE '%city'
GROUP BY nodes_tags.value
ORDER BY num DESC
LIMIT 10;

In [None]:
[(u'Providencia', 104),
 (u'6', 37),
 (u'10', 19),
 (u'5', 19),
 (u'4', 14),
 (u'8', 13),
 (u'15', 9),
 (u'12', 8),
 (u'20', 7),
 (u'30', 6)]

Unfortunately, the other comunas have not been mentioned by name in the list of bicycle parking spots.  While Provi definitely does have the most bike parking in the city, it exists in the other areas as well. The need to standardize (comuna names or numerial codes?) and clarify (what do these numerical codes mean?) is apparent once again.

## Other ideas about the data set

Though I was surprised by the amount of data present on the Santiago map, it is by no means complete, as was confirmed during SQL querying.  Scraping from government websites as well as commerical sites like Tripadvisor would augment the information on schools and amenities and help to audit its consistency.  There does already seem to be a large amount of data already present on public transportation, however.  This makes sense as the largest source of data is http://www.transantiago.cl.  Sometimes it seems there was duplicated or errneous data for public transport though, as an SQL query showed 144 metro stations, though there are only 108.  (Footnote #5).  



### Benefits of increased completeness and consistency through scraping additional sources

Though currently most query results matched with my perceptions about the city (e.g. where banks and restaurants are located) this is perhaps because my personal biased view is shared by those those who input the data, who possibly live in simililar (and relatively affluent) parts of the city.  Thus the data is weighted towards them.  Scraping from multiple sources will create a more accurate map of the city and as a result, queries will yield more accurate results.

This kind of information is useful for both commerical and governmental purposes such as decided where to put a new subway station or a new Subway restaurant.


### Anticipated Problems

Scraping from multiple sources to increase completeness will also create more uniformity issues. Uniformity is already a big issue for this data set, not just in terms of abbreviations for street names, but how items are tagged.  Sometimes street names are duplicated in multiple tags and many versions of the comunas (regions of the city) cloud the data.  Re-tagging the "city" and "is-in" for uniformity is one extension on the wrangling that has been done.


### Summary

This case study attempted to make street abbreviations uniform and change the keys for improper secondary tags (from addr:interpolation to addr:street, for example).  Additionally, querying was used to learn about the number of users, which amenities are present in the data, and where they can be found across the city.  Currently, this map is missing a lot of data. Remaining questions include what the meaning of the "id_origin" tag is and what the numerical codes for the school operator tag mean.  In general, I wonder if there is a way to name these tags or whether files/forums exist on OpenStreetMaps that makes these things more transparent.


### Footnotes

1. Project details -- cutting the file down to a sample: https://classroom.udacity.com/nanodegrees/nd002/parts/0021345404/modules/316820862075463/lessons/3168208620239847/concepts/77135319070923#
2. Changesets -- http://wiki.openstreetmap.org/wiki/Changeset
3. Stackoverflow on printing a dictionary sorted by values  http://stackoverflow.com/questions/11228812/print-a-dict-sorted-by-values
4. OSM highway wiki http://wiki.openstreetmap.org/wiki/Key:highway
5. Santiago Metro https://en.wikipedia.org/wiki/Santiago_Metro