# Udacity Project: Wrangle OSM Data to SQL- Houston, Texas 

## Introduction

For this project, I chose my hometown of Houston, Texas. I originally chose a small area containing Rice University and the museum district then moved my way to the larger metropolitan area. As the data covers a large area, I was unable to efficiently run my code on the full dataset and instead chose to sample the area using code provided by Udacity. I used the code provided and created during the Problem Set exercises as the basis for collecting and auditing the data. From there, I created the csv files and database (which proved to be one of the most difficult parts). I then used the sample project as a template for formulating questions for querying using sqlite. In an effort to keep this report brief, I have trimmed much of the auditing code, results, and sql queries to be kept separately. 

OSM query: 
https://www.openstreetmap.org/relation/2688911


# Auditing/Improving Street Names

Initial audit: 
I used code developed during the case study to audit the street names (using regular expressions) for problematic characters, valid lower case characters, and other characters. Luckily enough there were 0 problematic characters in my dataset. 

Improvements: 
Most of the below code and comments were taken from the problem set prior to this project. I ran the code repeatedly to better understand the types of street names that would be caught and adjusted accordingly. Some streets had programatic fixes while others just needed a very specific one-off fix. While a bit tedious, this approach worked for this data set but may not have worked for a larger one. 

If needed, the code could be rewritten to search through each word in a street name rather than just the street type (which I believe was a suggestion on the Udacity forums). I thought it sufficient to look for incorrect/mispelled street types rather than the various ways one could write a street type. For my purposes, I saw no reason to go into the details of modifying and standardizing the highway names and types, especially with data that is put together by users and would likely have similar discrepancies in other cities' datasets. 

In [None]:
# Note: The below code has been shortened for this report
"""
- audit the OSMFILE and use the variable 'mapping' which reflects the changes needed 
    Note: a semi-generalized solution 
- update_name: actually fixes the street name.
    The function takes a string with street name as an argument and should return the fixed name
"""

OSMFILE = "interpreter.osm"
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Circle", "Place", "Square", "Lane", "Road", 
           "1960", "6", "Real", "Highway", "Trace"]

mapping = { "St": "Street",
            "St.": "Street", 
            "Stree": "Street",
            "Ave": "Avenue",  
            }

def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)

def is_street_name(elem):
    return (elem.attrib['k'] == "addr:street")

def audit(osmfile):
    osm_file = open(osmfile, "r")
    street_types = defaultdict(set)
    for event, elem in ET.iterparse(osm_file, events=("start",)):
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'])
    osm_file.close()
    return street_types

def update_name(name, mapping):
    street_type = street_type_re.search(name).group()
    if street_type not in expected: 
        if street_type in mapping:
            name = (name[:-len(street_type)] + mapping[street_type])
            #print name 
        else: 
            #print "name not in mapping:", street_type            
            unmapped.add(name)
            #just in case we want to take a look and make sure nothing crazy is in the unmapped set        
    return name


# Preparing for Database

A schema provided by Udacity was created and loaded. Then data was gathered using iterparse, shaped appropriately using the schema, and written to CSVs. Lastly, the SQLite3 database was created along with relevant tables, and the data loaded using DictReader. From here, it was much easier to explore the data using SQLite. 

# Data Overview

## File sizes

In [19]:
# File size 
import os
print "Original Data: ", os.path.getsize('/Users/irasema/Desktop/DataScience/Udacity/Data Analyst/Project 2/interpreter.osm')
print "Nodes CSV: ", os.path.getsize('/Users/irasema/Desktop/DataScience/Udacity/Data Analyst/Project 2/nodes.csv')
print "Nodes Tags CSV: ", os.path.getsize('/Users/irasema/Desktop/DataScience/Udacity/Data Analyst/Project 2/nodes_tags.csv')
print "Ways CSV: ", os.path.getsize('/Users/irasema/Desktop/DataScience/Udacity/Data Analyst/Project 2/ways.csv')
print "Ways Tags CSV: ", os.path.getsize('/Users/irasema/Desktop/DataScience/Udacity/Data Analyst/Project 2/ways_tags.csv')
print "Ways Nodes CSV: ", os.path.getsize('/Users/irasema/Desktop/DataScience/Udacity/Data Analyst/Project 2/ways_nodes.csv')
print "Database: ", os.path.getsize('/Users/irasema/Desktop/DataScience/Udacity/Data Analyst/Project 2/interpreter.db')

#Resource:
#https://stackoverflow.com/questions/6591931/getting-file-size-in-python
#https://docs.python.org/3/library/os.path.html#os.path.getsize

Original Data:  66749940
Nodes CSV:  23374752
Nodes Tags CSV:  564723
Ways CSV:  2452684
Ways Tags CSV:  6043017
Ways Nodes CSV:  8296465
Database:  35467264


## Are all data coordinates within the limits of the original query? Is anything out of place?

In [5]:
query = ''' select max(lat), min(lat), max(lon), min(lon) from nodes '''
db = sqlite3.connect("interpreter.db")
cursor = db.cursor()
cursor.execute(query)
rows = cursor.fetchall()
df = pd.DataFrame(rows)
print df

         0          1         2          3
0  30.1831  29.463502 -94.82301 -96.100149


The coordinates seem to all fall within the Houston metropolitan area. 

## How many Nodes are there?

In [21]:
cursor = db.cursor()
query = "select count(*) from nodes"
cursor.execute(query)
rows = cursor.fetchall()
df = pd.DataFrame(rows)
print df

        0
0  278542


## How many ways?

In [22]:
query = "Select count(*) from ways"
cursor.execute(query)
rows = cursor.fetchall()
df = pd.DataFrame(rows)
print df

       0
0  41283


## What are the most common node tags?

In [25]:
query = ''' select key, value, count(*) as count from nodetags 
group by key, value order by count desc limit 20 
'''
cursor.execute(query)
rows = cursor.fetchall()
df = pd.DataFrame(rows)
print df


            0                  1     2
0     highway     turning_circle  2313
1       power              tower  1379
2       power               pole   932
3     highway    traffic_signals   751
4     highway           crossing   358
5     highway       turning_loop   331
6     railway     level_crossing   321
7    state_id                 48   273
8   county_id                201   225
9       state                 TX   219
10    natural               tree   205
11    amenity   place_of_worship   193
12   religion          christian   188
13       city            Houston   177
14    barrier               gate   163
15     noexit                yes   135
16   building              house   127
17    highway  motorway_junction   113
18    created         12/08/2003   112
19   crossing              zebra    91


## Why are there so many zebra crossing tags?

Upon further exploration, (looking at key/value pairs, googling the lat/lon, and googling the phrase 'zebra crossings osm') it seems the zebra crossing is a type of crosswalk (nonspecific to zebras or painted as zebras for the Houston Zoo). 

This may be an example of not when understanding the data leads to findings that may seem off but after some digging, actually make sense. I figured though the result was not noteable and could wholely be taken out of the report, the investigative procedure is helpful to demonstrate. 

https://wiki.openstreetmap.org/wiki/Approved_features/Road_crossings

# Data Exploration continued

## Users 

In [7]:
# This query combines the nodes and ways table to find all users who have contributed to either
# Limiting to 10 for readability
query = ''' select distinct(subq.user), uid
from (select uid, user from nodes union all select uid, user from ways) subq
limit 10
'''
cursor.execute(query)
rows = cursor.fetchall()
df = pd.DataFrame(rows)
print df

                 0        1
0        brianboru     9065
1        davidearl     3582
2  woodpeck_repair   145231
3        andrewpmk     1679
4  woodpeck_fixbot   147510
5          scottyc   496606
6         afdreher  1110270
7           clay_c   119881
8    RoadGeek_MD99   475877
9          Memoire  2176227


## Top 10 contributors? 

In [34]:
# top 10 contributing users across the nodes and ways tables 
query = ''' select uid, subq.user, count(*) as count
from 
(select uid, user from nodes union all select uid, user from ways) as subq
group by subq.user
order by count desc
limit 10
'''
cursor.execute(query)
rows = cursor.fetchall()
df = pd.DataFrame(rows)
print df

         0                1      2
0  1110270         afdreher  50264
1   147510  woodpeck_fixbot  35051
2   496606          scottyc  19328
3  3119079          cammace  19259
4   119881           clay_c  16125
5     9065        brianboru  11224
6   243003          skquinn   8680
7   475877    RoadGeek_MD99   7622
8   672878         TexasNHD   7005
9  2176227          Memoire   6540


## Users with 1 post

In [35]:
query = ''' 
select count (*) from 
(select uid, user, count(*) as counts
from (select uid, user from nodes union all select uid, user from ways) as subq
group by subq.user
having counts = 1
order by counts desc ) as substuff 
'''
cursor.execute(query)
rows = cursor.fetchall()
df = pd.DataFrame(rows)
print df


     0
0  265


## Restaurants 

In [8]:
# What are the most popular places and their corresponding cuisine? 
# Note, this query only draws from nodetags (not waytags - to be addressed later) 
# Limit 10 for readability 
query = ''' select names.name, value, count (*) as count 
from nodetags, 
(SELECT distinct(id) as restid FROM nodetags WHERE value= 'restaurant' or value= 'fast_food') 
as rest, 
(select value as name, id as nameid from nodetags where key = 'name') as names 

where nodetags.id = rest.restid 
and nodetags.id = names.nameid 
and nodetags.key = 'cuisine' 

group by name 
order by count desc 
limit 10 
'''
cursor.execute(query)
rows = cursor.fetchall()
df = pd.DataFrame(rows)
print df

                   0         1   2
0             Subway  sandwich  11
1            Wendy's    burger   7
2    Jack in the Box    burger   4
3         McDonald's    burger   4
4        Chick-fil-A   chicken   3
5  Schlotzsky's Deli  sandwich   3
6        Whataburger    burger   3
7        Burger King    burger   2
8        Jamba Juice    drinks   2
9                KFC   chicken   2


## Most common amenities 

In [13]:
# Note this query only accounts for node tags 
query = ''' select value, count(*) as count from nodetags 
where key = 'amenity' 
group by value
order by count desc
limit 15
'''
cursor.execute(query)
rows = cursor.fetchall()
df = pd.DataFrame(rows)
print df

                   0    1
0   place_of_worship  193
1          fast_food   89
2           fountain   89
3         restaurant   81
4             school   57
5              bench   33
6       fire_station   24
7               fuel   23
8               bank   20
9           pharmacy   20
10              cafe   19
11           toilets   12
12            police   11
13    drinking_water    9
14               atm    8


# Additional Ideas 

Previously we've observed that node tags and way tags tend to be similar and hold similar pieces of information. It seems that some information may be split across the two tables such that if you were only querying from one, your result may be incomplete and results thus misleading. 

For example, searching for restaurants in an area by querying nodetags may lead you to overlook a choice if additional, unique restaurants were held in the waytags table instead. You may end up missing out on your favorite restaurant. 

To remedy this, you may try to fix the problem at data input, or combine both tables for your queries. 

At data input, additional instructions on creating data for organization can be provided to those contributing to OSM's data. Cycling through to rearrange data from one table to another would be long, tedious, and difficult. Depending on the query, combining tags from both ways and nodes should be sufficient to gather relevant data in the same place. 

Additionally, for further data analysis, it would be more beneficial to read through the possible values for tags first, which should be provided on the OSM wiki. 

In [None]:
#Create a view that combines way and node tags 
query = ''' create view alltags as select * from nodetags union all select * from waytags '''
cursor.execute(query)
rows = cursor.fetchall()
df = pd.DataFrame(rows)

In [23]:
# What are the most prevalent restaurants in the area? 
# Note, this query draws from alltags 
query = ''' select names.name, value, count (*) as count 
from 
alltags, 
(SELECT distinct(id) as restid FROM alltags WHERE value= 'restaurant' or value= 'fast_food') 
as rest, 
(select value as name, id as nameid from alltags where key = 'name') as names 

where alltags.id = rest.restid 
and alltags.id = names.nameid 
and alltags.key = 'cuisine' 

group by name 
order by count desc 
limit 10
'''
cursor.execute(query)
rows = cursor.fetchall()
df = pd.DataFrame(rows)
print df

                   0         1   2
0        Whataburger    burger  16
1             Subway  sandwich  12
2         McDonald's    burger   8
3            Wendy's    burger   8
4        Chick-fil-A   chicken   6
5    Jack in the Box    burger   6
6        Burger King    burger   5
7      Panda Express   chinese   4
8  Schlotzsky's Deli  sandwich   4
9              Sonic    burger   3


# Conclusion

Upon assessment, it seems that for data collected by humans, it is fairly consistent and thorough (at least more than expected). While this project was challenging, the OSM data seems like a valuable resource for compiling data and practicing data manipulation. Most of my frustrations actually occured with preparing the database and creating the database from the csv files. 

If our purpose was to clean the OSM data, then I would likely approach this differently in the future. I thought it better to correct mistakes rather than create a standardization with manmade (ish) data. Instead of standardizing something in one subset of data which may be different in another (depending on the auditor's preference i.e. highway vs hwy), it may be better to compile the various ways contributors may add a data point (Road vs Rd vs Rd.) and group them under the same category when analyzing the data

More familiarization is likely needed with the OSM standards in terms of data possibilities (tags) and organization for other projects. However, many of the top contributors can likely share their code with contributors in other areas in order to speed up the process of preparing and cleaning. 

It seems the challenges faced with this dataset had numerous ways (ha) of being approached, and it would be interesting to see how the top contributors structure their bots to clean the data.  
