# <center> Data Wrangling OpenStreetMap with SQL</center>

<center> By: Kyle Hansen </center>



## <center>Map Area</center>
#### <center>Bothell, Washington, USA</center>

<img src="Map_Image.JPG" alt="Map" style="width: 650px;"/>
<center><a href="https://www.openstreetmap.org/relation/237656">https://www.openstreetmap.org/relation/237656</a></center>


<b>Project Summary:</b>
For this project, using data wrangling techniques, i chose Bothell, WA map area from OpenStreetMap.org.  The reason for this choice is because i grew up in Bothell.  The cleaned, fixed data starts in XML and is converted into CSV format, and is then imported into a SQL database.


### Tag Count

When Parsing through the Bothell, WA dataset and couting the unique element types by using the tags function, the following are the number of unique tags:

In [2]:
osm_file = "bothell.osm"

In [3]:
import tags
tags.count_tags(osm_file)

defaultdict(int,
            {'bounds': 1,
             'member': 34048,
             'meta': 1,
             'nd': 700084,
             'node': 607354,
             'note': 1,
             'osm': 1,
             'relation': 994,
             'tag': 363047,
             'way': 79027})

### Tag Patterns

There are three regular expression, lower is for tags that contain only lowercase letters and are valid. lower_colon is for other valid tags with a colon in the value. problem is for tags with problematic characters. The results are as below:

In [12]:
import tag_type
tag_type.process_map(osm_file)

{'lower': 165323, 'lower_colon': 194710, 'other': 3014, 'problem': 0}

## <center>Problems Seen in the Data</center>

### Street Types:

Because the data is inputed by users, there is no standard format for names.  With a list of expected street types inputed, the following dictionary of streets have unexpected types.  The majority of the issues involve abbreviations such as St for Street.

In [14]:
import audit
audit.audit(osm_file)

defaultdict(set,
            {'99': {'Highway 99', 'Hwy 99', 'State Highway 99'},
             'Ave': {'44th Ave'},
             'Blvd': {'Alderwood Mall Blvd'},
             'C': {'228th Street SE Suite C'},
             'E': {'Martin Way E'},
             'East': {'88th Avenue East'},
             'H': {'Hwy 99, Ste H'},
             'Hwy': {'Bothell-Everett Hwy'},
             'NE': {'104th Ave NE',
              '120th Avenue NE',
              '127th Ave NE',
              '138th Way NE',
              '141st Pl NE',
              '19115 112th Ave NE',
              '25th Ave NE',
              '5th Ave NE',
              '68th Ave NE',
              '94th Pl NE',
              'Ballinger Way NE',
              'Bothell Way NE',
              'Juanita-Woodinville Way NE'},
             'North': {'Ashworth Avenue North',
              'Bagley Avenue North',
              'Burke Avenue North',
              'Corliss Avenue North',
              'Courtland Place North',
             

To fix the abbreviations in the street names, to update with full names, a mapping dictionary can be used:

In [None]:
mapping = { "Ave": "Avenue",
            "Ave.": "Avenue",
            "avenue": "Avenue",
            "ave": "Avenue",
            "Blvd": "Boulevard",
            "Blvd.": "Boulevard",
            "Blvd,": "Boulevard",
            "Boulavard": "Boulevard",
            "Boulvard": "Boulevard",
            "Ct": "Court",
            "Dr": "Drive",
            "Dr.": "Drive",
            "E": "East",
            "Hwy": "Highway",
            "Ln": "Lane",
            "Ln.": "Lane",
            "N": "North",
            "Pl": "Place",
            "Plz": "Plaza",
            "Rd": "Road",
            "Rd.": "Road",
            "S:": "South",
            "St": "Street",
            "St.": "Street",
            "st": "Street",
            "street": "Street",
            "square": "Square",
            "parkway": "Parkway",
            "W": "West",
            "NW": "Northwest",
            "NE": "Northeast",
            "SW": "Southwest",
            "sw": "Southwest",
            "SE": "Southeast",
            "state": "State",
            "99": "Highway 99",
            "WA-99":"Highway 99"
            }


In [33]:
def fix_street(osmfile):
    st_types = audit.audit(osmfile)
    for st_type, ways in st_types.iteritems():
        for name in ways:
            if st_type in audit.mapping:
                better_name = name.replace(st_type, audit.mapping[st_type])
                print (name, "=>", better_name)
                
fix_street(osm_file)

('NE 145th St.', '=>', 'NE 145th Street')
('N 202nd Pl', '=>', 'N 202nd Place')
('25th Ave NE', '=>', '25th Ave Northeast')
('Ballinger Way NE', '=>', 'Ballinger Way Northeast')
('Juanita-Woodinville Way NE', '=>', 'Juanita-Woodinville Way Northeast')
('19115 112th Ave NE', '=>', '19115 112th Ave Northeast')
('5th Ave NE', '=>', '5th Ave Northeast')
('94th Pl NE', '=>', '94th Pl Northeast')
('104th Ave NE', '=>', '104th Ave Northeast')
('141st Pl NE', '=>', '141st Pl Northeast')
('68th Ave NE', '=>', '68th Ave Northeast')
('138th Way NE', '=>', '138th Way Northeast')
('Bothell Way NE', '=>', 'Bothell Way Northeast')
('120th Avenue NE', '=>', '120th Avenue Northeast')
('127th Ave NE', '=>', '127th Ave Northeast')
('Bothell-Everett Hwy', '=>', 'Bothell-Everett Highway')
('196th St SW', '=>', '196th St Southwest')
('236th St SW', '=>', '236th St Southwest')
('196th Street SW', '=>', '196th Street Southwest')
('Martin Way E', '=>', 'Martin Way East')
('NE 185th St', '=>', 'NE 185th Street'

## <center>Data Overview </center>

##### Basic Statistics of the Data:

<b>Bothell.osm</b> 141,444 KB<br>
<b>Bothell.db</b> 75,838 KB<br>
<b>nodes.csv</b> 54,103 KB<br>
<b>nodes_tags.csv</b> 2,134 KB<br>
<b>ways.csv</b> 5,083 KB<br>
<b>ways_nodes.csv</b> 17,096 KB<br>
<b>ways_tags.csv</b> 10,487 KB<br>

### Number of Nodes

In [40]:
query =  """SELECT COUNT(*) FROM nodes; """

output: 607354

### Number of Ways

In [41]:
query = '''SELECT COUNT(*) FROM ways; '''

output: 79027

### Most Common Way

In [43]:
query = '''SELECT key, count(*)
          FROM ways_tags 
          GROUP BY 1 
          ORDER BY count(*) DESC 
          LIMIT 1;
       '''

output: building

### Number of Unique Users

In [44]:
query = '''SELECT COUNT(distinct(uid)) 
           FROM (SELECT uid FROM nodes UNION ALL SELECT uid FROM ways);
        '''

output: 1276

### Number of Unique Bars

In [45]:
query = '''SELECT COUNT(distinct id) FROM nodes_tags WHERE value="bar"; '''

output: 8

### Number of Unique Schools

In [46]:
query = '''SELECT COUNT(*) FROM nodes_tags WHERE value="school"; '''

output = 37

### Top Contributer

In [48]:
query = '''SELECT user, COUNT(*) as num 
           FROM (SELECT user FROM nodes UNION ALL SELECT user FROM ways) user 
           GROUP BY user 
           ORDER BY num DESC 
           LIMIT 1)
        '''

output: 'patricknoll_import', 238998
    


### Most Popular Religion

In [49]:
query = '''SELECT nodes_tags.value, COUNT(*) as num 
           FROM nodes_tags 
           JOIN (SELECT DISTINCT(id) FROM nodes_tags WHERE value="place_of_worship") i 
           ON nodes_tags.id=i.id 
           WHERE nodes_tags.key="religion" 
           GROUP BY nodes_tags.value 
           ORDER BY num DESC
           LIMIT 1
        '''

output: 'christian', 14

### Most Popular Amenity

In [51]:
query = '''SELECT value, COUNT(distinct id) as num 
           FROM nodes_tags 
           WHERE key="amenity" 
           GROUP BY value 
           ORDER BY num DESC 
           LIMIT 1
        '''

output: 'restaurant', 149

## <center>Additional Ideas</center>

<ol>
   <b><li>Create a standard format for street addresses by autocorrecting abbreviations or allowing the user to choose from a drop down list of options when submitting address</li></b>
    The issue for creating a stricter format for inputing contributions would be that it would cost time and money for developement of an updated website interface for doing so.
   <b><li>Allow data from outside sources to be imported to expand the information on amentities, attractions, etc.</li></b>
           One issue that would occur from allowing outside sources would be matching up the data and making sure the format is inline with the current data
     
</ol>