# Minneapolis Open Street Map Data Wrangling

##### Grayson Ricketts
###### Map Area: Minneapolis, MN, United States

### 1. Problems Encountered

After downloading the dataset I audited the data using audit.py and found some immediate problems. In addition, over the course of processing and cleaning the data I found several more problems. The problems I encountered were:
* Lots of data was not actually in my target area. Got information about the suburbs and random places (things in Wisconsin) when I just wanted information about Minneapolis.
* Street names were overly and inconsistenly abbreviated. (e.g. st. or st rather than Street)
* General misspellings or inconsisent data entry (e.g. '55044, MN' or '55044-4973' instead of '55044')


#### Nodes and Ways not in Minneapolis
Much of the initial data file from the Open Street Map project contains information including a large area around Minneapolis-St. Paul. However, my goal was to look at points within Minneapolis. There were three areas that showed there were points not within Minneapolis in the data set.
   * There were points with latitude and longitude outside of a square that contained the Minneapolis area. Essentially, I found the corners of the Minneapolis area and their latitude and longitued. Everything not within the square made by those points is not in Minneapolis.
   * There were entries in the data set that included city names that were not Minneapolis or even close to Minneapolis. (e.g. Infromation about the town I live in was included in the data set even though it is 20 miles away from the city)
   * There were entries that had zip codes that were not Minneapolis zip codes.

Anything that showed the entry or point was not in Minneapolis was not included in the final dataset.

#### Abbreviated and Inconsistent Street Names
Certain words were frequently and inconsistently abbreviated in street names. For instance, users often put st. or st instead of street or S instead of South.

#### Poorly formated additional Information
The data was inconsistently formatted so there were nodes with characters that made parsing troublesome, inconsistent spelling of the state in the address field, and non-uniform zip codes.

   * There were general misspellings or other errors that were corrected along the way. For instance, place of worship was misspelled a few times in the amenity tag, so it was replaced with the correct spelling.
   * Fields contained characters that were not standard or should not have been included (e.g. fields with ':' that symbolize a subfield in places where subfields were not allowed). Those fields are not included.
   * There was inconsisten spelling of the state. Some people used the abbreviation (MN), others spelled out 'Minnesota', others spelled Minnesota incorrectly, and some points had a state of Wisconsin or WI. If the state field appeared to be 'Minnesota' or 'MN' it was included and changed to a uniform 'MN'.
   * Post codes had additional text and sometimes had the state abbrevation attached (e.g. 55037-MN). The field is cleaned so that it only contains the 5-digit zip code.
   


### 2. Data Overview

File|Size (mb)
-|-
osm|672
json|142
mongodb|55.8


In [1]:
# %load query.py
"""File to make queries on a MongoDB collection that is formatted by process.py.

Attributes:
    STREET_TYPES_RE (regex): Compile regular expression to get the last word in
        a street name. AKA-The street type.
"""
import re
from sets import Set

STREET_TYPES_RE = re.compile(r'\b\S+\.?$', re.IGNORECASE)


class Query(object):
    """Class to make queries to a database."""

    def __init__(self, collection):
        """Constructor taking in a collection"""
        self.collection = collection

    def query_gen_info(self):
        """Calls all other methods/runs all current queries against the collection."""
        self.count_elements()
        self.query_location()
        self.query_amenity()
        self.query_users()

        return

    def count_elements(self):
        """Prints the number of nodes and ways in the collection."""
        count = self.collection.count()
        ways_count = self.collection.find({'type': 'way'}).count()
        ways_percent = int(ways_count * 100 / count)

        print 'Number of elements: {0}'.format(count)
        print 'Number of nodes: {0}  ({1}%)'.format(count - ways_count, 100 - ways_percent)
        print 'Number of ways:  {0}  ({1}%)'.format(ways_count, ways_percent)

        print '\n\n'

    def query_amenity(self):
        """Lists the amenities in decreasing order of occurence."""
        query = [{'$match' : {'amenity' : {'$exists' : '1'}}},
                 {'$group' : {'_id' : '$amenity', 'count' : {'$sum' : 1}}},
                 {'$sort' : {'count' : -1}},
                 {'$limit' : 25}]

        amenity_count = self.collection.aggregate(query)

        for elem in amenity_count:
            print '{0} : {1}'.format(elem['_id'], elem['count'])

        print '\n\n'

    def query_postal_code(self):
        """Prints the distinct postal codes and how many there are."""
        postal_codes = self.collection.distinct('address.postcode')
        postal_codes.sort()

        for elem in postal_codes:
            print '{0}'.format(elem)

        print 'Number of distinct post codes: {0}\n\n'.format(len(postal_codes))

    def query_street_type(self):
        """Prints the distinct street name endings."""
        streets = self.collection.distinct('address.street')
        unique_endings = Set()

        for street in streets:
            match = STREET_TYPES_RE.search(street)

            if match:
                unique_endings.add(match.group())

        print unique_endings
        print "\n\n"

    def query_users(self):
        """Query information about the users.

        Gets information about the total number of users and the users with less than 5 posts. Then
        lists the most prominent users in decreasing order along with the number of posts and the
        percentage they contributed to the project.
        """
        user_query = [{'$group' : {'_id' : '$created.user', 'count' : {'$sum' : 1}}},
                      {'$sort' : {'count' : -1}},
                      {'$limit' : 25}]
        lt_query = [{'$group' : {'_id' : '$created.user', 'count' : {'$sum' : 1}}},
                    {'$match' : {'count' : {'$lte' : 5}}},
                    {'$group' : {'_id' : '_id', 'count' : {'$sum' : 1}}}]

        users = self.collection.aggregate(user_query)
        count = len(self.collection.distinct('created.user'))
        collection_count = self.collection.count()
        min_user_count = self.collection.aggregate(lt_query).next()['count']


        print 'Users: {0}'.format(count)
        print 'Users with less than 5 posts: {0} ({1:.2%})\n'.format(min_user_count,
                                                                     float(min_user_count) / count)

        template = u'{USER:17}|{POSTS:9}|{PERCENT:>7}'
        print template.format(USER='User', POSTS='Posts (#)', PERCENT='Percent')
        for elem in users:
            dec_format = '{:.2g}%'.format(float(elem['count']) * 100 / collection_count)
            print template.format(USER=elem['_id'], POSTS=elem['count'], PERCENT=dec_format)

        print '\n\n'

    def query_location(self):
        """Get information about how many nodes have location data."""
        query = [{'$match' : {'pos' : {'$exists' : '1'}}},
                 {'$group' : {'_id' : '$pos', 'count' : {'$sum' : 1}}},
                 {'$group' : {'_id' : 'id', 'count' : {'$sum' : 1}}}]

        loc_count = self.collection.aggregate(query).next()['count']
        loc_percent = float(loc_count) * 100 / self.collection.count()

        print 'Nodes or ways with location data: {0}  ({1:.2g}%)\n\n'.format(loc_count, loc_percent)

from pymongo import MongoClient

q = Query(MongoClient('localhost', 27017).udacity.msp_osm)
q.query_gen_info()

Number of elements: 531738
Number of nodes: 324674  (62%)
Number of ways:  207064  (38%)



Nodes or ways with location data: 324469  (61%)


parking : 7147
restaurant : 394
school : 364
place_of_worship : 341
fast_food : 263
fuel : 243
bank : 129
cafe : 111
bench : 109
pub : 74
shelter : 67
bicycle_parking : 61
swimming_pool : 60
bar : 49
public_building : 49
post_box : 45
pharmacy : 45
car_wash : 36
fire_station : 34
post_office : 32
university : 30
library : 29
hospital : 25
theatre : 25
toilets : 24



Users: 927
Users with less than 5 posts: 442 (47.68%)

User             |Posts (#)|Percent
Mulad            |   160757|    30%
iandees          |    83189|    16%
DavidF           |    44199|   8.3%
stucki1          |    42735|     8%
sota767          |    27186|   5.1%
neuhausr         |    21362|     4%
jumbanho         |    18361|   3.5%
nickrosencrans   |    11255|   2.1%
woodpeck_fixbot  |     9734|   1.8%
Mink             |     8193|   1.5%
PrometheusAvV    |     7817|   1.5%
l



### 3. Additional Ideas

#### Public data
Included in the original dataset was information from the city of Minneapolis. Minneapolis has other open data projects that are publicly available. It would be nice if more of this data was introduced in a standard way. Right now, some data sets aren't included and others that are oddly formated (e.g. Metcouncil data). Since most likely many of the points in the OpenStreetMap and public data overlap, it would be interesting if while people put in data it is possible for them to add in/ confirm that public data about the points they are adding. 



### Conclusion

I started with a large dataset that contained lots of raw information about points in and around Minneapolis. I then cleaned and formatted the data set for easy querying. Though the data from the Open Street Map project was overly inclusive, the resulting data was interesting and contained special values (metcouncil data) that can be included later to make this data set even better. 