# DAND P3
## Brendan Schell

Map Area: Toronto, ON, Canada

<a href="" >



## 1. Problems Encountered in the Map

After parsing through a small sample size of the Toronto .osm file, I noticed the following problems with the street map data:

    * Incorrect Postal Codes
    * Use of addr:State instead of addr:Province
    * Use of an 'address' tag instead of 'addr:' sub-components
    
### Problem 1: Incorrect Postal Codes

After parsing through the first 1000 "addr:postcode" keys in way or node objects and comparing them to a regular expression for Canadian postal codes (found [here](http://geekswithblogs.net/MainaD/archive/2007/12/03/117321.aspx)), I noticed that there were some invalid Canadian Postal Codes.

>p = audit_postal_codes(filename) 

>for k in p.keys(): 
>>print k + ":" + str(len(p[k]))

Valid: 762
Invalid: 5

Of the 5 invalid postal codes observed for the first 100,000 nodes and ways parsed: 

L7J 1B9  : A valid postal code, but contains a trailing space

M36 0H7 : Contains 4 numbers - an invalid postal code

l7a 3r9 : lower case letters

l6p 2r1 : lower case letters

M5T 1R9, M1P 2L7 : multiple postal codes

Using the small amount of invalid postal codes above as a sample size, I decided to change all lower case letters in postal codes to be upper case. This was done using the following code:

> if v == 'addr:postcode':
>> node[v] = i.get('v').upper()


### Problem 2: Use of addr:state for provinces

A second problem observed when going through the tag data was the use of the addr:state tag (seen below).

* addr:street
* addr:housenumber
* addr:city
* addr:state
* addr:country
* addr:postcode
* addr:housename
* addr:province
* address
* addr:building
* addr:unitAfter 

After parsing through the first 10 uses of this tag, all of them were populated with the value "ON" which is the correct value for the province. Since Canada has provinces and not states and there are are also addr:province tags, these should be consolidated. This was corrected by making an exception for state tags on import so that they would become province tags instead.


### Problem 3: Use of 'address' tag instead of 'addr:' fields

After collecting the first few instances of the address tag, these were the results:

* 1386 Queen Street West
* 1303 Queen Street West
* 309 Augusta Avenue
* 56 Spadina Road
* 45 Main Street East
* 2840 Dundas Street West
* 1996 Itabashi Way
* 975 Cosburn Avenue
* 131 Chisholm Avenue
* 3733 Ransomville Road, Ransomville, NY 14131

Of the above, the majority are valid addresses, but are not separated according to the addr: specifications. To correct this, any addresses that begin with a number and end with a valid street name are be converted into addr: housenumber and addr:street tags.

The following code was used to solve the three observed issues on import to JSON:

> if v == 'addr:postcode':
>> node[v.replace('addr:','')] = i.get('v').upper()
>elif v == 'address':
>>if v[0:v.find(" ")].isnumeric:
>>>house_num = v[0:v.find(" ")]
>>>temp = fix_address(v[v.find(" ")+1:])
>>>if temp != -1:
>>>>node['housenumber'] = house_num
>>>>node['streetname'] = temp
>elif v == 'addr:state':
>>node['province'] = i.get('v')
>else:
>>address[v.replace('addr:','')] = i.get('v')


## 2. Data Overview

This section contains basic statistics about the dataset and the MongoDB queries used to gather them.

In this section, coll is the collection invoked by:

>db = client['street_data']

>coll = db.P3

#### File Sizes
OSM: 1.2 GB

JSON: 1.3 GB
#### Number of Documents
Result: 6081859

Query:
>coll.find().count()

#### Number of Ways
Result: 646403

Query:
>coll.find({"type":"way"}).count()

#### Number of Nodes
Result: 5435456

Query:
>coll.find({"type":"node"}).count()

#### Number of Unique Users
Result:
1663
Query:
>len(coll.distinct("created.user"))

#### Top five 'city' field values
Result: 
>{u'count': 5653353, u'_id': None}

>{u'count': 112419, u'_id': u'City of Toronto'}

>{u'count': 40241, u'_id': u'City of Hamilton'}

>{u'count': 36869, u'_id': u'Mississauga'}

>{u'count': 25997, u'_id': u'City of Brampton'}

Query:
>cursor = coll.aggregate(
>>    [
>>>        {'$group': {'_id': '$address.city', 'count' : {'$sum' : 1}}},
      {'$sort' : {'count' : -1}},
      {'$limit' : 5}
>>    ]
>for doc in cursor:
>>    print doc





## 3. Additional Ideas

### Improvement Suggestion: Iterative input of postal codes

By looking at the postal code data in Section 1, I observed that many addresses were missing postal codes.

The number of addresses included within the City of Toronto, determined using Query 1 below is 112,419. The number of postal codes present within these, given by query 2 is 50. It is clear from these queries that very few addresses in the open street data contain the correct postal code information.

One potential activity that could be done to improve the data would be to retrieve postal code data for these addresses using Google Maps or another data source and update the open street map data accordingly. This could be done using a python script similar to what was used for this project.

>1) coll.find({'address.city': 'City of Toronto'}).count()

>2) coll.find({'address.city': 'City of Toronto', 'postcode' : {'$exists':True}}).count()

### Additional data exploration using MongoDB queries

#### Determining use of Gluten Free designation in Open Street Map Data

>cursor = coll.aggregate(

>>    [

>>        {'$group': {'_id': '$diet:gluten_free', 'count' : {'$sum' : 1}}},
        {'$sort' : {'count' : -1}},
        {'$limit' : 5}
>>    ]) 

>for i in cursor:

>>    print i

>{u'count': 6081857, u'_id': None}

>{u'count': 2, u'_id': u'yes'}

Result: Only used for 2 entries.

#### Top five 'street' field values

>cursor = coll.aggregate(
>>    [
>>>        {'$group': {'_id': '$address.street', 'count' : {'$sum' : 1}}},
      {'$sort' : {'count' : -1}},
      {'$limit' : 5}
>>    ]
>for doc in cursor:
>>    print doc



>{u'count': 5650184, u'_id': None}

>{u'count': 2208, u'_id': u'Yonge Street'}

>{u'count': 1020, u'_id': u'Bathurst Street'}

>{u'count': 999, u'_id': u'Dundas Street West'}

>{u'count': 919, u'_id': u'Bloor Street West'}

#### Number of entries with wheelchair descriptions:

>cnt = -1 #account for 'None' value

>cursor = coll.aggregate(

>>    [

>>>        {'$group': {'_id': '$wheelchair:description', 'count' : {'$sum' : 1}}},

>>>        {'$sort' : {'count' : -1}},

>>>        {'$limit' : 5}

>>    ]) 

>for i in cursor:

>>    cnt += 1

>print cnt

Result: 4

### Conclusion

In conclusion, the data was successfully cleaned, migrated into MongoDB, and queried against to determine some interesting information about the city of Toronto. Based on the small number of tags containing non-address and non-user information, it does not seem as though the Open Street Map data for Toronto is very populated with accessory information. Much could be done using available information from APIs to improve the level of detail in this Open Street Map data.

