In [14]:
import pymongo
import pprint

In [15]:
client = pymongo.MongoClient("mongodb+srv://public_access:GoData2017!@analyze-openstreetmap-vhq3i.mongodb.net/map_lic")
db_collection = client.map.lic

### 1. Problems Encountered in the Map

Ideally, main street and roads should be identified by highway map attribute having one of the following values: `primary`, `secondary`, and `residential`. Name of the street can then be identified using the following map attribues:
* `name`
* `tiger:name_base`
* `tiger:name_type`

However, a sample of the Long Island City map data shows that not all nodes have attributes correctly transformed based on the [Tiger to OSM Attribute Map](http://wiki.openstreetmap.org/wiki/TIGER_to_OSM_Attribute_Map). 

##### Streets with Incorrect Highway Attribute

For example, way id [46141644](http://www.openstreetmap.org/way/46141644) should be mapped to OSM attribute `:highway=>:motorway_link` but in the OSM map data it has been incorrectly mapped to `:highway=>:primary`. 

To account for these data quality issues, additional filtering is performed to check if `tiger:cfcc` is in the range between `A21` and `A49`. If there are way nodes identified as `primary`, `secondary`, or `residential` but `tiger:cfcc` is not in the correct range, the `highway` attribute will be updated based on manual lookup performed ex-ante.

##### Streets with No Name

### 2. Data Overview

###### File Sizes

Since all the OpenStreetMap data is downloaded programatically via the python wrapper for Overpass API, then directly uploaded to MongoDB on the cloud, there is no physical exchange of files to determine the file size. However, for the given map area chosen, the amount of data transmitted via Overpass API is about 60MB. This is determined by running the same query below in [Overpass Turbo](https://overpass-turbo.eu/):

```
(node(40.71,-73.97,40.78,-73.92);<;);
out meta;
```

##### Number of Documents

In [16]:
db_collection.count()

241129

##### Number of Nodes and Ways

In [17]:
result = [obj for obj in db_collection.aggregate([
    {'$group': {
        '_id': '$document_type',
        'count': {'$sum': 1}}}
])]
pprint.pprint(result)

[{'_id': 'way', 'count': 36544}, {'_id': 'node', 'count': 204585}]


##### Number of Unique Users

In [18]:
result = [obj for obj in db_collection.aggregate([
    {'$group': {
        '_id': '$change_info.user',
        'count': {'$sum': 1}}},
    {'$count': 'num_of_users'}
])]
pprint.pprint(result)

[{'num_of_users': 674}]


##### Type of Amenities in the Map Area

In [19]:
result = [obj for obj in db_collection.aggregate([
    {'$match': {'amenity': {'$exists': True}}},
    {'$group': {
        '_id': '$amenity',
        'count': {'$sum': 1}}} ,
    {'$sort': {'count': -1}}
])]
pprint.pprint(result)

[{'_id': 'bicycle_parking', 'count': 539},
 {'_id': 'restaurant', 'count': 407},
 {'_id': 'parking', 'count': 164},
 {'_id': 'cafe', 'count': 134},
 {'_id': 'place_of_worship', 'count': 108},
 {'_id': 'embassy', 'count': 105},
 {'_id': 'bench', 'count': 104},
 {'_id': 'bar', 'count': 98},
 {'_id': 'school', 'count': 94},
 {'_id': 'fast_food', 'count': 69},
 {'_id': 'bicycle_rental', 'count': 62},
 {'_id': 'post_box', 'count': 57},
 {'_id': 'bank', 'count': 53},
 {'_id': 'drinking_water', 'count': 40},
 {'_id': 'pharmacy', 'count': 36},
 {'_id': 'fuel', 'count': 28},
 {'_id': 'toilets', 'count': 28},
 {'_id': 'pub', 'count': 22},
 {'_id': 'fire_station', 'count': 19},
 {'_id': 'hospital', 'count': 17},
 {'_id': 'post_office', 'count': 15},
 {'_id': 'atm', 'count': 13},
 {'_id': 'library', 'count': 13},
 {'_id': 'dentist', 'count': 13},
 {'_id': 'nightclub', 'count': 12},
 {'_id': 'doctors', 'count': 9},
 {'_id': 'parking_space', 'count': 8},
 {'_id': 'arts_centre', 'count': 8},
 {'_id':

##### Street with Most Restaurants

In [20]:
result = [obj for obj in db_collection.aggregate([
    {'$match': {'amenity': 'restaurant', 'address.street': {'$exists': True}}},
    {'$group': {
        '_id': '$address.street',
        'count': {'$sum': 1}}},
    {'$sort': {'count': -1}},
    {'$limit': 1}
])]
pprint.pprint(result)

[{'_id': 'Bedford Avenue', 'count': 12}]


##### Streets with Bicycle Lanes

In [22]:
result = [obj for obj in db_collection.aggregate([
        {'$match': {
            'document_type': 'way',
            'highway': {'$in': ['primary', 'secondary', 'residential']},
            'tiger.tiger:cfcc': {'$regex': '^A(4[01]|[23][1-9])$'},
            'name': {'$exists': True},
            'bicycle': {'$in': ['yes','designated']}}},
        {'$project': {'name': 1}},
        {'$group': {'_id': '$name'}}
])]
pprint.pprint(result)

[{'_id': 'York Avenue'},
 {'_id': 'Manhattan Avenue'},
 {'_id': 'Skillman Avenue'},
 {'_id': 'Hoyt Avenue North'},
 {'_id': 'Quay Street'},
 {'_id': 'Shore Boulevard'},
 {'_id': 'Center Boulevard'},
 {'_id': 'West Street'},
 {'_id': 'Grand Street'},
 {'_id': 'East 48th Street'},
 {'_id': '1st Avenue'},
 {'_id': '44th Drive'},
 {'_id': 'Commercial Street'},
 {'_id': 'Driggs Avenue'},
 {'_id': 'South 3rd Street'},
 {'_id': 'East 51st Street'},
 {'_id': '11th Street'},
 {'_id': '12th Street'},
 {'_id': 'Borinquen Place'},
 {'_id': '9th Street'},
 {'_id': 'Provost Street'},
 {'_id': 'Berry Street'},
 {'_id': 'Franklin Street'},
 {'_id': 'Kent Avenue'},
 {'_id': 'Ash Street'},
 {'_id': '29th Street'},
 {'_id': 'Leonard Street'},
 {'_id': 'Greenpoint Avenue'},
 {'_id': '79th Street Transverse Road'},
 {'_id': '8th Street'},
 {'_id': 'Calyer Street'},
 {'_id': 'Vernon Boulevard'},
 {'_id': 'South 4th Street'},
 {'_id': '43rd Avenue'},
 {'_id': '28th Street'},
 {'_id': '14th Street'},
 {'_id':

### 2.  Additional Ideas

##### Cross Referencing between Streets and Buildings

Given the current structure and content of OSM data, it is not possible to link `nodes` corresponding to amenities and shops (e.g. resturants, shops) to `ways` denoting the street they are located. This is because the `way.nodes` child attribute for a given `way` does not contain any `node` corresponding to buildings. On the other hand, not all `nodes` are tagged with `address:street` attributes. A potential improvement to the existing osm data structure would be to include additional IDs that enable corss referencing between nodes and ways. 


##### Algorithm for Populating Missing Street Names

In the chosen map area, there are 17 streets with `name` attribute missing. One potnetial way to resolve these data issues is to use `way.nodes` data to look up if any of the `node` IDs are found in other `ways` and use an algorithm to determine if `name` attribute for adjacent `ways` can be used to infer the street name when it's missing. The algorithm would need to be able to differentiate between connected streets versus cross sections. For connected streets, it seems sensible to assume the street with missing name has the same name as the two streets connected to it if streets on both ends have the same end. However, if the two adjacent streets have different names, or if it is a cross section, the algorithm to infer street name is not as straight forward and can potentially infer incorrect street names.