# OpenStreetMap Data Case Study

## Map Area

New York, NY, United States

- http://www.openstreetmap.org/relation/175905

- https://mapzen.com/data/metro-extracts/metro/new-york_new-york/

I'm Korean and I live in South Korea. And I've never been to US before. I already heard about New York and Manhatten many times, and if I can get a opportunity I want to visit there. Therefore I'm interested to do the OpenStreetMap data case study with New York part of the map and see the results of querying.

## Problems in the map that I decided to solve

After making a small sample size of the New York area and creating database with the sample, I noticed lots of problems in the data. In this section I'm going to show the problems that I decided to solve and discuss here.

- Inconsistent postal codes (“NY 10533”, “08854-8009”, “10314”)


- Incorrect postal codes (New York area zip codes begin with “005”or "063" or "1". However some zip codes start with "07" or "08". And there are even some zip codes which aren't 5 digit number nor zip+4 code.)


- Same value type but different key names("postcode" and "postal_code", "fax" and "Fax")


- Inconsistent phone, fax numbers("(718) 623-9065", "+1 212 690 4000", "718-733-6813", etc)


- Inconsistent website addresses("bk.com", "http://www.nycgovparks.org/parks/Q048/", "www.skyviewonthehudson.com/")

### Postal codes

In [1]:
#Connect to osm_samples.db using sqlite3.
import sqlite3

db = sqlite3.connect("C:\\sqlite_windows\\osm_samples.db")
c = db.cursor()

https://www.unitedstateszipcodes.org/ny/

### Same value type but different key names

### Inconsistent phone and fax numbers

### Inconsistent website addresses

## How did I solve above problems

### Postal codes

### Same value type but different key names

### Inconsistent phone and fax numbers

### Inconsistent website addresses

## Additional problems in the map

### Maxspeed

### Opening_hours

### Same value type but different key names

## Ideas for improving the data

## Data overview and additional data exploration

### File sizes

In [5]:
import os
osm_path = 'C:\\projects\\New_York_OSM\\New_York_sample.osm'
db_path = 'C:\\sqlite_windows\\osm_samples.db'
print('New_York_sample.osm :', os.stat(osm_path).st_size)
print('osm_samples.db :', os.stat(db_path).st_size)

basedir = 'C:\\projects\\New_York_OSM\\samples'
names = os.listdir(basedir)
paths = [os.path.join(basedir, name) for name in names[:5]]
sizes = [os.stat(path).st_size for path in paths]
for i in range(5):
    print(names[i], ':', sizes[i])

New_York_sample.osm : 96696280
osm_samples.db : 107073536
sample_nodes.csv : 34343067
sample_nodes_tags.csv : 960657
sample_ways.csv : 3927976
sample_ways_nodes.csv : 11669712
sample_ways_tags.csv : 9811433


### I'm going to explore about building, sport, and shop fields. And I'm going to query at nodes_tags and ways_tags tables. 

### Top 5 usage of buildings.

In [2]:
building_query = "select tags.value, count(*) as num \
        from (select * from nodes_tags union all select * from ways_tags) tags \
        where tags.key = 'building' and tags.value != 'yes' \
        group by tags.value \
        order by num desc \
        limit 5;"

('house', 5393)
('garage', 4088)
('shed', 489)
('commercial', 317)
('service', 218)

- Most of the buildings are for house or garage in New York.

### Top 5 sports facilities.

In [3]:
sport_query = "select tags.value, count(*) as num \
        from (select * from nodes_tags union all select * from ways_tags) tags \
        where tags.key = 'sport' and tags.value != 'yes' \
        group by tags.value \
        order by num desc \
        limit 5;"

('baseball', 67)
('tennis', 65)
('basketball', 35)
('soccer', 25)
('american_football', 11)

- 5 of the most common sports facilities in New York are for baseball, tennis, basketball, soccer and american football.

### The number of shop information.

In [4]:
shop_query = "select tags.key, count(*) as num\
        from (select * from nodes_tags union all select * from ways_tags) tags \
        where tags.key = 'shop';"

('shop', 211)

### The number of shop information which have phone number or email or website or facebook.

In [5]:
shop_with_phone_query = "select tags.key, count(*) as num\
        from (select * from nodes_tags union all select * from ways_tags) tags \
        join \
        (select distinct(id) from nodes_tags where key in ('phone', 'email', 'website', 'facebook') \
            union all \
         select distinct(id) from ways_tags where key in ('phone', 'email', 'website', 'facebook')) ids \
        on tags.id = ids.id \
        where tags.key = 'shop';"

('shop', 40)

- The number of shop information in New York OSM sample is 211 but only 40 of them have contact information. I think that it is better to make each shop information to have at least one contact information in the map.

## Conclusion