## OpenStreetMap Data Analysis

#### Map Area
Austin, Texas, United States
The data was downloaded at https://mapzen.com/data/metro-extracts/.  A sample of the data was created using file "generate_sample.py".

### Problems Encountered During the Analysis
Two questions were found while analyzing the data.
1. Inconsistent street names
2. Inconsistent city names
3. Inconsistent phone numbers
4. Inconsistent state name

In [1]:
SAMPLE_OSM_FILE = "sample_austin_texas_k_100.osm"

In [2]:
import mapparser
mapparser.count_tags(SAMPLE_OSM_FILE)

{'member': 502,
 'nd': 69916,
 'node': 63907,
 'osm': 1,
 'relation': 24,
 'tag': 24064,
 'way': 6699}

The full compressed osm file is 70 MB, and the uncompressed osm file is 1.42 GB.  The sample osm file is 14.3 MB.  There are a total of 63,907 nodes, 24,064 tags, and 6,699 ways in the sample osm file.

In [3]:
import tag_types
tag_types.process_map(SAMPLE_OSM_FILE)

{'lower': 13127, 'lower_colon': 10823, 'other': 114, 'problemchars': 0}

Great, there is no problematic characters!

#### 2. Explore the Users

In [4]:
import users
len(users.get_unique_users(SAMPLE_OSM_FILE))

422

There are 422 unique users that have contributed to the data.

#### 3. Inconsistent Street Name 
The expected street name list was created by exploring the sample osm file.  I woule like to use the full name instead of abbreviations.  The mapping file was saved in a street mapping dictionary.  For example, if the street name includes "FM", it should be replaced by "Farm-to-Market Road".

In [5]:
import audit
print "Expected Street Names:"
print audit.expected

print "Mapping from inconsistent street names to consistent street names:"
print audit.street_mapping

['(512)282-0182', '(512)652-1200', '(512)759-4700', '+1512-244-8500', '+1877815-0542', '512-345-6070', '512-477-6104']
Expected Street Names:
['Street', 'Avenue', 'Boulevard', 'Drive', 'Court', 'Place', 'Square', 'Lane', 'Road', 'Trail', 'Parkway', 'Commons', 'Ridge', 'Cove', 'Hill', 'Circle', 'Run', 'Park', 'Plaza', 'Loop', 'Pass', 'Bend', 'Way', 'Trace', 'Highway', 'Crossing', 'Terrace', 'Path', 'Hollow', 'Skyway']
Mapping from inconsistent street names to consistent street names:
{'Trl': 'Trail ', 'Tr': 'Trail', 'Cv': 'Cove', 'Rd': 'Road', 'Ovlk': 'Overlook', 'Dr': 'Drive', 'FM': 'Farm-to-Market Road'}


#### 4. Inconsistent City Names
The expected city name list was created by exploring the sample osm file.  There are inconsistent city names, such as "Austin" and "Austin,TX".  There are also city names without spaces, such as "CedarPark" and "RoundRock".  I would like to update city names to be consistent.  The mapping file was saved in a city mapping dictionary.

In [6]:
print "Mapping from inconsistent city names to consistent city names:"
print audit.city_mapping

Mapping from inconsistent city names to consistent city names:
{'RoundRock': 'Round Rock', 'CedarPark': 'Cedar Park', 'Austin,TX': 'Austin'}


#### 5. Incomplete Phone Numbers
There are not a lot of phone numbers in the sample osm file.  There are only seven unique phone numbers in the sample osm file.  Since the phone numbers are far from enough, I will not clean it up in for the csv files.

In [7]:
audit.audit_phone(SAMPLE_OSM_FILE)

['(512)282-0182', '(512)652-1200', '(512)759-4700', '+1512-244-8500', '+1877815-0542', '512-345-6070', '512-477-6104']


#### 6. Inconsistent State Name
There are three different state values, and they are "Texas", "TX", and "tx".  I cleaned them up so that all of them are "Texas".

In [8]:
audit.audit_state(SAMPLE_OSM_FILE)

['TX', 'Texas', 'tx']


### Analysis with SQL
To analyze the data with SQL, the elements in the sample osm file was parsed to create five .csv files.  The file names and sizes are listed below.

nodes.csv: 6 MB

nodes_tags.csv: 113 KB

ways.csv: 482 KB

ways_tags.csv: 720 KB

ways_nodes.csv: 1.7 MB

I created a databse called "open_street_map.db" using file "database.py".

In [9]:
query = '''
select uid, count(*) as count
from nodes
group by uid
order by count desc
limit 20;
'''

The top contributor has 24724 contributions, and it's more than two times than the second users.

In [10]:
query = '''
select value, count(*) as count
from nodes_tags
where key = 'amenity'
group by value
order by count desc
limit 20;
'''

Of all the amenities, fire station is the most popular one, followed by bench, fuel, waste basket, bar, and cafe.

### Additional Ideas
1. The phone numbers are not complete.  Perhaps we can use google or Yelp reviews to improve the dataset.
2. We can improve the dataset by encouraging portable device carriers to report their location and related information.

###### Potential Benefits
1. Google or Yelp reviews usually have the accurate addresses for public places, such as libraries, churches, schools, etc.  And these information are readily available.  They can be accessed without additional efforts.
2. Portable devices such as smart phones are popular and they can be used as GPS to improve the quality of the database.

###### Potential problems
1. Obtaining phone numbers of public places is not a problem, however, it can be an issue to get phone numbers of private properties, such as houses.
2. It can be hard to encourage people to use their portable devices to report location information.  Also, there should be a easy way to report location inforamtion, such as an app on a smart phone.