# Audit Script for OpenStreetMap Data

In [1]:
# import packages
from osmAudit import *

In [2]:
# define osm_filename
wd = '/Users/ajp/dsProjects/workspace/osmAustin/data/'
atx_filename = wd + 'austin_texas.osm'
# atx_filename = wd + 'sample_atx.osm'
# atx_filename = wd + 'small_sample_atx.osm'

In [3]:
# get count of elements
count_elements(atx_filename)


----- Count all tags -----
nd: 8835948
node: 7932057
tag: 2967844
way: 858496
member: 58198
relation: 4341
osm: 1


In [4]:
# get count of attributes
count_attributes(atx_filename)


----- Count all attributes -----
ref: 8894146
version: 8794895
id: 8794894
timestamp: 8794894
uid: 8794894
user: 8794894
changeset: 8794894
lat: 7932057
lon: 7932057
k: 2967844
v: 2967844
type: 58198
role: 58198
generator: 1


In [5]:
# get count of keys
count_keys(atx_filename)


----- Count all keys -----
building: 622302
height: 441107
addr:street: 345406
addr:housenumber: 344583
highway: 216664
addr:postcode: 98282
name: 73274
service: 52447
access: 41496
tiger:county: 37785
tiger:cfcc: 37703
surface: 34399
tiger:name_base: 33396
tiger:name_type: 30904
tiger:reviewed: 25054
oneway: 25010
power: 23772
barrier: 19658
addr:city: 19394
addr:state: 19314
tiger:zip_left: 18060
tiger:zip_right: 16208
amenity: 15790
coa:place_id: 13443
footway: 13252
lanes: 11138
natural: 10536
crossing: 10107
generator:source: 9457
location: 8607
maxspeed: 8315
ref: 8207
tiger:tlid: 8188
tiger:source: 8177
leisure: 8011
landuse: 7671
generator:method: 7547
generator:type: 7547
tiger:separated: 7324
source: 6174
created_by: 5771
golf: 5537
generator:output:electricity: 5466
tiger:name_direction_prefix: 4700
layer: 4687
type: 4349
kerb: 4322
waterway: 4130
website: 4042
phone: 3358
brand: 3285
operator: 3278
shop: 3240
brand:wikidata: 3181
odbl: 3175
brand:wikipedia: 3107
tiger:name

### Exploring key values

First, I'll check the top 10 keys - based on frequency of occurrence - to see where there are opportunities for data cleaning. I'll also check a few others that look interesting. Because I'm working in a python notebook, I can't just loop through the top 10 keys. The printed data would get truncated well before 10 keys' values were displayed. Instead, I'm going to run each key in it's own cell.

In [6]:
key_val_counter(atx_filename, 'building')


----- Count of values for key: building -----
yes: 584823
house: 20797
apartments: 4389
detached: 2685
carport: 1824
retail: 954
roof: 926
commercial: 758
school: 691
residential: 565
garage: 537
semidetached_house: 484
garages: 424
static_caravan: 293
shed: 237
terrace: 229
office: 227
university: 213
industrial: 212
church: 176
demountable: 160
parking: 122
construction: 67
college: 54
public: 53
barn: 52
hospital: 46
dormitory: 42
warehouse: 33
kindergarten: 23
service: 22
hangar: 19
greenhouse: 17
cabin: 17
stadium: 13
hotel: 13
pavilion: 12
grandstand: 12
government: 8
chapel: 7
stable: 5
stadium seating: 4
civic: 4
ruins: 4
container: 4
temple: 3
sports_centre: 3
covered area: 2
train_station: 2
no: 2
mosque: 2
toilets: 2
farm_auxiliary: 2
synagogue: 2
generator: 2
farm: 2
museum: 1
business: 1
big state electric: 1
tree_house: 1
undefined: 1
Bing: 1
shelter: 1
gas_station: 1
transportation: 1
Learning_Center/_Day_Care: 1
public_building: 1
storage_tank: 1
bungalow: 1
hut: 1
tra

There are a few things that need to be cleaned up in the values for this key.
- There are spaces where there should be underscores. A simple str.replace will correct those.
- A few other entries are incorrect or ambiguous; I'll correct those with a dictionary replace.

In [7]:
key_val_counter(atx_filename, 'height')


----- Count of values for key: height -----
5.1: 7946
5.2: 7904
5.3: 7834
5.0: 7732
4.9: 7680
4.8: 7401
5.4: 7384
4.7: 7125
5.6: 6988
5.5: 6844
4.6: 6798
5.7: 6729
4.5: 6718
5.8: 6447
5.9: 6317
4.4: 6287
6.1: 6108
6.0: 6019
6.2: 6018
4.3: 5969
6.3: 5805
7.7: 5680
7.8: 5654
6.4: 5638
7.9: 5474
7.4: 5440
7.5: 5379
7.6: 5349
4.2: 5342
7.3: 5309
7.2: 5237
6.6: 5202
6.5: 5153
6.7: 5104
8.0: 5086
7.1: 5007
6.8: 4981
4.1: 4918
8.3: 4897
8.1: 4878
6.9: 4844
8.2: 4784
7.0: 4772
8.4: 4769
8.8: 4729
8.5: 4630
8.7: 4606
8.6: 4564
8.9: 4499
4.0: 4452
9.0: 4394
9.2: 4214
9.1: 4205
9.3: 4152
9.4: 4025
3.9: 3878
9.5: 3809
9.6: 3706
3.8: 3671
9.7: 3400
3.7: 3347
9.8: 3331
3.6: 3269
9.9: 3214
10.0: 3124
3.5: 3107
10.1: 3009
3.4: 2929
3.3: 2881
3.2: 2744
3.1: 2741
10.2: 2627
10.3: 2618
10.4: 2586
3.0: 2583
2.9: 2322
10.5: 2312
10.6: 2256
2.8: 2181
10.7: 1970
10.8: 1952
10.9: 1815
2.7: 1720
10: 1676
11.0: 1649
11.1: 1581
11.2: 1449
11.3: 1377
11.4: 1326
2.6: 1267
11.5: 1263
11.6: 1185
11.7: 1071
11.8: 98

In [8]:
key_val_counter(atx_filename, 'addr:street')


----- Count of values for key: addr:street -----
North Lamar Boulevard: 828
Dessau Road: 599
Burnet Road: 575
South Congress Avenue: 490
North Interstate Highway 35 Service Road: 464
Shoal Creek Boulevard: 452
Ranch Road 620: 448
Research Boulevard: 447
South 1st Street: 433
Manchaca Road: 431
Guadalupe Street: 422
Cameron Road: 367
Westlake Drive: 359
Briarcreek Loop: 358
East 12th Street: 344
Duval Street: 336
East Cesar Chavez Street: 311
South Lamar Boulevard: 310
Manor Road: 303
West Anderson Lane: 283
East 6th Street: 282
East 7th Street: 279
Avenue G: 275
Willow Street: 264
Avenue H: 260
East Riverside Drive: 257
Airport Boulevard: 247
Red River Street: 245
Hamilton Pool Road: 245
Deassau Road: 245
Avenue F: 244
Bar K Ranch Road: 242
Daffan Lane: 242
Parkfield Drive: 239
East 13th Street: 232
River Road: 230
West Slaughter Lane: 226
Berkman Drive: 225
Abilene Trail: 217
East 3rd Street: 216
South Interstate 35: 215
Congress Avenue: 214
Springdale Road: 213
Shoalwood Avenue: 213

In [9]:
key_val_counter(atx_filename, 'addr:housenumber')


----- Count of values for key: addr:housenumber -----
13021: 777
7601: 382
900: 379
201: 375
105: 374
1701: 364
901: 360
500: 349
104: 347
1100: 344
1700: 344
301: 343
1500: 338
103: 336
1000: 335
1201: 333
1801: 331
501: 329
1901: 326
1400: 325
2000: 325
1200: 324
200: 322
1900: 321
101: 320
2101: 320
1705: 319
2200: 318
1101: 315
905: 315
1401: 314
1301: 313
1300: 313
2100: 310
1600: 310
1704: 310
701: 308
1405: 308
305: 307
1205: 306
1601: 305
2105: 304
401: 303
904: 303
1905: 303
1800: 302
1105: 302
2501: 300
300: 300
1501: 299
2500: 295
1904: 294
505: 294
106: 293
1001: 293
601: 292
205: 291
1404: 291
1204: 290
2400: 289
2201: 288
107: 288
1104: 288
1505: 288
801: 288
2001: 287
102: 287
1304: 287
1305: 284
700: 283
600: 280
2205: 280
2300: 279
2005: 279
400: 278
1504: 276
800: 275
1605: 275
2401: 275
303: 274
1804: 272
805: 272
204: 272
304: 271
1005: 270
1805: 268
903: 267
2204: 267
1604: 266
405: 266
2104: 265
1708: 264
907: 263
108: 262
909: 261
504: 260
906: 259
908: 259
1703

In [10]:
key_val_counter(atx_filename, 'highway')


----- Count of values for key: highway -----
service: 105414
residential: 39134
footway: 19824
turning_circle: 12804
crossing: 6807
secondary: 5631
tertiary: 3861
path: 2499
traffic_signals: 2463
primary: 2457
track: 1847
motorway: 1566
cycleway: 1450
unclassified: 1349
motorway_link: 1295
bus_stop: 1148
stop: 1025
street_lamp: 952
secondary_link: 940
turning_loop: 878
trunk: 697
motorway_junction: 419
steps: 375
primary_link: 334
construction: 291
trunk_link: 247
tertiary_link: 196
give_way: 166
proposed: 161
pedestrian: 144
toll_gantry: 103
living_street: 55
raceway: 40
corridor: 38
milestone: 19
mini_roundabout: 15
trailhead: 10
elevator: 4
priority: 2
services: 2
bridleway: 2


In [11]:
key_val_counter(atx_filename, 'addr:postcode')


----- Count of values for key: addr:postcode -----
78645: 10893
78734: 5627
78660: 4560
78653: 3553
78641: 3276
78669: 3190
78754: 2820
78704: 2559
78746: 2527
78723: 2290
78613: 2280
78759: 2160
78724: 2136
78738: 1994
78703: 1864
78701: 1837
78617: 1813
78758: 1805
78748: 1744
78745: 1732
78731: 1688
78725: 1648
78620: 1468
78741: 1428
78753: 1365
78750: 1326
78757: 1315
78732: 1315
78747: 1287
78705: 1282
78702: 1220
78733: 1207
78744: 1203
78737: 1120
78621: 1065
78735: 1048
78736: 1008
78749: 980
78717: 958
78730: 943
78751: 889
78728: 883
78721: 779
78610: 739
78652: 729
78664: 711
78739: 710
78752: 694
78727: 690
78729: 672
78681: 624
78756: 560
78626: 554
78719: 482
78722: 428
78654: 413
78726: 412
78634: 294
78602: 269
78665: 228
78628: 158
78742: 150
78640: 139
78615: 126
78612: 86
76574: 82
78712: 79
78663: 44
78676: 43
78619: 15
78642: 15
78662: 7
78724-1199: 6
78666: 4
78616: 3
78953: 3
78644: 2
78754;78753: 2
78704-5639: 1
78758-7008: 1
78758-7013: 1
14150: 1
78704-7205:

These data are mostly clean, but there are some non-austin zip codes included (https://www.city-data.com/zipmaps/Austin-Texas.html). I'll filter those out in the shape function.

In [12]:
key_val_counter(atx_filename, 'name')


----- Count of values for key: name -----
North Interstate 35: 299
North Mopac Expressway: 286
Pickle Parkway: 262
West Parmer Lane: 238
South Interstate 35: 215
East Riverside Drive: 197
South Mopac Expressway: 186
West Slaughter Lane: 171
East Parmer Lane: 144
West William Cannon Drive: 139
North Lamar Boulevard: 135
Wells Branch Parkway: 134
Shell: 132
Research Boulevard: 132
West US Highway 290: 128
South 1st Street: 120
East US Highway 290: 119
State Highway 71: 104
7-Eleven: 101
Airport Boulevard: 101
Lady Bird Lake Hike and Bike Trail: 100
State Highway 45 North: 99
Bee Caves Road: 97
Brodie Lane: 93
183A Toll Road: 89
Austin Subdivision: 87
West Anderson Lane: 85
South Congress Avenue: 85
Anderson Mill Road: 85
East William Cannon Drive: 85
Gattis School Road: 84
North Capital of Texas Highway: 81
Walgreens: 80
Guadalupe Street: 80
South Pleasant Valley Road: 80
Burnet Road: 80
Exxon: 79
Ronald W Reagan Boulevard: 76
South FM 1626: 73
Metric Boulevard: 71
Lockhart Highway: 70


In [13]:
key_val_counter(atx_filename, 'service')


----- Count of values for key: service -----
driveway: 34956
parking_aisle: 15006
alley: 1249
drive-through: 969
spur: 91
yard: 59
emergency_access: 48
siding: 36
crossover: 10
long_distance: 6
pa\: 4
parking: 3
slipway: 2
commercial_office_park driveway: 2
tyres;oil: 1
glass: 1
aircraft_control: 1
storage: 1
access: 1
construction: 1


In [14]:
key_val_counter(atx_filename, 'surface')


----- Count of values for key: surface -----
asphalt: 21169
paved: 5156
concrete: 3893
unpaved: 1407
concrete:plates: 558
ground: 518
gravel: 452
dirt: 391
paving_stones: 250
fine_gravel: 181
sand: 98
grass: 64
wood: 49
compacted: 38
metal: 31
artificial_turf: 24
brick: 17
earth: 14
concrete:lanes: 14
bricks: 12
pebblestone: 10
rock: 8
stone: 8
tartan: 7
cobblestone: 6
yes: 5
con: 3
mud: 2
turf: 2
dirt/sand: 2
large,_unattached_stones_through_water: 2
Indoor: 1
CR_127: 1
paving_stones:30: 1
creekbed_(rock): 1
concrete,_dirt: 1
Large_unattached_stones_laid_in_the_creek: 1
woodchips: 1
f: 1


The values for this key need cleaning. For some of them, I can figure out what was intended and clean those with a dictionary. For others that are less clear, I'll just remove those tags with a list.

In [15]:
key_val_counter(atx_filename, 'tiger:county')


----- Count of values for key: tiger:county -----
Travis, TX: 21905
Williamson, TX: 7709
Bastrop, TX: 3635
Hays, TX: 3254
Caldwell, TX: 479
Burnet, TX: 454
Blanco, TX: 66
Lee, TX: 58
Comal, TX: 48
Bastrop, TX:Lee, TX:Travis, TX: 22
Bastrop, TX:Bell, TX:Travis, TX:Williamson, TX: 21
Hays, TX:Travis, TX: 17
Travis, TX:Williamson, TX: 15
Travis: 15
Travis, TX;Williamson, TX: 14
Anderson, TX:Freestone, TX:Leon, TX:Milam, TX:Robertson, TX:Williamson, TX: 14
Caldwell, TX:Travis, TX: 13
Travis, TX;Burnet, TX: 10
Hays, TX;Travis, TX: 6
Travis, TX; Williamson, TX: 6
Milam, TX: 4
Williamson, TX;Travis, TX: 4
Bastrop, TX:Caldwell, TX:Hays, TX: 4
Hays, TX;Travis, TX;Hays, TX: 2
Travis, TX;Hays, TX: 2
Williamson: 2
Bastrop, TX;Travis, TX: 1
Bastrop, TX;Caldwell, TX: 1
Burnet, TX:Llano, TX:Williamson, TX: 1
Travis, TX;Bastrop, TX: 1
Williamson,TX: 1
Hays, TX; Travis, TX; Hays, TX: 1


In [16]:
key_val_counter(atx_filename, 'addr:city')


----- Count of values for key: addr:city -----
Austin: 12095
Cedar Park: 1985
Pflugerville: 1137
Round Rock: 1012
Georgetown: 713
Leander: 452
Elgin: 437
Hutto: 298
Bastrop: 280
Kyle: 181
Dripping Springs: 173
Buda: 87
Taylor: 83
Manor: 79
Lakeway: 64
Wimberley: 43
West Lake Hills: 41
Bee Cave: 35
Manchaca: 32
Del Valle: 24
Cedar Creek: 18
Liberty Hill: 16
Driftwood: 15
Spicewood: 11
Lago Vista: 9
Westlake Hills: 7
austin: 7
Red Rock: 7
San Marcos: 4
Webberville: 4
Coupland: 4
Sunset Valley: 3
Jonestown: 3
Rosanky: 3
Creedmoor: 2
Rollingwood: 2
Lost Pines: 2
AUSTIN: 2
Pfluggerville: 2
Wells Branch: 2
Barton Creek: 1
Ste 128, Austin: 1
San Gabriel Village Boulevard: 1
Dale: 1
manor: 1
Pepe’s Tacos: 1
N Austin: 1
Manchaca,: 1
Austin;austin: 1
kyle: 1
Tampa: 1
McNeil: 1
Smithville: 1
wimberley: 1
Wimberly: 1
Marble Falls: 1
georgetown: 1
Volente: 1
Maxwell: 1
Jolyville: 1


These values are a little messy. I'll capitalize just the first letter of each word in these names, then a simple dictionary will clean them up.

In [17]:
key_val_counter(atx_filename, 'addr:state')


----- Count of values for key: addr:state -----
TX: 19311
FL: 1
AL: 1
tx: 1


There are a few non-Texas values in this key that need to be filtered out in the shape function.