-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perform Keyword Analysis on datasets available on catalog.data.gov #4068
Comments
Initial Analysis CompleteKey notes:
Keywords that appear more than 1000 times'north pacific ocean': 1002
'national data buoy center': 1004
'moored buoy': 1005
'nm': 1007
'chemistry': 1008
'dart': 1009
'ndbc': 1009
'state-of-louisiana': 1010
'c-man': 1014
'ctdtmp': 1015
'school': 1018
'tennessee': 1019
'profile': 1026
'coral': 1031
'telepresence': 1032
'north-carolina': 1035
'r337': 1039
'okeanos': 1041
'stream': 1045
'usgs national water information system (nwis)': 1045
'scs': 1047
'cartography': 1051
'wetlands': 1052
'new jersey': 1053
'coastal-processes': 1054
'quality': 1056
'geographic-information-system': 1060
'protection': 1070
'michigan': 1078
'ocean waves': 1083
'doc/noaa/nos/orr': 1084
'office of response and restoration': 1085
'nasa': 1092
'montana': 1093
'date': 1103
'boundary': 1104
'discrete measurement': 1104
'wyoming': 1106
'oer': 1109
'health': 1111
'geothermal': 1112
'energy': 1113
'platform_orientation': 1113
'sea surface temperature': 1116
'philipine-islands': 1119
'distributions': 1131
'geographic cell': 1131
'shoreline mapping program': 1132
'coastal mapping program': 1133
'national shoreline': 1133
'undersea': 1135
'underwater': 1138
'geomorphology': 1141
'explorer': 1142
'depth status_flag': 1146
'eastward_sea_water_velocity status_flag': 1146
'latitude status_flag': 1146
'longitude status_flag': 1146
'northward_sea_water_velocity status_flag': 1146
'time status_flag': 1146
'restoration': 1159
'idaho': 1161
'united-states-of-america': 1173
'reef': 1177
'platform_pitch_angle': 1186
'platform_roll_angle': 1186
'water column mapping system': 1187
'wcms': 1187
'philippines': 1188
'weather': 1190
'expedition': 1195
'johnson-space-center': 1196
'printed-maps': 1199
'ames-research-center': 1201
'aircraft': 1209
'sea_water_density status_flag': 1211
'wetland': 1211
'north carolina': 1215
'2006 tiger second edition': 1216
'census data': 1216
'tiger data': 1216
'gulf-of-mexico': 1224
'sea_water_temperature status_flag': 1226
'sea_water_pressure status_flag': 1229
'orthophoto': 1230
'delaware': 1251
'harmonic constituents': 1253
'rain fall': 1253
'water level predictions': 1254
'connecticut': 1255
'sea_water_electrical_conductivity status_flag': 1257
'arizona': 1259
'identifier': 1259
'pdf': 1269
'doqq': 1275
'channel': 1277
'tao': 1277
'floodplain mapping': 1284
'utah': 1286
'station': 1288
'jet-propulsion-laboratory': 1291
'spectral-engineering': 1304
'sea-floor-characteristics': 1308
'visibility': 1323
'sea_water_salinity': 1336
'ecosystem': 1338
'human dimensions': 1338
'mapping': 1343
'sea_water_speed': 1344
'imagery': 1358
'langley-research-center': 1359
'exploration': 1360
'natural-resources': 1361
'remote-sensing': 1361
'waves': 1369
'groundwater': 1374
'chlorophyll': 1377
'relative humidity': 1382
'soils': 1384
'acoustic scattering': 1388
'sst': 1393
'pelagic': 1399
'marine': 1400
'whcmsc': 1402
'river_discharge': 1435
'1-percent-annual-chance flood': 1439
'technology': 1452
'colorado': 1466
'atlantic-ocean': 1467
'precipitation': 1476
'seawater': 1480
'ocean chemistry': 1487
'new york': 1491
'wind_speed_of_gust': 1495
'nevada': 1499
'coast and geodetic survey': 1502
'goddard-space-flight-center': 1503
'coastal base map': 1503
'coastal zone map': 1503
'glenn-research-center': 1505
'environmental monitoring': 1510
'multibeam': 1512
'volcanic-eruption-forecasting': 1519
'stewardship': 1520
'new-york': 1523
'volcanic-ash': 1529
'tp-sheet': 1532
't-sheet': 1540
'marsh': 1555
'woods-hole-coastal-and-marine-science-center': 1560
'wisconsin': 1562
'location': 1564
'lake-county-illinois': 1564
'transportation': 1593
'species': 1603
'marine-geophysics': 1606
'wetland-ecosystems': 1612
'geophysics': 1655
'ocean currents': 1671
'georgia': 1673
'marine ecosystems': 1675
'alabama': 1693
'western pacific ocean': 1695
'census': 1708
'oxygen': 1717
'surface': 1719
'relative_humidity': 1730
'climate': 1743
'mississippi': 1744
'datum': 1756
'marine-geology': 1762
'coastal processes': 1771
'authcdfw': 1771
'air_pressure': 1807
'autonomous underwater vehicles': 1810
'auvs': 1810
'seaglider': 1811
'pennsylvania': 1823
'maine': 1844
'california-department-of-fish-and-wildlife': 1850
'cdfw': 1850
'dem': 1854
'currents': 1887
'noaa-navy sanctuary soundscapes monitoring project': 1888
'dod/usnavy': 1889
'sanctsound': 1889
'u.s. department of defense': 1894
'earth science oceans': 1897
'u.s. navy': 1897
'ambient noise': 1899
'passive acoustic recorder': 1899
'recorders/loggers': 1902
'hydrophones': 1903
'gis': 1917
'fixed observation stations': 1918
'gulf of mexico': 1941
'marine habitat': 1955
'land-surface': 1956
'animals/invertebrates': 1960
'ocean carbon and acidification data system (ocads) project': 1974
'cetaceans': 1975
'ocean acidification data stewardship (oads) project': 1975
'marine environment monitoring': 1977
'ocean carbon data system (ocads) project': 1978
'hydrology': 2011
'boundaries': 2050
'doc/noaa/nos/nms': 2058
'national marine sanctuaries': 2063
'california-natural-resources-agency': 2074
'county': 2088
'land surface': 2088
'water pressure': 2098
'us': 2099
'science': 2104
'mammals': 2104
'meteorology': 2108
'animals/vertebrates': 2121
'national-geospatial-data-asset': 2128
'ecosystems': 2128
'caopendata': 2145
'texas': 2160
'height': 2187
'slocum': 2192
'underwater glider': 2192
'spray': 2194
'glider': 2210
'vegetation': 2223
'water_surface_height_above_reference_datum': 2243
'doc/noaa/nmfs': 2258
'wmo': 2261
'flood hazard data': 2269
'usa': 2274
'wildlife': 2278
'north-america': 2279
'water level': 2284
'barometric pressure': 2286
'biological classification': 2304
'wind_from_direction': 2338
'lidar': 2390
'benthic': 2429
'ocean pressure': 2459
'geology': 2493
'region 04': 2495
'elevation': 2528
'oregon': 2567
'coastal barrier resources system': 2571
'cbrs': 2572
'sea_water_density': 2579
'trajectory': 2584
'massachusetts': 2590
'wind_speed': 2602
'coastal': 2631
'virginia': 2632
'coastal maps': 2632
'noaa shoreline': 2633
'coastal survey': 2634
'water oceans and coasts theme': 2639
'wind': 2641
'north atlantic ocean': 2658
'geospatial-datasets': 2667
'coastal flooding': 2672
'great lakes': 2679
'coastal-and-marine-geology-program': 2706
'cmgp': 2719
'data': 2722
'habitat': 2746
'northward_sea_water_velocity': 2748
'national geospatial data asset': 2760
'eastward_sea_water_velocity': 2760
'hawaii': 2781
'air temperature': 2821
'sea_water_pressure': 2834
'temperature': 2890
'active': 2908
'maryland': 2915
'washington': 2928
'winds': 2976
'biota': 2998
'atlantic ocean': 2998
'topography': 3033
'density': 3061
'fish': 3152
'water column': 3185
'aquatic sciences': 3211
'salinity/density': 3233
'linearfeature': 3234
'rreservation or off-reservation trust land indicator': 3235
'maftiger feature class code': 3237
'primaryalternate code': 3237
'area hydrography identifier': 3237
'115th congressional district code': 3237
'public use microdata area codeland/water flag': 3238
'feature names': 3238
'prefix direction code': 3238
'prefix qualifier code': 3238
'prefix type code description': 3238
'suffix direction code': 3238
'suffix qualifier code': 3238
'suffix type code': 3238
'land/water flag': 3238
'fips place code for all places': 3239
'subminor civil division fips code in puerto rico': 3239
'5 digit zip code tabulation area code': 3240
'alaska native regional corporation fips code': 3240
'american indian/alaska native/native hawaiian areas census code': 3240
'census tract number': 3240
'consolidated city fips code': 3240
'county subdivision fips code': 3240
'elementary school district local education agency code': 3240
'legislative session year': 3240
'metropolitan statistical area/consolidated metropolitan statistical area fips code': 3240
'new england county metropolitan area fips code': 3240
'primary metropolitan statistical area fips code': 3240
'secondary school district local education agency code': 3240
'state legislative district lower chamber code': 3240
'state legislative district upper chamber code': 3240
'tabulation block number': 3240
'tribal subdivision code': 3240
'unified school district local education agency code': 3240
'urban area code': 3240
'urban growth area code': 3240
'imagerybasemapsearthcover': 3243
'railways': 3246
'sea_water_practical_salinity': 3258
'permanent face id': 3299
'air_temperature': 3306
'aquatic ecosystems': 3322
'feature': 3326
'linear': 3333
'ocean acoustics': 3340
'block group': 3409
'doc/noaa/nos/ngs': 3416
'national geodetic survey': 3416
'riverine flooding': 3431
'sea_water_electrical_conductivity': 3623
'dfirm database': 3655
'floodway': 3671
'base flood elevation': 3683
'fema flood hazard zone': 3690
'nfip': 3696
'sfha': 3706
'sea': 3707
'flood insurance rate map': 3712
'special flood hazard area': 3712
'louisiana': 3713
'firm': 3727
'fisheries': 3855
'number': 3877
'ocean temperature': 3972
'inlandwaters': 4183
'name': 4189
'vertical location': 4231
'shoreline': 4298
'national marine fisheries service': 4303
'new mexico': 4381
'florida': 4408
'u-s-geological-survey': 4441
'pacific ocean': 4647
'u.s.': 4811
'state or equivalent entity': 4892
'ngda': 4915
'usgs': 4968
'california': 5334
'polygon': 5613
'conductivity': 5613
'water': 5666
'water temperature': 5692
'dfirm': 5897
'digital flood insurance rate map': 5918
'depth': 5990
'united-states': 6222
'salinity': 6269
'topological faces': 6544
'linear feature': 6564
'sea_water_temperature': 6629
'global positioning system/inertial measurement unit': 6692
'gps/imu': 6692
'msbs': 6693
'multibeam swath bathymetry system': 6693
'biology': 6700
'positioning/navigation': 6726
'gps receivers': 6749
'multibeam mapping system': 6833
'mbes': 6983
'gps': 7058
'geoscientificinformation': 7078
'altitude': 7211
'passive remote sensing': 7494
'earth remote sensing instruments': 7756
'earth-science': 8162
'united states': 8176
'national ocean service': 8268
'completed': 8845
'alaska': 9616
'5-digit zip code': 9692
'from house number': 9692
'side indicator flag': 9692
'to house number': 9692
'zip +4 code': 9692
'table': 9779
'road feature': 9840
'roads': 10023
'street centerline': 10186
'address range': 10192
'atmosphere': 10228
'sound navigation and ranging': 10904
'sonar': 11184
'biosphere': 11560
'profilers/sounders': 12476
'in situ/laboratory instruments': 12577
'permanent edge id': 12932
'environment': 13505
'time': 13857
'oceanography': 13988
'longitude': 14018
'latitude': 14020
'acoustic sounders': 14091
'hydrography': 14379
'county gnis code': 16161
'state gnis code': 16222
'earth-science-oceans-marine-sediments-sediment-composition': 16335
'hydrographic-surveys-for-selected-locations-within-the-united-states-hydro_bathy_2006': 16415
'earth-remote-sensing-instruments-passive-remote-sensing-positioning-navigation-gps-gps-imu-glob': 16423
'earth-remote-sensing-instruments-passive-remote-sensing-positioning-navigation-gps-gps-receiver': 16423
'in-situ-laboratory-instruments-profilers-sounders-acoustic-sounders': 16423
'in-situ-laboratory-instruments-profilers-sounders-acoustic-sounders-mbes-multibeam-mapping-syst': 16423
'in-situ-laboratory-instruments-profilers-sounders-acoustic-sounders-msbs-multibeam-swath-bathym': 16423
'in-situ-laboratory-instruments-profilers-sounders-acoustic-sounders-sonar-sound-navigation-and-': 16423
'in-situ-ocean-based-platforms-ships': 16476
'earth-science-oceans-bathymetry-seafloor-topography-water-depth': 16661
'earth-science-oceans-bathymetry-seafloor-topography-seafloor-topography': 16673
'doc/noaa/nesdis/ngdc': 16865
'national geophysical data center': 16865
'earth-science-oceans-bathymetry-seafloor-topography-bathymetry': 16983
'doc-noaa-nesdis-ngdc-national-geophysical-data-center': 17136
'u-s-department-of-commerce': 17683
'hydrographic surveys for selected locations within the united states (hydro_bathy_2006)': 18318
'sediment composition': 18332
'marine sediments': 18345
'county fips code': 19411
'state fips code': 19471
'ships': 21804
'water depth': 22292
'doc/noaa/nesdis/ncei': 23135
'national centers for environmental information': 23135
'seafloor topography': 23601
'in situ ocean-based platforms': 23985
'bathymetry/seafloor topography': 24168
'continent': 24411
'north america': 24475
'united states of america': 25069
'bathymetry': 26113
'u.s. department of commerce': 31085
'ocean': 33461
'county or equivalent entity': 35946
'oceans': 40376
'nesdis': 40599
'earth science': 40754
'noaa': 51569 |
This api call give all tags
|
And the following URL gives you every tag with > 1000 datasets: https://catalog.data.gov/api/action/package_search?facet.field=[%22tags%22]&facet.limit=-1&facet.mincount=1000&rows=0 |
First pass @ grouping keywordsSince we don't have a word model specifically trained for data.gov/open data, I used the off-the-shelf Wordnet to find the shortest distance (or similarity) between words. More similar words would make sense to group together. The idea was to define similarity as our parameter and see what groups appear from the data. This is contrary to the other approach of trying to select To help breakdown the complex keywords into simpler words that existed in Wordnet, the following preprocessing was done:
It should be noted that this inherently caused some contextual meaning loss. "North pacific ocean" is a specific area of the pacific ocean that might be more relevant if we cared about making sub-categories in our envrionmental/oceanographic/weather group; however, since this granularity was not as important to capture. A word that would lose considerably more context is "North Carolina" which is the name of a state. Since this analysis was also going to be filtered through human eyes, I thought this was also an acceptable loss. We'd be able to understand that Carolina may or may not belong in a particular group. Another example of context loss is The first pass used the ideas mentioned above scripted here to analyze the top 1000 most frequent keywords. Using a distance of
I will run a few more permutations of this algorithm, but I don't think it's going to give as much insight as we'd hope. Next steps:I'm going to train a basic word model based on the catalog. The premise for this will be tags on a single dataset are, by implementation, similar. The more datasets tags appear in together, the more similar the tags are. In this way, the word model does not need to know definitions of words or the relationships between them. Words that were excluded in the previous analysis will be included here. Also, keywords do not need to be broken down. If complex tags were created for a reason, they can be preserved. The only meddling that I will do is weed out the nonsense tags that even we, as humans, would not be able to make sense of. This would include tags such as |
Another note about Wordnet: It would have been more useful if we could have isolated the exact meaning of each word and pull that synset from Wordnet as it would have been a more meaningful distance. Since I didn't know which sense the word was referring to, I had no choice but to average all sense of the word which muddled the results too much too. |
I'm very skeptical of the approach I'm about to document. I may have made errors along the way that will greatly skew the usefulness or accuracy of the results (and I hope the team will review this and let me know if there are any gaps or errors). All of the code is in a gist: https://gist.github.com/nickumia/4f034ae951349a9dea5fda999f935405 Key Motivation Points
# N x N matrix form
[[ 0 4 2],
[ 0 0 7],
[ 0 0 0]]
# Dictionary form
{ '0,1': 4,
'0,2': 2,
'1,2': 7}
Specific Implementation Details
{ 'f9880479-bf5c-477c-ba8e-0651a0e054a5': ['new-york-lottery', 'powerball', 'results', 'winning'],
'4cfdbe83-666d-4e72-b8c6-31dbcdd8dbf0': ['assistance-transactions', 'banks', 'failures', 'financial-institution'],
...
}
curl "https://catalog.data.gov/api/action/package_search?facet.field=[%22tags%22]&facet.limit=-1&facet.mincount=2&rows=0" | jq '.result.facets.tags' > keywords.json
ResultsThis will take some time to fully complete. Also... we might want to limit the results in someway because it will group I'm going to optimize three parameters to determine the most diverse distribution of words (or the most groups with more than one word). These two parameters are: (1) the relatedness tolerance between words in a group, (2) when to create a new group, (3) How many words to include in the analysis.
|
References for the graph visualization above:
Other references for previous work: |
As a summary of where this leaves us:The job of grouping datasets into logical groups that improves discoverability and accessibility is not a simple one. I have explored two paths in the above analyses: (1) Using an off-the-shelf Word Model to process the tags and perform a similarity comparison, (2) Building a custom Word Model to process the tags and highlight relational similarities to group tags. Both algorithms used tags as the driving point to create groups. As @jbrown-xentity noted, tags from an agency are typically created all by the same publisher. From this perspective, all of the tags from one publisher might have a biased similarity towards the publisher and not the dataset itself. I don't think this is entirely true, but a valid concern nonetheless. Proper analysis would be to take all of the non-standard text from a dataset (title, description, tags and any unique extras fields) to build the model which would have a more complete picture of datasets. Even with this, the descriptions might also suffer from writer bias, so this is not a foolproof method either. The focus on tags in this ticket was: (1) to fine-tune scope and (2) to focus on the algorithm design via data discovery. I think, regardless of writer biases, we only have the data that we have, so if writer bias is an inhibiting factor, we need to raise that to the Agencies and make sure they intended for the datasets to be worded the way that they are and that their wording is accurate and consistent to the data that its describing. This collaboration is not easy, but is a necessary part if we want to remove biases from our analysis. Many points have been mentioned in the previous comments. As the key takeaway points:
Appendix A. Word Similarity
Appendix B. Word TaxonomiesThe federal government itself is a taxonomy. There is the Executive Branch. Within the Executive Branch, there are a host of agencies, such as the Department of Defense. The Department of Defense then has agencies, such as the Department of the Navy. The Department of the Navy then has sub-agencies, such as NAVAIR, NAVSEA, NAVSUB, et cetera. Each of those then have divisions such as Aircraft Division or Weapons Division. While the taxonomy that exists by design of the government is helpful, it is not complete or otherwise self-describing enough to use as the sole basis of our analysis. Each agency would have their own definition for words like health, finance, education and transportation. Creating a universal taxonomy that can be applied to such a wide range of data types and sources may not be possible; however, a system that aggregates all of the different taxonomies from each agency might be possible. Either way, we need to build a reference to understand relationships between data. |
thanks for doing this @nickumia-reisys this is good to have for future discussion |
User Story
In order to identify Subject Areas, the data.gov User Engagement team wants to capture the most used keywords for datasets and the number of datasets with each keyword.
Acceptance Criteria
[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]
WHEN I look at this ticket
THEN there is an overview of keywords used on catalog.data.gov.
Background
[Any helpful contextual notes or links to artifacts/evidence, if needed]
Security Considerations (required)
[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]
Sketch
[Notes or a checklist reflecting our understanding of the selected approach]
The text was updated successfully, but these errors were encountered: