Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform Keyword Analysis on datasets available on catalog.data.gov #4068

Closed
1 task done
nickumia-reisys opened this issue Nov 18, 2022 · 11 comments
Closed
1 task done
Assignees
Labels
component/catalog Related to catalog component playbooks/roles Feature Mission & Vision

Comments

@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented Nov 18, 2022

User Story

In order to identify Subject Areas, the data.gov User Engagement team wants to capture the most used keywords for datasets and the number of datasets with each keyword.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

  • GIVEN an analysis has been performed
    WHEN I look at this ticket
    THEN there is an overview of keywords used on catalog.data.gov.

Background

[Any helpful contextual notes or links to artifacts/evidence, if needed]

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

Sketch

[Notes or a checklist reflecting our understanding of the selected approach]

@nickumia-reisys nickumia-reisys self-assigned this Nov 18, 2022
@nickumia-reisys
Copy link
Contributor Author

nickumia-reisys commented Nov 21, 2022

Initial Analysis Complete

Key notes:

  • This is without any pre-processing, just taking the existing keywords as complete units
  • There are 287118 unique keywords.
  • 459 appear more than 1000 times.
Keywords that appear more than 1000 times
'north pacific ocean': 1002
'national data buoy center': 1004
'moored buoy': 1005
'nm': 1007
'chemistry': 1008
'dart': 1009
'ndbc': 1009
'state-of-louisiana': 1010
'c-man': 1014
'ctdtmp': 1015
'school': 1018
'tennessee': 1019
'profile': 1026
'coral': 1031
'telepresence': 1032
'north-carolina': 1035
'r337': 1039
'okeanos': 1041
'stream': 1045
'usgs national water information system (nwis)': 1045
'scs': 1047
'cartography': 1051
'wetlands': 1052
'new jersey': 1053
'coastal-processes': 1054
'quality': 1056
'geographic-information-system': 1060
'protection': 1070
'michigan': 1078
'ocean waves': 1083
'doc/noaa/nos/orr': 1084
'office of response and restoration': 1085
'nasa': 1092
'montana': 1093
'date': 1103
'boundary': 1104
'discrete measurement': 1104
'wyoming': 1106
'oer': 1109
'health': 1111
'geothermal': 1112
'energy': 1113
'platform_orientation': 1113
'sea surface temperature': 1116
'philipine-islands': 1119
'distributions': 1131
'geographic cell': 1131
'shoreline mapping program': 1132
'coastal mapping program': 1133
'national shoreline': 1133
'undersea': 1135
'underwater': 1138
'geomorphology': 1141
'explorer': 1142
'depth status_flag': 1146
'eastward_sea_water_velocity status_flag': 1146
'latitude status_flag': 1146
'longitude status_flag': 1146
'northward_sea_water_velocity status_flag': 1146
'time status_flag': 1146
'restoration': 1159
'idaho': 1161
'united-states-of-america': 1173
'reef': 1177
'platform_pitch_angle': 1186
'platform_roll_angle': 1186
'water column mapping system': 1187
'wcms': 1187
'philippines': 1188
'weather': 1190
'expedition': 1195
'johnson-space-center': 1196
'printed-maps': 1199
'ames-research-center': 1201
'aircraft': 1209
'sea_water_density status_flag': 1211
'wetland': 1211
'north carolina': 1215
'2006 tiger second edition': 1216
'census data': 1216
'tiger data': 1216
'gulf-of-mexico': 1224
'sea_water_temperature status_flag': 1226
'sea_water_pressure status_flag': 1229
'orthophoto': 1230
'delaware': 1251
'harmonic constituents': 1253
'rain fall': 1253
'water level predictions': 1254
'connecticut': 1255
'sea_water_electrical_conductivity status_flag': 1257
'arizona': 1259
'identifier': 1259
'pdf': 1269
'doqq': 1275
'channel': 1277
'tao': 1277
'floodplain mapping': 1284
'utah': 1286
'station': 1288
'jet-propulsion-laboratory': 1291
'spectral-engineering': 1304
'sea-floor-characteristics': 1308
'visibility': 1323
'sea_water_salinity': 1336
'ecosystem': 1338
'human dimensions': 1338
'mapping': 1343
'sea_water_speed': 1344
'imagery': 1358
'langley-research-center': 1359
'exploration': 1360
'natural-resources': 1361
'remote-sensing': 1361
'waves': 1369
'groundwater': 1374
'chlorophyll': 1377
'relative humidity': 1382
'soils': 1384
'acoustic scattering': 1388
'sst': 1393
'pelagic': 1399
'marine': 1400
'whcmsc': 1402
'river_discharge': 1435
'1-percent-annual-chance flood': 1439
'technology': 1452
'colorado': 1466
'atlantic-ocean': 1467
'precipitation': 1476
'seawater': 1480
'ocean chemistry': 1487
'new york': 1491
'wind_speed_of_gust': 1495
'nevada': 1499
'coast and geodetic survey': 1502
'goddard-space-flight-center': 1503
'coastal base map': 1503
'coastal zone map': 1503
'glenn-research-center': 1505
'environmental monitoring': 1510
'multibeam': 1512
'volcanic-eruption-forecasting': 1519
'stewardship': 1520
'new-york': 1523
'volcanic-ash': 1529
'tp-sheet': 1532
't-sheet': 1540
'marsh': 1555
'woods-hole-coastal-and-marine-science-center': 1560
'wisconsin': 1562
'location': 1564
'lake-county-illinois': 1564
'transportation': 1593
'species': 1603
'marine-geophysics': 1606
'wetland-ecosystems': 1612
'geophysics': 1655
'ocean currents': 1671
'georgia': 1673
'marine ecosystems': 1675
'alabama': 1693
'western pacific ocean': 1695
'census': 1708
'oxygen': 1717
'surface': 1719
'relative_humidity': 1730
'climate': 1743
'mississippi': 1744
'datum': 1756
'marine-geology': 1762
'coastal processes': 1771
'authcdfw': 1771
'air_pressure': 1807
'autonomous underwater vehicles': 1810
'auvs': 1810
'seaglider': 1811
'pennsylvania': 1823
'maine': 1844
'california-department-of-fish-and-wildlife': 1850
'cdfw': 1850
'dem': 1854
'currents': 1887
'noaa-navy sanctuary soundscapes monitoring project': 1888
'dod/usnavy': 1889
'sanctsound': 1889
'u.s. department of defense': 1894
'earth science oceans': 1897
'u.s. navy': 1897
'ambient noise': 1899
'passive acoustic recorder': 1899
'recorders/loggers': 1902
'hydrophones': 1903
'gis': 1917
'fixed observation stations': 1918
'gulf of mexico': 1941
'marine habitat': 1955
'land-surface': 1956
'animals/invertebrates': 1960
'ocean carbon and acidification data system (ocads) project': 1974
'cetaceans': 1975
'ocean acidification data stewardship (oads) project': 1975
'marine environment monitoring': 1977
'ocean carbon data system (ocads) project': 1978
'hydrology': 2011
'boundaries': 2050
'doc/noaa/nos/nms': 2058
'national marine sanctuaries': 2063
'california-natural-resources-agency': 2074
'county': 2088
'land surface': 2088
'water pressure': 2098
'us': 2099
'science': 2104
'mammals': 2104
'meteorology': 2108
'animals/vertebrates': 2121
'national-geospatial-data-asset': 2128
'ecosystems': 2128
'caopendata': 2145
'texas': 2160
'height': 2187
'slocum': 2192
'underwater glider': 2192
'spray': 2194
'glider': 2210
'vegetation': 2223
'water_surface_height_above_reference_datum': 2243
'doc/noaa/nmfs': 2258
'wmo': 2261
'flood hazard data': 2269
'usa': 2274
'wildlife': 2278
'north-america': 2279
'water level': 2284
'barometric pressure': 2286
'biological classification': 2304
'wind_from_direction': 2338
'lidar': 2390
'benthic': 2429
'ocean pressure': 2459
'geology': 2493
'region 04': 2495
'elevation': 2528
'oregon': 2567
'coastal barrier resources system': 2571
'cbrs': 2572
'sea_water_density': 2579
'trajectory': 2584
'massachusetts': 2590
'wind_speed': 2602
'coastal': 2631
'virginia': 2632
'coastal maps': 2632
'noaa shoreline': 2633
'coastal survey': 2634
'water oceans and coasts theme': 2639
'wind': 2641
'north atlantic ocean': 2658
'geospatial-datasets': 2667
'coastal flooding': 2672
'great lakes': 2679
'coastal-and-marine-geology-program': 2706
'cmgp': 2719
'data': 2722
'habitat': 2746
'northward_sea_water_velocity': 2748
'national geospatial data asset': 2760
'eastward_sea_water_velocity': 2760
'hawaii': 2781
'air temperature': 2821
'sea_water_pressure': 2834
'temperature': 2890
'active': 2908
'maryland': 2915
'washington': 2928
'winds': 2976
'biota': 2998
'atlantic ocean': 2998
'topography': 3033
'density': 3061
'fish': 3152
'water column': 3185
'aquatic sciences': 3211
'salinity/density': 3233
'linearfeature': 3234
'rreservation or off-reservation trust land indicator': 3235
'maftiger feature class code': 3237
'primaryalternate code': 3237
'area hydrography identifier': 3237
'115th congressional district code': 3237
'public use microdata area codeland/water flag': 3238
'feature names': 3238
'prefix direction code': 3238
'prefix qualifier code': 3238
'prefix type code description': 3238
'suffix direction code': 3238
'suffix qualifier code': 3238
'suffix type code': 3238
'land/water flag': 3238
'fips place code for all places': 3239
'subminor civil division fips code in puerto rico': 3239
'5 digit zip code tabulation area code': 3240
'alaska native regional corporation fips code': 3240
'american indian/alaska native/native hawaiian areas census code': 3240
'census tract number': 3240
'consolidated city fips code': 3240
'county subdivision fips code': 3240
'elementary school district local education agency code': 3240
'legislative session year': 3240
'metropolitan statistical area/consolidated metropolitan statistical area fips code': 3240
'new england county metropolitan area fips code': 3240
'primary metropolitan statistical area fips code': 3240
'secondary school district local education agency code': 3240
'state legislative district lower chamber code': 3240
'state legislative district upper chamber code': 3240
'tabulation block number': 3240
'tribal subdivision code': 3240
'unified school district local education agency code': 3240
'urban area code': 3240
'urban growth area code': 3240
'imagerybasemapsearthcover': 3243
'railways': 3246
'sea_water_practical_salinity': 3258
'permanent face id': 3299
'air_temperature': 3306
'aquatic ecosystems': 3322
'feature': 3326
'linear': 3333
'ocean acoustics': 3340
'block group': 3409
'doc/noaa/nos/ngs': 3416
'national geodetic survey': 3416
'riverine flooding': 3431
'sea_water_electrical_conductivity': 3623
'dfirm database': 3655
'floodway': 3671
'base flood elevation': 3683
'fema flood hazard zone': 3690
'nfip': 3696
'sfha': 3706
'sea': 3707
'flood insurance rate map': 3712
'special flood hazard area': 3712
'louisiana': 3713
'firm': 3727
'fisheries': 3855
'number': 3877
'ocean temperature': 3972
'inlandwaters': 4183
'name': 4189
'vertical location': 4231
'shoreline': 4298
'national marine fisheries service': 4303
'new mexico': 4381
'florida': 4408
'u-s-geological-survey': 4441
'pacific ocean': 4647
'u.s.': 4811
'state or equivalent entity': 4892
'ngda': 4915
'usgs': 4968
'california': 5334
'polygon': 5613
'conductivity': 5613
'water': 5666
'water temperature': 5692
'dfirm': 5897
'digital flood insurance rate map': 5918
'depth': 5990
'united-states': 6222
'salinity': 6269
'topological faces': 6544
'linear feature': 6564
'sea_water_temperature': 6629
'global positioning system/inertial measurement unit': 6692
'gps/imu': 6692
'msbs': 6693
'multibeam swath bathymetry system': 6693
'biology': 6700
'positioning/navigation': 6726
'gps receivers': 6749
'multibeam mapping system': 6833
'mbes': 6983
'gps': 7058
'geoscientificinformation': 7078
'altitude': 7211
'passive remote sensing': 7494
'earth remote sensing instruments': 7756
'earth-science': 8162
'united states': 8176
'national ocean service': 8268
'completed': 8845
'alaska': 9616
'5-digit zip code': 9692
'from house number': 9692
'side indicator flag': 9692
'to house number': 9692
'zip +4 code': 9692
'table': 9779
'road feature': 9840
'roads': 10023
'street centerline': 10186
'address range': 10192
'atmosphere': 10228
'sound navigation and ranging': 10904
'sonar': 11184
'biosphere': 11560
'profilers/sounders': 12476
'in situ/laboratory instruments': 12577
'permanent edge id': 12932
'environment': 13505
'time': 13857
'oceanography': 13988
'longitude': 14018
'latitude': 14020
'acoustic sounders': 14091
'hydrography': 14379
'county gnis code': 16161
'state gnis code': 16222
'earth-science-oceans-marine-sediments-sediment-composition': 16335
'hydrographic-surveys-for-selected-locations-within-the-united-states-hydro_bathy_2006': 16415
'earth-remote-sensing-instruments-passive-remote-sensing-positioning-navigation-gps-gps-imu-glob': 16423
'earth-remote-sensing-instruments-passive-remote-sensing-positioning-navigation-gps-gps-receiver': 16423
'in-situ-laboratory-instruments-profilers-sounders-acoustic-sounders': 16423
'in-situ-laboratory-instruments-profilers-sounders-acoustic-sounders-mbes-multibeam-mapping-syst': 16423
'in-situ-laboratory-instruments-profilers-sounders-acoustic-sounders-msbs-multibeam-swath-bathym': 16423
'in-situ-laboratory-instruments-profilers-sounders-acoustic-sounders-sonar-sound-navigation-and-': 16423
'in-situ-ocean-based-platforms-ships': 16476
'earth-science-oceans-bathymetry-seafloor-topography-water-depth': 16661
'earth-science-oceans-bathymetry-seafloor-topography-seafloor-topography': 16673
'doc/noaa/nesdis/ngdc': 16865
'national geophysical data center': 16865
'earth-science-oceans-bathymetry-seafloor-topography-bathymetry': 16983
'doc-noaa-nesdis-ngdc-national-geophysical-data-center': 17136
'u-s-department-of-commerce': 17683
'hydrographic surveys for selected locations within the united states (hydro_bathy_2006)': 18318
'sediment composition': 18332
'marine sediments': 18345
'county fips code': 19411
'state fips code': 19471
'ships': 21804
'water depth': 22292
'doc/noaa/nesdis/ncei': 23135
'national centers for environmental information': 23135
'seafloor topography': 23601
'in situ ocean-based platforms': 23985
'bathymetry/seafloor topography': 24168
'continent': 24411
'north america': 24475
'united states of america': 25069
'bathymetry': 26113
'u.s. department of commerce': 31085
'ocean': 33461
'county or equivalent entity': 35946
'oceans': 40376
'nesdis': 40599
'earth science': 40754
'noaa': 51569

@nickumia-reisys
Copy link
Contributor Author

@FuhuXia
Copy link
Member

FuhuXia commented Nov 21, 2022

This api call give all tags

curl "https://catalog.data.gov/api/action/package_search?facet.field=[%22tags%22]&facet.limit=-1" | jq '.result.facets.tags'

@jbrown-xentity
Copy link
Contributor

And the following URL gives you every tag with > 1000 datasets: https://catalog.data.gov/api/action/package_search?facet.field=[%22tags%22]&facet.limit=-1&facet.mincount=1000&rows=0

@nickumia-reisys
Copy link
Contributor Author

First pass @ grouping keywords

Since we don't have a word model specifically trained for data.gov/open data, I used the off-the-shelf Wordnet to find the shortest distance (or similarity) between words. More similar words would make sense to group together. The idea was to define similarity as our parameter and see what groups appear from the data. This is contrary to the other approach of trying to select N number of groups and then forcing words into one of the N groups.

To help breakdown the complex keywords into simpler words that existed in Wordnet, the following preprocessing was done:

  • 'north pacific ocean' to 'north', 'pacific', 'ocean'
  • 'doc/noaa/nesdis/ncei' to 'doc', 'noaa', 'nesdis', 'ncei'
  • 'in-situ-laboratory-instruments-profilers-sounders-acoustic-sounders-mbes-multibeam-mapping-syst' to 'in', 'situ', 'laboratory', 'instruments', 'profiles' ...

It should be noted that this inherently caused some contextual meaning loss. "North pacific ocean" is a specific area of the pacific ocean that might be more relevant if we cared about making sub-categories in our envrionmental/oceanographic/weather group; however, since this granularity was not as important to capture. A word that would lose considerably more context is "North Carolina" which is the name of a state. Since this analysis was also going to be filtered through human eyes, I thought this was also an acceptable loss. We'd be able to understand that Carolina may or may not belong in a particular group. Another example of context loss is 'lake-county-illinois': 1564. Very sensibly, lake county illinois does not mean there's data about lakes or counties. I'm not sure if this is an acceptable error; however, there is clearly location based data and that context will be accounted for by us as humans as well.

The first pass used the ideas mentioned above scripted here to analyze the top 1000 most frequent keywords. Using a distance of 5 between words, the groups in preliminary_lt_5 were created. Note that many words were custom acronyms about agencies or specific abbreviations based on developments since Wordnet, so their relevance could not be placed.

preliminary_lt_5.txt

  • There are 169 groupings.
  • Not a lot of words are grouped together

preliminary_lt_8.txt

  • There are 98 groupings.
  • One group become really large becoming a local maximum.

I will run a few more permutations of this algorithm, but I don't think it's going to give as much insight as we'd hope.

Next steps:

I'm going to train a basic word model based on the catalog. The premise for this will be tags on a single dataset are, by implementation, similar. The more datasets tags appear in together, the more similar the tags are. In this way, the word model does not need to know definitions of words or the relationships between them. Words that were excluded in the previous analysis will be included here. Also, keywords do not need to be broken down. If complex tags were created for a reason, they can be preserved.

The only meddling that I will do is weed out the nonsense tags that even we, as humans, would not be able to make sense of. This would include tags such as !c07, (58aa9402 leg 2) and 01asr02.

@nickumia-reisys
Copy link
Contributor Author

Another note about Wordnet: It would have been more useful if we could have isolated the exact meaning of each word and pull that synset from Wordnet as it would have been a more meaningful distance. Since I didn't know which sense the word was referring to, I had no choice but to average all sense of the word which muddled the results too much too.

@nickumia
Copy link

nickumia commented Nov 27, 2022

I'm very skeptical of the approach I'm about to document. I may have made errors along the way that will greatly skew the usefulness or accuracy of the results (and I hope the team will review this and let me know if there are any gaps or errors). All of the code is in a gist: https://gist.github.com/nickumia/4f034ae951349a9dea5fda999f935405

Key Motivation Points

  • All of the keywords that appear on a single dataset are related.
  • The more datasets that two keywords appear together on, the more related the two keywords are.
  • Keywords that appear on only 1 dataset are not meaningful since they will be leaf nodes where no new branches can be created.
  • Keyword relatedness can be expressed as a weighted non-directional graph (an upper/lower N x N triangular matrix).
    • There is no directionality between keyword relatedness. In other words, if two datasets have keyword a and keyword b, the only useful information is that there is a single connection between keyword a and keyword b with a weight of 2. It doesn't matter if the keywords were discovered keyword a -> keyword b : 1 and keyword b -> keyword a : 1.
  • Since an N x N matrix is memory-inefficient for sparse matrices, the graph can be represented by a python dictionary with keys of the N x N indices and values of the matrix value at the referenced index.
# N x N matrix form
[[ 0 4 2],
 [ 0 0 7],
 [ 0 0 0]]
# Dictionary form
{ '0,1': 4,
  '0,2': 2,
  '1,2': 7}
  • Clustering of keywords is based on this sense of relatedness. A random word is chosen to represent a new group. All other words are compared to that word and words within a specific tolerance are grouped with that word. When the tolerance is exceeded, a new random word (not previously visited) forms a new group. The clustering is recursive until all words have been visited at least once.
  • Clusters may be mutually exclusive (but it is not programmed that way currently).
  • The output is a dictionary of lists of words that are arbitrarily "id-ed". The "id" has no meaning other than to identify a group of words.

Specific Implementation Details

  • catalog.data.gov was crawled to generate a dictionary of each dataset and all of the keywords associated with that dataset. This turned out to be 93M. Note: There were 242842 datasets obtained. This may be influenced by duplicates and/or missing datasets from cloudfront rate limiting. The number seemed close enough that I felt like these were acceptable uncertainties.
{ 'f9880479-bf5c-477c-ba8e-0651a0e054a5': ['new-york-lottery', 'powerball', 'results', 'winning'],
  '4cfdbe83-666d-4e72-b8c6-31dbcdd8dbf0': ['assistance-transactions', 'banks', 'failures', 'financial-institution'],
...
}
  • A list of keywords that appeared more than once was retrieved with the following.
curl "https://catalog.data.gov/api/action/package_search?facet.field=[%22tags%22]&facet.limit=-1&facet.mincount=2&rows=0" | jq '.result.facets.tags' > keywords.json
  • For each keyword in each dataset, if the keyword was in the "list of keywords that appeared more than once", it was assigned a random ID, tallied based on the ID into the N x N graph matrix. This resulted in a 17.3G file which is why I need to use my personal computer.
  • This graph matrix was then converted to the memory-optimized dictionary. This was reduced to 13M.
  • The memory-optimized dictionary was then used to do the clustering mentioned above.
  • The code is very much fragmented right now, so if we need to replicate this process or improve it, I'll need to clean up the code ... alot haha..

Results

This will take some time to fully complete. Also... we might want to limit the results in someway because it will group 93691 keywords into N groups. 93K words are a lot to parse. As an initial pass, I ran it on the keywords that appear 1000 times or more (483 words). This created 96 groups. 9 of them have more than 1 word in them and would therefore be the most helpful in coming up with a category.

output_top_483.txt

I'm going to optimize three parameters to determine the most diverse distribution of words (or the most groups with more than one word). These two parameters are: (1) the relatedness tolerance between words in a group, (2) when to create a new group, (3) How many words to include in the analysis.

  • Words than appear more than 999 times: 483
  • Words than appear more than 99 times: 3072
  • Words than appear more than 49 times: 5597
  • Words than appear more than 1 times: 93691
  • Total words: 292045

@nickumia-reisys
Copy link
Contributor Author

While these images aren't exactly read-able, I thought it would be interesting for people to look at.

The first image is a graph of 2000 random connections between keywords on datasets. It has much clearer cluster definitions (but it's not meaningful because the number of times those keywords appear on catalog is not meaningful).
2000_random_words

The second image does not have as many well-defined clusters, but it is a graph of the connections between keywords on catalog that appear at least 100 times (483 keywords). This means that these keywords are used together on at least 100 datasets.
graph_1000_100

I want to get larger graphs, but the graphs would be even less readable and the time to compute would be super long.

The total number of connections that I have tracked: 500473
The total number of connections between the top 483 keywords and other words: 238162
The total number of connections in the "graph_1000_100.png": 1693

My next logical step would be to do text summarization of these groups of words (if we want less work as humans). Or just get the list of words in a readable format, so that we can parse it as humans.

@nickumia-reisys
Copy link
Contributor Author

As a summary of where this leaves us:

The job of grouping datasets into logical groups that improves discoverability and accessibility is not a simple one. I have explored two paths in the above analyses: (1) Using an off-the-shelf Word Model to process the tags and perform a similarity comparison, (2) Building a custom Word Model to process the tags and highlight relational similarities to group tags. Both algorithms used tags as the driving point to create groups. As @jbrown-xentity noted, tags from an agency are typically created all by the same publisher. From this perspective, all of the tags from one publisher might have a biased similarity towards the publisher and not the dataset itself. I don't think this is entirely true, but a valid concern nonetheless.

Proper analysis would be to take all of the non-standard text from a dataset (title, description, tags and any unique extras fields) to build the model which would have a more complete picture of datasets. Even with this, the descriptions might also suffer from writer bias, so this is not a foolproof method either. The focus on tags in this ticket was: (1) to fine-tune scope and (2) to focus on the algorithm design via data discovery. I think, regardless of writer biases, we only have the data that we have, so if writer bias is an inhibiting factor, we need to raise that to the Agencies and make sure they intended for the datasets to be worded the way that they are and that their wording is accurate and consistent to the data that its describing. This collaboration is not easy, but is a necessary part if we want to remove biases from our analysis.

Many points have been mentioned in the previous comments. As the key takeaway points:

  • I don't believe we should use an OTS word model to do this analysis. We need to build a representative word model ourselves.
  • There are many connections between datasets in many ways. However, I believe word similarity in the form of a relationship to other words is the best in this case.
    • See Appendix A for more information.
  • We should group with a goal of minimum similarity, not with a goal of a specific number of groups.
  • Create a Word Model to generate Taxonomy for catalog.data.gov categories #4088 is necessary before this can be continued.
    • See Appendix B for more information.

Appendix A. Word Similarity

  • There is word similarity in terms of hyper-/hypo-nyms. This is not ideal. This requires you to have complete contextual understanding of each word.
    >>> from nltk.corpus import wordnet
    >>> a = wordnet.synsets('water')
    [Synset('water.n.01'), Synset('body_of_water.n.01'), Synset('water.n.03'), Synset('water_system.n.02'), Synset('urine.n.01'), Synset('water.n.06'), Synset('water.v.01'), Synset('water.v.02'), Synset('water.v.03'), Synset('water.v.04')]
    >>> a[1].hypernyms()
    [Synset('thing.n.12')]
    >>> a[2].hypernyms()
    [Synset('element.n.05')]
    >>> a[3].hypernyms()
    [Synset('facility.n.01')]
    >>> a[4].hypernyms()
    [Synset('body_waste.n.01')]
    >>> a[5].hypernyms()
    [Synset('food.n.01'), Synset('liquid.n.01'), Synset('nutrient.n.02')]
    >>> a[1].shortest_path_distance(a[1].hypernyms()[0])
    1
    >>> a[1].shortest_path_distance(a[2])
    6

Appendix B. Word Taxonomies

The federal government itself is a taxonomy. There is the Executive Branch. Within the Executive Branch, there are a host of agencies, such as the Department of Defense. The Department of Defense then has agencies, such as the Department of the Navy. The Department of the Navy then has sub-agencies, such as NAVAIR, NAVSEA, NAVSUB, et cetera. Each of those then have divisions such as Aircraft Division or Weapons Division. While the taxonomy that exists by design of the government is helpful, it is not complete or otherwise self-describing enough to use as the sole basis of our analysis. Each agency would have their own definition for words like health, finance, education and transportation.

Creating a universal taxonomy that can be applied to such a wide range of data types and sources may not be possible; however, a system that aggregates all of the different taxonomies from each agency might be possible. Either way, we need to build a reference to understand relationships between data.

@hkdctol
Copy link
Contributor

hkdctol commented Dec 6, 2022

thanks for doing this @nickumia-reisys this is good to have for future discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/catalog Related to catalog component playbooks/roles Feature Mission & Vision
Projects
Archived in project
Development

No branches or pull requests

5 participants