Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What are the allowed search fields for the Search API q parameter? #2558

Open
leeper opened this issue Sep 20, 2015 · 11 comments
Open

What are the allowed search fields for the Search API q parameter? #2558

leeper opened this issue Sep 20, 2015 · 11 comments

Comments

@leeper
Copy link
Member

leeper commented Sep 20, 2015

I'm looking at the Search API Docs. What are the allowed fields for the q parameter? It appears to include the list of Dataverse DB Elements mentioned in the metadata crosswalk but it also appears to include other fields not listed there. Is there a complete list? And can the documentation be updated accordingly?

@pdurbin
Copy link
Member

pdurbin commented Sep 21, 2015

@leeper it depends! :) @markwilkinson asked about this too, as I mentioned in #2291 .

At the very least, I could document the fact that the fields supported by an installation of Dataverse 4 depend on which domain-specific metadata schemas (metadata blocks) have been enabled. http://guides.dataverse.org/en/4.1/user/appendix.html#metadata-references contains a list as of 4.1 but there are other site-specific ("custom") metadata blocks used only by Harvard as of this writing. All metadata blocks are stored as TSV files and then loaded into the system at installation time: https://github.com/IQSS/dataverse/tree/v4.1/scripts/api/data/metadatablocks . When we update these tsv files, we add them to the list of data-driven fields we index into Solr: https://github.com/IQSS/dataverse/blob/v4.1/conf/solr/4.6.0/schema.xml#L328 . You'll see references to the "custom" Harvard-specific blocks like GSD and PSI in that Solr schema config.

Parsing those TSV files is a little rough (#2551) and I wouldn't wish it on any API user so perhaps we should allow API users to interrogate a running Dataverse installation for a list of supported metadata fields. I can imagine this being part of the Search API itself. Maybe you call into /api/search/fields or something...

I recently stumbled upon the fact that I can go to https://dataverse.harvard.edu/api/metadatablocks to find a list of metadata blocks as documented at http://guides.dataverse.org/en/4.1/api/native-api.html#metadata-blocks but I didn't quickly find how to list the fields within each metadata block. I did add an "admin-only" API endpoint which I mentioned at #2357 (comment) that lets me list all the fields from http://localhost:8080/api/admin/datasetfield but the output needs a lot of work. Also, that API endpoint only shows the data-driven fields, not the static ones in SearchFields.java I mentioned in #2291. (At some point we'll probably want to change these static fields to be fed from the database for #2039 .)

Oh, and some sensitive fields such as for email addresses aren't indexed for privacy reasons per #759 .

Going to an Advanced Search Page such as https://dataverse.harvard.edu/dataverse/harvard/search for the root dataverse can be a help in figuring out which fields are searchable but as #2353 notes right now you can't see the domain-specific metadata blocks at the root. I mention this because different blocks can be enabled at different dataverses within the tree of dataverses in a single Dataverse installation. So maybe when you ask the Search API for a list of supported fields you could supply the dataverse of interest and it will tell you which metadata blocks are enabled. Or rather, it would tell you the search fields that are available based on the metadata blocks enabled from that dataverse (i.e. social science vs. astronomy).

@leeper I'm sure this is way more information than you wanted! Thanks for opening this issue. :)

To sum up, I can at least improve the Search API documentation a bit. I should probably add something to the Search API so that API users can simply get a list of fields they can search on, perhaps with respect to where in the tree of dataverses they are searching (the root dataverse vs. a subdataverse).

@pdurbin
Copy link
Member

pdurbin commented Sep 21, 2015

@leeper I looked at the code and played around with the already existing "GET http://$SERVER/api/metadatablocks/$identifier" endpoint documented at http://guides.dataverse.org/en/4.1/api/native-api.html#metadata-blocks

Perhaps you and @markwilkinson and anyone else interested in knowing which fields are supported could play around with this metadatablocks API endpoint and give us feedback on it. It looks like it was developed by @michbarsinai and it seems quite useful. Here's how I can imagine it being used:

Get a list of metadata blocks that are enabled

curl -s https://apitest.dataverse.org/api/metadatablocks | jq .data[].name -r

citation
geospatial
socialscience
astrophysics
biomedical
journal

For each of the metadata blocks, show the fields

curl -s https://apitest.dataverse.org/api/metadatablocks/citation | jq . | head -20

{
  "status": "OK",
  "data": {
    "id": 1,
    "name": "citation",
    "displayName": "Citation Metadata",
    "fields": {
      "title": {
        "name": "title",
        "displayName": "Title",
        "title": "Title",
        "type": "TEXT",
        "watermark": "Enter title...",
        "description": "Full title by which the Dataset is known."
      },
      "subtitle": {
        "name": "subtitle",
        "displayName": "Subtitle",
        "title": "Subtitle",
        "type": "TEXT",
...

In the output above the field to search on is listed under "name" such as "title" or "subtitle".

Of course, these are only the data-driven fields at the dataset level, not the static fields in SearchFields.java I mentioned, but some of those fields are aren't searchable by design (though we recently made more of them searchable as part of #2038).

@markwilkinson
Copy link

markwilkinson commented Sep 22, 2015

Thanks for the update! :-)

Mark

@leeper
Copy link
Member Author

leeper commented Sep 22, 2015

@pdurbin Excellent! This response is a lot to parse! I'll take a look and see what I can do. I guess the minimum solution is to provide a flexible interface and then I can build on features that help tailor use of the API when there are known metadata schemes. Being able to query what those are for any particular installation would definitely be a helpful feature of the search API.

@pdurbin
Copy link
Member

pdurbin commented Oct 4, 2016

#1510 is related in the sense that people don't know what subjects are allowed when creating a dataset (and it's a required field).

@pdurbin
Copy link
Member

pdurbin commented Aug 21, 2019

In pull request #6107 I at least linked back to this issue so API users can get a sense of how they can know what the allowed search fields are. Here's the commit: d3a5b2f

If anyone wants to help with an actual solution to this issue, I'm happy to mentor them. I'm thinking that for now we could just list the "out of the box" fields in the API Guide.

@Jerry-Ma
Copy link

Jerry-Ma commented Apr 12, 2022

Hi, I was trying to find a reference to the query string for searching particular files. However it seems the above discussion is more about searching dataverse/dataset metadata. Could anyone point me to the place that show the key we could use for searching files?

The use case for me is that I am creating a script uploading files to a dataset via the API, and I would like to check if a particular file named "foo" (with filepath foo.txt) already exists in a certain dataset of global_id="doi:10.5072/FK2/J9EK29", which is within the dataverse identified by id="bar"

So far I the furthest point I've got is the following:

api/search?q=fileName:foo&type=file&subtree=bar&sort=date&order=desc.

There are a couple of issues with this:

  • The subtree parameter does not recognize the dataset persistent ID, so I had to use the dataverse identifier "bar". However this resulted in many files from multiple dataset with similar name foo.
  • The fileName:foo does not restrict the filename to be exactly foo. Instead, files named foo-1 foo-2 are also returned.
    Any ideas?

@qqmyers
Copy link
Member

qqmyers commented Apr 12, 2022 via email

@Jerry-Ma
Copy link

Jerry-Ma commented Apr 12, 2022

@qqmyers

Thank you for the links. I'll take a look at DVUploader in detail. The direct upload with storage identifier is gonna also be useful for our use case, because we also have our own storage service (not amazon S3).

I am already using pyDataverse for creating datasets and uploading datafiles. It works great so far for me, but lacks certain logics (like checking if datafile already exists, etc) that I need to implement on my own. The repo of my mentioned workflow is here: https://github.com/toltec-astro/dvpipe.

Just a bit background, this effort is part of the software infrastructure that we are building for the Large Millimeter Telescope. We have setup a dataverse instance at https://dp.lmtgtm.org and plan to use it as the main channel to distribute the data products produced by the software pipelines that reduces the data taken by various instruments on the LMT. This dvpipe is to be the automation pipeline that packages the data reduction pipeline outputs and sends them to the dataverse server.

@qqmyers
Copy link
Member

qqmyers commented Apr 12, 2022

Nice! (I was involved with the Dark Energy Survey telescope data management project a f ew years ago.) W.r.t. pyDataverse, it@skasberger is open to pull requests so if there is logic you think should go there, please consider adding to it. (In particular, it would be great to get the direct upload capabilities in there.)

@pdurbin
Copy link
Member

pdurbin commented Apr 12, 2022

I would like to check if a particular file named "foo" (with filepath foo.txt) already exists in a certain dataset of global_id="doi:10.5072/FK2/J9EK29", which is within the dataverse identified by id="bar"

The approach suggested by @qqmyers to download the list of files is probably the most reliable but I thought I'd chime in specifically about the Search API question above.

@Jerry-Ma you can search against the parentIdentifier field with the DOI of the dataset like this:

https://dataverse.harvard.edu/api/search?q=name:2019-02-25.tab&fq=parentIdentifier:doi\:10.7910/DVN/TJCLKP

Please note:

  • You have to escape the colon in the DOI with a backslash.
  • If you know the database id of the dataset, you can search against parentId.
  • I didn't include subtree but you could, I suppose. Subtree only operates on dataverse collections, not datasets.
  • From a quick look I don't believe we index the file path. That's why I say the other approach is more reliable since the list of files will include both the file path and the name.

Also, if you'd like to include your installation on our map, please feel free to open an issue at https://github.com/IQSS/dataverse-installations !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Recherche Data Gouv (formerly Data IN...
  
⚠️ Needed/Important
Development

No branches or pull requests

8 participants