Skip to content

Improve OpenAPI with detailed descriptions and auth requirements #12437

@beepsoft

Description

@beepsoft

The Dataverse REST API already exposes a generated OpenAPI document (e.g., https://dataverse.harvard.edu/openapi?format=json), but the current document is not yet a strong machine-readable API contract. It is technically useful for discovering endpoints, paths, HTTP methods, and some parameters, but it is not sufficient for downstream consumers that depend on semantic quality, such as API users, generated clients, and documentation sites.

The main problem is that endpoints are not properly documented or, more precisely, only a handful of them are. Without proper descriptions, it is difficult for users of the OpenAPI document, especially machines, to understand the semantic details of the endpoints.

The reason we started working on this is machine actionability. If we have a properly documented OpenAPI document, we can more easily implement an MCP server based on it, which AI agents can then use to automate anything in Dataverse. Our final goal is to implement this complete Dataverse MCP server.

Now, back to the problems of the current OpenAPI solution. Consider the basic dataset access operation: what is the ID actually referring to?

    "/datasets/{id}" : {
      "get" : {
        "operationId" : "Datasets_getDataset",
        "parameters" : [ {
          "name" : "id",
          "in" : "path",
          "required" : true,
          "schema" : {
            "type" : "string"
          }
        }, {
          "name" : "returnOwners",
          "in" : "query",
          "schema" : {
            "type" : "boolean"
          }
        } ],
        "responses" : {
          "200" : {
            "description" : "OK"
          }
        }
      }

From the OpenAPI document, we can only see that it is a string. However, reading the source code or the documentation https://guides.dataverse.org/en/latest/api/native-api.html shows that it can actually be either a number representing the dataset’s numeric database ID, or a persistent identifier, which is a true string value.

There is also no indication that this endpoint requires authentication, or that the caller must provide a valid X-Dataverse-key.

A better version would be something like this:

    "/datasets/{id}" : {
      "get" : {
        "tags" : [ "Datasets" ],
        "summary" : "Read dataset details",
        "description" : "Returns dataset metadata and the latest accessible version, and records metrics for released dataset access.",
        "operationId" : "Datasets_getDataset",
        "parameters" : [ {
          "name" : "id",
          "in" : "path",
          "description" : "Dataset id or persistent identifier.",
          "required" : true,
          "schema" : {
            "type" : "string"
          }
        }, {
          "name" : "returnOwners",
          "in" : "query",
          "description" : "Whether to include owner information in the response.",
          "schema" : {
            "type" : "boolean"
          }
        } ],
        "responses" : {
          "200" : {
            "description" : "OK"
          }
        },
        "security" : [ {
          "DataverseApiKey" : [ ]
        } ]
      }
	  ...
      "securitySchemes" : {
        "DataverseApiKey" : {
          "type" : "apiKey",
          "description" : "Dataverse API token.",
          "name" : "X-Dataverse-key",
          "in" : "header"
        }
      }
	  

It is still not perfect, because we could state in the schema that the ID may be either numeric or string. However, the description helps the reader (user or machine) understand the expected values. We also include a security clause, which refers to a defined securitySchemes entry describing the requirement for an X-Dataverse-key header value.

My proposal is to improve the REST endpoints in Java with annotations, so that the generated OpenAPI document becomes more descriptive and useful.

We have already implemented an AI-based solution, which:

  1. Finds all OpenAPI endpoints.
  2. Looks up their documentation in native-api.rst.
  3. Finds their Java implementation.
  4. Based on these information, the LLM generates the appropriate OpenAPI annotations with descriptions.

So this:

@GET
@AuthRequired
@Path("{id}")
public Response getDataset(@Context ContainerRequestContext crc,
        @PathParam("id") String id,
        @Context UriInfo uriInfo,
        @Context HttpHeaders headers,
        @Context HttpServletResponse response,
        @QueryParam("returnOwners") boolean returnOwners)

becomes:

@GET
@AuthRequired
@Path("{id}")
@Operation(summary = "Read dataset details",
        description = "Returns dataset metadata and the latest accessible version, and records metrics for released dataset access.")
@SecurityRequirement(name = "DataverseApiKey")
public Response getDataset(@Context ContainerRequestContext crc,
        @Parameter(description = "Dataset ID or persistent identifier.", required = true)
        @PathParam("id") String id,
        @Context UriInfo uriInfo,
        @Context HttpHeaders headers,
        @Context HttpServletResponse response,
        @Parameter(description = "Whether to include owner information in the response.")
        @QueryParam("returnOwners") boolean returnOwners)

which ultimately results in the OpenAPI JSON shown above.

We did this for all 580 endpoints, which means we updated 45 Java files.

Since these are all AI-generated edits from Codex, reviewers should go through all of the changes. This is a major undertaking, and it is the reason I opened this issue for discussion, whether anyone is willing to do this?

We have reviewed the changes ourselves, but since we are not familiar with all REST calls and implementation details, we cannot be certain that all generated descriptions are correct. They look correct, though, and the MCP server we generated from the OpenAPI descriptions works as expected.

I will provide a PR so that you can see what has changed.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions