Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

public Identifier createDataset(String dataSetJson, String dataverseAlias) {...} returns a DB identifier but we need a doi to uploadFile #14

Open
AleixMT opened this issue Nov 15, 2022 · 4 comments

Comments

@AleixMT
Copy link
Contributor

AleixMT commented Nov 15, 2022

Hello again.

I am trying to do a bulk upload of a project into a dataverse instance. To do so I need to create a dataset for the project and then upload all the files into the created dataset. The problem is that when you create a dataset the method to do so returns an Identifier which contains an integer. This integer is supposed to identify the dataset that you just created, but when you want to upload a file into that dataset using the identifier you can not do it since the methods to upload a file only accept DOIs to identify datasets and not the identifier that you return from the createDataset method.

So, I would like to do something like this:

List<Document> documents = new ArrayList(...);
Identifier identifier = api.getDataverseOperations().createDataset(JSONMetadata.toString(),  "theDatasetName");

for (Document document: documents)
{
    try {
        api.getDatasetOperations().uploadFile(identifier.toString(), document.getInputStream(), document.getName() );  // This line fails because the identifier does not identify any dataset and it expects a DOI
    } catch (IOException e) {
        e.printStackTrace();
    }
}

Where Document is just a class that wraps file data.

But I cant do it since public Identifier createDataset(String dataSetJson, String dataverseAlias) {...} does not return a DOI.

So, my question is: ¿Is there any way to retrieve the DOI of the dataset that I just created in order to upload files to it inmediately after? Even if it involves doing extra operations. Alternatively: ¿Is there any way to use the Identifier object that you return to identify a dataset and upload files to it?

If that is not possible I will try to do another pull request. But this time I am going to need a little help, since I do not know what operations are you doing in the last line of public Identifier createDataset(String dataSetJson, String dataverseAlias) {...} where you do
return resp.getBody().getData(); where I deduce that you are parsing the return, and obtaining the Id from there.

The reason why I am proposing this change is because I think is completely possible to do so and also an improvement to the library: When you use the native API to create a dataset (using curl for example) the server returns a JSON which contains both the identifier that you return and the doi of the dataset that you just created. It is a matter of parsing the DOI and the identifier and returning them in the method or implementing an equivalent method that parses and returns only the DOI.

Please, answer me when you can to know your opinion in this subject.

@AleixMT
Copy link
Contributor Author

AleixMT commented Nov 16, 2022

I discovered that we can retrieve the DOI of a dataset using its Identifier like this:

// Call Dataverse API client to create dataset into the ICIQ dataverse
Identifier identifier = api.getDataverseOperations().createDataset(dataverseDatasetMetadata.toString(), "ICIQ");

// Upload files of each experiment into the created dataset
for (Document document : documents)
{
        try {
            // Obtain the dataset what we just created in order to obtain its DOI.
            Dataset dataset = api.getDatasetOperations().getDataset(identifier);

            // If we have a valid dataset, then we can
            if (! dataset.getDoiId().isPresent())
            {
                // TODO throw exception
                throw new RuntimeException();
            }
            else
            {
                api.getDatasetOperations().uploadFile(dataset.getDoiId().get(), file);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
}

Which basically it boils down to

Identifier identifier = api.getDataverseOperations().createDataset(dataverseDatasetMetadata.toString(), "ICIQ");
Dataset dataset = api.getDatasetOperations().getDataset(identifier);  // This middle step to obtain the dataset, from where we will retrieve the DOI
api.getDatasetOperations().uploadFile(dataset.getDoiId().get(), file);

This is still a possible and positive change since the native API implementation always returns the DOI of the dataset that you created, so doing an extra request to obtain the DOI is wasteful.

I am not going to close this issue. I will wait for the opinion of the owners.

Thank you.

PD: I am trying to upload a lot of datasets to dataverse, so optimization is a crucial step.

@richarda23
Copy link
Collaborator

richarda23 commented Nov 18, 2022

Hi Aleix
Please can you post a response you get from curl when you create a dataset? What version of Dataverse are you posting to?
It might well be it is returning more information than when Identifier was first written. As you say it would be better to get that info when the dataset is first created.
Thanks, Richard

@AleixMT
Copy link
Contributor Author

AleixMT commented Nov 19, 2022

Here is the curl call that I do and its response in the next line. You can see that the dictionary response returns two identifiers.

Screenshot from 2022-11-19 14-10-35

The dataverse instance that I am posting to is dataverse.csuc.cat.

I am aware that this instance has some customizations. For example: The file that I am uploading with curl dataset-finch1.json is a modified version of the example minimal dataset metadata that is provided with the documentation of dataverse. I needed to extend the file with some mandatory fields because it was not working on this instance that I am uploading to. I do not know if the responses of the API are customized too.

@pdurbin
Copy link
Member

pdurbin commented Nov 21, 2022

I don't believe the response has been customized. It's returning the database ID of the dataset as well as the DOI of the dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants