Harvest: Json format harvest fails with unknown field when field exists on client. #7075

kcondon · 2020-07-13T22:05:52Z

For reference, see pr #7057

For this instance, the fields do exist on the client side and surprisingly, the failing datasets are not the same as indicated in the ticket when the fields do not exist on the client. The failure appears to be the same, unknown field, mraCollection but it is coming from solr?

grep Error harvest_test_n99_2020-07-13T21-28-05.log
Exception processing getRecord(), oaiUrl=https://dataverse.harvard.edu/oai, identifier=doi:10.7910/DVN/B6OJKG, edu.harvard.iq.dataverse.api.imports.ImportException, Failed to import harvested dataset: class edu.harvard.iq.dataverse.engine.command.exception.CommandException (Command [DatasetCreate dataset:49321] failed: Exception thrown from bean: javax.ejb.EJBTransactionRolledbackException: Exception thrown from bean: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/collection1: ERROR: [doc=dataset_49321] unknown field 'mraCollection')
Exception processing getRecord(), oaiUrl=https://dataverse.harvard.edu/oai, identifier=doi:10.7910/DVN/I5O6OS, edu.harvard.iq.dataverse.api.imports.ImportException, Failed to import harvested dataset: class edu.harvard.iq.dataverse.engine.command.exception.CommandException (Command [DatasetCreate dataset:51066] failed: Exception thrown from bean: javax.ejb.EJBTransactionRolledbackException: Exception thrown from bean: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/collection1: ERROR: [doc=dataset_51066] unknown field 'mraCollection')

The two failing datasets are:
You are now connected to database "thedata_alt" as user "postgres".
select id, identifier, dtype from dvobject where id=49321;
id | identifier | dtype
-------+------------+---------
49321 | DVN/A9VJVR | Dataset
(1 row)

select id, identifier, dtype from dvobject where id=51066;
id | identifier | dtype
-------+------------+---------
51066 | DVN/3XMK0W | Dataset

In the original pr, the failing datasets were:
The controversial datasets are https://dataverse.harvard.edu/api/datasets/export?exporter=dataverse_json&persistentId=doi:10.7910/DVN/B6OJKG and https://dataverse.harvard.edu/api/datasets/export?exporter=dataverse_json&persistentId=doi:10.7910/DVN/I5O6OS.

JingMa87 · 2020-07-19T17:59:10Z

@kcondon I'm trying to reproduce the error, but I can't seem to add the custom field mraCollection in a way that reproduces the error. Do you have a TSV file for me so I can add this custom field to Dataverse?

kcondon · 2020-07-19T23:56:00Z

Hi, I believe all I did was run the custom script for Harvard metadata in the dvinstall file: https://raw.githubusercontent.com/IQSS/dataverse/develop/scripts/api/setup-optional-harvard.sh

JingMa87 · 2020-07-21T02:27:46Z

I ran the script to add the Harvard metadata successfully. When I run the harvest, I get the same error as you but with less info and a null in the message. However, I decided to not dig deeper into this difference yet and focus on solr instead.

I think that the problem is that the new fields were added to the database and the UI, but not to solr. So I checked the new fields that have to be added to solr using the curl http://localhost:8080/api/admin/index/solr/schema call, which was correct. Then I checked schema.xml and found that the mraCollection value isn't in there yet. Then I read the documentation in the SearchFields.java class and metadatacustomization.rst and found out that you have to update schema.xml using a script. So I ended up running the updateSchemaMDB.sh file to update the schema.xml file.

I received an error: "Dataverse responded with empty file. When running on K8s: did you bootstrap yet?". This failure happens here:

if [[ "`wc -l ${TMPFILE}`" < "3" ]]; then
  echo "Dataverse responded with empty file. When running on K8s: did you bootstrap yet?"
  exit 123
fi

I think the writer of this script intended to check if the amount of lines in the file with the new fields is 3 or less, but the piece of code doesn't work because the file actually has 455 lines but still ends up returning the empty file error. If my assumption is correct, I have a fix for this using this first line which I can push in a PR:

if [[ `wc -l < ${TMPFILE}` -lt 3 ]]; then

When I make this change, the update script triggers successfully but the schema.xml still doesn't have the new fields like "mraCollection". Not sure where to go next.

kcondon · 2020-07-21T03:41:55Z

@JingMa87 Thanks for looking into this! I can check with the team tomorrow when I'm back at work but I wonder whether @poikilotherm has any insight into the problem and your proposed fix since I think it is an area he is familiar with?

poikilotherm · 2020-07-21T10:26:02Z

Hi @JingMa87,
where are you running this script? I tested this with GNU bash and ZSH on Linux, successfully using it with Docker containers etc. Both string comparison or integer comparison should be perfectly valid and yield the same result.

I'll go ahead and try to reproduce with develop.

JingMa87 · 2020-07-21T10:49:28Z

Hi @poikilotherm, I'm running this script on my mac using zsh. I have a local full dev environment of Dataverse running in Glassfish4. But even after I adjust the script and run it, the schema.xml doesn't update to contain the new fields.

poikilotherm · 2020-07-22T09:05:42Z

Hi @JingMa87, I tried to reproduce this with latest develop@ 941d17d , running on Payara 5 and deploying customMRA.tsv via curl first. I tried the script with both zsh 5.7.1 (x86_64-redhat-linux-gnu) and bash 5.0.17(1)-release (x86_64-redhat-linux-gnu). I was not able to reproduce your problem, so I suspect your local environment.

Obviously, you can still gather the fields from the API endpoint and put it in the Solr schema files manually. Please reach out on IRC for more talk and help 😄

JingMa87 · 2020-07-23T00:34:06Z

@poikilotherm My colleague @mderuijter has had a similar problem actually. I decided to add the fields manually for now.

@kcondon Do you run Dataverse on Payara5? That might explain why my error message is less complete compared to yours. So I added the custom fields like mraCollection to Solr and then ran a harvest on the Princeton set, resulting in all successes. Do you have the mraCollection field in Solr? I presume this to be the problem.

kcondon · 2020-07-23T13:54:48Z

@JingMa87 Yes, Payara 5 is now our target platform.

JingMa87 · 2020-08-21T14:01:27Z

@kcondon Did you add the custom fields like mraCollection to Solr yet? This is probably what causes the problem.

kcondon · 2020-08-24T16:05:48Z

@JingMa87 Hi, that was the problem. Apologies for not catching that but this field used to be part of schema.xml but behavior has changed to separate those out.

pdurbin mentioned this issue Jul 14, 2020

Known field types in a dataset also causes the whole dataset to fail from being harvested #7076

Closed

JingMa87 added this to In progress in DANS Data Station Archaeology Jul 19, 2020

kcondon closed this as completed Aug 24, 2020

DANS Data Station Archaeology automation moved this from In progress to Done Aug 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harvest: Json format harvest fails with unknown field when field exists on client. #7075

Harvest: Json format harvest fails with unknown field when field exists on client. #7075

kcondon commented Jul 13, 2020

JingMa87 commented Jul 19, 2020 •

edited

kcondon commented Jul 19, 2020

JingMa87 commented Jul 21, 2020 •

edited

kcondon commented Jul 21, 2020

poikilotherm commented Jul 21, 2020 •

edited

JingMa87 commented Jul 21, 2020

poikilotherm commented Jul 22, 2020

JingMa87 commented Jul 23, 2020 •

edited

kcondon commented Jul 23, 2020

JingMa87 commented Aug 21, 2020

kcondon commented Aug 24, 2020

Harvest: Json format harvest fails with unknown field when field exists on client. #7075

Harvest: Json format harvest fails with unknown field when field exists on client. #7075

Comments

kcondon commented Jul 13, 2020

JingMa87 commented Jul 19, 2020 • edited

kcondon commented Jul 19, 2020

JingMa87 commented Jul 21, 2020 • edited

kcondon commented Jul 21, 2020

poikilotherm commented Jul 21, 2020 • edited

JingMa87 commented Jul 21, 2020

poikilotherm commented Jul 22, 2020

JingMa87 commented Jul 23, 2020 • edited

kcondon commented Jul 23, 2020

JingMa87 commented Aug 21, 2020

kcondon commented Aug 24, 2020

JingMa87 commented Jul 19, 2020 •

edited

JingMa87 commented Jul 21, 2020 •

edited

poikilotherm commented Jul 21, 2020 •

edited

JingMa87 commented Jul 23, 2020 •

edited