Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DS-4226 Improvements to Entity validation first pass #2522

Merged

Conversation

AndrewZachWood
Copy link
Contributor

This PR adds various improvements regarding entity type validation during the MetadataImport feature.

Pre-validate relational changes before displaying the change set to the user during MetadataImport:
The entire CSV is parsed and all potential relational changes are validated before the entire change set is displayed to the user.
If the given relational changes do not validate the validation errors are reported explicitly and the script exits.

Improve relational validation error reporting:
When there are validation errors during relational validation during the pre-validation phase the error(s) are reported to dspace logs more explicitly and will cause the current import process to fail. These errors are reported for the entire given CSV so the user will be able to correct against the set of errors for the entire CSV after just one run of the script. The previous behavior would not fail the MetadataImport process and create
items regardless if described relationships were valid which caused a great deal of confusion.

Revisit respective Integration Test:
Added additional test with respect to the new pre-validation code and ensured the new code was covered with the new test.
In addition, as to Tim Donohue's comment from DS-4316 I've ensure the Context authorization is only turned off when required.

Overall I believe the improvements to validation and reporting will save users a lot of head ache when batch importing a large set of relational changes. However, we're not quite done yet.
We still have cardinality and preservation of relational ordering to considered when validating relationships which has not been covered with this set of work. I believe that these issues
warrant their own PR.

@paulo-graca
Copy link
Contributor

I would like to provide this earlier feedback. I'm getting this error:

Error: Error in CSV row 22:
Ambiguous reference; multiple matches in csv: rowName:76c5cfd9-16ea-4f14-9c9c-b208da26fb29

And I've also noticed that, the related field appears with a strange UUID:
(relation.isParentOrgUnit): 00000000-0000-0000-0000-000000000001

I will revisit my CSV and logs to better understand what is happening.

@AndrewZachWood
Copy link
Contributor Author

AndrewZachWood commented Oct 1, 2019

I would like to provide this earlier feedback. I'm getting this error:

Error: Error in CSV row 22:
Ambiguous reference; multiple matches in csv: rowName:76c5cfd9-16ea-4f14-9c9c-b208da26fb29

And I've also noticed that, the related field appears with a strange UUID:
(relation.isParentOrgUnit): 00000000-0000-0000-0000-000000000001

I will revisit my CSV and logs to better understand what is happening.

Much appreciated. If it's not too much trouble/ if possible to include a link or sample of the CSV used in your testing with when you run into errors like these. I notice you're having rowName associate with a UUID, which maybe redundant if the 76c5cfd9-16ea-4f14-9c9c-b208da26f value is directly referencing an archived Item. You should just be able to use 76c5cfd9-16ea-4f14-9c9c-b208da26f as the value for that cell.

As for why you're seeing a UUID like 00000000-0000-0000-0000-000000000001; When the MetadataImport script displays changes, new items have not been created at this point in time. Therefore, actual UUID's have not been generated for the rows in the respective CSV at this point. Which is why a UUID placeholder is generated and used. We generate placeholder UUID's using the row reference's row number with UUID least significant bits. So (relation.isParentOrgUnit): 00000000-0000-0000-0000-000000000001 is a placeholder indicating that this row is referencing row 1 of the CSV. Once the user confirms the changes displayed by the MetadataImport script, these placeholders are replaced with the actual UUID's with respect to each of the rows in the in process CSV.

@paulo-graca
Copy link
Contributor

Much appreciated. If it's not too much trouble/ if possible to include a link or sample of the CSV used in your testing with when you run into errors like these.

I've just send you the file via Slack.

@benbosman
Copy link
Member

@AndrewZWood I've tried to create a relationship when submitting a new Project, and linking it to an existing Publication.
I'm linking to the Publication in the column relation.isPublicationOfProject and relation.isProjectOfPublication. The preview just displays both as pending imports while only relation.isPublicationOfProject should succeed.
=> Is this use case not caught yet?

I was able to trigger the new error logging when linking to a Publication in the relation.isOrgUnitOfProject column. It worked correctly here.

I also noticed this branch has some merge conflicts with the changes which were requested in #2488. I would expect them to be easily adjustable since they're primarily method name changes and JavaDocs

@AndrewZachWood AndrewZachWood force-pushed the DS-4226-CSV_Import_Entities_Improvments branch from 94ce25b to 21203be Compare October 4, 2019 17:04
@AndrewZachWood
Copy link
Contributor Author

Much appreciated. If it's not too much trouble/ if possible to include a link or sample of the CSV used in your testing with when you run into errors like these.

I've just send you the file via Slack.

I found that the bug was the row count not being reset after the first run of the script thusly adding the same values to the reference map many times making the ambiguity check go off during validation.

I corrected it here: 21203be#diff-b84d4d768d7f353bd7c78f0282c215bfR211

I also tested the sample you gave me ( thank you for that ). The file is ingested properly but fails relationship validation due to max cardinality constraint.

2019-10-04 11:09:23,015 WARN  org.dspace.content.RelationshipServiceImpl @ The relationship has been deemed invalid since the right item has more relationships than the right max cardinality allows after we'd store this relationship
2019-10-04 11:09:23,016 WARN  org.dspace.content.RelationshipServiceImpl @ The relationshipType's ID is: 3
2019-10-04 11:09:23,016 WARN  org.dspace.content.RelationshipServiceImpl @ The relationshipType's left label is: isParentOrgUnit
2019-10-04 11:09:23,016 WARN  org.dspace.content.RelationshipServiceImpl @ The relationshipType's right label is: isChildOrgUnit
2019-10-04 11:09:23,016 WARN  org.dspace.content.RelationshipServiceImpl @ The relationshipType's left entityType label is: OrgUnit
2019-10-04 11:09:23,016 WARN  org.dspace.content.RelationshipServiceImpl @ The relationshipType's right entityType label is: OrgUnit
2019-10-04 11:09:23,016 WARN  org.dspace.content.RelationshipServiceImpl @ The relationshipType's left min cardinality is: 0
2019-10-04 11:09:23,016 WARN  org.dspace.content.RelationshipServiceImpl @ The relationshipType's left max cardinality is: null
2019-10-04 11:09:23,016 WARN  org.dspace.content.RelationshipServiceImpl @ The relationshipType's right min cardinality is: 0
2019-10-04 11:09:23,016 WARN  org.dspace.content.RelationshipServiceImpl @ The relationshipType's right max cardinality is: 10

Currently my work is not accounting for cardinality constraints. I'm hoping to cover cardinality validation alongside relational ordering validation in a separate PR.

@AndrewZachWood AndrewZachWood force-pushed the DS-4226-CSV_Import_Entities_Improvments branch from c116229 to 0773f79 Compare October 9, 2019 16:15
@AndrewZachWood
Copy link
Contributor Author

@AndrewZWood I've tried to create a relationship when submitting a new Project, and linking it to an existing Publication.
I'm linking to the Publication in the column relation.isPublicationOfProject and relation.isProjectOfPublication. The preview just displays both as pending imports while only relation.isPublicationOfProject should succeed.
=> Is this use case not caught yet?

I was able to trigger the new error logging when linking to a Publication in the relation.isOrgUnitOfProject column. It worked correctly here.

I also noticed this branch has some merge conflicts with the changes which were requested in #2488. I would expect them to be easily adjustable since they're primarily method name changes and JavaDocs

@benbosman I've reproduced your use case and confirmed that using the incorrect typeName is NOT caught when it should be. I've adjusted the code to catch such cases by ensuring direction of typeName is valid with respect to the Entity type being processed and included an IT for this case: 0773f79#diff-5ce04a4fb4f137714321ddf616041328R454

I've also adjusted method calls and java docs to respect the changes from community master.

@paulo-graca
Copy link
Contributor

Thank you @AndrewZWood it worked!
But there is another issue. When you try to import a person. The field "person" will be parsed (If I remember it's related with Orcid) and for instance, if you try to import persons entities with any defined "person.givenName" field, you will get an error. I don't think we should change the schema name for anything else. This, perhaps, it's another issue, unrelated with this PR.
Besides this, I think it's working as expected.

Copy link
Contributor

@paulo-graca paulo-graca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR should be merged. This PR also fixes - https://jira.duraspace.org/browse/DS-4237 regarding the lack of error info.

@paulo-graca
Copy link
Contributor

I would also to add that I've just created this issue - https://jira.duraspace.org/browse/DS-4361 reporting the error I've just mentioned.

Copy link
Member

@benbosman benbosman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AndrewZWood it seems a small bug got introduced with verifying whether a relationship exists

If I try to import a Project type, it works correctly with a header relation.isPersonOfProject defined in my relationship types as:

 id | left_type | right_type |   leftward_type   |  rightward_type   | left_min_cardinality | left_max_cardinality | right_min_cardinality | right_max_cardinality 
----+-----------+------------+-------------------+-------------------+----------------------+----------------------+-----------------------+-----------------------
  4 |         2 |          3 | isProjectOfPerson | isPersonOfProject |                    0 |                      |                     0 |                      

3 is the project type

If I try to import a Project type, it doesn't work correctly with a header relation.isOrgUnitOfProject defined in my relationship types as:

 id | left_type | right_type |   leftward_type    |   rightward_type   | left_min_cardinality | left_max_cardinality | right_min_cardinality | right_max_cardinality 
----+-----------+------------+--------------------+--------------------+----------------------+----------------------+-----------------------+-----------------------
  6 |         3 |          4 | isOrgUnitOfProject | isProjectOfOrgUnit |                    0 |                      |                     0 |                      

3 is the project type

This is causing a correct CSV to be refused

@AndrewZachWood
Copy link
Contributor Author

AndrewZachWood commented Oct 14, 2019

@AndrewZWood it seems a small bug got introduced with verifying whether a relationship exists

If I try to import a Project type, it works correctly with a header relation.isPersonOfProject defined in my relationship types as:

 id | left_type | right_type |   leftward_type   |  rightward_type   | left_min_cardinality | left_max_cardinality | right_min_cardinality | right_max_cardinality 
----+-----------+------------+-------------------+-------------------+----------------------+----------------------+-----------------------+-----------------------
  4 |         2 |          3 | isProjectOfPerson | isPersonOfProject |                    0 |                      |                     0 |                      

3 is the project type

If I try to import a Project type, it doesn't work correctly with a header relation.isOrgUnitOfProject defined in my relationship types as:

 id | left_type | right_type |   leftward_type    |   rightward_type   | left_min_cardinality | left_max_cardinality | right_min_cardinality | right_max_cardinality 
----+-----------+------------+--------------------+--------------------+----------------------+----------------------+-----------------------+-----------------------
  6 |         3 |          4 | isOrgUnitOfProject | isProjectOfOrgUnit |                    0 |                      |                     0 |                      

3 is the project type

This is causing a correct CSV to be refused

@benbosman Sorry to be a pain but I am unable to reproduce the error you're describing with the following CSVs I created to match your description:

Project type to relation.isOrgUnitOfProject
id,dc.title,relationship.type,relation.isOrgUnitOfProject,collection,rowName

+,New OrgUnit,OrgUnit,,123456789/2,1

+,Some Project,Project,rowName:1,123456789/2,p1

and Project type to relation.isPersonOfProject
id,dc.title,relationship.type,relation.isPersonOfProject,collection,rowName

+,New Person,Person,,123456789/2,1

+,Some Project,Project,rowName:1,123456789/2,p1

If you could provide me with the CSV you're using or a sample of it I can debug further with more specificity of your case.

Copy link
Member

@benbosman benbosman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made an error in my test from #2522 (review)
I was trying trying to use relation.isOrgUnitOfProject on Publication instead of an OrgUnit.

The fact that it got refused in the preview was desired.

This looks good to me

@benbosman benbosman merged commit f7d9e0c into DSpace:master Oct 15, 2019
@benbosman benbosman deleted the DS-4226-CSV_Import_Entities_Improvments branch March 13, 2020 16:08
@tdonohue tdonohue added this to the 7.0beta1 milestone Jan 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants