Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplicate identifiers #2896

Merged
merged 4 commits into from Jun 9, 2022

Conversation

joemull
Copy link
Member

@joemull joemull commented May 30, 2022

We first tried to solve this with a unique constraint like this:

class Meta:
    unique_together = ("id_type", "identifier")

But this would not account for identifiers that are the same for several journals, which tend to occur for imported content.

So we tried adding the journal to the unique constraint. But journal is not an attribute of Identifier, and the unique constraints can't handle chaining foreign keys together as with article__journal.

So we set out to define a custom validate_unique function, but this is quite complex as the default function is quite complex.

At this point the only completed work that we will likely want to keep is the migration that will find and remove duplicates in previous data.

@ajrbyers ajrbyers marked this pull request as ready for review May 31, 2022 15:07
@ajrbyers
Copy link
Member

Closes #2848

id_type=id_type,
identifier=identifier,
).exclude(
article=self.article,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this would allow for any given article to still have duplicate identifiers. A way to avoid this would be to add an exclude on id=identifier.id

Comment on lines +18 to +30
to_keep_pk = identifiers_of_type.filter(identifier=doi_string).values_list('id', flat=True)[0]
duplicate_pks = identifiers_of_type.filter(identifier=doi_string).values_list('id', flat=True)[1:]
duplicate_identifiers = [identifier for identifier in Identifier.objects.filter(pk__in=duplicate_pks)]
print(id_type, to_keep_pk, doi_string)
if duplicate_identifiers:
print('\n\n\n')
print('To keep:')
print(id_type, to_keep_pk, doi_string)
print('Duplicates:')
for dup in duplicate_identifiers:
print(id_type, dup.pk, dup.identifier)
print('\n\n\n')
Identifier.objects.filter(pk__in=duplicate_pks).delete()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need all these printing.
Also, we are keeping the first identifier for each duplicate... makes me wonder if this is a good approach, since we won't be able to detect which of these are mistaken. I would advocate for deleting duplicate IDs for a given article but refrain from doing this for duplicates across an entire journal in case we wipe a correct identifier while preserving the wrong one

Copy link
Member

@mauromsl mauromsl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more change

Comment on lines 45 to 46
).exclude(
article=self.article,
Copy link
Member

@mauromsl mauromsl Jun 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will make it so that identifier edits won't be valid (since the query will return the identifier being edited)
A solution to this would be:

if self.instance and self.instance.id:
    idents = idents.exclude(id=self.instance.id)

@mauromsl mauromsl merged commit 5eb2b15 into 2734-batch-doi-registration Jun 9, 2022
@mauromsl mauromsl deleted the 2848-duplicate-identifiers branch June 9, 2022 14:05
@joemull joemull linked an issue Jun 14, 2022 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Duplicate DOIs
3 participants