Deduplicate identifiers #2896

joemull · 2022-05-30T13:48:02Z

We first tried to solve this with a unique constraint like this:

class Meta:
    unique_together = ("id_type", "identifier")

But this would not account for identifiers that are the same for several journals, which tend to occur for imported content.

So we tried adding the journal to the unique constraint. But journal is not an attribute of Identifier, and the unique constraints can't handle chaining foreign keys together as with article__journal.

So we set out to define a custom validate_unique function, but this is quite complex as the default function is quite complex.

At this point the only completed work that we will likely want to keep is the migration that will find and remove duplicates in previous data.

…must be unique for the article's journal.

ajrbyers · 2022-05-31T15:15:36Z

Closes #2848

mauromsl · 2022-06-07T10:59:31Z

src/identifiers/forms.py

+            id_type=id_type,
+            identifier=identifier,
+        ).exclude(
+            article=self.article,


I believe this would allow for any given article to still have duplicate identifiers. A way to avoid this would be to add an exclude on id=identifier.id

mauromsl · 2022-06-07T11:06:04Z

src/identifiers/migrations/0009_deduplicate_identifiers_20220527.py

+                to_keep_pk = identifiers_of_type.filter(identifier=doi_string).values_list('id', flat=True)[0]
+                duplicate_pks = identifiers_of_type.filter(identifier=doi_string).values_list('id', flat=True)[1:]
+                duplicate_identifiers = [identifier for identifier in Identifier.objects.filter(pk__in=duplicate_pks)]
+                print(id_type, to_keep_pk, doi_string)
+                if duplicate_identifiers:
+                    print('\n\n\n')
+                    print('To keep:')
+                    print(id_type, to_keep_pk, doi_string)
+                    print('Duplicates:')
+                    for dup in duplicate_identifiers:
+                        print(id_type, dup.pk, dup.identifier)
+                    print('\n\n\n')
+                Identifier.objects.filter(pk__in=duplicate_pks).delete()


I don't think we need all these printing.
Also, we are keeping the first identifier for each duplicate... makes me wonder if this is a good approach, since we won't be able to detect which of these are mistaken. I would advocate for deleting duplicate IDs for a given article but refrain from doing this for duplicates across an entire journal in case we wipe a correct identifier while preserving the wrong one

mauromsl

One more change

mauromsl · 2022-06-08T09:56:12Z

src/identifiers/forms.py

-        ).exclude(
-            article=self.article,


This will make it so that identifier edits won't be valid (since the query will return the identifier being edited)
A solution to this would be:

if self.instance and self.instance.id: idents = idents.exclude(id=self.instance.id)

joemull and others added 2 commits May 30, 2022 14:39

#2848 Migration work in progress

aa5bca0

#2848 DOIs must now be unique across all journals. Other identifiers …

3c8a24f

…must be unique for the article's journal.

ajrbyers marked this pull request as ready for review May 31, 2022 15:07

mauromsl requested changes Jun 7, 2022

View reviewed changes

#2848 updated form, also added tests.

ca8b2d5

mauromsl requested changes Jun 8, 2022

View reviewed changes

#2848 addresses comments.

8c181a3

mauromsl approved these changes Jun 9, 2022

View reviewed changes

mauromsl merged commit 5eb2b15 into 2734-batch-doi-registration Jun 9, 2022

mauromsl deleted the 2848-duplicate-identifiers branch June 9, 2022 14:05

joemull linked an issue Jun 14, 2022 that may be closed by this pull request

Duplicate DOIs #2848

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduplicate identifiers #2896

Deduplicate identifiers #2896

joemull commented May 30, 2022 •

edited

ajrbyers commented May 31, 2022

mauromsl Jun 7, 2022

mauromsl Jun 7, 2022

mauromsl left a comment

mauromsl Jun 8, 2022 •

edited

Deduplicate identifiers #2896

Deduplicate identifiers #2896

Conversation

joemull commented May 30, 2022 • edited

ajrbyers commented May 31, 2022

mauromsl Jun 7, 2022

Choose a reason for hiding this comment

mauromsl Jun 7, 2022

Choose a reason for hiding this comment

mauromsl left a comment

Choose a reason for hiding this comment

mauromsl Jun 8, 2022 • edited

Choose a reason for hiding this comment

joemull commented May 30, 2022 •

edited

mauromsl Jun 8, 2022 •

edited