WIP: sync ldap into names #195

jrcastro2 · 2024-09-04T08:04:41Z

closes names vocab: allow names vocab to have 2 types of objects #193

As a non admin user, see the synced value

Both values, not yet synced

As an admin, both values, the deprecated one appears greyed

* updates display for CERN creatibutors

kpsherva · 2024-10-23T14:30:00Z

site/cds_rdm/jobs.py

+        if since is None and job_obj.last_runs["success"]:
+            since = job_obj.last_runs["success"].started_at
+        else:
+            since = (datetime.now() - timedelta(days=1)).isoformat()


shouldn't be needed, ping @ntarocco to confirm

I thought only needed for the first run. However this has to be updated once jobs is refactored, might be solved by then

EDIT: I see that you mention everything not only the last line of code, I guess I am missing context, but why shouldn't it be needed?

The since param is already pre-computed to None (never run) or last success, see here and docstring here.
Was this copied from somewhere else and it should be also fixed there?

We have the same here: https://github.com/inveniosoftware/invenio-vocabularies/blob/master/invenio_vocabularies/jobs.py#L60-L63

But not sure I understand well, does this means that our funcs do need ot accept since=None?

EDIT: discussed IRL, just need to set the since when is None

kpsherva · 2024-10-23T14:30:16Z

site/cds_rdm/jobs.py

+    @classmethod
+    def build_task_arguments(cls, job_obj, since=None, user_id=None, **kwargs):
+        """Build task arguments."""
+        if since is None and job_obj.last_runs["success"]:


shouldn't be needed

kpsherva · 2024-10-23T14:32:36Z

site/cds_rdm/tasks.py

+def sync_local_accounts_to_names(since, user_id=None):
+    """Syncs local accounts to names vocabulary.
+
+    If the name is marked as "unlisted" - meaning it is deprecated - we still update it with the new values, if needed.


just for me to understand the use cases: when is name consider "deprecated"? When user leaves CERN?
will this influence the display of their names in the previously authored records?

Right now we only have one use case where an user is considered deprecated. This happens when we have the user is coming from ORCID and coming from CERN database with the ORCID registered there as well, in these cases we mark the CERN user as deprecated and merge the extra data in the ORCID value.

However, there can be more cases in the future where we want to deprecate names, not sure about the one you mention - when a user leaves CERN - is a valid one, as per my understanding we might want to still add that person as contributor of a project although they have left. However it might be a valid case or there might be others.

Existing records are not affected by deprecated values, this only affects the search display, the depreacted values won't be shown.

Karolina has a good point, maybe we want to be a little bit more descriptive in the docstring here.
We should explain in a clear and concise way the logic and business rules.

kpsherva · 2024-10-23T15:12:24Z

site/cds_rdm/tasks.py

+        updated = False
+        updated_name = {**name}
+
+        if not is_orcid:


I guess this means orcid takes precedence as a source of truth when it comes to updates
I am not sure though if thats what we want to achieve, I think CERN will be more reliable than orcid to be up to date regarding name and surname, no?

@kpsherva It is not about reliability, it is more about using PIDs as preferred ID.
@jrcastro2 we need to explain this in the docstring.

We should always prefer ORCID whenever possible, because data is harvested by external services, and the ORCID is the only widely recognized PID, understood by other systems.
The idea is to use/show the ORCID when available, and fallback to local CDS id when ORCID not avaiable.

kpsherva · 2024-10-23T15:14:56Z

site/cds_rdm/tasks.py

+                updated = True
+
+        # add the email as props
+        if user.email and user.email != name.get("props", {}).get("email"):


how is email then consumed?

The email is displayed in the UI

kpsherva · 2024-10-23T15:17:06Z

site/cds_rdm/tasks.py

+    # Allows to sync a single user
+    if user_id:
+        user = User.query.get(user_id)
+        users = [user] if user else []


I think

Suggested change

users = [user] if user else []

users = [user]

won't .get raise an exeption if user is not found for the given id?

It does return None. We might want to use one() if we want to raise.

It is indeed better to raise and provide a nice error clear msg, otherwise here the task will simply exist without any message.

kpsherva · 2024-10-23T15:18:16Z

site/cds_rdm/tasks.py

+    for user in users:
+        orcid = None
+        try:
+            name_dict = service.read(system_identity, str(user.id)).to_dict()


for readability, I was not sure what service we read from here

Suggested change

name_dict = service.read(system_identity, str(user.id)).to_dict()

name_dict = names_service.read(system_identity, str(user.id)).to_dict()

In general, I am not very pro comments... but if no comments, then the code should be self-explanatory.
I suggest 2 possible approaches (choose the clean code methodology you prefer :D):

add clear comments

split the code in self-explanatory functions or nicely names vars (I prefer this way)

For example here we should explain that we first try to find the user by its own id and if not found by the orcid, and why.

Example:

# First, we look up the name by the CDS id: if the user has an ORCID, we need to deprecate it.

Suggestion: one good way to implement complex parts, also suggested in the Clean Code, is to first write the func to call, and then later on, implement them.
Here, I would go with something like:

def _find_by_cds_id(...) user_cds_id = str(user.id) # to be changed to the random id user_orcid = user.user_profile.get("orcid") return get_by_id(user_cds_id) try: name = _find_by_cds_id(...) is_unlisted = name.get(...) has_orcid = name.get(...) if is_unlisted and has_orcid: promote_orcid(...) # this will unlist the cds-id name and update/create the orcid one except ... try: _find_by_orcid(..) except ... _create_missing(...)

which is similar to yours, but with more self-explanatory names.

Wouldn't the code above avoid the double task? I can't remember why we needed to split them.

kpsherva · 2024-10-23T15:23:45Z

site/cds_rdm/tasks.py

+@shared_task()
+def merge_duplicate_names_vocabulary(since=None):
+    """Merges duplicate names in the names vocabulary."""
+    service = current_service_registry.get("names")


just for readability

Suggested change

service = current_service_registry.get("names")

names_service = current_service_registry.get("names")

kpsherva · 2024-10-23T15:25:43Z

site/cds_rdm/tasks.py

+        """
+        updated = False
+        # Merge the affiliations
+        affiliations_to = name_to.get("affiliations", [])


it would be a little bit less "tangled" if we call it affiliations_source instead of affiliations_from and affiliations_dest instead of affiliations_to, but lets hear what the rest of the team thinks before changing it

kpsherva · 2024-10-23T15:33:41Z

site/cds_rdm/tasks.py

+        for affiliation_from in name_from.get("affiliations", []):
+            if affiliation_from not in affiliations_to:
+                affiliations_to.append(affiliation_from)
+                updated = True


sorry I am really lost on _from and _to, not sure which one is the "new" update

I think this could be a bit faster (I didn't check though!) to read from in terms of merging lists, no need to check each element, set ensures no duplication

Suggested change

for affiliation_from in name_from.get("affiliations", []):

if affiliation_from not in affiliations_to:

affiliations_to.append(affiliation_from)

updated = True

affiliations_from = name_from.get("affiliations", [])

affiliations_to = name_to.get("affiliations", [])

if affliations_from != affiliations_to:

affiliations_to = list(set(affiliations_from + affiliations_to))

updated = True

but before changing lets check if the team agrees
if yes, same comment applies for all the list merges below

Hmm this could work with lists that only contain values, but in this case the lists contain dicts in different forms. For instance

affiliations = [{"name": <value>}, ....] identifiers = [{"identifier": <value>, "scheme": "orcid"}]

This lists are not hashable therefore cannot use set on them.

However I will try to improve this to make ir more readable.

kpsherva · 2024-10-23T15:37:43Z

site/cds_rdm/tasks.py

+        filters.append(dsl.Q("range", updated={"gte": since}))
+    combined_filter = dsl.Q("bool", filter=filters)
+
+    names = service.scan(system_identity, extra_filter=combined_filter)


shouldn't we rely on the db table instead?

Given that we are applying a lot of filters, plus there depending on the time range and/or when we just imported ORCID and the CERN authors, there might be a lot of results to process and there are millions of values in this table, I would expect this to be quite slow.

I am aware that the DB is the source of truth but mainly for performance optimization I would go with OS, however, if there are strong reasons to prioritize the DB, I can change it and test it with a bigger sample to asses the performance

kpsherva · 2024-10-23T15:40:25Z

site/cds_rdm/tasks.py

+                # We get the ORCID value and merge all the other values into it (Ideally there should be only 1 value apart from the ORCID)
+                orcid_name = None
+                other_names = []
+                for name_to_merge in names_to_merge.hits:


sorry for a silly question, but I am not sure what condition we use to determine if we merge the name or not. Could we discuss this during the demo?

I think that is not super clear just because there is a lot of code.
We merge when the CERN user has an ORCID in her/his profile, and the ORCID is found in the names vocab. from the ORCID dump. Basically, there is an ORCID match.

maybe stating this in a comment in the code could help to understand whats the required condition to merge. I think it should be documented/described with more details
or broken down to smaller components like you suggested in one of the comments above

ntarocco · 2024-10-23T16:35:15Z

assets/js/components/deposit/overrides/CreatibutorsRemoteSelectItem.js

+          className="inline-id-icon ml-5 mr-5"
+          verticalAlign="middle"
+        />
+        {creatibutor.props.email} 22


Suggested change

{creatibutor.props.email} 22

{creatibutor.props.email}

ntarocco · 2024-10-23T16:35:41Z

assets/less/cds-rdm/globals/site.overrides

@@ -91,3 +91,6 @@
  }
 }

+.color-grey {
+  color: grey !important;


Is the !important required?

Yes, but I can actually replace it by the color attribute from the cmp, thanks!

ntarocco · 2024-10-23T16:38:00Z

site/cds_rdm/jobs.py

+        if since is None and job_obj.last_runs["success"]:
+            since = job_obj.last_runs["success"].started_at
+        else:
+            since = (datetime.now() - timedelta(days=1)).isoformat()


The since param is already pre-computed to None (never run) or last success, see here and docstring here.
Was this copied from somewhere else and it should be also fixed there?

ntarocco · 2024-10-23T16:39:33Z

site/cds_rdm/tasks.py

+def sync_local_accounts_to_names(since, user_id=None):
+    """Syncs local accounts to names vocabulary.
+
+    If the name is marked as "unlisted" - meaning it is deprecated - we still update it with the new values, if needed.


Karolina has a good point, maybe we want to be a little bit more descriptive in the docstring here.
We should explain in a clear and concise way the logic and business rules.

ntarocco · 2024-10-23T16:39:59Z

site/cds_rdm/tasks.py

+
+    If the name is marked as "unlisted" - meaning it is deprecated - we still update it with the new values, if needed.
+    """
+    prop_values = ["group", "department", "maibox", "section"]


Do we need those? I would keep only what we show, meaning the e-mail.
If we want to display more then it can make sense, e.g.
my.email@cern.ch (IT-AB-CD)

ntarocco · 2024-10-23T16:42:56Z

site/cds_rdm/tasks.py

+        updated = False
+        updated_name = {**name}
+
+        if not is_orcid:


@kpsherva It is not about reliability, it is more about using PIDs as preferred ID.
@jrcastro2 we need to explain this in the docstring.

We should always prefer ORCID whenever possible, because data is harvested by external services, and the ORCID is the only widely recognized PID, understood by other systems.
The idea is to use/show the ORCID when available, and fallback to local CDS id when ORCID not avaiable.

ntarocco · 2024-10-23T16:44:30Z

site/cds_rdm/tasks.py

+        user = User.query.get(user_id)
+        users = [user] if user else []
+    else:
+        users = User.query.filter(User.updated > since).all()


since might be None at the first run of the job.

We changed it on the job to set it to 1 day ago if None, we can update it as we desire.

ntarocco · 2024-10-23T16:47:41Z

site/cds_rdm/tasks.py

+    # Allows to sync a single user
+    if user_id:
+        user = User.query.get(user_id)
+        users = [user] if user else []


It does return None. We might want to use one() if we want to raise.

It is indeed better to raise and provide a nice error clear msg, otherwise here the task will simply exist without any message.

ntarocco · 2024-10-23T16:50:29Z

site/cds_rdm/tasks.py

+    for user in users:
+        orcid = None
+        try:
+            name_dict = service.read(system_identity, str(user.id)).to_dict()


In general, I am not very pro comments... but if no comments, then the code should be self-explanatory.
I suggest 2 possible approaches (choose the clean code methodology you prefer :D):

add clear comments

split the code in self-explanatory functions or nicely names vars (I prefer this way)

For example here we should explain that we first try to find the user by its own id and if not found by the orcid, and why.

Example:

# First, we look up the name by the CDS id: if the user has an ORCID, we need to deprecate it.

Suggestion: one good way to implement complex parts, also suggested in the Clean Code, is to first write the func to call, and then later on, implement them.
Here, I would go with something like:

def _find_by_cds_id(...) user_cds_id = str(user.id) # to be changed to the random id user_orcid = user.user_profile.get("orcid") return get_by_id(user_cds_id) try: name = _find_by_cds_id(...) is_unlisted = name.get(...) has_orcid = name.get(...) if is_unlisted and has_orcid: promote_orcid(...) # this will unlist the cds-id name and update/create the orcid one except ... try: _find_by_orcid(..) except ... _create_missing(...)

which is similar to yours, but with more self-explanatory names.

Wouldn't the code above avoid the double task? I can't remember why we needed to split them.

ntarocco · 2024-10-23T17:11:26Z

site/cds_rdm/tasks.py

+                # We get the ORCID value and merge all the other values into it (Ideally there should be only 1 value apart from the ORCID)
+                orcid_name = None
+                other_names = []
+                for name_to_merge in names_to_merge.hits:


I think that is not super clear just because there is a lot of code.
We merge when the CERN user has an ORCID in her/his profile, and the ORCID is found in the names vocab. from the ORCID dump. Basically, there is an ORCID match.

ntarocco · 2024-10-24T15:42:13Z

For the Dep - Group label, shall we put it even smaller, inside the parenthesis just after the email? It kind of belongs to the CERN part:
(ORCID logo ORCID ID CERN logo email depgroup)

jrcastro2 force-pushed the sync-ldap-names branch 5 times, most recently from 212d6dc to 6673d1a Compare September 20, 2024 08:48

jrcastro2 force-pushed the sync-ldap-names branch from 6673d1a to 81a947e Compare October 22, 2024 06:45

jrcastro2 marked this pull request as ready for review October 22, 2024 06:45

jrcastro2 force-pushed the sync-ldap-names branch from 81a947e to 1087edc Compare October 22, 2024 06:55

celery: add tasks to synchronize and merge names into vocabularies

9d5c950

* updates display for CERN creatibutors

jrcastro2 force-pushed the sync-ldap-names branch from 1087edc to 9d5c950 Compare October 22, 2024 12:55

kpsherva reviewed Oct 23, 2024

View reviewed changes

ntarocco reviewed Oct 23, 2024

View reviewed changes

	name_dict = service.read(system_identity, str(user.id)).to_dict()
	name_dict = names_service.read(system_identity, str(user.id)).to_dict()

	service = current_service_registry.get("names")
	names_service = current_service_registry.get("names")

WIP: sync ldap into names #195

Are you sure you want to change the base?

WIP: sync ldap into names #195

Conversation

jrcastro2 commented Sep 4, 2024 • edited Loading

As a non admin user, see the synced value

Both values, not yet synced

As an admin, both values, the deprecated one appears greyed

Choose a reason for hiding this comment

jrcastro2 Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrcastro2 Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kpsherva Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

kpsherva Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kpsherva Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ntarocco commented Oct 24, 2024 • edited Loading

jrcastro2 commented Sep 4, 2024 •

edited

Loading

jrcastro2 Oct 23, 2024 •

edited

Loading

jrcastro2 Oct 24, 2024 •

edited

Loading

kpsherva Oct 23, 2024 •

edited

Loading

kpsherva Oct 23, 2024 •

edited

Loading

kpsherva Oct 24, 2024 •

edited

Loading

ntarocco commented Oct 24, 2024 •

edited

Loading