Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: sync ldap into names #195

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jrcastro2
Copy link
Contributor

@jrcastro2 jrcastro2 commented Sep 4, 2024

As a non admin user, see the synced value

image

Both values, not yet synced

image

As an admin, both values, the deprecated one appears greyed

image

@jrcastro2 jrcastro2 force-pushed the sync-ldap-names branch 5 times, most recently from 212d6dc to 6673d1a Compare September 20, 2024 08:48
@jrcastro2 jrcastro2 marked this pull request as ready for review October 22, 2024 06:45
Comment on lines +46 to +49
if since is None and job_obj.last_runs["success"]:
since = job_obj.last_runs["success"].started_at
else:
since = (datetime.now() - timedelta(days=1)).isoformat()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't be needed, ping @ntarocco to confirm

Copy link
Contributor Author

@jrcastro2 jrcastro2 Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought only needed for the first run. However this has to be updated once jobs is refactored, might be solved by then

EDIT: I see that you mention everything not only the last line of code, I guess I am missing context, but why shouldn't it be needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The since param is already pre-computed to None (never run) or last success, see here and docstring here.
Was this copied from somewhere else and it should be also fixed there?

Copy link
Contributor Author

@jrcastro2 jrcastro2 Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the same here: https://github.com/inveniosoftware/invenio-vocabularies/blob/master/invenio_vocabularies/jobs.py#L60-L63

But not sure I understand well, does this means that our funcs do need ot accept since=None?

EDIT: discussed IRL, just need to set the since when is None

@classmethod
def build_task_arguments(cls, job_obj, since=None, user_id=None, **kwargs):
"""Build task arguments."""
if since is None and job_obj.last_runs["success"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't be needed

def sync_local_accounts_to_names(since, user_id=None):
"""Syncs local accounts to names vocabulary.

If the name is marked as "unlisted" - meaning it is deprecated - we still update it with the new values, if needed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for me to understand the use cases: when is name consider "deprecated"? When user leaves CERN?
will this influence the display of their names in the previously authored records?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we only have one use case where an user is considered deprecated. This happens when we have the user is coming from ORCID and coming from CERN database with the ORCID registered there as well, in these cases we mark the CERN user as deprecated and merge the extra data in the ORCID value.

However, there can be more cases in the future where we want to deprecate names, not sure about the one you mention - when a user leaves CERN - is a valid one, as per my understanding we might want to still add that person as contributor of a project although they have left. However it might be a valid case or there might be others.

Existing records are not affected by deprecated values, this only affects the search display, the depreacted values won't be shown.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Karolina has a good point, maybe we want to be a little bit more descriptive in the docstring here.
We should explain in a clear and concise way the logic and business rules.

updated = False
updated_name = {**name}

if not is_orcid:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this means orcid takes precedence as a source of truth when it comes to updates
I am not sure though if thats what we want to achieve, I think CERN will be more reliable than orcid to be up to date regarding name and surname, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kpsherva It is not about reliability, it is more about using PIDs as preferred ID.
@jrcastro2 we need to explain this in the docstring.

We should always prefer ORCID whenever possible, because data is harvested by external services, and the ORCID is the only widely recognized PID, understood by other systems.
The idea is to use/show the ORCID when available, and fallback to local CDS id when ORCID not avaiable.

updated = True

# add the email as props
if user.email and user.email != name.get("props", {}).get("email"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is email then consumed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The email is displayed in the UI
image

# Allows to sync a single user
if user_id:
user = User.query.get(user_id)
users = [user] if user else []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think

Suggested change
users = [user] if user else []
users = [user]

won't .get raise an exeption if user is not found for the given id?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does return None. We might want to use one() if we want to raise.

It is indeed better to raise and provide a nice error clear msg, otherwise here the task will simply exist without any message.

for user in users:
orcid = None
try:
name_dict = service.read(system_identity, str(user.id)).to_dict()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for readability, I was not sure what service we read from here

Suggested change
name_dict = service.read(system_identity, str(user.id)).to_dict()
name_dict = names_service.read(system_identity, str(user.id)).to_dict()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I am not very pro comments... but if no comments, then the code should be self-explanatory.
I suggest 2 possible approaches (choose the clean code methodology you prefer :D):

  1. add clear comments
  2. split the code in self-explanatory functions or nicely names vars (I prefer this way)

For example here we should explain that we first try to find the user by its own id and if not found by the orcid, and why.

Example:

# First, we look up the name by the CDS id: if the user has an ORCID, we need to deprecate it.

Suggestion: one good way to implement complex parts, also suggested in the Clean Code, is to first write the func to call, and then later on, implement them.
Here, I would go with something like:

def _find_by_cds_id(...)
  user_cds_id = str(user.id)  # to be changed to the random id
  user_orcid = user.user_profile.get("orcid")
  return get_by_id(user_cds_id)
 
try:
  name =  _find_by_cds_id(...)
  is_unlisted = name.get(...)
  has_orcid = name.get(...)
  if is_unlisted and has_orcid:
    promote_orcid(...)  # this will unlist the cds-id name and update/create the orcid one
except ...
  try:
    _find_by_orcid(..)
  except ...
    _create_missing(...)

which is similar to yours, but with more self-explanatory names.

Wouldn't the code above avoid the double task? I can't remember why we needed to split them.

@shared_task()
def merge_duplicate_names_vocabulary(since=None):
"""Merges duplicate names in the names vocabulary."""
service = current_service_registry.get("names")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for readability

Suggested change
service = current_service_registry.get("names")
names_service = current_service_registry.get("names")

"""
updated = False
# Merge the affiliations
affiliations_to = name_to.get("affiliations", [])
Copy link
Contributor

@kpsherva kpsherva Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be a little bit less "tangled" if we call it affiliations_source instead of affiliations_from and affiliations_dest instead of affiliations_to, but lets hear what the rest of the team thinks before changing it

Comment on lines +287 to +290
for affiliation_from in name_from.get("affiliations", []):
if affiliation_from not in affiliations_to:
affiliations_to.append(affiliation_from)
updated = True
Copy link
Contributor

@kpsherva kpsherva Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I am really lost on _from and _to, not sure which one is the "new" update

I think this could be a bit faster (I didn't check though!) to read from in terms of merging lists, no need to check each element, set ensures no duplication

Suggested change
for affiliation_from in name_from.get("affiliations", []):
if affiliation_from not in affiliations_to:
affiliations_to.append(affiliation_from)
updated = True
affiliations_from = name_from.get("affiliations", [])
affiliations_to = name_to.get("affiliations", [])
if affliations_from != affiliations_to:
affiliations_to = list(set(affiliations_from + affiliations_to))
updated = True

but before changing lets check if the team agrees
if yes, same comment applies for all the list merges below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this could work with lists that only contain values, but in this case the lists contain dicts in different forms. For instance

affiliations = [{"name": <value>}, ....]
identifiers = [{"identifier": <value>, "scheme": "orcid"}]

This lists are not hashable therefore cannot use set on them.

However I will try to improve this to make ir more readable.

filters.append(dsl.Q("range", updated={"gte": since}))
combined_filter = dsl.Q("bool", filter=filters)

names = service.scan(system_identity, extra_filter=combined_filter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we rely on the db table instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we are applying a lot of filters, plus there depending on the time range and/or when we just imported ORCID and the CERN authors, there might be a lot of results to process and there are millions of values in this table, I would expect this to be quite slow.

I am aware that the DB is the source of truth but mainly for performance optimization I would go with OS, however, if there are strong reasons to prioritize the DB, I can change it and test it with a bigger sample to asses the performance

# We get the ORCID value and merge all the other values into it (Ideally there should be only 1 value apart from the ORCID)
orcid_name = None
other_names = []
for name_to_merge in names_to_merge.hits:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for a silly question, but I am not sure what condition we use to determine if we merge the name or not. Could we discuss this during the demo?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is not super clear just because there is a lot of code.
We merge when the CERN user has an ORCID in her/his profile, and the ORCID is found in the names vocab. from the ORCID dump. Basically, there is an ORCID match.

Copy link
Contributor

@kpsherva kpsherva Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe stating this in a comment in the code could help to understand whats the required condition to merge. I think it should be documented/described with more details
or broken down to smaller components like you suggested in one of the comments above

className="inline-id-icon ml-5 mr-5"
verticalAlign="middle"
/>
{creatibutor.props.email} 22
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{creatibutor.props.email} 22
{creatibutor.props.email}

@@ -91,3 +91,6 @@
}
}

.color-grey {
color: grey !important;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the !important required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I can actually replace it by the color attribute from the cmp, thanks!

Comment on lines +46 to +49
if since is None and job_obj.last_runs["success"]:
since = job_obj.last_runs["success"].started_at
else:
since = (datetime.now() - timedelta(days=1)).isoformat()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The since param is already pre-computed to None (never run) or last success, see here and docstring here.
Was this copied from somewhere else and it should be also fixed there?

def sync_local_accounts_to_names(since, user_id=None):
"""Syncs local accounts to names vocabulary.

If the name is marked as "unlisted" - meaning it is deprecated - we still update it with the new values, if needed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Karolina has a good point, maybe we want to be a little bit more descriptive in the docstring here.
We should explain in a clear and concise way the logic and business rules.


If the name is marked as "unlisted" - meaning it is deprecated - we still update it with the new values, if needed.
"""
prop_values = ["group", "department", "maibox", "section"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need those? I would keep only what we show, meaning the e-mail.
If we want to display more then it can make sense, e.g.
my.email@cern.ch (IT-AB-CD)

updated = False
updated_name = {**name}

if not is_orcid:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kpsherva It is not about reliability, it is more about using PIDs as preferred ID.
@jrcastro2 we need to explain this in the docstring.

We should always prefer ORCID whenever possible, because data is harvested by external services, and the ORCID is the only widely recognized PID, understood by other systems.
The idea is to use/show the ORCID when available, and fallback to local CDS id when ORCID not avaiable.

user = User.query.get(user_id)
users = [user] if user else []
else:
users = User.query.filter(User.updated > since).all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since might be None at the first run of the job.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We changed it on the job to set it to 1 day ago if None, we can update it as we desire.

# Allows to sync a single user
if user_id:
user = User.query.get(user_id)
users = [user] if user else []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does return None. We might want to use one() if we want to raise.

It is indeed better to raise and provide a nice error clear msg, otherwise here the task will simply exist without any message.

for user in users:
orcid = None
try:
name_dict = service.read(system_identity, str(user.id)).to_dict()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I am not very pro comments... but if no comments, then the code should be self-explanatory.
I suggest 2 possible approaches (choose the clean code methodology you prefer :D):

  1. add clear comments
  2. split the code in self-explanatory functions or nicely names vars (I prefer this way)

For example here we should explain that we first try to find the user by its own id and if not found by the orcid, and why.

Example:

# First, we look up the name by the CDS id: if the user has an ORCID, we need to deprecate it.

Suggestion: one good way to implement complex parts, also suggested in the Clean Code, is to first write the func to call, and then later on, implement them.
Here, I would go with something like:

def _find_by_cds_id(...)
  user_cds_id = str(user.id)  # to be changed to the random id
  user_orcid = user.user_profile.get("orcid")
  return get_by_id(user_cds_id)
 
try:
  name =  _find_by_cds_id(...)
  is_unlisted = name.get(...)
  has_orcid = name.get(...)
  if is_unlisted and has_orcid:
    promote_orcid(...)  # this will unlist the cds-id name and update/create the orcid one
except ...
  try:
    _find_by_orcid(..)
  except ...
    _create_missing(...)

which is similar to yours, but with more self-explanatory names.

Wouldn't the code above avoid the double task? I can't remember why we needed to split them.

# We get the ORCID value and merge all the other values into it (Ideally there should be only 1 value apart from the ORCID)
orcid_name = None
other_names = []
for name_to_merge in names_to_merge.hits:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is not super clear just because there is a lot of code.
We merge when the CERN user has an ORCID in her/his profile, and the ORCID is found in the names vocab. from the ORCID dump. Basically, there is an ORCID match.

@ntarocco
Copy link
Contributor

ntarocco commented Oct 24, 2024

For the Dep - Group label, shall we put it even smaller, inside the parenthesis just after the email? It kind of belongs to the CERN part:
(ORCID logo ORCID ID CERN logo email depgroup)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

names vocab: allow names vocab to have 2 types of objects
3 participants