-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: sync ldap into names #195
base: master
Are you sure you want to change the base?
Conversation
212d6dc
to
6673d1a
Compare
6673d1a
to
81a947e
Compare
81a947e
to
1087edc
Compare
* updates display for CERN creatibutors
1087edc
to
9d5c950
Compare
if since is None and job_obj.last_runs["success"]: | ||
since = job_obj.last_runs["success"].started_at | ||
else: | ||
since = (datetime.now() - timedelta(days=1)).isoformat() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't be needed, ping @ntarocco to confirm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought only needed for the first run. However this has to be updated once jobs is refactored, might be solved by then
EDIT: I see that you mention everything not only the last line of code, I guess I am missing context, but why shouldn't it be needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have the same here: https://github.com/inveniosoftware/invenio-vocabularies/blob/master/invenio_vocabularies/jobs.py#L60-L63
But not sure I understand well, does this means that our funcs do need ot accept since=None?
EDIT: discussed IRL, just need to set the since when is None
@classmethod | ||
def build_task_arguments(cls, job_obj, since=None, user_id=None, **kwargs): | ||
"""Build task arguments.""" | ||
if since is None and job_obj.last_runs["success"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't be needed
def sync_local_accounts_to_names(since, user_id=None): | ||
"""Syncs local accounts to names vocabulary. | ||
|
||
If the name is marked as "unlisted" - meaning it is deprecated - we still update it with the new values, if needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for me to understand the use cases: when is name consider "deprecated"? When user leaves CERN?
will this influence the display of their names in the previously authored records?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now we only have one use case where an user is considered deprecated. This happens when we have the user is coming from ORCID and coming from CERN database with the ORCID registered there as well, in these cases we mark the CERN user as deprecated and merge the extra data in the ORCID value.
However, there can be more cases in the future where we want to deprecate names, not sure about the one you mention - when a user leaves CERN - is a valid one, as per my understanding we might want to still add that person as contributor of a project although they have left. However it might be a valid case or there might be others.
Existing records are not affected by deprecated values, this only affects the search display, the depreacted values won't be shown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Karolina has a good point, maybe we want to be a little bit more descriptive in the docstring here.
We should explain in a clear and concise way the logic and business rules.
updated = False | ||
updated_name = {**name} | ||
|
||
if not is_orcid: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this means orcid takes precedence as a source of truth when it comes to updates
I am not sure though if thats what we want to achieve, I think CERN will be more reliable than orcid to be up to date regarding name and surname, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kpsherva It is not about reliability, it is more about using PIDs as preferred ID.
@jrcastro2 we need to explain this in the docstring.
We should always prefer ORCID whenever possible, because data is harvested by external services, and the ORCID is the only widely recognized PID, understood by other systems.
The idea is to use/show the ORCID when available, and fallback to local CDS id when ORCID not avaiable.
updated = True | ||
|
||
# add the email as props | ||
if user.email and user.email != name.get("props", {}).get("email"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is email then consumed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Allows to sync a single user | ||
if user_id: | ||
user = User.query.get(user_id) | ||
users = [user] if user else [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think
users = [user] if user else [] | |
users = [user] |
won't .get
raise an exeption if user is not found for the given id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does return None. We might want to use one() if we want to raise.
It is indeed better to raise and provide a nice error clear msg, otherwise here the task will simply exist without any message.
for user in users: | ||
orcid = None | ||
try: | ||
name_dict = service.read(system_identity, str(user.id)).to_dict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for readability, I was not sure what service we read from here
name_dict = service.read(system_identity, str(user.id)).to_dict() | |
name_dict = names_service.read(system_identity, str(user.id)).to_dict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I am not very pro comments... but if no comments, then the code should be self-explanatory.
I suggest 2 possible approaches (choose the clean code methodology you prefer :D):
- add clear comments
- split the code in self-explanatory functions or nicely names vars (I prefer this way)
For example here we should explain that we first try to find the user by its own id and if not found by the orcid, and why.
Example:
# First, we look up the name by the CDS id: if the user has an ORCID, we need to deprecate it.
Suggestion: one good way to implement complex parts, also suggested in the Clean Code, is to first write the func to call, and then later on, implement them.
Here, I would go with something like:
def _find_by_cds_id(...)
user_cds_id = str(user.id) # to be changed to the random id
user_orcid = user.user_profile.get("orcid")
return get_by_id(user_cds_id)
try:
name = _find_by_cds_id(...)
is_unlisted = name.get(...)
has_orcid = name.get(...)
if is_unlisted and has_orcid:
promote_orcid(...) # this will unlist the cds-id name and update/create the orcid one
except ...
try:
_find_by_orcid(..)
except ...
_create_missing(...)
which is similar to yours, but with more self-explanatory names.
Wouldn't the code above avoid the double task? I can't remember why we needed to split them.
@shared_task() | ||
def merge_duplicate_names_vocabulary(since=None): | ||
"""Merges duplicate names in the names vocabulary.""" | ||
service = current_service_registry.get("names") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for readability
service = current_service_registry.get("names") | |
names_service = current_service_registry.get("names") |
""" | ||
updated = False | ||
# Merge the affiliations | ||
affiliations_to = name_to.get("affiliations", []) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be a little bit less "tangled" if we call it affiliations_source
instead of affiliations_from
and affiliations_dest
instead of affiliations_to
, but lets hear what the rest of the team thinks before changing it
for affiliation_from in name_from.get("affiliations", []): | ||
if affiliation_from not in affiliations_to: | ||
affiliations_to.append(affiliation_from) | ||
updated = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I am really lost on _from
and _to
, not sure which one is the "new" update
I think this could be a bit faster (I didn't check though!) to read from in terms of merging lists, no need to check each element, set ensures no duplication
for affiliation_from in name_from.get("affiliations", []): | |
if affiliation_from not in affiliations_to: | |
affiliations_to.append(affiliation_from) | |
updated = True | |
affiliations_from = name_from.get("affiliations", []) | |
affiliations_to = name_to.get("affiliations", []) | |
if affliations_from != affiliations_to: | |
affiliations_to = list(set(affiliations_from + affiliations_to)) | |
updated = True |
but before changing lets check if the team agrees
if yes, same comment applies for all the list merges below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this could work with lists that only contain values, but in this case the lists contain dicts in different forms. For instance
affiliations = [{"name": <value>}, ....]
identifiers = [{"identifier": <value>, "scheme": "orcid"}]
This lists are not hashable therefore cannot use set
on them.
However I will try to improve this to make ir more readable.
filters.append(dsl.Q("range", updated={"gte": since})) | ||
combined_filter = dsl.Q("bool", filter=filters) | ||
|
||
names = service.scan(system_identity, extra_filter=combined_filter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we rely on the db table instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we are applying a lot of filters, plus there depending on the time range and/or when we just imported ORCID and the CERN authors, there might be a lot of results to process and there are millions of values in this table, I would expect this to be quite slow.
I am aware that the DB is the source of truth but mainly for performance optimization I would go with OS, however, if there are strong reasons to prioritize the DB, I can change it and test it with a bigger sample to asses the performance
# We get the ORCID value and merge all the other values into it (Ideally there should be only 1 value apart from the ORCID) | ||
orcid_name = None | ||
other_names = [] | ||
for name_to_merge in names_to_merge.hits: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for a silly question, but I am not sure what condition we use to determine if we merge the name or not. Could we discuss this during the demo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is not super clear just because there is a lot of code.
We merge when the CERN user has an ORCID in her/his profile, and the ORCID is found in the names vocab. from the ORCID dump. Basically, there is an ORCID match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe stating this in a comment in the code could help to understand whats the required condition to merge. I think it should be documented/described with more details
or broken down to smaller components like you suggested in one of the comments above
className="inline-id-icon ml-5 mr-5" | ||
verticalAlign="middle" | ||
/> | ||
{creatibutor.props.email} 22 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{creatibutor.props.email} 22 | |
{creatibutor.props.email} |
@@ -91,3 +91,6 @@ | |||
} | |||
} | |||
|
|||
.color-grey { | |||
color: grey !important; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the !important
required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I can actually replace it by the color attribute from the cmp, thanks!
if since is None and job_obj.last_runs["success"]: | ||
since = job_obj.last_runs["success"].started_at | ||
else: | ||
since = (datetime.now() - timedelta(days=1)).isoformat() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def sync_local_accounts_to_names(since, user_id=None): | ||
"""Syncs local accounts to names vocabulary. | ||
|
||
If the name is marked as "unlisted" - meaning it is deprecated - we still update it with the new values, if needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Karolina has a good point, maybe we want to be a little bit more descriptive in the docstring here.
We should explain in a clear and concise way the logic and business rules.
|
||
If the name is marked as "unlisted" - meaning it is deprecated - we still update it with the new values, if needed. | ||
""" | ||
prop_values = ["group", "department", "maibox", "section"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need those? I would keep only what we show, meaning the e-mail.
If we want to display more then it can make sense, e.g.
my.email@cern.ch (IT-AB-CD)
updated = False | ||
updated_name = {**name} | ||
|
||
if not is_orcid: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kpsherva It is not about reliability, it is more about using PIDs as preferred ID.
@jrcastro2 we need to explain this in the docstring.
We should always prefer ORCID whenever possible, because data is harvested by external services, and the ORCID is the only widely recognized PID, understood by other systems.
The idea is to use/show the ORCID when available, and fallback to local CDS id when ORCID not avaiable.
user = User.query.get(user_id) | ||
users = [user] if user else [] | ||
else: | ||
users = User.query.filter(User.updated > since).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since
might be None at the first run of the job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We changed it on the job to set it to 1 day ago if None, we can update it as we desire.
# Allows to sync a single user | ||
if user_id: | ||
user = User.query.get(user_id) | ||
users = [user] if user else [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does return None. We might want to use one() if we want to raise.
It is indeed better to raise and provide a nice error clear msg, otherwise here the task will simply exist without any message.
for user in users: | ||
orcid = None | ||
try: | ||
name_dict = service.read(system_identity, str(user.id)).to_dict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I am not very pro comments... but if no comments, then the code should be self-explanatory.
I suggest 2 possible approaches (choose the clean code methodology you prefer :D):
- add clear comments
- split the code in self-explanatory functions or nicely names vars (I prefer this way)
For example here we should explain that we first try to find the user by its own id and if not found by the orcid, and why.
Example:
# First, we look up the name by the CDS id: if the user has an ORCID, we need to deprecate it.
Suggestion: one good way to implement complex parts, also suggested in the Clean Code, is to first write the func to call, and then later on, implement them.
Here, I would go with something like:
def _find_by_cds_id(...)
user_cds_id = str(user.id) # to be changed to the random id
user_orcid = user.user_profile.get("orcid")
return get_by_id(user_cds_id)
try:
name = _find_by_cds_id(...)
is_unlisted = name.get(...)
has_orcid = name.get(...)
if is_unlisted and has_orcid:
promote_orcid(...) # this will unlist the cds-id name and update/create the orcid one
except ...
try:
_find_by_orcid(..)
except ...
_create_missing(...)
which is similar to yours, but with more self-explanatory names.
Wouldn't the code above avoid the double task? I can't remember why we needed to split them.
# We get the ORCID value and merge all the other values into it (Ideally there should be only 1 value apart from the ORCID) | ||
orcid_name = None | ||
other_names = [] | ||
for name_to_merge in names_to_merge.hits: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is not super clear just because there is a lot of code.
We merge when the CERN user has an ORCID in her/his profile, and the ORCID is found in the names vocab. from the ORCID dump. Basically, there is an ORCID match.
For the |
As a non admin user, see the synced value
Both values, not yet synced
As an admin, both values, the deprecated one appears greyed