-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate identical entities, sometimes #141
Comments
Ok this is pretty strange. Can you try something: can you see if you can see the same effect by sticking the pipline in a pyffd and fetching the aggregate (curl) repeatedly? |
Also - I'm the process of refactoring the index/storage component which may make it easier to debug or maybe fix the issue altogether. |
I can reproduce this using my development branch, i.e wo your docker container. I have a theory this is related to how load works (threaded) - it "smells" like a race condition. If so it is probably fairly recent and if so the solution is probably to drill into the semantics of the merge code like you suspect. |
Note that as indicated on pyff-users I can reproduce the issue also with pyff tag
The only difference to the provided Dockerfile is specifying the tag in the pip install command: - && pip install git+git://github.com/IdentityPython/pyFF.git#egg=pyFF
+ && pip install git+git://github.com/IdentityPython/pyFF.git@0.10.0#egg=pyFF Which correctly uses the referenced tag: $ docker run --rm -it pyff-duplicates:0.10.0 pyff --version
pyff version 0.10.0.dev0 (Update: That last line did not in fact demonstrate anything, as current HEAD from master also identifiers itself with that same version string. But the docker image has the right version, AFAICT.) |
ping |
It might be interesting to test with new Whoosh store in HEAD.
Skickat från min iPhone
… 27 aug. 2018 kl. 20:41 skrev peter ***@***.***>:
ping
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Current HEAD doesn't run with my provided demo setup at all. Calling pyff with
Not sure where/why regex processing is needed here. If I simplify the feed config even further with this change: - - "eduGAIN!//md:EntityDescriptor[md:Extensions/mdrpi:RegistrationInfo/@registrationAuthority and not(md:Extensions/mdrpi:RegistrationInfo/@registrationAuthority='http://example.org')]"
+ - "eduGAIN!//md:EntityDescriptor" then I get an "empty select" error instead:
Sounds like a separate bug at work? |
All right I need to dig into this a bit - I've been too busy over in the ra21-l2 branch for a while now. |
OK so I have (of course) confirmed your observation. This is pretty interesting and worth further discussion. Peters example has a total of 5 entityIDs, one of which has a duplicate entityID (http://test-sp-1) - i.e I strongly suspect the underlying problem is that pyFF has no way to express priority between sources and the default merge strategy results in what amounts to random behaviour. When you run Peters pipeline the result is randomly different depending on which source "wins" the race for that entityID. There are several reasons why this is non-trivial
Why did this use to work? At some point in the past the load statement was 100% order preserving and each source was dumped into the backend as it was loaded. This was a long time ago though and not in any way compatible with having a high-performing fetch of many big resources (eg building edugain). Even so just relying on the order is a poor substitute for proper prioritization. The way forward The way to get back to the expected behavior is to implement a merge-strategy that allows clear expression of priority between sources - eg by looking at the registration authority - but this requires a bit of thought. |
digging more into this I believe it is possible to emulate the old behaviour with some careful management of the python data structures involved - however this doesn't allow for very precise management of priorities between sources and entries which I believe is needed for the future |
I believe this is now solved in HEAD |
Why simple source priority is not an edge case: If you're producing a single, unified downstream feed for your metadata consumers that should contain the union of all your local federation entities with everything from eduGAIN (so that metadata consumers only need to configure/load a single aggregate/URL) you'll need to prevent identical entityIDs from eduGAIN overriding your own local registrations of that same entityID: The differences for a given entity between what's in eduGAIN and what's in your local federation may be sufficient to break things. So the simplest and sanest "no surprises" strategy would be to always prefer entities from one source (your own registered copy of everything) over any other copies of those same entities. An ordered list of sources, if you will. That's slightly different and possibly significantly simpler than the model I think you proposed, which would have to be able to express source priorities on an per-entity level: Take entity A from source X, but entity B from source Y, etc.? I personally don't see a need for that more complex/expressive model, but YMMV, of course. Also note that the current failure mode with that old config is two-fold: One is the changing source an entity may be loaded from. The other is that the number of entities in the aggregate changes because one entity is duplicated (with completely identical EntityDescriptors). So that may hint at another error or may just be a(nother) side-effect of the reliance on old implementation behaviour that not longer exits. I'm aware this comment is coming late and I'm looking forward to testing the fix in head. |
Running the provided test case with tag
I can re-run this test with current master but I don't expect anything to change. In case the example feed configuration is no longer usable with current pyff versions -- i.e., the feed is at fault, not the software -- I'd appreciate suggestions on the proper way to achieve the desired results (as have been produced by pyff for years). |
I suspect this is the same issue that was discussed on pyff-users recently - you have to change your pipeline to use the filter primitive.
Skickat från min iPhone
… 6 sep. 2019 kl. 12:58 skrev peter ***@***.***>:
Running the provided test case with tag 1.1.1 I still have all the same issues, though now the results are always false (instead of being sometimes correct and sometimes false):
The resulting document test.xml contains 5 entitiy descriptors (it should only have 4),
There are 2 fully identical entity descriptors with entityID="http://test-sp-1" (no duplicates should ever exist),
The entity descriptor(s) with entityID="http://test-sp-1" come(s) from the wrong source (remote.xml, where registrationAuthority="http://OTHER.example.com" but it should be coming from local.xml with registrationAuthority="http://example.org").
I can re-run this test with current master but I don't expect anything to change.
In case the example feed configuration is no longer usable with current pyff versions -- i.e., the feed is at fault, not the software -- I'd appreciate suggestions on the proper way to achieve the desired results (as have been produced by pyff for years).
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Thanks, I'll ask on on pyff-users! |
FWIW, changing the order of the No idea whether this behavious should be relied upon. I'll wait for suggestions on how to rewrite the feed config on the pyff-users mailing list. |
it should be a reliable workaround since I have made sure that loads are always processed in the order they are listed. |
OK but then this issue is still open/present/unresolved: While the following feed (still using the provided simplified test data) works correctly (and furtunately also does what I want: prefer local entities over remote ones where entityIDs overlap): - load:
- remote.xml as eduGAIN
- local.xml as Local
- select:
- "eduGAIN!//md:EntityDescriptor[md:Extensions/mdrpi:RegistrationInfo/@registrationAuthority and not(md:Extensions/mdrpi:RegistrationInfo/@registrationAuthority='http://example.org')]"
- "Local!//md:EntityDescriptor"
- publish:
output: test.xml
- stats results in (OK): ---
total size: 5
selected: 4
idps: 0
sps: 4
---
$ fgrep entityID test.xml | sort
<md:EntityDescriptor entityID="http://test-sp-1">
<md:EntityDescriptor entityID="http://test-sp-2">
<md:EntityDescriptor entityID="http://test-sp-3">
<md:EntityDescriptor entityID="http://test-sp-4"> But simply reversing the order of the - load:
- local.xml as Local
- remote.xml as eduGAIN
- select:
- "eduGAIN!//md:EntityDescriptor[md:Extensions/mdrpi:RegistrationInfo/@registrationAuthority and not(md:Extensions/mdrpi:RegistrationInfo/@registrationAuthority='http://example.org')]"
- "Local!//md:EntityDescriptor"
- publish:
output: test.xml
- stats Output (NOT OK): ---
total size: 5
selected: 5
idps: 0
sps: 5
---
$ fgrep entityID test.xml | sort
<md:EntityDescriptor entityID="http://test-sp-1">
<md:EntityDescriptor entityID="http://test-sp-1">
<md:EntityDescriptor entityID="http://test-sp-2">
<md:EntityDescriptor entityID="http://test-sp-3">
<md:EntityDescriptor entityID="http://test-sp-4"> Further simplifying the test case by removing the conditions from the XPath expression in the - load:
- remote.xml as eduGAIN
- local.xml as Local
- select:
- "eduGAIN!//md:EntityDescriptor"
- "Local!//md:EntityDescriptor"
- publish:
output: test.xml
- stats also produces duplicate entityIDs for ---
total size: 5
selected: 6
idps: 0
sps: 6
---
$ fgrep entityID test.xml | sort
<md:EntityDescriptor entityID="http://test-sp-1">
<md:EntityDescriptor entityID="http://test-sp-1">
<md:EntityDescriptor entityID="http://test-sp-2">
<md:EntityDescriptor entityID="http://test-sp-3">
<md:EntityDescriptor entityID="http://test-sp-4">
<md:EntityDescriptor entityID="http://test-sp-5"> Interestingly this then happens with either order of the two Either way: Under no circumstances should entity |
Yeah that is a great analysis and does point to a bug. Duplicate entries should never occur, thats for sure. |
can you try bbdf245 |
That seems to have taken care of all the issue: No more duplicates, number of |
Context, original description:
https://groups.google.com/forum/#!topic/pyff-users/SieEPWahb8c
This may just be one issue, or several:
EntityDescriptor
s in the result (a single resultingEntitiesDescriptor
should never contain more than oneEntityDescriptor
for any givenentityID
)EntityDescriptor
s come from the wrong registrar after afork merge
pipeline (or the feed config is wrong, I'm open to suggestions/corrections)Full write-up and instructions to reproduce, including Docker container with test data:
https://gitlab.com/peter-/pyff-duplicate-merge-weirdness
The text was updated successfully, but these errors were encountered: