Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate identical entities, sometimes #141

Closed
peter- opened this issue Jul 12, 2018 · 21 comments
Closed

Duplicate identical entities, sometimes #141

peter- opened this issue Jul 12, 2018 · 21 comments

Comments

@peter-
Copy link
Contributor

peter- commented Jul 12, 2018

Context, original description:
https://groups.google.com/forum/#!topic/pyff-users/SieEPWahb8c

This may just be one issue, or several:

  • There's non-deterministic behaviour (running pyff multiple times gives differing results)
  • There (sometimes, see above) are fully identical duplicate EntityDescriptors in the result (a single resulting EntitiesDescriptor should never contain more than one EntityDescriptor for any given entityID)
  • The duplicated EntityDescriptors come from the wrong registrar after a fork merge pipeline (or the feed config is wrong, I'm open to suggestions/corrections)

Full write-up and instructions to reproduce, including Docker container with test data:
https://gitlab.com/peter-/pyff-duplicate-merge-weirdness

@leifj
Copy link
Contributor

leifj commented Jul 14, 2018

Ok this is pretty strange. Can you try something: can you see if you can see the same effect by sticking the pipline in a pyffd and fetching the aggregate (curl) repeatedly?

@leifj
Copy link
Contributor

leifj commented Jul 14, 2018

Also - I'm the process of refactoring the index/storage component which may make it easier to debug or maybe fix the issue altogether.

@leifj
Copy link
Contributor

leifj commented Jul 14, 2018

I can reproduce this using my development branch, i.e wo your docker container. I have a theory this is related to how load works (threaded) - it "smells" like a race condition. If so it is probably fairly recent and if so the solution is probably to drill into the semantics of the merge code like you suspect.

@peter-
Copy link
Contributor Author

peter- commented Jul 17, 2018

Note that as indicated on pyff-users I can reproduce the issue also with pyff tag 0.10.0, so it may not be so recent after all:

---
total size:     5
selected:       5
          idps: 0
           sps: 5
---
---
total size:     5
selected:       4
          idps: 0
           sps: 4
---

The only difference to the provided Dockerfile is specifying the tag in the pip install command:

-      && pip install git+git://github.com/IdentityPython/pyFF.git#egg=pyFF
+      && pip install git+git://github.com/IdentityPython/pyFF.git@0.10.0#egg=pyFF

Which correctly uses the referenced tag:

$ docker run --rm -it pyff-duplicates:0.10.0 pyff --version
pyff version 0.10.0.dev0

(Update: That last line did not in fact demonstrate anything, as current HEAD from master also identifiers itself with that same version string. But the docker image has the right version, AFAICT.)

@peter-
Copy link
Contributor Author

peter- commented Aug 27, 2018

ping

@leifj
Copy link
Contributor

leifj commented Aug 27, 2018 via email

@peter-
Copy link
Contributor Author

peter- commented Aug 29, 2018

Current HEAD doesn't run with my provided demo setup at all. Calling pyff with --loglevel=DEBUG:

DEBUG:root:PL[id=test]: calling 'load' using args: ['local.xml as Local', 'remote.xml as eduGAIN'] and opts: []
DEBUG:root:load parsing 'local.xml as Local'
DEBUG:root:load parsing 'remote.xml as eduGAIN'
DEBUG:root:Refreshing all resources
DEBUG:root:GET URL file:///app/remote.xml
DEBUG:root:GET URL file:///app/local.xml
DEBUG:root:got status_code=200, encoding=utf8 from_cache=False from file:///app/remote.xml
DEBUG:root:got status_code=200, encoding=utf8 from_cache=False from file:///app/local.xml
DEBUG:root:returning 3 valid entities
DEBUG:root:returning 3 valid entities
DEBUG:root:PL[id=test]: calling 'select' using args: ["eduGAIN!//md:EntityDescriptor[md:Extensions/mdrpi:RegistrationInfo/@registrationAuthority and not(md:Extensions/mdrpi:RegistrationInfo/@registrationAuthority='http://example.org')]", 'Local!//md:EntityDescriptor'] and opts: []
DEBUG:root:selecting using args: ["eduGAIN!//md:EntityDescriptor[md:Extensions/mdrpi:RegistrationInfo/@registrationAuthority and not(md:Extensions/mdrpi:RegistrationInfo/@registrationAuthority='http://example.org')]", 'Local!//md:EntityDescriptor']
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/pyff/md.py", line 69, in main
    plumbing(p).process(md, state={'batch': True, 'stats': {}})
  File "/usr/local/lib/python2.7/dist-packages/pyff/pipes.py", line 240, in process
    return Plumbing.Request(self, md, t, args=args, state=state, store=store).process(self)
  File "/usr/local/lib/python2.7/dist-packages/pyff/pipes.py", line 219, in process
    ot = cb(self, *opts)
  File "/usr/local/lib/python2.7/dist-packages/pyff/builtins.py", line 605, in select
    ot = entitiesdescriptor(args, name, lookup_fn=req.md.store.lookup)
  File "/usr/local/lib/python2.7/dist-packages/pyff/samlmd.py", line 220, in entitiesdescriptor
    for entity in _resolve(member, lookup_fn):
  File "/usr/local/lib/python2.7/dist-packages/pyff/samlmd.py", line 215, in _resolve
    return l_fn(member)
  File "/usr/local/lib/python2.7/dist-packages/pyff/store.py", line 504, in lookup
    return self._lookup(key)
  File "/usr/local/lib/python2.7/dist-packages/pyff/store.py", line 532, in _lookup
    return self._lookup("{%s}%s" % (m.group(1), str(m.group(2)).rstrip("/")))
  File "/usr/local/lib/python2.7/dist-packages/pyff/store.py", line 539, in _lookup
    res.update(self._get_index(m.group(1), v))
  File "/usr/local/lib/python2.7/dist-packages/pyff/store.py", line 463, in _get_index
    m = re.compile(v)
  File "/usr/lib/python2.7/re.py", line 194, in compile
    return _compile(pattern, flags)
  File "/usr/lib/python2.7/re.py", line 251, in _compile
    raise error, v # invalid expression
error: unbalanced parenthesis
ERROR:root:unbalanced parenthesis
Makefile:10: recipe for target 'test.xml' failed
make: *** [test.xml] Error 255

Not sure where/why regex processing is needed here. If I simplify the feed config even further with this change:

-    - "eduGAIN!//md:EntityDescriptor[md:Extensions/mdrpi:RegistrationInfo/@registrationAuthority and not(md:Extensions/mdrpi:RegistrationInfo/@registrationAuthority='http://example.org')]"
+    - "eduGAIN!//md:EntityDescriptor"

then I get an "empty select" error instead:

DEBUG:root:PL[id=test]: calling 'load' using args: ['local.xml as Local', 'remote.xml as eduGAIN'] and opts: []
DEBUG:root:load parsing 'local.xml as Local'
DEBUG:root:load parsing 'remote.xml as eduGAIN'
DEBUG:root:Refreshing all resources
DEBUG:root:GET URL file:///app/remote.xml
DEBUG:root:GET URL file:///app/local.xml
DEBUG:root:got status_code=200, encoding=utf8 from_cache=False from file:///app/remote.xml
DEBUG:root:got status_code=200, encoding=utf8 from_cache=False from file:///app/local.xml
DEBUG:root:returning 3 valid entities
DEBUG:root:returning 3 valid entities
DEBUG:root:PL[id=test]: calling 'select' using args: ['eduGAIN!//md:EntityDescriptor', 'Local!//md:EntityDescriptor'] and opts: []
DEBUG:root:selecting using args: ['eduGAIN!//md:EntityDescriptor', 'Local!//md:EntityDescriptor']
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/pyff/md.py", line 69, in main
    plumbing(p).process(md, state={'batch': True, 'stats': {}})
  File "/usr/local/lib/python2.7/dist-packages/pyff/pipes.py", line 240, in process
    return Plumbing.Request(self, md, t, args=args, state=state, store=store).process(self)
  File "/usr/local/lib/python2.7/dist-packages/pyff/pipes.py", line 219, in process
    ot = cb(self, *opts)
  File "/usr/local/lib/python2.7/dist-packages/pyff/builtins.py", line 607, in select
    raise PipeException("empty select - stop")
PipeException: empty select - stop
ERROR:root:empty select - stop
Makefile:10: recipe for target 'test.xml' failed
make: *** [test.xml] Error 255

Sounds like a separate bug at work?

@leifj
Copy link
Contributor

leifj commented Aug 29, 2018

All right I need to dig into this a bit - I've been too busy over in the ra21-l2 branch for a while now.

@leifj
Copy link
Contributor

leifj commented Jul 1, 2019

OK so I have (of course) confirmed your observation. This is pretty interesting and worth further discussion.

Peters example has a total of 5 entityIDs, one of which has a duplicate entityID (http://test-sp-1) - i.e
same entityID in two different metadata sources. The two Entities is different (in Peters example the difference is simple - just a difference in registration authority but this can manifest itself in many different ways).

I strongly suspect the underlying problem is that pyFF has no way to express priority between sources and the default merge strategy results in what amounts to random behaviour.

When you run Peters pipeline the result is randomly different depending on which source "wins" the race for that entityID.

There are several reasons why this is non-trivial

  • the pyFF load statement is not processed in sequence unless you say "load fail_on_error True". The default behaviour is to parallelize URL fetching.
  • the pyFF memory backend uses python primitives (dicts etc) which are not order-preserving
  • pyFF merges multiple sources at load-time to provide a single "view" into the entitydescriptors from all the sources in the ResourceManager.

Why did this use to work?

At some point in the past the load statement was 100% order preserving and each source was dumped into the backend as it was loaded. This was a long time ago though and not in any way compatible with having a high-performing fetch of many big resources (eg building edugain). Even so just relying on the order is a poor substitute for proper prioritization.

The way forward

The way to get back to the expected behavior is to implement a merge-strategy that allows clear expression of priority between sources - eg by looking at the registration authority - but this requires a bit of thought.

@leifj
Copy link
Contributor

leifj commented Jul 2, 2019

digging more into this I believe it is possible to emulate the old behaviour with some careful management of the python data structures involved - however this doesn't allow for very precise management of priorities between sources and entries which I believe is needed for the future

@leifj
Copy link
Contributor

leifj commented Jul 10, 2019

I believe this is now solved in HEAD

@leifj leifj closed this as completed Jul 10, 2019
@peter-
Copy link
Contributor Author

peter- commented Jul 12, 2019

however this doesn't allow for very precise management of priorities between sources and entries which I believe is needed for the future

Why simple source priority is not an edge case:

If you're producing a single, unified downstream feed for your metadata consumers that should contain the union of all your local federation entities with everything from eduGAIN (so that metadata consumers only need to configure/load a single aggregate/URL) you'll need to prevent identical entityIDs from eduGAIN overriding your own local registrations of that same entityID: The differences for a given entity between what's in eduGAIN and what's in your local federation may be sufficient to break things. So the simplest and sanest "no surprises" strategy would be to always prefer entities from one source (your own registered copy of everything) over any other copies of those same entities. An ordered list of sources, if you will.

That's slightly different and possibly significantly simpler than the model I think you proposed, which would have to be able to express source priorities on an per-entity level: Take entity A from source X, but entity B from source Y, etc.? I personally don't see a need for that more complex/expressive model, but YMMV, of course.

Also note that the current failure mode with that old config is two-fold: One is the changing source an entity may be loaded from. The other is that the number of entities in the aggregate changes because one entity is duplicated (with completely identical EntityDescriptors). So that may hint at another error or may just be a(nother) side-effect of the reliance on old implementation behaviour that not longer exits.

I'm aware this comment is coming late and I'm looking forward to testing the fix in head.

@peter-
Copy link
Contributor Author

peter- commented Sep 6, 2019

Running the provided test case with tag 1.1.1 I still have all the same issues, though now the results are always false (instead of being sometimes correct and sometimes false):

  • The resulting document test.xml contains 5 entitiy descriptors (it should only have 4),
  • There are 2 fully identical entity descriptors with entityID="http://test-sp-1" (no duplicates should ever exist),
  • The entity descriptor(s) with entityID="http://test-sp-1" come(s) from the wrong source (remote.xml, where registrationAuthority="http://OTHER.example.com" but it should be coming from local.xml with registrationAuthority="http://example.org").

I can re-run this test with current master but I don't expect anything to change.

In case the example feed configuration is no longer usable with current pyff versions -- i.e., the feed is at fault, not the software -- I'd appreciate suggestions on the proper way to achieve the desired results (as have been produced by pyff for years).

@leifj
Copy link
Contributor

leifj commented Sep 6, 2019 via email

@peter-
Copy link
Contributor Author

peter- commented Sep 6, 2019

Thanks, I'll ask on on pyff-users!

@peter-
Copy link
Contributor Author

peter- commented Sep 6, 2019

FWIW, changing the order of the loaded source documents so that my preferred source comes last seems to work around all the remaining issues, with and even without including the fork merge pipeline: At least with the simple test case this results in the expected 4 entities selected (not 5), no duplicates, and the entities come from the expected/desired source/registrationAuthority.

No idea whether this behavious should be relied upon. I'll wait for suggestions on how to rewrite the feed config on the pyff-users mailing list.

@leifj
Copy link
Contributor

leifj commented Sep 6, 2019

it should be a reliable workaround since I have made sure that loads are always processed in the order they are listed.

@peter-
Copy link
Contributor Author

peter- commented Sep 9, 2019

OK but then this issue is still open/present/unresolved:

While the following feed (still using the provided simplified test data) works correctly (and furtunately also does what I want: prefer local entities over remote ones where entityIDs overlap):

- load:
    - remote.xml as eduGAIN
    - local.xml as Local
- select:
    - "eduGAIN!//md:EntityDescriptor[md:Extensions/mdrpi:RegistrationInfo/@registrationAuthority and not(md:Extensions/mdrpi:RegistrationInfo/@registrationAuthority='http://example.org')]"
    - "Local!//md:EntityDescriptor"
- publish:
    output: test.xml
- stats

results in (OK):

---
total size:     5
selected:       4
          idps: 0
           sps: 4
---
$ fgrep entityID test.xml | sort
  <md:EntityDescriptor entityID="http://test-sp-1">
  <md:EntityDescriptor entityID="http://test-sp-2">
  <md:EntityDescriptor entityID="http://test-sp-3">
  <md:EntityDescriptor entityID="http://test-sp-4">

But simply reversing the order of the loaded documents (no other changes) does not work correctly as it still produces duplicate identical entities (instead of merely preferring the remote copy of a given entity over the local one):

- load:
    - local.xml as Local
    - remote.xml as eduGAIN
- select:
    - "eduGAIN!//md:EntityDescriptor[md:Extensions/mdrpi:RegistrationInfo/@registrationAuthority and not(md:Extensions/mdrpi:RegistrationInfo/@registrationAuthority='http://example.org')]"
    - "Local!//md:EntityDescriptor"
- publish:
    output: test.xml
- stats

Output (NOT OK):

---
total size:     5
selected:       5
          idps: 0
           sps: 5
---
$ fgrep entityID test.xml | sort
  <md:EntityDescriptor entityID="http://test-sp-1">
  <md:EntityDescriptor entityID="http://test-sp-1">
  <md:EntityDescriptor entityID="http://test-sp-2">
  <md:EntityDescriptor entityID="http://test-sp-3">
  <md:EntityDescriptor entityID="http://test-sp-4">

Further simplifying the test case by removing the conditions from the XPath expression in the eduGAIN select:

- load:
    - remote.xml as eduGAIN
    - local.xml as Local
- select:
    - "eduGAIN!//md:EntityDescriptor"
    - "Local!//md:EntityDescriptor"
- publish:
    output: test.xml
- stats

also produces duplicate entityIDs for http://test-sp-1:

---
total size:     5
selected:       6
          idps: 0
           sps: 6
---
$ fgrep entityID test.xml | sort                                                            
  <md:EntityDescriptor entityID="http://test-sp-1">
  <md:EntityDescriptor entityID="http://test-sp-1">
  <md:EntityDescriptor entityID="http://test-sp-2">
  <md:EntityDescriptor entityID="http://test-sp-3">
  <md:EntityDescriptor entityID="http://test-sp-4">
  <md:EntityDescriptor entityID="http://test-sp-5">

Interestingly this then happens with either order of the two loaded documents. Possibly total size: 5, selected: 6 also hints at an issue?

Either way: Under no circumstances should entity http://test-sp-1 appear twice in the published output? Also, merely changing the order of the two loaded documents should not result in published outputs that differ in the number of entities they contain?
So it seems to me the issue is not yet resolved.

@leifj
Copy link
Contributor

leifj commented Sep 9, 2019

Yeah that is a great analysis and does point to a bug. Duplicate entries should never occur, thats for sure.

@leifj leifj reopened this Sep 9, 2019
@leifj
Copy link
Contributor

leifj commented Sep 10, 2019

can you try bbdf245

@peter-
Copy link
Contributor Author

peter- commented Sep 10, 2019

That seems to have taken care of all the issue: No more duplicates, number of selected entities is not bigger than total size, the number of entities in the published output also doesn't change depending on the order of the loaded documents.
Tack!

@peter- peter- closed this as completed Sep 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants