Duplicate identical entities, sometimes #141

peter- · 2018-07-12T19:28:42Z

Context, original description:
https://groups.google.com/forum/#!topic/pyff-users/SieEPWahb8c

This may just be one issue, or several:

There's non-deterministic behaviour (running pyff multiple times gives differing results)
There (sometimes, see above) are fully identical duplicate EntityDescriptors in the result (a single resulting EntitiesDescriptor should never contain more than one EntityDescriptor for any given entityID)
The duplicated EntityDescriptors come from the wrong registrar after a fork merge pipeline (or the feed config is wrong, I'm open to suggestions/corrections)

Full write-up and instructions to reproduce, including Docker container with test data:
https://gitlab.com/peter-/pyff-duplicate-merge-weirdness

The text was updated successfully, but these errors were encountered:

leifj · 2018-07-14T20:24:23Z

Ok this is pretty strange. Can you try something: can you see if you can see the same effect by sticking the pipline in a pyffd and fetching the aggregate (curl) repeatedly?

leifj · 2018-07-14T20:25:20Z

Also - I'm the process of refactoring the index/storage component which may make it easier to debug or maybe fix the issue altogether.

leifj · 2018-07-14T21:38:20Z

I can reproduce this using my development branch, i.e wo your docker container. I have a theory this is related to how load works (threaded) - it "smells" like a race condition. If so it is probably fairly recent and if so the solution is probably to drill into the semantics of the merge code like you suspect.

peter- · 2018-07-17T08:43:50Z

Note that as indicated on pyff-users I can reproduce the issue also with pyff tag 0.10.0, so it may not be so recent after all:

---
total size:     5
selected:       5
          idps: 0
           sps: 5
---
---
total size:     5
selected:       4
          idps: 0
           sps: 4
---

The only difference to the provided Dockerfile is specifying the tag in the pip install command:

-      && pip install git+git://github.com/IdentityPython/pyFF.git#egg=pyFF
+      && pip install git+git://github.com/IdentityPython/pyFF.git@0.10.0#egg=pyFF

Which correctly uses the referenced tag:

$ docker run --rm -it pyff-duplicates:0.10.0 pyff --version
pyff version 0.10.0.dev0

(Update: That last line did not in fact demonstrate anything, as current HEAD from master also identifiers itself with that same version string. But the docker image has the right version, AFAICT.)

peter- · 2018-08-27T18:41:41Z

ping

leifj · 2018-08-27T18:43:45Z

It might be interesting to test with new Whoosh store in HEAD. Skickat från min iPhone

…

27 aug. 2018 kl. 20:41 skrev peter ***@***.***>: ping — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

peter- · 2018-08-29T07:57:16Z

Current HEAD doesn't run with my provided demo setup at all. Calling pyff with --loglevel=DEBUG:

DEBUG:root:PL[id=test]: calling 'load' using args: ['local.xml as Local', 'remote.xml as eduGAIN'] and opts: []
DEBUG:root:load parsing 'local.xml as Local'
DEBUG:root:load parsing 'remote.xml as eduGAIN'
DEBUG:root:Refreshing all resources
DEBUG:root:GET URL file:///app/remote.xml
DEBUG:root:GET URL file:///app/local.xml
DEBUG:root:got status_code=200, encoding=utf8 from_cache=False from file:///app/remote.xml
DEBUG:root:got status_code=200, encoding=utf8 from_cache=False from file:///app/local.xml
DEBUG:root:returning 3 valid entities
DEBUG:root:returning 3 valid entities
DEBUG:root:PL[id=test]: calling 'select' using args: ["eduGAIN!//md:EntityDescriptor[md:Extensions/mdrpi:RegistrationInfo/@registrationAuthority and not(md:Extensions/mdrpi:RegistrationInfo/@registrationAuthority='http://example.org')]", 'Local!//md:EntityDescriptor'] and opts: []
DEBUG:root:selecting using args: ["eduGAIN!//md:EntityDescriptor[md:Extensions/mdrpi:RegistrationInfo/@registrationAuthority and not(md:Extensions/mdrpi:RegistrationInfo/@registrationAuthority='http://example.org')]", 'Local!//md:EntityDescriptor']
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/pyff/md.py", line 69, in main
    plumbing(p).process(md, state={'batch': True, 'stats': {}})
  File "/usr/local/lib/python2.7/dist-packages/pyff/pipes.py", line 240, in process
    return Plumbing.Request(self, md, t, args=args, state=state, store=store).process(self)
  File "/usr/local/lib/python2.7/dist-packages/pyff/pipes.py", line 219, in process
    ot = cb(self, *opts)
  File "/usr/local/lib/python2.7/dist-packages/pyff/builtins.py", line 605, in select
    ot = entitiesdescriptor(args, name, lookup_fn=req.md.store.lookup)
  File "/usr/local/lib/python2.7/dist-packages/pyff/samlmd.py", line 220, in entitiesdescriptor
    for entity in _resolve(member, lookup_fn):
  File "/usr/local/lib/python2.7/dist-packages/pyff/samlmd.py", line 215, in _resolve
    return l_fn(member)
  File "/usr/local/lib/python2.7/dist-packages/pyff/store.py", line 504, in lookup
    return self._lookup(key)
  File "/usr/local/lib/python2.7/dist-packages/pyff/store.py", line 532, in _lookup
    return self._lookup("{%s}%s" % (m.group(1), str(m.group(2)).rstrip("/")))
  File "/usr/local/lib/python2.7/dist-packages/pyff/store.py", line 539, in _lookup
    res.update(self._get_index(m.group(1), v))
  File "/usr/local/lib/python2.7/dist-packages/pyff/store.py", line 463, in _get_index
    m = re.compile(v)
  File "/usr/lib/python2.7/re.py", line 194, in compile
    return _compile(pattern, flags)
  File "/usr/lib/python2.7/re.py", line 251, in _compile
    raise error, v # invalid expression
error: unbalanced parenthesis
ERROR:root:unbalanced parenthesis
Makefile:10: recipe for target 'test.xml' failed
make: *** [test.xml] Error 255

Not sure where/why regex processing is needed here. If I simplify the feed config even further with this change:

-    - "eduGAIN!//md:EntityDescriptor[md:Extensions/mdrpi:RegistrationInfo/@registrationAuthority and not(md:Extensions/mdrpi:RegistrationInfo/@registrationAuthority='http://example.org')]"
+    - "eduGAIN!//md:EntityDescriptor"

then I get an "empty select" error instead:

DEBUG:root:PL[id=test]: calling 'load' using args: ['local.xml as Local', 'remote.xml as eduGAIN'] and opts: []
DEBUG:root:load parsing 'local.xml as Local'
DEBUG:root:load parsing 'remote.xml as eduGAIN'
DEBUG:root:Refreshing all resources
DEBUG:root:GET URL file:///app/remote.xml
DEBUG:root:GET URL file:///app/local.xml
DEBUG:root:got status_code=200, encoding=utf8 from_cache=False from file:///app/remote.xml
DEBUG:root:got status_code=200, encoding=utf8 from_cache=False from file:///app/local.xml
DEBUG:root:returning 3 valid entities
DEBUG:root:returning 3 valid entities
DEBUG:root:PL[id=test]: calling 'select' using args: ['eduGAIN!//md:EntityDescriptor', 'Local!//md:EntityDescriptor'] and opts: []
DEBUG:root:selecting using args: ['eduGAIN!//md:EntityDescriptor', 'Local!//md:EntityDescriptor']
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/pyff/md.py", line 69, in main
    plumbing(p).process(md, state={'batch': True, 'stats': {}})
  File "/usr/local/lib/python2.7/dist-packages/pyff/pipes.py", line 240, in process
    return Plumbing.Request(self, md, t, args=args, state=state, store=store).process(self)
  File "/usr/local/lib/python2.7/dist-packages/pyff/pipes.py", line 219, in process
    ot = cb(self, *opts)
  File "/usr/local/lib/python2.7/dist-packages/pyff/builtins.py", line 607, in select
    raise PipeException("empty select - stop")
PipeException: empty select - stop
ERROR:root:empty select - stop
Makefile:10: recipe for target 'test.xml' failed
make: *** [test.xml] Error 255

Sounds like a separate bug at work?

leifj · 2018-08-29T21:28:07Z

All right I need to dig into this a bit - I've been too busy over in the ra21-l2 branch for a while now.

leifj · 2019-07-01T20:34:40Z

OK so I have (of course) confirmed your observation. This is pretty interesting and worth further discussion.

Peters example has a total of 5 entityIDs, one of which has a duplicate entityID (http://test-sp-1) - i.e
same entityID in two different metadata sources. The two Entities is different (in Peters example the difference is simple - just a difference in registration authority but this can manifest itself in many different ways).

I strongly suspect the underlying problem is that pyFF has no way to express priority between sources and the default merge strategy results in what amounts to random behaviour.

When you run Peters pipeline the result is randomly different depending on which source "wins" the race for that entityID.

There are several reasons why this is non-trivial

the pyFF load statement is not processed in sequence unless you say "load fail_on_error True". The default behaviour is to parallelize URL fetching.
the pyFF memory backend uses python primitives (dicts etc) which are not order-preserving
pyFF merges multiple sources at load-time to provide a single "view" into the entitydescriptors from all the sources in the ResourceManager.

Why did this use to work?

At some point in the past the load statement was 100% order preserving and each source was dumped into the backend as it was loaded. This was a long time ago though and not in any way compatible with having a high-performing fetch of many big resources (eg building edugain). Even so just relying on the order is a poor substitute for proper prioritization.

The way forward

The way to get back to the expected behavior is to implement a merge-strategy that allows clear expression of priority between sources - eg by looking at the registration authority - but this requires a bit of thought.

leifj · 2019-07-02T10:56:54Z

digging more into this I believe it is possible to emulate the old behaviour with some careful management of the python data structures involved - however this doesn't allow for very precise management of priorities between sources and entries which I believe is needed for the future

leifj · 2019-07-10T09:27:10Z

I believe this is now solved in HEAD

peter- · 2019-07-12T14:46:32Z

however this doesn't allow for very precise management of priorities between sources and entries which I believe is needed for the future

Why simple source priority is not an edge case:

If you're producing a single, unified downstream feed for your metadata consumers that should contain the union of all your local federation entities with everything from eduGAIN (so that metadata consumers only need to configure/load a single aggregate/URL) you'll need to prevent identical entityIDs from eduGAIN overriding your own local registrations of that same entityID: The differences for a given entity between what's in eduGAIN and what's in your local federation may be sufficient to break things. So the simplest and sanest "no surprises" strategy would be to always prefer entities from one source (your own registered copy of everything) over any other copies of those same entities. An ordered list of sources, if you will.

That's slightly different and possibly significantly simpler than the model I think you proposed, which would have to be able to express source priorities on an per-entity level: Take entity A from source X, but entity B from source Y, etc.? I personally don't see a need for that more complex/expressive model, but YMMV, of course.

Also note that the current failure mode with that old config is two-fold: One is the changing source an entity may be loaded from. The other is that the number of entities in the aggregate changes because one entity is duplicated (with completely identical EntityDescriptors). So that may hint at another error or may just be a(nother) side-effect of the reliance on old implementation behaviour that not longer exits.

I'm aware this comment is coming late and I'm looking forward to testing the fix in head.

peter- · 2019-09-06T10:58:41Z

Running the provided test case with tag 1.1.1 I still have all the same issues, though now the results are always false (instead of being sometimes correct and sometimes false):

The resulting document test.xml contains 5 entitiy descriptors (it should only have 4),
There are 2 fully identical entity descriptors with entityID="http://test-sp-1" (no duplicates should ever exist),
The entity descriptor(s) with entityID="http://test-sp-1" come(s) from the wrong source (remote.xml, where registrationAuthority="http://OTHER.example.com" but it should be coming from local.xml with registrationAuthority="http://example.org").

I can re-run this test with current master but I don't expect anything to change.

In case the example feed configuration is no longer usable with current pyff versions -- i.e., the feed is at fault, not the software -- I'd appreciate suggestions on the proper way to achieve the desired results (as have been produced by pyff for years).

leifj · 2019-09-06T11:01:10Z

I suspect this is the same issue that was discussed on pyff-users recently - you have to change your pipeline to use the filter primitive. Skickat från min iPhone

…

6 sep. 2019 kl. 12:58 skrev peter ***@***.***>: Running the provided test case with tag 1.1.1 I still have all the same issues, though now the results are always false (instead of being sometimes correct and sometimes false): The resulting document test.xml contains 5 entitiy descriptors (it should only have 4), There are 2 fully identical entity descriptors with entityID="http://test-sp-1" (no duplicates should ever exist), The entity descriptor(s) with entityID="http://test-sp-1" come(s) from the wrong source (remote.xml, where registrationAuthority="http://OTHER.example.com" but it should be coming from local.xml with registrationAuthority="http://example.org"). I can re-run this test with current master but I don't expect anything to change. In case the example feed configuration is no longer usable with current pyff versions -- i.e., the feed is at fault, not the software -- I'd appreciate suggestions on the proper way to achieve the desired results (as have been produced by pyff for years). — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.

peter- · 2019-09-06T11:05:03Z

Thanks, I'll ask on on pyff-users!

peter- · 2019-09-06T16:31:02Z

FWIW, changing the order of the loaded source documents so that my preferred source comes last seems to work around all the remaining issues, with and even without including the fork merge pipeline: At least with the simple test case this results in the expected 4 entities selected (not 5), no duplicates, and the entities come from the expected/desired source/registrationAuthority.

No idea whether this behavious should be relied upon. I'll wait for suggestions on how to rewrite the feed config on the pyff-users mailing list.

leifj · 2019-09-06T16:57:30Z

it should be a reliable workaround since I have made sure that loads are always processed in the order they are listed.

peter- · 2019-09-09T16:05:30Z

OK but then this issue is still open/present/unresolved:

While the following feed (still using the provided simplified test data) works correctly (and furtunately also does what I want: prefer local entities over remote ones where entityIDs overlap):

- load:
    - remote.xml as eduGAIN
    - local.xml as Local
- select:
    - "eduGAIN!//md:EntityDescriptor[md:Extensions/mdrpi:RegistrationInfo/@registrationAuthority and not(md:Extensions/mdrpi:RegistrationInfo/@registrationAuthority='http://example.org')]"
    - "Local!//md:EntityDescriptor"
- publish:
    output: test.xml
- stats

results in (OK):

---
total size:     5
selected:       4
          idps: 0
           sps: 4
---
$ fgrep entityID test.xml | sort
  <md:EntityDescriptor entityID="http://test-sp-1">
  <md:EntityDescriptor entityID="http://test-sp-2">
  <md:EntityDescriptor entityID="http://test-sp-3">
  <md:EntityDescriptor entityID="http://test-sp-4">

But simply reversing the order of the loaded documents (no other changes) does not work correctly as it still produces duplicate identical entities (instead of merely preferring the remote copy of a given entity over the local one):

- load:
    - local.xml as Local
    - remote.xml as eduGAIN
- select:
    - "eduGAIN!//md:EntityDescriptor[md:Extensions/mdrpi:RegistrationInfo/@registrationAuthority and not(md:Extensions/mdrpi:RegistrationInfo/@registrationAuthority='http://example.org')]"
    - "Local!//md:EntityDescriptor"
- publish:
    output: test.xml
- stats

Output (NOT OK):

---
total size:     5
selected:       5
          idps: 0
           sps: 5
---
$ fgrep entityID test.xml | sort
  <md:EntityDescriptor entityID="http://test-sp-1">
  <md:EntityDescriptor entityID="http://test-sp-1">
  <md:EntityDescriptor entityID="http://test-sp-2">
  <md:EntityDescriptor entityID="http://test-sp-3">
  <md:EntityDescriptor entityID="http://test-sp-4">

Further simplifying the test case by removing the conditions from the XPath expression in the eduGAIN select:

- load:
    - remote.xml as eduGAIN
    - local.xml as Local
- select:
    - "eduGAIN!//md:EntityDescriptor"
    - "Local!//md:EntityDescriptor"
- publish:
    output: test.xml
- stats

also produces duplicate entityIDs for http://test-sp-1:

---
total size:     5
selected:       6
          idps: 0
           sps: 6
---
$ fgrep entityID test.xml | sort                                                            
  <md:EntityDescriptor entityID="http://test-sp-1">
  <md:EntityDescriptor entityID="http://test-sp-1">
  <md:EntityDescriptor entityID="http://test-sp-2">
  <md:EntityDescriptor entityID="http://test-sp-3">
  <md:EntityDescriptor entityID="http://test-sp-4">
  <md:EntityDescriptor entityID="http://test-sp-5">

Interestingly this then happens with either order of the two loaded documents. Possibly total size: 5, selected: 6 also hints at an issue?

Either way: Under no circumstances should entity http://test-sp-1 appear twice in the published output? Also, merely changing the order of the two loaded documents should not result in published outputs that differ in the number of entities they contain?
So it seems to me the issue is not yet resolved.

leifj · 2019-09-09T20:01:04Z

Yeah that is a great analysis and does point to a bug. Duplicate entries should never occur, thats for sure.

leifj · 2019-09-10T09:43:46Z

can you try bbdf245

peter- · 2019-09-10T14:47:07Z

That seems to have taken care of all the issue: No more duplicates, number of selected entities is not bigger than total size, the number of entities in the published output also doesn't change depending on the order of the loaded documents.
Tack!

leifj closed this as completed Jul 10, 2019

leifj reopened this Sep 9, 2019

peter- closed this as completed Sep 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate identical entities, sometimes #141

Duplicate identical entities, sometimes #141

peter- commented Jul 12, 2018

leifj commented Jul 14, 2018

leifj commented Jul 14, 2018

leifj commented Jul 14, 2018

peter- commented Jul 17, 2018 •

edited

peter- commented Aug 27, 2018

leifj commented Aug 27, 2018 via email

peter- commented Aug 29, 2018

leifj commented Aug 29, 2018

leifj commented Jul 1, 2019

leifj commented Jul 2, 2019

leifj commented Jul 10, 2019

peter- commented Jul 12, 2019

peter- commented Sep 6, 2019

leifj commented Sep 6, 2019 via email

peter- commented Sep 6, 2019

peter- commented Sep 6, 2019

leifj commented Sep 6, 2019

peter- commented Sep 9, 2019

leifj commented Sep 9, 2019

leifj commented Sep 10, 2019

peter- commented Sep 10, 2019

Duplicate identical entities, sometimes #141

Duplicate identical entities, sometimes #141

Comments

peter- commented Jul 12, 2018

leifj commented Jul 14, 2018

leifj commented Jul 14, 2018

leifj commented Jul 14, 2018

peter- commented Jul 17, 2018 • edited

peter- commented Aug 27, 2018

leifj commented Aug 27, 2018 via email

peter- commented Aug 29, 2018

leifj commented Aug 29, 2018

leifj commented Jul 1, 2019

leifj commented Jul 2, 2019

leifj commented Jul 10, 2019

peter- commented Jul 12, 2019

peter- commented Sep 6, 2019

leifj commented Sep 6, 2019 via email

peter- commented Sep 6, 2019

peter- commented Sep 6, 2019

leifj commented Sep 6, 2019

peter- commented Sep 9, 2019

leifj commented Sep 9, 2019

leifj commented Sep 10, 2019

peter- commented Sep 10, 2019

peter- commented Jul 17, 2018 •

edited