Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to UUID based indexing #1

Merged
merged 7 commits into from Apr 26, 2015
Merged

Upgrade to UUID based indexing #1

merged 7 commits into from Apr 26, 2015

Conversation

thet
Copy link
Member

@thet thet commented Apr 20, 2015

Upgrade to UUID basd indexing instead of using the id. The id is not
unique and causes problems when multiple lineage subsites with the same id
are registered. Furthermore, the uuid can be used to retrieve the lineage
childsite object without traversing up the content tree. A upgrade step
is included.

@frisi @jensens this is the better fix than c091f96
i think, it's even safe without modifications for templates using this index, since the vocabularies title didn't change.

Upgrade to ``UUID`` basd indexing instead of using the ``id``. The id is not
unique and causes problems when multiple lineage subsites with the same id
are registered. Furthermore, the uuid can be used to retrieve the lineage
childsite object without traversing up the content tree. A upgrade step
is included.
@jensens
Copy link
Member

jensens commented Apr 20, 2015

+1

@claytron
Copy link
Contributor

Very nice!

@frisi
Copy link
Member

frisi commented Apr 20, 2015

thanks for your efforts @thet.

i think the upgrade step needs to update the index and metadata for all content objects not just the childsites.
lineage.index is also used to show which childsite an item belongs to (eg. in folder listings). therefore it's important that brain.childsite matches the keys in lineage.childsites vocabulary

to save memory we could use an generator as its done in this migration step (never tried out if that really has a smaller memory footprint though)

@thet
Copy link
Member Author

thet commented Apr 20, 2015

@frisi
hm... i guess, you're right, regarding updating all content objects.

regarding the second statement - brain.childsite is a uuid and the keys in the lineage.childsites vocab are uuids too.

3rd statement, regarding reduction of memory consumption - the upgrade step looks like it introduces the memory leak problem. when loading all objects before iterating over them, python's garbage collector cannot free any obj references.

@thet
Copy link
Member Author

thet commented Apr 20, 2015

@frisi please review again

@jensens
Copy link
Member

jensens commented Apr 21, 2015

waking up one object Inside the loop each by each is perfectly fine.
creating a tuple before and iterating over it is most worse practice.
Am 21.04.2015 01:20 schrieb "Johannes Raggam" notifications@github.com:

@frisi https://github.com/frisi please review again


Reply to this email directly or view it on GitHub
#1 (comment)
.

@frisi
Copy link
Member

frisi commented Apr 21, 2015

happy to see that others do not know about python inline generators - see section "the details" - @thet and @jensens 😃 - as i did before reviewing @hvelarde's PR for imagecropping.
so it's definitely not waking up all objects before iterating over the tuple. however i can't tell - did not test yet - if thats really more efficient than iterating over the catalog result (which should be a generator too) and waking up the object inside the for-loop. maybe @hvelarde can shed some light on this.

@frisi
Copy link
Member

frisi commented Apr 21, 2015

@thet - your updated upgrade step looks fine to me now - thanks again!
the line "lineage.index is also used to show" was meant to explain why the index and metadata needs to be updated (so it matches the new vocabulary) - not that it's something your PR was wrong about.

@thet
Copy link
Member Author

thet commented Apr 21, 2015

@frisi ha, python inline generators are new to me. nice!

the refactored upgrade step takes on my projectsite with > 20000 content items half an hour or so instead very few seconds like before. but we definitely need to index all content items.

@frisi can we merge it? :)

@frisi
Copy link
Member

frisi commented Apr 21, 2015

+1 for merging. maybe @hvelarde can tell us why

objects = (brain.getObject for catalog())
for obj in objects:
   obj.reindexObject();

is better than

for brain in catalog():
    brain.getObject().reindexObject()

thet added a commit that referenced this pull request Apr 26, 2015
@thet thet merged commit b49423e into master Apr 26, 2015
@thet thet deleted the thet-uuidindexing branch April 26, 2015 22:46
@hvelarde
Copy link
Member

generators are memory efficient; I used that expression because reindexing a bunch of objects on a large database could let you server running out of memory; I don't know if they are slower, but it's weird this is taking so much time on you case:

https://www.python.org/dev/peps/pep-0289/

@jensens
Copy link
Member

jensens commented Apr 27, 2015

i double checked pep289 and so from what i can tell both::

for brain in catalog():
    brain.getObject().reindexObject()

and::

objects = (brain.getObject for brain in  catalog())
for obj in objects:
    obj.reindexObject();

should be completely the same from time and memory consumption.

The first, catalog() itself returns LazyBrains. This happens in two too.
Then in first case a brain is fetched and at the same time the object is woken up (anonymous) and immediately indexed and since there is no reference held it will be garbage collected asap.

In the second the wakeup happens in a generator, also one by one. the loop is executed with reindexing, then the next obj gets a new reference in the cycle and the old obj can be garbage collected.

Since we do not expand the result to a list and need to loop anyway over the generator, its better memory efficiency does not get grip here.

So do what you like: both implementations are the same.

@hvelarde
Copy link
Member

well, I can confirm you that is not the case: I implemented this because I had issues with the p.a.imagecropping upgrade step mentioned above; without the generator I was running out of memory; after using the generator I was able to reindex all my objects with no problems.

the first case will wake all object in memory in the loop; the second, only one by one.

@jensens
Copy link
Member

jensens commented Apr 27, 2015

look carefully - in both cases its a wake up object by object - not all at once.

you created a local variable obj = brain.getObject() - maybe that is not garbage collected immediately because some circular reference was left?

Using an anonymous call helps here. So it does not matter much if you use a generator or directly brain.getObject().reindexObject()

if i remember right, loop variables are garbage collected in an optimized way.

@hvelarde
Copy link
Member

hvelarde commented Jun 8, 2015

@jensens just to let you know that no, the memory consumption s not the same.

I made some tests and accessing a batch of 1000 objects using the list of brains consumes around 200MB more memory that doing it by using a generator.

the gain was not as big as I expected in the first place, but there is a gain and it can be the difference between an upgrade step ran or a process killed by an hypervisor because excessive memory consumption.

I will share later the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants