Mongoengine is very slow on large documents compared to native pymongo usage #1230

baruchoxman · 2016-02-07T23:00:18Z

(See also this StackOverflow question)

I have the following mongoengine model:

class MyModel(Document):
    date = DateTimeField(required = True)
    data_dict_1 = DictField(required = False)
    data_dict_2 = DictField(required = True)

In some cases the document in the DB can be very large (around 5-10MB), and the data_dict fields contain complex nested documents (dict of lists of dicts, etc...).

I have encountered two (possibly related) issues:

When I run native pymongo find_one() query, it returns within a second. When I run MyModel.objects.first() it takes 5-10 seconds.
When I query a single large document from the DB, and then access its field, it takes 10-20 seconds just to do the following:
m = MyModel.objects.first()
val = m.data_dict_1.get(some_key)

The data in the object does not contain any references to any other objects, so it is not an issue of objects dereferencing.
I suspect it is related to some inefficiency of the internal data representation of mongoengine, which affects the document object construction as well as fields access.

The text was updated successfully, but these errors were encountered:

touilleMan · 2016-02-08T07:23:58Z

Hi,

Have you profiled the execution with Profile/cProfile ? A graph of it with objgraph should give us a better view of where the trouble is.

baruchoxman · 2016-02-08T19:11:19Z

Hi,

Please see the following chart: http://i.stack.imgur.com/qAb0t.png (attached in the answer to my StackOverflow question) - it shows that the bottleneck is in "DictField.to_python" (being called 600,000 times).

thedrow · 2016-03-10T09:37:40Z

Have you tried using MongoEngine with PyPy? Most of the overhead is gone because of PyPy's JIT.
But you're right. We need a better benchmarks suite.

amcgregor · 2016-06-23T15:13:58Z

The entire approach of eager conversion is potentially fundamentally flawed. Lazy conversion on first access would defer all conversion overhead to the point where the structure is actually accessed, completely eliminating it in the situation where access is not made. (Beyond such a situation indicating proper use of .only() and related helpers is warranted.)

amcgregor · 2016-06-23T15:15:33Z

Duplicate of #1137 ?

apolkosnik · 2017-03-08T15:51:52Z

I've played a bit with a snippet and with some modifications( https://gist.github.com/apolkosnik/e94327e92dd2e0642e2b263efd87d1b1 ), then I ran it against Mongoengine 0.8.8 and 0.11:

Please see the pictures...
On 0.8.8:
mongoengine with dict took 16.95s

On 0.11:
mongoengine with dict took 32.74s

apolkosnik · 2017-03-16T19:03:39Z

It looks like some change from 0.8.8 to 0.9+ caused the get() in ComplexBaseField class to go on a dereference spree for dicts().

sauravshah · 2017-07-19T01:00:21Z

@wojcikstefan first of all, thank you for contributions to Mongoengine.

We are using Mongoengine heavily in production and running into this issue. Is this something you are actively looking into?

touilleMan · 2017-07-19T08:16:53Z

@sauravshah I started investigating this issue and planned to release a fix for this

If you cannot wait, the trouble is in the Document initialization where a class is created then instanciated.
By replacing this self._data = SemiStrictDict.create(allowed_keys=self._fields_ordered)() by a simple self._data = {} I get a 30% boost on my entire application (not a microbenchmark)

It is the same thing for StrictDict, but it is not as simple to fix (the StrictDict subclass should be generated in the metaclass defining the Document). However I didn't see where it is really used.

There is 2 other troubles related to performances that could hit you:

#1446 (pymongo3.4 don't have ensure_index so a create_index request is actually send to the mongodb server before every save done by mongoengine). Solution is to handle index creation manually and disable it in mongoengine with meta = {..., auto_create_index: False}

#298 Accessing to reference field cause the fetching from the database of the document. This is really an issue if you only wanted to access it _id field which was already known in the first place... I'm working on a fix for this but it is really complicated issue given dereferencing early is at the core of mongoengine :(

sauravshah · 2017-07-19T08:58:14Z

Thanks, @touilleMan - that helps a bit. I looked at closeio's fork of the project and they seem to have some good ideas.

Thank you for pointing to the other issues, they have already started hitting us in production :). I'm excited to know that you are on it, really looking forward to a performant mongoengine.

Please let us know if you figure out any more quick wins in the meantime! Also, let me know if I can help in any way.

sauravshah · 2017-08-24T14:20:34Z

@touilleMan were you able to fix these?

touilleMan · 2017-08-24T15:40:58Z

@sauravshah sorry I had a branch ready but forget to issue a PR, here it is: #1630, can you have a look on it ?

Considering #298, I've tried numerous methods to create lazy referencing, but it involve too much magic (typically when having inheritance within the class referenced, you can't know before dereferencing what will be the type of the instance to return)
So in the end, I will try to provide a new type of field LazyReferenceField which would return a Reference class instance, allowing to access pk or to call fetch() to get back the actual document. But this mean one should rework it code to make use of this feature :-(

sauravshah · 2017-08-28T09:04:51Z

@touilleMan #1630 looks good to me.

Reg. #298, is it possible to take the class to be referenced as a kwarg on ReferenceField and solve this issue? Calling .fetch would be too much rework in most cases (including ours). Also, how would you solve the referenced class issue in .fetch ?

touilleMan · 2017-08-28T09:23:31Z

is it possible to take the class to be referenced as a kwarg on ReferenceField and solve this issue?

Not sure what you mean...
We could add __getattr__/__setattr__ methods to the LazyReference which would dereference the document when accessed and modify it.
This way you wouldn't have to change you code, except if you use isinstance, but this should lower a lot the amount of things that need to be reworked ;-)

sauravshah · 2017-08-28T09:36:45Z

Why can't be we follow this approach?

class A(Document):
  b = ReferenceField(B)

When A is loaded, we already have B's id, so an instance of class B can be created with just id (with a flag on the class to denote its not been loaded yet). isintance would work correctly in this case.

Once __getattr__/__setattr__ is called a query to the DB could load the actual mongo document.

touilleMan · 2017-08-28T09:45:10Z

The trouble B maybe have children classes,:

class B(Document):
    meta= {'allow_inheritance': True}


class Bchild(B):
    pass


class A(Document):
  b = ReferenceField(B)


b_child = Bchild().save()
a = A(b=b_child).save()

In this example you cannot know a.b is a Bchild instance before dereferencing it.

sauravshah · 2017-08-28T10:00:17Z

Ah ok, I understand the problem now. This is not a big deal for us (and most projects I would assume).

For backward compatibility, is it possible to add a kwarg to LazyReference (maybe ignore_inheritance) and make isintance work when that kwarg is present?

isinstance is being used all over the place in django-mongoengine, so would be great to not dereference on it.

amcgregor · 2017-09-05T20:22:37Z

As an interesting idea, it can know what it references prior to dereferencing if the _cls reference is stored in the DBRef (concrete; technically allowed via **kwargs inclusion in the resulting SON) or if stored similarly to a CahcedReferenceField that incorporates that value.

benjhastings · 2018-07-04T07:44:00Z

Does anyone know if there is a patch in the works for this issue?

touilleMan · 2018-07-04T09:59:59Z

@benjhastings #1690 is the solution, but it requires some change on your code (switching from ReferenceField to LazyReferenceField)

benjhastings · 2018-07-04T10:18:43Z

How does that work if a DictField is used though as per https://stackoverflow.com/questions/35257305/mongoengine-is-very-slow-on-large-documents-compared-to-native-pymongo-usage/35274930#35274930 ?

touilleMan · 2018-07-04T12:12:27Z

@benjhastings If you perf trouble comes from a too big document... well there is nothing that can save you right now :-(
I guess the DictField could be improved (or a RawDictField could be created) to do no deserialization at all on the data.

amcgregor · 2018-07-04T12:28:17Z

I have written an alternative Document container type which preserves, internally, the MongoDB native value types rather than Python typecast values, casts only on actual attribute access (get/set) to the desired field, and is directly usable with PyMongo base APIs as if it were a dictionary; no conversion on use. No eager bulk conversion of records' values as they stream in, which is a large part of the overhead, and similarly, no eager dereferencing (additional independent read queries for each record loaded) going with a hardline interpretation of "explicit is better than implicit". Relevant links:

Document (MutableMapping proxy to an ordered dictionary)
Container (underlying declarative base class)
Reference (Field subclass)

Use of a Reference(Foo, concrete=True, cache=['_cls']) would store an import path reference (e.g. "myapp.model:Foo") within the DBRef. (If Foo is a model that allows subclassing; typically by inheriting the Derived trait which defines and automates the calculation of a _cls field import reference.)

shr00mie · 2018-07-23T05:57:03Z

...well...i just got annoyed enough with mongoengine enough to google what's what to find this...great.

should be on current versions of pymongo and mongoengine per pip install -U.

here's my output a la @apolkosnik:
dict:

embed:

console:
pymongo with dict took 0.06s
pymongo with embed took 0.06s
mongoengine with dict took 16.72s
mongoengine with embed took 0.74s
mongoengine with dict as_pymongo() took 0.06s
mongoengine with embed as_pymongo() took 0.06s
mongoengine aggregation with dict took 0.11s
mongoengine aggregation with embed took 0.11s

if DictField is the issue, then please for the love of all that is holy, let us know what to change it to or fix it. watching mongo and pymongo respond almost immediately and then waiting close to a minute for mongoengine to...do whatever it's doing...kind of a massive bottleneck. dig the rest of the package, but if this can't be resolved on the package side...

shr00mie · 2018-08-02T01:36:43Z

cricket...cricket...

oh look at that. pymodm. and to porting we go.

nickfaughey · 2019-04-02T21:31:52Z

Just hit this bottleneck inserting a couple ~5MB documents. Pretty much a deal breaker, having an insert that takes less than a second with pymongo take over a minute with MongoEngine.

shr00mie · 2019-04-02T21:45:30Z

@nickfaughey I switched to pymodm. It took very little if any modification with my existing code and is lightning fast. And by MongoDB, so ongoing development.

Cayke · 2019-04-02T21:48:16Z

pymodm has a similar sintax and is much faster. You should try it.

amcgregor · 2019-04-04T11:06:28Z

As some absolutely direct <ShamelessSelfPromotion> I'd like to point out again that I also offer an alternative, directly designed to alleviate some of the issues with ME I've encountered or issues I've submitted but never had corrected. E.g. promotion/demotion, the distinction between embedded and top-level documents (there should be none; allow the embedding of top-level documents, collection and active record behaviour isolated and optional), lazy conversion (not eager, let alone eager sub-findMany and conversion of References, or worse, Lists of References…), minimal interposing (I don't track document dirty state), inline comparison generates filter documents (alternative to parametric querying, which is… limiting), extremely rich and expressive allowable type conversions across most field types (ObjectId ~= datetime, but also anything date-like, like timedelta), 99.27% coverage, 100% if you ignore two codepaths rarely hit (unless you dir() or star import specific modules…) My package even has an opinion on how one should store localized data, something a naive approach harshly penalizes. (Naive being {"en": "English Text", "fr": "Text Francois", …} — don't do that.)

Marrow Mongo (see also: WIP documentation manual)

Using the parametric helpers, syntax is nearly identical to MongoEngine, even with most of the same operator prefixes and suffixes so as to maintain that compatibility:

q1 = F(Foo, age__gt=30)  # {'age': {'$gt': 30}}
q2 = (Foo.age > 30)  # {'age': {'$gt': 30}}
q3 = F(Foo, not__age__gt=30)  # {'age': {'$not': {'$gt': 30}}}
q4 = F(Foo, attribute__name__exists=False)  # {'attribute.name': {'$exists': 1}}

Combineable using & and | operators. There's much more interesting things you can do, though. (Direct iteration of filter sets planned, currently.)

# Iterate all threads created or replied to within the last 7 days.
for record in (Thread.id | Thread.reply.id) >= -timedelta(days=7):
    ...

nickfaughey · 2019-04-04T13:45:22Z

Sweet, I'll check these 2 out. In the mean time I've literally just bypassed MongoEngine for these large documents and access mongo directly with PyMongo, but it would be nice to keep an ODM there for schema sanity.

shr00mie · 2019-04-05T18:46:49Z

@nickfaughey you didn't even need to go that far. Pymodm has pretty much the same ODM syntax as mongoengine. Literally has ODM in the name. 😉

amcgregor · 2019-04-05T22:04:32Z

I'd love to formalize this benchmark set (akin to template engine's "bigtable" test) and add more contenders to it. The code below the file of older results demonstrates, effectively side-by-side, identical solutions.

This is a more direct comparison of querying, specifically and in isolation, with the note that as_query is entirely unnecessary on the MM side; just pass find_{one/many} the Filter instance; it's a suitable mapping natively. (Oh, and ME appears to be unable to "continue" from a "compiled" query, e.g. reconstitute the rich Q object from a plain dict; at least, when I made that comparison.)

olka · 2019-11-27T14:02:48Z

Same here - to_python call produces 70% of overhead

pikeas · 2020-10-10T21:34:28Z

Is this still an issue? I'm using Pymodm and would prefer switching to MongoEngine as a more popular ODM, but poor large object performance would be a deal breaker.

amcgregor · 2020-10-11T19:09:45Z

@pikeas Yes, with some variance for some additional optimization here, and further complication over there… the underlying mechanism remains "eager", that is, upon retrieval of a record MongoEngine recursively casts all elements of the document it can to native Python types via repeated to_python invocation.

This contrasts with my own DAO's approach (if I'm going to be fixing everything, might as well start from scratch) which is purely lazy: transformers to "cast" (or just generally process) MongoDB values to native values are executed on attribute access. Bypassed by dictionary dereferencing. The Document class' equivalent from_mongo factory class method only performs the outermost Document object lookup and wrapping. Mine was written after many years of MongoEngine use and frustration with lack of progress on numerous fronts. Parts are still enjoyably crazy, but at least I can very exactly explain the “crazy” in mine. 😉

Edited to add: Note that a double underscore-prefixed (dunderscore / soft private) initializer argument is available to disable eager conversion. The underlying machinery iteratively utilizes both explicit to_python invocation, and indirect invocation via setters (L125), which doesn't make it much easier to follow. 🙁

Using my silly simple benchmark, the latest deserialization numbers:

MongoEngine 0.20.0 0.4516570568084717s (5× longer)
Marrow Mongo 2.0 (next) 0.08598494529724121s

Admittedly, other areas differ in the opposite direction. Unsaved instance construction is faster under MongoEngine:

MongoEngine: 0.03315997123718262s
Marrow Mongo: 0.26718783378601074s (8×; Marrow Mongo shifts most of the responsibility to the initializer, zero work at save-time: no waiting until save for a validation error, for example; make the assignment, you get your ValueError. The Document instance is directly usable with native PyMongo APIs as a dictionary.)

bagerard · 2020-10-12T09:49:03Z

Note that I'll probably start experimenting with lazy loading the attribute soon in mongoengine (i.e defer the python de-serialization until it's actually called, like it is done in pymodm)

amcgregor · 2020-10-12T13:51:37Z

The initial lazy version I completed myself 5 years ago — with minor deficits later corrected, e.g. from_mongo of an already cast Document. I hope you don't mind that I didn't wait.

bagerard · 2020-10-12T18:22:01Z

You do whatever you want in your own project :) I'll make sure to check how you dealt with that (compared to pymodm) when I'll work on that, out of curiosity. I understand the reasons that made you move away from mongoengine but I would appreciate if we could keep the discussions in the mongoengine project constructive.

amcgregor · 2020-10-13T02:39:10Z

@bagerard As the tag on my comments identify, I'm a past direct code contributor.

I understand the reasons that made you move away from mongoengine

marrow/contentment#12 — I indexed my issues for handy reference. A number date to 2013: almost 8 years with little to no progress. Progress on some, of course! #1136 (regression in limit use in 0.10.0) was a milestone for giving me the needed kick to get going on mine.

Many others documented there were simply never engineered to be problems in the first place, e.g. by removal of a "Document" v. "DynamicDocument" differentiation, clear segregation and isolation of "active collection"-like behavior, and explicit avoidance of the self-saving (change tracking) "active record" pattern, no global registry demanding unique model class names, plus virtually no form of implicit caching or automatic (edit: eager/recursive) casting/conversion. No result set middleware, or connection management middleware, and so on. The Document instances are usable anywhere a dictionary is, within plain PyMongo APIs, with zero use-time work. (And near-zero effort to encapsulate raw PyMongo result set records.) It's, unfortunately, almost exactly the opposite of MongoEngine. 😕

bagerard · 2020-10-13T09:47:58Z

I've seen your post many times and also all the references that it created in our tickets, I didn't need you to elaborate. If you don't like MongoEngine and gave up on improving it, that's ok but if we could keep the discussions in the MongoEngine project (thus its Issues) to actually improve MongoEngine, that would be more helpful

bkntr · 2020-12-07T10:19:16Z

Is there a plan to add lazy init to MongoEngine any time soon?

neilsh · 2022-01-17T17:16:04Z

Any news on this, or ways others can help?

cundi · 2023-03-01T04:33:14Z

PyMODM is better than this lib, does someone knows lib which like it.

amcgregor · 2023-11-10T15:45:21Z

PyMODM is better than this lib, does someone knows lib which like it.

I… somewhat hate to do this, but given how stale this (and numerous others; see my previous comments) have become, I wrote a replacement for myself. I haven't really gone out of my way to advertise it, but it follows some of the suggestions given here re: lazy conversion. (The Document and Query classes act like a plain, PyMongo-compatible dictionaries/mappings, always storing MongoDB-safe types internally thus no conversion on final use.) It is rather stable, extremely well unit tested, and in active use.

It does not, however, reimplement the full, nested change-tracking active record approach, nor automatic foreign record lookup. (Those are also performance and memory utilization problems.) It uses a "layer your needs" approach. The base Document class is essentially just a fancy, declaratively-defined dictionary with validation and typecasting. Mix-ins such as Queryable (a type of Collection of Identified records) adds in bind (to connect a PyMongo DB or collection), find, insert_one, &c. methods. There are also data defining mix-ins beyond Identified such as Localized or Published. .get() or .first() replaced by class dereferencing on bound classes: User[identifier]

To allow multiple installed packages to cooperate in defining fields or traits, entry_points-based pseudo-namespaces can be imported from: marrow.mongo.document, marrow.mongo.field, and marrow.mongo.trait. Register an entry_point in the appropriate namespace, your custom document, field, and trait can be imported from there, too. (Making conflicts explicit.) There is a minimal beginning to proper documentation.

lafrech changed the title ~~Mongoengine is very slow on large documents comapred to native pymongo usage~~ Mongoengine is very slow on large documents compared to native pymongo usage Mar 22, 2016

lafrech added the Enhancement label Mar 22, 2016

This comment has been minimized.

Sign in to view

amcgregor mentioned this issue Jun 23, 2016

Migrate away from MongoEngine. marrow/contentment#12

Open

This comment has been minimized.

Sign in to view

wojcikstefan added the Performance label Dec 28, 2016

wojcikstefan added High Priority and removed High Priority labels Apr 15, 2017

wojcikstefan self-assigned this Apr 15, 2017

touilleMan mentioned this issue Aug 24, 2017

Manipulating objects is very slow, even without going to database #1624

Closed

touilleMan mentioned this issue Nov 6, 2017

Add LazyReferenceField & GenericLazyReferencField #1690

Merged

bchrobot mentioned this issue Apr 12, 2018

Add documentation for LazyReference and GenericLazyReference fields. #1771

Merged

jbrownsc mentioned this issue Sep 4, 2018

st2api being bogged down by MongoEngine StackStorm/st2#4030

Open

jdmeyer3 mentioned this issue Oct 2, 2019

Writes to Mongodb in Actionrunner are slow and cpu intensive with large documents StackStorm/st2#4798

Closed

Kami mentioned this issue Dec 19, 2019

[WIP] [EXPERIMENT] [DONT MERGE] Speed up "get one" and "get all" / query MongoDB (mongoengine) operations StackStorm/st2#4838

Closed

1 task

Mongoengine is very slow on large documents compared to native pymongo usage #1230

Mongoengine is very slow on large documents compared to native pymongo usage #1230

Comments

baruchoxman commented Feb 7, 2016

touilleMan commented Feb 8, 2016

baruchoxman commented Feb 8, 2016

thedrow commented Mar 10, 2016

amcgregor commented Jun 23, 2016

amcgregor commented Jun 23, 2016

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

apolkosnik commented Mar 8, 2017

apolkosnik commented Mar 16, 2017

sauravshah commented Jul 19, 2017

touilleMan commented Jul 19, 2017

sauravshah commented Jul 19, 2017

sauravshah commented Aug 24, 2017

touilleMan commented Aug 24, 2017 • edited

sauravshah commented Aug 28, 2017

touilleMan commented Aug 28, 2017

sauravshah commented Aug 28, 2017

touilleMan commented Aug 28, 2017

sauravshah commented Aug 28, 2017

amcgregor commented Sep 5, 2017

benjhastings commented Jul 4, 2018

touilleMan commented Jul 4, 2018

benjhastings commented Jul 4, 2018

touilleMan commented Jul 4, 2018

amcgregor commented Jul 4, 2018 • edited

shr00mie commented Jul 23, 2018

shr00mie commented Aug 2, 2018

nickfaughey commented Apr 2, 2019

shr00mie commented Apr 2, 2019

Cayke commented Apr 2, 2019

amcgregor commented Apr 4, 2019 • edited

nickfaughey commented Apr 4, 2019

shr00mie commented Apr 5, 2019

amcgregor commented Apr 5, 2019 • edited

olka commented Nov 27, 2019

pikeas commented Oct 10, 2020

amcgregor commented Oct 11, 2020 • edited

bagerard commented Oct 12, 2020

amcgregor commented Oct 12, 2020

bagerard commented Oct 12, 2020 • edited

amcgregor commented Oct 13, 2020 • edited

bagerard commented Oct 13, 2020

bkntr commented Dec 7, 2020 • edited

neilsh commented Jan 17, 2022

cundi commented Mar 1, 2023

amcgregor commented Nov 10, 2023 • edited

touilleMan commented Aug 24, 2017 •

edited

amcgregor commented Jul 4, 2018 •

edited

amcgregor commented Apr 4, 2019 •

edited

amcgregor commented Apr 5, 2019 •

edited

amcgregor commented Oct 11, 2020 •

edited

bagerard commented Oct 12, 2020 •

edited

amcgregor commented Oct 13, 2020 •

edited

bkntr commented Dec 7, 2020 •

edited

amcgregor commented Nov 10, 2023 •

edited