Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Massive memory usage when creating a document with approximately 30k+ subdocuments #11541

Open
robert-nash opened this issue Mar 18, 2022 · 8 comments

Comments

@robert-nash
Copy link

Do you want to request a feature or report a bug? Bug

What is the current behavior?

When saving a large document (9.66MB) I am getting what looks to be some sort of memory leak. I initially thought that this might just be a result of my document being too big but I have concluded that the behaviour I am seeing does seem to constitute a bug as far as I can tell at the moment. Perhaps it would be a good idea not to have such a large document but I feel that what I am trying to do should work, at least at this scale.

This document contains a large array of subdocuments (low hundreds). Updating the document takes a long time (about 50 seconds for this particular 9.66MB I am using as an example). During this time, memory usage as observed in the Chrome profiler jumps from around 65MB to around 500MB. The document does appear to be saved successfully, it just takes a long time and uses a lot of memory.

I am finding it a little difficult to debug this error so do let me know what information I can give to be more helpful.

I have watched the memory change over time and recorded the flow through my function and I am pretty sure that it is specifically the .save() function on a document instance which is taking ~50 seconds and that the memory increase is happening after the call to save. I do not have any hooks defined on the model.

If the current behavior is a bug, please provide the steps to reproduce.

What is the expected behavior?

Much less memory usage and quicker save.

What are the versions of Node.js, Mongoose and MongoDB you are using? Note that "latest" is not a version.
Mongoose: 6.2.7, Node v14.17.5, MongoDB: 5.0.6

@vkarpov15
Copy link
Collaborator

Can you please provide an example of what your document looks like?

@robert-nash
Copy link
Author

@vkarpov15 Here is slightly simplified version of the schema (a reduced number of properties, the structure is the same) all in one place for an example.

Thanks!

import { Schema, model } from "mongoose";

const geoPointSchema = new Schema<GeoPoint>({
    type: {
        type: String,
        enum: ["Point"],
        required: true,
    },
    coordinates: {
        type: [Number],
        required: true,
    },
});

const journeySchema = new Schema<Journey>({
    status: {
        type: String,
        enum: ["available", "completed", "cancelled"],
    },
    start_point_text: String,
    start_point_coordinates: {
        type: geoPointSchema,
        index: "2dsphere",
    },
    end_point_text: String,
    end_point_coordinates: {
        type: geoPointSchema,
        index: "2dsphere",
    },
    start_time: Date,
    end_time: Date,
    driver: driverSchema,
    riders: [riderSchema]
});


const journeySummarySchema = new Schema<JourneySummary>({
    id: Schema.Types.ObjectId,
    private: Boolean,
    groups: [{ id: Schema.Types.ObjectId }],
    created: Date,
    start_time: Date,
    end_time: Date,
    start_point_coordinates: geoPointSchema,
    start_point_text: String,
    end_point_coordinates: geoPointSchema,
    end_point_text: String
});

const userSummarySchema = new Schema<UserSummary>({
    role: { type: String, enum: ["driver", "requested", "passenger", "not-part-of-journey"] },
});

const journeyEventSchema = new Schema<JourneyEvent>({
    time: Date,
    journey: journeySummarySchema,
    user: userSummarySchema,
    action: [
        "create",
        "cancel",
        "join"
    ],
    reconstructed: { type: Boolean, default: false },
});

const journeyStateUserSummarySchema = new Schema<JourneyStateUserSummary>({
    ...userSummarySchema.obj,
    status: "confirmed" | "refunded",
});

const journeyStateJourneySummarySchema = new Schema<JourneyStateJourneySummary>(
    {
        ...journeySummarySchema.obj,
        request_status: {
            type: String,
            enum: ["available", "accepted"],
        },
        status: {
            type: String,
            enum: ["available", "completed", "cancelled"],
        },
    }
);

const journeyStateSchema = new Schema<JourneyState>({
    _id: Schema.Types.ObjectId,
    journey: journeyStateJourneySummarySchema,
    complete_journey: journeySchema,
    user: journeyStateUserSummarySchema,
});

const searchEventSchema = new Schema<SearchEvent>({
    time: Date, 
    start_point_coordinates: geoPointSchema,
    end_point_coordinates: geoPointSchema,
});

const userEventLogSchema = new Schema<UserEventLog>({
    schema_version: Number,
    user: {
        id: Schema.Types.ObjectId,
    },
    events: {
        journeys: [journeyEventSchema],
        searches: [searchEventSchema],
    },
    states: {
        journeys: [journeyStateSchema],
    },
});

export default model<UserEventLog>("UserEventLog", userEventLogSchema);

@robert-nash
Copy link
Author

Is there any more information I can give to help with this? I understand that as I have presented it you haven't got much to go on but I am not sure what sort of information would be helpful. Would the output of memory profiling, for example, be useful?

Thanks,

Robert

@vkarpov15 vkarpov15 added this to the 6.2.11 milestone Apr 7, 2022
@vkarpov15
Copy link
Collaborator

We're working our way through this, we confirmed that the below script takes about 10x the memory of just using a POJO:

'use strict';

const mongoose = require('mongoose');
const { Schema } = mongoose;

const geoPointSchema = new Schema({
    type: {
        type: String,
        enum: ["Point"],
        required: true,
    },
    coordinates: {
        type: [Number],
        required: true,
    },
});

const journeySchema = new Schema({
    status: {
        type: String,
        enum: ["available", "completed", "cancelled"],
    },
    start_point_text: String,
    start_point_coordinates: {
        type: geoPointSchema,
        index: "2dsphere",
    },
    end_point_text: String,
    end_point_coordinates: {
        type: geoPointSchema,
        index: "2dsphere",
    },
    start_time: Date,
    end_time: Date,
});


const journeySummarySchema = new Schema({
    id: Schema.Types.ObjectId,
    private: Boolean,
    groups: [{ id: Schema.Types.ObjectId }],
    created: Date,
    start_time: Date,
    end_time: Date,
    start_point_coordinates: geoPointSchema,
    start_point_text: String,
    end_point_coordinates: geoPointSchema,
    end_point_text: String
});

const userSummarySchema = new Schema({
    role: { type: String, enum: ["driver", "requested", "passenger", "not-part-of-journey"] },
});

const journeyEventSchema = new Schema({
    time: Date,
    journey: journeySummarySchema,
    user: userSummarySchema,
    action: String, /*[
        "create",
        "cancel",
        "join"
    ],*/
    reconstructed: { type: Boolean, default: false },
});

const journeyStateUserSummarySchema = new Schema({
    ...userSummarySchema.obj,
    status: String //"confirmed" | "refunded",
});

const journeyStateJourneySummarySchema = new Schema(
    {
        ...journeySummarySchema.obj,
        request_status: {
            type: String,
            enum: ["available", "accepted"],
        },
        status: {
            type: String,
            enum: ["available", "completed", "cancelled"],
        },
    }
);

const journeyStateSchema = new Schema({
    _id: Schema.Types.ObjectId,
    journey: journeyStateJourneySummarySchema,
    complete_journey: journeySchema,
    user: journeyStateUserSummarySchema,
});

const searchEventSchema = new Schema({
    time: Date, 
    start_point_coordinates: geoPointSchema,
    end_point_coordinates: geoPointSchema,
});

const userEventLogSchema = new Schema({
    schema_version: Number,
    user: {
        id: Schema.Types.ObjectId,
    },
    events: {
        journeys: [journeyEventSchema],
        searches: [searchEventSchema],
    },
    states: {
        journeys: [journeyStateSchema],
    },
});

const UserEventLog = mongoose.model('UserEventLog', userEventLogSchema);

run().catch(err => console.log(err));

async function run() {
  const doc = new UserEventLog({});
  //const doc = { events: { journeys: [], searches: [] }, states: { journeys: [] } };

  setInterval(() => {
    console.log('[Timer] Memory usage:', process.memoryUsage().heapUsed / (1024 ** 2));
  }, 2_000);

  const start = Date.now();
  for (let i = 0; i < 10000; ++i) {
    doc.events.journeys.push({
      journey: {
        created: new Date(),
        start_point_coordinates: { type: 'Point', coordinates: [0, 0] }
      },
      user: {
        role: 'driver'
      }
    });
    doc.events.searches.push({
        start_point_coordinates: { type: 'Point', coordinates: [0, 0] }
    });
    doc.states.journeys.push({
        journey: {
            created: new Date(),
            start_point_coordinates: { type: 'Point', coordinates: [0, 0] },
            request_status: 'available'
        },
        complete_journey: {
            status: 'available',
            start_point_coordinates: { type: 'Point', coordinates: [0, 0] }
        },
        user: {
            role: 'driver',
            status: 'confirmed'
        }
    });
  }

  console.log('Done', Date.now() - start);
}

We don't know exactly why yet, managed to reduce some memory usage by removing calls to ownerDocument() to avoid computing full subdocument paths, but we're still left with 10x memory overhead.

vkarpov15 added a commit that referenced this issue Apr 12, 2022
@vkarpov15
Copy link
Collaborator

63af194 makes some improvements. Before:

Done 3910ms
[Timer] Memory usage: 287.39147186279297

After:

Done 3777ms
[Timer] Memory usage: 242.136474609375

Slightly better. Will keep working on a few other ideas we have to trim this down some more.

@vkarpov15
Copy link
Collaborator

Another way to trim down the overhead is to get rid of defaults on your schemas. The default _id on subdocuments adds a lot of overhead in this case because there's a lot of subdocuments. Getting rid of array defaults also helps. With the below code:

mongoose.Schema.Types.DocumentArray.set('default', undefined);

const { Schema } = mongoose;

const geoPointSchema = new Schema({
    type: {
        type: String,
        enum: ["Point"],
        required: true,
    },
    coordinates: {
        type: [Number],
        required: true,
        default: undefined
    },
}, { _id: false });

const journeySchema = new Schema({
    status: {
        type: String,
        enum: ["available", "completed", "cancelled"],
    },
    start_point_text: String,
    start_point_coordinates: {
        type: geoPointSchema,
        index: "2dsphere",
    },
    end_point_text: String,
    end_point_coordinates: {
        type: geoPointSchema,
        index: "2dsphere",
    },
    start_time: Date,
    end_time: Date,
}, { _id: false });


const journeySummarySchema = new Schema({
    id: Schema.Types.ObjectId,
    private: Boolean,
    groups: [{ id: Schema.Types.ObjectId }],
    created: Date,
    start_time: Date,
    end_time: Date,
    start_point_coordinates: geoPointSchema,
    start_point_text: String,
    end_point_coordinates: geoPointSchema,
    end_point_text: String
}, { _id: false });

const userSummarySchema = new Schema({
    role: { type: String, enum: ["driver", "requested", "passenger", "not-part-of-journey"] },
}, { _id: false });

const journeyEventSchema = new Schema({
    time: Date,
    journey: journeySummarySchema,
    user: userSummarySchema,
    action: String, /*[
        "create",
        "cancel",
        "join"
    ],*/
    reconstructed: { type: Boolean },
}, { _id: false });

const journeyStateUserSummarySchema = new Schema({
    ...userSummarySchema.obj,
    status: String //"confirmed" | "refunded",
}, { _id: false });

const journeyStateJourneySummarySchema = new Schema(
    {
        ...journeySummarySchema.obj,
        request_status: {
            type: String,
            enum: ["available", "accepted"],
        },
        status: {
            type: String,
            enum: ["available", "completed", "cancelled"],
        },
    },
    { _id: false }
);

const journeyStateSchema = new Schema({
    _id: Schema.Types.ObjectId,
    journey: journeyStateJourneySummarySchema,
    complete_journey: journeySchema,
    user: journeyStateUserSummarySchema,
}, { _id: false });

const searchEventSchema = new Schema({
    time: Date, 
    start_point_coordinates: geoPointSchema,
    end_point_coordinates: geoPointSchema,
}, { _id: false });

const userEventLogSchema = new Schema({
    schema_version: Number,
    user: {
        id: Schema.Types.ObjectId,
    },
    events: {
        journeys: [journeyEventSchema],
        searches: [searchEventSchema],
    },
    states: {
        journeys: [journeyStateSchema],
    },
});

const UserEventLog = mongoose.model('UserEventLog', userEventLogSchema);

We get to:

Done 2695ms
[Timer] Memory usage: 182.55683135986328

We'll look to see if we can support making the defaults option to the model constructor work for subdocuments. You should be able to do const doc = new UserEventLog({ }, { defaults: false }) and have that skip defaults for all subdocs. But apparently that doesn't work.

@robert-nash
Copy link
Author

Thank you for your hard work with this. I have actually ended up moving away from this method of solving the problem and ended up mirroring my data in bigquery but hopefully this will be helpful for others in the future.

@vkarpov15 vkarpov15 modified the milestones: 6.2.11, 6.2.12 Apr 13, 2022
vkarpov15 added a commit that referenced this issue Apr 19, 2022
@vkarpov15 vkarpov15 modified the milestones: 6.3.1, 6.3.2 Apr 20, 2022
@vkarpov15 vkarpov15 changed the title Memory Leak on .save() Massive memory usage when creating a document with approximately 30k+ subdocuments Apr 27, 2022
@vkarpov15 vkarpov15 modified the milestone: 6.3.2 May 2, 2022
@vkarpov15
Copy link
Collaborator

In f1c5412 we allowed disabling _id on single nested subdocuments by default using:

mongoose.Schema.Types.Subdocument.set('_id', false);

With that line, we get:

Done 2127
[Timer] Memory usage: 143.40158081054688
[Timer] Memory usage: 143.40792846679688

Without that line:

Done 2224
[Timer] Memory usage: 173.59060668945312

The tradeoff is no _id on geoPointSchema subdocs, like start_point_coordinates. But, in this case, the _id isn't helpful or necessary.

vkarpov15 added a commit that referenced this issue Jun 25, 2022
vkarpov15 added a commit that referenced this issue Jun 26, 2022
…de` on prototype to avoid unnecessary memory usage

Re: #11541
vkarpov15 added a commit that referenced this issue Jun 26, 2022
vkarpov15 added a commit that referenced this issue Jul 18, 2022
perf(document): avoid creating unnecessary empty objects when creating a state machine
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

3 participants