Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive memory usage when saving big XLWorkbook #264

Closed
igitur opened this issue Apr 7, 2017 · 19 comments
Closed

Excessive memory usage when saving big XLWorkbook #264

igitur opened this issue Apr 7, 2017 · 19 comments
Labels
enhancement Feature already exists, but should be enahanced.
Milestone

Comments

@igitur
Copy link
Member

igitur commented Apr 7, 2017

Continued from #86

For large workbooks, ClosedXML consumes are large amount of memory and sometimes this leads to OutOfMemoryException.

This is a known issue. For every cell that is used or addressed a full object graph containing the cell styles, formatting, etc is created. This is quite expensive.

@ricardok1
Copy link

@igitur the problem is the same as #86 because it's the cell repository that became huge - the suggestion about the cell styles as reference may apply to other properties as well. I used ClrProfiler and confirmed.
I think that even empty cells are being addressed - If manageable equal cells may be reference each one only - more complex I know.

@igitur
Copy link
Member Author

igitur commented Apr 27, 2017

I copy my comment at #273 (comment) about the challenges of using a style repository:

For this issue, yes, a central style repository would be one solution, as people have pointed out many times. However, this would require a significant change to the ClosedXML API, which I must thoroughly investigate.

Take this sample code:

using (var wb = new XLWorkbook())
using (var ws = wb.Worksheets.Add("Sheet1"))
{
    var a1 = ws.Cell("A1");
    var a2 = ws.Cell("A2");

    c1.Style.Fill.BackgroundColor = XLColor.Red;
    c2.Style.Fill.BackgroundColor = XLColor.Blue;
}

With the current codebase, cell C1 would indeed be red and C2 would be blue, as is expected.

Now let's imagine we implement a style repository. That means c1.Style points to some object in a shared repository. c2.Style also points to an object in the repository. When you open a new spreadsheet, c1 and c2 would both have the default styles (i.e. no background colour). Both cells would have the same Style instance, or in other words Object.ReferenceEquals(c1.Style, c2.Style) would evaluate to true. That means, in the above code sample, when you set c2's background colour to XLColor.Blue, you would be changing this single instance's colour. Because c1 also uses that style, both C1 and C2 would have a blue background. This is unintuitive and would break a lot of existing codebases. I doubt I would allow that.

If we use a style repository, we would need some hack to do an implicit/invisible c2.Style = new Style() first when one of the Style instance's child properties is changed. I've been looking for solutions, but haven't found a pattern that solves this yet. If someone has a practical application (i.e. real code) that can solve this, please help.

@vbjay
Copy link
Contributor

vbjay commented Apr 27, 2017 via email

@igitur
Copy link
Member Author

igitur commented Apr 27, 2017

I don't see how that link addresses the issue I mentioned above.

@vbjay
Copy link
Contributor

vbjay commented Apr 27, 2017 via email

@igitur
Copy link
Member Author

igitur commented Apr 27, 2017

Yes, I did read it. It still doesn't address my issue: a style repository doesn't work with the existing API. If you change the subproperty of a shared style, you'll change all cells that use that shared style. If you disagree, post a proof of concept piece of code.

@vbjay
Copy link
Contributor

vbjay commented Apr 27, 2017 via email

@vbjay
Copy link
Contributor

vbjay commented Apr 27, 2017 via email

@ricardok1
Copy link

@vbjay @igitur that was an example that we discussed but I understand if implementation could be a much simpler with just cell default and cell style. These are picked from a style repository (cell default pre-populated). The cell object just have a reference to the repository. When a style is set, the repository is updated or add a new style (using a hash as key of the repository).

@PhilPJL
Copy link

PhilPJL commented Jul 27, 2017

I know this would be a big change (maybe V2), but you could use a factory method approach with immutable styles. So in your example above, something like

c1.Style = Styles.Create( XLColor.Red, ... );
c2.Style = Styles.Modify( c1.Style, XLColor.Blue );
--

or 
var style = Styles.MakeMutable(c1.Style);
style.Fill.BackgroundColor = XLColor.Red;
c2.Style = Styles.MakeImmutable(style); 

Also I note that your implementations of GetHashCode() on XLStyle etc are suspect because they reference non-readonly properties.

@ghronkrepps
Copy link

Anything I can test or examples I can provide for this issue?
I'm currently trying to create and save large workbooks and experiencing an out of memory error.

The workbooks are approximately 8mb in size when saved as xlsx files and contain 50,000 rows with 53 columns.

To add the data to the sheet I am using:

ws.Cell(1, 1).InsertData(data.AsEnumerable());

This causes the memory to spike up to 600 MB before finishing, and then:

wb.SaveAs(reportFilePathAndName);

causes an additional spike that keeps increasing to 2GB + until I receive an OutOfMemory exception.

@igitur
Copy link
Member Author

igitur commented Oct 25, 2017

@ghronkrepps Have you tried disabling workbook events?

@ghronkrepps
Copy link

@igitur Yes I have that option set when I create the workbook. I have not seen a difference in memory consumption with or without it.

@ghronkrepps
Copy link

I noticed that I don't receive the out of memory exception until the line executed after saving. So I was able to force a garbage collection

wb.SaveAs
GC.Collect()

And was able to continue with no problems.

@smizer
Copy link

smizer commented Nov 13, 2017

@igitur

If we use a style repository, we would need some hack to do an implicit/invisible c2.Style = new Style() first when one of the Style instance's child properties is changed. I've been looking for solutions, but haven't found a pattern that solves this yet. If someone has a practical application (i.e. real code) that can solve this, please help.

Have you looked at the Flyweight Pattern?

@Pankraty
Copy link
Member

Pankraty commented Jan 8, 2018

I think I found the way how to improve the situation with styles handling. I manage to implement POC where XLFont instances are interned. I tested it against the file attached to #607, and had the following results:

  • The solution did not affect performance - opening file took about 5 sec with both interned and non-interned fonts. Though after a complete implementation of styles interning I expect some progress here.
  • In original version, once the file is opened there are 67304 instances of XLFont that allocate 5384560 bytes
  • In a version with interning there are still 67304 instances of XLFont, but their size reduced to 2153824 bytes. Again, as these instances are referenced by XLStyle instances, their number will decrease hugely after interning is fully implemented. In addition, there are 52 instances of XLFontInterned (unique fonts) allocating 3744 bytes.
  • For consumers, API is not modified. Cells (actually, styles) with the same font share the same XLFontInterned, but once a font of any of cells changes (i.e. set to bold) it does not affect other cells.

The underlying idea is the following:

  • Define an abstract StyleRepository that can store reusable styles, and a set of concrete repositories for each aspect (font, borders, colors, styles, etc.). In POC I used hash codes to serve as keys in storage, but this must be modified (BTW, do you know that current implementation of GetHashCode that uses XOR is quite inefficient? For example, Italic and Bold will have exactly the same hash code. Guess, this may be reported as a separate bug). Instances of fonts are stored via WeakReference so they could be GC-ed when they are not in use anymore. Also, in POC the repository is instantiated statically, but I suppose it would better be defined in a scope of a workbook, so that storage would be cleared on workbook closing.
  • XLFont, XLStyle and other classes keep implementing their interfaces (IXLFont, IXLStyle and so on), but internally they store a reference to the interned version (XLFontInterned, XLStyleInterned...); their getters return corresponding values for interned instances, and setters use Mutate method.
  • Mutate method clones an existing interned instance, modifies one of its properties (for example, set Bold to true) and put it into the repository. If such instance (having the same identity) already exists there the existing instance will be returned - that's how the same style is shared among cells. At this point, I don't like that we create a new instance for each mutation even if it is not needed just to get its identity, but these instances will only be presented in Gen.0, thus they will be immediately GC-ed. Anyway, there is some room for improvement here too.

So, currently, we have 1000 XLCell-s, they reference 1000 XLStyle-s, 1000 XLBorder-s, 1000 XLFont-s, 1000 XLAlignment-s, 1000 XLFill-s and so on.
With the proposed approach there would be still 1000 XLCell-s, 1000 XLStyle-s (with the smaller size than now), and, say 50 XLStyleInterned, 50 XLBorder-s, 10 XLBorderInterned, 50 XLFont-s, 7 XLFontInterned... Particular numbers will surely vary, but anyway memory consumption will reduce dramatically.

You can look to the code here Pankraty@374d105. It is not for merging, of course, but if you guys approve the approach I can proceed with the implementation.

@ricardok1
Copy link

@igitur @Pankraty any development on this? looks that this is the solution and the right pattern,

@igitur
Copy link
Member Author

igitur commented Apr 26, 2018

https://www.nuget.org/packages/ClosedXML/0.93.0-beta2 has been released and should address this issue. Closing this issue. Please address other concerns in new issues.

@igitur igitur closed this as completed Apr 26, 2018
@igitur igitur added this to the v0.93 milestone Apr 26, 2018
@igitur igitur added the enhancement Feature already exists, but should be enahanced. label Apr 26, 2018
@Pitterling
Copy link

@igitur @Pankraty - just to confirm, my example which was also the input for #86 consumes tremendous less memory

v0.90    -   6 GiB - Saving Workbook 52s
v0.93-b2 - 1.2 GiB - Saving Workbook 26s

GREAT WORK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature already exists, but should be enahanced.
Projects
None yet
Development

No branches or pull requests

8 participants