Better way to store formulas internally #815

igitur · 2018-04-26T15:05:04Z

Currently, formulas are stored internally as strings, either in A1 format or R1C1 format. This requires a lot of parsing, especially when ranges are copied / moved / deleted.

A better approached would be to parse the formula when it is set into some internal structure, which can be manipulated easier and converted back to string when needed.

igitur · 2018-05-07T13:07:24Z

@Pankraty I notice that the current ExpressionCache looks almost like the pattern of your repositories that you created. Maybe we should convert ExpressionCache to a property ExpressionRepository that uses your pattern. Then, I think expressions should be stored in that repository, but using the R1C1 format, which, in most cases, should lead to fewer entries (think of a column with formulas that is copied down). The question is whether we parse the expressions upon loading the file, or only when a calculation is triggered.

Pankraty · 2018-05-07T13:30:02Z

Yes, that makes perfect sense.
I thought about using R1C1 as keys too, and this seems a right way to go. I'd try to perform parsing at a workbook loading, hoping it won't be too slow. And if it is fine we may discard storing FormulaA1 on a cell level, constructing it on demand. This will make XLCell instances lighter (which is essential for heavy workbooks with millions of cells used).

You mentioned once you tried to screw XLParser to ClosedXML. Are you still going to use it? Maybe it can make parsing formulae faster?

And maybe we can benefit from parsing multiple formulae in parallel (no guarantee, of course)

igitur · 2018-05-07T14:26:48Z

Yes, I as delaying really looking into XLParser until we released the netstandard2.0 build. Now can continue it again. Unfortunately XLParser a bit abandoned. We'll have to take that into account. Luckily its dependency, Irony, seems to have been revived. This switch isn't something we should take lighly, and I don't even know abstract syntax trees that well, but XLParser does look very powerful in terms of formula parsing.

igitur · 2018-05-07T15:03:54Z

Hmm, but Irony has split a bit. It used to be a project on Codeplex, by Roman Ivantsov, but now there are 2 forks: https://github.com/daxnet/irony and https://github.com/IronyProject/Irony . I'm trying to see if we can consolidate the efforts. See IronyProject/Irony#4

jahav · 2023-09-23T22:42:15Z

I have given it some thought and I am leaning toward not representing formula as an AST and just parsing formula each time, as long as I can use IAstFactory for evaluation without materialization of AST (possibly even if materialization will be necessary).

Parsing one formula is something like 2 μs with ClosedParser = I can parse 500'000 of formulas per second for single thread.
Thanks to dependency tree, I don't need to reparse everything on data change, just what is necessary.
AST is pretty memory expensive. Each node is allocated on heap, is at least 24 bytes (minimum size of an object) but more likely ~40-50 range on average (there might be a string = another object, or it holds a value...), 4 node = 160-200 bytes. That starts to get expensive.
The parser has AstFactory that basically goes through AST nodes, so ClosedXML doesn't even need to materialize AST for parsing or evaluation.

Basically I only might need to parse formula when I load when I need to build dependency tree and during evaluation that is limited due to dirty tracking. I need to keep AST, which costs memory that might or might not be used (likely won't be used). All that to avoid parsing that happens about once or zero times in classical use case load, change, save.

XLParser was kind of slow and it made sense to keep AST. Don't think that it's true anymore.

I took a sample of formulas from enron dataset (1000 files) and average lengths of a formula is 35 chars.

igitur added the Up for grabs label Apr 26, 2018

Pankraty added this to the v0.94 milestone May 19, 2018

Pankraty mentioned this issue Jun 1, 2018

Formulas handling redesign #893

Open

igitur removed this from the v0.94 milestone Oct 26, 2018

jahav added this to the v0.96.2 milestone Oct 8, 2022

jahav mentioned this issue Nov 30, 2022

Parallel formula evaluation #1926

Open

jahav modified the milestones: v0.100, v0.101 Dec 22, 2022

jahav mentioned this issue Dec 23, 2022

SaveAs stream over 25k rows performance issue #1838

Closed

jahav modified the milestones: v0.101, v0.102 Apr 1, 2023

jahav modified the milestones: v0.102, v0.103 Jun 27, 2023

jahav removed this from the v0.103 milestone Oct 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better way to store formulas internally #815

Better way to store formulas internally #815

igitur commented Apr 26, 2018

igitur commented May 7, 2018

Pankraty commented May 7, 2018 •

edited

igitur commented May 7, 2018

igitur commented May 7, 2018

jahav commented Sep 23, 2023

Better way to store formulas internally #815

Better way to store formulas internally #815

Comments

igitur commented Apr 26, 2018

igitur commented May 7, 2018

Pankraty commented May 7, 2018 • edited

igitur commented May 7, 2018

igitur commented May 7, 2018

jahav commented Sep 23, 2023

Pankraty commented May 7, 2018 •

edited