Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better way to store formulas internally #815

Open
igitur opened this issue Apr 26, 2018 · 5 comments
Open

Better way to store formulas internally #815

igitur opened this issue Apr 26, 2018 · 5 comments

Comments

@igitur
Copy link
Member

igitur commented Apr 26, 2018

Currently, formulas are stored internally as strings, either in A1 format or R1C1 format. This requires a lot of parsing, especially when ranges are copied / moved / deleted.

A better approached would be to parse the formula when it is set into some internal structure, which can be manipulated easier and converted back to string when needed.

@igitur
Copy link
Member Author

igitur commented May 7, 2018

@Pankraty I notice that the current ExpressionCache looks almost like the pattern of your repositories that you created. Maybe we should convert ExpressionCache to a property ExpressionRepository that uses your pattern. Then, I think expressions should be stored in that repository, but using the R1C1 format, which, in most cases, should lead to fewer entries (think of a column with formulas that is copied down). The question is whether we parse the expressions upon loading the file, or only when a calculation is triggered.

@Pankraty
Copy link
Member

Pankraty commented May 7, 2018

Yes, that makes perfect sense.
I thought about using R1C1 as keys too, and this seems a right way to go. I'd try to perform parsing at a workbook loading, hoping it won't be too slow. And if it is fine we may discard storing FormulaA1 on a cell level, constructing it on demand. This will make XLCell instances lighter (which is essential for heavy workbooks with millions of cells used).

You mentioned once you tried to screw XLParser to ClosedXML. Are you still going to use it? Maybe it can make parsing formulae faster?

And maybe we can benefit from parsing multiple formulae in parallel (no guarantee, of course)

@igitur
Copy link
Member Author

igitur commented May 7, 2018

Yes, I as delaying really looking into XLParser until we released the netstandard2.0 build. Now can continue it again. Unfortunately XLParser a bit abandoned. We'll have to take that into account. Luckily its dependency, Irony, seems to have been revived. This switch isn't something we should take lighly, and I don't even know abstract syntax trees that well, but XLParser does look very powerful in terms of formula parsing.

@igitur
Copy link
Member Author

igitur commented May 7, 2018

Hmm, but Irony has split a bit. It used to be a project on Codeplex, by Roman Ivantsov, but now there are 2 forks: https://github.com/daxnet/irony and https://github.com/IronyProject/Irony . I'm trying to see if we can consolidate the efforts. See IronyProject/Irony#4

@Pankraty Pankraty added this to the v0.94 milestone May 19, 2018
@igitur igitur removed this from the v0.94 milestone Oct 26, 2018
@jahav jahav added this to the v0.96.2 milestone Oct 8, 2022
@jahav jahav modified the milestones: v0.100, v0.101 Dec 22, 2022
@jahav jahav modified the milestones: v0.101, v0.102 Apr 1, 2023
@jahav jahav modified the milestones: v0.102, v0.103 Jun 27, 2023
@jahav
Copy link
Member

jahav commented Sep 23, 2023

I have given it some thought and I am leaning toward not representing formula as an AST and just parsing formula each time, as long as I can use IAstFactory for evaluation without materialization of AST (possibly even if materialization will be necessary).

  • Parsing one formula is something like 2 μs with ClosedParser = I can parse 500'000 of formulas per second for single thread.
  • Thanks to dependency tree, I don't need to reparse everything on data change, just what is necessary.
  • AST is pretty memory expensive. Each node is allocated on heap, is at least 24 bytes (minimum size of an object) but more likely ~40-50 range on average (there might be a string = another object, or it holds a value...), 4 node = 160-200 bytes. That starts to get expensive.
  • The parser has AstFactory that basically goes through AST nodes, so ClosedXML doesn't even need to materialize AST for parsing or evaluation.

Basically I only might need to parse formula when I load when I need to build dependency tree and during evaluation that is limited due to dirty tracking. I need to keep AST, which costs memory that might or might not be used (likely won't be used). All that to avoid parsing that happens about once or zero times in classical use case load, change, save.

XLParser was kind of slow and it made sense to keep AST. Don't think that it's true anymore.

I took a sample of formulas from enron dataset (1000 files) and average lengths of a formula is 35 chars.
image

@jahav jahav removed this from the v0.103 milestone Oct 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants