Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splitting varinfo.jl #118

Closed
mohamed82008 opened this issue May 13, 2020 · 9 comments
Closed

Splitting varinfo.jl #118

mohamed82008 opened this issue May 13, 2020 · 9 comments

Comments

@mohamed82008
Copy link
Member

I am really really in favor of splitting the gigantic varinfo.jl file into multiple files. It is currently doing a whole lot of things revolving around VarInfo but I think it can be managed better by splitting it into multiple files, most of which are relatively self-contained and do specific tasks.

For instance, the getindex, setindex! and push! code is almost 400 lines of code that all attempt to define these 3 functions only. Everything else is helper code. If we put these 400 lines in indexing.jl, we know that if we suspect a bug has something to do with indexing, that it's probably somewhere in this 400 line file which is a relatively self-contained rather than in a 1100 loc file. That makes development and debugging VarInfo-related code easier at least for me. Even if a bug has something to do with an interaction of 2 files. It is easier for me to inspect the 2 files simultaneously and understand what's happening by switching windows or tabs, than it would be if I had to scroll up then scroll down again repeatedly.

What do you all think?

@devmotion
Copy link
Member

I don't mind if it's one large file or multiple small files as long as the implementation is clearly and reasonably structured. Apart from that, I'd say 400 loc for these three methods seems a bit much and might indicate that we should try to simplify the current implementation, regardless of how we organize it 😄

@mohamed82008
Copy link
Member Author

mohamed82008 commented May 13, 2020

Well, we have r = vi[vn], r = vi[spl], vi[vn] = r, vi[spl] = r and push!, each defined for UntypedVarInfo and TypedVarInfo. So that's about 10 different implementations which is an average 40 loc per implementation (edit: including docstrings and multi-line function headers) so it's not as bad as it sounds :)

@xukai92
Copy link
Member

xukai92 commented May 13, 2020

In general I also prefer "multiple small files" than "single large file". I belive splittng them could also pave our way to simplify the code if possible as David suggested.

@cpfiffer
Copy link
Member

I am ambivalent. Either file structure is fine, I suspect mostly because I don't tend to develop a lot in DynamicPPL.

I can say there is a lot of code here and I would probably want to split them up if they were in one of the packages I work more in.

@yebai
Copy link
Member

yebai commented May 13, 2020

I don't feel it splitting the files will necessarily simplify the code. Personally, I like to keep related code in one file, organised by sections. This makes it easy to read the code, and easy to track changes IMO. This also makes the test/ folder easier to read. If you look at the Julia standard library and base library, they are also using this type of organizing principle.

More importantly, we want to have a readable, modular, well-tested DSL implementation. We need to keep in mind there are only a few people who can understand (let alone modify) all code in Turing. It's important to bear in mind what we want to support directly, and indirectly, e.g by providing more documentation and advanced examples. Moving forward, I would gradually prioritise readability and performance over adding new features upon requests.

@mohamed82008
Copy link
Member Author

I don't feel it splitting the files will necessarily simplify the code.

It doesn't simplify the code but it simplifies the development and debugging process. It is easier to draw a mental model of which files use which other files than it is to draw the same model with sections in a file. Switching windows or tabs is much easier and faster than scrolling 100s of locs to see another function you suspect has something to do with an error. Tests can be also organized into smaller files encouraging more granularity and exhaustiveness. We can add longer and more detailed docstrings without risking the file growing to be many 1000s of locs.

Now let's take some examples from stdlib to see how they differ. This huge file https://github.com/JuliaLang/julia/blob/master/stdlib/SparseArrays/src/linalg.jl defines linalg functions for SparseArray. But more than 90% of the file defines multiplication and division so it makes little sense to split it further. This file https://github.com/JuliaLang/julia/blob/master/stdlib/LinearAlgebra/src/cholesky.jl defines the Cholesky decomposition and a bunch of linalg functions on Cholesky such as logdet and lowrankupdate. This file is also much easier to read than varinfo.jl because it has a clear flow: 1) the definition of Cholesky, 2) the factorization function, 3) the division/solve function, and 4) a few mathematical functions such as logdet, inv and lowrankupdate. None of these huge files even come close to the complexity of function call patterns in varinfo.jl so I think using the same model for varinfo.jl is not the best choice.

Moving forward, I would gradually prioritise readability and performance over adding new features upon requests.

I find smaller self-contained files more readable than a large file which defines many things that call each other.

@yebai
Copy link
Member

yebai commented May 13, 2020

I don’t disagree with that small, self-contained files are more readable. I would be happy to read those files in fact. What worries me is that, splitting VarInfo doesn’t lead to self-contained files at the moment. Instead, it leads to complex dependencies between small files, which is hard to understand and hard to track for people not super familiar about the code base.

I think we can and should gradually simplify the code in VarInfo, before deciding whether it should be split into many files. If we can archive the self-contained principle, i.e., minimal clear dependency between files, then I am not against splitting VarInfo.

@xukai92
Copy link
Member

xukai92 commented May 13, 2020

It doesn't simplify the code but it simplifies the development and debugging process. It is easier to draw a mental model of which files use which other files than it is to draw the same model with sections in a file. Switching windows or tabs is much easier and faster than scrolling 100s of locs to see another function you suspect has something to do with an error. Tests can be also organized into smaller files encouraging more granularity and exhaustiveness. We can add longer and more detailed docstrings without risking the file growing to be many 1000s of locs.

I completely agree.

I don’t disagree with that small, self-contained files are more readable. I would be happy to read those files in fact. What worries me is that, splitting VarInfo doesn’t lead to self-contained files at the moment. Instead, it leads to complex dependencies between small files, which is hard to understand and hard to track for people not super familiar about the code base.

I don't think it will cause real dependency issues. I believe what Mohamed wants is simply

include("file1.jl")
include("indexing.jl")
include("file3.jl")

where those codes can also be seperated by sections in Hong's suggestion, i.e.

###
# Section 1
###

...

###
# Indexing
###

...

###
# Section 3
###

What comes to end is what Mohamed said "Switching windows or tabs is much easier and faster than scrolling 100s of locs to see another function you suspect has something to do with an error." and I completely agree with it from my own experience. Modern text editors has no support to switch between sections we define but do have support to switch files we seperate.

@mohamed82008
Copy link
Member Author

Code complexity also has a lot to do with the impression. I think function call patterns are complex in varinfo.jl if you consider all of them at the same time. But looking at the 400 locs of indexing and pushing, you realize that this is only 40 locs per implementation on average including docstrings and multi-line function headers. Can it be improved? Maybe. If we can improve it, can we do it without changing any other file? Most likely yes if we don't change the underlying data structure. Is it worth the time/effort? Well, spending a few weeks to go from 40 locs per implementation to 30 may or may not be the best investment of time. I will leave that for you to judge.

Instead, it leads to complex dependencies between small files, which is hard to understand and hard to track for people not super familiar about the code base.

I think it has been established that the single file varinfo.jl is hard to understand by people who are not super familiar with the code base. In fact, it is hard to understand by people who wrote the code and left it for a few months without working on it. I think it is worth giving the smaller files a try and see if it's actually less intimidating or not.

I think we can and should gradually simplify the code in VarInfo, before deciding whether it should be split into many files. If we can archive the self-contained principle, i.e., minimal clear dependency between files, then I am not against splitting VarInfo.

Well, there is a chicken and egg problem here. Splitting the files encourages us to make them more self-contained. But without making them self-contained, splitting into multiple files doesn't really help. I feel like at least for indexing and linking, we can have 2 self-contained files that define 3-4 functions at most. The remaining getters and setters are used all over the place so in the reorganization PR, I just threw them all in the kitchen sink file utils.jl. Breaking down utils.jl into self-contained files can be the next step if needed. But most of the time, I find myself tinkering with indexing or linking functions and not the other getters and setters. So maybe utils.jl can be left alone but each function can be well documented. So you see, dividing the file into files makes us think about modularity more seriously. Different sections in a file...not so much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants