New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only do a single listAll from FileSwitchDir #9666
Conversation
Patch is good to me. I still think the thing to fix would be removing the Files.isDirectory() call on each entry. Alternatively, since we are specialized here for the FSDir case, and if we dont need to filter out subdirectories, we could just implement the obvious list (java 8 Files.list().toArray but a little more for java7). But, if this file listing is a hotspot for some reason, these changes will only make it Nx faster. Instead code should not list files unnecessarily. I am extremely concerned if we overoptimize here, that those problems will never get fixed. |
I ran a benchmark, on a static directory for simplicity. Test folder has 36,400 files, which is a bit on the high end, but its not unrealistic. I run one hundred iterations of each method. The files.isDirectory() check is really the bad guy: Still, as i said before, we are talking about something on the order of milliseconds and we need to ensure its not called unnecessarily unless its needed. |
A regression was introduced in the ES stats API call somewhere between ES 1.2.x. and 1.4.x that makes stats calls a lot more expensive. See elastic/elasticsearch#9681, https://groups.google.com/d/topic/elasticsearch/bOyBxgI9cMA/discussion and elastic/elasticsearch#9666 which are related issues. A ticket describing the exact issue is yet to be filed but I've been working offline on this with @imotov. The river makes frequent calls to the stats API which compounds the issue. Requesting only the needed thread_pool flags will contribute to mitigating the load on the cluster while the ES team fixes the stats issue on their side.
A regression was introduced in the ES stats API call somewhere between ES 1.2.x. and 1.4.x that makes stats calls a lot more expensive. See elastic/elasticsearch#9681, https://groups.google.com/d/topic/elasticsearch/bOyBxgI9cMA/discussion and elastic/elasticsearch#9666 which are related issues. A ticket describing the exact issue is yet to be filed but I've been working offline on this with @imotov. The river makes frequent calls to the stats API which compounds the issue. Requesting only the needed thread_pool flags will contribute to mitigating the load on the cluster while the ES team fixes the stats issue on their side.
LGTM |
In elastic#6636 we switched to a default FileSwitchDirectory that made .listAll run twice on the same underlying file system directory. This fixes listAll to do a single directory listing again. Closes elastic#9666
With #6636 listAll now calls it twice (once for the mmap dir, once for niofs dir) ... I think we should fix this to be a single call?