-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ugrep.exe - Processing large files: RAM usage and speed. #153
Comments
Нашел решение с помощью |
This is impossible. Files are not stored in RAM when searched. A sliding window over the file is used instead to keep memory usage low. So actual physical memory usage should not be high. Users reported that RAM usage is acceptable. Users also independently reported that ugrep is faster than ripgrep for common searches, see e.g. Genivia/RE-flex#91 "ugrep managed around 1850MB/s on my disk ... Ripgrep can get to 1700MB/s on my disk." Speed is related to memory usage, so I think ugrep is doing OK. But if there is a problem then I will fix it. |
But I really have so that 1 process used all the memory. I attached a screenshot from the link. |
What is this 1K12.txt file? If this is a very large file of pattern to match then this for sure can create a huge DFA for POSIX matching. Use ugrep option -P for Perl matching. Ripgrep also internally uses Perl matching to avoid large DFAs. The traditional grep tool uses POSIX matching, not Perl matching. Grep and ugrep construct regex DFAs internally that can grow very big depending on the regex patterns. |
1K1.txt....1K17.txt this is 17 parts of 65,000 lines (2.5 mb) created from one large file 1K.txt (about 42 mb). |
So I was right. This memory usage has to do with the fact that the pattern file is quite long and perhaps also very complex at the same time. The DFA construction for such patterns may take quite a bit of time and space, which is theoretically known (it's not an implementation limitation). This is especially true when the patterns are Unicode. To turn Unicode matching off, use ugrep option -U (no need to use -P). Use this option -U when patterns and files are just plain ASCII or bytes to match, like standard grep also expects them to be. For your case with option -P: It looks like PCRE (enabled with option -P) complains about the regex pattern length being of over 2MB long. This is a PCRE error message: "regular expression is too large" Please make sure to use the latest version of ugrep, because some recent changes were made to improve option -P by avoiding unnecessary memory use (no longer converts the patterns for Unicode matching). Also, if a pattern file contains strings, not patterns, then use option -F for string matching. In this case each line in the pattern file is considered a string pattern. |
Hm, this is interesting. Sure, it makes perfect sense that memory is still huge because you show this for option -U, which may or may not help. It still produces a large DFA. I did not suggest that -U is the solution. We know that theoretically this can blow up in size. Unicode just makes it worse. Again, ugrep emulates grep POSIX matching with a DFA if option -P is not used. I rather not make -P the default to keep ugrep compatible with grep as much as possible. Ripgrep and other many other grep-like tools are NOT compatible with grep or can be used as a replacement. The reason this issue was closed is that I was confident that option -P should work just fine to match input with PCRE2. Note that ripgrep and other grep-like tools use PCRE or a similar regex engine. Perhaps there could be a PCRE2 configuration limit that limits the size of the pattern. However, re-reading the PCRE2 docs it is not obvious because for 32 bit machines the "internal linkage size" of 4 is used, which is "essentially unlimited" see https://www.pcre.org/current/doc/html/pcre2limits.html Why PCRE2 in your case complains about regex pattern size being too long is not yet clear to me. This is what PCRE2 states:
Perhaps the I will have to look into that and report back later. |
Thanks for the info. So the problem is caused by PCRE2. PCRE2 is installed with a default configuration that by default limits the internal code size, which in turn limits the length and complexity of a regex pattern. The PCRE2 API documentation doesn't state this, but the installation and configuration instructions mention the fact that the link size for 8 bit patterns is only 2 bytes. We need at least 3 bytes. That is the same link size as ugrep's internal DFA for POSIX matching, which has practically no limit in the pattern size, but may result in a large DFA that takes time to construct as is very noticeable in your case (a theoretically known fact). This means that this problem with option -P for PCRE2 will be relatively easy to fix. However, ugrep can no longer rely on an existing libpcre2 installation of PCRE2 on a system, because that may be limiting. Instead, we need to install libpcre2 together with ugrep and configure it properly. For Windows, this is not an issue, because I've build ugrep.exe from scratch with a local libpcre2 installation. I will need to rebuild ugrep.exe without this limit. |
OK, I rebuilt the ugrep.exe x64 executable with a reconfigured PCRE2 library. I tested this out, but could not with patterns that are as large as yours. Attached is the x64 version 3.3.7 of ugrep.exe with the PCRE2 update (zipped). This is a clean file, no malicious stuff I assure you!! (Trust is hard to find these days.) Let me know what happens with ugrep option -P for Perl matching with PCRE. If this is still not enough, then there is another step I can take to increase the pattern sizes even more, but hopefully this gets you moving forward as a first step. Thanks for reporting this, because this helps to improve ugrep. The PCRE2 API documentation does not mention this limitation (it is more or less "hidden" in the installation instructions). So I was under the impression there was not pattern size limitation. Perhaps the title of this issue should be "ugrep option -P does not accept very long patterns". |
Thanks. I tried it. -P now freezes altogether. I attached a small set of template+verifiable in the archive and 2 batniki for fresh Grep and Ripgrep for comparison. |
Thanks for the files. I will take a closer look. I'm surprised to see this. Probably overlooked something in the settings or some other detail in the internals. |
There is indeed a memory use issue with the ugrep algorithm to match strings (the files contains strings, not regex patterns). This algorithm is optimized to recognize string matches and construct a pattern matcher more quickly, but the problem with your string pattens is that the algorithm is too greedy in allocating too much memory. I will address this with an update. Option -P is a different story. I'm surprised that PCRE2 doesn't appear to work well with this set of strings (as patterns). |
Now that a list of new ugrep features are complete and added to the latest ugrep releases (ugrep v3.6.0) as requested by users, I am now returning to work on this excessive RAM usage issue. Improving ugrep to handle long |
A quick update on my progress over the last two days. The new version of ugrep I worked on is a lot faster (2.68 sec) and uses 946MB RAM instead of 4.4GB:
I still have some work to do to try to optimize this further, if possible. The old version that uses 4.4GB and runs slower (4.89 sec):
Edit: the
|
Further improved pattern parsing speed, now down to 1.40s with 592MB:
Most of the time is spent on pattern parsing and DFA opcode generation for the pattern matcher engine. This can perhaps be optimized further. |
Your GNU grep takes 2.81 seconds:
Ugrep for the same search takes 1.56 seconds:
As you point out, the older ugrep version suffers from RAM usage slowdown, which will be fixed in the upcoming v3.7 release. |
Ugrep v3.7.0 is released with lower RAM usage and significant speed improvements for long and complex patterns. See also issue #188. |
Let me also suggest a more effective way to speed up your searches. Instead of using one large pattern file that generally consumes a lot of memory with any grep tools, instead you can split the file up into several pattern files and then run searches in parallel. This works on Windows
then concatenate the results if desired (after the ugrep searches are finished):
The same, but with four parallel searches:
This runs faster on multicore machines, but also because the complexity of the patterns is lower for each search. This generally requires less memory to run with ugrep and other grep tools. Lower memory use benefits performance due to detrimental memory hierarchy effects (caches). |
This stupid issue is still here: I just tried to do a popular operation: to get rid of duplicate lines from B compared to A.
No thanks, sorry, but this "workaround" proposition scares me. I have many large files on my disk. I'm not interested to fuck up with my files, spend additional time, waste effort, and create strangely long command lines every time I want to do a simple operation like removing duplicate lines from a fu..king file, especially when having 10x more available RAM than taken by the 3 files. The burden of splitting files into smaller chunks should be shifted to the program, and the splitting should be automatically handled by the program, as a part of the algorithm. It's bad design flaw or too RAM hungry algorithm or a bug (whatever it's considered to be), there is too many stupid programs eating dozens of GB of available RAM and then stupidly crashing, even when provided not too big input files and having 10x more free RAM than size of the files. |
Option It is not a problem for ugrep to have thousands of patterns in the file for option When you use option There are algorithms that are more optimal to search many strings in the input instead of using a tree and VM as in ugrep, which could be considered for ugrep as an enhancement. On the other hand, this won't work when we need to store a regex pattern file (instead of a set of strings). |
Hello, I also made a test with rg, which has failed too: Also adding some more information to my previous comment: SPOILER 1
Also such ones: SPOILER 2
By the way, ass seen the rg guy is rude: |
The content of As I've mentioned, there are specific algorithms for this type of problem to search many strings in some text, such as Rabin-Karp and Aho-Corasick. The latter is used with |
I already did and it worked, like mentioned at the beginning of my initial comment (I used the same options like with ug and rg:
under an assumption that fgrep and grep -F are the same: https://pubs.opengroup.org/onlinepubs/7990989799/xcu/fgrep.html :
I even now additionally tried fgrep.exe legacy binary that comes with older grep.exe 2.5.4 binary:
|
Well, the pattern files are just too big in practice to do this fast enough and without running into memory issues with grep-like tools, which is not that surprising IMO. Even if a grep tool accepts the patterns, it will be very slow to search There is no reason to approach this problem in only one way, especially when we know that it will never be as fast as we want or need it to be. There are many other ways this can be done more efficiently. Another way is to extract the domain names from a file to search like To find the domain names in
This retrieves domain names from
This will go through all |
Lets clear one thing: when in my earlier comments I said: " So I jumped directly to step 2: your command prints out duplicate lines between A and B, instead of non duplicate lines from B (bigger file), but this is not what I needed, as B needs to be cleaned from domains that are already present in A, even if probing database is faster, I don't see how printing duplicate lines can be useful in such case. Also what do you exactly mean by "database"? ugrep indexing feature (LINK) or some other database? I tried https://lucene.apache.org/ in the past but without success. |
A SQL DB or other DB would work. Well sorry I didn't get the details of what's in If all that you have is two sets of domain names, then your question can be reformulated as to find the intersection of the two sets. The intersection are the domain name matches. When the two sets are sorted by some key (e.g. lexicographically) then this is trivial to do in linear time |
Most likely, to be 100% clear and no doubts, simplified example:
And keep in mind the point is to not eat too much RAM and to not crash. Basically it's a task for "file compare tools", I tried WinMerge - an open source fast file comparer: I used "full text compare" with some options ticked like "ignore end-lines" etc, but it still behaves exactly like ug and rg, started eating RAM, then after ~1 min it took 13GB of 16GB and out of memory error, then crash-terminated. |
Hello.
I'm here about a job ugrep.exe.
I have a task: a template file of a million lines to search for something similar in another file among several million lines.
ugrep.exe copes with this well, but there are wishes.
In terms of volume, 2 files turned out to be 2.5 MB each. But when working, they use 4 GB of RAM.
I had to write a bat file (as I could, I'm not a programmer).2. The processing speed depends on the processor frequency. At 2.5 GHz, the file is checked in 12 seconds, and at 3 GHz in 8 seconds.
Also, time is spent on loading and unloading files into RAM, which greatly slows down the overall process of checking 2 databases.
The system operation looks like this: https://radikal.ru/lfp/a.radikal.ru/a27/2109/3b/10a041f4a62a.jpg/htm
Bat file: http://forum.ru-board.com/topic.cgi?forum=5&topic=0602&start=1166&limit=1&m=1#1
K1.txt this is a template made of pieces, 22.txt this is a checked file from chunks.
Wishes.
Something to come up with loading RAM.
The whole template file has a volume of 100 MB, the file being checked is from 500 MB.
The processor is not fully loaded in operation, will a full load help to increase the speed of reconciliation of 2 databases?
Check 2 databases on CUDA NVidia.
For more information, it is better to contact zh_76@internet.ru
I am also ready to conduct testing on a free basis ugrep.exe. I really liked it.
P.S. I don't know how to contact the developer of ugrep v3 directly.3.7 Dr. Robert van Engelen
The text was updated successfully, but these errors were encountered: