You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Adding a Unicode-aware regex engine (like the regex crate) to the project would increase the binary size from 200KB to over 700KB. Enabling its performance features would increase it to about 1100KB. Because of this, the editor depends on your OS to provide libicu. In the future it would be nice to have a fallback for the search that at least works with ASCII (and matches Unicode literally, byte-by-byte).
#52 reported the same issue but let's keep them separate and use this issue to track finding the right ICU version. For more info, see here: #52 (comment)
#52 can then be used to focus on the wrong sort order.
Adding a Unicode-aware regex engine (like the regex crate) to the project would increase the binary size from 200KB to over 700KB. Enabling its performance features would increase it to about 1100KB. Because of this, the editor depends on your OS to provide libicu. In the future it would be nice to have a fallback for the search that at least works with ASCII (and matches Unicode literally, byte-by-byte).
I can confirm the issue exists also in Kubuntu 24.10 dev. Even with libicu76 installed, an installation of libicu-dev is needed.
Regarding the filesizes: I was thinking about using edit in rescue systems, but the need to install 52.6 MB of libicu-dev on top of 38.7 MB of libicu76 defeats the purpose. A 197 kB binary is meaningless if it needs 91.3 MB libraries.
In contrast, the standard rescue system editor nano has a 285 kB binary and depends on libc6 (>= 2.38), libncursesw6 (>= 6), libtinfo6 (>= 6).
Is it possible to get a self-contained binary which includes a regex-crate? Just the instructions to build on Linux would be fine if they lead to this 1.1 MB binary including everything. Better than having the need for 91.3 MB libraries. In the future, a version including a search fallback would be great.
If anyone wants to send a PR that adds support for a different regex engine, I'll happily accept it. It has to be a compile-time feature though (i.e. a feature in Cargo.toml). Since the buffer is chunked, it needs to use something like the regex-cursor crate.
However, I'd prefer skipping that part and immediately going to the destination: We should have a fallback that works without a large regex crate. It would disable the regex button entirely (and perhaps even the whole-word button?) and perform only ASCII-case-insensitive matching (non-ASCII would get matched literally). It would be easy to build something like that with a basic Boyer–Moore search algorithm (probably needs a custom solution since the buffer is chunked).
That's the approach used by C# as far as I know and I'd prefer not adopting it. ICU releases 2 (?) major versions each year and so the list of hardcoded version numbers will quickly run out.
DHowett was prototyping an alternative approach, I believe by peering into the /etc/ld.so.cache IIRC.
Not every distribution/environment has it, plus having specific versions means that if they will break the API/ABI somehow, editor will not pick up faulty versions
I agree that this is not the most elegant solution, reading values that will be absent on many environments also does not sound great.
Yes, it's all around bad. Still, I'd like to avoid hardcoding a list of versions if possible. We should consider it an option of last resort. Your patch, for instance, will stop working in a few months already as ICU 77.1 rolls out (it was released 2 weeks ago).
It should be possible to solve using a range of versions to test, or something similar. And use it as a fallback, while the primary solution could be ldconfig. Just none of the distros I use have ld.so.cache, so I have my concerns =/
Would it be possible to use something like a Rust portation of Lua pattern matching instead of Regular expressions? Programming in lua - pattern matching
If one needs more, he could use an external tool. (In case filtering or highlighting via an external command is planned.)
If anyone wants to send a PR that adds support for a different regex engine, I'll happily accept it. It has to be a compile-time feature though (i.e. a feature in Cargo.toml). Since the buffer is chunked, it needs to use something like the regex-cursor crate.
However, I'd prefer skipping that part and immediately going to the destination: We should have a fallback that works without a large regex crate. It would disable the regex button entirely (and perhaps even the whole-word button?) and perform only ASCII-case-insensitive matching (non-ASCII would get matched literally). It would be easy to build something like that with a basic Boyer–Moore search algorithm (probably needs a custom solution since the buffer is chunked).
I'm not a rust dev, so can't provide the PR, but are you aware of the regex-lite crate which aims to provide regex with a smaller impact on binary size? https://docs.rs/regex-lite/latest/regex_lite/ Still might be too heavy.
I think adding a non-Unicode regex library would be a bad trade-off. If anything, we should consider making the regex-cursor crate a compile-time option. A fallback boyer-moore or similar searcher is still useful in case an ICU version of this project fails to load ICU for some reason.
Activity
Neo-vortex commentedon May 21, 2025
have you tried this installing it ?
apt-get install libicu-dev
@zcobol
zcobol commentedon May 21, 2025
@Neo-vortex test was done on Fedora 42 (no WSL) and
libicu
was already installed. After addinglibicu-devel
it works. Thank for the info!However, the expectation was that all dependencies to be satisfied by the release at https://github.com/microsoft/edit/releases/download/v1.0.0/edit-1.0.0-x86_64-linux-gnu.xz
[-]On Linux `Find` doesn't work[/-][+]Linux: This operation requires the ICU library[/+]lhecker commentedon May 21, 2025
Adding a Unicode-aware regex engine (like the
regex
crate) to the project would increase the binary size from 200KB to over 700KB. Enabling its performance features would increase it to about 1100KB. Because of this, the editor depends on your OS to providelibicu
. In the future it would be nice to have a fallback for the search that at least works with ASCII (and matches Unicode literally, byte-by-byte).#52 reported the same issue but let's keep them separate and use this issue to track finding the right ICU version. For more info, see here: #52 (comment)
#52 can then be used to focus on the wrong sort order.
kasini3000 commentedon May 21, 2025
utf16le +bom my .ps1 file same issue
Suggest adding:
apt-get install libicu-dev
dnf install libicu-devel
on error message .
emk2203 commentedon May 21, 2025
I can confirm the issue exists also in Kubuntu 24.10 dev. Even with
libicu76
installed, an installation oflibicu-dev
is needed.Regarding the filesizes: I was thinking about using
edit
in rescue systems, but the need to install 52.6 MB oflibicu-dev
on top of 38.7 MB oflibicu76
defeats the purpose. A 197 kB binary is meaningless if it needs 91.3 MB libraries.In contrast, the standard rescue system editor
nano
has a 285 kB binary and depends onlibc6
(>= 2.38),libncursesw6
(>= 6),libtinfo6
(>= 6).Is it possible to get a self-contained binary which includes a regex-crate? Just the instructions to build on Linux would be fine if they lead to this 1.1 MB binary including everything. Better than having the need for 91.3 MB libraries. In the future, a version including a search fallback would be great.
lhecker commentedon May 21, 2025
If anyone wants to send a PR that adds support for a different regex engine, I'll happily accept it. It has to be a compile-time feature though (i.e. a feature in
Cargo.toml
). Since the buffer is chunked, it needs to use something like theregex-cursor
crate.However, I'd prefer skipping that part and immediately going to the destination: We should have a fallback that works without a large regex crate. It would disable the regex button entirely (and perhaps even the whole-word button?) and perform only ASCII-case-insensitive matching (non-ASCII would get matched literally). It would be easy to build something like that with a basic Boyer–Moore search algorithm (probably needs a custom solution since the buffer is chunked).
diabloproject commentedon May 26, 2025
I was not able to replicate the issue, even in ubuntu docker, but maybe something like this will fix the "ICU not found" issue?
172.patch.zip
lhecker commentedon May 27, 2025
That's the approach used by C# as far as I know and I'd prefer not adopting it. ICU releases 2 (?) major versions each year and so the list of hardcoded version numbers will quickly run out.
DHowett was prototyping an alternative approach, I believe by peering into the
/etc/ld.so.cache
IIRC.diabloproject commentedon May 27, 2025
Not every distribution/environment has it, plus having specific versions means that if they will break the API/ABI somehow, editor will not pick up faulty versions
I agree that this is not the most elegant solution, reading values that will be absent on many environments also does not sound great.
lhecker commentedon May 27, 2025
Yes, it's all around bad. Still, I'd like to avoid hardcoding a list of versions if possible. We should consider it an option of last resort. Your patch, for instance, will stop working in a few months already as ICU 77.1 rolls out (it was released 2 weeks ago).
diabloproject commentedon May 27, 2025
It should be possible to solve using a range of versions to test, or something similar. And use it as a fallback, while the primary solution could be ldconfig. Just none of the distros I use have
ld.so.cache
, so I have my concerns =/hiareigl commentedon Jun 4, 2025
Would it be possible to use something like a Rust portation of Lua pattern matching instead of Regular expressions?
Programming in lua - pattern matching
If one needs more, he could use an external tool. (In case filtering or highlighting via an external command is planned.)
MrDowntempo commentedon Jun 17, 2025
I'm not a rust dev, so can't provide the PR, but are you aware of the regex-lite crate which aims to provide regex with a smaller impact on binary size? https://docs.rs/regex-lite/latest/regex_lite/ Still might be too heavy.
lhecker commentedon Jun 17, 2025
I think adding a non-Unicode regex library would be a bad trade-off. If anything, we should consider making the
regex-cursor
crate a compile-time option. A fallback boyer-moore or similar searcher is still useful in case an ICU version of this project fails to load ICU for some reason.Make the ICU SONAME configurable (#495)