-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path3589-Make-hex-search-work-for-binary-data.patch
59 lines (55 loc) · 2.44 KB
/
3589-Make-hex-search-work-for-binary-data.patch
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
From 8ea66d51ffdb1bc104947d46ad8ebcb62eeaeb44 Mon Sep 17 00:00:00 2001
From: Mooffie <mooffie@gmail.com>
Date: Mon, 26 Sep 2016 16:17:08 +0300
Subject: [PATCH] Ticket #3589: Make hex search work for binary data.
---
lib/search/hex.c | 36 ++++++++++++++++++++++++++++++++++++
1 file changed, 36 insertions(+)
diff --git a/lib/search/hex.c b/lib/search/hex.c
index 8b5470e..30202b4 100644
--- a/lib/search/hex.c
+++ b/lib/search/hex.c
@@ -137,6 +137,42 @@ mc_search__cond_struct_new_init_hex (const char *charset, mc_search_t * lc_mc_se
mc_search_hex_parse_error_t error = MC_SEARCH_HEX_E_OK;
int error_pos = 0;
+ /*
+ * We may be searching in binary data, which is often invalid UTF-8.
+ *
+ * We have to create a non UTF-8 regex (that is, G_REGEX_RAW) or else, as
+ * the data is invalid UTF-8, both GLib's PCRE and our
+ * mc_search__g_regex_match_full_safe() are going to fail us. The former by
+ * not finding all bytes, the latter by overwriting the supposedly invalid
+ * UTF-8 with NULs.
+ *
+ * To do this, we specify "ASCII" as the charset.
+ *
+ * In fact, we can specify any charset other than "UTF-8": any such charset
+ * will trigger G_REGEX_RAW (see [1]). The output of [2] will be the same
+ * for all charsets because it skips the \xXX symbols
+ * mc_search__hex_translate_to_regex() outputs.
+ *
+ * But "ASCII" is the best choice because a hex pattern may contain a
+ * quoted string: this way we know [2] will ignore any characters outside
+ * ASCII letters range (these ignored chars will be copied verbatim to the
+ * output and will match as-is; in other words, in a case-sensitive manner;
+ * If the user is interested in case-insensitive searches of international
+ * text, he shouldn't be using hex search in the first place.)
+ *
+ * Switching out of UTF-8 has another advantage:
+ *
+ * When doing case-insensitive searches, GLib treats \xXX symbols as normal
+ * letters and therefore matches both "a" and "A" for the hex pattern
+ * "0x61". When we switch out of UTF-8, we're switching to using [2], which
+ * doesn't have this issue.
+ *
+ * [1] mc_search__cond_struct_new_init_regex
+ * [2] mc_search__cond_struct_new_regex_ci_str
+ */
+ if (str_isutf8 (charset))
+ charset = "ASCII";
+
tmp = mc_search__hex_translate_to_regex (mc_search_cond->str, &error, &error_pos);
if (tmp != NULL)
{
--
2.9.3