[bug found] res.rhash shrinked during 4s-query and 4s-backend segfault #131

chenrushan · 2015-09-14T09:20:45Z

Here's the story

First I imported a file containing 10000000 triples, then I issued a simple query, and I got the following error

4s-client.c:1796 kb=test resolve(0) failed: no reply

Initially, it seems like a network error. But after some debugging, I found that backend segfault. Then I keep debugging to find why backend segfault, and I found that the size of res.rhash shinked to half, and an access to a hash entry which should have existed now became invalid and caused segfault.

I further found the reason of file shrink. When 4s-backend dealing with the FS_CHOOSE_SEGMENT request, the handle_choose_segment function calls fs_backend_open_files_intl with flags set to O_RDWR | O_CREAT which furtuer calls fs_rhash_open_filename, and according to the logic implemented in fs_rhash_open_filename, it will first set rh->size to 65536, and then call fs_rhash_ensure_size, since the O_RDWR flag is set, res.rhash which contained more than 65536 entries is now shrink to 65536 entries

Here's the call stack:

#0  fs_rhash_ensure_size (rh=0x79e7a0) at rhash.c:141
#1  0x0000000000411dde in fs_rhash_open_filename (filename=0x79e6b0 "/var/lib/4store/test/0000/res.rhash", flags=66) at rhash.c:207
#2  0x0000000000411a45 in fs_rhash_open (be=0x73f140, label=0x432275 "res", flags=66) at rhash.c:132
#3  0x0000000000408b87 in fs_backend_open_files_intl (be=0x73f140, seg=0, flags=66, files=255, file=0x431430 "server.c", line=443) at backend.c:587
#4  0x00000000004050b7 in handle_choose_segment (be=0x73f140, segment=0, length=0, content=0x761c90 "") at server.c:443
#5  0x000000000041f639 in handle_or_fail (name=0x435b76 "backend->choose_segment", handle_fn=0x40507c <handle_choose_segment>, be=0x73f140, segment=0, length=0, content=0x761c90"") at 4s-server.c:65
#6  0x000000000041fda8 in child (conn=14, backend=0x640860 <native_backend>, be=0x73f140) at 4s-server.c:191
#7  0x00000000004205f0 in accept_fn (source=0x763de0, condition=G_IO_IN, data=0x640860 <native_backend>) at 4s-server.c:351
#8  0x00007ffff74f790d in g_main_context_dispatch () from /usr/lib/libglib-2.0.so.0
#9  0x00007ffff74f7ce0 in ?? () from /usr/lib/libglib-2.0.so.0
#10 0x00007ffff74f8002 in g_main_loop_run () from /usr/lib/libglib-2.0.so.0
#11 0x0000000000421004 in fsp_serve (kb_name=0x7fffffffebb3 "test", backend=0x640860 <native_backend>, daemon=0, disk_limit=1) at 4s-server.c:589
#12 0x00000000004070c4 in main (argc=3, argv=0x7fffffffe8c8) at server.c:1182

I don't know why a query request would have O_RDWR set when open files ?

rushan chen

The text was updated successfully, but these errors were encountered:

swh · 2015-09-14T09:40:56Z

Is this with the version from Git HEAD - or a release?

Were you importing with 4s-import, and querying with 4s-query, or doing both with 4s-httpd?

chenrushan · 2015-09-14T09:51:15Z

the version is 6e2b4a7, 2 commits before HEAD

import is done with 4s-import and query is done with 4s-query

swh · 2015-09-14T10:05:31Z

Odd - it seems like that's quite a well-used path in the code.

Could you try backing off to an older version (maybe 12 months ago) and trying again.

There is a reason why some files are opened RDWR when querying, but I really can't remember why it is.

chenrushan · 2015-09-14T11:40:54Z

I tried multiple older versions from 2013 to 2014, and got the same problem for all of them.

If you don't mind may be I can sent you my data (98M after zipped on google drive).

swh · 2015-09-14T11:53:51Z

If you could share the data that would be fantastic.

It's really odd - I'm confident a lot of people have done exactly what you have...

chenrushan · 2015-09-14T12:05:30Z

the data is https://drive.google.com/file/d/0B44ZOxj9BAWPQkI2UVk3aDJ1azg/view?usp=sharing

swh · 2015-09-14T13:34:37Z

OK, well the good news is I can reproduce - Ubuntu 14.04.1 LTS

$ 4s-backend-setup freebase
$ 4s-backend freebase
$ 4s-import -v freebase freebase-rdf-latest.10m.nt
removing old data
Reading <file:///home/swh/freebase-rdf-latest.10m.nt>
Pass 1, processed 5000000 triples (5000000)
Pass 2, processed 5000000 triples, 143299 triples/s
Pass 1, processed 5000000 triples (10000000)
Pass 2, processed 5000000 triples, 138593 triples/s
Pass 1, processed 0 triples (10000000)
Pass 2, processed 0 triples
Updating index
Index update took 10.958860 seconds
Imported 10000000 triples, average 121955 triples/s
$ 4s-query freebase 'SELECT * WHERE { ?s ?p ?o } LIMIT 10'
<?xml version="1.0"?>
<sparql xmlns="http://www.w3.org/2005/sparql-results#">
  <head>
    <variable name="s"/>
    <variable name="p"/>
    <variable name="o"/>
  </head>
  <results>
4store[24290]: 4s-client.c:1796 kb=freebase resolve(0) failed: no reply
4store[24290]: 4s-client.c:1796 kb=freebase resolve(1) failed: no reply
4store[24290]: results.c:961 resolve_all failed
4store[24290]:  1: 4s-query() [0x415af4]
4store[24290]:  2: 4s-query() [0x4174b4]
4store[24290]:  3: 4s-query() [0x405344]
4store[24290]:  4: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f15464ccec5]
4store[24290]:  5: 4s-query() [0x405ae7]
4store[24290]: 4s-client.c:1738 kb=freebase resolve(0) failed: no reply
4store[24290]: 4s-client.c:1738 kb=freebase resolve(1) failed: no reply

swh · 2015-09-14T13:42:49Z

So, it doesn't happen with the first 500k triples from that file, but it does happen with the file but with just the first two deleted.

I guess there's something in the data that's causing the problem.

I can't spend anymore time on it now, but I'll setup a script to try importing it in 500k chunks and see if I can narrow it down.

swh · 2015-09-14T13:43:36Z

@msalvadores - have you seen anything like this?

chenrushan · 2015-09-14T13:51:06Z

glad you can reproduce the bug

chenrushan · 2015-09-19T01:51:43Z

hi, @swh, sorry to borther you, I have a project depending on 4store, so I'll be much appreciated if you could help me fix the bug or provide me some hint on how to fix it.

I personally have a simple solution, that is to check the size of rhash before ftruncate() is called, but I'm not sure if this can totally fix the bug, or may cause some other problems.

Thanks in advance

rushan chen

swh · 2015-09-20T13:20:20Z

No problem. Sorry, didn't get time to look at it again last week. I'd suggest putting in your hack and I'll try and find the cause - I suspect it must be something unusual in that file.

Cheers,
Steve

Sent on the move.

On 19 Sep 2015, at 02:51, rushan chen notifications@github.com wrote:

hi, @swh, sorry to borther you, I have a project depending on 4store, so I'll be much appreciated if you could help me fix the bug or provide me some hint on how to fix it.

I personally have a simple solution, that is to check the size of rhash before ftruncate() is called, but I'm not sure if this can totally fix the bug, or may cause some other problems.

Thanks in advance

rushan chen

—
Reply to this email directly or view it on GitHub.

swh · 2015-11-25T17:11:11Z

This might be related to #135 (comment) - can you try reverting the same patch?

chenrushan · 2015-11-30T03:01:07Z

Sorry, I'm busy working on a project, I'll try it when I have time. Currently, my hack seems to work well.

rushan chen

mazlo · 2016-01-07T12:34:57Z

Is there a patch for this issue? @juscodit Could you write some lines about how your fix looks like? I've got the same error here on Fedora21.

4store[27543]: 4s-client.c:1796 kb=missydb resolve(0) failed: no reply
4store[27543]: 4s-client.c:1796 kb=missydb resolve(1) failed: no reply
4store[27543]: results.c:961 resolve_all failed
4store[27543]:  1: 4s-query() [0x4137e9]
4store[27543]:  2: 4s-query() [0x4153be]
4store[27543]:  3: 4s-query() [0x405ad6]
4store[27543]:  4: /lib64/libc.so.6(__libc_start_main+0xf0) [0x3a0f81ffe0]
4store[27543]:  5: 4s-query() [0x406233]
4store[27543]: 4s-client.c:1738 kb=missydb resolve(0) failed: no reply
4store[27543]: 4s-client.c:436 kb=missydb write error for segment 0: Broken pipe
4store[27543]: 4s-client.c:448 kb=missydb segment 0 failed with no backup
4store[27543]:  1: 4s-query() [0x42b55b]
4store[27543]:  2: 4s-query() [0x42f3bd]
4store[27543]:  3: 4s-query() [0x40b435]
4store[27543]:  4: 4s-query() [0x412a8f]
4store[27543]:  5: 4s-query() [0x4153be]
4store[27543]:  6: 4s-query() [0x405ad6]
4store[27543]:  7: /lib64/libc.so.6(__libc_start_main+0xf0) [0x3a0f81ffe0]
4store[27543]:  8: 4s-query() [0x406233]
4store[27543]: 4s-common.c:171 recv header from socket failed, Bad file descriptor
4store[27543]: 4s-client.c:1738 kb=missydb resolve(0) failed: no reply
4store[27543]: 4s-client.c:436 kb=missydb write error for segment 0: Bad file descriptor
4store[27543]: 4s-client.c:448 kb=missydb segment 0 failed with no backup
...

The error is printed for each line (SPO) of the result set.
I used the code from HEAD and recompiled several times with no error.
I imported from 4 different files, the import process went through within 4-5minutes without an error. Data set contains around 18M triples.

chenrushan · 2016-01-08T05:26:19Z

diff --git a/src/backend/rhash.c b/src/backend/rhash.c
index 2ccf3d1..8906ff7 100644
--- a/src/backend/rhash.c
+++ b/src/backend/rhash.c
@@ -25,6 +25,7 @@
 #include <glib.h>
 #include <string.h>
 #include <zlib.h>
+#include <sys/stat.h>
 #include <sys/types.h>
 #include <sys/mman.h>
 #include <sys/file.h>
@@ -141,6 +142,14 @@ void fs_rhash_ensure_size(fs_rhash *rh)
     if (!(rh->flags & (O_WRONLY | O_RDWR))) return;

     const off_t len = sizeof(struct rhash_header) + ((off_t) rh->size) * ((off_t) rh->bucket_size) * sizeof(fs_rhash_entry);
+    /* crs: check file size before truncate */
+    struct stat statbuf;
+    if (fstat(rh->fd, &statbuf) < 0) {
+        return;
+    }
+    if (len <= statbuf.st_size) {
+        return;
+    }

     if (ftruncate(rh->fd, len)) {
        fs_error(LOG_ERR, "ftruncate failed for '%s': %s", rh->filename, strerror(errno));
@@ -206,7 +215,13 @@ fs_rhash *fs_rhash_open_filename(const char *filename, int flags)
 #endif
     fs_rhash_ensure_size(rh);

-    const size_t len = sizeof(header) + ((size_t) rh->size) * ((size_t) rh->bucket_size) * sizeof(fs_rhash_entry);
+    /* crs: check file size before truncate */
+    struct stat statbuf;
+    if (fstat(rh->fd, &statbuf) < 0) {
+        return NULL;
+    }
+    // const size_t len = sizeof(header) + ((size_t) rh->size) * ((size_t) rh->bucket_size) * sizeof(fs_rhash_entry);
+    const size_t len = statbuf.st_size;
     const int mmapflags = PROT_READ | (flags & (O_WRONLY | O_RDWR) ? PROT_WRITE : 0);
     header = mmap(NULL, len, mmapflags, MAP_SHARED, rh->fd, 0);
     if (header == MAP_FAILED) {

mazlo · 2016-01-08T10:49:54Z

lovely, thx

berezovskyi mentioned this issue Mar 28, 2017

Issues with version 1.1.6 #120

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug found] res.rhash shrinked during 4s-query and 4s-backend segfault #131

[bug found] res.rhash shrinked during 4s-query and 4s-backend segfault #131

chenrushan commented Sep 14, 2015

swh commented Sep 14, 2015

chenrushan commented Sep 14, 2015

swh commented Sep 14, 2015

chenrushan commented Sep 14, 2015

swh commented Sep 14, 2015

chenrushan commented Sep 14, 2015

swh commented Sep 14, 2015

swh commented Sep 14, 2015

swh commented Sep 14, 2015

chenrushan commented Sep 14, 2015

chenrushan commented Sep 19, 2015

swh commented Sep 20, 2015

swh commented Nov 25, 2015

chenrushan commented Nov 30, 2015

mazlo commented Jan 7, 2016

chenrushan commented Jan 8, 2016

mazlo commented Jan 8, 2016

[bug found] res.rhash shrinked during 4s-query and 4s-backend segfault #131

[bug found] res.rhash shrinked during 4s-query and 4s-backend segfault #131

Comments

chenrushan commented Sep 14, 2015

swh commented Sep 14, 2015

chenrushan commented Sep 14, 2015

swh commented Sep 14, 2015

chenrushan commented Sep 14, 2015

swh commented Sep 14, 2015

chenrushan commented Sep 14, 2015

swh commented Sep 14, 2015

swh commented Sep 14, 2015

swh commented Sep 14, 2015

chenrushan commented Sep 14, 2015

chenrushan commented Sep 19, 2015

swh commented Sep 20, 2015

swh commented Nov 25, 2015

chenrushan commented Nov 30, 2015

mazlo commented Jan 7, 2016

chenrushan commented Jan 8, 2016

mazlo commented Jan 8, 2016