Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug found] res.rhash shrinked during 4s-query and 4s-backend segfault #131

Open
chenrushan opened this issue Sep 14, 2015 · 17 comments
Open

Comments

@chenrushan
Copy link

Here's the story

First I imported a file containing 10000000 triples, then I issued a simple query, and I got the following error

4s-client.c:1796 kb=test resolve(0) failed: no reply

Initially, it seems like a network error. But after some debugging, I found that backend segfault. Then I keep debugging to find why backend segfault, and I found that the size of res.rhash shinked to half, and an access to a hash entry which should have existed now became invalid and caused segfault.

I further found the reason of file shrink. When 4s-backend dealing with the FS_CHOOSE_SEGMENT request, the handle_choose_segment function calls fs_backend_open_files_intl with flags set to O_RDWR | O_CREAT which furtuer calls fs_rhash_open_filename, and according to the logic implemented in fs_rhash_open_filename, it will first set rh->size to 65536, and then call fs_rhash_ensure_size, since the O_RDWR flag is set, res.rhash which contained more than 65536 entries is now shrink to 65536 entries

Here's the call stack:

#0  fs_rhash_ensure_size (rh=0x79e7a0) at rhash.c:141
#1  0x0000000000411dde in fs_rhash_open_filename (filename=0x79e6b0 "/var/lib/4store/test/0000/res.rhash", flags=66) at rhash.c:207
#2  0x0000000000411a45 in fs_rhash_open (be=0x73f140, label=0x432275 "res", flags=66) at rhash.c:132
#3  0x0000000000408b87 in fs_backend_open_files_intl (be=0x73f140, seg=0, flags=66, files=255, file=0x431430 "server.c", line=443) at backend.c:587
#4  0x00000000004050b7 in handle_choose_segment (be=0x73f140, segment=0, length=0, content=0x761c90 "") at server.c:443
#5  0x000000000041f639 in handle_or_fail (name=0x435b76 "backend->choose_segment", handle_fn=0x40507c <handle_choose_segment>, be=0x73f140, segment=0, length=0, content=0x761c90"") at 4s-server.c:65
#6  0x000000000041fda8 in child (conn=14, backend=0x640860 <native_backend>, be=0x73f140) at 4s-server.c:191
#7  0x00000000004205f0 in accept_fn (source=0x763de0, condition=G_IO_IN, data=0x640860 <native_backend>) at 4s-server.c:351
#8  0x00007ffff74f790d in g_main_context_dispatch () from /usr/lib/libglib-2.0.so.0
#9  0x00007ffff74f7ce0 in ?? () from /usr/lib/libglib-2.0.so.0
#10 0x00007ffff74f8002 in g_main_loop_run () from /usr/lib/libglib-2.0.so.0
#11 0x0000000000421004 in fsp_serve (kb_name=0x7fffffffebb3 "test", backend=0x640860 <native_backend>, daemon=0, disk_limit=1) at 4s-server.c:589
#12 0x00000000004070c4 in main (argc=3, argv=0x7fffffffe8c8) at server.c:1182

I don't know why a query request would have O_RDWR set when open files ?

rushan chen

@swh
Copy link
Collaborator

swh commented Sep 14, 2015

Is this with the version from Git HEAD - or a release?

Were you importing with 4s-import, and querying with 4s-query, or doing both with 4s-httpd?

@chenrushan
Copy link
Author

the version is 6e2b4a7, 2 commits before HEAD

import is done with 4s-import and query is done with 4s-query

@swh
Copy link
Collaborator

swh commented Sep 14, 2015

Odd - it seems like that's quite a well-used path in the code.

Could you try backing off to an older version (maybe 12 months ago) and trying again.

There is a reason why some files are opened RDWR when querying, but I really can't remember why it is.

@chenrushan
Copy link
Author

I tried multiple older versions from 2013 to 2014, and got the same problem for all of them.

If you don't mind may be I can sent you my data (98M after zipped on google drive).

@swh
Copy link
Collaborator

swh commented Sep 14, 2015

If you could share the data that would be fantastic.

It's really odd - I'm confident a lot of people have done exactly what you have...

@chenrushan
Copy link
Author

@swh
Copy link
Collaborator

swh commented Sep 14, 2015

OK, well the good news is I can reproduce - Ubuntu 14.04.1 LTS

$ 4s-backend-setup freebase
$ 4s-backend freebase
$ 4s-import -v freebase freebase-rdf-latest.10m.nt
removing old data
Reading <file:///home/swh/freebase-rdf-latest.10m.nt>
Pass 1, processed 5000000 triples (5000000)
Pass 2, processed 5000000 triples, 143299 triples/s
Pass 1, processed 5000000 triples (10000000)
Pass 2, processed 5000000 triples, 138593 triples/s
Pass 1, processed 0 triples (10000000)
Pass 2, processed 0 triples
Updating index
Index update took 10.958860 seconds
Imported 10000000 triples, average 121955 triples/s
$ 4s-query freebase 'SELECT * WHERE { ?s ?p ?o } LIMIT 10'
<?xml version="1.0"?>
<sparql xmlns="http://www.w3.org/2005/sparql-results#">
  <head>
    <variable name="s"/>
    <variable name="p"/>
    <variable name="o"/>
  </head>
  <results>
4store[24290]: 4s-client.c:1796 kb=freebase resolve(0) failed: no reply
4store[24290]: 4s-client.c:1796 kb=freebase resolve(1) failed: no reply
4store[24290]: results.c:961 resolve_all failed
4store[24290]:  1: 4s-query() [0x415af4]
4store[24290]:  2: 4s-query() [0x4174b4]
4store[24290]:  3: 4s-query() [0x405344]
4store[24290]:  4: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f15464ccec5]
4store[24290]:  5: 4s-query() [0x405ae7]
4store[24290]: 4s-client.c:1738 kb=freebase resolve(0) failed: no reply
4store[24290]: 4s-client.c:1738 kb=freebase resolve(1) failed: no reply

@swh
Copy link
Collaborator

swh commented Sep 14, 2015

So, it doesn't happen with the first 500k triples from that file, but it does happen with the file but with just the first two deleted.

I guess there's something in the data that's causing the problem.

I can't spend anymore time on it now, but I'll setup a script to try importing it in 500k chunks and see if I can narrow it down.

@swh
Copy link
Collaborator

swh commented Sep 14, 2015

@msalvadores - have you seen anything like this?

@chenrushan
Copy link
Author

glad you can reproduce the bug

@chenrushan
Copy link
Author

hi, @swh, sorry to borther you, I have a project depending on 4store, so I'll be much appreciated if you could help me fix the bug or provide me some hint on how to fix it.

I personally have a simple solution, that is to check the size of rhash before ftruncate() is called, but I'm not sure if this can totally fix the bug, or may cause some other problems.

Thanks in advance

rushan chen

@swh
Copy link
Collaborator

swh commented Sep 20, 2015

No problem. Sorry, didn't get time to look at it again last week. I'd suggest putting in your hack and I'll try and find the cause - I suspect it must be something unusual in that file.

Cheers,
Steve

Sent on the move.

On 19 Sep 2015, at 02:51, rushan chen notifications@github.com wrote:

hi, @swh, sorry to borther you, I have a project depending on 4store, so I'll be much appreciated if you could help me fix the bug or provide me some hint on how to fix it.

I personally have a simple solution, that is to check the size of rhash before ftruncate() is called, but I'm not sure if this can totally fix the bug, or may cause some other problems.

Thanks in advance

rushan chen


Reply to this email directly or view it on GitHub.

@swh
Copy link
Collaborator

swh commented Nov 25, 2015

This might be related to #135 (comment) - can you try reverting the same patch?

@chenrushan
Copy link
Author

Sorry, I'm busy working on a project, I'll try it when I have time. Currently, my hack seems to work well.

rushan chen

@mazlo
Copy link

mazlo commented Jan 7, 2016

Is there a patch for this issue? @juscodit Could you write some lines about how your fix looks like? I've got the same error here on Fedora21.

4store[27543]: 4s-client.c:1796 kb=missydb resolve(0) failed: no reply
4store[27543]: 4s-client.c:1796 kb=missydb resolve(1) failed: no reply
4store[27543]: results.c:961 resolve_all failed
4store[27543]:  1: 4s-query() [0x4137e9]
4store[27543]:  2: 4s-query() [0x4153be]
4store[27543]:  3: 4s-query() [0x405ad6]
4store[27543]:  4: /lib64/libc.so.6(__libc_start_main+0xf0) [0x3a0f81ffe0]
4store[27543]:  5: 4s-query() [0x406233]
4store[27543]: 4s-client.c:1738 kb=missydb resolve(0) failed: no reply
4store[27543]: 4s-client.c:436 kb=missydb write error for segment 0: Broken pipe
4store[27543]: 4s-client.c:448 kb=missydb segment 0 failed with no backup
4store[27543]:  1: 4s-query() [0x42b55b]
4store[27543]:  2: 4s-query() [0x42f3bd]
4store[27543]:  3: 4s-query() [0x40b435]
4store[27543]:  4: 4s-query() [0x412a8f]
4store[27543]:  5: 4s-query() [0x4153be]
4store[27543]:  6: 4s-query() [0x405ad6]
4store[27543]:  7: /lib64/libc.so.6(__libc_start_main+0xf0) [0x3a0f81ffe0]
4store[27543]:  8: 4s-query() [0x406233]
4store[27543]: 4s-common.c:171 recv header from socket failed, Bad file descriptor
4store[27543]: 4s-client.c:1738 kb=missydb resolve(0) failed: no reply
4store[27543]: 4s-client.c:436 kb=missydb write error for segment 0: Bad file descriptor
4store[27543]: 4s-client.c:448 kb=missydb segment 0 failed with no backup
...

The error is printed for each line (SPO) of the result set.
I used the code from HEAD and recompiled several times with no error.
I imported from 4 different files, the import process went through within 4-5minutes without an error. Data set contains around 18M triples.

@chenrushan
Copy link
Author

diff --git a/src/backend/rhash.c b/src/backend/rhash.c
index 2ccf3d1..8906ff7 100644
--- a/src/backend/rhash.c
+++ b/src/backend/rhash.c
@@ -25,6 +25,7 @@
 #include <glib.h>
 #include <string.h>
 #include <zlib.h>
+#include <sys/stat.h>
 #include <sys/types.h>
 #include <sys/mman.h>
 #include <sys/file.h>
@@ -141,6 +142,14 @@ void fs_rhash_ensure_size(fs_rhash *rh)
     if (!(rh->flags & (O_WRONLY | O_RDWR))) return;

     const off_t len = sizeof(struct rhash_header) + ((off_t) rh->size) * ((off_t) rh->bucket_size) * sizeof(fs_rhash_entry);
+    /* crs: check file size before truncate */
+    struct stat statbuf;
+    if (fstat(rh->fd, &statbuf) < 0) {
+        return;
+    }
+    if (len <= statbuf.st_size) {
+        return;
+    }

     if (ftruncate(rh->fd, len)) {
        fs_error(LOG_ERR, "ftruncate failed for '%s': %s", rh->filename, strerror(errno));
@@ -206,7 +215,13 @@ fs_rhash *fs_rhash_open_filename(const char *filename, int flags)
 #endif
     fs_rhash_ensure_size(rh);

-    const size_t len = sizeof(header) + ((size_t) rh->size) * ((size_t) rh->bucket_size) * sizeof(fs_rhash_entry);
+    /* crs: check file size before truncate */
+    struct stat statbuf;
+    if (fstat(rh->fd, &statbuf) < 0) {
+        return NULL;
+    }
+    // const size_t len = sizeof(header) + ((size_t) rh->size) * ((size_t) rh->bucket_size) * sizeof(fs_rhash_entry);
+    const size_t len = statbuf.st_size;
     const int mmapflags = PROT_READ | (flags & (O_WRONLY | O_RDWR) ? PROT_WRITE : 0);
     header = mmap(NULL, len, mmapflags, MAP_SHARED, rh->fd, 0);
     if (header == MAP_FAILED) {

@mazlo
Copy link

mazlo commented Jan 8, 2016

lovely, thx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants