New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rsync hangs on particular file #2138

Open
Wishmesh opened this Issue May 21, 2017 · 75 comments

Comments

Projects
None yet
@Wishmesh

Wishmesh commented May 21, 2017

  • Your Windows build number: Microsoft Windows [Version 10.0.15063] + BOW 16.04 (reinstalled today)

  • What you're doing and what's happening: I am trying to rsync folders between two Windows (both BashOnWindows). I am using following commands:

rsync --archive --stats --progress --delete /mnt/u/1-folder /mnt/u/2nd-folder some_name@192.168.99.17:/mnt/u/2-folder/
rsync -rl --progress --delete ...
rsync -rl --delete ...
  • What's wrong / what should be happening instead: It works almost flawlessly. But sometimes (at least 5 different cases) it stucks on particular file. When stopped (Ctrl+C) and resumed, it always hangs on the same file. Usually it hangs only on .pch and .pdb files (about 20MB). See the example below. I stop it with Ctrl+C (see the ^C).
        560,502 100%    1.63MB/s    0:00:00 (xfr#78, ir-chk=1056/39982)
some-path/more-path/cl.command.1.tlog
          6,060 100%   17.83kB/s    0:00:00 (xfr#79, ir-chk=1055/39982)
some-path/more-path/some-file.Build.CppClean.log
          1,817 100%    5.33kB/s    0:00:00 (xfr#80, ir-chk=1054/39982)
some-path/more-path/some-file.pch
         32,768   0%   95.81kB/s    0:00:46  ^C
rsync error: unexplained error (code 130) at rsync.c(632) [sender=3.1.0]
[sender] io timeout after 60 seconds -- exiting

I tried to wait for more than 24h... it stays on the one file.

Also, strange thing is:

  • it does not fail, if I move files around to different folders
  • as a workaround I can move offending files one by one with rsync (then rsync works again)
  • Strace of the failing command, if applicable: Ohhh.. Sorry. Tried to reproduce. But I can't. Because I moved files around. If you can not resolve this without strace, then I will wait till the next time and will provide strace here in comments.
    (If <cmd> is failing, then run strace -o strace.txt -ff <cmd>, and post the strace.txt output here)
@sunjoong

This comment has been minimized.

sunjoong commented May 21, 2017

@Wishmesh - What name and path name of that particular file? I had a strange experience with file named "ext" and directory named "ext"; My anti-virus blocked that file and directory.

@Wishmesh

This comment has been minimized.

Wishmesh commented May 22, 2017

@sunjoong - no. Anti-virus have these folders excluded. Path looks like:

X:\1-prj\some\simple\folders\here\Release_x64\appname.pch
or
X:\1-prj\some\simple\folders\here\Release_x64\appname.pdb

There were other extensions too, but I do not remember of what type they was...

@therealkenc

This comment has been minimized.

Collaborator

therealkenc commented May 23, 2017

Unfortunately I can't repro here. Strange files to be having trouble with. Their size isn't that special, nor should that be a problem anyway. TCP streams are very reliable on WSL these days. Any chance those files are open in something (anything) in Windows? Or special in any way you can think of, versus the other files in your tree?

One thing you might try is to do it old school and see what happens:

tar cpf - /mnt/u/somewhere | (ssh some_name@192.168.99.17 "cd remote_dir; tar xpf -")

That doesn't get you all the rsync goodness, but it will help eliminate filesystem and networking problems. If that works, try rsync without the --delete flag and see if that helps. Otherwise, shrug 🤷.

@Wishmesh

This comment has been minimized.

Wishmesh commented May 23, 2017

Any chance those files are open in something (anything) Windows?

No. It stuck on the same file again, and again. Even after reboot. Antivirus disabled for that drive.

Ok. Will try out your suggestions next time it happens.

@therealkenc

This comment has been minimized.

Collaborator

therealkenc commented May 23, 2017

It's not antivirus; and even if it were, those files still aren't very special compared to all the object files you must have in there. I rsync around enormous trees (like chromium) somewhat regularly, but that is on VolFS (/home/me). I rsynced a random Visual Studio project directory DrvFS to DrvFS (granted localhost to localhost) trying to repro just before I posted but it went fine. Anyway, bonne chance.

@Wishmesh

This comment has been minimized.

Wishmesh commented May 23, 2017

I have 2 folders (about 10GB):

  1. 36193 files and 6142 folders
  2. 33734 files and 5975 folders

Nothing extraordinary. And "file being used / not accessible" is less probable, because.... When it stuck, I can copy the same stuck file using the same rsync syntax, but specifying filename -- syncing by one file works... and then after that, again rsync works for some days or weeks without any problem.

@therealkenc

This comment has been minimized.

Collaborator

therealkenc commented May 23, 2017

...for some days or weeks without any problem

Yeah that's going to be a bitch to track. So obviously it's going to work for the one-off test I did. Sigh...

That the fail is sticky even after reboot is very relevant. That means it pretty much has to be a "filesystem thing", but I couldn't speculate what exactly. Appreciate you reporting though, so at least it is on the books.

@Wishmesh

This comment has been minimized.

Wishmesh commented May 23, 2017

That means it pretty much has to be a "filesystem thing"

But why rsync then works if I sync the same files one by one? And both - source and destination FS are NTFS, if that matters.

@therealkenc

This comment has been minimized.

Collaborator

therealkenc commented May 23, 2017

But why rsync then works if I sync the same files one by one?

Dunno, but it pretty much has to be a "filesystem thing" by deductive reasoning. Because your DrvFS (NTFS) filesystem is the only thing that can hold the sticky failure state across a reboot. Your sync one-by-one fix is, for whatever reason, getting things back into a sane state. Next time it fails, try deleting the problematic files on the server that you are pushing to and doing a full sync, instead of syncing them one at a time.

Handwaving here, the whole process of deleting files on DrvFS is not straightforward, because delete semantics are different in Unix and Windows. That why the --delete option is a red flag. Or, at least for lack of a better working theory atm.

@Wishmesh

This comment has been minimized.

Wishmesh commented May 24, 2017

Yeah that's going to be a bitch to track.

Let us try....

  1. Today I tried hard to reproduce and I succeeded

Try 1
some/folders/sub/folders/src/Debug/rc.read.1.tlog
          2,896 100%   10.79kB/s    0:00:00 (xfr#24, ir-chk=1011/41352)
some/folders/sub/folders/src/Debug/rc.write.1.tlog
            250 100%    0.93kB/s    0:00:00 (xfr#25, ir-chk=1010/41352)
some/folders/sub/folders/src/Debug/vc100.idb
          1,200   0%    4.47kB/s    0:05:24

Try2
some/folders/sub/folders/src/Debug_x64/some-name.log
          3,331 100%   28.04kB/s    0:00:00 (xfr#25, ir-chk=1034/41390)
some/folders/sub/folders/src/Debug_x64/some-name.obj
      1,397,284 100%   11.01MB/s    0:00:00 (xfr#26, ir-chk=1033/41390)
some/folders/sub/folders/src/Debug_x64/some-name.pch
         10,640   0%   85.17kB/s    0:05:33

Try3
some/folders/sub/folders/src/Debug_x64/some-name.pch
         10,640   0%    0.00kB/s    0:00:00
  1. The strange thing is, it sometimes shows 5+ minutes required transfer time almost immediately.

  2. When it stuck at some-name.pch, I checked MD5 of file on client and server -- they differ.

  3. When I said that state remains after reboot I somehow forgot about server state. I blamed client. So now I rebooted the server too -- both computers rebooted. So now it is confirmed -- it persists after reboot of both -- server and client.

sending incremental file list
deleting some/folders/sub/folders/src/some-name.opensdf
some/folders/sub/folders/src/
some/folders/sub/folders/src/some-name.sdf
     41,963,520 100%   26.62MB/s    0:00:01 (xfr#1, ir-chk=1025/41327)
some/folders/sub/folders/src/some-name.suo
         25,600 100%   51.02kB/s    0:00:00 (xfr#2, ir-chk=1023/41327)
some/folders/sub/folders/src/Debug_x64/
some/folders/sub/folders/src/Debug_x64/some-name.pch
         32,768   0%   56.04kB/s    0:08:25  ^C
rsync error: unexplained error (code 130) at rsync.c(632) [sender=3.1.1]
[sender] io timeout after 60 seconds -- exiting
  1. Checked offending file on the server. The file some-name.pch is missing on the server.

  2. Trying rsync again. It still stuck on the same file (even it is not on the server):

sending incremental file list
some/folders/sub/folders/src/
some/folders/sub/folders/src/some-name.sdf
     41,963,520 100%  303.08MB/s    0:00:00 (xfr#1, ir-chk=1025/41327)
some/folders/sub/folders/src/some-name.suo
         25,600 100%  187.97kB/s    0:00:00 (xfr#2, ir-chk=1023/41327)
some/folders/sub/folders/src/Debug_x64/some-name.pch
         32,768   0%  233.58kB/s    0:02:01  ^C
rsync error: unexplained error (code 130) at rsync.c(632) [sender=3.1.1]
[sender] io timeout after 60 seconds -- exiting
  1. Notice that before the some-name.pch file it synced .sdf and .suo file again. I am not deeply familiar with rsync. It writes multiple files at once? Why send .sdf and .suo again, when they were synced last time successfully?

  2. Deleting entire folder on the server: some/folders/sub/folders/src/Debug_x64

  3. Trying rsync again:

sending incremental file list
some/folders/sub/folders/src/
some/folders/sub/folders/src/some-name.sdf
     41,963,520 100%  305.40MB/s    0:00:00 (xfr#1, ir-chk=1025/41327)
some/folders/sub/folders/src/some-name.suo
         25,600 100%  185.19kB/s    0:00:00 (xfr#2, ir-chk=1023/41327)
some/folders/sub/folders/src/Debug_x64/
some/folders/sub/folders/src/Debug_x64/Base64.obj
         10,358 100%   11.60kB/s    0:00:00 (xfr#3, ir-chk=1045/41389)
some/folders/sub/folders/src/Debug_x64/CL.read.1.tlog
         45,746 100%   12.38MB/s    0:00:00 (xfr#4, ir-chk=1044/41389)
some/folders/sub/folders/src/Debug_x64/CL.write.1.tlog
          3,618 100%  588.87kB/s    0:00:00 (xfr#5, ir-chk=1043/41389)
some/folders/sub/folders/src/Debug_x64/**********.obj
         70,915 100%    1.50MB/s    0:00:00 (xfr#6, ir-chk=1042/41389)
some/folders/sub/folders/src/Debug_x64/*********.obj
         15,243 100%  236.28kB/s    0:00:00 (xfr#7, ir-chk=1041/41389)
some/folders/sub/folders/src/Debug_x64/*********.obj
^Crsync error: unexplained error (code 130) at rsync.c(632) [sender=3.1.1]
[sender] io timeout after 60 seconds -- exiting

Now it stuck on the some.obj file.

  1. Ctrl+C, try again... again stuck on the same.obj file...

  2. Remembered about strace...

strace -o strace.txt -ff rsync with params

Got 2 strace files. Unfortunately I cannot share them here as is... But will provide you with parts I you will need.

First file seems to come from ssh. 178974 bytes long.

dup2(3, 0)                              = 0
close(4)                                = 0
close(5)                                = 0
dup2(6, 1)                              = 1
close(3)                                = 0
close(6)                                = 0
fcntl(0, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(0, F_SETFL, O_RDWR)               = 0
execve("/usr/local/sbin/ssh", ["ssh", "-l", ....
execve("/usr/bin/ssh", ["ssh", "-l", .....
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd49ca20000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
...
... skip here ...
...
clock_gettime(CLOCK_BOOTTIME, {2396, 116297000}) = 0
read(3, "\216k__\225G!\263\265G}\374T\256\253\346\233\373>\305z][a\rb\32\211T\20\270\300"..., 8192) = 8192
clock_gettime(CLOCK_BOOTTIME, {2396, 116528000}) = 0
clock_gettime(CLOCK_BOOTTIME, {2396, 116644000}) = 0
select(7, [3 4], [], NULL, NULL)        = 1 (in [3])
clock_gettime(CLOCK_BOOTTIME, {2396, 116915000}) = 0
read(3, "\303\220\n\231\212\234\355+\277\34n5\\^\21O\326\366\324\372N\0\225\276n\313\355U\357\347`\335"..., 8192) = 8192
clock_gettime(CLOCK_BOOTTIME, {2396, 117163000}) = 0
clock_gettime(CLOCK_BOOTTIME, {2396, 117278000}) = 0
select(7, [3 4], [5], NULL, NULL)       = 2 (in [3], out [5])
clock_gettime(CLOCK_BOOTTIME, {2396, 117552000}) = 0
write(5, "\374\377\0\7,K\252G\377\n\260\23[\24DW@\312\233#\244\216\241\333\2300e\334L\24\212_"..., 16384) = 16384
read(3, "2\35\202\326T\32$F\276i\324\327k[\312\256\362\224/\255/z\247\17\243\261\346)\3464\312\236"..., 8192) = 8192
clock_gettime(CLOCK_BOOTTIME, {2396, 117901000}) = 0
clock_gettime(CLOCK_BOOTTIME, {2396, 118047000}) = 0
select(7, [3 4], [3], NULL, NULL)       = 2 (in [3], out [3])
clock_gettime(CLOCK_BOOTTIME, {2396, 118324000}) = 0
read(3, "\5\260\374\270\275\305\350\270\355E^|\243Ku\216\177\3\266&\317@\10\335t\362\261kP\360Ts"..., 8192) = 8192
write(3, "7\2571u\265?\347\4\212Ig]\t\21<JC\332.\243\206\257k\246\216\266\252\5\v\r'N"..., 36) = 36
clock_gettime(CLOCK_BOOTTIME, {2396, 118728000}) = 0
clock_gettime(CLOCK_BOOTTIME, {2396, 118843000}) = 0
select(7, [3 4], [5], NULL, NULL)       = 2 (in [3], out [5])
clock_gettime(CLOCK_BOOTTIME, {2396, 119116000}) = 0
write(5, ";aW|v\223Yu\26\253w0\334\330\212\344\205\273\246^\370\"x@13\327<\240.G<"..., 16384) = 16384
read(3, "\376i\255\251\245\361ft\321\325\351_\264\337\333\361\363W`\37\235\361\267\3709\216\335\4\357\313H\336"..., 8192) = 8192
clock_gettime(CLOCK_BOOTTIME, {2396, 119464000}) = 0
clock_gettime(CLOCK_BOOTTIME, {2396, 119579000}) = 0
select(7, [3 4], [], NULL, NULL)        = 1 (in [3])
clock_gettime(CLOCK_BOOTTIME, {2396, 119841000}) = 0
read(3, "\340\305\217i\325f2\267\246\211?Q\2gm\33\246\325\f\35\365\241d\322\255\370\300\275$\312\275D"..., 8192) = 8192
clock_gettime(CLOCK_BOOTTIME, {2396, 120085000}) = 0
clock_gettime(CLOCK_BOOTTIME, {2396, 120200000}) = 0
select(7, [3 4], [5], NULL, NULL)       = 2 (in [3], out [5])
clock_gettime(CLOCK_BOOTTIME, {2396, 120474000}) = 0
write(5, "#\3W+\246\246FS\23\330\253?\207\1u\323\253\351\2563h3\26\301\3171\317}\244\352\212-"..., 16384

2nd file from rsync. 9277601 bytes long:

execve("/usr/bin/rsync", ["rsync", "--archive", "--stats", "--progress", "--delete", ......
brk(NULL)                               = 0x7fffe73de000
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f993b200000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=21423, ...}) = 0
mmap(NULL, 21423, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f993b206000
close(3)                                = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
open("/lib/x86_64-linux-gnu/libattr.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300\20\0\0\0\0\0\0"..., 832) = 832
...
...
...
open("some/folders/sub/folders/src/Debug_x64/*********.obj", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0777, st_size=70915, ...}) = 0
write(1, "some/folders/sub/*******"..., 64) = 64
read(3, "d\206\265\0\225C%Yj\256\0\0\317\2\0\0\0\0\0\0.drectve\0\0\0\0"..., 70915) = 70915
gettimeofday({1495618771, 246185}, NULL) = 0
ioctl(1, TIOCGPGRP, [546])              = 0
write(1, "\r         32,768  46%   50.71kB/"..., 46) = 46
select(6, [5], [4], [5], {60, 0})       = 2 (in [5], out [4], left {59, 999999})
read(5, "\rT\v\320\16u\203&#0\25\352\353\313\204#\5x;", 19) = 19
write(4, "Q\216\0\7\377\376^\0S\0:\0\\\0001\0-\0P\0R\0J\0\\\0S\0O\0F\0"..., 36437) = 36437
gettimeofday({1495618771, 246814}, NULL) = 0
gettimeofday({1495618771, 246891}, NULL) = 0
gettimeofday({1495618771, 246961}, NULL) = 0
write(1, "\r         70,915 100%  109.58kB/"..., 72) = 72
close(3)                                = 0
open("some/folders/sub/folders/src/Debug_x64/*******.obj", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0777, st_size=15243, ...}) = 0
write(1, "some/folders/sub/*****"..., 65) = 65
read(3, "d\206\17\0\225C%Yf)\0\0f\0\0\0\0\0\0\0.drectve\0\0\0\0"..., 15243) = 15243
gettimeofday({1495618771, 247785}, NULL) = 0
ioctl(1, TIOCGPGRP, [546])              = 0
write(1, "\r         15,243 100%   23.52kB/"..., 46) = 46
gettimeofday({1495618771, 248074}, NULL) = 0
write(1, "\r         15,243 100%   23.52kB/"..., 72) = 72
close(3)                                = 0
open("some/folders/sub/folders/src/Debug_x64/some.obj", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0777, st_size=1071815, ...}) = 0
write(1, "some/folders/sub/********"..., 61) = 61
read(3, "d\206s\6\225C%Y\206\321\t\0\343\36\0\0\0\0\0\0.drectve\0\0\0\0"..., 262144) = 262144
select(6, [5], [4], [5], {60, 0})       = 2 (in [5], out [4], left {59, 999998})
read(5, "\32\371\177\2075\240'\347\247\223\245\2647\310\252\372\301\251\205\303T\324%R=\253\n\257\26\251\371\235"..., 38) = 38
write(4, "\350\320\0\7\0\0&\30\0\0\0\0\0\0\0\0\0ATL::CSimpleStr"..., 53484
  • "******" and "...." means, I obscured some names/parts.
  1. The client and the server is now in the state, where I can reliably repeat the hang :)
    So if I can somehow help further, just give me instructions.
@Wishmesh

This comment has been minimized.

Wishmesh commented May 24, 2017

And one more thing. I am starting to suspect that it is hanging only on the files, that are created with Visual Studio that is running as Elevated process under admin privileges. But it is not easy to reproduce, because in most time it sync these files easily.

@sunjoong

This comment has been minimized.

sunjoong commented May 24, 2017

@Wishmesh - I don't know why that happened (hmm.... perhaps timestamp or file closing problem??), but remember you said;

I am trying to rsync folders between two Windows

If so.... between two Windows.... using robocopy might be a meanwhile workaround, I think.

@therealkenc

This comment has been minimized.

Collaborator

therealkenc commented May 24, 2017

write(4, "\350\320\0\7\0\0&\30\0\0\0\0\0\0\0\0\0ATL::CSimpleStr"..., 53484

Those writes are likely blocking because the send buffers are full. Which means it is the server (receiving) side where you are going to need to do the strace. Which is going to be a [. . .]

Visual Studio that is running as Elevated process under admin privileges.

That could matter. If you insist on running VS as admin, then you are probably going to want to try running bash with elevated privileges too; assuming you are not already. That might not (probably won't) unstick a stuck rsync, but, spitballing, it might avoid whatever is getting you in the state in the first place. That goes double if you are accessing those files with elevated privilege on the server as well, as opposed to just pushing them there for backups.

  1. Checked offending file on the server. The file some-name.pch is missing on the server.

For giggles take a peek in %appdatalocal%/lxss/temp (look but don't touch) on the server and see if there is anything untoward in there. See #1940. You have a totally unrelated scenario, of course, but you do have a situation with deletes going on under failure conditions. Failure conditions, which are probably uncommon enough to explain why you go days or weeks without seeing a problem.

Hopefully the devs take a look. Even more hopefully the problem just gets fixed magically with ongoing improvements, which happens a lot around here.

If so.... between two Windows.... using robocopy might be a meanwhile workaround, I think.

Well yeah. Or Cygwin rsync. But where would be the fun in that.

@Wishmesh

This comment has been minimized.

Wishmesh commented May 24, 2017

Robocopy isn't an option, because I need to sync files to non-Windows machines too.
Cygwin sucks. Bash on Windows is king 👍
And of course I can workaround the problem, but fix of this issue would be better ;)

in %localappdata%/lxss/temp (not %appdatalocal%) nothing unusual:
0000000000000000_devzero 0000000000000001_unlink 0000000000000002_unlink 0000000000000003_unlink 0000000000000004_unlink 0000000000000005_tmpfs 0000000000000006_unlink 0000000000000007_tmpfs 0000000000000008_unlink 0000000000000009_tmpfs 000000000000000a_unlink 000000000000000b_tmpfs 000000000000000c_unlink 000000000000000d_tmpfs 000000000000000e_unlink 000000000000000f_unlink 0000000000000010_unlink

Also, tried to run Bash as Elevated admin. It still stops on the same file. Cannot say if running as elevated would avoid getting stuck in the first place.

On server files are not touched / accessed. Like in the backup.

@sunjoong

This comment has been minimized.

sunjoong commented May 24, 2017

@Wishmesh

Cygwin sucks. Bash on Windows is king 👍

Haha... Then.... Does Msys2 rsync make same problem too? Compairing is a good tool to clearify issue, I think. And... Is that sure all Visual Studio windows were closed before launching rsync?

@therealkenc

This comment has been minimized.

Collaborator

therealkenc commented May 24, 2017

in %localappdata%/lxss/temp (not %appdatalocal%) nothing unusual:
0000000000000001_unlink .... etc

What are the size of those files? Or more specifically, do they look familiar in the context of your errant files (pch etc).

On server files are not touched / accessed. Like in the backup.

Wish you wouldn't have said that. That negates most theories related to elevated privileges. You client strace looks clean. You don't have any of the usual fs related gotchas there. Since you are blocking on a network write, by definition either the server isn't reading or the bytes, or the bytes didn't get there in the first place.

If your use case is backups, and there aren't other circumstances (like not having space on C:), you might want to push to /home or /var on the server. Eliminates a variable, anyway.

@sunjoong

This comment has been minimized.

sunjoong commented May 24, 2017

@Wishmesh - I think @therealkenc points good one. You might sync files under /mnt/<dir> directory. What if under /home/<dir> or /var/<dir>?

@Wishmesh

This comment has been minimized.

Wishmesh commented May 24, 2017

Haha... Then.... Does Msys2 rsync make same problem too? Compairing is a good tool to clearify issue, I think. And... Is that sure all Visual Studio windows were closed before launching rsync?

  1. Unable to compare with Msys2.
  2. Of course VS closed. As I said. Full reboot for both - client and server. Nothing running except services... Start Bash... run rsync... hung...

What are the size of those files?

These are folders in the temp directory... I said - nothing unusual, because they are identical to what I see in other unrelated to this problem PC.

About syncing to /home ... just tried the same folder and of course rsync works. Remember I said in the first posts... that if I try to rsync individual files (that are stuck), then it unstacks... and everything works again from this point.

So I currently have stuck the following folder:
some/folders/sub/folders/src/Debug_x64/some-file.obj
on the file some-file.obj

I just did rsync to /home/user/testing ... and it works
I also tried to rsync to /mnt/drive/some-other-path ... and of course it works.
Remember, it need many tries to get to the "stuck state" -- some days or weeks.

After testing these I run original rsync command:
rsync /mnt/u/some/ ...
And it stuck on the same file:
some/folders/sub/folders/src/Debug_x64/some-file.obj

So I suspect that if I will run:
rsync /mnt/u/some/folders/ ...
or
rsync /mnt/u/some/folders/sub/ ...
... it will work... and the stuck state would be gone. Because rsync on individual file unstacks...

Do you have other ideas to try, before I issue rsync with "deeper" path?

@sunjoong

This comment has been minimized.

sunjoong commented May 24, 2017

@Wishmesh - I could not figure it out why that happen but... think there are still many traps in using DrvFS, i.e., /mnt/<dir>; DrvFS does not fully compatible with VolFS, i.e., /home/<dir> or /var/<dir>, and rsync of WSL is a real linux binary and considered to be run within linux filesystem (in this case, VolFS), so it might not fully compatiable with DrvFS yet. However, Cygwin or Msys2 use (modified) windows program (and considered to be able to be run within Windows filesystem); They don't have the same concept of DrvFS and VolFS (but they use another approach.)

I thought, if you could fix the point of condition when that problem occurs, i.e., if you could reproduce it when you want it, you could compare it with non DrvFS case. But... you look like to say it's impossible.

And... I don't understand @therealkenc's saying enogh but... guess it might be about sort of limits during programs running. You know, there are some directories in %localappdata%\lxss\temp; These directoris are changing contents during programs running.

@Wishmesh

This comment has been minimized.

Wishmesh commented May 24, 2017

I couldn't resist to try.... So I have one file stuck:
/mnt/x/1/2/3/4/5/some-file.ext

  1. running rsync /mnt/x/1 user@IP:/mnt/x/abc/
    result -- hang

  2. running rsync /mnt/x/1/2 user@IP:/mnt/x/abc/1/
    result -- hang

  3. running rsync /mnt/x/1/2/3 user@IP:/mnt/x/abc/1/2/
    result -- hang

  4. running rsync /mnt/x/1/2/3/4/5 user@IP:/mnt/x/abc/1/2/3/4/
    result -- hang

  5. running rsync /mnt/x/1/2/3/4/5/some-file.ext user@IP:/mnt/x/abc/1/2/3/4/5/
    result -- success

  6. running command from the step 1 again
    rsync /mnt/x/1 user@IP:/mnt/x/abc/
    success

Issue is gone till the next time :(

@Wishmesh

This comment has been minimized.

Wishmesh commented May 24, 2017

And if it is filesystem issue, then why syncing files one by one works?

@sunjoong

This comment has been minimized.

sunjoong commented May 24, 2017

@Wishmesh

And if it is filesystem issue, then why syncing files one by one works?

Good question; You know, people say good question when they don't know a quite answer.

I cannot be sure that's filesystem issue; I'm just guessing it might be or not and it could be timestamp related issue if it might be a filesystem issue. Of course, it might not be timestamp related, it might even not filesystem issue, it might be a issue I could not imagine. But I know there is a timestamp issue in DrvFS itself that make a strange thing, and found another timestamp issue of rsync (but it might not fit on your case) https://askubuntu.com/questions/112863/rsync-not-working-between-ntfs-fat-and-ext .

@therealkenc

This comment has been minimized.

Collaborator

therealkenc commented May 24, 2017

About syncing to /home ... just tried the same folder and of course rsync works.

Sorry my poor explanation. When I said "you might want to push to /home or /var on the server", I meant as operating procedure moving forward. Something is getting borked on the server side. We don't know what, because no server side strace. But I think you'll have a lower probability of borkage on the server if you push to VolFS, because smart money says this is a DrvFS "filesystem thing". [You'll keep using DrvFS on the client of course, because the whole point is having access in Visual Studio. There's no problem we can see on the client.]

Do you have other ideas to try, before I issue rsync with "deeper" path?

Once you push to any different folder without corruption you're golden. Go deeper, sideways, to a whole different server, whatever.

Hold on to the corrupt folder in its pristine corrupt state if you have the space. The devs are going to need diagnostics of some kind on that tree, because no one will be able to repro this locally. You've got a real bug. But note that the chances of anyone reading this far down the thread is rapidly decreasing.

@sunilmut

This comment has been minimized.

Member

sunilmut commented Jun 12, 2017

Adding @SvenGroot to see if he can help out with this one. Also, marking this one as a bug.

@sunilmut sunilmut added the bug label Jun 12, 2017

@SvenGroot

This comment has been minimized.

Member

SvenGroot commented Jun 12, 2017

@Wishmesh Could you follow the steps from section 8 at https://github.com/Microsoft/BashOnWindows/blob/master/CONTRIBUTING.md, and provide us with the log files? That may provide some insight in to what is happening.

@therealkenc

This comment has been minimized.

Collaborator

therealkenc commented Jun 12, 2017

His problem is on the server-side of an WSL-to-WSL rsync push. There's gotta be a way to collect a server-side strace but it isn't immediately obvious to me. The client-side he supplied above. It hangs on:

write(4, "\350\320\0\7\0\0&\30\0\0\0\0\0\0\0\0\0ATL::CSimpleStr"..., 53484

...which is not of much use, because it is not failing, just blocking. The problem isn't network-related because the problem persists across a reboot and only once the DrvFS filesystem on the server-side has gotten into a corrupt state.

@SvenGroot

This comment has been minimized.

Member

SvenGroot commented Jun 12, 2017

Since the server is also WSL, logs of both sides would be helpful. Strace would be helpful but I can see why that would be difficult in this scenario. However, ETL logs should be easy if you have access to the server.

Thanks,
Sven

@Wishmesh

This comment has been minimized.

Wishmesh commented Jun 13, 2017

Thanks for looking into this!

  1. I tried to get the server-side strace with the following command:
    sudo strace -ff -o mylog /usr/sbin/sshd
    I got two files - mylog.41 and mylog.42, however I do not see the relevant lines in contrast to client strace logs.

If you have idea how to strace rsync on server, let me know.

  1. I got some log files using logman command. Particularly the lxcore_kernel.etl seems to contain something useful, but I am unable to read it using EventViewer... missing some decoder perhaps.
    As it seems that it contains private info, I do not feel comfortable to share it here. Maybe I can share it to some email with @ microsoft.com at the end?
@jlahijani

This comment has been minimized.

jlahijani commented Aug 31, 2017

I'm experiencing this issue as well. It occurs on this image.
test

@acmuller

This comment has been minimized.

acmuller commented Apr 2, 2018

Yes, adding "--whole-file" is OK as a workaround, but that's not how Rsync is supposed work, optimally, right? Adding the flag does push the files through whole, but the process definitely runs slower. I think a proper fix still needs to be found for this.

@anthonyvdotbe

This comment has been minimized.

anthonyvdotbe commented Apr 3, 2018

Same issue here, doing rsync over SSH (so no rsyncd at play) to a Synology NAS server (which doesn't have strace). I used to use Cygwin for this, and never had any issues with it.
Any instructions on how to provide the necessary information to debug & resolve this issue, would be great.

@Wishmesh

This comment has been minimized.

Wishmesh commented Apr 5, 2018

The issue happens very rarely compared to a year ago. Currently the only fix is --whole-file or -W

@acmuller

This comment has been minimized.

acmuller commented Apr 6, 2018

It happens to me just often as it always has: every single time I run an Rsync upload to my server. It doesn't happen with CygWin, VMWare, or native Ubuntu, nor on downward syncs.

@cyberius0

This comment has been minimized.

cyberius0 commented Apr 27, 2018

Just wanted to let you know guys that I also have the problem.
Just moved my rsync backup-script from cygwin to WSL/bash and it randomly stops while transferring files from a large directory (containing ~40k files). Re-Running the script helped to copy all files, but I think this is a very big issue.

@acmuller

This comment has been minimized.

acmuller commented Apr 28, 2018

"Randomly" is indeed the right characterization. It doesn't seem to matter whether the files are small or large, or their type. It has been pointed out that we can work around this using the -W switch. That's true, but the whole point of using Rsync is to take advantage of its ability to handle files partially, so I don't think that approach should be taken as a proper solution.

@0xabu

This comment has been minimized.

0xabu commented May 31, 2018

@sunilmut @SvenGroot I have a repro of this and it looks a lot like a deadlock on the pipe between rsync and ssh.

rsync on the client side is doing a series of large writes just shy of 32KiB into the pipe it has to ssh. The last of these hangs:

write(4, ""..., 32776)                  = 32776
select(5, [], [4], [], {tv_sec=60, tv_usec=0}) = 1 (out [4], left {tv_sec=60, tv_usec=0})
write(4, ""..., 32776)                  = 32776
select(5, [], [4], [], {tv_sec=60, tv_usec=0}) = 1 (out [4], left {tv_sec=60, tv_usec=0})
write(4, ""..., 32776

Meanwhile ssh is doing its thing communicating with both rsync and the remote sshd over TCP; the last operation it blocked on was a write of exactly 32KiB into the local pipe with rsync (in the trace below FD 3 is the remote TCP socket, 5 is stdin from rsync, and 6 is stdout to rsync):

write(3, ""..., 32804)                  = 32804
getpid()                                = 13058
clock_gettime(CLOCK_BOOTTIME, {tv_sec=462356, tv_nsec=744078000}) = 0
clock_gettime(CLOCK_BOOTTIME, {tv_sec=462356, tv_nsec=744404000}) = 0
select(8, [3], [3], NULL, {tv_sec=60, tv_usec=0}) = 1 (out [3], left {tv_sec=59, tv_usec=905275})
clock_gettime(CLOCK_BOOTTIME, {tv_sec=462356, tv_nsec=840460000}) = 0
write(3, ""..., 32804)                  = 32804
getpid()                                = 13058
clock_gettime(CLOCK_BOOTTIME, {tv_sec=462356, tv_nsec=842270000}) = 0
clock_gettime(CLOCK_BOOTTIME, {tv_sec=462356, tv_nsec=842530000}) = 0
select(8, [3], [3], NULL, {tv_sec=60, tv_usec=0}) = 1 (in [3], left {tv_sec=60, tv_usec=0})
clock_gettime(CLOCK_BOOTTIME, {tv_sec=462356, tv_nsec=843980000}) = 0
read(3, ""..., 8192)                    = 8192
clock_gettime(CLOCK_BOOTTIME, {tv_sec=462356, tv_nsec=845545000}) = 0
clock_gettime(CLOCK_BOOTTIME, {tv_sec=462356, tv_nsec=846090000}) = 0
select(8, [3], [3], NULL, {tv_sec=60, tv_usec=0}) = 1 (in [3], left {tv_sec=60, tv_usec=0})
clock_gettime(CLOCK_BOOTTIME, {tv_sec=462356, tv_nsec=846816000}) = 0
read(3, ""..., 8192)                    = 8192
clock_gettime(CLOCK_BOOTTIME, {tv_sec=462356, tv_nsec=848333000}) = 0
clock_gettime(CLOCK_BOOTTIME, {tv_sec=462356, tv_nsec=848648000}) = 0
select(8, [3], [3], NULL, {tv_sec=60, tv_usec=0}) = 1 (in [3], left {tv_sec=60, tv_usec=0})
clock_gettime(CLOCK_BOOTTIME, {tv_sec=462356, tv_nsec=850480000}) = 0
read(3, ""..., 8192)                    = 4096
clock_gettime(CLOCK_BOOTTIME, {tv_sec=462356, tv_nsec=852325000}) = 0
clock_gettime(CLOCK_BOOTTIME, {tv_sec=462356, tv_nsec=853010000}) = 0
select(8, [3 5], [3 6], NULL, {tv_sec=60, tv_usec=0}) = 2 (in [5], out [6], left {tv_sec=60, tv_usec=0})
clock_gettime(CLOCK_BOOTTIME, {tv_sec=462356, tv_nsec=854243000}) = 0
read(5, ""..., 16384)                   = 16384
write(6, ""..., 32768

I also have a trace of the remote rsync, but as you'd expect it just shows a timeout on a select().

@0xabu

This comment has been minimized.

0xabu commented May 31, 2018

Oops, I lied: this appears to be a Unix domain socket, not a pipe, and what's more it is in nonblocking mode which would explain why it wasn't prepared to have the call block :)

socketpair(AF_UNIX, SOCK_STREAM, 0, [3, 4]) = 0
fcntl(3, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
fcntl(4, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
@0xabu

This comment has been minimized.

0xabu commented May 31, 2018

By the way, for everyone else hitting this -- there is a really kludgy workaround: when it hangs (probably multiple times during a run), send one of the ssh/rsync processes a SIGCHLD. It ignores the signal, but this causes the blocked syscall to return and generally un-wedges everything.

Or just sit in a loop sending the signal until it completes, like so:

while killall -CHLD ssh; do sleep 0.1; done
@sunilmut

This comment has been minimized.

Member

sunilmut commented May 31, 2018

@0xabu - Good find, if that's the actual root cause. Could this be #2822, which was fixed in 17666?

@0xabu

This comment has been minimized.

0xabu commented May 31, 2018

@sunilmut I don't think so. I'm still on RS4, but I can see in my strace (which I'll also email you) that neither side called shutdown.

@0xabu

This comment has been minimized.

0xabu commented May 31, 2018

It might well be #3100 though.

@MichaelAlguire

This comment has been minimized.

MichaelAlguire commented Jul 19, 2018

I might have some info to add.

From PowerShell, I run wsl.
I run rsync and it freezes.
However, if I resize the PowerShell window, it continues...
I've been able to complete an rsync by constantly resizing the window.

@0xabu

This comment has been minimized.

0xabu commented Jul 19, 2018

@MichaelAlguire I'm guessing that's another variation on the "just keep sending signals" workaround I suggested; in this case, resizing the window is probably emitting a SIGWINCH.

@JonMerlevede

This comment has been minimized.

JonMerlevede commented Aug 24, 2018

Can confirm that resizing the PowerShell window or sending SIGWINCH signals works for me as well, both on my legacy 14.04 installation and a brand-new 18.04 one.

@0xabu 's workaround might be a bit clunky but gets the job done...

while killall -CHLD ssh; do sleep 0.1; done

I think I'll be using rclone for now.

@phoerious

This comment has been minimized.

phoerious commented Oct 21, 2018

I'm having massive issues with this right now, but neither --whole-file nor the SIGCHLD hack appear to be working for me. :-\

@phoerious

This comment has been minimized.

phoerious commented Oct 22, 2018

Update: it appears to be working better in a real PowerShell terminal. With Cmdr it's absolutely abysmal and hangs at about every single file. It still hangs, though, and none of the workarounds is working.

I have the same problem with rsync at port 873 without SSH.

@jaimehrubiks

This comment has been minimized.

jaimehrubiks commented Oct 22, 2018

Interesting to see that it doesn't work either without SSH.

It works good enough for me using cmdr(or ssh from another pc)+tmux+ssh+sigchld_script. But it is still unpredictable.

@phoerious

This comment has been minimized.

phoerious commented Oct 22, 2018

The thing is: sometimes it works and then it hangs at some other random file and finally it will find one where it keeps hanging, no matter how often I restart the process.

@0xabu

This comment has been minimized.

0xabu commented Oct 22, 2018

@phoerious @jaimehrubiks are you folks on 1809 (aka "RS5") or an earlier version? I'm pretty sure this was fixed as #3100 -- I haven't seen it since upgrading to 1809. Also, if the SIGCHLD and/or window resize workaround don't work for you, this sounds like it's probably a separate problem, and you should open a new issue for it.

@phoerious

This comment has been minimized.

phoerious commented Oct 22, 2018

You mean my Windows build version? It's Windows 10 Pro 1803 (OS Build 17134.345).
Update says it's up2date, though.

@0xabu

This comment has been minimized.

0xabu commented Oct 22, 2018

Yes. According to this you should see it offered in Windows Update, but I know they held up the rollout, so if you don't get it now I imagine you soon will...

@phoerious

This comment has been minimized.

phoerious commented Oct 22, 2018

Trying the Insider Programme, but Windows Update is still searching. Has been for a while.

@phoerious

This comment has been minimized.

phoerious commented Oct 22, 2018

The Insider Programme tried to install 1809, but failed somewhere in between, probably when I pulled the plug on the programme's data collection when I realised that the few "diagnostics" I'm sending to Microsoft contain mu f***ing BROWSING HISTORY! That's really goddamn nobody's business. Unbelievable!

@DarthSpock

This comment has been minimized.

DarthSpock commented Oct 23, 2018

@phoerious [ In discussion tag outside of WSL scope territory] There is a Privacy Dashboard in which all the data being collected and sent is viewable and able to stop it's transmission and/or delete the data. Recommend the normal diagnostic data being kept for PC maintenance but for stuff like browsing history, obviously shut that off.

@n4af

This comment has been minimized.

n4af commented Nov 12, 2018

Gave up on WSL Rsync as it hangs at will.

Installed Git Bash for Windows with rsync.exe off Pacman repo.
All now works A-OK for large rsync of files.

https://blog.tiger-workshop.com/add-rsync-to-git-bash-for-windows/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment