Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rsync hangs on particular file #2138

Open
Wishmesh opened this issue May 21, 2017 · 101 comments
Open

rsync hangs on particular file #2138

Wishmesh opened this issue May 21, 2017 · 101 comments
Labels
bug

Comments

@Wishmesh
Copy link

@Wishmesh Wishmesh commented May 21, 2017

  • Your Windows build number: Microsoft Windows [Version 10.0.15063] + BOW 16.04 (reinstalled today)

  • What you're doing and what's happening: I am trying to rsync folders between two Windows (both BashOnWindows). I am using following commands:

rsync --archive --stats --progress --delete /mnt/u/1-folder /mnt/u/2nd-folder some_name@192.168.99.17:/mnt/u/2-folder/
rsync -rl --progress --delete ...
rsync -rl --delete ...
  • What's wrong / what should be happening instead: It works almost flawlessly. But sometimes (at least 5 different cases) it stucks on particular file. When stopped (Ctrl+C) and resumed, it always hangs on the same file. Usually it hangs only on .pch and .pdb files (about 20MB). See the example below. I stop it with Ctrl+C (see the ^C).
        560,502 100%    1.63MB/s    0:00:00 (xfr#78, ir-chk=1056/39982)
some-path/more-path/cl.command.1.tlog
          6,060 100%   17.83kB/s    0:00:00 (xfr#79, ir-chk=1055/39982)
some-path/more-path/some-file.Build.CppClean.log
          1,817 100%    5.33kB/s    0:00:00 (xfr#80, ir-chk=1054/39982)
some-path/more-path/some-file.pch
         32,768   0%   95.81kB/s    0:00:46  ^C
rsync error: unexplained error (code 130) at rsync.c(632) [sender=3.1.0]
[sender] io timeout after 60 seconds -- exiting

I tried to wait for more than 24h... it stays on the one file.

Also, strange thing is:

  • it does not fail, if I move files around to different folders
  • as a workaround I can move offending files one by one with rsync (then rsync works again)
  • Strace of the failing command, if applicable: Ohhh.. Sorry. Tried to reproduce. But I can't. Because I moved files around. If you can not resolve this without strace, then I will wait till the next time and will provide strace here in comments.
    (If <cmd> is failing, then run strace -o strace.txt -ff <cmd>, and post the strace.txt output here)
@sunjoong
Copy link

@sunjoong sunjoong commented May 21, 2017

@Wishmesh - What name and path name of that particular file? I had a strange experience with file named "ext" and directory named "ext"; My anti-virus blocked that file and directory.

@Wishmesh
Copy link
Author

@Wishmesh Wishmesh commented May 22, 2017

@sunjoong - no. Anti-virus have these folders excluded. Path looks like:

X:\1-prj\some\simple\folders\here\Release_x64\appname.pch
or
X:\1-prj\some\simple\folders\here\Release_x64\appname.pdb

There were other extensions too, but I do not remember of what type they was...

@therealkenc
Copy link
Collaborator

@therealkenc therealkenc commented May 23, 2017

Unfortunately I can't repro here. Strange files to be having trouble with. Their size isn't that special, nor should that be a problem anyway. TCP streams are very reliable on WSL these days. Any chance those files are open in something (anything) in Windows? Or special in any way you can think of, versus the other files in your tree?

One thing you might try is to do it old school and see what happens:

tar cpf - /mnt/u/somewhere | (ssh some_name@192.168.99.17 "cd remote_dir; tar xpf -")

That doesn't get you all the rsync goodness, but it will help eliminate filesystem and networking problems. If that works, try rsync without the --delete flag and see if that helps. Otherwise, shrug 🤷.

@Wishmesh
Copy link
Author

@Wishmesh Wishmesh commented May 23, 2017

Any chance those files are open in something (anything) Windows?

No. It stuck on the same file again, and again. Even after reboot. Antivirus disabled for that drive.

Ok. Will try out your suggestions next time it happens.

@therealkenc
Copy link
Collaborator

@therealkenc therealkenc commented May 23, 2017

It's not antivirus; and even if it were, those files still aren't very special compared to all the object files you must have in there. I rsync around enormous trees (like chromium) somewhat regularly, but that is on VolFS (/home/me). I rsynced a random Visual Studio project directory DrvFS to DrvFS (granted localhost to localhost) trying to repro just before I posted but it went fine. Anyway, bonne chance.

@Wishmesh
Copy link
Author

@Wishmesh Wishmesh commented May 23, 2017

I have 2 folders (about 10GB):

  1. 36193 files and 6142 folders
  2. 33734 files and 5975 folders

Nothing extraordinary. And "file being used / not accessible" is less probable, because.... When it stuck, I can copy the same stuck file using the same rsync syntax, but specifying filename -- syncing by one file works... and then after that, again rsync works for some days or weeks without any problem.

@therealkenc
Copy link
Collaborator

@therealkenc therealkenc commented May 23, 2017

...for some days or weeks without any problem

Yeah that's going to be a bitch to track. So obviously it's going to work for the one-off test I did. Sigh...

That the fail is sticky even after reboot is very relevant. That means it pretty much has to be a "filesystem thing", but I couldn't speculate what exactly. Appreciate you reporting though, so at least it is on the books.

@Wishmesh
Copy link
Author

@Wishmesh Wishmesh commented May 23, 2017

That means it pretty much has to be a "filesystem thing"

But why rsync then works if I sync the same files one by one? And both - source and destination FS are NTFS, if that matters.

@therealkenc
Copy link
Collaborator

@therealkenc therealkenc commented May 23, 2017

But why rsync then works if I sync the same files one by one?

Dunno, but it pretty much has to be a "filesystem thing" by deductive reasoning. Because your DrvFS (NTFS) filesystem is the only thing that can hold the sticky failure state across a reboot. Your sync one-by-one fix is, for whatever reason, getting things back into a sane state. Next time it fails, try deleting the problematic files on the server that you are pushing to and doing a full sync, instead of syncing them one at a time.

Handwaving here, the whole process of deleting files on DrvFS is not straightforward, because delete semantics are different in Unix and Windows. That why the --delete option is a red flag. Or, at least for lack of a better working theory atm.

@Wishmesh
Copy link
Author

@Wishmesh Wishmesh commented May 24, 2017

Yeah that's going to be a bitch to track.

Let us try....

  1. Today I tried hard to reproduce and I succeeded

Try 1
some/folders/sub/folders/src/Debug/rc.read.1.tlog
          2,896 100%   10.79kB/s    0:00:00 (xfr#24, ir-chk=1011/41352)
some/folders/sub/folders/src/Debug/rc.write.1.tlog
            250 100%    0.93kB/s    0:00:00 (xfr#25, ir-chk=1010/41352)
some/folders/sub/folders/src/Debug/vc100.idb
          1,200   0%    4.47kB/s    0:05:24

Try2
some/folders/sub/folders/src/Debug_x64/some-name.log
          3,331 100%   28.04kB/s    0:00:00 (xfr#25, ir-chk=1034/41390)
some/folders/sub/folders/src/Debug_x64/some-name.obj
      1,397,284 100%   11.01MB/s    0:00:00 (xfr#26, ir-chk=1033/41390)
some/folders/sub/folders/src/Debug_x64/some-name.pch
         10,640   0%   85.17kB/s    0:05:33

Try3
some/folders/sub/folders/src/Debug_x64/some-name.pch
         10,640   0%    0.00kB/s    0:00:00
  1. The strange thing is, it sometimes shows 5+ minutes required transfer time almost immediately.

  2. When it stuck at some-name.pch, I checked MD5 of file on client and server -- they differ.

  3. When I said that state remains after reboot I somehow forgot about server state. I blamed client. So now I rebooted the server too -- both computers rebooted. So now it is confirmed -- it persists after reboot of both -- server and client.

sending incremental file list
deleting some/folders/sub/folders/src/some-name.opensdf
some/folders/sub/folders/src/
some/folders/sub/folders/src/some-name.sdf
     41,963,520 100%   26.62MB/s    0:00:01 (xfr#1, ir-chk=1025/41327)
some/folders/sub/folders/src/some-name.suo
         25,600 100%   51.02kB/s    0:00:00 (xfr#2, ir-chk=1023/41327)
some/folders/sub/folders/src/Debug_x64/
some/folders/sub/folders/src/Debug_x64/some-name.pch
         32,768   0%   56.04kB/s    0:08:25  ^C
rsync error: unexplained error (code 130) at rsync.c(632) [sender=3.1.1]
[sender] io timeout after 60 seconds -- exiting
  1. Checked offending file on the server. The file some-name.pch is missing on the server.

  2. Trying rsync again. It still stuck on the same file (even it is not on the server):

sending incremental file list
some/folders/sub/folders/src/
some/folders/sub/folders/src/some-name.sdf
     41,963,520 100%  303.08MB/s    0:00:00 (xfr#1, ir-chk=1025/41327)
some/folders/sub/folders/src/some-name.suo
         25,600 100%  187.97kB/s    0:00:00 (xfr#2, ir-chk=1023/41327)
some/folders/sub/folders/src/Debug_x64/some-name.pch
         32,768   0%  233.58kB/s    0:02:01  ^C
rsync error: unexplained error (code 130) at rsync.c(632) [sender=3.1.1]
[sender] io timeout after 60 seconds -- exiting
  1. Notice that before the some-name.pch file it synced .sdf and .suo file again. I am not deeply familiar with rsync. It writes multiple files at once? Why send .sdf and .suo again, when they were synced last time successfully?

  2. Deleting entire folder on the server: some/folders/sub/folders/src/Debug_x64

  3. Trying rsync again:

sending incremental file list
some/folders/sub/folders/src/
some/folders/sub/folders/src/some-name.sdf
     41,963,520 100%  305.40MB/s    0:00:00 (xfr#1, ir-chk=1025/41327)
some/folders/sub/folders/src/some-name.suo
         25,600 100%  185.19kB/s    0:00:00 (xfr#2, ir-chk=1023/41327)
some/folders/sub/folders/src/Debug_x64/
some/folders/sub/folders/src/Debug_x64/Base64.obj
         10,358 100%   11.60kB/s    0:00:00 (xfr#3, ir-chk=1045/41389)
some/folders/sub/folders/src/Debug_x64/CL.read.1.tlog
         45,746 100%   12.38MB/s    0:00:00 (xfr#4, ir-chk=1044/41389)
some/folders/sub/folders/src/Debug_x64/CL.write.1.tlog
          3,618 100%  588.87kB/s    0:00:00 (xfr#5, ir-chk=1043/41389)
some/folders/sub/folders/src/Debug_x64/**********.obj
         70,915 100%    1.50MB/s    0:00:00 (xfr#6, ir-chk=1042/41389)
some/folders/sub/folders/src/Debug_x64/*********.obj
         15,243 100%  236.28kB/s    0:00:00 (xfr#7, ir-chk=1041/41389)
some/folders/sub/folders/src/Debug_x64/*********.obj
^Crsync error: unexplained error (code 130) at rsync.c(632) [sender=3.1.1]
[sender] io timeout after 60 seconds -- exiting

Now it stuck on the some.obj file.

  1. Ctrl+C, try again... again stuck on the same.obj file...

  2. Remembered about strace...

strace -o strace.txt -ff rsync with params

Got 2 strace files. Unfortunately I cannot share them here as is... But will provide you with parts I you will need.

First file seems to come from ssh. 178974 bytes long.

dup2(3, 0)                              = 0
close(4)                                = 0
close(5)                                = 0
dup2(6, 1)                              = 1
close(3)                                = 0
close(6)                                = 0
fcntl(0, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(0, F_SETFL, O_RDWR)               = 0
execve("/usr/local/sbin/ssh", ["ssh", "-l", ....
execve("/usr/bin/ssh", ["ssh", "-l", .....
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd49ca20000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
...
... skip here ...
...
clock_gettime(CLOCK_BOOTTIME, {2396, 116297000}) = 0
read(3, "\216k__\225G!\263\265G}\374T\256\253\346\233\373>\305z][a\rb\32\211T\20\270\300"..., 8192) = 8192
clock_gettime(CLOCK_BOOTTIME, {2396, 116528000}) = 0
clock_gettime(CLOCK_BOOTTIME, {2396, 116644000}) = 0
select(7, [3 4], [], NULL, NULL)        = 1 (in [3])
clock_gettime(CLOCK_BOOTTIME, {2396, 116915000}) = 0
read(3, "\303\220\n\231\212\234\355+\277\34n5\\^\21O\326\366\324\372N\0\225\276n\313\355U\357\347`\335"..., 8192) = 8192
clock_gettime(CLOCK_BOOTTIME, {2396, 117163000}) = 0
clock_gettime(CLOCK_BOOTTIME, {2396, 117278000}) = 0
select(7, [3 4], [5], NULL, NULL)       = 2 (in [3], out [5])
clock_gettime(CLOCK_BOOTTIME, {2396, 117552000}) = 0
write(5, "\374\377\0\7,K\252G\377\n\260\23[\24DW@\312\233#\244\216\241\333\2300e\334L\24\212_"..., 16384) = 16384
read(3, "2\35\202\326T\32$F\276i\324\327k[\312\256\362\224/\255/z\247\17\243\261\346)\3464\312\236"..., 8192) = 8192
clock_gettime(CLOCK_BOOTTIME, {2396, 117901000}) = 0
clock_gettime(CLOCK_BOOTTIME, {2396, 118047000}) = 0
select(7, [3 4], [3], NULL, NULL)       = 2 (in [3], out [3])
clock_gettime(CLOCK_BOOTTIME, {2396, 118324000}) = 0
read(3, "\5\260\374\270\275\305\350\270\355E^|\243Ku\216\177\3\266&\317@\10\335t\362\261kP\360Ts"..., 8192) = 8192
write(3, "7\2571u\265?\347\4\212Ig]\t\21<JC\332.\243\206\257k\246\216\266\252\5\v\r'N"..., 36) = 36
clock_gettime(CLOCK_BOOTTIME, {2396, 118728000}) = 0
clock_gettime(CLOCK_BOOTTIME, {2396, 118843000}) = 0
select(7, [3 4], [5], NULL, NULL)       = 2 (in [3], out [5])
clock_gettime(CLOCK_BOOTTIME, {2396, 119116000}) = 0
write(5, ";aW|v\223Yu\26\253w0\334\330\212\344\205\273\246^\370\"x@13\327<\240.G<"..., 16384) = 16384
read(3, "\376i\255\251\245\361ft\321\325\351_\264\337\333\361\363W`\37\235\361\267\3709\216\335\4\357\313H\336"..., 8192) = 8192
clock_gettime(CLOCK_BOOTTIME, {2396, 119464000}) = 0
clock_gettime(CLOCK_BOOTTIME, {2396, 119579000}) = 0
select(7, [3 4], [], NULL, NULL)        = 1 (in [3])
clock_gettime(CLOCK_BOOTTIME, {2396, 119841000}) = 0
read(3, "\340\305\217i\325f2\267\246\211?Q\2gm\33\246\325\f\35\365\241d\322\255\370\300\275$\312\275D"..., 8192) = 8192
clock_gettime(CLOCK_BOOTTIME, {2396, 120085000}) = 0
clock_gettime(CLOCK_BOOTTIME, {2396, 120200000}) = 0
select(7, [3 4], [5], NULL, NULL)       = 2 (in [3], out [5])
clock_gettime(CLOCK_BOOTTIME, {2396, 120474000}) = 0
write(5, "#\3W+\246\246FS\23\330\253?\207\1u\323\253\351\2563h3\26\301\3171\317}\244\352\212-"..., 16384

2nd file from rsync. 9277601 bytes long:

execve("/usr/bin/rsync", ["rsync", "--archive", "--stats", "--progress", "--delete", ......
brk(NULL)                               = 0x7fffe73de000
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f993b200000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=21423, ...}) = 0
mmap(NULL, 21423, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f993b206000
close(3)                                = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
open("/lib/x86_64-linux-gnu/libattr.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300\20\0\0\0\0\0\0"..., 832) = 832
...
...
...
open("some/folders/sub/folders/src/Debug_x64/*********.obj", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0777, st_size=70915, ...}) = 0
write(1, "some/folders/sub/*******"..., 64) = 64
read(3, "d\206\265\0\225C%Yj\256\0\0\317\2\0\0\0\0\0\0.drectve\0\0\0\0"..., 70915) = 70915
gettimeofday({1495618771, 246185}, NULL) = 0
ioctl(1, TIOCGPGRP, [546])              = 0
write(1, "\r         32,768  46%   50.71kB/"..., 46) = 46
select(6, [5], [4], [5], {60, 0})       = 2 (in [5], out [4], left {59, 999999})
read(5, "\rT\v\320\16u\203&#0\25\352\353\313\204#\5x;", 19) = 19
write(4, "Q\216\0\7\377\376^\0S\0:\0\\\0001\0-\0P\0R\0J\0\\\0S\0O\0F\0"..., 36437) = 36437
gettimeofday({1495618771, 246814}, NULL) = 0
gettimeofday({1495618771, 246891}, NULL) = 0
gettimeofday({1495618771, 246961}, NULL) = 0
write(1, "\r         70,915 100%  109.58kB/"..., 72) = 72
close(3)                                = 0
open("some/folders/sub/folders/src/Debug_x64/*******.obj", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0777, st_size=15243, ...}) = 0
write(1, "some/folders/sub/*****"..., 65) = 65
read(3, "d\206\17\0\225C%Yf)\0\0f\0\0\0\0\0\0\0.drectve\0\0\0\0"..., 15243) = 15243
gettimeofday({1495618771, 247785}, NULL) = 0
ioctl(1, TIOCGPGRP, [546])              = 0
write(1, "\r         15,243 100%   23.52kB/"..., 46) = 46
gettimeofday({1495618771, 248074}, NULL) = 0
write(1, "\r         15,243 100%   23.52kB/"..., 72) = 72
close(3)                                = 0
open("some/folders/sub/folders/src/Debug_x64/some.obj", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0777, st_size=1071815, ...}) = 0
write(1, "some/folders/sub/********"..., 61) = 61
read(3, "d\206s\6\225C%Y\206\321\t\0\343\36\0\0\0\0\0\0.drectve\0\0\0\0"..., 262144) = 262144
select(6, [5], [4], [5], {60, 0})       = 2 (in [5], out [4], left {59, 999998})
read(5, "\32\371\177\2075\240'\347\247\223\245\2647\310\252\372\301\251\205\303T\324%R=\253\n\257\26\251\371\235"..., 38) = 38
write(4, "\350\320\0\7\0\0&\30\0\0\0\0\0\0\0\0\0ATL::CSimpleStr"..., 53484
  • "******" and "...." means, I obscured some names/parts.
  1. The client and the server is now in the state, where I can reliably repeat the hang :)
    So if I can somehow help further, just give me instructions.
@Wishmesh
Copy link
Author

@Wishmesh Wishmesh commented May 24, 2017

And one more thing. I am starting to suspect that it is hanging only on the files, that are created with Visual Studio that is running as Elevated process under admin privileges. But it is not easy to reproduce, because in most time it sync these files easily.

@sunjoong
Copy link

@sunjoong sunjoong commented May 24, 2017

@Wishmesh - I don't know why that happened (hmm.... perhaps timestamp or file closing problem??), but remember you said;

I am trying to rsync folders between two Windows

If so.... between two Windows.... using robocopy might be a meanwhile workaround, I think.

@therealkenc
Copy link
Collaborator

@therealkenc therealkenc commented May 24, 2017

write(4, "\350\320\0\7\0\0&\30\0\0\0\0\0\0\0\0\0ATL::CSimpleStr"..., 53484

Those writes are likely blocking because the send buffers are full. Which means it is the server (receiving) side where you are going to need to do the strace. Which is going to be a [. . .]

Visual Studio that is running as Elevated process under admin privileges.

That could matter. If you insist on running VS as admin, then you are probably going to want to try running bash with elevated privileges too; assuming you are not already. That might not (probably won't) unstick a stuck rsync, but, spitballing, it might avoid whatever is getting you in the state in the first place. That goes double if you are accessing those files with elevated privilege on the server as well, as opposed to just pushing them there for backups.

  1. Checked offending file on the server. The file some-name.pch is missing on the server.

For giggles take a peek in %appdatalocal%/lxss/temp (look but don't touch) on the server and see if there is anything untoward in there. See #1940. You have a totally unrelated scenario, of course, but you do have a situation with deletes going on under failure conditions. Failure conditions, which are probably uncommon enough to explain why you go days or weeks without seeing a problem.

Hopefully the devs take a look. Even more hopefully the problem just gets fixed magically with ongoing improvements, which happens a lot around here.

If so.... between two Windows.... using robocopy might be a meanwhile workaround, I think.

Well yeah. Or Cygwin rsync. But where would be the fun in that.

@Wishmesh
Copy link
Author

@Wishmesh Wishmesh commented May 24, 2017

Robocopy isn't an option, because I need to sync files to non-Windows machines too.
Cygwin sucks. Bash on Windows is king 👍
And of course I can workaround the problem, but fix of this issue would be better ;)

in %localappdata%/lxss/temp (not %appdatalocal%) nothing unusual:
0000000000000000_devzero 0000000000000001_unlink 0000000000000002_unlink 0000000000000003_unlink 0000000000000004_unlink 0000000000000005_tmpfs 0000000000000006_unlink 0000000000000007_tmpfs 0000000000000008_unlink 0000000000000009_tmpfs 000000000000000a_unlink 000000000000000b_tmpfs 000000000000000c_unlink 000000000000000d_tmpfs 000000000000000e_unlink 000000000000000f_unlink 0000000000000010_unlink

Also, tried to run Bash as Elevated admin. It still stops on the same file. Cannot say if running as elevated would avoid getting stuck in the first place.

On server files are not touched / accessed. Like in the backup.

@sunjoong
Copy link

@sunjoong sunjoong commented May 24, 2017

@Wishmesh

Cygwin sucks. Bash on Windows is king 👍

Haha... Then.... Does Msys2 rsync make same problem too? Compairing is a good tool to clearify issue, I think. And... Is that sure all Visual Studio windows were closed before launching rsync?

@therealkenc
Copy link
Collaborator

@therealkenc therealkenc commented May 24, 2017

in %localappdata%/lxss/temp (not %appdatalocal%) nothing unusual:
0000000000000001_unlink .... etc

What are the size of those files? Or more specifically, do they look familiar in the context of your errant files (pch etc).

On server files are not touched / accessed. Like in the backup.

Wish you wouldn't have said that. That negates most theories related to elevated privileges. You client strace looks clean. You don't have any of the usual fs related gotchas there. Since you are blocking on a network write, by definition either the server isn't reading or the bytes, or the bytes didn't get there in the first place.

If your use case is backups, and there aren't other circumstances (like not having space on C:), you might want to push to /home or /var on the server. Eliminates a variable, anyway.

@sunjoong
Copy link

@sunjoong sunjoong commented May 24, 2017

@Wishmesh - I think @therealkenc points good one. You might sync files under /mnt/<dir> directory. What if under /home/<dir> or /var/<dir>?

@Wishmesh
Copy link
Author

@Wishmesh Wishmesh commented May 24, 2017

Haha... Then.... Does Msys2 rsync make same problem too? Compairing is a good tool to clearify issue, I think. And... Is that sure all Visual Studio windows were closed before launching rsync?

  1. Unable to compare with Msys2.
  2. Of course VS closed. As I said. Full reboot for both - client and server. Nothing running except services... Start Bash... run rsync... hung...

What are the size of those files?

These are folders in the temp directory... I said - nothing unusual, because they are identical to what I see in other unrelated to this problem PC.

About syncing to /home ... just tried the same folder and of course rsync works. Remember I said in the first posts... that if I try to rsync individual files (that are stuck), then it unstacks... and everything works again from this point.

So I currently have stuck the following folder:
some/folders/sub/folders/src/Debug_x64/some-file.obj
on the file some-file.obj

I just did rsync to /home/user/testing ... and it works
I also tried to rsync to /mnt/drive/some-other-path ... and of course it works.
Remember, it need many tries to get to the "stuck state" -- some days or weeks.

After testing these I run original rsync command:
rsync /mnt/u/some/ ...
And it stuck on the same file:
some/folders/sub/folders/src/Debug_x64/some-file.obj

So I suspect that if I will run:
rsync /mnt/u/some/folders/ ...
or
rsync /mnt/u/some/folders/sub/ ...
... it will work... and the stuck state would be gone. Because rsync on individual file unstacks...

Do you have other ideas to try, before I issue rsync with "deeper" path?

@sunjoong
Copy link

@sunjoong sunjoong commented May 24, 2017

@Wishmesh - I could not figure it out why that happen but... think there are still many traps in using DrvFS, i.e., /mnt/<dir>; DrvFS does not fully compatible with VolFS, i.e., /home/<dir> or /var/<dir>, and rsync of WSL is a real linux binary and considered to be run within linux filesystem (in this case, VolFS), so it might not fully compatiable with DrvFS yet. However, Cygwin or Msys2 use (modified) windows program (and considered to be able to be run within Windows filesystem); They don't have the same concept of DrvFS and VolFS (but they use another approach.)

I thought, if you could fix the point of condition when that problem occurs, i.e., if you could reproduce it when you want it, you could compare it with non DrvFS case. But... you look like to say it's impossible.

And... I don't understand @therealkenc's saying enogh but... guess it might be about sort of limits during programs running. You know, there are some directories in %localappdata%\lxss\temp; These directoris are changing contents during programs running.

@Wishmesh
Copy link
Author

@Wishmesh Wishmesh commented May 24, 2017

I couldn't resist to try.... So I have one file stuck:
/mnt/x/1/2/3/4/5/some-file.ext

  1. running rsync /mnt/x/1 user@IP:/mnt/x/abc/
    result -- hang

  2. running rsync /mnt/x/1/2 user@IP:/mnt/x/abc/1/
    result -- hang

  3. running rsync /mnt/x/1/2/3 user@IP:/mnt/x/abc/1/2/
    result -- hang

  4. running rsync /mnt/x/1/2/3/4/5 user@IP:/mnt/x/abc/1/2/3/4/
    result -- hang

  5. running rsync /mnt/x/1/2/3/4/5/some-file.ext user@IP:/mnt/x/abc/1/2/3/4/5/
    result -- success

  6. running command from the step 1 again
    rsync /mnt/x/1 user@IP:/mnt/x/abc/
    success

Issue is gone till the next time :(

@Wishmesh
Copy link
Author

@Wishmesh Wishmesh commented May 24, 2017

And if it is filesystem issue, then why syncing files one by one works?

@sunjoong
Copy link

@sunjoong sunjoong commented May 24, 2017

@Wishmesh

And if it is filesystem issue, then why syncing files one by one works?

Good question; You know, people say good question when they don't know a quite answer.

I cannot be sure that's filesystem issue; I'm just guessing it might be or not and it could be timestamp related issue if it might be a filesystem issue. Of course, it might not be timestamp related, it might even not filesystem issue, it might be a issue I could not imagine. But I know there is a timestamp issue in DrvFS itself that make a strange thing, and found another timestamp issue of rsync (but it might not fit on your case) https://askubuntu.com/questions/112863/rsync-not-working-between-ntfs-fat-and-ext .

@therealkenc
Copy link
Collaborator

@therealkenc therealkenc commented May 24, 2017

About syncing to /home ... just tried the same folder and of course rsync works.

Sorry my poor explanation. When I said "you might want to push to /home or /var on the server", I meant as operating procedure moving forward. Something is getting borked on the server side. We don't know what, because no server side strace. But I think you'll have a lower probability of borkage on the server if you push to VolFS, because smart money says this is a DrvFS "filesystem thing". [You'll keep using DrvFS on the client of course, because the whole point is having access in Visual Studio. There's no problem we can see on the client.]

Do you have other ideas to try, before I issue rsync with "deeper" path?

Once you push to any different folder without corruption you're golden. Go deeper, sideways, to a whole different server, whatever.

Hold on to the corrupt folder in its pristine corrupt state if you have the space. The devs are going to need diagnostics of some kind on that tree, because no one will be able to repro this locally. You've got a real bug. But note that the chances of anyone reading this far down the thread is rapidly decreasing.

@sunilmut
Copy link
Member

@sunilmut sunilmut commented Jun 12, 2017

Adding @SvenGroot to see if he can help out with this one. Also, marking this one as a bug.

@sunilmut sunilmut added the bug label Jun 12, 2017
@SvenGroot
Copy link
Member

@SvenGroot SvenGroot commented Jun 12, 2017

@Wishmesh Could you follow the steps from section 8 at https://github.com/Microsoft/BashOnWindows/blob/master/CONTRIBUTING.md, and provide us with the log files? That may provide some insight in to what is happening.

@therealkenc
Copy link
Collaborator

@therealkenc therealkenc commented Jun 12, 2017

His problem is on the server-side of an WSL-to-WSL rsync push. There's gotta be a way to collect a server-side strace but it isn't immediately obvious to me. The client-side he supplied above. It hangs on:

write(4, "\350\320\0\7\0\0&\30\0\0\0\0\0\0\0\0\0ATL::CSimpleStr"..., 53484

...which is not of much use, because it is not failing, just blocking. The problem isn't network-related because the problem persists across a reboot and only once the DrvFS filesystem on the server-side has gotten into a corrupt state.

@SvenGroot
Copy link
Member

@SvenGroot SvenGroot commented Jun 12, 2017

Since the server is also WSL, logs of both sides would be helpful. Strace would be helpful but I can see why that would be difficult in this scenario. However, ETL logs should be easy if you have access to the server.

Thanks,
Sven

@Wishmesh
Copy link
Author

@Wishmesh Wishmesh commented Jun 13, 2017

Thanks for looking into this!

  1. I tried to get the server-side strace with the following command:
    sudo strace -ff -o mylog /usr/sbin/sshd
    I got two files - mylog.41 and mylog.42, however I do not see the relevant lines in contrast to client strace logs.

If you have idea how to strace rsync on server, let me know.

  1. I got some log files using logman command. Particularly the lxcore_kernel.etl seems to contain something useful, but I am unable to read it using EventViewer... missing some decoder perhaps.
    As it seems that it contains private info, I do not feel comfortable to share it here. Maybe I can share it to some email with @ microsoft.com at the end?
@jlahijani
Copy link

@jlahijani jlahijani commented Aug 31, 2017

I'm experiencing this issue as well. It occurs on this image.
test

@therealkenc
Copy link
Collaborator

@therealkenc therealkenc commented Nov 21, 2018

Also, if the SIGCHLD and/or window resize workaround don't work for you, this sounds like it's probably a separate problem, and you should open a new issue for it.

I've been experiencing a hang with apt on 18282 that rights itself with a Windows Console resize. There was a suggestion above this particular manifestation was maybe addressed via #3100 but it is looking like not. I can't catch it in the act because it isn't like I strace every apt update, and worse, apt can call anything so "who knows" the triggering condition. Maybe I'll try to attach gdb if it happens again (effort).

Just adding the data point that this specific behavior (with resize workaround) doesn't appear to be resolved. I suspect a bunch of people are hitting these hangs in various scenarios (not limited to rsync or apt), but like me, aren't inclined to open an issue because no repro. We've also got stuff like #2721 on the books, where it is impossible to differentiate severe performance problems (which get chalked up to the usual suspect) but might be (and probably is) actually a hang.

@Gatlingod
Copy link

@Gatlingod Gatlingod commented Jan 13, 2019

are you folks on 1809 (aka "RS5") or an earlier version? I'm pretty sure this was fixed as #3100 -- I haven't seen it since upgrading to 1809.

i have this issue since today. I am on 1809 and last month the rsync works great.
But today i can only use it when i use

while killall -CHLD ssh; do sleep 0.1; done

that is realy strange?!

@Gaibhne
Copy link

@Gaibhne Gaibhne commented Feb 26, 2019

I have the same problem, and indeed, a console resize causes it to continue.

@JanC89
Copy link

@JanC89 JanC89 commented Mar 5, 2019

I can confirm I've the same problem. Sending SIGWINCH causes rsync to continue a bit, until it stalls agains. Resending SIGWINCH resumes the sync again for a while.

Note that after running rsync, two processes are spawned on my machine.

janc@pommes:~$ ps aux |grep rsync
janc     14123  1.8  0.0  17228  5120 tty2     S    08:46   0:03 rsync -Pav --verbose -W --delete-after --exclude=.git --exclude=public-src --exclude=node_modules /home/janc/development/awesome-project/ www@dummy-server.com:/mnt/www/janc
janc     14124  0.3  0.0  23472  8720 tty2     S    08:46   0:00 ssh -l www dummy-server.com rsync --server -vvlWogDtpre.iLsfxC --delete-after --partial . /mnt/www/jan

It's the ssh process I need to send SIGWINCH. So running below in a seprate console window resumes the file transfer.
kill -28 14124

I'm not sure exactly when the problem started, but I've recently installed the October 2018 Windows (1809) update to get better Copy/Paste functionality

@iamfil
Copy link

@iamfil iamfil commented Mar 24, 2019

Can confirm that resizing the PowerShell window or sending SIGWINCH signals works for me as well, both on my legacy 14.04 installation and a brand-new 18.04 one.

@0xabu 's workaround might be a bit clunky but gets the job done...

while killall -CHLD ssh; do sleep 0.1; done

I think I'll be using rclone for now.

Thank you for this! I was able to use this as a workaround to copy a massive amount of data.

@nvsystems
Copy link

@nvsystems nvsystems commented Jul 15, 2019

Adding a -W to the rsync command line had fixed this problem for my rsync based backups for quite some time. But it recently got stuck again. I updated from Windows from 1803 to 1903, updated WSL from 16.04 to 18.04 but nothing helped. So this definitely has not been fixed.

Running a "while killall -CHLD ssh; do sleep 0.1; done" in the background made the sync work again.
It is really disappointing that after more than two years Microsoft has shown no interest in fixing really major bugs affecting basic linux tools like rsync.

@scy
Copy link

@scy scy commented Jul 15, 2019

I can imagine that this issue is no longer present in the upcoming WSL 2, and maybe that's the reason that Microsoft isn't investing time in debugging/fixing it.

Can somebody check whether the issue is still present in WSL 2? It should be available in Windows Insiders by now.

@Saltallica
Copy link

@Saltallica Saltallica commented Dec 25, 2019

Holy shit its Christmas and they still haven't fixed this.

phoerious added a commit to phoerious/rs-backup-suite that referenced this issue Jan 4, 2020
@phoerious
Copy link

@phoerious phoerious commented Jan 4, 2020

@scy I am off Insiders now, but the SIGCHLD workaround was still needed last time I checked. That was a week ago or so before I reinstalled Windows to get a somewhat stable OS again.

@devyte
Copy link

@devyte devyte commented Feb 12, 2020

Same boat, Windows 10 Home build 18363, rsync from a WSL bash to a remote Debian machine. After some random time, rsync hangs, and resizing the bash window makes it continue.

squeaksvolvo added a commit to squeaksvolvo/Autohotkey that referenced this issue Feb 28, 2020
Refer to Windows Subsystem for Linux bug microsoft/WSL#2138
Bug: After initiating the resizing script on the desired window, if another WSL bash window is opened whose title begins the same as the intended window's title, upon bringing the other window into focus, the undesired window will be resized and the desired one is no longer resized.
@squeaksvolvo
Copy link

@squeaksvolvo squeaksvolvo commented Feb 28, 2020

Autohotkey script to resize the window with bash command prompt where rsync is running under WSL.

https://github.com/squeaksvolvo/Autohotkey/blob/master/windowResizeJiggle.ahk

@mumbleskates
Copy link

@mumbleskates mumbleskates commented Mar 1, 2020

@squeaksvolvo: @0xabu's killall solution is immensely cleaner than this.

@aressler38
Copy link

@aressler38 aressler38 commented Mar 11, 2020

This is still an issue in Microsoft Windows 10 Enterprise 10.0.17763 Build 17763. I am using rsync in the VS Code window, and it gets stuck. Resizing the window seems to unstick it for a little while.

@eroller
Copy link

@eroller eroller commented Jun 22, 2020

This just started happening for me today. I'm not sure why now, maybe a delayed Windows Update from my domain admins? This is totally unacceptable.

Windows Version 10.0.17763 Build 17763
WSL VERSION="18.04.2 LTS (Bionic Beaver)"

@xianwenchen
Copy link

@xianwenchen xianwenchen commented Jul 20, 2020

I have the same problem when rsyncing from /home/ to a remote computer.

Under WSL2 terminal, if I ctrl + z and then fg, I could resume rsync to transfer the next file. However, I have to repeat ctrl + z and fg many times to go through a sync.

@MichaelHipp
Copy link

@MichaelHipp MichaelHipp commented Oct 9, 2020

I have the same problem.

Why on earth has this not been fixed???

@acmuller
Copy link

@acmuller acmuller commented Oct 10, 2020

It's amazing that nothing has been done about this after this amount of time. I was hoping that WSL2 would fix it, but WSL2 doesn't even support SSH to a remote server. For the time being, I'm sticking with my VMWare-based Linuxes.

@123benni
Copy link

@123benni 123benni commented Oct 14, 2020

This is a really annoying issue. I'm experiencing this with WSL Debian on Windows 10 Enterprise. The workarounds to resize the WSL window works perfectly well. However, it is really annoying.

Some hints for reproducing the issue:
It is actually quite easy to generate files that cannot be rsynced on WSL. I use MATLAB to programmatically generate (a huge amout of) images. Every now and then I rerun my scripts so the already existing images are overwritten with new content. Every time this happens the file which was overwritten is now unable to rsync on WSL. So in order to run rsync (through 100-1000 of files which all hang) I have to keep resizing the WSL window for a mere 10 minutes.

I think it has to do with the timestamps of the files: Whenever a file is overwritten the "created date" stays the same however the "updated date" is changed. I'm quite sure that this is at the root of the issue.

So please some developer go ahead and investigate this issue. Thank you.

In the meantime: if you resize vertically you get more window updates than if you resize horizontally. So you get a higher file throughput rate. 😲 😬 😆

@kiu
Copy link

@kiu kiu commented Oct 14, 2020

Using the workaround mentioned in this bug report, I am using this to run my regular backup. Works for me.

#!/bin/bash
IGN_WIN='--exclude Cache --exclude .cache --exclude cache --exclude cache2 --exclude NTUSER.DAT --exclude ntuser.dat.LOG1 --exclude ntuser.dat.LOG2'

/usr/bin/rsync -vxrlHpEogDth -W --delete-after --delete-excluded --ignore-errors $IGN_WIN  -e 'ssh -i /root/.ssh/id_rsa.winbackup' /mnt/c/Users/ winbackup@xxx:/backup 1> /root/winbackup.1.log 2> /root/winbackup.2.log &
pid=$!

while /bin/kill -0 $pid 2>/dev/null; do
    /usr/bin/killall -CHLD ssh
    /bin/sleep 0.1
done
@therealkenc
Copy link
Collaborator

@therealkenc therealkenc commented Oct 14, 2020

There is definitely (scare quote) "something" going on here; I have seen the rsync hang as well on WSL1, but not recently on WSL2 (for some value of recent, call it ~a year). That doesn't mean it isn't still a problem, natch. There has been no authoritative explanation of why the SIGCHLD work-around works, that I know of, anyway. Doing a ctrl-z (SIGTSTP) + fg or resizing the terminal (SIGWINCH) also unblocks. Signals are "totally different" on WSL2 though, so the fact the problem appears to persist on WSL2 (at least for some) makes the problem all the more curious.

If someone can catch rsync hanging on WSL2, maybe post a screencap after a ctrl-c with a cat /proc/version as a bump. [That the problem persists on WSL1 isn't in question, or the issue would be closed already.]

@MichaelHipp
Copy link

@MichaelHipp MichaelHipp commented Oct 14, 2020

FWIW, I have 3 systems running WSL2 on Win10 x64. Only 1 of the 3 seems to experience the hanging problem. But maybe it is specific to some particular file attribute or content, so the other 2 may trip over it at some point.

@cwallraven
Copy link

@cwallraven cwallraven commented Oct 17, 2020

For me it was working perfectly with the -W [whole-file transfer] workaround for quite some time and with one of the recent windows updates it came back on WSL. It happens pretty predictably when multiple, small files need to be transferred for me [like small, 4KB-large images, for example].

@jwinterm
Copy link

@jwinterm jwinterm commented Dec 5, 2020

Same issue for me - random hangs when running rsync on windows ubuntu subsytem for linux to a remote debian machine. Just verified it is working if I run rsync inside of msys it works as expected. Running:
Microsoft Windows [Version 10.0.19041.630]

@lucent-sea
Copy link

@lucent-sea lucent-sea commented Jan 28, 2021

I'm having the same issue. I'm using rsync to deploy ASP.NET Core apps, from WSL Ubuntu 20.04 on Windows 10 to an Ubuntu 20.04 VPS.

This has always been an issue for me, starting with 18.04 on both ends.

If I deploy with the same command using an Ubuntu VM in Azure DevOps or GitHub Actions, it works fine.

rsync -r -v /mnt/d/Publish/ root@example.com:/var/www/example-app/

As others have mentioned, if I use the -W parameter, there's no issue. But it takes longer and uses more bandwidth, of course.

@Nigel-Caughey
Copy link

@Nigel-Caughey Nigel-Caughey commented Mar 1, 2021

This has been affecting me as well transferring to WSL2 ( Ubuntu 20.04 -- Linux version 4.19.104-microsoft-standard (oe-user@oe-host) (gcc version 8.2.0 (GCC)) #1 SMP Wed Feb 19 06:37:35 UTC 2020 ) from OSX

io timeout after 3 seconds -- exiting:03
rsync error: timeout in data send/receive (code 30) at /AppleInternal/BuildRoot/Library/Caches/com.apple.xbs/Sources/rsync/rsync-54.120.1/rsync/io.c(164) [sender=2.6.9]

Adding the -W did solve the issue, thanks !!.

@usovalx
Copy link

@usovalx usovalx commented Mar 3, 2021

As a confirmation, just ran into this issue on fresh W10 install, where rsync of large files (win->lin) would hang.
Spend few hours trying to debug it myself, until I noticed that my WSL was created as WSL1 instance, instead of WSL2.
Things seem to work fine in WSL2.

As a possible reason as to why signals help - strace(ing) some of the hanged rsync instances were stuck in "select" calls.
Assuming there is some race condition somewhere in select & pipe combo, where write to pipe won't wake up the select on another side, this would explain why resizing terminal (or sending SIGCHLD) helps - signal will interrupt the select, and subsequent call into it will notice data in the pipe and resume copying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet