This repository has been archived by the owner. It is now read-only.

siad scanning / rewriting entire siahostdata.dat; requests to siad hang #2594

Closed
hackneyb opened this Issue Dec 31, 2017 · 12 comments

Comments

Projects
None yet
4 participants
@hackneyb

hackneyb commented Dec 31, 2017

Bug Report

Severe IO on siahostdata.dat file in hosted folder. siahostdata.dat is larger than 4TB
IO has written 1.2TB in 5 hours into this file.
During this IO operation other requests to siad hang and host appears offline, although it is not offline.
What hangs:
siac host -v
external queries to host

Drive is 8TB external Seagate on USB 3.0 mounted as drive D:
siad reads and writes from the file at about 25MB/s simultaneously, with peaks of 30-40MB/s.
Overall drive performance - writes can occur at 50-80MB/s.

Stack Trace or error message
These errors appear in siad console after 1 hour of this activity:

goroutine 674092 [running]:
runtime/debug.Stack(0x446d5e, 0xc04e7c5eb8, 0xc04e7c5e80)
/usr/local/go/src/runtime/debug/stack.go:24 +0xae
runtime/debug.PrintStack()
/usr/local/go/src/runtime/debug/stack.go:16 +0x29
github.com/NebulousLabs/Sia/build.Severe(0xc04e7c5f18, 0x2, 0x2)
/home/luke/go/src/github.com/NebulousLabs/Sia/build/critical.go:32 +0x8c
github.com/NebulousLabs/Sia/api.cleanCloseHandler.func1.2(0xc052a21c20, 0xc054329a00, 0xc0588a4600)
/home/luke/go/src/github.com/NebulousLabs/Sia/api/routes.go:159 +0x18d
created by github.com/NebulousLabs/Sia/api.cleanCloseHandler.func1
/home/luke/go/src/github.com/NebulousLabs/Sia/api/routes.go:154 +0x1e3
Severe error: api call is taking more than 60 minutes to return: /consensus

Expected Behavior
siad should not write (or scan?) the entire siahostdata.dat file when there are no ongoing upload/download or file create operations; siad should not stop responding to requests

Environment

  • Sia version: 1.3.1 (siad, siac, SIA-UI)
  • OS: Windows Server 2012 R2

Any useful logs

Disk IO shows repeated writes to the file in this fashion (this has started at offset 0 and presumably will continue for another day or two if uninterrupted, until it reaches file size of +4TB):
8:06:40.7567275 PM siad.exe 3668 IRP_MJ_WRITE D:\SIAStorage01\siahostdata.dat SUCCESS Offset: 1,239,435,706,368, Length: 31,457,280, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
8:06:41.5951523 PM siad.exe 3668 IRP_MJ_WRITE D:\SIAStorage01\siahostdata.dat SUCCESS Offset: 1,239,469,260,800, Length: 31,457,280, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
8:06:44.7835456 PM siad.exe 3668 IRP_MJ_WRITE D:\SIAStorage01\siahostdata.dat SUCCESS Offset: 1,239,502,815,232, Length: 31,457,280, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
8:06:45.5828451 PM siad.exe 3668 IRP_MJ_WRITE D:\SIAStorage01\siahostdata.dat SUCCESS Offset: 1,239,536,369,664, Length: 31,457,280, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
8:06:46.3650888 PM siad.exe 3668 IRP_MJ_WRITE D:\SIAStorage01\siahostdata.dat SUCCESS Offset: 1,239,569,924,096, Length: 31,457,280, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
8:06:47.2798532 PM siad.exe 3668 IRP_MJ_WRITE D:\SIAStorage01\siahostdata.dat SUCCESS Offset: 1,239,603,478,528, Length: 31,457,280, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
8:06:50.6805290 PM siad.exe 3668 IRP_MJ_WRITE D:\SIAStorage01\siahostdata.dat SUCCESS Offset: 1,239,637,032,960, Length: 31,457,280, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
8:06:51.5031529 PM siad.exe 3668 IRP_MJ_WRITE D:\SIAStorage01\siahostdata.dat SUCCESS Offset: 1,239,670,587,392, Length: 31,457,280, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal

@tbenz9

This comment has been minimized.

Show comment
Hide comment
@tbenz9

tbenz9 Dec 31, 2017

Collaborator

Duplicate of #2483
These are helpful logs, thanks for uploading them.

Collaborator

tbenz9 commented Dec 31, 2017

Duplicate of #2483
These are helpful logs, thanks for uploading them.

@tbenz9 tbenz9 closed this Dec 31, 2017

@tbenz9

This comment has been minimized.

Show comment
Hide comment
@tbenz9

tbenz9 Dec 31, 2017

Collaborator

I've changed my mind, I think it's a similar error to #2483 but possibly a different bug. Reopening and labelling bug.

Collaborator

tbenz9 commented Dec 31, 2017

I've changed my mind, I think it's a similar error to #2483 but possibly a different bug. Reopening and labelling bug.

@tbenz9 tbenz9 reopened this Dec 31, 2017

@tbenz9 tbenz9 added the Bug label Dec 31, 2017

@hackneyb

This comment has been minimized.

Show comment
Hide comment
@hackneyb

hackneyb Dec 31, 2017

thanks .. it looks like the symptom is the same - siad not responding to some things, and API calls happen to be made and fail and are logged. But there is probably 50 different ways to make siad stop responding.

hackneyb commented Dec 31, 2017

thanks .. it looks like the symptom is the same - siad not responding to some things, and API calls happen to be made and fail and are logged. But there is probably 50 different ways to make siad stop responding.

@DavidVorick

This comment has been minimized.

Show comment
Hide comment
@DavidVorick

DavidVorick Dec 31, 2017

Member

Is this in the middle of an upgrade? Did you make some change which may cause it to look at all the data?

Seems like it's either moving data or upgrading data, and then as a byproduct locking the whole system up.

If you just upgraded from an earlier version, this could happen I think.

Member

DavidVorick commented Dec 31, 2017

Is this in the middle of an upgrade? Did you make some change which may cause it to look at all the data?

Seems like it's either moving data or upgrading data, and then as a byproduct locking the whole system up.

If you just upgraded from an earlier version, this could happen I think.

@hackneyb

This comment has been minimized.

Show comment
Hide comment
@hackneyb

hackneyb Dec 31, 2017

No, I've been running on 1.3.1 for about 5-6 days, this seemed to coincide with getting first 2 contracts, but no data is being uploaded - usage was 0
then suddenly a lot of chirping from the drive made me look what's going on, so unexpected

hackneyb commented Dec 31, 2017

No, I've been running on 1.3.1 for about 5-6 days, this seemed to coincide with getting first 2 contracts, but no data is being uploaded - usage was 0
then suddenly a lot of chirping from the drive made me look what's going on, so unexpected

@hackneyb

This comment has been minimized.

Show comment
Hide comment
@hackneyb

hackneyb Dec 31, 2017

first time I Ctrl-C out of siad console .. it hung, I rebooted, came back with everything OK, except for this:
2017/12/30 16:03:38.224283 writeaheadlog.go:285: WARN: WAL file detected, performing recovery after unclean shutdown.
Then OK for a while, then IO restarted on its own

hackneyb commented Dec 31, 2017

first time I Ctrl-C out of siad console .. it hung, I rebooted, came back with everything OK, except for this:
2017/12/30 16:03:38.224283 writeaheadlog.go:285: WARN: WAL file detected, performing recovery after unclean shutdown.
Then OK for a while, then IO restarted on its own

@hackneyb

This comment has been minimized.

Show comment
Hide comment
@hackneyb

hackneyb Dec 31, 2017

IO finished by itself and all hung requests to siad went through. I do believe the entire operation was triggered by the first upload of data ever, to the host instance. siad then responded properly to termination and is healthy after restart.

Just in case the impact was worsened on by relative performance of hardware, I've enabled drive write caching. This required a reboot .. nothing further to report for now.

Logs

Storage Used value updated from 0 to a positive value.

Storage Folders:
    Used       Capacity     % Used    Path
    4.19 MB    4.3999 TB    0.00      D:\SIAStorage01

host.log filled with about 250 lines similar to these:

2017/12/31 16:06:15.018778 errors.go:151: connection error: error with 104.175.232.198:59633: incoming RPCReviseContract failed: revision iteration failed: iteration signal failed to send: write tcp 192.<privIP>:9982->104.175.232.198:59633: i/o timeout
2017/12/31 16:06:18.514740 errors.go:151: connection error: error with 24.47.202.181:54857: incoming RPCSettings failed: failed WriteSignedObject during RPCSettings: write tcp 192.<privIP>:9982->24.47.202.181:54857: i/o timeout
2017/12/31 16:06:18.535741 errors.go:151: connection error: error with 94.224.211.150:51387: incoming RPCSettings failed: failed WriteSignedObject during RPCSettings: write tcp 

hackneyb commented Dec 31, 2017

IO finished by itself and all hung requests to siad went through. I do believe the entire operation was triggered by the first upload of data ever, to the host instance. siad then responded properly to termination and is healthy after restart.

Just in case the impact was worsened on by relative performance of hardware, I've enabled drive write caching. This required a reboot .. nothing further to report for now.

Logs

Storage Used value updated from 0 to a positive value.

Storage Folders:
    Used       Capacity     % Used    Path
    4.19 MB    4.3999 TB    0.00      D:\SIAStorage01

host.log filled with about 250 lines similar to these:

2017/12/31 16:06:15.018778 errors.go:151: connection error: error with 104.175.232.198:59633: incoming RPCReviseContract failed: revision iteration failed: iteration signal failed to send: write tcp 192.<privIP>:9982->104.175.232.198:59633: i/o timeout
2017/12/31 16:06:18.514740 errors.go:151: connection error: error with 24.47.202.181:54857: incoming RPCSettings failed: failed WriteSignedObject during RPCSettings: write tcp 192.<privIP>:9982->24.47.202.181:54857: i/o timeout
2017/12/31 16:06:18.535741 errors.go:151: connection error: error with 94.224.211.150:51387: incoming RPCSettings failed: failed WriteSignedObject during RPCSettings: write tcp 
@DavidVorick

This comment has been minimized.

Show comment
Hide comment
@DavidVorick

DavidVorick Jan 2, 2018

Member

Hmm. Glad it's all working again. The 'performing recovery' thing is normal if you do unclean shutdown. Recovery should be complete, you won't be missing any data or anything.

Let me know if you see the issue again, but that's pretty strange.

Actually, how much data are you hosting on the network? Is it possible that you have a very large contract (greater than 100 GB) that you are hosting?

Member

DavidVorick commented Jan 2, 2018

Hmm. Glad it's all working again. The 'performing recovery' thing is normal if you do unclean shutdown. Recovery should be complete, you won't be missing any data or anything.

Let me know if you see the issue again, but that's pretty strange.

Actually, how much data are you hosting on the network? Is it possible that you have a very large contract (greater than 100 GB) that you are hosting?

@hackneyb

This comment has been minimized.

Show comment
Hide comment
@hackneyb

hackneyb Jan 3, 2018

Thanks .. I do believe the NTFS file may first be created in a way which causes some sort of zero-out or thick-provisioning action, but only when data is first written to the file. I will have a chance to reproduce this in the future. I could have dumped siad.exe when in state and then see what threads existed but really wanted to see if it completes by itself. This is not the only host I plan on running so will update if I ever encounter this again.

I cannot easily tell what contracts I have but my locked collateral is about 2/3 of my per TB per month price. Does that mean the contracts amount to 2/3 of a TB?

hackneyb commented Jan 3, 2018

Thanks .. I do believe the NTFS file may first be created in a way which causes some sort of zero-out or thick-provisioning action, but only when data is first written to the file. I will have a chance to reproduce this in the future. I could have dumped siad.exe when in state and then see what threads existed but really wanted to see if it completes by itself. This is not the only host I plan on running so will update if I ever encounter this again.

I cannot easily tell what contracts I have but my locked collateral is about 2/3 of my per TB per month price. Does that mean the contracts amount to 2/3 of a TB?

@tbenz9

This comment has been minimized.

Show comment
Hide comment
@tbenz9

tbenz9 Jan 5, 2018

Collaborator

@hackneyb we are tracking the feature request to get detailed contract info in #2465

I'm going to close this issue since your problem was resolved and I don't see any specific action items for the core devs. If you think there is still work that needs to be done please reopen the issue.

Collaborator

tbenz9 commented Jan 5, 2018

@hackneyb we are tracking the feature request to get detailed contract info in #2465

I'm going to close this issue since your problem was resolved and I don't see any specific action items for the core devs. If you think there is still work that needs to be done please reopen the issue.

@tbenz9 tbenz9 closed this Jan 5, 2018

@hackneyb

This comment has been minimized.

Show comment
Hide comment
@hackneyb

hackneyb Jan 13, 2018

My issue came back. This was after I moved the entire storage to another disk (same drive letter, copied all offline so siad/sia-ui didn't know it was moved).

Symptoms similar as before:

siac hostdb -v output shows unresponsive siad while in state:

Recent Scans
11111111111111111111111101101

host.log - 376 lines like this, all same time, different IPs:

2018/01/12 23:04:14.447751 errors.go:151: connection error: error with 93.83.187.18:38492: incoming RPCReviseContract failed: revision iteration failed: iteration signal failed to send: write tcp 192.168.x.y:9982->93.83.187.18:38492: i/o timeout
2018/01/12 23:04:14.464750 errors.go:151: connection error: error with 173.77.154.207:64012: incoming RPCSettings failed: failed WriteSignedObject during RPCSettings: write tcp 192.168.x.y:9982->173.77.154.207:64012: i/o timeout
[.. snip..]
2018/01/12 23:04:14.716296 errors.go:151: connection error: error with 104.15.124.37:55994: incoming RPCSettings failed: failed WriteSignedObject during RPCSettings: write tcp 192.168.x.y:9982->104.15.124.37:55994: i/o timeout
2018/01/12 23:04:14.716296 errors.go:151: connection error: error with 108.248.249.252:58305: incoming RPCSettings failed: failed WriteSignedObject during RPCSettings: write tcp 192.168.x.y:9982->108.248.249.252:58305: i/o timeout

siad.exe console:
timeout on consensus errors (can't quote, didn't get to capture)

contractmanager.log - probably red herring errors, this logged later:

2018/01/12 05:39:26.016837 writeaheadlogsync.go:50: SEVERE: ERROR: unable to atomically copy the contract manager settings: rename C:\Users\Administrator\AppData\Roaming\Sia-UI\sia\host\contractmanager\contractmanager.json_temp C:\Users\Administrator\AppData\Roaming\Sia-UI\sia\host\contractmanager\contractmanager.json: Access is denied.
2018/01/12 23:28:06.544894 writeaheadlogsync.go:50: SEVERE: ERROR: unable to atomically copy the contract manager settings: rename C:\Users\Administrator\AppData\Roaming\Sia-UI\sia\host\contractmanager\contractmanager.json_temp C:\Users\Administrator\AppData\Roaming\Sia-UI\sia\host\contractmanager\contractmanager.json: Access is denied.
2018/01/12 23:28:06.589784 writeaheadlogsync.go:110: SEVERE: ERROR: could not rename temporary write-ahead-log in contract manager: rename C:\Users\Administrator\AppData\Roaming\Sia-UI\sia\host\contractmanager\contractmanager.wal_temp C:\Users\Administrator\AppData\Roaming\Sia-UI\sia\host\contractmanager\contractmanager.wal: Access is denied.

Overall I started running kernel tools in Windows and that froze the machine. On reboot things normal initially, then zeroing out the storage file came back. Currently in 3rd or 4th hour of writing 30MB chunks of data to the file, offset 2TB to 3.2TB in a 4TB file. Commands using siad hanging.

Wallet Hosting screen reduced active contract count by 14 after reboot. Most active siad thread stack:

ntoskrnl.exe!KeSynchronizeExecution+0x2246
ntoskrnl.exe!KeWaitForMultipleObjects+0x135e
ntoskrnl.exe!KeWaitForMultipleObjects+0xdd9
ntoskrnl.exe!KeWaitForMutexObject+0x373
Ntfs.sys+0x14522
Ntfs.sys+0x16834
Ntfs.sys+0x19c5b
fltmgr.sys!FltIsCallbackDataDirty+0x2ee
fltmgr.sys+0x10c2
ntoskrnl.exe!IoSynchronousPageWrite+0x138
ntoskrnl.exe!CcZeroDataOnDisk+0x25d
ntoskrnl.exe!CcZeroDataOnDisk+0xcc
ntoskrnl.exe!CcZeroDataOnDisk+0x38
ntoskrnl.exe!CcZeroData+0x16a
Ntfs.sys+0x8c261
Ntfs.sys+0x16e99
Ntfs.sys+0x19c5b
fltmgr.sys+0x1101
ntoskrnl.exe!FsRtlInitializeOplock+0x32f
ntoskrnl.exe!NtWriteFile+0x694
ntoskrnl.exe!setjmpex+0x34a3
ntdll.dll!ZwWriteFile+0xa
KERNELBASE.dll!GetEnvironmentVariableW+0x699
siad.exe+0x5d86e
siad.exe+0x15519

hackneyb commented Jan 13, 2018

My issue came back. This was after I moved the entire storage to another disk (same drive letter, copied all offline so siad/sia-ui didn't know it was moved).

Symptoms similar as before:

siac hostdb -v output shows unresponsive siad while in state:

Recent Scans
11111111111111111111111101101

host.log - 376 lines like this, all same time, different IPs:

2018/01/12 23:04:14.447751 errors.go:151: connection error: error with 93.83.187.18:38492: incoming RPCReviseContract failed: revision iteration failed: iteration signal failed to send: write tcp 192.168.x.y:9982->93.83.187.18:38492: i/o timeout
2018/01/12 23:04:14.464750 errors.go:151: connection error: error with 173.77.154.207:64012: incoming RPCSettings failed: failed WriteSignedObject during RPCSettings: write tcp 192.168.x.y:9982->173.77.154.207:64012: i/o timeout
[.. snip..]
2018/01/12 23:04:14.716296 errors.go:151: connection error: error with 104.15.124.37:55994: incoming RPCSettings failed: failed WriteSignedObject during RPCSettings: write tcp 192.168.x.y:9982->104.15.124.37:55994: i/o timeout
2018/01/12 23:04:14.716296 errors.go:151: connection error: error with 108.248.249.252:58305: incoming RPCSettings failed: failed WriteSignedObject during RPCSettings: write tcp 192.168.x.y:9982->108.248.249.252:58305: i/o timeout

siad.exe console:
timeout on consensus errors (can't quote, didn't get to capture)

contractmanager.log - probably red herring errors, this logged later:

2018/01/12 05:39:26.016837 writeaheadlogsync.go:50: SEVERE: ERROR: unable to atomically copy the contract manager settings: rename C:\Users\Administrator\AppData\Roaming\Sia-UI\sia\host\contractmanager\contractmanager.json_temp C:\Users\Administrator\AppData\Roaming\Sia-UI\sia\host\contractmanager\contractmanager.json: Access is denied.
2018/01/12 23:28:06.544894 writeaheadlogsync.go:50: SEVERE: ERROR: unable to atomically copy the contract manager settings: rename C:\Users\Administrator\AppData\Roaming\Sia-UI\sia\host\contractmanager\contractmanager.json_temp C:\Users\Administrator\AppData\Roaming\Sia-UI\sia\host\contractmanager\contractmanager.json: Access is denied.
2018/01/12 23:28:06.589784 writeaheadlogsync.go:110: SEVERE: ERROR: could not rename temporary write-ahead-log in contract manager: rename C:\Users\Administrator\AppData\Roaming\Sia-UI\sia\host\contractmanager\contractmanager.wal_temp C:\Users\Administrator\AppData\Roaming\Sia-UI\sia\host\contractmanager\contractmanager.wal: Access is denied.

Overall I started running kernel tools in Windows and that froze the machine. On reboot things normal initially, then zeroing out the storage file came back. Currently in 3rd or 4th hour of writing 30MB chunks of data to the file, offset 2TB to 3.2TB in a 4TB file. Commands using siad hanging.

Wallet Hosting screen reduced active contract count by 14 after reboot. Most active siad thread stack:

ntoskrnl.exe!KeSynchronizeExecution+0x2246
ntoskrnl.exe!KeWaitForMultipleObjects+0x135e
ntoskrnl.exe!KeWaitForMultipleObjects+0xdd9
ntoskrnl.exe!KeWaitForMutexObject+0x373
Ntfs.sys+0x14522
Ntfs.sys+0x16834
Ntfs.sys+0x19c5b
fltmgr.sys!FltIsCallbackDataDirty+0x2ee
fltmgr.sys+0x10c2
ntoskrnl.exe!IoSynchronousPageWrite+0x138
ntoskrnl.exe!CcZeroDataOnDisk+0x25d
ntoskrnl.exe!CcZeroDataOnDisk+0xcc
ntoskrnl.exe!CcZeroDataOnDisk+0x38
ntoskrnl.exe!CcZeroData+0x16a
Ntfs.sys+0x8c261
Ntfs.sys+0x16e99
Ntfs.sys+0x19c5b
fltmgr.sys+0x1101
ntoskrnl.exe!FsRtlInitializeOplock+0x32f
ntoskrnl.exe!NtWriteFile+0x694
ntoskrnl.exe!setjmpex+0x34a3
ntdll.dll!ZwWriteFile+0xa
KERNELBASE.dll!GetEnvironmentVariableW+0x699
siad.exe+0x5d86e
siad.exe+0x15519
@gchilders

This comment has been minimized.

Show comment
Hide comment
@gchilders

gchilders Feb 10, 2018

I'm having this issue now. I created a 7 TB NTFS file. When I tried to move data to it siad locked up for over 24 hours until the computer rebooted for updates. On the next write, it came back again.

Following up, I created a 50 GB file and moved data to it. Once it completed, I resized the file back to 7TB. After a couple days, however, the problem returned.

The 7TB file is on an NTFS partition on a SATA connected drive. During the writes, siad is unresponsive. Thus the time (days) spent writing count as downtime. If the computer reboots in the process, it starts over from the beginning. Largely due to this, my relatively new host is at 86% uptime and dropping.

gchilders commented Feb 10, 2018

I'm having this issue now. I created a 7 TB NTFS file. When I tried to move data to it siad locked up for over 24 hours until the computer rebooted for updates. On the next write, it came back again.

Following up, I created a 50 GB file and moved data to it. Once it completed, I resized the file back to 7TB. After a couple days, however, the problem returned.

The 7TB file is on an NTFS partition on a SATA connected drive. During the writes, siad is unresponsive. Thus the time (days) spent writing count as downtime. If the computer reboots in the process, it starts over from the beginning. Largely due to this, my relatively new host is at 86% uptime and dropping.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.