This repository has been archived by the owner. It is now read-only.

Reduce file padding #1921

Open
nitronick600 opened this Issue Jun 12, 2017 · 4 comments

Comments

3 participants
@nitronick600

nitronick600 commented Jun 12, 2017

I don't know the history of why SIA had to implement 40Mb padding, but reducing it would open up a lot more possibilities for application developers.

@nitronick600

This comment has been minimized.

Show comment
Hide comment
@nitronick600

nitronick600 Jun 12, 2017

I just found the Trello board and it appears that's being scheduled soon. Thanks!

nitronick600 commented Jun 12, 2017

I just found the Trello board and it appears that's being scheduled soon. Thanks!

@lukechampine

This comment has been minimized.

Show comment
Hide comment
@lukechampine

lukechampine Jun 12, 2017

Member

In case you're curious, here's the technical breakdown:

Sia operates on 4MB "sectors;" the minimum you can upload to a host is 4MB. The 40MB padding comes from the fact that we upload redundantly across many hosts. So even if you're uploading a 100KB file, it will be padded to 4MB and the padded version will be uploaded to many hosts. This also affects downloading: when you download the 100KB file, you have to download the full 4MB sector.

There's a few ways of fixing this. One is to "pack" files together during upload. For example, if you're uploading a whole folder of photos, they could all be packed into a single 4MB sector before being sent off to the host. The obvious downside of this approach is that you need to have all the files grouped for upload in advance, but it has the advantage of being possible today without any modifications to the host or the upload protocol.

Another approach is to allow the uploader to modify the 4MB sector by sending additional data. Then you could store multiple files in the same sector via a series of modifications. This is actually specified in our protocol, but it isn't currently used. The downside of this approach is that it's taxing for the host; since the Merkle root of the sector changes, they may have to shuffle things around in their database. If this is coded poorly, it could be a DoS vector. @DavidVorick can speak to this aspect better than I.

A final consideration here is that storing "partial sector" files will require adding checksums to the download code. Currently, since we always download a full sector, the Merkle root of the sector can double as a checksum. But if we start downloading less than one sector, we need another way to verify the integrity of the data. (Note that we can't simply use a single checksum for the entire file, because if the checksum failed, we wouldn't know which host was at fault.) So we need to store the checksums in the new .sia file format.
Bonus consideration: downloading partial sectors leaks metadata to the host about what you're storing. For example, there aren't many files with exactly 1,193,254,020 bytes in them. So for better privacy, you'd want to download a little more than you need, and strip off the extra afterward.

Member

lukechampine commented Jun 12, 2017

In case you're curious, here's the technical breakdown:

Sia operates on 4MB "sectors;" the minimum you can upload to a host is 4MB. The 40MB padding comes from the fact that we upload redundantly across many hosts. So even if you're uploading a 100KB file, it will be padded to 4MB and the padded version will be uploaded to many hosts. This also affects downloading: when you download the 100KB file, you have to download the full 4MB sector.

There's a few ways of fixing this. One is to "pack" files together during upload. For example, if you're uploading a whole folder of photos, they could all be packed into a single 4MB sector before being sent off to the host. The obvious downside of this approach is that you need to have all the files grouped for upload in advance, but it has the advantage of being possible today without any modifications to the host or the upload protocol.

Another approach is to allow the uploader to modify the 4MB sector by sending additional data. Then you could store multiple files in the same sector via a series of modifications. This is actually specified in our protocol, but it isn't currently used. The downside of this approach is that it's taxing for the host; since the Merkle root of the sector changes, they may have to shuffle things around in their database. If this is coded poorly, it could be a DoS vector. @DavidVorick can speak to this aspect better than I.

A final consideration here is that storing "partial sector" files will require adding checksums to the download code. Currently, since we always download a full sector, the Merkle root of the sector can double as a checksum. But if we start downloading less than one sector, we need another way to verify the integrity of the data. (Note that we can't simply use a single checksum for the entire file, because if the checksum failed, we wouldn't know which host was at fault.) So we need to store the checksums in the new .sia file format.
Bonus consideration: downloading partial sectors leaks metadata to the host about what you're storing. For example, there aren't many files with exactly 1,193,254,020 bytes in them. So for better privacy, you'd want to download a little more than you need, and strip off the extra afterward.

@lukechampine

This comment has been minimized.

Show comment
Hide comment
@lukechampine

lukechampine Jul 25, 2017

Member

A final consideration here is that storing "partial sector" files will require adding checksums to the download code. Currently, since we always download a full sector, the Merkle root of the sector can double as a checksum. But if we start downloading less than one sector, we need another way to verify the integrity of the data. (Note that we can't simply use a single checksum for the entire file, because if the checksum failed, we wouldn't know which host was at fault.) So we need to store the checksums in the new .sia file format.

This is actually incorrect: we don't need to store extra checksums. We're using AEAD to encrypt the sector data, so we get authentication "for free" -- the checksum of the data is stored on the host, prepended to the data itself.

Member

lukechampine commented Jul 25, 2017

A final consideration here is that storing "partial sector" files will require adding checksums to the download code. Currently, since we always download a full sector, the Merkle root of the sector can double as a checksum. But if we start downloading less than one sector, we need another way to verify the integrity of the data. (Note that we can't simply use a single checksum for the entire file, because if the checksum failed, we wouldn't know which host was at fault.) So we need to store the checksums in the new .sia file format.

This is actually incorrect: we don't need to store extra checksums. We're using AEAD to encrypt the sector data, so we get authentication "for free" -- the checksum of the data is stored on the host, prepended to the data itself.

@calchulus

This comment has been minimized.

Show comment
Hide comment
@calchulus

calchulus Jul 30, 2018

What's the current status on this?

calchulus commented Jul 30, 2018

What's the current status on this?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.