Implement SIP9 #521

braydonf · 2017-10-19T18:29:11Z

Features

Ability to query for shards that are missing for the purpose of running shard recovery for storage over longer periods of time
Be able to track farmer bandwidth and storage to incentivize reliability over longer periods of time with improved data for farmer payouts
Builds farmer reputation based on successful transfers
Be able to more accurately track user bandwidth usage and storage for billing

Implements: https://github.com/Storj/sips/blob/master/sip-0009.md

Specific updates

Adds storage events for each mirror to a new farmer
Updates current storage events for uploads to include each shard and associate with a farmer
Updates download storage events to be associated with a farmer
Will mark an upload storage event as ended when farmer goes offline
Updates exchange reports route to update storage events on success/failure
Updates storage event model
Marks an upload storage event as ended when file is deleted

Todo:

Integration testing with storjshare and libstorj exchange report updates Implementation of SIP9 integration#8
Test various scenarios for billing and payout generation based this data
Cron schedule to update storage events success based on reports and update reputation, consider the case of over reporting failure and lack of reports from users
Unit test storage events finality cron
Consider case of handling older exchange reports that do not include a token
Update package.json with released service-storage-models with necessary updates

Part of:

littleskunk · 2017-11-01T21:56:13Z

Some early feedback from my side. I see 3 problems here:

1.) If the renter deletes a file eary the farmer will not get paid for it but still hold the shard until it expires. 2TB used space but only 200GB paid space? No chance for the farmer to see the difference between unpaid and paid used space? -> Farmer will delete all shards and start with a new farmer until that one has to mouch unpaid space as well. We need a way to clean up the space on the farmer.

2.) Farmer will lose all contracts if he has a too high timeoutRate. TimeoutRate is not working at the moment. See storj-archived/complex#82 and storj-archived/complex#81 As long as timeoutRate is not stable and tested we should not use it or we are burning reliable farmer in a few seconds.

3.) timeoutRateThreshold is only 1 hour. A hardware upgrade, ISP downtime, UPnP issue (storj-archived/core#600) different farmer maintenance jobs like the farmer GBh calculation script require a higher timeoutRateThreshold. How about 12 hours downtime per week? timeoutRateThreshold 0.07 and 168 hours for the timeout calculation.

littleskunk · 2017-11-26T20:52:20Z

lib/server/routes/reports.js

+      let token = pointer.token;
+      self._createStorageEvent(token,
+                               farmer.nodeID,
+                               mirror.contact,


mirror.contact itself is not a nodeID. I would expect dirty StorageEvents stored in the database.

Hmm... okay. Will take a second look at this.

littleskunk · 2017-11-26T21:36:34Z

lib/cron/storage-events.js

+  const cursor = StorageEvent.find({
+    timestamp: {
+      $lt: finalityTime,
+      $gt: lastTimestamp


Two or more StorageEvents can have the same timestamp. If the job gets canceld by StorageEventsCron.MAX_RUN_TIME the lastTimestamp could be set at the first StorageEvent and restarting the job with $gt: lastTimestamp would skip all other StorageEvents with the same timestamp.

I don't think $gte is a better choice. In this case it would not skip StorageEvents but it would read them twice :/

It should be safe to repeat processing storage events within a certain degree (many hours). The event should remain unmodified, and the update to the user should not apply based on the timestamp: https://github.com/Storj/service-storage-models/pull/136/files#diff-6cde9fa75ac789afe7ff15f670cfc049R462

littleskunk · 2017-11-26T22:01:27Z

lib/server/routes/reports.js

+        event.success = true;
+        this.updateReputation(event.farmer, points.TRANSFER_SUCCESS);
+      } else {
+        this.updateReputation(event.farmer, points.TRANSFER_FAILURE);


In case of mirror creation I will send a ClientReport. By claiming all mirror transfers as failed I can reduce the reputation of all my neighbors. Better reputation for myself.

Not critical. We can solve this one later.

Agree, we can take a look at this later. We would likely need to do something similar as we have with users.

littleskunk · 2017-11-29T20:16:17Z

2.) Farmer will lose all contracts if he has a too high timeoutRate. TimeoutRate is not working at the moment. See storj-archived/complex#82 and storj-archived/complex#81 As long as timeoutRate is not stable and tested we should not use it or we are burning reliable farmer in a few seconds.
#521 (comment)

Let me explain that a little bit better with an example:
Farmer updated spaceAvailable=false because his drive is full. Bridge will not send any ALLOC messages. LastSeen 3 hours ago. Farmer is going offline for 5 minutes and the bridge is sending a perfect timed PING messages. The bridge will take the lastSeen and assume that the farmer was offline for 3 hours and 5 minutes. Farmer is loosing all contracts for beeing offline 5 minutes.

braydonf · 2017-12-05T19:16:27Z

@littleskunk Offline time is calculated between timeouts, not from lastSeen. If the lastSeen is not greater than the last timeout, the time is considered offline. See: https://github.com/Storj/service-storage-models/blob/master/lib/models/contact.js#L155

braydonf · 2017-12-15T04:05:09Z

Rebased on the latest master branch

littleskunk · 2017-12-21T20:15:50Z

I see a problem with the offline penalty. Lets take an example:

1.) Farmer joins the network and gets 1 contract.
2.) Farmer goes offline for a few days. Monitor will ping him an he gets the offline penalty.
3.) Farmer comes back online and is farming for a full month without any downtime. Not a single timeout. 1TB used space.
4.) Farmer is now full and sets spaceAvailable to false.
5.) Farmer goes offline from day to day. Bridge monitor will not ping him because his timeoutRate is still below the threshold.

Now we have 1 TB offline shards but the bridge doesn't notice it and will not create new mirrors.

Other timeouts are very unlikely. The bridge will not send any alloc requests or request a upload pointer because spaceAvailable is set to false. Download pointer are ordered by lastSeen.

Is it possible to modify the bridge monitor? Lets ping all farmer with more than one contract? That should solve the problem. The bridge would ping the farmer again and stops pinging as soon as the offline penalty is applied.

braydonf · 2018-01-10T17:03:41Z

Sounds like we need to fix timeoutRate to reset without a timeout event?

littleskunk · 2018-01-18T00:11:57Z

@braydonf I guess you can implement that faster than me. How about a check if lastTimeout is more than 72 hours old and in that case set timeoutRate to 0. Somewhere around this function should effect all important messages:
https://github.com/Storj/complex/blob/master/lib/landlord.js#L369

For storj-archived/complex#82 I will try to add ALLOC and RENEW to that function in order to trigger a responseTime calculation.

braydonf · 2018-02-08T19:51:21Z

Rebased again from latest master

braydonf · 2018-02-08T19:59:40Z

@littleskunk We can likely add a timeout reset in recordResponseTime https://github.com/Storj/service-storage-models/blob/master/lib/models/contact.js#L177

braydonf · 2018-03-01T04:11:33Z

Rebased again from latest master, and resolved conflicts.

Fix cron scheduling pattern

Deleting empty bucket failed with "404 Bucket not found" because no shards were found for this bucket.

dylanlott · 2018-04-25T19:33:59Z

lib/server/routes/reports.js

+    }
+
+    const isClientReport = req.user ?
+          (event.client === req.user.id) : req.farmerNodeID ?


I believe this needs to compare event.client to req.user._id rather than req.user.id

Okay, what was the problem that came up due to it?

Fix for deleting empty bucket

kaloyan-raev · 2018-04-26T10:08:06Z

I noticed some strange storage events generated while uploading a file.

Storage event with the same client and farmer:

{
	"_id" : ObjectId("5ae19ab14b1cda19b4973227"),
	"token" : "d9ce3e7dd0947ae44e19d43053de86f7a33edebb",
	"user" : null,
	"client" : "4bb49fb779e9233f9ed14f6ef34e29048911f018",
	"farmer" : "4bb49fb779e9233f9ed14f6ef34e29048911f018",
	"downloadBandwidth" : 0,
	"storage" : 107374,
	"shardHash" : "89120aaf3e7e0c091debbdb4d938b0ca1ac95069",
	"success" : false,
	"timestamp" : ISODate("2018-04-26T09:24:01.400Z"),
	"__v" : 0
}

This event is never updated with clientReport and farmerReport and remains with "success" : false.

EDIT: Something that can help debugging is that the client/farmer id 4bb49fb779e9233f9ed14f6ef34e29048911f018 in this case is the one of the farmer that received the uploaded shard from the user.

Duplicated storage events generated:

{
	"_id" : ObjectId("5ae19ab14b1cda19b497322a"),
	"token" : "b5701032361e77838d12e9e064976d9d25860166",
	"user" : null,
	"client" : "ab6a9c5205af4cf1da57fb7ec98b5433e967a244",
	"farmer" : "99103d9615965c85d35295f458034f639460cc8e",
	"downloadBandwidth" : 0,
	"storage" : 107374,
	"shardHash" : "89120aaf3e7e0c091debbdb4d938b0ca1ac95069",
	"success" : true,
	"timestamp" : ISODate("2018-04-26T09:24:01.728Z"),
	"__v" : 0,
	"clientReport" : {
		"exchangeStart" : ISODate("2018-04-26T09:24:01.755Z"),
		"exchangeEnd" : ISODate("2018-04-26T09:24:01.757Z"),
		"exchangeResultCode" : 1000,
		"exchangeResultMessage" : "SHARD_DOWNLOADED"
	},
	"farmerReport" : {
		"exchangeStart" : ISODate("2018-04-26T09:24:01.746Z"),
		"exchangeEnd" : ISODate("2018-04-26T09:24:01.762Z"),
		"exchangeResultCode" : 1000,
		"exchangeResultMessage" : "MIRROR_SUCCESS"
	}
}
{
	"_id" : ObjectId("5ae19ab14b1cda19b497322b"),
	"token" : "229d70ccf15563aa01f7c05e54736f49054a7bac",
	"user" : null,
	"client" : "ab6a9c5205af4cf1da57fb7ec98b5433e967a244",
	"farmer" : "99103d9615965c85d35295f458034f639460cc8e",
	"downloadBandwidth" : 0,
	"storage" : 107374,
	"shardHash" : "89120aaf3e7e0c091debbdb4d938b0ca1ac95069",
	"success" : false,
	"timestamp" : ISODate("2018-04-26T09:24:01.848Z"),
	"__v" : 0
}

Both events are between the same client and farmer for mirroring the same shard. The only difference is in the token. In result, only the first event is updated with clientReport and farmerReport. The second never got them and remains with "success" : false.

braydonf · 2018-04-26T14:30:00Z

@kaloyan-raev Events without a user are replication events, such as the first example, however that one looks like an attempt to mirror from and to the same farmer. That would explain why it didn't work, but not why it was created, so we would probably need to dig a bit more to find out why.

braydonf · 2018-04-26T14:43:55Z

@kaloyan-raev For the second one, it looks like the duplicate event was created just after the first by 100ms. It likely failed because the first one was started already. There may be a race condition as to why there are two to begin with though.

littleskunk · 2018-04-26T23:02:39Z

Duplicate mirror traffic. That is correct. The shard was transfered twice to the same farmer. See #564

kaloyan-raev · 2018-04-27T15:45:29Z

Today I downloaded 200 MB file. The download was successful. But for the first 3 shards I got DOWNLOAD_ERROR storage events in the database. Like this:

{
	"_id" : ObjectId("5ae2e2074b1cda19b497329f"),
	"token" : "d92fcedc7c2d2cb7218472bcf07f4b727a555e12",
	"user" : "kaloyan.raev.tmp@gmail.com",
	"client" : "kaloyan.raev.tmp@gmail.com",
	"farmer" : "64d44c4684c84f5bf39c657bd007f306e91aac88",
	"shardHash" : "c63084f84d0f52a92ec6ad51d38220e8b97bcaeb",
	"downloadBandwidth" : 16777216,
	"storage" : 0,
	"success" : false,
	"timestamp" : ISODate("2018-04-27T08:40:39.055Z"),
	"__v" : 0,
	"clientReport" : {
		"exchangeStart" : ISODate("2018-04-27T08:40:39.058Z"),
		"exchangeEnd" : ISODate("2018-04-27T08:40:39.067Z"),
		"exchangeResultCode" : 1100,
		"exchangeResultMessage" : "DOWNLOAD_ERROR"
	}
}

It does not really makes sense as I doubt any download was unsuccessful. It's a perfect networking environment - downloading from a Storj/integration docker container running on the same machine.

I also noticed that for those shards who got DOWNLOAD_ERROR reports, the were mirrored again afterwards:

{
	"_id" : ObjectId("5ae2e2084b1cda19b49732a7"),
	"token" : "2736a0f55c5636a47c50730528b743ec4cd37562",
	"user" : null,
	"client" : "64d44c4684c84f5bf39c657bd007f306e91aac88",
	"farmer" : "abd2d67fb61cbd697b9bccb07b031406a921afb6",
	"downloadBandwidth" : 0,
	"storage" : 0,
	"shardHash" : "c63084f84d0f52a92ec6ad51d38220e8b97bcaeb",
	"success" : true,
	"timestamp" : ISODate("2018-04-27T08:40:40.710Z"),
	"__v" : 0,
	"clientReport" : {
		"exchangeStart" : ISODate("2018-04-27T08:40:42.633Z"),
		"exchangeEnd" : ISODate("2018-04-27T08:40:52.696Z"),
		"exchangeResultCode" : 1000,
		"exchangeResultMessage" : "SHARD_DOWNLOADED"
	},
	"farmerReport" : {
		"exchangeStart" : ISODate("2018-04-27T08:40:42.584Z"),
		"exchangeEnd" : ISODate("2018-04-27T08:40:56.010Z"),
		"exchangeResultCode" : 1000,
		"exchangeResultMessage" : "MIRROR_SUCCESS"
	}
}

BTW, in a mirroring storage event is it the "farmer" downloading from the "client", or vice versa?

littleskunk · 2018-04-27T16:24:47Z

The client is downloading from the farmer.

The mirror was completed at 2018-04-27T08:40:52.696Z but your download was at 2018-04-27T08:40:39.067Z. That is the reason for the download error.

I would expect a download retry with the other farmer. That should work and finaly you get the complete file. The download retry could be hidden somewhere in the logfile. Do you see any download success for the same shard hash?

Change `req.user.id` to `req.user._id` to match middleware _id assignment

braydonf · 2018-05-01T21:17:33Z

Okay, summary of what we've found so far with another run of testing:

Line 211 of reports.js is immediately giving failed reputation points, even though there is one case that both sides of the report should be considered. There is a TODO at line 108 of the storage events cron that also hints at this.
As @littleskunk has mentioned, there is also an issue with the timeoutRate not being reset that could cause issues. An existing bug, that should be resolved as it's very related to this work.
There were mirror/replication events that were trying to mirror from/to the same farmer, which @kaloyan-raev also mentioned above. This may be an existing bug.

braydonf · 2018-05-01T21:18:18Z

Thanks @dylanlott and @kaloyan-raev for the PRs!

braydonf · 2018-05-01T22:44:10Z

There is also an issue we just found with the cron working processing storage events multiple times, and then recording those events on the user as the totalRate. So in a case where the client is not reporting with SIP9 the unknown rate is still less than totalRate.

braydonf · 2018-05-01T23:00:07Z

As follow up to previous comment, it looks like the lastTimeout isn't accurate. For three seperate runs of cron the lastTimeout was the same, even though they were 2 minutes apart each:

lastTimestamp: Tue May 01 2018 22:42:06 GMT+0000 (UTC)
lastTimestamp: Tue May 01 2018 22:42:06 GMT+0000 (UTC)
lastTimestamp: Tue May 01 2018 22:42:06 GMT+0000 (UTC)

incorrect reporting rates.

There was an issue where the reputation pointst were awarded too soon without having the full perspective of both client and farmer reports. This now updates reports after there is a chance for both of those reports and that the storage event has been resolved.

There was an issue where duplicate mirror events could be triggered for each side of the exchange report, once for the farmer and another for the client. The replication/mirror is triggered by a successfull client report. There was another issue where a mirror would be attempted from/to the same farmer. There has been a check added to make sure that the source and destination are not the same.

There was previously code that would immediatly resolve the storage event to true once the report was made. However we've moved the functionality to exist in one place in the cron worker, so it's necessary for the resolving to handle this case too.

braydonf · 2018-05-02T23:28:17Z

Okay, all known issues reported here should now be resolved.

- service storage models - mongodb adapter - complex

braydonf added the needs review label Oct 19, 2017

aleitner mentioned this pull request Oct 23, 2017

Sip9 storj-archived/libstorj#395

Closed

This was referenced Oct 23, 2017

Implement SIP9 storj-archived/core#743

Merged

[WIP] SIP-09 #523

Closed

Implementation of SIP9 storj-archived/integration#8

Merged

braydonf mentioned this pull request Nov 8, 2017

Upload failure: Farmer request error storj-archived/libstorj#396

Closed

braydonf force-pushed the sip9 branch 2 times, most recently from e222499 to 07cc7e1 Compare November 20, 2017 19:21

braydonf changed the title ~~[WIP] Implement SIP9~~ Implement SIP9 Nov 20, 2017

littleskunk suggested changes Nov 26, 2017

View reviewed changes

littleskunk reviewed Nov 26, 2017

View reviewed changes

littleskunk approved these changes Dec 14, 2017

View reviewed changes

braydonf force-pushed the sip9 branch from e4e8d8b to 77a9d4c Compare December 14, 2017 21:46

littleskunk mentioned this pull request Jan 23, 2018

Sending ALLOC messages to long time offline farmer #557

Closed

braydonf mentioned this pull request Feb 8, 2018

Duplicate mirror traffic #564

Closed

braydonf force-pushed the sip9 branch from 77a9d4c to b8fadaa Compare February 8, 2018 19:50

braydonf mentioned this pull request Feb 22, 2018

[WIP] Update Debit Generation Queries For SIP09 Compatibility storj-archived/billing#87

Closed

braydonf force-pushed the sip9 branch from b8fadaa to 03276a5 Compare March 1, 2018 04:09

Braydon Fuller added 2 commits April 18, 2018 18:42

Sketch out implementation

4b42718

Add auth middleware for exchange reports

0d5fb32

braydonf and others added 2 commits April 24, 2018 07:26

Merge pull request #2 from kaloyan-raev/patch-1

ba0f180

Fix cron scheduling pattern

Fix for deleting empty bucket

d08818b

Deleting empty bucket failed with "404 Bucket not found" because no shards were found for this bucket.

dylanlott reviewed Apr 25, 2018

View reviewed changes

Merge pull request #3 from kaloyan-raev/empty-bucket

6860925

Fix for deleting empty bucket

dylanlott added 3 commits April 26, 2018 11:57

fix renter reports comparison

a3d8f42

fix renter reports comparison

96814f5

Merge branch 'sip9' of github.com:dylanlott/bridge into sip9

2b6049d

dylanlott and others added 3 commits May 1, 2018 11:16

compare against _id instead of id to match middleware

807703b

Merge pull request #4 from dylanlott/sip9

01e8040

Change `req.user.id` to `req.user._id` to match middleware _id assignment

Fix tests

65f8e8b

Braydon Fuller added 4 commits May 2, 2018 12:10

Make sure that storage events are only processed once to avoid

68e1472

incorrect reporting rates.

Also resolve client success

df786a2

There was previously code that would immediatly resolve the storage event to true once the report was made. However we've moved the functionality to exist in one place in the cron worker, so it's necessary for the resolving to handle this case too.

Braydon Fuller added 4 commits May 3, 2018 10:21

Check that event is not already processed

9e2a686

Update to released version of dependencies for SIP9

da42831

- service storage models - mongodb adapter - complex

Fix tests, added missing stubs

a7e33ab

Fix state issues with tests

f7c9f41

navillasa merged commit 05b700c into storj-archived:master May 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement SIP9 #521

Implement SIP9 #521

braydonf commented Oct 19, 2017 •

edited

Loading

littleskunk commented Nov 1, 2017

littleskunk Nov 26, 2017

braydonf Dec 5, 2017

littleskunk Nov 26, 2017

braydonf Dec 5, 2017

littleskunk Nov 26, 2017 •

edited

Loading

braydonf Dec 5, 2017

littleskunk commented Nov 29, 2017

braydonf commented Dec 5, 2017

braydonf commented Dec 15, 2017

littleskunk commented Dec 21, 2017 •

edited

Loading

braydonf commented Jan 10, 2018

littleskunk commented Jan 18, 2018 •

edited

Loading

braydonf commented Feb 8, 2018

braydonf commented Feb 8, 2018

braydonf commented Mar 1, 2018

dylanlott Apr 25, 2018

braydonf Apr 25, 2018 •

edited

Loading

kaloyan-raev commented Apr 26, 2018 •

edited

Loading

braydonf commented Apr 26, 2018 •

edited

Loading

braydonf commented Apr 26, 2018

littleskunk commented Apr 26, 2018

kaloyan-raev commented Apr 27, 2018

littleskunk commented Apr 27, 2018

braydonf commented May 1, 2018

braydonf commented May 1, 2018

braydonf commented May 1, 2018

braydonf commented May 1, 2018

braydonf commented May 2, 2018

Implement SIP9 #521

Implement SIP9 #521

Conversation

braydonf commented Oct 19, 2017 • edited Loading

littleskunk commented Nov 1, 2017

littleskunk Nov 26, 2017

Choose a reason for hiding this comment

braydonf Dec 5, 2017

Choose a reason for hiding this comment

littleskunk Nov 26, 2017

Choose a reason for hiding this comment

braydonf Dec 5, 2017

Choose a reason for hiding this comment

littleskunk Nov 26, 2017 • edited Loading

Choose a reason for hiding this comment

braydonf Dec 5, 2017

Choose a reason for hiding this comment

littleskunk commented Nov 29, 2017

braydonf commented Dec 5, 2017

braydonf commented Dec 15, 2017

littleskunk commented Dec 21, 2017 • edited Loading

braydonf commented Jan 10, 2018

littleskunk commented Jan 18, 2018 • edited Loading

braydonf commented Feb 8, 2018

braydonf commented Feb 8, 2018

braydonf commented Mar 1, 2018

dylanlott Apr 25, 2018

Choose a reason for hiding this comment

braydonf Apr 25, 2018 • edited Loading

Choose a reason for hiding this comment

kaloyan-raev commented Apr 26, 2018 • edited Loading

braydonf commented Apr 26, 2018 • edited Loading

braydonf commented Apr 26, 2018

littleskunk commented Apr 26, 2018

kaloyan-raev commented Apr 27, 2018

littleskunk commented Apr 27, 2018

braydonf commented May 1, 2018

braydonf commented May 1, 2018

braydonf commented May 1, 2018

braydonf commented May 1, 2018

braydonf commented May 2, 2018

braydonf commented Oct 19, 2017 •

edited

Loading

littleskunk Nov 26, 2017 •

edited

Loading

littleskunk commented Dec 21, 2017 •

edited

Loading

littleskunk commented Jan 18, 2018 •

edited

Loading

braydonf Apr 25, 2018 •

edited

Loading

kaloyan-raev commented Apr 26, 2018 •

edited

Loading

braydonf commented Apr 26, 2018 •

edited

Loading