Skip to content
This repository has been archived by the owner on Oct 30, 2018. It is now read-only.

Implement SIP9 #521

Merged
merged 100 commits into from
May 3, 2018
Merged

Implement SIP9 #521

merged 100 commits into from
May 3, 2018

Conversation

braydonf
Copy link
Contributor

@braydonf braydonf commented Oct 19, 2017

Features

  • Ability to query for shards that are missing for the purpose of running shard recovery for storage over longer periods of time
  • Be able to track farmer bandwidth and storage to incentivize reliability over longer periods of time with improved data for farmer payouts
  • Builds farmer reputation based on successful transfers
  • Be able to more accurately track user bandwidth usage and storage for billing

Implements: https://github.com/Storj/sips/blob/master/sip-0009.md

Specific updates

  • Adds storage events for each mirror to a new farmer
  • Updates current storage events for uploads to include each shard and associate with a farmer
  • Updates download storage events to be associated with a farmer
  • Will mark an upload storage event as ended when farmer goes offline
  • Updates exchange reports route to update storage events on success/failure
  • Updates storage event model
  • Marks an upload storage event as ended when file is deleted

Todo:

  • Integration testing with storjshare and libstorj exchange report updates Implementation of SIP9 integration#8
  • Test various scenarios for billing and payout generation based this data
  • Cron schedule to update storage events success based on reports and update reputation, consider the case of over reporting failure and lack of reports from users
  • Unit test storage events finality cron
  • Consider case of handling older exchange reports that do not include a token
  • Update package.json with released service-storage-models with necessary updates

Part of:

@littleskunk
Copy link
Collaborator

Some early feedback from my side. I see 3 problems here:

1.) If the renter deletes a file eary the farmer will not get paid for it but still hold the shard until it expires. 2TB used space but only 200GB paid space? No chance for the farmer to see the difference between unpaid and paid used space? -> Farmer will delete all shards and start with a new farmer until that one has to mouch unpaid space as well. We need a way to clean up the space on the farmer.

2.) Farmer will lose all contracts if he has a too high timeoutRate. TimeoutRate is not working at the moment. See storj-archived/complex#82 and storj-archived/complex#81 As long as timeoutRate is not stable and tested we should not use it or we are burning reliable farmer in a few seconds.

3.) timeoutRateThreshold is only 1 hour. A hardware upgrade, ISP downtime, UPnP issue (storj-archived/core#600) different farmer maintenance jobs like the farmer GBh calculation script require a higher timeoutRateThreshold. How about 12 hours downtime per week? timeoutRateThreshold 0.07 and 168 hours for the timeout calculation.

@braydonf braydonf changed the title [WIP] Implement SIP9 Implement SIP9 Nov 20, 2017
let token = pointer.token;
self._createStorageEvent(token,
farmer.nodeID,
mirror.contact,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mirror.contact itself is not a nodeID. I would expect dirty StorageEvents stored in the database.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... okay. Will take a second look at this.

const cursor = StorageEvent.find({
timestamp: {
$lt: finalityTime,
$gt: lastTimestamp
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two or more StorageEvents can have the same timestamp. If the job gets canceld by StorageEventsCron.MAX_RUN_TIME the lastTimestamp could be set at the first StorageEvent and restarting the job with $gt: lastTimestamp would skip all other StorageEvents with the same timestamp.

I don't think $gte is a better choice. In this case it would not skip StorageEvents but it would read them twice :/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be safe to repeat processing storage events within a certain degree (many hours). The event should remain unmodified, and the update to the user should not apply based on the timestamp: https://github.com/Storj/service-storage-models/pull/136/files#diff-6cde9fa75ac789afe7ff15f670cfc049R462

event.success = true;
this.updateReputation(event.farmer, points.TRANSFER_SUCCESS);
} else {
this.updateReputation(event.farmer, points.TRANSFER_FAILURE);
Copy link
Collaborator

@littleskunk littleskunk Nov 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of mirror creation I will send a ClientReport. By claiming all mirror transfers as failed I can reduce the reputation of all my neighbors. Better reputation for myself.

Not critical. We can solve this one later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, we can take a look at this later. We would likely need to do something similar as we have with users.

@littleskunk
Copy link
Collaborator

2.) Farmer will lose all contracts if he has a too high timeoutRate. TimeoutRate is not working at the moment. See storj-archived/complex#82 and storj-archived/complex#81 As long as timeoutRate is not stable and tested we should not use it or we are burning reliable farmer in a few seconds.
#521 (comment)

Let me explain that a little bit better with an example:
Farmer updated spaceAvailable=false because his drive is full. Bridge will not send any ALLOC messages. LastSeen 3 hours ago. Farmer is going offline for 5 minutes and the bridge is sending a perfect timed PING messages. The bridge will take the lastSeen and assume that the farmer was offline for 3 hours and 5 minutes. Farmer is loosing all contracts for beeing offline 5 minutes.

@braydonf
Copy link
Contributor Author

braydonf commented Dec 5, 2017

@littleskunk Offline time is calculated between timeouts, not from lastSeen. If the lastSeen is not greater than the last timeout, the time is considered offline. See: https://github.com/Storj/service-storage-models/blob/master/lib/models/contact.js#L155

@braydonf
Copy link
Contributor Author

Rebased on the latest master branch

@littleskunk
Copy link
Collaborator

littleskunk commented Dec 21, 2017

I see a problem with the offline penalty. Lets take an example:

1.) Farmer joins the network and gets 1 contract.
2.) Farmer goes offline for a few days. Monitor will ping him an he gets the offline penalty.
3.) Farmer comes back online and is farming for a full month without any downtime. Not a single timeout. 1TB used space.
4.) Farmer is now full and sets spaceAvailable to false.
5.) Farmer goes offline from day to day. Bridge monitor will not ping him because his timeoutRate is still below the threshold.

Now we have 1 TB offline shards but the bridge doesn't notice it and will not create new mirrors.

Other timeouts are very unlikely. The bridge will not send any alloc requests or request a upload pointer because spaceAvailable is set to false. Download pointer are ordered by lastSeen.

Is it possible to modify the bridge monitor? Lets ping all farmer with more than one contract? That should solve the problem. The bridge would ping the farmer again and stops pinging as soon as the offline penalty is applied.

@braydonf
Copy link
Contributor Author

Sounds like we need to fix timeoutRate to reset without a timeout event?

@littleskunk
Copy link
Collaborator

littleskunk commented Jan 18, 2018

@braydonf I guess you can implement that faster than me. How about a check if lastTimeout is more than 72 hours old and in that case set timeoutRate to 0. Somewhere around this function should effect all important messages:
https://github.com/Storj/complex/blob/master/lib/landlord.js#L369

For storj-archived/complex#82 I will try to add ALLOC and RENEW to that function in order to trigger a responseTime calculation.

@braydonf
Copy link
Contributor Author

braydonf commented Feb 8, 2018

Rebased again from latest master

@braydonf
Copy link
Contributor Author

braydonf commented Feb 8, 2018

@littleskunk We can likely add a timeout reset in recordResponseTime https://github.com/Storj/service-storage-models/blob/master/lib/models/contact.js#L177

@braydonf
Copy link
Contributor Author

braydonf commented Mar 1, 2018

Rebased again from latest master, and resolved conflicts.

braydonf and others added 2 commits April 24, 2018 07:26
Fix cron scheduling pattern
Deleting empty bucket failed with "404 Bucket not found" because no
shards were found for this bucket.
}

const isClientReport = req.user ?
(event.client === req.user.id) : req.farmerNodeID ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this needs to compare event.client to req.user._id rather than req.user.id

Copy link
Contributor Author

@braydonf braydonf Apr 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, what was the problem that came up due to it?

@kaloyan-raev
Copy link
Contributor

kaloyan-raev commented Apr 26, 2018

I noticed some strange storage events generated while uploading a file.

  1. Storage event with the same client and farmer:
{
	"_id" : ObjectId("5ae19ab14b1cda19b4973227"),
	"token" : "d9ce3e7dd0947ae44e19d43053de86f7a33edebb",
	"user" : null,
	"client" : "4bb49fb779e9233f9ed14f6ef34e29048911f018",
	"farmer" : "4bb49fb779e9233f9ed14f6ef34e29048911f018",
	"downloadBandwidth" : 0,
	"storage" : 107374,
	"shardHash" : "89120aaf3e7e0c091debbdb4d938b0ca1ac95069",
	"success" : false,
	"timestamp" : ISODate("2018-04-26T09:24:01.400Z"),
	"__v" : 0
}

This event is never updated with clientReport and farmerReport and remains with "success" : false.

EDIT: Something that can help debugging is that the client/farmer id 4bb49fb779e9233f9ed14f6ef34e29048911f018 in this case is the one of the farmer that received the uploaded shard from the user.

  1. Duplicated storage events generated:
{
	"_id" : ObjectId("5ae19ab14b1cda19b497322a"),
	"token" : "b5701032361e77838d12e9e064976d9d25860166",
	"user" : null,
	"client" : "ab6a9c5205af4cf1da57fb7ec98b5433e967a244",
	"farmer" : "99103d9615965c85d35295f458034f639460cc8e",
	"downloadBandwidth" : 0,
	"storage" : 107374,
	"shardHash" : "89120aaf3e7e0c091debbdb4d938b0ca1ac95069",
	"success" : true,
	"timestamp" : ISODate("2018-04-26T09:24:01.728Z"),
	"__v" : 0,
	"clientReport" : {
		"exchangeStart" : ISODate("2018-04-26T09:24:01.755Z"),
		"exchangeEnd" : ISODate("2018-04-26T09:24:01.757Z"),
		"exchangeResultCode" : 1000,
		"exchangeResultMessage" : "SHARD_DOWNLOADED"
	},
	"farmerReport" : {
		"exchangeStart" : ISODate("2018-04-26T09:24:01.746Z"),
		"exchangeEnd" : ISODate("2018-04-26T09:24:01.762Z"),
		"exchangeResultCode" : 1000,
		"exchangeResultMessage" : "MIRROR_SUCCESS"
	}
}
{
	"_id" : ObjectId("5ae19ab14b1cda19b497322b"),
	"token" : "229d70ccf15563aa01f7c05e54736f49054a7bac",
	"user" : null,
	"client" : "ab6a9c5205af4cf1da57fb7ec98b5433e967a244",
	"farmer" : "99103d9615965c85d35295f458034f639460cc8e",
	"downloadBandwidth" : 0,
	"storage" : 107374,
	"shardHash" : "89120aaf3e7e0c091debbdb4d938b0ca1ac95069",
	"success" : false,
	"timestamp" : ISODate("2018-04-26T09:24:01.848Z"),
	"__v" : 0
}

Both events are between the same client and farmer for mirroring the same shard. The only difference is in the token. In result, only the first event is updated with clientReport and farmerReport. The second never got them and remains with "success" : false.

@braydonf
Copy link
Contributor Author

braydonf commented Apr 26, 2018

@kaloyan-raev Events without a user are replication events, such as the first example, however that one looks like an attempt to mirror from and to the same farmer. That would explain why it didn't work, but not why it was created, so we would probably need to dig a bit more to find out why.

@braydonf
Copy link
Contributor Author

@kaloyan-raev For the second one, it looks like the duplicate event was created just after the first by 100ms. It likely failed because the first one was started already. There may be a race condition as to why there are two to begin with though.

@littleskunk
Copy link
Collaborator

Duplicate mirror traffic. That is correct. The shard was transfered twice to the same farmer. See #564

@kaloyan-raev
Copy link
Contributor

Today I downloaded 200 MB file. The download was successful. But for the first 3 shards I got DOWNLOAD_ERROR storage events in the database. Like this:

{
	"_id" : ObjectId("5ae2e2074b1cda19b497329f"),
	"token" : "d92fcedc7c2d2cb7218472bcf07f4b727a555e12",
	"user" : "kaloyan.raev.tmp@gmail.com",
	"client" : "kaloyan.raev.tmp@gmail.com",
	"farmer" : "64d44c4684c84f5bf39c657bd007f306e91aac88",
	"shardHash" : "c63084f84d0f52a92ec6ad51d38220e8b97bcaeb",
	"downloadBandwidth" : 16777216,
	"storage" : 0,
	"success" : false,
	"timestamp" : ISODate("2018-04-27T08:40:39.055Z"),
	"__v" : 0,
	"clientReport" : {
		"exchangeStart" : ISODate("2018-04-27T08:40:39.058Z"),
		"exchangeEnd" : ISODate("2018-04-27T08:40:39.067Z"),
		"exchangeResultCode" : 1100,
		"exchangeResultMessage" : "DOWNLOAD_ERROR"
	}
}

It does not really makes sense as I doubt any download was unsuccessful. It's a perfect networking environment - downloading from a Storj/integration docker container running on the same machine.

I also noticed that for those shards who got DOWNLOAD_ERROR reports, the were mirrored again afterwards:

{
	"_id" : ObjectId("5ae2e2084b1cda19b49732a7"),
	"token" : "2736a0f55c5636a47c50730528b743ec4cd37562",
	"user" : null,
	"client" : "64d44c4684c84f5bf39c657bd007f306e91aac88",
	"farmer" : "abd2d67fb61cbd697b9bccb07b031406a921afb6",
	"downloadBandwidth" : 0,
	"storage" : 0,
	"shardHash" : "c63084f84d0f52a92ec6ad51d38220e8b97bcaeb",
	"success" : true,
	"timestamp" : ISODate("2018-04-27T08:40:40.710Z"),
	"__v" : 0,
	"clientReport" : {
		"exchangeStart" : ISODate("2018-04-27T08:40:42.633Z"),
		"exchangeEnd" : ISODate("2018-04-27T08:40:52.696Z"),
		"exchangeResultCode" : 1000,
		"exchangeResultMessage" : "SHARD_DOWNLOADED"
	},
	"farmerReport" : {
		"exchangeStart" : ISODate("2018-04-27T08:40:42.584Z"),
		"exchangeEnd" : ISODate("2018-04-27T08:40:56.010Z"),
		"exchangeResultCode" : 1000,
		"exchangeResultMessage" : "MIRROR_SUCCESS"
	}
}

BTW, in a mirroring storage event is it the "farmer" downloading from the "client", or vice versa?

@littleskunk
Copy link
Collaborator

The client is downloading from the farmer.

The mirror was completed at 2018-04-27T08:40:52.696Z but your download was at 2018-04-27T08:40:39.067Z. That is the reason for the download error.

I would expect a download retry with the other farmer. That should work and finaly you get the complete file. The download retry could be hidden somewhere in the logfile. Do you see any download success for the same shard hash?

dylanlott and others added 3 commits May 1, 2018 11:16
Change `req.user.id` to `req.user._id` to match middleware _id assignment
@braydonf
Copy link
Contributor Author

braydonf commented May 1, 2018

Okay, summary of what we've found so far with another run of testing:

  • Line 211 of reports.js is immediately giving failed reputation points, even though there is one case that both sides of the report should be considered. There is a TODO at line 108 of the storage events cron that also hints at this.
  • As @littleskunk has mentioned, there is also an issue with the timeoutRate not being reset that could cause issues. An existing bug, that should be resolved as it's very related to this work.
  • There were mirror/replication events that were trying to mirror from/to the same farmer, which @kaloyan-raev also mentioned above. This may be an existing bug.

@braydonf
Copy link
Contributor Author

braydonf commented May 1, 2018

Thanks @dylanlott and @kaloyan-raev for the PRs!

@braydonf
Copy link
Contributor Author

braydonf commented May 1, 2018

There is also an issue we just found with the cron working processing storage events multiple times, and then recording those events on the user as the totalRate. So in a case where the client is not reporting with SIP9 the unknown rate is still less than totalRate.

@braydonf
Copy link
Contributor Author

braydonf commented May 1, 2018

As follow up to previous comment, it looks like the lastTimeout isn't accurate. For three seperate runs of cron the lastTimeout was the same, even though they were 2 minutes apart each:

lastTimestamp: Tue May 01 2018 22:42:06 GMT+0000 (UTC)
lastTimestamp: Tue May 01 2018 22:42:06 GMT+0000 (UTC)
lastTimestamp: Tue May 01 2018 22:42:06 GMT+0000 (UTC)

Braydon Fuller added 4 commits May 2, 2018 12:10
There was an issue where the reputation pointst were awarded too soon
without having the full perspective of both client and farmer reports.
This now updates reports after there is a chance for both of those reports
and that the storage event has been resolved.
There was an issue where duplicate mirror events could be triggered
for each side of the exchange report, once for the farmer and another
for the client. The replication/mirror is triggered by a successfull
client report.

There was another issue where a mirror would be attempted from/to
the same farmer. There has been a check added to make sure that the
source and destination are not the same.
There was previously code that would immediatly resolve the storage
event to true once the report was made. However we've moved the
functionality to exist in one place in the cron worker, so it's
necessary for the resolving to handle this case too.
@braydonf
Copy link
Contributor Author

braydonf commented May 2, 2018

Okay, all known issues reported here should now be resolved.

@navillasa navillasa merged commit 05b700c into storj-archived:master May 3, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants