Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Colossus] Add pruning service #4845

Merged

Conversation

zeeshanakram3
Copy link
Contributor

@zeeshanakram3 zeeshanakram3 commented Aug 28, 2023

addresses #4813

This PR adds:

  • adds cleanup service to prune outdated assets, and configuration option to enable/disable it on storage-node startup
  • adds CLI command 'dev:cleanup' for performing pruning actions (ideally should not be used in production)
  • extends '/status' endpoint with additional info, e.g. sync, cleanup/pruning & serving buckets configurations

How the pruning of assets work

The cleanupService runs the data objects cleanup/pruning workflow. It removes all the locally stored data objects that the operator is no longer obliged to keep, because either the data object has been deleted from the runtime or it has been moved to some other bucket/s

PRECONDITIONS:

  • Since the cleanup uses the QueryNode to query the data obligations, the QueryNode processor should not lag more than MAXIMUM_QN_LAGGING_THRESHOLD blocks, otherwise the cleanup workflow would not be performed in order to avoid pruning assets based on outdated state.
  • If the asset being pruned from this storage node still exists in the runtime (i.e. its storage obligation has been moved), then at least "X" other storage nodes should hold the asset, otherwise, the cleanup workflow would not be performed, where "X" is defined by MINIMUM_REPLICATION_THRESHOLD
  • If the asset being pruned from this storage node is currently being downloaded by some external actors, then the cleanup action for this asset would be postponed

Copy link
Contributor

@mnaamani mnaamani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good and will be quite valuable. Left some feedback. Also needs an update from master to fix some merge conflicts.

storage-node/src/commands/dev/cleanup.ts Outdated Show resolved Hide resolved
@@ -62,6 +66,16 @@ export default class Server extends ApiCommandBase {
description: 'Interval between synchronizations (in minutes)',
default: 1,
}),
cleanup: flags.boolean({
char: 's',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

char 's' already assigned for sync. Better remove it as I don't know what the behavior is if short flag is used and multiple flags assigned same char.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, changed the short flag to c. WDYT?

cleanupInterval: flags.integer({
char: 'i',
description: 'Interval between periodic cleanup actions (in minutes)',
default: 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say this should be much higher like 1h, maybe even longer?
Depends on how much work/time it takes per cleanup run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, ideally it should be higher that the sync interval, I just set it to 1min, basically test the pruning behavior locally. Now I have set it to 6hrs

storage-node/src/commands/server.ts Outdated Show resolved Hide resolved
storage-node/src/services/caching/localDataObjects.ts Outdated Show resolved Hide resolved
): Promise<{ dataObjectId: DataObjectId; pinnedCount: DataObjectPinCount } | undefined> {
if (idCache.has(dataObjectId)) {
await lock.acquireAsync()
const id = { dataObjectId, pinnedCount: idCache.get(dataObjectId) as DataObjectPinCount }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return type of function doesn't seem to match id, dataObjectId is a string where as in the return type it is DataObjectId (u64)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DataObjectId is also a string, as you can see

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, i didn't spot that type definition.


const timeoutMs = 60 * 1000 // 1 minute since it's only a HEAD request
const deletionTasks: DeleteLocalFileTask[] = []
await Promise.all(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
await Promise.all(
await Promise.allSettled(

To avoid the promise being rejected when HEAD request to one data object fails, Promise.allSettled() is more appropriate.

Also I can image if the number of dataobjects could potentially be very large, doing a massive number of HTTP HEAD requests in parallel might be a bad idea. Perhaps we should do it in batches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but I have to change target option in tsconfig.json to es2020 as Promise.allSettled does not seem to be available in es2017

Copy link
Contributor

@mnaamani mnaamani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One final small change and this should be ready to test in production :)
And updating from master to resolve merge conflict.

dataObjectId: string
): Promise<{ dataObjectId: DataObjectId; pinnedCount: DataObjectPinCount } | undefined> {
if (idCache.has(dataObjectId)) {
await lock.acquireAsync()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the lock be acquired before checking idCache.has() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. However, placing a lock before the if condition can lead to unnecessary blocking if the dataObjectId is not present in the cache. So I reimplemented by first checking the cache without acquiring the lock. This is a quick operation and if the dataObjectId is not present, just return without the overhead of acquiring the lock.

@mnaamani mnaamani changed the base branch from master to colossus-beta November 17, 2023 05:52
@mnaamani mnaamani self-requested a review November 17, 2023 05:53
Copy link
Contributor

@mnaamani mnaamani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will merge this work into a new branch colossus-beta where we can do final tweaks for a "beta" release.

@mnaamani mnaamani merged commit c9ca6f9 into Joystream:colossus-beta Nov 17, 2023
22 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants