Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async rehydrate with index! #542

Merged
merged 16 commits into from
Jul 2, 2020
Merged

Async rehydrate with index! #542

merged 16 commits into from
Jul 2, 2020

Conversation

raymondjacobson
Copy link
Member

@raymondjacobson raymondjacobson commented Jun 23, 2020

  • Adds a migration that adds two columns to the files table
    -- fileName: the actual queryable source filename stripped of any prefixes
    -- dirMultihash: if the file is in an IPFS directory, the CID/multihash for the parent dir

  • Updates the models accordingly ^ and updates the image upload flow to populate those two columns

  • Allows fallback fetching from FS in /ipfs/:dirCID/:filename endpoint (we previously didn't have this)

  • Adds a GET query param to /ipfs/:CID and /ipfs/:dirCID/:filename fromFS that async dispatches IPFS rehydration and immediately returns query results from filesystem

  • Adds an index to the dirMultihash column

Copy link
Contributor

@SidSethi SidSethi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks great. thanks for putting this together. had a few questions.
but also - unless i'm misreading, i dont' think this works? specifically the storagePath.split('/').length === 7 check should always be false?

also just want to make sure this is thoroughly tested not just with clean state locally but against existing data, to make sure that (1) migration does not corrupt non multires-image data and (2) new CID routes do not introduce regressions in any of the code paths

creator-node/src/models/file.js Show resolved Hide resolved
creator-node/src/routes/files.js Outdated Show resolved Hide resolved
creator-node/src/routes/files.js Outdated Show resolved Hide resolved
creator-node/src/routes/files.js Outdated Show resolved Hide resolved
creator-node/src/utils.js Outdated Show resolved Hide resolved
@raymondjacobson raymondjacobson changed the title Async rehydrate [DO NOT MERGE/REVIEW] Async rehydrate Jun 29, 2020
@raymondjacobson
Copy link
Member Author

overall looks great. thanks for putting this together. had a few questions.
but also - unless i'm misreading, i dont' think this works? specifically the storagePath.split('/').length === 7 check should always be false?

also just want to make sure this is thoroughly tested not just with clean state locally but against existing data, to make sure that (1) migration does not corrupt non multires-image data and (2) new CID routes do not introduce regressions in any of the code paths

I rewrote the split('/')[7] thing that gets the "leaf" of the path as a regex with capture so it's more reliable. I think that should hopefully mitigate some concerns. Tested this against staging data. Migration rollback is pretty bulletproof though so I'm not exceedingly worried about anything.

@raymondjacobson raymondjacobson changed the title [DO NOT MERGE/REVIEW] Async rehydrate Async rehydrate with index! Jul 1, 2020
Copy link
Contributor

@dmanjunath dmanjunath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few small questions but this looks great!

{ transaction }
)

const files = (await queryInterface.sequelize.query(`SELECT * FROM "Files";`, { transaction }))[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this loads all the files into memory right? do you think that's okay?

Copy link
Contributor

@dmanjunath dmanjunath Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually pretty sure this isn't going to work. this most likely exceeds node heap size, not to mention running a raw select * will take forever cause these tables have millions of records. we need to do this in a paginated way. like first get count, iterate over chunks of like 50k or 100k files. this query seems to work well

SELECT * FROM "Files" ORDER BY "multihash" ASC LIMIT 100000 OFFSET 200000;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated per our convo offline. Left the bulk update in as a comment.

creator-node/src/models/file.js Show resolved Hide resolved
creator-node/src/utils.js Outdated Show resolved Hide resolved
@raymondjacobson
Copy link
Member Author

Pulled out the route changes in this PR, going in separately. Merging this now.

@raymondjacobson raymondjacobson merged commit 6f9c0e9 into master Jul 2, 2020
@raymondjacobson raymondjacobson deleted the rj-async-rehydrate branch July 2, 2020 17:08
raymondjacobson added a commit that referenced this pull request Jul 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants