Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File storage directory nesting #1056

Merged
merged 29 commits into from Nov 17, 2020
Merged

Conversation

dmanjunath
Copy link
Contributor

@dmanjunath dmanjunath commented Nov 12, 2020

Trello Card Link

https://trello.com/c/dyViGxGG/1654-track-upload-errors

Description

The /file_storage directory is susceptible to hash collisions in ext4 file systems using the half md5 hashing algorithm because we write all files to one directory. Moving to a nested directory model should alleviate these concerns.

Services

  • Discovery Provider
  • Creator Node
  • Identity Service
  • Libs
  • Contracts
  • Service Commands
  • Mad Dog

Does it touch a critical flow like Discovery indexing, Creator Node track upload, Creator Node gateway, or Creator Node file system?

Delete an option.

  • 🚨 Yes, this touches how files are stored in the file system

How Has This Been Tested?

  1. Created an account, updated my profile picture and cover photo, uploaded a track with the old file storage system. Fetch to metadata and track segment via the ipfs gateway. Fetch a dir image via the ipfs gateway
  2. Applied the new file storage system and uploaded more tracks and updated my profile picture and cover photo again. Fetch to metadata and track segment via the ipfs gateway. Fetch a dir image via the ipfs gateway
  3. Used the dapp to play the tracks, view the track cover art, my profile picture and cover photo
  4. Make sure a sync was triggered and verify clock state for both nodes was the same in the db
  5. Verify the number of files in both volumes were the same and the directory structures were identical by listing out all the files, sorting and doing running the diff command
  6. Disable fetching through IPFS for sync, only other cnode gateways
  7. Upload another track to verify that sync triggered and fetched all files via the gateways
  8. Randomly sample and fetch metadata and track segments via the ipfs gateway
  9. Randomly sample and fetch dir images via the ipfs gateway
  10. Swap the primary between the two nodes, upload another track and hit the ipfs gateway to fetch the content again

Copy link
Member

@raymondjacobson raymondjacobson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems pretty good to me. Maybe we can test together?

creator-node/src/diskManager.js Outdated Show resolved Hide resolved
creator-node/src/diskManager.js Outdated Show resolved Hide resolved
creator-node/src/fileManager.js Outdated Show resolved Hide resolved
@dmanjunath
Copy link
Contributor Author

@raymondjacobson @SidSethi @roneilr @hareeshnagaraj i think this is ready for a final review. I tested a bunch locally.

Copy link
Member

@raymondjacobson raymondjacobson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get it on staging! I'm down to very thoroughly test after.

@dmanjunath
Copy link
Contributor Author

Let's get it on staging! I'm down to very thoroughly test after.

Yeah I'm planning to do this tomorrow. We can push the hotfix tag onto staging and test with it.

Copy link
Contributor

@hareeshnagaraj hareeshnagaraj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just some minor suggestions - can chat offline if easier

creator-node/src/fileManager.js Show resolved Hide resolved
// regex match to check if a directory or just a regular file
// if directory will have both outer and inner properties in match.groups
// else will have just outer
const matchObj = DiskManager.extractCIDsFromFSPath(expectedStoragePath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this get named better? matchObj vs dirMatchObj or something more descriptive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't call it dirMatchObj cause it can be a a non-dir file. i guess we could call it cidMatchObj?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking imo, if it works thats more impt

creator-node/src/diskManager.js Outdated Show resolved Hide resolved
*/
static computeBasePath (fsDest) {
if (!fsDest || fsDest.length < 4) throw new Error(`Please pass in a valid fsDest to computeBasePath. Passed in ${fsDest}`)
if (fsDest.includes('/')) throw new Error('Cannot pass in a directory path into this function, please pass in the leaf dir or file name')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is guaranteed to always be a CID, can we improve this validatio by checking that format explicitly? i think theres helpers somewhere

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a blocker but if easy to do might be worth it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not always a CID (eg the case of a the original track artifacts uploaded it's a folder with a UUID name). also i don't want to add extra validation in the event we want to generalize this function because that may impact backwards compatibility

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense agreed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i had a similar comment about this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this doesn't accept uuid's anymore. but should we do CID validation, i'm honestly not convinced one way or the other @hareeshnagaraj @SidSethi

Copy link
Contributor

@SidSethi SidSethi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is awesome. thanks for doing and testing man
had some comments/questions but nothing too major

creator-node/src/diskManager.js Outdated Show resolved Hide resolved
creator-node/src/diskManager.js Outdated Show resolved Hide resolved
const matchObj = DiskManager.extractCIDsFromFSPath(expectedStoragePath)

// if this is a directory, make it compatible with our dir cid gateway url
if (matchObj && matchObj.isDir && matchObj.outer && fileNameForImage) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if isDir = true and fileNameForImage = null shouldn't we throw an error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's confusing but we actually don't want to throw an error. there's a chance we could get the multihash via IPFS and store it in the expectedStoragePath. the only thing this if statement does is format the ipfs gateway urls in case we need to fall back to the gateways

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe i'm missing something but this seems wrong? also unclear what this means

there's a chance we could get the multihash via IPFS and store it in the expectedStoragePath

i don't think the error case i mentioned is possible given how this fn is called so its fine, just want to make sure i am reading code correctly

creator-node/src/fileManager.js Outdated Show resolved Hide resolved
creator-node/src/diskManager.test.js Outdated Show resolved Hide resolved
creator-node/src/diskManager.test.js Show resolved Hide resolved
it('Should return null if extractCIDsFromFSPath is passed in no valid CID', function () {
const path = '/file_storage/files/QMcidhere'
const matchObj = DiskManager.extractCIDsFromFSPath(path)
assert.deepStrictEqual(matchObj, null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you choose to have it return null instead of throw error in this case? am concerned we might see some silent failures if we don't explicitly throw

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's just the api i thought of. if you feel strongly i can change but didn't seem necessary to have a try/catch around this when we can just have an if check

Copy link
Contributor

@SidSethi SidSethi Nov 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we don't necessarily have to try-catch this in each caller since they're all wrapped in handleResponse anyway, my concern would be if the function occasionally returns null and we have no awareness. i guess why return null if there's no valid scenario where that should happen. throwing seems safer 🤷
but will leave up to you, not huge deal

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol def ignore for now

Copy link
Contributor

@hareeshnagaraj hareeshnagaraj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving final approval to @SidSethi looks good from my end - added some questions but no blockers

*/
static getTmpTrackUploadArtifactsPath () {
const dirPath = path.join(config.get('storagePath'), 'files', 'tmp_track_artifacts')
this.ensureDirPathExists(dirPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can see the argument for creating the CID shard dirs on demand but in this case we should def just create the tmp dir at the beginning and not have to check if it exists every single time. seems like unnecessary overhead
i'm glad we are encapsulating all ops here, but if we make above change, we can just build this path at diskManager init and store it as a class var instead of re-concating every time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we have on the order of a few hundred checks for tmp_track_artifacts but we have like tens of thousand cp or move ops? i don't think this is an optimization we need now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really sure what this means but with this code every track upload will call fs.mkdirSync for no reason. we should just avoid? if you want i can make this change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for making fix

creator-node/src/diskManager.js Show resolved Hide resolved
creator-node/src/fileManager.js Outdated Show resolved Hide resolved
Copy link
Contributor

@SidSethi SidSethi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@@ -49,15 +49,15 @@ describe('test AudiusUsers with mocked IPFS', function () {
it('successfully creates Audius user (POST /audius_users/metadata)', async function () {
const metadata = { test: 'field1' }
ipfsMock.add.twice().withArgs(Buffer.from(JSON.stringify(metadata)))
ipfsMock.pin.add.once().withArgs('testCIDLink')
ipfsMock.pin.add.once().withArgs('QmYfSQCgCwhxwYcdEwCkFJHicDe6rzCAb7AtLz3GrHmuU6')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lmao we only call this in health check now 🤦 k whatever, will rip out of tests and health check separately

@dmanjunath dmanjunath merged commit 3ce3ec1 into master Nov 17, 2020
@dmanjunath dmanjunath deleted the dm-file-storage-nesting-hotfix branch November 17, 2020 03:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants