-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disk space usage and version inventory files #367
Comments
My understanding is that the SHA512 checksums are a large part of the bytes in those inventory files. Storing SHA256 instead could be a ~50% savings in inventory bytes. I wonder if SHA256 should be recommended instead of SHA512. (This would be a bytesize savings on any OCFL repo with many files, regardless of whether many versions or just many independent objects). SHA256 plus only one copy of inventory file (instead of an additional copy in the version dirs) could then be a ~75% reduction in bytes. (Is still going to be uncomfortably many bytes, or would that be cool? Not sure). Both of these things would, I think, be allowed by the spec now (if one ignores one or more
|
Yes -- this is the core use case for OCFL: Many files, with one or more versions.
The 'SHOULD' was put there, as far as I remember, to help facilitate migration from older folder structure layouts. It was 'MUST' for several iterations of the spec, but after feedback the editors decided to change it. It is strongly suggested that copies of the inventory file be maintained in the version directories. The
We had tossed around the idea of storing version directories themselves as ZIP (or Gzip) archives, but felt we needed to get the basics of the storage mechanism right first before looking at optimizations. This may come back in v2 of the spec. Watch this space. There is no requirement that the JSON contain spaces or line-endings. In my experience significant savings, especially for large JSON files, can be made by removing whitespace characters. It's not a particularly clever method, but a quick check using one of our IIIF manifests is 2.3 MB formatted, 1.1 MB raw (48% reduction).
The use case OCFL is designed to address is not as part of a workflow system, it is for 'objects at rest'. While there certainly are cases where an object goes through multiple versions in the process of being accessioned, the intended use of OCFL is to store a 'settled' version of the object. Changes over the object's lifespan are, of course, common, but in gathering our use cases we did not hear about objects with thousands of versions. Off the top of my head, I believe the most changed objects that Stanford stores in Moab has roughly 20-30 versions. Most are one to five. @julianmorley could probably give you a more accurate answer. I would be interested in seeing the inventory.json that was produced. Are you storing the checksum for each version, or are you making use of the forward-versioning capabilities in OCFL to store the differences? The spec permits the use of SHA256 as the digest algorithm if the difference in disk space usage is critical for your use case: https://ocfl.io/draft/spec/#digests The reasons for the recommendation of SHA512 has been explored quite extensively in Github issues, Use cases, and community calls. See: #7 #8 #21 #290, https://groups.google.com/d/msg/ocfl-community/TU2zNYex0ao/w-hsJJWQBgAJ, https://github.com/OCFL/spec/wiki/2018.11.07-Editors-Meeting. |
Attached is the inventory file with a lot of versions. I had to rename it to upload it. The other one is too large to be uploaded directly without jumping through additional hoops, but let me know if you'd like to see that one too. Stripping white space is a good idea and will definitely save space. I was hoping not to have to do that because it makes the files so hard to read, but it seems like I'll need to at least make that option toggleable. I'll rerun the test tomorrow without the whitespace. |
Thanks for the response @ahankinson
I think it would be valuable for the specs and/or implementation guidance to be clear on this. Or some people are going to try to use it for cases it wasn't designed for, and find themselves in pain. Arguably this has already started, if you consider the fedora planning to use OCFL to be "as part of a workflow system", which may depend on how users of fedora use fedora, and also what you mean by "workflow system" vs "objects at rest". So such guidance could say more about that too. Obviously you intend to support versions, since they are such a big part of the spec; so maybe you mean "we expect most objects to have only one version, those that have more than one to have only few, and versions to be made infrquently?" Not sure. Especially when put together with your statement that "many files in one object and many versions in the same object" IS the core use case for OCFL. This is relevant, for instance, because I have observed discussions on the slack with fedora implementers trying to decide how to use OCFL, and guidance on the use cases OCFL means to support would probably help some of those discussions come to resolution. [edit: deleted stuff where I was confused about what was going on with inventories in root vs version directories, I don't understand it enough to speak on it!] |
For reference, I ran the same two tests again to try sha512 without pretty json, sha256 with pretty json, and sha256 without pretty json. Here are the results: https://docs.google.com/spreadsheets/d/1dffxLQoLP26dSEx39hlhXCBw5VejIt3Ig4zliPHoUpQ/edit?usp=sharing |
@pwinckles thanks for doing those numbers - that's great. I do see section 2.1.5 in the implementation notes, where it recommends packaging many small files together: https://ocfl.io/0.3/implementation-notes/#objects-with-many-small-files. @ahankinson and other editors - do you think it would be helpful to add a note about OCFL not being for a "workflow system", as @jrochkind suggested? It might be helpful to explicitly note that 20-30 versions is expected, rather than 2000-3000? Or, maybe you could add a note that if an object gets too many versions, the many inventory.json files will take a lot of disk space? |
I think having a conversation about the intended uses in the community call would be useful prior to adding a specific note about not being part of a 'workflow' system. The challenge I see is that adding a new version is a valid part of a curation workflow, so overly broad language might lead to confusion in the other direction, where we might be seen to discourage the creation of new versions. Likewise, I don't think it would be accurate to say that OCFL cannot be part of a workflow system, but that the decision in the application on when to write the output the workflow should reflect some idea of the settled state of the object. The most appropriate analogue I can think of to help clarify what I mean is that of a git commit. You wouldn't create a new commit after every keystroke or line, since that would be too many. In idealized use, a git commit generally represents the settled state of an application's source. Most programmers have an instinctive sense of the difference, but specifying what is and is not a 'commitable' change varies widely, from a single character to several hundred lines. Likewise, creating an OCFL version when an object passes from one stage to another in an accession workflow, for virus scanning, metadata checking, metadata double-checking, file format adjustments, etc. likely does not represent the 'settled state' of an object, but it is entirely possible that a routine operation (such as adding a missing file or changing a mistake in the metadata) can lead to a new version being made. The challenge will be how to communicate that in a way that encourages implementers to see versioning as a natural part of long-term preservation, while also ensuring that implementers do not shy away from OCFL because the requirements of the spec introduce inefficiences for storing the specifics of their application state. |
I agree about a conversation about intended uses in the community. Of late, there has been increased input from members of Fedora's community (which has been nice to see!), which has caused focused scrutiny on Fedora's relationship with OCFL. Fedora straddles a line between access, management, and preservation The initial proposed relationship between fedora and OCFL (the one proposed to our leaders group, and accepted) was one whereby the act of creating an immutable version of an object (as defined in the Fedora API) resulted in publication to OCFL. "unversioned" content that can be updated and mutated at will would be persisted elsewhere until a request to create a version came along. In this scenario, you can see Fedora as supporting a workflow for updating/managing objects, then shipping them off to preservation at defined points. The problem with the above scenario is that some content in Fedora is "in ocfl", and some isn't. There are some users of Fedora who never create explicit immutable versions of objects. The idea that Fedora isn't really "preserving" anything without explicit action (and the idea that some fedora content may perhaps unknowingly be absent from OCFL) was hard to explain, confusing, and discomforting to some. An alternate solution where Fedora always writes every change to OCFL seemed to get broad approval. The analogy to git is apt here, whereby each mutation of the repository through the API (either on its own, or bundled up with others in a 'transaction') corresponds to a "commit", and the act of creating an explicit version in Fedora is analogous to the act of tagging. The problem here seems to be that it is easy to use the repository in ways that don't necessarily align with the intended purpose of OCFL. On the technical side, this may possibly lead to the ballooning object size problem for some usage patterns. Maybe @awoods or @rosy1280 can put on your Fedora hats and comment? Do the OCFL editors see either scenario as particularly problematic? Also, this part caused considerable confusion among the Fedora developers:
The spec reads the opposite way; pretty much everybody took the |
I'll let @awoods or @rosy1280 weigh in about the first bit. For the second, I can completely see how that would be confusing, and I think that's on us to sort out how we make that clearer. The intention with the use of SHOULD is that it is highly recommended, but for legacy reasons we relaxed it from a MUST. I can't find the exact issue at the moment, but this one mentions the decision: #293 The reason, however, for having a redundant inventory file is so that no single act of changing the object can result in an unreadable object. We assume that the object is at its highest risk of loss when it is being changed. If we assume that the operation producing vN did not complete successfully and left Without this redundancy it is entirely possible for an OCFL object to have an irreconcilable version history and latest state, since it will depend on just a single |
First, apologies for being late to this thread and many thanks to @ahankinson for being on top of things. His use of Git commits to explain versioning in OCFL is useful and I agree that providing specifics will create barriers to adoption. Quite frankly explicitly defining what constitutes a version is like asking for the definition of collection, everyone will give you a different answer. Next, I'd like to better understand what happened during the course of the performance tests @pwinckles did. For the many versions test, it would be good to understand how large the updated file was at v1 and whether or not it grew at each version and by how much. For both tests can you tell us how the Finally, @birkland you mention a few things that I think I need clarity on. You say that some Fedora content will be in OCFL and some won't. Can you help me understand what content won't be in OCFL? You also mentioned that Fedora might be used in a way that doesn't require users to commit an immutable version. Do you have examples of those use cases? I have to wonder, if someone isn't using Fedora for preservation or management purposes, does it matter that an object isn't in OCFL? |
In the many versions test, the contents of a single file (the same file each time) was overwritten in each version. This file only contained the current version number, so it was about as small as it could be without being empty (naturally slightly larger for higher version numbers). The object itself contained 9 files in every version, 8 of which never changed. I attached the final inventory file for it earlier. For the many files test, it was more or less the same setup except instead of overwriting a single file in each version, a new, similarly tiny, file was added in each version.
Are you asking for the rate of change between versions? Or something else? The tests I ran were ad hoc experiments. More of back of the envelope calculations to see what happens to an object when it is versioned a lot with minimal changes. Currently, Fedora is planning on putting all content in OCFL. For a period of time it was discussed maintaining unversioned/staged content outside of OCFL and versioned content inside OCFL, but that is not the approach that is presently being pursued. Fedora allows users to update objects without versioning them. They can choose to stamp a version on the objects or just leave them. In the approach where everything is always stored in OCFL, every change to an object, regardless of whether or not it was versioned in Fedora, would be versioned in OCFL. The question then became, if a user makes a large number of updates (or a small number of updates to an object with a lot of files) to an object, what would happen to the OCFL object? That is why I ran the experiments that spawned this issue. |
@pwinckles can you tell us what the size (in bytes) of the Thank you for the explanation. I still have to question whether or not it matters if a user chooses not to commit a version to Fedora. If they aren't choosing to commit a version then it sounds like they aren't using Fedora for preservation or management, at which point, does it matter that it isn't in OCFL. That being said, I would like to understand better how Hyrax and Islandora commit things to Fedora before identifying which approach makes the most sense. @rotated8 can you tell me if Hyrax commits a new version to Fedora upon each update or is it something you configure in Fedora and Hyrax doesn't care about it. |
Yes, I can get you the exact numbers when I'm at my work computer on Monday. But, if I remember correctly, the increase is linear, so I would expect the version to version increase for the many versions test to be around 1,870 bytes (sha512 pretty json) and the version to version increase of the many files test to be around 1,118,052 bytes (sha512 pretty json). |
@rosy1280 I can't comment about Islandora 7, but I do know that Islandora 8 does not currently use the versioning capabilities of Fedora 5 at the moment, but it's on the radar to figure that out. I'd be surprised if Hyrax does, but I honestly know nothing about it. There hasn't been as much of broad culture of versioning in Fedora 4 and 5 as there had been in Fedora 3, so I suspect it's rare among Fedora 4 or 5 users. The notion of Fedora6 committing everything to OCFL emerged in the past few weeks, so that's why issues related to lots of versions and "fitness for purpose" of OCFL are emerging just now in community discussion around Fedora. In the original fcrepo6 proposal, the act of creating a version was the point at which "at rest" content was created and shipped to OCFL. The mutable unversioned content would not have been in OCFL (but would have been durable and rebuildable). To answer your question directly, those who are not using Fedora strictly for preservation of finished "at rest" content do not strictly need OCFL (but would likely be fine with it as long as there is no serious technical drawback to doing so - we're assessing whether disk space or inventory bloat is a serious technical drawback in practice). There are, however, members of the community who feel that the active/unversioned content that has historically been supported by Fedora should specifically not be in OCFL. |
@rosy1280, I updated the spreadsheet with the data you requested. It's on sheet 2. |
@rosy1280 To the best of my knowledge (and @no-reply will understand this better than I do), I do not believe Hyrax interacts with Fedora's versioning at the object level, although files may be versioned, if a user explicitly chooses to create one. To create a new version, you are required to upload a new file, and no mechanism exists for creating a version for metadata changes alone. I will defer to @no-reply for a better understanding. |
I am pretty sure that hyrax does not use Fedora "versioning" for metadata at all right now. If there is any fedora versioning (keeping track of changes/past versions) of metadata at all going on in hyrax, it is not exposed in any hyrax UI as far as I am aware. There is no way to see or revert to past states of metadata offered in hyrax. If we imagined hyrax using a fedora that used OCFL as a back end, such that all fedora updates were written to the OCFL store, that would necessarily be a change, as there is no way to "write to OCFL" without "creating a version", so every persisted change to metadata would be "a new version in OCFL", where at present every persisted change to metadata does not, I believe, result in a "fedora version", nor is there a way to "undo" using fedora. So, if we're worried about how many versions "typical" use would create -- under such a scenario, "how many versions would an object end up with" would be roughly answered by "How many times during the life of an object will/did someone make an edit to metadata and press 'save'" (Or a programmatic/batch 'save' would also count of course). I'm not sure how/if that information is available for existing hyrax installations. Of course, if hyrax did not send an update to fedora for every time a change to metadata had to be persisted, but only sent an update to fedora for some "preservation version should be created" kind of event, but kept it's "working" persistent data somewhere else (fedora is not it's main persistence store, but just a tool used for making a preservation copy at certain manual or automatic defined points), that would be another story, but require some changes to how hyrax approaches things. It would mean hyrax would need some "persistence store" in addition to fedora, even if it were using fedora. |
So if I understand @birkland 's example: When an Islandora user uploads a new file, it replaces the file currently in Fedora -- it does not make a call to Fedora to say "version this new file I'm uploading" it just overwrites the file. @birkland how difficult would it be for Islandora to change that? Or better yet would it even need to change that? Contrast that with what @rotated8 (and @no-reply just said in a meeting I was in with him) which is that every time a new file is uploaded in Hyrax, Fedora creates a new version of the file. Hyrax uses Fedora's RDF to store metadata. @birkland above you mentioned that some components of an object would not be in OCFL. Is Fedora's RDF something that could change without it being put "in OCFL"? Is Fedora's RDF preserved when an object is put "in OCFL"? Hyrax does have the concept of workflows so I also wonder if its possible for Hyrax to solve this problem itself (assuming its deemed necessary) so that a step in the workflow is "I edited this metadata, now make an immutable version in Fedora." Perhaps that's also a question for @no-reply. |
@rotated8's summary seems right to me: Hyrax does not create object-level versions, but does create new file versions normally when editing files. Creating a file version in Fedora is a routine side effect of editing a file (whether through the UI or through provided internal APIs). The versions are exposed in the UI through the Edit File interface; the versions tab is linked directly from file info on the main Work page. This provides an easy restore. Restoring creates a new version, identical to the selected previous version. |
@pwinckles Thanks for providing that. Because I'm visual I turned it into a spreadsheet with a chart! https://docs.google.com/spreadsheets/d/1xpbQfgDSIFXXxOw-mKhiRT0MTcnmH8t98lqSBA8U2yQ/edit#gid=350102796 It looks like the rate of change is linear, so the best way to stop the growth is to start at the beginning. I wonder if that's something the editors can look into. (and maybe should be a separate ticket from the rest of what is happening on this thread...). |
This seems like a good question to me. The best way that comes to mind for this to be handled on the Hyrax side is to serialize the metadata and store it as a file. I say "best" but this leaves a lot to be desired and is probably better called the "only/least worst" way. As of now, there's no concept of versioned metadata updates, or of Object versions, in Hyrax. |
My understanding is that the original point of this thread is to discuss this problem, and somewhere along the line it devolved into implementation details for Fedora and Fedora-based software. The fact of the matter is that there are serious storage implications for using OCFL to store objects with numerous versions. If this is not something that can/will be addressed in the OCFL spec, then Fedora needs to evaluate how and to what extent it should use OCFL with the understanding that some users may generate a large number of versions. An extreme example of a real-life Fedora 3 object that we looked at today had only 6 files but over 35,000 versions. A back of the envelope estimation for how much space would be required to store that object's cumulative inventory files if it was an OCFL object was over 700GB (sha512 not pretty printed). To me, that doesn't seem reasonable. |
@pwinckles thank you for that feedback. Are you telling me that you have an example of an object who's files changed significantly 35,000 times? |
I cannot speak to the significance of the changes (it's not my object). All I am saying is that several of the Fedora 3 repositories that we examined contained outlier objects with large numbers of versions, which is not something that OCFL handles gracefully. I understand if this is not a usecase that OCFL was designed to support, but it circles back to @bcail's questions in the original post and @birkland's question of "fitness of purpose." |
I would like to see actual use cases that require an object to be versioned that much, rather than it perhaps being the result of suboptimal coding. It would seem that such objects-in-motion should reside in the OCFL workspace area (which is defined), where the content and inventory can be updated in-place in an OCFL compliant structure, but only migrating to the persistent OCFL structure when a version needs long term retention. |
I’ll defer to someone else to provide use cases, but it would seem to me that the two obvious cases for large numbers of versions are 1) as part of a “workflow system” and 2) updates to a higher-level object that contains references to a numerous and expanding array of child objects. A third possible use case may be certain types of metadata updates. Let’s consider the deposit directory though. For reference, the spec has the following to say about it:
My reading of that is that the deposit directory is intended to be used to construct new object versions immediately prior to moving them to the object root. If the deposit directory is intended to support long-lived version creation (in Fedora’s case it could be indefinitely long), then I have a slew of implementation questions but here are the ones that I think are most pertinent to the spec.
We talked about the deposit directory at some length on Fedora tech calls. I have no problem using it to stage versions before they’re finalized. However, from Fedora’s perspective unversioned content must be durable and long-lived. It feels wrong to me to build an OCFL library that extensively uses the deposit directory to maintain staged content indefinitely without the spec sanctioning this interpretation. Doing so would essentially make the repository unusable by any other OCFL implementation. |
Good questions!
At the moment, I think the Fedora/Hyrax/Islandora community still has to reach consensus as to what they want Fedora to be and do. Fedora content has to-date not actually been particularly durable, and durability involves some trade-offs. |
I think this is a side issue, but since you all brought it up:
That has not been my conclusion at all trying to approach it -- even without the "deposit" directory, just concurrency control in adding versions to an OCFL object. But I'm not sure what you mean by "per object transaction IDs". We were having extensive discussion of concurrency in Slack over the past few months, and I don't think that concept came up. Concurrency issues are especially difficult with S3 as storage, which does not offer any "atomic directory move" operation, does not offer any file-locks (like an ordinary file system does), and in fact only guarantees "eventual consistency" on any add/update operations. In my attempts at working it out, it seems quite difficult to deal with concurrency issues, and if you are using S3 for sure (I think) requires an external system gatekeeping (and probably keeping a copy of at least some parts of S3 inventory/manifests, becuase of S3 "eventual consistency"). But it is not obvious to me a simple way to handle it even on a local file system (which may exclude NFS as well though, note well). But I actually don't understand what is meant by "per object transaction ID's", so there may be an approach I don't know. An example (say a pseudocode example) would be very helpful, so we're all talking about the same thing, and maybe it's simpler than I think. (Further discussion should probably be somewhere other than this ticket). |
@neilsjefferies I am wondering about something you said
This seems to indicate that OCFL is designed with a particular data model in mind and perhaps that could be described and would better help us understand how to use it. For instance, I would assume that for a book object I might contain the book "object" and page "objects" together in a single OCFL object to ensure they stay together to take advantage of the human readable structure. But it sounds like you are saying it would be better to have each page separate and some file in the book that ties them together. |
For what its worth, our object with 35k modifications is a collection object where the changes to it are primarily membership changes. The collection has actively managed by staff for roughly 5 years, which additions and rearrangement. The modifications are not to versioned datastreams in this case. |
Interesting... I had never considered using OCFL in that fashion. In my mind, purging was only ever intended to erase files that absolutely needed to be removed from the filesystem. Based on what you're saying, it would seem like there should be no qualms about supporting a squash-like operation in order to cleanup/trim versions, so long as it resulted in a new object. Is that correct? |
...another approach is to have objects assert their membership of the collection themselves but not vice versa. The collection object effectively defines an index to be created that contains all the objects making the membership assertion but does not need to be versioned when an object is added. RDF inference is kind of made for this sort of optimisation. |
@neilsjefferies Yes, I understand how S3 works, more or less, at that level. Are you suggesting S3 is not suitable for OCFL, only things more like "real" file systems are? There is a lot of interest in OCFL on S3, so there would be a lot of dissapointed people if that's true. But if that's "waiting until v2", then it would probably be good for Fedora implementers, for instance, to know that they probably ought not try to do OCFL on S3. My experience with S3 is that enough things are different that a very good design for a local file system may not work at all and require major changes for S3, so I think it is somewhat dangerous to assume that if you avoid thinking about S3 at all in "v1", you will be able to accomodate it in "v2" without major changes. This conversation is pulling out all sorts of differing assumptions; I think OCFL is being designed for a more limited set of scenarios/use cases than some onlookers (not directly involved in OCFL spec-making but hoping to use it) may be assuming. This may not be a problem exactly, OCFL design process may be going exactly according to plan, but expectations in the surrounding community should probably be set properly.
I would suggest that OCFL's fitness-for-purpose for use cases ought to be within the scope of OCFL spec construction discussion, and actual implementation considerations ought to be in the scope for discussion. It is fair to say that some use cases are not meant to be supported or not considered for support; but surely you are thinking of some use cases that are meant to be supported, and surely you are considering them when designing the spec, to make sure they, well, work. It's fine if some scenarios/use cases (say, S3) are outside the purview -- so long as there are enough use cases that are that you still have a community actually wanting to use OCFL! But I think this discussion reveals some lack of clarity between what use cases OCFL editors feel are "outside of scope", and what potential OCFL adopters have been hoping to do with it. Perhaps the potential OCFL adopters have been wrong, and should not have been trying to do those things with it. But it is hard for them to figure out, that's why we're having this discussion. Certainly external system/applciation requirements can't be entirely irrelevant to OCFL spec design, if you want to have a spec that actually works for external systems/applications. Are you trying to create a spec without feedback from implementers on what things are hard, or what things require special consideration? That would seem ill-advised. This whole conversation is confusing me. Also:
OK, but you're not really talking about "application profile", when you say things like:
You are recommending fitting your data model to the particular trade-offs of OCFL. You might not have to disaggregate them if you were storing in something other than OCFL. But for storing in OCFL, you have particular considerations related to OCFL. You can not in fact simply consider your "application profile" in isolation from the particular details of OCFL, in this conversation folks are trying to work out how to do that. |
@jrochkind If you know how S3 works then why are you trying to use it for operations for which it is suboptimal? You appear to want S3 to behave more like a filesystem but appear to be having trouble with that. Amazon have a higher cost file-system storage offering so there are specific reasons for the limitations of S3. Eventual consistency models accept that some level of loss is tolerable in the event of concurrency conflicts as a tradeoff against not having to manage locking so an external mechanism is absolutely necessary! The aspects of OCFL I indicated describe how it can be used to overcome some of these troublesome aspects of S3, so it has been thought about and OCFL can form part of the required external mechanism. However, it means accepting the design decisions (defined versions, serialised operations etc.) that are explained in fair detail the Spec and the Implementation Notes. C'est la vie.
Yes, exactly that. Just like I design different data models for relational, graph and document databases. Unsurprisingly, my suggestions for disaggregating collections or defining them by membership assertions are very much graph database types of model optimisation. OCFL is not going to try to be all things to all people. It is about getting objects that change relatively slowly into stable storage and helping preserve them. It's not really about high transaction volumes except perhaps for object creation. |
Ah, no. It just creates a new resource (in the LDP sense) as far as I understand. Translated to operations against OCFL, it'd be equivalent to creating a new object containing the new file. So it's not using using the versioning API per se, but if you squint hard enough it almost looks like it could be close to a user-space versioning scheme.
Currently, no proposal makes a distinction between rdf/metadata and binaries as far as persistence to OCFL is concerned, they all would be serialized as files alike. It looks like both hydra and Islandora share the characteristic where editing a metadata form in the UI can result in an update to Fedora. This (changes to RDF) is actually the vast majority of our updates as well. Also,
Ultimately, I think this thread is concerned with "in motion" objects, since that has always been part of Fedora's historical use cases. It uniquely bridges the world between workflows, access, and preservation, and the goal is to do that in the most rational manner achievable. The interest and excitement around OCFL adoption has been remarkable. Success depends on achieving a solution that would be recommend as far as usage of OCFL is concerned. Disk space usage is a technical speed bump, but not entirely the crux of the matter. I wonder what the best venue for hashing this out is. Clearly, approval from the OCFL editors is desired, despite the nature of the topic being almost entirely out of the narrow scope of the spec as it has been defined for 1.0. |
it seems like this is becoming a very different discussion from the issue topic, and i wonder if it wouldn't be possible to summarize the issues related to #367, as they stand now. i would say that i also have concerns about the issue of update frequency as it relates to data models. this is certainly an issue in the Samvera community (very much along the lines of @whikloj's examples), as i'm sure folks are aware. the suggestion in #367 (comment) in particular resulted in a substantial development effort some years back. i also think that OCFL's design goals should be clearly reflected in the use cases it's recommended for. should i consider OCFL in relation to my S3-compatible object store? what about my distributed block store? as workloads move behind containerized abstractions, i'm increasingly wondering whether i'll have any normal "filesystems" in a few years. i'm definitely very curious about where OCFL slots into my technology stack as my environment shifts, and my applications become less and less invested in data atomicity. |
I'm saying S3 is suboptimal for Fedora. Trying to implement a system
with workflow elements over a storage layer that intrinsically has no
concept of updates, versioning and ACID-ity is going to be painful. OCFL
over the top helps this situation since inventories provide locking and
consistent state mechanisms but at the cost of requiring the notion of
versioning. I think you will find that most approaches will end up
having to invent something similar.
If anything, the gap between S3's design and Fedora's requirements is
probably greater than that between Modeshape and Fedora.
I believe the Archipelago Repository platform has already implemented
OCFL over an object store in some form - but not S3.
…On 2019-09-11 4:22, Jonathan Rochkind wrote:
I am not the only one who was considering OCFL on S3. If it is the
official recommendation of the OCFL editors/spec that OCFL probably
won't work with S3 as it is "suboptimal" for it, if you make that
clear you will save a lot of people a lot of trouble.
--
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub [1], or mute the
thread [2]. [ { ***@***.***": "http://schema.org", ***@***.***":
"EmailMessage", "potentialAction": { ***@***.***": "ViewAction", "target":
"#367?email_source=notifications\u0026email_token=AJBHHY7ZKKVD45OVVTS6HC3QJBQAVA5CNFSM4IUAFNG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6NEE6A#issuecomment-530203256",
"url":
"#367?email_source=notifications\u0026email_token=AJBHHY7ZKKVD45OVVTS6HC3QJBQAVA5CNFSM4IUAFNG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6NEE6A#issuecomment-530203256",
"name": "View Issue" }, "description": "View this Issue on GitHub",
"publisher": { ***@***.***": "Organization", "name": "GitHub", "url":
"https://github.com" } } ]
Links:
------
[1]
#367?email_source=notifications&email_token=AJBHHY7ZKKVD45OVVTS6HC3QJBQAVA5CNFSM4IUAFNG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6NEE6A#issuecomment-530203256
[2]
https://github.com/notifications/unsubscribe-auth/AJBHHYYCB6J2KUAHR64EIP3QJBQAVANCNFSM4IUAFNGQ
|
A couple notes about the SHA512 "SHOULD" in the spec:
|
Possibilities for reducing the size of objects:
OK, tear them apart. :) |
Here's a crack at a summary. It's a long thread, so let me know if I missed something. If Fedora stores every object change in OCFL, there is the potential for generating a large number of “unnecessary” or “insignificant” versions. There are two consequences of this:
This ticket was original created to discuss the problem of inventory bloat, but the topics discussed have been wide ranging. The following possible solutions have been proposed:
|
Valid point.
Bizarrely, Intel SHA extensions were released on precisely one processor
line at time...the feeble low power Goldmonts that appeared almost
nowhere. However, AMD added support in Ryzen CPU's and Intel is finally
getting its act together and may support it in upcoming CPU's! Linux has
support for the instructions now, no idea about Windows. ARM processors
also have SHA acceleration instructions in most modern iterations.
For the second point, OCFL uses hashes for content addressing rather
than key exchange so that is not a concern.
…On 2019-09-11 14:14, bcail wrote:
A couple notes about the SHA512 "SHOULD" in the spec:
* https://software.intel.com/en-us/articles/intel-sha-extensions -
as I understand it, these can speed up SHA256 to the point where it
would be faster than 512 on Intel processors. Maybe choosing 512 over
256 for performance reasons should be discussed more?
*
https://blog.skullsecurity.org/2012/everything-you-need-to-know-about-hash-length-extension-attacks
- are hash length extension attacks a concern for OCFL?
--
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub [1], or mute the
thread [2]. [ { ***@***.***": "http://schema.org", ***@***.***":
"EmailMessage", "potentialAction": { ***@***.***": "ViewAction", "target":
"#367?email_source=notifications\u0026email_token=AJBHHY3TQPIDPISRH6N262TQJDVKVA5CNFSM4IUAFNG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6ON6XY#issuecomment-530374495",
"url":
"#367?email_source=notifications\u0026email_token=AJBHHY3TQPIDPISRH6N262TQJDVKVA5CNFSM4IUAFNG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6ON6XY#issuecomment-530374495",
"name": "View Issue" }, "description": "View this Issue on GitHub",
"publisher": { ***@***.***": "Organization", "name": "GitHub", "url":
"https://github.com" } } ]
Links:
------
[1]
#367?email_source=notifications&email_token=AJBHHY3TQPIDPISRH6N262TQJDVKVA5CNFSM4IUAFNG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6ON6XY#issuecomment-530374495
[2]
https://github.com/notifications/unsubscribe-auth/AJBHHYYM55PKCG665IXVN4TQJDVKVANCNFSM4IUAFNGQ
|
You mean binaries that are changing insignificantly, right? Because otherwise I thought we only store them once and just refer to them. |
I mean if a user is making incremental changes to a binary before getting it into a state that they would consider to be "settled." This is the case that Paul mentioned a few tech calls ago, where his organization deals with video files that they make edits to, save to Fedora, make more edits, and repeat. In an ideal case most would likely only want the "settled" binaries to be versioned. |
In reference to @pwinckles summary, I'm all for 5.ii "Don’t put an object in OCFL that is being actively worked on". We may need to do a better job of explaining the goals of OCFL; it's for objects at rest that are ready for long-term preservation, possibly on WORM storage. As a point of reference, Stanford's preservation system, which uses a very similar versioning paradigm, has > 1.7 million objects / 650TB of data in it, some over 10 years old, and our highest version number is 22 and our mean average is 2.88. Our Fedora instance has a separate workspace and datastore, where daily work occurs. That datastore is backed up just like any other line of business application. That does not backup every file change as it occurs, as that would be nuts: it backs up any changes to files or that database that occurred over the previous 24 hours. |
discussion with fedora committers was: where is a mutable head created and can the spec include language that makes this more clear. |
tighten up the implementation notes on this subject.
|
Everything here has been either spun into a separate issue or covered by recent commits. |
Some in the Fedora community have run some rough tests that analyze disk usage for OCFL objects with many versions and/or many files. Here are some numbers from @pwinckles:
From another test:
The storing of the inventory file in every version directory contributes significantly to the overall size of the OCFL object on disk. Some questions:
The text was updated successfully, but these errors were encountered: