[Feature:Plagiarism] Add reasonable limits to file sizes #54
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What is the current behavior?
It is currently possible to cause Lichen to use excessive amounts of memory, while running for extended periods of time, when improperly configured.
What is the new behavior?
This PR adds reasonable limits to all potentially large aspects of a given Lichen run such that files which exceed the limits are truncated without failing altogether. Future work will need to be done to add hard, fixed, limits on the amount of memory available and the time any given run is allowed to take. For now, the collection of limits here should prevent any excessive memory usage by limiting all of the potentially large files to reasonable sizes. All of the limits introduced have been condensed to a single
lichen_config.jsonfile containing:concat_max_total_bytes: the total number of bytes allowed to be concatenated in totalmax_sequences_per_file: the maximum number of hashes any given submission may containmatches.jsonfor any given submission to a reasonable upper bound. Raising this limit too much may result in excessive memory usage whilecompare_hashes.cppis attempting to build a large data structure.max_matching_positions: The maximum number of duplicate sequences there may be between any two submissions.matches.jsonfiles, especially if two or more files contain significant amounts of repetition between them. This also helps control the maximum memory usage used by Lichen.