[Feature:Plagiarism] Add reasonable limits to file sizes #54

williamjallen · 2021-08-11T18:34:30Z

What is the current behavior?

It is currently possible to cause Lichen to use excessive amounts of memory, while running for extended periods of time, when improperly configured.

What is the new behavior?

This PR adds reasonable limits to all potentially large aspects of a given Lichen run such that files which exceed the limits are truncated without failing altogether. Future work will need to be done to add hard, fixed, limits on the amount of memory available and the time any given run is allowed to take. For now, the collection of limits here should prevent any excessive memory usage by limiting all of the potentially large files to reasonable sizes. All of the limits introduced have been condensed to a single lichen_config.json file containing:

concat_max_total_bytes: the total number of bytes allowed to be concatenated in total
- This is largely an attempt to prevent excessively large tar/zip archives from being moved around and an attempt to limit the duration of Lichen runs to a reasonable amount of time.
max_sequences_per_file: the maximum number of hashes any given submission may contain
- This is by far the most crucial and most restrictive limit which limits the size of a matches.json for any given submission to a reasonable upper bound. Raising this limit too much may result in excessive memory usage while compare_hashes.cpp is attempting to build a large data structure.
max_matching_positions: The maximum number of duplicate sequences there may be between any two submissions.
- This serves as some amount of protection against the Lichen equivalent of a zip bomb, either accidental or deliberate. Raising this limit too much may result in excessively large matches.json files, especially if two or more files contain significant amounts of repetition between them. This also helps control the maximum memory usage used by Lichen.

sbelsk and others added 5 commits August 6, 2021 13:14

add error message

c3a56b6

Merge branch 'main' into print-heuristics-when-killed

1ba4714

Add limits to potentially expensive operations

d9662bb

linter

7b525af

linter (v2.0)

b69eb24

bmcutler merged commit 0f2e364 into main Aug 13, 2021

bmcutler deleted the print-heuristics-when-killed branch August 13, 2021 02:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature:Plagiarism] Add reasonable limits to file sizes #54

[Feature:Plagiarism] Add reasonable limits to file sizes #54

Uh oh!

williamjallen commented Aug 11, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Feature:Plagiarism] Add reasonable limits to file sizes #54

[Feature:Plagiarism] Add reasonable limits to file sizes #54

Uh oh!

Conversation

williamjallen commented Aug 11, 2021

What is the current behavior?

What is the new behavior?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants