[Refactor:Plagiarism] Improve tokenizer performance #57
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What is the current behavior?
The C++ and Python tokenizers are currently somewhat ineffective at identifying plagiarism. The biggest issue is currently that both of these tokenizers group all operator tokens together under one label. This means that the C++ sequence
i = (float)(j)is currently marked as being equivalent to the sequencei == true) { j =. These two sequences are obviously not at all similar and should not be marked as such.What is the new behavior?
This PR improves the aforementioned behavior such that operators are identified uniquely, instead of as a generic label. This means that the example provided above will not be marked as a match but the sequences
int i = true;andfloat j = false;will still marked as matches.