Skip to content

Conversation

@williamjallen
Copy link
Member

What is the current behavior?

The C++ and Python tokenizers are currently somewhat ineffective at identifying plagiarism. The biggest issue is currently that both of these tokenizers group all operator tokens together under one label. This means that the C++ sequence i = (float)(j) is currently marked as being equivalent to the sequence i == true) { j =. These two sequences are obviously not at all similar and should not be marked as such.

What is the new behavior?

This PR improves the aforementioned behavior such that operators are identified uniquely, instead of as a generic label. This means that the example provided above will not be marked as a match but the sequences int i = true; and float j = false; will still marked as matches.

@bmcutler bmcutler merged commit eb8259b into main Aug 17, 2021
@bmcutler bmcutler deleted the tokenizer-improvements branch August 17, 2021 04:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants