[Refactor:Plagiarism] Improve tokenizer performance #57

williamjallen · 2021-08-16T20:53:14Z

What is the current behavior?

The C++ and Python tokenizers are currently somewhat ineffective at identifying plagiarism. The biggest issue is currently that both of these tokenizers group all operator tokens together under one label. This means that the C++ sequence i = (float)(j) is currently marked as being equivalent to the sequence i == true) { j =. These two sequences are obviously not at all similar and should not be marked as such.

What is the new behavior?

This PR improves the aforementioned behavior such that operators are identified uniquely, instead of as a generic label. This means that the example provided above will not be marked as a match but the sequences int i = true; and float j = false; will still marked as matches.

williamjallen added 3 commits August 16, 2021 16:35

Improve tokenizers by including token value in type for operator tokens

5f79f58

Update tests

6c90b07

Remove unnecessary debugging statement

2c4e097

bmcutler merged commit eb8259b into main Aug 17, 2021

bmcutler deleted the tokenizer-improvements branch August 17, 2021 04:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Refactor:Plagiarism] Improve tokenizer performance #57

[Refactor:Plagiarism] Improve tokenizer performance #57

Uh oh!

williamjallen commented Aug 16, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Refactor:Plagiarism] Improve tokenizer performance #57

[Refactor:Plagiarism] Improve tokenizer performance #57

Uh oh!

Conversation

williamjallen commented Aug 16, 2021

What is the current behavior?

What is the new behavior?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants