'\n' mixed in Vocabulary['token'] #111

hehehwang · 2021-09-23T10:31:18Z

it seems that counter in vocabulary is counting 'token' tokens with a newline character.
for example, vocabulary.pkl in java-small dataset, i can find
'return': 6020684,
and
'return\n': 33290,
separately.

i personally fixed this problem by stripping path_context on Vocabulary._process_raw_sample,
but im little confused whether this problem(mixing '\n' in tokens) is intended.

thank you!

SpirinEgor · 2021-09-23T10:42:50Z

It's interesting. But I'm not sure that this is the same return. The code was tokenized by a parser, so it should handle different indentations. I may suggest that there are different sorts of string literals with return\n inside.

hehehwang · 2021-10-03T18:45:37Z

i don't understand what "different sorts of string literals with return\n inside." means, but i could find out lots of '*\n' tokens in vocabulary.pkl

for example,
'EMPTY\n': 11459,
'\n': 11416,
'if\n': 6900,
'exception\n': 6624, ...

lots of tokens from 'token' tokens are mixed with '\n', which i assume that vocabulary parser is reading each end of the line

SpirinEgor · 2021-10-04T10:20:40Z

Yeah, seems strange. I will investigate why the parser extracted tokens with new line characters in the end.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'\n' mixed in Vocabulary['token'] #111

'\n' mixed in Vocabulary['token'] #111

hehehwang commented Sep 23, 2021

SpirinEgor commented Sep 23, 2021

hehehwang commented Oct 3, 2021 •

edited

Loading

SpirinEgor commented Oct 4, 2021

'\n' mixed in Vocabulary['token'] #111

'\n' mixed in Vocabulary['token'] #111

Comments

hehehwang commented Sep 23, 2021

SpirinEgor commented Sep 23, 2021

hehehwang commented Oct 3, 2021 • edited Loading

SpirinEgor commented Oct 4, 2021

hehehwang commented Oct 3, 2021 •

edited

Loading