`tokenizer.get_subwords`: problem with short words

This function should decompose a word into character n-grams that make it up. 
Additionally, we want to return the indices of these character n-grams in the embedding matrix.

However, the function ensures that the entire word is added to the result if it is not contained in the list of character n-grams (i.e., if the word contains more characters than max_n). Thus, if the word is long enough, you add it to the list along with its index via the `get_word_index` method. Everything is fine here.

The problem arises if the word is short (<= max_n). In this case, the function treats this short word as a character n-gram. Consequently, its assigns a character n-gram index via the `get_subword_index` method. As a result, this short word receives an index greater than `nwords` in the embedding matrix, which is not what we normally expect.

The consequence is likely that the first rows of the embedding matrix associated with short words are never used.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`tokenizer.get_subwords`: problem with short words #45

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

tokenizer.get_subwords: problem with short words #45

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`tokenizer.get_subwords`: problem with short words #45