You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Nov 26, 2025. It is now read-only.
This function should decompose a word into character n-grams that make it up.
Additionally, we want to return the indices of these character n-grams in the embedding matrix.
However, the function ensures that the entire word is added to the result if it is not contained in the list of character n-grams (i.e., if the word contains more characters than max_n). Thus, if the word is long enough, you add it to the list along with its index via the get_word_index method. Everything is fine here.
The problem arises if the word is short (<= max_n). In this case, the function treats this short word as a character n-gram. Consequently, its assigns a character n-gram index via the get_subword_index method. As a result, this short word receives an index greater than nwords in the embedding matrix, which is not what we normally expect.
The consequence is likely that the first rows of the embedding matrix associated with short words are never used.
This function should decompose a word into character n-grams that make it up.
Additionally, we want to return the indices of these character n-grams in the embedding matrix.
However, the function ensures that the entire word is added to the result if it is not contained in the list of character n-grams (i.e., if the word contains more characters than max_n). Thus, if the word is long enough, you add it to the list along with its index via the
get_word_indexmethod. Everything is fine here.The problem arises if the word is short (<= max_n). In this case, the function treats this short word as a character n-gram. Consequently, its assigns a character n-gram index via the
get_subword_indexmethod. As a result, this short word receives an index greater thannwordsin the embedding matrix, which is not what we normally expect.The consequence is likely that the first rows of the embedding matrix associated with short words are never used.