Skip to content

Commit

Permalink
[BugFix] Deal with greek letter "sigma" when return offset_mapping (#…
Browse files Browse the repository at this point in the history
…2897)

* deal with greek letter sigma

* update comments
  • Loading branch information
yingyibiao committed Jul 29, 2022
1 parent 912e027 commit 8292c71
Showing 1 changed file with 8 additions and 1 deletion.
9 changes: 8 additions & 1 deletion paddlenlp/transformers/tokenizer_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -1363,7 +1363,14 @@ def get_offset_mapping(self, text):
if token in self.all_special_tokens:
token = token.lower() if hasattr(
self, "do_lower_case") and self.do_lower_case else token
start = text[offset:].index(token) + offset
# The greek letter "sigma" has 2 forms of lowercase, σ and ς respectively.
# When used as a final letter of a word, the final form (ς) is used. Otherwise, the form (σ) is used.
# https://latin.stackexchange.com/questions/6168/how-and-when-did-we-get-two-forms-of-sigma
if "σ" in token or "ς" in token:
start = text[offset:].replace("ς", "σ").index(
token.replace("ς", "σ")) + offset
else:
start = text[offset:].index(token) + offset

end = start + len(token)

Expand Down

0 comments on commit 8292c71

Please sign in to comment.