Why tokenize twice? #73

saeub · 2022-04-13T14:25:38Z

I'm trying to adapt TransformerSum to a non-English custom dataset and currently very confused about this code in extractive.py:

TransformerSum/src/extractive.py

Lines 1093 to 1107 in 15bd11d

    
           if tokenized: 
        
               src_txt = [ 
        
                   " ".join([token.text for token in sentence if str(token) != "."]) + "." 
        
                   for sentence in input_sentences 
        
               ] 
        
           else: 
        
               nlp = English() 
        
               sentencizer = nlp.create_pipe("sentencizer") 
        
               nlp.add_pipe(sentencizer) 
        
               src_txt = [ 
        
                   " ".join([token.text for token in nlp(sentence) if str(token) != "."]) 
        
                   + "." 
        
                   for sentence in input_sentences 
        
               ]

Why separate the words with spaces, when the resulting string is then tokenized using the tokenizer from the transformers library? I assume those tokenizers are not usually trained on pre-tokenized text, and neither are the pretrained models?
Why remove the space before "." characters, but not anywhere else?

Thanks for any explanations.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why tokenize twice? #73

Why tokenize twice? #73

saeub commented Apr 13, 2022 •

edited

Loading

Why tokenize twice? #73

Why tokenize twice? #73

Comments

saeub commented Apr 13, 2022 • edited Loading

saeub commented Apr 13, 2022 •

edited

Loading