Tokenization of input sentences for English #136

polars05 · 2019-07-10T10:58:27Z

When feeding in sentences, do we need to pre-process them (e.g. using BPE) beforehand? Or do we just split them into words and let GPT-2 / BERT handle the pre-processing?

Example: If we want to use BERT embedding, do we have to pre-process "Jim Henson was a puppeteer" into ["jim", "henson", "was", "a", "puppet", "##eer"] using some external BPE library or do we just split the sentence on spaces and feed ["jim", "henson", "was", "a", "puppeteer"] into the model?

BrikerMan · 2019-07-10T15:29:59Z

You need to pre-process (tokenize) your sequence with the proper tool for the best performance. For example, if you are using the BERT embedding, use BERT Tokenizer for pre-processing.

If you just split the sentence with spaces and feed ["jim", "henson", "was", "a", "puppeteer"], It will lose some details. It will run thought, but not good as using BERT Tokenizer.

polars05 · 2019-07-12T14:38:34Z

It seems that this part of the code already handles the tokenizing of the input sequence using the vocab.txt:

Kashgari/kashgari/embeddings/bert_embedding.py

Lines 87 to 98 in aeb1d78

    
           def _build_token2idx_from_bert(self): 
        
               dict_path = os.path.join(self.model_folder, 'vocab.txt') 
        
               token2idx = {} 
        
               with codecs.open(dict_path, 'r', 'utf8') as reader: 
        
                   for line in reader: 
        
                       token = line.strip() 
        
                       token2idx[token] = len(token2idx) 
        
               self.bert_token2idx = token2idx 
        
               self.processor.token2idx = self.bert_token2idx 
        
               self.processor.idx2token = dict([(value, key) for key, value in token2idx.items()])

Can I assume that we can just split the input sentence into ["jim", "henson", "was", "a", "puppeteer"] instead of ["jim", "henson", "was", "a", "puppet", "##eer"] before fitting it on a classifier built upon kashgari.embeddings.BERTEmbedding?

BrikerMan · 2019-07-12T14:52:51Z

Nope, We read the BERT vocab for token numerize process. It converts tokens to token-index number array. It handles the numerize process, not the tokenize process.
You can just split sentence by space, but you will get a little bit accurate feature representation when you tokenize it.

polars05 · 2019-07-13T02:49:19Z

Ahh, thanks for the clarification! :) Guess would be helpful to state this in the documentation / include the tokenization process as part of the kashgari library

(Side note: the multi-label classification example shown in the documentation shows splitting of the sentences by whitespace; no tokenization via the BERT tokenizer was used as a preprocessing step)

BrikerMan · 2019-07-13T03:31:08Z

Yes, we should add those details. Maybe you could help me with the documents.

polars05 · 2019-07-13T17:23:05Z

In the documentation (under "Quick Start") for the tf.keras branch, I added an example for pre-processing of text before feeding into BERT for text classification; I noticed kashgari builds off bert-keras so I've crafted the example based on the tokenizer from bert-keras.

I've created a pull request for that; hope the example helps with your documentation!

Side note: not sure if you have to set self.processor.add_bos_eos to False since the tokenizer will already automatically include the [CLS] and [SEP] tokens

Kashgari/kashgari/embeddings/bert_embedding.py

Line 80 in aeb1d78

self.processor.add_bos_eos = True

BrikerMan · 2019-07-13T18:00:48Z

In the documentation (under "Quick Start") for the tf.keras branch, I added an example for pre-processing of text before feeding into BERT for text classification; I noticed kashgari builds off bert-keras so I've crafted the example based on the tokenizer from bert-keras.

I've created a pull request for that; hope the example helps with your documentation!

Side note: not sure if you have to set self.processor.add_bos_eos to False since the tokenizer will already automatically include the [CLS] and [SEP] tokens

Kashgari/kashgari/embeddings/bert_embedding.py

Line 80 in aeb1d78

self.processor.add_bos_eos = True

Yes, if sequence includes the [CLS] and [SEP] tokens, we need to set add_bos_eos to False.

BrikerMan · 2019-07-14T04:28:40Z

@polars05 Added to the new document, check it out. https://kashgari.bmio.net/embeddings/bert-embedding/

BrikerMan · 2019-07-15T11:03:41Z

Added tokenizer property to the BERTEmebdding, simplified the process. Here is the document

polars05 added the question Further information is requested label Jul 10, 2019

polars05 assigned BrikerMan Jul 10, 2019

BrikerMan closed this as completed Jul 12, 2019

BrikerMan reopened this Jul 13, 2019

BrikerMan added the document document issue label Jul 13, 2019

BrikerMan added this to the v0.5.1 milestone Jul 13, 2019

BrikerMan added the good first issue Good for newcomers label Jul 13, 2019

BrikerMan mentioned this issue Jul 13, 2019

[Feature request] Add built-in tokenizers for BERT Embedding #148

Closed

BrikerMan added a commit that referenced this issue Jul 15, 2019

🚸 Add tokenizer property for BERT Embedding. #136

22ad76f

BrikerMan mentioned this issue Jul 15, 2019

Release v0.5.1 #153

Merged

BrikerMan closed this as completed Jul 15, 2019

zmd971202 mentioned this issue Jul 17, 2019

[Question] Can I use it as an NER task in English? #101

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization of input sentences for English #136

Tokenization of input sentences for English #136

polars05 commented Jul 10, 2019

BrikerMan commented Jul 10, 2019

polars05 commented Jul 12, 2019

BrikerMan commented Jul 12, 2019

polars05 commented Jul 13, 2019 •

edited

Loading

BrikerMan commented Jul 13, 2019 •

edited

Loading

polars05 commented Jul 13, 2019

BrikerMan commented Jul 13, 2019

BrikerMan commented Jul 14, 2019

BrikerMan commented Jul 15, 2019

Tokenization of input sentences for English #136

Tokenization of input sentences for English #136

Comments

polars05 commented Jul 10, 2019

BrikerMan commented Jul 10, 2019

polars05 commented Jul 12, 2019

BrikerMan commented Jul 12, 2019

polars05 commented Jul 13, 2019 • edited Loading

BrikerMan commented Jul 13, 2019 • edited Loading

polars05 commented Jul 13, 2019

BrikerMan commented Jul 13, 2019

BrikerMan commented Jul 14, 2019

BrikerMan commented Jul 15, 2019

polars05 commented Jul 13, 2019 •

edited

Loading

BrikerMan commented Jul 13, 2019 •

edited

Loading