-
-
Notifications
You must be signed in to change notification settings - Fork 440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenization of input sentences for English #136
Comments
You need to pre-process (tokenize) your sequence with the proper tool for the best performance. For example, if you are using the BERT embedding, use BERT Tokenizer for pre-processing. If you just split the sentence with spaces and feed ["jim", "henson", "was", "a", "puppeteer"], It will lose some details. It will run thought, but not good as using BERT Tokenizer. |
It seems that this part of the code already handles the tokenizing of the input sequence using the vocab.txt: Kashgari/kashgari/embeddings/bert_embedding.py Lines 87 to 98 in aeb1d78
Can I assume that we can just split the input sentence into ["jim", "henson", "was", "a", "puppeteer"] instead of ["jim", "henson", "was", "a", "puppet", "##eer"] before fitting it on a classifier built upon kashgari.embeddings.BERTEmbedding? |
Nope, We read the BERT vocab for token numerize process. It converts tokens to token-index number array. It handles the numerize process, not the tokenize process. |
Ahh, thanks for the clarification! :) Guess would be helpful to state this in the documentation / include the tokenization process as part of the kashgari library (Side note: the multi-label classification example shown in the documentation shows splitting of the sentences by whitespace; no tokenization via the BERT tokenizer was used as a preprocessing step) |
Yes, we should add those details. Maybe you could help me with the documents. |
In the documentation (under "Quick Start") for the tf.keras branch, I added an example for pre-processing of text before feeding into BERT for text classification; I noticed kashgari builds off bert-keras so I've crafted the example based on the tokenizer from bert-keras. I've created a pull request for that; hope the example helps with your documentation! Side note: not sure if you have to set self.processor.add_bos_eos to False since the tokenizer will already automatically include the [CLS] and [SEP] tokens
|
Yes, if sequence includes the [CLS] and [SEP] tokens, we need to set add_bos_eos to False. |
@polars05 Added to the new document, check it out. https://kashgari.bmio.net/embeddings/bert-embedding/ |
Added tokenizer property to the BERTEmebdding, simplified the process. Here is the document |
When feeding in sentences, do we need to pre-process them (e.g. using BPE) beforehand? Or do we just split them into words and let GPT-2 / BERT handle the pre-processing?
Example: If we want to use BERT embedding, do we have to pre-process "Jim Henson was a puppeteer" into ["jim", "henson", "was", "a", "puppet", "##eer"] using some external BPE library or do we just split the sentence on spaces and feed ["jim", "henson", "was", "a", "puppeteer"] into the model?
The text was updated successfully, but these errors were encountered: