Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization of input sentences for English #136

Closed
polars05 opened this issue Jul 10, 2019 · 9 comments
Closed

Tokenization of input sentences for English #136

polars05 opened this issue Jul 10, 2019 · 9 comments
Assignees
Labels
document document issue good first issue Good for newcomers question Further information is requested
Milestone

Comments

@polars05
Copy link

When feeding in sentences, do we need to pre-process them (e.g. using BPE) beforehand? Or do we just split them into words and let GPT-2 / BERT handle the pre-processing?

Example: If we want to use BERT embedding, do we have to pre-process "Jim Henson was a puppeteer" into ["jim", "henson", "was", "a", "puppet", "##eer"] using some external BPE library or do we just split the sentence on spaces and feed ["jim", "henson", "was", "a", "puppeteer"] into the model?

@polars05 polars05 added the question Further information is requested label Jul 10, 2019
@BrikerMan
Copy link
Owner

You need to pre-process (tokenize) your sequence with the proper tool for the best performance. For example, if you are using the BERT embedding, use BERT Tokenizer for pre-processing.

If you just split the sentence with spaces and feed ["jim", "henson", "was", "a", "puppeteer"], It will lose some details. It will run thought, but not good as using BERT Tokenizer.

@polars05
Copy link
Author

It seems that this part of the code already handles the tokenizing of the input sequence using the vocab.txt:

def _build_token2idx_from_bert(self):
dict_path = os.path.join(self.model_folder, 'vocab.txt')
token2idx = {}
with codecs.open(dict_path, 'r', 'utf8') as reader:
for line in reader:
token = line.strip()
token2idx[token] = len(token2idx)
self.bert_token2idx = token2idx
self.processor.token2idx = self.bert_token2idx
self.processor.idx2token = dict([(value, key) for key, value in token2idx.items()])

Can I assume that we can just split the input sentence into ["jim", "henson", "was", "a", "puppeteer"] instead of ["jim", "henson", "was", "a", "puppet", "##eer"] before fitting it on a classifier built upon kashgari.embeddings.BERTEmbedding?

@BrikerMan
Copy link
Owner

Nope, We read the BERT vocab for token numerize process. It converts tokens to token-index number array. It handles the numerize process, not the tokenize process.
You can just split sentence by space, but you will get a little bit accurate feature representation when you tokenize it.

@polars05
Copy link
Author

polars05 commented Jul 13, 2019

Ahh, thanks for the clarification! :) Guess would be helpful to state this in the documentation / include the tokenization process as part of the kashgari library

(Side note: the multi-label classification example shown in the documentation shows splitting of the sentences by whitespace; no tokenization via the BERT tokenizer was used as a preprocessing step)

@BrikerMan
Copy link
Owner

BrikerMan commented Jul 13, 2019

Yes, we should add those details. Maybe you could help me with the documents.

@BrikerMan BrikerMan reopened this Jul 13, 2019
@BrikerMan BrikerMan added the document document issue label Jul 13, 2019
@BrikerMan BrikerMan added this to the v0.5.1 milestone Jul 13, 2019
@BrikerMan BrikerMan added the good first issue Good for newcomers label Jul 13, 2019
@polars05
Copy link
Author

In the documentation (under "Quick Start") for the tf.keras branch, I added an example for pre-processing of text before feeding into BERT for text classification; I noticed kashgari builds off bert-keras so I've crafted the example based on the tokenizer from bert-keras.

I've created a pull request for that; hope the example helps with your documentation!

Side note: not sure if you have to set self.processor.add_bos_eos to False since the tokenizer will already automatically include the [CLS] and [SEP] tokens

self.processor.add_bos_eos = True

@BrikerMan
Copy link
Owner

In the documentation (under "Quick Start") for the tf.keras branch, I added an example for pre-processing of text before feeding into BERT for text classification; I noticed kashgari builds off bert-keras so I've crafted the example based on the tokenizer from bert-keras.

I've created a pull request for that; hope the example helps with your documentation!

Side note: not sure if you have to set self.processor.add_bos_eos to False since the tokenizer will already automatically include the [CLS] and [SEP] tokens

self.processor.add_bos_eos = True

Yes, if sequence includes the [CLS] and [SEP] tokens, we need to set add_bos_eos to False.

@BrikerMan
Copy link
Owner

@polars05 Added to the new document, check it out. https://kashgari.bmio.net/embeddings/bert-embedding/

@BrikerMan
Copy link
Owner

Added tokenizer property to the BERTEmebdding, simplified the process. Here is the document

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
document document issue good first issue Good for newcomers question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants