You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sentences:
input_ids
tokens for all non special symbol tokens
tokens_with_special_symbols for all tokens including [CLS] and [SEP]
Doc:
object can now be constructed either with raw text or a list of
texts (represnenting the sentences in document). This is just a
initial release. We will fix later based on issue #47
__len__ of document is defined as the total number of sentences
in it.
max_length is number of tokens (with special tokens) in the
longest sentences. This is used by padded_matrix function
padded_matrix returns a 0 padded (bringing all sentences into a
common length) matrix of sentences (each row is a sentence) each
column is a token in sentence (input_ids). Note that this is
like document-term matrix, but not the same. Function can return
numpy or tensor, it also can return the mask to be used for BERT
embedding generation.
bert_embeddings property returns the embeddings matrix for the
document (one embedding per sentences) computed using
padded_matrix
No description provided.
The text was updated successfully, but these errors were encountered: