Add from_raw and from_sents static methods to Doc instance of constructor #47

husnusensoy · 2020-07-31T11:11:24Z

No description provided.

Sentences: input_ids tokens for all non special symbol tokens tokens_with_special_symbols for all tokens including [CLS] and [SEP] Doc: object can now be constructed either with raw text or a list of texts (represnenting the sentences in document). This is just a initial release. We will fix later based on issue #47 __len__ of document is defined as the total number of sentences in it. max_length is number of tokens (with special tokens) in the longest sentences. This is used by padded_matrix function padded_matrix returns a 0 padded (bringing all sentences into a common length) matrix of sentences (each row is a sentence) each column is a token in sentence (input_ids). Note that this is like document-term matrix, but not the same. Function can return numpy or tensor, it also can return the mask to be used for BERT embedding generation. bert_embeddings property returns the embeddings matrix for the document (one embedding per sentences) computed using padded_matrix

husnusensoy added the enhancement New feature or request label Jul 31, 2020

husnusensoy self-assigned this Jul 31, 2020

husnusensoy closed this as completed in 79f787f Aug 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add from_raw and from_sents static methods to Doc instance of constructor #47

Add from_raw and from_sents static methods to Doc instance of constructor #47

husnusensoy commented Jul 31, 2020

Add from_raw and from_sents static methods to Doc instance of constructor #47

Add from_raw and from_sents static methods to Doc instance of constructor #47

Comments

husnusensoy commented Jul 31, 2020