Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add from_raw and from_sents static methods to Doc instance of constructor #47

Closed
husnusensoy opened this issue Jul 31, 2020 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@husnusensoy
Copy link
Contributor

No description provided.

@husnusensoy husnusensoy added the enhancement New feature or request label Jul 31, 2020
@husnusensoy husnusensoy self-assigned this Jul 31, 2020
husnusensoy added a commit that referenced this issue Jul 31, 2020
    Sentences:
        input_ids
        tokens for all non special symbol tokens
        tokens_with_special_symbols for all tokens including [CLS] and [SEP]

    Doc:
        object can now be constructed either with raw text or a list of
        texts (represnenting the sentences in document). This is just a
        initial release. We will fix later based on issue #47

        __len__ of document is defined as the total number of sentences
        in it.

        max_length is number of tokens (with special tokens) in the
        longest sentences. This is used by padded_matrix function

        padded_matrix returns a 0 padded (bringing all sentences into a
        common length) matrix of sentences (each row is a sentence) each
        column is a token in sentence (input_ids). Note that this is
        like document-term matrix, but not the same. Function can return
        numpy or tensor, it also can return the mask to be used for BERT
        embedding generation.

        bert_embeddings property returns the embeddings matrix for the
        document (one embedding per sentences) computed using
        padded_matrix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant