Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for list of pre-generated stems/lemmas #155

Open
DavMrc opened this issue Mar 18, 2021 · 3 comments
Open

Support for list of pre-generated stems/lemmas #155

DavMrc opened this issue Mar 18, 2021 · 3 comments
Assignees
Labels

Comments

@DavMrc
Copy link

DavMrc commented Mar 18, 2021

Good morning
first of all I wanted to congratulate with you for this awesome repository, it really is very well made and the practical results are great, on top of being easy to achieve.
I was wondering: is there a way I can use a pre-processed list of strings, being stems or lemmas, with your example pipeline?

@miso-belica
Copy link
Owner

Hi, yes. You can implement your own function that constructs ObjectDocumentModel. You can see inspiration in https://github.com/miso-belica/sumy/blob/master/sumy/parsers/plaintext.py#L60-L78 Then send it to the Summarizer and that's it. stop-words and even stemmer are optional parts so if you omit them the summarizer will process the raw ObjectDocumentModel it gets.

@miso-belica miso-belica self-assigned this Mar 23, 2021
@DavMrc
Copy link
Author

DavMrc commented Mar 26, 2021

Hi, thank you for your insight. By following your code, I saw that creating a Sentence requires a Tokenizer, which in my case would be redundant because I've got my tokens already.

Is it sufficient to change line 69 and 75 with sentences = [Sentence(s, None) for s in current_paragraph]?

@miso-belica
Copy link
Owner

Well, I would avoid changing sumy unless it is really needed. You can rather implement your own tokenizer like this:

class Tokenizer:
    language = 'en???'

    def to_sentences(self, paragraph):
        return paragraph

    def to_words(self, sentence):
        return sentence  # make sure this is a collection

It simply returns your already tokenized data. I don't know the precise data structures you use but I believe you should omit/replace some parts of sumy and it will work without the modification of the sumy code. But I may be wrong, never worked with the case you described here. Let me know :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants