Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summarization and keywords: custom pre-processing and language independence #1021

Open
olavurmortensen opened this issue Nov 16, 2016 · 23 comments
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature wishlist Feature request

Comments

@olavurmortensen
Copy link
Contributor

It is not possible, at the moment, to pre-process text yourself before applying the summarization module to it. Among other things, this means that you cannot summarize non-english text, even though the model itself is language independent.

As far as I can see, it is only this line, that is language dependent. Similarly, this line in the keywords module. I'm not familiar with the codebase, so it is well possible that it is more complicated.

I suggest that someone (could be me) make this pre-processing optional, so you could either

  • supply the class with already pre-processed data, or
  • supply raw text, and your own pre-processing function (with same output as _clean_text_by_word(text)).

This was brought to my attention on the blog, where someone wanted to summarize a portuguese dataset.

@tmylk tmylk added difficulty medium Medium issue: required good gensim understanding & python skills wishlist Feature request labels Nov 16, 2016
@ELind77
Copy link
Contributor

ELind77 commented Nov 21, 2016

I would like to second this request. I strongly prefer a different pre-processing library (spacy) and being forced to rely on the functions in gensim (which are mostly implemented by the original paper authors as part of their proof of concept) is both slow and produces poor results when compared with better pre-processing.

Back when the paper came out and the module was first added to gensim I was able to use summarize_corpus with my own tokens without too much modification but this isn't as easy with the keywords function.

Working specifiaclly with the keywords function because that's what I'm interested in at this point:

I looked into the code (not a lot of comments or docstrings in there btw) a bit and it looks like _build_graph just takes a list of strings and I don't think that _set_graph_edges or any of the things it calls actually need the tokens dictionary. If the text is already pre-processed there is no reason that there should be discrepancies between the tokens dictionary and the (already) split_text. If this is the case, there is no reason for _set_graph_edge to check tokens.

Ok, next step extracted tokens/lemmas. First, why are the variable and function name denoting two different things? In any case, from what I can tell, everything in keywords.py from line 216-226 is spending a lot of time creating a dictionary that maps word forms to the pagerank score of that word form. If that's right, then there doesn't seem to be much need for that either if the user does their own pre-processing. I'm really not sure if this is what's going on because the code is a bit confusing (maybe it's just me?) but I'm pretty sure that's what's happening.

Now we're almost at the finish line, but get_combined_keywords also does some text processing (not sure why as everything it's doing should have already been done earlier), it calls _strip_word. The rest of the functions seem to be centered around dealing with having multiple word forms for various tokens, again, not a problem if the user does their own pre-processing (with the understanding that they may loose some nuance).

To summarize, it may be possible to use your own pre-processing if you refactor everything in the call path of set_graph_edges and then deal with the pagerank results yourself.

Now that I've gone through all of this I might as well try it and see what happens.

-- Eric

@ELind77
Copy link
Contributor

ELind77 commented Nov 21, 2016

@olavurmortensen I got the keywords function to work on an array of string: https://gist.github.com/ELind77/c7c47d68ac7aa05f1493bc492de7dd5c
The refactor isn't very pretty but it works. Hope this helps even though it's only half of what you were looking for.

@tmylk Any thoughts on what a real refactor should look like in terms of the code path and return values?

@tmylk
Copy link
Contributor

tmylk commented Nov 22, 2016

@ELind77 Actually just Spacy integration would be great and is easier than an abstract refactor. That is the best preprocessing library around and we should direct people to use it.

@ELind77
Copy link
Contributor

ELind77 commented Nov 22, 2016

@tmylk I completely agree the spacy is amazing. But the gensim code still needs to be refactored to accept user pre-processed text, right?

@tmylk
Copy link
Contributor

tmylk commented Nov 22, 2016

Yes.

@tmylk
Copy link
Contributor

tmylk commented Nov 22, 2016

Or just preprocessing it with spacy inside gensim seems easier and more convenient.

@ELind77
Copy link
Contributor

ELind77 commented Nov 22, 2016

As much as I do like spacy, in my opinion it wouldn't be addressing the key issue at hand, which is user choice. Maybe someone wants to use Stanford, or ClearNLP, or something none of us have ever heard of. If that is the case I think gensim should allow that in the same way that it does for all of the other models. I think it also raises issues for separation of concerns with gensim relying on external libraries that are not core to the algorithm being implemented. At it's core, what TextRank is doing is building a graph and running PageRank on it. The only input fundamentally required is an ordered list of tokens (they don't even need to be strings, they could be integer ids). In my opinion (for whatever it's worth) I think it's better to favor flexibility.

I think a good approach would be to keep the option of using the provided pre-processing methods for users who aren't interested in doing it themselves and providing the ability for "power users" to provide their own tokens, sentences, etc.

@tmylk
Copy link
Contributor

tmylk commented Nov 22, 2016

Agree. It would be a welcome PR.

@ELind77
Copy link
Contributor

ELind77 commented Nov 22, 2016

Cool!

I'd like to volunteer. I've never contributed to gensim before though. I've never seen a contributor's guide anywhere or instructions on running (and writing) tests is there a resource for that? I seem to remember I used doctests to verify when I modified the summarize_corpus function for my own uses a while back but I don't really remember any unit tests or anything.

Also, is there a project preference in terms of how the interface for a function that has this kind of dual functionality should look?

@tmylk
Copy link
Contributor

tmylk commented Nov 22, 2016

@ELind77
Copy link
Contributor

ELind77 commented Nov 22, 2016

Great, thank you! I'll take a look at the guidelines and clone the project tonight and see what I can do. Not sure how long it will take, but I may have a little time over the Thanksgiving holiday to get some coding done.

I'll post here with any followup questions.

@olavurmortensen
Copy link
Contributor Author

Nice, thanks for picking this up @ELind77!

@bhargavvader
Copy link
Contributor

Following this thread, would be nice to have the pre-processing modularised separately!

@rpedela
Copy link

rpedela commented Feb 10, 2017

While spacy is a great library, I found the sentence segmentation code used by the summarizer to be superior overall for my use case: summarizing SEC filings. SEC filings are legal documents so the language is formal, but they also contain a lot of tables, forms, and lists which don't follow normal grammar rules. I have been amazed at how well sentences are segmented by gensim.

I guess what I am saying is that I agree with making the API more modular, but I would hate to see gensim's sentence segmentation implementation thrown away.

@bhargavvader
Copy link
Contributor

@ELind77 you still taking a stab at this? If not, I'd like to - I've some free time on my hands.

@ELind77
Copy link
Contributor

ELind77 commented Feb 21, 2017

@bhargavvader I just started a new job and I need approval before I can commit to opensource projects. I'm not sure how long that will take. If you want to take this please post progress here. Otherwise I will have to come back to it when I have time.

@bhargavvader
Copy link
Contributor

Cool, will open a PR soon then :)

@piskvorky
Copy link
Owner

My 2 cents: we definitely don't want to add a hard dependency on spaCy. It's compiled, slow to load, requires lots of memory and, like @ELind77 and @rpedela say, users may prefer other tools/approaches, for whatever reason.

Apologies for the poor code quality and design of this module (API, comments, documentation etc). It was a student contribution and perhaps our checks weren't as thorough as they should have been. @tmylk is being more careful with new code now!

@ELind77
Copy link
Contributor

ELind77 commented Jul 23, 2017

@piskvorky I've settled down at my new job and gotten permission to start contributing again. If there is still a need for this I'll give it a shot next month. I should have something before the end of September.

@bhargavvader If you have a fork with this implemented please post here to let me know.

@piskvorky
Copy link
Owner

Awesome, welcome back :) CC @menshikh-iv

@menshikh-iv menshikh-iv added the feature Issue described a new feature label Oct 2, 2017
@bhargavvader
Copy link
Contributor

bhargavvader commented Jan 15, 2018

Sorry @ELind77 , been too busy to get back to this, but this is really a feature which would be great to have implemented some time soon...

Another similar request - it would be cool if we could plug in our own distance metrics in modules which use any kind of similarity/distance functions, like the Similarity class. (ping @piskvorky , @menshikh-iv )

@bastij
Copy link

bastij commented Aug 9, 2018

I have managed to customize the summarization and keywords method for the German language by creating my own preprocessing with Spacy and changing only a few lines of code in the summarizer and keywords method. The results that I have achieved are really satisfying.

If you want to do something similar, then you should pay attention to the following:
The summarizer method expects a list of SyntacticUnits (gensim.summarization.syntactic_unit), where each SyntacticUnit stands for a corresponding sentence.
The Keywords method expects a dictionary of SyntacticUnits, where each SyntacticUnit stands for a word of the text. The respective key of the SyntacticUnit would be the corresponding preprocessed token of the word.

Each SyntacticUnit expects as input the original sentence or word and the preprocessed token set or token.

@vonBlasberg
Copy link

I have managed to customize the summarization and keywords method for the German language by creating my own preprocessing with Spacy and changing only a few lines of code in the summarizer and keywords method. The results that I have achieved are really satisfying.

If you want to do something similar, then you should pay attention to the following:
The summarizer method expects a list of SyntacticUnits (gensim.summarization.syntactic_unit), where each SyntacticUnit stands for a corresponding sentence.
The Keywords method expects a dictionary of SyntacticUnits, where each SyntacticUnit stands for a word of the text. The respective key of the SyntacticUnit would be the corresponding preprocessed token of the word.

Each SyntacticUnit expects as input the original sentence or word and the preprocessed token set or token.

Hi bastij, I would be really interested in your solution. Would you be willing to share your code? Thanks in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature wishlist Feature request
Projects
None yet
Development

No branches or pull requests

9 participants