Summarization and keywords: custom pre-processing and language independence #1021

olavurmortensen · 2016-11-16T11:22:07Z

It is not possible, at the moment, to pre-process text yourself before applying the summarization module to it. Among other things, this means that you cannot summarize non-english text, even though the model itself is language independent.

As far as I can see, it is only this line, that is language dependent. Similarly, this line in the keywords module. I'm not familiar with the codebase, so it is well possible that it is more complicated.

I suggest that someone (could be me) make this pre-processing optional, so you could either

supply the class with already pre-processed data, or
supply raw text, and your own pre-processing function (with same output as _clean_text_by_word(text)).

This was brought to my attention on the blog, where someone wanted to summarize a portuguese dataset.

The text was updated successfully, but these errors were encountered:

ELind77 · 2016-11-21T04:53:28Z

I would like to second this request. I strongly prefer a different pre-processing library (spacy) and being forced to rely on the functions in gensim (which are mostly implemented by the original paper authors as part of their proof of concept) is both slow and produces poor results when compared with better pre-processing.

Back when the paper came out and the module was first added to gensim I was able to use summarize_corpus with my own tokens without too much modification but this isn't as easy with the keywords function.

Working specifiaclly with the keywords function because that's what I'm interested in at this point:

I looked into the code (not a lot of comments or docstrings in there btw) a bit and it looks like _build_graph just takes a list of strings and I don't think that _set_graph_edges or any of the things it calls actually need the tokens dictionary. If the text is already pre-processed there is no reason that there should be discrepancies between the tokens dictionary and the (already) split_text. If this is the case, there is no reason for _set_graph_edge to check tokens.

Ok, next step extracted tokens/lemmas. First, why are the variable and function name denoting two different things? In any case, from what I can tell, everything in keywords.py from line 216-226 is spending a lot of time creating a dictionary that maps word forms to the pagerank score of that word form. If that's right, then there doesn't seem to be much need for that either if the user does their own pre-processing. I'm really not sure if this is what's going on because the code is a bit confusing (maybe it's just me?) but I'm pretty sure that's what's happening.

Now we're almost at the finish line, but get_combined_keywords also does some text processing (not sure why as everything it's doing should have already been done earlier), it calls _strip_word. The rest of the functions seem to be centered around dealing with having multiple word forms for various tokens, again, not a problem if the user does their own pre-processing (with the understanding that they may loose some nuance).

To summarize, it may be possible to use your own pre-processing if you refactor everything in the call path of set_graph_edges and then deal with the pagerank results yourself.

Now that I've gone through all of this I might as well try it and see what happens.

-- Eric

ELind77 · 2016-11-21T04:56:36Z

@olavurmortensen I got the keywords function to work on an array of string: https://gist.github.com/ELind77/c7c47d68ac7aa05f1493bc492de7dd5c
The refactor isn't very pretty but it works. Hope this helps even though it's only half of what you were looking for.

@tmylk Any thoughts on what a real refactor should look like in terms of the code path and return values?

tmylk · 2016-11-22T19:31:50Z

@ELind77 Actually just Spacy integration would be great and is easier than an abstract refactor. That is the best preprocessing library around and we should direct people to use it.

ELind77 · 2016-11-22T19:33:59Z

@tmylk I completely agree the spacy is amazing. But the gensim code still needs to be refactored to accept user pre-processed text, right?

tmylk · 2016-11-22T19:38:46Z

Yes.

tmylk · 2016-11-22T19:39:29Z

Or just preprocessing it with spacy inside gensim seems easier and more convenient.

ELind77 · 2016-11-22T19:49:42Z

As much as I do like spacy, in my opinion it wouldn't be addressing the key issue at hand, which is user choice. Maybe someone wants to use Stanford, or ClearNLP, or something none of us have ever heard of. If that is the case I think gensim should allow that in the same way that it does for all of the other models. I think it also raises issues for separation of concerns with gensim relying on external libraries that are not core to the algorithm being implemented. At it's core, what TextRank is doing is building a graph and running PageRank on it. The only input fundamentally required is an ordered list of tokens (they don't even need to be strings, they could be integer ids). In my opinion (for whatever it's worth) I think it's better to favor flexibility.

I think a good approach would be to keep the option of using the provided pre-processing methods for users who aren't interested in doing it themselves and providing the ability for "power users" to provide their own tokens, sentences, etc.

tmylk · 2016-11-22T19:51:14Z

Agree. It would be a welcome PR.

ELind77 · 2016-11-22T19:56:53Z

Cool!

I'd like to volunteer. I've never contributed to gensim before though. I've never seen a contributor's guide anywhere or instructions on running (and writing) tests is there a resource for that? I seem to remember I used doctests to verify when I modified the summarize_corpus function for my own uses a while back but I don't really remember any unit tests or anything.

Also, is there a project preference in terms of how the interface for a function that has this kind of dual functionality should look?

tmylk · 2016-11-22T19:58:41Z

You are very welcome! See https://github.com/RaRe-Technologies/gensim/wiki/Developer-page and https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_summarization.py

ELind77 · 2016-11-22T20:06:43Z

Great, thank you! I'll take a look at the guidelines and clone the project tonight and see what I can do. Not sure how long it will take, but I may have a little time over the Thanksgiving holiday to get some coding done.

I'll post here with any followup questions.

olavurmortensen · 2016-11-23T11:27:44Z

Nice, thanks for picking this up @ELind77!

bhargavvader · 2016-11-24T13:22:11Z

Following this thread, would be nice to have the pre-processing modularised separately!

rpedela · 2017-02-10T03:56:01Z

While spacy is a great library, I found the sentence segmentation code used by the summarizer to be superior overall for my use case: summarizing SEC filings. SEC filings are legal documents so the language is formal, but they also contain a lot of tables, forms, and lists which don't follow normal grammar rules. I have been amazed at how well sentences are segmented by gensim.

I guess what I am saying is that I agree with making the API more modular, but I would hate to see gensim's sentence segmentation implementation thrown away.

bhargavvader · 2017-02-11T19:29:40Z

@ELind77 you still taking a stab at this? If not, I'd like to - I've some free time on my hands.

ELind77 · 2017-02-21T18:56:16Z

@bhargavvader I just started a new job and I need approval before I can commit to opensource projects. I'm not sure how long that will take. If you want to take this please post progress here. Otherwise I will have to come back to it when I have time.

bhargavvader · 2017-02-22T10:25:40Z

Cool, will open a PR soon then :)

piskvorky · 2017-03-12T04:39:35Z

My 2 cents: we definitely don't want to add a hard dependency on spaCy. It's compiled, slow to load, requires lots of memory and, like @ELind77 and @rpedela say, users may prefer other tools/approaches, for whatever reason.

Apologies for the poor code quality and design of this module (API, comments, documentation etc). It was a student contribution and perhaps our checks weren't as thorough as they should have been. @tmylk is being more careful with new code now!

ELind77 · 2017-07-23T01:22:52Z

@piskvorky I've settled down at my new job and gotten permission to start contributing again. If there is still a need for this I'll give it a shot next month. I should have something before the end of September.

@bhargavvader If you have a fork with this implemented please post here to let me know.

piskvorky · 2017-07-23T02:05:33Z

Awesome, welcome back :) CC @menshikh-iv

bhargavvader · 2018-01-15T18:22:26Z

Sorry @ELind77 , been too busy to get back to this, but this is really a feature which would be great to have implemented some time soon...

Another similar request - it would be cool if we could plug in our own distance metrics in modules which use any kind of similarity/distance functions, like the Similarity class. (ping @piskvorky , @menshikh-iv )

bastij · 2018-08-09T09:43:19Z

I have managed to customize the summarization and keywords method for the German language by creating my own preprocessing with Spacy and changing only a few lines of code in the summarizer and keywords method. The results that I have achieved are really satisfying.

If you want to do something similar, then you should pay attention to the following:
The summarizer method expects a list of SyntacticUnits (gensim.summarization.syntactic_unit), where each SyntacticUnit stands for a corresponding sentence.
The Keywords method expects a dictionary of SyntacticUnits, where each SyntacticUnit stands for a word of the text. The respective key of the SyntacticUnit would be the corresponding preprocessed token of the word.

Each SyntacticUnit expects as input the original sentence or word and the preprocessed token set or token.

vonBlasberg · 2019-02-06T11:39:50Z

I have managed to customize the summarization and keywords method for the German language by creating my own preprocessing with Spacy and changing only a few lines of code in the summarizer and keywords method. The results that I have achieved are really satisfying.

If you want to do something similar, then you should pay attention to the following:
The summarizer method expects a list of SyntacticUnits (gensim.summarization.syntactic_unit), where each SyntacticUnit stands for a corresponding sentence.
The Keywords method expects a dictionary of SyntacticUnits, where each SyntacticUnit stands for a word of the text. The respective key of the SyntacticUnit would be the corresponding preprocessed token of the word.

Each SyntacticUnit expects as input the original sentence or word and the preprocessed token set or token.

Hi bastij, I would be really interested in your solution. Would you be willing to share your code? Thanks in advance!

tmylk added difficulty medium Medium issue: required good gensim understanding & python skills wishlist Feature request labels Nov 16, 2016

bhargavvader mentioned this issue Jan 5, 2017

Add deacc param to make accent removal optional #1076

Merged

menshikh-iv added the feature Issue described a new feature label Oct 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Summarization and keywords: custom pre-processing and language independence #1021

Summarization and keywords: custom pre-processing and language independence #1021

olavurmortensen commented Nov 16, 2016

ELind77 commented Nov 21, 2016

ELind77 commented Nov 21, 2016

tmylk commented Nov 22, 2016

ELind77 commented Nov 22, 2016

tmylk commented Nov 22, 2016

tmylk commented Nov 22, 2016

ELind77 commented Nov 22, 2016

tmylk commented Nov 22, 2016

ELind77 commented Nov 22, 2016

tmylk commented Nov 22, 2016 •

edited

ELind77 commented Nov 22, 2016

olavurmortensen commented Nov 23, 2016

bhargavvader commented Nov 24, 2016

rpedela commented Feb 10, 2017

bhargavvader commented Feb 11, 2017

ELind77 commented Feb 21, 2017

bhargavvader commented Feb 22, 2017

piskvorky commented Mar 12, 2017

ELind77 commented Jul 23, 2017

piskvorky commented Jul 23, 2017

bhargavvader commented Jan 15, 2018 •

edited

bastij commented Aug 9, 2018

vonBlasberg commented Feb 6, 2019

Summarization and keywords: custom pre-processing and language independence #1021

Summarization and keywords: custom pre-processing and language independence #1021

Comments

olavurmortensen commented Nov 16, 2016

ELind77 commented Nov 21, 2016

ELind77 commented Nov 21, 2016

tmylk commented Nov 22, 2016

ELind77 commented Nov 22, 2016

tmylk commented Nov 22, 2016

tmylk commented Nov 22, 2016

ELind77 commented Nov 22, 2016

tmylk commented Nov 22, 2016

ELind77 commented Nov 22, 2016

tmylk commented Nov 22, 2016 • edited

ELind77 commented Nov 22, 2016

olavurmortensen commented Nov 23, 2016

bhargavvader commented Nov 24, 2016

rpedela commented Feb 10, 2017

bhargavvader commented Feb 11, 2017

ELind77 commented Feb 21, 2017

bhargavvader commented Feb 22, 2017

piskvorky commented Mar 12, 2017

ELind77 commented Jul 23, 2017

piskvorky commented Jul 23, 2017

bhargavvader commented Jan 15, 2018 • edited

bastij commented Aug 9, 2018

vonBlasberg commented Feb 6, 2019

tmylk commented Nov 22, 2016 •

edited

bhargavvader commented Jan 15, 2018 •

edited