New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Summarization and keywords: custom pre-processing and language independence #1021
Comments
I would like to second this request. I strongly prefer a different pre-processing library (spacy) and being forced to rely on the functions in gensim (which are mostly implemented by the original paper authors as part of their proof of concept) is both slow and produces poor results when compared with better pre-processing. Back when the paper came out and the module was first added to gensim I was able to use summarize_corpus with my own tokens without too much modification but this isn't as easy with the keywords function. Working specifiaclly with the keywords function because that's what I'm interested in at this point: I looked into the code (not a lot of comments or docstrings in there btw) a bit and it looks like Ok, next step extracted tokens/lemmas. First, why are the variable and function name denoting two different things? In any case, from what I can tell, everything in keywords.py from line 216-226 is spending a lot of time creating a dictionary that maps word forms to the pagerank score of that word form. If that's right, then there doesn't seem to be much need for that either if the user does their own pre-processing. I'm really not sure if this is what's going on because the code is a bit confusing (maybe it's just me?) but I'm pretty sure that's what's happening. Now we're almost at the finish line, but To summarize, it may be possible to use your own pre-processing if you refactor everything in the call path of Now that I've gone through all of this I might as well try it and see what happens. -- Eric |
@olavurmortensen I got the keywords function to work on an array of string: https://gist.github.com/ELind77/c7c47d68ac7aa05f1493bc492de7dd5c @tmylk Any thoughts on what a real refactor should look like in terms of the code path and return values? |
@ELind77 Actually just Spacy integration would be great and is easier than an abstract refactor. That is the best preprocessing library around and we should direct people to use it. |
@tmylk I completely agree the spacy is amazing. But the gensim code still needs to be refactored to accept user pre-processed text, right? |
Yes. |
Or just preprocessing it with spacy inside gensim seems easier and more convenient. |
As much as I do like spacy, in my opinion it wouldn't be addressing the key issue at hand, which is user choice. Maybe someone wants to use Stanford, or ClearNLP, or something none of us have ever heard of. If that is the case I think gensim should allow that in the same way that it does for all of the other models. I think it also raises issues for separation of concerns with gensim relying on external libraries that are not core to the algorithm being implemented. At it's core, what TextRank is doing is building a graph and running PageRank on it. The only input fundamentally required is an ordered list of tokens (they don't even need to be strings, they could be integer ids). In my opinion (for whatever it's worth) I think it's better to favor flexibility. I think a good approach would be to keep the option of using the provided pre-processing methods for users who aren't interested in doing it themselves and providing the ability for "power users" to provide their own tokens, sentences, etc. |
Agree. It would be a welcome PR. |
Cool! I'd like to volunteer. I've never contributed to gensim before though. I've never seen a contributor's guide anywhere or instructions on running (and writing) tests is there a resource for that? I seem to remember I used doctests to verify when I modified the Also, is there a project preference in terms of how the interface for a function that has this kind of dual functionality should look? |
Great, thank you! I'll take a look at the guidelines and clone the project tonight and see what I can do. Not sure how long it will take, but I may have a little time over the Thanksgiving holiday to get some coding done. I'll post here with any followup questions. |
Nice, thanks for picking this up @ELind77! |
Following this thread, would be nice to have the pre-processing modularised separately! |
While spacy is a great library, I found the sentence segmentation code used by the summarizer to be superior overall for my use case: summarizing SEC filings. SEC filings are legal documents so the language is formal, but they also contain a lot of tables, forms, and lists which don't follow normal grammar rules. I have been amazed at how well sentences are segmented by gensim. I guess what I am saying is that I agree with making the API more modular, but I would hate to see gensim's sentence segmentation implementation thrown away. |
@ELind77 you still taking a stab at this? If not, I'd like to - I've some free time on my hands. |
@bhargavvader I just started a new job and I need approval before I can commit to opensource projects. I'm not sure how long that will take. If you want to take this please post progress here. Otherwise I will have to come back to it when I have time. |
Cool, will open a PR soon then :) |
My 2 cents: we definitely don't want to add a hard dependency on spaCy. It's compiled, slow to load, requires lots of memory and, like @ELind77 and @rpedela say, users may prefer other tools/approaches, for whatever reason. Apologies for the poor code quality and design of this module (API, comments, documentation etc). It was a student contribution and perhaps our checks weren't as thorough as they should have been. @tmylk is being more careful with new code now! |
@piskvorky I've settled down at my new job and gotten permission to start contributing again. If there is still a need for this I'll give it a shot next month. I should have something before the end of September. @bhargavvader If you have a fork with this implemented please post here to let me know. |
Awesome, welcome back :) CC @menshikh-iv |
Sorry @ELind77 , been too busy to get back to this, but this is really a feature which would be great to have implemented some time soon... Another similar request - it would be cool if we could plug in our own distance metrics in modules which use any kind of similarity/distance functions, like the |
I have managed to customize the summarization and keywords method for the German language by creating my own preprocessing with Spacy and changing only a few lines of code in the summarizer and keywords method. The results that I have achieved are really satisfying. If you want to do something similar, then you should pay attention to the following: Each SyntacticUnit expects as input the original sentence or word and the preprocessed token set or token. |
Hi bastij, I would be really interested in your solution. Would you be willing to share your code? Thanks in advance! |
It is not possible, at the moment, to pre-process text yourself before applying the summarization module to it. Among other things, this means that you cannot summarize non-english text, even though the model itself is language independent.
As far as I can see, it is only this line, that is language dependent. Similarly, this line in the
keywords
module. I'm not familiar with the codebase, so it is well possible that it is more complicated.I suggest that someone (could be me) make this pre-processing optional, so you could either
_clean_text_by_word(text)
).This was brought to my attention on the blog, where someone wanted to summarize a portuguese dataset.
The text was updated successfully, but these errors were encountered: