Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Module for automatic summarization #324
This adds a module for automatic summarization based on TextRank. It features both uses introduced in the original paper: sentences extraction for summaries and keyword extraction.
The input can be either a gensim corpus or the raw text. The output is a summarized text, a list of sentences or a list of keywords.
The summaries generated using the sentence extraction feature were evaluated using the ROUGE toolkit and the 2002 Document Understanding Conference corpus, as in the original paper. The results we found were similar: TextRank performed better than the DUC baseline by 2.3%
It looks awesome!
I'd have just few minor notes.
Though I'm not sure about the licence line. This is maybe more question for @piskvorky
Points 2) and 3) I have prepared in my local cloned version, so if you don't mind I can push it (also if the licence line is correct).
Also do you plan to write some more tests? For example for
I have found several problems trying to experiment with "pathological" cases, eg. very short input text.
Great job guys :)
Re. open source license: sure, LGPL, like the rest of gensim.
Re. corner cases: we want these to work too. Or, if there's a reason they can't, a clearer error message, so we don't confuse users. Good catch.
@ziky90, if you have some fixes ready, open a new PR against
A brief summary of resource use (CPU & RAM: big-O + practical tips) would be nice indeed.
@fedelopez77 @fbarrios, do you think you could write a short article about this new functionality?
I'll publish that (or provide a link to your published version) + link there from API docs.
Thanks for the feedback :)
@ziky90 @piskvorky We have found that the input text must be at least around ten sentences long for the summary to make sense.
The ZeroDivisionError issue is definitely a bug, and it will happen every time a text with just one sentence is provided. It will also happen if the text has sentences without any words in common or meaningless text.
@ziky90 Yes, we didn’t provide enought tests. We’ll plan to add them soon.
@piskvorky Yes, I agree we should write more documentation. We’re a little busy at the moment with university stuff, but we’ll be working on that as soon as we can.
We've got some updates on this, sorry for the delay.
We've been working on your suggestions. The border cases and the bug @ziky90 pointed out are both fixed. Thanks!
We do have some texts about the project, but most of them are written in spanish:
We made this script to summarize the English Wikipedia.
I'll keep you posted when we make advances with the documentation or have some performance results.
Are you aware of any work or any clue to use Word2Vec for summarization? It would be interesting to have the similarity function as word2vec similarity and see the results. Moreover, do you have any module to convert a text/sentence/document to a vector based on a w2v model so that you can use the model to find the similarity among the sentences?
@piskvorky We integrated the development branch to make the changes mergeable.
We still have to write a more detailed description and more tests over the following days, but we think the PR is ready to be merged.
added a commit
this pull request
Jul 5, 2015
Jul 5, 2015
1 check failed
Hello, GenSim folks.