New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Module for automatic summarization #324

Merged
merged 35 commits into from Jul 5, 2015

Conversation

Projects
None yet
6 participants
@fedelopez77
Copy link
Contributor

fedelopez77 commented Apr 17, 2015

This adds a module for automatic summarization based on TextRank. It features both uses introduced in the original paper: sentences extraction for summaries and keyword extraction.

The input can be either a gensim corpus or the raw text. The output is a summarized text, a list of sentences or a list of keywords.

from gensim import corpora
from gensim import summarization

text = "Jerry works in his father-in-law's car dealership and has gotten " + \
       "himself in financial problems. He tries various schemes to come " + \
       "up with money needed for a reason that is never really explained. " + \
       "It has to be assumed that his huge embezzlement of money from the " + \
       "dealership is about to be discovered by father-in-law. When all  " + \
       "else falls through, plans he set in motion earlier for two men to " + \
       "kidnap his wife for ransom to be paid by her wealthy father (who " + \
       "doesn't seem to have the time of day for son-in-law). From the " + \
       "moment of the kidnapping, things go wrong and what was supposed " + \
       "to be a non-violent affair turns bloody with more blood added by " + \
       "the minute. Jerry is upset at the bloodshed, which turns loose a " + \
       "pregnant sheriff from Brainerd, MN who is tenacious in attempting " + \
       "to solve the three murders in her jurisdiction."

sentences = text.split(".")
tokens = [sentence.split() for sentence in sentences]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(sentence_tokens) for sentence_tokens in tokens]

print(summarization.textrank_from_corpus(corpus, len(dictionary.token2id)))
[[(3, 1), (7, 1), (9, 1), (18, 1), (19, 1), (25, 1), (26, 2), (31, 1), 
(32, 1), (33, 1), (34, 2), (35, 1), (36, 1), (37, 1), (38, 1), 
(39, 1), (40, 1), (41, 1), (42, 1)]]

# Or from the raw text
print(summarization.summarize(text))
'It has to be assumed that his huge embezzlement of money from the 
dealership is about to be discovered by father-in-law.'

The summaries generated using the sentence extraction feature were evaluated using the ROUGE toolkit and the 2002 Document Understanding Conference corpus, as in the original paper. The results we found were similar: TextRank performed better than the DUC baseline by 2.3%
We include a test in which we reproduce the results of the paper using a sample article.

@piskvorky

This comment has been minimized.

Copy link
Member

piskvorky commented Apr 22, 2015

Thanks @fedelopez77 !

This is a meaty addition, the review may take a while :)

In the meanwhile, can you fix the failing unit tests? Travis reports fails on 2.6, 3.3 and 3.4.

@nick-magnini

This comment has been minimized.

Copy link

nick-magnini commented Apr 23, 2015

It would be very interesting to build the doc vectors based on w2v model built on a different corpora and evaluate the summaries and compare with the original results.

@ziky90

This comment has been minimized.

Copy link
Contributor

ziky90 commented Apr 26, 2015

It looks awesome!

I'd have just few minor notes.

  1. In summarization.summarize() would't it be good to also allow users to call it with corpus in addition to text? Or there can be done another method, for example called summarization.summarize_corpora(). It seems to me as a possible simplification of work with summarization module. Does this make a sense?

  2. From a formal point of view, shouldn't every python file start with?:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html

Though I'm not sure about the licence line. This is maybe more question for @piskvorky

  1. See the line comment in gensim/summarization/graph.py.

Points 2) and 3) I have prepared in my local cloned version, so if you don't mind I can push it (also if the licence line is correct).
Point 1) I can do and push as well, but I'm better ask if you think that new method should be made or summarization.summarize() should be just able to handle corpora directly?

@fedelopez77
I'd also like to ask you you have made some performance tests (eg. how long does it take to summarize every article in english wikipedia, or at those 567 articles referenced in the paper) and measurements and if you can possibly publish them to the initial comment? (I am currently trying to do something myself as well, so this would save me some time)

Also do you plan to write some more tests? For example for summarization.textrank_from_corpus, you can also try some other inputs, some cases that should not work, etc.?

I have found several problems trying to experiment with "pathological" cases, eg. very short input text.
For example summarization.summarize("Jerry works in his father himself in financial problems. He tries various schemes to come up with money needed for a reason that is never really explained.") throws ZeroDivisionError: float division by zero. I think that this should be at least replaced by some custom gensim exception that would inform that you have to call summarization with longer input text?

Then trying summarization.summarize("Jerry works in his father himself in financial problems. It has to be assumed that his huge embezzlement of money dealership is about to be discovered by father-in-law.") returns empty summarization. Again I am not sure, what should be the proper behaviour in case like this.

from abc import ABCMeta, abstractmethod


class IGraph:

This comment has been minimized.

@ziky90

ziky90 Apr 26, 2015

Contributor

Just a formal note, it'd be nice to keep it consistent with rest of gensim and inherit from object class IGraph(object):
And second thing, there are sometimes too many white spaces (based on PEP 8 style guide).

@piskvorky

This comment has been minimized.

Copy link
Member

piskvorky commented Apr 26, 2015

Great job guys :)

Re. open source license: sure, LGPL, like the rest of gensim.

Re. corner cases: we want these to work too. Or, if there's a reason they can't, a clearer error message, so we don't confuse users. Good catch.

@ziky90, if you have some fixes ready, open a new PR against summanlp's PR branch. I'll play with the code some more today too.

A brief summary of resource use (CPU & RAM: big-O + practical tips) would be nice indeed.

@fedelopez77 @fbarrios, do you think you could write a short article about this new functionality?
It's large enough that I think we ought to give users a short intro and tutorial (motivation, background, solution approach, how to use it, weaknesses, strengths) :)

I'll publish that (or provide a link to your published version) + link there from API docs.

Thanks!

@fbarrios

This comment has been minimized.

Copy link
Contributor

fbarrios commented Apr 27, 2015

Thanks for the feedback :)

@ziky90 @piskvorky We have found that the input text must be at least around ten sentences long for the summary to make sense.
The summary length is calculated as the number of sentences times the ratio, and if that equals a number below one, the resulting summary will be empty. Perhaps an exception should be thrown in that case.

The ZeroDivisionError issue is definitely a bug, and it will happen every time a text with just one sentence is provided. It will also happen if the text has sentences without any words in common or meaningless text.

@ziky90 Yes, we didn’t provide enought tests. We’ll plan to add them soon.
About the performance: summarizing the 567 documents takes around 21 seconds on my Athlon II X2 240 CPU. We’ll be testing with Wikipedia and sharing the results when we have them :)

@piskvorky Yes, I agree we should write more documentation. We’re a little busy at the moment with university stuff, but we’ll be working on that as soon as we can.

Merge pull request #1 from ziky90/summarization_fixes
Consistency with gensim and pep 8
@piskvorky

This comment has been minimized.

Copy link
Member

piskvorky commented Apr 27, 2015

Thanks @fbarrios !

The "tutorial article" is not blocking. We can review and merge without it. But if you have a presentation / text ready (maybe for your uni?), we could link to that.

@fbarrios

This comment has been minimized.

Copy link
Contributor

fbarrios commented May 11, 2015

We've got some updates on this, sorry for the delay.

We've been working on your suggestions. The border cases and the bug @ziky90 pointed out are both fixed. Thanks!
We added a few tests for those cases, although we still need to add some more code documentation.

We do have some texts about the project, but most of them are written in spanish:

We made this script to summarize the English Wikipedia.
It's been running since yesterday morning (around 30 hours) and has processed around 50,000 articles so far. Currently, Wikipedia has around 5,000,000 articles, so I don't think the script will end anytime soon.

I'll keep you posted when we make advances with the documentation or have some performance results.

@nick-magnini

This comment has been minimized.

Copy link

nick-magnini commented May 11, 2015

Hi guys,

Are you aware of any work or any clue to use Word2Vec for summarization? It would be interesting to have the similarity function as word2vec similarity and see the results. Moreover, do you have any module to convert a text/sentence/document to a vector based on a w2v model so that you can use the model to find the similarity among the sentences?

@ziky90

This comment has been minimized.

Copy link
Contributor

ziky90 commented Jun 24, 2015

@fbarrios I would like to ask how does it look with adding tests? Might I help by writing some? I would like to play a bit with summarization on my toy project, so I would like to ask, how can I help towards the merge of summarization to gensim.

@piskvorky

This comment has been minimized.

Copy link
Member

piskvorky commented Jul 1, 2015

We're planning to make a new release this weekend. It'd be great to get this in -- what is missing?

@fbarrios @fedelopez77 @ziky90 can you complete the PR, make it merge-able?

Cheers!

@fbarrios

This comment has been minimized.

Copy link
Contributor

fbarrios commented Jul 5, 2015

@piskvorky We integrated the development branch to make the changes mergeable.
We also added a detailed documentation to the public methods and a few more tests, but the build is failing because a test from test_models.py is failing.

We still have to write a more detailed description and more tests over the following days, but we think the PR is ready to be merged.

@piskvorky

This comment has been minimized.

Copy link
Member

piskvorky commented Jul 5, 2015

Great! Merging now, thanks to the entire team!

A tutorial will be very welcome of course, that's what many people need to get started :)

piskvorky added a commit that referenced this pull request Jul 5, 2015

Merge pull request #324 from summanlp/develop
Module for automatic summarization

@piskvorky piskvorky merged commit d0e5e74 into RaRe-Technologies:develop Jul 5, 2015

1 check failed

continuous-integration/travis-ci/pr The Travis CI build failed
Details
@al7veda

This comment has been minimized.

Copy link

al7veda commented Jul 9, 2015

Hello, GenSim folks.
I'm trying to use summa/textrank-0.07 via command-line option-- cd path/to/folder/summa/
python textrank.py -t FILE -- on the text-set in plain text, 10Mb, windows 8.1,12Gb RAM, but getting a Memory Error:
C:\Python27\Lib\site-packages\summa>python textrank.py -t train_set_1SentByLine_clean.txt
Traceback (most recent call last):
File "textrank.py", line 75, in main()
File "textrank.py", line 71, in main
print textrank(text, summarize_by, ratio, words)
File "textrank.py", line 60, in textrank
return summarize(text, ratio, words)
File "C:\Python27\Lib\site-packages\summa\summarizer.py", line 97, in summarize
_set_graph_edge_weights(graph)
File "C:\Python27\Lib\site-packages\summa\summarizer.py", line 17, in _set_graph_edge_weights
graph.add_edge(edge, similarity)
File "C:\Python27\Lib\site-packages\summa\graph.py", line 177, in add_edge
self.set_edge_properties((u, v), label=label, weight=wt)
File "C:\Python27\Lib\site-packages\summa\graph.py", line 226, in set_edge_properties
self.edge_properties.setdefault((edge[1], edge[0]), {}).update( properties ) MemoryError
I'd greatly appreciate an advice about how to fix this error.
Thank you,
Al

@piskvorky

This comment has been minimized.

Copy link
Member

piskvorky commented Jul 9, 2015

Hello @al7veda , this is the repository for gensim. For other Python packages, please use their respective support systems directly.

@al7veda

This comment has been minimized.

Copy link

al7veda commented Jul 10, 2015

Sorry, I was looking at the summa/textrank.py docs, and then switched to gensim trying to find a solution to my issue but didn't realize that's a different package. I'd be more careful next time.
Best.

@piskvorky

This comment has been minimized.

Copy link
Member

piskvorky commented Jul 10, 2015

No worries. If something fails in gensim, feel free to report an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment