Skip to content

Commit

Permalink
WIP spaCy extension
Browse files Browse the repository at this point in the history
  • Loading branch information
ceteri committed Nov 2, 2019
1 parent 751c1a2 commit 2e8b7d7
Show file tree
Hide file tree
Showing 17 changed files with 199 additions and 432 deletions.
7 changes: 5 additions & 2 deletions LICENSE.md → LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[MIT License](https://spdx.org/licenses/MIT.html)
MIT License

Copyright (c) 2016 [Paco Xander Nathan](https://derwen.ai/paco)
Copyright (c) 2016 Paco Xander Nathan

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand All @@ -19,3 +19,6 @@ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

https://spdx.org/licenses/MIT.html
https://derwen.ai/paco
3 changes: 1 addition & 2 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
include README.rst
include README.md
include changelog.txt
include pytextrank/stop.txt

105 changes: 105 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# PyTextRank

*PyTextRank* is a Python implementation of *TextRank* as a
[spaCy extension](https://explosion.ai/blog/spacy-v2-pipelines-extensions),
for working with text documents to:

- extract the top-ranked phrases
- run extractive summarization

This work is based on the paper:

- ["TextRank: Bringing Order into Text"](https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf)
[**Rada Mihalcea**](https://web.eecs.umich.edu/~mihalcea/),
[**Paul Tarau**](https://www.cse.unt.edu/~tarau/);
[*Empirical Methods in Natural Language Processing*](https://www.researchgate.net/publication/200044196_TextRank_Bringing_Order_into_Texts)
(2004)

Several modifications improve on the algorithm originally described in the paper:

- fixed bug; see [Java impl, 2008](https://github.com/ceteri/textrank)
- uses *lemmatization* in place of stemming
- includes verbs in the graph, but not in resulting phrases
- leverages preprocessing based on *noun chunking* and *named entity recognition*
- provides *extractive summarization* based on vectors of ranked
phrases
- allows use of a *knowledge graph* for enriching the lemma graph and subsequent phrase extraction and summarization

This implementation was inspired by the
[Williams 2016](http://mike.place/2016/summarization/)
talk on text summarization.


## Installation

Prerequisites:

- [Python 3.x](https://www.python.org/downloads/)
- [spaCy](https://spacy.io/docs/usage/)
- [NetworkX](http://networkx.readthedocs.io/)

To install from [PyPi](https://pypi.python.org/pypi/pytextrank):

```
pip install pytextrank
```

If you install directly from this Git repo, be sure to install the dependencies as well:

```
pip install -r requirements.txt
```


## Usage

For example usage, see the
[PyTextRank wiki](https://github.com/DerwenAI/pytextrank/wiki).
If you need to troubleshoot any problems:

- use [GitHub issues](https://github.com/DerwenAI/pytextrank/issues)
(recommended)
- search [related discussions on StackOverflow](https://stackoverflow.com/search?q=pytextrank)

For course materials and training, please check for calendar updates in
the article
["Natural Language Processing in Python"](https://medium.com/derwen/natural-language-processing-in-python-832b0a99791b).

Let us know if you find this useful, tell us about use cases,
describe what else you would like to see integrated, etc.
If you have questions about related consulting work in natural language, machine learning, knowledge graph, or other AI applications, contact
[Derwen, Inc.](https://derwen.ai/contact)


## Attribution

*PyTextRank* has an [MIT](https://spdx.org/licenses/MIT.html) license,
which is succinct and simplifies use in commercial applications.

Please use the following Bibtex entry for citing *PyTextRank* in publications:

```
@Misc{PyTextRank,
author = {Nathan, Paco},
title = {PyTextRank, a Python implementation of TextRank for text document NLP parsing and summarization},
howpublished = {\url{https://github.com/DerwenAI/pytextrank/}},
year = {2016}
}
```


## Kudos

Many thanks to contributors:
[@htmartin](https://github.com/htmartin),
[@williamsmj](https://github.com/williamsmj/),
[@mattkohl](https://github.com/mattkohl),
[@vanita5](https://github.com/vanita5),
[@HarshGrandeur](https://github.com/HarshGrandeur),
[@mnowotka](https://github.com/mnowotka),
[@kjam](https://github.com/kjam),
[@dvsrepo](https://github.com/dvsrepo),
[@SaiThejeshwar](https://github.com/SaiThejeshwar),
[@laxatives](https://github.com/laxatives),
[@dimmu](https://github.com/dimmu),
and for support from [Derwen, Inc.](https://derwen.ai/)
107 changes: 0 additions & 107 deletions README.rst

This file was deleted.

File renamed without changes.
18 changes: 15 additions & 3 deletions changelog.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,17 @@
# pytextrank changelog

## 2.0.0

2019-11-17

### Improved

* refactored library to run as a spaCy extension
* supports multiple languages
* significantly faster, with less memory required
* better extraction of top-ranked phrases
* WIP toward integration with knowledge graph use cases

## 1.2.1

2019-11-01
Expand All @@ -14,7 +26,7 @@

### Improved

* updated to fix for current versions of `spaCy` and `networkX` -- kudos @dimmu
* updated to fix for current versions of `spaCy` and `NetworkX` -- kudos @dimmu
* removed deprecated argument -- kudos @laxatives

## 1.1.1
Expand All @@ -23,8 +35,8 @@

### Improved

* Patch disables use of NER in spaCy until an intermittent bug is resolved.
* Will probably replace named tuples with spaCy spans instead.
* patch disables use of NER in spaCy until an intermittent bug is resolved.
* will probably replace named tuples with spaCy spans instead.

## 1.1.0

Expand Down
8 changes: 4 additions & 4 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,17 +53,17 @@

# General information about the project.
project = 'PyTextRank'
copyright = '2017, Paco Nathan'
author = 'Paco Nathan'
copyright = '2016, Paco Xander Nathan'
author = 'Paco Xander Nathan'

# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
#
# The short X.Y version.
version = '1.0'
version = '2.0'
# The full version, including alpha/beta/rc tags.
release = '1.0.1'
release = '2.0.0'

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
Expand Down
File renamed without changes.
36 changes: 29 additions & 7 deletions pipe.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,14 @@
import spacy
import sys
import time
import unicodedata


class TextRank:
"""
Python implementation of TextRank by Milhacea, et al.,
as a spaCy extension, used to extract the top-ranked
phrases from a text document.
Python impl of TextRank by Milhacea, et al., as a spaCy extension,
used to extract the top-ranked phrases from a text document
"""

_EDGE_WEIGHT = 1.0
_POS_KEPT = ["ADJ", "NOUN", "PROPN", "VERB"]
_TOKEN_LOOKBACK = 3
Expand All @@ -41,6 +40,29 @@ def reset (self):
self.seen_lemma = {}


@classmethod
def cleanup_text (cls, text):
"""
it scrubs the garble from its stream...
or it gets the debugger again
"""
x = " ".join(map(lambda s: s.strip(), text.split("\n"))).strip()

x = x.replace('“', '"').replace('”', '"')
x = x.replace("‘", "'").replace("’", "'").replace("`", "'")
x = x.replace("…", "...").replace("–", "-")

x = str(unicodedata.normalize("NFKD", x).encode("ascii", "ignore").decode("ascii"))

# some content returns text in bytes rather than as a str ?
try:
assert type(x).__name__ == "str"
except AssertionError:
print("not a string?", type(line), line)

return x


def increment_edge (self, graph, node0, node1):
"""
increment the weight for an edge between the two given nodes,
Expand Down Expand Up @@ -225,11 +247,11 @@ def text_rank (self, doc):

tr = TextRank(logger=None)

start = time.time()
t0 = time.time()
phrase_iter = tr.text_rank(doc)
end = time.time()
t1 = time.time()

for phrase, rank, count in phrase_iter:
print("{:.4f} {:5d} {}".format(rank, count, phrase))

print("\nelapsed time: {} ms".format((end - start) * 1000))
print("\nelapsed time: {} ms".format((t1 - t0) * 1000))
Loading

0 comments on commit 2e8b7d7

Please sign in to comment.