WIP spaCy extension

DerwenAI · Nov 2, 2019 · 2e8b7d7 · 2e8b7d7
1 parent 751c1a2
commit 2e8b7d7
Show file tree

Hide file tree

Showing 17 changed files with 199 additions and 432 deletions.
diff --git a/LICENSE.md → LICENSE b/LICENSE.md → LICENSE
@@ -1,6 +1,6 @@
-[MIT License](https://spdx.org/licenses/MIT.html)
+MIT License
 
-Copyright (c) 2016 [Paco Xander Nathan](https://derwen.ai/paco)
+Copyright (c) 2016 Paco Xander Nathan
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
@@ -19,3 +19,6 @@ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 SOFTWARE.
+
+https://spdx.org/licenses/MIT.html
+https://derwen.ai/paco
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -1,4 +1,3 @@
-include README.rst
+include README.md
 include changelog.txt
-include pytextrank/stop.txt
 
diff --git a/README.md b/README.md
@@ -0,0 +1,105 @@
+# PyTextRank
+
+*PyTextRank* is a Python implementation of *TextRank* as a
+[spaCy extension](https://explosion.ai/blog/spacy-v2-pipelines-extensions),
+for working with text documents to:
+
+  - extract the top-ranked phrases
+  - run extractive summarization
+
+This work is based on the paper:
+
+  - ["TextRank: Bringing Order into Text"](https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf)  
+[**Rada Mihalcea**](https://web.eecs.umich.edu/~mihalcea/), 
+[**Paul Tarau**](https://www.cse.unt.edu/~tarau/);  
+[*Empirical Methods in Natural Language Processing*](https://www.researchgate.net/publication/200044196_TextRank_Bringing_Order_into_Texts)  
+(2004)
+
+Several modifications improve on the algorithm originally described in the paper:
+
+  - fixed bug; see [Java impl, 2008](https://github.com/ceteri/textrank)
+  - uses *lemmatization* in place of stemming
+  - includes verbs in the graph, but not in resulting phrases
+  - leverages preprocessing based on *noun chunking* and *named entity recognition*
+  - provides *extractive summarization* based on vectors of ranked
+    phrases
+  - allows use of a *knowledge graph* for enriching the lemma graph and subsequent phrase extraction and summarization
+
+This implementation was inspired by the
+[Williams 2016](http://mike.place/2016/summarization/)
+talk on text summarization.
+
+
+## Installation
+
+Prerequisites:
+
+- [Python 3.x](https://www.python.org/downloads/)
+- [spaCy](https://spacy.io/docs/usage/)
+- [NetworkX](http://networkx.readthedocs.io/)
+
+To install from [PyPi](https://pypi.python.org/pypi/pytextrank):
+
+```
+pip install pytextrank
+```
+
+If you install directly from this Git repo, be sure to install the dependencies as well:
+
+```
+pip install -r requirements.txt
+```
+
+
+## Usage
+
+For example usage, see the 
+[PyTextRank wiki](https://github.com/DerwenAI/pytextrank/wiki).
+If you need to troubleshoot any problems:
+
+  - use [GitHub issues](https://github.com/DerwenAI/pytextrank/issues)
+    (recommended)
+  - search [related discussions on StackOverflow](https://stackoverflow.com/search?q=pytextrank)
+
+For course materials and training, please check for calendar updates in
+the article
+["Natural Language Processing in Python"](https://medium.com/derwen/natural-language-processing-in-python-832b0a99791b).
+
+Let us know if you find this useful, tell us about use cases,
+describe what else you would like to see integrated, etc.
+If you have questions about related consulting work in natural language, machine learning, knowledge graph, or other AI applications, contact 
+[Derwen, Inc.](https://derwen.ai/contact)
+
+
+## Attribution
+
+*PyTextRank* has an [MIT](https://spdx.org/licenses/MIT.html) license,
+which is succinct and simplifies use in commercial applications.
+
+Please use the following Bibtex entry for citing *PyTextRank* in publications:
+
+```
+@Misc{PyTextRank,
+author = {Nathan, Paco},
+title = {PyTextRank, a Python implementation of TextRank for text document NLP parsing and summarization},
+    howpublished = {\url{https://github.com/DerwenAI/pytextrank/}},
+    year = {2016}
+    }
+```
+
+
+## Kudos
+
+Many thanks to contributors:
+[@htmartin](https://github.com/htmartin),
+[@williamsmj](https://github.com/williamsmj/),
+[@mattkohl](https://github.com/mattkohl),
+[@vanita5](https://github.com/vanita5),
+[@HarshGrandeur](https://github.com/HarshGrandeur),
+[@mnowotka](https://github.com/mnowotka),
+[@kjam](https://github.com/kjam),
+[@dvsrepo](https://github.com/dvsrepo),
+[@SaiThejeshwar](https://github.com/SaiThejeshwar),
+[@laxatives](https://github.com/laxatives),
+[@dimmu](https://github.com/dimmu), 
+and for support from [Derwen, Inc.](https://derwen.ai/)
diff --git a/README.rst b/README.rst
diff --git a/example.ipynb → archive/example.ipynb b/example.ipynb → archive/example.ipynb
diff --git a/changelog.txt b/changelog.txt
@@ -1,5 +1,17 @@
 # pytextrank changelog
 
+## 2.0.0
+
+2019-11-17
+
+### Improved
+
+  * refactored library to run as a spaCy extension
+  * supports multiple languages
+  * significantly faster, with less memory required
+  * better extraction of top-ranked phrases
+  * WIP toward integration with knowledge graph use cases
+
 ## 1.2.1
 
 2019-11-01
@@ -14,7 +26,7 @@
 
 ### Improved
 
- * updated to fix for current versions of `spaCy` and `networkX` -- kudos @dimmu
+ * updated to fix for current versions of `spaCy` and `NetworkX` -- kudos @dimmu
  * removed deprecated argument -- kudos @laxatives
 
 ## 1.1.1
@@ -23,8 +35,8 @@
 
 ### Improved
 
-  * Patch disables use of NER in spaCy until an intermittent bug is resolved.
-  * Will probably replace named tuples with spaCy spans instead.
+  * patch disables use of NER in spaCy until an intermittent bug is resolved.
+  * will probably replace named tuples with spaCy spans instead.
 
 ## 1.1.0
 

diff --git a/docs/conf.py b/docs/conf.py
@@ -53,17 +53,17 @@
 
 # General information about the project.
 project = 'PyTextRank'
-copyright = '2017, Paco Nathan'
-author = 'Paco Nathan'
+copyright = '2016, Paco Xander Nathan'
+author = 'Paco Xander Nathan'
 
 # The version info for the project you're documenting, acts as replacement for
 # |version| and |release|, also used in various other places throughout the
 # built documents.
 #
 # The short X.Y version.
-version = '1.0'
+version = '2.0'
 # The full version, including alpha/beta/rc tags.
-release = '1.0.1'
+release = '2.0.0'
 
 # The language for content autogenerated by Sphinx. Refer to documentation
 # for a list of supported languages.

diff --git a/Text_Rank.ipynb → explain.ipynb b/Text_Rank.ipynb → explain.ipynb
diff --git a/pipe.py b/pipe.py
@@ -8,15 +8,14 @@
 import spacy
 import sys
 import time
+import unicodedata
 
 
 class TextRank:
     """
-    Python implementation of TextRank by Milhacea, et al.,
-    as a spaCy extension, used to extract the top-ranked
-    phrases from a text document.
+    Python impl of TextRank by Milhacea, et al., as a spaCy extension,
+    used to extract the top-ranked phrases from a text document
     """
-
     _EDGE_WEIGHT = 1.0
     _POS_KEPT = ["ADJ", "NOUN", "PROPN", "VERB"]
     _TOKEN_LOOKBACK = 3
@@ -41,6 +40,29 @@ def reset (self):
         self.seen_lemma = {}
 
 
+    @classmethod
+    def cleanup_text (cls, text):
+        """
+        it scrubs the garble from its stream...
+        or it gets the debugger again
+        """
+        x = " ".join(map(lambda s: s.strip(), text.split("\n"))).strip()
+
+        x = x.replace('“', '"').replace('”', '"')
+        x = x.replace("‘", "'").replace("’", "'").replace("`", "'")
+        x = x.replace("…", "...").replace("–", "-")
+
+        x = str(unicodedata.normalize("NFKD", x).encode("ascii", "ignore").decode("ascii"))
+
+        # some content returns text in bytes rather than as a str ?
+        try:
+            assert type(x).__name__ == "str"
+        except AssertionError:
+            print("not a string?", type(line), line)
+
+            return x
+
+
     def increment_edge (self, graph, node0, node1):
         """
         increment the weight for an edge between the two given nodes,
@@ -225,11 +247,11 @@ def text_rank (self, doc):
 
     tr = TextRank(logger=None)
 
-    start = time.time()
+    t0 = time.time()
     phrase_iter = tr.text_rank(doc)
-    end = time.time()
+    t1 = time.time()
 
     for phrase, rank, count in phrase_iter:
         print("{:.4f} {:5d}  {}".format(rank, count, phrase))
 
-    print("\nelapsed time: {} ms".format((end - start) * 1000))
+    print("\nelapsed time: {} ms".format((t1 - t0) * 1000))