Inclusion of the TopicRank Keyphrase Extraction algorithm #208

tomaarsen · 2022-02-06T22:36:08Z

Hello!
As promised: a beefier PR.

Pull request overview

Added an implementation of the TopicRank algorithm, including a TopicRank and TopicRankFactory class.
Added tests for TopicRank
Modified CI tests to run for Python 3.7 through 3.10, rather than just Python 3.7

Details on TopicRank

TopicRank was mentioned in the useful discussion in #174 by @BrandonKMLee and @louisguitton, and was originally introduced in a paper by (Bougouin et al., 2013). It focuses on keyphrase extraction, and relies on TextRank as a part of the overall algorithm. The paper is quite interesting, so I would recommend looking at it, but I'll quickly summarize the algorithm that is presented (copied from my comment in the code):

Preprocessing: Sentence segmentation, word tokenization, POS tagging.
After this stage, we have preprocessed text.
Candidate extraction: Extract sequences of nouns and adjectives (i.e. noun chunks)
After this stage, we have a list of keyphrases that may be topics.
Candidate clustering: Hierarchical Agglomerative Clustering algorithm with average
linking using simple set-based overlap of lemmas. Similarity is achieved at > 25%
overlap. Note: PyTextRank deviates from the original algorithm here, which uses
stems rather than lemmas.
After this stage, we have a list of topics.
Candidate ranking: Apply TextRank on a complete graph, with topics as nodes
(i.e. clusters derived in the last step), where edge weights are higher between
topics that appear closer together within the document.
After this stage, we have a ranked list of topics.
Candidate selection: Select the first occurring keyphrase from each topic to
represent that topic.
After this stage, we have a ranked list of topics, with a keyphrase to represent
the topic.

Sample usage & output

The TopicRank algorithm implemented in this PR can be used in essentially the same way as TextRank:

import spacy
import pytextrank
from pprint import pprint

# Example text
text = "A mathematical model of ion exchange is considered, allowing for ion exchanger compression in the process of ion exchange. Two inverse problems are investigated for this model, unique solvability is proved, and numerical solution methods are proposed. The efficiency of the proposed methods is demonstrated by a numerical experiment."

# load a spaCy model, depending on language, scale, etc.
nlp = spacy.load("en_core_web_sm")

# add PyTextRank to the spaCy pipeline
nlp.add_pipe("topicrank")
doc = nlp(text)

# examine the top-ranked phrases in the document
pprint(doc._.phrases)
print([phrase.text for phrase in doc._.phrases][:10])
tr = doc._.textrank
print(tr.elapsed_time)

This outputs:

[Phrase(text='ion exchange', chunks=[ion exchange, ion, ion exchange], count=3, rank=0.2164760626537704),
 Phrase(text='mathematical model', chunks=[mathematical model, model], count=2, rank=0.13294553955872498),
 Phrase(text='exchanger compression', chunks=[exchanger compression], count=1, rank=0.11752875572953524),
 Phrase(text='process', chunks=[process], count=1, rank=0.0976304878191955),
 Phrase(text='unique solvability', chunks=[unique solvability], count=1, rank=0.08867195618809895),
 Phrase(text='inverse problems', chunks=[inverse problems], count=1, rank=0.07980296596112096),
 Phrase(text='efficiency', chunks=[efficiency], count=1, rank=0.07445298452109365),
 Phrase(text='proposed methods', chunks=[proposed methods], count=1, rank=0.07279118674022787),
 Phrase(text='numerical solution methods', chunks=[numerical solution methods], count=1, rank=0.06825226658665143),
 Phrase(text='numerical experiment', chunks=[numerical experiment], count=1, rank=0.05144779424158112)]
['ion exchange', 'mathematical model', 'exchanger compression', 'process', 'unique solvability', 'inverse problems', 'efficiency', 'proposed methods', 'numerical solution methods', 'numerical experiment']
2.8972625732421875

To compare, this is the output of the original https://github.com/adrien-bougouin/KeyBench/ repository is as follows, when given the same input:

(u'ion exchange', 1.6957436822957042)
(u'mathematical model', 1.32419391897424)
(u'numerical solution methods', 1.1483448063357846)
(u'process', 0.9181346427759802)
(u'unique solvability', 0.8265677195114645)
(u'inverse problems', 0.7989594740490582)
(u'efficiency', 0.7225318465443265)
(u'numerical experiment', 0.5631577922037418)

This output is also given by pke, by the same author:

['ion exchange', 'mathematical model', 'numerical solution methods', 'process', 'unique solvability', 'inverse problems', 'efficiency', 'numerical experiment']

Full transparancy: As you can see, there are some small differences between this TopicRank implementation and the originals. E.g. numerical solution methods ranks lower, and exchanger compression ranks higher. I can try to invest a bit more time to try and debug this, but I suspect that this is caused by a differint way to find potential candidates and differing TextRank parameters.

Lastly, for context, replacing nlp.add_pipe("topicrank") with nlp.add_pipe("textrank") gives the following results:

[Phrase(text='ion exchange', chunks=[ion exchange, ion exchange], count=2, rank=0.2217093467665982),
 Phrase(text='numerical solution methods', chunks=[numerical solution methods], count=1, rank=0.18495219543002075),
 Phrase(text='ion', chunks=[ion], count=1, rank=0.18185672562374786),
 Phrase(text='exchanger compression', chunks=[exchanger compression], count=1, rank=0.16266971703534683),
 Phrase(text='unique solvability', chunks=[unique solvability], count=1, rank=0.0960861890678228),
 Phrase(text='a numerical experiment', chunks=[a numerical experiment], count=1, rank=0.0916401771972592),
 Phrase(text='the proposed methods', chunks=[the proposed methods], count=1, rank=0.08875872150025549),
 Phrase(text='the process', chunks=[the process], count=1, rank=0.05602421471187259),
 Phrase(text='A mathematical model', chunks=[A mathematical model], count=1, rank=0.04383340352811856),
 Phrase(text='Two inverse problems', chunks=[Two inverse problems], count=1, rank=0.03593743181662447),
 Phrase(text='this model', chunks=[this model], count=1, rank=0.03401330317595105),
 Phrase(text='The efficiency', chunks=[The efficiency], count=1, rank=0.021969568492358083),
 Phrase(text='Two', chunks=[Two], count=1, rank=0.0)]
['ion exchange', 'numerical solution methods', 'ion', 'exchanger compression', 'unique solvability', 'a numerical experiment', 'the proposed methods', 'the process', 'A mathematical model', 'Two inverse problems']
3.9992332458496094

Implementation details

New default config

pytextrank/__init__.py has been given a new configuration dict: _TOPIC_DEFAULT_CONFIG. This configuration dict is simply an extension of the existing _DEFAULT_CONFIG, with the threshold and method attributes added. In this file, a component factory for the TopicRank class is added.

New classes

Both a TopicRankFactory and TopicRank class were implemented, following the design structure of the previously created classes and factories, with the addition of the aforementioned threshold and method attributes.

New attributes

The threshold attribute is a float between 0 and 1, which is used in the TopicRank candidate clustering algorithm to specify the minimum similarity between two keyphrases, for them to be eligible for clustering. The original paper uses a value of 0.25 (i.e. 25% similarity of overlap of lemmas), and so this PR also uses 0.25 as a default.

The method attribute is the string name of a clustering method in candidate clustering. This PR relies on scipy.cluster.hierarchy.linkage for the hierarchical agglomerative clustering of candidates, and allows users of PTR to specify which clustering method to use. Again, we use the clustering method from the paper as default: "average".

New methods

The TopicRank class is a subclass of TextRank, (although it may cause some confusion, as the TopicRank implementation also uses the lemma_graph object from TextRank, but TopicRank uses topic graphs instead). The implementation relies on the overriding of some key methods and properties:

node_list: Now returns a cached (!) list of clustered keyphrase candidates, instead of lemmas.
edge_list: Now returns a list representing edges for a complete, weighted graph between the clustered keyphrase candidates.
calc_textrank: Quite similar to the original, but skips out on some of the post-processing.
reset: Also remove the node_list caching.

And two new methods: _get_candidates and _cluster, which perform steps 2 and 3 of the TopicRank algorithm, respectively (Candidate extraction & Candidate clustering). Note that _get_candidates is almost just self.doc.noun_chunks, but it strips undesired (i.e. not self._keep_token(token)) tokens from the candidate keyphrase.

Note that this PR relies on scipy and networkx, both of which were already dependencies, so this doesn't introduce any additional ones.

Tests

The tests are based on test_base.py, but adapted for TopicRank instead. However, some tests for different threshold and method values are still missing.

Other changes

`conftest.py` changes

Each of the fixtures are now limited to module, as the nlp.add_pipe from certain tests was preserved between all tests, which led to undesired behaviour (i.e. multiple instances of PTR in the pipeline)

CI

The continuous integration was updated in this PR. I understand that this is kind of separate to the PR itself, but I felt that the additional tests help give confidence in the correctness. In short, the CI now first calls pre-commit, and only if that passes, it will run pytest for Python 3.7, 3.8, 3.9 and 3.10.

Quirks

There are some implementation details that might require some discussion, and could have different implementations depending on preferences. I'll mention them here, and I'd like to have some discussion about them.

The scrubber attribute is used once, right here: https://github.com/tomaarsen/pytextrank/blob/df5833e463160c740f8d4a920129070e26756950/pytextrank/topicrank.py#L311-L331
The scrubber is applied only on the Phrase text, i.e. the keyphrase that "represents" the phrase as a whole. An alternative is to apply the scrubber on all keyphrase candidates.
In _get_candidates, self._keep_token is only called on the first tokens of a noun chunk, potentially removing those tokens from the noun chunk. If there is a token that we wish to remove at the middle or end of a noun chunk, then it won't be removed. In practice this won't happen often, as it's mostly meant to remove "The" from "The system" etc., but this functionality might be preferred? The current implementation is simpler however, as it doesn't require splitting noun chunks into multiple keyphrase candidates if there is a undesirable token in the middle. There is also an argument to be made that we usually want to remove certain tokens, but perhaps not if they occur in the middle of an interesting noun chunk. (i.e. that would be the current implementation)
Because TopicRank extends TextRank, but there are some definite differences between them, the inherited token_lookback attribute of TopicRank is unused. Perhaps a warning if a non-default value is used here would be proper? Alternatively, we could remove this attribute from the TopicRank implementation somehow.

What now?

There are still some TODO's, here are a few:

As mentioned, I'd still like to add a test or two for threshold and method.
On line 88 of topicrank.py, there is still a TODO. On e.g. biasedrank.py, this place contains a link to https://derwen.ai/docs/ptr/biblio/#kazemi-etal-2020-biased, with the citation for that work. I feel it would be proper to add a citation in the documentation, with a link from the code for TopicRank too. However, I haven't done this as of now.
If we wish, then there can be some specific documentation for TopicRank in PTR, although I don't this has the highest priority. After all, to the end user it'll work almost the same as any of the other ranking components regardless.

I'd love to hear your thoughts, and if you have any questions or comments, let me know (either here or via Slack). I also have a question for you: is there a specific formatter that you use, that I should use for this work?

Tom Aarsen

But not on the chunks themselves! I'm not sure what behaviour is preferred here

…ce tests from another

…ature/topicrank

tomaarsen · 2022-03-02T06:12:15Z

Friendly ping on this! @ceteri

ceteri

Outstanding! Many thanks @tomaarsen

We'll need to add more about TopicRank in the docs plus a notebook example.
The improvements on testing look great!

tomaarsen added 14 commits February 3, 2022 12:50

Implemented TopicRank algorithm using BaseTextRank as superclass

511150e

Prevent linkage from throwing an exception on an empty pairwise distance

86713b7

Apply scrubber on phrase text

6c82031

But not on the chunks themselves! I'm not sure what behaviour is preferred here

Remove newlines from test text to improve test correctness

0b74f28

Add TopicRank tests based on base tests

d5c94a5

Resolve and ignore several typing and pylint issues

55704f1

Set fixture scope to module, so that pipes from one file dont influen…

dcebcee

…ce tests from another

Specify maxsize for lru_cache for Python 3.7 compatibility

4a09b24

Run tests on all supported Python versions, rather than just 3.7

8b8e4b4

fail-fast to False, 3.10 to '3.10', as 3.10 becomes 3.1

273bedc

Moved fail-fast to strategy section

7784518

Merge branch 'main' of https://github.com/DerwenAI/pytextrank into fe…

03e4fc1

…ature/topicrank

Add name to CI test runs

a0f6401

Add missing bracket

df5833e

ceteri approved these changes Mar 6, 2022

View reviewed changes

ceteri merged commit 0d78d12 into DerwenAI:main Mar 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inclusion of the TopicRank Keyphrase Extraction algorithm #208

Inclusion of the TopicRank Keyphrase Extraction algorithm #208

tomaarsen commented Feb 6, 2022 •

edited

Loading

tomaarsen commented Mar 2, 2022

ceteri left a comment

Inclusion of the TopicRank Keyphrase Extraction algorithm #208

Inclusion of the TopicRank Keyphrase Extraction algorithm #208

Conversation

tomaarsen commented Feb 6, 2022 • edited Loading

Pull request overview

Details on TopicRank

Sample usage & output

Implementation details

New default config

New classes

New attributes

New methods

Tests

Other changes

conftest.py changes

CI

Quirks

What now?

tomaarsen commented Mar 2, 2022

ceteri left a comment

Choose a reason for hiding this comment

tomaarsen commented Feb 6, 2022 •

edited

Loading

`conftest.py` changes