Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inclusion of the TopicRank Keyphrase Extraction algorithm #208

Merged
merged 14 commits into from
Mar 6, 2022

Conversation

tomaarsen
Copy link
Contributor

@tomaarsen tomaarsen commented Feb 6, 2022

Hello!
As promised: a beefier PR.

Pull request overview

  • Added an implementation of the TopicRank algorithm, including a TopicRank and TopicRankFactory class.
  • Added tests for TopicRank
  • Modified CI tests to run for Python 3.7 through 3.10, rather than just Python 3.7

Details on TopicRank

TopicRank was mentioned in the useful discussion in #174 by @BrandonKMLee and @louisguitton, and was originally introduced in a paper by (Bougouin et al., 2013). It focuses on keyphrase extraction, and relies on TextRank as a part of the overall algorithm. The paper is quite interesting, so I would recommend looking at it, but I'll quickly summarize the algorithm that is presented (copied from my comment in the code):

  1. Preprocessing: Sentence segmentation, word tokenization, POS tagging.
    After this stage, we have preprocessed text.
  2. Candidate extraction: Extract sequences of nouns and adjectives (i.e. noun chunks)
    After this stage, we have a list of keyphrases that may be topics.
  3. Candidate clustering: Hierarchical Agglomerative Clustering algorithm with average
    linking using simple set-based overlap of lemmas. Similarity is achieved at > 25%
    overlap. Note: PyTextRank deviates from the original algorithm here, which uses
    stems rather than lemmas.
    After this stage, we have a list of topics.
  4. Candidate ranking: Apply TextRank on a complete graph, with topics as nodes
    (i.e. clusters derived in the last step), where edge weights are higher between
    topics that appear closer together within the document.
    After this stage, we have a ranked list of topics.
  5. Candidate selection: Select the first occurring keyphrase from each topic to
    represent that topic.
    After this stage, we have a ranked list of topics, with a keyphrase to represent
    the topic.

Sample usage & output

The TopicRank algorithm implemented in this PR can be used in essentially the same way as TextRank:

import spacy
import pytextrank
from pprint import pprint

# Example text
text = "A mathematical model of ion exchange is considered, allowing for ion exchanger compression in the process of ion exchange. Two inverse problems are investigated for this model, unique solvability is proved, and numerical solution methods are proposed. The efficiency of the proposed methods is demonstrated by a numerical experiment."

# load a spaCy model, depending on language, scale, etc.
nlp = spacy.load("en_core_web_sm")

# add PyTextRank to the spaCy pipeline
nlp.add_pipe("topicrank")
doc = nlp(text)

# examine the top-ranked phrases in the document
pprint(doc._.phrases)
print([phrase.text for phrase in doc._.phrases][:10])
tr = doc._.textrank
print(tr.elapsed_time)

This outputs:

[Phrase(text='ion exchange', chunks=[ion exchange, ion, ion exchange], count=3, rank=0.2164760626537704),
 Phrase(text='mathematical model', chunks=[mathematical model, model], count=2, rank=0.13294553955872498),
 Phrase(text='exchanger compression', chunks=[exchanger compression], count=1, rank=0.11752875572953524),
 Phrase(text='process', chunks=[process], count=1, rank=0.0976304878191955),
 Phrase(text='unique solvability', chunks=[unique solvability], count=1, rank=0.08867195618809895),
 Phrase(text='inverse problems', chunks=[inverse problems], count=1, rank=0.07980296596112096),
 Phrase(text='efficiency', chunks=[efficiency], count=1, rank=0.07445298452109365),
 Phrase(text='proposed methods', chunks=[proposed methods], count=1, rank=0.07279118674022787),
 Phrase(text='numerical solution methods', chunks=[numerical solution methods], count=1, rank=0.06825226658665143),
 Phrase(text='numerical experiment', chunks=[numerical experiment], count=1, rank=0.05144779424158112)]
['ion exchange', 'mathematical model', 'exchanger compression', 'process', 'unique solvability', 'inverse problems', 'efficiency', 'proposed methods', 'numerical solution methods', 'numerical experiment']
2.8972625732421875

To compare, this is the output of the original https://github.com/adrien-bougouin/KeyBench/ repository is as follows, when given the same input:

(u'ion exchange', 1.6957436822957042)
(u'mathematical model', 1.32419391897424)
(u'numerical solution methods', 1.1483448063357846)
(u'process', 0.9181346427759802)
(u'unique solvability', 0.8265677195114645)
(u'inverse problems', 0.7989594740490582)
(u'efficiency', 0.7225318465443265)
(u'numerical experiment', 0.5631577922037418)

This output is also given by pke, by the same author:

['ion exchange', 'mathematical model', 'numerical solution methods', 'process', 'unique solvability', 'inverse problems', 'efficiency', 'numerical experiment']

Full transparancy: As you can see, there are some small differences between this TopicRank implementation and the originals. E.g. numerical solution methods ranks lower, and exchanger compression ranks higher. I can try to invest a bit more time to try and debug this, but I suspect that this is caused by a differint way to find potential candidates and differing TextRank parameters.

Lastly, for context, replacing nlp.add_pipe("topicrank") with nlp.add_pipe("textrank") gives the following results:

[Phrase(text='ion exchange', chunks=[ion exchange, ion exchange], count=2, rank=0.2217093467665982),
 Phrase(text='numerical solution methods', chunks=[numerical solution methods], count=1, rank=0.18495219543002075),
 Phrase(text='ion', chunks=[ion], count=1, rank=0.18185672562374786),
 Phrase(text='exchanger compression', chunks=[exchanger compression], count=1, rank=0.16266971703534683),
 Phrase(text='unique solvability', chunks=[unique solvability], count=1, rank=0.0960861890678228),
 Phrase(text='a numerical experiment', chunks=[a numerical experiment], count=1, rank=0.0916401771972592),
 Phrase(text='the proposed methods', chunks=[the proposed methods], count=1, rank=0.08875872150025549),
 Phrase(text='the process', chunks=[the process], count=1, rank=0.05602421471187259),
 Phrase(text='A mathematical model', chunks=[A mathematical model], count=1, rank=0.04383340352811856),
 Phrase(text='Two inverse problems', chunks=[Two inverse problems], count=1, rank=0.03593743181662447),
 Phrase(text='this model', chunks=[this model], count=1, rank=0.03401330317595105),
 Phrase(text='The efficiency', chunks=[The efficiency], count=1, rank=0.021969568492358083),
 Phrase(text='Two', chunks=[Two], count=1, rank=0.0)]
['ion exchange', 'numerical solution methods', 'ion', 'exchanger compression', 'unique solvability', 'a numerical experiment', 'the proposed methods', 'the process', 'A mathematical model', 'Two inverse problems']
3.9992332458496094

Implementation details

New default config

pytextrank/__init__.py has been given a new configuration dict: _TOPIC_DEFAULT_CONFIG. This configuration dict is simply an extension of the existing _DEFAULT_CONFIG, with the threshold and method attributes added. In this file, a component factory for the TopicRank class is added.

New classes

Both a TopicRankFactory and TopicRank class were implemented, following the design structure of the previously created classes and factories, with the addition of the aforementioned threshold and method attributes.

New attributes

The threshold attribute is a float between 0 and 1, which is used in the TopicRank candidate clustering algorithm to specify the minimum similarity between two keyphrases, for them to be eligible for clustering. The original paper uses a value of 0.25 (i.e. 25% similarity of overlap of lemmas), and so this PR also uses 0.25 as a default.

The method attribute is the string name of a clustering method in candidate clustering. This PR relies on scipy.cluster.hierarchy.linkage for the hierarchical agglomerative clustering of candidates, and allows users of PTR to specify which clustering method to use. Again, we use the clustering method from the paper as default: "average".

New methods

The TopicRank class is a subclass of TextRank, (although it may cause some confusion, as the TopicRank implementation also uses the lemma_graph object from TextRank, but TopicRank uses topic graphs instead). The implementation relies on the overriding of some key methods and properties:

  • node_list: Now returns a cached (!) list of clustered keyphrase candidates, instead of lemmas.
  • edge_list: Now returns a list representing edges for a complete, weighted graph between the clustered keyphrase candidates.
  • calc_textrank: Quite similar to the original, but skips out on some of the post-processing.
  • reset: Also remove the node_list caching.

And two new methods: _get_candidates and _cluster, which perform steps 2 and 3 of the TopicRank algorithm, respectively (Candidate extraction & Candidate clustering). Note that _get_candidates is almost just self.doc.noun_chunks, but it strips undesired (i.e. not self._keep_token(token)) tokens from the candidate keyphrase.

Note that this PR relies on scipy and networkx, both of which were already dependencies, so this doesn't introduce any additional ones.

Tests

The tests are based on test_base.py, but adapted for TopicRank instead. However, some tests for different threshold and method values are still missing.

Other changes

conftest.py changes

Each of the fixtures are now limited to module, as the nlp.add_pipe from certain tests was preserved between all tests, which led to undesired behaviour (i.e. multiple instances of PTR in the pipeline)

CI

The continuous integration was updated in this PR. I understand that this is kind of separate to the PR itself, but I felt that the additional tests help give confidence in the correctness. In short, the CI now first calls pre-commit, and only if that passes, it will run pytest for Python 3.7, 3.8, 3.9 and 3.10.

Quirks

There are some implementation details that might require some discussion, and could have different implementations depending on preferences. I'll mention them here, and I'd like to have some discussion about them.

  • The scrubber attribute is used once, right here: https://github.com/tomaarsen/pytextrank/blob/df5833e463160c740f8d4a920129070e26756950/pytextrank/topicrank.py#L311-L331
    The scrubber is applied only on the Phrase text, i.e. the keyphrase that "represents" the phrase as a whole. An alternative is to apply the scrubber on all keyphrase candidates.
  • In _get_candidates, self._keep_token is only called on the first tokens of a noun chunk, potentially removing those tokens from the noun chunk. If there is a token that we wish to remove at the middle or end of a noun chunk, then it won't be removed. In practice this won't happen often, as it's mostly meant to remove "The" from "The system" etc., but this functionality might be preferred? The current implementation is simpler however, as it doesn't require splitting noun chunks into multiple keyphrase candidates if there is a undesirable token in the middle. There is also an argument to be made that we usually want to remove certain tokens, but perhaps not if they occur in the middle of an interesting noun chunk. (i.e. that would be the current implementation)
  • Because TopicRank extends TextRank, but there are some definite differences between them, the inherited token_lookback attribute of TopicRank is unused. Perhaps a warning if a non-default value is used here would be proper? Alternatively, we could remove this attribute from the TopicRank implementation somehow.

What now?

There are still some TODO's, here are a few:

  • As mentioned, I'd still like to add a test or two for threshold and method.
  • On line 88 of topicrank.py, there is still a TODO. On e.g. biasedrank.py, this place contains a link to https://derwen.ai/docs/ptr/biblio/#kazemi-etal-2020-biased, with the citation for that work. I feel it would be proper to add a citation in the documentation, with a link from the code for TopicRank too. However, I haven't done this as of now.
  • If we wish, then there can be some specific documentation for TopicRank in PTR, although I don't this has the highest priority. After all, to the end user it'll work almost the same as any of the other ranking components regardless.

I'd love to hear your thoughts, and if you have any questions or comments, let me know (either here or via Slack). I also have a question for you: is there a specific formatter that you use, that I should use for this work?

  • Tom Aarsen

@tomaarsen
Copy link
Contributor Author

Friendly ping on this! @ceteri

Copy link
Collaborator

@ceteri ceteri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outstanding! Many thanks @tomaarsen

We'll need to add more about TopicRank in the docs plus a notebook example.
The improvements on testing look great!

@ceteri ceteri merged commit 0d78d12 into DerwenAI:main Mar 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants