In [None]:
!pip install git+https://github.com/boudinfl/pke.git
!pip install matplotlib
!python -m spacy download en_core_web_sm

Collecting git+https://github.com/boudinfl/pke.git
  Cloning https://github.com/boudinfl/pke.git to /tmp/pip-req-build-i2fn4ow2
  Running command git clone --filter=blob:none --quiet https://github.com/boudinfl/pke.git /tmp/pip-req-build-i2fn4ow2
  Resolved https://github.com/boudinfl/pke.git to commit 69871ffdb720b83df23684fea53ec8776fd87e63
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting unidecode (from pke==2.0.0)
  Downloading Unidecode-1.3.8-py3-none-any.whl (235 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.5/235.5 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: pke
  Building wheel for pke (setup.py) ... [?25l[?25hdone
  Created wheel for pke: filename=pke-2.0.0-py3-none-any.whl size=6160628 sha256=fe52caa9f280449623f92b41e4a7641c6af917b9df5956698fa82359a17dc0f4
  Stored in directory: /tmp/pip-ephem-wheel-cache-16oe23ka/wheels/8c/07/29/6b35bed2aa36e33d77ff3677eb716965ece4d2e56639ad0aab
Successfully

# Hands-on session with pke - part 1

This notebook covers a brief introduction on keyphrase extraction using `pke`, an open source python-based keyphrase extraction toolkit. `pke` provides an end-to-end keyphrase extraction pipeline in which each component can be easily modified or extented to develop new models.

The overall architecture of `pke` is depicted in the Figure below.
Extracting keyphrases from an input document involves three stages.
First, **keyphrase candidates** (i.e. words and phrases that are eligible to be keyphrases) are selected from the content of the document (populates the `self.candidates` dictionary). Second, **candidates are either ranked** using a candidate weighting function (unsupervised approaches), **or classified as keyphrase or not** using a set of extracted features (supervised approaches) (populates the `self.weights` dictionary). Third, the top-N highest weighted candidates, or those classified as keyphrase with the highest confidence scores, are selected as keyphrases.

![pke_architecture.png](attachment:pke_architecture.png)

`pke` provides a standardized API for extracting keyphrases from a document:

```python
import pke

extractor = pke.unsupervised.TfIdf()                # initialize a keyphrase extraction model, here TFxIDF
extractor.load_document(input='text')               # load the content of the document  (str or spacy Doc)
extractor.candidate_selection()                     # identify keyphrase candidates
extractor.candidate_weighting()                     # weight keyphrase candidates
keyphrases = extractor.get_n_best(n=10)             # select the 10-best candidates as keyphrases
```

## Graph-based keyphrase extraction with TopicRank

[TopicRank (Bougouin et al., 2013)](https://aclanthology.org/I13-1062/) is an unsupervised graph-based ranking model to keyphrase extraction that is often used as a baseline by the research community.
TopicRank relies on a graph-based topical representation of the input document, and uses a random walk algorithm derived from PageRank to estimate the importance of each topic (node).
The most representative phrase candidates belonging to the highest-scored topics are then selected as keyphrases.

This notebook presents an end-to-end example of keyphrase extraction using TopicRank implemented in `pke`.

### step-1: let's start by importing `pke` and initializing a `TopicRank` model

In [None]:
import pke

# initialize a TopicRank keyphrase extraction model
extractor = pke.unsupervised.TopicRank()

### step-2: what we need now is a sample document

In [None]:
# sample document (2040.abstr from the Inspec dataset)
sample = """A day before the swearing-in ceremony of Prime Minister Narendra Modi, Janata Dal (United) leader KC Tyagi claimed that the opposition INDIA bloc had offered the PM post to Bihar chief minister Nitish Kumar, to woo him to join the alliance and not support the NDA. "After the elections, those leaders of INDI alliance who did not want Nitish Kumar to become the convenor offered him the post of Prime Minister. We have phone records attesting to this,” he claimed.""".replace("\n", " ")

### step-3: we can load the sample document using the pke model

When raw text is given to a `pke` model, `spacy`/`nltk` is used to pre-process the text (sentence splitting, tokenization, Part-of-Speech tagging, stemming).

In [None]:
# load the document using the initialized model
# text preprocessing is carried out using spacy
extractor.load_document(input=sample, language='en')

In [None]:
# loading a document populates the extractor.sentences list
# let's have a look at the pre-processed text

# for each sentence in the document
for i, sentence in enumerate(extractor.sentences):

    # print out the sentence id, its tokens, its stems and the corresponding Part-of-Speech tags
    print("sentence {}:".format(i))
    print(" - words: {} ...".format(' '.join(sentence.words[:5])))
    print(" - stems: {} ...".format(' '.join(sentence.stems[:5])))
    print(" - PoS: {} ...".format(' '.join(sentence.pos[:5])))

sentence 0:
 - words: A day before the swearing-in ...
 - stems: a day befor the swearing-in ...
 - PoS: DET NOUN ADP DET ADJ ...
sentence 1:
 - words: After the elections , those ...
 - stems: after the elect , those ...
 - PoS: ADP DET NOUN PUNCT DET ...
sentence 2:
 - words: We have phone records attesting ...
 - stems: we have phone record attest ...
 - PoS: PRON AUX NOUN NOUN VERB ...


### step-4 : identifying keyphrase candidates

In [None]:
# identify the keyphrase candidates using TopicRank's default strategy
# i.e. the longest sequences of nouns and adjectives `(Noun|Adj)*`
extractor.candidate_selection()

In [None]:
# identifying keyphrase candidates populates the extractor.candidates dictionary
# let's have a look at the keyphrase candidates

# for each keyphrase candidate
for i, candidate in enumerate(extractor.candidates):

    # print out the candidate id, its stemmed form
    print("candidate {}: {} (stemmed form)".format(i, candidate))

    # print out the surface forms of the candidate
    print(" - surface forms:", [ " ".join(u) for u in extractor.candidates[candidate].surface_forms])

    # print out the corresponding offsets
    print(" - offsets:", extractor.candidates[candidate].offsets)

    # print out the corresponding sentence ids
    print(" - sentence_ids:", extractor.candidates[candidate].sentence_ids)

    # print out the corresponding PoS patterns
    print(" - pos_patterns:", extractor.candidates[candidate].pos_patterns)

candidate 0: day (stemmed form)
 - surface forms: ['day']
 - offsets: [1]
 - sentence_ids: [0]
 - pos_patterns: [['NOUN']]
candidate 1: swearing-in ceremoni (stemmed form)
 - surface forms: ['swearing-in ceremony']
 - offsets: [4]
 - sentence_ids: [0]
 - pos_patterns: [['ADJ', 'NOUN']]
candidate 2: prime minist narendra modi (stemmed form)
 - surface forms: ['Prime Minister Narendra Modi']
 - offsets: [7]
 - sentence_ids: [0]
 - pos_patterns: [['PROPN', 'PROPN', 'PROPN', 'PROPN']]
candidate 3: janata dal (stemmed form)
 - surface forms: ['Janata Dal']
 - offsets: [12]
 - sentence_ids: [0]
 - pos_patterns: [['PROPN', 'PROPN']]
candidate 4: unit (stemmed form)
 - surface forms: ['United']
 - offsets: [15]
 - sentence_ids: [0]
 - pos_patterns: [['PROPN']]
candidate 5: leader kc tyagi (stemmed form)
 - surface forms: ['leader KC Tyagi']
 - offsets: [17]
 - sentence_ids: [0]
 - pos_patterns: [['NOUN', 'PROPN', 'PROPN']]
candidate 6: opposit india bloc (stemmed form)
 - surface forms: ['oppo

### step-5 : ranking keyphrase candidates

In [None]:
# In TopicRank, candidate weighting is a three-step process:
#  1. candidate clustering (grouping keyphrase candidates into topics)
#  2. graph construction (building a complete-weighted-graph of topics)
#  3. rank topics (nodes) using a random walk algorithm
extractor.candidate_weighting()

In [None]:
# let's have a look at the topics

# for each topic of the document
for i, topic in enumerate(extractor.topics):

    # print out the topic id and the candidates it groups together
    print("topic {}: {} ".format(i, ';'.join(topic)))

topic 0: prime minist;prime minist narendra modi 
topic 1: bihar chief minist nitish kumar;nitish kumar 
topic 2: pm post;post 
topic 3: leader;leader kc tyagi 
topic 4: allianc;indi allianc 
topic 5: convenor 
topic 6: day 
topic 7: elect 
topic 8: janata dal 
topic 9: nda 
topic 10: opposit india bloc 
topic 11: phone record 
topic 12: swearing-in ceremoni 
topic 13: unit 


In [None]:
# let have a look at the graph-based representation of the document
#
# here, nodes are topics, edges between topics are weighted according to
# the strength of their semantic relation measured by the reciprocal distances
# between the offset positions of the candidate keyphrases

import networkx as nx
import matplotlib.pyplot as plt
%matplotlib notebook

# set the labels as list of candidates for each topic
labels = {i: ';'.join(topic) for i, topic in enumerate(extractor.topics)}

# set the weights of the edges
edge_weights = [extractor.graph[u][v]['weight'] for u,v in extractor.graph.edges()]

# set the weights of the nodes (topic weights are stored in _w attribute)
sizes = [10e3*extractor._w[i] for i, topic in enumerate(extractor.topics)]

# draw the graph
nx.draw_shell(extractor.graph, with_labels=True, labels=labels, width=edge_weights, node_size=sizes)

In [None]:
# let's have a look at the weights/ranks of the topics

# In TopicRank, weights are computed for each topic, and only one
# representative candidate per topic (by default the first occurring
# one) is kept

# for each representative candidate
for candidate, weight in extractor.weights.items():

    # print out the candidate (in stemmed form) and its weight
    print('{}: {}'.format(candidate, weight))

prime minist narendra modi: 0.1159118499400101
bihar chief minist nitish kumar: 0.09398429065282353
pm post: 0.10146139142338098
leader kc tyagi: 0.1157721838386306
allianc: 0.09011808582476209
convenor: 0.05012380333427968
day: 0.0428040283039335
elect: 0.054427863221189615
janata dal: 0.07071249970083388
nda: 0.047648962264445635
opposit india bloc: 0.05236819648163593
phone record: 0.03811416596726638
swearing-in ceremoni: 0.058008685896256476
unit: 0.06854399315055178


### step-6: selecting the N-best candidates as keyphrases

In [None]:
# Get the N-best candidates (here, 5) as keyphrases
keyphrases = extractor.get_n_best(n=10, stemming=False)

# for each of the best candidates
for i, (candidate, score) in enumerate(keyphrases):

    # print out the its rank, phrase and score
    print("rank {}: {} ({})".format(i, candidate, score))

rank 0: prime minister narendra modi (0.1159118499400101)
rank 1: leader kc tyagi (0.1157721838386306)
rank 2: pm post (0.10146139142338098)
rank 3: bihar chief minister nitish kumar (0.09398429065282353)
rank 4: alliance (0.09011808582476209)
rank 5: janata dal (0.07071249970083388)
rank 6: united (0.06854399315055178)
rank 7: swearing-in ceremony (0.058008685896256476)
rank 8: elections (0.054427863221189615)
rank 9: opposition india bloc (0.05236819648163593)


## Conclusion

Now that we are familiar with the three-stage process involved in keyphrase extraction (candidate selection, candidate ranking, N-best selection), as well as with the `pke` API, we are ready for part-2 in which experiment with different models and parameters and see how to evaluate the quality of the produced keyphrases.