# Themes extraction with pke

**Themes** are topical keywords and phrases that are prominent in a document.

To extract the themes we will use the [pke - Python keyphrase extraction](https://github.com/boudinfl/pke) toolkit. pke requires [SpaCy](https://spacy.io/usage) and a SpaCy model for the language of the document.

Let's install spacy and pke first.

In [None]:
import sys

!{sys.executable} -m pip install spacy
!{sys.executable} -m spacy download en_core_web_sm  # download the English SpaCy model
!{sys.executable} -m pip install git+https://github.com/boudinfl/pke.git

If you plan to use pke on a command-line installation of Python, you can use the following commands instead:

```
pip install spacy
python -m spacy download en_core_web_sm
pip install git+https://github.com/boudinfl/pke.git
```

Let's see how pke works. For this, we are going to use a raw text file called [wiki_gershwin.txt](wiki_gershwin.txt). We first import the module and initialize the keyphrase extraction model (here: TopicRank):

In [None]:
import pke

extractor = pke.unsupervised.TopicRank()

Load the content of the document, here document is expected to be in raw format (i.e. a simple text file). The document is automatically preprocessed and analyzed with SpaCy, using the language given in the parameter:

In [None]:
doc = open('wiki_gershwin.txt', 'r')
text = doc.read()
extractor.load_document(text, language='en')

The keyphrase extraction consists of three steps:

1. Candidate selection:  
With TopicRank, the default candidates are sequences of nouns and adjectives (i.e. `(Noun|Adj)*`)

2. Candidate weighting:  
With TopicRank, this is done using a random walk algorithm.

3. N-best candidate selection:  
The 10 highest-scored candidates are selected. They are returned as (keyphrase, score) tuples.

In [None]:
extractor.candidate_selection()
extractor.candidate_weighting()
keyphrases = extractor.get_n_best(n=10)

print("Extracted themes:")
print("=================")
for keyphrase in keyphrases:
    print(f'{keyphrase[1]:.5f}   {keyphrase[0]}')

Next, you can try out different methods for extracting themes: supervised, unsupervised, graph. Compare the themes extracted. If your texts are in other languages than English, test the themes extraction for them and assess the quality. Is this something you might want to use for your final project?

![Approaches implemented in pke](static/pke_methods.png)

You can read more about the pke toolkit from their paper ([Boudin, 2016](https://aclanthology.org/C16-2015.pdf)).

<sub>By Dmitry Kan, updated by Mathias Creutz and Yves Scherrer</sub>