# Themes extraction with pke

**Themes** are topical keywords and phrases that are prominent in a document.

To extract the themes we will use the [pke - Python keyphrase extraction](https://github.com/boudinfl/pke) toolkit. pke requires [SpaCy](https://spacy.io/usage) as well as the NLTK stopwords.

Let's install spacy, nltk and pke first.

In [5]:
import sys

!{sys.executable} -m pip install spacy
!{sys.executable} -m spacy download en_core_web_sm  # download the English SpaCy model
!{sys.executable} -m pip install git+https://github.com/boudinfl/pke.git

Defaulting to user installation because normal site-packages is not writeable
Collecting scipy>=1.8
  Using cached scipy-1.10.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.5 MB)
Collecting numpy<1.27.0,>=1.19.5
  Using cached numpy-1.24.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
Installing collected packages: numpy, scipy
Successfully installed numpy-1.24.1 scipy-1.10.0
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m83.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')

Defaulting to user installation because normal site-packages is not writeable
Collecting git+https://github.com/boudinfl/pke.git
  Cloning https://github.com/boudinfl/pke.git to /tmp/pip-req-build-m025zg3s
  Running command git clone --filter=blob:none --quiet https://github.com/boudinfl/pke.git /tmp/pip-req-build-m025zg3s
  Resolved https://github.com/boudinfl/pke.git to commit 8f1d05dcc52041c9920ba0f9d5231fe6086d12c4
  Preparing metadata (setup.py) ... [?25ldone


If you plan to use pke on a command-line installation of Python, you can use the following commands instead:

```
pip install spacy
python -m spacy download en_core_web_sm
pip install git+https://github.com/boudinfl/pke.git
```

Let's see how pke works. For this, we are going to use a raw text file called [wiki_gershwin.txt](wiki_gershwin.txt). We first import the module and initialize the keyphrase extraction model (here: TopicRank):

In [6]:
import pke

extractor = pke.unsupervised.TopicRank()

Load the content of the document, here document is expected to be in raw format (i.e. a simple text file). The document is automatically preprocessed and analyzed with SpaCy, using the language given in the parameter:

In [7]:
doc = open('wiki_gershwin.txt', 'r')
text = doc.read()
extractor.load_document(text, language='en')

The keyphrase extraction consists of three steps:

1. Candidate selection:  
With TopicRank, the default candidates are sequences of nouns and adjectives (i.e. `(Noun|Adj)*`)

2. Candidate weighting:  
With TopicRank, this is done using a random walk algorithm.

3. N-best candidate selection:  
The 10 highest-scored candidates are selected. They are returned as (keyphrase, score) tuples.

In [8]:
extractor.candidate_selection()
extractor.candidate_weighting()
keyphrases = extractor.get_n_best(n=10)

print("Extracted themes:")
print("=================")
for keyphrase in keyphrases:
    print(f'{keyphrase[1]:.5f}   {keyphrase[0]}')

AttributeError: module 'scipy.sparse' has no attribute 'coo_array'

Next, you can try out different methods for extracting themes: supervised, unsupervised, graph. Compare the themes extracted. If your texts are in other languages than English, test the themes extraction for them and assess the quality. Is this something you might want to use for your final project?

![Approaches implemented in pke](static/pke_methods.png)

You can read more about the pke toolkit from their paper ([Boudin, 2016](https://aclanthology.org/C16-2015.pdf)).

<sub>By Dmitry Kan, updated by Mathias Creutz and Yves Scherrer</sub>