In [1]:
%reload_ext autoreload
%autoreload 2

## Keyphrase Extraction in `ktrain`

Keyphrase extraction in **ktrain** leverages the [textblob](https://textblob.readthedocs.io/en/dev/) package, which can be installed with:
```
pip install textblob
python -m textblob.download_corpora
```

In [2]:
from ktrain.text.kw import KeywordExtractor
from ktrain.text.textextractor import TextExtractor

#### Downloaded a Paper from ArXiv and Extract Text
For our test document, let's download the ktrain ArXiv paper and use the `TextExtractor` module to extract text.

In [3]:
!wget --user-agent="Mozilla" https://arxiv.org/pdf/2004.10703.pdf -O /tmp/downloaded_paper.pdf -q
text = TextExtractor().extract('/tmp/downloaded_paper.pdf')

In [14]:
print(f"# of words in downloaded paper: {len(text.split())}")

# of words in downloaded paper: 4551


#### Using N-Grams as the candidate generator

Let's first use `ngrams` as the candidate generator, which is comparatively fast:

In [5]:
kwe = KeywordExtractor()

In [9]:
%%time
kwe.extract_keywords(text, candidate_generator='ngrams')

CPU times: user 351 ms, sys: 27.6 ms, total: 379 ms
Wall time: 378 ms


[('machine learning', 0.10548523206751055),
 ('step', 0.06751054852320675),
 ('learning rate', 0.046413502109704644),
 ('arxiv preprint', 0.046413502109704644),
 ('text classification', 0.03375527426160337),
 ('augmented machine', 0.02531645569620253),
 ('open-domain question-answering', 0.02531645569620253),
 ('augmented machine learning', 0.02531645569620253),
 ('bert', 0.02109704641350211),
 ('low-code library', 0.02109704641350211)]

#### Using Noun Phrases as the candidate generator


If we use `noun_phrases` as the candidate generator instead, quality improves slightly at the expense of a longer running time.

In [10]:
%%time
kwe.extract_keywords(text, candidate_generator='noun_phrases')

CPU times: user 1.09 s, sys: 273 µs, total: 1.09 s
Wall time: 1.09 s


[('machine learning', 0.0784313725490196),
 ('text classification', 0.049019607843137254),
 ('image classification', 0.049019607843137254),
 ('exact answers', 0.0392156862745098),
 ('augmented machine learning', 0.0392156862745098),
 ('graph data', 0.029411764705882353),
 ('node classification', 0.029411764705882353),
 ('entity recognition', 0.029411764705882353),
 ('code example', 0.029411764705882353),
 ('index documents', 0.029411764705882353)]

#### Other Parameters
The `extract_keywords` method has many other parameters to control the output.  For instance, you can control the number of words in keyphrases with the `ngram_range` parameter. Here, we extract 3-word keyphrases:

In [11]:
kwe.extract_keywords(text, candidate_generator='noun_phrases', ngram_range=(3,3))

[('augmented machine learning', 0.07017543859649122),
 ('a. s. maiya', 0.05263157894736842),
 ('optimal learning rate', 0.03508771929824561),
 ('natural language questions', 0.03508771929824561),
 ('support text data', 0.017543859649122806),
 ('learning rate schedules', 0.017543859649122806),
 ('machine learning model', 0.017543859649122806),
 ('unsupervised topic modeling', 0.017543859649122806),
 ('large text corpus', 0.017543859649122806),
 ('social media accounts', 0.017543859649122806)]

#### Combining All the Steps:  Low-Code Keyphrase Extraction

In [13]:
from ktrain.text.kw import KeywordExtractor
from ktrain.text.textextractor import TextExtractor
!wget --user-agent="Mozilla" https://arxiv.org/pdf/2004.10703.pdf -O /tmp/downloaded_paper.pdf -q
text = TextExtractor().extract('/tmp/downloaded_paper.pdf')
kwe.extract_keywords(text, candidate_generator='noun_phrases')

[('machine learning', 0.0784313725490196),
 ('text classification', 0.049019607843137254),
 ('image classification', 0.049019607843137254),
 ('exact answers', 0.0392156862745098),
 ('augmented machine learning', 0.0392156862745098),
 ('graph data', 0.029411764705882353),
 ('node classification', 0.029411764705882353),
 ('entity recognition', 0.029411764705882353),
 ('code example', 0.029411764705882353),
 ('index documents', 0.029411764705882353)]