# **Tutorial** - Keyword Extraction with KeyBERT
(last updated 10-05-2021)

In this tutorial we will be exploring how to use KeyBERT to create keywords from documents. The most frequent use-cases and methods are discussed together with important parameters to keep a look out for.


## KeyBERT
KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document.

<br>

<img src="https://raw.githubusercontent.com/MaartenGr/KeyBERT/master/images/logo.png" width="50%">

# Enabling the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

[Reference](https://colab.research.google.com/notebooks/gpu.ipynb)

# **Installing KeyBERT**

We start by installing KeyBERT from PyPi:

In [None]:
%%capture
# !pip install keybert[all]
!pip install keybert

**NOTE**: If you choose to use `keybert[all]` to install all embedding backends, then this may take a while as it needs to install Spacy, Torch, Gensim, USE, etc. If you only want to use sentence-transformers, then I would advise you to use `pip install keybert`.

## Restart the Notebook
After installing KeyBERT, some packages that were already loaded were updated and in order to correctly use them, we should now restart the notebook.

From the Menu:

Runtime → Restart Runtime

# **KeyBERT**
Using KeyBERT is rather straightforward, we simply choose a document that we want keywords/keyphrases from and pass it through our keyword model:

In [None]:
from keybert import KeyBERT

doc = """
         Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs.[1] It infers a
         function from labeled training data consisting of a set of training examples.[2]
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal).
         A supervised learning algorithm analyzes the training data and produces an inferred function,
         which can be used for mapping new examples. An optimal scenario will allow for the
         algorithm to correctly determine the class labels for unseen instances. This requires
         the learning algorithm to generalize from the training data to unseen situations in a
         'reasonable' way (see inductive bias).
      """

In [None]:
kw_model = KeyBERT()
kw_model.extract_keywords(doc)

[('learning', 0.4605),
 ('algorithm', 0.4556),
 ('training', 0.4488),
 ('class', 0.4087),
 ('mapping', 0.3701)]

**NOTE**: Use `model="xlm-r-bert-base-nli-stsb-mean-tokens"` to select a model that support 50+ languages.

## Keyphrase Length
You can set `keyphrase_ngram_range` to set the length of the resulting keywords/keyphrases:



In [None]:
kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1))

[('learning', 0.4605),
 ('algorithm', 0.4556),
 ('training', 0.4488),
 ('class', 0.4087),
 ('mapping', 0.3701)]

To extract keyphrases, simply set `keyphrase_ngram_range` to (1, 2) or higher depending on the number of words you would like in the resulting keyphrases:

In [None]:
kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 3))

[('algorithm generalize training', 0.7727),
 ('learning algorithm analyzes', 0.7588),
 ('learning machine learning', 0.7577),
 ('learning algorithm generalize', 0.7515),
 ('algorithm analyzes training', 0.7504)]

Note that the stop_words are set by default to `"english"` so if you set this to None, then some of the stopwords will still be included in longer keyphrases:

In [None]:
kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 3), stop_words=None)

[('learning algorithm analyzes', 0.7588),
 ('supervised learning algorithm', 0.7503),
 ('the learning algorithm', 0.7272),
 ('learning algorithm to', 0.7107),
 ('learning algorithm', 0.6979)]

## Max Sum Similarity
To diversity the results, we take the 2 x top_n most similar words/phrases to the document. Then, we take all top_n combinations from the 2 x top_n words and extract the combination that are the least similar to each other by cosine similarity.

In [None]:
kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 3),
                          use_maxsum=True, nr_candidates=20, top_n=5)

[('training examples supervised', 0.4556),
 ('machine learning', 0.7504),
 ('analyzes training data', 0.7727),
 ('requires learning algorithm', 0.5011),
 ('supervised learning algorithm', 0.279)]

## Maximal Marginal Relevance

To diversify the results, we can use Maximal Margin Relevance (MMR) to create keywords / keyphrases which is also based on cosine similarity. The results with **high** diversity:

In [None]:
kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 3),
                          use_mmr=True, diversity=0.7)

[('algorithm generalize training', 0.7727),
 ('labels unseen', 0.089),
 ('mapping new', 0.3573),
 ('algorithm correctly', 0.3867),
 ('pairs infers function', 0.3827)]

The results with **low diversity**:



In [None]:
kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 3),
                              use_mmr=True, diversity=0.2)

[('algorithm generalize training', 0.7727),
 ('supervised learning algorithm', 0.7503),
 ('learning machine learning', 0.7577),
 ('learning algorithm analyzes', 0.7588),
 ('learning algorithm generalize', 0.7515)]

# **Embedding Models**
In this section, we will go through all embedding models and backends that are supported in KeyBERT.

## Sentence Transformers
You can select any model from sentence-transformers [here](https://www.sbert.net/docs/pretrained_models.html) and pass it through KeyBERT with `model`:

In [None]:
kw_model = KeyBERT(model="xlm-r-bert-base-nli-stsb-mean-tokens")
kw_model.extract_keywords(doc)

[('learning', 0.6026),
 ('training', 0.518),
 ('algorithm', 0.471),
 ('analyzes', 0.4646),
 ('supervised', 0.4624)]

Or we can select a SentenceTransformer model with our own parameters:

In [None]:
from sentence_transformers import SentenceTransformer
sentence_model = SentenceTransformer("xlm-r-bert-base-nli-stsb-mean-tokens", device="cuda")

In [None]:
kw_model = KeyBERT(model=sentence_model)
kw_model.extract_keywords(doc)

[('learning', 0.6026),
 ('training', 0.518),
 ('algorithm', 0.471),
 ('analyzes', 0.4646),
 ('supervised', 0.4624)]

## Flair
Flair allows you to choose almost any embedding model that is publicly available.  
Flair can be used as follows:

In [None]:
from flair.embeddings import TransformerDocumentEmbeddings
roberta = TransformerDocumentEmbeddings('roberta-base')

In [None]:
kw_model = KeyBERT(model=roberta)
kw_model.extract_keywords(doc)

[('algorithm', 0.9289),
 ('inferred', 0.9286),
 ('output', 0.9286),
 ('supervised', 0.9285),
 ('desired', 0.9284)]

You can select any 🤗 transformers model [here](https://huggingface.co/models).

Moreover, you can also use Flair to use word embeddings and pool them to create document embeddings. Under the hood, Flair simply averages all word embeddings in a document. Then, we can easily pass it to KeyBERT in order to use those word embeddings as document embeddings:

In [None]:
from flair.embeddings import WordEmbeddings, DocumentPoolEmbeddings

glove_embedding = WordEmbeddings('crawl')
document_glove_embeddings = DocumentPoolEmbeddings([glove_embedding])

In [None]:
kw_model = KeyBERT(model=document_glove_embeddings)
kw_model.extract_keywords(doc)

[('function', 0.4896),
 ('output', 0.4621),
 ('data', 0.4577),
 ('learning', 0.4538),
 ('input', 0.4524)]

## Spacy
Spacy has shown great promise over the last years and is now slowly transitioning into transformer-based techniques which makes it interesting to use in KeyBERT.

We start by using a non-transformer-based model which we will have to download first:

In [None]:
%%capture
!python -m spacy download en_core_web_md

Next, simply load the model into a Spacy nlp instance and pass it through KeyBERT:

In [None]:
import spacy
nlp = spacy.load("en_core_web_md", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])

In [None]:
kw_model = KeyBERT(model=nlp)
kw_model.extract_keywords(doc)

[('example', 0.7436),
 ('way', 0.7313),
 ('determine', 0.6889),
 ('allow', 0.6621),
 ('used', 0.6432)]

We can also use their transformer-based models which we also have to download first:

In [None]:
%%capture
!python -m spacy download en_core_web_trf

As before, we simply load the model and pass it through KeyBERT. Note that we exclude a bunch of features as they are not used in KeyBERT.

In [None]:
import spacy
from thinc.api import set_gpu_allocator, require_gpu

nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
set_gpu_allocator("pytorch")
require_gpu(0)

In [None]:
kw_model = KeyBERT(model=nlp)
kw_model.extract_keywords(doc)

## Universal Sentence Encoder (USE)
The Universal Sentence Encoder encodes text into high dimensional vectors that are used here for embedding the documents. The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs.

In [None]:
import tensorflow_hub
embedding_model = tensorflow_hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

In [None]:
kw_model = KeyBERT(model=embedding_model)
kw_model.extract_keywords(doc)

[('training', 0.2549),
 ('learning', 0.2264),
 ('algorithm', 0.2092),
 ('data', 0.1952),
 ('pairs', 0.1859)]

## Gensim
For Gensim, KeyBERT supports its `gensim.downloader` module. Here, we can download any model word embedding model to be used in KeyBERT. Note that Gensim is primarily used for Word Embedding models. This works typically best for short documents since the word embeddings are pooled.

In [None]:
import gensim.downloader as api
ft = api.load('fasttext-wiki-news-subwords-300')

In [None]:
kw_model = KeyBERT(model=ft)
kw_model.extract_keywords(doc)

[('way', 0.698),
 ('new', 0.6955),
 ('based', 0.6899),
 ('set', 0.6824),
 ('object', 0.6574)]

## Custom Backend
If your backend or model cannot be found in the ones currently available, you can use the BaseEmbedder class to create your own backend. Below, you will find an example of creating a SentenceTransformer backend for KeyBERT:

In [None]:
from keybert.backend import BaseEmbedder
from sentence_transformers import SentenceTransformer

class CustomEmbedder(BaseEmbedder):
    def __init__(self, embedding_model):
        super().__init__()
        self.embedding_model = embedding_model

    def embed(self, documents, verbose=False):
        embeddings = self.embedding_model.encode(documents, show_progress_bar=verbose)
        return embeddings

# Create custom backend
distilbert = SentenceTransformer("distilbert-base-nli-stsb-mean-tokens")
custom_embedder = CustomEmbedder(embedding_model=distilbert)

In [None]:
kw_model = KeyBERT(model=custom_embedder)
kw_model.extract_keywords(doc)

[('learning', 0.5199),
 ('algorithm', 0.4292),
 ('supervised', 0.4265),
 ('training', 0.3835),
 ('class', 0.3147)]

# **Candidates**
In some cases, one might want to be using candidate keywords generated by other keyword algorithms or retrieved from a select list of possible keywords/keyphrases. In KeyBERT, you can easily use those candidate keywords to perform keyword extraction. We are going to create these candidates with [YAKE](https://github.com/LIAAD/yake), another great tool for extracting keywords.

We start by installing yake:

In [None]:
%%capture
!pip install yake

Next, we will create 20 candidate keywords with YAKE:

In [None]:
import yake

kw_extractor = yake.KeywordExtractor(top=20)
candidates = kw_extractor.extract_keywords(doc)
candidates = [candidate[0] for candidate in candidates]

Finally, we are going to pass these candidates to KeyBERT and use MMR to select the top 5 keywords/keyphrases:

In [None]:
kw_model = KeyBERT()
kw_model.extract_keywords(doc, candidates, use_mmr=True, diversity=0.5)

[('supervised learning algorithm', 0.7503),
 ('training data consisting', 0.5419),
 ('machine learning', 0.6306),
 ('learning algorithm', 0.6979),
 ('input-output pairs.', 0.3598)]