# Tutorial - Keyword Extraction with keyBoost

<img src="https://github.com/IIZCODEII/keyboost/raw/main/keyboost.png" width="50%">

During this brief tutorial, we'll explore how to extract keywords with keyBoost in the most classical scenario. Doing so, we'll showcase the capabilities of this package and the parameters you may want to tune for your own use-case.

#### About keyBoost

KeyBoost is simple and easy-to-use keyword extraction tool that moves away the hassle of selecting the best models for your specific use-case. No background in the litterature of keyword extraction or expertise is needed to extract the best possible keywords given no prior knowledge of what are the most performant models for your use case.

## Environnement Set-up

### Installing keyBoost

Running this cell will allow to install the package directly from the corresponding github repository.

In [9]:
!pip install git+https://github.com/IIZCODEII/keyboost.git#egg=keyboost

^C
Note: you may need to restart the kernel to use updated packages.


To take into account any potential updates on prexisting librairies, restarting the notebook by going to Runtime ==> Restart Runtime is strongly advised.

### Enabling GPU acceleration 

Although not necessary, GPU acceleration can lead to an increased time performance for the keywords generation process. It can be enabled on Colab by navigating to Edit ==> Notebook Settings and selecting GPU hardware acceleration in the dropdown menu.

## Basic Usage

The basic task of extracting keywords from a document with keyBoost can be done in a few lines of code :

In [13]:
from keyboost.keyBoost import *

doc = """
         Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs.[1] It infers a
         function from labeled training data consisting of a set of training examples.[2]
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal). 
         A supervised learning algorithm analyzes the training data and produces an inferred function, 
         which can be used for mapping new examples. An optimal scenario will allow for the 
         algorithm to correctly determine the class labels for unseen instances. This requires 
         the learning algorithm to generalize from the training data to unseen situations in a 
         'reasonable' way (see inductive bias).
      """



In [None]:
keyboost = KeyBoost('paraphrase-MiniLM-L6-v2') # selecting a generic SoTA model

In [None]:
keywords = keyboost.extract_keywords(text=doc,
                       language='en',
                       n_top=10,
                       keyphrases_ngram_max=2,
                       stopwords=stopwords,
                       consensus='statistical',
                       models=['keybert','yake','textrank'])

In [6]:
keywords

0    input-output pairs.
1    Supervised learning
2    examples supervised
3          training data
4       machine learning
5      learning function
6     learning algorithm
7           output based
8                 unseen
9               examples
Name: Keyword, dtype: object

### Tuning the length of the keyphrases

To do so, you can just set `keyphrase_ngram_max` to the top bound of your linking. For example, if we want keyphrases composed with up to 3 words :

In [None]:
keywords = keyboost.extract_keywords(text=doc,
                       language='en',
                       n_top=10,
                       keyphrases_ngram_max=3,
                       consensus='statistical',
                       models=['keybert','yake','textrank'])

In [None]:
keywords

### Tuning the number of extracted keywords

This time, you can play around by setting `n_top` to value that suits what you're looking for. Here we it was changed to 15:

In [None]:
keywords = keyboost.extract_keywords(text=doc,
                       language='en',
                       n_top=15,
                       keyphrases_ngram_max=3,
                       consensus='statistical',
                       models=['keybert','yake','textrank'])

In [None]:
keywords

### Including custom stopwords

For better results, it is a good practice to systematically includes stopwords, a generic list is greatly sufficient in general but a curated list targeted to specifically to your needs will add this much relevance to the extractions.

Here, we leverage spacy's provided stopwords for the english language:

In [None]:
import spacy

nlp = spacy.load('en_core_web_sm')

stopwords = nlp.Defaults.stop_words

In [None]:
keywords = keyboost.extract_keywords(text=doc,
                       language='en',
                       n_top=10,
                       keyphrases_ngram_max=2,
                       stopwords=stopwords,
                       consensus='statistical',
                       models=['keybert','yake','textrank'])

In [None]:
keywords

### Language support

Modifying the `language` parameter with the [ISO 639-1 codes](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) of the language of document you want to extract keywords from can be useful.
However, your specific language may not be supported by the underlying keyword extraction models or by the embedding model. For the mainly used ones in the keyword extraction litterature, it should be fine.

### Complete example code

The following cell exhibits the complete code of a typical keyword extraction using the most common parameters.

In [None]:
import spacy
from keyboost.keyBoost import *

doc = """
         Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs.[1] It infers a
         function from labeled training data consisting of a set of training examples.[2]
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal). 
         A supervised learning algorithm analyzes the training data and produces an inferred function, 
         which can be used for mapping new examples. An optimal scenario will allow for the 
         algorithm to correctly determine the class labels for unseen instances. This requires 
         the learning algorithm to generalize from the training data to unseen situations in a 
         'reasonable' way (see inductive bias).
      """


nlp = spacy.load('en_core_web_sm')

stopwords = nlp.Defaults.stop_words


keyboost = KeyBoost('paraphrase-MiniLM-L6-v2')


keywords = keyboost.extract_keywords(text=doc,
                       language='en',
                       n_top=10,
                       keyphrases_ngram_max=2,
                       stopwords=stopwords,
                       consensus='statistical',
                       models=['keybert','yake','textrank'])

## Underlying models

By default, it is recommanded to use all of the 3 models supported by keyBoost for the moment.(KeyBERT, YAKE! and TextRank)

If you have any kind of information leading you to want to use only a subset of these models, you can simply specify this sublist as the value for `models`.

For instance, if you wanted only YAKE! and TextRank :

In [None]:
selected_models = ['yake','textrank']


keywords = keyboost.extract_keywords(text=doc,
                       language='en',
                       n_top=10,
                       keyphrases_ngram_max=2,
                       stopwords=stopwords,
                       consensus='statistical',
                       models=selected_models)

In [None]:
keywords

## Consensus type

keyBoost provides two types of consensus to consolidates the predictions of the different underlying models : *statstical consensus* and *ranking consensus*.

By default, it is recommanded to use the *statistical consensus* as it is prone to give you the best results the majority of the time. However, if you want anyway to try the *ranking* consensus, you can directly set the `consensus` parameters to *rank* :

In [None]:
keywords = keyboost.extract_keywords(text=doc,
                       language='en',
                       n_top=10,
                       keyphrases_ngram_max=2,
                       stopwords=stopwords,
                       consensus='rank',
                       models=['keybert','yake','textrank'])

In [None]:
keywords

#### Addtional informations when using the *statistical consensus*

In [None]:
keywords = keyboost.extract_keywords(text=doc,
                       language='en',
                       n_top=10,
                       keyphrases_ngram_max=2,
                       stopwords=stopwords,
                       consensus='statistical',
                       models=['keybert','yake','textrank'])

Due to the *statistical consensus* nature, it may sometimes fall back to the *ranking consensus* if the distributions of the scores outputed by the underlying models do not meet certain requirements.
You can quickly check for the *completeness* of the statistical consensus procedure ( meaning no fallback to the *ranking* consensus) by typing :

In [7]:
keyboost.is_statistical_consensus_completed

True

Finally, if the *statistical consensus* is complete and you want to check the respective scores of the outputed keywords, you can simply type:

In [8]:
keyboost.statistical_consensus_scores

Unnamed: 0,Keyword,Score
0,input-output pairs.,5.912583
1,Supervised learning,5.189542
2,examples supervised,4.408128
3,training data,3.995092
4,machine learning,3.82766
5,learning function,3.625091
6,learning algorithm,3.468484
7,output based,3.458874
8,unseen,3.384235
9,examples,3.107092


Note : due to its nature, the *ranking consensus* does not output any (keyword, score) pairing.

## Embedding Models

Emdeddings are a part of the consensus procedure as well as a central element for the KeyBERT keywords extraction.

Although we recommand paraphrase-MiniLM-L6-v2, which a state of the art go to model in term of embeddings, you can specify any sentence-transformers model name avalaible  [here](https://www.sbert.net/docs/pretrained_models.html) and use it for both consensus and KeyBert.

For instance, with the stsb-distilroberta-base-v2 pretrained model (if its the first time you choose this model, it will automatically download it) :

In [12]:
keyboost = KeyBoost('stsb-distilroberta-base-v2')


keywords = keyboost.extract_keywords(text=doc,
                       language='en',
                       n_top=10,
                       keyphrases_ngram_max=2,
                       stopwords=stopwords,
                       consensus='statistical',
                       models=['keybert','yake','textrank'])

In [11]:
keywords

0    input-output pairs.
1    Supervised learning
2          training data
3       learning machine
4               learning
5           output based
6                 unseen
7          example input
8    examples supervised
9       labeled training
Name: Keyword, dtype: object