# 1) Keybert (scratch)

There are many powerful techniques that perform keywords extraction (e.g. Rake, YAKE!, TF-IDF). However, they are mainly based on the statistical properties of the text and don’t necessarily take into account the semantic aspects of the full document.

[Reference](https://towardsdatascience.com/how-to-extract-relevant-keywords-with-keybert-6e7b3cf889ae)

In [None]:
!pip install sentence-transformers keybert keyphrase-vectorizers
!pip install textacy==0.11.0 -qqqq
!python -m spacy download en_core_web_sm -qqq

In [None]:
import numpy as np
import itertools
from sklearn.feature_extraction.text import CountVectorizer
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from keybert import KeyBERT
from keyphrase_vectorizers import KeyphraseCountVectorizer
import warnings
warnings.filterwarnings("ignore")

In [None]:
doc = """
         Supervised learning is the machine learning task of
         learning a function that maps an input to an output based
         on example input-output pairs.[1] It infers a function
         from labeled training data consisting of a set of
         training examples.[2] In supervised learning, each
         example is a pair consisting of an input object
         (typically a vector) and a desired output value (also
         called the supervisory signal). A supervised learning
         algorithm analyzes the training data and produces an
         inferred function, which can be used for mapping new
         examples. An optimal scenario will allow for the algorithm
         to correctly determine the class labels for unseen
         instances. This requires the learning algorithm to
         generalize from the training data to unseen situations
         in a 'reasonable' way (see inductive bias).
      """

In [None]:
# Extract candidate words/phrases
# use countvectorizer to remove stopword and define ngram
n_gram_range = (1, 1)
stop_words = "english"

count = CountVectorizer(ngram_range=n_gram_range, stop_words=stop_words).fit([doc])
candidates = count.get_feature_names_out()
print(candidates, '\n')

# embedding with distilbert
model = SentenceTransformer('distilbert-base-nli-mean-tokens')
doc_embedding = model.encode([doc])
candidate_embeddings = model.encode(candidates)

print(doc_embedding.shape)
print(candidate_embeddings.shape)
print("\m")

# find top most similar candidates to document
top_n = 5
distances = cosine_similarity(doc_embedding, candidate_embeddings)
keywords = [candidates[index] for index in distances.argsort()[0][-top_n:]]
print(keywords, '\n')

['algorithm' 'allow' 'analyzes' 'based' 'bias' 'called' 'class'
 'consisting' 'correctly' 'data' 'desired' 'determine' 'example'
 'examples' 'function' 'generalize' 'inductive' 'inferred' 'infers'
 'input' 'instances' 'labeled' 'labels' 'learning' 'machine' 'mapping'
 'maps' 'new' 'object' 'optimal' 'output' 'pair' 'pairs' 'produces'
 'reasonable' 'requires' 'scenario' 'set' 'signal' 'situations'
 'supervised' 'supervisory' 'task' 'training' 'typically' 'unseen' 'used'
 'value' 'vector' 'way'] 

(1, 768)
(50, 768)
\m
['mapping', 'class', 'training', 'algorithm', 'learning'] 



In [None]:
# some words are too similar, maximize similarities(doc, word), min similarities(word, word)
nr_candidates = 10
distances_candidates = cosine_similarity(candidate_embeddings, candidate_embeddings)
words_idx = list(distances.argsort()[0][-nr_candidates:])
words_vals = [candidates[index] for index in words_idx]
distances_candidates = distances_candidates[np.ix_(words_idx, words_idx)]
print(words_vals)
print(distances_candidates.shape)

# Calculate the combination of words that are the least similar to each other to ensure diversity
min_sim = np.inf
candidate = None
for combination in itertools.combinations(range(len(words_idx)), top_n):
    sim = sum([distances_candidates[i][j] for i in combination for j in combination if i != j])
    if sim < min_sim:
        candidate = combination
        min_sim = sim

result = [words_vals[idx] for idx in candidate]
print("After Minimizing Distance", result)

['maps', 'task', 'input', 'analyzes', 'supervised', 'mapping', 'class', 'training', 'algorithm', 'learning']
(10, 10)
After Minimizing Distance ['maps', 'input', 'class', 'training', 'algorithm']


In [None]:
# Maximal Marginal Relevance tries to minimize redundancy and maximize the diversity
# of results in text summarization tasks

# Extract similarity within words, and between words and the document
word_doc_similarity = cosine_similarity(candidate_embeddings, doc_embedding)
word_similarity = cosine_similarity(candidate_embeddings)
print(word_doc_similarity.shape)
print(word_similarity.shape)

# Initialize candidates and already choose best keyword/keyphras
keywords_idx = [np.argmax(word_doc_similarity)]
candidates_idx = [i for i in range(len(candidates)) if i != keywords_idx[0]]
print(keywords_idx)

# Extract similarities within candidates
# between candidates and selected keywords/phrases

diversity = 0.7
for _ in range(top_n - 1):
    candidate_similarities = word_doc_similarity[candidates_idx, :]
    target_similarities = np.max(word_similarity[candidates_idx][:, keywords_idx], axis=1)

    # Calculate MMR
    mmr = (1 - diversity) * candidate_similarities - diversity * target_similarities.reshape(-1, 1)
    mmr_idx = candidates_idx[np.argmax(mmr)]

    # Update keywords & candidates
    keywords_idx.append(mmr_idx)
    candidates_idx.remove(mmr_idx)

print([candidates[idx] for idx in keywords_idx])

(50, 1)
(50, 50)
[23]
['learning', 'maps', 'function', 'algorithm', 'new']


# 2) Keybert Package with Keyphase Vectorizer

First, the document texts are annotated with spaCy part-of-speech tags.

Second, keyphrases are extracted from the document texts whose part-of-speech tags match a predefined regex pattern.

By default, the vectorizers extract keyphrases that have zero or more adjectives, followed by one or more nouns using the English spaCy part-of-speech tags. Finally, the vectorizers calculate document-keyphrase matrices. Apart from the matrices, the package can also provide us with the keyphrases extracted via part-of-speech.

In [None]:
# keyphase count vectorizer, no ngram required
count = KeyphraseCountVectorizer(stop_words=stop_words).fit([doc])
print(count.get_feature_names_out())

['learning' 'supervised learning' 'training data' 'function' 'algorithm'
 'machine' 'task' 'vector' 'output pairs.[1' 'input object'
 'training examples.[2' 'unseen situations' 'example' 'optimal scenario'
 'supervisory signal' 'set' 'input' 'output value' 'example input'
 'instances' 'pair' 'way' 'class labels' 'inductive bias' 'output'
 'learning algorithm' 'examples']


In [None]:
docs = ["""Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs. It infers a
         function from labeled training data consisting of a set of training examples.
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal).
         A supervised learning algorithm analyzes the training data and produces an inferred function,
         which can be used for mapping new examples. An optimal scenario will allow for the
         algorithm to correctly determine the class labels for unseen instances. This requires
         the learning algorithm to generalize from the training data to unseen situations in a
         'reasonable' way (see inductive bias).""",

        """Keywords are defined as phrases that capture the main topics discussed in a document.
        As they offer a brief yet precise summary of document content, they can be utilized for various applications.
        In an information retrieval environment, they serve as an indication of document relevance for users, as the list
        of keywords can quickly help to determine whether a given document is relevant to their interest.
        As keywords reflect a document's main topics, they can be utilized to classify documents into groups
        by measuring the overlap between the keywords assigned to them. Keywords are also used proactively
        in information retrieval."""]

# KeyBERT
sentence_model = SentenceTransformer("xlm-r-bert-base-nli-stsb-mean-tokens", device="cuda")
kw_model = KeyBERT(model=sentence_model)
result = kw_model.extract_keywords(docs=docs, keyphrase_ngram_range=(1, 2),
                                    stop_words='english', use_mmr=True, diversity=0.5)
print(result)

# keybert with keyphase extractor
kw_model = KeyBERT()
result = kw_model.extract_keywords(docs=docs, vectorizer=KeyphraseCountVectorizer())
print(result)

2it [00:00, 345.41it/s]


[[('machine learning', 0.6824), ('task learning', 0.7043), ('learning function', 0.7053), ('learning machine', 0.7085), ('analyzes training', 0.7145)], [('overlap keywords', 0.5512), ('keywords assigned', 0.5652), ('keywords reflect', 0.5749), ('list keywords', 0.5794), ('keywords defined', 0.5807)]]


2it [00:00, 800.36it/s]

[[('learning', 0.4813), ('training data', 0.5271), ('learning algorithm', 0.5632), ('supervised learning', 0.6779), ('supervised learning algorithm', 0.6992)], [('document content', 0.3988), ('information retrieval environment', 0.5166), ('information retrieval', 0.5792), ('keywords', 0.6046), ('document relevance', 0.633)]]





# 3) Textacy

https://towardsdatascience.com/keyword-extraction-python-tf-idf-textrank-topicrank-yake-bert-7405d51cd839

In [None]:
import textacy
from textacy.extract.keyterms import yake, scake, sgrank, textrank

docs = ["""Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs. It infers a
         function from labeled training data consisting of a set of training examples.
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal).
         A supervised learning algorithm analyzes the training data and produces an inferred function,
         which can be used for mapping new examples. An optimal scenario will allow for the
         algorithm to correctly determine the class labels for unseen instances. This requires
         the learning algorithm to generalize from the training data to unseen situations in a
         'reasonable' way (see inductive bias).""",

        """Keywords are defined as phrases that capture the main topics discussed in a document.
        As they offer a brief yet precise summary of document content, they can be utilized for various applications.
        In an information retrieval environment, they serve as an indication of document relevance for users, as the list
        of keywords can quickly help to determine whether a given document is relevant to their interest.
        As keywords reflect a document's main topics, they can be utilized to classify documents into groups
        by measuring the overlap between the keywords assigned to them. Keywords are also used proactively
        in information retrieval."""]

doc = textacy.make_spacy_doc(docs[0], "en_core_web_sm")
print(yake(doc, normalize='lemma', ngrams=(1, 2)), '\n')
print(scake(doc, normalize='lemma'), '\n')
print(sgrank(doc, normalize='lemma'), '\n')
print(textrank(doc, normalize='lemma'), '\n')

[('example', 0.3634354740148829), ('learning', 0.3705974051722146), ('training', 0.3718801713053885), ('function', 0.40704093462685076), ('input', 0.4170577044400542), ('output', 0.4199299663859866), ('algorithm', 0.42832637489640263), ('datum', 0.42992518289069875), ('pair', 0.5118569748358457), ('supervised', 0.5181712859829104)] 

[('supervised learning algorithm', 1173.2634871773075), ('example input', 393.35091441034865), ('training example', 392.90889155484166), ('training datum', 246.50911289502568), ('new example', 230.7359879626814), ('function', 220.29545454545456), ('output pair', 205.37226127377158), ('input object', 172.43905801336618), ('output value', 149.74645744796777), ('machine', 53.34545454545455)] 

[('training datum', 0.25063587325424164), ('supervised learning algorithm', 0.09145589706469866), ('training example', 0.06583222607866272), ('input object', 0.05784336111980621), ('output value', 0.0494875530956109), ('supervisory signal', 0.04764306107307626), ('new e