<a href="https://colab.research.google.com/github/MohammadHeydari/DeepNLP/blob/master/BERT_Keywords_Extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Keyword Extraction with BERT

A minimal method for extracting keywords and keyphrases

understand key information from specific documents

Keyword extraction is the automated process of extracting the words and phrases that are most relevant to an input text.

***Rake and YAKE: these models typically work based on the statistical properties of a text and not so much on semantic similarity. ***

**BERT** is a **bi-directional transformer model that allows us to transform phrases and documents to vectors that capture their meaning.**

**KeyBERT** **a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings.**

In [62]:
#god

doc = """

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since


"""

using Scikit-Learns CountVectorizer. This allows us to specify the length of the keywords and make them into keyphrases. It also is a nice method for quickly removing stop words.



In [63]:
pip install sklearn



In [64]:
from sklearn.feature_extraction.text import CountVectorizer

#if we would set it to (3, 3) then the resulting candidates would phrases that include 3 keywords.
n_gram_range = (3, 3)
stop_words = 'english'

count = CountVectorizer(ngram_range = n_gram_range, stop_words = stop_words).fit([doc])

#candidates is simply a list of strings that includes our candidate keywords/keyphrases.
candidates = count.get_feature_names()

#**Embeddings**


###we convert both the document as well as the 
###candidate keywords/keyphrases to numerical data

###We use BERT for this purpose as it has shown great results for both similarity- and paraphrasing tasks. 

many methods for generating the BERT embeddings, such as Flair, Hugginface Transformers, and now even spaCy with their 3.0 release!

prefer to use the sentence-transformers package as it allows me to quickly create high-quality embeddings that work quite well for sentence- and document-level embeddings.

In [65]:
pip install sentence-transformers



In [66]:
#check CUDA version in Colab
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0


In [67]:
!python --version

Python 3.7.10


In [68]:
pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html


In [69]:
import torch

torch.cuda.get_device_name(0)

'Tesla P100-PCIE-16GB'

In [70]:
from sentence_transformers import SentenceTransformer

In [71]:
#transform our document and candidates into vectors

model = SentenceTransformer('distilbert-base-nli-mean-tokens')
doc_embedding = model.encode([doc])
candidate_embeddings = model.encode(candidates)

In [72]:
# Distilbert as it has shown great performance in similarity tasks

**Since transformer models have a token limit, you might run into some errors when inputting large documents. In that case, you could consider splitting up your document into paragraphs and mean pooling (taking the average of) the resulting vectors.**

**NOTE: There are many pre-trained BERT-based models that you can use for keyword extraction. However, I would advise you to use either distilbert — base-nli-stsb-mean-tokens or xlm-r-distilroberta-base-paraphase-v1 as they have shown great performance in semantic similarity and paraphrase identification respectively.**

Cosine Similarity
In the final step, we want to find the candidates that are most similar to the document

We assume that the most similar candidates to the document are good keywords/keyphrases 
for representing the document.

To calculate the similarity between candidates and the document, we will be using the cosine similarity between vectors as it performs quite well in high-dimensionality:

In [73]:
from sklearn.metrics.pairwise import cosine_similarity

top_n = 5
distances = cosine_similarity(doc_embedding, candidate_embeddings)
keywords = [candidates[index] for index in distances.argsort()[0][-top_n:]]

In [74]:
keywords

['spark open source',
 'california berkeley amplab',
 'berkeley amplab spark',
 'amplab spark codebase',
 'donated apache software']