# KeyBERT or BERTopic?
Both libraries use BERT to extract keywords from the documents in the dataset. However, they provide different data. MaartenGr, the owner of both libraries on Github, explains the differences between them in this link: https://coder.social/MaartenGr/KeyBERT/issues/60 . Unrelated, but interesting to note, is that he created both libraries and is based in Tilburg. This block provides a summary of the link.

**About BERTopic:**
*Steps*
- Embedding documents
- Clustering documents
- Creating a topic representation.
> The main output of BERTopic is a set of words per topic. Thus, multiple documents have the same topic representation.

**About KeyBERT:**
- Embedding documents
- Creating candidate keywords
- Calculating best keywords through either MMR, Max Sum Similarity, or Cosine Similarity
> The main output of KeyBERT is a set of words per document. Thus, each document is expected to have different keywords.

**Difference in output:**
> BERTopic aims to cluster documents and create a broad representation of multiple documents whereas KeyBERT does not.

**Finally, a note on when to use each:**
> BERTopic, and in that sense most topic modeling techniques, are meant to explore the data to create an understanding of the perhaps millions of documents that you have collected. KeyBERT, in contrast, is not able to do this as it creates a completely different set of words per document. An example of using KeyBERT, and in that sense most keyword extraction algorithms, is automatically creating relevant keywords for content (blogs, articles, etc.) that businesses post on their website.

In conclusion, we use KeyBERT for extracting keywords, and BERTopic to cluster the documents into topics. This should suffice for part 2's inputs: "topics and keywords."

## Imports
Download dataset of news articles from Reuters (English – from nltk.corpus import reuters).

In [26]:
# Text processing
import nltk
nltk.download('reuters')
from nltk.corpus import reuters as source

# Keyword/topic extraction
from keybert import KeyBERT
from keyphrase_vectorizers import KeyphraseCountVectorizer

# For editing the shape of structures
import pandas as pd
import string

[nltk_data] Downloading package reuters to
[nltk_data]     /home/musashishi/nltk_data...
[nltk_data]   Package reuters is already up-to-date!


In [10]:
# Convert nltk corpus reuters to pandas
reuters = []
for fileid in source.fileids():
    tag, filename = fileid.split('/')
    reuters.append((filename, tag, source.raw(fileid)))

df = pd.DataFrame(reuters, columns=['filename', 'tag', 'text'])
df

Unnamed: 0,filename,tag,text
0,14826,test,ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RI...
1,14828,test,CHINA DAILY SAYS VERMIN EAT 7-12 PCT GRAIN STO...
2,14829,test,JAPAN TO REVISE LONG-TERM ENERGY DEMAND DOWNWA...
3,14832,test,THAI TRADE DEFICIT WIDENS IN FIRST QUARTER\n ...
4,14833,test,INDONESIA SEES CPO PRICE RISING SHARPLY\n Ind...
...,...,...,...
10783,999,training,U.K. MONEY MARKET SHORTAGE FORECAST REVISED DO...
10784,9992,training,KNIGHT-RIDDER INC &lt;KRN> SETS QUARTERLY\n Q...
10785,9993,training,TECHNITROL INC &lt;TNL> SETS QUARTERLY\n Qtly...
10786,9994,training,NATIONWIDE CELLULAR SERVICE INC &lt;NCEL> 4TH ...


## Convert text to something useable

In [14]:
# TODO check stuff from the other notebooks
df.iloc[0]['text'][:150]

"ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RIFT\n  Mounting trade friction between the\n  U.S. And Japan has raised fears among many of Asia's exportin"

In [34]:
def cleanText(text: str):
    # Remove escape characters
    escapes = ''.join([chr(char) for char in range(1, 32)])
    translator = str.maketrans('', '', escapes)
    text = text.translate(translator)

    # Remove other weird characters "\'s"
    text = text.replace("\'s", " ")
    text = text.replace("n\'t", " not")
    text = text.replace("-", " ")
    text = text.replace("dlrs", " ")
    text = text.replace("&lt;", " ")
    text = text.replace("mln", "million")
    text = text.replace("pct", "percent")
    text = text.translate(str.maketrans('', '', string.punctuation)) # remove all punctuation
    text = text.lower()
    text = ' '.join(text.split()) # remove long spaces
    return text

In [37]:
df['clean_text'] = df.apply(lambda row: cleanText(row['text']), axis=1)
df.iloc[0]['clean_text'][:150]

'asian exporters fear damage from us japan rift mounting trade friction between the us and japan has raised fears among many of asia exporting nations '

## Using KeyBERT to extract keywords from each document

In [None]:
kw_model = KeyBERT()
doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs, min_df=1, stop_words="english")
# keywords = kw_model.extract_keywords(docs, min_df=1, stop_words="english", 
#                                      doc_embeddings=doc_embeddings, 
#                                      word_embeddings=word_embeddings)


## Using BERTopic to extract topics from the corpus