# KeyBERT or BERTopic?
Both libraries use BERT to extract keywords from the documents in the dataset. However, they provide different data. MaartenGr, the owner of both libraries on Github, explains the differences between them in this link: https://coder.social/MaartenGr/KeyBERT/issues/60 . Unrelated, but interesting to note, is that he created both libraries and is based in Tilburg. This block provides a summary of the link.

**About BERTopic:**
*Steps*
- Embedding documents
- Clustering documents
- Creating a topic representation.
> The main output of BERTopic is a set of words per topic. Thus, multiple documents have the same topic representation.

**About KeyBERT:**
- Embedding documents
- Creating candidate keywords
- Calculating best keywords through either MMR, Max Sum Similarity, or Cosine Similarity
> The main output of KeyBERT is a set of words per document. Thus, each document is expected to have different keywords.

**Difference in output:**
> BERTopic aims to cluster documents and create a broad representation of multiple documents whereas KeyBERT does not.

**Finally, a note on when to use each:**
> BERTopic, and in that sense most topic modeling techniques, are meant to explore the data to create an understanding of the perhaps millions of documents that you have collected. KeyBERT, in contrast, is not able to do this as it creates a completely different set of words per document. An example of using KeyBERT, and in that sense most keyword extraction algorithms, is automatically creating relevant keywords for content (blogs, articles, etc.) that businesses post on their website.

In conclusion, we use KeyBERT for extracting keywords, and BERTopic to cluster the documents into topics. This should suffice for part 2's inputs: "topics and keywords."

## Imports
Download dataset of news articles from Reuters (English – from nltk.corpus import reuters).

In [1]:
# Text processing
import nltk
nltk.download('reuters')
from nltk.corpus import reuters as source

# Keyword/topic extraction
from keybert import KeyBERT
from keyphrase_vectorizers import KeyphraseCountVectorizer

# For editing the shape of structures
import pandas as pd
import string

[nltk_data] Downloading package reuters to
[nltk_data]     /home/musashishi/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Convert nltk corpus reuters to pandas
reuters = []
for fileid in source.fileids():
    tag, filename = fileid.split('/')
    reuters.append((filename, tag, source.raw(fileid)))

df = pd.DataFrame(reuters, columns=['filename', 'tag', 'text'])
df

Unnamed: 0,filename,tag,text
0,14826,test,ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RI...
1,14828,test,CHINA DAILY SAYS VERMIN EAT 7-12 PCT GRAIN STO...
2,14829,test,JAPAN TO REVISE LONG-TERM ENERGY DEMAND DOWNWA...
3,14832,test,THAI TRADE DEFICIT WIDENS IN FIRST QUARTER\n ...
4,14833,test,INDONESIA SEES CPO PRICE RISING SHARPLY\n Ind...
...,...,...,...
10783,999,training,U.K. MONEY MARKET SHORTAGE FORECAST REVISED DO...
10784,9992,training,KNIGHT-RIDDER INC &lt;KRN> SETS QUARTERLY\n Q...
10785,9993,training,TECHNITROL INC &lt;TNL> SETS QUARTERLY\n Qtly...
10786,9994,training,NATIONWIDE CELLULAR SERVICE INC &lt;NCEL> 4TH ...


## Convert text to something useable

In [3]:
# TODO check stuff from the other notebooks
df.iloc[0]['text'][:150]

"ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RIFT\n  Mounting trade friction between the\n  U.S. And Japan has raised fears among many of Asia's exportin"

In [4]:
def cleanText(text: str):
    # Remove escape characters
    escapes = ''.join([chr(char) for char in range(1, 32)])
    translator = str.maketrans('', '', escapes)
    text = text.translate(translator)

    # Remove other weird characters "\'s"
    text = text.replace("\'s", " ")
    text = text.replace("n\'t", " not")
    text = text.replace("-", " ")
    text = text.replace("&lt;", " ")
    
    # Short words
    text = text.replace("dlrs", "dollars")
    text = text.replace("mln", "million")
    text = text.replace("pct", "percent")
    
    # Punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = text.lower()
    text = ' '.join(text.split()) # remove long spaces
    return text

In [5]:
df['clean_text'] = df.apply(lambda row: cleanText(row['text']), axis=1)
df.iloc[0]['clean_text'][:150]

'asian exporters fear damage from us japan rift mounting trade friction between the us and japan has raised fears among many of asia exporting nations '

## Using KeyBERT to extract keywords from each document

In [6]:
docs = list(df['clean_text'])

In [7]:
for doc in docs:
    print(doc[:50])
    print()

asian exporters fear damage from us japan rift mou

china daily says vermin eat 7 12 pct grain stocks 

japan to revise long term energy demand downwards 

thai trade deficit widens in first quarter thailan

indonesia sees cpo price rising sharply indonesia 

australian foreign ship ban ends but nsw ports hit

indonesian commodity exchange may expand the indon

sri lanka gets usda approval for wheat price food 

western mining to open new gold mine in australia 

sumitomo bank aims at quick recovery from merger s

subroto says indonesia supports tin pact extension

bundesbank allocates 61 billion marks in tender th

bond corp still considering atlas mining bail out 

china industrial output rises in first quarter chi

japan ministry says open farm trade would hit us j

amatil proposes two for five bonus share issue ama

bowater 1986 pretax profits rise 156 mln stg shr 2

uk money market deficit forecast at 250 mln stg th

south korea moves to slow growth of trade surplus 

finns and ca

It's odd that there are so many 3-letter words that are substitutes for longer words. For example, "pct" for "percent". Might it make sense to replace all of them? Let's count them first.

In [14]:
all_words = " ".join(list(df['clean_text']))
from nltk.corpus import stopwords
short_words = list(filter(lambda i: len(i)== 3, all_words.split()))
short_words = [word for word in short_words if word not in stopwords.words('english')] # remove stopwords
short_words = list(set(short_words)) # remove duplicates
print("Number of short words:", len(short_words))
short_words[:20]

Number of short words: 3312


['llc',
 '900',
 'cke',
 'usr',
 'hci',
 'jig',
 'ust',
 'lfg',
 'ppg',
 'bha',
 'adh',
 'nmk',
 'hlm',
 'gmp',
 'lpl',
 'wnt',
 'abm',
 '695',
 'gsa',
 'crn']

There are 3312 of these, and they don't seem to be coherent. So, we don't do anything with this! Let's move on.

In [8]:
# Using the KeyphraseVectorizer allows us to not need to specify the ngram size
kw_model = KeyBERT()
key_words = kw_model.extract_keywords(docs=docs, vectorizer=KeyphraseCountVectorizer())
key_words

2022-12-03 17:05:25,520 - KeyphraseVectorizer - INFO - It looks like the selected spaCy pipeline is not downloaded yet. It is attempted to download the spaCy pipeline now.


Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.8/12.8 MB 13.4 MB/s eta 0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.4.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


[[('japanese electronics goods', 0.5698),
  ('import tariffs', 0.5366),
  ('japanese economy', 0.508),
  ('tariffs', 0.4959),
  ('us exports', 0.4818)],
 [('vermin consume', 0.6053),
  ('china grain stocks', 0.5585),
  ('china fruit output', 0.5433),
  ('vermin', 0.4266),
  ('pct grain stocks', 0.4173)],
 [('japanese energy demand ministry officials', 0.6494),
  ('energy supplydemand outlook', 0.6131),
  ('long term energy supplydemand outlook', 0.589),
  ('domestic electric power demand miti', 0.5823),
  ('japan electric power', 0.5077)],
 [('first quarter thailand trade deficit', 0.7652),
  ('thai trade deficit', 0.7416),
  ('high export growth', 0.5369),
  ('export growth', 0.5162),
  ('trade deficit', 0.4851)],
 [('crude palm oil cpo prices', 0.7627),
  ('crude palm oil cpo', 0.679),
  ('indonesian exports', 0.662),
  ('international market share indonesia', 0.5984),
  ('recent palm oil purchases', 0.5438)],
 [('australian foreign ship ban ends', 0.6591),
  ('australian foreign shi

In [12]:
df['keyword_tuples'] = key_words
df.iloc[0]['keyword_tuples']

[('japanese electronics goods', 0.5698),
 ('import tariffs', 0.5366),
 ('japanese economy', 0.508),
 ('tariffs', 0.4959),
 ('us exports', 0.4818)]

Now we have a column with the keyword tuples, but it may be useful to have separate columns, one for the words, one for the distances.

In [35]:
# Split up the tuples into words and distances
words = [] 
distances = []
for index, doc_tuples in enumerate(key_words):
    next_doc = True
    for new_tuple in doc_tuples:
        # Extract tuple
        word, distance = new_tuple

        # Create new sublist for each new doc tuple list
        if next_doc: 
            words.append([word])
            distances.append([distance])
            next_doc = False
        else:
            words[index].append(word)
            distances[index].append(distance)
            
df['keywords'] = words 
df['keyword_distances'] = distances

In [38]:
# Check that the previous cell split the columns as expected
print(df['keyword_tuples'][:5])
print()
print(df['keywords'][:5])
print()
print(df['keyword_distances'][:5])

0    [(japanese electronics goods, 0.5698), (import...
1    [(vermin consume, 0.6053), (china grain stocks...
2    [(japanese energy demand ministry officials, 0...
3    [(first quarter thailand trade deficit, 0.7652...
4    [(crude palm oil cpo prices, 0.7627), (crude p...
Name: keyword_tuples, dtype: object

0    [japanese electronics goods, import tariffs, j...
1    [vermin consume, china grain stocks, china fru...
2    [japanese energy demand ministry officials, en...
3    [first quarter thailand trade deficit, thai tr...
4    [crude palm oil cpo prices, crude palm oil cpo...
Name: keywords, dtype: object

0     [0.5698, 0.5366, 0.508, 0.4959, 0.4818]
1    [0.6053, 0.5585, 0.5433, 0.4266, 0.4173]
2     [0.6494, 0.6131, 0.589, 0.5823, 0.5077]
3    [0.7652, 0.7416, 0.5369, 0.5162, 0.4851]
4      [0.7627, 0.679, 0.662, 0.5984, 0.5438]
Name: keyword_distances, dtype: object


Let's see what a single column looks like now.

In [39]:
df.iloc[0]

filename                                                         14826
tag                                                               test
text                 ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RI...
clean_text           asian exporters fear damage from us japan rift...
keyword_tuples       [(japanese electronics goods, 0.5698), (import...
keywords             [japanese electronics goods, import tariffs, j...
keyword_distances              [0.5698, 0.5366, 0.508, 0.4959, 0.4818]
Name: 0, dtype: object

## Using BERTopic to extract topics from the corpus