# KeyBERT

Cieľom projektu bolo vytvoriť extraktor kľúčových slov založený na _BERT_ vnorených slovách.  
Vnorená reprezentácia slov znamená ich transformáciu z lexikálnych symbolov na vektory v n-rozmernom priestore také, že sémnaticky podobné slová tvoria zhluky v danom priestore.  
  
KeyBERT pracuje tak, že najprv urobí vnorenú reprezentáciu na úrovni dokumentu pomocou BERT. Následne sú extrahované vnorené reprezentácie pre N-gramy slov a fráz.  
Na záver sa pomocou kosínusovej podobnosti určia najpodobnejšie slová alebo frázy k dokumentu. Z tohto pohľadu potom najpodobnejšie slová sú práve slová, ktoré najpresnejšie vystihujú dokument.

In [3]:
%pip install keybert sentence-transformers pandas

Note: you may need to restart the kernel to use updated packages.


Na načítanie vstupného datasetu použijeme knižnicu _pandas_.  
Odfiltrujeme si len pravdivé články o Covid-19 a z nich vyberieme len prvých 30 článkov, ktoré použijeme ako korpus.

In [37]:
import pandas as pd

df = pd.read_excel('data/fake_new_dataset.xlsx', usecols=[1, 2, 4])
df = df.query('label == 1')
df['texts'] = df['title'] + ' ' + df['text']
df['texts'] = df['texts'].str.strip()
df['texts'] = df['texts'].str.lower()

df = df[['title', 'texts']]

df = df.iloc[:30]

df.head()

Unnamed: 0,title,texts
1,Other Viewpoints: COVID-19 is worse than the flu,other viewpoints: covid-19 is worse than the f...
2,Bermuda's COVID-19 cases surpass 100,bermuda's covid-19 cases surpass 100 the minis...
6,Delhi: Eight nurses test positive for Covid-19...,delhi: eight nurses test positive for covid-19...
8,Mississippi man recovering at home after 21 da...,mississippi man recovering at home after 21 da...
20,Eight nurses test positive for Covid-19 at Kal...,eight nurses test positive for covid-19 at kal...


In [38]:
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
stop_words = stopwords.words('english') + ['covid', 'coronavirus', 'corona', '19']

[nltk_data] Downloading package stopwords to /home/godric/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [40]:
from keybert import KeyBERT

text = ' '.join(df['texts'].values.tolist())

kw_model = KeyBERT(model=sentence_model)
model = kw_model.extract_keywords(text, keyphrase_ngram_range=(2, 3), stop_words=stop_words,
                          use_mmr=True, diversity=0.2,
                          top_n=10)

# df['keywords'] = (
#     df['texts']
#     .apply(lambda x: kw_model.extract_keywords(x, keyphrase_ngram_range=(2, 3), stop_words=stop_words,
#                                                use_mmr=True, diversity=0.5,
#                                                top_n=10))
# )

df2 = pd.DataFrame(model, columns=['Keywords', 'Score'])
df2

Unnamed: 0,Keywords,Score
0,influenza killing americans,0.6166
1,killing americans flu,0.6152
2,comparisons influenza killing,0.6126
3,ongoing pandemic,0.6056
4,outbreak pandemic urged,0.5718
5,outbreak comparing flu,0.567
6,viewpoints worse flu,0.5585
7,disease since coronaviruses,0.5499
8,people dead outbreak,0.5311
9,deadly virus shifts,0.5244


In [None]:
# df.iloc[0]['keywords']

# df['keywords'] = (
#     df['texts']
#     .apply(lambda x: kw_model.extract_keywords(x, keyphrase_ngram_range=(2, 3), stop_words=stop_words,
#                                                use_mmr=True, diversity=0.2,
#                                                top_n=3))
# )

# df.iloc[0]['keywords']