# RAKE

_Rapid Automatic Keyword Extraction_ (RAKE) je doménovo nezávislý algoritmus, ktorý využíva metódu extrakcie kľúčových slov založenú na zozname stop slov a oddeľovačoch fráz na detekciu najrelevantnejších slov alebo fraz v texte.

Majme príklad vety z ktorej chceme získať kľúčové slová
```text
Keyword extraction is not that difficult after all.
There are many libraries that can help you with keyword extraction.
Rapid automatic keyword extraction is one of those.
```

Najpr si danu vetu rozdelíme na jednotlivé slová a taktiež aj frázy na slová pomocou oddeľovačou fraz.  
Následne odstránime stop slová
```python
stopwords = ['is', 'not', 'that', 'there', 'are', 'can', 'you', 'with', 'of', 'those', 'after', 'all', 'one']
delimiters = ['.', ',']
```
a dostávame nasledovné obsahové slová
```python
content_words = ['keyword', 'extraction', 'difficult', 'many', 'libraries', 'help', 'rapid', 'automatic']
```

Potom algoritmus rozdelí text pomocou frázových a stop slov oddeľovačou z čoho vytvára kandidátne výrazy.  
V našom prípade budu vyzerať nasledovne:  

<span style="color:red">Keyword extraction</span> is not that  <span style="color:red">difficult</span> after all.  
There are  <span style="color:red">many libraries</span> that can  <span style="color:red">help</span> you with <span style="color:red">keyword extraction</span>.  
<span style="color:red">Rapid automatic keyword extraction</span> is one of those.  

Na to sa vytvorí matica obsahujúca počet výskytov dvojíc slov v rámci kandidátnych výrazov.
|                | **keyword** | **extraction** | **difficult** | **many** | **libraries** | **help** | **rapid** | **automatic** |
|:--------------:|:-----------:|:--------------:|:-------------:|:--------:|:-------------:|:--------:|:---------:|:-------------:|
|   **keyword**  |      3      |        3       |       0       |     0    |       0       |     0    |     1     |       1       |
| **extraction** |      3      |        3       |       0       |     0    |       0       |     0    |     1     |       1       |
|  **difficult** |      0      |        0       |       1       |     0    |       0       |     0    |     0     |       0       |
|    **many**    |      0      |        0       |       0       |     1    |       1       |     0    |     0     |       0       |
|  **libraries** |      0      |        0       |       0       |     1    |       1       |     0    |     0     |       0       |
|    **help**    |      0      |        0       |       0       |     0    |       0       |     1    |     0     |       0       |
|    **rapid**   |      1      |        1       |       0       |     0    |       0       |     0    |     1     |       1       |
|  **automatic** |      1      |        1       |       0       |     0    |       0       |     0    |     1     |       1       |
|    **Spolu**   |      8      |        8       |       1       |     2    |       2       |     1    |     4     |       4       |

Následne je pre každé slovo možné spočítať:
* sumu výskytov slova spolu s ďaľšími obsahovými slovami
    * spočítame hodnoty v stĺpci pre dané slovo
* počet výskytov slova v rámci textu
    * hodnota na prieniku riadka a stĺpca pre rovnake slovo
* podieľ sumy výskytov s ďaľšími obsahovými slovami a počtu výskytov slov v rámci textu
    * _Degree Score_

Tabuľka zobrazuje spočítané _Degree Score_.
|  **Slovo**  | **Degree Score** |
|:----------:|:----------------:|
|   keyword  |       2.66       |
| extraction |       2.66       |
|  difficult |        1.0       |
|    many    |        2.0       |
|  libraries |        2.0       |
|    help    |        1.0       |
|    rapid   |        4.0       |
|  automatic |        4.0       |

Pomocou súčtu skóre jednotlivých obsahových slov dostávame skóre aj pre kandidátne frázy.
|                                        |       |
|:--------------------------------------:|:-----:|
|         **keyword extraction**         |  5.33 |
|           **many libraries**           |  4.0  |
| **rapid automatic keyword extraction** | 13.33 |

Kľúčové slovo alebo fráza je vybraná, ak jej skóre patrí k top _T_ skóram, kde _T_ je  
počet kľúčových slov, ktoré chceme extrahovať.  
Podľa originálnej práce _T_ má východziu hodnotu rovnú 1/3 obsahových slov.

In [1]:
%pip install rake-nltk

Note: you may need to restart the kernel to use updated packages.


Na načítanie vstupného datasetu použijeme knižnicu _pandas_.  
Odfiltrujeme si len pravdivé články o Covid-19, ktoré použijeme ako korpus.

In [3]:
import pandas as pd

df = pd.read_excel('data/fake_new_dataset.xlsx', usecols=[1, 2, 4])
df = df.query('label == 1')
df['texts'] = df['title'] + ' ' + df['text']
df['texts'] = df['texts'].str.strip()

df = df[['title', 'texts']]

df.head()

Unnamed: 0,title,texts
1,Other Viewpoints: COVID-19 is worse than the flu,Other Viewpoints: COVID-19 is worse than the f...
2,Bermuda's COVID-19 cases surpass 100,Bermuda's COVID-19 cases surpass 100 The Minis...
6,Delhi: Eight nurses test positive for Covid-19...,Delhi: Eight nurses test positive for Covid-19...
8,Mississippi man recovering at home after 21 da...,Mississippi man recovering at home after 21 da...
20,Eight nurses test positive for Covid-19 at Kal...,Eight nurses test positive for Covid-19 at Kal...


Následne si predspracujeme vstupné dáta pomocou knižnice _Natural Language Toolkit_.  
Stiahneme si zoznam _stopwords_ obsahujúci slová typu 'a', 'the' a podobne, ktoré sú pre extrakciu kľúčových slov irelevantne.  
Takisto pripojíme špecifické slová pre pandémiu Covid-19, ktoré taktiež nechceme, lebo by nám skreslovali výsledky.

In [4]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

from nltk.corpus import stopwords

stop_words = stopwords.words('english') + ['covid', 'covid', 'coronavirus', 'corona', '19', '2019', 'ncov']

[nltk_data] Downloading package stopwords to /home/godric/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/godric/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Samotný algoritmus RAKE pomocou knižnice rake_nltk.

In [32]:
from rake_nltk import Metric, Rake
from nltk.tokenize import RegexpTokenizer

regex_tokenizer = RegexpTokenizer(r'\w+')

r_degree = Rake(ranking_metric=Metric.WORD_DEGREE,
                word_tokenizer=regex_tokenizer.tokenize,
                stopwords=stop_words,
                min_length=2, max_length=3,
                include_repeated_phrases=False)

r_frequency = Rake(ranking_metric=Metric.WORD_FREQUENCY,
                   word_tokenizer=regex_tokenizer.tokenize,
                   stopwords=stop_words,
                   min_length=2, max_length=3,
                   include_repeated_phrases=False)

r_degree_freq_ratio = Rake(ranking_metric=Metric.DEGREE_TO_FREQUENCY_RATIO,
                           word_tokenizer=regex_tokenizer.tokenize,
                           stopwords=stop_words,
                           min_length=2, max_length=3,
                           include_repeated_phrases=False)

r_degree.extract_keywords_from_sentences(df['texts'].values.tolist())
r_frequency.extract_keywords_from_sentences(df['texts'].values.tolist())
r_degree_freq_ratio.extract_keywords_from_sentences(df['texts'].values.tolist())

df_degree = pd.DataFrame(r_degree.get_ranked_phrases_with_scores(), columns=['score', 'phrase'])
df_frequency = pd.DataFrame(r_frequency.get_ranked_phrases_with_scores(), columns=['score', 'phrase'])
df_degree_freq_ratio = pd.DataFrame(r_degree_freq_ratio.get_ranked_phrases_with_scores(), columns=['score', 'phrase'])


Zobrazíme si najdené kľúčove frázy podľa jednotlivých kritérií.

In [38]:
from IPython.display import display_html 

df_degree_styler = df_degree.head(10).style.set_table_attributes("style='display:inline'").set_caption('Degree')
df_frequency_styler = df_frequency.head(10).style.set_table_attributes("style='display:inline'").set_caption('Frequency')
df_degree_freq_ratio_styler = df_degree_freq_ratio.head(10).style.set_table_attributes("style='display:inline'").set_caption('Degree to Frequency ratio')

display_html(df_degree_styler._repr_html_()+df_frequency_styler._repr_html_()+df_degree_freq_ratio_styler._repr_html_(), raw=True)


Unnamed: 0,score,phrase
0,9299.0,virus china said
1,8491.0,china china also
2,8471.0,virus virus outbreak
3,8011.0,virus outbreak china
4,8011.0,china virus outbreak
5,7673.0,wuhan china china
6,7667.0,china also said
7,7653.0,virus infection china
8,7411.0,virus news china
9,7318.0,virus infections china

Unnamed: 0,score,phrase
0,3647.0,virus china said
1,3406.0,china china also
2,3327.0,virus virus outbreak
3,3173.0,virus outbreak china
4,3173.0,china virus outbreak
5,3064.0,wuhan china china
6,3038.0,china also said
7,3025.0,virus infection china
8,2925.0,virus news china
9,2891.0,virus infections china

Unnamed: 0,score,phrase
0,9.0,zoonosis plural zoonoses
1,9.0,yin yang dwelling
2,9.0,ww ii bombing
3,9.0,unsold barrels pile
4,9.0,tsim sha tsui
5,9.0,tshephang kapinga ub
6,9.0,toyota innova crysta
7,9.0,tiu sonco convenes
8,9.0,teammate lindy remigino
9,9.0,tablets capsules supplements
