# Keyword extraction using YAKE

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#YAKE---Yet-Another-Keyword-Extractor" data-toc-modified-id="YAKE---Yet-Another-Keyword-Extractor-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>YAKE - Yet Another Keyword Extractor</a></span><ul class="toc-item"><li><span><a href="#Main-Features" data-toc-modified-id="Main-Features-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Main Features</a></span></li><li><span><a href="#Rationale" data-toc-modified-id="Rationale-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Rationale</a></span></li><li><span><a href="#Where-can-I-find-YAKE?" data-toc-modified-id="Where-can-I-find-YAKE?-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Where can I find YAKE?</a></span></li></ul></li><li><span><a href="#Package" data-toc-modified-id="Package-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Package</a></span><ul class="toc-item"><li><span><a href="#Installation" data-toc-modified-id="Installation-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Installation</a></span></li><li><span><a href="#Usage-(Python)" data-toc-modified-id="Usage-(Python)-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Usage (Python)</a></span></li><li><span><a href="#Usage-(Command-Line)" data-toc-modified-id="Usage-(Command-Line)-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Usage (Command Line)</a></span></li></ul></li></div>

## YAKE - Yet Another Keyword Extractor

Unsupervised Approach for Automatic Keyword Extraction using Text Features.

### Main Features

* Unsupervised approach
* Corpus-Independent
* Domain and Language Independent
* Single-Document

### Rationale

Extracting keywords from texts has become a challenge for individuals and organizations as the information grows in complexity and size. The need to automate this task so that texts can be processed in a timely and adequate manner has led to the emergence of automatic keyword extraction tools. Despite the advances, there is a clear lack of multilingual online tools to automatically extract keywords from single documents. Yake! is a novel feature-based system for multi-lingual keyword extraction, which supports texts of different sizes, domain or languages. Unlike other approaches, Yake! does not rely on dictionaries nor thesauri, neither is trained against any corpora. Instead, it follows an unsupervised approach which builds upon features extracted from the text, making it thus applicable to documents written in different languages without the need for further knowledge. This can be beneficial for a large number of tasks and a plethora of situations where the access to training corpora is either limited or restricted.

### Where can I find YAKE?

Open source Python package [https://liaad.github.io/yake]

## Quick start

### Installation

In [None]:
!pip install git+http://github.com/LIAAD/yake

### Usage (Python)

In [None]:
text = '''
Google is acquiring data science community Kaggle. Sources tell us that Google is acquiring Kaggle, a platform that hosts data science and machine learning   competitions. Details about the transaction remain somewhat vague , but given that Google is hosting   its Cloud Next conference in San Francisco this week, the official announcement could come as early   as tomorrow.  Reached by phone, Kaggle co-founder CEO Anthony Goldbloom declined to deny that the   acquisition is happening. Google itself declined 'to comment on rumors'.   Kaggle, which has about half a million data scientists on its platform, was founded by Goldbloom   and Ben Hamner in 2010. The service got an early start and even though it has a few competitors   like DrivenData, TopCoder and HackerRank, it has managed to stay well ahead of them by focusing on its   specific niche. The service is basically the de facto home for running data science  and machine learning   competitions.  With Kaggle, Google is buying one of the largest and most active communities for   data scientists - and with that, it will get increased mindshare in this community, too   (though it already has plenty of that thanks to Tensorflow and other projects).   Kaggle has a bit of a history with Google, too, but that's pretty recent. Earlier this month,   Google and Kaggle teamed up to host a $100,000 machine learning competition around classifying   YouTube videos. That competition had some deep integrations with the Google Cloud Platform, too.   Our understanding is that Google will keep the service running - likely under its current name.   While the acquisition is probably more about Kaggle's community than technology, Kaggle did build   some interesting tools for hosting its competition and 'kernels', too. On Kaggle, kernels are   basically the source code for analyzing data sets and developers can share this code on the   platform (the company previously called them 'scripts').  Like similar competition-centric sites,   Kaggle also runs a job board, too. It's unclear what Google will do with that part of the service.   According to Crunchbase, Kaggle raised $12.5 million (though PitchBook says it's $12.75) since its   launch in 2010. Investors in Kaggle include Index Ventures, SV Angel, Max Levchin, Naval Ravikant,   Google chief economist Hal Varian, Khosla Ventures and Yuri Milner'''

In [None]:
import yake

# assuming default parameters
kw_extractor = yake.KeywordExtractor()
keywords = kw_extractor.extract_keywords(text)

for kw in keywords:
    print(kw)

Playing with the parameters:

In [None]:
import yake

language = "en"
max_ngram_size = 3
deduplication_thresold = 0.9
deduplication_algo = 'seqm'
windowSize = 1
numOfKeywords = 20

custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_thresold, dedupFunc=deduplication_algo, windowsSize=windowSize, top=numOfKeywords, features=None)
keywords = custom_kw_extractor.extract_keywords(text)

for kw in keywords:
    print(kw)

While English (`en`) is the default language, users can use YAKE! to extract keywords from whatever language they want to by specifying the the corresponding language universal code. The below example shows how to extract keywords from a portuguese text.

In [None]:
text = '''
"Conta-me Histórias." Xutos inspiram projeto premiado. A plataforma "Conta-me Histórias" foi distinguida com o Prémio Arquivo.pt, atribuído a trabalhos inovadores de investigação ou aplicação de recursos preservados da Web, através dos serviços de pesquisa e acesso disponibilizados publicamente pelo Arquivo.pt . Nesta plataforma em desenvolvimento, o utilizador pode pesquisar sobre qualquer tema e ainda executar alguns exemplos predefinidos. Como forma de garantir a pluralidade e diversidade de fontes de informação, esta são utilizadas 24 fontes de notícias eletrónicas, incluindo a TSF. Uma versão experimental (beta) do "Conta-me Histórias" está disponível aqui.
A plataforma foi desenvolvida por Ricardo Campos investigador do LIAAD do INESC TEC e docente do Instituto Politécnico de Tomar, Arian Pasquali e Vitor Mangaravite, também investigadores do LIAAD do INESC TEC, Alípio Jorge, coordenador do LIAAD do INESC TEC e docente na Faculdade de Ciências da Universidade do Porto, e Adam Jatwot docente da Universidade de Kyoto.
'''

custom_kw_extractor = yake.KeywordExtractor(lan="pt")
keywords = custom_kw_extractor.extract_keywords(text)

for kw in keywords:
    print(kw)

### Usage (Command Line)

```
Usage: yake [OPTIONS]

Options:
        --help                              Show this message and exit.
        -ti, --text_input TEXT              Input text, SURROUNDED by single quotes(')
        -i, --input_file TEXT               Input file
        -l, --language TEXT                 Language
        -n, --ngram-size INTEGER            Max size of the ngram.
        -df, --dedup-func [leve|jaro|seqm]  Deduplication function.
        -dl, --dedup-lim FLOAT              Deduplication limiar.
        -ws, --window-size INTEGER          Window size.
        -t, --top INTEGER                   Number of keyphrases to extract
        -v, --verbose                       Gets detailed information (such as the score)
```

A few examples:

```
yake -i text.txt -l en -n 3
```

## References

Please cite the following works when using YAKE:

**In-depth journal paper at Information Sciences Journal**

- Campos, R., Mangaravite, V., Pasquali, A., Jatowt, A., Jorge, A., Nunes, C. and Jatowt, A. (2020). YAKE! Keyword Extraction from Single Documents using Multiple Local Features. In Information Sciences Journal. Elsevier, Vol 509, pp 257-289. [pdf](https://doi.org/10.1016/j.ins.2019.09.013)

**ECIR'18 Best Short Paper**

- Campos R., Mangaravite V., Pasquali A., Jorge A.M., Nunes C., and Jatowt A. (2018). A Text Feature Based Automatic Keyword Extraction Method for Single Documents. In: Pasi G., Piwowarski B., Azzopardi L., Hanbury A. (eds). Advances in Information Retrieval. ECIR 2018 (Grenoble, France. March 26 – 29). Lecture Notes in Computer Science, vol 10772, pp. 684 - 691. [pdf](https://link.springer.com/chapter/10.1007/978-3-319-76941-7_63)

- Campos R., Mangaravite V., Pasquali A., Jorge A.M., Nunes C., and Jatowt A. (2018). YAKE! Collection-independent Automatic Keyword Extractor. In: Pasi G., Piwowarski B., Azzopardi L., Hanbury A. (eds). Advances in Information Retrieval. ECIR 2018 (Grenoble, France. March 26 – 29). Lecture Notes in Computer Science, vol 10772, pp. 806 - 810. [pdf](https://link.springer.com/chapter/10.1007/978-3-319-76941-7_80)