Skip to content

LIAAD/kep

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KEP - Keyphrase Extraction Package

KEP is a Python package that enables to extract keyphrases from documents (single or multiple documents) by applying a number of algorithms, the big majority of which provided by pke an open-source package. Differently from PKE, we provide a ready to run code to extract keyphrases not only from a single document, but also in batch mode (i.e., several documents). More to the point, we consider 20 state-of-the-art datasets from which keyphrases may be extracted, and the corresponding dfs, lda and KEA pre-computed models (which contrasts with pke as only semeval-2010 models are made available).

KEP is available on Dockerhub (ready to run) or available for download (in which case, some configurations need to be done). Regardless your option, we provide a jupyter notebook to ease the process of extracting keyphrases. More on this on the Installation section.

List of Datasets

KEP can extract keyphrases from 20 datasets:

  • 110-PT-BN-KP (110 docs; PT)
  • 500N-KPCrowd-v1.1 (500 docs; EN)
  • cacic (888 docs; ES)
  • citeulike180 (183 docs; EN)
  • fao30 (30 docs; EN)
  • fao780 (779 docs; EN)
  • Inspec (2000 docs; EN)
  • kdd (755 docs; EN)
  • Krapivin2009 (2304 docs; EN)
  • Nguyen2007 (209 docs; EN)
  • pak2018 (50 docs; PL)
  • PubMed (500 docs; EN)
  • Schutz2008 (1231 docs; EN)
  • SemEval2010 (243 docs; EN)
  • SemEval2017 (493 docs; EN)
  • theses100 (100 docs; EN)
  • wicc (1640 docs; ES)
  • wiki20 (20 docs; EN)
  • WikiNews (100 docs; FR)
  • www (1330 docs; EN)

Note however that more datasets can be added as long as they follow the coming structure:
  • keys: a folder that contains for each document a file with the corresponding keywords (ground-truth)
  • docsutf8: a folder that contains the documents text
  • lan.txt: a file that specifies the language of the document (e.g., EN). Used to load the stopwords

Keyphrase Extraction Algorithms

Unsupervised Algorithms

Statistical Methods

Graph-based Methods

Supervised Algorithms

Installing KEP

Option 1: Docker

Install Docker

Windows

Docker for Windows requires 64bit Windows 10 Pro with Hyper-V available. If you have this, then proceed to download here: (https://docs.docker.com/docker-for-windows/install/#download-docker-for-windows) and click on Get Docker for Windows (Stable)

If your system does not meet the requirements to run Docker for Windows (e.g., 64bit Windows 10 Home), you can install Docker Toolbox, which uses Oracle Virtual Box instead of Hyper-V. In that case proceed to download here: (https://docs.docker.com/toolbox/overview/#ready-to-get-started) and click on Get Docker Toolbox for Windows

MAC

Docker for Mac will launch only if all of these requirements (https://docs.docker.com/docker-for-mac/install/#what-to-know-before-you-install) are met. If you have this, then proceed to download here: (https://docs.docker.com/docker-for-mac/install/#download-docker-for-mac) and click on Get Docker for Mac (Stable)

If your system does not meet the requirements to run Docker for Mac, you can install Docker Toolbox, which uses Oracle Virtual Box instead of Hyper-V. In that case proceed to download here: (https://docs.docker.com/toolbox/overview/#ready-to-get-started) and click on Get Docker Toolbox for Mac

Linux

Proceed to download here: (https://docs.docker.com/engine/installation/#server)

Pull Image

Execute the following command on your docker machine:

docker pull liaad/kep

Run Image

On your docker machine run the following to launch the image:

docker run -p 9999:8888 --user root liaad/kep

Then go to your browser and type in the following url:

http://<DOCKER-MACHINE-IP>:9999

where the IP may be the localhost or 192.168.99.100 if you are using a Docker Machine VM.

You will be required a token which you can find on your docker machine prompt. It will be something similar to this: http://eac214218126:8888/?token=ce459c2f581a5f56b90256aaa52a96e7e4b1705113a657e8. Copy paste the token (in this example, that would be: ce459c2f581a5f56b90256aaa52a96e7e4b1705113a657e8) to the browser, and voila, you will have KEP package ready to run. Keep this token (for future references) or define a password.

Run Jupyter notebooks

Once you logged in, proceed by running the notebook that we have prepared for you.

Shutdown

Once you are done go to File - Shutdown.

Login again

If later on you decide to play with the same container, you should proceed as follows. The first thing to do is to get the container id:

docker ps -a

Next run the following commands:

docker start ContainerId
docker attach ContainerId (attach to a running container)

Nothing happens in your docker machine, but you are now ready open your browser as you did before:

http://<DOCKER-MACHINE-IP>:9999

Hopefully, you have saved the token or defined a password. If that is not the case, then you should run the following command (before doing start/attach) to have access to your token:

docker exec -it <docker_container_name> jupyter notebook list

Option 2: Standalone Installation

Install KEP library and Dependency Packages

pip install git+https://github.com/liaad/kep
pip install git+https://github.com/boudinfl/pke
pip install git+https://github.com/LIAAD/yake.git
pip install langcodes

Install External Resources

Spacy Language Models

PKE makes use of Spacy for the pre-processing stage. Currently Spacy supports the following languages:

  • 'en': 'english',
  • 'pt': 'portuguese',
  • 'fr': 'french',
  • 'es': 'spanish',
  • 'it': 'italian',
  • 'nl': 'dutch',
  • 'de': 'german',
  • 'el': 'greek'

In order to install these language models you need to open your command line (e.g., anaconda) in administration mode. Otherwise they will be installed, but will return an error later on.

python -m spacy download en
python -m spacy download es
python -m spacy download fr
python -m spacy download pt
python -m spacy download de
python -m spacy download it
python -m spacy download nl
python -m spacy download el

If you want to make sure that everything was properly installed go to site-packages\spacy\data and check if a shortcut for every language is found there.

Datasets with languages other than the ones above listed will be handled (in the pre-processing stage) as if they were "english".

PKE also gives the possibility of applying stemming in the pre-processing stage to the coming languages (by applying snowballStemmer):

  • 'en': 'english',
  • 'pt': 'portuguese',
  • 'fr': 'french',
  • 'es': 'spanish',
  • 'it': 'italian',
  • 'nl': 'dutch',
  • 'de': 'german',
  • 'da': 'danish',
  • 'fi': 'finnish',
  • 'da': 'danish',
  • 'hu': 'hungarian',
  • 'nb': 'norwegian',
  • 'ro': 'romanian',
  • 'ru': 'russian',
  • 'sv': 'swedish'

Stemming will not be applied (even if defined as a parameter) for languages different then the above referred.

NLTK Stopwords

In terms of stopwords, PKE considers the NLTK stopwords for the following languages:

  • 'ar': 'arabic',
  • 'az': 'azerbaijani',
  • 'da': 'danish',
  • 'nl': 'dutch',
  • 'en': 'english',
  • 'fi': 'finnish',
  • 'fr': 'french',
  • 'de': 'german',
  • 'el': 'greek',
  • 'hu': 'hungarian',
  • 'id': 'indonesian',
  • 'it': 'italian',
  • 'kk': 'kazakh',
  • 'ne': 'nepali',
  • 'nb': 'norwegian',
  • 'pt': 'portuguese',
  • 'ro': 'romanian',
  • 'ru': 'russian',
  • 'es': 'spanish',
  • 'sv': 'swedish',
  • 'tr': 'turkish'

In order to download these stopwords please procede as follows:

python -m nltk.downloader stopwords

In addition to this, we make use of an extended list of stopwords downloaded from here which can be found within the KEP package. These are naturally instaled upon installing the package.

Create folder Data and Download dfs, lda and pickle Models

Create a folder named 'data' at the same level of the notebook with the following structure:

  • Datasets: folder where the datasets should go in. You may already find 20 datasets ready to download here. Each dataset should be unzipped to this folder. For instance if you want to play with the Inspec dataset you should end up with the following structure: data\Datasets\Inspec
  • Keyphrases: folder where the keyphrases are to be written by the system. For instance, if later on you decide to run YAKE! keyword extraction algorithm on top of the Inspec collection, you will end up with the following structure: data\Keyphrases\YAKE\Inspec. In any case, it is not mandatory to manually create 'Keyphrases' folder as this will be automatically created by the system in case it doesn't exists.
  • Models: Some unsupervised algorithms (such as TF.IDF, KPMiner, TopicalPageRank and KEA) require a number of models in order to run. To speed up this process we make them available here for download. Once downloaded put them under the data\Models folder. In case you decide not to download them, the system will automatically create the 'Models' folder and the corresponding models will be saved in the folder. Note however, that this will take you much time, thus downloading them in advance is a better option.

RUN

Run Jupyter notebooks

In the practical part of the course we will run the following python notebook. If you followed a standalone installation please proceed by downloading it here to have everything setup. Instead you don't need to worry with this should you followed a Docker installation.

Run Code

Alternatively you can resort to the files we provide under the kep/tests folder of the kep package.

  • ExtractKeyphrases_From_SingleDoc.py: enables to extract keyphrases from a single doc.
  • ExtractKeyphrases_From_MultipleDocs.py: enables to extract keyphrases from multiple docs.

About

Keyphase Extraction Package

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published