Skip to content

mromanello/CitationExtractor

Repository files navigation

(Canonical) Citation Extractor

Status

DOI Build Status codecov

Installation

This software supports Python version 2.7, and it was tested only on POSIX–compliant operating systems (Linux, Mac OS X, FreeBSD, etc.).

Installing TreeTagger

The CitationExtractor relies on TreeTagger for the PoS tagging of input texts.

There is a handy script to install it.

To run it without having to clone this repo:

wget -O install_treetagger.sh https://raw.githubusercontent.com/mromanello/CitationExtractor/master/install_treetagger.sh
chmod a+x install_treetagger.sh
./install_treetagger.sh
rm install_treetagger.sh

otherwise:

git clone https://github.com/mromanello/CitationExtractor.git
cd CitationExtractor
chmod a+x install_treetagger.sh
./install_treetagger.sh

With pip

To install the CitationExtractor first run:

$ pip install http://www.antlr3.org/download/Python/antlr_python_runtime-3.1.3.tar.gz#egg=antlr_python_runtime-3.1.3
$ pip install https://github.com/mromanello/treetagger-python/archive/master.zip#egg=treetagger-1.0.1

followed by:

$ pip install citation-extractor

NB: the installation of all other dependencies is handled by setup.py but for some reason (that I'm still trying to figure out) it does not pick up these two.

Verify installation

To double check that everything was installed correctly, try running the following lines (it should take ~20s):

from citation_extractor.settings import crfsuite
from citation_extractor.pipeline import get_extractor
extractor = get_extractor(crfsuite)
assert extractor is not None

If the code above runs without throwing exceptions means you managed to install the library!

Documentation

I'm working on it ;-)

For the time being, you can find a concrete example of how to use the library in this notebook.