Implementation of the ClausIE information extraction system for python+spacy.
Disclaimer: This is not meant to be a 1-1 implementation of the algorithm (which is impossible since SpaCy is used instead of Stanford Dependencies like in the paper) but a clause extraction and text simplification library I have for personal use.
I have made some modifications.
- I did some exploration on how to better separate embedded clauses when using SpaCy dependencies.
- I provide the ability to inflect the verbs, so that they are in a somewhat useful text form when generating propositions in text.
This allows the processing of complex sentences such as this:
A cat, hearing that the birds in a certain aviary were ailing dressed himself up as a physician,
and, taking his cane and a bag of instruments becoming his profession, went to call on them.
to produce propositions such as these:
['The birds were ailing.']
['A cat dressed himself as a physician.', 'A cat dressed himself.']
['A cat took his cane.', 'A cat took a bag.']
['A cat became his profession.']
['A cat went.']
['A cat called on them.']
- Rewrote it to match more closely the algorithm in the paper.
- Reimplemented it as a
spacy
pipeline component (clauses underdoc._.clauses
) - Added tests from the paper
While this is a re-implementation by me, original research work (and also the dictionaries) is attributed to Luciano Del Corro and Rainer Gemulla. If you use it in your code please note that there are slight modifications in the code in order to make it work with the spacy dependency parser, and also cite:
Del Corro Luciano, and Rainer Gemulla: "Clausie: clause-based open information extraction."
Proceedings of the 22nd international conference on World Wide Web. ACM, 2013.
It would be helpful to also cite this specific implementation if you are using it:
@InProceedings{chourdakis2018grammar,
author = {Chourdakis, E.T and Reiss, J.D.},
title = {Grammar Informed Sound Effect Retrieval for Soundscape Generation},
booktitle = {DMRN+ 13: Digital Music Research Network One-day Workshop},
month = {November},
year = {2018},
address = {London, UK},
pages={9}
}
spacy>=3.0.0
lemminflect>=0.2.1
(only if using theinflect
argument into_propositions(as_text=True)
)- Python 3
$ git clone https://github.com/mmxgn/spacy-clausie.git
$ cd spacy-clausie
$ python setup.py build
$ python setup.py install [--user]
# Optionally
$ python setup.py test
Or with pip:
python -m pip install git+https://github.com/mmxgn/spacy-clausie.git
Download the pipeline if necessary:
python -m spacy download en_core_web_sm
$ ipython
In [1]: import spacy
In [2]: import claucy
In [3]: nlp = spacy.load("en_core_web_sm")
In [4]: claucy.add_to_pipe(nlp)
In [5]: doc = nlp("AE died in Princeton in 1955.")
In [6]: doc._.clauses
Out[6]: [<SV, AE, died, None, None, None, [in Princeton, in 1955]>]
In [7]: propositions = doc._.clauses[0].to_propositions(as_text=True)
In [8]: propositions
Out[8]:
['AE died in Princeton in 1955',
'AE died in 1955',
'AE died in Princeton']
Setting as_text=False
will instead give a tuple of spacy spans:
In [9]: propositions = doc._.clauses[0].to_propositions(as_text=False)
In [10]: propositions
Out[10]:
[(AE, died, in Princeton, in 1955),
(AE, died, in 1955),
(AE, died, in Princeton)]
Copy problog/claucy_pl.py
at the same directory as your problog .pl
files, include it
in your scripts with:
:- use_module('claucy_pl.py').
And use it via the claucy/4
predicate. An example can be seen in problog/test_clausie.pl
:
:-use_module('claucy_pl.py').
query(claucy('Albert Einstein, a scientist of the 20th century, died in Princeton in 1955.',Predicate,Arg1,Arg2)).
You can run it with:
problog test_claucy.pl
and get the output:
claucy('Albert Einstein, a scientist of the 20th century, died in Princeton in 1955.',died,Albert Einstein,in 1955): 1
claucy('Albert Einstein, a scientist of the 20th century, died in Princeton in 1955.',died,Albert Einstein,in Princeton): 1
claucy('Albert Einstein, a scientist of the 20th century, died in Princeton in 1955.',is,Albert Einstein,a scientist): 1
The variable Predicate
comes directly from the verb and Arg1
and Arg2
are the first and second arguments.
This code is licensed under the General Public License Version 3.0.