Relna is a Text Mining (TM) tool for relation extraction for transcription factors and gene / gene products. To the best of our knowledge, it is the first text mining tool for relation extraction of transcriptor factors and associated proteins. It is part of a thesis at Technical University, Munich. This tool is built on the nalaf framework, developed as part of two other theses done at Technical University, Munich. The tool is generic enough that it can be extended by people with their own modules, eg. parsers, features, taggers etc. The method uses Support Vector Machines, and allows for the use of Tree Kernels.
nalaf framework is well documented here.
As part of the thesis, an associated corpus by the same name (relna) was annotated using tagtog. The relna corpus consists of 140 documents that have been semi-automatically annotated using GNormPlus for named entities and manually annotated for relations. The reason for relation extraction for transcription factors and gene / gene products, and corpus statistics is documented here.
Using our method, we achieve an F-measure of 69.3% on the relna corpus. The full results of our experiments are available here.
The pipeline used by relna is as follows:
- Python 3
- SVMLight, linear vs tree kernel:
- The default is to use SVMLight with linear kernels, already defined in https://github.com/Rostlab/nalaf.
- If using SVMLight TK for tree kernels:
- BLLIP Parser
- SVMLight-TK-1.2
- The easiest way to install it is to download compiled binaries from the official website.
- You will have to fill up a form to get this, and make the build using the given Makefile.
- Place the binaries
svm_classify
andsvm_learn
in your$PATH
(note, that as of now, this is also needed in nalaf for SVMLight)
- Installation of nalaf
git clone https://github.com/Rostlab/nalaf
cd nalaf
python3 setup.py install
python3 -m nalaf.download_corpora
- Installation of relna
git clone https://github.com/Rostlab/relna.git
cd relna
python3 setup.py install
python3 -m relna.download_corpora
Eventually, when the package is registered on PyPi, you can simply install relna by:
pip3 install relna
Run:
relna.py
for a simple example how to use relna just for prediction with a pre-trained modelpython3 relna.py -c [PATH SVMLight BIN DIR] -p 10383460
python3 relna.py -c [PATH SVMLight BIN DIR] -s "Conclusion: we find that Ubc9 interacts with the androgen receptor (AR), a member of the steroid receptor family of ligand-activated transcription factors. In transiently transfected COS-1 cells, AR-dependent but not basal transcription is enhanced by the coexpression of Ubc9."
python3 relna.py -c [PATH SVMLight BIN DIR] -d example.txt
- Implement neural networks (Theano or TensorFlow, when they release for Python 3) for training and classifying data and evaluate performance on that.
- Implement bootstrapping for relation extraction (similar to nalaf, where it has been done for entities)
- Implement multiple sentence models, looking at relations at a distance of one sentence and beyond
- Implement corereference resolution (might increase performance slightly)
- Experiment with Tree Kernels (SVMLight TK), which achieves a very high precision P>91, to extract highly-accturate relationships from entire PubMed. That, in the end, may give better task extraction results since the lower recall (R~21) is compensated by the size of the large corpus of PubMed.
- SpaCy plans to implement its own constituent parser, replace BLLIP with SpaCy for speed and efficiency (no linking to external C/C++ libraries)