Skip to content

Rostlab/LocText

Repository files navigation

☝️ We moved

This library is not maintained anymore.

We moved the LocText model to the text annotation tool, tagtog:

tagtog, The Text Annotation Tool to Train AI




Build Status codecov

LocText

Text-mine the relationship of Proteins <--> Cell Compartments (meaning, protein is/functions in cell compartment).

Run on PubMed abstracts or any string (i.e. including full text).

Publication: LocText: relation extraction of protein localizations to assist database curation

Requirements

Runs on Python >= 3.5.

Non-packaged dependencies (each software has its own dependencies):

Install

git clone https://github.com/Rostlab/LocText.git
cd LocText
# We recommend you use python virtualenv: https://pypi.python.org/pypi/virtualenv
pip install -r requirements.txt
python -m loctext.download_data

Run

Sample Script

python run.py --text "GCN2 was constitutively localized to the nucleolus or recruited to the nucleolus by amino acid starvation stress"

You should see something like the following:

# Predicted entities:
Entity(class_id: e_1, offset: 0, text: GCN2, norms: {'n_7': 'Q9P2K8,Q9LX30,Q9FIB4,P15442'})
Entity(class_id: e_2, offset: 41, text: nucleolus, norms: {'n_9': 'GO:0005730'})
Entity(class_id: e_2, offset: 71, text: nucleolus, norms: {'n_9': 'GO:0005730'})

# Predicted relations:
Relation(class_id:"r_5": e1:"Entity(class_id: e_1, offset: 0, text: GCN2, norms: {'n_7': 'Q9P2K8,Q9LX30,Q9FIB4,P15442'})"   <--->   e2:"Entity(class_id: e_2, offset: 41, text: nucleolus, norms: {'n_9': 'GO:0005730'})")
Relation(class_id:"r_5": e1:"Entity(class_id: e_1, offset: 0, text: GCN2, norms: {'n_7': 'Q9P2K8,Q9LX30,Q9FIB4,P15442'})"   <--->   e2:"Entity(class_id: e_2, offset: 71, text: nucleolus, norms: {'n_9': 'GO:0005730'})")

Python API

Full documentation is due. For now:

For any any issue or question with the LocText and nalaf code, please open up an issue in the corresponding repository. Indeed, considerable chunks require refactoring and documentation; don't hesitate to complain ;)

Development

We use pytest for testing.


To do a quick performance cross-validation of the LocText machine-learning model, execute:

# Use --help for more possible arguments
python loctext/learning/train.py --model D0

In the end, you should see something like:

Run Arguments:
	corpus_percentage = 1.0
	cv_with_test_set = False
	eval_corpus = None
	evaluate_only_on_edges_plausible_relations = False
	evaluation_level = 4
	evaluator = <nalaf.learning.evaluators.DocumentLevelRelationEvaluator object at 0x10802df98>
	feature_generators = LocText
	force_external_corpus_evaluation = False
	k_num_folds = 5
	load_model = None
	model = D0
	predict_entities = []
	save_model = None
	training_corpus = LocText
	---
	Using libraries versions: numpy == 1.11.2, scipy == 0.18.1, scikit-learn == 0.18.1, spacy == 1.2.0

Training corpus stats:
	#documents: 100
	#relations total: 1345
	#relations prot<-->loc: 550
	#entities: Counter({'e_1': 1393, 'e_2': 558, 'e_3': 277})
	#sentences: 1056
	#instances (edges): 663 -- #P=351 vs. #N=312
	#plausible relations from edges: 351
	#features: 302

# class	tp	fp	fn	fp_ov	fn_ov	e|P	e|R	e|F	e|F_SE	o|P	o|R	o|F	o|F_SE
r_5	214	18	89	0	0	0.9224	0.7063	0.8000	0.0031	0.9224	0.7063	0.8000	0.0031