Skip to content

Formats

David Campos edited this page Oct 14, 2016 · 4 revisions

Input

Raw

Description available soon.

XML

Description available soon.

BioC

Description available soon.

Pubmed

Description available soon.

BioMed Central

Description available soon.

Output

BioC

Description available soon.

A1

Description available soon.

CoNLL

Description available soon.

JSON

Description available soon.

BC2

For an annotated corpus, this format requires two files: Sentences and Annotations. For an unannotated corpus, only the Sentences file is required.

The sentences file should contain one sentence per line, which includes the unique identifier and respective sentence separated by a white space. The unique identifier should not contain white spaces.

P00001606T0076 Comparison with alkaline phosphatases and 5-nucleotidase
P00008171T0000 Pharmacologic aspects of neonatal hyperbilirubinemia.
P00008997A0472 When CSF [HCO3-] is shown as a function of CSF PCO2 the data of K-depleted rats are no longer displaced when compared to controls but still have a significantly greater slope (1.21 +/- 0.23 vs.
P00010943A0733 Flurazepam thus appears to be an effective hypnotic drug with the optimum dose for use in general practice being 15 mg at night.
P00012653T0045 Beta blocking agents.
P00013683A0210 When extracorporeal CO2 removal approximated CO2 production (VCO2), alveolar ventilation almost ceased.
P00015731A0090 Intravenous administration (25 mg/kg) of carbonic anhydrase inhibitors (acetazolamide, methazolamide, dichlorphenamide, sulthiame) induced an early important rise of cortical p O2, which is not dependent on increase of p O2 and p CO2 and decrease of pH in arterial blood.

The annotations file should contain one annotation per line, which follows the following format: SENTENCE_ID|FIRST_CHAR LAST_CHAR|TEXT. The character counting used for the FIRST_CHAR and LAST_CHAR, must be performed discarding white spaces.

P00001606T0076|14 33|alkaline phosphatases
P00001606T0076|37 50|5-nucleotidase
P00015731A0090|36 52|carbonic anhydrase
P00024600A0522|11 13|HMG
P00027739T0000|0 28|Serum gamma glutamyltransferase
P00027967A0207|11 31|secretory HI antibodies
P00029953T0045|17 22|lipase
P00030183T0000|33 38|HLA-B5

JNLPBA

This corpus should be provided using only one file, which already contains the abstracts, sentences, tokens and annotations. The abstract must be identified by the respective MEDLINE identifier, using the format: ###MEDLINE:ID.

Each token is provided in one line, which contains the token and the respective label separated by a tab (\t). The labels should follow the BIO encoding format:

  • "B": the first token of the entity name;
  • "I": the other tokens of the entity name;
  • "O": the tokens that do not make part of any entity name.

Since several entity types could be used in this format, each label ("B" or "I") should contain a suffix of the semantic type: -protein, -DNA, -RNA, -cell_type and -cell_line. Each sentence is a block of tokens. Different sentences should be separated by an empty line.

IL-2	B-DNA
gene	I-DNA
expression	O
and	O
NF-kappa	B-protein
B	I-protein
activation	O
through	O
CD28	B-protein
requires	O
reactive	O
oxygen	O
production	O
by	O
5-lipoxygenase	B-protein
.	O

IeXML

Description available soon.

Pipe

Description available soon.

PipeExtended

Description available soon.

Base64

Description available soon.