# XML 2 Sentence Format Demo

This notebook is for showing how to convert the XML format to sentences for training in other models.

In [1]:
import medtator_kits as mtk
import sentence_kits as stk

# force reload everything in mtk
import importlib
importlib.reload(mtk)
importlib.reload(stk)

import copy
import random

# for display nicer
from IPython.core.display import display, HTML

# load sentence detection
import pysbd
# load spacy and config the sentencizer
from spacy.lang.en import English
nlp = English()
# Create a Tokenizer with the default settings for English
# including punctuation rules and exceptions
tokenizer = nlp.tokenizer

print('* loaded all libraries')

  from .autonotebook import tqdm as notebook_tqdm


* loaded all libraries


## Load Data

In [2]:
path = '../sample/ENTITY_RELATION_TASK/ann_xml/Annotator_A/'
rst = mtk.parse_xmls(path)
print(rst['stat'])

* checking path ../sample/ENTITY_RELATION_TASK/ann_xml/Annotator_A/
* parsed XML file ../sample/ENTITY_RELATION_TASK/ann_xml/Annotator_A/A_doc2.txt.xml
* parsed XML file ../sample/ENTITY_RELATION_TASK/ann_xml/Annotator_A/A_doc3.txt.xml
* parsed XML file ../sample/ENTITY_RELATION_TASK/ann_xml/Annotator_A/A_doc1.txt.xml
* parsed XML file ../sample/ENTITY_RELATION_TASK/ann_xml/Annotator_A/A_doc4.txt.xml
* checked 4 files
* found 4 XML files
* skipped 0 non-XML files
{'total_files': 4, 'total_xml_files': 4, 'total_other_files': 0, 'total_tags': 52}


## Parse and convert format

We want to convert the text into a sentence-based format for downstream task (e.g., training relation extraction), so first of all, a sentencizer and a tokenizer are needed.
You can use any libraries for this purpose.

For here, we use `pySBD` for [sentence boundary detection](https://github.com/nipunsadvilkar/pySBD).

```Python
import pysbd
text = "My name is Jonas E. Smith. Please turn to p. 55."
seg = pysbd.Segmenter(language="en", clean=False, char_span=True)
print(seg.segment(text))
# [TextSpan(sent='My name is Jonas E. Smith. ', start=0, end=27), TextSpan(sent='Please turn to p. 55.', start=27, end=48)]
```

and we use `spaCy`'s [Tokenizer](https://spacy.io/api/tokenizer).

```Python

# Use Sentencizer
from spacy.lang.en import English
nlp = English()
tokenizer = nlp.tokenizer
tokens = tokenizer("This is a sentence for tokens.")
sentence_tokens = list(map(lambda v: v.text, tokens))
sentence_tokens
# ['This', 'is', 'a', 'sentence', 'for', 'tokens', '.']
```

Second, we need a way to map the spans to token index.
In the `sentence_kits.py`, we implemented a function `update_ents_token_index()` for this purpose. It uses the spans of a tag to check whether overlapped with any tokens of a sentence.
The function `update_ents_token_index()` will update the entities by adding a new property `token_index` which is a list of token indexes.

For more details, please check the follwoing section *Sentence-based format* and the source code in `sentence_kits.py`.

### Sentence-based format

In the following demo, we will convert each XML file into a sentence-based format, which looks like the following:

```json
{
    "text": "The full text of the file",
    "sentence_tags": [{
        "sentence": "this is a sentence.",
        "sentence_tokens": ["this", "is", "a" "sentence", "."],
        "spans": [start, end],
        "entities": [{
            "id": "A1",
            "text": "a sentence",
            "token_index": [2, 3]
            // "token_index": [start_token_idx, end_token_idx]
        }],
        "relations": [{
            // the same format as relation in XML
        }]
    }]
}
```

In [3]:
# let's check the sentence-based format for the given samples

ann_sents = stk.convert_anns_to_sentags(rst['anns'])
print("* got %s ann_sents" % (len(ann_sents)))

* got 4 ann_sents


In [4]:
# let's show how the results look like
# the following code just for reference, you can use and modify for your purpose
for ann_idx, ann_sent in enumerate(ann_sents):
    print('*' * 30, ann_idx, '*' * 30)
    for sentag in ann_sent['sentence_tags']:
        if len(sentag['relations'])>0:
            tokens = copy.copy(sentag['sentence_tokens'])
            # print(tokens)
            for ent in sentag['entities']:
                color = ''.join([random.choice('9ABCDEF') for j in range(6)])
                for idx in range(ent['token_index'][0], ent['token_index'][1] + 1):
                    tokens[idx] = '<span style="color:#%s;background:black;">%s</span>' % (
                        color,
                        tokens[idx]
                    )
            display(HTML("[ `" + "` | `".join(tokens) + "`]"))

****************************** 0 ******************************


****************************** 1 ******************************


****************************** 2 ******************************


****************************** 3 ******************************


As you can see, each token is shown in a pair of ``. 
The entities are located by the token index.