# XML 2 Sentence Format Demo

This notebook is for showing how to convert the XML format to sentences for training in other models.

In [12]:
import medtator_kits as mtk

# load spacy and config the sentencizer
from spacy.lang.en import English
nlp = English()
nlp.add_pipe("sentencizer")

# Create a Tokenizer with the default settings for English
# including punctuation rules and exceptions
tokenizer = nlp.tokenizer

print('* loaded all libraries')

* loaded all libraries


## Load Files



In [10]:
path = '../sample/ENTITY_RELATION_TASK/ann_xml/Annotator_A/'
rst = mtk.parse_xmls(path)
print(rst['stat'])

* checking path ../sample/ENTITY_RELATION_TASK/ann_xml/Annotator_A/
* parsed XML file ../sample/ENTITY_RELATION_TASK/ann_xml/Annotator_A/A_doc2.txt.xml
* parsed XML file ../sample/ENTITY_RELATION_TASK/ann_xml/Annotator_A/A_doc3.txt.xml
* parsed XML file ../sample/ENTITY_RELATION_TASK/ann_xml/Annotator_A/A_doc1.txt.xml
* parsed XML file ../sample/ENTITY_RELATION_TASK/ann_xml/Annotator_A/A_doc4.txt.xml
* checked 4 files
* found 4 XML files
* skipped 0 non-XML files
{'total_files': 4, 'total_xml_files': 4, 'total_other_files': 0, 'total_tags': 52}


## Parse and convert format

We want to convert the text into a sentence-based format for downstream task (e.g., training relation extraction), so first of all, a sentencizer and a tokenizer are needed.
You can use any libraries for this purpose.

For here, we use `spaCy`'s [Sentencizer](https://spacy.io/api/sentencizer) and [Tokenizer](https://spacy.io/api/tokenizer).

```Python

# Use Sentencizer
from spacy.lang.en import English
nlp = English()
nlp.add_pipe("sentencizer")
doc = nlp("This is a sentence. This is another sentence.")
for sent in doc.sents:
    print(sent)

```

Second, we need a way to map the spans to token index

In [13]:
tokens = tokenizer("This is a sent")

tokens

This is a sent

In [20]:
list(tokens)[0].text

'This'

### Sentence-based format

In the following demo, we will convert each XML file into a sentence-based format, which looks like the following:

```json
{
    "text": "The full text of the file",
    "sentence_tags": [{
        "sentence": "a sentence",
        "spans": [start, end],
        "entities": [{

        }],
        "relations": [{
            
        }]
    }]
}
```

In [11]:
# a helper function for checking whether a tag is overlapped in a sentence
def is_overlapped(a, b):
    if a[0] >= b[0] and a[0] < b[1]:
        return True
    
    if a[1] > b[0] and a[1] <= b[1]:
        return True
    
    # the missing for contains
    if a[0] <= b[0] and a[1] >= b[1]:
        return True
    
    if b[0] <= a[0] and b[1] >= a[1]:
        return True
        
    return False

In [24]:
# new format results
rs = []

# check each ann
for ann in rst['anns']:
    # first, create a record
    r = {
        "text": ann['text'],
        "sentence_tags": []
    }

    # get the sentences
    doc = nlp(ann['text'])
    for sent in doc.sents:
        tokens = list(map(lambda v: v.text, tokenizer(sent.text)))
        st = {
            'sentence': sent.text,
            'sentence_tokens': tokens,
            'spans': [
                sent.start_char,
                sent.end_char
            ]
        }

        r['sentence_tags'].append(st)

    # finally, save this record
    rs.append(r)

print("* done!")

* done!
