# XML 2 Sentence Format Demo

This notebook is for showing how to convert the XML format to sentences for training in other models.

In [14]:
import medtator_kits as mtk
import sentence_kits as stk

# force reload everything in mtk
import importlib
importlib.reload(mtk)
importlib.reload(stk)

import copy
import random

# for data processing
import numpy as np
import pandas as pd

# for display nicer
from IPython.core.display import display, HTML

# load sentence detection
import pysbd
# load spacy and config the sentencizer
from spacy.lang.en import English
nlp = English()
# Create a Tokenizer with the default settings for English
# including punctuation rules and exceptions
tokenizer = nlp.tokenizer

print('* loaded all libraries')

* loaded all libraries


## Load Data

In [29]:
path = '../sample/ENTITY_RELATION_TASK/ann_xml/Annotator_A/'
rst = mtk.parse_xmls(path)
print(rst['stat'])

* checking path ../sample/ENTITY_RELATION_TASK/ann_xml/Annotator_A/
* parsed XML file ../sample/ENTITY_RELATION_TASK/ann_xml/Annotator_A/A_doc2.txt.xml
* parsed XML file ../sample/ENTITY_RELATION_TASK/ann_xml/Annotator_A/A_doc3.txt.xml
* parsed XML file ../sample/ENTITY_RELATION_TASK/ann_xml/Annotator_A/A_doc1.txt.xml
* parsed XML file ../sample/ENTITY_RELATION_TASK/ann_xml/Annotator_A/A_doc4.txt.xml
* checked 4 files
* found 4 XML files
* skipped 0 non-XML files
{'total_files': 4, 'total_xml_files': 4, 'total_other_files': 0, 'total_tags': 62}


## Parse and convert format

We want to convert the text into a sentence-based format for downstream task (e.g., training relation extraction), so first of all, a sentencizer and a tokenizer are needed.
You can use any libraries for this purpose.

For here, we use `pySBD` for [sentence boundary detection](https://github.com/nipunsadvilkar/pySBD).

```Python
import pysbd
text = "My name is Jonas E. Smith. Please turn to p. 55."
seg = pysbd.Segmenter(language="en", clean=False, char_span=True)
print(seg.segment(text))
# [TextSpan(sent='My name is Jonas E. Smith. ', start=0, end=27), TextSpan(sent='Please turn to p. 55.', start=27, end=48)]
```

and we use `spaCy`'s [Tokenizer](https://spacy.io/api/tokenizer).

```Python

# Use Sentencizer
from spacy.lang.en import English
nlp = English()
tokenizer = nlp.tokenizer
tokens = tokenizer("This is a sentence for tokens.")
sentence_tokens = list(map(lambda v: v.text, tokens))
sentence_tokens
# ['This', 'is', 'a', 'sentence', 'for', 'tokens', '.']
```

Second, we need a way to map the spans to token index.
In the `sentence_kits.py`, we implemented a function `update_ents_token_index()` for this purpose. It uses the spans of a tag to check whether overlapped with any tokens of a sentence.
The function `update_ents_token_index()` will update the entities by adding a new property `token_index` which is a list of token indexes.

For more details, please check the follwoing section *Sentence-based format* and the source code in `sentence_kits.py`.

### Sentence-based format

In the following demo, we will convert each XML file into a sentence-based format, which looks like the following:

```json
{
    "text": "The full text of the file",
    "sentence_tags": [{
        "sentence": "this is a sentence.",
        "sentence_tokens": ["this", "is", "a" "sentence", "."],
        "spans": [start, end],
        "entities": {
            "A1": {
                "id": "A1",
                "text": "this",
                "token_index": [0, 0]
                // other properties
            },
            "A2": {
                "id": "A2",
                "text": "a sentence",
                "token_index": [2, 3]
                // other properties
            }
        },
        "relations": {
            "R1": {
                "id": "R1",
                "link_EAID": "A1", // this
                "link_EBID": "A2", // a sentence
            }
        }
    }]
}
```

In [30]:
# let's check the sentence-based format for the given samples
ann_sents = stk.convert_anns_to_sentags(rst['anns'])
print("* got %s ann_sents" % (len(ann_sents)))

* got 4 ann_sents


In [36]:
# let's show how the results look like
# the following code just for reference, you can use and modify for your purpose
for ann_idx, (ann, ann_sent) in enumerate(zip(rst['anns'], ann_sents)):
    print('*' * 30, ann['_filename'], len(ann_sent['sentence_tags']), 'sent(s)', '*' * 30)
    for sentag in ann_sent['sentence_tags']:
        if len(sentag['relations'])>0: sign = '<b style="color:red;">HAS REL</b>:'
        else: sign = 'ENT ONLY:'

        # parse the tokens
        tokens = copy.copy(sentag['sentence_tokens'])
        # print(tokens)
        for ent in sentag['entities'].values():
            color = ''.join([random.choice('9ABCDEF') for j in range(6)])
            for idx in range(ent['token_index'][0], ent['token_index'][1] + 1):
                tokens[idx] = '<span style="color:#%s;background:black;">%s</span>' % (
                    color,
                    tokens[idx]
                )
        display(HTML(sign + "[ `" + "` | `".join(tokens) + "`]"))

****************************** A_doc2.txt.xml 3 sent(s) ******************************


****************************** A_doc3.txt.xml 3 sent(s) ******************************


****************************** A_doc1.txt.xml 5 sent(s) ******************************


****************************** A_doc4.txt.xml 18 sent(s) ******************************


As you can see, each token is shown in a pair of ``. 
The entities are located by the token index.

Then, we can use this way to explore the dataset.

# Demo 1: Adverse event severity relation detection

In this demo, we use the toolkits in this folder to make a tiny dataset for training a model for detection severity of adverse event.

Please run the above cells to load the sample dataset.
After loading, all documents are loaded into a variable `ann_sents`.
Now, let's convert the data to a tiny training set.

## Prepare dataset

First of all, we can create a tiny dataset from the annotated corpus

In [38]:
ds = []

# the prop prefix for the Adverse Event and Severity
prop_AE = 'link_AE'
prop_SVRT = 'link_SVRT'

for ann_idx, (ann_sent, ann) in enumerate(zip(ann_sents, rst['anns'])):
    # ok, for each annotation file, 
    # we want to extract the annotated relations for positive training dataset
    # there can be multiple sentence in a file, 
    # so we need to check each sentence
    for sentag in ann_sent['sentence_tags']:
        if len(sentag['relations'])==0:
            # Oh, this sentence doesn't have a relation annotated
            # but we can use this sentence for building negative samples
            # as the SVRT tags and AE tags have no relations
            # this is just a demo, please change it accordingly.
            stags_ae = []
            stags_svrt = []
            for tag in sentag['entities'].values():
                if tag['tag'] == 'AE': stags_ae.append(tag)
                elif tag['tag'] == 'SVRT': stags_svrt.append(tag)
            # let's check how many tags 
            # but if not AE or SVRT tags, we just skip
            if len(stags_ae)==0 or len(stags_svrt)==0: continue

            # OK, pair each ae and svrt as NEGATIVE sample
            for _tag_a in stags_ae:
                for _tag_s in stags_svrt:
                    # create a data item
                    d = {
                        "AE": _tag_a,
                        "SVRT": _tag_s,
                        "sentence": sentag['sentence'],
                        "tokens": sentag['sentence_tokens'],
                        "ann_idx": ann_idx,
                        "y": 0, # for those obtained by creating, define as 0
                    }

                    ds.append(d)

        else:
            # Ok, this sentence may have several relations
            # let's check one by one
            for rel in sentag['relations'].values():
                # get the AE
                ent_ae = sentag['entities'][rel['%sID' % prop_AE]]
                ent_svrt = sentag['entities'][rel['%sID' % prop_SVRT]]

                # ok, we can save this relation now
                d = {
                    "AE": ent_ae,
                    "SVRT": ent_svrt,
                    "sentence": sentag['sentence'],
                    "tokens": sentag['sentence_tokens'],
                    "ann_idx": ann_idx,
                    "y": 1, # for those obtained from relation, define as 1(POSITIVE)
                }

                ds.append(d)

df = pd.DataFrame(ds)
print('* got the dataset %d records!' % (len(df)))
df

* got the dataset 19 records!


Unnamed: 0,AE,SVRT,sentence,tokens,ann_idx,y
0,"{'tag': 'AE', 'spans': [[689, 695]], 'text': '...","{'tag': 'SVRT', 'spans': [[684, 688]], 'text':...","On 28Jan2021 18:30, the patient experienced mi...","[On, 28Jan2021, 18:30, ,, the, patient, experi...",0,1
1,"{'tag': 'AE', 'spans': [[1211, 1232]], 'text':...","{'tag': 'SVRT', 'spans': [[1206, 1210]], 'text...","The day after the arm was very sore, red and b...","[The, day, after, the, arm, was, very, sore, ,...",1,1
2,"{'tag': 'AE', 'spans': [[696, 700]], 'text': '...","{'tag': 'SVRT', 'spans': [[691, 695]], 'text':...",The consumer reported receiving her first inje...,"[The, consumer, reported, receiving, her, firs...",2,1
3,"{'tag': 'AE', 'spans': [[775, 779]], 'text': '...","{'tag': 'SVRT', 'spans': [[784, 791]], 'text':...",The consumer reports the pain was similar to w...,"[The, consumer, reports, the, pain, was, simil...",2,1
4,"{'tag': 'AE', 'spans': [[1102, 1110]], 'text':...","{'tag': 'SVRT', 'spans': [[1033, 1043]], 'text...",The consumer reported when she was told by peo...,"[The, consumer, reported, when, she, was, told...",2,0
5,"{'tag': 'AE', 'spans': [[1102, 1110]], 'text':...","{'tag': 'SVRT', 'spans': [[961, 971]], 'text':...",The consumer reported when she was told by peo...,"[The, consumer, reported, when, she, was, told...",2,0
6,"{'tag': 'AE', 'spans': [[1115, 1120]], 'text':...","{'tag': 'SVRT', 'spans': [[1033, 1043]], 'text...",The consumer reported when she was told by peo...,"[The, consumer, reported, when, she, was, told...",2,0
7,"{'tag': 'AE', 'spans': [[1115, 1120]], 'text':...","{'tag': 'SVRT', 'spans': [[961, 971]], 'text':...",The consumer reported when she was told by peo...,"[The, consumer, reported, when, she, was, told...",2,0
8,"{'tag': 'AE', 'spans': [[1789, 1793]], 'text':...","{'tag': 'SVRT', 'spans': [[1784, 1788]], 'text...",The outcome of the event mild pain at the site...,"[The, outcome, of, the, event, mild, pain, at,...",2,1
9,"{'tag': 'AE', 'spans': [[46, 54]], 'text': 'Ar...","{'tag': 'SVRT', 'spans': [[118, 127]], 'text':...",4/10/21 Patient presented to Urgent Care with ...,"[4/10/21, Patient, presented, to, Urgent, Care...",3,0


## Split train/test
As you can see, a very tiny dataset is created with both positive and negative samples. And we preserved as much information as possible in the dataframe to generate features. 

In practice settings, you don't need to save those raw data into a dataframe. Instead, you can just save major information, such as token, index, sentences, which can reduce the space needed for large corpus.

You can save this dataset into `pkl` format for futuer usage as follows or any other formats.

```Python
import pickle

with open('dataset.pkl', 'wb') as f:
    pickle.dump(df, f)
```

In future, you can load the dataset as follows:

```Python
with open('dataset.pkl', 'rb') as f:
    df = pickle.load(f)
```

Now, let's get the training and testing dataset.

In [39]:
# random select 80% for training
df_train = df.sample(frac=0.8)
# the rest 20% for test
df_test = df.iloc[~df.index.isin(df_train.index)]

## Get features

Although we have got all the information needed, we still need to convert the information to features which can be read by machine.

There are so many methods to convert, simple and complex. You will never know which one is the best if you don't try and understand the pros and cons of each method. For this demo purpose, we just show a very simple way: 

In practice settings, you may choose BERT-based models to get text embeddings of the tokens, POS features, and other knowledge graph embeddings to fully capture the information. We plan to include more practical demos to show how to do that in future.