# XML 2 Sentence Format Demo

This notebook is for showing how to convert the XML format to sentences for training in other models.

In [3]:
import medtator_kits as mtk
import sentence_kits as stk

# force reload everything in mtk
import importlib
importlib.reload(mtk)
importlib.reload(stk)

import copy
import random

# for data processing
import numpy as np
import pandas as pd

# for display nicer
from IPython.core.display import display, HTML

# load sentence detection
import pysbd
# load spacy and config the sentencizer
from spacy.lang.en import English
nlp = English()
# Create a Tokenizer with the default settings for English
# including punctuation rules and exceptions
tokenizer = nlp.tokenizer

# for hiding the warnings
# please remove this for debugging
import warnings
warnings.filterwarnings('ignore')

print('* loaded all libraries')

* loaded all libraries


## Load Data

In [4]:
path = '../sample/ENTITY_RELATION_TASK/ann_xml/Annotator_A/'
rst = mtk.parse_xmls(path)
print(rst['stat'])

* checking path ../sample/ENTITY_RELATION_TASK/ann_xml/Annotator_A/
* parsed XML file ../sample/ENTITY_RELATION_TASK/ann_xml/Annotator_A/A_doc2.txt.xml
* parsed XML file ../sample/ENTITY_RELATION_TASK/ann_xml/Annotator_A/A_doc3.txt.xml
* parsed XML file ../sample/ENTITY_RELATION_TASK/ann_xml/Annotator_A/A_doc1.txt.xml
* parsed XML file ../sample/ENTITY_RELATION_TASK/ann_xml/Annotator_A/A_doc4.txt.xml
* checked 4 files
* found 4 XML files
* skipped 0 non-XML files
{'total_files': 4, 'total_xml_files': 4, 'total_other_files': 0, 'total_tags': 62}


## Parse and convert format

We want to convert the text into a sentence-based format for downstream task (e.g., training relation extraction), so first of all, a sentencizer and a tokenizer are needed.
You can use any libraries for this purpose.

For here, we use `pySBD` for [sentence boundary detection](https://github.com/nipunsadvilkar/pySBD).

```Python
import pysbd
text = "My name is Jonas E. Smith. Please turn to p. 55."
seg = pysbd.Segmenter(language="en", clean=False, char_span=True)
print(seg.segment(text))
# [TextSpan(sent='My name is Jonas E. Smith. ', start=0, end=27), TextSpan(sent='Please turn to p. 55.', start=27, end=48)]
```

and we use `spaCy`'s [Tokenizer](https://spacy.io/api/tokenizer).

```Python

# Use Sentencizer
from spacy.lang.en import English
nlp = English()
tokenizer = nlp.tokenizer
tokens = tokenizer("This is a sentence for tokens.")
sentence_tokens = list(map(lambda v: v.text, tokens))
sentence_tokens
# ['This', 'is', 'a', 'sentence', 'for', 'tokens', '.']
```

Second, we need a way to map the spans to token index.
In the `sentence_kits.py`, we implemented a function `update_ents_token_index()` for this purpose. It uses the spans of a tag to check whether overlapped with any tokens of a sentence.
The function `update_ents_token_index()` will update the entities by adding a new property `token_index` which is a list of token indexes.

For more details, please check the follwoing section *Sentence-based format* and the source code in `sentence_kits.py`.

### Sentence-based format

In the following demo, we will convert each XML file into a sentence-based format, which looks like the following:

```json
{
    "text": "The full text of the file",
    "sentence_tags": [{
        "sentence": "this is a sentence.",
        "sentence_tokens": ["this", "is", "a" "sentence", "."],
        "spans": [start, end],
        "entities": {
            "A1": {
                "id": "A1",
                "text": "this",
                "token_index": [0, 0]
                // other properties
            },
            "A2": {
                "id": "A2",
                "text": "a sentence",
                "token_index": [2, 3]
                // other properties
            }
        },
        "relations": {
            "R1": {
                "id": "R1",
                "link_EAID": "A1", // this
                "link_EBID": "A2", // a sentence
            }
        }
    }]
}
```

In [5]:
# let's check the sentence-based format for the given samples
ann_sents = stk.convert_anns_to_sentags(rst['anns'])
print("* got %s ann_sents" % (len(ann_sents)))

* got 4 ann_sents


In [6]:
# let's show how the results look like
# the following code just for reference, you can use and modify for your purpose
for ann_idx, (ann, ann_sent) in enumerate(zip(rst['anns'], ann_sents)):
    print('*' * 30, ann['_filename'], len(ann_sent['sentence_tags']), 'sent(s)', '*' * 30)
    for sentag in ann_sent['sentence_tags']:
        if len(sentag['relations'])>0: sign = '<b style="color:red;">HAS REL</b>:'
        else: sign = 'ENT ONLY:'

        # parse the tokens
        tokens = copy.copy(sentag['sentence_tokens'])
        # print(tokens)
        for ent in sentag['entities'].values():
            color = ''.join([random.choice('9ABCDEF') for j in range(6)])
            for idx in range(ent['token_index'][0], ent['token_index'][1] + 1):
                tokens[idx] = '<span style="color:#%s;background:black;">%s</span>' % (
                    color,
                    tokens[idx]
                )
        display(HTML(sign + "[ `" + "` | `".join(tokens) + "`]"))

****************************** A_doc2.txt.xml 3 sent(s) ******************************


****************************** A_doc3.txt.xml 3 sent(s) ******************************


****************************** A_doc1.txt.xml 5 sent(s) ******************************


****************************** A_doc4.txt.xml 18 sent(s) ******************************


As you can see, each token is shown in a pair of ``. 
The entities are located by the token index.

Then, we can use this way to explore the dataset.

# Demo 1: Adverse event severity relation detection

In this demo, we use the toolkits in this folder to make a tiny dataset for training a model for detection severity of adverse event.

Please run the above cells to load the sample dataset.
After loading, all documents are loaded into a variable `ann_sents`.
Now, let's convert the data to a tiny training set.

## Prepare dataset

First of all, we can create a tiny dataset from the annotated corpus

In [7]:
ds = []

# the prop prefix for the Adverse Event and Severity
prop_AE = 'link_AE'
prop_SVRT = 'link_SVRT'

for ann_idx, (ann_sent, ann) in enumerate(zip(ann_sents, rst['anns'])):
    # ok, for each annotation file, 
    # we want to extract the annotated relations for positive training dataset
    # there can be multiple sentence in a file, 
    # so we need to check each sentence
    for sentag in ann_sent['sentence_tags']:
        if len(sentag['relations'])==0:
            # Oh, this sentence doesn't have a relation annotated
            # but we can use this sentence for building negative samples
            # as the SVRT tags and AE tags have no relations
            # this is just a demo, please change it accordingly.
            stags_ae = []
            stags_svrt = []
            for tag in sentag['entities'].values():
                if tag['tag'] == 'AE': stags_ae.append(tag)
                elif tag['tag'] == 'SVRT': stags_svrt.append(tag)
            # let's check how many tags 
            # but if not AE or SVRT tags, we just skip
            if len(stags_ae)==0 or len(stags_svrt)==0: continue

            # OK, pair each ae and svrt as NEGATIVE sample
            for _tag_a in stags_ae:
                for _tag_s in stags_svrt:
                    # create a data item
                    d = {
                        "AE": _tag_a,
                        "SVRT": _tag_s,
                        "sentence": sentag['sentence'],
                        "tokens": sentag['sentence_tokens'],
                        "ann_idx": ann_idx,
                        "y": 0, # for those obtained by creating, define as 0
                    }

                    ds.append(d)

        else:
            # Ok, this sentence may have several relations
            # let's check one by one
            for rel in sentag['relations'].values():
                # get the AE
                ent_ae = sentag['entities'][rel['%sID' % prop_AE]]
                ent_svrt = sentag['entities'][rel['%sID' % prop_SVRT]]

                # ok, we can save this relation now
                d = {
                    "AE": ent_ae,
                    "SVRT": ent_svrt,
                    "sentence": sentag['sentence'],
                    "tokens": sentag['sentence_tokens'],
                    "ann_idx": ann_idx,
                    "y": 1, # for those obtained from relation, define as 1(POSITIVE)
                }

                ds.append(d)

df = pd.DataFrame(ds)
print('* got the dataset %d records!' % (len(df)))
df

* got the dataset 19 records!


Unnamed: 0,AE,SVRT,sentence,tokens,ann_idx,y
0,"{'tag': 'AE', 'spans': [[689, 695]], 'text': '...","{'tag': 'SVRT', 'spans': [[684, 688]], 'text':...","On 28Jan2021 18:30, the patient experienced mi...","[On, 28Jan2021, 18:30, ,, the, patient, experi...",0,1
1,"{'tag': 'AE', 'spans': [[1211, 1232]], 'text':...","{'tag': 'SVRT', 'spans': [[1206, 1210]], 'text...","The day after the arm was very sore, red and b...","[The, day, after, the, arm, was, very, sore, ,...",1,1
2,"{'tag': 'AE', 'spans': [[696, 700]], 'text': '...","{'tag': 'SVRT', 'spans': [[691, 695]], 'text':...",The consumer reported receiving her first inje...,"[The, consumer, reported, receiving, her, firs...",2,1
3,"{'tag': 'AE', 'spans': [[775, 779]], 'text': '...","{'tag': 'SVRT', 'spans': [[784, 791]], 'text':...",The consumer reports the pain was similar to w...,"[The, consumer, reports, the, pain, was, simil...",2,1
4,"{'tag': 'AE', 'spans': [[1102, 1110]], 'text':...","{'tag': 'SVRT', 'spans': [[1033, 1043]], 'text...",The consumer reported when she was told by peo...,"[The, consumer, reported, when, she, was, told...",2,0
5,"{'tag': 'AE', 'spans': [[1102, 1110]], 'text':...","{'tag': 'SVRT', 'spans': [[961, 971]], 'text':...",The consumer reported when she was told by peo...,"[The, consumer, reported, when, she, was, told...",2,0
6,"{'tag': 'AE', 'spans': [[1115, 1120]], 'text':...","{'tag': 'SVRT', 'spans': [[1033, 1043]], 'text...",The consumer reported when she was told by peo...,"[The, consumer, reported, when, she, was, told...",2,0
7,"{'tag': 'AE', 'spans': [[1115, 1120]], 'text':...","{'tag': 'SVRT', 'spans': [[961, 971]], 'text':...",The consumer reported when she was told by peo...,"[The, consumer, reported, when, she, was, told...",2,0
8,"{'tag': 'AE', 'spans': [[1789, 1793]], 'text':...","{'tag': 'SVRT', 'spans': [[1784, 1788]], 'text...",The outcome of the event mild pain at the site...,"[The, outcome, of, the, event, mild, pain, at,...",2,1
9,"{'tag': 'AE', 'spans': [[46, 54]], 'text': 'Ar...","{'tag': 'SVRT', 'spans': [[118, 127]], 'text':...",4/10/21 Patient presented to Urgent Care with ...,"[4/10/21, Patient, presented, to, Urgent, Care...",3,0


## Get features

Although we have got all the information needed, we still need to convert the information to features which can be read by machine.

There are so many methods to convert, simple and complex. You will never know which one is the best if you don't try and understand the pros and cons of each method. For this demo purpose, we just use a very simple way to 

1. the number of tokens between AE and severity
2. TF-IDF feature of the sentence

In practice settings, you may choose BERT-based models to get text embeddings of the tokens, POS features, and other knowledge graph embeddings to fully capture the information. We plan to include more practical demos to show how to do that in future.

### Get between tokens

In [8]:
def get_between_tokens(x):
    '''
    Get the tokens between AE and SVRT
    '''
    left_idx = x['AE']['token_index']
    right_idx = x['SVRT']['token_index']

    if left_idx[0] > right_idx[1]:
        # which means AE is after SVRT
        left_idx = x.SVRT['token_index']
        right_idx = x.AE['token_index']

    # get the tokens between left and right
    between_tokens = x.tokens[
        left_idx[1] + 1 : right_idx[0]
    ]

    return between_tokens


def get_HTML(x):
    tokens = x.tokens
    for idx in range(x.AE['token_index'][0], x.AE['token_index'][1] + 1):
        tokens[idx] = "<span style='color:blue'>%s</span>" % tokens[idx]
    for idx in range(x.SVRT['token_index'][0], x.SVRT['token_index'][1] + 1):
        tokens[idx] = "<span style='color:red'>%s</span>" % tokens[idx]
    return ' '.join(tokens)

df['between_tokens'] = df.apply(lambda r: get_between_tokens(r), axis=1)
    
# normalize as 10 tokens
df['n_bt'] = df['between_tokens'].apply(lambda ts: len(ts)/10 if len(ts)/10 < 1 else 1)

### Get TF-IDF

The `tfvecter` is needed for future use, you can also save it by pickle

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfvecter = TfidfVectorizer()
tfidf = tfvecter.fit_transform(df['sentence'])

### Convert to feature vector

In [10]:
df_dataset = df[['y', 'n_bt']]

# flatten the tf-idf column as a df
df_tfidf = pd.DataFrame(tfidf.toarray(), index=df.index, columns=['t%d' % i for i in range(tfidf.shape[1])])
# concat
df_dataset = pd.concat([
    df_dataset,
    df_tfidf
], axis=1)

df_dataset.head(5)

Unnamed: 0,y,n_bt,t0,t1,t2,t3,t4,t5,t6,t7,...,t123,t124,t125,t126,t127,t128,t129,t130,t131,t132
0,1,0.0,0.0,0.39707,0.0,0.0,0.39707,0.39707,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.214255,0.18795,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1,0.0,0.0,0.0,0.195089,0.0,0.0,0.0,0.0,0.0,...,0.195089,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.246648,...,0.0,0.281168,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.129785,0.129785,0.0,0.0,0.0,0.0,0.0,0.129785


### Split train/test
As you can see, a very tiny dataset is created with both positive and negative samples. And we preserved as much information as possible in the dataframe to generate features. 

In practice settings, you don't need to save those raw data into a dataframe. Instead, you can just save major information, such as token, index, sentences, which can reduce the space needed for large corpus.

You can save this dataset into `pkl` format for futuer usage as follows or any other formats.

```Python
import pickle

with open('dataset.pkl', 'wb') as f:
    pickle.dump(df, f)
```

In future, you can load the dataset as follows:

```Python
with open('dataset.pkl', 'rb') as f:
    df = pickle.load(f)
```

Now, let's get the training and testing dataset.

In [11]:
# random select 80% for training
df_train = df_dataset.sample(frac=0.8)
# the rest 20% for test
df_test = df_dataset.iloc[~df_dataset.index.isin(df_train.index)]

# get the value s 
X_train = df_train.loc[:, df_train.columns != 'y'].to_numpy()
y_train = df_train.loc[:, 'y'].to_numpy()

X_test = df_test.loc[:, df_test.columns != 'y'].to_numpy()
y_test = df_test.loc[:, 'y'].to_numpy()

print('* got df_train', len(df_train))
print('* got df_test', len(df_test))

* got df_train 15
* got df_test 4


## Build model

In fact, there are so many model available in the machine learning world.
You can choose whatever you want for a project.
For here, we just use a [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) which is easy to get and good for high-dimensional datasets.

In practice settings, building model has many options. Try to enjoy it!

In [12]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(max_depth=10, random_state=0)
print('* created a RFC model')

* created a RFC model


## Train model

We can train the model on the given training dataset

In [13]:
model.fit(X_train, y_train)
print('* trained model')

* trained model


## Evaluate model

Then we can get the reports on the model based on the test dataset.

In [14]:
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# get the prediction value
y_pred = model.predict(X_test)

print('Accuracy     : %.3f'%accuracy_score(y_test, y_pred))
print('Precision    : %.3f'%precision_score(y_test, y_pred))
print('Recall       : %.3f'%recall_score(y_test, y_pred))
print('F1-Score     : %.3f'%f1_score(y_test, y_pred))
print('\nClassification Report : ')
print(classification_report(y_test, y_pred))

Accuracy     : 0.750
Precision    : 0.500
Recall       : 1.000
F1-Score     : 0.667

Classification Report : 
              precision    recall  f1-score   support

           0       1.00      0.67      0.80         3
           1       0.50      1.00      0.67         1

    accuracy                           0.75         4
   macro avg       0.75      0.83      0.73         4
weighted avg       0.88      0.75      0.77         4



## Apply model

Now, once you got this model, you can put it into practice! (yes, this is a toy for demo. You need to put a *REAL* model into practice for sure).

Usually, you can save the model as a pickle file and use it in future:

```Python
# save model
filename = 'final_model.pkl'
pickle.dump(model, open(filename, 'wb'))

# load model
model = pickle.load(open(filename, 'rb'))
```

In addition, as this tiny uses TF-IDF for features, the `tfvecter` is needed to get the same dimensional features. 
You can also load the `tfvecter` before application.

### Preprocessing

Before start using the trained model into practice, please keep in mind that the input must be the **same format** of training phase, which means it needs to preprocess the data to get the AE and SVRT from the given text first.

This preprocess can be done by a Named Entity Recognition (NER) model automatically.
But you need to pair the AE and SVRT found by the NER model by another algorithm.
For example, if the NER model finds 3 AEs and 2 SVRTs, you can just pair all 6 possible combinations as the candidates:

$$
AE_1 + SVRT_1 |
AE_1 + SVRT_2 |
AE_2 + SVRT_1 |
AE_2 + SVRT_2 |
AE_3 + SVRT_1 |
AE_3 + SVRT_2
$$

Then, let the model to decide which one is correct. 
In fact, there are also some strategies to reduce the number of the candidates.

For here the demo, we will skip the NER process and just show the rest.

In [15]:
# the AE and SVRT
df_sample = pd.DataFrame([[
    'headache', 'mild', [',', 'but', 'still'], 
    '2/20/21:  2AM advil helped the headache, but still mild diarrhea.'
], [
    'diarrhea', 'mild', [], 
    '2/20/21:  2AM advil helped the headache, but still mild diarrhea.'
], [
    'nause', 'Mild', [],
    'Mild nausea beginning the next day (3/7) and lasting until (3/8).'
], [
    'dizziness', 'intense', ['headache','yesterday',',','but','after','a','few','days',',','I','only','felt','some'],
    'I got an intense headache yesterday, but after a few days, I only felt some dizziness.'
]], columns=['AE', 'SVRT', 'between_tokens', 'sentence'])

# First, convert to features
# 1. n_bt
f_n_bt = df_sample['between_tokens'].apply(lambda ts: len(ts)/10 if len(ts)/10 < 1 else 1)
f_n_bt = f_n_bt.to_numpy().reshape([len(f_n_bt), 1])
print('* f_n_bt shape', f_n_bt.shape)

# 2. tf-idf
f_tfidf = tfvecter.transform(df_sample.sentence).toarray()
print('* f_tfidf shape', f_tfidf.shape)

# concat n_bt and tf-idf
X_sample = np.concatenate([f_n_bt, f_tfidf], axis=1)
print('* X_sample shape', X_sample.shape)

* f_n_bt shape (4, 1)
* f_tfidf shape (4, 133)
* X_sample shape (4, 134)


### Run model

Instead of predicting the class/label directly with `predict()`, we use `predict_proba()` to show the class probabilities.

In [16]:
y_sample = model.predict_proba(X_sample)

# let's see what's the result:
for i, r in df_sample.iterrows():
    print("RESULT: %s | [%s] - [%s] | %s" % (
        y_sample[i], r['AE'], r['SVRT'], r['sentence']
    ))

RESULT: [0.21 0.79] | [headache] - [mild] | 2/20/21:  2AM advil helped the headache, but still mild diarrhea.
RESULT: [0.08 0.92] | [diarrhea] - [mild] | 2/20/21:  2AM advil helped the headache, but still mild diarrhea.
RESULT: [0.15 0.85] | [nause] - [Mild] | Mild nausea beginning the next day (3/7) and lasting until (3/8).
RESULT: [0.33 0.67] | [dizziness] - [intense] | I got an intense headache yesterday, but after a few days, I only felt some dizziness.


As you can see above, the second and third sample show high probability (**0.87** and **0.77**).
The first and the last one show lower probability (**0.69** and **0.61**).
Well, as this is just a simple demo, a rough threshold can be set as 0.75 for the relation decision.

You can improve the performance for real production in many aspect, such as:

1. **A large annotated dataset**. A well-annotated dataset is always helpful for improving performance.
2. **Feature engineering**. Feature engineering is the process of selecting, manipulating, and transforming raw data into features that can be used in machine learning. Good features can not only improve the overall performance, but also make the follow-up steps easier.
3. **Nice models**. Contrary to what many believe, the machine learning model with the best performance is not necessarily the best solution. Choosing a nice model is not easy for practice, you need to balance the cost and performance. Some SOTA models have very good performance but need more computation power, while some may have acceptable performance with less cost (e.g., Random Forest is widely used in industries)
4. **Hyperparameter tuning**. How you train the model is also important. There are some methods to optimize the hyperparameters in both open-source solutions and commercial solutions.
5. **Problem definition**. This should be first considered for sure. It's out of the scope of this notebook.

Well, to the best of my knowledge, it's the basic workflow 