<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 35px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
      Dataset processing
  </div> 
  
<div style="
      font-weight: normal; 
      font-size: 25px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
      CHIA
  </div>


  <div style=" float:left; 
      font-size: 12px; 
      line-height: 12px; 
  padding: 10px 15px 8px;">
  Jean-baptiste AUJOGUE -  Hybrid Intelligence
  </div> 


<a id="TOC"></a>

#### Table Of Content

1. [CHIA Texts](#texts) <br>
2. [CHIA Entities](#ents) <br>


#### Useful links

- [CHIA a large annotated corpus of clinical trial eligibility criteria](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7452886/pdf/41597_2020_Article_620.pdf) (paper)
- https://github.com/uf-hobi-informatics-lab/ClinicalTransformerNER/blob/master/tutorial/brat2bio.ipynb
- https://github.com/ctgatecci/Clinical-trial-eligibility-criteria-NER/blob/main/NER%20Preprocessing%20and%20Performance%20Analysis.ipynb

In [18]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [19]:
import os
import sys
import re
import copy
import json
import zipfile

# data
import pandas as pd

# text
from spacy.lang.en import English

#### Custom variables

In [20]:
path_to_repo = os.path.dirname(os.getcwd())
path_to_data = os.path.join(path_to_repo, 'datasets', 'chia')
path_to_src  = os.path.join(path_to_repo, 'src')
path_to_src

'C:\\Users\\jb\\Desktop\\NLP\\Internal - Transformers for NLP\\src'

In [21]:
base_dataset_name  = 'chia.zip'
final_dataset_name = 'chia-ner'

#### Custom imports

In [22]:
sys.path.insert(0, path_to_src)

In [23]:
from nlptools.dataset.chia.io import load_texts_from_zipfile, load_entities_from_zipfile
from nlptools.dataset.chia.preprocessing import get_ner_entities, convert_to_bio

<a id="texts"></a>

# 1. Chia Texts

[Table of content](#TOC)

In [24]:
folder = os.path.join(path_to_data, final_dataset_name)
if not os.path.isdir(folder):
    os.makedirs(folder)

In [25]:
df_texts = load_texts_from_zipfile(os.path.join(path_to_data, base_dataset_name))

In [26]:
df_texts.head()

Unnamed: 0,Id,Text
0,NCT00050349_exc,Patients with symptomatic CNS metastases or le...
1,NCT00050349_inc,Patients with biopsy-proven metastatic carcino...
2,NCT00061308_exc,Women of child-bearing potential that do not p...
3,NCT00061308_inc,Have had one prior platinum-based chemotherapy...
4,NCT00094861_exc,Metastatic disease (M1)/stage 4 NSCLC \nPleura...


In [27]:
df_texts.to_csv(os.path.join(path_to_data, final_dataset_name, 'chia_texts.tsv'), sep = "\t", index = False)

<a id="ents"></a>

# 2. Criteria Entities

[Table of content](#TOC)

## 2.1 Create Dataframe of entities

[Table of content](#TOC)

In [28]:
categories = [
    # domain
    'Condition',                     # - in C2Q
    'Device',                        # present in OMOP CDM, can be precise ('Pacemaker') or broad ('barrier method of birth control)
    'Drug',                          # - in C2Q
    'Measurement',                   # - in C2Q
    'Person',                        # Demographic (sex, age), but also contains mislabeled Condition ('Premenopausal', drug users'), mislabeled Measurement ('body mass index') and irrelevant terms ('Incarcerated')
    'Procedure',                     # - in C2Q
    # 'Visit',                         # codable ?
    
    # field
    'Value',                         # - in C2Q
    'Temporal',                      # - in C2Q
    'Qualifier',                     # Originaly a Construct entity. Modifier terms similar to Observation
    'Observation',                   # - in C2Q # greatly overlaps Qualifier, and the delta seems useless
    # 'Reference_point',             # reference point in time, absolute ("3 times the agent's half-life") or relative ('initiation of treatment')
    # 'Mood',                        # useless

    # construct
    'Negation',                      # negation expressed as a NER + RelEx problem to be complete
    # 'Multiplier',                  # CAUTION here, as it covers Value ('> 500 mg/m^2') and Logical label ('3 or more')
]

In [29]:
df_ents = load_entities_from_zipfile(os.path.join(path_to_data, base_dataset_name))
df_ents.shape

(44616, 7)

In [30]:
df_ents = get_ner_entities(df_texts, df_ents, categories)

In [31]:
df_ents.head()

Unnamed: 0,Id,Mention,Start_char,End_char,Entity_id,Category
0,NCT00050349_exc,symptomatic,14,25,"(T65,)",Qualifier
1,NCT00050349_exc,CNS metastases,26,40,"(T1,)",Condition
2,NCT00050349_exc,leptomeningeal involvement,44,70,"(T2,)",Condition
3,NCT00050349_exc,brain metastases,92,108,"(T4,)",Condition
4,NCT00050349_exc,unless,110,116,"(T70,)",Negation


In [32]:
df_ents.to_csv(os.path.join(path_to_data, final_dataset_name, 'chia_ents.tsv'), sep = "\t", index = False)

<a id="bio"></a>

## 2.2 Convert entities to BIO format

[Table of content](#TOC)


In [33]:
df_spans = convert_to_bio(df_texts, df_ents)
df_spans.shape

(77302, 4)

In [34]:
df_spans.head()

Unnamed: 0,Id,Sequence_id,Mention,Category
0,NCT00050349_exc,NCT00050349_exc_0,Patients with,O
1,NCT00050349_exc,NCT00050349_exc_0,symptomatic,Qualifier
2,NCT00050349_exc,NCT00050349_exc_0,,O
3,NCT00050349_exc,NCT00050349_exc_0,CNS metastases,Condition
4,NCT00050349_exc,NCT00050349_exc_0,or,O


In [35]:
df_spans.to_csv(os.path.join(path_to_data, final_dataset_name, 'chia_spans.tsv'), sep = "\t", index = False)

In [36]:
tokenizer = English()
df_bio = convert_to_bio(df_texts, df_ents, tokenizer = lambda s: [t.text for t in tokenizer(s)])

df_bio.shape

(205982, 4)

In [37]:
df_bio.head(10)

Unnamed: 0,Id,Sequence_id,Mention,Category
0,NCT00050349_exc,NCT00050349_exc_0,Patients,O
1,NCT00050349_exc,NCT00050349_exc_0,with,O
2,NCT00050349_exc,NCT00050349_exc_0,symptomatic,B-Qualifier
3,NCT00050349_exc,NCT00050349_exc_0,,O
4,NCT00050349_exc,NCT00050349_exc_0,CNS,B-Condition
5,NCT00050349_exc,NCT00050349_exc_0,metastases,I-Condition
6,NCT00050349_exc,NCT00050349_exc_0,,O
7,NCT00050349_exc,NCT00050349_exc_0,or,O
8,NCT00050349_exc,NCT00050349_exc_0,leptomeningeal,B-Condition
9,NCT00050349_exc,NCT00050349_exc_0,involvement,I-Condition


In [38]:
df_bio.to_csv(os.path.join(path_to_data, final_dataset_name, 'chia_bio.tsv'), sep = "\t", index = False)

[Table of content](#TOC)