<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 35px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
      Dataset processing
  </div> 
  
<div style="
      font-weight: normal; 
      font-size: 25px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
      Clinical trials CTTI
  </div>


  <div style=" float:left; 
      font-size: 12px; 
      line-height: 12px; 
  padding: 10px 15px 8px;">
  Jean-baptiste AUJOGUE -  Hybrid Intelligence
  </div> 
  
  <div style=" float:right; 
      font-size: 12px; 
      line-height: 12px; 
  padding: 10px 15px 8px;">
  December 2022
  </div>

<a id="TOC"></a>

#### Table Of Content

1. [Clinical trials](#texts) <br>


#### Useful links

- [Clinical Trials dataset download](https://www.clinicaltrials.gov/ct2/resources/download#DownloadAllData)

In [1]:
%load_ext autoreload
%autoreload 2

In [9]:
import os
import re
import copy
import json
import zipfile

# data
import xml.etree.ElementTree as ET
import numpy as np
import pandas as pd

# viz
# from tqdm import tnrange

#### Custom variables

In [3]:
path_to_data = os.path.join(os.getcwd(), 'clinical trials CTTI')
path_to_data

'C:\\Users\\jb\\Desktop\\NLP\\Internal - Transformers for NLP\\datasets\\clinical trials CTTI'

In [4]:
base_dataset_name  = 'AllPublicXML.zip'
final_dataset_name = 'clinical-trials-ctti'

<a id="texts"></a>

# 1. Clinical trials

[Table of content](#TOC)

In [10]:
def clean_text(s):
    s = (s if s else '')
    return re.sub('(\s)+', ' ', s).strip()


def clean_criteria(s):
    s = (s if s else '')
    criteria = [m.group(3) for m in re.finditer(
        '\r\n\r\n(\s)*-(\s)*(.*)\r\n\r\n', 
        s.replace('\r\n\r\n', '\r\n\r\n\r\n\r\n').rstrip() + '\r\n\r\n',
    )]
    criteria = '\n'.join([clean_text(c) for c in criteria])
    return criteria



def process_xml(f):
    tree = ET.parse(f)
    root = tree.getroot()
    
    content_dict = {
        'summary'     : 'brief_summary/textblock',
        'description' : 'detailed_description/textblock',
        'ie_criteria' : 'eligibility/criteria/textblock',
        'condition'   : 'condition',
        'purpose'     : 'study_design_info/primary_purpose',
        'intervention': 'intervention/intervention_name',
    }
    content_dict = {k: root.findtext(v) for k, v in content_dict.items()}
    content_dict = {k: (clean_criteria(v) if k == 'ie_criteria' else clean_text(v)) for k, v in content_dict.items()}
    return content_dict



def process_zipfile(zip_file):
    all_content = []
    archive = zipfile.ZipFile(zip_file, 'r')
    files = [f for f in archive.namelist() if f.endswith('.xml')]
    for file in files:
        with archive.open(file, 'r') as f:
            content = process_xml(f)
            content = [file] + list(content.values())
            all_content.append(content)
    return all_content



def get_character_count(texts):
    text = ' '.join(texts)
    return np.unique(list(text), return_counts = True)

In [138]:
# with open(os.path.join(path_to_data, 'AllPublicXML', 'NCT0000xxxx', 'NCT00000180.xml')) as f:
#     tree = ET.parse(f)
#     root = tree.getroot()
    
# for c in root.iter():
#     print('---')
#     print(c.tag)
#     print(c.text)

In [157]:
trials = process_zipfile(os.path.join(path_to_data, base_dataset_name))

In [161]:
df_trials = pd.DataFrame(trials, columns = ['Id', 'Summary', 'Description', 'IE_criteria', 'Condition', 'Purpose', 'Intervention'])

In [165]:
df_trials.head(20)

Unnamed: 0,Id,Summary,Description,IE_criteria,Condition,Purpose,Intervention
0,NCT0000xxxx/NCT00000102.xml,This study will test the ability of extended r...,This protocol is designed to assess both acute...,diagnosed with Congenital Adrenal Hyperplasia ...,Congenital Adrenal Hyperplasia,Treatment,Nifedipine
1,NCT0000xxxx/NCT00000104.xml,Inner city children are at an increased risk f...,,,Lead Poisoning,,ERP measures of attention and memory
2,NCT0000xxxx/NCT00000105.xml,The purpose of this study is to learn how the ...,Patients will receive each vaccine once only c...,Patients must have a diagnosis of cancer of an...,Cancer,,Intracel KLH Vaccine
3,NCT0000xxxx/NCT00000106.xml,Recently a non-toxic system for whole body hyp...,,,Rheumatic Diseases,Treatment,Whole body hyperthermia unit
4,NCT0000xxxx/NCT00000107.xml,Adults with cyanotic congenital heart disease ...,,Resting blood pressure below 140/90,"Heart Defects, Congenital",,
5,NCT0000xxxx/NCT00000108.xml,The purpose of this research is to find out wh...,,Postmenopausal and preferably on hormone repla...,Cardiovascular Diseases,Prevention,Exercise
6,NCT0000xxxx/NCT00000110.xml,The purpose of this pilot investigation is to ...,,Healthy volunteers (developmental phase)\nHeal...,Obesity,Treatment,magnetic resonance spectroscopy
7,NCT0000xxxx/NCT00000111.xml,The purpose of this study is to see if we can ...,,Lack sufficient attached keratinized tissue at...,Mouth Diseases,Treatment,Oral mucosal graft
8,NCT0000xxxx/NCT00000112.xml,The prevalence of obesity in children is reach...,,Obesity: BM +/- 95% for age general good health,Obesity,,
9,NCT0000xxxx/NCT00000113.xml,To evaluate whether progressive addition lense...,Myopia (nearsightedness) is an important publi...,,Myopia,Treatment,Progressive Addition Lenses


#### Export to tsv

In [None]:
df_trials.to_csv(os.path.join(path_to_data, '{}.tsv'.format(final_dataset_name)), sep = "\t", index = False)

#### Export to txt

In [13]:
df_trials = pd.read_csv(os.path.join(path_to_data, '{}.tsv'.format(final_dataset_name)), sep = "\t")
df_trials = df_trials.fillna('')

# replace linebreaks with comas
df_trials.IE_criteria = df_trials.IE_criteria.apply(lambda t: re.sub('(\n)+', ', ', t))

# remove comas placed next to punctuation
df_trials.IE_criteria = df_trials.IE_criteria.apply(lambda t: re.sub('(?P<name>[,;:.!?]),', '\g<name>', t))

In [14]:
texts = df_trials[['Summary', 'Description', 'IE_criteria']].apply(func = lambda row: ' '.join(row), axis = 1)
texts = [re.sub('(\s)+', ' ', t).strip() for t in texts]

In [15]:
chars, counts = get_character_count(texts)

In [16]:
chars_to_hide = [char for char, count in zip(chars, counts) if count < 40]
len(chars_to_hide)

668

In [17]:
# We choose to hide all characters appearing less than 40 times
texts = [re.sub('( )+', ' ', re.sub('[{}]'.format(''.join(chars_to_hide)), ' ', t)) for t in texts]

In [18]:
len(texts)

430108

In [19]:
texts[257836]

"The study is a prospective, adaptive, multicenter, randomized, double-blind, Sham-controlled pilot study, to evaluate the efficacy and safety of the Vibrant Capsule in relieving constipation in subjects with functional constipation. Three arms will be assessed: - Vibrant Capsule with vibrating mode 1 administered 5 times per week - Vibrant Capsule with vibrating mode 2 administered 5 times per week - Sham Capsule administered 5 times per week Subjects will be followed continuously for at least a 2 weeks run-in period and then be randomized to either Vibrant or Sham capsules for a treatment period of 8 weeks. The first 2 weeks of treatment will be considered as a subjects' training period. Data reporting will be done on an electronic Case Report Form and an eDiary. Subjects will be asked to refrain from taking any medication or supplement to relieve their constipation, during the entire study period. After the run-in period, the subjects will return and eligibility will be re-assessed.

In [20]:
# with open(os.path.join(path_to_data, '{}.txt'.format(final_dataset_name)), 'w', encoding = 'utf-8') as f:
#     f.write('\n'.join(texts))

In [7]:
# with open(os.path.join(path_to_data, '{}.txt'.format(final_dataset_name)), 'r', encoding = 'utf-8') as f:
#     texts = f.readlines()

[Table of content](#TOC)