<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 35px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
      Dataset processing
  </div> 
  
<div style="
      font-weight: normal; 
      font-size: 25px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
      Clinical trials ICTRP
  </div>


  <div style=" float:left; 
      font-size: 12px; 
      line-height: 12px; 
  padding: 10px 15px 8px;">
  Jean-baptiste AUJOGUE -  Hybrid Intelligence
  </div> 
  
  <div style=" float:right; 
      font-size: 12px; 
      line-height: 12px; 
  padding: 10px 15px 8px;">
  December 2022
  </div>

<a id="TOC"></a>

#### Table Of Content

1. [Clinical trials](#texts) <br>


#### Useful links

- [Clinical Trials ICTRP dataset download](https://www.who.int/clinical-trials-registry-platform)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import re
import copy
import json
import zipfile

# data
import numpy as np
import pandas as pd

# viz
# from tqdm import tnrange

#### Custom variables

In [3]:
path_to_data = os.path.join(os.getcwd(), 'clinical trials ICTRP')
path_to_data

'C:\\Users\\jb\\Desktop\\NLP\\perso - Transformers for NLP\\datasets\\clinical trials ICTRP'

In [4]:
base_dataset_name  = 'ICTRPFullExport-774632-24-03-2022.zip'
final_dataset_name = 'clinical-trials-ictrp'

<a id="texts"></a>

# 1. Clinical trials

[Table of content](#TOC)

In [5]:
def clean_text(s):
    s = (s if s else '')
    s = re.sub('<br>', '\n', s)
    s = re.sub('\n\n(\s)*-(\s)+', ', ', s)
    s = re.sub('(\s)+', ' ', s)
    s = (s[1: -1] if len(s.split(' '))>1 else s)
    s = s.strip()
    return s



def process_zipfile(zip_file):
    all_content = []
    archive = zipfile.ZipFile(zip_file, 'r')
    file = archive.namelist()[0]
    with archive.open(file, 'r') as f:
        return pd.read_csv(f)
    

    
def clean_dataframe(df):
    columns = [
        'public_title', 'Scientific_title',
        'Inclusion_Criteria', 'Exclusion_Criteria',
        'Primary_Outcome', 'Secondary_Outcomes',
        'results_summary',
        
    ]
    df = df[columns].fillna('').applymap(clean_text)
    df['Inclusion_Criteria'] = df.Inclusion_Criteria.apply(lambda s: re.sub('Inclusion [Cc]riteria(\s)*:(\s|,)*', '', s).strip())
    df['Exclusion_Criteria'] = df.Exclusion_Criteria.apply(lambda s: re.sub('Exclusion [Cc]riteria(\s)*:(\s|,)*', '', s).strip())
    return df



def get_character_count(texts):
    text = ' '.join(texts)
    return np.unique(list(text), return_counts = True)

In [18]:
df_trials = process_zipfile(os.path.join(path_to_data, base_dataset_name))

  return pd.read_csv(f)


In [19]:
df_trials.shape

(774632, 63)

In [20]:
df_trials.columns

Index(['TrialID', '(No column name)', 'SecondaryIDs', 'public_title',
       'Scientific_title', 'url', 'Public_Contact_Firstname',
       'Public_Contact_Lastname', 'Public_Contact_Address',
       'Public_Contact_Email', 'Public_Contact_Tel',
       'Public_Contact_Affiliation', 'Scientific_Contact_Firstname',
       'Scientific_Contact_Lastname', 'Scientific_Contact_Address',
       'Scientific_Contact_Email', 'Scientific_Contact_Tel',
       'Scientific_Contact_Affiliation', 'study_type', 'study_design', 'phase',
       'Date_registration', 'Date_enrollement', 'Target_size',
       'Recruitment_status', 'Primary_sponsor', 'Secondary_sponsors',
       'Source_Support', 'Countries', 'Conditions', 'Interventions', 'Age_min',
       'Age_max', 'Gender', 'Inclusion_Criteria', 'Exclusion_Criteria',
       'Primary_Outcome', 'Secondary_Outcomes', 'Bridging_flag',
       'Bridged_type', 'Childs', 'type_enrolment', 'Retrospective_flag',
       'results_actual_enrollment', 'results_url_link'

In [21]:
df_trials = clean_dataframe(df_trials)

In [22]:
df_trials.head()

Unnamed: 0,public_title,Scientific_title,Inclusion_Criteria,Exclusion_Criteria,Primary_Outcome,Secondary_Outcomes,results_summary
0,Evaluation of the Impact of the Administration...,Evaluation of the Impact of the Administration...,"patient = 18 years old,, symptomatic COVID-19 ...",,negativation of the RT-PCR test on nasopharyng...,,
1,Correlation Between Cardiac Markers and Severi...,Correlation Between Cardiac Markers and Severi...,all moderate to severe COVID 19 patients who w...,,relation between cardiac markers and mortality...,relation ship between cardiac and inflammatory...,
2,A Prospective Virtual Study to Evaluate the Lo...,A Prospective Cohort Study of Immunoglobulin G...,"Age = 18 years, Willing to provide informed co...",,To describe the longevity and seroprevalence o...,To compare the off-kinetics of IgG positivity ...,
3,A Prospective Virtual Study of Patient Reporte...,"Assessing Behavioral, Functional, and Clinical...","Age >=18 years, Individuals with a valid diagn...",,To identify and describe patient behaviors and...,To describe recovery in patients with test-con...,
4,A Behavioral Activation Prenatal and Postpartu...,Pilot Study to Evaluate a Behavioral Activatio...,"INCLUSION CRITERIA FOR AIMS 1, 3 AND 4, Pregna...",,Focus group,,


In [23]:
df_trials.Exclusion_Criteria[df_trials.Exclusion_Criteria.apply(lambda s: len(s.split(' ')))>5].tolist()[:100]

['Exclusion Criteria 1.Potential participants with comorbidities which would preclude them from participating in the book club for seven weeks will be excluded; this will be assessed by the facility manager or his/her surrogate. 2.Potential participants who score 3 or 4 on any parts of Section D: Communication and vision of the InterRAI Long-Term Care Facility (LTCF) Assessment Form Version 9.1, copyrighted will be excluded from the trial. 3.All potential participants with dementia who are recorded as exhibiting any verbal abuse, physical abuse or socially disruptive behaviour in their InterRAI (Section E.3, parts b, c, or d) will be excluded from the study.',
 '1. Previous total knee replacement surgery to same knee 2. Bilateral knee replacement surgery 3. previously enrolled in study 4. Chronic opioid use; defined as > 20mg oral morphine equivalent per day on average, in the 4 weeks prior to surgery',
 'Under 18 years of age. History of major systemic or ocular disease. History of oc

#### Export to tsv

In [24]:
df_trials.to_csv(os.path.join(path_to_data, '{}.tsv'.format(final_dataset_name)), sep = "\t", index = False)

#### Export to txt

In [6]:
def clean_txt(t):
    # replace linebreaks with comas
    t = re.sub('(\n)+', ', ', t)
    
    # shrink consecutive punctuation
    t = re.sub('(?P<name>[,;:.!?])[,;:.!?]+', '\g<name>', t)
    
    # shrink space
    t = re.sub('(\s)+', ' ', t).strip()
    return t

In [7]:
df_trials = pd.read_csv(os.path.join(path_to_data, '{}.tsv'.format(final_dataset_name)), sep = "\t")
df_trials = df_trials[[
    # 'Scientific_title',
    'Inclusion_Criteria',
    'Exclusion_Criteria',
    # 'Primary_Outcome',
    # 'Secondary_Outcomes',
    # 'results_summary',
]].fillna('')

In [8]:
texts = df_trials.apply(func = lambda row: '\n'.join([t for t in row if len(t.split()) > 5]), axis = 1)
texts = [clean_txt(t) for t in texts]

In [9]:
len(texts)

774632

In [10]:
chars, counts = get_character_count(texts[:100000])

In [11]:
chars_to_hide = [char for char, count in zip(chars, counts) if count < 40]
len(chars_to_hide)

47

In [12]:
chars_to_hide

['\x81',
 '\x9d',
 '¦',
 '©',
 'ª',
 '«',
 '\xad',
 '¯',
 '¶',
 '¸',
 '¹',
 '»',
 '¼',
 '¾',
 'Ä',
 'É',
 'Ê',
 'Ë',
 'Í',
 'Î',
 'Ï',
 'Ò',
 'Ó',
 'Ö',
 'Ü',
 'Ý',
 'ã',
 'æ',
 'ê',
 'ë',
 'ì',
 'î',
 '÷',
 'ù',
 'û',
 'ý',
 'œ',
 'š',
 'Ž',
 'ž',
 'ƒ',
 'ˆ',
 '˜',
 '‚',
 '„',
 '‡',
 '‰']

In [13]:
# We choose to hide all characters appearing less than 40 times
texts = [re.sub('( )+', ' ', re.sub('[{}]'.format(''.join(chars_to_hide)), ' ', t)) for t in texts]

In [14]:
len(texts)

774632

In [15]:
texts[257837]

"Diagnosis of NSCLC, Advanced disease (stage III-IV) according to the TNM 7th /8th edition classification at the beginning of first ICP, Patients must have received at least two lines of ICP during their history of disease, Patients with EGFR mutation and/or ALK translocation must have received all specific target agents regularly reimbursed (during the reporting period) prior to the first PCI Exclusion Criteria: • Opposition form signed by the living patient or opposition clearly indicated in the deceased patient's medical records"

In [16]:
with open(os.path.join(path_to_data, '{}.txt'.format(final_dataset_name)), 'w', encoding = 'utf-8') as f:
    f.write('\n'.join(texts))

In [None]:
# with open(os.path.join(path_to_data, '{}.txt'.format(final_dataset_name)), 'r', encoding = 'utf-8') as f:
#     texts = f.readlines()

[Table of content](#TOC)