<a href="https://colab.research.google.com/github/MWFK/NLP-Semantic-Similarity/blob/main/ClinicalTrials/Data%20Engineering/01.%20Data_Extraction_Transformation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Objectives

Get the data from https://clinicaltrials.gov using their API.

Transform the data into usable dataset.

### Libs

In [None]:
import pandas as pd
# pd.set_option('display.max_columns', None)  
# pd.set_option('display.max_colwidth', None)
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

### Data

In [None]:
######### Study ##############
# lung cancer 
# NStudiesFound: 10152

######### Study Fields #######
# NCTId, OrgFullName, OfficialTitle, OverallStatus, Keyword, DetailedDescription, Condition, EligibilityCriteria, HealthyVolunteers, Gender, MinimumAge, StudyPopulation, LocationFacility, LocationCity, LocationState, LocationZip, LocationCountry

### New Fields to add
#LocationStatus

######### Range Min_MAX ######
# 1 to 10152

######### Format ############
# CSV


step    = 1000
min_rnk = 1
max_rnk = step

for req in range(11): 
    
    print("Downloading Lung Cancer clinical trials with ranks from ", min_rnk, " to ", max_rnk)

    url = 'https://clinicaltrials.gov/api/query/study_fields?expr=lung+cancer&fields=NCTId%2C+OrgFullName%2C+OfficialTitle%2C+OverallStatus%2C+Keyword%2C+DetailedDescription%2C+Condition%2C+EligibilityCriteria%2C+HealthyVolunteers%2C+Gender%2C+MinimumAge%2C+StudyPopulation%2C+LocationFacility%2C+LocationCity%2C+LocationState%2C+LocationStatus%2C+LocationZip%2C+LocationCountry&min_rnk='+str(min_rnk)+'&max_rnk='+str(max_rnk)+'&fmt=csv'

    session = requests.Session()
    retry   = Retry(connect=3, backoff_factor=0.5)
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://' , adapter)
    session.mount('https://', adapter)

    clinicaltrials = session.get(url)
    print('Download Request Status: ', clinicaltrials.status_code)
    
    csv_file = open('/content/'+str(req)+'-batch.csv', 'wb')
    csv_file.write(clinicaltrials.content)
    csv_file.close()
    
    min_rnk = max_rnk + 1
    max_rnk += step

Downloading Lung Cancer clinical trials with ranks from  1  to  1000
Download Request Status:  200
Downloading Lung Cancer clinical trials with ranks from  1001  to  2000
Download Request Status:  200
Downloading Lung Cancer clinical trials with ranks from  2001  to  3000
Download Request Status:  200
Downloading Lung Cancer clinical trials with ranks from  3001  to  4000
Download Request Status:  200
Downloading Lung Cancer clinical trials with ranks from  4001  to  5000
Download Request Status:  200
Downloading Lung Cancer clinical trials with ranks from  5001  to  6000
Download Request Status:  200
Downloading Lung Cancer clinical trials with ranks from  6001  to  7000
Download Request Status:  200
Downloading Lung Cancer clinical trials with ranks from  7001  to  8000
Download Request Status:  200
Downloading Lung Cancer clinical trials with ranks from  8001  to  9000
Download Request Status:  200
Downloading Lung Cancer clinical trials with ranks from  9001  to  10000
Download Req

In [None]:
df = pd.read_csv(r'/content/0-batch.csv', skiprows=10)
for req in range(1, 11): 
    tmp = pd.read_csv('/content/' +str(req)+ '-batch.csv', skiprows=10)
    print('Batch ', req, ': ', tmp.shape)
    df = df.append(tmp, ignore_index=True)

df.to_csv(r'/content/batchs.csv')
print('All Batchs: ',df.shape)

Batch  1 :  (1000, 19)
Batch  2 :  (1000, 19)
Batch  3 :  (1000, 19)
Batch  4 :  (1000, 19)
Batch  5 :  (1000, 19)
Batch  6 :  (1000, 19)
Batch  7 :  (1000, 19)
Batch  8 :  (1000, 19)
Batch  9 :  (1000, 19)
Batch  10 :  (221, 19)
All Batchs:  (10221, 19)


### Seperate the Eligibility Criterias

In [None]:
eligibility_criteria = df['EligibilityCriteria'].astype(str).to_list()
eligibility_criteria = [s.replace('|', '') for s in eligibility_criteria]
eligibility_criteria[:2]

['Inclusion Criteria:Age ≥ 18 years at the time of screening.Eastern Cooperative Oncology Group performance status of ≤ 2.Written informed consent obtained from the patient.Histologically and cytologically documented Stage 3B-4 lung cancer (according to Version 8 of the International Association for the Study of Lung Cancer Staging system).Patients with stage 1 to 3, who undergo radical therapy with disease free survival (DFS) >12 months.Willingness and ability to comply with scheduled visits and other study procedures.Exclusion Criteria:History of another primary malignancy except for malignancy treated with curative intent with known active disease ≥ 5 years before date of the informed consent.Without signed informed consent.Unwillingness or inability to comply with scheduled visits or other study procedures.Previously diagnosed with VTE before signing informed consent.',
 'Inclusion Criteria:20 or more pack year smoking historyclinical need for diagnostic bronchoscopy or consent to 

In [None]:
exclusion_criteria = [txt[txt.find('Exclusion Criteria')+21:] for txt in eligibility_criteria]
exclusion_criteria[:2]

['story of another primary malignancy except for malignancy treated with curative intent with known active disease ≥ 5 years before date of the informed consent.Without signed informed consent.Unwillingness or inability to comply with scheduled visits or other study procedures.Previously diagnosed with VTE before signing informed consent.',
 'ng Cancer within 3 months after the date of enrollment']

In [None]:
inclusion_criteria = [txt[21:txt.find('Exclusion Criteria')] for txt in eligibility_criteria]
inclusion_criteria[:2]

['e ≥ 18 years at the time of screening.Eastern Cooperative Oncology Group performance status of ≤ 2.Written informed consent obtained from the patient.Histologically and cytologically documented Stage 3B-4 lung cancer (according to Version 8 of the International Association for the Study of Lung Cancer Staging system).Patients with stage 1 to 3, who undergo radical therapy with disease free survival (DFS) >12 months.Willingness and ability to comply with scheduled visits and other study procedures.',
 ' or more pack year smoking historyclinical need for diagnostic bronchoscopy or consent to study driven bronchoscopy']

In [None]:
df['InclusionCriteria'] = inclusion_criteria
df['ExclusionCriteria'] = exclusion_criteria
cols = ['Rank', 'NCTId', 'OrgFullName', 'OfficialTitle', 'OverallStatus','Keyword', 'DetailedDescription', 'Condition', 'EligibilityCriteria','InclusionCriteria', 'ExclusionCriteria',
       'HealthyVolunteers', 'Gender', 'MinimumAge', 'StudyPopulation','LocationFacility', 'LocationCity', 'LocationState', 'LocationStatus', 'LocationZip','LocationCountry']
df = df[cols] 
df.head(0)

Unnamed: 0,Rank,NCTId,OrgFullName,OfficialTitle,OverallStatus,Keyword,DetailedDescription,Condition,EligibilityCriteria,InclusionCriteria,ExclusionCriteria,HealthyVolunteers,Gender,MinimumAge,StudyPopulation,LocationFacility,LocationCity,LocationState,LocationStatus,LocationZip,LocationCountry


### Data Overview

In [None]:
print('Number of unique values:',        df['Gender'].nunique())
print('\nGender list of unique values\n', df['Gender'].unique())

Number of unique values: 3

Gender list of unique values
 ['All' nan 'Female' 'Male']


In [None]:
print('Number of unique values:',      df['MinimumAge'].nunique())
print('\nAge list of unique values\n', df['MinimumAge'].unique())

Number of unique values: 50

Age list of unique values
 ['18 Years' '50 Years' '40 Years' nan '5 Years' '55 Years' '45 Years'
 '65 Years' '21 Years' '25 Years' '19 Years' '20 Years' '56 Years'
 '49 Years' '35 Years' '15 Years' '16 Years' '30 Years' '60 Years'
 '46 Years' '70 Years' '2 Years' '26 Years' '75 Years' '22 Years'
 '17 Years' '76 Years' '47 Years' '18 Months' '71 Years' '23 Years'
 '80 Years' '10 Years' '38 Years' '8 Months' '3 Years' '28 Years'
 '13 Years' '3 Months' '27 Years' '12 Years' '1 Year' '6 Years' '14 Years'
 '41 Years' '6 Months' '51 Years' '1 Month' '4 Years' '12 Months'
 '39 Years']


In [None]:
print('Number of unique values:',                df['OverallStatus'].nunique())
print('\nOverallStatus list of unique values\n', df['OverallStatus'].unique())

Number of unique values: 13

OverallStatus list of unique values
 ['Not yet recruiting' 'Active, not recruiting' 'Recruiting' 'Completed'
 'Unknown status' 'Terminated' 'Withdrawn' 'Enrolling by invitation'
 'Suspended' 'Available' 'Approved for marketing' 'No longer available'
 'Temporarily not available']


In [None]:
print('Number of unique values:',                df['LocationStatus'].nunique())
print('\nOverallStatus list of unique values\n', df['LocationStatus'].unique())

In [None]:
print('Number of unique values:',            df['Condition'].nunique())
print('\nCondition list of unique values\n', df['Condition'].unique())

Number of unique values: 4627

Condition list of unique values
 ['Lung Neoplasms|Venous Thromboembolism' 'Lung Cancer'
 'Lung Neoplasms|Computed Tomography|Mass Screening|Lung Nodules' ...
 'Esophageal Squamous Cell Carcinoma|Neoadjuvant Therapy|Surgery'
 'B-cell Leukemia' 'Hematologic Neoplasms']


In [None]:
print('Number of unique values:',                    df['HealthyVolunteers'].nunique())
print('\nHealthyVolunteers list of unique values\n', df['HealthyVolunteers'].unique())

Number of unique values: 2

HealthyVolunteers list of unique values
 ['No' 'Accepts Healthy Volunteers' nan]


In [None]:
print('Number of unique values:',                  df['StudyPopulation'].nunique())
print('\nStudyPopulation list of unique values\n', df['StudyPopulation'].unique())

Number of unique values: 1897

StudyPopulation list of unique values
 ['Patients diagnosed with advanced staged lung cancer with written informed consent.'
 'The study population will consist of subjects aged 50 to 90 and with 20 or more pack year smoking history, who are determined not to have lung cancer at the time of enrollment or within three months after the date of enrollment, and either a) volunteer for the study driven bronchoscopy, or b) have standard of care clinical need for diagnostic bronchoscopy (e.g. they may present with respiratory symptoms or abnormal test results consistent with the need for bronchoscopy).'
 nan ...
 'From February 2008 to December 2009 all patients admitted to The Department of Surgical Gastroenterology with upper GI cancer or pancreatic cancer will be included.||Depending on the disease nature and progression, the patients will be followed as palliation or surgery cohorts.'
 'Patients with esophageal squamous cell carcinoma who accept esophagectom

In [None]:
print('Number of unique values in InclusionCriteria:',                  df['InclusionCriteria'].nunique())
print('Number of unique values: in ExclusionCriteria',                  df['ExclusionCriteria'].nunique())

Number of unique values in InclusionCriteria: 10166
Number of unique values: in ExclusionCriteria 9950
