### Objectives

Using the previous POC, we'll automate the downloading of all lung cancer studies and process the data like we did in the previous notebook.

### Libs

In [65]:
import pandas as pd
pd.set_option('display.max_columns', -1)  
pd.set_option('display.max_colwidth', -1)
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

### Data Collection

In [8]:
######### Study ##############
# lung cancer 
# NStudiesFound: 10152

######### Study Fields #######
# NCTId, OrgFullName, OfficialTitle, OverallStatus, Keyword, DetailedDescription, Condition, EligibilityCriteria, HealthyVolunteers, Gender, MinimumAge, StudyPopulation, LocationFacility, LocationCity, LocationState, LocationZip, LocationCountry

######### Range Min_MAX ######
# 1 to 2

######### Format ############
# CSV


step    = 1000
min_rnk = 1
max_rnk = step

for req in range(11):
    
    print("Downloading Lung Cancer clinical trials with ranks from ", min_rnk, " to ", max_rnk)

    url = 'https://clinicaltrials.gov/api/query/study_fields?expr=lung+cancer&fields=NCTId%2C+OrgFullName%2C+OfficialTitle%2C+OverallStatus%2C+Keyword%2C+DetailedDescription%2C+Condition%2C+EligibilityCriteria%2C+HealthyVolunteers%2C+Gender%2C+MinimumAge%2C+StudyPopulation%2C+LocationFacility%2C+LocationCity%2C+LocationState%2C+LocationZip%2C+LocationCountry&min_rnk='+str(min_rnk)+'&max_rnk='+str(max_rnk)+'&fmt=csv'

    session = requests.Session()
    retry   = Retry(connect=3, backoff_factor=0.5)
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://' , adapter)
    session.mount('https://', adapter)

    clinicaltrials = session.get(url)
    print('Download Request Status: ', clinicaltrials.status_code)
    
    csv_file = open('C:/Users/Almighty/Python workspace/ClinicalNet/Data/'+str(req)+'-batch.csv', 'wb')
    csv_file.write(clinicaltrials.content)
    csv_file.close()
    
    min_rnk = max_rnk + 1
    max_rnk += step

Downloading Lung Cancer clinical trials with ranks from  1  to  1000
Download Request Status:  200
Downloading Lung Cancer clinical trials with ranks from  1001  to  2000
Download Request Status:  200
Downloading Lung Cancer clinical trials with ranks from  2001  to  3000
Download Request Status:  200
Downloading Lung Cancer clinical trials with ranks from  3001  to  4000
Download Request Status:  200
Downloading Lung Cancer clinical trials with ranks from  4001  to  5000
Download Request Status:  200
Downloading Lung Cancer clinical trials with ranks from  5001  to  6000
Download Request Status:  200
Downloading Lung Cancer clinical trials with ranks from  6001  to  7000
Download Request Status:  200
Downloading Lung Cancer clinical trials with ranks from  7001  to  8000
Download Request Status:  200
Downloading Lung Cancer clinical trials with ranks from  8001  to  9000
Download Request Status:  200
Downloading Lung Cancer clinical trials with ranks from  9001  to  10000
Download Req

### Convert Data to DataFrame

In [20]:
df = pd.read_csv(r'C:\Users\Almighty\Python workspace\ClinicalNet\Data\0-batch.csv', skiprows=10)
for req in range(1, 11):
    tmp = pd.read_csv('C:/Users/Almighty/Python workspace/ClinicalNet/Data/' +str(req)+ '-batch.csv', skiprows=10)
    print('Batch ', req, ': ', tmp.shape)
    
    df = df.append(tmp, ignore_index=True)

print('All Batchs: ',df.shape)
df.head()

Batch  1 :  (1000, 18)
Batch  2 :  (1000, 18)
Batch  3 :  (1000, 18)
Batch  4 :  (1000, 18)
Batch  5 :  (1000, 18)
Batch  6 :  (1000, 18)
Batch  7 :  (1000, 18)
Batch  8 :  (1000, 18)
Batch  9 :  (1000, 18)
Batch  10 :  (152, 18)
All Batchs:  (10152, 18)


Unnamed: 0,Rank,NCTId,OrgFullName,OfficialTitle,OverallStatus,Keyword,DetailedDescription,Condition,EligibilityCriteria,HealthyVolunteers,Gender,MinimumAge,StudyPopulation,LocationFacility,LocationCity,LocationState,LocationZip,LocationCountry
0,1,NCT03581708,Guangdong Provincial People's Hospital,Real-world Study of the Incidence and Risk Fac...,Not yet recruiting,lung cancer|Venous Thromboembolism,VTE has high incidence in lung cancer and incr...,Lung Neoplasms|Venous Thromboembolism,Inclusion Criteria:||Age ≥ 18 years at the tim...,No,All,18 Years,Patients diagnosed with advanced staged lung c...,Guangdong General Hospital,Guangzhou,Guagndong,510080,China
1,2,NCT01130285,University of Toledo Health Science Campus,Validation of a Multi-gene Test for Lung Cance...,"Active, not recruiting",Lung Cancer,"Because more than 160,000 individuals die of l...",Lung Cancer,Inclusion Criteria:||20 or more pack year smok...,Accepts Healthy Volunteers,All,50 Years,The study population will consist of subjects ...,National Jewish Health|University of Michigan|...,Denver|Ann Arbor|Detroit|Rochester|Cleveland|C...,Colorado|Michigan|Michigan|Minnesota|Ohio|Ohio...,80206|48109|48202|55905|44195|43221|43606|4360...,United States|United States|United States|Unit...
2,3,NCT03992833,Tianjin Medical University Cancer Institute an...,Methods of Computed Tomography Screening and M...,Recruiting,,"In this population-based study, participants w...",Lung Neoplasms|Computed Tomography|Mass Screen...,Inclusion Criteria:||Aged 40-74 years;|Residen...,Accepts Healthy Volunteers,All,40 Years,,Tianjin Medical University Cancer Institute An...,Tianjin,Tianjin,300060,China
3,4,NCT02725892,AstraZeneca,LuCaReAl: Lung Cancer Registry in Algeria.,Completed,lung cancer epidemiology algeria registry,The study consists of:||All patients meeting i...,Oncology & Epidemiology & Lung Cancer,Inclusion Criteria:||Men or women diagnosed wi...,No,All,,each sanitary region defined by the Ministry o...,Research Site|Research Site|Research Site,Algiers|Constantine|Oran,,16000|25000|31000,Algeria|Algeria|Algeria
4,5,NCT00897650,Vanderbilt-Ingram Cancer Center,Molecular Fingerprints in Lung Cancer: Predict...,Completed,lung cancer,OBJECTIVES:||To determine protein and/or RNA e...,Lung Cancer,Inclusion criteria||Diagnosis of suspected lun...,No,All,,People who have or may have lung cancer.,Vanderbilt-Ingram Cancer Center,Nashville,Tennessee,37232-6838,United States


### Seperate the Eligibility Criteria

In [48]:
eligibility_criteria = df['EligibilityCriteria'].astype(str).to_list()

print(type(eligibilitycriteria ))
eligibility_criteria[:3]

<class 'list'>


['Inclusion Criteria:||Age ≥ 18 years at the time of screening.|Eastern Cooperative Oncology Group performance status of ≤ 2.|Written informed consent obtained from the patient.|Histologically and cytologically documented Stage 3B-4 lung cancer (according to Version 8 of the International Association for the Study of Lung Cancer Staging system).|Patients with stage 1 to 3, who undergo radical therapy with disease free survival (DFS) >12 months.|Willingness and ability to comply with scheduled visits and other study procedures.||Exclusion Criteria:||History of another primary malignancy except for malignancy treated with curative intent with known active disease ≥ 5 years before date of the informed consent.|Without signed informed consent.|Unwillingness or inability to comply with scheduled visits or other study procedures.|Previously diagnosed with VTE before signing informed consent.',
 'Inclusion Criteria:||20 or more pack year smoking history|clinical need for diagnostic bronchosco

In [49]:
exclusion_criteria = [txt[txt.find('Exclusion Criteria')+21:] for txt in eligibility_criteria]
exclusion_criteria

['History of another primary malignancy except for malignancy treated with curative intent with known active disease ≥ 5 years before date of the informed consent.|Without signed informed consent.|Unwillingness or inability to comply with scheduled visits or other study procedures.|Previously diagnosed with VTE before signing informed consent.',
 'Lung Cancer within 3 months after the date of enrollment',
 'Pregnant woman will be excluded.',
 'Patients who did not provide the informed consent form|Patients with a mental or psychological disorder according to their treating clinicians',
 'Diagnosis of suspected lung cancer or lung cancer||Exclusion criteria||Inability to undergo therapy',
 'Those who do not meet any of the above conditions are excluded. Reason: The included non-small cell lung adenocarcinoma samples and normal samples should have statistical significance, and the conclusions obtained should be scientific and valid. The selection of subjects for this study, while excludi

In [50]:
inclusion_criteria = [txt[21:txt.find('Exclusion Criteria')] for txt in eligibility_criteria]
inclusion_criteria

['Age ≥ 18 years at the time of screening.|Eastern Cooperative Oncology Group performance status of ≤ 2.|Written informed consent obtained from the patient.|Histologically and cytologically documented Stage 3B-4 lung cancer (according to Version 8 of the International Association for the Study of Lung Cancer Staging system).|Patients with stage 1 to 3, who undergo radical therapy with disease free survival (DFS) >12 months.|Willingness and ability to comply with scheduled visits and other study procedures.||',
 '20 or more pack year smoking history|clinical need for diagnostic bronchoscopy or consent to study driven bronchoscopy||',
 'Aged 40-74 years;|Resident in the Hexi district of Tianjin city for at least 3 years;|Having no self-reported history of any malignant tumor.||',
 'Men or women diagnosed with lung cancer all types and stages confirmed over 12 months of recruitment period by a pathologist||Aged at least18 years at diagnosis|Patients who provide their informed consent form

In [51]:
df['InclusionCriteria'] = inclusion_criteria
df['ExclusionCriteria'] = exclusion_criteria
cols = ['Rank', 'NCTId', 'OrgFullName', 'OfficialTitle', 'OverallStatus','Keyword', 'DetailedDescription', 'Condition', 'EligibilityCriteria','InclusionCriteria', 'ExclusionCriteria',
       'HealthyVolunteers', 'Gender', 'MinimumAge', 'StudyPopulation','LocationFacility', 'LocationCity', 'LocationState', 'LocationZip','LocationCountry']
df = df[cols] 
df.head()

Unnamed: 0,Rank,NCTId,OrgFullName,OfficialTitle,OverallStatus,Keyword,DetailedDescription,Condition,EligibilityCriteria,InclusionCriteria,ExclusionCriteria,HealthyVolunteers,Gender,MinimumAge,StudyPopulation,LocationFacility,LocationCity,LocationState,LocationZip,LocationCountry
0,1,NCT03581708,Guangdong Provincial People's Hospital,Real-world Study of the Incidence and Risk Fac...,Not yet recruiting,lung cancer|Venous Thromboembolism,VTE has high incidence in lung cancer and incr...,Lung Neoplasms|Venous Thromboembolism,Inclusion Criteria:||Age ≥ 18 years at the tim...,Age ≥ 18 years at the time of screening.|Easte...,History of another primary malignancy except f...,No,All,18 Years,Patients diagnosed with advanced staged lung c...,Guangdong General Hospital,Guangzhou,Guagndong,510080,China
1,2,NCT01130285,University of Toledo Health Science Campus,Validation of a Multi-gene Test for Lung Cance...,"Active, not recruiting",Lung Cancer,"Because more than 160,000 individuals die of l...",Lung Cancer,Inclusion Criteria:||20 or more pack year smok...,20 or more pack year smoking history|clinical ...,Lung Cancer within 3 months after the date of ...,Accepts Healthy Volunteers,All,50 Years,The study population will consist of subjects ...,National Jewish Health|University of Michigan|...,Denver|Ann Arbor|Detroit|Rochester|Cleveland|C...,Colorado|Michigan|Michigan|Minnesota|Ohio|Ohio...,80206|48109|48202|55905|44195|43221|43606|4360...,United States|United States|United States|Unit...
2,3,NCT03992833,Tianjin Medical University Cancer Institute an...,Methods of Computed Tomography Screening and M...,Recruiting,,"In this population-based study, participants w...",Lung Neoplasms|Computed Tomography|Mass Screen...,Inclusion Criteria:||Aged 40-74 years;|Residen...,Aged 40-74 years;|Resident in the Hexi distric...,Pregnant woman will be excluded.,Accepts Healthy Volunteers,All,40 Years,,Tianjin Medical University Cancer Institute An...,Tianjin,Tianjin,300060,China
3,4,NCT02725892,AstraZeneca,LuCaReAl: Lung Cancer Registry in Algeria.,Completed,lung cancer epidemiology algeria registry,The study consists of:||All patients meeting i...,Oncology & Epidemiology & Lung Cancer,Inclusion Criteria:||Men or women diagnosed wi...,Men or women diagnosed with lung cancer all ty...,Patients who did not provide the informed cons...,No,All,,each sanitary region defined by the Ministry o...,Research Site|Research Site|Research Site,Algiers|Constantine|Oran,,16000|25000|31000,Algeria|Algeria|Algeria
4,5,NCT00897650,Vanderbilt-Ingram Cancer Center,Molecular Fingerprints in Lung Cancer: Predict...,Completed,lung cancer,OBJECTIVES:||To determine protein and/or RNA e...,Lung Cancer,Inclusion criteria||Diagnosis of suspected lun...,iagnosis of suspected lung cancer or lung canc...,Diagnosis of suspected lung cancer or lung can...,No,All,,People who have or may have lung cancer.,Vanderbilt-Ingram Cancer Center,Nashville,Tennessee,37232-6838,United States


### Data Overview

In [1]:
print('Number of unique values:',    df['Gender'].nunique())
print('Gender list of unique values\n', df['Gender'].unique())

NameError: name 'df' is not defined

In [74]:
print('Number of unique values:',    df['MinimumAge'].nunique())
print('Age list of unique values\n', df['MinimumAge'].unique())

Number of unique values: 50
Age list of unique values
 ['18 Years' '50 Years' '40 Years' nan '5 Years' '55 Years' '45 Years'
 '65 Years' '21 Years' '25 Years' '19 Years' '20 Years' '56 Years'
 '49 Years' '35 Years' '15 Years' '16 Years' '30 Years' '60 Years'
 '46 Years' '70 Years' '2 Years' '26 Years' '75 Years' '22 Years'
 '17 Years' '76 Years' '47 Years' '18 Months' '71 Years' '23 Years'
 '80 Years' '10 Years' '38 Years' '8 Months' '3 Years' '28 Years'
 '13 Years' '3 Months' '27 Years' '12 Years' '1 Year' '6 Years' '14 Years'
 '41 Years' '6 Months' '51 Years' '1 Month' '4 Years' '12 Months'
 '39 Years']


In [73]:
print('Number of unique values:',       df['Gender'].nunique())
print('Gender list of unique values\n', df['Gender'].unique())

Number of unique values: 3
Gender list of unique values
 ['All' nan 'Female' 'Male']


In [72]:
print('Number of unique values:',              df['OverallStatus'].nunique())
print('OverallStatus list of unique values\n', df['OverallStatus'].unique())

Number of unique values: 13
OverallStatus list of unique values
 ['Not yet recruiting' 'Active, not recruiting' 'Recruiting' 'Completed'
 'Unknown status' 'Terminated' 'Withdrawn' 'Enrolling by invitation'
 'Suspended' 'Available' 'Approved for marketing' 'No longer available'
 'Temporarily not available']


In [71]:
print('Number of unique values:',          df['Condition'].nunique())
print('Condition list of unique values\n', df['Condition'].unique())

Number of unique values: 4586
Condition list of unique values
 ['Lung Neoplasms|Venous Thromboembolism' 'Lung Cancer'
 'Lung Neoplasms|Computed Tomography|Mass Screening|Lung Nodules' ...
 'Hematologic Neoplasms' 'Prostatic Neoplasms|Urinary Bladder Neoplasms'
 'Multiple Myeloma and Plasma Cell Neoplasm']


In [75]:
print('Number of unique values:',                  df['HealthyVolunteers'].nunique())
print('HealthyVolunteers list of unique values\n', df['HealthyVolunteers'].unique())

Number of unique values: 2
HealthyVolunteers list of unique values
 ['No' 'Accepts Healthy Volunteers' nan]


In [76]:
print('Number of unique values:',                df['StudyPopulation'].nunique())
print('StudyPopulation list of unique values\n', df['StudyPopulation'].unique())

Number of unique values: 1880
StudyPopulation list of unique values
 ['Patients diagnosed with advanced staged lung cancer with written informed consent.'
 'The study population will consist of subjects aged 50 to 90 and with 20 or more pack year smoking history, who are determined not to have lung cancer at the time of enrollment or within three months after the date of enrollment, and either a) volunteer for the study driven bronchoscopy, or b) have standard of care clinical need for diagnostic bronchoscopy (e.g. they may present with respiratory symptoms or abnormal test results consistent with the need for bronchoscopy).'
 nan ...
 'From February 2008 to December 2009 all patients admitted to The Department of Surgical Gastroenterology with upper GI cancer or pancreatic cancer will be included.||Depending on the disease nature and progression, the patients will be followed as palliation or surgery cohorts.'
 'Patients with esophageal squamous cell carcinoma who accept esophagectomy

In [77]:
print('Number of unique values:',                  df['InclusionCriteria'].nunique())
print('InclusionCriteria list of unique values\n', df['InclusionCriteria'].unique())

Number of unique values: 10093
InclusionCriteria list of unique values
 ['Age ≥ 18 years at the time of screening.|Eastern Cooperative Oncology Group performance status of ≤ 2.|Written informed consent obtained from the patient.|Histologically and cytologically documented Stage 3B-4 lung cancer (according to Version 8 of the International Association for the Study of Lung Cancer Staging system).|Patients with stage 1 to 3, who undergo radical therapy with disease free survival (DFS) >12 months.|Willingness and ability to comply with scheduled visits and other study procedures.||'
 '20 or more pack year smoking history|clinical need for diagnostic bronchoscopy or consent to study driven bronchoscopy||'
 'Aged 40-74 years;|Resident in the Hexi district of Tianjin city for at least 3 years;|Having no self-reported history of any malignant tumor.||'
 ...
 'CS: Histologically confirmed metastatic colon cancer expressing carcinoembryonic antigen (CEA) At least 50% of tumor cells must express

In [78]:
print('Number of unique values:',                  df['ExclusionCriteria'].nunique())
print('ExclusionCriteria list of unique values\n', df['ExclusionCriteria'].unique())

Number of unique values: 9889
ExclusionCriteria list of unique values
 ['History of another primary malignancy except for malignancy treated with curative intent with known active disease ≥ 5 years before date of the informed consent.|Without signed informed consent.|Unwillingness or inability to comply with scheduled visits or other study procedures.|Previously diagnosed with VTE before signing informed consent.'
 'Lung Cancer within 3 months after the date of enrollment'
 'Pregnant woman will be excluded.' ...
 'ICS: Histologically confirmed metastatic colon cancer expressing carcinoembryonic antigen (CEA) At least 50% of tumor cells must express CEA with at least moderate intensity Resectable hepatic metastases or other site of metastatic colon cancer that is resectable (e.g., lung metastases)||PATIENT CHARACTERISTICS: Age: 18 and over Performance status: Karnofsky 70-100% Life expectancy: At least 6 months Hematopoietic: Absolute neutrophil count at least 1000/mm3 Absolute lymphocy

### Export Data

In [52]:
df.to_excel(r'C:\Users\Almighty\Python workspace\ClinicalNet\Data\All_Batches.xlsx', index=False)