<a href="https://colab.research.google.com/github/MWFK/NLP-Semantic-Similarity/blob/main/ClinicalTrials/Data%20Engineering/06.%20Filters_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Objectives

1. Download the lung cancer data with specific features.
2. Process HealthyVolunteer feature then use it as a filter.
3. Process Age feature then use it as a filter.
4. Procss Gender feature then use it as filter.
5. Procss Gender feature then use it as filter.
6. Filter by LocationStatus

### Libs

In [72]:
import re
import pandas as pd
import numpy as np
import requests
from itertools import compress
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

### Data

In [73]:
####### Search Expression #######
# Lung Cancer

####### Study Fields #######
'''
NCTId, OrgFullName, OfficialTitle, OverallStatus, Phase, DetailedDescription, 
Condition, EligibilityCriteria, HealthyVolunteers, Gender, MinimumAge, StudyPopulation, 
LocationFacility, LocationCity, LocationCountry, LocationStatus
'''

####### Range Min_MAX ######
# 1 to 1000

####### Format #######
# CSV

url = 'https://clinicaltrials.gov/api/query/study_fields?expr=lung+cancer&fields=NCTId%2C+OrgFullName%2C+OfficialTitle%2C+OverallStatus%2C+Phase%2C+DetailedDescription%2C+%0D%0ACondition%2C+EligibilityCriteria%2C+HealthyVolunteers%2C+Gender%2C+MinimumAge%2C+StudyPopulation%2C+%0D%0ALocationFacility%2C+LocationCity%2C+LocationCountry%2C+LocationStatus&min_rnk=1&max_rnk=1000&fmt=csv'
session = requests.Session()
retry   = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://' , adapter)
session.mount('https://', adapter)

clinicaltrials = session.get(url)
print('Download Request Status: ', clinicaltrials.status_code)

csv_file = open('/content/'+str(1)+'-batch.csv', 'wb')
csv_file.write(clinicaltrials.content)
csv_file.close()

raw = pd.read_csv(r'/content/1-batch.csv', skiprows=10)
print(raw.shape)
raw.head()


Download Request Status:  200
(1000, 17)


Unnamed: 0,Rank,NCTId,OrgFullName,OfficialTitle,OverallStatus,Phase,DetailedDescription,Condition,EligibilityCriteria,HealthyVolunteers,Gender,MinimumAge,StudyPopulation,LocationFacility,LocationCity,LocationCountry,LocationStatus
0,1,NCT03581708,Guangdong Provincial People's Hospital,Real-world Study of the Incidence and Risk Fac...,Not yet recruiting,,VTE has high incidence in lung cancer and incr...,Lung Neoplasms|Venous Thromboembolism,Inclusion Criteria:||Age ≥ 18 years at the tim...,No,All,18 Years,Patients diagnosed with advanced staged lung c...,Guangdong General Hospital,Guangzhou,China,
1,2,NCT01130285,University of Toledo,Validation of a Multi-gene Test for Lung Cance...,"Active, not recruiting",,"Because more than 160,000 individuals die of l...",Lung Cancer,Inclusion Criteria:||20 or more pack year smok...,Accepts Healthy Volunteers,All,50 Years,The study population will consist of subjects ...,National Jewish Health|University of Michigan|...,Denver|Ann Arbor|Detroit|Rochester|Cleveland|C...,United States|United States|United States|Unit...,
2,3,NCT03992833,Tianjin Medical University Cancer Institute an...,Methods of Computed Tomography Screening and M...,Recruiting,Not Applicable,"In this population-based study, participants w...",Lung Neoplasms|Computed Tomography|Mass Screen...,Inclusion Criteria:||Aged 40-74 years;|Residen...,Accepts Healthy Volunteers,All,40 Years,,Tianjin Medical University Cancer Institute An...,Tianjin,China,Recruiting
3,4,NCT02725892,AstraZeneca,LuCaReAl: Lung Cancer Registry in Algeria.,Completed,,The study consists of:||All patients meeting i...,Oncology & Epidemiology & Lung Cancer,Inclusion Criteria:||Men or women diagnosed wi...,No,All,,each sanitary region defined by the Ministry o...,Research Site|Research Site|Research Site,Algiers|Constantine|Oran,Algeria|Algeria|Algeria,
4,5,NCT00897650,Vanderbilt-Ingram Cancer Center,Molecular Fingerprints in Lung Cancer: Predict...,Completed,,OBJECTIVES:||To determine protein and/or RNA e...,Lung Cancer,Inclusion criteria||Diagnosis of suspected lun...,No,All,,People who have or may have lung cancer.,Vanderbilt-Ingram Cancer Center,Nashville,United States,


### Filter by [1] HealthyVolunteers

In [None]:
df = raw
print('Data dimensions before Filtering : ', df.shape, '\n')
print(df['HealthyVolunteers'].unique())
print(df['HealthyVolunteers'].nunique())
print(df['HealthyVolunteers'].value_counts())
print(df.loc[df['HealthyVolunteers'] == 'nan'].shape)

Data dimensions before Filtering :  (1000, 17) 

['No' 'Accepts Healthy Volunteers' nan]
2
No                            855
Accepts Healthy Volunteers    129
Name: HealthyVolunteers, dtype: int64
(0, 17)


In [None]:
print(df['HealthyVolunteers'].unique())

df['HealthyVolunteers'] = df['HealthyVolunteers'].replace('No', 'no')
df['HealthyVolunteers'] = df['HealthyVolunteers'].replace('Accepts Healthy Volunteers', 'yes')
df['HealthyVolunteers'] = df['HealthyVolunteers'].replace(np.nan, 'yes_no')

print(df['HealthyVolunteers'].unique())
print(df['HealthyVolunteers'].value_counts())

['No' 'Accepts Healthy Volunteers' nan]
['no' 'yes' 'yes_no']
no        855
yes       129
yes_no     16
Name: HealthyVolunteers, dtype: int64


In [None]:
df = raw
HealthyVolunteers_Input = input("Are you a healthy volunteer? (Example: yes ; no))")
print(HealthyVolunteers_Input)

df = df.loc[df['HealthyVolunteers'].isin([HealthyVolunteers_Input, 'yes_no'])] 
print(df['HealthyVolunteers'].unique())
print(df['HealthyVolunteers'].value_counts())

Are you a healthy volunteer? (yes/no)no
no
['no' 'yes_no']
no        855
yes_no     16
Name: HealthyVolunteers, dtype: int64


### Filtering by [2] Age

In [None]:
df = raw
print('Data dimensions before Filtering : ', df.shape, '\n')
df['MinimumAge'] = df['MinimumAge'].replace(np.nan, '0 Months')
print(df['MinimumAge'].value_counts())

In [None]:
# convert ages to month base
def ages_to_months(ages):
  return pd.Series([int(age[:age.find('Years')])*12 if (age.find('Years')!=-1) else int(age[:age.find('Months')]) for age in ages.tolist()])

# ages = pd.Series(['18 Years', '99 Months', '7 Months', '6 Years', '0 Months'])
# ages_to_months(ages)

0    216
1     99
2      7
3     72
4      0
dtype: int64

In [None]:
Age_Input = pd.Series(input("Can we know your age: (Example: 29 Years ; 9 Months)"))
print('\n', Age_Input)

df = df[ages_to_months(df['MinimumAge']) <= ages_to_months(Age_Input).tolist()[0]]
print(df.shape)

Can we know your age: (Example: 29 Years ; 9 Months)50 Years

 0    50 Years
dtype: object
(944, 17)


### Filtering by [3] Gender

In [None]:
df = raw
print(df['Gender'].unique())
print(df['Gender'].value_counts())
df['Gender'] = df['Gender'].replace(np.nan, 'All')
print(df['Gender'].value_counts())

In [None]:
Gender_Input = input("Can we know your Gender: (Example: Male ; Female ; All)")
print('\n', Gender_Input)

df = df[df['Gender'].isin([Gender_Input, 'All'])]
print(df.shape)

['All' nan 'Female']
All       990
Female      8
Name: Gender, dtype: int64
All       992
Female      8
Name: Gender, dtype: int64
Can we know your Gender: (Example: Male ; Female ; All)Female

 Female
(1000, 18)


### Filtering by [4] Phase

In [None]:
df = raw
print(df['Phase'].value_counts())
df['Phase'] = df['Phase'].replace(np.nan, 'No Phase') 
df['Phase'] = df['Phase'].replace('Not Applicable', 'No Phase') 
print(df['Phase'].value_counts())

In [22]:
Phase_Input = input("Which Phase are you in: (Example: Phase 1; Phase 2; Phase 3; Phase 4; No Phase): ")
print('\n', Phase_Input)

df = df[df['Phase'] == Phase_Input]
print(df.shape)

Which Phase are you in: (Example: Phase 1; Phase 2; Phase 3; Phase 4; No Phase): No Phase

 No Phase
(507, 17)


### Filtering by [5] LocationStatus

In [74]:
df = raw
df.shape

(1000, 17)

In [75]:
lfacility = df['LocationFacility'].astype(str).to_list()
print(lfacility[:5])
lstatus   = df['LocationStatus'].astype(str).to_list()
print(lstatus[:5])
lcity     = df['LocationCity'].astype(str).to_list()
print(lcity[:5])
lcountry  = df['LocationCountry'].astype(str).to_list()
print(lcountry[:5])

['Guangdong General Hospital', 'National Jewish Health|University of Michigan|Henry Ford|Mayo Clinic|Cleveland Clinic Foundation|Ohio State University|The Toledo Hospital|Mercy St. Vincent Medical Center|University of Toledo, Health Science Campus|Medical University of South Carolina|Tennessee Valley Veterans Admin.|Vanderbilt|Inova Fairfax Hospital', 'Tianjin Medical University Cancer Institute And Hospital', 'Research Site|Research Site|Research Site', 'Vanderbilt-Ingram Cancer Center']
['nan', 'nan', 'Recruiting', 'nan', 'nan']
['Guangzhou', 'Denver|Ann Arbor|Detroit|Rochester|Cleveland|Columbus|Toledo|Toledo|Toledo|Charleston|Nashville|Nashville|Falls Church', 'Tianjin', 'Algiers|Constantine|Oran', 'Nashville']
['China', 'United States|United States|United States|United States|United States|United States|United States|United States|United States|United States|United States|United States|United States', 'China', 'Algeria|Algeria|Algeria', 'United States']


In [76]:
allfacility = [text.split('|') for text in lfacility]
print(allfacility[:5])
allstatus   = [text.split('|') for text in lstatus]
print(allstatus[:5])
allcity     = [text.split('|') for text in lcity]
print(allcity[:5])
allcountry  = [text.split('|') for text in lcountry]
print(allcountry[:5])

[['Guangdong General Hospital'], ['National Jewish Health', 'University of Michigan', 'Henry Ford', 'Mayo Clinic', 'Cleveland Clinic Foundation', 'Ohio State University', 'The Toledo Hospital', 'Mercy St. Vincent Medical Center', 'University of Toledo, Health Science Campus', 'Medical University of South Carolina', 'Tennessee Valley Veterans Admin.', 'Vanderbilt', 'Inova Fairfax Hospital'], ['Tianjin Medical University Cancer Institute And Hospital'], ['Research Site', 'Research Site', 'Research Site'], ['Vanderbilt-Ingram Cancer Center']]
[['nan'], ['nan'], ['Recruiting'], ['nan'], ['nan']]
[['Guangzhou'], ['Denver', 'Ann Arbor', 'Detroit', 'Rochester', 'Cleveland', 'Columbus', 'Toledo', 'Toledo', 'Toledo', 'Charleston', 'Nashville', 'Nashville', 'Falls Church'], ['Tianjin'], ['Algiers', 'Constantine', 'Oran'], ['Nashville']]
[['China'], ['United States', 'United States', 'United States', 'United States', 'United States', 'United States', 'United States', 'United States', 'United Stat

In [77]:
allmasks = []
for onelist in allstatus:
  masks = []
  for status in onelist:
    if status == 'Recruiting':
      masks.append(True)
    else:
      masks.append(False)
  allmasks.append(masks)

allmasks[:5]

[[False], [False], [True], [False], [False]]

In [78]:
filtered_status = []
for idx,x in enumerate(allstatus):
  filtered_status.append(list(compress(allstatus[idx], allmasks[idx])))
filtered_status[:6]

[[], [], ['Recruiting'], [], [], []]

In [79]:
filtered_country = []
for idx,x in enumerate(allstatus):
  filtered_country.append(list(compress(allcountry[idx], allmasks[idx])))
filtered_country[:6]

[[], [], ['China'], [], [], []]

In [80]:
filtered_facility = []
for idx,x in enumerate(allstatus):
  filtered_facility.append(list(compress(allfacility[idx], allmasks[idx])))
filtered_facility[:6]

[[],
 [],
 ['Tianjin Medical University Cancer Institute And Hospital'],
 [],
 [],
 []]

In [81]:
filtered_city = []
for idx,x in enumerate(allstatus):
  filtered_city.append(list(compress(allcity[idx], allmasks[idx])))
filtered_city[:6]

[[], [], ['Tianjin'], [], [], []]

In [None]:
df['LocationFacility'] = filtered_facility
df['LocationStatus']   = filtered_status
df['LocationCity']     = filtered_city
df['LocationCountry']  = filtered_country
df.head()

In [99]:
tmp = df[df['LocationFacility'].map(lambda location_list: len(location_list)) > 0]
print(tmp.shape)
tmp

(235, 17)


Unnamed: 0,Rank,NCTId,OrgFullName,OfficialTitle,OverallStatus,Phase,DetailedDescription,Condition,EligibilityCriteria,HealthyVolunteers,Gender,MinimumAge,StudyPopulation,LocationFacility,LocationCity,LocationCountry,LocationStatus
2,3,NCT03992833,Tianjin Medical University Cancer Institute an...,Methods of Computed Tomography Screening and M...,Recruiting,Not Applicable,"In this population-based study, participants w...",Lung Neoplasms|Computed Tomography|Mass Screen...,Inclusion Criteria:||Aged 40-74 years;|Residen...,Accepts Healthy Volunteers,All,40 Years,,[Tianjin Medical University Cancer Institute A...,[Tianjin],[China],[Recruiting]
6,7,NCT04498052,University of Utah,Evaluation of a Scalable Decision Support and ...,Recruiting,Not Applicable,The purpose of this project is to increase app...,Early Detection of Cancer|Lung Neoplasms,Inclusion Criteria:||receives care at Universi...,No,All,55 Years,,[University of Utah Health],[Salt Lake City],[United States],[Recruiting]
27,28,NCT03356808,Shenzhen Geno-Immune Medical Institute,Multicenter Trial of Cancer Antigen-specific T...,Unknown status,Phase 1|Phase 2,Lung cancer is a malignancy characterized by u...,Lung Cancer,"Inclusion Criteria:||Patients with stage III, ...",No,All,18 Years,,[Jinshazhou Hospital of Guangzhou University o...,"[Guangzhou, Shenzhen, Kunming]","[China, China, China]","[Recruiting, Recruiting, Recruiting]"
29,30,NCT04315753,Istituto Clinico Humanitas,Circulating and Imaging Biomarkers to Improve ...,Recruiting,,,Lung Cancer,Inclusion Criteria:||Age ≥ 55 years old and ex...,Accepts Healthy Volunteers,All,55 Years,The study population should have the following...,[Istituto Clinico Humanitas],[Rozzano],[Italy],[Recruiting]
31,32,NCT02898441,Shanghai Chest Hospital,Community-based Early Stage Lung Cancer Screen...,Unknown status,Not Applicable,,Lung Cancer,Inclusion Criteria:||Eligible participants wer...,Accepts Healthy Volunteers,All,45 Years,,[Shanghai Chest hospital],[Shanghai],[China],[Recruiting]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
984,985,NCT00546130,University of Toyama,Feasibility Study for Multicenter Randomized C...,Unknown status,Phase 2,To examine whether the following protocol and ...,Small Cell Lung Cancer,Inclusion Criteria:||Patients with histologica...,No,All,20 Years,,"[Toho University Sakura Medical Center, Hokkai...","[Sakura, Sapporo, Kanazawa, Uchinada, Ikoma, K...","[Japan, Japan, Japan, Japan, Japan, Japan, Jap...","[Recruiting, Recruiting, Recruiting, Recruitin..."
988,989,NCT00514293,National Cancer Institute (NCI),Phase II Trial of Bexarotene (Targretin) Capsu...,Unknown status,Phase 2,OBJECTIVES:||Primary||Evaluate the efficacy of...,Lung Cancer,DISEASE CHARACTERISTICS:||Histologically or cy...,No,All,18 Years,,[R. Nandan M.D. Incorporated],[Lakewood],[United States],[Recruiting]
992,993,NCT02974933,Hubei Cancer Hospital,Apatinib Mesylate Combined With Pemetrexed in ...,Unknown status,Phase 2,It is a one-arm study. The progression-free su...,Nonsmall Cell Lung Cancer,Inclusion Criteria:||Aged from 18 years to 70y...,No,All,18 Years,,[Ou wuling],[Wuhan],[China],[Recruiting]
994,995,NCT04775095,Intergroupe Francophone de Cancerologie Thorac...,BRAF V600-mutated Lung Carcinoma Treated With ...,Recruiting,,The braf gene (V Raf murine sarcoma viral onco...,Non Small Cell Lung Cancer|BRAF V600 Mutation,Inclusion Criteria:||Patients with histologica...,No,All,18 Years,All patients with histologically or cytologica...,"[Créteil - CHI, Lyon - CRLCC]","[Créteil, Lyon]","[France, France]","[Recruiting, Recruiting]"
