This notebook contains code to download the title, description and inclusion/exclsion criteria for oncology trials  using clinicaltrials.gov api.

Author - Akshay Chougule<br>
Created on - 30th May 2020<br>
<br>

In [1]:
import pandas as pd
import numpy as np
import requests
import datetime
import json
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
pd.set_option('max_colwidth', 4000)

### Section 0. Data Collection and Preprocessing
#### Skip to section 1 if you are interesting in findings.

Since we can get only 1000 trials as the result of this api query and the total number of covid-19 trials exceed 100, we will have to break our query in 2 parts. The first part will cover first thousand results `&min_rnk=1&max_rnk=1000`, the second part would cover next thousand `&min_rnk=1001&max_rnk=2000` and so on.

In [3]:
CT_GOV_URL = 'https://clinicaltrials.gov/api/query/study_fields?expr=cancer&min_rnk=1&max_rnk=1000&fmt=json'

In [4]:
rct_fields = [
    'NCTId',
    'BriefTitle',
    'BriefSummary',
    'DetailedDescription',
    'EligibilityCriteria',
]

In [5]:
query_url = f'{CT_GOV_URL}&fields={",".join(rct_fields)}'
print(query_url)

https://clinicaltrials.gov/api/query/study_fields?expr=cancer&min_rnk=1&max_rnk=1000&fmt=json&fields=NCTId,BriefTitle,BriefSummary,DetailedDescription,EligibilityCriteria


In [6]:
r = requests.get(query_url)
r.status_code

200

In [7]:
j = json.loads(r.content)
df1 = pd.DataFrame(j['StudyFieldsResponse']['StudyFields'])
df1.shape

(1000, 6)

In [8]:
# Let's iterate over the results 1000 at a time
min_rank = 1
max_rank = 1000
for i in range(75):
    min_rank = min_rank + 1000
    max_rank = max_rank + 1000
    CT_GOV_URL = f'https://clinicaltrials.gov/api/query/study_fields?expr=cancer&min_rnk={min_rank}&max_rnk={max_rank}&fmt=json'
    query_url = f'{CT_GOV_URL}&fields={",".join(rct_fields)}'
    r = requests.get(query_url)
    print(f'Records from {min_rank} to {max_rank} with status code {r.status_code}')
    j = json.loads(r.content)
    df2 = pd.DataFrame(j['StudyFieldsResponse']['StudyFields'])
    df1 = pd.concat([df1, df2])
    print(df1.shape)

Records from 1001 to 2000 with status code 200
(2000, 6)
Records from 2001 to 3000 with status code 200
(3000, 6)
Records from 3001 to 4000 with status code 200
(4000, 6)
Records from 4001 to 5000 with status code 200
(5000, 6)
Records from 5001 to 6000 with status code 200
(6000, 6)
Records from 6001 to 7000 with status code 200
(7000, 6)
Records from 7001 to 8000 with status code 200
(8000, 6)
Records from 8001 to 9000 with status code 200
(9000, 6)
Records from 9001 to 10000 with status code 200
(10000, 6)
Records from 10001 to 11000 with status code 200
(11000, 6)
Records from 11001 to 12000 with status code 200
(12000, 6)
Records from 12001 to 13000 with status code 200
(13000, 6)
Records from 13001 to 14000 with status code 200
(14000, 6)
Records from 14001 to 15000 with status code 200
(15000, 6)
Records from 15001 to 16000 with status code 200
(16000, 6)
Records from 16001 to 17000 with status code 200
(17000, 6)
Records from 17001 to 18000 with status code 200
(18000, 6)
Recor

In [19]:
df = df1
df.shape

(76000, 6)

#### Let's use the 76k oncology trials data as our training corpus

In [20]:
df.head()

Unnamed: 0,Rank,NCTId,BriefTitle,BriefSummary,DetailedDescription,EligibilityCriteria
0,1,[NCT04318756],[The Italian Version of Cancer Worry Scale],"[An Italian language version of the Cancer Worry Scale is not available yet.\n\nThe aim of this study is to develop and validate the Italian version of Cancer Worry Scale through subjects at high risk of pancreatic cancer for familiarity/genetic predisposition, or suffering from premalignant cystic lesions.]","[The Fear of Cancer remains a significant problem for subjects enrolled surveillance programs. This emotional condition can influence the patients' wellbeing and their adherence to treatments. No results coming from the application of Cancer Worry Scale on high-risk of Pancreatic cancer individuals have been provided yet.\n\nThe fear is an emotional reaction that can be the result of interpretation and cognitions of perceived internal cues and/or external cues. To objectify this emotion, a cancer worry scale, has been developed to investigate the fear in female breast cancer survivors. The proposed scale assesses the psychological distress caused by fear of cancer.\n\nThe detection of high level of fear can prevent problematic behaviours, including anxious preoccupation, avoidance, and excessive checking, and moreover it can help patients to reduce difficulties in performing the individual's daily and social activities. The scale could use to evaluate the psychological burden produced by the enrolment in a surveillance program due to a certain predisposition to Pancreatic cancer. One step forward will be to manage those individuals with high-level of Fear of cancer, providing them a proper psychological support.\n\nAt the General and Pancreatic Surgery Unit of the Pancreas Institute, some experimental self-made psychological support programs that investigate different psychologic distress through validated instruments, like anxiety, depression, perceived stress and global assessment or quality of life have been built up. It must be noted that the scale is not available specifically for Pancreatic cancer (originally it was created for breast cancer) , and for the Italian language, thus it has never been applied in an italian surveillance/follow-up program dealing with Pancreatic cancer. Nowadays, there is only a well-structured paper assessing psychological factors associated with cancer worries in high risk participants in a pancreatic cancer surveillance program. Differently than what has been already reported by Konings et al, the cohorts of patients that will be considered are heterogeneous, reflecting more accurately the real-life scenario of the subjects to whom the scale is administered to. In addition, this may help to identify those individuals that may benefit from a psychological support, in order to prevent a withdrawal from the surveillance program.\n\nHowever, a cut-off has not been provided yet. With next step study we'll aim at determining a cut-off for the detection clinically relevant worry for cancer.]","[Inclusion Criteria:\n\nSubject will be recruited from the current outpatients' clinic activity of the General and Pancreatic Surgery Unit, Pancreas Institute, Verona University Hospital. Subjects must be able to read and write in Italian. After obtained informed consent we will ask them to participate to preliminary pilot phase.\n\nThe interview of the pilot group will be audio recorded to allow to investigate and track detail that will highlight the comprehension of scale and participants' suggestion.\n\nPatients affected by cynic neoplasm (group A) and high risk subjects for familiarity/genetic predisposition (group B) are currently enrolled at General and Pancreatic Surgery Unit of the Pancreas Institute of the University of Verona.\n\nGroup A Patients with premalignant lesions (such as intraductal papillary mucinous neoplasms) that are followed-up at our Institution, to detect any clinic-radiological progression. Subjects older than 18 will be enrolled.\n\nGroup B\n\nThe enrolment criteria were the following :\n\nHaving at least 45 years of age or 10 years younger than the age of the youngest relative with pancreatic cancer (only for subject with familiar history of pancreatic cancer)\nHaving at least 40 years of age or 5 years younger than the age of the youngest relative with pancreatic cancer (only for subject with familial pancreatitis and subject affected by Lynch syndrome with at least one relative first- or second-degree affected by pancreatic cancer and subject having a known genetic mutation with at least a first-degree relative or a second-degree relative affected by pancreatic cancer\nHaving at least 30 years of age for subject with Familial Multiple Melanoma Syndrome\nHaving at least 30 years of age for patients affected by Peutz-Jeghers syndrome.\n\nThe interview of pilot group will be record to allow us to investigate and track detail that will highlight the comprehension of cancer worry scale and participants' suggestion.\n\nExclusion Criteria:\n\npatients who not meet eligibility criteria]"
1,2,[NCT03581708],[Venous Thromboembolism in Advanced Lung Cancer],"[This is a prospective observatory clinical study, aiming to establish and validate venous thromboembolism risk model in Chinese advanced non-small cell lung cancer.]",[VTE has high incidence in lung cancer and increases the mortality. Appropriate preventive measures contribute to 50% increase of incidence. The investigators are to investigate the VTE in advanced non-small cell lung cancer and delineate the risk factors to establish a VTE risk model system helping clinicians to differentiate VTE high risk population and apply early prevention in order to reduce the incidence of VTE.],"[Inclusion Criteria:\n\nAge ≥ 18 years at the time of screening.\nEastern Cooperative Oncology Group performance status of ≤ 2.\nWritten informed consent obtained from the patient.\nHistologically and cytologically documented Stage 3B-4 lung cancer (according to Version 8 of the International Association for the Study of Lung Cancer Staging system).\nPatients with stage 1 to 3, who undergo radical therapy with disease free survival (DFS) >12 months.\nWillingness and ability to comply with scheduled visits and other study procedures.\n\nExclusion Criteria:\n\nHistory of another primary malignancy except for malignancy treated with curative intent with known active disease ≥ 5 years before date of the informed consent.\nWithout signed informed consent.\nUnwillingness or inability to comply with scheduled visits or other study procedures.\nPreviously diagnosed with VTE before signing informed consent.]"
2,3,[NCT02053662],[Biomarker Identification for Bladder Cancer Patients],"[To develop a simple blood and urine test that we would perform before patients start their treatment to predict the risk that their bladder cancer might come back. To develop this test the investigators plan to analyze blood, urine and cancer tissue from bladder cancer patients and follow them closely during and after treatment. This will include looking for changes in proteins and genes that might play a role in bladder cancer biology. The investigators will then compare the information obtained from the studies of blood, urine and cancer tissue between patients that are cured and those whose cancer comes back. The knowledge about these differences between patients can then potentially be used to develop a blood or urine test to tell us who has a high risk for having bladder cancer come back.]",[],"[Inclusion Criteria:\n\nAdult patients ≥18 years old.\nPatients suspected, clinically diagnosed, or histologically diagnosed bladder cancer.\nPatients undergoing cystoscopy without cancer suspicion.\nAbility to give an informed consent.\n\nExclusion Criteria:\n\nPatients receiving concurrent therapy for a second malignancy.\n< 18 years old.\nInability to give an informed consent.]"
3,4,[NCT00897650],[Protein and RNA Expression Patterns in Predicting Response to Treatment in Patients With Lung Cancer],"[RATIONALE: Studying samples of tumor tissue and blood in the laboratory from patients with cancer may help doctors learn more about changes that occur in genetic material (DNA and RNA) and may also identify protein expression patterns related to cancer. It may also help doctors predict how patients will respond to treatment.\n\nPURPOSE: This research study evaluates changes in DNA, RNA, and proteins with the goal of predicting response to treatment in patients with lung cancer.]","[OBJECTIVES:\n\nTo determine protein and/or RNA expression patterns capable of predicting tumor response to therapy in tumor tissue samples from patients with lung cancer or suspected of having lung cancer.\nTo characterize the genes and proteins found to be predictive of response in order to help elucidate the molecular biology underlying cancer chemosensitivity.\nTo evaluate DNA mutations found within the lung cancer sample which may be predictive of response or resistance to certain therapeutic agents.\n\nOUTLINE: Patients undergo collection of tumor tissue by percutaneous fine needle aspiration, core biopsy, thoracentesis, or during any medically indicated procedure involving removal of lung cancer tissue. Tissue samples are analyzed by a variety of techniques, including DNA sequencing, RNA sequencing and expression levels, protein assessment [by immunohistochemistry, western blot, Matrix-assisted laser desorption/ionization time of flight mass spectrometry (MALDI-MS]). The goal of these studies is to identify of gene mutations, gene expression levels, and proteins predictive of treatment response. Blood samples are also collected to obtain normal DNA for analysis.\n\nAfter completion of study, patients will be followed until their death.]",[Inclusion criteria\n\nDiagnosis of suspected lung cancer or lung cancer\n\nExclusion criteria\n\nInability to undergo therapy]
4,5,[NCT02890667],[Evaluation of Algorithms to Identify Incident Cancer Cases by Using French Health Administrative Databases],"[""This study is a population based observational study conducted on the French administrative databases to estimate cancer incidence in 2012 by using the ""ECHANTILLON GENERALISTE DES BENEFICIAIRES"" (EGB, a 1/97th dynamic random sample of the SNIIRAM).\n\nThe EGB database contains anonymous and prospectively recorded data about all beneficiaries' medical reimbursements. Many algorithm definitions are defined to estimate the incident rate of cancer in 2012 from the EGB database. The incidence rates obtained by each algorithm definition are compared to national incidence rates by indirect age and sex standardization. National incidence rates are obtained from ""FRANCE CANCER INCIDENCE ET MORTALITE"" (FRANCIM): the French network of cancer registries.""]","[""This study is a population based observational study conducted on the French administrative databases to estimate cancer incidence in 2012 by using the ""ECHANTILLON GENERALISTE DES BENEFICIAIRES"" (EGB, a 1/97th dynamic random sample of the SNIIRAM).\n\nThe EGB database contains anonymous and prospectively recorded data about all beneficiaries' medical reimbursements including age, gender, long-term chronic disease (LTD), date of death, all out-hospital health-spending reimbursements and all patients' hospitalizations. Many algorithm definitions are defined to estimate the incident rate of cancer in 2012 from the EGB database and are applied separately for men and women. These algorithms use information from either out-hospital care only (LTD status, anticancer specific drugs, outpatient radiotherapy sessions), or inpatient stays only (primary or related diagnosis of cancer, cancer-related procedures) or combine both information. The incidence rates obtained by each algorithm definition are compared to national incidence rates by indirect age and sex standardization. National incidence rates are obtained from FRANCE CANCER INCIDENCE ET MORTALITE (FRANCIM): the French network of cancer registries that has collected cancer cases since 1975 from 21 French registries (general or specific) covering 17 of the 95 French metropolitan departments. The most recent estimation of the FRANCIM network was published for 2012 and included all cancer locations (C00-C97) excluding non-melanoma skin cancer (C44).\n\nFollow up All patients included are followed from January 1, 2012 until the occurrence of the first of death, cancer occurrence, moving out of the general insurance scheme or January 1, 2013.\n\nTo allow for comparison with data from cancer registries, only malignant neoplasms are considered: ICD-10 codes (C00-C97), excluding non-melanoma skin cancer (C44).\n\nStatistical analysis\n\nThe number of incident cancer cases obtained with each algorithm is compared to the expected number of cancer cases calculated by using national estimation for the same age and sex stratum. The standardized incidence ratio (SIR) with 95% confidence intervals is calculated by indirect age and sex standardization. Age- and sex-specific incident rates are compared to the incident rates in 2012 estimated by FRANCIM.\n\nThe investigators also apply the most accurate algorithm separately for men and women to the 3 most common cancers in the corresponding gender by restricting the involved cancer codes, procedures and drugs to those related to the cancer of interest. The investigators also restrict our population to the most-studied age groups in cancer etiological studies (40 to 75 years). All the analyses are performed using SAS Enterprise Guide, version 4.3.""]","[Inclusion Criteria:\n\nBeneficiaries present in the EGB on January 1, 2012.\nBeneficiaries who are < 90 years old on January 1, 2012.\nBeneficiaries residing in metropolitan France.\nBeneficiaries affiliated with the general insurance scheme since January 1, 2011 or before.\n\nExclusion Criteria:\n\nNo prevalent cancer before January 1, 2012; ( hospital discharge with a primary, related or associated diagnosis of cancer (C00-C97), or personal history of cancer (Z85); LTD status related to cancer; reimbursement for any cancer-specific drug; or external radiotherapy session).]"


In [21]:
temp = df['NCTId'].str[0]
temp.head()

0    NCT04318756
1    NCT03581708
2    NCT02053662
3    NCT00897650
4    NCT02890667
Name: NCTId, dtype: object

In [22]:
for col in df.columns[1:]:
    print(col)
    df[col] = df[col].str[0]

NCTId
BriefTitle
BriefSummary
DetailedDescription
EligibilityCriteria


In [23]:
df.head()

Unnamed: 0,Rank,NCTId,BriefTitle,BriefSummary,DetailedDescription,EligibilityCriteria
0,1,NCT04318756,The Italian Version of Cancer Worry Scale,"An Italian language version of the Cancer Worry Scale is not available yet.\n\nThe aim of this study is to develop and validate the Italian version of Cancer Worry Scale through subjects at high risk of pancreatic cancer for familiarity/genetic predisposition, or suffering from premalignant cystic lesions.","The Fear of Cancer remains a significant problem for subjects enrolled surveillance programs. This emotional condition can influence the patients' wellbeing and their adherence to treatments. No results coming from the application of Cancer Worry Scale on high-risk of Pancreatic cancer individuals have been provided yet.\n\nThe fear is an emotional reaction that can be the result of interpretation and cognitions of perceived internal cues and/or external cues. To objectify this emotion, a cancer worry scale, has been developed to investigate the fear in female breast cancer survivors. The proposed scale assesses the psychological distress caused by fear of cancer.\n\nThe detection of high level of fear can prevent problematic behaviours, including anxious preoccupation, avoidance, and excessive checking, and moreover it can help patients to reduce difficulties in performing the individual's daily and social activities. The scale could use to evaluate the psychological burden produced by the enrolment in a surveillance program due to a certain predisposition to Pancreatic cancer. One step forward will be to manage those individuals with high-level of Fear of cancer, providing them a proper psychological support.\n\nAt the General and Pancreatic Surgery Unit of the Pancreas Institute, some experimental self-made psychological support programs that investigate different psychologic distress through validated instruments, like anxiety, depression, perceived stress and global assessment or quality of life have been built up. It must be noted that the scale is not available specifically for Pancreatic cancer (originally it was created for breast cancer) , and for the Italian language, thus it has never been applied in an italian surveillance/follow-up program dealing with Pancreatic cancer. Nowadays, there is only a well-structured paper assessing psychological factors associated with cancer worries in high risk participants in a pancreatic cancer surveillance program. Differently than what has been already reported by Konings et al, the cohorts of patients that will be considered are heterogeneous, reflecting more accurately the real-life scenario of the subjects to whom the scale is administered to. In addition, this may help to identify those individuals that may benefit from a psychological support, in order to prevent a withdrawal from the surveillance program.\n\nHowever, a cut-off has not been provided yet. With next step study we'll aim at determining a cut-off for the detection clinically relevant worry for cancer.","Inclusion Criteria:\n\nSubject will be recruited from the current outpatients' clinic activity of the General and Pancreatic Surgery Unit, Pancreas Institute, Verona University Hospital. Subjects must be able to read and write in Italian. After obtained informed consent we will ask them to participate to preliminary pilot phase.\n\nThe interview of the pilot group will be audio recorded to allow to investigate and track detail that will highlight the comprehension of scale and participants' suggestion.\n\nPatients affected by cynic neoplasm (group A) and high risk subjects for familiarity/genetic predisposition (group B) are currently enrolled at General and Pancreatic Surgery Unit of the Pancreas Institute of the University of Verona.\n\nGroup A Patients with premalignant lesions (such as intraductal papillary mucinous neoplasms) that are followed-up at our Institution, to detect any clinic-radiological progression. Subjects older than 18 will be enrolled.\n\nGroup B\n\nThe enrolment criteria were the following :\n\nHaving at least 45 years of age or 10 years younger than the age of the youngest relative with pancreatic cancer (only for subject with familiar history of pancreatic cancer)\nHaving at least 40 years of age or 5 years younger than the age of the youngest relative with pancreatic cancer (only for subject with familial pancreatitis and subject affected by Lynch syndrome with at least one relative first- or second-degree affected by pancreatic cancer and subject having a known genetic mutation with at least a first-degree relative or a second-degree relative affected by pancreatic cancer\nHaving at least 30 years of age for subject with Familial Multiple Melanoma Syndrome\nHaving at least 30 years of age for patients affected by Peutz-Jeghers syndrome.\n\nThe interview of pilot group will be record to allow us to investigate and track detail that will highlight the comprehension of cancer worry scale and participants' suggestion.\n\nExclusion Criteria:\n\npatients who not meet eligibility criteria"
1,2,NCT03581708,Venous Thromboembolism in Advanced Lung Cancer,"This is a prospective observatory clinical study, aiming to establish and validate venous thromboembolism risk model in Chinese advanced non-small cell lung cancer.",VTE has high incidence in lung cancer and increases the mortality. Appropriate preventive measures contribute to 50% increase of incidence. The investigators are to investigate the VTE in advanced non-small cell lung cancer and delineate the risk factors to establish a VTE risk model system helping clinicians to differentiate VTE high risk population and apply early prevention in order to reduce the incidence of VTE.,"Inclusion Criteria:\n\nAge ≥ 18 years at the time of screening.\nEastern Cooperative Oncology Group performance status of ≤ 2.\nWritten informed consent obtained from the patient.\nHistologically and cytologically documented Stage 3B-4 lung cancer (according to Version 8 of the International Association for the Study of Lung Cancer Staging system).\nPatients with stage 1 to 3, who undergo radical therapy with disease free survival (DFS) >12 months.\nWillingness and ability to comply with scheduled visits and other study procedures.\n\nExclusion Criteria:\n\nHistory of another primary malignancy except for malignancy treated with curative intent with known active disease ≥ 5 years before date of the informed consent.\nWithout signed informed consent.\nUnwillingness or inability to comply with scheduled visits or other study procedures.\nPreviously diagnosed with VTE before signing informed consent."
2,3,NCT02053662,Biomarker Identification for Bladder Cancer Patients,"To develop a simple blood and urine test that we would perform before patients start their treatment to predict the risk that their bladder cancer might come back. To develop this test the investigators plan to analyze blood, urine and cancer tissue from bladder cancer patients and follow them closely during and after treatment. This will include looking for changes in proteins and genes that might play a role in bladder cancer biology. The investigators will then compare the information obtained from the studies of blood, urine and cancer tissue between patients that are cured and those whose cancer comes back. The knowledge about these differences between patients can then potentially be used to develop a blood or urine test to tell us who has a high risk for having bladder cancer come back.",,"Inclusion Criteria:\n\nAdult patients ≥18 years old.\nPatients suspected, clinically diagnosed, or histologically diagnosed bladder cancer.\nPatients undergoing cystoscopy without cancer suspicion.\nAbility to give an informed consent.\n\nExclusion Criteria:\n\nPatients receiving concurrent therapy for a second malignancy.\n< 18 years old.\nInability to give an informed consent."
3,4,NCT00897650,Protein and RNA Expression Patterns in Predicting Response to Treatment in Patients With Lung Cancer,"RATIONALE: Studying samples of tumor tissue and blood in the laboratory from patients with cancer may help doctors learn more about changes that occur in genetic material (DNA and RNA) and may also identify protein expression patterns related to cancer. It may also help doctors predict how patients will respond to treatment.\n\nPURPOSE: This research study evaluates changes in DNA, RNA, and proteins with the goal of predicting response to treatment in patients with lung cancer.","OBJECTIVES:\n\nTo determine protein and/or RNA expression patterns capable of predicting tumor response to therapy in tumor tissue samples from patients with lung cancer or suspected of having lung cancer.\nTo characterize the genes and proteins found to be predictive of response in order to help elucidate the molecular biology underlying cancer chemosensitivity.\nTo evaluate DNA mutations found within the lung cancer sample which may be predictive of response or resistance to certain therapeutic agents.\n\nOUTLINE: Patients undergo collection of tumor tissue by percutaneous fine needle aspiration, core biopsy, thoracentesis, or during any medically indicated procedure involving removal of lung cancer tissue. Tissue samples are analyzed by a variety of techniques, including DNA sequencing, RNA sequencing and expression levels, protein assessment [by immunohistochemistry, western blot, Matrix-assisted laser desorption/ionization time of flight mass spectrometry (MALDI-MS]). The goal of these studies is to identify of gene mutations, gene expression levels, and proteins predictive of treatment response. Blood samples are also collected to obtain normal DNA for analysis.\n\nAfter completion of study, patients will be followed until their death.",Inclusion criteria\n\nDiagnosis of suspected lung cancer or lung cancer\n\nExclusion criteria\n\nInability to undergo therapy
4,5,NCT02890667,Evaluation of Algorithms to Identify Incident Cancer Cases by Using French Health Administrative Databases,"""This study is a population based observational study conducted on the French administrative databases to estimate cancer incidence in 2012 by using the ""ECHANTILLON GENERALISTE DES BENEFICIAIRES"" (EGB, a 1/97th dynamic random sample of the SNIIRAM).\n\nThe EGB database contains anonymous and prospectively recorded data about all beneficiaries' medical reimbursements. Many algorithm definitions are defined to estimate the incident rate of cancer in 2012 from the EGB database. The incidence rates obtained by each algorithm definition are compared to national incidence rates by indirect age and sex standardization. National incidence rates are obtained from ""FRANCE CANCER INCIDENCE ET MORTALITE"" (FRANCIM): the French network of cancer registries.""","""This study is a population based observational study conducted on the French administrative databases to estimate cancer incidence in 2012 by using the ""ECHANTILLON GENERALISTE DES BENEFICIAIRES"" (EGB, a 1/97th dynamic random sample of the SNIIRAM).\n\nThe EGB database contains anonymous and prospectively recorded data about all beneficiaries' medical reimbursements including age, gender, long-term chronic disease (LTD), date of death, all out-hospital health-spending reimbursements and all patients' hospitalizations. Many algorithm definitions are defined to estimate the incident rate of cancer in 2012 from the EGB database and are applied separately for men and women. These algorithms use information from either out-hospital care only (LTD status, anticancer specific drugs, outpatient radiotherapy sessions), or inpatient stays only (primary or related diagnosis of cancer, cancer-related procedures) or combine both information. The incidence rates obtained by each algorithm definition are compared to national incidence rates by indirect age and sex standardization. National incidence rates are obtained from FRANCE CANCER INCIDENCE ET MORTALITE (FRANCIM): the French network of cancer registries that has collected cancer cases since 1975 from 21 French registries (general or specific) covering 17 of the 95 French metropolitan departments. The most recent estimation of the FRANCIM network was published for 2012 and included all cancer locations (C00-C97) excluding non-melanoma skin cancer (C44).\n\nFollow up All patients included are followed from January 1, 2012 until the occurrence of the first of death, cancer occurrence, moving out of the general insurance scheme or January 1, 2013.\n\nTo allow for comparison with data from cancer registries, only malignant neoplasms are considered: ICD-10 codes (C00-C97), excluding non-melanoma skin cancer (C44).\n\nStatistical analysis\n\nThe number of incident cancer cases obtained with each algorithm is compared to the expected number of cancer cases calculated by using national estimation for the same age and sex stratum. The standardized incidence ratio (SIR) with 95% confidence intervals is calculated by indirect age and sex standardization. Age- and sex-specific incident rates are compared to the incident rates in 2012 estimated by FRANCIM.\n\nThe investigators also apply the most accurate algorithm separately for men and women to the 3 most common cancers in the corresponding gender by restricting the involved cancer codes, procedures and drugs to those related to the cancer of interest. The investigators also restrict our population to the most-studied age groups in cancer etiological studies (40 to 75 years). All the analyses are performed using SAS Enterprise Guide, version 4.3.""","Inclusion Criteria:\n\nBeneficiaries present in the EGB on January 1, 2012.\nBeneficiaries who are < 90 years old on January 1, 2012.\nBeneficiaries residing in metropolitan France.\nBeneficiaries affiliated with the general insurance scheme since January 1, 2011 or before.\n\nExclusion Criteria:\n\nNo prevalent cancer before January 1, 2012; ( hospital discharge with a primary, related or associated diagnosis of cancer (C00-C97), or personal history of cancer (Z85); LTD status related to cancer; reimbursement for any cancer-specific drug; or external radiotherapy session)."


^ This looks good now.

In [24]:
df.DetailedDescription[0]

0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       

In [26]:
df.to_csv("/home/ubuntu/datasets/cancer-clinical-trials/76k_cancer_trials_description.csv")

### Preprocessing the corpus

In [38]:
import nltk
from nltk.corpus import stopwords
# Get nltk stopword list into a set.
stop_words = set(stopwords.words('english'))
str(stop_words)

'{\'s\', \'ain\', "that\'ll", \'as\', \'yours\', \'some\', "hasn\'t", "don\'t", \'but\', \'further\', \'me\', \'have\', \'over\', \'each\', \'will\', \'aren\', "wasn\'t", \'themselves\', \'own\', \'no\', "mightn\'t", \'yourself\', \'mustn\', \'needn\', \'what\', \'too\', \'doesn\', \'than\', \'am\', \'are\', \'when\', \'such\', \'haven\', \'by\', \'all\', \'nor\', \'or\', \'out\', \'isn\', \'its\', \'why\', \'do\', "didn\'t", \'between\', \'itself\', \'few\', \'wasn\', \'having\', \'any\', "should\'ve", "shouldn\'t", \'i\', \'most\', \'through\', \'herself\', \'should\', "couldn\'t", \'his\', \'which\', \'here\', "aren\'t", \'if\', \'can\', \'don\', \'her\', \'did\', \'himself\', \'below\', \'re\', \'y\', \'o\', "needn\'t", \'ours\', \'from\', \'very\', "hadn\'t", "wouldn\'t", \'their\', \'so\', \'on\', "you\'d", \'once\', \'myself\', \'she\', \'during\', "doesn\'t", \'m\', \'only\', \'we\', \'ma\', \'mightn\', \'more\', \'because\', "she\'s", "it\'s", \'that\', \'has\', \'at\', \'for\

In [40]:
stop_words2 = set('for a ( of the ) and to in is at an must be with are but not no none has have other from as prior or except none see below study , use " one two three four five six patients before start greater than any allowed by for they since'.split())
str(stop_words2)

'{\'greater\', \'as\', \',\', \'but\', \'two\', \'any\', \'and\', \'has\', \'have\', \'must\', \'patients\', \'study\', \'for\', \'in\', \'at\', \'is\', \'one\', \'see\', \'allowed\', \'three\', \'other\', \'none\', \'no\', \'of\', \'be\', \'four\', \'a\', \'below\', \'to\', \'use\', \'prior\', \'five\', \'except\', \'the\', \'"\', \'six\', \'start\', \'than\', \'from\', \'before\', \'they\', \'are\', \'not\', \')\', \'an\', \'since\', \'by\', \'(\', \'or\', \'with\'}'

In [44]:
master_stop_words = stop_words.union(stop_words2)
str(master_stop_words)

'{\'greater\', \'s\', \'ain\', "that\'ll", \'as\', \'yours\', \'some\', "hasn\'t", "don\'t", \'but\', \'further\', \'me\', \'have\', \'over\', \'each\', \'will\', \'aren\', "wasn\'t", \'none\', \'themselves\', \'own\', \'no\', "mightn\'t", \'yourself\', \'mustn\', \'needn\', \'what\', \'too\', \'six\', \'doesn\', \'than\', \'am\', \'are\', \'when\', \'such\', \'haven\', \'by\', \'all\', \'nor\', \'or\', \'out\', \'isn\', \'its\', \'why\', \'do\', "didn\'t", \'between\', \'itself\', \'few\', \',\', \'wasn\', \'having\', \'any\', "should\'ve", "shouldn\'t", \'i\', \'must\', \'most\', \'through\', \'herself\', \'should\', "couldn\'t", \'his\', \'which\', \'here\', "aren\'t", \'if\', \'can\', \'don\', \'her\', \'did\', \'himself\', \'below\', \'re\', \'y\', \'five\', \'o\', "needn\'t", \'ours\', \'from\', \'very\', "hadn\'t", \')\', "wouldn\'t", \'their\', \'so\', \'(\', \'on\', "you\'d", \'once\', \'myself\', \'she\', \'during\', "doesn\'t", \'m\', \'only\', \'we\', \'ma\', \'mightn\', \'

In [52]:
# Open and read in a text file.
txt_file = open("/home/ubuntu/datasets/cancer-clinical-trials/76k_cancer_trials_description_copy.txt")
txt_line = txt_file.read()
txt_words = txt_line.split()
 
# stopwords found counter.
sw_found = 0
 
# If each word checked is not in stopwords list,
# then append word to a new text file.
for check_word in txt_words:
    if not check_word.lower() in master_stop_words:
        # Not found on stopword list, so remove noise and then append.
        check_word = check_word.replace('(','').replace(')','').replace('[','').replace(']','').replace('.','').replace('-','').replace(':','').replace('.','')
        appendFile = open('/home/ubuntu/datasets/cancer-clinical-trials/76k_cancer_trials_description_stopwords_removed.txt','a')
        appendFile.write(" "+check_word)
        appendFile.close()
    else:
        # It's on the stopword list
        sw_found +=1
        #print(check_word)

print(sw_found,"stop words found and removed")
print("Saved as 'stopwords-removed.txt' ")

16078121 stop words found and removed
Saved as 'stopwords-removed.txt' 


### 1. Training on Cancer trial data

In [28]:
pip install fasttext

Collecting fasttext
  Downloading fasttext-0.9.2.tar.gz (68 kB)
[K     |████████████████████████████████| 68 kB 2.3 MB/s eta 0:00:011
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25ldone
[?25h  Created wheel for fasttext: filename=fasttext-0.9.2-cp36-cp36m-linux_x86_64.whl size=3173570 sha256=f5897a7aa95461d44db97e2b5e627864b3ccd78ba756974385849c46cb94fe00
  Stored in directory: /home/ubuntu/.cache/pip/wheels/c3/5c/d0/4a725c6ee7df3267d818d3bc9d89bb173b94832f2b9eca6368
Successfully built fasttext
Installing collected packages: fasttext
Successfully installed fasttext-0.9.2
Note: you may need to restart the kernel to use updated packages.


In [29]:
import fasttext

In [53]:
model = fasttext.train_unsupervised('/home/ubuntu/datasets/cancer-clinical-trials/76k_cancer_trials_description_stopwords_removed.txt')

In [54]:
model.get_word_vector("estrogen")

array([ 2.4892978e-01, -1.6563013e-01, -6.5934040e-02,  4.5928913e-01,
        2.0202581e-02,  7.6780237e-02,  3.3513233e-01, -4.1252303e-01,
        2.0858361e-01, -9.1023928e-01, -3.0293906e-01, -5.1558006e-01,
       -4.1902757e-01, -1.2521465e-01, -1.6299035e-01, -2.6738225e-02,
        7.8007823e-01,  1.3554697e-01, -4.8519111e-01, -3.5083588e-02,
        2.5528473e-01,  3.0092922e-01, -1.9155055e-01,  1.4476283e-01,
        1.7660636e-07, -6.0222067e-02,  4.0759182e-01,  5.4100269e-01,
        8.0235440e-01, -4.7768134e-01, -3.0493966e-01, -2.8303018e-01,
        4.9820712e-01, -3.2064819e-01, -2.5452789e-02, -6.0181928e-01,
        2.5312525e-01,  2.5672410e-02, -7.9386222e-01, -4.2245692e-01,
        3.7047675e-01,  4.2844656e-01,  1.9379880e-01,  3.1470615e-02,
        1.1510277e-01,  4.6984351e-01, -4.0669106e-02,  8.8827230e-02,
       -9.4303268e-04,  9.2533886e-01, -4.6607573e-02, -3.7166696e-02,
        2.9520580e-01, -4.3344401e-02, -5.8495682e-01,  8.1720948e-01,
      

In [55]:
model.get_nearest_neighbors('estrogen') 

[(0.9273011088371277, '[estrogen'),
 (0.9223880767822266, 'estrogen/progesterone'),
 (0.9073039293289185, 'progesterone'),
 (0.8981960415840149, 'progesterones'),
 (0.8929575085639954, 'Estrogen/progesterone'),
 (0.8849475979804993, 'estrogen,'),
 (0.8834394812583923, '[ER]/progesterone'),
 (0.8831318020820618, 'progesterone.'),
 (0.8800917863845825, 'receptor/progesterone'),
 (0.8786907196044922, 'progesterones,')]

In [56]:
model.get_nearest_neighbors('Pembrolizumab') 

[(0.9559347629547119, '+Pembrolizumab'),
 (0.951829731464386, '(Pembrolizumab'),
 (0.9217069745063782, 'Pembrolizumab,'),
 (0.9184354543685913, 'Nivolumab'),
 (0.9098639488220215, 'Pembrolizumab:'),
 (0.9070537090301514, 'Pembrolizumab)'),
 (0.8792606592178345, '(Nivolumab'),
 (0.8746487498283386, 'Atezolizumab'),
 (0.8567278981208801, '(Pembrolizumab)'),
 (0.8501065969467163, '(MK3475)')]

In [57]:
model.get_nearest_neighbors('NSCLC') 

[(0.9353123307228088, 'NSCLC；'),
 (0.9346041083335876, 'nsNSCLC'),
 (0.9293544292449951, 'NSCLC*'),
 (0.9225519895553589, 'aNSCLC'),
 (0.9197292923927307, 'NSCLCC'),
 (0.8932170867919922, 'NSCLCA'),
 (0.889642059803009, '(NSCLC'),
 (0.88927161693573, 'NSCLC-'),
 (0.8864023089408875, 'NSCLC)'),
 (0.8786198496818542, 'nonsmall-cell')]

In [59]:
model.get_analogies("immunotherapy","oncology","cardiovascular")

[(0.646023690700531, 'cardiovascular/cerebrovascular'),
 (0.635280191898346, 'cardiovascular/'),
 (0.6086441278457642, 'cardiovascularly'),
 (0.6009490489959717, '(cardiovascular'),
 (0.5713033080101013, 'cardiovacular'),
 (0.5709266066551208, 'cardiovascular/pulmonary/renal'),
 (0.5489823222160339, 'anticancer,'),
 (0.5482404232025146, 'Cerebrovascular'),
 (0.5465226173400879, 'cardio/cerebrovascular'),
 (0.5408843755722046, 'accident/stroke,')]