# Data set Generation

This notebook shows how to generate datasets for the historical validation. We choose a year (say 2014) and get all available phase 4 clinical trials for test and training on this dataset. We additional export files with positive data for all subsequent years. As negative data, we choose all combinations of kinases and cancers that have remained negative up to the present day.

In [1]:
import pandas as pd
import os
import sys
sys.path.insert(0, os.path.abspath('..'))
from kcet import KcetDatasetGenerator

# Input file. 
We have placed a version of ``clinical_trials_by_phase.tsv`` into the ``data`` subfolder in the ``notebooks`` subfolder of kcet.

In [2]:
ctfile = os.path.abspath(os.path.join('data', 'clinical_trials_by_phase.tsv'))

## target year
In this example, we show how to extract datasets for the target year of 2012

In [3]:
target_year = 2012

In [4]:
dsGen = KcetDatasetGenerator(clinical_trials=ctfile)

[INFO] Reading protein kinase information from /home/peter/GIT/KCET/input/prot_kinase.tsv
[INFO] ingested symbol_to_id_map with 522 entries such as {'NCBIGene:2870': 'GRK6'}
[INFO] Ingested mesh_id list with 694 entries such as 'meshd000008' and 'meshd000069293', 
[INFO] Ingested _meshid2disease_map with 694 entries
[INFO] Ingested meshid2disease_map with 514 entries
[INFO] Ingested pki_to_kinase with 83 entries
[INFO] Reading protein kinase information from /home/peter/GIT/KCET/input/prot_kinase.tsv
[INFO] ingested symbol_to_id_map with 522 entries such as {'NCBIGene:2870': 'GRK6'}
[INFO] Ingested mesh_id list with 694 entries such as 'meshd000008' and 'meshd000069293', 
[INFO] Ingested _meshid2disease_map with 694 entries
[INFO] Ingested meshid2disease_map with 514 entries
[INFO] Ingested pki_to_kinase with 83 entries
[INFO] Parsed data for the following medications:
[INFO] afatinib
[INFO] imatinib


In [5]:
df = dsGen._get_positive_data_set(2012)
df.head()


Unnamed: 0,cancer,mesh_id,kinase,gene_id,pki,nct,phase,year
14,Breast Neoplasms,meshd001943,ABL1,ncbigene25,imatinib,NCT00372476,Phase 4,2006
15,Breast Neoplasms,meshd001943,PDGFRA,ncbigene5156,imatinib,NCT00372476,Phase 4,2006
16,Breast Neoplasms,meshd001943,PDGFRB,ncbigene5159,imatinib,NCT00372476,Phase 4,2006
17,Leukemia,meshd007938,ABL1,ncbigene25,imatinib,NCT02317159;NCT00390897;NCT00786812;NCT0220472...,Phase 4,2003
18,Leukemia,meshd007938,PDGFRA,ncbigene5156,imatinib,NCT02317159;NCT00390897;NCT00786812;NCT0220472...,Phase 4,2003


In [6]:
d=dsGen._get_positive_validation_data_set(2012)
d2 = dsGen._get_negative_validation_data_set(negative_df=d)
d2.head()

GPVDS year=2012
[INFO] We generated a negative set with 222 examples (the positive set has 589)


Unnamed: 0,gene_id,mesh_id
0,ncbigene5609,meshd018285
1,ncbigene4140,meshd008479
2,ncbigene57172,meshd015674
3,ncbigene1147,meshd016582
4,ncbigene1017,meshd018291


In [7]:
positive_df, negative_df, positive_validation_df, negative_validation_df = \
   dsGen.get_data_for_target_year(target_year=target_year)

[INFO] We generated a negative set with 390 examples (the positive set has 39)
GPVDS year=2012
[INFO] We generated a negative set with 390 examples (the positive set has 589)


## Positive data
We use this data frame for test and training. It contains all phase 4 trials completed up to and including the
target year.

In [8]:
positive_df.head()

Unnamed: 0,cancer,mesh_id,kinase,gene_id,pki,nct,phase,year
14,Breast Neoplasms,meshd001943,ABL1,ncbigene25,imatinib,NCT00372476,Phase 4,2006
15,Breast Neoplasms,meshd001943,PDGFRA,ncbigene5156,imatinib,NCT00372476,Phase 4,2006
16,Breast Neoplasms,meshd001943,PDGFRB,ncbigene5159,imatinib,NCT00372476,Phase 4,2006
17,Leukemia,meshd007938,ABL1,ncbigene25,imatinib,NCT02317159;NCT00390897;NCT00786812;NCT0220472...,Phase 4,2003
18,Leukemia,meshd007938,PDGFRA,ncbigene5156,imatinib,NCT02317159;NCT00390897;NCT00786812;NCT0220472...,Phase 4,2003


# Output file
We write the file for machine learning as follows

In [14]:
pos_df = positive_df[['gene_id','mesh_id']]
# Uncomment to output file
#outname = "positive_upto_2012.tsv"
#pos_df.to_csv(outname, sep='\t')
pos_df.head()

Unnamed: 0,gene_id,mesh_id
14,ncbigene25,meshd001943
15,ncbigene5156,meshd001943
16,ncbigene5159,meshd001943
17,ncbigene25,meshd007938
18,ncbigene5156,meshd007938


# Negative data
This data is used for test and training. It contains a sample with factor times (default=10) as many random combinations of kinases and cancers that we not positive as of the target year. Note that the dataframes for negative examples just have the columns ``gene_id`` and ``mesh_id``.

In [17]:
negative_df.head()

Unnamed: 0,gene_id,mesh_id
0,ncbigene91419,meshd058617
1,ncbigene84451,meshd002813
2,ncbigene9088,meshd014594
3,ncbigene7867,meshd017600
4,ncbigene1436,meshd019572


# Positive validation data
This file has data for all four study phases and for all years following the target year. We should predicting them
all at once and then use postprocessing in a script to figure out the results for each year.

In [11]:
positive_validation_df.head()

Unnamed: 0,cancer,mesh_id,kinase,gene_id,pki,nct,phase,year
6,"Carcinoma, Non-Small-Cell Lung",meshd002289,EGFR,ncbigene1956,afatinib,NCT04413201;NCT02695290;NCT02208843;NCT0435611...,Phase 4,2014
7,"Carcinoma, Non-Small-Cell Lung",meshd002289,ERBB2,ncbigene2064,afatinib,NCT04413201;NCT02695290;NCT02208843;NCT0435611...,Phase 4,2014
8,Urinary Bladder Neoplasms,meshd001749,EGFR,ncbigene1956,afatinib,NCT02122172;NCT02465060,Phase 2,2013
9,Urinary Bladder Neoplasms,meshd001749,ERBB2,ncbigene2064,afatinib,NCT02122172;NCT02465060,Phase 2,2013
10,Urethral Neoplasms,meshd014523,EGFR,ncbigene1956,afatinib,NCT02122172,Phase 2,2013


# Negative validation data
This is similar to the negative training data in that it just has two columns.

In [20]:
negative_validation_df.head()

Unnamed: 0,gene_id,mesh_id
0,ncbigene25865,meshd012512
1,ncbigene25778,meshd051527
2,ncbigene5164,meshd006223
3,ncbigene1457,meshd016864
4,ncbigene79705,meshd010048
