# Data set Generation

This notebook shows how to generate datasets for the historical validation. We choose a year and get all available phase 4 clinical trials for test and training on this dataset. We additionally export files with positive data for at least a year after the target year. As negative data, we choose all combinations of kinases and cancers that have remained negative up to the present day.

In [36]:
import pandas as pd
import os
import sys
sys.path.insert(0, os.path.abspath('../..'))
from kcet import KcetDatasetGenerator

# Input file. 
We have placed a version of ``clinical_trials_by_phase.tsv`` into the ``data`` subfolder in the ``notebooks`` subfolder of kcet.

In [37]:
ctfile = os.path.abspath(os.path.join('../data', 'clinical_trials_by_phase.tsv'))

## target year
In this example, we show how to extract datasets for the target year of 2012

In [38]:
target_year = 2005
num_years_after_the_target_year = 10

In [39]:
dsGen = KcetDatasetGenerator(clinical_trials=ctfile)

[INFO] Reading protein kinase information from /Users/ravanv/PycharmProjects/KCET/input/prot_kinase.tsv
[INFO] ingested symbol_to_id_map with 522 entries such as {'NCBIGene:2870': 'GRK6'}
[INFO] Ingested mesh_id list with 694 entries such as 'meshd000008' and 'meshd000069293', 
[INFO] Ingested _meshid2disease_map with 694 entries
[INFO] Ingested meshid2disease_map with 514 entries
[INFO] Ingested pki_to_kinase with 83 entries
[INFO] Reading protein kinase information from /Users/ravanv/PycharmProjects/KCET/input/prot_kinase.tsv
[INFO] ingested symbol_to_id_map with 522 entries such as {'NCBIGene:2870': 'GRK6'}
[INFO] Ingested mesh_id list with 694 entries such as 'meshd000008' and 'meshd000069293', 
[INFO] Ingested _meshid2disease_map with 694 entries
[INFO] Ingested meshid2disease_map with 514 entries
[INFO] Ingested pki_to_kinase with 83 entries
[INFO] Parsed data for the following medications:
[INFO] imatinib
[INFO] afatinib


In [40]:
positive_train_df, negative_train_df, positive_validation_df, negative_validation_df = \
   dsGen.get_data_for_target_year_and_later_year(target_year=target_year, num_years_later= num_years_after_the_target_year)

[INFO] We generated a negative set with 150 examples (the positive set has 15)
GPVDS year=2015
[INFO] We generated a negative set with 150 examples (the positive set has 510)


# Positive training data
We use this data frame for test and training. It contains all phase 4 trials completed up to and including the target year.

In [41]:
positive_train_df.head()

Unnamed: 0,cancer,mesh_id,kinase,gene_id,pki,nct,phase,year
17,Leukemia,meshd007938,ABL1,ncbigene25,imatinib,NCT02317159;NCT00390897;NCT00786812;NCT0220472...,Phase 4,2003
18,Leukemia,meshd007938,PDGFRA,ncbigene5156,imatinib,NCT02317159;NCT00390897;NCT00786812;NCT0220472...,Phase 4,2003
19,Leukemia,meshd007938,PDGFRB,ncbigene5159,imatinib,NCT02317159;NCT00390897;NCT00786812;NCT0220472...,Phase 4,2003
26,"Leukemia, Myeloid",meshd007951,ABL1,ncbigene25,imatinib,NCT02317159;NCT00390897;NCT00786812;NCT0220472...,Phase 4,2003
27,"Leukemia, Myeloid",meshd007951,PDGFRA,ncbigene5156,imatinib,NCT02317159;NCT00390897;NCT00786812;NCT0220472...,Phase 4,2003


In [42]:
#We write the file for machine learning as follows
pos_train_df = positive_train_df[['gene_id','mesh_id']]
outname = "positive_training_upto_{}.tsv".format(target_year)
pos_train_df.to_csv(outname, sep='\t')
print(pos_train_df.head())
print("Number of positive trainig links: {}".format(pos_train_df.shape[0]) )

         gene_id      mesh_id
17    ncbigene25  meshd007938
18  ncbigene5156  meshd007938
19  ncbigene5159  meshd007938
26    ncbigene25  meshd007951
27  ncbigene5156  meshd007951
Number of positive trainig links: 15


# Negative training data
This data is used for test and training. It contains a sample with factor times (default=10) as many random combinations of kinases and cancers that we not positive as of the target year. Note that the dataframes for negative examples just have the columns ``gene_id`` and ``mesh_id``.

In [43]:
print(negative_train_df.head())
outname = "negative_training_upto_{}.tsv".format(target_year)
negative_train_df.to_csv(outname, sep='\t')
print("Number of negative trainig links: {}".format(negative_train_df.shape[0]) )

       mesh_id        gene_id
0  meshd055756   ncbigene8428
1  meshd044584   ncbigene8558
2  meshd019292  ncbigene51086
3  meshd009837   ncbigene5127
4  meshd015451  ncbigene10769
Number of negative trainig links: 150


# Positive validation data
This file has data for all four study phases and for all years following the target year. We should predicting them
all at once and then use postprocessing in a script to figure out the results for each year.

In [44]:
pos_validation_df = positive_validation_df[['gene_id','mesh_id']]
outname = "positive_validation_{}_years_after_{}.tsv".format(num_years_after_the_target_year,target_year)
pos_validation_df.to_csv(outname, sep='\t')
print(pos_validation_df.head())
print("Number of positive validation links: {}".format(pos_validation_df.shape[0]) )

         gene_id      mesh_id
16  ncbigene1956  meshd009101
17  ncbigene2064  meshd009101
40  ncbigene1956  meshd008175
41  ncbigene2064  meshd008175
72  ncbigene1956  meshd010190
Number of positive validation links: 60


# Negative validation data
This is similar to the negative training data in that it just has two columns.

In [45]:
print(negative_validation_df.head())
outname = "negative_validation_{}_years_after_{}.tsv".format(num_years_after_the_target_year,target_year)
negative_validation_df.to_csv(outname, sep='\t')
print("Number of negative validation links: {}".format(negative_validation_df.shape[0]) )

       mesh_id        gene_id
0  meshd017043    ncbigene156
1  meshd014134   ncbigene5163
2  meshd018255   ncbigene3654
3  meshd015408  ncbigene83694
4  meshd000237   ncbigene7075
Number of negative validation links: 150
