# Data set Generation

This notebook shows how to generate datasets for the historical validation. We choose a year and get all available phase 4 clinical trials for test and training on this dataset. We additionally export files with positive data for all subsequent years of phase 4. As negative data, we choose all combinations of kinases and cancers that have remained negative up to the present day.

In [1]:
import pandas as pd
import os
import sys
sys.path.insert(0, os.path.abspath('../..'))
from kcet import KcetDatasetGenerator

# Input file. 
We have placed a version of ``clinical_trials_by_phase.tsv`` into the ``data`` subfolder in the ``notebooks`` subfolder of kcet.

In [2]:
ctfile = os.path.abspath(os.path.join('../data', 'clinical_trials_by_phase.tsv'))

## target year
In this example, we show how to extract datasets for a target year.

In [3]:
target_year = 2014
num_years_after_the_target_year = 6

In [4]:
dsGen = KcetDatasetGenerator(clinical_trials=ctfile)

[INFO] Reading protein kinase information from /Users/ravanv/PycharmProjects/KCET/input/prot_kinase.tsv
[INFO] ingested symbol_to_id_map with 522 entries such as {'NCBIGene:2870': 'GRK6'}
[INFO] Ingested mesh_id list with 698 entries such as 'meshd000008' and 'meshd000069293', 
[INFO] Ingested _meshid2disease_map with 698 entries
[INFO] Ingested meshid2disease_map with 514 entries
[INFO] Ingested pki_to_kinase with 76 entries
[INFO] Parsed data for the following medications:
[INFO] avapritinib
[INFO] olaratumab
[INFO] dacomitinib
[INFO] bosutinib
[INFO] larotrectinib
[INFO] midostaurin
[INFO] trametinib
[INFO] alectinib
[INFO] apatinib
[INFO] necitumumab
[INFO] entrectinib
[INFO] ponatinib
[INFO] crizotinib
[INFO] vemurafenib
[INFO] regorafenib
[INFO] nilotinib
[INFO] icotinib
[INFO] sorafenib
[INFO] imatinib
[INFO] trastuzumab
[INFO] duvelisib
[INFO] amlexanox
[INFO] ribociclib
[INFO] lorlatinib
[INFO] acalabrutinib
[INFO] gilteritinib
[INFO] ruxolitinib
[INFO] cobimetinib
[INFO] fedr

In [5]:
positive_train_df, negative_train_df, positive_validation_df, negative_validation_df = \
   dsGen.get_data_after_target_year_upto_later_year_phase_4(target_year=target_year,num_years_later= num_years_after_the_target_year)

Skipping random link(ncbigene780,meshd018567) since we found it in the positive set
Skipping random link(ncbigene2322,meshd008415) since we found it in the positive set
Skipping random link(ncbigene80122,meshd002276) since we found it in the positive set
Skipping random link(ncbigene5159,meshd054364) since we found it in the positive set
Skipping random link(ncbigene5294,meshd015451) since we found it in the positive set
Skipping random link(ncbigene80122,meshd008545) since we found it in the positive set
Skipping random link(ncbigene3815,meshd054364) since we found it in the positive set
Skipping random link(ncbigene2475,meshd013274) since we found it in the positive set
Skipping random link(ncbigene5159,meshd008258) since we found it in the positive set
Skipping random link(ncbigene660,meshd015451) since we found it in the positive set
Skipping random link(ncbigene9263,meshd000077192) since we found it in the positive set
Skipping random link(ncbigene3932,meshd010255) since we found 

# Positive training data
We use this data frame for test and training. It contains all phase 4 trials completed up to and including the target year.

In [6]:
positive_train_df.head()

Unnamed: 0,cancer,mesh_id,kinase,gene_id,pki,nct,phase,year
4,"Carcinoma, Non-Small-Cell Lung",meshd002289,EGFR,ncbigene1956,afatinib,NCT04413201;NCT02695290;NCT02208843;NCT0435611...,Phase 4,2014
5,"Carcinoma, Non-Small-Cell Lung",meshd002289,ERBB2,ncbigene2064,afatinib,NCT04413201;NCT02695290;NCT02208843;NCT0435611...,Phase 4,2014
6,"Carcinoma, Non-Small-Cell Lung",meshd002289,ERBB4,ncbigene2066,afatinib,NCT04413201;NCT02695290;NCT02208843;NCT0435611...,Phase 4,2014
81,Leukemia,meshd007938,ABL1,ncbigene25,bosutinib,NCT02228382,Phase 4,2014
82,Leukemia,meshd007938,MAP4K5,ncbigene11183,bosutinib,NCT02228382,Phase 4,2014


In [7]:
#We write the file for machine learning as follows
pos_train_df = positive_train_df[['gene_id','mesh_id']]
outname = "positive_training_upto_{}.tsv".format(target_year)
pos_train_df.to_csv(outname, sep='\t')
print(pos_train_df.head())
print("Number of positive trainig links: {}".format(pos_train_df.shape[0]) )

          gene_id      mesh_id
4    ncbigene1956  meshd002289
5    ncbigene2064  meshd002289
6    ncbigene2066  meshd002289
81     ncbigene25  meshd007938
82  ncbigene11183  meshd007938
Number of positive trainig links: 307


# Negative training data
This data is used for test and training. It contains a sample with factor times (default=10) as many random combinations of kinases and cancers that we not positive as of the target year. Note that the dataframes for negative examples just have the columns ``gene_id`` and ``mesh_id``.

In [8]:
print(negative_train_df.head())
outname = "negative_training_upto_{}.tsv".format(target_year)
negative_train_df.to_csv(outname, sep='\t')
print("Number of negative trainig links: {}".format(negative_train_df.shape[0]) )

       mesh_id       gene_id
0  meshd014516  ncbigene8550
1  meshd018270  ncbigene8899
2  meshd009374  ncbigene4139
3  meshd007953  ncbigene4486
4  meshd009918   ncbigene701
Number of negative trainig links: 2560


# Positive validation data
This file has data for all four study phases and for all years following the target year. We should predicting them
all at once and then use postprocessing in a script to figure out the results for each year.

In [9]:
pos_validation_df = positive_validation_df[['gene_id','mesh_id']]
outname = "positive_validation_{}_years_after_{}_phase4.tsv".format(num_years_after_the_target_year,target_year)
pos_validation_df.to_csv(outname, sep='\t')
print(pos_validation_df.head())
print("Number of positive validation links: {}".format(pos_validation_df.shape[0]) )

        gene_id      mesh_id
0  ncbigene2241  meshd008175
1  ncbigene2268  meshd008258
2  ncbigene2322  meshd002277
3  ncbigene6098  meshd009362
4   ncbigene660  meshd008258
Number of positive validation links: 328


# Negative validation data
This is similar to the negative training data in that it just has two columns.

In [10]:
print(negative_validation_df.head())
outname = "negative_validation_{}_years_after_{}_phase4.tsv".format(num_years_after_the_target_year,target_year)
negative_validation_df.to_csv(outname, sep='\t')
print("Number of negative validation links: {}".format(negative_validation_df.shape[0]) )

       mesh_id       gene_id
0  meshd000239  ncbigene6352
1  meshd002472  ncbigene3717
2  meshd014685  ncbigene5592
3  meshd009507  ncbigene3932
4  meshd005909  ncbigene9113
Number of negative validation links: 2560
