# Data set Generation

This notebook shows how to generate datasets for the historical validation. We choose a year and get all available phase 4 clinical trials for test and training on this dataset. We additionally export files with positive data for at least a year after the target year. As negative data, we choose all combinations of kinases and cancers that have remained negative up to the present day.

In [1]:
import pandas as pd
import os
import sys
sys.path.insert(0, os.path.abspath('../..'))
from kcet import KcetDatasetGenerator

# Input file. 
We have placed a version of ``clinical_trials_by_phase.tsv`` into the ``data`` subfolder in the ``notebooks`` subfolder of kcet.

In [2]:
ctfile = os.path.abspath(os.path.join('../data', 'clinical_trials_by_phase.tsv'))

## target year
In this example, we show how to extract datasets after the target year until a new target year.

In [3]:
target_year = 2010
mid_year = 2011
num_years_after_the_mid_year = 0

In [4]:
dsGen = KcetDatasetGenerator(clinical_trials=ctfile)

[INFO] Reading protein kinase information from /Users/ravanv/PycharmProjects/KCET/input/prot_kinase.tsv
[INFO] ingested symbol_to_id_map with 522 entries such as {'NCBIGene:2870': 'GRK6'}
[INFO] Ingested mesh_id list with 698 entries such as 'meshd000008' and 'meshd000069293', 
[INFO] Ingested _meshid2disease_map with 698 entries
[INFO] Ingested meshid2disease_map with 514 entries
[INFO] Ingested pki_to_kinase with 76 entries
[INFO] Parsed data for the following medications:
[INFO] lenvatinib
[INFO] amlexanox
[INFO] duvelisib
[INFO] neratinib
[INFO] sorafenib
[INFO] selumetinib
[INFO] panitumumab
[INFO] axitinib
[INFO] cobimetinib
[INFO] palbociclib
[INFO] tivozanib
[INFO] alectinib
[INFO] lapatinib
[INFO] ribociclib
[INFO] dacomitinib
[INFO] pemigatinib
[INFO] acalabrutinib
[INFO] gefitinib
[INFO] entrectinib
[INFO] trametinib
[INFO] nintedanib
[INFO] cabozantinib
[INFO] dasatinib
[INFO] lorlatinib
[INFO] gilteritinib
[INFO] olaratumab
[INFO] regorafenib
[INFO] sunitinib
[INFO] abemac

In [5]:
positive_train_df, negative_train_df, positive_validation_df, negative_validation_df = \
   dsGen.get_data_years_after_target_year_upto_later_year(target_year=target_year, mid_year = mid_year, num_years_later= num_years_after_the_mid_year)

Skipping random link(ncbigene2475,meshd008579) since we found it in the positive set
Skipping random link(ncbigene147746,meshd044584) since we found it in the positive set
Skipping random link(ncbigene25,meshd002292) since we found it in the positive set
Skipping random link (ncbigene11011,meshd010307) since we already added it to the negative set
Skipping random link(ncbigene3791,meshd012512) since we found it in the positive set
Skipping random link(ncbigene2322,meshd007680) since we found it in the positive set
Skipping random link(ncbigene27,meshd016400) since we found it in the positive set
Skipping random link(ncbigene2042,meshd006528) since we found it in the positive set
Skipping random link (ncbigene55359,meshd007938) since we already added it to the negative set
Skipping random link (ncbigene285220,meshd010048) since we already added it to the negative set
Skipping random link(ncbigene5159,meshd015464) since we found it in the positive set
Skipping random link(ncbigene3791,me

Skipping random link (ncbigene6793,meshd008579) since we found it in the positive set
Skipping random link (ncbigene147746,meshd009959) since we found it in the positive set
Skipping random link (ncbigene1436,meshd000077216) since we found it in the positive set
Skipping random link (ncbigene208,meshd009374) since we already added it to the negative set
Skipping random link (ncbigene5156,meshd065646) since we found it in the positive set
Skipping random link (ncbigene2042,meshd064726) since we found it in the positive set
Skipping random link (ncbigene3815,meshd013724) since we found it in the positive set
Skipping random link (ncbigene3791,meshd018237) since we found it in the positive set
Skipping random link (ncbigene5598,meshd002294) since we already added it to the negative set
Skipping random link (ncbigene8569,meshd018285) since we already added it to the negative set
Skipping random link (ncbigene5063,meshd018315) since we already added it to the negative set
Skipping random li

# Positive training data
We use this data frame for test and training. It contains all phase 4 trials completed up to and including the target year.

In [6]:
positive_train_df.head()

Unnamed: 0,cancer,mesh_id,kinase,gene_id,pki,nct,phase,year
140,Colorectal Neoplasms,meshd015179,EGFR,ncbigene1956,cetuximab,NCT00327093;NCT01564810;NCT01315990,Phase 4,2006
145,Neoplasm Metastasis,meshd009362,EGFR,ncbigene1956,cetuximab,NCT00327093;NCT01564810;NCT00510627,Phase 4,2006
146,"Neoplasms, Second Primary",meshd016609,EGFR,ncbigene1956,cetuximab,NCT00327093,Phase 4,2006
147,Liver Neoplasms,meshd008113,EGFR,ncbigene1956,cetuximab,NCT01564810;NCT00510627,Phase 4,2006
249,"Carcinoma, Non-Small-Cell Lung",meshd002289,EGFR,ncbigene1956,erlotinib,NCT01287754;NCT01320501;NCT01230710;NCT0106688...,Phase 4,2004


In [7]:
#We write the file for machine learning as follows
pos_train_df = positive_train_df[['gene_id','mesh_id']]
outname = "positive_training_upto_{}.tsv".format(target_year)
pos_train_df.to_csv(outname, sep='\t')
print(pos_train_df.head())
print("Number of positive training links: {}".format(pos_train_df.shape[0]) )

          gene_id      mesh_id
140  ncbigene1956  meshd015179
145  ncbigene1956  meshd009362
146  ncbigene1956  meshd016609
147  ncbigene1956  meshd008113
249  ncbigene1956  meshd002289
Number of positive training links: 161


# Negative training data
This data is used for test and training. It contains a sample with factor times (default=10) as many random combinations of kinases and cancers that we not positive as of the target year. Note that the dataframes for negative examples just have the columns ``gene_id`` and ``mesh_id``.

In [8]:
print(negative_train_df.head())
outname = "negative_training_upto_{}.tsv".format(target_year)
negative_train_df.to_csv(outname, sep='\t')
print("Number of negative training links: {}".format(negative_train_df.shape[0]) )

          mesh_id        gene_id
0     meshd000310   ncbigene5578
1     meshd013964  ncbigene50488
2  meshd000069295    ncbigene207
3     meshd008441   ncbigene6355
4     meshd015451  ncbigene23139
Number of negative training links: 1350


# Positive validation data
This file has data for all four study phases and for all years following the target year. We should predicting them
all at once and then use postprocessing in a script to figure out the results for each year.

In [9]:
positive_validation_df.head(n=1000)

Unnamed: 0,mesh_id,gene_id
0,meshd009456,ncbigene4921
1,meshd055752,ncbigene6885
2,meshd018284,ncbigene6885
3,meshd002295,ncbigene55589
4,meshd010673,ncbigene1436
...,...,...
351,meshd018234,ncbigene5156
352,meshd000071960,ncbigene2064
353,meshd018263,ncbigene1436
354,meshd000077273,ncbigene5156


In [10]:
pos_validation_df = positive_validation_df[['gene_id','mesh_id']]
outname = "positive_validation_{}_years_after_{}_target_{}.tsv".format(num_years_after_the_mid_year,mid_year,target_year)
pos_validation_df.to_csv(outname, sep='\t')
print(pos_validation_df.head())
print("Number of positive validation links: {}".format(pos_validation_df.shape[0]) )

         gene_id      mesh_id
0   ncbigene4921  meshd009456
1   ncbigene6885  meshd055752
2   ncbigene6885  meshd018284
3  ncbigene55589  meshd002295
4   ncbigene1436  meshd010673
Number of positive validation links: 356


# Negative validation data
This is similar to the negative training data in that it just has two columns.

In [11]:
print(negative_validation_df.head())
outname = "negative_validation_upto_{}.tsv".format(target_year)
negative_validation_df.to_csv(outname, sep='\t')
print("Number of negative validation links: {}".format(negative_validation_df.shape[0]) )

          mesh_id        gene_id
0     meshd019574   ncbigene1326
1     meshd000235   ncbigene4915
2     meshd009808  ncbigene84033
3     meshd045888  ncbigene84930
4  meshd000074009   ncbigene8536
Number of negative validation links: 1350
