# Data set Generation

This notebook shows how to generate datasets for the novel validation. We fix the year to the current year (2021) and get all available phase 4 clinical trials for test and training on this dataset. For the predicton set, we consider all possible links between  protein kinases and the cancers except the positive dataset calculated up to the current year.

In [1]:
import pandas as pd
import os
import sys
sys.path.insert(0, os.path.abspath('../..'))
from kcet import KcetDatasetGenerator

# Input file. 
We have placed a version of ``clinical_trials_by_phase.tsv`` into the ``data`` subfolder in the ``notebooks`` subfolder of kcet.

In [2]:
ctfile = os.path.abspath(os.path.join('../data', 'clinical_trials_by_phase.tsv'))

## target year
In this example, we show how to extract datasets for the target year of 2021

In [3]:
target_year = 2021 #current year

In [4]:
dsGen = KcetDatasetGenerator(clinical_trials=ctfile)

[INFO] Reading protein kinase information from /Users/ravanv/PycharmProjects/KCET/input/prot_kinase.tsv
[INFO] ingested symbol_to_id_map with 522 entries such as {'NCBIGene:2870': 'GRK6'}
[INFO] Ingested mesh_id list with 698 entries such as 'meshd000008' and 'meshd000069293', 
[INFO] Ingested _meshid2disease_map with 698 entries
[INFO] Ingested meshid2disease_map with 514 entries
[INFO] Ingested pki_to_kinase with 76 entries
[INFO] Parsed data for the following medications:
[INFO] acalabrutinib
[INFO] nintedanib
[INFO] bosutinib
[INFO] entrectinib
[INFO] neratinib
[INFO] ramucirumab
[INFO] panitumumab
[INFO] lorlatinib
[INFO] midostaurin
[INFO] palbociclib
[INFO] afatinib
[INFO] alpelisib
[INFO] dacomitinib
[INFO] selumetinib
[INFO] crizotinib
[INFO] ceritinib
[INFO] olaratumab
[INFO] cetuximab
[INFO] tofacitinib
[INFO] vemurafenib
[INFO] pertuzumab
[INFO] trastuzumab
[INFO] sunitinib
[INFO] lenvatinib
[INFO] fostamatinib
[INFO] tucatinib
[INFO] apatinib
[INFO] erdafitinib
[INFO] neci

In [6]:
positive_train_df, negative_train_df, prediction_df = \
   dsGen.get_data_for_novel_prediction(current_year=target_year)



Skipping random link(ncbigene5159,meshd008223) since we found it in the positive set
Skipping random link(ncbigene3932,meshd002277) since we found it in the positive set
Skipping random link(ncbigene4921,meshd017253) since we found it in the positive set
Skipping random link(ncbigene147746,meshd014523) since we found it in the positive set
Skipping random link(ncbigene1436,meshd000008) since we found it in the positive set
Skipping random link(ncbigene5979,meshd008080) since we found it in the positive set
Skipping random link(ncbigene5159,meshd002288) since we found it in the positive set
Skipping random link(ncbigene2261,meshd017599) since we found it in the positive set
Skipping random link(ncbigene3791,meshd017253) since we found it in the positive set
Skipping random link(ncbigene1021,meshd055752) since we found it in the positive set
Skipping random link(ncbigene695,meshd054219) since we found it in the positive set
Skipping random link(ncbigene25,meshd000230) since we found it i

# Positive training data
We use this data frame for test and training. It contains all phase 4 trials completed up to and including the target year.

In [8]:
positive_train_df.head()

Unnamed: 0,cancer,mesh_id,kinase,gene_id,pki,nct,phase,year
0,Breast Neoplasms,meshd001943,CDK4,ncbigene1019,abemaciclib,NCT04707196;NCT03988114;NCT04031885,Phase 4,2019
1,Breast Neoplasms,meshd001943,CDK6,ncbigene1021,abemaciclib,NCT04707196;NCT03988114;NCT04031885,Phase 4,2019
2,Neoplasm Metastasis,meshd009362,CDK4,ncbigene1019,abemaciclib,NCT04707196,Phase 4,2021
3,Neoplasm Metastasis,meshd009362,CDK6,ncbigene1021,abemaciclib,NCT04707196,Phase 4,2021
4,"Carcinoma, Non-Small-Cell Lung",meshd002289,EGFR,ncbigene1956,afatinib,NCT04413201;NCT02695290;NCT02208843;NCT0435611...,Phase 4,2014


In [9]:
#We write the file for machine learning as follows
pos_train_df = positive_train_df[['gene_id','mesh_id']]
outname = "positive_training_upto_{}.tsv".format(target_year)
pos_train_df.to_csv(outname, sep='\t')
print(pos_train_df.head())
print("Number of positive trainig links: {}".format(pos_train_df.shape[0]) )

        gene_id      mesh_id
0  ncbigene1019  meshd001943
1  ncbigene1021  meshd001943
2  ncbigene1019  meshd009362
3  ncbigene1021  meshd009362
4  ncbigene1956  meshd002289
Number of positive trainig links: 702


# Negative training data
This data is used for test and training. It contains a sample with factor times (default=10) as many random combinations of kinases and cancers that we not positive as of the target year. Note that the dataframes for negative examples just have the columns ``gene_id`` and ``mesh_id``.

In [10]:
print(negative_train_df.head())
outname = "negative_training_upto_{}.tsv".format(target_year)
negative_train_df.to_csv(outname, sep='\t')
print("Number of negative trainig links: {}".format(negative_train_df.shape[0]) )

       mesh_id         gene_id
0  meshd018218  ncbigene440275
1  meshd054973    ncbigene4831
2  meshd045888    ncbigene9262
3  meshd008479    ncbigene2580
4  meshd052958    ncbigene8851
Number of negative trainig links: 5590


# Prediction data
This set contains all possible links between untargeted protein kinases
and the cancers except the positive dataset calculated up to current year

In [11]:
print(prediction_df.head())

          mesh_id        gene_id
0     meshd000008  ncbigene23552
1  meshd000069293  ncbigene23552
2  meshd000069295  ncbigene23552
3  meshd000069584  ncbigene23552
4  meshd000070779  ncbigene23552


In [12]:
outname = "predictions.tsv"
prediction_df.to_csv(outname, sep='\t')
print("Number of prediction links: {}".format(prediction_df.shape[0]) )

Number of prediction links: 356447
