# Data set Generation

This notebook shows how to generate datasets for the historical validation. We choose a year and get all available phase 4 clinical trials for test and training on this dataset. We additionally export files with positive data for at least a year after the target year. As negative data, we choose all combinations of kinases and cancers that have remained negative up to the present day.

In [1]:
import pandas as pd
import os
import sys
sys.path.insert(0, os.path.abspath('../..'))
from kcet import KcetDatasetGenerator

# Input file. 
We require the ``clinical_trials_by_phase.tsv``file. TODO DESCRIBE HOW THIS WAS GENERATED.
For convenience, the file that was used for this work is available from the project's zenodo repository: https://zenodo.org/record/5329035#.YTyvZXspBl9
Download the file and enter its path into the following dialog.

In [2]:
ctfile = input()

/home/peter/data/pubmed2vec/clinical_trials_by_phase.tsv


## target year
In this example, we show how to extract datasets after the target year until a new target year.

In [3]:
target_year = 2010
mid_year = 2019
num_years_after_the_mid_year = 1

In [4]:
dsGen = KcetDatasetGenerator(clinical_trials=ctfile)

[INFO] Reading protein kinase information from /home/peter/GIT/KCET/input/prot_kinase.tsv
[INFO] ingested symbol_to_id_map with 522 entries such as {'NCBIGene:2870': 'GRK6'}
[INFO] Ingested mesh_id list with 698 entries such as 'meshd000008' and 'meshd000069293', 
[INFO] Ingested _meshid2disease_map with 698 entries
[INFO] Ingested meshid2disease_map with 514 entries
[INFO] Ingested pki_to_kinase with 76 entries
[INFO] Parsed data for 70 medications.


In [5]:
pos_train_df, neg_train_df, pos_validation_df, neg_validation_df = \
   dsGen.get_data_years_after_target_year_upto_later_year(target_year=target_year, mid_year = mid_year, num_years_later= num_years_after_the_mid_year)

[INFO] We generated a negative training set with 1350 examples (the positive set has 4087)
[INFO] We generated a negative validation set with 1350 examples (the positive set has 4087)


# Write data to files

The data looks like this. Note that downstream analysis only requires links between gene ids (protein kinases) and MeSH ids (cancers), and we only write out these two fields.

### negative_train_df 
is used for test and training. It contains a sample with factor times (default=10) as many random combinations of kinases and cancers that were not positive as of the target year. Note that the dataframes for negative examples just have the columns gene_id and mesh_id.

### Positive validation data
This file has data for all four study phases and for all years following the target year. We should predicting them
all at once and then use postprocessing in a script to figure out the results for each year.

In [None]:
pos_train_df.head()

In [None]:
print("Number of positive training links: {}".format(pos_train_df.shape[0]))
print("Number of negative training links: {}".format(neg_train_df.shape[0]))
print("Number of positive validation links: {}".format(pos_validation_df.shape[0]))
print("Number of negative validation links: {}".format(neg_validation_df.shape[0]))

In [None]:
# Positive train
pos_train_df = pos_train_df[['gene_id','mesh_id']]
outname = "positive_training_upto_{}.tsv".format(target_year)
pos_train_df.to_csv(outname, sep='\t')
# Negative train
outname = "negative_training_upto_{}.tsv".format(target_year)
neg_train_df.to_csv(outname, sep='\t')
# Positive validation
pos_validation_df = pos_validation_df[['gene_id','mesh_id']]
outname = "positive_validation_{}_years_after_{}_target_{}.tsv".format(num_years_after_the_mid_year,mid_year,target_year)
pos_validation_df.to_csv(outname, sep='\t')
# Negative validation 
outname = "negative_validation_upto_{}.tsv".format(target_year)
neg_validation_df.to_csv(outname, sep='\t')