# Data set Generation

This notebook shows how to generate datasets for the historical validation. We choose a year (say 2014) and get all available phase 4 clinical trials for test and training on this dataset. We additional export files with positive data for all subsequent years. As negative data, we choose all combinations of kinases and cancers that have remained negative up to the present day.

In [None]:
import pandas as pd
import os
import sys
sys.path.insert(0, os.path.abspath('..'))
from kcet import KcetDatasetGenerator

# Input file. 
We have placed a version of ``clinical_trials_by_phase.tsv`` into the ``data`` subfolder in the ``notebooks`` subfolder of kcet.

In [None]:
ctfile = os.path.abspath(os.path.join('data', 'clinical_trials_by_phase.tsv'))

## target year
In this example, we show how to extract datasets for the target year of 2012

In [None]:
target_year = 2015

In [None]:
dsGen = KcetDatasetGenerator(clinical_trials=ctfile)

In [None]:
df = dsGen._get_positive_data_set(target_year)
df.head()


In [None]:
d=dsGen._get_positive_validation_data_set(target_year)
d2 = dsGen._get_negative_validation_data_set(negative_df=d)
d2.head()

In [None]:
positive_df, negative_df, positive_validation_df, negative_validation_df = \
   dsGen.get_data_for_target_year(target_year=target_year)

## Positive data
We use this data frame for test and training. It contains all phase 4 trials completed up to and including the
target year.

In [None]:
positive_df.head()

# Output file
We write the file for machine learning as follows

In [None]:
pos_df = positive_df[['gene_id','mesh_id']]
# Uncomment to output file
#outname = "positive_upto_2012.tsv"
#pos_df.to_csv(outname, sep='\t')
pos_df.head()

# Negative data
This data is used for test and training. It contains a sample with factor times (default=10) as many random combinations of kinases and cancers that we not positive as of the target year. Note that the dataframes for negative examples just have the columns ``gene_id`` and ``mesh_id``.

In [None]:
negative_df.head()

# Positive validation data
This file has data for all four study phases and for all years following the target year. We should predicting them
all at once and then use postprocessing in a script to figure out the results for each year.

In [None]:
positive_validation_df.head()

# Negative validation data
This is similar to the negative training data in that it just has two columns.

In [None]:
negative_validation_df.head()