# Data set Generation

This notebook shows how to generate datasets for the historical validation. We fix the year to the current year (2021) and get all available phase 4 clinical trials for test and training on this dataset. For the predicton set, we consider all possible links between untargeted protein kinases and the cancers except the positive dataset calculated up to the current year.

In [None]:
import pandas as pd
import os
import sys
sys.path.insert(0, os.path.abspath('../..'))
from kcet import KcetDatasetGenerator

# Input file. 
We have placed a version of ``clinical_trials_by_phase.tsv`` into the ``data`` subfolder in the ``notebooks`` subfolder of kcet.

In [None]:
ctfile = os.path.abspath(os.path.join('../data', 'clinical_trials_by_phase.tsv'))

## target year
In this example, we show how to extract datasets for the target year of 2021

In [None]:
target_year = 2021

In [None]:
dsGen = KcetDatasetGenerator(clinical_trials=ctfile)

In [None]:
positive_train_df, negative_train_df, prediction_df = \
   dsGen.get_data_for_novel_prediction(target_year=target_year)



# Positive training data
We use this data frame for test and training. It contains all phase 4 trials completed up to and including the target year.

In [None]:
positive_train_df.head()

In [None]:
#We write the file for machine learning as follows
pos_train_df = positive_train_df[['gene_id','mesh_id']]
outname = "positive_training_upto_{}.tsv".format(target_year)
pos_train_df.to_csv(outname, sep='\t')
print(pos_train_df.head())
print("Number of positive trainig links: {}".format(pos_train_df.shape[0]) )

# Negative training data
This data is used for test and training. It contains a sample with factor times (default=10) as many random combinations of kinases and cancers that we not positive as of the target year. Note that the dataframes for negative examples just have the columns ``gene_id`` and ``mesh_id``.

In [None]:
print(negative_train_df.head())
outname = "negative_training_upto_{}.tsv".format(target_year)
negative_train_df.to_csv(outname, sep='\t')
print("Number of negative trainig links: {}".format(negative_train_df.shape[0]) )

# Prediction data
This set contains all possible links between untargeted protein kinases
and the cancers except the positive dataset calculated up to current year

In [None]:
print(prediction_df.head())

In [None]:
outname = "predictions.tsv"
prediction_df.to_csv(outname, sep='\t')
print("Number of prediction links: {}".format(prediction_df.shape[0]) )