# Untargeted
This demo notebook shows how to extract a list of targeted and untargeted kinases according to a certain year.

In [1]:
import pandas as pd
import os
import sys

In [2]:
sys.path.insert(0, os.path.abspath('..'))
from kcet import UnTargetedKinases

In [3]:
clinical_trials_by_phase_example = '../example/clinical_trials_by_phase.tsv'
if not os.path.exists(clinical_trials_by_phase_example):
    raise FileNotFoundError("Could not find example/clinical_trials_by_phase.tsv")

By default, if no year is provided, the parser looks at all studies up to the current year

In [4]:
untargeted = UnTargetedKinases(clinical_trials_by_phase=clinical_trials_by_phase_example)

[INFO] Reading protein kinase information from /home/peter/GIT/KCET/input/prot_kinase.tsv
[INFO] Parsed data for the following medications:
[INFO] imatinib
[INFO] afatinib


clinical_trials_by_phase.tsvGet a data frame with targeted kinases. The fact that a kinase is targeted is 
derived from the ``clinical_trials_by_phase.tsv`` file, which is created by [yactp](https://github.com/monarch-initiative/yactp).

In [5]:
target_kinases = untargeted.get_targeted_kinases_with_gene_id()
target_kinases.head()

Unnamed: 0,gene.id,kinase
0,ncbigene25,ABL1
1,ncbigene1956,EGFR
2,ncbigene2064,ERBB2
3,ncbigene5156,PDGFRA
4,ncbigene5159,PDGFRB


Likewise we can get a dataframe of kinases that have not been targeted yet. The purpose of this project is to associated
a previously untargeted kinase with a form of cancer in which inhibition of the kinase will be 
therapeutically useful.

In [6]:
untargeted_kinases = untargeted.get_untargeted_kinases_with_gene_id()
untargeted_kinases.head(10)

Unnamed: 0,gene.id,kinase
0,ncbigene23552,CDK20
1,ncbigene8767,RIPK2
2,ncbigene7046,TGFBR1
3,ncbigene5289,PIK3C3
4,ncbigene7301,TYRO3
5,ncbigene2322,FLT3
6,ncbigene8536,CAMK1
7,ncbigene2869,GRK5
8,ncbigene81788,NUAK2
9,ncbigene23049,SMG1


In [7]:
print("We got a total of %d untargeted kinases" % len(untargeted_kinases))

We got a total of 517 untargeted kinases


# Phase 4
We also want to extract a list of kinases targeted by phase 4 studies -- these are our positive training kinases. Note that there is no 'phase 4 untargeted', instead we just take the above list of untargeted kinases.

In [8]:
targeted_kinases_phase_4 = untargeted.get_targeted_kinases_with_gene_id_phase_4()
targeted_kinases_phase_4.head()

Unnamed: 0,gene.id,kinase
0,ncbigene25,ABL1
1,ncbigene1956,EGFR
2,ncbigene2064,ERBB2
3,ncbigene5156,PDGFRA
4,ncbigene5159,PDGFRB


# Limit by year
For assessing the accuracy of the algorithm, historical snapshots are taken for a certain year. All data about studies that
target a certain protein-kinase inhibitor that were started after this year are discarded. In this example, we
choose the year 2000, and only 3 of the original 5 protein kinases are identified as targeted.

In [9]:
untargeted_2000 = UnTargetedKinases(clinical_trials_by_phase=clinical_trials_by_phase_example, year=2000)

[INFO] Reading protein kinase information from /home/peter/GIT/KCET/input/prot_kinase.tsv
[INFO] Parsed data for the following medications:
[INFO] imatinib
[INFO] afatinib


In [10]:
target_kinases = untargeted_2000.get_targeted_kinases_with_gene_id()
target_kinases.head()

Unnamed: 0,gene.id,kinase
0,ncbigene25,ABL1
1,ncbigene5156,PDGFRA
2,ncbigene5159,PDGFRB


# File output
Note that the kce_tool app writes these dataframes to file. The following examples show the filenames for year=2-17.
* targeted_kinases_2017.tsv
* untargeted_kinases_2017.tsv
* targeted_kinases_phase_4_2017.tsv