# KcetParser

This notebook demonstrates some of the parsing functionality of the ``kcet`` package. This is intended to document some of the functionality for added clarity. Note that for using the package, all of this can be performed under the hood, so end users do not need to study these functions.

In [1]:
import pandas as pd
import os
import sys
sys.path.insert(0, os.path.abspath('..'))
from kcet import CTParserByPhase, KcetParser

In [2]:
kcet = KcetParser()

[INFO] Reading protein kinase information from /home/peter/GIT/KCET/input/prot_kinase.tsv
[INFO] ingested symbol_to_id_map with 522 entries such as {'NCBIGene:2870': 'GRK6'}
[INFO] Ingested mesh_id list with 694 entries such as 'meshd000008' and 'meshd000069293', 
[INFO] Ingested _meshid2disease_map with 694 entries
[INFO] Ingested meshid2disease_map with 514 entries
[INFO] Ingested pki_to_kinase with 83 entries


## Gene symbol to gene id map

In [3]:
[print(key, value) for key, value in list(kcet.get_symbol_to_id_map().items())[:5]];

CDK20 ncbigene23552
RIPK2 ncbigene8767
TGFBR1 ncbigene7046
PIK3C3 ncbigene5289
TYRO3 ncbigene7301


## MeSH Id to Disease label map

In [4]:
[print(key, value) for key, value in list(kcet.get_mesh_to_disease_map().items())[:5]];

meshd000008 Abdominal Neoplasms
meshd000069293 Plasmablastic Lymphoma
meshd000069295 Mammary Analogue Secretory Carcinoma
meshd000069584 Unilateral Breast Neoplasms
meshd000070779 Giant Cell Tumor of Tendon Sheath


## Gene symbol to target-development level (TDL) map

In [5]:
[print(key, value) for key, value in list(kcet.get_symbol_to_tdl_map().items())[:5]];

AAK1 Tchem
AATK Tbio
ABL1 Tclin
ABL2 Tchem
ACVR1 Tchem


## Protein kinase inhibitor (PKI) to protein kinase map
The list of protein kinases represent kinases that are inhibited by the indicated PKI.

In [6]:
[print(key, value) for key, value in list(kcet.get_pki_to_kinase_list_dict().items())[:5]];

PKI ['PK']
abemaciclib ['CDK4', 'CDK6']
acalabrutinib ['BTK']
actinomycin D ['DDR2']
afatinib ['EGFR', 'ERBB2']


# CTParserByPhase

### Parsing the yactp file

CTParserByPhase ingests the output of the yactp parser.

[Yet Another Clinical Trials Parser](https://github.com/monarch-initiative/yactp) (yatcp) gets information about Clinical Trials involving a specified list of medications. This data includes information about the diseases being treated with the medications. We interpret any clinical trial as being an indication of scientific interest in treating a disease with a medication, and we interpret a Phase 4 trial as an indication that evidence exists (in a previous Phase 3 trial) that the medication may be effective against the disease.

See the README for details on ``yactp``. 

We use a small example file that is created from the ``example.txt`` file in the ``yactp`` repository and is
located in the ``example`` subdirectory of this repository. The file has a toy dataset representing two protein-kinase inhibitors.

In [7]:
ctfile = os.path.abspath(os.path.join(os.pardir, 'example', 'clinical_trials_by_phase.tsv'))

In the first example, we do not indicate the year, which defaults to the current year (2020 at the time of this writing).

In [8]:
parser = CTParserByPhase(clinical_trials=ctfile)

[INFO] Parsed data for the following medications:
[INFO] imatinib
[INFO] afatinib


## All phases
The following command is used to get studies from all phases (1, 2, 3, and 4). The columns ``cancer`` and ``mesh_id`` show the cancer that is treated by the protein kinase inhibitor in therr column ``pki``. The kinases that are the major target of the pki are shown in the columns ``kinase`` with the associated NCBI Gene id in the column ``gene.id``. The ``nct`` column shows the study ids of the corresponding study or studies in the phase shown in the ``phase`` column. The ``year`` column shows the **earliest** start year of the list of studies shown in the ``nct`` column. This information is parsed in the [yactp](https://github.com/monarch-initiative/yactp) project.

In [9]:
df_all = parser.get_all_phases()
df_all.head(20)

Unnamed: 0,cancer,mesh_id,kinase,gene.id,pki,nct,phase,year
0,"Carcinoma, Non-Small-Cell Lung",meshd002289,EGFR,ncbigene1956,afatinib,NCT01999985;NCT03054038;NCT04448379;NCT0219189...,Phase 1,2009
1,"Carcinoma, Non-Small-Cell Lung",meshd002289,ERBB2,ncbigene2064,afatinib,NCT01999985;NCT03054038;NCT04448379;NCT0219189...,Phase 1,2009
2,"Carcinoma, Non-Small-Cell Lung",meshd002289,EGFR,ncbigene1956,afatinib,NCT03623750;NCT02716311;NCT03399669;NCT0247006...,Phase 2,2007
3,"Carcinoma, Non-Small-Cell Lung",meshd002289,ERBB2,ncbigene2064,afatinib,NCT03623750;NCT02716311;NCT03399669;NCT0247006...,Phase 2,2007
4,"Carcinoma, Non-Small-Cell Lung",meshd002289,EGFR,ncbigene1956,afatinib,NCT01814553;NCT01523587;NCT00949650;NCT0243872...,Phase 3,2008
5,"Carcinoma, Non-Small-Cell Lung",meshd002289,ERBB2,ncbigene2064,afatinib,NCT01814553;NCT01523587;NCT00949650;NCT0243872...,Phase 3,2008
6,"Carcinoma, Non-Small-Cell Lung",meshd002289,EGFR,ncbigene1956,afatinib,NCT02514174;NCT02695290;NCT02208843;NCT0441320...,Phase 4,2014
7,"Carcinoma, Non-Small-Cell Lung",meshd002289,ERBB2,ncbigene2064,afatinib,NCT02514174;NCT02695290;NCT02208843;NCT0441320...,Phase 4,2014
8,Carcinoma,meshd002277,EGFR,ncbigene1956,afatinib,NCT02451553;NCT03652233;NCT01288430;NCT03878524,Phase 1,2011
9,Carcinoma,meshd002277,ERBB2,ncbigene2064,afatinib,NCT02451553;NCT03652233;NCT01288430;NCT03878524,Phase 1,2011


## Phase 4
For the purposes of this project, we will regard all PKI/cancer combinations that have a phase 4 study as representing a confirmed relation
that inhibition of the protein kinases targeted by the protein kinase inhibitor represents an effective treatment strategy for the cancer in question.
The following command can be used to get a list with only phase 4 studies.

In [10]:
df_phase_4 = parser.get_phase_4()
df_phase_4.head(20)

Unnamed: 0,cancer,mesh_id,kinase,gene.id,pki,nct,phase,year
0,"Carcinoma, Non-Small-Cell Lung",meshd002289,EGFR,ncbigene1956,afatinib,NCT02514174;NCT02695290;NCT02208843;NCT0441320...,Phase 4,2014
1,"Carcinoma, Non-Small-Cell Lung",meshd002289,ERBB2,ncbigene2064,afatinib,NCT02514174;NCT02695290;NCT02208843;NCT0441320...,Phase 4,2014
2,Carcinoma,meshd002277,EGFR,ncbigene1956,afatinib,NCT04132102,Phase 4,2018
3,Carcinoma,meshd002277,ERBB2,ncbigene2064,afatinib,NCT04132102,Phase 4,2018
4,"Carcinoma, Squamous Cell",meshd002294,EGFR,ncbigene1956,afatinib,NCT04132102,Phase 4,2018
5,"Carcinoma, Squamous Cell",meshd002294,ERBB2,ncbigene2064,afatinib,NCT04132102,Phase 4,2018
6,Lung Neoplasms,meshd008175,EGFR,ncbigene1956,afatinib,NCT04356118,Phase 4,2020
7,Lung Neoplasms,meshd008175,ERBB2,ncbigene2064,afatinib,NCT04356118,Phase 4,2020
8,Neoplasm Metastasis,meshd009362,EGFR,ncbigene1956,afatinib,NCT04356118,Phase 4,2020
9,Neoplasm Metastasis,meshd009362,ERBB2,ncbigene2064,afatinib,NCT04356118,Phase 4,2020


## Limit the time range
By adding an argument for ``year``, the parser returns only studies that were started not later than the year in question.
In the following example, we limit the year to 2012.

In [None]:
targetYear = 2012
parser2012 = CTParserByPhase(clinical_trials=ctfile, year=targetYear)

In [None]:
df2012 = parser2012.get_all_phases()
df2012.head()

In [None]:
print("[INFO] We extracted %d cancer protein kinase links up to %d" % (len(df2012), targetYear))

# Get data for training
For machine learning, we only require data that is formated with the gene id and the disease id. Also, we combine duplicate
lines for the all_phases data frame. The following code shows how this is done.

In [None]:
df_allphases_tr = parser.get_all_phases_for_training()
df_allphases_tr.head()

In [None]:
df_phase_4_tr = parser.get_phase_4_for_training()
df_phase_4_tr.head()

# Get validation data
Here, we extract all positive studies published in Clinical Trials dot gov after the target year. It seems appropriate to use all phases for validation, not just phase 4, but for completeness, a separate function is provided to get phase 4 validation studies (see below).

In [None]:
df_validation_all = parser2012.get_validation_all_phases()
df_validation_all.head(20)

In [None]:
print("[INFO] We extracted %d protein kinase-cancer links published after the target year of %d" 
      % (len(df_validation_all), targetYear))

# Phase 4
This function gets the phase 4 studies published after the target year.

In [None]:
df_validation_p4 = parser2012.get_validation_phase_4()
df_validation_p4.head()