# Annotate the final dataset

**Goal:** This notebook annotates the final dataset obtained at the end of [`get_final_dataset.ipynb`](https://github.com/ElsaB/impact-annotator/blob/master/analysis/description/compute_final_dataset/get_final_dataset.ipynb), by adding some features. It follows the progression of the [`annotate_cleaned_dataset.ipynb`](https://github.com/ElsaB/impact-annotator/blob/master/analysis/description/first_study/annotate_cleaned_dataset.ipynb) notebook, used to annotate the cleaned dataset. All the operations made are stored in the [`compute_final_dataset.R`](https://github.com/ElsaB/impact-annotator/blob/master/data/utils/compute_final_dataset.R) file, and can be applied on the raw dataset by using the `annotate_final_dataset()` function.

This notebook is divided in 4 parts:
* **1. `Kaviar_AF`**
* **2. `OncoKB annotations`**
* **3. `CancerGenesList`**
* **2. Compare with the old impact**

In [1]:
source("../../../utils/R/custom_tools.R")
setup_environment("../../../utils/R")

In [2]:
impact <- read.table("../../../data/impact_181105/final_IMPACT_mutations_181105.txt", sep = "\t", stringsAsFactors = FALSE, header = TRUE)

In [3]:
nrow(impact)

## OncoKB annotations

Get the `is_a_hotspot`,`is_a_3d_hotspot` and `oncogenic` features from `oncokb_annotated_final_IMPACT_mutations_181105.txt` (impact annotated with oncokb-annotator, see [`/data/annotate_with_oncokb_final_dataset`](https://github.com/ElsaB/impact-annotator/tree/master/data/annotate_with_oncokb_final_dataset).

### Get the raw data

In [5]:
impact_oncokb <- read.table("../../../data/impact_181105/annotate_with_oncokb_final_dataset/oncokb_annotated_final_IMPACT_mutations_181105.txt",
                             sep = "\t", stringsAsFactors = FALSE, header = TRUE)

In [6]:
ncol(impact_oncokb)
nrow(impact_oncokb)
head(impact_oncokb)

mut_key,Hugo_Symbol,VEP_Consequence,VEP_VARIANT_CLASS,HGVSp_Short,Variant_Classification,is.a.hotspot,is.a.3d.hotspot,mutation_effect,oncogenic,LEVEL_1,LEVEL_2A,LEVEL_2B,LEVEL_3A,LEVEL_3B,LEVEL_4,LEVEL_R1,Highest_level,citations
17_7577515_T_G,TP53,missense_variant,SNV,p.T256P,Missense_Mutation,,,Likely Loss-of-function,Likely Oncogenic,,,,,,,,,8023157;11900253
1_46521514_G_C,PIK3R3,missense_variant,SNV,p.I298M,Missense_Mutation,,,,,,,,,,,,,
3_142178126_C_A,ATR,missense_variant,SNV,p.R2431M,Missense_Mutation,,,,,,,,,,,,,
4_55139732_T_A,PDGFRA,missense_variant,SNV,p.L465M,Missense_Mutation,,,,,,,,,,,,,
4_153249542_C_A,FBXW7,splice_acceptor_variant,SNV,unknown,Splice_Site,,,,Likely Oncogenic,,,,,,,,,
4_153332775_C_A,FBXW7,stop_gained,SNV,p.G61*,Nonsense_Mutation,,,,Likely Oncogenic,,,,,,,,,


### Create keys to join the two dataframes and extract the features

We are going to identify each mutation with a key in both dataframes, allowing us to link each mutation from `impact` to its corresponding mutation in `impact_oncokb`. The keys will be `mut_key` for both dataset, and are already created.

**Verification 1** The features `oncogenic`, `is.a.hotspot`, and `is.a.3d.hotspot` are unique for each `mut_key`:

In [7]:
impact_oncokb <- unique(impact_oncokb[, c("mut_key", "is.a.hotspot", "is.a.3d.hotspot", "oncogenic")])
impact_oncokb %>% group_by(mut_key) %>% filter(n() > 1) %>% arrange(mut_key)

mut_key,is.a.hotspot,is.a.3d.hotspot,oncogenic


**Verification 2** Every impact `mut_key` has an analoguous `join_key` in `impact_oncokb`:

In [8]:
nrow(impact[! impact$mut_key %in% impact_oncokb$mut_key,])

In [9]:
impact <- left_join(impact, impact_oncokb[, c("mut_key", "is.a.hotspot", "is.a.3d.hotspot", "oncogenic")], by = c("mut_key" = "mut_key"))

### Process raw features

**`is_a_hotspot`**

In [10]:
colnames(impact)[colnames(impact) == "is.a.hotspot"] <- "is_a_hotspot"
impact$is_a_hotspot[impact$is_a_hotspot == "Y"  ] <- "yes"
impact$is_a_hotspot[impact$is_a_hotspot != "yes"] <- "unknown"
get_table(impact$is_a_hotspot)

values,count,freq
unknown,196898,87.3%
yes,28663,12.7%
-- total --,225561,100%


**`is_a_3d_hotspot`**

In [11]:
colnames(impact)[colnames(impact) == "is.a.3d.hotspot"] <- "is_a_3d_hotspot"
impact$is_a_3d_hotspot[impact$is_a_3d_hotspot == "Y"  ] <- "yes"
impact$is_a_3d_hotspot[impact$is_a_3d_hotspot != "yes"] <- "unknown"
get_table(impact$is_a_3d_hotspot)

values,count,freq
unknown,208230,92.3%
yes,17331,7.7%
-- total --,225561,100%


**`oncogenic`**

In [12]:
impact$oncogenic[impact$oncogenic == ""] <- "Unknown"
get_table(impact$oncogenic)

values,count,freq
Unknown,144137,63.9%
Likely Oncogenic,60601,26.9%
Oncogenic,16588,7.4%
Predicted Oncogenic,3189,1.4%
Inconclusive,651,0.3%
Likely Neutral,395,0.2%
-- total --,225561,100%


## CancerGenesList

Get the `gene_type` feature from `CancerGenesList.txt` (downloaded from http://oncokb.org/#/cancerGenes, the upper right button "CANCER GENE LIST").

### Get the raw data

In [13]:
cancer_genes_list <- read.table("../../../data/other_databases/CancerGenesList.txt",
                                sep = "\t", stringsAsFactors = FALSE, header = TRUE, comment.char = '')

In [14]:
ncol(cancer_genes_list)
nrow(cancer_genes_list)
head(cancer_genes_list)

Hugo.Symbol,X..of.occurence.within.resources,OncoKB.Annotated,OncoKB.Oncogene,OncoKB.TSG,MSK.IMPACT,MSK.HEME,Foundation.One,Foundation.One.Heme,Vogelstein,Sanger.CGC
ABL1,7,Yes,Yes,,Yes,Yes,Yes,Yes,Yes,Yes
ABL2,3,No,,,No,No,Yes,Yes,No,Yes
ACTB,1,No,,,No,No,No,Yes,No,No
ACTG1,1,No,,,No,Yes,No,No,No,No
ACVR1,3,Yes,Yes,,Yes,No,No,No,No,Yes
ACVR1B,2,No,,,No,No,Yes,No,Yes,No


### Create keys to join the two dataframes and extract the features

We are going to identify each mutation with a key in both dataframes, allowing us to link each mutation from `impact` to its corresponding mutation in `cancer_genes_list`. The keys will be: 
* `VEP_SYMBOL` for `impact`
* `Hugo.Symbol` for `impact_annotated`

**Verification 1** The feature `OncoKB.Oncogene` and `OncoKB.TSG` are unique for each key:

In [15]:
cancer_genes_list <- unique(cancer_genes_list[, c("Hugo.Symbol", "OncoKB.Oncogene", "OncoKB.TSG")])
cancer_genes_list %>% group_by(Hugo.Symbol) %>% filter(n() > 1)

Hugo.Symbol,OncoKB.Oncogene,OncoKB.TSG


**Verification 2** Some impact `VEP_SYMBOL` don't have an analoguous `Hugo.Symbol` in `impact_annotated`, some `NA` values will appear and need to be handled:

In [16]:
length(unique(impact$VEP_SYMBOL[! impact$VEP_SYMBOL %in% cancer_genes_list$Hugo.Symbol]))
print(unique(impact$VEP_SYMBOL[! impact$VEP_SYMBOL %in% cancer_genes_list$Hugo.Symbol]))

 [1] "INSRR"          "RP11-211G3.3"   "PCDHB3"         "AC008738.1"    
 [5] "OBSL1"          "TIMM8B"         "CTD-2561B21.3"  "RP1-85F18.6"   
 [9] "SDCCAG8"        "MFSD11"         "SMIM4"          "OCLN"          
[13] "AC129492.6"     "IL31RA"         "SETD8"          "RP11-354M1.2"  
[17] "RTEL1-TNFRSF6B" "KIF20A"         "H2AFY"          "CHD1"          
[21] "DHRS13"         "BZRAP1-AS1"     "PAPD4"          "RP11-770J1.3"  
[25] "AC004906.3"     "AP003419.16"    "RP11-513D5.2"  


In [17]:
impact <- left_join(impact, cancer_genes_list[, c("Hugo.Symbol", "OncoKB.Oncogene", "OncoKB.TSG")], by = c("VEP_SYMBOL" = "Hugo.Symbol"))

In [18]:
head(impact)

Hugo_Symbol,Chromosome,Start_Position,End_Position,Consequence,Variant_Type,Reference_Allele,Tumor_Seq_Allele2,Tumor_Sample_Barcode,cDNA_change,HGVSp_Short,t_depth,t_vaf,t_alt_count,n_depth,n_vaf,n_alt_count,t_ref_plus_count,t_ref_neg_count,t_alt_plus_count,t_alt_neg_count,confidence_class,sample_coverage,variant_caller_cv,mut_key,VAG_VT,VAG_GENE,VAG_cDNA_CHANGE,VAG_PROTEIN_CHANGE,VAG_EFFECT,VEP_Consequence,VEP_SYMBOL,VEP_HGVSc,VEP_HGVSp,VEP_Amino_acids,VEP_VARIANT_CLASS,VEP_EXON,VEP_INTRON,VEP_IMPACT,VEP_CLIN_SIG,VEP_COSMIC_CNT,VEP_gnomAD_AF,sample_mut_key,patient_key,frequency_in_normals,VEP_SIFT_class,VEP_SIFT_score,VEP_PolyPhen_class,VEP_PolyPhen_score,VEP_in_dbSNP,VEP_gnomAD_total_AF_AFR,VEP_gnomAD_total_AF_AMR,VEP_gnomAD_total_AF_ASJ,VEP_gnomAD_total_AF_EAS,VEP_gnomAD_total_AF_FIN,VEP_gnomAD_total_AF_NFE,VEP_gnomAD_total_AF_OTH,VEP_gnomAD_total_AF_max,VEP_gnomAD_total_AF,is_a_hotspot,is_a_3d_hotspot,oncogenic,OncoKB.Oncogene,OncoKB.TSG
TP53,17,7577515,7577515,nonsynonymous_SNV,SNP,T,G,P-0000012-T02-IM3,c.766A>C,p.T256P,227,0.5022,114,569,0.0,0,59,54,58,56,AUTO_OK,344,MUTECT_ANNOVAR,17_7577515_T_G,Sub,TP53,c.766A>C,p.T256P,non_synonymous_codon,missense_variant,TP53,c.766A>C,p.T256P,T/P,SNV,7|11,,MODERATE,unknown,1,0.0,P-0000012-T02-IM3_17_7577515_T_G,P-0000012,0,deleterious,0.0,probably_damaging,0.999,False,0,0,0,0,0,0.0,0,0.0,0.0,unknown,unknown,Likely Oncogenic,,Yes
PIK3R3,1,46521514,46521514,nonsynonymous_SNV,SNP,G,C,P-0000012-T03-IM3,c.894C>G,p.I298M,733,0.17599,129,1243,0.0,0,288,316,61,68,AUTO_OK,428,MUTECT_ANNOVAR,1_46521514_G_C,Sub,PIK3R3,c.1032C>G,p.I344M,non_synonymous_codon,missense_variant,PIK3R3,c.894C>G,p.I298M,I/M,SNV,7|10,,MODERATE,unknown,0,0.0,P-0000012-T03-IM3_1_46521514_G_C,P-0000012,0,deleterious,0.0,benign,0.277,False,0,0,0,0,0,0.0,0,0.0,0.0,unknown,unknown,Unknown,,Yes
ATR,3,142178126,142178126,nonsynonymous_SNV,SNP,C,A,P-0000012-T03-IM3,c.7292G>T,p.R2431M,482,0.17427,84,581,0.00172,1,221,177,46,38,AUTO_OK,428,MUTECT_ANNOVAR,3_142178126_C_A,Sub,ATR,c.7292G>T,p.R2431M,non_synonymous_codon,missense_variant,ATR,c.7292G>T,p.R2431M,R/M,SNV,43|47,,MODERATE,unknown,0,4.063e-06,P-0000012-T03-IM3_3_142178126_C_A,P-0000012,0,deleterious,0.0,probably_damaging,0.997,True,0,0,0,0,0,8.959771e-06,0,8.959771e-06,4.644035e-06,unknown,unknown,Unknown,,Yes
PDGFRA,4,55139732,55139732,nonsynonymous_SNV,SNP,T,A,P-0000012-T03-IM3,c.1393T>A,p.L465M,570,0.20351,116,811,0.0,0,252,202,66,50,AUTO_OK,428,MUTECT_ANNOVAR,4_55139732_T_A,Sub,PDGFRA,c.1393T>A,p.L465M,non_synonymous_codon,missense_variant,PDGFRA,c.1393T>A,p.L465M,L/M,SNV,10|23,,MODERATE,unknown,0,0.0,P-0000012-T03-IM3_4_55139732_T_A,P-0000012,0,deleterious,0.01,probably_damaging,0.965,False,0,0,0,0,0,0.0,0,0.0,0.0,unknown,unknown,Unknown,Yes,
FBXW7,4,153249542,153249542,splicing,SNP,C,A,P-0000012-T03-IM3,c.1237-1G>T,,333,0.25526,85,458,0.0,0,69,179,24,61,AUTO_OK,428,MUTECT_ANNOVAR,4_153249542_C_A,Sub,FBXW7,c.1237-1G>T,p.?,splice_site_variant,splice_acceptor_variant,FBXW7,c.1237-1G>T,unknown,unknown,SNV,,8|11,HIGH,unknown,0,0.0,P-0000012-T03-IM3_4_153249542_C_A,P-0000012,0,unknown,,unknown,,False,0,0,0,0,0,0.0,0,0.0,0.0,unknown,unknown,Likely Oncogenic,,Yes
FBXW7,4,153332775,153332775,stopgain_SNV,SNP,C,A,P-0000012-T03-IM3,c.181G>T,p.G61*,570,0.22807,130,908,0.0,0,243,197,70,60,AUTO_OK,428,MUTECT_ANNOVAR,4_153332775_C_A,Sub,FBXW7,c.181G>T,p.G61*,stop_gained,stop_gained,FBXW7,c.181G>T,p.G61*,G/*,SNV,2|12,,HIGH,unknown,0,0.0,P-0000012-T03-IM3_4_153332775_C_A,P-0000012,0,unknown,,unknown,,False,0,0,0,0,0,0.0,0,0.0,0.0,unknown,unknown,Likely Oncogenic,,Yes


### `gene_type`

In [19]:
head(unique(impact$OncoKB.Oncogene))
head(unique(impact$OncoKB.TSG))

In [20]:
impact$gene_type <- "unknown"
impact$gene_type[impact$OncoKB.Oncogene == "Yes"] <- "oncogene"
impact$gene_type[impact$OncoKB.TSG == "Yes"] <- "tsg"
impact$gene_type[impact$OncoKB.Oncogene == "Yes" & impact$OncoKB.TSG == "Yes"] <- "oncogene_and_tsg"

impact$OncoKB.Oncogene <- NULL
impact$OncoKB.TSG      <- NULL

In [21]:
table(impact$gene_type)


        oncogene oncogene_and_tsg              tsg          unknown 
           55454             5541           111390            53176 