# Genie filter data by mutation and clinical data

I wat to create a master-table conaining my clinical & pathogen information in one table.

Later, I wat to see how much data is left when restricting the mutations to the ones with a weight >=50% or >=75%.

## Setup

In [1]:
library("ggplot2")
library(dplyr, warn.conflicts = FALSE)

“package ‘ggplot2’ was built under R version 4.2.3”
“package ‘dplyr’ was built under R version 4.2.3”


## Get data

In [2]:
clinical <- read.csv("../../derived_data/genie_v15/clean_reference.csv", header=TRUE, stringsAsFactors=FALSE)
mutation <- read.csv("../../derived_data/genie_v15/mutation_pathogen_filter.csv", header=TRUE, stringsAsFactors=FALSE)
g_75 <- read.csv("../../derived_data/genie_v15/gene_weights_75.csv", header=FALSE)
g_50 <- read.csv("../../derived_data/genie_v15/gene_weights_50.csv", header=FALSE)

In [3]:
# Removing the first column
mutation <- dplyr::select(mutation, -X)
clinical <- dplyr::select(clinical, -X)

# Sanity check
dim(clinical)
head(clinical)

dim(mutation)
head(mutation)

dim(g_75)
dim(g_50)

Unnamed: 0_level_0,patient_id,sex,race,center,dead,sample_id,age,oncotree_code,sample_type,sequence_assay_ID,cancer_type,sample_type_detail,population,age_group,age_interval
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,GENIE-CHOP-C1002819,Female,White,CHOP,False,GENIE-CHOP-C1002819-BS79B4V9EZ,0,,Primary,CHOP-STNGS,,Primary tumor,AMR,Child,<45
2,GENIE-CHOP-C1002942,Female,White,CHOP,True,GENIE-CHOP-C1002942-BS7MEH2Z48,0,AML,Metastasis,CHOP-HEMEP,Leukemia,Local recurrence,AMR,Child,<45
3,GENIE-CHOP-C1002942,Female,White,CHOP,True,GENIE-CHOP-C1002942-BS3YSSSEPK,0,AML,Metastasis,CHOP-HEMEP,Leukemia,Local recurrence,AMR,Child,<45
4,GENIE-CHOP-C1003065,Male,White,CHOP,False,GENIE-CHOP-C1003065-BS66P79BWX,0,MBL,Primary,CHOP-STNGS,Embryonal Tumor,Primary tumor,AMR,Child,<45
5,GENIE-CHOP-C1003065,Male,White,CHOP,False,GENIE-CHOP-C1003065-BSTECFMMGC,0,MBL,Primary,CHOP-STNGS,Embryonal Tumor,Primary tumor,AMR,Child,<45
6,GENIE-CHOP-C1003188,Female,White,CHOP,False,GENIE-CHOP-C1003188-BSHKNH1HNP,0,MBL,Primary,CHOP-STNGS,Embryonal Tumor,Primary tumor,AMR,Child,<45


Unnamed: 0_level_0,Hugo_Symbol,Tumor_Sample_Barcode,SIFT_Prediction,Polyphen_Prediction,Variant_Classification,gnomAD_AMR_AF,gnomAD_NFE_AF,Population,Pathogen
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<int>
1,KRAS,GENIE-JHU-00006-00185,deleterious,probably_damaging,Missense_Mutation,0.0,0.0,AMR,1
2,BRAF,GENIE-JHU-00006-00185,deleterious,probably_damaging,Missense_Mutation,0.0,0.0,AMR,1
3,EGFR,GENIE-JHU-00006-00185,deleterious,probably_damaging,Missense_Mutation,2.89101e-05,3.51673e-05,AMR,1
4,TP53,GENIE-JHU-00006-00185,tolerated,possibly_damaging,Missense_Mutation,0.0,2.64271e-05,AMR,0
5,NRAS,GENIE-JHU-00006-00185,tolerated,benign,Missense_Mutation,,,AMR,0
6,PIK3CA,GENIE-JHU-00006-00185,tolerated,benign,Missense_Mutation,,,AMR,0


## Filter

How much data is left if I only choose mutaions that are pathogens and only choose clinical data where all the values are known?

In [4]:
# All mutations where the Pathogen column has a value of 1, is estimated to be a pathogen
pathogen <- mutation[mutation$Pathogen == 1,]

# Checking how many there are, but also making sure that the pahogen colum only contains 1s.
table(pathogen$Pathogen)


     1 
734800 

In [5]:
# Omitting all rows where there is one or more NA values
clean_clin <- na.omit(clinical)

# Checking how many rows there are left.
dim(clean_clin)

In [6]:
colnames(clean_clin)

In [7]:
# We only keep the data-points where the sex is known
clean_clin <- clean_clin[which(clean_clin$sex %in% c('Male', 'Female')),]

# We only keep the data-points where the sample-type is either Primary or metastasis
clean_clin <- clean_clin[which(clean_clin$sample_type %in% c('Primary', 'Metastasis')),]

# We remove any rows with NA values
clean_clin <- na.omit(clean_clin)

length(unique(clean_clin$patient_id))
length(unique(clean_clin$sample_id))

clean_clin %>%
  group_by(sex,sample_type)  %>%
  summarise(n = n())

[1m[22m`summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.


sex,sample_type,n
<chr>,<chr>,<int>
Female,Metastasis,27772
Female,Primary,43460
Male,Metastasis,22088
Male,Primary,39655


In [8]:
dim(clean_clin[which(clean_clin$age == 0 | clean_clin$age == 100),])

## Left-join

There are 734.800 pathogenic mutations and 157.017 individuals where all clinincal information is present. I wat to merge the two dataframes by 'sample_id' in the clinical data and 'Tumor_Sample_Barcode' in the mutational data.

In [6]:
# Rename Tumor_Sample_Barcode to sample_id
pathogen <- rename(pathogen, sample_id = Tumor_Sample_Barcode)

# left-join clinical and pathogen data by sample_id column
pathogen <- left_join(pathogen, clean_clin, join_by(sample_id))

# Clinical data is in columns 10-23, so if they are empty we are not interested in them
# complete.case returns a TRUE/FALSE vector, so all rows that have NA in column 10-23 return FALSE
# We only keep the rows that return TRUE
pathogen <- pathogen[complete.cases(pathogen[ , 10:23]),]

In [7]:
# Check dimensions after filtering
dim(pathogen)

# Check how many samples are left
length(unique(pathogen$sample_id))
# Check how many patients are left
length(unique(pathogen$patient_id))

In [9]:
(618196/734800)*100
(618196/1840311)*100

Of the 1.840.311 mutations, 734.800 are pathogens, and of those pathogens 618.196 have all their clinical data. This coresponds to ~84% of all pathogens and 34% of all the mutations.

## Mutation filter

In [10]:
# We filter pathogens by only keeping rows where the Hugo symbol is in the list of genes with a gene weight of >=0.50
pathogen_g_50 <- pathogen %>% filter(Hugo_Symbol %in% g_50[[1]])

# We inspect the new dataframe
head(pathogen_g_50)
dim(pathogen_g_50)

# And find how many samples and patients are left
length(unique(pathogen_g_50$sample_id))
length(unique(pathogen_g_50$patient_id))

Unnamed: 0_level_0,Hugo_Symbol,sample_id,SIFT_Prediction,Polyphen_Prediction,Variant_Classification,gnomAD_AMR_AF,gnomAD_NFE_AF,Population,Pathogen,patient_id,⋯,dead,age,oncotree_code,sample_type,sequence_assay_ID,cancer_type,sample_type_detail,population,age_group,age_interval
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<int>,<chr>,⋯,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,KRAS,GENIE-JHU-00006-00185,deleterious,probably_damaging,Missense_Mutation,0.0,0.0,AMR,1,GENIE-JHU-00006,⋯,False,61,LUAD,Primary,JHU-50GP,Non-Small Cell Lung Cancer,Primary tumor,AMR,Middle Aged,[60-65[
2,BRAF,GENIE-JHU-00006-00185,deleterious,probably_damaging,Missense_Mutation,0.0,0.0,AMR,1,GENIE-JHU-00006,⋯,False,61,LUAD,Primary,JHU-50GP,Non-Small Cell Lung Cancer,Primary tumor,AMR,Middle Aged,[60-65[
3,EGFR,GENIE-JHU-00006-00185,deleterious,probably_damaging,Missense_Mutation,2.89101e-05,3.51673e-05,AMR,1,GENIE-JHU-00006,⋯,False,61,LUAD,Primary,JHU-50GP,Non-Small Cell Lung Cancer,Primary tumor,AMR,Middle Aged,[60-65[
4,CTNNB1,GENIE-JHU-00006-00185,deleterious,probably_damaging,Missense_Mutation,,,AMR,1,GENIE-JHU-00006,⋯,False,61,LUAD,Primary,JHU-50GP,Non-Small Cell Lung Cancer,Primary tumor,AMR,Middle Aged,[60-65[
5,PIK3CA,GENIE-JHU-00006-00185,deleterious,probably_damaging,Missense_Mutation,,,AMR,1,GENIE-JHU-00006,⋯,False,61,LUAD,Primary,JHU-50GP,Non-Small Cell Lung Cancer,Primary tumor,AMR,Middle Aged,[60-65[
6,CDKN2A,GENIE-JHU-00006-00185,,,Nonsense_Mutation,,,AMR,1,GENIE-JHU-00006,⋯,False,61,LUAD,Primary,JHU-50GP,Non-Small Cell Lung Cancer,Primary tumor,AMR,Middle Aged,[60-65[


In [11]:
# Percentage of rows left, in relation to total number of pathogens
511408/786772*100

# Percentage of rows left, in relation to the original mutation dataset
511408/1840311*100

In [12]:
# The same as before, but with gene weights of >=0.75
pathogen_g_75 <- pathogen %>% filter(Hugo_Symbol %in% g_75[[1]])

head(pathogen_g_75)
dim(pathogen_g_75)

length(unique(pathogen_g_75$sample_id))
length(unique(pathogen_g_75$patient_id))

Unnamed: 0_level_0,Hugo_Symbol,sample_id,SIFT_Prediction,Polyphen_Prediction,Variant_Classification,gnomAD_AMR_AF,gnomAD_NFE_AF,Population,Pathogen,patient_id,⋯,dead,age,oncotree_code,sample_type,sequence_assay_ID,cancer_type,sample_type_detail,population,age_group,age_interval
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<int>,<chr>,⋯,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,KRAS,GENIE-JHU-00006-00185,deleterious,probably_damaging,Missense_Mutation,0.0,0.0,AMR,1,GENIE-JHU-00006,⋯,False,61,LUAD,Primary,JHU-50GP,Non-Small Cell Lung Cancer,Primary tumor,AMR,Middle Aged,[60-65[
2,BRAF,GENIE-JHU-00006-00185,deleterious,probably_damaging,Missense_Mutation,0.0,0.0,AMR,1,GENIE-JHU-00006,⋯,False,61,LUAD,Primary,JHU-50GP,Non-Small Cell Lung Cancer,Primary tumor,AMR,Middle Aged,[60-65[
3,EGFR,GENIE-JHU-00006-00185,deleterious,probably_damaging,Missense_Mutation,2.89101e-05,3.51673e-05,AMR,1,GENIE-JHU-00006,⋯,False,61,LUAD,Primary,JHU-50GP,Non-Small Cell Lung Cancer,Primary tumor,AMR,Middle Aged,[60-65[
4,CTNNB1,GENIE-JHU-00006-00185,deleterious,probably_damaging,Missense_Mutation,,,AMR,1,GENIE-JHU-00006,⋯,False,61,LUAD,Primary,JHU-50GP,Non-Small Cell Lung Cancer,Primary tumor,AMR,Middle Aged,[60-65[
5,PIK3CA,GENIE-JHU-00006-00185,deleterious,probably_damaging,Missense_Mutation,,,AMR,1,GENIE-JHU-00006,⋯,False,61,LUAD,Primary,JHU-50GP,Non-Small Cell Lung Cancer,Primary tumor,AMR,Middle Aged,[60-65[
6,CDKN2A,GENIE-JHU-00006-00185,,,Nonsense_Mutation,,,AMR,1,GENIE-JHU-00006,⋯,False,61,LUAD,Primary,JHU-50GP,Non-Small Cell Lung Cancer,Primary tumor,AMR,Middle Aged,[60-65[


In [13]:
405902/786772*100

405902/1840311*100

## Results

Before filtering by anything, there were 786.772 pathogens.

When restricting our data to tumor samples where there was clinical data, there were 618.196 pathogens left across 124.738 samples and 110.611 patients (~84% of all pathogens).

If we filter by a gene weight of >=0.50, there are 511.408 pathogens left across 122.143 samples and 108.320 patients (~65% of all pathogens, ~28% of the original mutation data).

If we filter by a gene weight of >=0.75, there are 405.902 pathogens left across 118.360 samples and 105.077 patients (~52% of all pathogens, ~22% of the original mutation data).

In [14]:
write.csv(pathogen_g_75, "../../derived_data/genie_v15/pathogen_filtered_75.csv", row.names=FALSE)
write.csv(pathogen_g_50, "../../derived_data/genie_v15/pathogen_filtered_50.csv", row.names=FALSE)