This script  -
- Filters only ependymoma samples 
- adds a disease group based on primary site to ependymoma samples 
- Matches up DNA and RNA BSID's from pbta-histologies file 
- Creates a notebook that can be used in the next script 

In [6]:
# Importing modules

import argparse
import pandas as pd

# Reading in the input pbta-histologies file 
pbta_histologies = pd.read_csv("/Users/kogantit/Documents/OpenPBTA/OpenPBTA-analysis/data/pbta-histologies.tsv", sep="\t")




This function  takes in the primary site and categorizes the sample into "supratentirial" or "infratentorial"
Two primary sites categorized as "None" -  
 1) Ventricles
 2) Other locations NOS

In [10]:
def group_disease(primay_site):
    if "Posterior Fossa" in primay_site:
        return "infratentorial"
    elif "Optic" in primay_site:
        return "infratentorial"
    elif "Spinal" in primay_site:
        return "infratentorial"
    elif "Tectum" in primay_site:
        return "infratentorial"
    elif "Spine" in primay_site:
        return "infratentorial"
    elif "Frontal Lobe" in primay_site:
        return "supratentorial"
    elif "Parietal Lobe" in primay_site:
        return "supratentorial"
    elif "Occipital Lobe" in primay_site:
        return "supratentorial"
    elif "Temporal Lobe" in primay_site:
        return "supratentorial"
    else:
        return "None"

In [14]:
# Reading the pbta-histologies table and filtering for ependymoma samples 
EP = pbta_histologies[pbta_histologies["integrated_diagnosis"]=="Ependymoma"]

# Filtering inly RNA samples and retrieving only BSID, primarysite, PTID, sampleID and expstrategy columns 
EP_rnaseq_samples = EP[EP["experimental_strategy"] == "RNA-Seq"][["Kids_First_Biospecimen_ID", "primary_site", 
    "Kids_First_Participant_ID", "sample_id", "experimental_strategy"]]

EP_rnaseq_samples.head()

Unnamed: 0,Kids_First_Biospecimen_ID,primary_site,Kids_First_Participant_ID,sample_id,experimental_strategy
27,BS_07ANYSYQ,Frontal Lobe,PT_S4H6KA09,7316-2134,RNA-Seq
33,BS_0BXY0F9N,Cerebellum/Posterior Fossa,PT_164RNWTT,7316-1078,RNA-Seq
56,BS_0QYS36NR,Ventricles,PT_V3Q78E6F,7316-455,RNA-Seq
75,BS_0WQJP6ZG,Frontal Lobe,PT_Y6Y9JJ9P,7316-425,RNA-Seq
80,BS_0XEG6SNV,Parietal Lobe;Ventricles,PT_82A9SDRN,7316-2313,RNA-Seq


In [16]:
# Adding a disease group column to the. above DF based on the group_disease function - 
EP_rnaseq_samples["disease_group"] = [group_disease(primary) for primary in EP_rnaseq_samples["primary_site"]]
EP_rnaseq_samples.head()

Unnamed: 0,Kids_First_Biospecimen_ID,primary_site,Kids_First_Participant_ID,sample_id,experimental_strategy,disease_group
27,BS_07ANYSYQ,Frontal Lobe,PT_S4H6KA09,7316-2134,RNA-Seq,supratentorial
33,BS_0BXY0F9N,Cerebellum/Posterior Fossa,PT_164RNWTT,7316-1078,RNA-Seq,infratentorial
56,BS_0QYS36NR,Ventricles,PT_V3Q78E6F,7316-455,RNA-Seq,
75,BS_0WQJP6ZG,Frontal Lobe,PT_Y6Y9JJ9P,7316-425,RNA-Seq,supratentorial
80,BS_0XEG6SNV,Parietal Lobe;Ventricles,PT_82A9SDRN,7316-2313,RNA-Seq,supratentorial


In [18]:
# List with only RNA sample PTIDs
EP_rnasamplenames_PTIDs = list(EP_rnaseq_samples["Kids_First_Participant_ID"]) 



# Filtering for DNA samples from EP(dataframe from pbta-histologies and only ependymoma samples) 
all_WGS = EP[EP["experimental_strategy"]=="WGS"]
# Only retrieving DNA samples PTID, if it is listed in RNA PTID's (basically filtering for ependymoma DNA samples)
WGSPT = all_WGS[all_WGS["Kids_First_Participant_ID"].isin(EP_rnasamplenames_PTIDs)]
# Create a new DF for DNA samples with BSID, PTID and sample_ID 
WGS_dnaseqsamples = WGSPT[["Kids_First_Biospecimen_ID", "Kids_First_Participant_ID", "sample_id"]]
WGS_dnaseqsamples.head()

Unnamed: 0,Kids_First_Biospecimen_ID,Kids_First_Participant_ID,sample_id
0,BS_007JTNB8,PT_1MW98VR1,7316-2558
5,BS_01DQH017,PT_CYVVA9AB,7316-3299
73,BS_0W8AWY10,PT_06H29FCG,7316-1944
81,BS_0XR7GYAV,PT_XHYBZKCX,7316-2972
114,BS_1AZ8YJSH,PT_R6WWH1QX,7316-423


In [19]:
# Renaming the column name so they don;t conflict in merge step 
EP_rnaseq_samples = EP_rnaseq_samples.rename(columns={"Kids_First_Biospecimen_ID":"Kids_First_Biospecimen_ID_RNA"})
WGS_dnaseqsamples = WGS_dnaseqsamples.rename(columns={"Kids_First_Biospecimen_ID":"Kids_First_Biospecimen_ID_DNA"})


- Merging both the dataframes based on "sample_id" (I found this was the only unique ID between both DNA and RNA samples, participantID's are repeated across multiple RNA and DNA. Example: Two RNA BSID's can have the same participant ID)
- Every RNA and DNA BSID combo with the same sample_id also has the same participant ID. SO I am renaming the column name from "Kids_First_Participant_ID_x" to "Kids_First_Participant_ID" for simplicity
- Some RNA samples have missing corresponding DNA BSID's (which is the reason for left join based on RNA table)

In [20]:
EP_rnaseq_WGS = EP_rnaseq_samples.merge(WGS_dnaseqsamples, on = "sample_id", how = "left")
EP_rnaseq_WGS = EP_rnaseq_WGS.rename(columns={"Kids_First_Participant_ID_x":"Kids_First_Participant_ID"})
EP_rnaseq_WGS.fillna('NA', inplace=True)

EP_rnaseq_WGS.head()

Unnamed: 0,Kids_First_Biospecimen_ID_RNA,primary_site,Kids_First_Participant_ID,sample_id,experimental_strategy,disease_group,Kids_First_Biospecimen_ID_DNA,Kids_First_Participant_ID_y
0,BS_07ANYSYQ,Frontal Lobe,PT_S4H6KA09,7316-2134,RNA-Seq,supratentorial,BS_K6A9Z04J,PT_S4H6KA09
1,BS_0BXY0F9N,Cerebellum/Posterior Fossa,PT_164RNWTT,7316-1078,RNA-Seq,infratentorial,BS_5D24XV4T,PT_164RNWTT
2,BS_0QYS36NR,Ventricles,PT_V3Q78E6F,7316-455,RNA-Seq,,BS_7RQCH5Y7,PT_V3Q78E6F
3,BS_0WQJP6ZG,Frontal Lobe,PT_Y6Y9JJ9P,7316-425,RNA-Seq,supratentorial,,
4,BS_0XEG6SNV,Parietal Lobe;Ventricles,PT_82A9SDRN,7316-2313,RNA-Seq,supratentorial,BS_NWYBD9CA,PT_82A9SDRN
