# Dataframe join examples

In this use case, we provide several examples of how to use the built-in <code>cptac</code> functions for joining different dataframes.

In [2]:
import cptac
en = cptac.Endometrial()

                                    

## General format

In all of the join functions, you specify the dataframes you want to join by passing their names to the appropriate parameters in the function call. The function will automatically check that the dataframes whose names you provided are valid for the join function, and print an error message if they aren't.

Whenever a column from an -omics dataframe is included in a joined table, the name of the -omics dataframe it came from is joined to the column header, to avoid confusion.

If you wish to only include particular columns in the join, pass them to the appropriate parameters in the join function. All such parameters will accept either a single column name as a string, or a list of column name strings. In this use case, we will usually only select specific columns for readability, but you could select the whole dataframe in all these cases, except for the mutations dataframe.

The join functions use logic analogous to an SQL INNER JOIN.

## <code>join_omics_to_omics</code>

The <code>join_omics_to_omics</code> function joins two -omics dataframes to each other:

In [2]:
prot_and_phos = en.join_omics_to_omics(df1_name="proteomics", df2_name="phosphoproteomics")
prot_and_phos.head()

Unnamed: 0_level_0,A1BG_proteomics,A2M_proteomics,A2ML1_proteomics,A4GALT_proteomics,AAAS_proteomics,AACS_proteomics,AADAT_proteomics,AAED1_proteomics,AAGAB_proteomics,AAK1_proteomics,...,ZZZ3-S397_phosphoproteomics,ZZZ3-S411_phosphoproteomics,ZZZ3-S420_phosphoproteomics,ZZZ3-S424_phosphoproteomics,ZZZ3-S426_phosphoproteomics,ZZZ3-S468_phosphoproteomics,ZZZ3-S89_phosphoproteomics,ZZZ3-T415_phosphoproteomics,ZZZ3-T418_phosphoproteomics,ZZZ3-Y399_phosphoproteomics
Sample_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
S001,-1.18,-0.863,-0.802,0.222,0.256,0.665,1.28,-0.339,0.412,-0.664,...,0.184,,,,-0.205,,,,,
S002,-0.685,-1.07,-0.684,0.984,0.135,0.334,1.3,0.139,1.33,-0.367,...,-0.171,,,-0.393,-0.171,,0.29,,0.1605,-0.0635
S003,-0.528,-1.32,0.435,,-0.24,1.04,-0.0213,-0.0479,0.419,-0.5,...,,,,,,,,,,
S005,-1.67,-1.19,-0.443,0.243,-0.0993,0.757,0.74,-0.929,0.229,-0.223,...,0.1397,,,,-0.559,,,,,0.298
S006,-0.374,-0.0206,-0.537,0.311,0.375,0.0131,-1.1,,0.565,-0.101,...,-0.15875,,,0.196,0.06175,,,,,-0.29


Joining only specific columns.
(Note that when a gene is selected from the phosphoproteomics dataframe, data for all sites of the gene are selected. The same is done for acetylproteomics data.)

In [41]:
prot_and_phos_selected = en.join_omics_to_omics(
    df1_name="proteomics", 
    df2_name="phosphoproteomics", 
    genes1="A1BG", 
    genes2="PIK3CA")

prot_and_phos_selected.head()

Unnamed: 0_level_0,A1BG_proteomics,PIK3CA-S312_phosphoproteomics,PIK3CA-T313_phosphoproteomics
Sample_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
S001,-1.18,-0.00615,0.0731
S002,-0.685,-0.0222,
S003,-0.528,,0.083
S005,-1.67,,-0.846
S006,-0.374,0.436,


## <code>join_metadata_to_omics</code>

The <code>join_metadata_to_omics</code> function joins a metadata dataframe (e.g. clinical or derived_molecular) with an -omics dataframe:

In [4]:
clin_and_tran = en.join_metadata_to_omics(metadata_df_name="clinical", omics_df_name="transcriptomics")
clin_and_tran.head()

Unnamed: 0_level_0,Patient_ID,Proteomics_Tumor_Normal,Country,Histologic_Grade_FIGO,Myometrial_invasion_Specify,Histologic_type,Treatment_naive,Tumor_purity,Path_Stage_Primary_Tumor-pT,Path_Stage_Reg_Lymph_Nodes-pN,...,ZWILCH_transcriptomics,ZWINT_transcriptomics,ZXDA_transcriptomics,ZXDB_transcriptomics,ZXDC_transcriptomics,ZYG11A_transcriptomics,ZYG11B_transcriptomics,ZYX_transcriptomics,ZZEF1_transcriptomics,ZZZ3_transcriptomics
Sample_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
S001,C3L-00006,Tumor,United States,FIGO grade 1,under 50 %,Endometrioid,YES,Normal,pT1a (FIGO IA),pN0,...,11.06,10.73,8.4,9.78,10.88,5.93,11.52,10.23,11.5,11.47
S002,C3L-00008,Tumor,United States,FIGO grade 1,under 50 %,Endometrioid,YES,Normal,pT1a (FIGO IA),pNX,...,10.87,11.43,8.39,9.14,10.38,7.25,11.64,10.64,11.26,11.57
S003,C3L-00032,Tumor,United States,FIGO grade 2,under 50 %,Endometrioid,YES,Normal,pT1a (FIGO IA),pN0,...,10.06,10.13,8.35,9.27,10.46,6.85,11.6,10.21,11.51,11.09
S005,C3L-00090,Tumor,United States,FIGO grade 2,under 50 %,Endometrioid,YES,Normal,pT1a (FIGO IA),pNX,...,10.29,10.41,9.1,9.59,10.15,7.89,11.9,10.21,11.34,11.51
S006,C3L-00098,Tumor,United States,,under 50 %,Serous,YES,Normal,pT1a (FIGO IA),pNX,...,10.36,11.24,8.6,9.44,11.8,9.32,11.97,9.77,11.37,12.35


Joining only specific columns:

In [40]:
clin_and_tran = en.join_metadata_to_omics(
    metadata_df_name="clinical", 
    omics_df_name="transcriptomics", 
    metadata_cols = ["Age", "Histologic_type"], 
    omics_genes="ZZZ3")

clin_and_tran.head()

Unnamed: 0_level_0,Age,Histologic_type,ZZZ3_transcriptomics
Sample_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
S001,64.0,Endometrioid,11.47
S002,58.0,Endometrioid,11.57
S003,50.0,Endometrioid,11.09
S005,75.0,Endometrioid,11.51
S006,63.0,Serous,12.35


## <code>join_metadata_to_metadata</code>

The <code>join_metadata_to_metadata</code> function joins two metadata dataframes (e.g. clinical or derived_molecular) to each other. Note how we passed a column name to select from the clinical dataframe, but passing <code>None</code> for the column parameter for the derived_molecular dataframe caused the entire dataframe to be selected. We could have omitted the <code>cols2</code> parameter altogether, as it is assigned to None by default.

In [39]:
hist_and_derived_molecular = en.join_metadata_to_metadata(
    df1_name="clinical",
    df2_name="derived_molecular",
    cols1="Histologic_type",
    cols2=None) # Could have omitted cols2=None, as that's the default value

hist_and_derived_molecular.head()

Unnamed: 0_level_0,Histologic_type,Estrogen_Receptor,Estrogen_Receptor_%,Progesterone_Receptor,Progesterone_Receptor_%,MLH1,MLH2,MSH6,PMS2,p53,...,Log2_variant_total,Log2_SNP_total,Log2_INDEL_total,Genomics_subtype,Mutation_signature_C>A,Mutation_signature_C>G,Mutation_signature_C>T,Mutation_signature_T>C,Mutation_signature_T>A,Mutation_signature_T>G
Sample_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
S001,Endometrioid,Cannot be determined,,Cannot be determined,,Intact nuclear expression,Intact nuclear expression,Loss of nuclear expression,Intact nuclear expression,Cannot be determined,...,10.062046,9.984418,5.83289,MSI-H,8.300395,1.482213,72.529644,14.426877,1.383399,1.87747
S002,Endometrioid,Cannot be determined,,Cannot be determined,,Intact nuclear expression,Intact nuclear expression,Intact nuclear expression,Loss of nuclear expression,Cannot be determined,...,8.861087,8.330917,7.169925,MSI-H,14.641745,2.803738,64.485981,15.264798,0.934579,1.869159
S003,Endometrioid,Cannot be determined,,Cannot be determined,,Intact nuclear expression,Intact nuclear expression,Intact nuclear expression,Intact nuclear expression,Cannot be determined,...,5.321928,5.0,3.169925,CNV_low,16.129032,3.225806,70.967742,3.225806,3.225806,3.225806
S005,Endometrioid,Cannot be determined,,Cannot be determined,,Intact nuclear expression,Intact nuclear expression,Intact nuclear expression,Intact nuclear expression,Cannot be determined,...,5.672425,5.523562,2.584963,CNV_low,17.777778,8.888889,62.222222,8.888889,2.222222,0.0
S006,Serous,Cannot be determined,,Cannot be determined,,Intact nuclear expression,Intact nuclear expression,Intact nuclear expression,Intact nuclear expression,Normal,...,6.108524,5.954196,3.0,CNV_high,9.836066,13.114754,62.295082,3.278689,8.196721,3.278689


## <code>join_omics_to_mutations</code>

The <code>join_omics_to_mutations</code> function joins an -omics dataframe with the mutation data for a specified gene or genes. Because there may be multiple mutations for one gene in a single sample, the mutation type and location data are returned in lists by default. If there is no mutation for the gene in a particular sample, the list contains either "Wildtype_Tumor" or "Wildtype_Normal", depending on whether it's a tumor or normal sample. The mutation status column contains either "Single_mutation", "Multiple_mutation", "Wildtype_Tumor", or "Wildtype_Normal", for help with parsing.

The function has the ability to filter multiple mutations down to just one mutation. It first prioritizes mutations and locations (hotspots) that you pass to the optional <code>mutations_filter</code> parameter. Mutations and locations earlier in your filtering list are given priority over mutations and locations later in the list. When a sample has multiple mutations in a gene, but none of them were in your filter, the function chooses truncations over missense mutations, and silent mutations last of all. Between mutations of the same type, the function chooses mutations occurring earlier in the sequence. To filter all mutations based on this default hierarchy, simply pass an empty list to the <code>mutations_filter</code> parameter. Passing nothing to the <code>mutations_filter</code> parameter will cause no filtering to be done.

Note that when multiple mutations are filtered, the Mutation_Status column is still included, so that if there were originally multiple mutations, you can easily know it.

If you wish to drop the location columns, pass <code>False</code> to the optional show_location parameter.

First, we demonstrate the function with no filtering of multiple mutations. Note that the Mutation and Location column values are in lists, even if there is only one mutation.

In [30]:
selected_acet_and_PTEN_mut = en.join_omics_to_mutations(
    omics_df_name="acetylproteomics",
    mutations_genes="PTEN", 
    omics_genes=["AAGAB", "AACS"])

selected_acet_and_PTEN_mut.head(10)

Unnamed: 0_level_0,AAGAB-K290_acetylproteomics,AACS-K391_acetylproteomics,PTEN_Mutation,PTEN_Location,PTEN_Mutation_Status,Sample_Status
Sample_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
S001,0.461,,"[Missense_Mutation, Nonsense_Mutation]","[p.R130Q, p.R233*]",Multiple_mutation,Tumor
S002,1.77,,[Missense_Mutation],[p.G127R],Single_mutation,Tumor
S003,-0.815,,[Nonsense_Mutation],[p.W111*],Single_mutation,Tumor
S005,,,[Missense_Mutation],[p.R130G],Single_mutation,Tumor
S006,0.205,,[Wildtype_Tumor],[No_mutation],Wildtype_Tumor,Tumor
S007,0.403,,"[Missense_Mutation, Missense_Mutation]","[p.Y68C, p.R130G]",Multiple_mutation,Tumor
S008,0.792,,"[Frame_Shift_Ins, Nonsense_Mutation]","[p.H118Qfs*8, p.Y180*]",Multiple_mutation,Tumor
S009,,,[Wildtype_Tumor],[No_mutation],Wildtype_Tumor,Tumor
S010,-1.11,,[Missense_Mutation],[p.R130G],Single_mutation,Tumor
S011,0.4,,"[Missense_Mutation, Frame_Shift_Ins]","[p.H93R, p.E242*]",Multiple_mutation,Tumor


Using the mutations filtering functionality. Notice how mutations matching the mutation types and locations in the filter are prioritized, even if they occur later in the sequence than others, and how filter values earlier in the list are prioritized over values later in the list.

In [35]:
selected_acet_and_PTEN_mut = en.join_omics_to_mutations(
    omics_df_name="acetylproteomics",
    mutations_genes="PTEN", 
    omics_genes=["AAGAB", "AACS"],
    mutations_filter=["Nonsense_Mutation", "p.R130G", "p.Y68C", "p.H93R"])

selected_acet_and_PTEN_mut.head(10)

Unnamed: 0_level_0,AAGAB-K290_acetylproteomics,AACS-K391_acetylproteomics,PTEN_Mutation,PTEN_Location,PTEN_Mutation_Status,Sample_Status
Sample_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
S001,0.461,,Nonsense_Mutation,p.R233*,Multiple_mutation,Tumor
S002,1.77,,Missense_Mutation,p.G127R,Single_mutation,Tumor
S003,-0.815,,Nonsense_Mutation,p.W111*,Single_mutation,Tumor
S005,,,Missense_Mutation,p.R130G,Single_mutation,Tumor
S006,0.205,,Wildtype_Tumor,No_mutation,Wildtype_Tumor,Tumor
S007,0.403,,Missense_Mutation,p.R130G,Multiple_mutation,Tumor
S008,0.792,,Nonsense_Mutation,p.Y180*,Multiple_mutation,Tumor
S009,,,Wildtype_Tumor,No_mutation,Wildtype_Tumor,Tumor
S010,-1.11,,Missense_Mutation,p.R130G,Single_mutation,Tumor
S011,0.4,,Missense_Mutation,p.H93R,Multiple_mutation,Tumor


Using the default filtering hierarchy. Notice how without any values in the filter list, all truncations are chosen over missense mutations, and mutations earlier in the sequence are chosen over mutations later in it.

In [31]:
selected_acet_and_PTEN_mut = en.join_omics_to_mutations(
    omics_df_name="acetylproteomics",
    mutations_genes="PTEN", 
    omics_genes=["AAGAB", "AACS"],
    mutations_filter=[])

selected_acet_and_PTEN_mut.head(10)

Unnamed: 0_level_0,AAGAB-K290_acetylproteomics,AACS-K391_acetylproteomics,PTEN_Mutation,PTEN_Location,PTEN_Mutation_Status,Sample_Status
Sample_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
S001,0.461,,Nonsense_Mutation,p.R233*,Multiple_mutation,Tumor
S002,1.77,,Missense_Mutation,p.G127R,Single_mutation,Tumor
S003,-0.815,,Nonsense_Mutation,p.W111*,Single_mutation,Tumor
S005,,,Missense_Mutation,p.R130G,Single_mutation,Tumor
S006,0.205,,Wildtype_Tumor,No_mutation,Wildtype_Tumor,Tumor
S007,0.403,,Missense_Mutation,p.Y68C,Multiple_mutation,Tumor
S008,0.792,,Nonsense_Mutation,p.Y180*,Multiple_mutation,Tumor
S009,,,Wildtype_Tumor,No_mutation,Wildtype_Tumor,Tumor
S010,-1.11,,Missense_Mutation,p.R130G,Single_mutation,Tumor
S011,0.4,,Frame_Shift_Ins,p.E242*,Multiple_mutation,Tumor


## <code>join_metadata_to_mutations</code>

The <code>join_metadata_to_mutations</code> function works exactly like <code>join_omics_to_mutations</code>, except that it works with metadata dataframes (e.g. clinical and derived molecular) instead of omics dataframes. It also can filter multiple mutations, which you control through the <code>mutations_filter</code> parameter, and has the ability to hide the location colunms.

In [33]:
hist_and_PTEN = en.join_metadata_to_mutations(
    metadata_df_name="clinical",
    mutations_genes="PTEN",
    metadata_cols="Histologic_type")

hist_and_PTEN.head()

Unnamed: 0_level_0,Histologic_type,PTEN_Mutation,PTEN_Location,PTEN_Mutation_Status,Sample_Status
Sample_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
S001,Endometrioid,"[Missense_Mutation, Nonsense_Mutation]","[p.R130Q, p.R233*]",Multiple_mutation,Tumor
S002,Endometrioid,[Missense_Mutation],[p.G127R],Single_mutation,Tumor
S003,Endometrioid,[Nonsense_Mutation],[p.W111*],Single_mutation,Tumor
S005,Endometrioid,[Missense_Mutation],[p.R130G],Single_mutation,Tumor
S006,Serous,[Wildtype_Tumor],[No_mutation],Wildtype_Tumor,Tumor


With multiple mutations filtered:

In [38]:
hist_and_PTEN = en.join_metadata_to_mutations(
    metadata_df_name="clinical",
    mutations_genes="PTEN",
    metadata_cols="Histologic_type",
    mutations_filter=["Nonsense_Mutation"])

hist_and_PTEN.head()

Unnamed: 0_level_0,Histologic_type,PTEN_Mutation,PTEN_Location,PTEN_Mutation_Status,Sample_Status
Sample_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
S001,Endometrioid,Nonsense_Mutation,p.R233*,Multiple_mutation,Tumor
S002,Endometrioid,Missense_Mutation,p.G127R,Single_mutation,Tumor
S003,Endometrioid,Nonsense_Mutation,p.W111*,Single_mutation,Tumor
S005,Endometrioid,Missense_Mutation,p.R130G,Single_mutation,Tumor
S006,Serous,Wildtype_Tumor,No_mutation,Wildtype_Tumor,Tumor


# Exporting dataframes

If you wish to export a dataframe to a file, simply call the dataframe's to_csv method, passing the path you wish to save the file to, and the value separator you want:

In [9]:
more_selected_acet_and_mut.to_csv(path_or_buf="more_selected_acet_and_mut.tsv", sep='\t')