# Dataframe join examples

In this use case, we provide several examples of how to use the built-in <code>cptac</code> functions for joining different dataframes.

In [None]:
import cptac
en = cptac.Endometrial()

## General format

In all of the join functions, you specify the dataframes you want to join by passing their names to the appropriate parameters in the function call. The function will automatically check that the dataframes whose names you provided are valid for the join function, and print an error message if they aren't.

Whenever a column from an -omics dataframe is included in a joined table, the name of the -omics dataframe it came from is appended to the column header, to avoid confusion.

If you wish to only include particular columns in the join, pass them to the appropriate parameters in the join function. All such parameters will accept either a single column name as a string, or a list of column name strings.

All join functions use logic analogous to an SQL <code>INNER JOIN</code>.

## join_omics_to_omics

The <code>join_omics_to_omics</code> function joins two -omics dataframes to each other:

In [2]:
prot_and_phos = en.join_omics_to_omics(omics_df1_name="proteomics", omics_df2_name="phosphoproteomics")
prot_and_phos.head()

Unnamed: 0_level_0,A1BG_proteomics,A2M_proteomics,A2ML1_proteomics,A4GALT_proteomics,AAAS_proteomics,AACS_proteomics,AADAT_proteomics,AAED1_proteomics,AAGAB_proteomics,AAK1_proteomics,...,ZZZ3-S397_phosphoproteomics,ZZZ3-S411_phosphoproteomics,ZZZ3-S420_phosphoproteomics,ZZZ3-S424_phosphoproteomics,ZZZ3-S426_phosphoproteomics,ZZZ3-S468_phosphoproteomics,ZZZ3-S89_phosphoproteomics,ZZZ3-T415_phosphoproteomics,ZZZ3-T418_phosphoproteomics,ZZZ3-Y399_phosphoproteomics
Sample_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
S001,-1.18,-0.863,-0.802,0.222,0.256,0.665,1.28,-0.339,0.412,-0.664,...,0.184,,,,-0.205,,,,,
S002,-0.685,-1.07,-0.684,0.984,0.135,0.334,1.3,0.139,1.33,-0.367,...,-0.171,,,-0.393,-0.171,,0.29,,0.1605,-0.0635
S003,-0.528,-1.32,0.435,,-0.24,1.04,-0.0213,-0.0479,0.419,-0.5,...,,,,,,,,,,
S005,-1.67,-1.19,-0.443,0.243,-0.0993,0.757,0.74,-0.929,0.229,-0.223,...,0.1397,,,,-0.559,,,,,0.298
S006,-0.374,-0.0206,-0.537,0.311,0.375,0.0131,-1.1,,0.565,-0.101,...,-0.15875,,,0.196,0.06175,,,,,-0.29


Joining only specific columns:

In [3]:
prot_and_phos_selected = en.join_omics_to_omics(omics_df1_name="proteomics", omics_df2_name="phosphoproteomics", genes1="A1BG", genes2=["PIK3CA", "TP53"])
prot_and_phos_selected.head()

Unnamed: 0_level_0,A1BG_proteomics,PIK3CA-S312_phosphoproteomics,PIK3CA-T313_phosphoproteomics,TP53-S315_phosphoproteomics,TP53-T150_phosphoproteomics
Sample_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
S001,-1.18,-0.00615,0.0731,,
S002,-0.685,-0.0222,,0.646,
S003,-0.528,,0.083,-0.8,
S005,-1.67,,-0.846,,
S006,-0.374,0.436,,3.76,0.253


## join_metadata_to_omics

The <code>join_metadata_to_omics</code> function joins a metadata dataframe (e.g. clinical, derived_molecular, or treatment) with an -omics dataframe:

In [4]:
clin_and_tran = en.join_metadata_to_omics(metadata_df_name="clinical", omics_df_name="transcriptomics")
clin_and_tran.head()

Unnamed: 0_level_0,Patient_ID,Proteomics_Tumor_Normal,Country,Histologic_Grade_FIGO,Myometrial_invasion_Specify,Histologic_type,Treatment_naive,Tumor_purity,Path_Stage_Primary_Tumor-pT,Path_Stage_Reg_Lymph_Nodes-pN,...,ZWILCH_transcriptomics,ZWINT_transcriptomics,ZXDA_transcriptomics,ZXDB_transcriptomics,ZXDC_transcriptomics,ZYG11A_transcriptomics,ZYG11B_transcriptomics,ZYX_transcriptomics,ZZEF1_transcriptomics,ZZZ3_transcriptomics
Sample_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
S001,C3L-00006,Tumor,United States,FIGO grade 1,under 50 %,Endometrioid,YES,Normal,pT1a (FIGO IA),pN0,...,11.06,10.73,8.4,9.78,10.88,5.93,11.52,10.23,11.5,11.47
S002,C3L-00008,Tumor,United States,FIGO grade 1,under 50 %,Endometrioid,YES,Normal,pT1a (FIGO IA),pNX,...,10.87,11.43,8.39,9.14,10.38,7.25,11.64,10.64,11.26,11.57
S003,C3L-00032,Tumor,United States,FIGO grade 2,under 50 %,Endometrioid,YES,Normal,pT1a (FIGO IA),pN0,...,10.06,10.13,8.35,9.27,10.46,6.85,11.6,10.21,11.51,11.09
S005,C3L-00090,Tumor,United States,FIGO grade 2,under 50 %,Endometrioid,YES,Normal,pT1a (FIGO IA),pNX,...,10.29,10.41,9.1,9.59,10.15,7.89,11.9,10.21,11.34,11.51
S006,C3L-00098,Tumor,United States,,under 50 %,Serous,YES,Normal,pT1a (FIGO IA),pNX,...,10.36,11.24,8.6,9.44,11.8,9.32,11.97,9.77,11.37,12.35


Joining only specific columns:

In [5]:
clin_and_tran = en.join_metadata_to_omics(metadata_df_name="clinical", omics_df_name="transcriptomics", metadata_cols = ["Age", "Histologic_type"], omics_genes="ZZZ3")
clin_and_tran.head()

Unnamed: 0_level_0,Age,Histologic_type,ZZZ3_transcriptomics
Sample_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
S001,64.0,Endometrioid,11.47
S002,58.0,Endometrioid,11.57
S003,50.0,Endometrioid,11.09
S005,75.0,Endometrioid,11.51
S006,63.0,Serous,12.35


## join_omics_to_mutations

The <code>join_omics_to_mutations</code> function joins an -omics dataframe with the mutation data for a specified gene or genes. Because there may be multiple mutations for one gene in a single sample, the mutation type and location data are returned in lists. If there is no mutation for the gene in a particular sample, the list contains either "Wildtype_Tumor" or "Wildtype_Normal", depending on whether it's a tumor or normal sample. The mutation status column contains either "Single_mutation", "Multiple_mutation", "Wildtype_Tumor", or "Wildtype_Normal", for help with parsing.

In [6]:
acet_and_TP53_mut = en.join_omics_to_mutations(omics_df_name="acetylproteomics", mutations_genes="TP53")
acet_and_TP53_mut.head()

Unnamed: 0_level_0,A2M-K1168_acetylproteomics,A2M-K1176_acetylproteomics,A2M-K135_acetylproteomics,A2M-K145_acetylproteomics,A2M-K516_acetylproteomics,A2M-K664_acetylproteomics,A2M-K682_acetylproteomics,AACS-K391_acetylproteomics,AAGAB-K290_acetylproteomics,AAK1-K201_acetylproteomics,...,ZYX-K25_acetylproteomics,ZYX-K265_acetylproteomics,ZYX-K272_acetylproteomics,ZYX-K279_acetylproteomics,ZYX-K533_acetylproteomics,ZZZ3-K117_acetylproteomics,TP53_Mutation,TP53_Location,TP53_Mutation_Status,Sample_Status
Sample_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
S001,,1.08,,,,,,,0.461,,...,,,,,,,[Missense_Mutation],[p.R248W],Single_mutation,Tumor
S002,,0.477,,,,,,,1.77,,...,-0.343,-0.307,,-0.0955,,,[Wildtype_Tumor],[No_mutation],Wildtype_Tumor,Tumor
S003,,,,,,,,,-0.815,-0.00573,...,-1.17,,,-0.705,0.089,,[Wildtype_Tumor],[No_mutation],Wildtype_Tumor,Tumor
S005,,-0.608,,,-0.919,,,,,,...,-0.537,,,-0.37,,,[Wildtype_Tumor],[No_mutation],Wildtype_Tumor,Tumor
S006,,1.63,,2.4,,,1.26,,0.205,,...,-0.358,,,-0.97,,,[Missense_Mutation],[p.S241C],Single_mutation,Tumor


Selecting only specified columns from the -omics dataframe:

In [7]:
selected_acet_and_TP53_mut = en.join_omics_to_mutations(omics_df_name="acetylproteomics", mutations_genes="TP53", omics_genes=["AAGAB", "AACS"])
selected_acet_and_TP53_mut.head()

Unnamed: 0_level_0,AAGAB-K290_acetylproteomics,AACS-K391_acetylproteomics,TP53_Mutation,TP53_Location,TP53_Mutation_Status,Sample_Status
Sample_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
S001,0.461,,[Missense_Mutation],[p.R248W],Single_mutation,Tumor
S002,1.77,,[Wildtype_Tumor],[No_mutation],Wildtype_Tumor,Tumor
S003,-0.815,,[Wildtype_Tumor],[No_mutation],Wildtype_Tumor,Tumor
S005,,,[Wildtype_Tumor],[No_mutation],Wildtype_Tumor,Tumor
S006,0.205,,[Missense_Mutation],[p.S241C],Single_mutation,Tumor


To not hide the mutation location column, pass <code>False</code> to the optional <code>show_location</code> parameter:

In [8]:
more_selected_acet_and_mut = en.join_omics_to_mutations(omics_df_name="acetylproteomics", mutations_genes=["PIK3CA", "TP53"], omics_genes=["AAGAB", "AACS"], show_location=False)
more_selected_acet_and_mut.head()

Unnamed: 0_level_0,AAGAB-K290_acetylproteomics,AACS-K391_acetylproteomics,PIK3CA_Mutation,PIK3CA_Mutation_Status,TP53_Mutation,TP53_Mutation_Status,Sample_Status
Sample_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
S001,0.461,,[Missense_Mutation],Single_mutation,[Missense_Mutation],Single_mutation,Tumor
S002,1.77,,[Wildtype_Tumor],Wildtype_Tumor,[Wildtype_Tumor],Wildtype_Tumor,Tumor
S003,-0.815,,[Missense_Mutation],Single_mutation,[Wildtype_Tumor],Wildtype_Tumor,Tumor
S005,,,[Wildtype_Tumor],Wildtype_Tumor,[Wildtype_Tumor],Wildtype_Tumor,Tumor
S006,0.205,,[Wildtype_Tumor],Wildtype_Tumor,[Missense_Mutation],Single_mutation,Tumor


# Exporting dataframes

If you wish to export a dataframe to a file, simply call the dataframe's to_csv method, passing the path you wish to save the file to, and the value separator you want:

In [9]:
more_selected_acet_and_mut.to_csv(path_or_buf="more_selected_acet_and_mut.tsv", sep='\t')