# Tutorial 3: Joining dataframes with `cptac`

In this tutorial, we provide several examples of how to use the built-in `cptac` functions for joining different dataframes.

We will do this on data for Endometrial carcinoma. First we need to import the package and create an endometrial data object, which we call 'en'.

In [1]:
import cptac
en = cptac.Ucec()
en.list_data_sources()
en.get_proteomics('umich')



HttpResponseError: Failed to download data file for umich ucec proteomics with error:
[WinError 2] The system cannot find the file specified: 'all_index.txt'

## General format

cptac has a helpful function called `multi_join`. It allows data from several different cptac dataframes to be joined at the same time.

To use `multi_join`, you specify the dataframes you want to join by passing a dictionary of their names to the function call. The function will automatically check that the dataframes whose names you provided are valid for the join function, and print an error message if they aren't.

Whenever a column from an -omics dataframe is included in a joined table, the name of the -omics dataframe it came from is joined to the column header, to avoid confusion.

If you wish to only include particular columns in the join, include them as values in the dictionary. All values will accept either a single column name as a string, or a list of column name strings. In this use case, we will usually only select specific columns for readability, but you could select the whole dataframe in all these cases, except for the mutations dataframe.

The join functions use logic analogous to an SQL INNER JOIN.

# Join dictionary

The main parameter for the `multi_join` function is a dictionary with source and datatype as a key, and specific columns as a value. Because there are multiple sources for each datatype, the desired source needs to be included. This can be done in two different ways. The first is by using a string that contains the source, a space, and then the datatype. The second is by using a tuple formatted (source, datatype). For example, using:

`{('awg', 'proteomics'): ''}`

or

`{"awg proteomics": ''}`

as the join dictionary would each result in `multi_join` returning a dataframe containing only awg proteomics data.

You'll notice the value in the key:value pair is an empty string. Because a dictionary needs to have a value for each key, the empty string or an empty list mean we want everything from the specified dataframe. If a string or list of strings is specified, the joined dataframe will only contain the specified columns. See below for more examples.

## Join omics to omics

`multi_join` can join two -omics dataframes to each other. Types of -omics data valid for use with this function are acetylproteomics, CNV, phosphoproteomics, phosphoproteomics_gene, proteomics, and transcriptomics.

In [None]:
prot_and_phos = en.multi_join({"awg proteomics":'', "awg phosphoproteomics":''})
prot_and_phos.head()

Joining only specific columns.
(Note that when a gene is selected from the phosphoproteomics dataframe, data for all sites of the gene are selected. The same is done for acetylproteomics data.)

In [None]:
prot_and_phos_selected = en.multi_join({"awg proteomics":'A1BG', "awg phosphoproteomics":'PIK3CA'})
prot_and_phos_selected.head()

## Join metadata to omics

The `multi_join` function can also join a metadata dataframe (e.g. clinical or derived_molecular) with an -omics dataframe:

In [None]:
clin_and_tran = en.multi_join({"awg clinical":'', "awg transcriptomics":''})
clin_and_tran.head()

Joining only specific columns:

In [None]:
clin_and_tran = en.multi_join({"awg clinical": ["Age", "Histologic_type"], "awg transcriptomics": "ZZZ3"})
clin_and_tran.head()

## Join metadata to metadata

Of course two metadata dataframes (e.g. clinical or derived_molecular) can also be joined together. Note how we passed a column name to select from the clinical dataframe, but passing an empty string `''` or an empty list `[]` for the column parameter for the derived_molecular dataframe caused the entire dataframe to be selected.

In [None]:
hist_and_derived_molecular = en.multi_join({
    "awg clinical": "Histologic_type",
    "awg derived_molecular": '' # Note that by using an empty string or list as the value, we join the entire dataframe
})

hist_and_derived_molecular.head()

## Join many datatypes together

If you need data from three or more dataframes, they can all simply be added to the joining dictionary. The only limit to the number of dataframes the joining dictionary parameter for `multi_join` can take is your imagination.

In [None]:
joining_dictionary = {"awg proteomics": ["AURKA", "TP53"], "awg phosphoproteomics": ["AURKA", "TP53"], "awg clinical": [], "awg somatic_mutation": "PTEN"}
en.multi_join(joining_dictionary).head()

`multi_join` does not necessarily need to join different dataframes. If you just want a small amount of information from a dataframe, this function is useful for that as well.

In [None]:
histologic_type_and_grade = en.multi_join({"awg clinical": ['Histologic_type', 'Histologic_Grade_FIGO']})
histologic_type_and_grade.head()

## Join omics to mutations

Joining an -omics dataframe with the mutation data for a specified gene or genes is slightly different than other types of joins using `multi_join`. Because there may be multiple mutations for one gene in a single sample, the mutation type and location data are returned in lists by default, even if there is only one mutation. If there is no mutation for the gene in a particular sample, the list contains either "Wildtype_Tumor" or "Wildtype_Normal", depending on whether it's a tumor or normal sample. The mutation status column contains either "Single_mutation", "Multiple_mutation", "Wildtype_Tumor", or "Wildtype_Normal", for help with parsing.

In [None]:
selected_acet_and_PTEN_mut_mult = en.multi_join({"awg proteomics": ["AURKA", "TP53"], "awg somatic_mutation": "PTEN"})
selected_acet_and_PTEN_mut_mult.head(10)

In [None]:
selected_acet_and_PTEN_mut = en.join_omics_to_mutations(
    omics_name="proteomics",
    mutations_genes="PTEN", 
    omics_genes=["AURKA", "TP53"])

selected_acet_and_PTEN_mut.head(10)

### Filtering multiple mutations

The function has the ability to filter multiple mutations down to just one mutation. It allows you to specify particular mutation types or locations to prioritize, and also provides a default sorting hierarchy for all other mutations. The default hierarchy chooses truncation mutations over missense mutations, and silent mutations last of all. If there are multiple mutations of the same type, it chooses the mutation occurring earlier in the sequence. 

To filter all mutations based on this default hierarchy, simply pass an empty list to the optional `mutations_filter` parameter. Notice how in sample S001, the nonsense mutation was chosen over the missense mutation, because it's a type of trucation mutation, even though the missense mutation occurs earlier in the peptide sequence. In sample S008, both mutations were types of truncation mutations, so the function just chose the earlier one.

In [None]:
PTEN_default_filter = en.multi_join({"awg proteomics": ["AURKA", "TP53"],
                                     "awg somatic_mutation": "PTEN"},
                                    mutations_filter=[])
PTEN_default_filter.loc[["C3L-00006", "C3L-00137"]]

To prioritize a particular type of mutation, or a particular location, include it in the `mutations_filter` list. Below, we tell the function to prioritize nonsense mutations over all other mutations. Notice how in sample S008, the nonsense mutation is now selected instead of the frameshift insertion, even though the nonsense mutation occurs later in the peptide sequence.

In [None]:
PTEN_simple_filter = en.multi_join({"awg proteomics": ["AURKA", "TP53"],
                                    "awg somatic_mutation": "PTEN"},
                                   mutations_filter=["Nonsense_Mutation"])
PTEN_simple_filter.loc[["C3L-00006", "C3L-00137"]]

You can include multiple mutation types and/or locations in the `mutations_filter` list. Values earlier in the list will be prioritized over values later in the list. For example, with the filter we specify below, the function first selects sample S001's missense mutation over its nonsense mutation, because we put the location of S001's missense mutation as the first value in our filter list. We still included Nonsense_Mutation in the filter list, but it comes after the location of S001's missense mutation, which is why S001's missense mutation is still prioritized. However, on all other samples, unless they also have a mutation at that same location, the function will continue prioritizing nonsense mutations, as we see in sample S008.

In [None]:
PTEN_complex_filter = en.multi_join({"awg proteomics": ["AURKA", "TP53"],
                                    "awg somatic_mutation": "PTEN"}, 
                                    mutations_filter=["p.R130Q", "Nonsense_Mutation"])
PTEN_complex_filter.loc[["C3L-00006", "C3L-00137"]]

## Join metadata to mutations

Joining metadata to mutation data works exactly like joining other datatypes. Just like any time you are using somatic_mutation data, you can filter multiple mutations with the `mutations_filter` parameter. Here are some examples:

In [None]:
hist_and_PTEN = en.multi_join(
    {"awg clinical": 'Histologic_type',
    "awg somatic_mutation": "PTEN"})

hist_and_PTEN.head()

With multiple mutations filtered:

In [None]:
hist_and_PTEN = en.multi_join(
    {"awg clinical": "Histologic_type",
    "awg somatic_mutation": "PTEN"},
    mutations_filter=["Nonsense_Mutation"])

hist_and_PTEN.head()

# Exporting dataframes

If you wish to export a dataframe to a file, simply call the dataframe's to_csv method, passing the path you wish to save the file to, and the value separator you want:

In [None]:
hist_and_PTEN.to_csv(path_or_buf="histologic_type_and_PTEN_mutation.tsv", sep='\t')