# Tutorial 1: CPTAC Data Introduction

The National Cancer Institute’s Clinical Proteomic Tumor Analysis Consortium (CPTAC) is a national effort to accelerate the understanding of the molecular basis of cancer through the application of large-scale proteome and genome analysis, or proteogenomics. CPTAC generates comprehensive proteomics and genomics data from clinical cohorts, typically with ~100 samples per tumor type. The graphic below summarizes the structure of each CPTAC dataset. For more information, visit the [NIH website](https://proteomics.cancer.gov/programs/cptac). 

<img src="img/Graphical_Abstract.png" alt="CPTAC cohort" width="700"/>

This Python package makes accessing CPTAC data easy with Python code and Jupyter notebooks. The package contains several tutorials which demonstrate data access and usage. This first tutorial serves as an introduction to the data to help users become familiar with what is included and how it is presented.

## Data Overview

Our package provides data access in a Python programming environment. If you have not installed Python or have not installed the package, see our installation documentation [here](https://paynelab.github.io/cptac/#installation).

Once we have the package installed and we're in our Python environment, we begin by importing the package with a standard Python import statement:

In [1]:
import cptac

ImportError: cannot import name 'download_text' from 'cptac.tools.download_tools.box_download' (/Users/blakemcgee/opt/anaconda3/lib/python3.9/site-packages/cptac/tools/download_tools/box_download.py)

cptac data is broken down into datasets by cancer type. To view the available datasets, call the `cptac.get_cancer_options()` function:

## Data Availability
The goals of CPTAC as a consortium include the broad and open dissemination of cancer proteogenomic data. The timing of a dataset's public release generally follows three stages: internal release to CPTAC investigators, public release with a publication embargo, and full public release. Each of the cancer types may be at a different data availability stage, depending on the date of data creation. In the Python `cptac` package, these three stages are dealt with as follows:

**Internally released data** requires a password to download.

**Embargoed release data** is publicly available, but prints an embargo statement every time you interact with the data.

**Public data** is fully released without restrictions.

## Exploring the data

`cptac` allows you to load the dataset into a Python variable, and you can use that variable to access and work with the data. To load a particular dataset into a variable, type the name you want to give the variable, followed by `=`, and then type `cptac.` and the name of the dataset in [UpperCamelCase](https://en.wikipedia.org/wiki/Camel_case) followed by two parentheses, e.g. `cptac.Ucec()` or `cptac.Ccrcc()`:

In [None]:
en = cptac.Ucec()

To see what data is available, use the `en.list_data_sources()` function. This displays the different types of data included in the dataset for this particular cancer type, each stored in a [pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe).

In [None]:
en.list_data_sources()

# Molecular Omics

You'll notice that some datatypes have more than one source, usually named after the organization which generated the data. Since not all of this data has been publicly released yet, we will use 'awg' data in these tutorials. AWG stands for "All Working Groups" and was generated collaboratively by many different groups, albeit with some inconsistencies between datasets.

Data can be accessed through the `get_dataframe` function, or through one of several helper "get" functions. For example, we can look at the proteomics data by using `en.get_proteomics()`. This returns a [pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe) containing the proteomic data. Each column in the proteomics dataframe is the quantitiative measurement for a particular protein. Each row in the proteomics dataframe is a sample of either a tumor or non-tumor from a cancer patient.

In [None]:
# These two methods of getting proteomics data are functionally equivalent:
# get_dataframe needs at least two arguments: datatype, and source
proteomics = en.get_dataframe('proteomics', 'umich')
# there is also a get function for each datatype, this is the same as get_dataframe but does not need the datatype argument
proteomics = en.get_proteomics('umich')

samples = proteomics.index
proteins = proteomics.columns
print("Samples:",samples[0:20].tolist()) #the first twenty samples
print("Proteins:",proteins[0:20].tolist()) #the first twenty proteins

## Dataframe values

Values in the dataframe are protein abundance values. Values that read "NaN" mean that particular sample from that patient had no data for that particular protein. For the awg endometrial CPTAC proteomics data, a TMT-reference channel strategy was used. A detailed description of this strategy can be found at [Nature Protocols](https://www.nature.com/articles/s41596-018-0006-9) and also at [PubMed Central](https://www.ncbi.nlm.nih.gov/pubmed/?term=29988108). This strategy ratios each sample's abundance to a pooled reference. The ratio is then log transformed. Therefore positive values indicate a measurement higher than the pooled reference; negative values are lower than the pooled reference.

In [None]:
proteomics.head()

As seen in `en.list_data_sources()`, other omics data are also available (e.g. transcriptomics, copy number variation, phoshoproteomics).

The transcriptomics looks almost identical to the proteomics data, available in a pandas dataframe with the same convention. Each set of samples is consistent, meaning samples found in the endometrial proteomics data will be the same samples in all other endometrial dataframes.

In [None]:
transcriptomics = en.get_transcriptomics('washu')
transcriptomics.head()

# Clinical Data

The clinical dataframe lists clinical information for the patient associated with each sample (e.g. age, race, diabetes status, tumor size). 

In [None]:
clinical = en.get_clinical('awg')
clinical.head()

In addition to donating a tumor sample, some patients also had a normal sample taken for control and comparison. We can identify these samples by looking for samples marked "Normal" in the "Sample_Tumor_Normal" column, and whose Patient IDs are the same as the Patient IDs of tumor samples, but with a ".N" appended to the ID. For example, patient C3L-00006 provided both a tumor sample (marked C3L-00006) and a normal sample (marked C3L-00006.N). Note that the normal samples do not have many values in the clinical columns, because much of the information does not apply to non-tumor samples. Additionally, in cases where a column would have identical values for tumor and normal samples from the same patient (e.g., patient age and gender), the information is recorded only for the tumor sample.

In [None]:
clinical.loc[["C3L-00006","C3L-00361","C3L-01246", "C3L-00006.N","C3L-00361.N","C3L-01246.N"]]

# Mutation data

Each cancer dataset contains mutation data for the cohort. The data consists of all somatic mutations found for each sample (meaning there will be many lines for each sample). Each row lists the specific gene that was mutated, the type of mutation, and the location of the mutation. This data is a direct import of a MAF file.

In [None]:
somatic_mutations = en.get_somatic_mutation('awg')
somatic_mutations.head()

# Exporting dataframes

If you wish to export a dataframe to a file, simply call the dataframe's `to_csv` method, passing the path you wish to save the file to, and the value separator you want:

In [None]:
clinical = en.get_clinical('awg')
clinical.to_csv(path_or_buf="clinical_dataframe.tsv", sep='\t')

## Downloading data

The cptac package stores the data files for each dataset on a remote server. When you first install cptac, you will have no data files. Data files will be automatically downloaded the first time you try to use them. If you won't have internet access and need to download the files beforehand, that can be done with the `cptac.download` function:

In [None]:
cptac.download({"awg": ["proteomics"]})

# Getting help with a dataset or function

To view the documentation for a dataset, pass it to the Python `help` function, e.g. `help(en)`. You can also view the documentation for just a specific function: `help(en.join_omics_to_omics)`.

In [None]:
help(en.join_omics_to_omics)