# Tutorial 7: Pancan data access

This tutorial shows how to use the `cptac.pancan` submodule to access data from the harmonized pipelines for all cancer types.

Before the harmonized pipelines, the team working on each cancer type had their own pipeline for each data type. So for example, the ccRCC team ran the ccRCC data through their own transcriptomics pipeline, and the HNSCC team ran the HNSCC data through a different transcriptomics pipeline. However, this made it hard to study trends across multiple cancer types, since each cancer type's data had been processed differently.

To fix this problem, all data for all cancer types was run through the same pipelines for each data type. These are the harmonized pipelines. Now, for example, you can get transcriptomics data for both ccRCC and HNSCC (and all other cancer types) that came from the same pipeline.

For some data types, multiple harmonized pipelines were available. In this cases, all cancers were run through each pipeline, and you can choose which one to use. For example, you can get transcriptomics data from either the BCM pipeline, the Broad pipeline, or the WashU pipeline. But whichever pipeline you choose, you can get transcriptomics data for all cancer types through that one pipeline.

First, we'll import the package.

In [1]:
import cptac

We can list which cancers we have data for.

In [2]:
cptac.get_cancer_options()

['brca', 'ccrcc', 'coad', 'gbm', 'hnscc', 'lscc', 'luad', 'ov', 'pdac', 'ucec']

## Load the BRCA dataset

In [3]:
br = cptac.Brca()

We can list which data types are available from which sources.

In [4]:
br.list_data_sources()

Unnamed: 0,Data type,Available sources
0,CNV,"awg, washu"
1,acetylproteomics,"awg, pdc"
2,clinical,"awg, mssm, pdc"
3,deconvolution_cibersort,washu
4,deconvolution_xcell,washu
5,derived_molecular,awg
6,phosphoproteomics,"awg, pdc, umich"
7,proteomics,"awg, pdc, umich"
8,somatic_mutation,"awg, harmonized, washu"
9,transcriptomics,"awg, bcm, broad, washu"


## Download

Each file will be automatically downloaded when requested, but authentication through your Box account is required to download pancan data.

See the end of this tutorial for how to download files on a remote computer that doesn't have a web browser for logging into Box.

Let's get some data tables.

In [5]:
br.get_clinical(source="mssm")

                                               

Name,tumor_code,discovery_study,discovery_study/type_of_analyzed_samples,consent/age,consent/sex,consent/race,consent/ethnicity,consent/ethnicity_race_ancestry_identified,consent/collection_in_us,consent/participant_country,...,follow-up/additional_treatment_for_new_tumor_radiation,follow-up/additional_treatment_for_new_tumor_pharmaceutical,follow-up/additional_treatment_for_new_tumor_immunological,follow-up/days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor,follow-up/cause_of_death,follow-up/days_from_date_of_initial_pathologic_diagnosis_to_date_of_death,Recurrence-free survival,Overall survial,"Recurrence status (1, yes; 0, no)","Survial status (1, dead; 0, alive)"
Patient_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
01BR001,BR,,,55,Female,Black or African American,Not Hispanic or Latino,,,,...,,,,,,,,421.0,0,0.0
01BR015,BR,,,35,Female,White,Not Hispanic or Latino,,,,...,,,,,,,,347.0,0,0.0
01BR017,BR,,,45,Female,White,Not Hispanic or Latino,,,,...,,,,,,,,413.0,0,0.0
01BR018,BR,,,66,Female,White,Not Hispanic or Latino,,,,...,,,,,,,,384.0,0,0.0
01BR025,BR,,,62,Female,Black or African American,Not Hispanic or Latino,,,,...,,,,,,,,601.0,0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21BR003,BR,,,46,Female,White,Hispanic or Latino,,,,...,,,,,,,,,0,
21BR010,BR,,,71,Female,White,Hispanic or Latino,,,,...,,,,,,,,327.0,0,0.0
22BR003,BR,,,30,Female,White,Not Hispanic or Latino,,,,...,,,,,,,,,0,
22BR005,BR,,,46,Female,White,Not Hispanic or Latino,,,,...,,,,,,,,348.0,0,0.0


In [6]:
br.get_somatic_mutation(source="washu")

                                                        

Name,Gene,Mutation,Location,Entrez_Gene_Id,Center,NCBI_Build,Chromosome,Start_Position,End_Position,Strand,...,ExAC_AC_AN_Adj,ExAC_AC_AN,ExAC_AC_AN_AFR,ExAC_AC_AN_AMR,ExAC_AC_AN_EAS,ExAC_AC_AN_FIN,ExAC_AC_AN_NFE,ExAC_AC_AN_OTH,ExAC_AC_AN_SAS,ExAC_FILTER
Patient_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
01BR001,PIK3CD,Missense_Mutation,p.R785W,0,.,GRCh38,chr1,9722533,9722533,+,...,,,,,,,,,,
01BR001,GLI4,Missense_Mutation,p.R231G,0,.,GRCh38,chr8,143276364,143276364,+,...,,,,,,,,,,
01BR001,PLEC,Missense_Mutation,p.L3732R,0,.,GRCh38,chr8,143919037,143919037,+,...,,,,,,,,,,
01BR001,MFSD3,Missense_Mutation,p.G292A,0,.,GRCh38,chr8,144510642,144510642,+,...,,,,,,,,,,
01BR001,PPFIA1,Silent,p.L794L,0,.,GRCh38,chr11,70355705,70355705,+,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
CPT001846,ANKRD33B,Missense_Mutation,p.V342G,0,.,GRCh38,chr5,10649653,10649653,+,...,,,,,,,,,,
CPT001846,NPY1R,Missense_Mutation,p.F18S,0,.,GRCh38,chr4,163326502,163326502,+,...,,,,,,,,,,
CPT001846,PLRG1,Missense_Mutation,p.L477F,0,.,GRCh38,chr4,154537340,154537340,+,...,,,,,,,,,,
CPT001846,STAG2,Splice_Site,p.X607_splice,0,.,GRCh38,chrX,124063207,124063207,+,...,,,,,,,,,,


In [7]:
br.get_proteomics(source="umich")

                                                  

Name,ARF5,M6PR,ESRRA,FKBP4,NDUFAF7,FUCA2,DBNDD1,SEMA3F,CFTR,CYP51A1,...,DDHD1,WIZ,GBF1,APOA5,WIZ,LDB1,WIZ,RFX7,SWSAP1,SVIL
Database_ID,ENSP00000000233.5,ENSP00000000412.3,ENSP00000000442.6,ENSP00000001008.4,ENSP00000002125.4,ENSP00000002165.5,ENSP00000002501.6,ENSP00000002829.3,ENSP00000003084.6,ENSP00000003100.8,...,ENSP00000500986.2,ENSP00000500993.1,ENSP00000501064.1,ENSP00000501141.1,ENSP00000501256.3,ENSP00000501277.1,ENSP00000501300.1,ENSP00000501317.1,ENSP00000501355.1,ENSP00000501521.1
Patient_ID,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
01BR001,0.012367,-0.945999,,-0.478170,1.135840,-0.512706,0.750335,-0.274824,,-0.278244,...,-0.649127,-0.580869,-0.226667,,,-0.676185,-0.068202,-0.078207,-0.328420,
01BR008,-0.514386,0.462307,0.230124,-0.555968,0.491366,-0.656034,-1.220890,-0.369282,-1.036441,-0.059327,...,0.632221,,0.032873,,,-0.015459,0.227424,0.325643,-0.606240,
01BR009,-0.210782,-0.085055,0.380296,-0.389491,1.255391,-0.608007,-0.231318,0.092870,-1.505195,0.206595,...,0.450818,,-0.341503,,,-0.220239,0.125092,0.365397,-0.167392,
01BR010,0.105457,0.351335,-0.322798,-0.821610,0.241406,-0.500140,-0.137824,0.113791,,0.498314,...,-0.423470,,0.360900,,,-0.451556,-0.098897,0.208643,-0.729096,0.670307
01BR015,-0.509298,-0.874164,,-0.113804,-0.131347,-0.412813,0.262210,0.042333,,-0.657666,...,0.406016,-0.493869,-0.192847,,,0.083639,0.966976,-0.012664,0.081968,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21BR010,0.528298,-0.127929,-0.497360,-0.151022,0.082288,0.447267,0.151024,0.220194,0.516739,-0.230357,...,-0.172151,0.636608,0.267400,,-0.09507,-0.017522,-0.220463,-0.067717,-0.311446,0.602422
22BR005,-0.549542,0.134236,,0.580773,-0.080663,-0.056509,-0.148632,0.260986,,-0.348578,...,0.791937,,0.171712,,,0.083980,-0.200083,0.155198,,0.456801
22BR006,0.336092,0.125742,,-0.360510,0.086199,0.470607,-0.515990,-0.162247,1.003075,0.342987,...,-0.080755,,0.174904,-0.353412,,-0.013793,-0.253829,-0.117960,,1.094966
CPT000814,-0.518995,0.262582,0.277980,0.137505,0.600041,-1.041230,0.513974,-0.012011,,-0.411714,...,-2.011008,,0.035445,,,-1.385942,0.620827,,,


## Box authentication for remote downloads

Normally, when you download the `cptac.pancan` data files you're required to log into your Box account, as these files are not released publicly. However, there may be situations where the computer you're running your analysis on doesn't have a web browser you can use to log in to Box. For example, you may be running your code in a remotely hosted notebook (e.g. Google Colabs), or on a computer cluster that you access using ssh.

In these situations, follow these steps to take care of Box authenication:
1. On a computer where you do have access to a web browser to log in to Box, load the `cptac.pancan` module.
2. Call the `cptac.pancan.get_box_token` function. This will return a temporary access token that gives permission to download files from Box with your credentials. The token expires 1 hour after it's created.
3. On the remote computer, when you call the `cptac.pancan.download` function, copy and paste the access token you generated on your local machine into the `box_token` parameter of the function. The program will then be able to download the data files.

Below is all the code you would need to call for this process on each machine. For security, we will not actually run it in this notebook.

On your local machine:
```
import cptac.pancan as pc
pc.get_box_token()
```

On the remote machine:
```
import cptac.pancan as pc
pc.download("pancanbrca", box_token=[INSERT TOKEN HERE])
```