# Tutorial 7: Pancan data access

This tutorial shows how to use the `cptac.pancan` submodule to access data from the harmonized pipelines for all cancer types.

Before the harmonized pipelines, the team working on each cancer type had their own pipeline for each data type. So for example, the ccRCC team ran the ccRCC data through their own transcriptomics pipeline, and the HNSCC team ran the HNSCC data through a different transcriptomics pipeline. However, this made it hard to study trends across multiple cancer types, since each cancer type's data had been processed differently.

To fix this problem, all data for all cancer types was run through the same pipelines for each data type. These are the harmonized pipelines. Now, for example, you can get transcriptomics data for both ccRCC and HNSCC (and all other cancer types) that came from the same pipeline.

For some data types, multiple harmonized pipelines were available. In this cases, all cancers were run through each pipeline, and you can choose which one to use. For example, you can get transcriptomics data from either the BCM pipeline, the Broad pipeline, or the WashU pipeline. But whichever pipeline you choose, you can get transcriptomics data for all cancer types through that one pipeline.

First, we'll import the package.

In [1]:
import cptac.pancan as pc

We can list which cancers we have data for.

In [2]:
pc.list_datasets()

PancanBrca
PancanCcrcc
PancanCoad
PancanGbm
PancanHnscc
PancanLscc
PancanLuad
PancanOv
PancanUcec
PancanPdac


## Download 

Authentication through your Box account is required when you download files. Pass the name of the dataset you want, as listed by `list_datasets`. Capitalization does not matter.

See the end of this tutorial for how to download files on a remote computer that doesn't have a web browser for logging into Box.

In [3]:
pc.download("pancanbrca")

Please login to Box on the webpage that was just opened and grant access for cptac to download files through your account. If you accidentally closed the browser window, press Ctrl+C and call the download function again.
Please login to Box on the webpage that was just opened and grant access for cptac to download files through your account. If you accidentally closed the browser window, press Ctrl+C and call the download function again.
                                                

True

## Load the BRCA dataset

In [4]:
br = pc.PancanBrca()

Loading broadbrca v1.0...                     



  result = parse_gtf(


  result = parse_gtf(


                                                 

We can list which data types are available from which sources.

In [5]:
br.list_data_sources()

Unnamed: 0,Data type,Available sources
0,CNV,washu
1,acetylproteomics,pdc
2,cancer_diagnosis,mssm
3,clinical,"mssm, pdc"
4,deconvolution_cibersort,washu
5,deconvolution_xcell,washu
6,demographic,mssm
7,followup,mssm
8,phosphoproteomics,"pdc, umich"
9,previous_cancer,mssm


Let's get some data tables.

In [6]:
br.get_clinical(source="mssm")

Unnamed: 0_level_0,Overall survial,"Recurrence status (1, yes; 0, no)",Recurrence-free survival,Sample_Tumor_Normal,"Survial status (1, dead; 0, alive)",baseline/ajcc_tnm_cancer_staging_edition,baseline/clinical_staging_distant_metastasis,baseline/he_staining_done,baseline/histologic_type,baseline/ihc_staining_done,...,medications/history_source,medications/medication_name_vitamins_supplements,procurement/blood_collection_minimum_required_blood_collected,procurement/blood_collection_number_of_blood_tubes_collected,procurement/normal_adjacent_tissue_collection_number_of_normal_segments_collected,procurement/tumor_tissue_collection_clamps_used,procurement/tumor_tissue_collection_frozen_with_oct,procurement/tumor_tissue_collection_number_of_tumor_segments_collected,procurement/tumor_tissue_collection_tumor_type,tumor_code
Patient_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
01BR001,421.0,0.0,,Tumor,0.0,,,,Inflitrating Ductal Carcinoma,,...,,,,2.0,,,,,,BR
01BR008,,,,Tumor,,,,,,,...,,,,,,,,,,
01BR009,,,,Tumor,,,,,,,...,,,,,,,,,,
01BR010,,,,Tumor,,,,,,,...,,,,,,,,,,
01BR015,347.0,0.0,,Tumor,0.0,,,,Inflitrating Ductal Carcinoma,,...,,,,2.0,,,,,,BR
01BR017,413.0,0.0,,Tumor,0.0,,,,Inflitrating Ductal Carcinoma,,...,,,,1.0,,,,,,BR
01BR018,384.0,0.0,,Tumor,0.0,,,,Inflitrating Ductal Carcinoma,,...,,,,2.0,,,,,,BR
01BR020,,,,Tumor,,,,,,,...,,,,,,,,,,
01BR023,,,,Tumor,,,,,,,...,,,,,,,,,,
01BR025,601.0,0.0,,Tumor,0.0,,,,Inflitrating Ductal Carcinoma,,...,,,,1.0,,,,,,BR


In [7]:
br.get_somatic_mutation(source="washu")

Unnamed: 0_level_0,AA_MAF,AFR_MAF,ALLELE_NUM,AMR_MAF,ASN_MAF,Allele,Amino_acids,BAM_File,BIOTYPE,CANONICAL,...,dbSNP_Val_Status,flanking_bps,n_alt_count,n_depth,n_ref_count,t_alt_count,t_depth,t_ref_count,variant_id,variant_qual
Patient_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
01BR001,,,1,,,T,R/W,,protein_coding,YES,...,,,0,635,635,180,878,698,.,.
01BR001,,,1,,,G,R/G,,protein_coding,YES,...,,,0,121,121,24,311,287,.,.
01BR001,,,1,,,C,L/R,,protein_coding,YES,...,,,1,772,771,147,1398,1251,.,.
01BR001,,,1,,,C,G/A,,protein_coding,YES,...,,,0,58,58,8,77,69,.,.
01BR001,,,1,,,T,L,,protein_coding,YES,...,,,0,127,127,9,135,126,rs1213708101,.
01BR001,,,1,,,T,H/L,,protein_coding,YES,...,,,0,126,126,9,138,129,rs1256101256,.
01BR001,,,1,,,T,E/K,,protein_coding,YES,...,,,0,375,375,53,432,379,rs750883732,.
01BR001,,,1,,,C,,,protein_coding,YES,...,,,0,291,291,42,283,241,.,.
01BR001,,,1,,,A,T/M,,protein_coding,YES,...,,,0,468,468,160,526,366,rs1161996687,.
01BR001,,,1,,,C,E/A,,protein_coding,YES,...,,,0,210,210,76,250,174,.,.


In [8]:
br.get_proteomics(source="umich")

Name,A1BG,A2M,A2ML1,A4GALT,AAAS,AACS,AADAT,AAGAB,AAK1,AAMDC,...,ZSWIM8,ZSWIM9,ZUP1,ZW10,ZWILCH,ZWINT,ZXDC,ZYG11B,ZYX,ZZEF1
Database_ID,ENSP00000263100.2,ENSP00000323929.8,ENSP00000299698.7,ENSP00000249005.2,ENSP00000209873.4,ENSP00000324842.6,ENSP00000226840.4,ENSP00000261880.5,ENSP00000386456.3,ENSP00000377078.2,...,ENSP00000381693.2,ENSP00000480314.1,ENSP00000357565.3,ENSP00000200135.3,ENSP00000311429.5,ENSP00000363055.3,ENSP00000374359.3,ENSP00000294353.6,ENSP00000324422.5,ENSP00000371051.2
Patient_ID,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
01BR001,0.964219,0.712229,-1.838762,,0.190146,-0.401836,0.558497,-0.571508,0.046252,-0.067064,...,0.205835,,,-0.076955,0.189799,-0.165917,1.065263,0.186977,0.351824,-0.673075
01BR008,-0.163783,0.222128,0.564935,,0.219314,-1.077252,,-0.334289,-0.128818,-0.635724,...,0.139989,,,0.106005,0.373294,0.257054,0.758199,-0.087769,-0.109325,0.003726
01BR009,0.440711,1.305547,1.064684,,-0.051330,-0.643568,,-0.739326,-0.155741,-0.407221,...,-0.082291,,,0.025558,-0.191570,-0.272440,0.107059,0.157156,0.053902,-0.094669
01BR010,-0.210077,-0.476778,-2.153951,,0.151513,1.202484,2.918194,-0.479379,-0.066867,0.572150,...,-0.066837,0.009928,,0.588021,-0.822620,-0.548751,-0.477608,-0.353604,0.587893,-0.060439
01BR015,0.512848,-0.108772,-2.024479,,0.428309,0.114126,-0.372507,-0.261763,-0.136773,0.259900,...,-0.129002,,,-0.114591,-0.357176,0.361189,-1.354635,0.010848,0.713279,-0.451709
01BR017,0.411871,0.906319,0.196331,,0.230231,-0.080213,,0.151591,0.216750,-0.160176,...,0.049944,0.374453,,0.014303,-0.164777,0.403286,0.100330,-0.042834,0.285940,-0.214578
01BR018,0.778780,0.949798,0.481126,,0.068836,-1.363016,0.489847,-0.214481,-0.428230,0.298039,...,-0.106189,0.244123,,-0.279242,0.274504,0.314267,-0.956851,-0.011198,-0.185218,-0.077110
01BR020,-0.120716,-0.366892,-1.963831,,0.279906,-0.621878,-0.006049,-0.038905,-0.161589,0.366362,...,-0.103624,0.210426,,0.282532,0.172683,0.708654,0.529805,-0.081990,0.389254,-0.758983
01BR023,-0.210646,-0.762573,-2.833123,,0.231477,-0.942649,-1.185737,0.106079,0.254490,-0.300329,...,0.282852,0.434072,,-0.119322,-0.532877,-0.456982,0.287748,-0.188229,-0.575809,0.295687
01BR025,-0.324954,-0.733043,-1.483728,,0.411059,0.386470,-2.537195,0.418928,-0.049206,0.606711,...,0.134950,,,0.655280,0.781046,-0.069951,-1.150195,0.216931,-0.068224,-0.205014


## Box authentication for remote downloads

Normally, when you download the `cptac.pancan` data files you're required to log into your Box account, as these files are not released publicly. However, there may be situations where the computer you're running your analysis on doesn't have a web browser you can use to log in to Box. For example, you may be running your code in a remotely hosted notebook (e.g. Google Colabs), or on a computer cluster that you access using ssh.

In these situations, follow these steps to take care of Box authenication:
1. On a computer where you do have access to a web browser to log in to Box, load the `cptac.pancan` module.
2. Call the `cptac.pancan.get_box_token` function. This will return a temporary access token that gives permission to download files from Box with your credentials. The token expires 1 hour after it's created.
3. On the remote computer, when you call the `cptac.pancan.download` function, copy and paste the access token you generated on your local machine into the `box_token` parameter of the function. The program will then be able to download the data files.

Below is all the code you would need to call for this process on each machine. For security, we will not actually run it in this notebook.

On your local machine:
```
import cptac.pancan as pc
pc.get_box_token()
```

On the remote machine:
```
import cptac.pancan as pc
pc.download("pancanbrca", box_token=[INSERT TOKEN HERE])
```