# CPTAC Data Introduction

This package includes several different types of cancer data. This tutorial serves as an introduction to help users become familiar with what data is included, as well as how the data is presented. 

## Exploring the data

The specifications of the cancer data are as follows:

![Data](img/Figure_0_Graphical_Abstract.png)

For this example, we will look at the endometrial data in the package through the import statement <code>import CPTAC.Endometrial</code>. Now we can access the data through the package by listing <code>CPTAC.Endometrial.function()</code>, or we can name the import for convenience: <code>import CPTAC.Endometrial as en</code>, simplifying funciton calls to <code>en.function()</code>.

In [1]:
import CPTAC.Endometrial as en

Welcome to the CPTAC data service package. Available datasets may be
viewed using CPTAC.list(). In order to access a specific data set,
import a CPTAC subfolder using either 'import CPTAC.Dataset' or 'from
CPTAC import Dataset'.
******
Version: 0.2.5
******
Loading Endometrial CPTAC data:
Loading Dictionary...
Loading Clinical Data...
Loading Acetylation Proteomics Data...
Loading Proteomics Data...
Loading Transcriptomics Data...
Loading CNA Data...
Loading Phosphoproteomics Data...
Loading Somatic Mutation Data...

 ******PLEASE READ******
CPTAC is a community resource project and data are made available
rapidly after generation for community research use. The embargo
allows exploring and utilizing the data, but the data may not be in a
publication until July 1, 2019. Please see
https://proteomics.cancer.gov/data-portal/about/data-use-agreement or
enter embargo() to open the webpage for more details.


## Exploring the Data Continued

We can get a handle for what data we have using the <code>en.list()</code> function, which displays the different types of data contained in the package for that specific cancer, as well as the dimensions of each respective data file.

In [2]:
en.list()

Below are the available endometrial data frames contained in this package:
	 clinical
	 	 Dimensions: (144, 27)
	 derived_molecular
	 	 Dimensions: (144, 144)
	 acetylproteomics
	 	 Dimensions: (144, 10862)
	 proteomics
	 	 Dimensions: (144, 10999)
	 transcriptomics_linear
	 	 Dimensions: (109, 28057)
	 CNA
	 	 Dimensions: (95, 28057)
	 phosphoproteomics_site
	 	 Dimensions: (144, 73212)
	 somatic binary
	 	 Dimensions: (95, 51559)
	 somatic MAF
	 	 Dimensions: (52560, 5)
To access the data, use a get function with the data frame name, i.e. endometrial.get_proteomics()


# Proteomics Data

Data can be accessed through several <code>en.get</code> functions. For example, we can look at the proteomics data by using the <code>en.get_proteomics()</code> function. The result is a <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html">pandas dataframe</a> containing the proteomic data for that cohort (group of cancer patients).

In [3]:
en.get_proteomics()

idx,A1BG,A2M,A2ML1,A4GALT,AAAS,AACS,AADAT,AAED1,AAGAB,AAK1,...,ZSWIM8,ZSWIM9,ZW10,ZWILCH,ZWINT,ZXDC,ZYG11B,ZYX,ZZEF1,ZZZ3
S001,-1.1800,-0.8630,-0.80200,0.2220,0.256000,0.6650,1.28000,-0.33900,0.4120,-0.66400,...,-0.087700,,0.022900,0.1090,,-0.33200,-0.43300,-1.020000,-0.123000,-0.085900
S002,-0.6850,-1.0700,-0.68400,0.9840,0.135000,0.3340,1.30000,0.13900,1.3300,-0.36700,...,-0.035600,,0.363000,1.0700,0.73700,-0.56400,-0.00461,-1.130000,-0.075700,-0.473000
S003,-0.5280,-1.3200,0.43500,,-0.240000,1.0400,-0.02130,-0.04790,0.4190,-0.50000,...,0.001120,-0.14500,0.010500,-0.1160,,0.15100,-0.07400,-0.540000,0.320000,-0.419000
S005,-1.6700,-1.1900,-0.44300,0.2430,-0.099300,0.7570,0.74000,-0.92900,0.2290,-0.22300,...,0.072500,-0.05520,-0.071400,0.0933,0.15600,-0.39800,-0.07520,-0.797000,-0.030100,-0.467000
S006,-0.3740,-0.0206,-0.53700,0.3110,0.375000,0.0131,-1.10000,,0.5650,-0.10100,...,-0.176000,,-1.220000,-0.5620,0.93700,-0.64600,0.20700,-1.850000,-0.176000,0.051300
S007,-1.0800,-0.7080,-0.12600,-0.4260,-0.114000,-0.1110,0.89500,1.26000,0.1570,0.72300,...,0.455000,,0.397000,-0.9990,-0.73000,-0.02290,-0.33100,-1.160000,-0.116000,0.002500
S008,-1.3200,-0.7080,-0.80800,-0.0709,0.138000,0.6560,-0.28000,-0.12800,0.2170,-0.13300,...,0.015800,-0.00491,0.180000,0.5130,0.54900,-0.68100,-0.28500,-0.564000,-0.087600,0.006980
S009,-0.4670,0.3700,-0.33900,,0.434000,0.0358,-0.17500,0.18100,0.1160,-0.05120,...,-0.675000,0.23900,0.140000,1.0700,0.60700,0.48600,0.16900,-0.632000,-0.203000,-0.068500
S010,-1.1200,-1.3100,0.91200,0.4180,-0.076800,0.8460,-0.12100,,-0.3110,0.20700,...,-0.002120,,-1.190000,-1.2700,-1.27000,-0.22200,-0.32000,-0.620000,0.363000,-0.463000
S011,-0.7160,-0.8850,2.82000,-0.3430,0.147000,0.4450,-0.05650,-0.83800,0.0490,0.17600,...,0.084500,-0.64700,0.175000,0.2120,-0.32400,-0.35000,-0.37700,0.388000,0.011000,0.134000


## Row and Column values

Each column in the proteomics dataframe is the readings for each sample for a particular protein.

Each row in the proteomics dataframe is the different protein readings for a sample of either a tumor or non-tumor area of a cancer patient. NOTE: Some samples were not included in any of the endometrial data (e.g. S004, S015) due to a quality check.

In [4]:
proteomics = en.get_proteomics()
samples = proteomics.index
proteins = proteomics.columns
print("Samples:",samples[0:100]) #the first one hundred samples (or first one hundred row labels in the proteomics dataframe)
print("Proteins:",proteins[0:100]) #the first one hundred proteins (or first one hundred column labels in the proteomics dataframe)

Samples: Index(['S001', 'S002', 'S003', 'S005', 'S006', 'S007', 'S008', 'S009', 'S010',
       'S011', 'S012', 'S014', 'S016', 'S017', 'S018', 'S019', 'S020', 'S021',
       'S022', 'S023', 'S024', 'S025', 'S026', 'S027', 'S028', 'S029', 'S030',
       'S031', 'S032', 'S033', 'S034', 'S036', 'S037', 'S038', 'S039', 'S040',
       'S041', 'S042', 'S044', 'S045', 'S046', 'S048', 'S049', 'S050', 'S051',
       'S053', 'S054', 'S055', 'S056', 'S057', 'S058', 'S059', 'S060', 'S061',
       'S062', 'S063', 'S064', 'S065', 'S066', 'S067', 'S068', 'S069', 'S070',
       'S071', 'S072', 'S073', 'S074', 'S075', 'S076', 'S077', 'S078', 'S079',
       'S080', 'S081', 'S082', 'S083', 'S084', 'S085', 'S086', 'S087', 'S088',
       'S090', 'S091', 'S092', 'S093', 'S094', 'S095', 'S096', 'S097', 'S098',
       'S099', 'S100', 'S101', 'S102', 'S103', 'S105', 'S106', 'S107', 'S108',
       'S109'],
      dtype='object')
Proteins: Index(['A1BG', 'A2M', 'A2ML1', 'A4GALT', 'AAAS', 'AACS', 'AADAT', 'AAED1',

## Dataframe values

Values in the dataframe are relative protein abundance values. Values that read "NaN" mean that particular sample from that patient had no data for that particular protein.

In [5]:
proteomics.head()

idx,A1BG,A2M,A2ML1,A4GALT,AAAS,AACS,AADAT,AAED1,AAGAB,AAK1,...,ZSWIM8,ZSWIM9,ZW10,ZWILCH,ZWINT,ZXDC,ZYG11B,ZYX,ZZEF1,ZZZ3
S001,-1.18,-0.863,-0.802,0.222,0.256,0.665,1.28,-0.339,0.412,-0.664,...,-0.0877,,0.0229,0.109,,-0.332,-0.433,-1.02,-0.123,-0.0859
S002,-0.685,-1.07,-0.684,0.984,0.135,0.334,1.3,0.139,1.33,-0.367,...,-0.0356,,0.363,1.07,0.737,-0.564,-0.00461,-1.13,-0.0757,-0.473
S003,-0.528,-1.32,0.435,,-0.24,1.04,-0.0213,-0.0479,0.419,-0.5,...,0.00112,-0.145,0.0105,-0.116,,0.151,-0.074,-0.54,0.32,-0.419
S005,-1.67,-1.19,-0.443,0.243,-0.0993,0.757,0.74,-0.929,0.229,-0.223,...,0.0725,-0.0552,-0.0714,0.0933,0.156,-0.398,-0.0752,-0.797,-0.0301,-0.467
S006,-0.374,-0.0206,-0.537,0.311,0.375,0.0131,-1.1,,0.565,-0.101,...,-0.176,,-1.22,-0.562,0.937,-0.646,0.207,-1.85,-0.176,0.0513


## Other endometrial data

All other data included in the package are presented using dataframes. Additionally, each dataset (endometrial, ovarian, colon, etc.) has data for a specific set of samples. Each set of samples is consitent throughout the respective dataset. Therefore, samples found in the endometrial proteomics data will be the same samples in all other endometrial dataframes. 

### Transcriptomics

The transcriptomics looks almost identical to the proteomics data. However, note that that the gene names (column names) are slightly different from the proteomics protein names (column names).

In [6]:
transcriptomics = en.get_transcriptomics()
transcriptomics.head()

idx,A1BG,A1BG-AS1,A1CF,A2M,A2M-AS1,A2ML1,A2MP1,A3GALT2,A4GALT,A4GNT,...,ZWILCH,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3
S001,4.02,2.16,3.27,13.39,5.88,6.79,1.55,0.97,10.34,1.96,...,11.06,10.73,8.4,9.78,10.88,5.93,11.52,10.23,11.5,11.47
S002,4.81,2.21,4.86,13.24,5.93,6.33,0.93,0.0,10.83,0.0,...,10.87,11.43,8.39,9.14,10.38,7.25,11.64,10.64,11.26,11.57
S003,6.24,6.43,3.68,14.32,6.53,9.42,2.79,0.0,10.98,2.13,...,10.06,10.13,8.35,9.27,10.46,6.85,11.6,10.21,11.51,11.09
S005,5.31,4.87,5.59,13.77,6.35,4.22,2.97,0.0,8.68,1.98,...,10.29,10.41,9.1,9.59,10.15,7.89,11.9,10.21,11.34,11.51
S006,9.84,8.83,7.0,13.12,6.49,6.83,1.8,0.0,11.42,3.28,...,10.36,11.24,8.6,9.44,11.8,9.32,11.97,9.77,11.37,12.35


### Clinical

The clinical dataframe lists clinical information (such as age, race, diabetes status, tumor size, etc.) for the patient associated with each sample. 

In [25]:
clinical = en.get_clinical()
clinical.head()

Unnamed: 0_level_0,Proteomics_Participant_ID,Case_excluded,Proteomics_Tumor_Normal,Country,Histologic_Grade_FIGO,Myometrial_invasion_Specify,Histologic_type,Treatment_naive,Tumor_purity,Path_Stage_Primary_Tumor-pT,...,Age,Diabetes,Race,Ethnicity,Gender,Tumor_Site,Tumor_Site_Other,Tumor_Focality,Tumor_Size_cm,Num_full_term_pregnancies
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
S001,C3L-00006,No,Tumor,United States,FIGO grade 1,under 50 %,Endometrioid,YES,Normal,pT1a (FIGO IA),...,64.0,No,White,Not-Hispanic or Latino,Female,Anterior endometrium,,Unifocal,2.9,1
S002,C3L-00008,No,Tumor,United States,FIGO grade 1,under 50 %,Endometrioid,YES,Normal,pT1a (FIGO IA),...,58.0,No,White,Not-Hispanic or Latino,Female,Posterior endometrium,,Unifocal,3.5,1
S003,C3L-00032,No,Tumor,United States,FIGO grade 2,under 50 %,Endometrioid,YES,Normal,pT1a (FIGO IA),...,50.0,Yes,White,Not-Hispanic or Latino,Female,"Other, specify",Anterior and Posterior endometrium,Unifocal,4.5,4 or more
S005,C3L-00090,No,Tumor,United States,FIGO grade 2,under 50 %,Endometrioid,YES,Normal,pT1a (FIGO IA),...,75.0,No,White,Not-Hispanic or Latino,Female,"Other, specify",Anterior and Posterior endometrium,Unifocal,3.5,4 or more
S006,C3L-00098,No,Tumor,United States,,under 50 %,Serous,YES,Normal,pT1a (FIGO IA),...,63.0,No,White,Not-Hispanic or Latino,Female,"Other, specify",Anterior and Posterior endometrium,Unifocal,6.0,2


In addition to having a tumor sample taken, some patients also had a normal sample taken for control and comparison. These control samples begin at S105, which can be observed by looking at the "Proteomics_Tumor_Normal" column in the clinical dataframe (listing some form of "normal" value for all samples after S105). As another confirmation of a control sample, note that the clinical dataframe contains a secondary ID in the Proteomics_Participant_ID (PPID in some datasets) column, listing C3L-00006 for sample S001. Sample S105 lists the same PPID as S001, which means that these two samples were drawn from the same patient. 

In [26]:
clinical.iloc[90:100]

Unnamed: 0_level_0,Proteomics_Participant_ID,Case_excluded,Proteomics_Tumor_Normal,Country,Histologic_Grade_FIGO,Myometrial_invasion_Specify,Histologic_type,Treatment_naive,Tumor_purity,Path_Stage_Primary_Tumor-pT,...,Age,Diabetes,Race,Ethnicity,Gender,Tumor_Site,Tumor_Site_Other,Tumor_Focality,Tumor_Size_cm,Num_full_term_pregnancies
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
S099,C3N-01520,No,Tumor,Ukraine,FIGO grade 2,under 50 %,Endometrioid,YES,Normal,pT1a (FIGO IA),...,69.0,No,,,Female,"Other, specify",Endometrium,Multifocal,1.0,2
S100,C3N-01521,No,Tumor,Ukraine,FIGO grade 3,under 50 %,Endometrioid,YES,Normal,pT1a (FIGO IA),...,75.0,No,,,Female,"Other, specify",Entire uterine cavity,Unifocal,4.2,2
S101,C3N-01537,No,Tumor,Ukraine,FIGO grade 2,50 % or more,Endometrioid,YES,Normal,pT2 (FIGO II),...,74.0,No,,,Female,"Other, specify",Entire uterine cavity,Unifocal,1.5,1
S102,C3N-01802,No,Tumor,United States,,under 50 %,Serous,YES,Normal,pT2 (FIGO II),...,85.0,Yes,Black or African American,Not-Hispanic or Latino,Female,"Other, specify",entire uterine cavity,Unifocal,3.8,1
S103,C3N-01825,No,Tumor,Ukraine,,under 50 %,Serous,YES,Normal,pT1a (FIGO IA),...,70.0,No,,,Female,"Other, specify",Entire uterine cavity,Unifocal,5.0,Unknown
S105,C3L-00006,No,Adjacent_normal,,,,,,,,...,,,,,,,,,,
S106,C3L-00361,No,Adjacent_normal,,,,,,,,...,,,,,,,,,,
S107,C3L-00586,No,Adjacent_normal,,,,,,,,...,,,,,,,,,,
S108,C3L-00601,No,Adjacent_normal,,,,,,,,...,,,,,,,,,,
S109,C3L-00769,No,Adjacent_normal,,,,,,,,...,,,,,,,,,,


### Mutation data

All datasets contain mutation data for the respective cancer cohort. The data consists of all mutations found for a sample (meaning there will be many lines for each sample). Each line lists the specific gene that was mutated, the type of mutation occured at that gene for that particular sample, as well as the location of the mutation in the gene for that sample. 

In [9]:
somatic_mutations = en.get_somatic()
somatic_mutations.head()

Unnamed: 0,Clinical_Patient_Key,Patient_Id,Gene,Mutation,Location
0,S001,C3L-00006,MXRA8,Frame_Shift_Del,p.R301Gfs*107
1,S001,C3L-00006,GNB1,Missense_Mutation,p.R314C
2,S001,C3L-00006,RPL22,Missense_Mutation,p.V72M
3,S001,C3L-00006,CASZ1,Missense_Mutation,p.R233Q
4,S001,C3L-00006,PRAMEF9,Missense_Mutation,p.L30M
