# Frequently Asked Questions: A simple "How To" for common questions when working with cancer data in dataframes in the cptac package

A large portion of this package utilizes the functionality of pandas dataframes. Here we give a few specific instances of using dataframes with our package. For a more general overview of dataframes, see pandas <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html">documentation</a> and <a href="https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python">tutorials</a>.

## How do I install the package?

Install the package from the <a href="https://pypi.org/project/cptac/">Python Package Index (PyPI)</a> using the <a href="https://pypi.org/project/pip/">pip package installer</a>, with the name of the package:

<code>pip install cptac</code>

## How do I import the package once it has been installed?

The cptac package has several cancer data sets. First import the entire package by entering the following:

<code>import cptac</code>

(Once the package has loaded, a message will display the package version, as well as instructions for viewing specific data sets.)

Then, to access a specific data set (i.e. Endometrial data) and its functionalities, load the dataset object and assign it to a variable, like this:

<code>en = cptac.Endometrial()</code>

As the dataset loads, it will print basic instructions for using it, tell you the data version, and update you on the progress of the data loading until it's finished.

In [1]:
import pandas as pd
import numpy as np
import cptac
en = cptac.Endometrial()

Welcome to the cptac data service package. To view available datasets,
enter cptac.list_data(). To access a specific data set, load the
dataset and assign it to a variable using 'cptac.NameOfDataset()',
e.g. 'en = cptac.Endometrial()'

******
Version: 0.4.3
******
You have loaded the cptac endometrial dataset. To view available
dataframes, call the dataset's list_data() method. To view available
functions for accessing and manipulating the dataframes, call its
list_api() method.
endometrial data version: 2.1

Loading acetylproteomics data...
Loading clinical data...
Loading CNA data...
Loading definitions data...
Loading miRNA data...
Loading phosphoproteomics_gene data...
Loading phosphoproteomics_site data...
Loading proteomics data...
Loading somatic data...
Loading somatic_binary data...
Loading transcriptomics_circular data...
Loading transcriptomics_linear data...

 ******PLEASE READ******
CPTAC is a community resource project and data are made available
rapidly after generation 

## Can I use multiple datasets at the same time?

You can have multiple datasets loaded at the same time. As you load each dataset, it will display its individual loading information.

In [2]:
ov = cptac.Ovarian()
co = cptac.Colon()

AttributeError: module 'cptac' has no attribute 'Ovarian'

<b>NOTE: When using multiple data sets, be sure to check that each function used matches the expected data set, as each data set uses the same API.</b> For example, the command for retrieving clinical data is <code>get_clinical()</code> for all data sets, so make sure not to retrieve ovarian clinical data <code>ov.get_clinical()</code> when the intent was to retrieve endometrial clinical data <code>en.get_clinical()</code>.

## How do I see what data is available?

Each cancer type contained in the CPTAC package has a slightly different set of data. The data for a particular cancer type can be viewed by utilizing the <code>list_data()</code> function. This will display the name and dimensions of the data.

In [3]:
en.list_data()

Below are the available endometrial data frames contained in this package:
	 clinical
	 	 Dimensions: (144, 26)
	 derived_molecular
	 	 Dimensions: (144, 144)
	 acetylproteomics
	 	 Dimensions: (144, 10862)
	 proteomics
	 	 Dimensions: (144, 10999)
	 transcriptomics_linear
	 	 Dimensions: (109, 28057)
	 transcriptomics_circular
	 	 Dimensions: (109, 4945)
	 miRNA
	 	 Dimensions: (99, 2337)
	 cna
	 	 Dimensions: (95, 28057)
	 phosphoproteomics_site
	 	 Dimensions: (144, 73212)
	 phosphoproteomics_gene
	 	 Dimensions: (144, 8466)
	 somatic binary
	 	 Dimensions: (95, 51559)
	 somatic MAF
	 	 Dimensions: (52560, 5)


Data can then be accessed by using a "get" function for that particular dataframe. As the API for some data is different, consult the <a href="https://github.com/PayneLab/CPTAC/blob/master/doc/help.txt">CPTAC github help page</a> or enter <code>list_api</code> for documentation.

In [4]:
proteomics = en.get_proteomics()
proteomics.head()

idx,A1BG,A2M,A2ML1,A4GALT,AAAS,AACS,AADAT,AAED1,AAGAB,AAK1,...,ZSWIM8,ZSWIM9,ZW10,ZWILCH,ZWINT,ZXDC,ZYG11B,ZYX,ZZEF1,ZZZ3
S001,-1.18,-0.863,-0.802,0.222,0.256,0.665,1.28,-0.339,0.412,-0.664,...,-0.0877,,0.0229,0.109,,-0.332,-0.433,-1.02,-0.123,-0.0859
S002,-0.685,-1.07,-0.684,0.984,0.135,0.334,1.3,0.139,1.33,-0.367,...,-0.0356,,0.363,1.07,0.737,-0.564,-0.00461,-1.13,-0.0757,-0.473
S003,-0.528,-1.32,0.435,,-0.24,1.04,-0.0213,-0.0479,0.419,-0.5,...,0.00112,-0.145,0.0105,-0.116,,0.151,-0.074,-0.54,0.32,-0.419
S005,-1.67,-1.19,-0.443,0.243,-0.0993,0.757,0.74,-0.929,0.229,-0.223,...,0.0725,-0.0552,-0.0714,0.0933,0.156,-0.398,-0.0752,-0.797,-0.0301,-0.467
S006,-0.374,-0.0206,-0.537,0.311,0.375,0.0131,-1.1,,0.565,-0.101,...,-0.176,,-1.22,-0.562,0.937,-0.646,0.207,-1.85,-0.176,0.0513


## How do I access specific columns in a dataframe?

Many times you will want to get a feel for what data is contained in the dataframe. For instance, we want to know what kind of proteins are listed in our proteomics data. These proteins are the columns of the dataframe, which can be accessed in a few different ways.

A list of column names (or a list of protein names, in the case of the proteomics data) can be viewed by appending <code>.columns</code> to the end of a dataframe variable.

In [5]:
proteomics.columns

Index(['A1BG', 'A2M', 'A2ML1', 'A4GALT', 'AAAS', 'AACS', 'AADAT', 'AAED1',
       'AAGAB', 'AAK1',
       ...
       'ZSWIM8', 'ZSWIM9', 'ZW10', 'ZWILCH', 'ZWINT', 'ZXDC', 'ZYG11B', 'ZYX',
       'ZZEF1', 'ZZZ3'],
      dtype='object', name='idx', length=10999)

A specific column (or specific protein information, in the case of the proteomics data) can be accessed by entering the name of a column after the dataframe variable by either:

<code>proteomics["A1BG"]</code>

or

<code>proteomics.A1BG</code>

both of which returns a <a href="https://www.geeksforgeeks.org/python-pandas-series/">pandas series</a>. However, the first method is useful for handing a variable containing the name of the column as a string.

In [6]:
protein = "A1BG"
proteomics[protein]

S001   -1.1800
S002   -0.6850
S003   -0.5280
S005   -1.6700
S006   -0.3740
S007   -1.0800
S008   -1.3200
S009   -0.4670
S010   -1.1200
S011   -0.7160
S012   -0.2320
S014   -0.2690
S016   -0.7820
S017   -1.3400
S018   -0.9240
S019    0.1500
S020    0.1810
S021   -0.5930
S022   -0.9580
S023   -0.4540
S024   -0.2400
S025   -0.8310
S026   -0.3270
S027    0.0499
S028    0.7730
S029    0.5510
S030   -0.2130
S031   -0.7060
S032   -0.6980
S033   -1.1900
         ...  
S124    1.4700
S125    0.5260
S126    1.6900
S127    2.1400
S128    1.4000
S129    1.5300
S130    1.4300
S131    0.6920
S132    0.9160
S133    1.0600
S134    1.4100
S135    2.1200
S136    0.7790
S137   -0.0812
S138    0.5200
S139    0.7500
S140   -0.0014
S141    0.2980
S142    0.1110
S143    0.9280
S144    0.8990
S145    1.6000
S146    1.3200
S147    1.4400
S148    1.4600
S149    0.6500
S150    0.4580
S151    1.1500
S152    0.5470
S153    0.9400
Name: A1BG, Length: 144, dtype: float64

Additionally, the <code>dataframe["col_name"]</code> format allows for selection of multiple columns by entering a list.

In [7]:
proteins = ["A1BG","PTEN","TP53"]
proteomics[proteins]

idx,A1BG,PTEN,TP53
S001,-1.1800,-0.5260,0.295000
S002,-0.6850,-0.8300,0.277000
S003,-0.5280,-0.9410,-0.871000
S005,-1.6700,0.7300,-0.343000
S006,-0.3740,-0.3790,3.010000
S007,-1.0800,0.0293,-0.148000
S008,-1.3200,-1.0100,0.441000
S009,-0.4670,0.1300,-1.220000
S010,-1.1200,0.3900,-0.082500
S011,-0.7160,0.0301,-0.181000


## How do I access specific rows in a dataframe?

Specific rows in a dataframe (or specific samples, in the case of the CPTAC data) can be accessed by appending either <code>.iloc</code> (by row number) or <code>.loc</code> (by row name), both returning a pandas series if one row is selected.

In [8]:
proteomics.iloc[0:5]

idx,A1BG,A2M,A2ML1,A4GALT,AAAS,AACS,AADAT,AAED1,AAGAB,AAK1,...,ZSWIM8,ZSWIM9,ZW10,ZWILCH,ZWINT,ZXDC,ZYG11B,ZYX,ZZEF1,ZZZ3
S001,-1.18,-0.863,-0.802,0.222,0.256,0.665,1.28,-0.339,0.412,-0.664,...,-0.0877,,0.0229,0.109,,-0.332,-0.433,-1.02,-0.123,-0.0859
S002,-0.685,-1.07,-0.684,0.984,0.135,0.334,1.3,0.139,1.33,-0.367,...,-0.0356,,0.363,1.07,0.737,-0.564,-0.00461,-1.13,-0.0757,-0.473
S003,-0.528,-1.32,0.435,,-0.24,1.04,-0.0213,-0.0479,0.419,-0.5,...,0.00112,-0.145,0.0105,-0.116,,0.151,-0.074,-0.54,0.32,-0.419
S005,-1.67,-1.19,-0.443,0.243,-0.0993,0.757,0.74,-0.929,0.229,-0.223,...,0.0725,-0.0552,-0.0714,0.0933,0.156,-0.398,-0.0752,-0.797,-0.0301,-0.467
S006,-0.374,-0.0206,-0.537,0.311,0.375,0.0131,-1.1,,0.565,-0.101,...,-0.176,,-1.22,-0.562,0.937,-0.646,0.207,-1.85,-0.176,0.0513


In [9]:
proteomics.loc["S001"]

idx
A1BG       -1.1800
A2M        -0.8630
A2ML1      -0.8020
A4GALT      0.2220
AAAS        0.2560
AACS        0.6650
AADAT       1.2800
AAED1      -0.3390
AAGAB       0.4120
AAK1       -0.6640
AAMDC       0.2280
AAMP       -0.4560
AAR2       -0.2560
AARS        0.0419
AARS2       0.8570
AARSD1     -0.1450
AASDHPPT   -0.1360
AASS       -0.3040
AATF        0.2000
ABAT        0.0264
ABCA8      -0.3990
ABCB1      -0.4840
ABCB10      0.7960
ABCB6       0.3770
ABCB7       1.1900
ABCB8       0.8300
ABCC1       0.8140
ABCC10     -0.0376
ABCC3       0.4350
ABCC4       0.9160
             ...  
ZNF852         NaN
ZNF888         NaN
ZNFX1       0.2090
ZNHIT1      0.3060
ZNHIT2      0.0774
ZNHIT3      0.4890
ZNHIT6      0.1430
ZNRD1      -0.0326
ZNRF1       0.1830
ZNRF2      -0.1920
ZPR1       -0.0337
ZRANB2     -0.3810
ZRSR2       0.1940
ZSCAN12        NaN
ZSCAN18    -0.6500
ZSCAN2         NaN
ZSCAN21     0.1310
ZSCAN26    -0.1660
ZSCAN30        NaN
ZSCAN31    -0.1020
ZSWIM8     -0.0877
ZSWIM9  

## How do I access specific rows and columns?

In addition to selecting specific rows, the <code>.loc</code> can be used to select a subset of rows and columns using lists.

In [10]:
samples = ["S001","S003","S016"]
proteins = ["A1BG","PTEN","TP53"]
proteomics.loc[samples,proteins]

idx,A1BG,PTEN,TP53
S001,-1.18,-0.526,0.295
S003,-0.528,-0.941,-0.871
S016,-0.782,-0.539,2.12


## How can I search using conditional statements?

There are a variety of ways to use boolean statements to traverse a dataframe. A common way is to pass a boolean statement involving a selected column or row (which returns a series of true/false values) to the frequently utilized <code>.loc</code> function. For example, if we want to see information for samples that have a positive protein expression level for the protein A1BG, we would pass the <code>.loc</code> function the boolean statement asking for rows containing values above zero for the A1BG column. 

<code>.loc</code> has many functionalities. For a full list, see pandas documentation for <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html">.loc</a> and <a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html">indexing and slicing</a>.

In [11]:
proteomics.loc[proteomics["A1BG"] > 0].head()

idx,A1BG,A2M,A2ML1,A4GALT,AAAS,AACS,AADAT,AAED1,AAGAB,AAK1,...,ZSWIM8,ZSWIM9,ZW10,ZWILCH,ZWINT,ZXDC,ZYG11B,ZYX,ZZEF1,ZZZ3
S019,0.15,0.987,5.07,0.302,-0.169,0.398,,-0.274,1.09,0.147,...,0.348,-0.484,0.142,0.25,-0.142,0.077,-0.157,-0.872,-0.00663,-0.0175
S020,0.181,-0.283,0.689,0.104,-0.13,-0.0683,-0.376,-0.0838,0.16,0.277,...,0.228,,0.0206,-0.97,,-0.452,0.172,0.626,0.342,-0.203
S027,0.0499,0.0199,-0.125,0.477,-0.252,0.838,-0.1,-0.851,0.351,-0.582,...,0.0491,-0.479,0.0548,0.933,0.00967,-1.14,-0.541,-0.838,0.0238,-0.0602
S028,0.773,0.546,-0.603,,0.0555,-0.639,-0.39,0.467,-0.366,0.427,...,-0.00835,0.119,-0.296,-0.205,-0.0765,0.228,0.237,1.05,0.0185,0.587
S029,0.551,0.203,,,-0.493,0.211,1.06,-0.28,0.207,0.334,...,0.3,,0.142,-0.549,,-0.39,-0.0972,0.00108,0.2,-0.194


Viewing a different example, we can separate clinical information for samples that are serous vs endometrioid by looking at the "Histologic_type" column.

In [12]:
clinical = en.get_clinical()
endometrioid_clinical = clinical.loc[clinical["Histologic_type"] == "Endometrioid"]
serous_clinical = clinical.loc[clinical["Histologic_type"] == "Serous"]

In [13]:
endometrioid_clinical.head()

Unnamed: 0_level_0,Proteomics_Participant_ID,Proteomics_Tumor_Normal,Country,Histologic_Grade_FIGO,Myometrial_invasion_Specify,Histologic_type,Treatment_naive,Tumor_purity,Path_Stage_Primary_Tumor-pT,Path_Stage_Reg_Lymph_Nodes-pN,...,Age,Diabetes,Race,Ethnicity,Gender,Tumor_Site,Tumor_Site_Other,Tumor_Focality,Tumor_Size_cm,Num_full_term_pregnancies
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
S001,C3L-00006,Tumor,United States,FIGO grade 1,under 50 %,Endometrioid,YES,Normal,pT1a (FIGO IA),pN0,...,64.0,No,White,Not-Hispanic or Latino,Female,Anterior endometrium,,Unifocal,2.9,1
S002,C3L-00008,Tumor,United States,FIGO grade 1,under 50 %,Endometrioid,YES,Normal,pT1a (FIGO IA),pNX,...,58.0,No,White,Not-Hispanic or Latino,Female,Posterior endometrium,,Unifocal,3.5,1
S003,C3L-00032,Tumor,United States,FIGO grade 2,under 50 %,Endometrioid,YES,Normal,pT1a (FIGO IA),pN0,...,50.0,Yes,White,Not-Hispanic or Latino,Female,"Other, specify",Anterior and Posterior endometrium,Unifocal,4.5,4 or more
S005,C3L-00090,Tumor,United States,FIGO grade 2,under 50 %,Endometrioid,YES,Normal,pT1a (FIGO IA),pNX,...,75.0,No,White,Not-Hispanic or Latino,Female,"Other, specify",Anterior and Posterior endometrium,Unifocal,3.5,4 or more
S007,C3L-00136,Tumor,United States,FIGO grade 1,under 50 %,Endometrioid,YES,Normal,pT1a (FIGO IA),pNX,...,50.0,No,White,Not-Hispanic or Latino,Female,"Other, specify",Anterior and Posterior endometrium,Unifocal,4.5,3


In [14]:
serous_clinical.head()

Unnamed: 0_level_0,Proteomics_Participant_ID,Proteomics_Tumor_Normal,Country,Histologic_Grade_FIGO,Myometrial_invasion_Specify,Histologic_type,Treatment_naive,Tumor_purity,Path_Stage_Primary_Tumor-pT,Path_Stage_Reg_Lymph_Nodes-pN,...,Age,Diabetes,Race,Ethnicity,Gender,Tumor_Site,Tumor_Site_Other,Tumor_Focality,Tumor_Size_cm,Num_full_term_pregnancies
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
S006,C3L-00098,Tumor,United States,,under 50 %,Serous,YES,Normal,pT1a (FIGO IA),pNX,...,63.0,No,White,Not-Hispanic or Latino,Female,"Other, specify",Anterior and Posterior endometrium,Unifocal,6.0,2
S009,C3L-00139,Tumor,United States,,50 % or more,Serous,YES,Normal,pT3a (FIGO IIIA),pNX,...,83.0,No,White,Not-Hispanic or Latino,Female,"Other, specify",Both anterior and posterior,Unifocal,4.0,4 or more
S016,C3L-00358,Tumor,United States,,50 % or more,Serous,YES,Normal,pT1b (FIGO IB),pNX,...,90.0,No,White,Not-Hispanic or Latino,Female,"Other, specify",Both anterior and posterior endometrium,Unifocal,4.5,Unknown
S041,C3L-00963,Tumor,Other_specify,,50 % or more,Serous,YES,Normal,pT1b (FIGO IB),pNX,...,59.0,Yes,White,Not reported,Female,"Other, specify",along anterior and posterior surface,Unifocal,2.6,1
S042,C3L-01246,Tumor,Other_specify,,under 50 %,Serous,YES,Normal,pT1a (FIGO IA),pN0,...,62.0,No,White,Not reported,Female,Posterior endometrium,,Unifocal,2.3,1
