# Tutorial 2: Accessing dataframes and using them with `pandas` 

A large portion of this package utilizes the functionality of pandas dataframes. Here we give a few specific instances of using dataframes with our package. For a more general overview of dataframes, see pandas [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) and [tutorials](https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python).

## How do I install the package?

Install the package from the [Python Package Index (PyPI)](https://pypi.org/project/cptac/) in the command line using the [pip package installer](https://pypi.org/project/pip/), with the name of the package:

`pip install cptac`

## How do I import the package once it has been installed?

The cptac package has several cancer data sets. First import the entire package by entering the following:

`import cptac`

The data in the package is broken into specific cancer types. All available types can be seen with the command `cptac.list_cancer_options()`

To get a data for a cancer, load the dataset object and assign it to a variable, like this:

`en = cptac.Ucec()`


In [1]:
import cptac
en = cptac.Ucec()

## Can I use multiple datasets at the same time?

You can have multiple datasets loaded at the same time, just assign each one to its own variable.

In [2]:
ov = cptac.Ov()
co = cptac.Coad()

**NOTE: When using multiple data sets, be sure to check that each function used matches the expected data set, as each data set uses the same API.** For example, the command for retrieving clinical data is `get_clinical()` for all data sets, so make sure not to retrieve ovarian clinical data `ov.get_clinical()` when you meant to get the endometrial clinical data `en.get_clinical()`.

## How do I access a particular dataframe?

You can access a specific dataframe by calling the dataset's `get_dataframe` method and passing in a datatype and source. There are also helper "get" methods for each datatype, for example `get_clinical`, which works the same as `get_dataframe` but does not require the datatype parameter. (To see all available dataframes, call the dataset's `list_data_sources` function, e.g. `en.list_data_sources()`.)

In [3]:
en.list_data_sources()

Unnamed: 0,Data type,Available sources
0,CNV,"awg, washu"
1,clinical,"awg, mssm, pdc"
2,deconvolution_cibersort,washu
3,deconvolution_xcell,washu
4,followup,awg
5,phosphoproteomics,"awg, pdc, umich"
6,proteomics,"awg, pdc, umich"
7,somatic_mutation,"awg, harmonized, washu"
8,transcriptomics,"awg, bcm, broad, washu"
9,treatment,awg


In [4]:
proteomics = en.get_proteomics('umich')
proteomics.head()

                                                  

Name,A1BG,A2M,A2ML1,A4GALT,AAAS,AACS,AADAT,AAED1,AAGAB,AAK1,...,ZSWIM8,ZSWIM9,ZW10,ZWILCH,ZWINT,ZXDC,ZYG11B,ZYX,ZZEF1,ZZZ3
Patient_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
C3L-00006,-1.18,-0.863,-0.802,0.222,0.256,0.665,1.28,-0.339,0.412,-0.664,...,-0.0877,,0.0229,0.109,,-0.332,-0.433,-1.02,-0.123,-0.0859
C3L-00008,-0.685,-1.07,-0.684,0.984,0.135,0.334,1.3,0.139,1.33,-0.367,...,-0.0356,,0.363,1.07,0.737,-0.564,-0.00461,-1.13,-0.0757,-0.473
C3L-00032,-0.528,-1.32,0.435,,-0.24,1.04,-0.0213,-0.0479,0.419,-0.5,...,0.00112,-0.145,0.0105,-0.116,,0.151,-0.074,-0.54,0.32,-0.419
C3L-00090,-1.67,-1.19,-0.443,0.243,-0.0993,0.757,0.74,-0.929,0.229,-0.223,...,0.0725,-0.0552,-0.0714,0.0933,0.156,-0.398,-0.0752,-0.797,-0.0301,-0.467
C3L-00098,-0.374,-0.0206,-0.537,0.311,0.375,0.0131,-1.1,,0.565,-0.101,...,-0.176,,-1.22,-0.562,0.937,-0.646,0.207,-1.85,-0.176,0.0513


## How do I access specific columns in a dataframe?

You'll probably want to get a feel for what data is in the dataframes you load. For example, say we want to know what kind of data is included in our proteomics dataframe. Each column in that dataframe contains the proteomics data for a different protein.

You can view a list of column names (which is a list of protein names, in the case of the proteomics dataframe) by appending `.columns` to the end of a dataframe variable. (If you wish to see all of the column names, even if there are a lot, append `.columns.values` to the dataframe variable.)

In [5]:
proteomics.columns

Index(['A1BG', 'A2M', 'A2ML1', 'A4GALT', 'AAAS', 'AACS', 'AADAT', 'AAED1',
       'AAGAB', 'AAK1',
       ...
       'ZSWIM8', 'ZSWIM9', 'ZW10', 'ZWILCH', 'ZWINT', 'ZXDC', 'ZYG11B', 'ZYX',
       'ZZEF1', 'ZZZ3'],
      dtype='object', name='Name', length=10999)

To access specific column (which is a specific protein's data, in the case of the proteomics dataframe), slice the column out of the dataframe using either of the following methods:

`proteomics["A1BG"]`

or

`proteomics.A1BG`

Both return the column as a [pandas series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html). The first method is useful when the name of the column you want is stored as a string variable.

In [6]:
protein = "A1BG"
A1BG_col = proteomics[protein]
A1BG_col.head()

Patient_ID
C3L-00006   -1.180
C3L-00008   -0.685
C3L-00032   -0.528
C3L-00090   -1.670
C3L-00098   -0.374
Name: A1BG, dtype: float64

This `dataframe["col_name"]` syntax also allows for selection of multiple columns by entering a list of column names.

In [7]:
proteins = ["A1BG","PTEN","TP53"]
selected_prot = proteomics[proteins]
selected_prot.head()

Name,A1BG,PTEN,TP53
Patient_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
C3L-00006,-1.18,-0.526,0.295
C3L-00008,-0.685,-0.83,0.277
C3L-00032,-0.528,-0.941,-0.871
C3L-00090,-1.67,0.73,-0.343
C3L-00098,-0.374,-0.379,3.01


## How do I access specific rows in a dataframe?

You can access specific rows in a dataframe (which are specific samples, in the case of the CPTAC data) using the dataframe's `.iloc` (by row number) or `.loc` (by row name) method, which both return a pandas Series if you select one row, and a pandas DataFrame if you select multiple rows.

In [8]:
proteomics.iloc[0:5]

Name,A1BG,A2M,A2ML1,A4GALT,AAAS,AACS,AADAT,AAED1,AAGAB,AAK1,...,ZSWIM8,ZSWIM9,ZW10,ZWILCH,ZWINT,ZXDC,ZYG11B,ZYX,ZZEF1,ZZZ3
Patient_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
C3L-00006,-1.18,-0.863,-0.802,0.222,0.256,0.665,1.28,-0.339,0.412,-0.664,...,-0.0877,,0.0229,0.109,,-0.332,-0.433,-1.02,-0.123,-0.0859
C3L-00008,-0.685,-1.07,-0.684,0.984,0.135,0.334,1.3,0.139,1.33,-0.367,...,-0.0356,,0.363,1.07,0.737,-0.564,-0.00461,-1.13,-0.0757,-0.473
C3L-00032,-0.528,-1.32,0.435,,-0.24,1.04,-0.0213,-0.0479,0.419,-0.5,...,0.00112,-0.145,0.0105,-0.116,,0.151,-0.074,-0.54,0.32,-0.419
C3L-00090,-1.67,-1.19,-0.443,0.243,-0.0993,0.757,0.74,-0.929,0.229,-0.223,...,0.0725,-0.0552,-0.0714,0.0933,0.156,-0.398,-0.0752,-0.797,-0.0301,-0.467
C3L-00098,-0.374,-0.0206,-0.537,0.311,0.375,0.0131,-1.1,,0.565,-0.101,...,-0.176,,-1.22,-0.562,0.937,-0.646,0.207,-1.85,-0.176,0.0513


In [9]:
S001_row = proteomics.loc["C3L-00006"]
S001_row.head()

Name
A1BG     -1.180
A2M      -0.863
A2ML1    -0.802
A4GALT    0.222
AAAS      0.256
Name: C3L-00006, dtype: float64

## How do I access specific rows and columns?

In addition to selecting specific rows, you can also use `.loc` to select a subset of rows and columns, using lists.

In [10]:
samples = ["C3L-00006","C3L-00032","C3L-00413"]
proteins = ["A1BG","PTEN","TP53"]
proteomics.loc[samples, proteins]

Name,A1BG,PTEN,TP53
Patient_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
C3L-00006,-1.18,-0.526,0.295
C3L-00032,-0.528,-0.941,-0.871
C3L-00413,0.15,-1.7,0.0213


## How can I search using conditional statements?

There are a variety of ways to use boolean statements to traverse a dataframe. A common way is to pass a boolean statement that selects the data you want to the `.loc` function. For example, if we want to see all the data for samples that have a positive protein expression level for the A1BG protein, we would pass the `.loc` function the boolean statement asking for rows containing values above zero for the A1BG column. 

`.loc` has many functionalities. For a full list, see pandas documentation for [.loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) and [indexing and slicing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html).

In [11]:
a1bg_positive = proteomics.loc[proteomics["A1BG"] > 0]
a1bg_positive.head()

Name,A1BG,A2M,A2ML1,A4GALT,AAAS,AACS,AADAT,AAED1,AAGAB,AAK1,...,ZSWIM8,ZSWIM9,ZW10,ZWILCH,ZWINT,ZXDC,ZYG11B,ZYX,ZZEF1,ZZZ3
Patient_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
C3L-00413,0.15,0.987,5.07,0.302,-0.169,0.398,,-0.274,1.09,0.147,...,0.348,-0.484,0.142,0.25,-0.142,0.077,-0.157,-0.872,-0.00663,-0.0175
C3L-00449,0.181,-0.283,0.689,0.104,-0.13,-0.0683,-0.376,-0.0838,0.16,0.277,...,0.228,,0.0206,-0.97,,-0.452,0.172,0.626,0.342,-0.203
C3L-00770,0.0499,0.0199,-0.125,0.477,-0.252,0.838,-0.1,-0.851,0.351,-0.582,...,0.0491,-0.479,0.0548,0.933,0.00967,-1.14,-0.541,-0.838,0.0238,-0.0602
C3L-00771,0.773,0.546,-0.603,,0.0555,-0.639,-0.39,0.467,-0.366,0.427,...,-0.00835,0.119,-0.296,-0.205,-0.0765,0.228,0.237,1.05,0.0185,0.587
C3L-00780,0.551,0.203,,,-0.493,0.211,1.06,-0.28,0.207,0.334,...,0.3,,0.142,-0.549,,-0.39,-0.0972,0.00108,0.2,-0.194


For another example, suppose we wanted to separate clinical information for samples that are serous vs. endometrioid, as recorded in the "Histologic_type" column.

In [12]:
clinical = en.get_clinical("awg")
endometrioid_clinical = clinical.loc[clinical["Histologic_type"] == "Endometrioid"]
serous_clinical = clinical.loc[clinical["Histologic_type"] == "Serous"]

In [13]:
endometrioid_clinical.head()

Name,Sample_ID,Sample_Tumor_Normal,Proteomics_Tumor_Normal,Country,Histologic_Grade_FIGO,Myometrial_invasion_Specify,Histologic_type,Treatment_naive,Tumor_purity,Path_Stage_Primary_Tumor-pT,...,Age,Diabetes,Race,Ethnicity,Gender,Tumor_Site,Tumor_Site_Other,Tumor_Focality,Tumor_Size_cm,Num_full_term_pregnancies
Patient_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
C3L-00006,S001,Tumor,Tumor,United States,FIGO grade 1,under 50 %,Endometrioid,YES,Normal,pT1a (FIGO IA),...,64.0,No,White,Not-Hispanic or Latino,Female,Anterior endometrium,,Unifocal,2.9,1
C3L-00008,S002,Tumor,Tumor,United States,FIGO grade 1,under 50 %,Endometrioid,YES,Normal,pT1a (FIGO IA),...,58.0,No,White,Not-Hispanic or Latino,Female,Posterior endometrium,,Unifocal,3.5,1
C3L-00032,S003,Tumor,Tumor,United States,FIGO grade 2,under 50 %,Endometrioid,YES,Normal,pT1a (FIGO IA),...,50.0,Yes,White,Not-Hispanic or Latino,Female,"Other, specify",Anterior and Posterior endometrium,Unifocal,4.5,4 or more
C3L-00090,S005,Tumor,Tumor,United States,FIGO grade 2,under 50 %,Endometrioid,YES,Normal,pT1a (FIGO IA),...,75.0,No,White,Not-Hispanic or Latino,Female,"Other, specify",Anterior and Posterior endometrium,Unifocal,3.5,4 or more
C3L-00136,S007,Tumor,Tumor,United States,FIGO grade 1,under 50 %,Endometrioid,YES,Normal,pT1a (FIGO IA),...,50.0,No,White,Not-Hispanic or Latino,Female,"Other, specify",Anterior and Posterior endometrium,Unifocal,4.5,3


In [14]:
serous_clinical.head()

Name,Sample_ID,Sample_Tumor_Normal,Proteomics_Tumor_Normal,Country,Histologic_Grade_FIGO,Myometrial_invasion_Specify,Histologic_type,Treatment_naive,Tumor_purity,Path_Stage_Primary_Tumor-pT,...,Age,Diabetes,Race,Ethnicity,Gender,Tumor_Site,Tumor_Site_Other,Tumor_Focality,Tumor_Size_cm,Num_full_term_pregnancies
Patient_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
C3L-00098,S006,Tumor,Tumor,United States,,under 50 %,Serous,YES,Normal,pT1a (FIGO IA),...,63.0,No,White,Not-Hispanic or Latino,Female,"Other, specify",Anterior and Posterior endometrium,Unifocal,6.0,2
C3L-00139,S009,Tumor,Tumor,United States,,50 % or more,Serous,YES,Normal,pT3a (FIGO IIIA),...,83.0,No,White,Not-Hispanic or Latino,Female,"Other, specify",Both anterior and posterior,Unifocal,4.0,4 or more
C3L-00358,S016,Tumor,Tumor,United States,,50 % or more,Serous,YES,Normal,pT1b (FIGO IB),...,90.0,No,White,Not-Hispanic or Latino,Female,"Other, specify",Both anterior and posterior endometrium,Unifocal,4.5,Unknown
C3L-00963,S041,Tumor,Tumor,Other_specify,,50 % or more,Serous,YES,Normal,pT1b (FIGO IB),...,59.0,Yes,White,Not reported,Female,"Other, specify",along anterior and posterior surface,Unifocal,2.6,1
C3L-01246,S042,Tumor,Tumor,Other_specify,,under 50 %,Serous,YES,Normal,pT1a (FIGO IA),...,62.0,No,White,Not reported,Female,Posterior endometrium,,Unifocal,2.3,1


## How do I export a dataframe to a file?

If you wish to export a dataframe to a file, call the dataframe's built-in `to_csv` method, specifying the path you wish to save to, and the separator you wish to use:

In [15]:
clinical = en.get_clinical('awg')
clinical.to_csv(path_or_buf="clinical_df.tsv", sep='\t')