# Download PanCancer data

First, we download the RSEM TPM gene expression data from https://toil.xenahubs.net/download/tcga_RSEM_gene_tpm.gz. The format of the gene expression data is log2(TPM+0.001).

In [1]:
%%time
import pandas as pd
import numpy as np
df_gene_exp = pd.read_table("/tempory/transcriptomic_data/pan_cancer/tcga_RSEM_gene_tpm",
                            sep='\t', index_col=0).sort_index(axis='rows').sort_index(axis='columns')

CPU times: user 2min 17s, sys: 3.82 s, total: 2min 21s
Wall time: 2min 40s


In [2]:
print("Genes={}; Samples={};".format(*df_gene_exp.shape))

Genes=60498; Samples=10535;


In [3]:
df_gene_exp.head()

Unnamed: 0_level_0,TCGA-02-0047-01,TCGA-02-0055-01,TCGA-02-2483-01,TCGA-02-2485-01,TCGA-04-1331-01,TCGA-04-1332-01,TCGA-04-1337-01,TCGA-04-1338-01,TCGA-04-1341-01,TCGA-04-1343-01,...,TCGA-ZR-A9CJ-01,TCGA-ZS-A9CD-01,TCGA-ZS-A9CE-01,TCGA-ZS-A9CF-01,TCGA-ZS-A9CF-02,TCGA-ZS-A9CG-01,TCGA-ZT-A8OM-01,TCGA-ZU-A8S4-01,TCGA-ZU-A8S4-11,TCGA-ZX-AA5X-01
sample,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000000003.14,5.4712,5.1498,5.6448,6.1709,5.7911,4.1907,4.3463,3.9856,4.0251,4.2921,...,4.4784,6.033,5.4845,5.1363,5.1583,7.1371,1.5998,4.656,5.323,4.8115
ENSG00000000005.5,-3.1714,4.1652,-5.5735,-3.1714,-2.6349,-2.3147,-5.0116,-5.5735,-4.2934,0.9115,...,-5.0116,-9.9658,-9.9658,-5.5735,-4.6082,-0.2671,-1.1172,-9.9658,-9.9658,-3.1714
ENSG00000000419.12,4.6753,6.0251,5.8263,5.1768,5.7963,4.3169,6.8252,5.243,4.9031,6.5546,...,6.7702,5.067,4.6611,4.5261,4.6317,4.8798,2.8321,5.5874,4.0037,5.2192
ENSG00000000457.13,2.0742,2.1013,1.9564,2.4198,2.1988,0.8246,1.1641,1.5013,0.5955,0.3685,...,2.1988,1.8762,2.128,3.0428,3.5473,2.1313,-0.6873,1.787,0.9642,2.5061
ENSG00000000460.16,2.2573,2.4571,2.5036,3.0995,2.8442,1.4281,1.0007,1.4174,0.7407,0.9419,...,3.0498,0.044,0.2522,1.8036,2.4623,3.0825,2.1444,2.6208,0.5955,2.6624


The next thing to do is to check if the data frame contains any NA. If so, either remove the rows that contain them (dropna method) or use any other imputation method:

In [4]:
df_gene_exp.isnull().values.any()

False

# Data exploration

We now explore the samples type (tumor or normal), and then some clinical information associated to them.

## Tumor-Normal binary variable

We first load a dataset that contains information about the PanCancer sample types and diseases, downloaded from https://pancanatlas.xenahubs.net/download/TCGA_phenotype_denseDataOnlyDownload.tsv.gz

In [5]:
df_pancan_sample = pd.read_table("/tempory/transcriptomic_data/pan_cancer/TCGA_phenotype_denseDataOnlyDownload.tsv", 
                                 index_col=0).sort_index(axis='rows')

In [6]:
print(df_pancan_sample.shape)

df_pancan_sample.head()

(12804, 3)


Unnamed: 0_level_0,sample_type_id,sample_type,_primary_disease
sample,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
TCGA-01-0628-11,11.0,Solid Tissue Normal,ovarian serous cystadenocarcinoma
TCGA-01-0629-11,,,ovarian serous cystadenocarcinoma
TCGA-01-0630-11,11.0,Solid Tissue Normal,ovarian serous cystadenocarcinoma
TCGA-01-0631-11,11.0,Solid Tissue Normal,ovarian serous cystadenocarcinoma
TCGA-01-0633-11,11.0,Solid Tissue Normal,ovarian serous cystadenocarcinoma


We check that there are no duplicated samples:

In [7]:
pancan_sample = df_pancan_sample.index
pancan_sample.duplicated().any()

False

We select the PanCancer samples contained both in the expression and samples type datasets:

In [8]:
pancan_sample_common = df_gene_exp.columns.intersection(pancan_sample)
len(pancan_sample_common)

10534

In [9]:
# One sample from the expression dataset is not included in the samples type dataset.
# When we google its identifier, most information is NOT SPECIFIED
df_gene_exp.columns.difference(pancan_sample_common)

Index(['TCGA-07-0249-20'], dtype='object')

In [10]:
df_pancan_sample = df_pancan_sample.loc[pancan_sample_common]
df_pancan_sample.shape

(10534, 3)

In [11]:
# Check NAs
df_pancan_sample.isnull().any()

sample_type_id      False
sample_type         False
_primary_disease    False
dtype: bool

In [12]:
# Sample type variable
df_pancan_sample.sample_type.value_counts(normalize=False)

Primary Tumor                                      9185
Solid Tissue Normal                                 727
Metastatic                                          392
Primary Blood Derived Cancer - Peripheral Blood     173
Recurrent Tumor                                      45
Additional - New Primary                             11
Additional Metastatic                                 1
Name: sample_type, dtype: int64

We create a tumor/normal binary variable using the sample type, with no NA values in the column:

In [13]:
# for binary classification
df_pancan_sample["tumor_normal"] = df_pancan_sample.apply(
    lambda row: "Normal" if row["sample_type"] == "Solid Tissue Normal" else "Tumor", axis=1)

In [14]:
# Tumor/Normal variable
df_pancan_sample.tumor_normal.value_counts(normalize=False)

Tumor     9807
Normal     727
Name: tumor_normal, dtype: int64

## Clinical variables

We then load a second dataset that contains clinical information about the PanCancer samples, downloaded from https://pancanatlas.xenahubs.net/download/Survival_SupplementalTable_S1_20171025_xena_sp.gz

In [15]:
df_pancan_clinical = pd.read_table("/tempory/transcriptomic_data/pan_cancer/Survival_SupplementalTable_S1_20171025_xena_sp", 
                                   index_col=0).sort_index(axis='rows')

In [16]:
print(df_pancan_clinical.shape)

df_pancan_clinical.head()

(12591, 33)


Unnamed: 0_level_0,_PATIENT,cancer type abbreviation,age_at_initial_pathologic_diagnosis,gender,race,ajcc_pathologic_tumor_stage,clinical_stage,histological_type,histological_grade,initial_pathologic_dx_year,...,residual_tumor,OS,OS.time,DSS,DSS.time,DFI,DFI.time,PFI,PFI.time,Redaction
sample,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-02-0001-01,TCGA-02-0001,GBM,44.0,FEMALE,WHITE,,,Untreated primary (de novo) GBM,,2002.0,...,,1.0,358.0,1.0,358.0,,,1.0,137.0,
TCGA-02-0003-01,TCGA-02-0003,GBM,50.0,MALE,WHITE,,,Untreated primary (de novo) GBM,,2003.0,...,,1.0,144.0,1.0,144.0,,,1.0,40.0,
TCGA-02-0006-01,TCGA-02-0006,GBM,56.0,FEMALE,WHITE,,,Untreated primary (de novo) GBM,,2002.0,...,,1.0,558.0,1.0,558.0,,,1.0,302.0,
TCGA-02-0007-01,TCGA-02-0007,GBM,40.0,FEMALE,WHITE,,,Treated primary GBM,,2002.0,...,,1.0,705.0,1.0,705.0,,,1.0,518.0,
TCGA-02-0009-01,TCGA-02-0009,GBM,61.0,FEMALE,WHITE,,,Untreated primary (de novo) GBM,,2003.0,...,,1.0,322.0,1.0,322.0,,,1.0,264.0,


We check that there are no duplicated samples:

In [17]:
pancan_clinical = df_pancan_clinical.index
pancan_clinical.duplicated().any()

False

We select the PanCancer samples contained both in the expression and clinical datasets:

In [18]:
pancan_clinical_common = df_gene_exp.columns.intersection(pancan_clinical)
len(pancan_clinical_common)

10496

In [20]:
df_pancan_clinical = df_pancan_clinical.loc[pancan_clinical_common]
df_pancan_clinical.shape

(10496, 33)

In [21]:
# Overall survival
variable = "OS"
print("Number of samples with this information:",
      sum(df_pancan_clinical[variable].value_counts(normalize=False)))

df_pancan_clinical[variable].value_counts(normalize=True)

Number of samples with this information: 10489


0.0    0.687196
1.0    0.312804
Name: OS, dtype: float64

In [22]:
# Progression-free interval
variable = "PFI"
print("Number of samples with this information:",
      sum(df_pancan_clinical[variable].value_counts(normalize=False)))

df_pancan_clinical[variable].value_counts(normalize=True)

Number of samples with this information: 10316


0.0    0.653742
1.0    0.346258
Name: PFI, dtype: float64

In [23]:
# Disease-specific survival
variable = "DSS"
print("Number of samples with this information:",
      sum(df_pancan_clinical[variable].value_counts(normalize=False)))

df_pancan_clinical[variable].value_counts(normalize=True)

Number of samples with this information: 10013


0.0    0.785978
1.0    0.214022
Name: DSS, dtype: float64

In [24]:
# Disease-free interval
variable = "DFI"
print("Number of samples with this information:",
      sum(df_pancan_clinical[variable].value_counts(normalize=False)))

df_pancan_clinical[variable].value_counts(normalize=True)

Number of samples with this information: 5335


0.0    0.797751
1.0    0.202249
Name: DFI, dtype: float64

### Export

We write the PanCancer gene expression and the sample info datasets into an HDF5 file, in machine learning format (rows as samples):

In [27]:
%%time
# Export h5 format file: create an HDF5 file with three datasets (contained in the root group, the file object)
with pd.HDFStore("/tempory/transcriptomic_data/pan_cancer/pancan.h5", "w") as store:
    store["expression"] = df_gene_exp.transpose()
    store["sample_type"] = df_pancan_sample
    store["sample_clinical"] = df_pancan_clinical

CPU times: user 1.16 s, sys: 2.63 s, total: 3.79 s
Wall time: 11.5 s


your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block1_values] [items->Index(['_PATIENT', 'cancer type abbreviation', 'gender', 'race',
       'ajcc_pathologic_tumor_stage', 'clinical_stage', 'histological_type',
       'histological_grade', 'menopause_status', 'vital_status',
       'tumor_status', 'cause_of_death', 'new_tumor_event_type',
       'new_tumor_event_site', 'new_tumor_event_site_other',
       'treatment_outcome_first_course', 'margin_status', 'residual_tumor',
       'Redaction'],
      dtype='object')]

  exec(code, glob, local_ns)
