# Project 2 - The Cancer Genome Atlas (TCGA) Data Analysis

Notebook version: `25.3` (please don't change)

**IMPORTANT: Before you do anything, save a copy of this notebook to your own google drive using the `File -> Save a copy to Drive` button in the menu. Otherwise you cannot save your changes. Once you've saved a copy to your own drive, it's available there just like a regular Google Docs file, and it is saved automatically.**

The Cancer Genome Atlas (TCGA) is an international endeavor to catalogue genomic and genetic mutations in a variety of cancer tissues. It is generally believed that gathering such information from a large number of patients will improve our ability to diagnose, treat, and prevent cancer through a better understanding of the genomic variation introduced by cancer. Paramount to arrive at such understanding is the bioinformatic analysis of the data. In this project, you are offered the possibility to contribute to this venture. You have access to the processed TCGA data from $9,648$ patients having different forms of cancer.

You are given the `clinical.csv` file, which contains many different types of information about each patient. For example, the field `cancer_type` contains the cancer subtype, `drug_received_treatment` denotes whether a patient has received a drug treatment, and `vital_status` denotes whether a patient was still alive during follow-up. Please note, that for some of these variables, information is available for a subset of the patients. An example is the `her2_immunohistochemistry_level_result` column, containing the HER2 score (0, 1+, 2+, or 3+), where 3+ denotes HER2-positive. This score is only available for breastcancer patients. For all patients for which the variable is not measured, the value is set to "NaN" (`np.nan` in Python). See https://docs.cancergenomicscloud.org/docs/tcga-metadata for an rough overview of the metadata categories.

For each patient, you further have access to the following data:

- `GE` - **Gene expresssion data**: mRNA expression of each gene, measured by RNAseq. The data was normalized to one million counts per sample (CPM) to account for different sequencing depths per sample, and then log-transformed. The data was not standardized (i.e. the mean expression of each gene is not zero), so think carefully about whether your analysis requires this.

- `ME` - **DNA Methylation data**: Methylation of each gene, represented as beta-values, which are continuous values between $0$ and $1$, representing the ratio of intensities between methylated ($1$) and unmethylated ($0$) sites.

- `CN.pkl` - **Copy-Number Variation data**: Copy-number of each gene.

- `MIR.pkl` - **microRNA expression data**: mRNA expression of each microRNA, measured by RNAseq. Just like gene experssion data, this data is normalized to one million counts per sample (CPM), and then log-transformed, but not standardized.

To link the data from these files to patients, you can use the `patient_id` column in each datatypes' dataframe, which corresponds to the index of the clinical dataframe (use `clinical.index`, or `clinical['patient_id']` to access it).

Please note that for some patients there is additional data on healthy tissue of that patient. This can be identified by the `sample_type` column in the corresponding dataframe.

More information about TCGA can be found on their website: https://cancergenome.nih.gov/, or in the paper: Taskesen et al. Pan-cancer subtyping in a 2Dmap shows substructures that are driven by specific combinations of molecular characteristics. Nature Scientific Reports, 6:24949, 2016.
doi: 10.1038/srep24949. (also available on BrightSpace)

<br>

---
<br>


> To contribute to the quest for solving cancer, you are asked to analyze this data, which also means that you should think of meaningful and interesting questions that can be answered using the provided data (these are not known beforehand!). Make use of the techniques you have learned in modules 2, 3 and 4.
>
> The results should be summarized in a poster. Make sure that you: motivate your research question(s) and the choices that you made during the analyses (aim of the performed analysis, type of algorithm, parameter settings etc.); explain and discuss your findings; explain what is represented in figures (what is on the axes etc.).

---

**Hint**: So far you've made your plots with `matplotlib.pyplot`, which is excellent for basic plots, but if you need other types of plots, you may want to look at the `seaborn` library. They have many different types of visualizations (see some example [here](https://seaborn.pydata.org/examples/index.html)), and the library works well together with pandas.

In [8]:
!mkdir -p /data
!wget -nc -O "/data/clinical.csv" https://surfdrive.surf.nl/files/index.php/s/653xXM13mXQFhnR/download
!wget -nc -O "/data/cnv.pkl" https://surfdrive.surf.nl/files/index.php/s/Gkn21dal4o2mNhd/download
!wget -nc -O "/data/expression.pkl" https://surfdrive.surf.nl/files/index.php/s/OCi3ZI2clscbqIs/download
!wget -nc -O "/data/meth.pkl" https://surfdrive.surf.nl/files/index.php/s/6uzoxlHVVCjHyM1/download
!wget -nc -O "/data/mirna.pkl" https://surfdrive.surf.nl/files/index.php/s/CCtSonICb3O0ByR/download

File ‘/data/clinical.csv’ already there; not retrieving.
File ‘/data/cnv.pkl’ already there; not retrieving.
File ‘/data/expression.pkl’ already there; not retrieving.
File ‘/data/meth.pkl’ already there; not retrieving.
File ‘/data/mirna.pkl’ already there; not retrieving.


In [12]:
import pandas as pd
import pickle

with open("/data/cnv.pkl", "rb") as f:
  CN = pickle.load(f)

with open("/data/expression.pkl", "rb") as f:
  GE = pickle.load(f)

with open("/data/meth.pkl", "rb") as f:
  ME = pickle.load(f)

with open("/data/mirna.pkl", "rb") as f:
  MIR = pickle.load(f)

clinical = pd.read_csv("/data/clinical.csv", index_col=0)


In [18]:
# Just like in the first project, everything is stored in Pandas dataframes:
display(clinical.head())

Unnamed: 0_level_0,age_at_initial_pathologic_diagnosis,alcohol_history_documented,anatomic_neoplasm_subdivision,axillary_lymph_node_stage_method_type,bcr_patient_uuid,blood_relative_cancer_history_cancertype,blood_relative_cancer_history_relation,breast_carcinoma_estrogen_receptor_status,breast_carcinoma_progesterone_receptor_status,breast_carcinoma_surgical_procedure_name,...,tissue_source_site,tobacco_smoking_history,tumor_tissue_site,venous_invasion,vital_status,weight,white_cell_count_result,year_of_form_completion,year_of_initial_pathologic_diagnosis,year_of_tobacco_smoking_onset
bcr_patient_barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-85-6561,66.0,,R-Upper,,46f35964-bd43-47e6-83fe-40da33828c94,,,,,,...,85,4.0,Lung,,Alive,,,2011.0,2011.0,1975.0
TCGA-A2-A0CT,71.0,,,Sentinel node biopsy alone,378778d2-b331-4867-a93b-c64028c8b4c7,,,Positive,Negative,Simple Mastectomy,...,A2,,Breast,,Alive,,,2010.0,2005.0,
TCGA-KS-A41J,28.0,,,,23AD7900-4E99-4ABF-82FF-71EDEAA8A51C,,,,,,...,KS,,Thyroid,,Alive,,,2012.0,2000.0,
TCGA-CJ-4916,69.0,,,,4519a839-11ea-4628-b5a7-071833ad16de,,,,,,...,CJ,,Kidney,,Alive,,,2011.0,2007.0,
TCGA-BJ-A45H,45.0,,,,778A8F53-53F6-4A03-8CE1-B0928539A444,,,,,,...,BJ,,Thyroid,,Alive,,,2013.0,2012.0,


In [23]:
import pandas as pd

# Assuming the clinical dataset is already loaded as 'clinical'
cancer_type_column = "cancer_type"  # Adjust this if the column name is different

# Count occurrences of each cancer type
cancer_type_counts = clinical[cancer_type_column].value_counts()

total_patients = cancer_type_counts.sum()

# Display the results
print("Number of unique cancer types:", cancer_type_counts.shape[0])
print("\nCancer Type Distribution:")
print(cancer_type_counts)
print("\nTotal Number of patients:", total_patients)


Number of unique cancer types: 32

Cancer Type Distribution:
cancer_type
BRCA    1037
UCEC     507
KIRC     499
HNSC     497
LUAD     494
LGG      493
THCA     490
PRAD     472
LUSC     459
SKCM     437
COAD     408
STAD     400
OV       397
BLCA     381
LIHC     359
CESC     291
KIRP     281
SARC     247
ESCA     179
PCPG     172
PAAD     172
READ     148
LAML     137
TGCT     133
THYM     119
MESO      87
UVM       77
ACC       76
KICH      65
UCS       53
DLBC      45
CHOL      36
Name: count, dtype: int64

Total Number of patients: 9648


In [21]:
# Define the 19 cancer types from the paper
cancer_types_of_interest = [
    "ACC", "BLCA", "CESC", "HNSC", "LUSC", "LAML", "COAD", "PAAD",
    "READ", "DLBC", "BRCA", "KICH", "KIRP", "LGG", "LIHC", "LUAD",
    "OV", "PRAD"
]

# Filter dataset to include only the selected cancer types
filtered_clinical = clinical[clinical["cancer_type"].isin(cancer_types_of_interest)]

# Count occurrences of each selected cancer type
cancer_counts = filtered_clinical["cancer_type"].value_counts()

# Compute the total number of genes (patients) across these 19 cancer types
total_genes = cancer_counts.sum()

# Display results
print("Cancer Type Counts in Filtered Dataset:")
print(cancer_counts)
print("\nTotal Number of Genes in These 19 Cancer Types:", total_genes)


Cancer Type Counts in Filtered Dataset:
cancer_type
BRCA    1037
HNSC     497
LUAD     494
LGG      493
PRAD     472
LUSC     459
COAD     408
OV       397
BLCA     381
LIHC     359
CESC     291
KIRP     281
PAAD     172
READ     148
LAML     137
ACC       76
KICH      65
DLBC      45
Name: count, dtype: int64

Total Number of Genes in These 19 Cancer Types: 6212


In [17]:
print("Copy number of each gene")
display(CN.head())
print("Gene expression data")
display(GE.head())
print("DNA methylation data")
display(ME.head())
print("mRNA expression of each microRNA, measured by RNAseq")
display(MIR.head())

Copy number of each gene


gene_name,patient_id,OR4G11P,OR4F5,AL627309.1,AL627309.3,CICP27,AL627309.6,AL627309.7,AL627309.2,AL627309.5,...,TMLHE-AS1,BX571846.1,TMLHE,SPRY3,AMD1P2,DPH3P2,VAMP7,ELOCP24,TRPC6P,IL9R
patient_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-D1-A15Z,TCGA-D1-A15Z,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,...,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
TCGA-AK-3433,TCGA-AK-3433,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,...,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
TCGA-HZ-7924,TCGA-HZ-7924,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,...,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
TCGA-66-2753,TCGA-66-2753,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,...,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0
TCGA-61-1740,TCGA-61-1740,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0


Gene expression data


gene_name,patient_id,sample_type,TSPAN6,TNMD,DPM1,SCYL3,C1orf112,FGR,CFH,FUCA2,...,AL451106.1,AC092910.4,AC073611.1,AC136977.1,AC078856.1,AC008763.4,AL592295.6,AC006486.3,AL391628.1,AP006621.6
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-HC-A48F-01A,TCGA-HC-A48F,Primary Tumor,1.864258,0.017685,1.532227,1.435547,0.722656,0.371094,1.524414,1.511719,...,0.0,0.0,0.309326,0.0,0.0,0.0,0.621094,0.0,0.05896,0.371094
TCGA-90-A4EE-01A,TCGA-90-A4EE,Primary Tumor,1.583008,0.009186,1.624023,1.432617,1.357422,1.676758,1.628906,1.603516,...,0.0,0.009186,0.459473,0.0,0.0,0.0,0.819824,0.0,0.099182,0.341797
TCGA-CJ-4639-01A,TCGA-CJ-4639,Primary Tumor,1.786133,0.236694,1.451172,1.250977,0.783203,1.227539,1.729492,1.658203,...,0.0,0.011932,0.18457,0.0,0.0,0.0,1.09375,0.0,0.056641,0.087402
TCGA-ET-A2N0-01A,TCGA-ET-A2N0,Primary Tumor,1.675781,0.007748,1.360352,0.749023,0.278809,0.865234,1.817383,1.727539,...,0.0,0.0,0.007748,0.0,0.0,0.0,0.612305,0.0,0.015358,0.139282
TCGA-D8-A1XO-01A,TCGA-D8-A1XO,Primary Tumor,1.788086,0.04007,1.571289,1.62793,1.357422,0.966797,1.833008,1.444336,...,0.0,0.0,0.137573,0.0,0.0,0.0,1.054688,0.0,0.081726,0.137573


DNA methylation data


Unnamed: 0_level_0,patient_id,sample_type,RBL2,IARS1,ATP2A1,ATP2A1-AS1,PGBD5,MAN1B1,TSEN34,NPHP4,...,ENSG00000253423,ENSG00000253430,KRT8P4,MIR7-1,HNRNPK-AS1,ENSG00000288841,ENSG00000273056,PGK1P1,SRSF6P1,RAC1P4
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-04-1651-01A,TCGA-04-1651,Primary Tumor,0.011581,0.039246,0.0,0.936523,0.94043,0.008179,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
TCGA-44-4112-01A,TCGA-44-4112,Primary Tumor,0.020615,0.06134,0.0,0.655762,0.946777,0.012222,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
TCGA-F7-8489-01A,TCGA-F7-8489,Primary Tumor,0.11792,0.046844,0.150879,0.601074,0.32251,0.608398,0.077515,0.667969,...,0.048645,0.0,0.050537,0.163696,0.046326,0.026031,0.0,0.041229,0.0,0.0
TCGA-58-A46M-01A,TCGA-58-A46M,Primary Tumor,0.096558,0.046234,0.219727,0.556641,0.451416,0.597168,0.06897,0.659668,...,0.056763,0.0,0.042877,0.0,0.028915,0.035309,0.0,0.049896,0.0,0.0
TCGA-AB-2952-03A,TCGA-AB-2952,Primary Blood Derived Cancer - Peripheral Blood,0.15918,0.03717,0.100464,0.626953,0.389648,0.619141,0.060516,0.658691,...,0.059418,0.0,0.038269,0.06012,0.057312,0.030441,0.0,0.026505,0.0,0.0


mRNA expression of each microRNA, measured by RNAseq


miRNA_ID,patient_id,sample_type,hsa-let-7a-1,hsa-let-7a-2,hsa-let-7a-3,hsa-let-7b,hsa-let-7c,hsa-let-7d,hsa-let-7e,hsa-let-7f-1,...,hsa-mir-941-5,hsa-mir-942,hsa-mir-943,hsa-mir-944,hsa-mir-95,hsa-mir-9500,hsa-mir-96,hsa-mir-98,hsa-mir-99a,hsa-mir-99b
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-B8-5549-01A,TCGA-B8-5549,Primary Tumor,4.136719,4.136719,4.140625,3.958984,3.220703,2.533203,3.207031,4.007812,...,0.0,0.90332,0.0,0.0,0.498779,0.0,0.187134,1.665039,2.613281,4.609375
TCGA-J8-A42S-01A,TCGA-J8-A42S,Primary Tumor,4.15625,4.15625,4.160156,4.085938,3.818359,2.902344,3.302734,3.861328,...,0.0,0.481934,0.0,0.118164,1.196289,0.0,1.125977,1.923828,3.287109,4.5625
TCGA-EO-A3AS-01A,TCGA-EO-A3AS,Primary Tumor,4.164062,4.164062,4.160156,4.246094,2.285156,2.558594,3.238281,3.539062,...,0.0,0.788086,0.166504,0.380615,0.890625,0.0,1.233398,1.473633,1.922852,4.40625
TCGA-QK-A6IJ-01A,TCGA-QK-A6IJ,Primary Tumor,3.890625,3.894531,3.894531,4.121094,3.052734,2.685547,3.009766,3.537109,...,0.0,1.424805,0.0,2.722656,1.084961,0.0,1.385742,1.938477,2.298828,4.351562
TCGA-AJ-A3QS-01A,TCGA-AJ-A3QS,Primary Tumor,3.900391,3.898438,3.902344,4.128906,3.261719,2.701172,3.068359,3.447266,...,0.0,1.107422,0.0,0.173584,0.59668,0.0,1.164062,1.793945,3.210938,4.398438
