# Outline

- Introduction
    - The problem of disease subtype discovery from multi-omics data;
        - Multi-omics clustering methods
    - Prostate adenocarcinoma;
    - Objective
- Practical Approach:
    - Explanation of multi omics dataset utilized;
    - Importing of the libraries;
    - **Download of the Prostate adenocarcinoma dataset, considering three different omics data sources (mRNA, miRNA and protein expression data). The _TCGA_ code for the dataset is “PRAD”**;
    - Explanation of MultiAssayExperiment data structure;
    - **Pre-processing of the dataset following the same steps used during lessons. During the filtering by variance, select the first $100$ features having highest variance from each data source**;
    - **Download of the disease subtypes (column “Subtype\_Integrative” is the one containing the iCluster molecular subtypes). Note that not all subtypes are available for the set of samples having all the considered omics data sources, thus you need to retain from the multi-omics dataset only samples having an associated subtype**;
    - **Check that patients in multi-omics dataset and subtypes are in the same order**;
    - Digression about Similarity Network Fusion;
    - **Integration of the data using Similarity Network Fusion with the scaled exponential euclidean distance;**
    - **Integration of the similarity matrices from each data source (computed by scaled exponential euclidean distance) using a simple average of the matrices. This can be considered as a trivial multi-omics data integration strategy**;
    - Digression about NEMO;
    - **Integrate the dataset using another data fusion method called NEMO to obtain an integrated similarity matrix. NEMO implementation is available on github [https://github.com/Shamir-Lab/NEMO]**;
    - Digression about PAM;
    - **Perform disease subtype discovery (number of clusters equal to the number of disease subtypes foundby iCluster) using PAM algorithm on the following similarity matrices:**
        - **Similarity matrices obtained from single data sources (i.e. miRNA, mRNA, proteins) using the usual scaled exponential euclidean distance. Thus, you should obtain three different similarity matrices.To compute the corresponding distance matrix use this code: dist <- 1 - NetPreProc::Prob.norm(W). Prob.norm() function is in the NetPreProc CRAN package (https://cran.r-project.org/web/packages/NetPreProc/index.html). The idea is to normalize the similarity matrix before computing the corresponding distance**;
        - **Integrated matrix obtained using the average among matrices.  Use dist <- 1 - NetPreProc::Prob.norm(W) to compute the distance matrix**; 
        - **Integrated matrix obtained using Similarity Network Fusion**;
        - **Integrated matrix obtained using NEMO. Use dist <- 1 - NetPreProc::Prob.norm(W)to compute the distance matrix.**
    - **NEMO provides the possibility of performing clustering using another approach called Spectral Clustering. Use the function nemo.clustering() to test this approach.**
    - Analysis based on iCluster disease subtypes;
    - **Comparation of the clusterings obtained by each considered approach w.r.t. the iCluster disease subtypes. Make tables and plots to show the results and discuss them.**

## The problem of disease subtype discovery from multi-omics data
Remarkable advancements in technology have facilitated the generation of diverse genome-wide high-throughput
biological data types, collectively referred to as **omics**. Omic is a suffix used to refer to different fields of study that involve comprehensive analysis of a specific biological component or aspect. It typically denotes a multidimensional approach to studying biological systems on a large scale, encompassing various molecular components, such as genes (**genomics**), proteins (**proteomics**), metabolites (**metabolomics**), and more. The omic sciences aim to understand the complex interactions and functions of these components to gain insights into biological processes. By utilizing the omic approach, researchers seek a comprehensive understanding of biological systems at a molecular level, exploring the intricate networks and relationships that contribute to an organism's structure, function, and behavior.

The wealth of these omic profiles gathered from large cohorts in recent years presents a unique opportunity to gain a deeper understanding of human diseases. These profiles can serve as valuable resources for characterizing diseases more comprehensively, thus facilitating the development of personalized treatment strategies tailored to individual patients.

In the field of oncology, the analysis of extensive datasets has led to the identification of novel cancer subtypes, revolutionizing treatment decision-making. 
However, typically, the attained results are based on the analysis of a single omic rather than being derived from a comprehensive analysis of multiple data sources. Since the molecular complexity of a tumor manifests itself at the omics levels, genomic profiling at these multiple levels allows a better integrated characterization of tumor etiology.

Identifying tumor subtypes by simultaneously analyzing **multi-omic data** is a relatively new problem. In fact, since initiatives like **The Cancer Genome Atlas** (henceforth referred to as **TCGA**) have made multi-omic cohort data available, there has been a pressing need for improved and advanced methodologies that enable the integrated analysis of these datasets.
The simplest way utilized to combine biological data was to concatenate normalized measurements from various biological domains for each sample. Concatenation further dilutes the already low signal-to-noise ratio in each data type. To avoid this, a common strategy was to analyze each data type independently before combining data. In fact, the most used approach to subtype discovery across multiple types in the past years was to separately cluster each type and then to manually integrate the result. However, such independent analyses often led to inconsistent conclusions that were hard to integrate.

### Multi-omics clustering methods
There are several approaches to multi-omics clustering. The simplest one, called **early integration** (also named **concatenation-based**), is applied on the input data in an early stage and it concatenates all omic matrices into one matrix and applies single-omic clustering on the resulting one. This type of method probabilistically models the distribution of numeric, count and discrete features. The evident advantage of early methods relies on their ability to uncover the individual information characterizing each of the different sources as well as the hidden relationships between them. Another considerable advantage is brought by the fact that early methods solve the integration problem in the first stage, so that any unimodal analysis process may be subsequentially applied. Nevertheless, these methods suffer from the increasing of the dimensionality of the data. They also ignore the different distributions of values in different omics.

Another approach, called **late integration** method  (also named **model-based**), clusters each omic separately, and then integrates in a late phase the clustering results, for example using **consensus clustering**. This approach has the flaw of ignoring interactions that are weak but consistent across omics, discarding in this way an important piece of information. These approaches along with the early integration ones are classified as **model-agnostic**. They are named **agnostic** because they are independent from the specific algorithm applied in the preceding unimodal analysis, which can be therefore tailored to the processed type.

Finally, an ulterior integrative clustering approach, which accounts for all omics, is the one called **middle integration**. It allows joint inference from multi-omic data and generates a single integrated cluster assignment through simultaneously capturing patterns of genomic alterations that are consistent across multiple data types, specific to individual data types or weak yet consistent across datasets that would emerge only as a result of combining levels of evidence.
However, this data-integration method needs to overcome at least three **computational challenges**: the small number of samples compared to the large number of measurements, the differences in scale, collection bias and noise in each data set, and the complementary nature of the information provided by different types of data.
**dimension reduction** is a key to the feasibility and performance of these integrative clustering approaches. Methods that rely on pairwise correlation matrices are, in fact, computationally prohibitive with today’s high-resolution arrays.
Therefore, because of the high number of features and because of the complexity of dimension reduction algorithms, feature selection is required. Similarity based methods handle these shortcomings by working with inter-patient-similarities. These methods have improved runtime tand are less reliant on feature selection.

All middle integration methods for multi-omics clustering developed within the bioinformatics community assume full datasets, i.e., data from all omics were measured for each patient. However, in real experimental settings, often, for some patients, only a subset of the omics were measured. These datasets are called **partial datasets**. This phenomenon is already prevalent in existing multi-omic datasets and will increase as cohorts grow. Being able to analyze partial data is of paramount importance due to the high cost of experiments and the unequal cost for acquiring data for different omics. Naive solutions like using only those patients with all omics measured or **imputation** (the assignment of a value to something by inference from the value of the products or processes to which it contributes) have obvious disadvantage.

## Prostate adenocarcinoma
**Prostate cancer** is a cancer type that affects the prostate gland and it is the second most common cancer types among men and, in general, ranking fourth in frequency worldwide. A combination of genetic and demographic factors like age, family history, genetic susceptibilityt and race contribute to its high incidence. 

The clinical behavior of localized prostate cancer can vary widely, with some individuals having aggressive cancer that can spread and cause death, while others have indolent cancer that can be treated or observed safely.

To better predict the likelihood of progression and tailor treatment accordingly, **risk stratification systems** have been developed that take into account various clinical and pathological parameters. **Risk stratification** is the process of categorizing individuals or entities into different **risk levels** based on certain characteristics or factors in order to predict the likelihood of an event or outcome occurring and, therefore, risk stratification systems are tools employed to assign individuals or entities to specific risk categories. These systems aim to identify individuals at higher risk for aggressive disease and guide treatment decisions, taking into account factors such as **prostate-specific antigen** (**PSA**) levels, **Gleason score** (a measure of cancer aggressiveness based on biopsy samples), clinical stage, and other factors.

Despite these systems' usefulness, it is fundamental to keep in mind that they are not perfect, and there is still a need for improved risk stratification. This is where molecular features come into play. Molecular and genetic profiles are increasingly being used to subtype various cancer types and guide targeted treatment interventions.

Recent research has identified several genomic alterations as key features of primary prostate cancer, including **mutations** (changes in the DNA sequence, where one or more nucleotides are altered), **DNA copy-number changes** (changes in the number of copies of a specific DNA sequence or gene in a cell's genome), **rearrangements** (changes in the structure or arrangement of larger segments of DNA, such as genes or whole chromosomes), and **gene fusions** (the joining or fusion of two separate genes, resulting in the formation of a hybrid gene). The most common genomic alteration in prostate cancer is the fusion of **androgen-regulated promoters** (regions of DNA that control the expression of genes in response to androgen hormones, such as testosterone) with members of the **ETS family** of transcription factors such as ERG. The ETS family is a group of genes that encode proteins involved in regulating gene expression. These transcription factors control the activity of various genes, influencing important cellular processes like growth, differentiation, and development.

However, individuals with fusion-bearing tumors do not appear to have a different prognosis following treatment than those without.

Prostate cancers also have varying degrees of DNA copy-number alteration, with indolent and low-Gleason tumors having fewer alterations, while more aggressive tumors have a higher burden of copy-number alteration throughout the genome.

Further research on the molecular basis of prostate cancer and risk stratification could help identify those at higher risk of developing aggressive disease, leading to better treatment options and outcomes for patients. Therefore, there is a need to continue studying the molecular characteristics of prostate cancer to develop better risk stratification and treatment strategies.

## Objective
The goal of this project is to discover disease subtypes in the prostate adenocarcinoma dataset from the Cancer Genome Atlas utilizing clustering techniques and to compare the results with the one from TCGA, which used **iCluster**, an integrative clustering model on multi-omics data.

# Practical approach
First of all, we need to install all the packages needed for this project:

In [1]:
# if (!require("BiocManager", quietly = TRUE))
#    install.packages("BiocManager")

# BiocManager::install("curatedTCGAData");
# BiocManager::install("TCGAutils");
# BiocManager::install("TCGAbiolinks");
BiocManager::install("Shamir-Lab/NEMO/NEMO");

# install.packages("SNFtool");
# install.packages("caret");
# install.packages("cluster");
# install.packages("mclustcomp");
install.packages("tsne");

'getOption("repos")' replaces Bioconductor standard repositories, see
'help("repositories", package = "BiocManager")' for details.
Replacement repositories:
    CRAN: https://cran.r-project.org

Bioconductor version 3.16 (BiocManager 1.30.20), R 4.2.0 (2022-04-22)

Installing github package(s) 'Shamir-Lab/NEMO/NEMO'

Skipping install of 'NEMO' from a github remote, the SHA1 (451052fe) has not changed since last install.
  Use `force = TRUE` to force installation

Old packages: 'BiocManager', 'bslib', 'class', 'clock', 'curl', 'DT',
  'evaluate', 'future.apply', 'httpuv', 'httr', 'jsonlite', 'KernSmooth',
  'knitr', 'markdown', 'MASS', 'Matrix', 'matrixStats', 'nnet', 'parallelly',
  'pROC', 'RcppArmadillo', 'rmarkdown', 'sass', 'sys', 'tzdb'

Aggiornamento indice HTML dei pacchetti in '.Library'

Making 'packages.html' ...
 fatto



Now we can load the packages:

In [2]:
library("curatedTCGAData");
library("TCGAbiolinks");
library("TCGAutils");
library("SNFtool");
library("caret");
library("cluster"); #pam
library("mclustcomp");
library("NEMO");
library("tsne");

Caricamento del pacchetto richiesto: MultiAssayExperiment

Caricamento del pacchetto richiesto: SummarizedExperiment

Caricamento del pacchetto richiesto: MatrixGenerics

Caricamento del pacchetto richiesto: matrixStats


Caricamento pacchetto: ‘MatrixGenerics’


I seguenti oggetti sono mascherati da ‘package:matrixStats’:

    colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
    colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
    colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
    colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
    colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
    colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
    colWeightedMeans, colWeightedMedians, colWeightedSds,
    colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
    rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
    rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
    rowMadDiffs, rowMad

As above-mentioned, we will download **multi-omics** data from patients having prostate cancer. A multi-omics dataset is a dataset comprising multiple different biological data sources where each source represents a different data modality capturing the state of a specific biological layer in the cells.

The advent of the so-called “high-throughput technologies” enables the evaluation of:
- **Genome**: the complete genetic information of an organism (i.e. the sequence of nucleotides in the DNA);
- **Transcriptome**: set of all RNA transcripts (used also for all mRNA);
- **Proteome**: entire set of proteins etc.

in cell, tissue, or organism at a certain time. All omics data are high-dimensional and characterized by **small-n large-p** (i.e. few samples and a large number of features), which easily leads to the **curse of dimensionality** in machine learning applications. In machine learning, the curse of dimensionality is the deterioration of algorithm performance caused by the exponential growth of data volume as the number of input features or dimensions increases. As the dimensionality of the data increases, the available data becomes increasingly sparse in the high-dimensional space, resulting in difficulties in accurately representing and analyzing the data.

We download a prostate cancer multi-omics dataset from The Cancer Genome Atlas (TCGA) program. In particular, we exploit the package “curatedTCGAData” to download the following data views:
- **mRNA data**;
- **miRNA data**;
- **protein data**.

In [3]:
# Download prostate cancer multi-omics dataset.
# Note that RPPA stands for Reverse-phase protein array and it is the technology used to obtain proteomic data.
assays <- c("miRNASeqGene", "RNASeq2Gene", "RPPAArray");
mo <- curatedTCGAData(diseaseCode = "PRAD", 
                        assays = assays, 
                        version = "2.0.1", dry.run = FALSE);

# This command print a summary of the MultiAssayExperiemnt object.
mo;

snapshotDate(): 2022-10-31

Working on: PRAD_miRNASeqGene-20160128

see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation

loading from cache

Working on: PRAD_RNASeq2Gene-20160128

see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation

loading from cache

Working on: PRAD_RPPAArray-20160128

see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation

loading from cache

Working on: PRAD_colData-20160128

see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation

loading from cache

Working on: PRAD_metadata-20160128

see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation

loading from cache

Working on: PRAD_sampleMap-20160128

see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation

loading from cache

harmonizing input:
  removing 5189 sampleMap rows not in names(experiments)



A MultiAssayExperiment object of 3 listed
 experiments with user-defined names and respective classes.
 Containing an ExperimentList class object of length 3:
 [1] PRAD_miRNASeqGene-20160128: SummarizedExperiment with 1046 rows and 547 columns
 [2] PRAD_RNASeq2Gene-20160128: SummarizedExperiment with 20501 rows and 550 columns
 [3] PRAD_RPPAArray-20160128: SummarizedExperiment with 195 rows and 352 columns
Functionality:
 experiments() - obtain the ExperimentList instance
 colData() - the primary/phenotype DataFrame
 sampleMap() - the sample coordination DataFrame
 `$`, `[`, `[[` - extract colData columns, subset, or experiment
 *Format() - convert into a long or wide DataFrame
 assays() - convert ExperimentList to a SimpleList of matrices
 exportClass() - save data to flat files

In [4]:
# This subset does not change the content of the variable "mo" if using the version 4.2.0 of R.
mo <- mo[, , paste0("PRAD", "_", assays, "-20160128")];

In [5]:
# Checking the actual number of entries in the sampleMap. It can be noticed that the number of entries in the
# sampleMap DataFrame is still the same after the subsetting (1449 rows and 3 columns).
sampleMap(mo);

DataFrame with 1449 rows and 3 columns
                         assay      primary                colname
                      <factor>  <character>            <character>
1    PRAD_RNASeq2Gene-20160128 TCGA-2A-A8VL TCGA-2A-A8VL-01A-21R..
2    PRAD_RNASeq2Gene-20160128 TCGA-2A-A8VO TCGA-2A-A8VO-01A-11R..
3    PRAD_RNASeq2Gene-20160128 TCGA-2A-A8VT TCGA-2A-A8VT-01A-11R..
4    PRAD_RNASeq2Gene-20160128 TCGA-2A-A8VV TCGA-2A-A8VV-01A-11R..
5    PRAD_RNASeq2Gene-20160128 TCGA-2A-A8VX TCGA-2A-A8VX-01A-11R..
...                        ...          ...                    ...
1445   PRAD_RPPAArray-20160128 TCGA-ZG-A9LZ TCGA-ZG-A9LZ-01A-21-..
1446   PRAD_RPPAArray-20160128 TCGA-ZG-A9M4 TCGA-ZG-A9M4-01A-21-..
1447   PRAD_RPPAArray-20160128 TCGA-ZG-A9MC TCGA-ZG-A9MC-01A-21-..
1448   PRAD_RPPAArray-20160128 TCGA-ZG-A9N3 TCGA-ZG-A9N3-01A-22-..
1449   PRAD_RPPAArray-20160128 TCGA-ZG-A9NI TCGA-ZG-A9NI-01A-21-..

As we can see, we obtain a MultiAssayExperiment object, which, in its essence, is a data structure designed to store and coordinately analyze multi-omics experiments. The three main components of this data structure are:
- **colData**: it contains a dataframe having for each sample the corresponding phenotipic characteristics (in our case mainly clinical data) - access colData()
- **ExperimentList**: a list with the considered experiments (i.e. data modalities acquired with a specific technology). Element of the list are usually matrices or dataframes - access experiments()
- **sampleMap**: it is a map that connects all the considered elements.- access sampleMap()

Moreover, a function is provided to build MultiAssayExperiment objects for your own data and also subsetting operations for coordinated data selection among views.

To work with data coming from TCGA, it is important to understand the structure of the **barcode** associated to each sample. A TCGA barcode is composed of a collection of identifiers. Each sample/patient is identified by one of this barcode with a specific structure: in pratice, the first 12 characters of the barcode identify a specific individual, while the other parts give us indications about the type of sample (i.e. primary, metastatic, solid, blood derived, etc), the type of genomic material extracted (i.e. DNA, RNA) and other information related to technical replicates (i.e. repeated measurements from the same sample).  Each specifically identifies a TCGA data element.

We use the barcode to retain only Primary Solid Tumors to have a more homogeneous group of samples and to check for the presence of technical replicates in the dataset.

In [6]:
# We extract the samples knowing that the type of tumor is indicated in the barcode. In TCGA “Primary Solid Tumors”
# are identified by the code “01” in the sample part of the barcode.
# Consider only primary solid tumors because primary tumors originate in a specific organ or tissue and are
# generally more consistent in terms of location, size, and characteristics compared to metastatic tumors
# (secondary tumors that spread from the primary site). Focusing on primary tumors helps maintain statistical
# validity by comparing similar types of tumors, reducing variability and confounding factors that may arise
# from studying different metastatic sites.
primary <- TCGAutils::TCGAsampleSelect(colnames(mo), c("01"));
primary;

“Inconsistent barcode lengths: 28, 27”
“Inconsistent barcode lengths: 28, 27”


LogicalList of length 3
[["PRAD_miRNASeqGene-20160128"]] 01=TRUE 01=TRUE 01=TRUE ... 01=TRUE 01=TRUE
[["PRAD_RNASeq2Gene-20160128"]] 01=TRUE 01=TRUE 01=TRUE ... 01=TRUE 01=TRUE
[["PRAD_RPPAArray-20160128"]] 01=TRUE 01=TRUE 01=TRUE ... 01=TRUE 01=TRUE

In [7]:
# The execution of the precedent cell raises a warnin due to the fact that the barcode associated with
# the RPPAArray is composed by 27 characters while the others are composed by 28 characters.
print(colnames(mo)[1]);
print(colnames(mo)[2]);
print(colnames(mo)[3]);

CharacterList of length 1
[["PRAD_miRNASeqGene-20160128"]] TCGA-2A-A8VL-01A-21R-A37H-13 ...
CharacterList of length 1
[["PRAD_RNASeq2Gene-20160128"]] TCGA-2A-A8VL-01A-21R-A37L-07 ...
CharacterList of length 1
[["PRAD_RPPAArray-20160128"]] TCGA-2A-A8VL-01A-11-A43M-20 ...


In [8]:
mo <- mo[, primary, ];

harmonizing input:
  removing 106 sampleMap rows with 'colname' not in colnames of experiments



In [9]:
# Checking the actual number of entries in the sampleMap (1343 rows and 3 columns).
sampleMap(mo);

DataFrame with 1343 rows and 3 columns
                         assay      primary                colname
                      <factor>  <character>            <character>
1    PRAD_RNASeq2Gene-20160128 TCGA-2A-A8VL TCGA-2A-A8VL-01A-21R..
2    PRAD_RNASeq2Gene-20160128 TCGA-2A-A8VO TCGA-2A-A8VO-01A-11R..
3    PRAD_RNASeq2Gene-20160128 TCGA-2A-A8VT TCGA-2A-A8VT-01A-11R..
4    PRAD_RNASeq2Gene-20160128 TCGA-2A-A8VV TCGA-2A-A8VV-01A-11R..
5    PRAD_RNASeq2Gene-20160128 TCGA-2A-A8VX TCGA-2A-A8VX-01A-11R..
...                        ...          ...                    ...
1339   PRAD_RPPAArray-20160128 TCGA-ZG-A9LZ TCGA-ZG-A9LZ-01A-21-..
1340   PRAD_RPPAArray-20160128 TCGA-ZG-A9M4 TCGA-ZG-A9M4-01A-21-..
1341   PRAD_RPPAArray-20160128 TCGA-ZG-A9MC TCGA-ZG-A9MC-01A-21-..
1342   PRAD_RPPAArray-20160128 TCGA-ZG-A9N3 TCGA-ZG-A9N3-01A-22-..
1343   PRAD_RPPAArray-20160128 TCGA-ZG-A9NI TCGA-ZG-A9NI-01A-21-..

In [10]:
# Check for replicates (anyReplicated() checks the so called biological or primary unit in the sampleMap of the
# MultiAssayExperiment object, that corresponds to the first 12 characters of the barcodes for TCGA data). In fact,
# If two samples have the same 12 characters in their barcodes, then they come from the same patient and can be
# identified as technical replicated (since we already filtered for the same sample type). The outcome ("FALSE")
# indicates that there were no replicates.
check_rep <- anyReplicated(mo);
print(check_rep);

 PRAD_RNASeq2Gene-20160128 PRAD_miRNASeqGene-20160128 
                     FALSE                      FALSE 
   PRAD_RPPAArray-20160128 
                     FALSE 


Then, other additional pre-processing steps are performed:

- Remove **FFPE** (**formalin-fixed, paraffin-embedded**) samples. Broadly speaking, after a biopsy is performed we need to store and preserve the sample. Two major tissue preparation methods are generally used: (1) FFPE, (2) freezing the sample. DNA and RNA molecules are preserved better if the tissue is frozen, thus we will remove samples preserved using FFPE technique;

In [11]:
# The information regarding if the sample is FFPE is stored in the clinical data, which are accessible using
# colData(). 
no_ffpe <- which(as.data.frame(colData(mo))$patient.samples.sample.is_ffpe == "no");

In [12]:
mo <- mo[, no_ffpe, ];

In [13]:
# Checking the actual number of entries in the sampleMap (1343 rows and 3 columns);
sampleMap(mo);

DataFrame with 1343 rows and 3 columns
                         assay      primary                colname
                      <factor>  <character>            <character>
1    PRAD_RNASeq2Gene-20160128 TCGA-2A-A8VL TCGA-2A-A8VL-01A-21R..
2    PRAD_RNASeq2Gene-20160128 TCGA-2A-A8VO TCGA-2A-A8VO-01A-11R..
3    PRAD_RNASeq2Gene-20160128 TCGA-2A-A8VT TCGA-2A-A8VT-01A-11R..
4    PRAD_RNASeq2Gene-20160128 TCGA-2A-A8VV TCGA-2A-A8VV-01A-11R..
5    PRAD_RNASeq2Gene-20160128 TCGA-2A-A8W1 TCGA-2A-A8W1-01A-11R..
...                        ...          ...                    ...
1339   PRAD_RPPAArray-20160128 TCGA-ZG-A9LB TCGA-ZG-A9LB-01A-21-..
1340   PRAD_RPPAArray-20160128 TCGA-TP-A8TV TCGA-TP-A8TV-01A-21-..
1341   PRAD_RPPAArray-20160128 TCGA-V1-A9OX TCGA-V1-A9OX-01A-21-..
1342   PRAD_RPPAArray-20160128 TCGA-YL-A9WX TCGA-YL-A9WX-01A-11-..
1343   PRAD_RPPAArray-20160128 TCGA-ZG-A9L9 TCGA-ZG-A9L9-01A-21-..

- Restrict samples to the ones having all the considered omics and extract the set of omics (one matrix for each omic) in a list;

In [14]:
# intersectColumns() is a wrapper for complete.cases to return a MultiAssayExperiment with only those biological
# units that have measurements across all experiments. We will obtain samples having all the considered omics
# (1044 rows and 3 columns).
complete <- intersectColumns(mo);
sampleMap(complete);

DataFrame with 1044 rows and 3 columns
                         assay      primary                colname
                      <factor>  <character>            <character>
1    PRAD_RNASeq2Gene-20160128 TCGA-2A-A8VL TCGA-2A-A8VL-01A-21R..
2    PRAD_RNASeq2Gene-20160128 TCGA-2A-A8VO TCGA-2A-A8VO-01A-11R..
3    PRAD_RNASeq2Gene-20160128 TCGA-2A-A8VT TCGA-2A-A8VT-01A-11R..
4    PRAD_RNASeq2Gene-20160128 TCGA-2A-A8VV TCGA-2A-A8VV-01A-11R..
5    PRAD_RNASeq2Gene-20160128 TCGA-2A-A8W1 TCGA-2A-A8W1-01A-11R..
...                        ...          ...                    ...
1040   PRAD_RPPAArray-20160128 TCGA-ZG-A9LB TCGA-ZG-A9LB-01A-21-..
1041   PRAD_RPPAArray-20160128 TCGA-TP-A8TV TCGA-TP-A8TV-01A-21-..
1042   PRAD_RPPAArray-20160128 TCGA-V1-A9OX TCGA-V1-A9OX-01A-21-..
1043   PRAD_RPPAArray-20160128 TCGA-YL-A9WX TCGA-YL-A9WX-01A-11-..
1044   PRAD_RPPAArray-20160128 TCGA-ZG-A9L9 TCGA-ZG-A9L9-01A-21-..

In [15]:
# Extract assays in list of matrices. To access an assay it is possible to use complete$assaysname
complete <- assays(complete);
complete;

List of length 3
names(3): PRAD_miRNASeqGene-20160128 PRAD_RNASeq2Gene-20160128 PRAD_RPPAArray-20160128

- Transpose each matrix to have samples in the rows and features in columns.

In [16]:
# Obtain matrices samples x features:
complete <- lapply(complete, FUN=t);
complete;

Unnamed: 0,hsa-let-7a-1,hsa-let-7a-2,hsa-let-7a-3,hsa-let-7b,hsa-let-7c,hsa-let-7d,hsa-let-7e,hsa-let-7f-1,hsa-let-7f-2,hsa-let-7g,⋯,hsa-mir-941-3,hsa-mir-941-4,hsa-mir-942,hsa-mir-943,hsa-mir-944,hsa-mir-95,hsa-mir-96,hsa-mir-98,hsa-mir-99a,hsa-mir-99b
TCGA-2A-A8VL-01A-21R-A37H-13,32458,65068,32654,68355,43115,1450,3639,67,27703,1532,⋯,0,0,3,0,9,6,162,113,9766,81390
TCGA-2A-A8VO-01A-11R-A37H-13,19387,38457,19447,19162,22523,958,1809,39,26757,1713,⋯,0,0,11,0,0,3,108,104,7105,35388
TCGA-2A-A8VT-01A-11R-A37H-13,42690,85105,43247,36065,14156,1088,3211,81,43675,1516,⋯,0,0,4,0,7,3,49,133,5476,53659
TCGA-2A-A8VV-01A-11R-A37H-13,62290,123325,61878,58422,36300,1145,7363,100,67980,2011,⋯,0,0,7,0,4,3,134,150,12168,65723
TCGA-2A-A8W1-01A-11R-A37H-13,12750,24974,12584,6681,11792,548,1090,35,17591,1049,⋯,0,0,4,0,1,2,90,60,5150,24431
TCGA-2A-A8W3-01A-11R-A37H-13,28225,56756,28358,21738,19805,1078,3282,62,30647,1593,⋯,0,0,9,0,12,6,93,104,8462,66046
TCGA-CH-5737-01A-11R-1579-13,75147,149632,75297,71861,82881,1433,3955,118,82389,3113,⋯,0,0,9,0,2,4,264,285,13049,58597
TCGA-CH-5738-01A-11R-1579-13,41523,82750,41762,51346,36212,1324,4628,90,47625,1571,⋯,0,0,6,0,13,6,92,151,6967,78816
TCGA-CH-5739-01A-11R-1579-13,75998,151246,76239,92385,52152,2485,6201,113,88833,4240,⋯,0,0,20,0,8,19,190,280,12621,119386
TCGA-CH-5740-01A-11R-1579-13,42616,85659,42947,26142,12640,781,2776,95,76034,1865,⋯,0,0,2,0,7,4,77,108,2011,33256

Unnamed: 0,A1BG,A1CF,A2BP1,A2LD1,A2ML1,A2M,A4GALT,A4GNT,AAA1,AAAS,⋯,ZXDA,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3,psiTPTE22,tAKR
TCGA-2A-A8VL-01A-21R-A37L-07,131.90,0,49,264.20,88,9375.69,1491,1,0,1792,⋯,93,552,2139,74,1344,4474,2891,981,387,0
TCGA-2A-A8VO-01A-11R-A37L-07,27.00,0,27,254.53,4,30093.34,758,2,0,1386,⋯,63,577,1366,252,1138,4081,1285,964,439,0
TCGA-2A-A8VT-01A-11R-A37L-07,46.52,1,5,432.49,3,13345.67,683,1,0,2180,⋯,308,1629,4699,121,3239,4174,5661,3736,98,1
TCGA-2A-A8VV-01A-11R-A37L-07,37.42,0,8,372.46,2,17906.31,987,1,0,1867,⋯,95,895,2159,137,1302,6500,2188,2229,45,0
TCGA-2A-A8W1-01A-11R-A37L-07,16.02,0,0,420.48,10,3200.38,452,0,0,1879,⋯,146,1013,2248,382,2001,2136,2001,2497,13,0
TCGA-2A-A8W3-01A-11R-A37L-07,66.88,0,6,473.60,5,25123.86,1202,0,0,1635,⋯,167,1322,3218,45,2933,5376,3386,2392,146,0
TCGA-CH-5737-01A-11R-1580-07,33.27,2,13,237.02,53,14980.97,1346,0,0,1945,⋯,190,1027,2747,84,2627,3349,2425,2270,109,0
TCGA-CH-5738-01A-11R-1580-07,61.85,0,57,552.16,155,91195.95,2373,0,0,2053,⋯,222,1945,3735,109,3177,8819,4730,2439,84,0
TCGA-CH-5739-01A-11R-1580-07,31.00,0,5,478.90,5,45177.75,1217,0,0,1670,⋯,260,1452,3468,56,2882,5364,4831,3194,1224,0
TCGA-CH-5740-01A-11R-1580-07,27.75,0,13,250.09,3,11131.87,333,0,0,1401,⋯,17,628,1673,25,1348,2156,2048,983,31,0

Unnamed: 0,14-3-3_beta,14-3-3_epsilon,14-3-3_zeta,4E-BP1,4E-BP1_pS65,4E-BP1_pT37_T46,4E-BP1_pT70,53BP1,A-Raf_pS299,ACC1,⋯,p27_pT198,p38,p38_MAPK,p38_pT180_Y182,p53,p62-LCK-ligand,p70S6K,p70S6K_pT389,p90RSK,p90RSK_pT359_S363
TCGA-2A-A8VL-01A-11-A43M-20,-0.049911077,-0.106880871,0.094811939,-0.31583725,0.069098733,0.14985706,-0.001959236,0.068477359,-0.068047354,0.0476650013,⋯,0.007940404,0.01125132,,0.263462117,-0.108519267,-0.042595631,-0.161587614,-0.005111448,-0.066468953,-0.036112777
TCGA-2A-A8VO-01A-21-A43N-20,-0.014907775,0.089755117,0.171977017,-0.06371873,0.042506192,-0.31037682,0.119323718,-0.857558044,0.217263848,-0.1392483117,⋯,-0.156329749,0.09343749,,0.189801838,-0.056297520,-0.221798118,-0.567801031,-0.095043535,-0.189920204,0.048042144
TCGA-2A-A8VT-01A-21-A43M-20,-0.073162472,0.033985628,-0.093995316,-0.13318682,-0.104138286,0.15142957,-0.103481592,-0.088387964,0.033823632,0.3530380013,⋯,0.101176896,0.12116993,,0.297726089,-0.207714882,0.008781276,-0.116932331,0.066018352,0.003819951,0.237311589
TCGA-2A-A8VV-01A-21-A43M-20,-0.056122482,-0.029684579,-0.005292788,-0.31041743,0.157419577,0.10327751,-0.262962682,0.001967876,0.104937073,-0.3832658233,⋯,0.192590787,-0.16945604,,0.487236468,-0.115750816,-0.153000467,-0.222863863,-0.108812004,0.059881018,-0.123107098
TCGA-2A-A8W1-01A-21-A43M-20,-0.077347808,-0.054949235,-0.361387058,0.24149916,-0.270727991,-0.20036718,0.215255221,-0.034558583,0.113168302,0.0007544018,⋯,-0.052013024,-0.11643450,,-0.131228490,0.157349450,0.279546373,0.002758116,0.028584368,0.035546414,0.153969802
TCGA-2A-A8W3-01A-21-A43M-20,0.023169800,-0.086594562,-0.883213047,0.09623086,-0.300276305,-0.52312498,0.083494481,0.347375977,0.191686710,0.3922293097,⋯,-0.007055746,-0.30899454,,-0.496588564,0.282620560,0.279145180,0.322622115,0.212702324,0.008250305,0.065939057
TCGA-CH-5737-01A-22-A303-20,-0.284120460,0.318566719,-0.443196865,0.34255435,-0.276434892,0.85064113,-0.001184839,-0.872119959,0.330474955,0.0434791145,⋯,0.001927585,,-0.14823142,0.353951541,-0.161482227,-0.177923227,0.066171180,0.220497068,0.149577266,0.712343949
TCGA-CH-5738-01A-21-A303-20,0.090714021,-0.065186707,-0.121425910,-0.43240111,0.045768052,0.40499532,-0.036641199,0.184545046,-0.218231281,-1.5059547205,⋯,0.096906586,,0.17203582,0.299487821,-0.096697096,-0.151101876,-0.520249274,1.445321701,0.087132238,0.045614781
TCGA-CH-5739-01A-21-A303-20,0.020596776,-0.020156040,-0.096052825,0.49878540,-0.015088522,1.02181576,-0.124598608,0.444575960,0.138645100,0.1983116865,⋯,-0.024773330,,-0.16745760,0.173151805,-0.183501196,0.137070202,0.148064013,0.980897428,0.143878886,0.210360860
TCGA-CH-5740-01A-21-A303-20,-0.150337202,0.198741246,-0.341694689,0.54475143,0.105664202,1.01352023,0.004842354,0.755671300,-0.042752803,0.6289362880,⋯,0.005464926,,-0.67295646,0.103150443,-0.175194073,0.000191091,0.267600721,0.851677354,0.121761143,0.378836906


- Remove features having missing values (i.e. NA). In this case it is easier to remove features instead of performing imputation, since only few features in the proteomics data have missing values;

In [17]:
# Remove features having NAs (present only in proteomics data).
# In details, "is.na(complete[[3]])" checks for missing values (NA) in the third matrix.
# "colSums(is.na(complete[[3]]))" calculates the column-wise sums of missing values. It returns a numeric vector
# with the same number of elements as the number of columns in complete[[3]]. Each element represents the count of
# missing values in the corresponding column.
# "colSums(is.na(complete[[3]])) == 0" creates a logical vector indicating which columns have no missing values.
# It returns TRUE for columns with no missing values and FALSE otherwise.
# "complete[[3]][, colSums(is.na(complete[[3]])) == 0]" selects columns from the proteomics matrix where the
# corresponding column in colSums(is.na(complete[[3]])) == 0 is TRUE. In other words, it keeps only the columns
# that have no missing values.
complete[[3]] <- complete[[3]][, colSums(is.na(complete[[3]])) == 0];

- Select features having more variance across samples. Here we make a strong assumption: features that have more variance across samples bring more information and are the more relevant ones. This feature selection strategy is fast and commonly used in literature, however it as some drawbacks: (1) it is univariate, thus does not considers interactions among features and (2) it is not able to remove redundant variables. Moreover, we need to identify a threshold for feature selection (top 100 features) but it is always an arbitrary choice;

In [18]:
# Remove features with near zero variance and retain top 100 features having higher variance.
# First, we define the number of features we want to retain.
# We loop on each one of the data matrices.
# "nearZeroVar()", from the caret package, is used to identify variables with near-zero variance, which means they
# have very little or no variation in their values. The resulting indices are stored in the "idx" variable.
# Then, if the length of the "idx" variable is not zero, the expression "complete[[i]][, -idx]"" is used to subset
# the i-th element of complete and remove the columns specified by the idx variable. Then, it modifies the i-th
# element within the "complete" list by removing the columns identified by "idx" from that element.
# The modified element is then assigned back to the i-th position in the "complete" list.
# "if(ncol(complete[[i]]) <= nf)" next is a conditional statement which checks if the number of columns (features) in
# the modified i-th element of "complete" is less than or equal to "nf" (100 in this case). If it is, the "next"
# keyword is used to skip the remaining operations within the loop for the current i and move on to the next
# iteration.
# "vars <- apply(complete[[i]], 2, var);" calculates the variance of each column (feature) in the modified i-th
# element of "complete".
# "idx <- sort(vars, index.return=TRUE, decreasing = TRUE)$ix;" sorts the variance values ("vars") in descending
# order and retrieves the corresponding indices ("ix"). It stores the sorted indices in the "idx" variable.
# Finally, "complete[[i]] <- complete[[i]][, idx[1:nf]];" keeps only the top 100 features in the modified i-th
# element of "complete". It uses the sorted indices "idx" to select the first "nf" elements and retains only those
# columns. The modified matrix is then assigned back to the i-th position in the "complete" list.
# (Removed 418 features from PRAD_miRNASeqGene-20160128)
# (Removed  1334 features from PRAD_RNASeq2Gene-20160128)
# (Removed  0 features from PRAD_RPPAArray-20160128)
nf <- 100;
for(i in 1:length(complete)){
    
    idx <- caret::nearZeroVar(complete[[i]]);
    message(paste("Removed ", length(idx), "features from", names(complete)[i]));
    if(length(idx) != 0){
        complete[[i]] <- complete[[i]][, -idx];
    }

    if(ncol(complete[[i]]) <= nf) next
    
    vars <- apply(complete[[i]], 2, var);
    idx <- sort(vars, index.return=TRUE, decreasing = TRUE)$ix;
    
    complete[[i]] <- complete[[i]][, idx[1:nf]];
    
}

Removed  418 features from PRAD_miRNASeqGene-20160128

Removed  1334 features from PRAD_RNASeq2Gene-20160128

Removed  0 features from PRAD_RPPAArray-20160128



- Standardize features using z-score;

In [19]:
# Perform features standardization using z-score. Z-score normalization is a statistical technique used to
# transform a dataset so that it has a mean of zero and a standard deviation of one.
# This process allows us to compare and analyze data that originally had different scales or units.
zscore <- function(data){
    
    zscore_vec <- function(x) { return ((x - mean(x)) / sd(x))};
    data <- apply(data, 2, zscore_vec);
    
    
    return(data);
}

complete <- lapply(complete, zscore);

- Clean barcodes to retain only the first part specific for each individual.

In [20]:
# Clean barcodes retaining only "Project-TSS-Participant", that is, the unique identifier for each patient. 
# We substitute the names of the rows with the substring composed by their first 12 characters. 
for(v in 1:length(complete)){
    rownames(complete[[v]]) <- substr(rownames(complete[[v]]), 1, 12);
}

The classification of a sample to a specific disease subtype helps to predict patients’ prognosis and it has an impact also on the definition of the therapy. Many different tests to define the disease subtype of a prostate cancer patient are available, which consider different subset of genes for the definition of the subtypes. TCGA Research Network provides the subtypes defined using the iCluster model. We will try to see if the clusters we compute are similar to the iCluster disease subtypes for prostate cancer.

In [21]:
# Download disease subtypes (prostate adenocarcinoma) from TCGAbiolinks. The column “Subtype_Integrative” is the
# one containing the iCluster molecular subtype.
subtypes <- as.data.frame(TCGAbiolinks::PanCancerAtlas_subtypes());
subtypes <- subtypes[subtypes$cancer.type == "PRAD", ];
subtypes;

Unnamed: 0_level_0,pan.samplesID,cancer.type,Subtype_mRNA,Subtype_DNAmeth,Subtype_protein,Subtype_miRNA,Subtype_CNA,Subtype_Integrative,Subtype_other,Subtype_Selected
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>
5978,TCGA-HC-7818-01,PRAD,3,4,,1,Quiet,3,1-ERG,PRAD.1-ERG
5979,TCGA-HC-7077-01,PRAD,2,1,1,2,Some_SCNA,2,1-ERG,PRAD.1-ERG
5980,TCGA-G9-6356-01,PRAD,3,3,1,1,Some_SCNA,3,1-ERG,PRAD.1-ERG
5981,TCGA-HC-7213-01,PRAD,2,3,1,6,More_SCNA,3,1-ERG,PRAD.1-ERG
5982,TCGA-KK-A6E1-01,PRAD,2,3,,3,More_SCNA,2,1-ERG,PRAD.1-ERG
5983,TCGA-VP-A872-01,PRAD,2,3,,6,More_SCNA,2,1-ERG,PRAD.1-ERG
5984,TCGA-V1-A8WS-01,PRAD,2,1,,4,Some_SCNA,2,1-ERG,PRAD.1-ERG
5985,TCGA-CH-5741-01,PRAD,2,3,1,3,Some_SCNA,2,1-ERG,PRAD.1-ERG
5986,TCGA-J4-A6M7-01,PRAD,2,3,,2,Quiet,2,1-ERG,PRAD.1-ERG
5987,TCGA-KK-A8I5-01,PRAD,2,1,,1,Some_SCNA,2,1-ERG,PRAD.1-ERG


In [22]:
# Retain only primary solid tumors and select samples in common with omics data (in the same order):
subtypes <- subtypes[TCGAutils::TCGAsampleSelect(subtypes$pan.samplesID, "01"), ];
subtypes;

Unnamed: 0_level_0,pan.samplesID,cancer.type,Subtype_mRNA,Subtype_DNAmeth,Subtype_protein,Subtype_miRNA,Subtype_CNA,Subtype_Integrative,Subtype_other,Subtype_Selected
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>
5978,TCGA-HC-7818-01,PRAD,3,4,,1,Quiet,3,1-ERG,PRAD.1-ERG
5979,TCGA-HC-7077-01,PRAD,2,1,1,2,Some_SCNA,2,1-ERG,PRAD.1-ERG
5980,TCGA-G9-6356-01,PRAD,3,3,1,1,Some_SCNA,3,1-ERG,PRAD.1-ERG
5981,TCGA-HC-7213-01,PRAD,2,3,1,6,More_SCNA,3,1-ERG,PRAD.1-ERG
5982,TCGA-KK-A6E1-01,PRAD,2,3,,3,More_SCNA,2,1-ERG,PRAD.1-ERG
5983,TCGA-VP-A872-01,PRAD,2,3,,6,More_SCNA,2,1-ERG,PRAD.1-ERG
5984,TCGA-V1-A8WS-01,PRAD,2,1,,4,Some_SCNA,2,1-ERG,PRAD.1-ERG
5985,TCGA-CH-5741-01,PRAD,2,3,1,3,Some_SCNA,2,1-ERG,PRAD.1-ERG
5986,TCGA-J4-A6M7-01,PRAD,2,3,,2,Quiet,2,1-ERG,PRAD.1-ERG
5987,TCGA-KK-A8I5-01,PRAD,2,1,,1,Some_SCNA,2,1-ERG,PRAD.1-ERG


In [23]:
# Retain from the subtypes only information regarding samples in the multi-omic dataset.
# "substr(subtypes$pan.samplesID,1,12)" extracts the first 12 characters from the "pan.samplesID" column in
# "subtypes".
# "%in%" checks if each element in the left-hand side vector (substrings) is present in the right-hand side vector
# (rownames of the first matrix in "complete").
# Overall, the condition "substr(subtypes$pan.samplesID,1,12) %in% rownames(complete[[1]])"" checks if the first 12
# characters of the "pan.samplesID" column in "subtypes" match any of the row names of the first element in 
# "complete".
sub_select <- substr(subtypes$pan.samplesID,1,12) %in% rownames(complete[[1]]);

# Count the occurrences of TRUE
count <- sum(as.numeric(sub_select))
print(count);

[1] 248


In [24]:
subtypes <- subtypes[sub_select, ];
subtypes;

Unnamed: 0_level_0,pan.samplesID,cancer.type,Subtype_mRNA,Subtype_DNAmeth,Subtype_protein,Subtype_miRNA,Subtype_CNA,Subtype_Integrative,Subtype_other,Subtype_Selected
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>
5979,TCGA-HC-7077-01,PRAD,2,1,1,2,Some_SCNA,2,1-ERG,PRAD.1-ERG
5980,TCGA-G9-6356-01,PRAD,3,3,1,1,Some_SCNA,3,1-ERG,PRAD.1-ERG
5981,TCGA-HC-7213-01,PRAD,2,3,1,6,More_SCNA,3,1-ERG,PRAD.1-ERG
5982,TCGA-KK-A6E1-01,PRAD,2,3,,3,More_SCNA,2,1-ERG,PRAD.1-ERG
5985,TCGA-CH-5741-01,PRAD,2,3,1,3,Some_SCNA,2,1-ERG,PRAD.1-ERG
5986,TCGA-J4-A6M7-01,PRAD,2,3,,2,Quiet,2,1-ERG,PRAD.1-ERG
5987,TCGA-KK-A8I5-01,PRAD,2,1,,1,Some_SCNA,2,1-ERG,PRAD.1-ERG
5988,TCGA-EJ-5521-01,PRAD,2,3,1,1,More_SCNA,2,1-ERG,PRAD.1-ERG
5989,TCGA-EJ-5507-01,PRAD,2,3,1,5,More_SCNA,3,1-ERG,PRAD.1-ERG
5991,TCGA-KC-A7F6-01,PRAD,2,1,,1,Some_SCNA,2,1-ERG,PRAD.1-ERG


In [25]:
# This line of code assigns new row names to "subtypes" based on a substring of the "pan.samplesID" column.
rownames(subtypes) <- substr(subtypes$pan.samplesID, 1, 12);
subtypes;

Unnamed: 0_level_0,pan.samplesID,cancer.type,Subtype_mRNA,Subtype_DNAmeth,Subtype_protein,Subtype_miRNA,Subtype_CNA,Subtype_Integrative,Subtype_other,Subtype_Selected
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>
TCGA-HC-7077,TCGA-HC-7077-01,PRAD,2,1,1,2,Some_SCNA,2,1-ERG,PRAD.1-ERG
TCGA-G9-6356,TCGA-G9-6356-01,PRAD,3,3,1,1,Some_SCNA,3,1-ERG,PRAD.1-ERG
TCGA-HC-7213,TCGA-HC-7213-01,PRAD,2,3,1,6,More_SCNA,3,1-ERG,PRAD.1-ERG
TCGA-KK-A6E1,TCGA-KK-A6E1-01,PRAD,2,3,,3,More_SCNA,2,1-ERG,PRAD.1-ERG
TCGA-CH-5741,TCGA-CH-5741-01,PRAD,2,3,1,3,Some_SCNA,2,1-ERG,PRAD.1-ERG
TCGA-J4-A6M7,TCGA-J4-A6M7-01,PRAD,2,3,,2,Quiet,2,1-ERG,PRAD.1-ERG
TCGA-KK-A8I5,TCGA-KK-A8I5-01,PRAD,2,1,,1,Some_SCNA,2,1-ERG,PRAD.1-ERG
TCGA-EJ-5521,TCGA-EJ-5521-01,PRAD,2,3,1,1,More_SCNA,2,1-ERG,PRAD.1-ERG
TCGA-EJ-5507,TCGA-EJ-5507-01,PRAD,2,3,1,5,More_SCNA,3,1-ERG,PRAD.1-ERG
TCGA-KC-A7F6,TCGA-KC-A7F6-01,PRAD,2,1,,1,Some_SCNA,2,1-ERG,PRAD.1-ERG


In [26]:
# Remove subtypes without an associate multi-omic sample
subtypes <- subtypes[rownames(complete[[1]]),];
subtypes;

Unnamed: 0_level_0,pan.samplesID,cancer.type,Subtype_mRNA,Subtype_DNAmeth,Subtype_protein,Subtype_miRNA,Subtype_CNA,Subtype_Integrative,Subtype_other,Subtype_Selected
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>
TCGA-2A-A8VL,TCGA-2A-A8VL-01,PRAD,3,3,,2,Quiet,3,1-ERG,PRAD.1-ERG
TCGA-2A-A8VO,TCGA-2A-A8VO-01,PRAD,1,4,,2,Some_SCNA,1,8-other,PRAD.8-other
TCGA-2A-A8VT,TCGA-2A-A8VT-01,PRAD,2,3,,6,More_SCNA,2,1-ERG,PRAD.1-ERG
TCGA-2A-A8VV,TCGA-2A-A8VV-01,PRAD,2,3,,1,Some_SCNA,2,1-ERG,PRAD.1-ERG
TCGA-2A-A8W1,TCGA-2A-A8W1-01,PRAD,1,4,,3,Some_SCNA,1,2-ETV1,PRAD.2-ETV1
TCGA-2A-A8W3,TCGA-2A-A8W3-01,PRAD,3,1,,6,Some_SCNA,2,2-ETV1,PRAD.2-ETV1
TCGA-CH-5737,TCGA-CH-5737-01,PRAD,1,2,3,5,Some_SCNA,1,6-FOXA1,PRAD.6-FOXA1
TCGA-CH-5738,TCGA-CH-5738-01,PRAD,3,3,3,1,Quiet,3,4-FLI1,PRAD.4-FLI1
TCGA-CH-5739,TCGA-CH-5739-01,PRAD,2,3,3,1,Some_SCNA,2,1-ERG,PRAD.1-ERG
TCGA-CH-5740,TCGA-CH-5740-01,PRAD,2,3,3,4,Quiet,2,1-ERG,PRAD.1-ERG


In [27]:
# Remove all the rows with a Nan value in the Subtype_Integrative column. that is, all the rows without an
# associated iCluster molecular subtype.
subtypes <- subtypes[!is.na(subtypes$Subtype_Integrative),];
subtypes;

Unnamed: 0_level_0,pan.samplesID,cancer.type,Subtype_mRNA,Subtype_DNAmeth,Subtype_protein,Subtype_miRNA,Subtype_CNA,Subtype_Integrative,Subtype_other,Subtype_Selected
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>
TCGA-2A-A8VL,TCGA-2A-A8VL-01,PRAD,3,3,,2,Quiet,3,1-ERG,PRAD.1-ERG
TCGA-2A-A8VO,TCGA-2A-A8VO-01,PRAD,1,4,,2,Some_SCNA,1,8-other,PRAD.8-other
TCGA-2A-A8VT,TCGA-2A-A8VT-01,PRAD,2,3,,6,More_SCNA,2,1-ERG,PRAD.1-ERG
TCGA-2A-A8VV,TCGA-2A-A8VV-01,PRAD,2,3,,1,Some_SCNA,2,1-ERG,PRAD.1-ERG
TCGA-2A-A8W1,TCGA-2A-A8W1-01,PRAD,1,4,,3,Some_SCNA,1,2-ETV1,PRAD.2-ETV1
TCGA-2A-A8W3,TCGA-2A-A8W3-01,PRAD,3,1,,6,Some_SCNA,2,2-ETV1,PRAD.2-ETV1
TCGA-CH-5737,TCGA-CH-5737-01,PRAD,1,2,3,5,Some_SCNA,1,6-FOXA1,PRAD.6-FOXA1
TCGA-CH-5738,TCGA-CH-5738-01,PRAD,3,3,3,1,Quiet,3,4-FLI1,PRAD.4-FLI1
TCGA-CH-5739,TCGA-CH-5739-01,PRAD,2,3,3,1,Some_SCNA,2,1-ERG,PRAD.1-ERG
TCGA-CH-5740,TCGA-CH-5740-01,PRAD,2,3,3,4,Quiet,2,1-ERG,PRAD.1-ERG


In [28]:
# Print number of samples for each subtype found by iCluster:
table(subtypes$Subtype_Integrative);


  1   2   3 
 60  83 105 

Now, we have to compute the similarity matrix for our omic dataset.
A similarity matrix, also known as a similarity or distance matrix, is a square matrix that quantifies the
similarity or dissimilarity between pairs of objects or entities. In the context of clustering or data analysis, it
is commonly used to represent the pairwise similarities or distances between data points.

In [29]:
# Compute similarity matrix for each omic data source using the scaled exponential euclidean distance:
similarity_matrix <- list();
for(i in 1:length(complete)){
    
    # Compute the distance between the rows of the matrix
    Dist <- (dist2(as.matrix(complete[[i]]), as.matrix(complete[[i]])))^(1/2);
    
    # Compute the similarity matrix
    similarity_matrix[[i]] <- affinityMatrix(Dist);
}

ERROR: Error in parse(text = x, srcfile = src): <text>:13:4: simbolo inatteso
12: SNF
13: As already
       ^


## SNF
As already said, Similarity Network Fusion is a similarity method that works with inter-patient-similarities. SNF builds a similarity network of patients per omic and iteratively updates these networks to increase their similarity until they converge to a single network, which is then partitioned using spectral clustering.

In [None]:
# Integration of multi-omics data using Similarity Network Fusion:
M_SNF <- SNF(similarity_matrix, K = 20, t = 20)

In [None]:
# Integration of multi-omics data using the average. This can be considered as a trivial multi-omics data
# integration strategy.
# The "Reduce()"" function is used to reduce the elements of "similarity_matrix" by applying the "+" operator.
# In other words, it sums up all the elements in the "similarity_matrix".
M_Mean <- Reduce("+", similarity_matrix) / length(similarity_matrix)

**NEMO** (**NE**ighborhood based **M**ulti-**O**mics clustering) is a simple algorithm for multi-omics clustering. NEMO is inspired and built on prior similarity-based multi-omics clustering methods such as SNF. Importantly, NEMO can be applied to partial datasets in which some patients have data for only a subset of the omics, without performing data imputation.
NEMO works in three phases:
First, an inter-patient similarity matrix is built for each omic. Next, the matrices of different omics are integrated into one matrix. Finally, that network is clustered.

NEMO receives as input a set of data matrices of n subjects (samples or patients).

This similarity measure is based on the radial basis function kernel. r2ijl is a normalizing factor, which controls for the density of samples by averaging the squared distance of the ith and jth samples to their nearest neighbors and the squared distance between these two samples.

In [None]:
# Integration of multi-omics data using NEMO
# Constructs a single affinity graph measuring similarity across different omics.
# The given parameter is a list of the data to be clustered, where each entry is a matrix of *features x samples*
# and "k" is the number of neighbors to use for each omic.
t_complete <- lapply(complete, FUN = t)
M_NEMO <- nemo.affinity.graph(t_complete, k = 20)

We will attempt to identify disease subtypes using the **Partitioning Around Medoids** (**PAM**) clustering algorithm. The PAM algorithm is based on the search for a number **k** (given as input by the user) of representative objects, or **medoids**, among the observations of the dataset. These observations should represent the structure of the data. After finding a set of k medoids, k clusters are constructed by assigning each observation to the nearest medoid. The objects of a cluster must show a high degree of similarity, while
objects belonging to different clusters must be as dissimilar as possible. The goal is to find these k representative objects which minimize the sum of the dissimilarities of the observations to their closest representative object. It is possible to interprete the final goal as to obtain a set of clusters where the average distances of objects belonging to the cluster and the cluster representative is minimized (equivalently the sum of the distances can be minimized).
The entire set of objects is defined as O and the set of objects that are tentatively defined as medoids is S, so U = O − S is the set of unselected objects.
The algorithm has two phases:
- BUILD PHASE: the goal is to select k initial objects to populate the set of selected objects S. Then, the other  objects in U are assigned to the closest representative in S. The first object in S is the one that has minimal distance with all the other objects, thus the most central data point. The other points i in U are evaluated to be selected as representatives and chosen if they have a high number of unselected objects j that are closer to i than to already selected representatives belonging to S. These steps are performed until a number of selected medoids k is reached;
- SWAP PHASE: this phase is intended to improve the set of selected representatives. For each pair of representative i∈S and non-representative h∈U:
    - We swap i and h, as that h is a representative and i is not;
    - Compute the contribution Kjih of each object j∈U−{h} to the swap of i and h. We can have two main situations:
        1) d(j,i)>Dj , where Dj is the dissimilarity between j and the closest object in S. Then, Kjih = min{d(j,h)−Dj,0};
        2) d(j,i)=Dj. Then, Kjih=min{d(j,h),Ej}−Dj, where Ej is the dissimilarity between j and the second closest object in S.

    - Compute the total results of the swap as Tih=∑{Kjih|j∈U};
    - Select the pair i,h that maximizes Tih;
    - If Tih<0 the swap is performed, Dj and Ej are recomputed and we return at the first step of the SWAP phase. Otherwise, the algorithm stops if all Tih>0.

Let’s apply the PAM algorithm to our integrated similarity matrix (which we convert into a distance matrix). Note that we set the number of clusters to the number of prostate cancer molecular disease subtypes.

In [None]:
clusterings = list();

k <- length(unique(na.omit(subtypes$Subtype_Integrative)));
print(k)

In [None]:
# OMIC (a)
#you can pass to pam a dist object (create using as.dist() on the distance matrix) or directly the matrix.
# In this last case, set diss=TRUE in pam.
for (i in 1:length(assays)) {
    dist <- 1 - NetPreProc::Prob.norm(similarity_matrix[[i]]);
    clusterings[assays[[i]]] <- list(pam(dist, k = k, diss = TRUE, keep.diss = TRUE));
}

In [None]:
list(pam(dist, k = k, diss = TRUE, keep.diss = TRUE))[1]

In [None]:
# Partitioning around Medoids on M_Mean 
dist <- 1 - NetPreProc::Prob.norm(M_Mean);
clusterings$Meanpam <- pam(dist, k = k, diss = TRUE, keep.diss = TRUE);

In [None]:
# Partitioning around Medoids on M_SNF 
dist <- 1 - M_SNF
clusterings$SNFpam <- pam(dist, k = k, diss = TRUE, keep.diss = TRUE);

In [None]:
# Partitioning around Medoids on M_NEMO 
dist <- 1 - NetPreProc::Prob.norm(M_NEMO);
clusterings$NEMOpam <- pam(dist, k = k, diss = TRUE, keep.diss = TRUE)

In [None]:
# NEMO spectral clustering
clusterings$NEMOspectral$clustering <- nemo.clustering(t_complete, num.clusters = k)

In [None]:
# Spectral clustering on SNF matrix
clusterings$SNFspectral$clustering <- SNFtool::spectralClustering(M_snf, K = k)

When comparing clusterings of multi-omic data, it is important to consider the specific characteristics and requirements of the data. In the following bulleted list, there are some of the measures that are commonly used and considered appropriate for comparing clusterings of multi-omic data:

- Variation of Information (VI): This measure is commonly used in multi-omic data analysis as it considers the shared information and entropy between two clusterings. It quantifies the information loss when one clustering is used to represent another, providing insights into the similarity or dissimilarity between the clusterings.

- Adjusted Rand Index (ARI): The ARI is a widely used measure for comparing clusterings in various domains, including multi-omic data analysis. It accounts for chance agreements and provides a normalized similarity score.

- Normalized Mutual Information (NMI): NMI is another commonly employed measure in multi-omic data analysis. It normalizes the mutual information between clusterings by considering the entropy of the clusterings, providing a useful similarity measure.

- Fowlkes-Mallows Index: The Fowlkes-Mallows Index measures the geometric mean of pairwise precision and recall. It can be applicable in comparing clusterings of multi-omic data when considering precision and recall aspects.

- Jaccard Coefficient: The Jaccard Coefficient is a simple measure that compares the similarity between two clusterings based on the presence or absence of samples in the same or different clusters. It can be used as a quick similarity measure for multi-omic data clusterings.