This workflow is composed of 11 R scripts organized in 4 main parts. The scripts are ordered and need to be run sequentially. The MASTER_script.R script allows to run the whole pipeline at once. The majority of the scripts depends on the previous script, no intermediate results are kept. Functions used in each part are written in a separate file functions.R.
The aim of the workflow is to reproduce the results presented in the manuscript (BioRxiv preprint: https://www.biorxiv.org/content/10.1101/2020.10.23.352336).
- functional annotations from MSigDB in a gmt format
- probes annotations for Affymetrix and Agilent platforms
- cytoband genomic position and chromosome size from hg19 genome version from UCSC
- miRNA-mRNA interaction from miRTarbase and miRecords formatted and saved in RData format
- TCGA PANCAN batch corrected miRNA expression
- TCGA PANCAN batch corrected miRNA clinical data formated and saved as RData
- TCGA SARC miRNA expression saved as RData
- TCGA RNA-seq expression
- TCGA clinical data formated saved in a RData object.
- TCGA gene level copy number (CN) combined saved in a RData object.
- TCGA + GTEX RNA-seq combined data saved as RData. This object contains also a vector with tissue of origin of the samples ordered according to the samples in the data.frame and a vector containing the colours associated with each tissue type.
- List of micro-array data from both Affymetrix and agilent platforms saved in a RData object
- Gene level copy number estimation from SNP6 and BAC array technologies saved in RData objects
- Annotation file for all patients analysed on micro-arrays
- ICGC clinical data
- ICGC RNA-seq gene level log2(FPKM+1)
- ICGC DMD isoform expression
- ICGC TP53 and RB1 genetic status and PTEN, DMD and ATRX location
- ICGC breakpoint prediction
- ICGC somatic mutation prediction formated to be pass to MutationPatterns package
- ICGC CN estimation data: precomputed from cn.MOPS output. Used script is available in pre_computed.R file
- Correlation in step 1.1: Consistency of gene expression profiles is measured by computing the correlation of each gene with itself between the 2 platforms and with all the other genes in both platforms. This analysis is time consuming (around 1h30) and a precomputed object is used by default. To re-compute it, change "TRUE" to "FALSE" in the run file.
- Functional and regulatory feature enrichment: GSEA and iCistarget results. All folders contain html reports allowing to visualize the results. report.html in icistarget_* folders and pos_ and neg_snapshot.html in GSEA/Affy_* folders.
- Functional annotations by GSAn for top hLMS expressed genes (ribosomal genes excluded, list available in text file) and miRNA targets. A json file is available and can be loaded on GSAn web server to visualize the results.
- Cell lines log2(FPKM+1) transcriptome files *_FPKM_log2.txt
1.1. Detection of LMS groups: harmonization and combination of Affymetrix and Agilent micro-arrays data
This script aims at identify genes with consistent expression profiles between Agilent and Affymetrix platform through 87 patients analyzed on both platforms and harmonized the data to be able to combine them. All datasets are then merged and harmonized. Modules of co-expressed genes are defined and used in a hierarchical clustering. LMSs grouped in a single cluster are compared to others. Definition of hLMS and oLMS in Affymetrix cohort. Differentially expressed genes are defined and used to compute centroids.
-
Produced plots:
- Figure 1A / S1B: Figure1A_all_samples_clustering
- Figure 1B: Figure1B_DE_Affymetrix
- Figure S1B: FigureS1B Venn diagram
- Figure S1C: FS1_example_transformation_87 line plots
- Figure S1C: FS1_raw_heatmap_87_samples normalized 1st round
- Figure S1C: FS1_Qnorm_heatmap_87_samples normalized 2nd round
- Figure S1C: FS1_harmonized_heatmap_87_samples harmonized
-
Text files produced:
- functional enrichment of module of co-expression: hypergeometric test: enrichment_cluster_multiplatform_*
- table with 2 columns: gene symbols and t-scores in decreasing order. This file can be used to reproduce GSEA analysis Affy_t_scores_OCEANCODE.rnk from command line:
java -cp gsea-3.0.jar -Xmx2048m xtools.gsea.GseaPreranked -gmx /data/ext/KEGG.symbols.gmt.txt -norm meandiv -nperm 1000 -rnk /results/Affy_t_scores_OCEANCODE.rnk -scoring_scheme weighted -rpt_label Affy_t_scores_OCEANCODE -create_svgs false -make_sets true -plot_top_x 20 -rnd_seed timestamp -set_max 500 -set_min 15 -zip_report false -out "/results/GSEA_LMS_DE" -gui false; done
- Two files, with significantly up- and down- regulated genes hLMS_genes_*.txt which can be used to reproduce iCistarget results from web site: https://gbiomed.kuleuven.be/apps/lcb/i-cisTarget/
This script allows to parse html reports from GSEA and retrieve the normalized enrichment scores and FDR associated to enriched terms. The median t-scores (hLMS/oLMS) for all genes belonging to a given term is computed and the number of significantly differentially expressed genes are counted. Related enriched terms were manually grouped in GSEA_*_table_group files in the precomp folder. In iCistarget analysis, the report.html are parsed. Manual grouping is also made at this step.
- produced plot:
- Figure 1D: Figure1D_GSEA_functional_annots
- Figure 1E: Figure1E_cistarget_results
Correlation to centroids obtained in the previous step is computed for each TCGA and ICGC patients. A Gaussian mixture model is computed on distances to the hLMS centroid from all patients (Affymetric cohort included) and patients are classified. Clinical associations are then tested and metastasis free survival is evaluated.
-
Produced plots:
- Figure 1F: Figure1F_centroid_classification_mclust
- Figure 1C: Figure1C_Affy_MFS_surv
- Figure 5A: Figure5A_OCs_centroids.pdf
-
Text files produced:
- Table 1: Table1_Affymetrix_clinical.tab
- Table 2: Table2_ICGC_TCGA_clinical.tab
- Table S2A: TS2A_stat_table_all_cohorts.tab
Using TCGA and GTEX combined data, median gene expression is computed for the hLMS group and the top 100 are selected to cluster patients using tSNE method. Median gene expression of those genes is computed for all tissue groups and visualized with an heatmap.
- Produced plots:
- Figure S2A: FS2A_tSNE_top100_hLMS_GTEX
- Figure S2B: FS2B_heatmap_top100_median_tissue
- stat.probes : compute Pearson's correlations from 2 matrices containing the same patients and genes as input select.genes.by.cor : select genes having a correlation above >0.8 with itself between platforms or better then with any other gene.
- harmonize : median normalization for the 87 patients plot.couple.gene : plot gene expression profiles for 2 genes for a given expression matrix
- harmonize.2 : generalized version of median normalization between platforms
- import.clinical : Parse and format clinical annotations for micro-arrays analyzed patients
- readGSEA : parse HTML GSEA output
- import.icistarget : parse HTML iCistarget output
- clinical.fisher : compute Fisher's exact test for categorical data, one versus all other
- t.score : compute t-scores and t-test p-values from gene expression data between 2 groups
- computeCentroids: compute centroids based on cit.dfAggregate
- classify_LMS: compute Spearman's correlation to centroids
During this step, frequencies of homozygous (deletion), heterozygous (loss) losses, gains of one copy or amplifications (gains of more than one copy) are computed. These frequencies were computed in separated and merged cohorts, frequency profiles were compared between cohorts to evaluate the consistency using a linear regression model. Enrichment of gene alteration events on the merged cohort is analyzed using a Fisher's exact test comparing hLMS and oLMS. Enrichment of significant events per cytogenic band is estimated using Fisher's exact test. MYOCD expression level is categorized by cohort according to the whole transcriptome signal distribution.
-
Produced plot:
- Figure 3A: Figure3A_heatmap_CN_puce_available_genes
- Figure 3B: Figure3B_puces_avail_all_hLMS
- Figure 3B: Figure3B_puces_avail_all_oLMS
- Figure 4A: Figure4A_zoom_MYOCD
- Figure 4B: Figure4B_MYOCD_expression_per_cohort
- Figure S4A: FS4A_heatmap_rsquared_CN
- Figure S4B: FS4B_MYOCD_CN_amplified_expression
- Figure S4C: FS4C_cumul_MYOCD_expr
-
Text files produced:
- Table S3A: TS3_CN_per_cytoband.tab
- Table S3B: TS3_CN_per_genes.tab
- copy.test: Assign numeric values to predicted copy numbers (A: 4, amplification; G: 3, gain; N: 2, normal; P and L: 1, heterozygous loss; D: 0, homozygous loss). If a gene locus is assigned with more than one status, the lowest copy number is kept.
- comp.rsquared: compute linear regression per event type and return R-squared values
- percentage.events: compute percentage of events in hLMS and oLMS. The total number of patients per gene do not count missing values.
- percentage.all: compute global percentage of event. The total number of patients per gene do not count missing values.
- import.copyNumber.by.genes: compute counting of events per group and Fisher's exact test if parameter fisher set to true.
- enrich.CN: function using import.copyNumber.by.genes and allowing to choose the reference group with parameter ref.
Differential expression is computed with miRComb package, using limma method and Holm's multi-testing correction. miRNA with less than one normalized reads in at least one group are discarded.
- Produced plot:
- Figure 2A: Figure2A_PCA_TCGA_miRNA
- Figure 2C: Figure2C_heatmap_diff_miRNA_cr_TCGA
Raw counts are submitted to edgeR package for filtering, normalization and differential expression analysis.
- Produced plot:
- Figure 2A: Figure2A_PCA_ICGC_miRNA
- Figure 2C: Figure2C_heatmap_diff_miRNA_centered_reduced_ICGC
Log Fol Changes (logFC) estimated during the previous steps are compared using a linear regression model. Results of each analysis are merged. logFC and median expression of miRNA from the DIO3-DLK1 miRNA cluster on chromosome 14 which are not significantly differentially expressed (DE) are evaluated.
-
Produced plot:
- Figure 2A: Fgure2A_XY_logFC_ICGC_TCGA_miRNA
- Figure S3AB: FigureS3AB_boxplots_DIO3_DLK1_no_sig
-
Text files produced:
- Table S2B: miRNA_ICGC_TCGA_DE.tab
Differentially expressed miRNAs in both cohort and DIO3-DLK1 miRNAs are used to cluster patients from TCGA PANCAN data. Distribution of cancer histotypes in clusters is evaluated.
- Produced plot:
- Figure 2B: Figure2B_heatmap_diff_miRNA_all_TCGA
- Figure 2B: Figure2B_PANCAN_cluster_diff_histotype
- Figure S3C: FS3C_heatmap_DIO3_miRNA_all_TCGA
- Figure S3C: FS3C_PANCAN_cluster_DIO3_histotype
- Figure S3C: points_cluster_DIO3_DLK1_all_TCGA
miRComb algorithm is applied to TCGA and ICGC cohorts independently together with validated miRNA-mRNA interactions from public databases. Results were merged. Validated targets from databases which were found to significantly anti-correlate in both cohort were analyzed with GSAn method. This script helps in visualize this result.
-
Produced plot:
- Figure2D: Figure2D_GSAN_miRNA_targets
-
Text files produced:
- Table S2C: miRComb_table.tab
- test.mRNA.miRNA.txt: selected targets that has been analyzed with GSAn
- run_mirComb: Apply classic miRComb pipeline: apply differential expression analysis on mRNAs and miRNAs (limma), select on significance (p-value < 0.01 and absolute logFC > 1), compute correlation between mRNA and miRNA expression and compute a p-value on Pearson's correlation score and correct it with FDR approach, add public database information.
The VCFs from ICGC containing somatic variants were analysed with MutationPattarns R package following their suggested pipeline: compute 96 trinucleotides matrix, compare with COSMIC signatures (cosine similarity) and estimate COSMIC signature contribution for each patients. Only those with a cosine similarity above 0.75 are reported.
Enrichment for specific allele status (TP53 and RB1) and protein cellular localization (DMD, PTEN, ATRX) was estimated using a Fisher's exact test comparing one category to all other possible categories.
- Produced plot:
- Figure 4C: Figure4C_cat_alteration_categories
- Figure 4D: Figure4D_DMD_isoforms
- Figure 4E: Figure4E_loc_alteration_categories
- Figure S4D: FigureS4D_TBM_BP
- Figure S4E: FigureS4E_cosine_cosmic_signature
- Figure S4E: FigureS4E_contribution_cosmic_signature_barplot
- plot.prop.mutation: plot percentages of allele status or protein localization occurrence per LMS group.