Skip to content

ElodieDarbo/lms_onco

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

Workflow analysis for leiomyosarcoma origin and oncogenesis

This workflow is composed of 11 R scripts organized in 4 main parts. The scripts are ordered and need to be run sequentially. The MASTER_script.R script allows to run the whole pipeline at once. The majority of the scripts depends on the previous script, no intermediate results are kept. Functions used in each part are written in a separate file functions.R.

The aim of the workflow is to reproduce the results presented in the manuscript (BioRxiv preprint: https://www.biorxiv.org/content/10.1101/2020.10.23.352336).

Data

From public databases

  • functional annotations from MSigDB in a gmt format
  • probes annotations for Affymetrix and Agilent platforms
  • cytoband genomic position and chromosome size from hg19 genome version from UCSC
  • miRNA-mRNA interaction from miRTarbase and miRecords formatted and saved in RData format
  • TCGA PANCAN batch corrected miRNA expression
  • TCGA PANCAN batch corrected miRNA clinical data formated and saved as RData
  • TCGA SARC miRNA expression saved as RData
  • TCGA RNA-seq expression
  • TCGA clinical data formated saved in a RData object.
  • TCGA gene level copy number (CN) combined saved in a RData object.
  • TCGA + GTEX RNA-seq combined data saved as RData. This object contains also a vector with tissue of origin of the samples ordered according to the samples in the data.frame and a vector containing the colours associated with each tissue type.
  • List of micro-array data from both Affymetrix and agilent platforms saved in a RData object
  • Gene level copy number estimation from SNP6 and BAC array technologies saved in RData objects

From this study

  • Annotation file for all patients analysed on micro-arrays
  • ICGC clinical data
  • ICGC RNA-seq gene level log2(FPKM+1)
  • ICGC DMD isoform expression
  • ICGC TP53 and RB1 genetic status and PTEN, DMD and ATRX location
  • ICGC breakpoint prediction
  • ICGC somatic mutation prediction formated to be pass to MutationPatterns package

Pre-computed intermediate results

  • ICGC CN estimation data: precomputed from cn.MOPS output. Used script is available in pre_computed.R file
  • Correlation in step 1.1: Consistency of gene expression profiles is measured by computing the correlation of each gene with itself between the 2 platforms and with all the other genes in both platforms. This analysis is time consuming (around 1h30) and a precomputed object is used by default. To re-compute it, change "TRUE" to "FALSE" in the run file.
  • Functional and regulatory feature enrichment: GSEA and iCistarget results. All folders contain html reports allowing to visualize the results. report.html in icistarget_* folders and pos_ and neg_snapshot.html in GSEA/Affy_* folders.
  • Functional annotations by GSAn for top hLMS expressed genes (ribosomal genes excluded, list available in text file) and miRNA targets. A json file is available and can be loaded on GSAn web server to visualize the results.
  • Cell lines log2(FPKM+1) transcriptome files *_FPKM_log2.txt

Scripts

Step 1: Transcriptome analysis

1.1. Detection of LMS groups: harmonization and combination of Affymetrix and Agilent micro-arrays data

This script aims at identify genes with consistent expression profiles between Agilent and Affymetrix platform through 87 patients analyzed on both platforms and harmonized the data to be able to combine them. All datasets are then merged and harmonized. Modules of co-expressed genes are defined and used in a hierarchical clustering. LMSs grouped in a single cluster are compared to others. Definition of hLMS and oLMS in Affymetrix cohort. Differentially expressed genes are defined and used to compute centroids.

  • Produced plots:

    • Figure 1A / S1B: Figure1A_all_samples_clustering
    • Figure 1B: Figure1B_DE_Affymetrix
    • Figure S1B: FigureS1B Venn diagram
    • Figure S1C: FS1_example_transformation_87 line plots
    • Figure S1C: FS1_raw_heatmap_87_samples normalized 1st round
    • Figure S1C: FS1_Qnorm_heatmap_87_samples normalized 2nd round
    • Figure S1C: FS1_harmonized_heatmap_87_samples harmonized
  • Text files produced:

    • functional enrichment of module of co-expression: hypergeometric test: enrichment_cluster_multiplatform_*
    • table with 2 columns: gene symbols and t-scores in decreasing order. This file can be used to reproduce GSEA analysis Affy_t_scores_OCEANCODE.rnk from command line:

    java -cp gsea-3.0.jar -Xmx2048m xtools.gsea.GseaPreranked -gmx /data/ext/KEGG.symbols.gmt.txt -norm meandiv -nperm 1000 -rnk /results/Affy_t_scores_OCEANCODE.rnk -scoring_scheme weighted -rpt_label Affy_t_scores_OCEANCODE -create_svgs false -make_sets true -plot_top_x 20 -rnd_seed timestamp -set_max 500 -set_min 15 -zip_report false -out "/results/GSEA_LMS_DE" -gui false; done

1.2. Visualize results obtained with GSEA and iCistarget

This script allows to parse html reports from GSEA and retrieve the normalized enrichment scores and FDR associated to enriched terms. The median t-scores (hLMS/oLMS) for all genes belonging to a given term is computed and the number of significantly differentially expressed genes are counted. Related enriched terms were manually grouped in GSEA_*_table_group files in the precomp folder. In iCistarget analysis, the report.html are parsed. Manual grouping is also made at this step.

  • produced plot:
    • Figure 1D: Figure1D_GSEA_functional_annots
    • Figure 1E: Figure1E_cistarget_results

1.3. Classify patients from ICGC and TCGA cohorts

Correlation to centroids obtained in the previous step is computed for each TCGA and ICGC patients. A Gaussian mixture model is computed on distances to the hLMS centroid from all patients (Affymetric cohort included) and patients are classified. Clinical associations are then tested and metastasis free survival is evaluated.

  • Produced plots:

    • Figure 1F: Figure1F_centroid_classification_mclust
    • Figure 1C: Figure1C_Affy_MFS_surv
    • Figure 5A: Figure5A_OCs_centroids.pdf
  • Text files produced:

    • Table 1: Table1_Affymetrix_clinical.tab
    • Table 2: Table2_ICGC_TCGA_clinical.tab
    • Table S2A: TS2A_stat_table_all_cohorts.tab

1.4. Evaluate to which normal tissue each LMS group is the closest

Using TCGA and GTEX combined data, median gene expression is computed for the hLMS group and the top 100 are selected to cluster patients using tSNE method. Median gene expression of those genes is computed for all tissue groups and visualized with an heatmap.

  • Produced plots:
    • Figure S2A: FS2A_tSNE_top100_hLMS_GTEX
    • Figure S2B: FS2B_heatmap_top100_median_tissue

Functions

  • stat.probes : compute Pearson's correlations from 2 matrices containing the same patients and genes as input select.genes.by.cor : select genes having a correlation above >0.8 with itself between platforms or better then with any other gene.
  • harmonize : median normalization for the 87 patients plot.couple.gene : plot gene expression profiles for 2 genes for a given expression matrix
  • harmonize.2 : generalized version of median normalization between platforms
  • import.clinical : Parse and format clinical annotations for micro-arrays analyzed patients
  • readGSEA : parse HTML GSEA output
  • import.icistarget : parse HTML iCistarget output
  • clinical.fisher : compute Fisher's exact test for categorical data, one versus all other
  • t.score : compute t-scores and t-test p-values from gene expression data between 2 groups
  • computeCentroids: compute centroids based on cit.dfAggregate
  • classify_LMS: compute Spearman's correlation to centroids

Step 2: Copy number variation analysis

2.1. Analyze copy number alteration recurrences in hLMS and oLMS and focus on MYOCD locus

During this step, frequencies of homozygous (deletion), heterozygous (loss) losses, gains of one copy or amplifications (gains of more than one copy) are computed. These frequencies were computed in separated and merged cohorts, frequency profiles were compared between cohorts to evaluate the consistency using a linear regression model. Enrichment of gene alteration events on the merged cohort is analyzed using a Fisher's exact test comparing hLMS and oLMS. Enrichment of significant events per cytogenic band is estimated using Fisher's exact test. MYOCD expression level is categorized by cohort according to the whole transcriptome signal distribution.

  • Produced plot:

    • Figure 3A: Figure3A_heatmap_CN_puce_available_genes
    • Figure 3B: Figure3B_puces_avail_all_hLMS
    • Figure 3B: Figure3B_puces_avail_all_oLMS
    • Figure 4A: Figure4A_zoom_MYOCD
    • Figure 4B: Figure4B_MYOCD_expression_per_cohort
    • Figure S4A: FS4A_heatmap_rsquared_CN
    • Figure S4B: FS4B_MYOCD_CN_amplified_expression
    • Figure S4C: FS4C_cumul_MYOCD_expr
  • Text files produced:

    • Table S3A: TS3_CN_per_cytoband.tab
    • Table S3B: TS3_CN_per_genes.tab

Functions

  • copy.test: Assign numeric values to predicted copy numbers (A: 4, amplification; G: 3, gain; N: 2, normal; P and L: 1, heterozygous loss; D: 0, homozygous loss). If a gene locus is assigned with more than one status, the lowest copy number is kept.
  • comp.rsquared: compute linear regression per event type and return R-squared values
  • percentage.events: compute percentage of events in hLMS and oLMS. The total number of patients per gene do not count missing values.
  • percentage.all: compute global percentage of event. The total number of patients per gene do not count missing values.
  • import.copyNumber.by.genes: compute counting of events per group and Fisher's exact test if parameter fisher set to true.
  • enrich.CN: function using import.copyNumber.by.genes and allowing to choose the reference group with parameter ref.

Step 3: miRNA transcriptome profiling

3.1. Analysis of SARC TCGA normalized mature miRNA data

Differential expression is computed with miRComb package, using limma method and Holm's multi-testing correction. miRNA with less than one normalized reads in at least one group are discarded.

  • Produced plot:
    • Figure 2A: Figure2A_PCA_TCGA_miRNA
    • Figure 2C: Figure2C_heatmap_diff_miRNA_cr_TCGA

3.2. Analysis of ICGC miRNA data

Raw counts are submitted to edgeR package for filtering, normalization and differential expression analysis.

  • Produced plot:
    • Figure 2A: Figure2A_PCA_ICGC_miRNA
    • Figure 2C: Figure2C_heatmap_diff_miRNA_centered_reduced_ICGC

3.3. Comparison between TCGA and ICGC miRNA differential expression

Log Fol Changes (logFC) estimated during the previous steps are compared using a linear regression model. Results of each analysis are merged. logFC and median expression of miRNA from the DIO3-DLK1 miRNA cluster on chromosome 14 which are not significantly differentially expressed (DE) are evaluated.

  • Produced plot:

    • Figure 2A: Fgure2A_XY_logFC_ICGC_TCGA_miRNA
    • Figure S3AB: FigureS3AB_boxplots_DIO3_DLK1_no_sig
  • Text files produced:

    • Table S2B: miRNA_ICGC_TCGA_DE.tab

3.4. Pan cancer miRNA profiling

Differentially expressed miRNAs in both cohort and DIO3-DLK1 miRNAs are used to cluster patients from TCGA PANCAN data. Distribution of cancer histotypes in clusters is evaluated.

  • Produced plot:
    • Figure 2B: Figure2B_heatmap_diff_miRNA_all_TCGA
    • Figure 2B: Figure2B_PANCAN_cluster_diff_histotype
    • Figure S3C: FS3C_heatmap_DIO3_miRNA_all_TCGA
    • Figure S3C: FS3C_PANCAN_cluster_DIO3_histotype
    • Figure S3C: points_cluster_DIO3_DLK1_all_TCGA

3.5. Visualization of miRNA target functional mapping (GSAn)

miRComb algorithm is applied to TCGA and ICGC cohorts independently together with validated miRNA-mRNA interactions from public databases. Results were merged. Validated targets from databases which were found to significantly anti-correlate in both cohort were analyzed with GSAn method. This script helps in visualize this result.

  • Produced plot:

    • Figure2D: Figure2D_GSAN_miRNA_targets
  • Text files produced:

    • Table S2C: miRComb_table.tab
    • test.mRNA.miRNA.txt: selected targets that has been analyzed with GSAn

Functions

  • run_mirComb: Apply classic miRComb pipeline: apply differential expression analysis on mRNAs and miRNAs (limma), select on significance (p-value < 0.01 and absolute logFC > 1), compute correlation between mRNA and miRNA expression and compute a p-value on Pearson's correlation score and correct it with FDR approach, add public database information.

Step 4: Single nucleotide variant analysis

The VCFs from ICGC containing somatic variants were analysed with MutationPattarns R package following their suggested pipeline: compute 96 trinucleotides matrix, compare with COSMIC signatures (cosine similarity) and estimate COSMIC signature contribution for each patients. Only those with a cosine similarity above 0.75 are reported.

Enrichment for specific allele status (TP53 and RB1) and protein cellular localization (DMD, PTEN, ATRX) was estimated using a Fisher's exact test comparing one category to all other possible categories.

4.1 Single nucleotide variant analysis

  • Produced plot:
    • Figure 4C: Figure4C_cat_alteration_categories
    • Figure 4D: Figure4D_DMD_isoforms
    • Figure 4E: Figure4E_loc_alteration_categories
    • Figure S4D: FigureS4D_TBM_BP
    • Figure S4E: FigureS4E_cosine_cosmic_signature
    • Figure S4E: FigureS4E_contribution_cosmic_signature_barplot

Functions

  • plot.prop.mutation: plot percentages of allele status or protein localization occurrence per LMS group.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published