Merge pull request #110 from ARTbio/week-6-1

GSEA Part
ARTbio · Mar 4, 2024 · 1c660c3 · 1c660c3
2 parents 162949f + 7f1d8fe
commit 1c660c3
Show file tree

Hide file tree

Showing 13 changed files with 2,249 additions and 48 deletions.
diff --git a/docs/bulk_RNAseq-IOC/32_GOseq.md b/docs/bulk_RNAseq-IOC/32_GOseq.md
@@ -113,7 +113,7 @@ your background genes.
         --> `fields`
     - List of Fields
 
-        --> `boolean`
+        --> `column 1` and `column 3`
     - `Run Tool`
 
 :warning: As this is the last step of the construction of the gene set lists, you should

diff --git a/docs/bulk_RNAseq-IOC/33_exercices_week_05_review.md b/docs/bulk_RNAseq-IOC/33_exercices_week_05_review.md
@@ -1,30 +1,3 @@
-## Issues with Slack ?
+## Issues with `goseq` ?
 
-## Issues with GitHub ?
-- [x] Does everyone have a GitHub ID ? 
-- [x] Was everyone able to create a readme file and make a pull request to the repository
-      [ARTbio_064_IOC_Bulk-RNAseq](https://github.com/ARTbio/ARTbio_064_IOC_Bulk-RNAseq) ?
-- [x] Was everyone able to retrieve the galaxy workflow file (the one that you have
-      generated during the first online meeting, with an extension .ga) and to add it in
-      the repository
-      [ARTbio_064_IOC_Bulk-RNAseq](https://github.com/ARTbio/ARTbio_064_IOC_Bulk-RNAseq) ?
-
-## Data upload in PSILO, then in Galaxy from Psilo
-- [x] Did everyone upload the necessary data in its
-      [PSILO account](https://psilo.sorbonne-universite.fr) ?
-- [x] Did everyone succeed to create direct download links ? 
-- [x] Did everyone succeed to transfer its PSILO data into a Galaxy story `Input dataset`
-      in its Galaxy account ?
-
-## Issues following the Galaxy training ?
-
-[training to collection operations](https://training.galaxyproject.org/training-material/topics/galaxy-interface/tutorials/collections/tutorial.html)
-
-- Check whether `Relabel identifiers` tool is understood
-
-- Check whether `Extract element identifiers` tool is understood. Is the output dataset
-  from this tool uploaded in the appropriate GitHub folder ?
-
-## Check input datasets histories of the participants
-
-... and their ability to create appropriate collection for the analysis
+## Issues with GO files (N. gono and A. meli)
diff --git a/docs/bulk_RNAseq-IOC/34_GSEA_intro.md b/docs/bulk_RNAseq-IOC/34_GSEA_intro.md
@@ -1,17 +1,125 @@
-![](images/lamp.png)
+# Gene Set Enrichment Analysis
 
-# Analysis of functional enrichment among the differentially expressed genes
+## Definition and Rationale Behind Gene Set Enrichment Analysis
 
-We have extracted genes that are differentially expressed in treated (Pasilla gene-depleted)
-samples compared to untreated samples. We would like to know if there are categories of
-genes that are enriched among the differentially expressed genes.
+Gene Set Enrichment Analysis (GSEA) is a powerful computational method used in
+bioinformatics to interpret gene expression data in the context of biological pathways,
+processes, or sets of functionally related genes.
 
-Gene Ontology (GO) analysis is widely used to reduce complexity and highlight biological
-processes in genome-wide expression studies.
+Unlike traditional methods that focus on individual genes, GSEA evaluates the **coordinated
+expression changes** of predefined gene sets, providing a more holistic view of molecular
+mechanisms underlying experimental conditions or phenotypes.
 
-However, standard methods give biased results on RNA-seq data due to over-detection
-of differential expression for long and highly-expressed transcripts.
+### Definition of GSEA
+  - [x] GSEA assesses whether predefined sets of genes show statistically significant,
+  concordant differences between two biological states (e.g., treatment vs. control,
+  diseased vs. healthy).
+  - [x] Rather than focusing on individual genes, GSEA operates on gene sets, which can
+  represent pathways, molecular functions, cellular processes, or ==other biologically
+  relevant groups of genes==. This last case actually represents the most common use of
+  GSEA. Many groups of genes, sometimes also called "gene signatures" or "molecular
+  signatures" are available in databases or published articles.
+  - [x] It ranks all genes based on their expression changes between experimental
+  conditions and then tests whether genes within a gene set tend to appear towards the top
+  (or bottom) of the ranked list more than expected by chance.
 
-The goseq tool provides methods for performing GO analysis of RNA-seq data,
-taking length bias into account. The methods and software used by goseq are equally
-applicable to other category based tests of RNA-seq data, such as KEGG pathway analysis.
+### Rationale Behind GSEA
+  - [x] **Biological Context**: GSEA acknowledges that genes rarely act in isolation but
+  rather function in coordinated networks and pathways. Analyzing gene sets helps
+  contextualize gene expression changes within the framework of biological processes.
+  - [x] **Statistical Power**: By aggregating signals from groups of genes, GSEA enhances
+  statistical power to detect subtle but coordinated changes in gene expression that might
+  be missed when analyzing individual genes.
+  - [x] **Reduction of Multiple Testing Burden**: GSEA reduces the multiple testing burden
+  associated with examining thousands of individual genes by focusing on predefined gene
+  sets. This reduces the risk of false positives and improves the reliability of results.
+  - [x] **Interpretability**: GSEA provides interpretable results by associating gene
+  expression changes with known biological pathways or processes, enabling researchers to
+  generate hypotheses and gain insights into the underlying biology.
+  - [x] **Robustness Across Platforms**: GSEA is platform-independent and can be applied
+  to various types of gene expression data, including microarray and RNA sequencing
+  (RNA-seq) data, making it widely applicable across different experimental settings and
+  datasets.
+
+### Key Features of GSEA
+  - [x] **Enrichment Score**: Measures the degree to which a gene set is overrepresented
+  at the top or bottom of the ranked gene list.
+  - [x] **Normalized Enrichment Score (NES)**: Corrects for gene set size and data set
+  size, facilitating comparison of results across different datasets.
+  - [x] **False Discovery Rate (FDR)**: Estimates the proportion of false positive results
+  among significant findings, controlling for multiple testing.
+
+In summary, GSEA offers a systematic and biologically meaningful approach to analyze gene expression data, enabling researchers to uncover key molecular pathways and processes associated with different experimental conditions or phenotypes. Its ability to integrate complex genomic data with prior biological knowledge makes it a valuable tool in deciphering the mechanisms underlying biological phenomena and disease states.
+
+## How to perform GSEA ?
+
+### A video presentation by Katherine West (University of Glasgow) 
+We strongly advise you to look at the excellent
+[presentation by Katherine West](https://youtu.be/KY6SS4vRchY?si=cxbHjHdXdjc7uE-4){:target="_blank"}.
+The aspects that have been presented above are all taken up and illustrated with graphics
+in a very educational way.
+
+### Practical focus: computation of Enrichment Score (ES)
+
+The Enrichment Score (ES) is central in Gene Set Enrichment Analysis (GSEA) since it
+quantifies the degree to which a gene set is overrepresented at the top or bottom of a
+ranked list of genes based on their expression changes between two biological conditions.
+
+The computation of the Enrichment Score involves several steps:
+
+1. **Ranking Genes**: The first step is to rank all genes in the dataset based on a metric
+that reflects their differential expression between the two biological conditions. This
+metric is most often fold change, but could be t-statistic, or any other relevant
+statistical measure.
+
+2. **Cumulative Sum Calculation**: The Enrichment Score is calculated by walking down the
+      ranked list of genes, accumulating a running sum statistic. At each step, the running sum
+      is increased when a gene belongs to the gene set being evaluated and decreased otherwise.
+      The running sum captures the degree of enrichment of the gene set at that point in the
+      ranked list.
+
+      The way the running sum is increased when a gene belongs to the gene set being evaluated
+      and decreased otherwise varies depending on the GSEA implementation (there are several).
+      What you need to remember is that increment and decrement are never calculated
+      symmetrically.
+
+      A simple example of running sum calculation is to add the GSEA metric (eg fold change)
+      when a gene belongs to the gene set being evaluated and to remove a fixed value that
+      depends on the total number of genes in the ranked gene list (eg 1 / N). This fixed value
+      is typically referred to as the "penalty" or "decay" factor.
+
+      The rationale behind using a penalty or decay factor is to adjust the running sum to
+      account for the fact that genes not belonging to the gene set can still contribute to the
+      overall distribution of scores. This adjustment helps to prevent the running sum from
+      being overly biased by the presence or absence of genes in the gene set.
+
+3. **Peak Enrichment Score**: The Enrichment Score reaches its maximum (peak) value when
+the cumulative sum reaches its maximum deviation from zero. This peak reflects the
+enrichment of the gene set at a particular position in the ranked list.
+
+4. **Normalization of Enrichment Score**: To make Enrichment Scores comparable across
+different gene sets and datasets, the Enrichment Score is normalized. This normalization
+accounts for differences in gene set size and dataset size. One common normalization
+method is to divide the Enrichment Score by the mean enrichment score from permuted
+datasets.
+
+5. **Estimation of Significance**: The significance of the Enrichment Score is assessed
+through permutation testing. This involves repeatedly permuting the gene labels to
+generate a null distribution of Enrichment Scores (ie computing many NES from gene sets
+randomly sampled from the total gene list). The observed Enrichment Score is then compared
+to the null distribution to determine its statistical significance, typically reported as
+a nominal p-value or false discovery rate (FDR).
+
+
+Overall, the Enrichment Score provides a quantitative measure of the degree to which a
+predefined gene set is enriched towards the top or bottom of a ranked list of genes,
+indicating the collective expression behavior of genes within that set under different
+experimental conditions. It enables the identification of biologically relevant gene sets
+associated with specific phenotypes or experimental treatments in gene expression studies.
+
+## The main resource for GSEA
+
+* GSEA software: [https://www.gsea-msigdb.org/gsea/msigdb](https://www.gsea-msigdb.org/gsea/msigdb)
+    * Provides a user-friendly platform for performing GSEA analysis.
+    * Provides access to a large database of curated gene sets in various format, including
+    the GMT format (.gmt files) which is the format that we are going to use in this IOC.
diff --git a/docs/bulk_RNAseq-IOC/35_GSEA_1.md b/docs/bulk_RNAseq-IOC/35_GSEA_1.md
@@ -1,3 +1,96 @@
-![](images/galaxylogo.png)
+# fGSEA
 
-# GSEA exercices part 1
+fgsea is a Bioconductor package for fast preranked gene set enrichment analysis (GSEA) which
+has been "wrapped" for use in the Galaxy framework.
+As all GSEA approaches, fgsea implement an algorithm for cumulative GSEA-statisti
+calculation. We will use it in a the standard way, ie basing our metrics on ==fold changes==
+that computed using the DESeq2 Galaxy tool.
+
+## fgsea inputs
+
+#### 1. The collection of DESeq2 DE tables (with headers)
+fgsea reauires first a two-column file containing a ranked list of genes. The first column
+must contain the gene identifiers and the second column the statistic used to rank. Gene
+identifiers ==must be unique== (not repeated) within the file and must be ==the same type==
+as the ==identifiers in the Gene Sets file==.
+
+Since what is expected is in the form of
+
+| Symbol | Ranked Stat |
+|---|---|
+| VDR | 67.198 |
+| IL20RA | 65.963 |
+| MPHOSPH10 | 51.353 |
+| RCAN1 | 50.269 |
+| HILPDA | 50.015 |
+| TSC22D3 | 47.496 |
+| FAM107B | 45.926 |
+
+and that our DESeq tables contain only Ensembl identifiers, we will start from the DEseq2
+collection and replace Ensembl identifiers by gene symbols (using a table generated in the
+previous section)
+
+--> Thus, copy the collection `DESeq2 Results Tables` from the history `PRJNA630433
+DESeq2 analysis` in a new history that you will name `PRJNA630433 fgsea`
+
+#### 2. The table `EnsemblID-GeneSymbol table`
+
+As mentionned above.
+
+Note that you have generated this table in the previous section.
+
+It is also available in the data library `IOC_bulk_RNAseq / Mouse reference files`, as well
+as in your own data library (if you followed the instructions).
+
+--> Copy `ENTREZID-GeneSymbol table` in your history `PRJNA630433 fgsea`
+
+#### 3. One or several GMT files
+
+GMT (Gene Matrix Transposed) files are available at
+https://www.gsea-msigdb.org/gsea/msigdb/mouse/collections.jsp
+
+They are tabular files looking like:
+
+| HALLMARK_APOPTOSIS | http://www.broadinstitute.org/gsea/msigdb/cards/HALLMARK_APOPTOSIS | CASP3 | CASP9 | ... |
+|---|---|---|---|---|
+| HALLMARK_HYPOXIA | http://www.broadinstitute.org/gsea/msigdb/cards/HALLMARK_HYPOXIA | PGK1 | PDK1 | ... |
+
+Note that in such a file, each line represents a gene set. The two first columns identify
+the gene set (its name and it description URL). For each line, the number of column is
+otherwise variable, with one column for each symbol of gene belonging to the gene set. Thus,
+the number of these extra columns starting at col-3 reflects the number of genes in the
+geneset.
+
+We have downloaded several GMT files on purpose from
+https://www.gsea-msigdb.org/gsea/msigdb/mouse/collections.jsp and made this files available
+from the data library `IOC_bulk_RNAseq / Mouse reference files`.
+
+--> Copy the following datasets in the history `PRJNA630433 fgsea`:
+
+- [x] dendritic.gmt
+- [x] glycolysis.gmt
+- [x] monocyte_OR_macrophage.gmt
+- [x] mouse_immune_AND_response.gmt
+- [x] osteoclast.gmt
+
+??? note "How did we generated the GMT files"
+    These files were retrieved from a search on [msigdb](https://www.gsea-msigdb.org/gsea/msigdb)
+    with the keyword(s) indicated in their title.
+
+    Do no hesitate to generate your own GMT files using `msigdb`
+
+## The `fgsea` workflow
+
+The Galaxy workflow [Galaxy-Workflow-fgsea.ga](Galaxy-Workflow-fgsea.ga) performs fgsea
+analysis from
+
+- [x] The collection `DESeq2 Results Tables` (history `PRJNA630433 DESeq2 analysis`)
+- [x] The dataset `ENTREZID-GeneSymbol table` (from the data library `IOC_bulk_RNAseq /
+Mouse reference files`) or where available
+- [x] 5 GMT files dendritic.gmt, glycolysis.gmt, monocyte_OR_macrophage.gmt,
+  mouse_immune_AND_response.gmt,osteoclast.gmt.
+
+![](images/Workflow-fgsea.png)
+
+**==Run this workflow in the dedicated history `PRJNA630433 fgsea` paying extra attention to
+select the appropriate input datasets (follow the workflow form instructions)==**
diff --git a/docs/bulk_RNAseq-IOC/35_INTRO_GSEA_exercices.md b/docs/bulk_RNAseq-IOC/35_INTRO_GSEA_exercices.md
@@ -0,0 +1,82 @@
+# Introduction to week-6 exercises
+Using previous results obtained in the course of PRJNA630433 analysis, we are going to
+perform successively a ==fGSEA== (**f**ast preranked **G**ene *S*et **E**nrichment
+**A**nalysis) and an ==EGSEA== (**E**nsemble of **G**ene **S**et **E**nrichment
+**A**nalyses) with the corresponding Galaxy tools.
+
+Since this is the last week in the program where we run galaxy tools, we are also going to
+upgrade the way we use Galaxy, making it "workflow-oriented". Thus Instead of describing
+each galaxy tool run and showing you the details of the tool forms, we will provide a global
+description of Workflows (inputs, outputs, purpose of the pipeline of steps)
+and, most importantly, the corresponding workflow file as well as a screenshot of this file
+in the Galaxy Workflow Editor.
+
+
+## Tables of correspondances between Ensembl, ENTREZ and Gene Symbol IDs.
+
+For both fGSEA and EGSEA, we will some computation steps require tables to convert Ensembl
+to ENTREZ IDs, ENTREZ to Gene Symbol IDs or Ensembl to Gene Symbol IDs.
+
+This is a perfect occasion to use the new training method described above.
+
+Thus, we are going to use a Galaxy workflow that generates these three tables.
+
+The input material will be a collection of featurecounts tables that we previously
+generates in the `PRJNA630433 FeatureCounts Counting on HISAT2 bam alignments` history.
+
+Thus, to begin, copy the dataset `Dc FeatureCounts counts` from this history to a new
+history which you will name `Conversion Tables`. This is all we need as an input in this
+history. The rest of dataset will be programmatically generated by a galaxy workflow
+`Ensembl-Entrez-GeneSymbol tables` that
+
+1. Extracts a dataset from the input data collection.
+2. Uses the first column of this dataset (the Ensembl gene identifiers of the PRJNA630433)
+with the `annotate my IDs` tool to generate at three-columns dataset, with EnsemblIDs,
+ENTREZIDs (NCBI's nomenclature, raw numbers) and GeneSymbol IDs, respectively.
+3. Filters out irrelevant lines (improper matchs) with `NA` or with `Rik` containing Gene
+Symbols (these genes were identified in the course of the Riken project and are not considered
+as supported by enough evidence to be included in GSEA)
+4. Ensures that each ENTREZ IDs in the table are unique
+5. Ensures that the final clean table has a first line header (the previous `unique` step
+reorder the lines in an unpredictable way)
+6. Generates three tables by cutting the final 3-col table with c1,c2, c1,c3, and c1,c3,
+respectively and renames these tables accordingly to their content.
+
+## The `Ensembl-Entrez-GeneSymbol tables` workflow
+
+The workflow is available in a Galaxy/json format (.ga)
+[here](Galaxy-Workflow-Ensembl-Entrez-GeneSymbol_tables.ga)
+
+There is several ways to use it:
+
+- [x] Download the file and reupload it as a new workflow using the workflow menu.
+- [x] These workflow exists already in the server artbio.snv.jussieu.fr and was shared
+with you. Thus, it is already visible in your workflow list (workflow menu), and you can
+run it as is. ==However== :warning:, to better visualize this workflow you need to `copy`
+it in your account. When this operation is done, new menu items are available for this
+workflow, includin `edit`
+- [x] Finally, you can upload a workflow in your account using its URL. For instance, if
+you click the `Import` button in your workflow list (workflow menu), you can paste the
+URL of this workflow in this course, and get it imported in you workflow list right away.
+
+The graphical view of the workflow is the following. We have annotated this view but within
+the Galaxy workflow editor, just click on each step of the workflow to see the details and
+parameters (right hand part of the editor) used by the tool in this workflow.
+
+![](images/tables_workflow.png)
+
+## RUN the `Ensembl-Entrez-GeneSymbol tables` workflow
+
+- [x] Be sure you are in the right history `Ensembl-Entrez-GeneSymbol tables`
+- [x] Go to the workflow menu and click on the run icon of the workflow
+`Ensembl-Entrez-GeneSymbol tables`
+- [x] Ensure the appropriate input in select for the workflow (here there is only one dataset
+in the history, no risk of error !)
+- Click the `Run Workflow` button.
+
+When the workflow has run you'll see that the three last dataset, appropriately named are
+the one we expected. You can use these datasets latter when needed.
+
+:warning: However, it is even more convenient to transfer these datasets in your data
+library ! Just do it !
+---