Skip to content

Commit

Permalink
Merge pull request #110 from ARTbio/week-6-1
Browse files Browse the repository at this point in the history
GSEA Part
  • Loading branch information
drosofff committed Mar 4, 2024
2 parents 162949f + 7f1d8fe commit 1c660c3
Show file tree
Hide file tree
Showing 13 changed files with 2,249 additions and 48 deletions.
2 changes: 1 addition & 1 deletion docs/bulk_RNAseq-IOC/32_GOseq.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ your background genes.
--> `fields`
- List of Fields

--> `boolean`
--> `column 1` and `column 3`
- `Run Tool`

:warning: As this is the last step of the construction of the gene set lists, you should
Expand Down
31 changes: 2 additions & 29 deletions docs/bulk_RNAseq-IOC/33_exercices_week_05_review.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,3 @@
## Issues with Slack ?
## Issues with `goseq` ?

## Issues with GitHub ?
- [x] Does everyone have a GitHub ID ?
- [x] Was everyone able to create a readme file and make a pull request to the repository
[ARTbio_064_IOC_Bulk-RNAseq](https://github.com/ARTbio/ARTbio_064_IOC_Bulk-RNAseq) ?
- [x] Was everyone able to retrieve the galaxy workflow file (the one that you have
generated during the first online meeting, with an extension .ga) and to add it in
the repository
[ARTbio_064_IOC_Bulk-RNAseq](https://github.com/ARTbio/ARTbio_064_IOC_Bulk-RNAseq) ?

## Data upload in PSILO, then in Galaxy from Psilo
- [x] Did everyone upload the necessary data in its
[PSILO account](https://psilo.sorbonne-universite.fr) ?
- [x] Did everyone succeed to create direct download links ?
- [x] Did everyone succeed to transfer its PSILO data into a Galaxy story `Input dataset`
in its Galaxy account ?

## Issues following the Galaxy training ?

[training to collection operations](https://training.galaxyproject.org/training-material/topics/galaxy-interface/tutorials/collections/tutorial.html)

- Check whether `Relabel identifiers` tool is understood

- Check whether `Extract element identifiers` tool is understood. Is the output dataset
from this tool uploaded in the appropriate GitHub folder ?

## Check input datasets histories of the participants

... and their ability to create appropriate collection for the analysis
## Issues with GO files (N. gono and A. meli)
132 changes: 120 additions & 12 deletions docs/bulk_RNAseq-IOC/34_GSEA_intro.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,125 @@
![](images/lamp.png)
# Gene Set Enrichment Analysis

# Analysis of functional enrichment among the differentially expressed genes
## Definition and Rationale Behind Gene Set Enrichment Analysis

We have extracted genes that are differentially expressed in treated (Pasilla gene-depleted)
samples compared to untreated samples. We would like to know if there are categories of
genes that are enriched among the differentially expressed genes.
Gene Set Enrichment Analysis (GSEA) is a powerful computational method used in
bioinformatics to interpret gene expression data in the context of biological pathways,
processes, or sets of functionally related genes.

Gene Ontology (GO) analysis is widely used to reduce complexity and highlight biological
processes in genome-wide expression studies.
Unlike traditional methods that focus on individual genes, GSEA evaluates the **coordinated
expression changes** of predefined gene sets, providing a more holistic view of molecular
mechanisms underlying experimental conditions or phenotypes.

However, standard methods give biased results on RNA-seq data due to over-detection
of differential expression for long and highly-expressed transcripts.
### Definition of GSEA
- [x] GSEA assesses whether predefined sets of genes show statistically significant,
concordant differences between two biological states (e.g., treatment vs. control,
diseased vs. healthy).
- [x] Rather than focusing on individual genes, GSEA operates on gene sets, which can
represent pathways, molecular functions, cellular processes, or ==other biologically
relevant groups of genes==. This last case actually represents the most common use of
GSEA. Many groups of genes, sometimes also called "gene signatures" or "molecular
signatures" are available in databases or published articles.
- [x] It ranks all genes based on their expression changes between experimental
conditions and then tests whether genes within a gene set tend to appear towards the top
(or bottom) of the ranked list more than expected by chance.

The goseq tool provides methods for performing GO analysis of RNA-seq data,
taking length bias into account. The methods and software used by goseq are equally
applicable to other category based tests of RNA-seq data, such as KEGG pathway analysis.
### Rationale Behind GSEA
- [x] **Biological Context**: GSEA acknowledges that genes rarely act in isolation but
rather function in coordinated networks and pathways. Analyzing gene sets helps
contextualize gene expression changes within the framework of biological processes.
- [x] **Statistical Power**: By aggregating signals from groups of genes, GSEA enhances
statistical power to detect subtle but coordinated changes in gene expression that might
be missed when analyzing individual genes.
- [x] **Reduction of Multiple Testing Burden**: GSEA reduces the multiple testing burden
associated with examining thousands of individual genes by focusing on predefined gene
sets. This reduces the risk of false positives and improves the reliability of results.
- [x] **Interpretability**: GSEA provides interpretable results by associating gene
expression changes with known biological pathways or processes, enabling researchers to
generate hypotheses and gain insights into the underlying biology.
- [x] **Robustness Across Platforms**: GSEA is platform-independent and can be applied
to various types of gene expression data, including microarray and RNA sequencing
(RNA-seq) data, making it widely applicable across different experimental settings and
datasets.

### Key Features of GSEA
- [x] **Enrichment Score**: Measures the degree to which a gene set is overrepresented
at the top or bottom of the ranked gene list.
- [x] **Normalized Enrichment Score (NES)**: Corrects for gene set size and data set
size, facilitating comparison of results across different datasets.
- [x] **False Discovery Rate (FDR)**: Estimates the proportion of false positive results
among significant findings, controlling for multiple testing.

In summary, GSEA offers a systematic and biologically meaningful approach to analyze gene expression data, enabling researchers to uncover key molecular pathways and processes associated with different experimental conditions or phenotypes. Its ability to integrate complex genomic data with prior biological knowledge makes it a valuable tool in deciphering the mechanisms underlying biological phenomena and disease states.

## How to perform GSEA ?

### A video presentation by Katherine West (University of Glasgow)
We strongly advise you to look at the excellent
[presentation by Katherine West](https://youtu.be/KY6SS4vRchY?si=cxbHjHdXdjc7uE-4){:target="_blank"}.
The aspects that have been presented above are all taken up and illustrated with graphics
in a very educational way.

### Practical focus: computation of Enrichment Score (ES)

The Enrichment Score (ES) is central in Gene Set Enrichment Analysis (GSEA) since it
quantifies the degree to which a gene set is overrepresented at the top or bottom of a
ranked list of genes based on their expression changes between two biological conditions.

The computation of the Enrichment Score involves several steps:

1. **Ranking Genes**: The first step is to rank all genes in the dataset based on a metric
that reflects their differential expression between the two biological conditions. This
metric is most often fold change, but could be t-statistic, or any other relevant
statistical measure.

2. **Cumulative Sum Calculation**: The Enrichment Score is calculated by walking down the
ranked list of genes, accumulating a running sum statistic. At each step, the running sum
is increased when a gene belongs to the gene set being evaluated and decreased otherwise.
The running sum captures the degree of enrichment of the gene set at that point in the
ranked list.

The way the running sum is increased when a gene belongs to the gene set being evaluated
and decreased otherwise varies depending on the GSEA implementation (there are several).
What you need to remember is that increment and decrement are never calculated
symmetrically.

A simple example of running sum calculation is to add the GSEA metric (eg fold change)
when a gene belongs to the gene set being evaluated and to remove a fixed value that
depends on the total number of genes in the ranked gene list (eg 1 / N). This fixed value
is typically referred to as the "penalty" or "decay" factor.

The rationale behind using a penalty or decay factor is to adjust the running sum to
account for the fact that genes not belonging to the gene set can still contribute to the
overall distribution of scores. This adjustment helps to prevent the running sum from
being overly biased by the presence or absence of genes in the gene set.

3. **Peak Enrichment Score**: The Enrichment Score reaches its maximum (peak) value when
the cumulative sum reaches its maximum deviation from zero. This peak reflects the
enrichment of the gene set at a particular position in the ranked list.

4. **Normalization of Enrichment Score**: To make Enrichment Scores comparable across
different gene sets and datasets, the Enrichment Score is normalized. This normalization
accounts for differences in gene set size and dataset size. One common normalization
method is to divide the Enrichment Score by the mean enrichment score from permuted
datasets.

5. **Estimation of Significance**: The significance of the Enrichment Score is assessed
through permutation testing. This involves repeatedly permuting the gene labels to
generate a null distribution of Enrichment Scores (ie computing many NES from gene sets
randomly sampled from the total gene list). The observed Enrichment Score is then compared
to the null distribution to determine its statistical significance, typically reported as
a nominal p-value or false discovery rate (FDR).


Overall, the Enrichment Score provides a quantitative measure of the degree to which a
predefined gene set is enriched towards the top or bottom of a ranked list of genes,
indicating the collective expression behavior of genes within that set under different
experimental conditions. It enables the identification of biologically relevant gene sets
associated with specific phenotypes or experimental treatments in gene expression studies.

## The main resource for GSEA

* GSEA software: [https://www.gsea-msigdb.org/gsea/msigdb](https://www.gsea-msigdb.org/gsea/msigdb)
* Provides a user-friendly platform for performing GSEA analysis.
* Provides access to a large database of curated gene sets in various format, including
the GMT format (.gmt files) which is the format that we are going to use in this IOC.
97 changes: 95 additions & 2 deletions docs/bulk_RNAseq-IOC/35_GSEA_1.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,96 @@
![](images/galaxylogo.png)
# fGSEA

# GSEA exercices part 1
fgsea is a Bioconductor package for fast preranked gene set enrichment analysis (GSEA) which
has been "wrapped" for use in the Galaxy framework.
As all GSEA approaches, fgsea implement an algorithm for cumulative GSEA-statisti
calculation. We will use it in a the standard way, ie basing our metrics on ==fold changes==
that computed using the DESeq2 Galaxy tool.

## fgsea inputs

#### 1. The collection of DESeq2 DE tables (with headers)
fgsea reauires first a two-column file containing a ranked list of genes. The first column
must contain the gene identifiers and the second column the statistic used to rank. Gene
identifiers ==must be unique== (not repeated) within the file and must be ==the same type==
as the ==identifiers in the Gene Sets file==.

Since what is expected is in the form of

| Symbol | Ranked Stat |
|---|---|
| VDR | 67.198 |
| IL20RA | 65.963 |
| MPHOSPH10 | 51.353 |
| RCAN1 | 50.269 |
| HILPDA | 50.015 |
| TSC22D3 | 47.496 |
| FAM107B | 45.926 |

and that our DESeq tables contain only Ensembl identifiers, we will start from the DEseq2
collection and replace Ensembl identifiers by gene symbols (using a table generated in the
previous section)

--> Thus, copy the collection `DESeq2 Results Tables` from the history `PRJNA630433
DESeq2 analysis` in a new history that you will name `PRJNA630433 fgsea`

#### 2. The table `EnsemblID-GeneSymbol table`

As mentionned above.

Note that you have generated this table in the previous section.

It is also available in the data library `IOC_bulk_RNAseq / Mouse reference files`, as well
as in your own data library (if you followed the instructions).

--> Copy `ENTREZID-GeneSymbol table` in your history `PRJNA630433 fgsea`

#### 3. One or several GMT files

GMT (Gene Matrix Transposed) files are available at
https://www.gsea-msigdb.org/gsea/msigdb/mouse/collections.jsp

They are tabular files looking like:

| HALLMARK_APOPTOSIS | http://www.broadinstitute.org/gsea/msigdb/cards/HALLMARK_APOPTOSIS | CASP3 | CASP9 | ... |
|---|---|---|---|---|
| HALLMARK_HYPOXIA | http://www.broadinstitute.org/gsea/msigdb/cards/HALLMARK_HYPOXIA | PGK1 | PDK1 | ... |

Note that in such a file, each line represents a gene set. The two first columns identify
the gene set (its name and it description URL). For each line, the number of column is
otherwise variable, with one column for each symbol of gene belonging to the gene set. Thus,
the number of these extra columns starting at col-3 reflects the number of genes in the
geneset.

We have downloaded several GMT files on purpose from
https://www.gsea-msigdb.org/gsea/msigdb/mouse/collections.jsp and made this files available
from the data library `IOC_bulk_RNAseq / Mouse reference files`.

--> Copy the following datasets in the history `PRJNA630433 fgsea`:

- [x] dendritic.gmt
- [x] glycolysis.gmt
- [x] monocyte_OR_macrophage.gmt
- [x] mouse_immune_AND_response.gmt
- [x] osteoclast.gmt

??? note "How did we generated the GMT files"
These files were retrieved from a search on [msigdb](https://www.gsea-msigdb.org/gsea/msigdb)
with the keyword(s) indicated in their title.

Do no hesitate to generate your own GMT files using `msigdb`

## The `fgsea` workflow

The Galaxy workflow [Galaxy-Workflow-fgsea.ga](Galaxy-Workflow-fgsea.ga) performs fgsea
analysis from

- [x] The collection `DESeq2 Results Tables` (history `PRJNA630433 DESeq2 analysis`)
- [x] The dataset `ENTREZID-GeneSymbol table` (from the data library `IOC_bulk_RNAseq /
Mouse reference files`) or where available
- [x] 5 GMT files dendritic.gmt, glycolysis.gmt, monocyte_OR_macrophage.gmt,
mouse_immune_AND_response.gmt,osteoclast.gmt.

![](images/Workflow-fgsea.png)

**==Run this workflow in the dedicated history `PRJNA630433 fgsea` paying extra attention to
select the appropriate input datasets (follow the workflow form instructions)==**
82 changes: 82 additions & 0 deletions docs/bulk_RNAseq-IOC/35_INTRO_GSEA_exercices.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Introduction to week-6 exercises
Using previous results obtained in the course of PRJNA630433 analysis, we are going to
perform successively a ==fGSEA== (**f**ast preranked **G**ene *S*et **E**nrichment
**A**nalysis) and an ==EGSEA== (**E**nsemble of **G**ene **S**et **E**nrichment
**A**nalyses) with the corresponding Galaxy tools.

Since this is the last week in the program where we run galaxy tools, we are also going to
upgrade the way we use Galaxy, making it "workflow-oriented". Thus Instead of describing
each galaxy tool run and showing you the details of the tool forms, we will provide a global
description of Workflows (inputs, outputs, purpose of the pipeline of steps)
and, most importantly, the corresponding workflow file as well as a screenshot of this file
in the Galaxy Workflow Editor.


## Tables of correspondances between Ensembl, ENTREZ and Gene Symbol IDs.

For both fGSEA and EGSEA, we will some computation steps require tables to convert Ensembl
to ENTREZ IDs, ENTREZ to Gene Symbol IDs or Ensembl to Gene Symbol IDs.

This is a perfect occasion to use the new training method described above.

Thus, we are going to use a Galaxy workflow that generates these three tables.

The input material will be a collection of featurecounts tables that we previously
generates in the `PRJNA630433 FeatureCounts Counting on HISAT2 bam alignments` history.

Thus, to begin, copy the dataset `Dc FeatureCounts counts` from this history to a new
history which you will name `Conversion Tables`. This is all we need as an input in this
history. The rest of dataset will be programmatically generated by a galaxy workflow
`Ensembl-Entrez-GeneSymbol tables` that

1. Extracts a dataset from the input data collection.
2. Uses the first column of this dataset (the Ensembl gene identifiers of the PRJNA630433)
with the `annotate my IDs` tool to generate at three-columns dataset, with EnsemblIDs,
ENTREZIDs (NCBI's nomenclature, raw numbers) and GeneSymbol IDs, respectively.
3. Filters out irrelevant lines (improper matchs) with `NA` or with `Rik` containing Gene
Symbols (these genes were identified in the course of the Riken project and are not considered
as supported by enough evidence to be included in GSEA)
4. Ensures that each ENTREZ IDs in the table are unique
5. Ensures that the final clean table has a first line header (the previous `unique` step
reorder the lines in an unpredictable way)
6. Generates three tables by cutting the final 3-col table with c1,c2, c1,c3, and c1,c3,
respectively and renames these tables accordingly to their content.

## The `Ensembl-Entrez-GeneSymbol tables` workflow

The workflow is available in a Galaxy/json format (.ga)
[here](Galaxy-Workflow-Ensembl-Entrez-GeneSymbol_tables.ga)

There is several ways to use it:

- [x] Download the file and reupload it as a new workflow using the workflow menu.
- [x] These workflow exists already in the server artbio.snv.jussieu.fr and was shared
with you. Thus, it is already visible in your workflow list (workflow menu), and you can
run it as is. ==However== :warning:, to better visualize this workflow you need to `copy`
it in your account. When this operation is done, new menu items are available for this
workflow, includin `edit`
- [x] Finally, you can upload a workflow in your account using its URL. For instance, if
you click the `Import` button in your workflow list (workflow menu), you can paste the
URL of this workflow in this course, and get it imported in you workflow list right away.

The graphical view of the workflow is the following. We have annotated this view but within
the Galaxy workflow editor, just click on each step of the workflow to see the details and
parameters (right hand part of the editor) used by the tool in this workflow.

![](images/tables_workflow.png)

## RUN the `Ensembl-Entrez-GeneSymbol tables` workflow

- [x] Be sure you are in the right history `Ensembl-Entrez-GeneSymbol tables`
- [x] Go to the workflow menu and click on the run icon of the workflow
`Ensembl-Entrez-GeneSymbol tables`
- [x] Ensure the appropriate input in select for the workflow (here there is only one dataset
in the history, no risk of error !)
- Click the `Run Workflow` button.

When the workflow has run you'll see that the three last dataset, appropriately named are
the one we expected. You can use these datasets latter when needed.

:warning: However, it is even more convenient to transfer these datasets in your data
library ! Just do it !
---
Loading

0 comments on commit 1c660c3

Please sign in to comment.