-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #102 from ARTbio/aftercounting
DEseq data manipulation and visualisation
- Loading branch information
Showing
3 changed files
with
366 additions
and
118 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,365 @@ | ||
# Manipulation of differential expression data for visualisation and comparisons | ||
|
||
Now we would like to extract the most differentially expressed genes in the various | ||
conditions, and then visualize them using an heatmap of the normalized counts or computed | ||
z-score for each sample. | ||
|
||
We will proceed in several steps: | ||
|
||
- [x] For each package, extract the normalized counts of genes for each sample (all three | ||
packages, DESeq2, edgeR and limma, provide this functionality. | ||
- [x] For each package, extract the most differentially expressed genes at a given log2FC | ||
threshold (let's say 2, corresponding to a 4x or 1/4x fold time expression), and at a | ||
given p-adjusted value (let's say p-adj < 0.01). We will keep these gene lists apart to | ||
build latter a venn diagram for comparison of the three tools. | ||
- [x] Plot heatmaps of normalized counts | ||
- [x] Compute Z score of the normalized counts | ||
- [x] Plot heatmaps of the Z score of the normalized counts | ||
|
||
## Extract the most differentially expressed genes (PRJNA630433 / DESeq2) | ||
|
||
Basically, we navigate in the DESeq history of the PRJNA630433 use-case and we repeat a | ||
DESeq2 run, asking in addition for a **rLog-Normalized** counts output. | ||
|
||
??? info "![](images/tool_small.png){width="25" align="absbottom"} `DESeq2` settings" | ||
Basically, the same as before, except that we ask for a Normalized counts file | ||
|
||
- how | ||
|
||
--> Select datasets per levels | ||
- 1: Factor | ||
|
||
--> Tissue | ||
- 1: Factor level | ||
|
||
Note that there will be three factor levels in this analysis: Dc, Mo and Oc. | ||
|
||
--> Oc | ||
|
||
- Counts file(s) | ||
|
||
--> select the data collection icon, then `15: Oc FeatureCounts counts` | ||
- 2: Factor level | ||
|
||
--> Mo | ||
|
||
- Counts file(s) | ||
|
||
--> select the data collection icon, then `10: Mo FeatureCounts counts` | ||
- 3: Factor level (you must click on :heavy_plus_sign: `Insert Factor level`) | ||
|
||
--> Dc | ||
|
||
- Counts file(s) | ||
|
||
--> select the data collection icon, then `5: Mo FeatureCounts counts` | ||
- (Optional) provide a tabular file with additional batch factors to include in the model. | ||
|
||
--> Leave to `Nothing selected` | ||
- Files have header? | ||
|
||
--> Yes | ||
- Choice of Input data | ||
|
||
--> Count data | ||
- Advanced options | ||
|
||
--> No, leave folded | ||
- Output options | ||
|
||
--> ==This time, check the `Output rLog normalized table` box !== | ||
|
||
--> Unfold and check `Output all levels vs all levels of primary factor (use when | ||
you have >2 levels for primary factor)` in addition to the already checked | ||
`Generate plots for visualizing the analysis results` | ||
|
||
--> Leave `Alpha value for MA-plot` to 0,1: note that this option is used for | ||
plots and does not impact DESeq2 results | ||
- `Run Tool` | ||
|
||
:warning: This time you can trash the DESeq2 plots and result files which we have already | ||
generated. | ||
|
||
:warning: Keep this output for latter, will use it for a clustered heatmap | ||
|
||
## Generate top lists of DE genes | ||
|
||
We will do that with the help of the tool `Filter data on any column using simple | ||
expressions`. We will also use 3 other tools `Compute on rows`, `Column Regex Find And | ||
Replace` and `Filter data on any column using simple expressions` | ||
|
||
### Select genes with |log2FC > 2| and p-adj < 0.01 with ![](images/tool_small.png){width="30" align="absbottom"}`Filter data on any column using simple expressions` | ||
|
||
!!! info "![](images/tool_small.png){width="25" align="absbottom"} `Filter data on any column...` settings" | ||
- Filter | ||
|
||
--> DESeq2 Results Tables | ||
- With following condition | ||
|
||
--> abs(c3) > 2 and c7 < 0.01 | ||
- Number of header lines to skip | ||
|
||
--> `1` (these tables have an added header !) | ||
- Click the `Run Tool` button | ||
|
||
:warning: Rename the "filter on..." collection to `Top gene lists` | ||
|
||
### Compute a boolean value by row | ||
|
||
This is to determine whether genes in the lists are up or down-regulated | ||
|
||
|
||
:warning: Look at the effect of evaluating the expression `c3 > 0` in the new column | ||
`expression` in the output datasets. | ||
|
||
### Transform `True` and `False` values to `up` and `down`, respectively | ||
|
||
!!! info "![](images/tool_small.png){width="25" align="absbottom"} `Column Regex Find And Replace` settings" | ||
- Select cells from | ||
|
||
--> `Compute on collection 36 (or so)` | ||
- using column | ||
|
||
--> `8` | ||
- Check | ||
|
||
--> click the button :heavy_plus_sign:`Insert Check` | ||
- Find Regex | ||
|
||
--> `False` | ||
- Replacement | ||
|
||
--> `down` | ||
- Check | ||
|
||
--> click another time the button :heavy_plus_sign:`Insert Check` | ||
- Find Regex | ||
|
||
--> `True` | ||
- Replacement | ||
|
||
--> `up` | ||
- Click the `Run Tool` button | ||
|
||
:warning: rename the collection `Column Regex Find And Replace on collection 40` with | ||
`top gene lists - oriented` | ||
|
||
### Split the list in `up` and `down` regulated lists | ||
|
||
This will be performed through 2 successive runs of the | ||
![](images/tool_small.png){width="25" align="absbottom"} tool `Select lines that match an | ||
expression` | ||
|
||
!!! info "![](images/tool_small.png){width="25" align="absbottom"} `Select lines that match an expression` settings" | ||
- Select lines from | ||
|
||
--> `top gene lists - oriented` | ||
- that | ||
|
||
--> `matching` | ||
- the pattern | ||
|
||
--> `\tup` (a tabulation immediately followed by the string *up*) | ||
- Keep header line | ||
|
||
--> `Yes` | ||
- Click the `Run Tool` button | ||
|
||
:warning: Immediately rename the collection `Select on collection...` to `top up-regulated | ||
gene lists` | ||
|
||
Redo exactly the same operation with a single change in the setting of the | ||
![](images/tool_small.png){width="25" align="absbottom"} tool `Select lines that match an | ||
expression` | ||
|
||
??? info "![](images/tool_small.png){width="25" align="absbottom"} `Select lines that match an expression` settings" | ||
- Select lines from | ||
|
||
--> `top gene lists - oriented` | ||
- that | ||
|
||
--> `matching` | ||
- the pattern | ||
|
||
--> `\tdown` (a tabulation immediately followed by the string *down*) | ||
- Keep header line | ||
|
||
--> `Yes` | ||
- Click the `Run Tool` button | ||
|
||
:warning: Rename the collection `Select on collection...` to `top down-regulated | ||
gene lists` | ||
|
||
:warning: keep the last three generated collections for later comparison with edgeR and | ||
limma tools | ||
|
||
## Plotting an heatmap of the most significantly de-regulated genes | ||
|
||
For this, we are going to collect and gather all significantly de-regulated genes in any of the | ||
3 conditions, and to intersect (join operation) this list with the rLog normalized count | ||
table precedently generated. | ||
|
||
### Use ![](images/tool_small.png){width="25" align="absbottom"}`Advanced cut` to select the list of deregulated genes in all three comparisons | ||
|
||
!!! info "![](images/tool_small.png){width="25" align="absbottom"} `advanced cut` settings" | ||
- File to cut | ||
|
||
--> `Top gene lists` (this is a collection) | ||
- Operation | ||
|
||
--> `Keep` | ||
- Delimited by | ||
|
||
--> `Tab` | ||
- Cut by | ||
|
||
--> `fields` | ||
- List of Fields | ||
|
||
--> `Column 1` | ||
- First line is a header line | ||
- Click the `Run Tool` button | ||
|
||
:warning: Rename this collection of single column datasets `top genes names` | ||
### Next we concatenate the three datasets of the previous collection in a single dataset | ||
|
||
We do that using the ![](images/tool_small.png){width="25" align="absbottom"} | ||
`Concatenate multiple datasets tail-to-head while specifying how` tool | ||
|
||
!!! info "![](images/tool_small.png){width="25" align="absbottom"} `Concatenate multiple datasets tail-to-head while specifying how` settings" | ||
- What type of data do you wish to concatenate? | ||
|
||
--> `Single datasets` | ||
- Concatenate Datasets | ||
|
||
--> :warning: Click on the collection icon and select `top genes names` | ||
- Include dataset names? | ||
|
||
--> `No` | ||
- Number of lines to skip at the beginning of each concatenation: | ||
|
||
--> `1` | ||
- Click the `Run Tool` button | ||
|
||
:warning: Rename the return single dataset as `Pooled top genes` | ||
|
||
### Next we extract *Uniques* gene names from the `Pooled top genes` dataset | ||
|
||
You probably agree that the same gene may be deregulated in the three pair-wise comparisons | ||
which we have performed with DESeq2. | ||
|
||
Thus we need to eliminate the redundancy, using the tool | ||
![](images/tool_small.png){width="25" align="absbottom"}`Unique | ||
occurrences of each record`. | ||
|
||
!!! info "![](images/tool_small.png){width="25" align="absbottom"} `Unique occurrences of each record` settings" | ||
- File to scan for unique values | ||
|
||
--> `Pooled top genes` | ||
- Ignore differences in case when comparing | ||
|
||
--> `No` | ||
- Column only contains numeric values | ||
|
||
--> `No` | ||
- Advanced Options | ||
|
||
--> Leave as `Hide Advanced Options` | ||
- Click the `Run Tool` button | ||
|
||
### Add a header the list of unique gene names associated we significant DE in any of the comparisons | ||
|
||
We do this with the tools `Add Header` | ||
|
||
!!! info "![](images/tool_small.png){width="25" align="absbottom"} `Add Header` settings" | ||
- List of Column headers (comma delimited, e.g. C1,C2,...) | ||
|
||
--> `All_DE_genes` | ||
- Data File (tab-delimted) | ||
|
||
--> `Unique on data 1xx...` | ||
- Click the `Run Tool` button | ||
|
||
### Intersection (join operation) between the list of unique gene name associated with DE and the rLog-Normalized counts file. | ||
|
||
This is the moment when we are going to use the `rLog-Normalized counts file on data...` | ||
and intersect it (join operation) with the list of DE genes in all three condition. | ||
|
||
To do this, we are going to use the tool | ||
![](images/tool_small.png){width="25" align="absbottom"}`Join two files` | ||
|
||
!!! info "![](images/tool_small.png){width="25" align="absbottom"} `Join two files` settings" | ||
- 1st file | ||
|
||
--> `rLog-Normalized counts file on data...` | ||
- Column to use from 1st file | ||
|
||
--> `1` | ||
- 2nd File | ||
|
||
--> `All_DE_genes` | ||
- Column to use from 2nd file | ||
|
||
--> `1` | ||
- Output lines appearing in | ||
|
||
--> `Both 1st and 2nd files` | ||
- First line is a header line | ||
|
||
--> `Yes` | ||
- Ignore case | ||
|
||
--> `No` | ||
- Value to put in unpaired (empty) fields | ||
|
||
--> `NA` | ||
- Click the `Run Tool` button | ||
|
||
:warning: Rename the output dataset `rLog-Normalized counts of DE genes` | ||
|
||
### Plot a heatmap of the rLog-Normalized counts of DE genes in all three conditions | ||
|
||
We do this using the ![](images/tool_small.png){width="25" align="absbottom"}`Plot | ||
heatmap with high number of rows` tool | ||
|
||
!!! info "![](images/tool_small.png){width="25" align="absbottom"} `Plot heatmap with high number of rows` settings" | ||
- Input should have column headers - these will be the columns that are plotted | ||
|
||
--> `rLog-Normalized counts of DE genes` | ||
- Data transformation | ||
|
||
--> `Plot the data as it is` | ||
- Enable data clustering | ||
|
||
--> `Yes` | ||
|
||
- Clustering columns and rows | ||
|
||
--> `Cluster rows and not columns` | ||
- Distance method | ||
|
||
--> `Euclidean` | ||
|
||
- Clustering method | ||
|
||
--> `Complete` | ||
- Labeling columns and rows | ||
|
||
--> `Label columns and not rows` | ||
|
||
- Coloring groups | ||
|
||
--> `Blue to white to red` | ||
- Data scaling | ||
|
||
--> `Scale my data by row` | ||
- tweak plot height | ||
|
||
--> `35` | ||
- tweak row label size | ||
|
||
--> `1` | ||
- tweak line height | ||
|
||
--> `24` | ||
- `Run Tool` |
Oops, something went wrong.