Skip to content

Commit

Permalink
Week 2 forward (#97)
Browse files Browse the repository at this point in the history
* Update 16_strandness.md

* progressing throught HISAT2

* finished STAR section

* Update 18-1_UCSC_visualisation.md

* Delete ucsc.png

* continue on ucsc visualisation
  • Loading branch information
drosofff committed Jan 8, 2024
1 parent beeeebb commit 6c6bbbf
Show file tree
Hide file tree
Showing 15 changed files with 527 additions and 218 deletions.
2 changes: 1 addition & 1 deletion docs/bulk_RNAseq-IOC/11_uploads.md
Original file line number Diff line number Diff line change
Expand Up @@ -226,7 +226,7 @@ The second one is a single list of the run_accession IDs
```

The third one is a run_accession / sample_title match table:

<a name="table3"></a>
!!! info "Table 3"
```
run_accession sample_title
Expand Down
211 changes: 147 additions & 64 deletions docs/bulk_RNAseq-IOC/16_strandness.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,70 +21,153 @@ the reference gene model (from an annotation file).
Based on the strand of the genes, it can gauge whether sequencing is strand-specific,
and if so, how reads are stranded.



## Use of `Infer Experiment` tool

----
![](images/tool_small.png)

### `Convert GTF to BED12` tool to convert the GTF file to BED

1. Go to your history `STAR` or `HISAT2`
2. Select the tool `Convert GTF to BED12`
1. `GTF File to convert`: Drosophila_melanogaster.BDGP6.95.gtf
3. `Execute`

----
![](images/tool_small.png)

### `Infer Experiment` tool to determine the library strandness

1. In the same history `STAR` or `HISAT2`
2. Select the tool `Infer Experiment`
1. `Input .bam file`: mapped.bam files (outputs of RNA STAR or HISAT2 tools)
2. `Reference gene model` : BED12 file (output of Convert GTF to BED12 tool)
3. `Number of reads sampled from SAM/BAM file (default = 200000)`: 200000
3. `Execute`

----
![](images/tool_small.png)
!!! info "a paradoxical situation"
At this point you can notice a sort of paradox: splice-aware aligner need to know the
strandness of the library, but if you do not know this strandness, you must use the
"infer experiment" tool which itself will analyze a bam alignment...

If so, you are right !

However:

1. It is not necessary to use a splice-aware aligner to generate this first set of
alignment. A "simple" bowtie or bwa alignment will be enough and less resource-demanding.
2. You can even use HISAT2 and specify "unstranded" for the library design. Even if it
is not true, this is not an issue since the bam file will be re-analyzed by the tool
`infer experiment`.
3. You can anyway use the STAR aligner to generate this fist BAM file, since STAR does
not require specifying whether the library is stranded or not.
4. Finally, all your libraries have most likely the same design, unless they have not been
generated at the same time. Therefore, analyzing a **single** sequencing dataset should
be enough !

## Preliminary read alignment from a single sequencing dataset

To keep it sample, we are going to do this preliminary alignment using the BWA aligner.
Although BWA is not commonly used in transcriptome analysis, it will be the opportunity
for you to use it once.

- [x] First of all, create a new history and rename it `PRJNA630433 Strandness analysis`
- [x] Using the data library. Follow the main menu Shared Data --> Data Libraries -->
IOC_bulk_RNAseq --> PRJNA630433 --> FASTQ files, and check out the `SRR11688218` dataset.
- [x] Select the `Export to History`tab and `as Datasets`
- [x] On the next panel that pops up, select the newly created history `PRJNA630433
Strandness analysis` (which should be already selected). Click the `Import` button.
- [x] If you are fast enough, click on the green pop up area which will return you to the
working history where the dataset is now imported and visible. Otherwise, click on the
house icon which will get you to the same history.

![](images/tool_small.png) Now, select the `Map with BWA-MEM` tool and check that his version is `0.7.17.2`. If it
is not, change it using the version icon (3 stacked cubes) at the top-right of the tool
form.
!!! info "![](images/tool_small.png){width="25" align="absbottom"} Map with BWA-MEM tool settings"
- Will you select a reference genome from your history or use a built-in index?

--> Use a built-in genome index
- Using reference genome

--> GRCm38
- Single or Paired-end reads

--> single
- Select fastq dataset

--> SRR11688218
- Leave other paramaters as is
- Click the `Execute` button

The tool should run about 2-3 mins before returning a green bam dataset, which you can
examine with the eye icon

- [x] Before using the `Infer Experiment` tool, you will need the GTF annotation file
"Mus_musculus.GRCm38.102.chr.gtf" from the data library `IOC_bulk_RNAseq`. Import it as
you already did with the `SRR11688218` dataset.


## ![](images/tool_small.png){width="30" align="absbottom"} Use of `Infer Experiment` tool

Unfortunalty, the `Infer Experiment` tool does not with annotations in GTF format, but rather
in another format: the BED12 format. No worries, there is a converter tool for this.

!!! info "![](images/tool_small.png){width="25" align="absbottom"} `Convert GTF to BED12` settings"
- GTF File to convert

--> Mus_musculus.GRCm38.102.chr.gtf
- Advanced options

--> Use default options
- Click `Execute` button

This will return a bed12 dataset "Convert GTF to BED12 on data 2: BED12" to be used In
the next step

!!! info "![](images/tool_small.png){width="25" align="absbottom"} `Infer Experiment` settings"
- Input BAM file

--> Map with BWA-MEM on data 1 (mapped reads in BAM format)
- Reference gene model

--> "Convert GTF to BED12 on data 2: BED12" (the output of Convert GTF to BED12 tool)
- Number of reads sampled
--> 200000

- Minimum mapping quality

--> 30
- `Execute`

The output of the `Infer Experiment` tool is the following text file:
```
This is SingleEnd Data
Fraction of reads failed to determine: 0.0299
Fraction of reads explained by "++,--": 0.0035
Fraction of reads explained by "+-,-+": 0.9666
```

In most cases, when a read in mapped to the (+) strand of the genome, the parental gene is
located on the (-) strand of the genome (thus "+-") or conversely when a read in mapped to
the (-) strand of the genome, the parental gene is located on the (+) strand of the genome
(thus "-+").

In conclusion, in this study, the library is **stranded** and reads correspond to the reverse
of transcribed RNA.
??? note "![](images/lamp.png){width="30" align="absbottom"} Infer experiment tool explained in greater details"
Infer Experiment tool generates one file with information on:

- Paired-end or single-end library
- Fraction of reads failed to determine
- 2 lines:
- For single-end
- Fraction of reads explained by “++,–” (SF in previous figure)
- Fraction of reads explained by “+-,-+” (SR in previous figure)
- For paired-end
- Fraction of reads explained by “1++,1–,2+-,2-+” (SF in previous figure)
- Fraction of reads explained by “1+-,1-+,2++,2–” (SR in previous figure)

If the two “Fraction of reads explained by” numbers are close to each other (*i.e.* a mix of SF and SR),
we conclude that the library is not a strand-specific dataset (U in previous figure).

As it is sometimes quite difficult to find out which settings correspond to those of
other programs, the following table might be helpful to identify the library type:

|Library type |Infer Experiment | TopHat | HISAT | htseq-count |featureCounts |
|--------------------------|-----------------|------------------|------------------|-------------|--------------|
|Paired-End (PE) - SF |1++,1–,2+-,2-+ |FR Second Strand |Second Strand F/FR|yes |Forward (1) |
|PE - SR |1+-,1-+,2++,2– |FR First Strand |First Strand R/RF |reverse |Reverse (2) |
|Single-End (SE) - SF |+,– |FR Second Strand |Second Strand F/FR|yes |Forward (1) |
|SE - SR |+-,-+ |FR First Strand |First Strand R/RF |reverse |Reverse (2) |
|PE, SE - U |undecided |FR Unstranded |default |no |Unstranded (0)|

### Summarize results with `MultiQC` tool

1. Select the tool `MultiQC`
2. Which tool was used generate logs?
- `RSeQC`
3. RSeQC output (Type of RSeQC output?)
- infer_experiment
4. Select the 7 datasets of type `Infer Experiment on ...`
3. `Execute`

----
![](images/lamp.png)

Infer Experiment tool generates one file with information on:

- Paired-end or single-end library
- Fraction of reads failed to determine
- 2 lines:
- For single-end
- Fraction of reads explained by “++,–” (SF in previous figure)
- Fraction of reads explained by “+-,-+” (SR in previous figure)
- For paired-end
- Fraction of reads explained by “1++,1–,2+-,2-+” (SF in previous figure)
- Fraction of reads explained by “1+-,1-+,2++,2–” (SR in previous figure)

If the two “Fraction of reads explained by” numbers are close to each other (*i.e.* a mix of SF and SR),
we conclude that the library is not a strand-specific dataset (U in previous figure).

As it is sometimes quite difficult to find out which settings correspond to those of
other programs, the following table might be helpful to identify the library type:

|Library type |Infer Experiment | TopHat | HISAT | htseq-count |featureCounts |
|--------------------------|-----------------|------------------|------------------|-------------|--------------|
|Paired-End (PE) - SF |1++,1–,2+-,2-+ |FR Second Strand |Second Strand F/FR|yes |Forward (1) |
|PE - SR |1+-,1-+,2++,2– |FR First Strand |First Strand R/RF |reverse |Reverse (2) |
|Single-End (SE) - SF |+,– |FR Second Strand |Second Strand F/FR|yes |Forward (1) |
|SE - SR |+-,-+ |FR First Strand |First Strand R/RF |reverse |Reverse (2) |
|PE, SE - U |undecided |FR Unstranded |default |no |Unstranded (0)|
For a marginal benefit, you can also use the `MultiQC` tool
!!! info "![](images/tool_small.png){width="25" align="absbottom"} `MultiQC` settings"
- Which tool was used generate logs?

--> RSeQC
- RSeQC output (Type of RSeQC output?)

--> infer_experiment
- Select dataset `Infer Experiment on ...`
- `Execute`

Loading

0 comments on commit 6c6bbbf

Please sign in to comment.