Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ bioinformatics subjects:
* [Assembly](assembly.md)
* [Annotation](annotation.md)
* [Variant Calling](variant_calling.md)
* [Pan-genome Analysis](pan-genome.md)
* [Pan-genome Analysis](pan_genome.md)
* [RNA-Seq](rna_seq.md)
* [16s Metabarcoding Analysis](16s.md)
* [Whole Metagenome Sequencing](wms.md)
Expand Down
24 changes: 12 additions & 12 deletions file_formats.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# File Formats

This lecture is aimed at making you discover the most popular file formats used in bioinforatics. You're expected to have basic working knowledge of linux to be able to follow the lesson.
This lecture is aimed at making you discover the most popular file formats used in bioinformatics. You're expected to have basic working knowledge of Linux to be able to follow the lesson.

## Table of Contents
* [The fasta format](#the-fasta-format)
Expand All @@ -11,7 +11,7 @@ This lecture is aimed at making you discover the most popular file formats used

### The fasta format

The fasta format was invented in 1988 and designed to represent nucleotide or peptide sequences. It originates from the [FASTA](https://en.wikipedia.org/wiki/FASTA) software package, but is now a standard in the world of bioinformatics
The fasta format was invented in 1988 and designed to represent nucleotide or peptide sequences. It originates from the [FASTA](https://en.wikipedia.org/wiki/FASTA) software package, but is now a standard in the world of bioinformatics.

The first line in a FASTA file starts with a ">" (greater-than) symbol followed by the description or identifier of the sequence. Following the initial line (used for a unique description of the sequence) is the actual sequence itself in standard one-letter code.

Expand All @@ -31,7 +31,7 @@ HDGVLSVNAKRDSFNDESDSEGNVIASERSYGRFARQYSLPNVDESGIKAKCEDGVLKLTLPKLAEEKIN
GNHIEIE
```

A fasta file can contain multiple sequence. Each sequence will be separated by their "header" line, starting by ">"
A fasta file can contain multiple sequence. Each sequence will be separated by their "header" line, starting by ">".

Example:

Expand All @@ -49,7 +49,7 @@ SQFIGYPITLFVEK

### The fastq format

The fastq format is also a text based format to represent nucleotide sequences, but also contains the coresponding quality of each nucleotide. It is the standard for storing the output of high-throughput sequencing instruments such as the Illumina machines.
The fastq format is also a text based format to represent nucleotide sequences, but also contains the corresponding quality of each nucleotide. It is the standard for storing the output of high-throughput sequencing instruments such as the Illumina machines.

A fastq file uses four lines per sequence:

Expand All @@ -69,9 +69,9 @@ GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT

#### Quality

The quality, also called phred scores is the probability that the corresponding basecall is incorrect.
The quality, also called phred score, is the probability that the corresponding basecall is incorrect.

Phred scores use a logarithmic scale, and are represented by ASCII characters, mapping to a quality usually going from 0 to 40
Phred scores use a logarithmic scale, and are represented by ASCII characters, mapping to a quality usually going from 0 to 40.

| Phred Quality Score | Probability of incorrect base call | Base call accuracy
| --- | --- | ---
Expand All @@ -90,9 +90,9 @@ SAM (file format) is a text-based format for storing biological sequences aligne

The SAM format consists of a header and an alignment section. The binary representation of a SAM file is a BAM file, which is a compressed SAM file.[1] SAM files can be analysed and edited with the software SAMtools.

The SAM format has a really extensive and complex specification that you can find [here](https://samtools.github.io/hts-specs/SAMv1.pdf)
The SAM format has a really extensive and complex specification that you can find [here](https://samtools.github.io/hts-specs/SAMv1.pdf).

In brief it consists in a header section and reads (with other information) in tab delimited format.
In brief it consists of a header section and reads (with other information) in tab delimited format.

#### Example header section

Expand All @@ -111,9 +111,9 @@ M01137:130:00-A:17009:1352/14 * 0 0 * * 0 0 AGCAAAATACAACGATCTGGATGGTAGCATTAGCGA

### the vcf format

the vcf is also a text-based file format. VCF stands for Variant Call Format and is used to store gene sequence variations (SNVs, indels). The format hs been developped for genotyping projects, and is the standard to represent variations in the genome of a species.
The vcf format is also a text-based file format. VCF stands for Variant Call Format and is used to store gene sequence variations (SNVs, indels). The format has been developped for genotyping projects, and is the standard to represent variations in the genome of a species.

A vcf is a tab-delimited file with 9 columns, described [here](https://samtools.github.io/hts-specs/VCFv4.2.pdf)
A vcf is a tab-delimited file, described [here](https://samtools.github.io/hts-specs/VCFv4.2.pdf).

#### VCF Example

Expand Down Expand Up @@ -163,9 +163,9 @@ chr2 215634055 . C T

### the gff format

The general feature format (gff) is another text file format. used for describing genes and other features of DNA, RNA and protein sequences. It is the standrad for annotation of genomes.
The general feature format (gff) is another text file format, used for describing genes and other features of DNA, RNA and protein sequences. It is the standard for annotation of genomes.

A gff file should, as a vcf, conatins 9 columns described [here](https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md)
A gff file should contain 9 columns, described [here](https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md)

#### Example gff

Expand Down
22 changes: 14 additions & 8 deletions qc.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ The sequencing was done as paired-end 2x150bp.

## Downloading the data

The Raw data were deposited at the European nucleotide archive, under the accession number SRR957824, go the the ENA [website](http://www.ebi.ac.uk/ena) and search for the run with the accession SRR957824. Download the two fastq files associated with the run:
The raw data were deposited at the European Nucleotide Archive, under the accession number SRR957824. Go to the ENA [website](http://www.ebi.ac.uk/ena) and search for the run with the accession SRR957824. Download the two fastq files associated with the run:

```
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR957/SRR957824/SRR957824_1.fastq.gz
Expand All @@ -18,18 +18,18 @@ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR957/SRR957824/SRR957824_2.fastq.gz

To check the quality of the sequence data we will use a tool called FastQC. With this you can check things like read length distribution, quality distribution across the read length, sequencing artifacts and much more.

FastQC has a graphical interface and can be downloaded and ran on a Windows or LINUX computer without installation. It is available [here](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/).
FastQC has a graphical interface and can be downloaded and run on a Windows or Linux computer without installation. It is available [here](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/).

However, FastQC is also available as a command line utility on the training server you are using. You can load the module and execute the program as follow:
However, FastQC is also available as a command line utility on the training server you are using. You can load the module and execute the program as follows:

```
module load fastqc
module load FastQC
fastqc $read1 $read2
```

which will produce both a .zip archive containing all the plots, and a html document for you to look at the result in your browser.

Open the html file with your favourite web browser, and try to interpret them
Open the html file with your favourite web browser, and try to interpret them.

Pay special attention to the per base sequence quality and sequence length distribution. Explanations for the various quality modules can be found [here](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/). Also, have a look at examples of a [good](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html) and a [bad](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html) illumina read set for comparison.

Expand All @@ -49,7 +49,9 @@ cd scythe
make all
```

Then, copy or move "scythe" to a directory in your $PATH.
Then, copy or move "scythe" to a directory in your $PATH, for example like this:

`cp scythe $HOME/bin/`

Scythe can be run minimally with:

Expand All @@ -73,6 +75,10 @@ cd sickle
make
```

Copy sickle to a directory in your $PATH:

`cp sickle $HOME/bin/`

Sickle has two modes to work with both paired-end and single-end reads: sickle se and sickle pe.

Running sickle by itself will print the help:
Expand All @@ -95,8 +101,8 @@ What did the trimming do to the per-base sequence quality, the per sequence qual

What is the sequence duplication levels graph about? Why should you care about a high level of duplication, and why is the level of duplication very low for this data?

Based on the FastQC report, there seems to be a population of shorter reads that are technical artefacts. We will ignore them for now as they will not interfere with our analysis.
Based on the FastQC report, there seems to be a population of shorter reads that are technical artifacts. We will ignore them for now as they will not interfere with our analysis.

## Extra exercises

Perform quality control on the extra datasets given by your instructors
Perform quality control on the extra datasets given by your instructors.
74 changes: 38 additions & 36 deletions rna_seq.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,10 @@
## Load salmon

```
module load salmon
module load Salmon
```

## Downloading the data.
## Downloading the data

For this tutorial we will use the test data from [this](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004393) paper:

Expand All @@ -27,7 +27,7 @@ So to summarize we have:
* HBR + ERCC Spike-In Mix2, Replicate 2
* HBR + ERCC Spike-In Mix2, Replicate 3

You can download the data from [here](http://139.162.178.46/files/tutorials/toy_rna.tar.gz)
You can download the data from [here](http://139.162.178.46/files/tutorials/toy_rna.tar.gz).

Unpack the data and go into the toy_rna directory

Expand All @@ -36,13 +36,13 @@ tar xzf toy_rna.tar.gz
cd toy_rna
```

## indexing transcriptome
## Indexing transcriptome

```
salmon index -t chr22_transcripts.fa -i chr22_index
```

## quantify reads using salmon
## Quantify reads using salmon

```bash
for i in *_R1.fastq.gz
Expand All @@ -64,9 +64,9 @@ Salmon exposes many different options to the user that enable extra features or

After the salmon commands finish running, you should have a directory named `quant`, which will have a sub-directory for each sample. These sub-directories contain the quantification results of salmon, as well as a lot of other information salmon records about the sample and the run. The main output file (called quant.sf) is rather self-explanatory. For example, take a peek at the quantification file for sample `HBR_Rep1` in `quant/HBR_Rep1/quant.sf` and you’ll see a simple TSV format file listing the name (Name) of each transcript, its length (Length), effective length (EffectiveLength) (more details on this in the documentation), and its abundance in terms of Transcripts Per Million (TPM) and estimated number of reads (NumReads) originating from this transcript.

## import read counts using tximport
## Import read counts using tximport

Using the tximport R package, you can import salmon’s transcript-level quantifications and optionally aggregate them to the gene level for gene-level differential expression analysis
Using the tximport R package, you can import salmon’s transcript-level quantifications and optionally aggregate them to the gene level for gene-level differential expression analysis.

First, open up your favourite R IDE and install the necessary packages:

Expand All @@ -86,59 +86,57 @@ library(GenomicFeatures)
library(readr)
```

Salmon did the quantifiation of the transcript level. We want to see which genes are differentially expressed, so we need to link the transcripts name to the gene names. We can use our .gtf annotation for that, and the GenomicFeatures package:
Salmon did the quantifiation of the transcript level. We want to see which genes are differentially expressed, so we need to link the transcript names to the gene names. We can use our .gtf annotation for that, and the GenomicFeatures package:

```R
txdb <- makeTxDbFromGFF("chr22_genes.gtf")
k <- keys(txdb, keytype = "GENEID")
df <- select(txdb, keys = k, keytype = "GENEID", columns = "TXNAME")
tx2gene <- df[, 2:1]
tx2gene <- select(txdb, keys = k, keytype = "GENEID", columns = "TXNAME")
head(tx2gene)
```

now we can import the salmon quantification:
Now we can import the salmon quantification. First, download the file with sample descriptions from [here](https://raw.githubusercontent.com/HadrienG/tutorials/master/data/samples.txt) and put it in the toy_rna directory. Then, use that file to load the corresponding quantification data.

```R
samples <- read.table("samples.txt", header = TRUE)
files <- file.path("quant", samples$quant, "quant.sf")
files <- file.path("quant", samples$sample, "quant.sf")
names(files) <- paste0(samples$sample)
txi.salmon <- tximport(files, type = "salmon", tx2gene = tx2gene, reader = read_tsv)
```

take a look at the data:
Take a look at the data:

```R
head(txi.salmon$counts)
```

## differential expression using DeSeq2
## Differential expression using DESeq2

install the necessary package
Install the necessary package:

```R
biocLite('DESeq2')
```

then load it:
Then load it:

```R
library(DESeq2)
```

Instantiate the DESeqDataSet and generate result table. See ?DESeqDataSetFromTximport and ?DESeq for more information about the steps performed by the program.

Instantiate the DESeqDataSet and generate result table. See `?DESeqDataSetFromTximport` and `?DESeq` for more information about the steps performed by the program.

```R
dds <- DESeqDataSetFromTximport(txi.salmon, samples, ~condition)
dds <- DESeq(dds)
res <- results(dds)
```

run the `summary` command to have an idea of how many genes are up and down-regulated between the two conditions
Run the `summary` command to get an idea of how many genes are up- and downregulated between the two conditions:

`summary(res)`

DESeq uses a negative binomial distribution. Such distribution has two parameters: mean and dispersion. The dispersion is a parameter describing how much the variance deviates from the mean.
DESeq uses a negative binomial distribution. Such distributions have two parameters: mean and dispersion. The dispersion is a parameter describing how much the variance deviates from the mean.

You can read more about the methods used by DESeq2 in the [paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8) or the [vignette](https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq/inst/doc/DESeq.pdf)

Expand Down Expand Up @@ -218,17 +216,17 @@ library(gageData)
Let’s use the `mapIds` function to add more columns to the results. The row.names of our results table has the Ensembl gene ID (our key), so we need to specify `keytype=ENSEMBL`. The column argument tells the `mapIds` function which information we want, and the `multiVals` argument tells the function what to do if there are multiple possible values for a single input value. Here we ask to just give us back the first one that occurs in the database. Let’s get the Entrez IDs, gene symbols, and full gene names.

```R
res$symbol = mapIds(org.Hs.eg.db,
res$symbol <- mapIds(org.Hs.eg.db,
keys=row.names(res),
column="SYMBOL",
keytype="ENSEMBL",
multiVals="first")
res$entrez = mapIds(org.Hs.eg.db,
res$entrez <- mapIds(org.Hs.eg.db,
keys=row.names(res),
column="ENTREZID",
keytype="ENSEMBL",
multiVals="first")
res$name = mapIds(org.Hs.eg.db,
res$name <- mapIds(org.Hs.eg.db,
keys=row.names(res),
column="GENENAME",
keytype="ENSEMBL",
Expand All @@ -237,51 +235,55 @@ res$name = mapIds(org.Hs.eg.db,
head(res)
```

We’re going to use the [gage](http://bioconductor.org/packages/release/bioc/html/gage.html) package for pathway analysis, and the [pathview](http://bioconductor.org/packages/release/bioc/html/pathview.html) package to draw a pathway diagram.

We’re going to use the [gage](https://bioconductor.org/packages/release/bioc/html/gage.html) package for pathway analysis, and the [pathview](https://bioconductor.org/packages/release/bioc/html/pathview.html) package to draw a pathway diagram.

The gageData package has pre-compiled databases mapping genes to KEGG pathways and GO terms for common organisms:

```R
data(kegg.sets.hs)
data(sigmet.idx.hs)
kegg.sets.hs = kegg.sets.hs[sigmet.idx.hs]
kegg.sets.hs <- kegg.sets.hs[sigmet.idx.hs]
head(kegg.sets.hs, 3)
```

Run the pathway analysis. See help on the gage function with ?gage. Specifically, you might want to try changing the value of same.dir
Run the pathway analysis. See help on the gage function with `?gage`. Specifically, you might want to try changing the value of same.dir.

```R
foldchanges = res$log2FoldChange
names(foldchanges) = res$entrez
keggres = gage(foldchanges, gsets=kegg.sets.hs, same.dir=TRUE)
foldchanges <- res$log2FoldChange
names(foldchanges) <- res$entrez
keggres <- gage(foldchanges, gsets=kegg.sets.hs, same.dir=TRUE)
lapply(keggres, head)
```

pull out the top 5 upregulated pathways, then further process that just to get the IDs. We’ll use these KEGG pathway IDs downstream for plotting.
Pull out the top 5 upregulated pathways, then further process that just to get the IDs. We’ll use these KEGG pathway IDs downstream for plotting. The `dplyr` package is required to use the pipe (`%>%`) construct.

```R
library(dplyr)

# Get the pathways
keggrespathways = data.frame(id=rownames(keggres$greater), keggres$greater) %>%
keggrespathways <- data.frame(id=rownames(keggres$greater), keggres$greater) %>%
tbl_df() %>%
filter(row_number()<=5) %>%
.$id %>%
as.character()
keggrespathways

# Get the IDs.
keggresids = substr(keggrespathways, start=1, stop=8)
keggresids <- substr(keggrespathways, start=1, stop=8)
keggresids
```

Finally, the pathview() function in the pathview package makes the plots. Let’s write a function so we can loop through and draw plots for the top 5 pathways we created above.
Finally, the `pathview()` function in the pathview package makes the plots. Let’s write a function so we can loop through and draw plots for the top 5 pathways we created above.

```R
# Define plotting function for applying later
plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="hsa", new.signature=FALSE)
plot_pathway <- function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="hsa", new.signature=FALSE)

# Unload dplyr since it conflicts with the next line
detach("package:dplyr", unload=T)

# plot multiple pathways (plots saved to disk and returns a throwaway list object)
tmp = sapply(keggresids, function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="hsa"))
tmp <- sapply(keggresids, function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="hsa"))
```

#### Thanks
Expand Down
Loading