diff --git a/README.md b/README.md index 3119c83..ccb613e 100644 --- a/README.md +++ b/README.md @@ -11,7 +11,7 @@ bioinformatics subjects: * [Assembly](assembly.md) * [Annotation](annotation.md) * [Variant Calling](variant_calling.md) -* [Pan-genome Analysis](pan-genome.md) +* [Pan-genome Analysis](pan_genome.md) * [RNA-Seq](rna_seq.md) * [16s Metabarcoding Analysis](16s.md) * [Whole Metagenome Sequencing](wms.md) diff --git a/file_formats.md b/file_formats.md index 3940d55..866fe6d 100644 --- a/file_formats.md +++ b/file_formats.md @@ -1,6 +1,6 @@ # File Formats -This lecture is aimed at making you discover the most popular file formats used in bioinforatics. You're expected to have basic working knowledge of linux to be able to follow the lesson. +This lecture is aimed at making you discover the most popular file formats used in bioinformatics. You're expected to have basic working knowledge of Linux to be able to follow the lesson. ## Table of Contents * [The fasta format](#the-fasta-format) @@ -11,7 +11,7 @@ This lecture is aimed at making you discover the most popular file formats used ### The fasta format -The fasta format was invented in 1988 and designed to represent nucleotide or peptide sequences. It originates from the [FASTA](https://en.wikipedia.org/wiki/FASTA) software package, but is now a standard in the world of bioinformatics +The fasta format was invented in 1988 and designed to represent nucleotide or peptide sequences. It originates from the [FASTA](https://en.wikipedia.org/wiki/FASTA) software package, but is now a standard in the world of bioinformatics. The first line in a FASTA file starts with a ">" (greater-than) symbol followed by the description or identifier of the sequence. Following the initial line (used for a unique description of the sequence) is the actual sequence itself in standard one-letter code. @@ -31,7 +31,7 @@ HDGVLSVNAKRDSFNDESDSEGNVIASERSYGRFARQYSLPNVDESGIKAKCEDGVLKLTLPKLAEEKIN GNHIEIE ``` -A fasta file can contain multiple sequence. Each sequence will be separated by their "header" line, starting by ">" +A fasta file can contain multiple sequence. Each sequence will be separated by their "header" line, starting by ">". Example: @@ -49,7 +49,7 @@ SQFIGYPITLFVEK ### The fastq format -The fastq format is also a text based format to represent nucleotide sequences, but also contains the coresponding quality of each nucleotide. It is the standard for storing the output of high-throughput sequencing instruments such as the Illumina machines. +The fastq format is also a text based format to represent nucleotide sequences, but also contains the corresponding quality of each nucleotide. It is the standard for storing the output of high-throughput sequencing instruments such as the Illumina machines. A fastq file uses four lines per sequence: @@ -69,9 +69,9 @@ GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT #### Quality -The quality, also called phred scores is the probability that the corresponding basecall is incorrect. +The quality, also called phred score, is the probability that the corresponding basecall is incorrect. -Phred scores use a logarithmic scale, and are represented by ASCII characters, mapping to a quality usually going from 0 to 40 +Phred scores use a logarithmic scale, and are represented by ASCII characters, mapping to a quality usually going from 0 to 40. | Phred Quality Score | Probability of incorrect base call | Base call accuracy | --- | --- | --- @@ -90,9 +90,9 @@ SAM (file format) is a text-based format for storing biological sequences aligne The SAM format consists of a header and an alignment section. The binary representation of a SAM file is a BAM file, which is a compressed SAM file.[1] SAM files can be analysed and edited with the software SAMtools. -The SAM format has a really extensive and complex specification that you can find [here](https://samtools.github.io/hts-specs/SAMv1.pdf) +The SAM format has a really extensive and complex specification that you can find [here](https://samtools.github.io/hts-specs/SAMv1.pdf). -In brief it consists in a header section and reads (with other information) in tab delimited format. +In brief it consists of a header section and reads (with other information) in tab delimited format. #### Example header section @@ -111,9 +111,9 @@ M01137:130:00-A:17009:1352/14 * 0 0 * * 0 0 AGCAAAATACAACGATCTGGATGGTAGCATTAGCGA ### the vcf format -the vcf is also a text-based file format. VCF stands for Variant Call Format and is used to store gene sequence variations (SNVs, indels). The format hs been developped for genotyping projects, and is the standard to represent variations in the genome of a species. +The vcf format is also a text-based file format. VCF stands for Variant Call Format and is used to store gene sequence variations (SNVs, indels). The format has been developped for genotyping projects, and is the standard to represent variations in the genome of a species. -A vcf is a tab-delimited file with 9 columns, described [here](https://samtools.github.io/hts-specs/VCFv4.2.pdf) +A vcf is a tab-delimited file, described [here](https://samtools.github.io/hts-specs/VCFv4.2.pdf). #### VCF Example @@ -163,9 +163,9 @@ chr2 215634055 . C T ### the gff format -The general feature format (gff) is another text file format. used for describing genes and other features of DNA, RNA and protein sequences. It is the standrad for annotation of genomes. +The general feature format (gff) is another text file format, used for describing genes and other features of DNA, RNA and protein sequences. It is the standard for annotation of genomes. -A gff file should, as a vcf, conatins 9 columns described [here](https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md) +A gff file should contain 9 columns, described [here](https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md) #### Example gff diff --git a/qc.md b/qc.md index ce27290..b6ce30a 100644 --- a/qc.md +++ b/qc.md @@ -7,7 +7,7 @@ The sequencing was done as paired-end 2x150bp. ## Downloading the data -The Raw data were deposited at the European nucleotide archive, under the accession number SRR957824, go the the ENA [website](http://www.ebi.ac.uk/ena) and search for the run with the accession SRR957824. Download the two fastq files associated with the run: +The raw data were deposited at the European Nucleotide Archive, under the accession number SRR957824. Go to the ENA [website](http://www.ebi.ac.uk/ena) and search for the run with the accession SRR957824. Download the two fastq files associated with the run: ``` wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR957/SRR957824/SRR957824_1.fastq.gz @@ -18,18 +18,18 @@ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR957/SRR957824/SRR957824_2.fastq.gz To check the quality of the sequence data we will use a tool called FastQC. With this you can check things like read length distribution, quality distribution across the read length, sequencing artifacts and much more. -FastQC has a graphical interface and can be downloaded and ran on a Windows or LINUX computer without installation. It is available [here](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). +FastQC has a graphical interface and can be downloaded and run on a Windows or Linux computer without installation. It is available [here](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). -However, FastQC is also available as a command line utility on the training server you are using. You can load the module and execute the program as follow: +However, FastQC is also available as a command line utility on the training server you are using. You can load the module and execute the program as follows: ``` -module load fastqc +module load FastQC fastqc $read1 $read2 ``` which will produce both a .zip archive containing all the plots, and a html document for you to look at the result in your browser. -Open the html file with your favourite web browser, and try to interpret them +Open the html file with your favourite web browser, and try to interpret them. Pay special attention to the per base sequence quality and sequence length distribution. Explanations for the various quality modules can be found [here](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/). Also, have a look at examples of a [good](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html) and a [bad](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html) illumina read set for comparison. @@ -49,7 +49,9 @@ cd scythe make all ``` -Then, copy or move "scythe" to a directory in your $PATH. +Then, copy or move "scythe" to a directory in your $PATH, for example like this: + +`cp scythe $HOME/bin/` Scythe can be run minimally with: @@ -73,6 +75,10 @@ cd sickle make ``` +Copy sickle to a directory in your $PATH: + +`cp sickle $HOME/bin/` + Sickle has two modes to work with both paired-end and single-end reads: sickle se and sickle pe. Running sickle by itself will print the help: @@ -95,8 +101,8 @@ What did the trimming do to the per-base sequence quality, the per sequence qual What is the sequence duplication levels graph about? Why should you care about a high level of duplication, and why is the level of duplication very low for this data? -Based on the FastQC report, there seems to be a population of shorter reads that are technical artefacts. We will ignore them for now as they will not interfere with our analysis. +Based on the FastQC report, there seems to be a population of shorter reads that are technical artifacts. We will ignore them for now as they will not interfere with our analysis. ## Extra exercises -Perform quality control on the extra datasets given by your instructors +Perform quality control on the extra datasets given by your instructors. diff --git a/rna_seq.md b/rna_seq.md index 1ee655f..9fade01 100644 --- a/rna_seq.md +++ b/rna_seq.md @@ -3,10 +3,10 @@ ## Load salmon ``` -module load salmon +module load Salmon ``` -## Downloading the data. +## Downloading the data For this tutorial we will use the test data from [this](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004393) paper: @@ -27,7 +27,7 @@ So to summarize we have: * HBR + ERCC Spike-In Mix2, Replicate 2 * HBR + ERCC Spike-In Mix2, Replicate 3 -You can download the data from [here](http://139.162.178.46/files/tutorials/toy_rna.tar.gz) +You can download the data from [here](http://139.162.178.46/files/tutorials/toy_rna.tar.gz). Unpack the data and go into the toy_rna directory @@ -36,13 +36,13 @@ tar xzf toy_rna.tar.gz cd toy_rna ``` -## indexing transcriptome +## Indexing transcriptome ``` salmon index -t chr22_transcripts.fa -i chr22_index ``` -## quantify reads using salmon +## Quantify reads using salmon ```bash for i in *_R1.fastq.gz @@ -64,9 +64,9 @@ Salmon exposes many different options to the user that enable extra features or After the salmon commands finish running, you should have a directory named `quant`, which will have a sub-directory for each sample. These sub-directories contain the quantification results of salmon, as well as a lot of other information salmon records about the sample and the run. The main output file (called quant.sf) is rather self-explanatory. For example, take a peek at the quantification file for sample `HBR_Rep1` in `quant/HBR_Rep1/quant.sf` and you’ll see a simple TSV format file listing the name (Name) of each transcript, its length (Length), effective length (EffectiveLength) (more details on this in the documentation), and its abundance in terms of Transcripts Per Million (TPM) and estimated number of reads (NumReads) originating from this transcript. -## import read counts using tximport +## Import read counts using tximport -Using the tximport R package, you can import salmon’s transcript-level quantifications and optionally aggregate them to the gene level for gene-level differential expression analysis +Using the tximport R package, you can import salmon’s transcript-level quantifications and optionally aggregate them to the gene level for gene-level differential expression analysis. First, open up your favourite R IDE and install the necessary packages: @@ -86,47 +86,45 @@ library(GenomicFeatures) library(readr) ``` -Salmon did the quantifiation of the transcript level. We want to see which genes are differentially expressed, so we need to link the transcripts name to the gene names. We can use our .gtf annotation for that, and the GenomicFeatures package: +Salmon did the quantifiation of the transcript level. We want to see which genes are differentially expressed, so we need to link the transcript names to the gene names. We can use our .gtf annotation for that, and the GenomicFeatures package: ```R txdb <- makeTxDbFromGFF("chr22_genes.gtf") k <- keys(txdb, keytype = "GENEID") -df <- select(txdb, keys = k, keytype = "GENEID", columns = "TXNAME") -tx2gene <- df[, 2:1] +tx2gene <- select(txdb, keys = k, keytype = "GENEID", columns = "TXNAME") head(tx2gene) ``` -now we can import the salmon quantification: +Now we can import the salmon quantification. First, download the file with sample descriptions from [here](https://raw.githubusercontent.com/HadrienG/tutorials/master/data/samples.txt) and put it in the toy_rna directory. Then, use that file to load the corresponding quantification data. ```R samples <- read.table("samples.txt", header = TRUE) -files <- file.path("quant", samples$quant, "quant.sf") +files <- file.path("quant", samples$sample, "quant.sf") names(files) <- paste0(samples$sample) txi.salmon <- tximport(files, type = "salmon", tx2gene = tx2gene, reader = read_tsv) ``` -take a look at the data: +Take a look at the data: ```R head(txi.salmon$counts) ``` -## differential expression using DeSeq2 +## Differential expression using DESeq2 -install the necessary package +Install the necessary package: ```R biocLite('DESeq2') ``` -then load it: +Then load it: ```R library(DESeq2) ``` -Instantiate the DESeqDataSet and generate result table. See ?DESeqDataSetFromTximport and ?DESeq for more information about the steps performed by the program. - +Instantiate the DESeqDataSet and generate result table. See `?DESeqDataSetFromTximport` and `?DESeq` for more information about the steps performed by the program. ```R dds <- DESeqDataSetFromTximport(txi.salmon, samples, ~condition) @@ -134,11 +132,11 @@ dds <- DESeq(dds) res <- results(dds) ``` -run the `summary` command to have an idea of how many genes are up and down-regulated between the two conditions +Run the `summary` command to get an idea of how many genes are up- and downregulated between the two conditions: `summary(res)` -DESeq uses a negative binomial distribution. Such distribution has two parameters: mean and dispersion. The dispersion is a parameter describing how much the variance deviates from the mean. +DESeq uses a negative binomial distribution. Such distributions have two parameters: mean and dispersion. The dispersion is a parameter describing how much the variance deviates from the mean. You can read more about the methods used by DESeq2 in the [paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8) or the [vignette](https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq/inst/doc/DESeq.pdf) @@ -218,17 +216,17 @@ library(gageData) Let’s use the `mapIds` function to add more columns to the results. The row.names of our results table has the Ensembl gene ID (our key), so we need to specify `keytype=ENSEMBL`. The column argument tells the `mapIds` function which information we want, and the `multiVals` argument tells the function what to do if there are multiple possible values for a single input value. Here we ask to just give us back the first one that occurs in the database. Let’s get the Entrez IDs, gene symbols, and full gene names. ```R -res$symbol = mapIds(org.Hs.eg.db, +res$symbol <- mapIds(org.Hs.eg.db, keys=row.names(res), column="SYMBOL", keytype="ENSEMBL", multiVals="first") -res$entrez = mapIds(org.Hs.eg.db, +res$entrez <- mapIds(org.Hs.eg.db, keys=row.names(res), column="ENTREZID", keytype="ENSEMBL", multiVals="first") -res$name = mapIds(org.Hs.eg.db, +res$name <- mapIds(org.Hs.eg.db, keys=row.names(res), column="GENENAME", keytype="ENSEMBL", @@ -237,32 +235,33 @@ res$name = mapIds(org.Hs.eg.db, head(res) ``` -We’re going to use the [gage](http://bioconductor.org/packages/release/bioc/html/gage.html) package for pathway analysis, and the [pathview](http://bioconductor.org/packages/release/bioc/html/pathview.html) package to draw a pathway diagram. - +We’re going to use the [gage](https://bioconductor.org/packages/release/bioc/html/gage.html) package for pathway analysis, and the [pathview](https://bioconductor.org/packages/release/bioc/html/pathview.html) package to draw a pathway diagram. The gageData package has pre-compiled databases mapping genes to KEGG pathways and GO terms for common organisms: ```R data(kegg.sets.hs) data(sigmet.idx.hs) -kegg.sets.hs = kegg.sets.hs[sigmet.idx.hs] +kegg.sets.hs <- kegg.sets.hs[sigmet.idx.hs] head(kegg.sets.hs, 3) ``` -Run the pathway analysis. See help on the gage function with ?gage. Specifically, you might want to try changing the value of same.dir +Run the pathway analysis. See help on the gage function with `?gage`. Specifically, you might want to try changing the value of same.dir. ```R -foldchanges = res$log2FoldChange -names(foldchanges) = res$entrez -keggres = gage(foldchanges, gsets=kegg.sets.hs, same.dir=TRUE) +foldchanges <- res$log2FoldChange +names(foldchanges) <- res$entrez +keggres <- gage(foldchanges, gsets=kegg.sets.hs, same.dir=TRUE) lapply(keggres, head) ``` -pull out the top 5 upregulated pathways, then further process that just to get the IDs. We’ll use these KEGG pathway IDs downstream for plotting. +Pull out the top 5 upregulated pathways, then further process that just to get the IDs. We’ll use these KEGG pathway IDs downstream for plotting. The `dplyr` package is required to use the pipe (`%>%`) construct. ```R +library(dplyr) + # Get the pathways -keggrespathways = data.frame(id=rownames(keggres$greater), keggres$greater) %>% +keggrespathways <- data.frame(id=rownames(keggres$greater), keggres$greater) %>% tbl_df() %>% filter(row_number()<=5) %>% .$id %>% @@ -270,18 +269,21 @@ keggrespathways = data.frame(id=rownames(keggres$greater), keggres$greater) %>% keggrespathways # Get the IDs. -keggresids = substr(keggrespathways, start=1, stop=8) +keggresids <- substr(keggrespathways, start=1, stop=8) keggresids ``` -Finally, the pathview() function in the pathview package makes the plots. Let’s write a function so we can loop through and draw plots for the top 5 pathways we created above. +Finally, the `pathview()` function in the pathview package makes the plots. Let’s write a function so we can loop through and draw plots for the top 5 pathways we created above. ```R # Define plotting function for applying later -plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="hsa", new.signature=FALSE) +plot_pathway <- function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="hsa", new.signature=FALSE) + +# Unload dplyr since it conflicts with the next line +detach("package:dplyr", unload=T) # plot multiple pathways (plots saved to disk and returns a throwaway list object) -tmp = sapply(keggresids, function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="hsa")) +tmp <- sapply(keggresids, function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="hsa")) ``` #### Thanks diff --git a/unix/first_steps.md b/unix/first_steps.md index f90cfb2..b9cbd6e 100644 --- a/unix/first_steps.md +++ b/unix/first_steps.md @@ -16,9 +16,9 @@ The aim of this course is to introduce Unix and cover the basics. The programs t * Unix is a well established, very widespread operating system. You probably have a device running on Unix in your home without realising it (e.g. playstation, TV box, wireless router, android tablets/phones,... * Command line driven, with a huge number of often terse, but powerful commands. -* In contrast to Windows, it is designed to allow many users to run their programs simultaneously on the same computer +* In contrast to Windows, it is designed to allow many users to run their programs simultaneously on the same computer. * Designed to work in computer networks - for example, most of the Internet is Unix based. -* It is used on many of the powerful computers at bioinformatics centres and also on many desktops and laptops (Mac OSX is UNIX based). +* It is used on many of the powerful computers at bioinformatics centres and also on many desktops and laptops (MacOS is largely UNIX compatible). * The major difference between Unix and Windows is that it is free (as in freedom) and you can modify it to work however you want. This same principle of freedom is also used in most bioinformatics software. * There are many distributions of Unix such as Ubuntu, RedHat, Fedora, Mint,...). These are all Unix, but they bundle up extra software in a different way or combinations. Some are known for being conservative and reliable; whilst others are know for being cutting-edge (and less reliable). * The MacOSX operating system used by the [eBioKit](77.235.253.122) is also based on Unix. @@ -28,14 +28,14 @@ The aim of this course is to introduce Unix and cover the basics. The programs t For this course, you will have to connect to the eBiokit using SSH. SSH stands for Secure Shell and is a network protocol used to securely connect to a server. To do so, you will need an SSH client: * On Linux: it is included by default, named Terminal. -* On OSX: it is included by default, also named Terminal. +* On MacOS: it is included by default, also named Terminal. * On Windows: you'll have to download and install [MobaXterm](http://mobaxterm.mobatek.net), a terminal emulator. Once you've opened your terminal (or terminal emulator), type `ssh username@ip_address` -replacing `username` and `ip_address` by your username and the ip address of the server you are connecting to. +replacing `username` and `ip_address` with your username and the ip address of the server you are connecting to. Type your password when prompted. As you type, nothing will show on screen. No stars, no dots. It is supposed to be that way. Just type the password and press enter! @@ -75,10 +75,9 @@ Therefore if there is a file called genome.seq in the **dna** directory its loca ### General Points -Unix is pretty straightforward, but there are some general points to remember that will make your life easier: most flavors of UNIX are case sensitive - typing **ls** is generally not the same as typing **LS**. You need to put a space between a command and its argument - for example, **less my_file** will show you the contents of the file called my_file; **moremyfile** will just give you an error! Unix is not psychic: If you misspell the name of a command or the name of a file, it will not understand you. Many of the commands are only a few letters long; this can be confusing until you start to think logically about why those letters were chosen - ls for list, rm for remove and so on. Often when you have problems with Unix, it is due to a spelling mistake, or perhaps you have omitted a space. If you want to know more about Unix and its commands there are plenty of resources available that provide a more comprehensive guide (including a cheat sheet at the end of this chapter. +Unix is pretty straightforward, but there are some general points to remember that will make your life easier: most flavors of UNIX are case sensitive - typing **ls** is generally not the same as typing **LS**. You need to put a space between a command and its argument - for example, **less my_file** will show you the contents of the file called my_file; **lessmyfile** will just give you an error! Unix is not psychic: If you misspell the name of a command or the name of a file, it will not understand you. Many of the commands are only a few letters long; this can be confusing until you start to think logically about why those letters were chosen - ls for list, rm for remove and so on. Often when you have problems with Unix, it is due to a spelling mistake, or perhaps you have omitted a space. If you want to know more about Unix and its commands there are plenty of resources available that provide a more comprehensive guide (including a cheat sheet at the end of this chapter. -*   [http://Unixhelp.ed.ac.uk](http://Linuxhelp.ed.ac.uk/) -*   [http://Unix.t-a-y-l-o-r.com/](http://Linux.t-a-y-l-o-r.com/) +*   [http://unix.t-a-y-l-o-r.com/](http://unix.t-a-y-l-o-r.com/) In what follows, we shall use the following typographical conventions: Characters written in **bold typewriter font** are commands to be typed into the computer as they stand. Characters written in *italic typewriter font* indicate non-specific file or directory names. Words inserted within square brackets [Ctrl] indicate keys to be pressed. So, for example, `$ **ls** *any_directory* [Enter]` means "at the Unix prompt $, type ls followed by the name of some directory, then press Enter" @@ -175,10 +174,13 @@ Now use the **pwd** command to check your location in the directory hierarchy. N #### Tips There are some short cuts for referring to directories: -.           Current directory (one full stop) -..           Directory above (two full stops) -~         Home directory (tilde) -/          Root of the file system (like C:\ in Windows) + +``` +.  Current directory (one full stop) +.. Directory above (two full stops) +~  Home directory (tilde) +/  Root of the file system (like C:\ in Windows) +``` Pressing the tab key twice will try and autocomplete what you’ve started typing or give you a list of all possible completions. This saves a lot of typing and typos. Pressing the up/down arrows will let you scroll through the previous commands. If you highlight some text, middle clicking will paste it on the command line. diff --git a/unix/fully_walking.md b/unix/fully_walking.md index 9280566..346a20a 100644 --- a/unix/fully_walking.md +++ b/unix/fully_walking.md @@ -1,14 +1,14 @@ # Introduction to Unix (continued) -In this part of the Unix tutorial, you will learn to download files, compress and decompress them, and combining commands +In this part of the Unix tutorial, you will learn to download files, compress and decompress them, and combine commands. ## Download files -`wget` can be used to download files from internet and store them. The following downloads and stores a file called to the current directory. +`wget` can be used to download files from the internet and store them. `wget https://raw.githubusercontent.com/HadrienG/tutorials/master/LICENSE` -will download the file that is located at the above URL on the internet, and put it **in the current directory**. This is the license under which this course is released. Open in and read it if you like! +will download the file that is located at the above URL on the internet, and put it **in the current directory**. This is the license under which this course is released. Open it and read it if you like! The `-O` option can be used to change the output file name. @@ -35,7 +35,7 @@ gzip is a utility for compressing and decompressing individual files. To compres `gzip filename` -The filename will be deleted and replaced by a compressed file called filename.gz To reverse the compression process, use: +The filename will be deleted and replaced by a compressed file called filename.gz. To reverse the compression process, use: `gzip -d filename.gz` @@ -51,9 +51,9 @@ To create a gzipped disk file tar archive, use `tar -czvf archivename filenames` -where archivename will usually have a .tar .gz extension +where archivename will usually have a .tar.gz extension -The c option means create, the v option means verbose (output filenames as they are archived), and option f means file. +The c option means create, the v option means verbose (output filenames as they are archived), option f means file, and z means that the tar archive should be gzip compressed. To list the contents of a gzipped tar archive, use @@ -65,7 +65,7 @@ To unpack files from a tar archive, use Try to archive the folder `Module_Unix` from the previous exercise! -You will notice a file called tutorials.tar.bz2 in your home directory. This is also a compressed archive, but compressed in the bzip format. Read the tar manual and find a way to decompress it +You will notice a file called tutorials.tar.bz2 in your home directory. This is also a compressed archive, but compressed in the bzip format. Read the tar manual and find a way to decompress it. Hint: you can read the manual for any command using `man` @@ -79,13 +79,13 @@ Some commands give you an output to your screen, but you would have preferred it The output from a command normally intended for standard output (that is, your screen) can be easily diverted to a file instead. This capability is known as output redirection: -If the notation `> file` is appended to any command that normally writes its output to standard output, the output of that command will be written to file instead of your terminal +If the notation `> file` is appended to any command that normally writes its output to standard output, the output of that command will be written to file instead of your terminal. I.e, the following who command: `who > users.txt` -No output appears at the terminal. This is because the output has been redirected into the specified file. +No output appears at the terminal. This is because the output has been redirected into the specified file. `less users.txt` @@ -114,8 +114,8 @@ Remember the command `grep`? We can pipe other commands to it, to refine searche `ls -l ngs_course_data | grep "Jan"` -will only give you the files and directories created in January +will only give you the files and directories created in January. Tip: There are various options you can use with the grep command, look at the manual! -Pipes are extremely useful to connect various bioinformatics softwares together. We'll use them extensively later. +Pipes are extremely useful to connect various bioinformatics software together. We'll use them extensively later. diff --git a/wms.md b/wms.md index 98c8d99..998adfa 100644 --- a/wms.md +++ b/wms.md @@ -34,7 +34,7 @@ The choice of shotgun or 16S approaches is usually dictated by the nature of the * [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) * [Kraken](https://ccb.jhu.edu/software/kraken/) -* [Bracken](http://ccb.jhu.edu/software/bracken/) +* [Bracken](https://ccb.jhu.edu/software/bracken/) * [R](https://www.r-project.org/) * [Pavian](https://github.com/fbreitwieser/pavian) @@ -49,17 +49,17 @@ cd wms ``` We'll use FastQC to check the quality of our data. FastQC can be downloaded and -ran on a Windows or LINUX computer without installation. It is available [here](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) +run on a Windows or Linux computer without installation. It is available [here](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) Start FastQC and select the fastq file you just downloaded with `file -> open` -What do you think about the quality of the reads? Do they need trimming? Is there still adapters +What do you think about the quality of the reads? Do they need trimming? Are there still adapters present? Overrepresented sequences? Alternatively, run fastqc on the command-line: `fastqc *.fastq.gz` -If the quality appears to be that good, it's because it was probably the cleaned reads that were deposited into SRA. +If the quality appears to be good, it's because it was probably the cleaned reads that were deposited into SRA. We can directly move to the classification step. ### Taxonomic Classification @@ -68,14 +68,14 @@ We can directly move to the classification step. Kraken aims to achieve high sensitivity and high speed by utilizing exact alignments of k-mers and a novel classification algorithm (sic). In short, kraken uses a new approach with exact k-mer matching to assign taxonomy to short reads. It is *extremely* fast compared to traditional -approaches (i.e. Blast) +approaches (i.e. BLAST). By default, the authors of kraken built their database based on RefSeq Bacteria, Archea and Viruses. We'll use it for the purpose of this tutorial. **NOTE: The database may have been installed already! Ask your instructor!** ```bash -# You might not need this step (example if you're working on Uppmax!) +# You might not need this step (for example if you're working on Uppmax!) wget https://ccb.jhu.edu/software/kraken/dl/minikraken.tgz tar xzf minikraken.tgz $KRAKEN_DB=minikraken_20141208 @@ -88,7 +88,9 @@ mkdir kraken_results for i in *_1.fastq.gz do prefix=$(basename $i _1.fastq.gz) - kraken --db $KRAKEN_DB --threads 2 --fastq-input --gzip-compressed \ + # set number of threads to number of cores if running under SLURM, otherwise use 2 threads + nthreads=${SLURM_NPROCS:=2} + kraken --db $KRAKEN_DB --threads ${nthreads} --fastq-input --gzip-compressed \ ${prefix}_1.fastq.gz ${prefix}_2.fastq.gz > kraken_results/${prefix}.tab kraken-report --db $KRAKEN_DB \ kraken_results/${prefix}.tab > kraken_results/${prefix}_tax.txt @@ -134,7 +136,7 @@ Three steps are necessary to set up Kraken abundance estimation. ```bash find -L $KRAKEN_DB/library -name "*.fna" -o -name "*.fa" -o -name "*.fasta" > genomes.list cat $(grep -v '^#' genomes.list) > genomes.fasta -kraken --db=${KRAKEN_DB} --fasta-input --threads=10 kraken.fasta > database.kraken +kraken --db=${KRAKEN_DB} --fasta-input --threads=${SLURM_NPROCS:=10} kraken.fasta > database.kraken count-kmer-abundances.pl --db=${KRAKEN_DB} --read-length=100 database.kraken > database100mers.kraken_cnts ```