In [None]:
# Getting Started with Magic-BLAST

Please start by installing [magicblast](https://ncbi.github.io/magicblast/) and EDirect tools! The go to **Setting environmental variables** section to get going!

Note: If you want to copy and paste these commands in a terminal, remove the `!` preceding the Linux commands.

## Installing dependencies

We have already installed the following dependencies for you! Go ahead and proceed to the **Getting Started** section!

...but if you need to re-install, use the code below.

- Magic-BLAST
- perl
- NCBI EDirect tools

#### Installing Magic-BLAST

In [None]:
# Magic-BLAST
cd ~
wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/magicblast/LATEST/ncbi-magicblast-1.3.0-x64-linux.tar.gz

In [None]:
tar -xzvf ncbi-magicblast-1.3.0-x64-linux.tar.gz

In [None]:
ls

#### Installing NCBI EDirect tools

In [None]:
apt-get install perl

In [None]:
# We will download the edirect.tar.gz in ~/
cd ~
# Download edirect.tar.gz
perl -MNet::FTP -e '$ftp = new Net::FTP("ftp.ncbi.nlm.nih.gov", Passive => 1); $ftp->login; $ftp->binary; $ftp->get("/entrez/entrezdirect/edirect.tar.gz");'
# Unpack tar.gz file
tar -xzvf edirect.tar.gz

In [None]:
ls

In [None]:
# Now we will setup edirect using the setup.sh script
./edirect/setup.sh

## Important: Setting environment variables

You must run this step to have the remainder of the tutorial working.

In [30]:
# Import Python3 package to set environment variables
#import os
#os.environ['PATH'] += ":/content/ncbi-magicblast-1.3.0/bin"
#os.environ['PATH'] += ":/content/edirect"

#Use this one if it is in Google Collaboratory. If not, use another command!
# Use the commands below when testing in Jupyter notebooks (local computer)
export PATH="/home/dantonio/ncbi-magicblast-1.3.0/bin:$PATH"

First, let's see what we have in our directory:

In [None]:
ls

Now we will set up our working directory called `sandbox`:

In [None]:
mkdir -p ~/sandbox

# Check if our directory has been made
ls

In [None]:
cd ~ 
ls

Let's bring up the usage message for `magicblast`:

In [35]:
magicblast -h                                                                                                                                                                   magicblast -help

USAGE
  magicblast [-h] [-help] [-db database_name] [-gilist filename]
    [-seqidlist filename] [-negative_gilist filename]
    [-negative_seqidlist filename] [-db_soft_mask filtering_algorithm]
    [-db_hard_mask filtering_algorithm] [-subject subject_input_file]
    [-subject_loc range] [-query input_file] [-out output_file] [-gzo]
    [-word_size int_value] [-gapopen open_penalty] [-gapextend extend_penalty]
    [-perc_identity float_value] [-penalty penalty] [-lcase_masking]
    [-validate_seqs TF] [-infmt format] [-paired] [-query_mate infile]
    [-sra accession] [-parse_deflines TF] [-sra_cache] [-outfmt format]
    [-no_query_id_trim] [-no_unaligned] [-num_threads int_value]
    [-max_intron_length length] [-score num] [-max_edit_dist num] [-splice TF]
    [-reftype type] [-limit_lookup TF] [-lookup_stride num] [-version]

DESCRIPTION
   Short read mapper

Use '-help' to print detailed descriptions of command line arguments


# Create a BLAST database
---

First we need to create a BLAST database for our genome or transcriptome. For reference sequences in a FASTA file, use this command line:

```bash
!makeblastdb -in <reference.fa> -dbtype nucl -parse_seqids -out <database_name> -title "Database title"
```

The `-parse_seqids` option is required to keep the original sequence identifiers. Otherwise makeblastdb will generate its own identifiers, `-title` is optional.

For more information on `makeblastdb` see [NCBI BLAST+ Command Line User Manual](https://www.ncbi.nlm.nih.gov/books/NBK279688/).

Magic-BLAST will work with a genome in a FASTA file, but will be very slow for anything larger than a bacterial genome, so we do not recommend it.

### Uploading Files from your Computer

If you have a file on your local computer to upload (i.e. a reference.fa file that you can't download via a link), run the code below. Otherwise, continue to the **Getting Started with Magic-BLAST** section!

**Note:** A `Choose Files` widget will appear within 1 minute. You can then click on the widget and select the file you want to upload. This works best in a Google Chrome browser!

In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

In [None]:
# Check if file was uploaded
ls

### Example

For our example, we will be working with `viralgen.fa`. Let's take a look at it first.

In [None]:
!cat viralgen.fa

To create a BLAST database from the reference file.fa, use the following command line:

In [None]:
# Go into sandbox directory
cd ~/sandbox
# Create BLAST database
makeblastdb -in $HOME/viralgen.fa -out my_reference -parse_seqids -dbtype nucl 

In [None]:
ls

datalab  edirect.tar.gz   ncbi-magicblast-1.3.0			  sandbox
edirect  viralgen.fa  ncbi-magicblast-1.3.0-x64-linux.tar.gz

datalab  edirect.tar.gz   ncbi-magicblast-1.3.0			  sandbox
edirect  viralgen.fa  ncbi-magicblast-1.3.0-x64-linux.tar.gz

Note that the word following ‘>’ is a sequence identifier that will be used in Magic-BLAST reports. The identifier should be unique.

There are several ways to download whole genomes, transcriptomes, or selected sequences from NCBI. For example to download human chromosome 1 using [NCBI EDirect tools](https://github.com/NCBI-Hackathons/EDirectCookbook) use:

**NOTE:** We have already installed NCBI EDirect tools for you here. But if you are having problems with this installation, please follow installation instructions on [NCBI EDirect tools](https://github.com/NCBI-Hackathons/EDirectCookbook) before proceeding.

In [40]:
# Let's take a quick look at the usage message
esearch -help

esearch 7.40

Query Specification

  -db          Database name
  -query       Query string

Document Order

  -sort        Result presentation order

Date Constraint

  -days        Number of days in the past
  -datetype    Date field abbreviation
  -mindate     Start of date range
  -maxdate     End of date range

Limit by Field

  -field       Query words individually in field
  -pairs       Query overlapping word pairs

Spell Check

  -spell       Correct misspellings in query

Miscellaneous Arguments

  -label       Alias for query step

Sort Order Examples

  -db            -sort
  ___            _____

  gene
                 Chromosome
                 Gene Weight
                 Name
                 Relevance

  geoprofiles
                 Default Order
                 Deviation
                 Mean Value
                 Outliers
                 Subgroup Effect

  pubmed
                 First Author
                 Journal
                 Last Author
                

In [None]:
#esearch -db my_reference -query NC_000001 | !efetch -format fasta > NC_000001.fa
esearch -db my_reference -query NC_000001 

In [None]:
efetch -h

You can download full human genome or transcriptome from [NCBI human genome resources](https://www.ncbi.nlm.nih.gov/projects/genome/guide/human/index.shtml) or use [NCBI Genome search](https://www.ncbi.nlm.nih.gov/genome) for any organism.

For example to download the latest human genome use:

In [None]:
cd ~/sandbox
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.fna.gz
gzip -d GRCh38_latest_genomic.fna.gz
makeblastdb -in GRCh38_latest_genomic.fna -out GRCh38 -parse_seqids -dbtype nucl

In [None]:
ls

# Use NCBI SRA repository
---

If you are mapping an experiment from [NCBI Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra), use `-sra <accession>` option:

```bash
magicblast -sra <accession> -db <database_name>
```

### Example

In [None]:
cd ~/sandbox

In [None]:
magicblast -sra SRS3315293 -db my_reference

To map several SRA runs use comma-separated list of accessions:

In [None]:
magicblast -sra SRR1237994,SRR1237993 -db my_reference

See [Create BLAST database](https://ncbi.github.io/magicblast/cook/blastdb.html) to see how to create a BLAST database.

# Reads in FASTA or FASTQ
---

If your reads are in a local FASTA file use this command line:

```bash
!magicblast -query reads.fa -db my_reference
```

If your reads are in a local FASTQ file use this command line:

```bash
!magicblast -query reads.fastq -db my_reference -infmt fastq
```

# Paired reads
---

For SRA accessions Magic-BLAST determines whether reads are paired and maps them appropriately.

For reads in FASTA and FASTQ files paired reads can either be in a single file, or two files.

#### Single file

For paired reads presented as successive entries in a single FASTA or FASTQ file, i.e. read 1 and 2 of fragment 1, then read 1 and 2 of fragment 2, etc., simply add the parameter `-paired`:

```bash
!magicblast -query reads.fa -db genome -paired
```

or

```bash
!magicblast -query reads.fastq -db genome -paired -infmt fastq
```

#### Two files

For paired reads presented in two parallel files, use these options:

```bash
!magicblast -query reads.fa -query_mate mates.fa -db genome
```

or

```bash
!magicblast -query reads.fastq -query_mate mates.fastq -db genome -infmt fastq
```

# RNA vs DNA
---

#### Splicing

By default, Magic-BLAST aligns RNA reads to a genome and reports spliced alignments, possibly spanning several exons. To disable spliced alignments, use the `-splice F` option.

For example, mapping RNA or DNA reads to a bacterial genome:

In [None]:
magicblast -sra SRR5647973 -db salmonella_enterica_genome -splice F

#### Transcriptome

Use the `-reftype transcriptome` option, to map reads to a transcriptome database. For example:

```bash
!magicblast -query reads.fa -db my_transcripts -reftype transcriptome
```

The `-ref_type transcriptome` option is a short hand for `-splice F -limit_lookup F`, so the above call is equivalent to:

```bash
!magicblast -query reads.fa -db my_transcripts -splice F -limit_lookup F
```

Magic-Blast finds alignments between a read and a genome based on initial common word in both. Many genomes contain interspersed repeats that make mapping much more time consuming. To make mapping faster we disregard words that appear too often in the reference. This is not desirable when mapping to transcripts, because a transcript with many variants could be considered a repeat. The `-limit_lookup F` option turns this functionality off.

# Multi-threading
---

To use multiple CPUs, specify the maximal number of threads with the `-num_threads` parameter:

```bash
!magicblast -query reads.fa -db genome -num_threads 10
```