# Getting Started with Magic-BLAST

Please start by installing [magicblast](https://ncbi.github.io/magicblast/) and EDirect tools using the commands below! After running the commands under the **Installing dependencies** section, run the **Setting environment variables** section to get going!

**Note**: If you want to copy and paste these commands in a terminal, remove the **`!` **preceding the Linux commands.

## Installing dependencies

Please run the code blocks below in order to install the following required dependencies:

- Magic-BLAST
- perl and perl modules
- NCBI EDirect tools (if you have a direct link to reference fasta files, you do not need to install Edirect tools)

#### Installing Magic-BLAST

In [0]:
# Magic-BLAST
%cd ~
!wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/magicblast/LATEST/ncbi-magicblast-1.3.0-x64-linux.tar.gz

In [0]:
# Decompress gzipped tar archive
!tar -xzvf ncbi-magicblast-1.3.0-x64-linux.tar.gz

#### Installing NCBI EDirect tools

In [0]:
!apt-get install perl

In [0]:
# We will download the edirect.tar.gz in ~/
%cd ~
# Download edirect.tar.gz
!perl -MNet::FTP -e '$ftp = new Net::FTP("ftp.ncbi.nlm.nih.gov", Passive => 1); $ftp->login; $ftp->binary; $ftp->get("/entrez/entrezdirect/edirect.tar.gz");'
# Unpack tar.gz file
!tar -xzvf edirect.tar.gz

In [0]:
# Now we will setup edirect using the setup.sh script
!./edirect/setup.sh

Installing additional required perl modules for NCBI EDirect tools.

In [0]:
# Install tool to help manage perl modules
!curl -L https://cpanmin.us | perl - App::cpanminus
# Make sure ssl library is compatible with perl module Net::SSLeay
!apt-get install libssl-dev
# Install Net::SSLeay perl module
!cpanm Net::SSLeay
# Install LWP::Protocol::https perl module
!cpanm LWP::Protocol::https

## Important: Setting environment variables

You must run this step to have the remainder of this tutorial work.

In [0]:
# Import Python3 package to set environment variables
import os
os.environ['PATH'] += ":/content/ncbi-magicblast-1.3.0/bin"
os.environ['PATH'] += ":/content/edirect"

## Directory setup and additional checks

First, let's see what we have in our directory:

In [0]:
!ls

Now we will set up our working directory called `sandbox`:

In [0]:
!mkdir -p ~/sandbox

# Check if our directory has been made
!ls

Let's bring up the usage message for `magicblast`:

In [0]:
!magicblast -h                                                                                                                                                                       !magicblast -help

---

# Example 1: *Herpes Simplex*

For our example, we will be working with the `viral_reference.fa` reference genome. 

**Caution:** If you choose to use your own data, make sure you change filenames and filepaths accordingly throughout the tutorial.

Now let's download the reference genome and take a look at it first.

## Step 1: Get reference FASTA file into Google Colaboratory

There are **3 methods** to get data into Google Colaboratory. You can choose any of the 3 methods to use, you do not have to run all 3. For the purposes of this tutorial, we will cover all three ways using this first virus example.

#### Methods Summary

- **Get Data Method 1:** Use widget to upload data from local computer.
- **Get Data Method 2:** Search for the organism in NCBI's public data repositories, find an FTP URL, and download using `wget`.
- **Get Data Method 3:** Use [NCBI EDirect tools](https://github.com/NCBI-Hackathons/EDirectCookbook). Below is an example command line to download human chromosome 1 by accession query ID:

```bash
!esearch -db nucleotide -query NC_000001 | !efetch -format fasta > NC_000001.fa
```

Input for the `-query` flag here was found on [this page in NCBI GenBank](https://www.ncbi.nlm.nih.gov/nuccore/NC_000001.10) under `ACCESSION`.

**NOTE:** Here, we assume you have already installed all dependencies listed in the **Getting started** section. If you have not and are running this tutorial in **Google Colaboratory**, please re-visit the **Getting started** section and install all dependencies before proceeding. If you plan to copy paste commands and run on your **local computer**, please follow installation instructions provided at [NCBI EDirect tools](https://github.com/NCBI-Hackathons/EDirectCookbook).


### Get Data Method 1: Uploading Files from your Computer

If you have a file on your local computer you want to use (i.e. a reference.fa file that you can't download via a link), run the code below. **Otherwise, use either *Method 2* or *Method 3* to get your reference FASTA file into Google Colaboratory!**

**Note:** A `Choose Files` widget will appear within 1 minute. You can then click on the widget and select the file you want to upload. This works best in a Google Chrome browser!

In [0]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

In [0]:
# Check if file was uploaded
!ls

### Get Data Method 2: Transfer data via FTP (File Transfer Protocol)

Since we already have the URL for our viral reference genome, we will download it using the `wget` command.

In [0]:
!wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/859/985/GCF_000859985.2_ViralProj15217/GCF_000859985.2_ViralProj15217_genomic.fna.gz

Decompress and rename the reference genome you chose from the NCBI link given above.

In [0]:
!ls
!gzip -dc GCF_000859985.2_ViralProj15217_genomic.fna.gz > viral_reference.fa

Verify and display the first 6 lines of the file using the `head` command. 

In [0]:
!head -n 6 viral_reference.fa

### Get Data Method 3: Transfer data via NCBI EDirect tools

In [0]:
!esearch -db nucleotide -query NC_001806 | efetch -format fasta > viral_reference.fa

## Step 2: Create a BLAST database

Next, we need to create a BLAST database for our genome. When working with reference FASTA files, we will follow this command line structure:

```bash
!makeblastdb -in <reference.fa> -dbtype nucl -parse_seqids -out <database_name> -title "Database title"
```

The `-parse_seqids` option is required to keep the original sequence identifiers. Otherwise `makeblastdb` will generate its own identifiers, `-title` is optional.

For more information on `makeblastdb` see [NCBI BLAST+ Command Line User Manual](https://www.ncbi.nlm.nih.gov/books/NBK279688/). For more information on creating a BLAST database, [see this documentation](https://ncbi.github.io/magicblast/cook/blastdb.html).

#### Let's use the following command line to create our BLAST database from the `viral_reference.fa` file:

In [0]:
# Let's take a quick look at the usage message
!esearch -help

In [0]:
# Go into sandbox directory
%cd ~/sandbox
# Create BLAST database
!makeblastdb -in $HOME/viral_reference.fa -dbtype nucl -parse_seqids -out Herpes_virus_1

## Step 3: Use NCBI SRA repository

If you are mapping an experiment from [NCBI Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra), use the `-sra <accession>` option like so:

```bash
magicblast -sra <accession> -db <database_name>
```

#### Let's use NCBI SRA repostory to map an experiment:

In [0]:
%cd ~/sandbox

In [0]:
!magicblast -sra SRR1533750 -db Herpes_virus_1 -no_unaligned -num_threads 2 -out SRR3315293_into_HSV1

To map several SRA runs use a comma-separated list of accessions:

In [0]:
!magicblast -sra SRR1237994,SRR1237993 -db Herpes_virus_1

---

#  Example 2: *Salmonella typhimurium* str. LT2

For this second example in our tutorial, we will repeat similar steps as above except using a bacteria samples.

## Step 1: Get reference FASTA file into Google Colaboratory

Let's download our bacteria reference genome and take a look at the first few lines.

In [0]:
# Go into our home directory
%cd ~

In [0]:
# Download reference file into home directory via FTP URL
!wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.fna.gz

In [0]:
# Decompress and rename reference genome
!gzip -dc GCF_000006945.2_ASM694v2_genomic.fna.gz > bacteria_reference.fa

In [0]:
!head -n 6 bacteria_reference.fa

## Step 2: Create a BLAST database

For more details, please see the **Step 2: Create a BLAST database** section in Example 1: Herpes Simplex (above) of this tutorial.

In [0]:
# Check if our sandbox directory exists, if not make it
!mkdir -p ~/sandbox
# Go into sandbox directory
%cd ~/sandbox

In [0]:
# Create BLAST database
!makeblastdb -in $HOME/bacteria_reference.fa -dbtype nucl -parse_seqids -out salmonella_enterica_genome

## Step 3a: Use NCBI SRA repository

Let's map an experiment to the bacteria reference database. For this example, we will be working with SRA sample SRS2253554 (WGS data) with run ID **SRR5647973**, you can find out more info about this SRA sample [here](https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRR5647973&go=go).

In [0]:
%cd ~/sandbox

In [0]:
!magicblast -sra SRR5647973 -db salmonella_enterica_genome

## Step 3b: RNA vs DNA

#### Splicing

By default, Magic-BLAST aligns RNA reads to a genome and reports spliced alignments, possibly spanning several exons. To disable spliced alignments, use the `-splice F` option. We can map RNA or DNA reads  to a reference genome. Here, we will be mapping RNA reads.

For this example we will be working with SRA sample SRS3192091 (RNA-Seq data) with run ID **SRR7029713**, you can find out more info about this SRA sample [here](https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRR7029713&go=go).

Let's map RNA-seq reads to a bacterial genome:

In [0]:
!magicblast -sra SRR7029713 -db salmonella_enterica_genome -splice F

#### Transcriptome

Use the `-reftype transcriptome` option, to map reads to a transcriptome database. For example:

```bash
!magicblast -query reads.fa -db my_transcripts -reftype transcriptome
```

The `-ref_type transcriptome` option is a short hand for `-splice F -limit_lookup F`, so the above call is equivalent to:

```bash
!magicblast -query reads.fa -db my_transcripts -splice F -limit_lookup F
```

Magic-Blast finds alignments between a read and a genome based on initial common word in both. Many genomes contain interspersed repeats that make mapping much more time consuming. To make mapping faster we disregard words that appear too often in the reference. This is not desirable when mapping to transcripts, because a transcript with many variants could be considered a repeat. The `-limit_lookup F` option turns this functionality off.

For this example, we will be working with SRA sample SRS3192092 with run ID **SRR7029712**, you can find more info about this SRA sample [here](https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRS3192092&go=go).

Let's map to our transcriptome database:

In [0]:
!magicblast -sra SRR7029712 -db salmonella_enterica_genome -reftype transcriptome

---

# Additional Magic-BLAST Features

We have not included code blocks for reading in FASTQ files, working with paired end vs single end reads, and multi-threading in this tutorial, but we have laid out the command line structure you will need to run them. You are welcome to test them out for yourself.

## Reads in FASTA or FASTQ

If your reads are in a local FASTA file use this command line:

```bash
!magicblast -query reads.fa -db my_reference
```

If your reads are in a local FASTQ file use this command line:

```bash
!magicblast -query reads.fastq -db my_reference -infmt fastq
```

## Paired reads

For SRA accessions Magic-BLAST determines whether reads are paired and maps them appropriately.

For reads in FASTA and FASTQ files paired reads can either be in a single file, or two files.

#### Single file

For paired reads presented as successive entries in a single FASTA or FASTQ file, i.e. read 1 and 2 of fragment 1, then read 1 and 2 of fragment 2, etc., simply add the parameter `-paired`:

```bash
!magicblast -query reads.fa -db genome -paired
```

or

```bash
!magicblast -query reads.fastq -db genome -paired -infmt fastq
```

#### Two files

For paired reads presented in two parallel files, use these options:

```bash
!magicblast -query reads.fa -query_mate mates.fa -db genome
```

or

```bash
!magicblast -query reads.fastq -query_mate mates.fastq -db genome -infmt fastq
```

## Multi-threading

To use multiple CPUs, specify the maximal number of threads with the `-num_threads` parameter:

```bash
!magicblast -query reads.fa -db genome -num_threads 10
```