### Uploading Files from your Computer

If you have a file on your local computer you want to use (i.e. a reference.fa file that you can't download via a link), run the code below. **Otherwise, continue to the *Getting Started with Magic-BLAST* section!**

**Note:** A `Choose Files` widget will appear within 1 minute. You can then click on the widget and select the file you want to upload. This works best in a Google Chrome browser!

In [0]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

In [0]:
# Check if file was uploaded
!ls

# Getting Started with Magic-BLAST

Please start by installing [magicblast](https://ncbi.github.io/magicblast/) and EDirect tools! The go to **Setting environmental variables** section to get going!

Note: If you want to copy and paste these commands in a terminal, remove the `!` preceding the Linux commands.

## Installing dependencies

Please follow the code blocks below to install the following required dependencies:

- Magic-BLAST
- perl
- NCBI EDirect tools

#### Installing Magic-BLAST

In [0]:
# Magic-BLAST
!cd ~
!wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/magicblast/LATEST/ncbi-magicblast-1.3.0-x64-linux.tar.gz

In [0]:
!tar -xzvf ncbi-magicblast-1.3.0-x64-linux.tar.gz

In [0]:
!ls

#### Installing NCBI EDirect tools

In [0]:
!apt-get install perl

In [0]:
# We will download the edirect.tar.gz in ~/
!cd ~
# Download edirect.tar.gz
!perl -MNet::FTP -e '$ftp = new Net::FTP("ftp.ncbi.nlm.nih.gov", Passive => 1); $ftp->login; $ftp->binary; $ftp->get("/entrez/entrezdirect/edirect.tar.gz");'
# Unpack tar.gz file
!tar -xzvf edirect.tar.gz

In [0]:
!ls

In [0]:
# Now we will setup edirect using the setup.sh script
!./edirect/setup.sh

Installing additional required perl modules for NCBI EDirect tools.

In [0]:
# Install tool to help manage perl modules
!curl -L https://cpanmin.us | perl - App::cpanminus
# Make sure ssl library is compatible with perl module Net::SSLeay
!apt-get install libssl-dev
# Install Net::SSLeay perl module
!cpanm Net::SSLeay
# Install LWP::Protocol::https perl module
!cpanm LWP::Protocol::https

## Important: Setting environment variables

You must run this step to have the remainder of the tutorial working.

In [0]:
# Import Python3 package to set environment variables
import os
os.environ['PATH'] += ":/content/ncbi-magicblast-1.3.0/bin"
os.environ['PATH'] += ":/content/edirect"

First, let's see what we have in our directory:

In [0]:
!ls

Now we will set up our working directory called `sandbox`:

In [0]:
!mkdir -p ~/sandbox

# Check if our directory has been made
!ls

Let's bring up the usage message for `magicblast`:

In [0]:
!magicblast -h                                                                                                                                                                       !magicblast -help

# Example 1: *Simplex Herpes*

## Create a BLAST database:
---

First we need to create a BLAST database for our genome or transcriptome. For reference sequences in a FASTA file, use this command line:

```bash
!makeblastdb -in <reference.fa> -dbtype nucl -parse_seqids -out <database_name> -title "Database title"
```

The `-parse_seqids` option is required to keep the original sequence identifiers. Otherwise makeblastdb will generate its own identifiers, `-title` is optional.

For more information on `makeblastdb` see [NCBI BLAST+ Command Line User Manual](https://www.ncbi.nlm.nih.gov/books/NBK279688/).

Magic-BLAST will work with a genome in a FASTA file, but will be very slow for anything larger than a bacterial genome, so we do not recommend it.

### Example

For our example, we will be working with `viralgen.fa`. Let's download the reference genome and take a look at it first.

There are a couple of ways to download the reference genome, we can either search for the organism in NCBI and download it using `wget link` or use NCBI EDirect tools. We will be covering both methods to download the reference below.

Note that the word following ‘>’ is a sequence identifier that will be used in Magic-BLAST reports. The identifier should be unique.

There are several ways to download whole genomes, transcriptomes, or selected sequences from NCBI. For example to download human chromosome 1 using [NCBI EDirect tools](https://github.com/NCBI-Hackathons/EDirectCookbook) use:

```bash
!esearch -db nucleotide -query NC_000001 | efetch -format fasta > NC_000001.fa
```

**NOTE:** We have already installed NCBI EDirect tools for you here. But if you are having problems with this installation, please follow installation instructions on [NCBI EDirect tools](https://github.com/NCBI-Hackathons/EDirectCookbook) before proceeding.

You can download full human genome or transcriptome from [NCBI human genome resources](https://www.ncbi.nlm.nih.gov/projects/genome/guide/human/index.shtml) or use [NCBI Genome search](https://www.ncbi.nlm.nih.gov/genome) for any organism.

For example to download the latest human genome use:

```bash
!wget ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.fna.gz
!gzip -d GRCh38_latest_genomic.fna.gz
!makeblastdb -in GRCh38_latest_genomic.fna -out GRCh38 -parse_seqids -dbtype nucl
```

#### Method 1 Download Reference Genome

In [0]:
# Download our reference genome
!wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/859/985/GCF_000859985.2_ViralProj15217/GCF_000859985.2_ViralProj15217_genomic.fna.gz

In [0]:
!ls
# Decompress and rename reference genome
!gzip -dc GCF_000859985.2_ViralProj15217_genomic.fna.gz > viral_reference.fa

In [0]:
!cat viral_reference.fa

>NC_001806.2 Human herpesvirus 1 strain 17, complete genome
AGCCCGGGCCCCCCGCGGGCGCGCGCGCGCGCAAAAAAGGCGGGCGGCGGTCCGGGCGGCGTGCGCGCGCGCGGCGGGCG
TGGGGGGCGGGGCCGCGGGAGCGGGGGGAGGAGCGGGGGGAGGAGCGGGGGGAGGAGCGGGGGGAGGAGCGGGGGGAGGA
GCGGGGGGAGGAGCGGGGGGAGGAGCGGGGGGAGGAGCGGGGGGAGGAGCGGGGGGAGGAGCGGGGGGAGGAGCGGGGGG
AGGAGCGGGGGGAGGAGCGGGGGGAGGAGCGGGGGGAGGAGCGGGGGGAGGAGCGGGGGGAGGAGCGGGGGGAGGAGCGG
CCAGACCCCGAAAACGGGCCCCCCCCAAAACACACCCCCCGGGGGTCGCGCGCGGCCCTTTAAAGCGGCGGCGGCGGGCA
GCCCGGGCCCCCCGCGGCCGAGACTAGCGAGTTAGACAGGCAAGCACTACTCGCCTCTGCACGCACATGCTTGCCTGTCA
AACTCTACCACCCCGGCACGCTCTCTGTCTCCATGGCCCGCCGCCGCCGCCATCGCGGCCCCCGCCGCCCCCGGCCGCCC
GGGCCCACGGGCGCCGTCCCAACCGCACAGTCCCAGGTAACCTCCACGCCCAACTCGGAACCCGCGGTCAGGAGCGCGCC
CGCGGCCGCCCCGCCGCCGCCCCCCGCCGGTGGGCCCCCGCCTTCTTGTTCGCTGCTGCTGCGCCAGTGGCTCCACGTTC
CCGAGTCCGCGTCCGACGACGACGATGACGACGACTGGCCGGACAGCCCCCCGCCCGAGCCGGCGCCAGAGGCCCGGCCC
ACCGCCGCCGCCCCCCGGCCCCGGCCCCCACCGCCCGGCGTGGGCCCGGGGGGCGGGGCTGACCCCTCCCACCCCCCCTC
GCGCCCCTTCCGCCTTCCGCCGCGCCTCGCCCTCCGC

#### Method 2 Download Reference Genome

In [0]:
!esearch -db nucleotide -query NC_001806 | efetch -format fasta > NC_001806.fa

In [0]:
!ls

### Create a BLAST database

To create a BLAST database from the reference file.fa, use the following command line:

In [0]:
# Let's take a quick look at the usage message
!esearch -help

In [0]:
# Go into sandbox directory
!cd ~/sandbox
# Create BLAST database
!makeblastdb -in $HOME/viral_reference.fa -dbtype nucl -parse_seqids -out viral_reference -title "Herpes virus 1"

In [0]:
!ls

# Use NCBI SRA repository
---

If you are mapping an experiment from [NCBI Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra), use `-sra <accession>` option:

```bash
magicblast -sra <accession> -db <database_name>
```

### Example

In [0]:
!cd ~/sandbox

In [0]:
!magicblast -sra SRS3315293 -db my_reference

To map several SRA runs use comma-separated list of accessions:

In [0]:
!magicblast -sra SRR1237994,SRR1237993 -db my_reference

See [Create BLAST database](https://ncbi.github.io/magicblast/cook/blastdb.html) to see how to create a BLAST database.

# Reads in FASTA or FASTQ
---

If your reads are in a local FASTA file use this command line:

```bash
!magicblast -query reads.fa -db my_reference
```

If your reads are in a local FASTQ file use this command line:

```bash
!magicblast -query reads.fastq -db my_reference -infmt fastq
```

In [0]:
!wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/626/655/GCF_000626655.2_ASM62665v2/GCF_000626655.2_ASM62665v2_genomic.fna.gz

In [0]:
!ls

# Paired reads
---

For SRA accessions Magic-BLAST determines whether reads are paired and maps them appropriately.

For reads in FASTA and FASTQ files paired reads can either be in a single file, or two files.

#### Single file

For paired reads presented as successive entries in a single FASTA or FASTQ file, i.e. read 1 and 2 of fragment 1, then read 1 and 2 of fragment 2, etc., simply add the parameter `-paired`:

```bash
!magicblast -query reads.fa -db genome -paired
```

or

```bash
!magicblast -query reads.fastq -db genome -paired -infmt fastq
```

#### Two files

For paired reads presented in two parallel files, use these options:

```bash
!magicblast -query reads.fa -query_mate mates.fa -db genome
```

or

```bash
!magicblast -query reads.fastq -query_mate mates.fastq -db genome -infmt fastq
```

# RNA vs DNA
---

#### Splicing

By default, Magic-BLAST aligns RNA reads to a genome and reports spliced alignments, possibly spanning several exons. To disable spliced alignments, use the `-splice F` option.

For example, mapping RNA or DNA reads to a bacterial genome:

In [0]:
!magicblast -sra SRR5647973 -db salmonella_enterica_genome -splice F

#### Transcriptome

Use the `-reftype transcriptome` option, to map reads to a transcriptome database. For example:

```bash
!magicblast -query reads.fa -db my_transcripts -reftype transcriptome
```

The `-ref_type transcriptome` option is a short hand for `-splice F -limit_lookup F`, so the above call is equivalent to:

```bash
!magicblast -query reads.fa -db my_transcripts -splice F -limit_lookup F
```

Magic-Blast finds alignments between a read and a genome based on initial common word in both. Many genomes contain interspersed repeats that make mapping much more time consuming. To make mapping faster we disregard words that appear too often in the reference. This is not desirable when mapping to transcripts, because a transcript with many variants could be considered a repeat. The `-limit_lookup F` option turns this functionality off.

# Multi-threading
---

To use multiple CPUs, specify the maximal number of threads with the `-num_threads` parameter:

```bash
!magicblast -query reads.fa -db genome -num_threads 10
```

 #  Example 2: *Salmonella typhimurium* str. LT2 
 ----

Now we will repeat these same order of steps as before, but with an bacteria example.

In [0]:
!wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.fna.gz

In [0]:
!ls

## Create a BLAST database:
---
First we need to create a BLAST database for our genome or transcriptome. For reference sequences in a FASTA file, use this command line:

```bash
!makeblastdb -in <reference.fa> -dbtype nucl -parse_seqids -out <database_name> -title "Database title"
```

The `-parse_seqids` option is required to keep the original sequence identifiers. Otherwise makeblastdb will generate its own identifiers, `-title` is optional.

For more information on `makeblastdb` see [NCBI BLAST+ Command Line User Manual](https://www.ncbi.nlm.nih.gov/books/NBK279688/).

Magic-BLAST will work with a genome in a FASTA file, but will be very slow for anything larger than a bacterial genome, so we do not recommend it.

### Example

For our example, we will be working with `bacteria_reference.fa`. Let's download the reference genome and take a look at it first.

There are a couple of ways to download the reference genome, we can either search for the organism in NCBI and download it using `wget link` or use NCBI EDirect tools. We will be covering both methods to download the reference below.

Note that the word following ‘>’ is a sequence identifier that will be used in Magic-BLAST reports. The identifier should be unique.

There are several ways to download whole genomes, transcriptomes, or selected sequences from NCBI. For example to download human chromosome 1 using [NCBI EDirect tools](https://github.com/NCBI-Hackathons/EDirectCookbook) use:

```bash
!esearch -db nucleotide -query NC_000001 | efetch -format fasta > NC_000001.fa
```

**NOTE:** We have already installed NCBI EDirect tools for you here. But if you are having problems with this installation, please follow installation instructions on [NCBI EDirect tools](https://github.com/NCBI-Hackathons/EDirectCookbook) before proceeding.

You can download full human genome or transcriptome from [NCBI human genome resources](https://www.ncbi.nlm.nih.gov/projects/genome/guide/human/index.shtml) or use [NCBI Genome search](https://www.ncbi.nlm.nih.gov/genome) for any organism.

For example to download the latest human genome use:

```bash
!wget ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.fna.gz
!gzip -d GRCh38_latest_genomic.fna.gz
!makeblastdb -in GRCh38_latest_genomic.fna -out GRCh38 -parse_seqids -dbtype nucl
```

### Method 1 Download Reference Genome

 You may use either method 1 or 2 as seen in the beginning of the tutorial to download the reference genome!

In [0]:
!wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.fna.gz

Caution: Make sure to decompress and rename the reference genome of your choice. 

In [0]:
!ls
# Decompress and rename reference genome
!gzip -dc GCF_000006945.2_ASM694v2_genomic.fna.gz > bacteria_reference.fa

# Use NCBI SRA repository
---

If you are mapping an experiment from [NCBI Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra), use `-sra <accession>` option:

```bash
magicblast -sra <accession> -db <database_name>
```

### Example

In [0]:
!cd ~/sandbox

In [0]:
!magicblast -sra SRS3315293 -db my_reference