# NCBI Datasets and ElasticBLAST - BOSC/CoFest (07/15/2022)

### Table of contents
* [Part I: Introduction to NCBI Datasets](#Part-I) 
* [Part II: Accessing metadata](#Part-II)
* [Part III: Accessing genomes](#Part-III)
* [Part IV: Accessing genes](#Part-IV)
* [Part V: Accessing orthologs](#Part-V)
* [Part VI: Using ElasticBLAST and Datasets](#Part-VI)

### Important resources
- Etherpad: 
- Github: 
- NCBI datasets: https://www.ncbi.nlm.nih.gov/datasets/
- ElasticBLAST: 
- jq cheat sheet: 

## Case study: Elmo loves ants

Elmo is a graduate student at the Via Sesamum University. As part of his Ph.D. project, he studies Panamanian leaf cutter ants (genus *Acromyrmex*, family Formicidae) and how variation in the gene *orco* (**o**dorant **r**eceptor **co**receptor) affects the colonies of this genus.

(here's the [link](https://www.ncbi.nlm.nih.gov/labs/pmc/articles/PMC5556950/) to a cool paper talking about this gene in ants of the species *Ooceraea biroi*).

Elmo will use `datasets` to help him gather the existing genomic resources from NCBI. He will:

- download all available genomes for the genus *Acromyrmex*
- download the *orco* gene from the *Acromyrmex* reference genome
- download the ortholog set for this gene for all ants (Formicidae)

In addition, he will use `BLAST` and `ElasticBLAST` to do the following tasks:
- Download the NCBI taxonomy database 
- Create a custom BLAST database with taxonomy information
- Prepare an ElasticBLAST search on the cloud
- Submit an ElasticBLAST search, download and visualize the results
- Cleanup after the search is done


## Part I: Introduction to NCBI Datasets<a class="anchor" id="Part-I"></a>

### How is `datasets` organized?

[NCBI datasets](https://www.ncbi.nlm.nih.gov/datasets/docs/v1/download-and-install/) is a command line tool that allows users to download data packages (data + metadata) or look at metadata summaries for genomes, RefSeq annotated genes, curated ortholog sets and SARS-Cov-2 virus sequences and proteins. The program follows a hierarchy that makes it easier for users to select exact which options they would like to use. In addition to the program commands, additional flags are available for filtering the results. We will go over those during this tutorial.
<img src="https://www.ncbi.nlm.nih.gov/datasets/docs/v1/datasets_schema_complete.svg" alt="datasets" style="width: 800px;"/>

In addition to `datasets`, we will be using `jq` (JSON parser) to take a look at the metadata information. Our metadata reports are almost all in JSON or [JSON Lines](https://jsonlines.org/) format. We put together a [jq cheat sheet]( https://github.com/ncbi/datasets/blob/workshop-cshl-2021/training/cshl-2021/jq_cheatsheet.md) to help you extract information from those files.  

For example: if I want to download the cat reference genome (<i>Felis catus</i>), I would use the command below:

In [2]:
%%bash
#Download the fruit fly (taxid 7227) reference genome, with associated annotation files and metadata

datasets download genome taxon 7227 --reference --no-progressbar


Instead of downloading a data package, I could instead look at the metadata information by using the `summary`command. Here, I'm pipping it to [`jq`](https://stedolan.github.io/jq/) so it's easier to read:

In [None]:
%%bash
#Check the metadata information for the fruit fly reference genome

datasets summary genome taxon 7227 --reference | jq .

### Data packages

NCBI Datasets delivers data as <u>data packages</u>, which which are zip archives containing both data (FASTA, GFF3, GTF, GBFF) and metadata files (JSON, JSON-Lines). The image below shows the contents of the genome data package. Files are included depending on their availability. For example: for an annotated genome, the data package would include FASTA files (genomic, transcript, protein and CDS sequences) and annotation files (GFF3, GTF and GBFF).


<img src="./images/genome_data_package.png" alt="data_package" />

### How to get help when using the command line

Since `datasets` is a very hierarchical program, we can use that characteristic to our advantage to get very specific help.   For example: if we type `datasets --help`, we will see the first level of commands available.


In [4]:
%%bash
datasets --help

datasets is a command-line tool that is used to query and download biological sequence data
across all domains of life from NCBI databases.

Refer to NCBI's [download and install](https://www.ncbi.nlm.nih.gov/datasets/docs/download-and-install/) documentation for information about getting started with the command-line tools.

Usage
  datasets [command]

Data Retrieval Commands
  summary              print a summary of a gene or genome dataset
  download             download a gene, genome or coronavirus dataset as a zip file
  rehydrate            rehydrate a downloaded, dehydrated dataset

Miscellaneous Commands
  completion           generate autocompletion scripts
  version              print the version of this client and exit
  help                 Help about any command

Flags
      --api-key string   NCBI Datasets API Key
  -h, --help             help for datasets
      --no-progressbar   hide progress bar

Use datasets help <command> for detailed help about a command.


Notice the difference from when we type `datasets summary genome taxon formicidae --help`  


In [5]:
%%bash
datasets summary genome taxon formicidae --help


Print a summary of a genome dataset by taxon (NCBI Taxonomy ID, scientific or common name at any tax rank). The summary is returned in JSON format.

Refer to NCBI's [download and install](https://www.ncbi.nlm.nih.gov/datasets/docs/download-and-install/) documentation for information about getting started with the command-line tools.

Usage
  datasets summary genome taxon [flags]

Examples
  datasets summary genome taxon human
  datasets summary genome taxon "mus musculus"
  datasets summary genome taxon 10116

Flags
  -h, --help              help for taxon
      --tax-exact-match   exclude sub-species when a species-level taxon is specified


Global Flags
  -a, --annotated                only include genomes with annotation
      --api-key string           NCBI Datasets API Key
      --as-json-lines            Stream results as newline delimited JSON-Lines
      --assembly-level string    restrict assemblies to a comma-separated list of one or more of: chromosome, complete_genome, con

## Part II: Accessing metadata<a class="anchor" id="Part-II"></a>

First, let's figure out what kind of genome information NCBI has for ants (family Formicidae).

In [None]:
%%bash
# Get metadata info

datasets summary genome taxon formicidae 


In [6]:
%%bash
# Get metadata info and save to a file

datasets summary genome taxon formicidae > formicidae_summary.json

**Now let's take a look at the metadata using jq**

In [7]:
%%bash

datasets summary genome taxon formicidae | jq . 

{
  "assemblies": [
    {
      "assembly": {
        "annotation_metadata": {
          "file": [
            {
              "estimated_size": "3421616",
              "type": "GENOME_GFF"
            },
            {
              "estimated_size": "129483045",
              "type": "GENOME_GBFF"
            },
            {
              "estimated_size": "3444924",
              "type": "PROT_FASTA"
            },
            {
              "estimated_size": "2684704",
              "type": "GENOME_GTF"
            },
            {
              "estimated_size": "7862131",
              "type": "CDS_FASTA"
            }
          ],
          "name": "Annotation submitted by BGI",
          "release_date": "2021-03-29",
          "source": "BGI",
          "stats": {
            "gene_counts": {
              "protein_coding": 8986,
              "total": 14640
            }
          }
        },
        "assembly_accession": "GCA_017607545.1",
        "assembly_category": "rep

### A little bit more about json files
A JSON (JavaScript Object Notation) file stores data structures and objects. In a very simplified (and non-technical) way, a JSON file is a box, that might contain other boxes with more boxes inside. In `datasets summary genome` our JSON "box" is organized like this:
<img src="./images/json8.png" alt="image"/>

But let's explore the "boxes" in stages, so we can understand how everything is organized and how we can use this knowledge to extract information from the summary metadata file. At the first level, we have this: 
```
{
 assemblies[
      assembly{},
      assembly{},
 ],
 total_count
}
```
<img src="./images/json1.png" />

If we want to look at the value in the field "total_count", here's the command we would use:

In [9]:
%%bash
datasets summary genome taxon herpestidae | jq '.total_count'

5


If we continue to expand each one of those assembly boxes, more levels of the hierarchy will be revealed. Let's expand each assembly and look at what information we can find at that level.  

<img src="./images/json2a.png" alt="image"/>

Here we can see that some of the assembly information, such as assembly accession number, contig N50 or submission date are not included inside any of the available "boxes" (annotation_metadata, chromosomes, bioproject_lineage, and org). Those fields describe assembly features/characteristics that pertain to the entire assembly, and not only any of those boxes available. What are the contig n50 values of those assemblies?

To retrieve that information, we need to call each box, starting from the largest one, until the field we're interested in. And each level is separated from the next by a period (.). 

<img src="./images/json3.png" />

In [8]:
%%bash
datasets summary genome taxon herpestidae | jq '.assemblies[].assembly.contig_n50'

113567
180702
75409
148487
75409


Now let's see what we have inside `annotation_metadata`, `bioproject_lineages`, `org` and `chromosomes`. 
<img src="./images/json8.png" alt="image"/>

Now let's see how we can retrieve the scientific names associated with those assemblies.
<img src="./images/json4.png" />

In [10]:
%%bash
datasets summary genome taxon herpestidae | jq '.assemblies[].assembly.org.sci_name'

"Helogale parvula"
"Mungos mungo"
"Suricata suricatta"
"Suricata suricatta"
"Suricata suricatta"


As you can see, `jq` is very useful in retrieving information from the summary metadata *as long as* you know the path to find it. Let's try a few more complex examples.
<img src="./images/json5.png" />

First, let's retrieve information from three fields at the same time: scientific name (`sci_name`), assembly accession number (`assembly_accession`) and contig N50 (`contig_n50`).

In [11]:
%%bash
datasets summary genome taxon herpestidae | jq '.assemblies[].assembly | (.org.sci_name, .assembly_accession, .contig_n50)'

"Helogale parvula"
"GCA_004023845.1"
113567
"Mungos mungo"
"GCA_004023785.1"
180702
"Suricata suricatta"
"GCF_006229205.1"
75409
"Suricata suricatta"
"GCA_004023905.1"
148487
"Suricata suricatta"
"GCA_006229205.1"
75409


Since all three fields are inside the `.assemblies[].assembly`, we can call the first part of the path once and use a pipe (|) to call each specific field. 
Now let's try to make this a little easier to read. We can create new fields and assign values to them, like this:

In [12]:
%%bash
datasets summary genome taxon herpestidae | jq '.assemblies[].assembly 
| {species: .org.sci_name, accession: .assembly_accession, contigN50:.contig_n50}'

{
  "species": "Helogale parvula",
  "accession": "GCA_004023845.1",
  "contigN50": 113567
}
{
  "species": "Mungos mungo",
  "accession": "GCA_004023785.1",
  "contigN50": 180702
}
{
  "species": "Suricata suricatta",
  "accession": "GCF_006229205.1",
  "contigN50": 75409
}
{
  "species": "Suricata suricatta",
  "accession": "GCA_004023905.1",
  "contigN50": 148487
}
{
  "species": "Suricata suricatta",
  "accession": "GCA_006229205.1",
  "contigN50": 75409
}


**Last one**: let's look at a larger collection of genome assemblies (let's say, all Carnivora) and select only those assemblies with contig N50 larger than 15 Mb (15000000 bp). `datasets` provides many options for filtering, but there is no built-in filter for contig N50 size.  

Here's what we want to see: assembly accession number, species and assembly level for those genomes with contig N50 above 15 Mb.

In [13]:
%%bash
datasets summary genome taxon carnivora | jq -r '.assemblies[].assembly 
| select(.contig_n50 > 15000000) 
| [.assembly_accession, .org.sci_name, .assembly_level] 
| @tsv'

GCA_905319855.2	Canis lupus	Chromosome
GCF_012295265.1	Canis lupus dingo	Chromosome
GCA_012295265.2	Canis lupus dingo	Chromosome
GCA_003254725.2	Canis lupus dingo	Chromosome
GCF_000002285.5	Canis lupus familiaris	Chromosome
GCA_000002285.4	Canis lupus familiaris	Chromosome
GCA_008641055.3	Canis lupus familiaris	Chromosome
GCA_013276365.2	Canis lupus familiaris	Chromosome
GCF_018350175.1	Felis catus	Chromosome
GCA_000181335.5	Felis catus	Chromosome
GCA_013340865.1	Felis catus	Contig
GCA_016509815.2	Felis catus	Chromosome
GCA_018350175.1	Felis catus	Chromosome
GCA_019924945.1	Felis chaus	Chromosome
GCF_018350155.1	Leopardus geoffroyi	Chromosome
GCA_018350155.1	Leopardus geoffroyi	Chromosome
GCF_902655055.1	Lutra lutra	Chromosome
GCA_902655055.2	Lutra lutra	Chromosome
GCF_022079265.1	Lynx rufus	Scaffold
GCA_022079265.1	Lynx rufus	Scaffold
GCF_922984935.1	Meles meles	Chromosome
GCA_922984935.2	Meles meles	Chromosome
GCA_922990625.1	Meles meles	Scaffold
GCF_009829155.1	Mustela erminea	Chrom

**RESOURCE:**  
We included a list of all fields in the genome summary in our [jq cheatsheet](https://github.com/ncbi/datasets/blob/workshop-cshl-2021/training/cshl-2021/jq_cheatsheet.md) to help you extract the information you need. And we will show you now how to do that. 

## Part III: Accessing genomes<a class="anchor" id="Part-III"></a>

Independent of choosing `datasets download` or `datasets summary`, there are two options for retrieving genome information:
- accession (assembly accession and version)
- taxon (common name, scientific name or NCBI taxon-ID at any taxonomic level)

Additional flags, such as `--assembly-source`, `--reference` or `--annotated`, will help narrow down to the genomes of interest.

`datasets download genome accession GCF_018350175.1`  
`datasets download genome taxon cat --reference`  

Both commands will download the same genome package: cat (<i>Felis catus</i>) <u>reference genome</u>. 

### Let's continue to explore the available genomes for the family Formicidae

For this part, we will use two UNIX commands: `sort` and `uniq`. 

- `sort` can be used to sort text files line by line, numerically and alphabetically.   
- `uniq` will filter out the repeated lines in a file. However, `uniq` can only detect repeated lines if they are adjacent to each other. In other words, if they are alphabetically or numerically sorted. The flag `-c` or `--count` tells the command `uniq` to remove the repeated lines, and to count how many times each value appeared. 

So, we will use `jq` to extract the information we need, sort the result and count the number of unique entries.

In [14]:
%%bash
# For which species does NCBI have genomes in its database? How many per species?

datasets summary genome taxon formicidae | jq '.assemblies[].assembly.org.sci_name' | sort | uniq -c

   1 "Acromyrmex charruanus"
   2 "Acromyrmex echinatior"
   1 "Acromyrmex heyeri"
   1 "Acromyrmex insinuator"
   1 "Aphaenogaster ashmeadi"
   1 "Aphaenogaster floridana"
   1 "Aphaenogaster fulva"
   1 "Aphaenogaster miamiana"
   1 "Aphaenogaster picea"
   2 "Aphaenogaster rudis"
   2 "Atta cephalotes"
   2 "Atta colombica"
   1 "Atta texana"
   3 "Camponotus floridanus"
   2 "Camponotus pennsylvanicus"
   1 "Cardiocondyla obscurior"
   3 "Cataglyphis hispanica"
   1 "Cataglyphis niger"
   1 "Crematogaster levior"
   2 "Cyphomyrmex costatus"
   2 "Dinoponera quadriceps"
   1 "Eciton burchellii"
   1 "Formica aquilonia x Formica polyctena"
   2 "Formica exsecta"
   1 "Formica selysi"
   3 "Harpegnathos saltator"
   1 "Lasius niger"
   2 "Linepithema humile"
   5 "Monomorium pharaonis"
   2 "Nylanderia fulva"
   2 "Odontomachus brunneus"
   3 "Ooceraea biroi"
   2 "Pogonomyrmex barbatus"
   1 "Pogonomyrmex californicus"
   1 "Pseudoatta argentina"
   1 "Pseudomyrmex concolor"
   1 "Ps

In [15]:
%%bash
# What is the assembly level (contig, scaffold, chromosome, complete) breakdown?

datasets summary genome taxon formicidae | jq '.assemblies[].assembly.assembly_level' | sort | uniq -c

  15 "Chromosome"
   3 "Contig"
  78 "Scaffold"


Now that we explored the number of available genomes, as well as the assembly level and other important characteristics, it's time to download the genomes for Elmo's research project.

First, we want to download all genomes for the genus *Acromyrmex* from Genbank. 

In [16]:
%%bash
# Download all available GenBank assemblies for the genus Acromyrmex and save as genomes.zip
datasets download genome taxon acromyrmex --assembly-source genbank --filename genomes.zip --no-progressbar

The next step is to unzip the data package zip file to a new folder called genomes

In [17]:
%%bash
# Unzip genomes.zip to the folder genomes
unzip -o genomes.zip -d genomes

Archive:  genomes.zip
  inflating: genomes/README.md       
  inflating: genomes/ncbi_dataset/data/assembly_data_report.jsonl  
  inflating: genomes/ncbi_dataset/data/GCA_000204515.1/unplaced.scaf.fna  
  inflating: genomes/ncbi_dataset/data/GCA_000204515.1/genomic.gff  
  inflating: genomes/ncbi_dataset/data/GCA_000204515.1/protein.faa  
  inflating: genomes/ncbi_dataset/data/GCA_000204515.1/cds_from_genomic.fna  
  inflating: genomes/ncbi_dataset/data/GCA_017607455.1/GCA_017607455.1_ASM1760745v1_genomic.fna  
  inflating: genomes/ncbi_dataset/data/GCA_017607455.1/genomic.gff  
  inflating: genomes/ncbi_dataset/data/GCA_017607455.1/protein.faa  
  inflating: genomes/ncbi_dataset/data/GCA_017607455.1/cds_from_genomic.fna  
  inflating: genomes/ncbi_dataset/data/GCA_017607545.1/GCA_017607545.1_ASM1760754v1_genomic.fna  
  inflating: genomes/ncbi_dataset/data/GCA_017607545.1/genomic.gff  
  inflating: genomes/ncbi_dataset/data/GCA_017607545.1/protein.faa  
  inflating: genomes/ncbi_datas

Finally, let's take a look at the contents of the data package with the command `tree`

In [18]:
%%bash
# Explore the folder structure of the folder genome with the command tree
tree -C genomes/

[01;34mgenomes/[00m
├── README.md
└── [01;34mncbi_dataset[00m
    └── [01;34mdata[00m
        ├── [01;34mGCA_000204515.1[00m
        │   ├── cds_from_genomic.fna
        │   ├── genomic.gff
        │   ├── protein.faa
        │   ├── sequence_report.jsonl
        │   └── unplaced.scaf.fna
        ├── [01;34mGCA_017607455.1[00m
        │   ├── GCA_017607455.1_ASM1760745v1_genomic.fna
        │   ├── cds_from_genomic.fna
        │   ├── genomic.gff
        │   ├── protein.faa
        │   └── sequence_report.jsonl
        ├── [01;34mGCA_017607545.1[00m
        │   ├── GCA_017607545.1_ASM1760754v1_genomic.fna
        │   ├── cds_from_genomic.fna
        │   ├── genomic.gff
        │   ├── protein.faa
        │   └── sequence_report.jsonl
        ├── [01;34mGCA_017607565.1[00m
        │   ├── GCA_017607565.1_ASM1760756v1_genomic.fna
        │   ├── cds_from_genomic.fna
        │   ├── genomic.gff
        │   ├── protein.faa
        │   └── sequence_report.jsonl
        ├── ass

### Let's recap our goals

We used `datasets` to download all the Genbank assemblies for the genus *Acromyrmex*. The next step is to download the gene *orco* (odorance receptor coreceptor) for the same genus. But first, let's learn more about how genes are organized at NCBI.

## Part IV: Accessing genes <a class="anchor" id="Part-IV"></a>
### GENES

Independent of choosing `datasets download` or `datasets summary`, there are three options for retrieving gene information:
- accession
- gene-id
- symbol

When choosing any of those three options, you will retrieve the gene information for the **reference** taxon. Like this:

`datasets download gene accession XR_002738142.1`  
`datasets download gene gene-id 101081937`  
`datasets download gene symbol BRCA1 --taxon cat`  

All three commands will download the same gene from the cat (<i>Felis catus</i>) <u>reference genome</u>. 

#### accession
Unique identifier. Accession includes RefSeq RNA and protein accessions. Since it's unique, taxon is implied (aka there will never be two sequences from different taxa with the same accession number).  

#### gene-id
Also a unique identifier. Every RefSeq genome annotated has a unique set of identifiers. For example: the gene-id for BRCA1 in human is 672, while in cat is 101081937.  

#### symbol
Differently from accession and gene-id, gene symbol is not unique and means different things in different taxonomic groups. If using the symbol option, you should specify the species. The default option is human.

**Remember**: both `summary` and `download` will return results for the **reference assembly** of a <u>single species</u>. If you want to download a curated set of the same gene for multiple taxa, you should use the option `ortholog`. We'll talk more about it later. For reference, here's the JSON organization of the gene summary metadata.  

<img src="./images/gene_json.drawio.png" />

Now let's take a look at a gene example:

In [19]:
%%bash
#Example: IFNG in human
datasets summary gene symbol ifng | jq -C .


[1;39m{
  [0m[34;1m"genes"[0m[1;39m: [0m[1;39m[
    [1;39m{
      [0m[34;1m"gene"[0m[1;39m: [0m[1;39m{
        [0m[34;1m"annotations"[0m[1;39m: [0m[1;39m[
          [1;39m{
            [0m[34;1m"assemblies_in_scope"[0m[1;39m: [0m[1;39m[
              [1;39m{
                [0m[34;1m"accession"[0m[1;39m: [0m[0;32m"GCF_000001405.40"[0m[1;39m,
                [0m[34;1m"name"[0m[1;39m: [0m[0;32m"GRCh38.p14"[0m[1;39m
              [1;39m}[0m[1;39m,
              [1;39m{
                [0m[34;1m"accession"[0m[1;39m: [0m[0;32m"GCF_009914755.1"[0m[1;39m,
                [0m[34;1m"name"[0m[1;39m: [0m[0;32m"T2T-CHM13v2.0"[0m[1;39m
              [1;39m}[0m[1;39m
            [1;39m][0m[1;39m,
            [0m[34;1m"release_date"[0m[1;39m: [0m[0;32m"2022-02-25"[0m[1;39m,
            [0m[34;1m"release_name"[0m[1;39m: [0m[0;32m"NCBI Homo sapiens Annotation Release 110"[0m[1;39m
          [1;39m}[0m[1;39m
       

In [20]:
%%bash
# how datasets deals with synonyms
datasets summary gene symbol IFG | jq -C -r '.genes[].gene | {species: .taxname, symbol: .symbol, synonyms:.synonyms}'


[1;39m{
  [0m[34;1m"species"[0m[1;39m: [0m[0;32m"Homo sapiens"[0m[1;39m,
  [0m[34;1m"symbol"[0m[1;39m: [0m[0;32m"IFNG"[0m[1;39m,
  [0m[34;1m"synonyms"[0m[1;39m: [0m[1;39m[
    [0;32m"IFG"[0m[1;39m,
    [0;32m"IFI"[0m[1;39m,
    [0;32m"IMD69"[0m[1;39m
  [1;39m][0m[1;39m
[1;39m}[0m


In [21]:
%%bash
#Example: IFNG in cat
datasets summary gene symbol ifng --taxon "felis catus"


{"genes":[{"gene":{"annotations":[{"assemblies_in_scope":[{"accession":"GCF_018350175.1","name":"F.catus_Fca126_mat1.0"}],"release_date":"2021-10-27","release_name":"NCBI Felis catus Annotation Release 105"}],"chromosomes":["B4"],"common_name":"domestic cat","description":"interferon gamma","gene_id":"493965","genomic_ranges":[{"accession_version":"NC_058374.1","range":[{"begin":"93137484","end":"93141840","orientation":"minus"}]}],"nomenclature_authority":{"authority":"VGNC","identifier":"VGNC:67703"},"orientation":"minus","swiss_prot_accessions":["P46402"],"symbol":"IFNG","tax_id":"9685","taxname":"Felis catus","transcripts":[{"accession_version":"NM_001009873.1","cds":{"accession_version":"NM_001009873.1","range":[{"begin":"40","end":"543"}]},"exons":{"accession_version":"NC_058374.1","range":[{"begin":"93141688","end":"93141840","order":1},{"begin":"93140419","end":"93140487","order":2},{"begin":"93140138","end":"93140323","order":3},{"begin":"93137484","end":"93137642","order":4}]

### Back to ants
We will download the gene *orco* for the species *Acromyrmex echinatior*. We will use the gene-id 105147775 instead of the symbol because no informative gene symbol has been assigned for this gene.  

In [22]:
%%bash
# Using gene-id to retrieve gene information
datasets summary gene gene-id 105147775 | jq -C '.genes[].gene 
| {gene_description: .description, gene_id: .gene_id, symbol: .symbol, species: .taxname}'

[1;39m{
  [0m[34;1m"gene_description"[0m[1;39m: [0m[0;32m"odorant receptor coreceptor"[0m[1;39m,
  [0m[34;1m"gene_id"[0m[1;39m: [0m[0;32m"105147775"[0m[1;39m,
  [0m[34;1m"symbol"[0m[1;39m: [0m[0;32m"LOC105147775"[0m[1;39m,
  [0m[34;1m"species"[0m[1;39m: [0m[0;32m"Acromyrmex echinatior"[0m[1;39m
[1;39m}[0m


In [23]:
%%bash
# if we try to retrieve metadata information for this gene using the symbol orco, what happens?
datasets summary gene symbol orco --taxon "acromyrmex echinatior"




In [24]:
%%bash
# Download the gene data package for the gene-id 105147775 (*orco* in Acromyrmex echinatior)
datasets download gene gene-id 105147775 --filename gene.zip --no-progressbar


In [25]:
%%bash
#Unzip the file
unzip -o gene.zip -d gene

Archive:  gene.zip
  inflating: gene/README.md          
  inflating: gene/ncbi_dataset/data/gene.fna  
  inflating: gene/ncbi_dataset/data/rna.fna  
  inflating: gene/ncbi_dataset/data/protein.faa  
  inflating: gene/ncbi_dataset/data/data_report.jsonl  
  inflating: gene/ncbi_dataset/data/data_table.tsv  
  inflating: gene/ncbi_dataset/data/dataset_catalog.json  


In [26]:
%%bash
#Explore the data package structure using tree
tree gene

gene
├── README.md
└── ncbi_dataset
    └── data
        ├── data_report.jsonl
        ├── data_table.tsv
        ├── dataset_catalog.json
        ├── gene.fna
        ├── protein.faa
        └── rna.fna

2 directories, 7 files


Now we are going to take advantage of the fact that we are using a Jupyter Notebook and use the package `pandas` to look at the gene data table

In [27]:
import pandas as pd                                                        #load pandas to this notebook
gene_orco = pd.read_csv('gene/ncbi_dataset/data/data_table.tsv', sep='\t') #use pandas to import the data_table.tsv
gene_orco                                                                  #visualize the data table as the object gene_orco

Unnamed: 0,gene_id,gene_symbol,description,scientific_name,common_name,tax_id,genomic_range,orientation,location,gene_type,transcript_accession,transcript_name,transcript_length,transcript_cds_coords,protein_accession,isoform_name,protein_length,protein_name
0,105147775,LOC105147775,odorant receptor coreceptor,Acromyrmex echinatior,Panamanian leafcutter ant,103372,NW_011627180.1:39489-47719,-,chr Un,PROTEIN_CODING,XM_011059026.1,,2621,XM_011059026.1:264-1712,XP_011057328.1,,482,odorant receptor coreceptor


## Part V: Accessing orthologs <a class="anchor" id="Part-V"></a>

### Orthologs

The options to retrieve ortholog sets are the same as those for genes. We'll go over the differences when using each option:

- accession
- gene-id
- symbol

When choosing any of those three options, you will download the **full ortholog set** to which they belong (unless you use additional filtering. We'll cover it below). Like this:

`datasets download ortholog accession XR_002738142.1`  
`datasets download ortholog gene-id 101081937`  
`datasets download ortholog symbol BRCA1 --taxon cat`  

All three commands will download the **same** ortholog set. 

---

#### <font color='blue'>Wait, but what is an ortholog set?</font>

>An ortholog set, or ortholog gene group, is a group of sequences that have been identified by the NCBI genome annotation team as homologous genes related to each other by speciation events. They are identified by a combination of protein similarity + local syntheny information. 
Currently, NCBI has ortholog sets calculated for vertebrates and some insects. 


#### accession
Unique identifier. Accession includes RefSeq accession RNA and protein sequences. Since it's unique, taxon is implied (aka there will never be two sequences from different taxa with the same accession number).

#### gene-id
Also an unique identifier. Every RefSeq genome annotated has a unique set of identifiers. For example: the gene-id for BRCA1 in human is 672, while in cat is 101081937. You can use either one (672 or 101081937) to get the same vertebrate BRCA1 ortholog set.

#### symbol
Differently from accession and gene-id, gene symbol is not unique and means different things in different taxonomic groups. For example: the P53 ortholog set in vertebrates is different from the insect set. If using the symbol option, you should specify the taxonomic group. The default option is human. Note that if you want ortholog sets from multiple vertebrate species, you might end up downloading the same ortholog set multiple times. Like this: 

`datasets download ortholog symbol brca1 --taxon cat`  
`datasets download ortholog symbol brca1 --taxon chicken`  
`datasets download ortholog symbol brca1 --taxon "chelonia mydas"`  

If that's the case, how do you filter the ortholog set to include *only* your taxonomic group of interest?

### Applying a taxonomic filter to the ortholog set

For the orthologs, `datasets` provides the flag `--taxon-filter`, which allows the user to restrict the summary or download to one or multiple taxonomic groups.  `--taxon` and `--taxon-filter` have different effects on the data package/summary output. A few examples:

- `datasets summary ortholog symbol brca1 --taxon-filter "felis catus"`  
Prints a json metadata summary of the gene brca1 for the domestic cat. 
We did not specify a `--taxon` because the default is human, and Felidae and human are part of the same brca1 ortholog set.   

  

- `datasets summary ortholog symbol brca1 --taxon "felis catus"`  
Even though this option looks almost the same as the one above, the result is *very different*. Here, we're asking `datasets` to find the ortholog set to which the gene brca1 in the domestic cat belongs. And `datasets` will download the <u>entire</u> ortholog set, not only the sequences for the domestic cat.


- `datasets summary ortholog symbol brca1 --taxon "felis catus" --taxon-filter "felis catus"`  
gives you the same result as `datasets summary ortholog symbol brca1 --taxon-filter "felis catus"`


The summary metadata for orthologs is presented in JSON Lines, which means that each gene entry is in a different line. Here's the diagram to help you create queries.
  
<img src="./images/ortholog_jsonl.drawio.png" />

#### We are going to do the following steps:
- download the ortholog data package and save it with the name ortholog.zip
- unzip it to the folder ortholog
- look at the files

Helpful info:

- gene symbol: orco
- gene-id in *Drosophila melanogaster*: 40650
- gene-id in *Acromyrmer echinatior*: 105147775
- target taxon: Formicidae

In [28]:
%%bash
# download the orco ortholog set for ants (Formicidae)
datasets download ortholog gene-id 40650 --taxon-filter formicidae --filename ortholog.zip --no-progressbar


Found 22 genes in set


In [29]:
%%bash
# unzip it to the folder ortholog
unzip -o ortholog.zip -d ortholog


Archive:  ortholog.zip
  inflating: ortholog/README.md      
  inflating: ortholog/ncbi_dataset/data/gene.fna  
  inflating: ortholog/ncbi_dataset/data/rna.fna  
  inflating: ortholog/ncbi_dataset/data/protein.faa  
  inflating: ortholog/ncbi_dataset/data/data_report.jsonl  
  inflating: ortholog/ncbi_dataset/data/data_table.tsv  
  inflating: ortholog/ncbi_dataset/data/dataset_catalog.json  


In [30]:
%%bash
#Explore the folder structure
tree ortholog/


ortholog/
├── README.md
└── ncbi_dataset
    └── data
        ├── data_report.jsonl
        ├── data_table.tsv
        ├── dataset_catalog.json
        ├── gene.fna
        ├── protein.faa
        └── rna.fna

2 directories, 7 files


In [31]:
# Create an object called ortho_table using pandas
ortho_table = pd.read_csv("ortholog/ncbi_dataset/data/data_table.tsv", sep='\t')
ortho_table

Unnamed: 0,gene_id,gene_symbol,description,scientific_name,common_name,tax_id,genomic_range,orientation,location,gene_type,transcript_accession,transcript_name,transcript_length,transcript_cds_coords,protein_accession,isoform_name,protein_length,protein_name
0,105147775,LOC105147775,odorant receptor coreceptor,Acromyrmex echinatior,Panamanian leafcutter ant,103372,NW_011627180.1:39489-47719,-,chr Un,PROTEIN_CODING,XM_011059026.1,,2621,XM_011059026.1:264-1712,XP_011057328.1,,482,odorant receptor coreceptor
1,105183395,LOC105183395,odorant receptor coreceptor,Harpegnathos saltator,Jerdon's jumping ant,610380,NW_020230404.1:97134-107285,+,chr Un,PROTEIN_CODING,XM_011141465.3,,4083,XM_011141465.3:381-1820,XP_011139767.1,,479,odorant receptor coreceptor
2,105199036,LOC105199036,odorant receptor coreceptor,Solenopsis invicta,red fire ant,13686,NC_052665.1:4823610-4833998,+,chr 2,PROTEIN_CODING,XM_011165941.3,,5506,XM_011165941.3:203-1648,XP_011164243.1,,481,odorant receptor coreceptor
3,105249684,LOC105249684,odorant receptor coreceptor,Camponotus floridanus,Florida carpenter ant,104421,NW_020229214.1:9396847-9403345,+,chr Un,PROTEIN_CODING,XM_011255339.3,,1933,XM_011255339.3:281-1726,XP_011253641.1,,481,odorant receptor coreceptor
4,105284785,LOC105284785,odorant receptor coreceptor,Ooceraea biroi,clonal raider ant,2015173,NC_039506.1:10910490-10919026,+,chr 1,PROTEIN_CODING,XM_011348552.2,,4114,XM_011348552.2:184-1620,XP_011346854.1,,478,odorant receptor coreceptor
5,105424270,LOC105424270,odorant receptor coreceptor,Pogonomyrmex barbatus,red harvester ant,144034,NW_011933557.1:711662-718961,+,chr Un,PROTEIN_CODING,XM_011634408.2,,2642,XM_011634408.2:308-1747,XP_011632710.1,,479,odorant receptor coreceptor
6,105457428,LOC105457428,odorant receptor coreceptor,Wasmannia auropunctata,little fire ant,64793,NW_012027674.1:53097-61891,+,chr Un,PROTEIN_CODING,XM_011702081.1,,2852,XM_011702081.1:277-1722,XP_011700383.1,,481,odorant receptor coreceptor
7,105561667,LOC105561667,odorant receptor coreceptor,Vollenhovia emeryi,,411798,NW_011967163.1:316278-324489,-,chr Un,PROTEIN_CODING,XM_012011854.1,,2657,XM_012011854.1:444-1880,XP_011867244.1,,478,odorant receptor coreceptor
8,105625195,LOC105625195,odorant receptor coreceptor,Atta cephalotes,,12957,NW_012130067.1:2024690-2031375,-,chr Un,PROTEIN_CODING,XM_012206539.1,,1449,XM_012206539.1:1-1449,XP_012061929.1,,482,odorant receptor coreceptor
9,105673490,LOC105673490,odorant receptor coreceptor,Linepithema humile,Argentine ant,83485,NW_012160723.1:535817-538942,-,chr Un,PROTEIN_CODING,XM_012369146.1,,1285,XM_012369146.1:251-1285,XP_012224569.1,,344,LOW QUALITY PROTEIN: odorant receptor coreceptor


## What have we done so far?
- Explored metadata for all ant genomes
- Downloaded genomes for the panamanian leaf cutter ant
- Downloaded the orco gene for Acromyrmex echinatior
- Downloaded the ortholog set for all ants for the orco gene


In [None]:
from uuid import uuid4

## Download sequences

If you have already done [Part III: Accessing genomes](#Part-III) and [Part IV: Accessing genes](#Part-IV), you can skip to [Download NCBI taxonomy database](#taxonomy). If not, follow the steps below.


In [1]:
%%bash
# Download the gene data package for the gene-id 105147775 (*orco* in Acromyrmex echinatior)
datasets download gene gene-id 105147775 --filename gene.zip --no-progressbar
unzip -o gene.zip -d gene

bash: line 2: datasets: command not found


Archive:  gene.zip
  inflating: gene/README.md          
  inflating: gene/ncbi_dataset/data/gene.fna  
  inflating: gene/ncbi_dataset/data/rna.fna  
  inflating: gene/ncbi_dataset/data/protein.faa  
  inflating: gene/ncbi_dataset/data/data_report.jsonl  
  inflating: gene/ncbi_dataset/data/data_table.tsv  
  inflating: gene/ncbi_dataset/data/dataset_catalog.json  


In [7]:
%%bash
# Download all available GenBank assemblies for the genus Acromyrmex and save as genomes.zip
datasets download genome taxon acromyrmex --assembly-source genbank --filename genomes.zip --no-progressbar
unzip -o genomes.zip -d genomes

Archive:  genomes.zip
  inflating: genomes/README.md       
  inflating: genomes/ncbi_dataset/data/assembly_data_report.jsonl  
  inflating: genomes/ncbi_dataset/data/GCA_000204515.1/unplaced.scaf.fna  
  inflating: genomes/ncbi_dataset/data/GCA_000204515.1/genomic.gff  
  inflating: genomes/ncbi_dataset/data/GCA_000204515.1/protein.faa  
  inflating: genomes/ncbi_dataset/data/GCA_000204515.1/cds_from_genomic.fna  
  inflating: genomes/ncbi_dataset/data/GCA_017607455.1/GCA_017607455.1_ASM1760745v1_genomic.fna  
  inflating: genomes/ncbi_dataset/data/GCA_017607455.1/genomic.gff  
  inflating: genomes/ncbi_dataset/data/GCA_017607455.1/protein.faa  
  inflating: genomes/ncbi_dataset/data/GCA_017607455.1/cds_from_genomic.fna  
  inflating: genomes/ncbi_dataset/data/GCA_017607545.1/GCA_017607545.1_ASM1760754v1_genomic.fna  
  inflating: genomes/ncbi_dataset/data/GCA_017607545.1/genomic.gff  
  inflating: genomes/ncbi_dataset/data/GCA_017607545.1/protein.faa  
  inflating: genomes/ncbi_datas

### Download NCBI taxonomy database

In [17]:
%%bash
update_blastdb.pl taxdb

Connected to NCBI
Downloading taxdb.tar.gz... [OK]


## Create a BLAST database with taxonomy information

### Concatenate downloaded genomic sequences in one file

In [None]:
%%bash
# Concatenate all genomic sequence files into one
cat genomes/ncbi_dataset/data/GCA_000204515.1/unplaced.scaf.fna \
  genomes/ncbi_dataset/data/GCA_017607455.1/GCA_017607455.1_ASM1760745v1_genomic.fna \
  genomes/ncbi_dataset/data/GCA_017607545.1/GCA_017607545.1_ASM1760754v1_genomic.fna \
  genomes/ncbi_dataset/data/GCA_017607565.1/GCA_017607565.1_ASM1760756v1_genomic.fna >genomes.fa

### Create taxonomy map for the genomic sequences

In [5]:
%%bash
cat genomes/ncbi_dataset/data/GCA_000204515.1/unplaced.scaf.fna | awk '/>/ {print substr($1, 2, length($1)-1) "\t103372";}' >taxids.tsv
cat genomes/ncbi_dataset/data/GCA_017607455.1/GCA_017607455.1_ASM1760745v1_genomic.fna | awk '/>/ {print substr($1, 2, length($1)-1) "\t230686";}' >>taxids.tsv
cat genomes/ncbi_dataset/data/GCA_017607545.1/GCA_017607545.1_ASM1760754v1_genomic.fna | awk '/>/ {print substr($1, 2, length($1)-1) "\t2715315";}' >>taxids.tsv
cat genomes/ncbi_dataset/data/GCA_017607565.1/GCA_017607565.1_ASM1760756v1_genomic.fna | awk '/>/ {print substr($1, 2, length($1)-1) "\t230685";}' >>taxids.tsv
head taxids.tsv

GL884603.1	103372
GL884604.1	103372
GL884605.1	103372
GL884606.1	103372
GL884607.1	103372
GL884608.1	103372
GL884609.1	103372
GL884610.1	103372
GL884611.1	103372
GL884612.1	103372


### Create a BLAST database

In [6]:
%%bash
makeblastdb -in genomes.fa -out genomesdb -dbtype nucl -parse_seqids -taxid_map taxids.tsv



Building a new DB, current time: 06/27/2022 10:40:56
New DB name:   /export/home/boratyng/eb-1532/genomesdb
New DB title:  genomes.fa
Sequence type: Nucleotide
Deleted existing Nucleotide BLAST database named /export/home/boratyng/eb-1532/genomesdb
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 23295 sequences in 12.1766 seconds.




## Prepare ElasticBLAST search

### ElasticBLAST results bucket

ElasticBLAST uses a cloud storage bucket to store a BLAST database and search results. If you already have an S3 bucket assign its URI to `RESULTS` variable. Otherwise, running this cell without modifications will generate a name with a random suffix.

Remember that S3 bucket names must be globally unique. You cannot use a bucket name if one already exists.

In [24]:
YOURNAME = str(uuid4())[:8]
RESULTS = f's3://elasticblast-{YOURNAME}'
print(f'Your results bucket: {RESULTS}')

Your results bucket: s3://elasticblast-c34e3f10


### Create results bucket

Skip this step if the bucket already exists.

In [25]:
!aws s3 mb {RESULTS}

make_bucket: elasticblast-c34e3f10


### Upload your BLAST database

ElasticBLAST searches BLAST databases stored in cloud storage. Upload your database to your cloud bucket.

In [26]:
!aws s3 cp . {RESULTS}/db --recursive --exclude "*" --include "genomesdb.*"

upload: ./genomesdb.njs to s3://elasticblast-c34e3f10/db/genomesdb.njs
upload: ./genomesdb.nin to s3://elasticblast-c34e3f10/db/genomesdb.nin
upload: ./genomesdb.nto to s3://elasticblast-c34e3f10/db/genomesdb.nto
upload: ./genomesdb.nog to s3://elasticblast-c34e3f10/db/genomesdb.nog
upload: ./genomesdb.ntf to s3://elasticblast-c34e3f10/db/genomesdb.ntf
upload: ./genomesdb.nos to s3://elasticblast-c34e3f10/db/genomesdb.nos
upload: ./genomesdb.not to s3://elasticblast-c34e3f10/db/genomesdb.not
upload: ./genomesdb.nhr to s3://elasticblast-c34e3f10/db/genomesdb.nhr
upload: ./genomesdb.ndb to s3://elasticblast-c34e3f10/db/genomesdb.ndb
upload: ./genomesdb.nsq to s3://elasticblast-c34e3f10/db/genomesdb.nsq


### ElasticBLAST configuration

ElasticBLAST uses a configuration file to specify search parameters, such as query, database, search program, and parameters. For easier code management we will assign ElasticBLAST configuration file name to the variable `conf_file`.

In [27]:
conf_file = 'elastic-blast-config.ini'

The cell below will write the contents of ElasticBLAST configuration file.

The configuration file instructs ElasticBLAST to do a BLASTN search of query sequences in file `gene/ncbi_dataset/data/gene.fna` against the database in your results bucket.

In [28]:
# These are the contents of ElasticBLAST config 
conf = f"""
[cloud-provider]
aws-region = us-east-1

[cluster]
num-nodes = 1

[blast]
program = blastn
db = {RESULTS}/db/genomesdb
queries = gene/ncbi_dataset/data/gene.fna
options = -evalue 1e-50 -outfmt 11 -max_hsps 1
"""

# Write the config to the file: elastic-blast-config.ini
with open(conf_file, 'w') as f:
    f.write(conf)

## Submit an ElasticBLAST search

We can now submit the ElasticBLAST search, run the cell below. The submission may take a few minutes.

In [29]:
!elastic-blast submit --cfg {conf_file} --results {RESULTS}

awslimitchecker 12.0.0 is AGPL-licensed free software; all users have a right to the full source code of this version. See <https://github.com/jantman/awslimitchecker>


## Check search status

The cell below checks search status. ElasticBLAST splits query sequences into batches. The elastic-blast status command shows how many of these batches are pending, running, completed, or completed. When the whole search is done you will see only the message: "Your ElasticBLAST search succeeded ..." or "Your ElasticBLAST search failed ..."

In [30]:
!elastic-blast status --results {RESULTS}

SUBMITTING


## Wait until the search is done

Run the cell below to wait until the search is done. The cell will keep working until ElasticBLAST search is complete.

In [31]:
!elastic-blast status --results {RESULTS} --wait

Your ElasticBLAST search succeeded, results can be found in s3://elasticblast-c34e3f10


## Download and uncompress results

When an ElasticBLAST search is done, search results are placed in your results bucket. Now we need to download and uncompress the results to analyze them.

In [32]:
!aws s3 cp {RESULTS} . --exclude "*" --include "*.out.gz" --recursive
!gzip -d -f batch_*.gz

download: s3://elasticblast-c34e3f10/batch_000-blastn-genomesdb.out.gz to ./batch_000-blastn-genomesdb.out.gz


### Convert results to the tabular format

In [19]:
%%bash
blast_formatter \
-archive batch_000-blastn-genomesdb.out \
-outfmt '6 sseqid sstart send evalue length staxid ssciname' > orco_acromyrmex_1e-50.tsv
head orco_acromyrmex_1e-50.tsv



gb|GL888262.1|	47719	39489	0.0	8231	103372	Acromyrmex echinatior
gb|JAANIC010002885.1|	374203	382429	0.0	8319	2715315	Acromyrmex charruanus
gb|JAANHZ010000736.1|	394277	399208	0.0	4942	230686	Acromyrmex insinuator
gb|JAANIB010005913.1|	42669	38207	0.0	4507	230685	Acromyrmex heyeri
gb|JAANIB010010813.1|	2927587	2927837	3.10e-64	256	230685	Acromyrmex heyeri
gb|JAANIC010005341.1|	252786	252532	4.01e-63	259	2715315	Acromyrmex charruanus
gb|JAANHZ010000232.1|	840079	839835	1.44e-62	249	230686	Acromyrmex insinuator
gb|GL888207.1|	1612253	1612494	1.44e-62	246	103372	Acromyrmex echinatior
gb|JAANIB010005055.1|	682287	682472	1.45e-57	190	230685	Acromyrmex heyeri
gb|JAANIC010001616.1|	1476837	1476653	1.88e-56	188	2715315	Acromyrmex charruanus


## Cleanup

### Delete cloud resources created by ElasticBLAST

If you did not enable ElasticBLAST auto-shutdown feature, cloud resources (like AWS Batch queue and compute environment) have to be deleted manually via the command below:

In [23]:
!elastic-blast delete --results {RESULTS}

### Delete cloud bucket

If you do not need BLAST search results stored in the cloud, delete the cloud bucket so that you are not charged for it.

In [33]:
!aws s3 rb {RESULTS} --force

delete: s3://elasticblast-c34e3f10/batch_000-blastn-genomesdb.out.gz
delete: s3://elasticblast-c34e3f10/db/genomesdb.nto
delete: s3://elasticblast-c34e3f10/db/genomesdb.nin
delete: s3://elasticblast-c34e3f10/db/genomesdb.njs
delete: s3://elasticblast-c34e3f10/db/genomesdb.nog
delete: s3://elasticblast-c34e3f10/db/genomesdb.nos
delete: s3://elasticblast-c34e3f10/db/genomesdb.ntf
delete: s3://elasticblast-c34e3f10/metadata/elastic-blast-config.json
delete: s3://elasticblast-c34e3f10/metadata/job-ids.json
delete: s3://elasticblast-c34e3f10/db/genomesdb.not
delete: s3://elasticblast-c34e3f10/metadata/query_length.txt
delete: s3://elasticblast-c34e3f10/db/genomesdb.ndb
delete: s3://elasticblast-c34e3f10/db/genomesdb.nhr
delete: s3://elasticblast-c34e3f10/query_batches/batch_000.fa
delete: s3://elasticblast-c34e3f10/metadata/job-ids-v2.json
delete: s3://elasticblast-c34e3f10/db/genomesdb.nsq
remove_bucket: elasticblast-c34e3f10
