# NCBI Datasets - CSHL (11/02/2021)

### Important resources:
- Etherpad: https://etherpad.wikimedia.org/p/CSHL_Datasets_Workshop_2021
- Github: https://github.com/ncbi/datasets/tree/workshop-cshl-2021/training/cshl-2021
- NCBI datasets: https://www.ncbi.nlm.nih.gov/datasets/
- jq cheat sheet: https://github.com/ncbi/datasets/blob/workshop-cshl-2021/training/cshl-2021/jq_cheatsheet.md

## Case study: Elmo loves ants

Elmo is a graduate student at the Via Sesamum University. As part of his Ph.D. project, he studies Panamanian leaf cutter ants (genus *Acromyrmex*, family Formicidae) and how variation in the gene *orco* (**o**dorant **r**eceptor **co**receptor) affects the colonies of this genus.

(here's the [link](https://www.sciencedirect.com/science/article/pii/S0092867417307729#app3) to a cool paper talking about this gene in ants of the species *Ooceraea biroi*).

<img src="./images/ants.png" alt="image"/>

Elmo will use `datasets` to help him gather the existing genomic resources from NCBI. He will:

- download all available genomes for the genus *Acromyrmex*
- download the *orco* gene from the *Acromyrmex* reference genome
- download the ortholog set for this gene for all ants (Formicidae)

In addition, he will also do the following tasks:
- Create a custom BLAST database with the Panamanian leaf cutter ants genomes 
- BLAST the gene *orco* against the database
- Multiple sequence alignment of the BLAST results and the ortholog gene sequences
- Build a phylogenetic tree using fastTree


### How is `datasets` organized?

[NCBI datasets](https://www.ncbi.nlm.nih.gov/datasets/docs/v1/quickstarts/command-line-tools/) is a command line tool that allows users to download data packages (data + metadata) or look at metadata summaries for genomes, RefSeq annotated genes, curated ortholog sets and SARS-Cov-2 virus sequences and proteins. The program follows a hierarchy that makes it easier for users to select exact which options they would like to use. In addition to the program commands, additional flags are available for filtering the results. We will go over those during this tutorial.
<img src="./images/datasets_horizontal.drawio.png" alt="datasets" style="width: 600px;"/>

In addition to `datasets`, we will be using `jq` (json parser) to take a look at the metadata information. Our metadata reports are almsot all in json or json-lines format. We put together a [jq cheat sheet](<add link>) to help you extract information from those files.

## Tutorial - I: Accessing genomes

![workflow](./images/elmo_workflow.drawio.png)

First, let's figure out what kind of information NCBI has for ants (family Formicidae).

<img src="./images/genome_summary.drawio.png" style="width: 600px;"/>

In [1]:
%%bash
# Get metadata info
datasets summary genome taxon formicidae

{"assemblies": [{"assembly": {"annotation_metadata":{"file":[{"estimated_size":"3421616","type":"GENOME_GFF"},{"estimated_size":"129483045","type":"GENOME_GBFF"},{"estimated_size":"3444924","type":"PROT_FASTA"},{"estimated_size":"2684704","type":"GENOME_GTF"},{"estimated_size":"7862131","type":"CDS_FASTA"}],"name":"From INSDC submitter","release_date":"2021-03-29","source":"BGI","stats":{"gene_counts":{"protein_coding":8986,"total":14640}}},"assembly_accession":"GCA_017607545.1","assembly_category":"representative genome","assembly_level":"Scaffold","bioproject_lineages":[{"bioprojects":[{"accession":"PRJNA605929","title":"Project of the leaf-cutting ants"}]}],"biosample_accession":"SAMN14167745","blast_url":"https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch\u0026PROG_DEF=blastn\u0026BLAST_SPEC=GDH_GCA_017607545.1","chromosomes":[{"length":"296539234","name":"Un"}],"contig_n50":34925,"display_name":"ASM1760754v1","estimated_size":"241200730","gc_count":"99149685","org":{"a

In [2]:
%%bash
# Get metadata info and save to a file
datasets summary genome taxon formicidae > formicidae_summary.json

**Now let's take a look at the metadata using jq**

In [3]:
%%bash
datasets summary genome taxon formicidae | jq .

{
  "assemblies": [
    {
      "assembly": {
        "annotation_metadata": {
          "file": [
            {
              "estimated_size": "3421616",
              "type": "GENOME_GFF"
            },
            {
              "estimated_size": "129483045",
              "type": "GENOME_GBFF"
            },
            {
              "estimated_size": "3444924",
              "type": "PROT_FASTA"
            },
            {
              "estimated_size": "2684704",
              "type": "GENOME_GTF"
            },
            {
              "estimated_size": "7862131",
              "type": "CDS_FASTA"
            }
          ],
          "name": "From INSDC submitter",
          "release_date": "2021-03-29",
          "source": "BGI",
          "stats": {
            "gene_counts": {
              "protein_coding": 8986,
              "total": 14640
            }
          }
        },
        "assembly_accession": "GCA_017607545.1",
        "assembly_category": "representa

### A little bit more about json files
A JSON (JavaScript Object Notation) file stores data structures and objects. In a very simplified (and non-technical) way, a JSON file is a box, that might contain other boxes with more boxes inside. In `datasets summary genome` our JSON "box" is organized like this:
<img src="./images/json8.png" alt="image"/>

But let's explore the "boxes" in stages, so we can understand how everything is organized and how we can use this knowledge to extract information from the summary metadata file. At the first level, we have this: 
```
{
 assemblies[
      assembly{},
      assembly{},
 ],
 total_count
}
```
<img src="./images/json1.png" />

If we want to look at the value in the field "total_count", here's the command we would use:

In [4]:
%%bash
datasets summary genome taxon herpestidae | jq '.total_count'

5


If we continue to expand each one of those assembly boxes, more levels of the hierarchy will be revelead. Let's expand each assembly and look which information we can find at that level
<img src="./images/json2a.png" alt="image"/>

Here we can see that some of the assembly information, such as assembly accession number, contig N50 or submission date are not include inside any of the available "boxes" (annotation_metadata, chromosomes, bioproject_lineage, and org). Those fields describe assembly features/characteristics that pertain the entire assembly, and not only any of those boxes available. What are the contig n50 values of those assemblies?

To retrieve that information, we need to call each box, starting from the largest one, until the field we're interested in. And each level is separated from the next by a period (.).

<img src="./images/json3.png" />

In [5]:
%%bash
datasets summary genome taxon herpestidae | jq '.assemblies[].assembly.contig_n50'

113567
180702
75409
148487
75409


Now let's see what we have inside `annotation_metadata`, `bioproject_lineages`, `org` and `chromosomes`. 
<img src="./images/json8.png" alt="image"/>

Now let's see how we can retrieve the scientific names associated with those assemblies
<img src="./images/json4.png" />

In [6]:
%%bash
datasets summary genome taxon herpestidae | jq '.assemblies[].assembly.org.sci_name'

"Helogale parvula"
"Mungos mungo"
"Suricata suricatta"
"Suricata suricatta"
"Suricata suricatta"


As you can see, `jq` is very useful in retrieving information from the summary metadata *as long as* you know the path to find it. Let's try a few more complex examples.
<img src="./images/json5.png" />

First, let's retrieve information from three fields at the same time: scientific name (`sci_name`), assembly accession number (`assembly_accession`) and contig N50 (`contig_n50`).

In [7]:
%%bash
datasets summary genome taxon herpestidae | jq '.assemblies[].assembly | (.org.sci_name, .assembly_accession, .contig_n50)'

"Helogale parvula"
"GCA_004023845.1"
113567
"Mungos mungo"
"GCA_004023785.1"
180702
"Suricata suricatta"
"GCF_006229205.1"
75409
"Suricata suricatta"
"GCA_004023905.1"
148487
"Suricata suricatta"
"GCA_006229205.1"
75409


Since all three fields are inside the `.assemblies[].assembly`, we can call the first part of the path once and use a pipe (|) to call each specific field. 
Now let's try to make this a little easier to read. We can create new fields and assign values to them, like this:

In [8]:
%%bash
datasets summary genome taxon herpestidae | jq '.assemblies[].assembly 
| {species: .org.sci_name, accession: .assembly_accession, contigN50:.contig_n50}'

{
  "species": "Helogale parvula",
  "accession": "GCA_004023845.1",
  "contigN50": 113567
}
{
  "species": "Mungos mungo",
  "accession": "GCA_004023785.1",
  "contigN50": 180702
}
{
  "species": "Suricata suricatta",
  "accession": "GCF_006229205.1",
  "contigN50": 75409
}
{
  "species": "Suricata suricatta",
  "accession": "GCA_004023905.1",
  "contigN50": 148487
}
{
  "species": "Suricata suricatta",
  "accession": "GCA_006229205.1",
  "contigN50": 75409
}


Last one: let's look at a larger collection of genome assemblies (let's say, all Carnivora) and select only those assemblies with contig N50 larger than 15 Mb (15000000 bp). `datasets` allows for a lot of filtering, but contig N50 size is not one of them. 

Here's what we want to see: assembly accession number, species and assembly level for those genomes with contig N50 above 15 Mb.

In [9]:
%%bash
datasets summary genome taxon carnivora | jq -r '.assemblies[].assembly 
| select(.contig_n50 > 15000000) 
| [.assembly_accession, .org.sci_name, .assembly_level] 
| @tsv'

GCA_905319855.2	Canis lupus	Chromosome
GCF_012295265.1	Canis lupus dingo	Chromosome
GCA_003254725.2	Canis lupus dingo	Chromosome
GCA_012295265.1	Canis lupus dingo	Chromosome
GCF_000002285.5	Canis lupus familiaris	Chromosome
GCF_013276365.1	Canis lupus familiaris	Chromosome
GCA_000002285.4	Canis lupus familiaris	Chromosome
GCA_008641055.3	Canis lupus familiaris	Chromosome
GCA_013276365.2	Canis lupus familiaris	Chromosome
GCF_000181335.3	Felis catus	Chromosome
GCA_000181335.5	Felis catus	Chromosome
GCA_000181335.4	Felis catus	Chromosome
GCA_013340865.1	Felis catus	Contig
GCA_016509815.2	Felis catus	Chromosome
GCA_018350175.1	Felis catus	Chromosome
GCA_019924945.1	Felis chaus	Chromosome
GCA_018350155.1	Leopardus geoffroyi	Chromosome
GCA_902655055.2	Lutra lutra	Chromosome
GCF_009829155.1	Mustela erminea	Chromosome
GCA_009829155.1	Mustela erminea	Chromosome
GCA_009859125.1	Mustela putorius furo	Contig
GCA_009859175.1	Mustela putorius furo	Contig
GCA_009859215.1	Mustela putorius furo	Contig


**RESOURCE:**  
We included a list of all fields in the genome summary in our [jq cheatsheet]() to help you extract the information you need. And we will show you now how to do that. 

### Let's continue to explore the available genomes for the family Formicidae

<img src="./images/genome_summary.drawio.png" alt="summary" style="width: 600px"/>

For this part, we will use two UNIX commands: `sort` and `uniq`. 

- `sort` can be used to sort text files line by line, numerically and alphabetically.   
- `uniq` will filter out the repeated lines in a file. However, `uniq` can only detect repeated lines if they are adjacent to each other. In other words, if they are alphabetically or numerically sorted. The flag `-c` or `--count` tells the command `uniq` to remove the repeated lines, and to count how many times each value appeared. 

So, we will use `jq` to extract the information we need, sort the result and count the number of unique entries.

In [10]:
%%bash
# For which species does NCBI have genomes in its database? How many per species?

datasets summary genome taxon formicidae | jq '.assemblies[].assembly.org.sci_name' | sort | uniq -c

      1 "Acromyrmex charruanus"
      2 "Acromyrmex echinatior"
      1 "Acromyrmex heyeri"
      1 "Acromyrmex insinuator"
      1 "Aphaenogaster ashmeadi"
      1 "Aphaenogaster floridana"
      1 "Aphaenogaster fulva"
      1 "Aphaenogaster miamiana"
      1 "Aphaenogaster picea"
      2 "Aphaenogaster rudis"
      2 "Atta cephalotes"
      2 "Atta colombica"
      1 "Atta texana"
      4 "Camponotus floridanus"
      1 "Cardiocondyla obscurior"
      1 "Cataglyphis hispanica"
      1 "Cataglyphis niger"
      1 "Crematogaster levior"
      2 "Cyphomyrmex costatus"
      2 "Dinoponera quadriceps"
      1 "Eciton burchellii"
      1 "Formica aquilonia x Formica polyctena"
      2 "Formica exsecta"
      1 "Formica selysi"
      4 "Harpegnathos saltator"
      1 "Lasius niger"
      2 "Linepithema humile"
      7 "Monomorium pharaonis"
      2 "Nylanderia fulva"
      2 "Odontomachus brunneus"
      4 "Ooceraea biroi"
      2 "Pogonomyrmex barbatus"
      1 "Pogonomyrmex californicus"

In [11]:
%%bash
# What is the assembly level (contig, scaffold, chromosome, complete) breakdown?

datasets summary genome taxon formicidae | jq '.assemblies[].assembly.assembly_level' | sort | uniq -c

     12 "Chromosome"
      1 "Contig"
     84 "Scaffold"


### How to get help when using the command line

Since `datasets` is a very hierarchical program, we can use that characteristic to our advantage to get very specific help.   For example: if we type `datasets --help`, we will see the first level of commands available.


In [12]:
%%bash
datasets --help

datasets is a command-line tool that is used to query and download biological sequence data
across all domains of life from NCBI databases.

Refer to NCBI's [command line quickstart](https://www.ncbi.nlm.nih.gov/datasets/docs/quickstarts/command-line-tools/) documentation for information about getting started with the command-line tools.

Usage
  datasets [command]

Data Retrieval Commands
  summary              print a summary of a gene or genome dataset
  download             download a gene, genome or coronavirus dataset as a zip file
  rehydrate            rehydrate a downloaded, dehydrated dataset

Miscellaneous Commands
  completion           generate autocompletion scripts
  version              print the version of this client and exit
  help                 Help about any command

Flags
      --api-key string   NCBI Datasets API Key
  -h, --help             help for datasets
      --no-progressbar   hide progress bar

Use datasets help <command> for detailed help about a comma

Notice the difference from when we type `datasets summary genome taxon formicidae --help`  


In [13]:
%%bash
datasets summary genome taxon formicidae --help


Print a summary of a genome dataset by taxon (NCBI Taxonomy ID, scientific or common name at any tax rank). The summary is returned in JSON format.

Refer to NCBI's [command line quickstart](https://www.ncbi.nlm.nih.gov/datasets/docs/quickstarts/command-line-tools/) documentation for information about getting started with the command-line tools.

Usage
  datasets summary genome taxon [flags]

Examples
  datasets summary genome taxon human
  datasets summary genome taxon "mus musculus"
  datasets summary genome taxon 10116

Flags
  -h, --help              help for taxon
      --tax-exact-match   exclude sub-species when a species-level taxon is specified


Global Flags
  -a, --annotated                only include genomes with annotation
      --api-key string           NCBI Datasets API Key
      --as-json-lines            Stream results as newline delimited JSON-Lines
      --assembly-level string    restrict assemblies to a comma-separated list of one or more of: chromosome, complet

### Exercises

Now we will practice what we learned about `datasets`. Take a look at the questions below and feel free to ask questions. Useful resources for this exercise are the `--help` from the command line and the [jq cheatsheet](). 


In [None]:
%%bash
# How many reference genomes in the family Formicidae? (hint --reference)



In [None]:
%%bash
# How many reference genomes are annotated? (hint: --annotated)



In [None]:
%%bash
# How many genomes have NCBI (RefSeq) annotations? (hint: --assembly-source)



### Bonus questions:

In [None]:
%%bash
## Take a look at the jq cheat sheet (link here) and try to build a jq query for the metadata



In [None]:
%%bash
# Now look at the summary metadata for your organism of interest 
# (if you don't have a favorite, go with red panda, Ailurus fulgens, taxid: 9649)



In [None]:
%%bash
# How many genomes?



In [None]:
%%bash
# Assembly level breakdown



In [None]:
%%bash
# How many have contig N50 above 5Mb?



### Back to the main room


### What is the difference/relationship between Genbank, RefSeq and Reference assemblies?

<img src="./images/gca_gcf.png" alt="ref" />

### Data package

We explored the `datasets summary` option, in which we had a chance to look at the summary metadata ***without*** downloading any files. In the next steps, we will look at the data packages, which contains the actual data files. 
<img src="./images/genome_data_package.png" alt="data_package" />

In [14]:
%%bash
# Download all available GenBank assemblies for the genus Acromyrmex and save as genomes.zip
datasets download genome taxon acromyrmex --assembly-source genbank --filename genomes.zip --no-progressbar

In [15]:
%%bash
# Unzip genomes.zip to the folder genomes
unzip -o genomes.zip -d genomes

Archive:  genomes.zip
  inflating: genomes/README.md       
  inflating: genomes/ncbi_dataset/data/assembly_data_report.jsonl  
  inflating: genomes/ncbi_dataset/data/GCA_000204515.1/unplaced.scaf.fna  
  inflating: genomes/ncbi_dataset/data/GCA_000204515.1/cds_from_genomic.fna  
  inflating: genomes/ncbi_dataset/data/GCA_000204515.1/genomic.gff  
  inflating: genomes/ncbi_dataset/data/GCA_000204515.1/protein.faa  
  inflating: genomes/ncbi_dataset/data/GCA_017607455.1/GCA_017607455.1_ASM1760745v1_genomic.fna  
  inflating: genomes/ncbi_dataset/data/GCA_017607455.1/cds_from_genomic.fna  
  inflating: genomes/ncbi_dataset/data/GCA_017607455.1/genomic.gff  
  inflating: genomes/ncbi_dataset/data/GCA_017607455.1/protein.faa  
  inflating: genomes/ncbi_dataset/data/GCA_017607545.1/GCA_017607545.1_ASM1760754v1_genomic.fna  
  inflating: genomes/ncbi_dataset/data/GCA_017607545.1/cds_from_genomic.fna  
  inflating: genomes/ncbi_dataset/data/GCA_017607545.1/genomic.gff  
  inflating: genomes/n

In [16]:
%%bash
# Explore the folder structure of the folder genome with the command tree
tree -C genomes/

[01;34mgenomes/[00m
├── [01;34mncbi_dataset[00m
│   └── [01;34mdata[00m
│       ├── assembly_data_report.jsonl
│       ├── dataset_catalog.json
│       ├── [01;34mGCA_000204515.1[00m
│       │   ├── cds_from_genomic.fna
│       │   ├── genomic.gff
│       │   ├── protein.faa
│       │   ├── sequence_report.jsonl
│       │   └── unplaced.scaf.fna
│       ├── [01;34mGCA_017607455.1[00m
│       │   ├── cds_from_genomic.fna
│       │   ├── GCA_017607455.1_ASM1760745v1_genomic.fna
│       │   ├── genomic.gff
│       │   ├── protein.faa
│       │   └── sequence_report.jsonl
│       ├── [01;34mGCA_017607545.1[00m
│       │   ├── cds_from_genomic.fna
│       │   ├── GCA_017607545.1_ASM1760754v1_genomic.fna
│       │   ├── genomic.gff
│       │   └── protein.faa
│       └── [01;34mGCA_017607565.1[00m
│           ├── cds_from_genomic.fna
│           ├── GCA_017607565.1_ASM1760756v1_genomic.fna
│           ├── genomic.gff
│           └── protein.faa
└── README.md

6 directories, 21 

### Let's recap our goals

We used `datasets` to download all the Genbank assemblies for the genus *Acromyrmex*. The next step is to download the gene *orco* (odorance receptor coreceptor) for the same genus. But first, let's learn more about how genes are organized at NCBI.

<img src="./images/elmo_done1.png" alt="done1" style="width: 500px;" />

## Tutorial - II: Accessing genes
### GENES

Independent of choosing `datasets download` or `datasets summary`, there are three options for retrieving gene information:
- accession
- gene-id
- symbol

<img src="./images/genes_op2.png" style="width: 800px;"/>


When choosing any of those three options, you will retrieve the gene information for the **reference** taxon. Like this:

`datasets download gene accession XR_002738142.1`  
`datasets download gene gene-id 101081937`  
`datasets download gene symbol BRCA1 --taxon cat`  

All three commands will download the same gene from the cat (<i>Felis catus</i>) <u>reference genome</u>. 

#### accession
Unique identifier. Accession includes RefSeq accession RNA and protein sequences. Since it's unique, taxon is implied (aka there will never be two sequences from different taxa with the same accession number).

#### gene-id
Also an unique identifier. Every RefSeq genome annotated has a unique set of identifiers. For example: the gene-id for BRCA1 in human is 672, while in cat is 101081937.

#### symbol
Differently from accession and gene-id, gene symbol is not unique and means different things in different taxonomic groups. If using the symbol option, you should specify the species. The default option is human.

**Remember**: both `summary` and `download` will return results for the **reference assembly** of a <u>single species</u>. If you want to download all the orthologs of a given gene, you should use the option `ortholog`. We'll talk more about it later. For reference, here's the JSON organization of the gene summary metadata

<img src="./images/gene_json.drawio.png" />

Now let's take a look at a gene example:

In [17]:
%%bash
#Example: IFNG in human
datasets summary gene symbol ifng | jq -C .


[1;39m{
  [0m[34;1m"genes"[0m[1;39m: [0m[1;39m[
    [1;39m{
      [0m[34;1m"gene"[0m[1;39m: [0m[1;39m{
        [0m[34;1m"annotations"[0m[1;39m: [0m[1;39m[
          [1;39m{
            [0m[34;1m"assemblies_in_scope"[0m[1;39m: [0m[1;39m[
              [1;39m{
                [0m[34;1m"accession"[0m[1;39m: [0m[0;32m"GCF_000001405.39"[0m[1;39m,
                [0m[34;1m"name"[0m[1;39m: [0m[0;32m"GRCh38.p13"[0m[1;39m
              [1;39m}[0m[1;39m
            [1;39m][0m[1;39m,
            [0m[34;1m"release_date"[0m[1;39m: [0m[0;32m"2021-05-14"[0m[1;39m,
            [0m[34;1m"release_name"[0m[1;39m: [0m[0;32m"NCBI Homo sapiens Updated Annotation Release 109.20210514"[0m[1;39m
          [1;39m}[0m[1;39m
        [1;39m][0m[1;39m,
        [0m[34;1m"chromosomes"[0m[1;39m: [0m[1;39m[
          [0;32m"12"[0m[1;39m
        [1;39m][0m[1;39m,
        [0m[34;1m"common_name"[0m[1;39m: [0m[0;32m"human"[0m[1;39m,

In [18]:
%%bash
# how datasets deals with synonyms
datasets summary gene symbol IFG | jq -C -r '.genes[].gene | {species: .taxname, symbol: .symbol, synonyms:.synonyms}'


[1;39m{
  [0m[34;1m"species"[0m[1;39m: [0m[0;32m"Homo sapiens"[0m[1;39m,
  [0m[34;1m"symbol"[0m[1;39m: [0m[0;32m"IFNG"[0m[1;39m,
  [0m[34;1m"synonyms"[0m[1;39m: [0m[1;39m[
    [0;32m"IFG"[0m[1;39m,
    [0;32m"IFI"[0m[1;39m,
    [0;32m"IMD69"[0m[1;39m
  [1;39m][0m[1;39m
[1;39m}[0m


In [19]:
%%bash
#Example: IFNG in cat
datasets summary gene symbol ifng --taxon "felis catus"


{"genes":[{"gene":{"annotations":[{"assemblies_in_scope":[{"accession":"GCF_000181335.3","name":"Felis_catus_9.0"}],"release_date":"2017-12-06","release_name":"NCBI Felis catus Annotation Release 104"}],"chromosomes":["B4"],"common_name":"domestic cat","description":"interferon gamma","ensembl_gene_ids":["ENSFCAG00000009014"],"gene_id":"493965","genomic_ranges":[{"accession_version":"NC_018729.3","range":[{"begin":"95329479","end":"95334070","orientation":"minus"}]}],"nomenclature_authority":{"authority":"VGNC","identifier":"VGNC:67703"},"orientation":"minus","swiss_prot_accessions":["P46402"],"symbol":"IFNG","tax_id":"9685","taxname":"Felis catus","transcripts":[{"accession_version":"NM_001009873.1","cds":{"accession_version":"NM_001009873.1","range":[{"begin":"40","end":"543"}]},"ensembl_transcript":"ENSFCAT00000009016.4","exons":{"accession_version":"NC_018729.3","range":[{"begin":"95333918","end":"95334070","order":1},{"begin":"95332650","end":"95332718","order":2},{"begin":"953323

### Back to ants
We will download the gene *orco* for the species *Acromyrmex echinatior*. We will use the gene-id 105147775 instead of the symbol.
The reason for it is that sometimes even when a known gene is annotated in a species, no informative gene symbol has been assigned. 

In [20]:
%%bash
# Using gene-id to retrieve gene information
datasets summary gene gene-id 105147775 | jq -C '.genes[].gene 
| {gene_description: .description, gene_id: .gene_id, symbol: .symbol, species: .taxname}'

[1;39m{
  [0m[34;1m"gene_description"[0m[1;39m: [0m[0;32m"odorant receptor coreceptor"[0m[1;39m,
  [0m[34;1m"gene_id"[0m[1;39m: [0m[0;32m"105147775"[0m[1;39m,
  [0m[34;1m"symbol"[0m[1;39m: [0m[0;32m"LOC105147775"[0m[1;39m,
  [0m[34;1m"species"[0m[1;39m: [0m[0;32m"Acromyrmex echinatior"[0m[1;39m
[1;39m}[0m


In [21]:
%%bash
# if we try to retrieve metadata information for this gene using the symbol orco, what happens?
datasets summary gene symbol orco --taxon "acromyrmex echinatior"




In [22]:
%%bash
# Download the gene data package for the gene-id 105147775 (*orco* in Acromyrmex echinatior)
datasets download gene gene-id 105147775 --filename gene.zip --no-progressbar


In [23]:
%%bash
#Unzip the file
unzip -o gene.zip -d gene

Archive:  gene.zip
  inflating: gene/README.md          
  inflating: gene/ncbi_dataset/data/gene.fna  
  inflating: gene/ncbi_dataset/data/rna.fna  
  inflating: gene/ncbi_dataset/data/protein.faa  
  inflating: gene/ncbi_dataset/data/data_report.jsonl  
  inflating: gene/ncbi_dataset/data/data_table.tsv  
  inflating: gene/ncbi_dataset/data/dataset_catalog.json  


In [24]:
%%bash
#Explore the data package structure using tree
tree gene

gene
├── ncbi_dataset
│   └── data
│       ├── data_report.jsonl
│       ├── dataset_catalog.json
│       ├── data_table.tsv
│       ├── gene.fna
│       ├── protein.faa
│       └── rna.fna
└── README.md

2 directories, 7 files


Now we are going to take advantage of the fact that we are using a Jupyter Notebook and use the package `pandas` to look at the gene data table

In [25]:
import pandas as pd                                                        #load pandas to this notebook
gene_orco = pd.read_csv('gene/ncbi_dataset/data/data_table.tsv', sep='\t') #use pandas to import the data_table.tsv
gene_orco                                                                  #visualize the data table as the object gene_orco

Unnamed: 0,gene_id,gene_symbol,description,scientific_name,common_name,tax_id,genomic_range,orientation,location,gene_type,transcript_accession,transcript_name,transcript_length,transcript_cds_coords,protein_accession,isoform_name,protein_length,protein_name
0,105147775,LOC105147775,odorant receptor coreceptor,Acromyrmex echinatior,Panamanian leafcutter ant,103372,NW_011627180.1:39489-47719,-,chr Un,PROTEIN_CODING,XM_011059026.1,,2621,XM_011059026.1:264-1712,XP_011057328.1,,482,odorant receptor coreceptor


### Exercises

1. Look for the summary data for a gene of interest (check the [etherpad](https://etherpad.wikimedia.org/p/CSHL_Datasets_Workshop_2021) for suggestions)
2. What is the gene location?
3. What is the gene range?
4. Now, download a list of gene symbols using the file genes.txt (provided). Save it as gene_list.zip
5. Unzip gene_list.zip and explore the folder structure
6. How many fasta files?

In [None]:
%%bash
# Summary data



In [None]:
%%bash
# Gene location



In [None]:
%%bash
# Gene range



In [None]:
%%bash
# Download a list of genes and save the data package as gene_list.zip (--filename gene_list.zip)


In [None]:
%%bash
# Explore the folder structure



In [None]:
%%bash
# How many genes were downloaded?



In [None]:
%%bash
# How many fasta files in the data package?



## Tutorial - III: Accessing orthologs

### Orthologs

The options to retrieve ortholog sets are the same as those for genes. We'll go over the differences when using each option:

- accession
- gene-id
- symbol

<img src="./images/ortholog.png" style="width: 800px;" />

When choosing any of those three options, you will download the **full ortholog set** to which they belong (unless you use additional filtering. We'll cover it below). Like this:

`datasets download ortholog accession XR_002738142.1`  
`datasets download ortholog gene-id 101081937`  
`datasets download ortholog symbol BRCA1 --taxon cat`  

All three commands will download the **same** ortholog set. 

---

#### <font color='blue'>Wait, but what is an ortholog set?</font>

>An ortholog set, or ortholog gene group, is a group of sequences that have been identified by the NCBI genome annotation team as homologous genes related to each other by speciation events. They are identified by a combination of protein similarity + local syntheny information. 
Currently, NCBI has ortholog sets calculated for vertebrates and some insects. 


#### accession
Unique identifier. Accession includes RefSeq accession RNA and protein sequences. Since it's unique, taxon is implied (aka there will never be two sequences from different taxa with the same accession number).

#### gene-id
Also an unique identifier. Every RefSeq genome annotated has a unique set of identifiers. For example: the gene-id for BRCA1 in human is 672, while in cat is 101081937. You can use either one (672 or 101081937) to get the same vertebrate BRCA1 ortholog set.

#### symbol
Differently from accession and gene-id, gene symbol is not unique and means different things in different taxonomic groups. For example: the P53 ortholog set in vertebrates is different from the insect set. If using the symbol option, you should specify the taxonomic group. The default option is human. Note that if you want ortholog sets from multiple vertebrate species, you might end up downloading the same ortholog set multiple times. Like this: 

`datasets download ortholog symbol brca1 --taxon cat`  
`datasets download ortholog symbol brca1 --taxon chicken`  
`datasets download ortholog symbol brca1 --taxon "chelonia mydas"`  

If that's the case, how to you filter the ortholog set to include *only* your taxonomic group of interest?

### Applying a taxonomic filter to the ortholog set

For the orthologs, `datasets` provides the flag `--taxon-filter`, which allows the user to restrict the summary or download to one or multiple taxonomic groups.  `--taxon` and `--taxon-filter` have different effects on the data package/summary output.A few examples:

- `datasets summary ortholog symbol brca1 --taxon-filter "felis catus"`  
Prints a json metadata summary of the gene brca1 for the domestic cat. 
We did not specify a `--taxon` because the default is human, and Felidae and human are part of the same brca1 ortholog set.   

  

- `datasets summary ortholog symbol brca1 --taxon "felis catus"`  
Even though this option looks almost the same as the one above, the result is *very different*. Here, we're asking `datasets` to find the ortholog set to which the gene brca1 in the domestic cat belongs. And `datasets` will download the <u>entire</u> ortholog set, not only the sequences for the domestic cat.


- `datasets summary ortholog symbol brca1 --taxon "felis catus" --taxon-filter "felis catus"`  
gives you the same result as `datasets summary ortholog symbol brca1 --taxon-filter "felis catus"`


The summary metadata for orthologs is presented in JSON-LINES , which means that each gene entry is in a different line. Here's the diagram to help you create queries. 

<img src="./images/ortholog_jsonl.drawio.png" />

#### We are going to do the following steps:
- download the ortholog data package and save it with the name ortholog.zip
- unzip it to the folder ortholog
- look at the files

Helpful info:

- gene symbol: orco
- gene-id in *Drosophila melanogaster*: 40650
- gene-id in *Acromyrmer echinatior*: 105147775
- target taxon: Formicidae

In [26]:
%%bash
# download the orco ortholog set for ants (Formicidae)
datasets download ortholog gene-id 40650 --taxon-filter formicidae --filename ortholog.zip --no-progressbar


Found 22 genes in set


In [27]:
%%bash
# unzip it to the folder ortholog
unzip -o ortholog.zip -d ortholog


Archive:  ortholog.zip
  inflating: ortholog/README.md      
  inflating: ortholog/ncbi_dataset/data/gene.fna  
  inflating: ortholog/ncbi_dataset/data/rna.fna  
  inflating: ortholog/ncbi_dataset/data/protein.faa  
  inflating: ortholog/ncbi_dataset/data/data_report.jsonl  
  inflating: ortholog/ncbi_dataset/data/data_table.tsv  
  inflating: ortholog/ncbi_dataset/data/dataset_catalog.json  


In [28]:
%%bash
#Explore the folder structure
tree ortholog/


ortholog/
├── ncbi_dataset
│   └── data
│       ├── data_report.jsonl
│       ├── dataset_catalog.json
│       ├── data_table.tsv
│       ├── gene.fna
│       ├── protein.faa
│       └── rna.fna
└── README.md

2 directories, 7 files


In [29]:
# Create an object called ortho_table using pandas
ortho_table = pd.read_csv("ortholog/ncbi_dataset/data/data_table.tsv", sep='\t')
ortho_table

Unnamed: 0,gene_id,gene_symbol,description,scientific_name,common_name,tax_id,genomic_range,orientation,location,gene_type,transcript_accession,transcript_name,transcript_length,transcript_cds_coords,protein_accession,isoform_name,protein_length,protein_name
0,105147775,LOC105147775,odorant receptor coreceptor,Acromyrmex echinatior,Panamanian leafcutter ant,103372,NW_011627180.1:39489-47719,-,chr Un,PROTEIN_CODING,XM_011059026.1,,2621,XM_011059026.1:264-1712,XP_011057328.1,,482,odorant receptor coreceptor
1,105183395,LOC105183395,odorant receptor coreceptor,Harpegnathos saltator,Jerdon's jumping ant,610380,NW_020230404.1:97134-107285,+,chr Un,PROTEIN_CODING,XM_011141465.3,,4083,XM_011141465.3:381-1820,XP_011139767.1,,479,odorant receptor coreceptor
2,105199036,LOC105199036,odorant receptor coreceptor,Solenopsis invicta,red fire ant,13686,NC_052665.1:4823610-4833998,+,chr 2,PROTEIN_CODING,XM_011165941.3,,5506,XM_011165941.3:203-1648,XP_011164243.1,,481,odorant receptor coreceptor
3,105249684,LOC105249684,odorant receptor coreceptor,Camponotus floridanus,Florida carpenter ant,104421,NW_020229214.1:9396847-9403345,+,chr Un,PROTEIN_CODING,XM_011255339.3,,1933,XM_011255339.3:281-1726,XP_011253641.1,,481,odorant receptor coreceptor
4,105284785,LOC105284785,odorant receptor coreceptor,Ooceraea biroi,clonal raider ant,2015173,NC_039506.1:10910490-10919026,+,chr 1,PROTEIN_CODING,XM_011348552.2,,4114,XM_011348552.2:184-1620,XP_011346854.1,,478,odorant receptor coreceptor
5,105424270,LOC105424270,odorant receptor coreceptor,Pogonomyrmex barbatus,red harvester ant,144034,NW_011933557.1:711662-718961,+,chr Un,PROTEIN_CODING,XM_011634408.2,,2642,XM_011634408.2:308-1747,XP_011632710.1,,479,odorant receptor coreceptor
6,105457428,LOC105457428,odorant receptor coreceptor,Wasmannia auropunctata,little fire ant,64793,NW_012027674.1:53097-61891,+,chr Un,PROTEIN_CODING,XM_011702081.1,,2852,XM_011702081.1:277-1722,XP_011700383.1,,481,odorant receptor coreceptor
7,105561667,LOC105561667,odorant receptor coreceptor,Vollenhovia emeryi,,411798,NW_011967163.1:316278-324489,-,chr Un,PROTEIN_CODING,XM_012011854.1,,2657,XM_012011854.1:444-1880,XP_011867244.1,,478,odorant receptor coreceptor
8,105625195,LOC105625195,odorant receptor coreceptor,Atta cephalotes,,12957,NW_012130067.1:2024690-2031375,-,chr Un,PROTEIN_CODING,XM_012206539.1,,1449,XM_012206539.1:1-1449,XP_012061929.1,,482,odorant receptor coreceptor
9,105673490,LOC105673490,odorant receptor coreceptor,Linepithema humile,Argentine ant,83485,NW_012160723.1:535817-538942,-,chr Un,PROTEIN_CODING,XM_012369146.1,,1285,XM_012369146.1:251-1285,XP_012224569.1,,344,LOW QUALITY PROTEIN: odorant receptor coreceptor


## What have we done so far?
- Explored metadata for all ant genomes
- Downloaded genomes for the panamanian leaf cutter ant
- Downloaded the orco gene for Acromyrmex echinatior
- Downloaded the ortholog set for all ants for the orco gene

<img src="./images/elmo_done.png" />

## Tutorial - IV: Building a BLAST database and creating a phylogenetic tree

### Here's what we are showing you now:
- BLAST:
    - Create a BLAST database for each genome
    - BLAST the *orco* gene sequence against the genomes database and extract the matching regions
- multiple sequence alignment of the blast matches and the ortholog sequences
- generate a approximate maximum likelihood tree using FastTree

We'll add more detailed information about the commands we're using here to the GitHub page.

#### Extracting taxIDs from the genome data package

First, let's use `dataformat` to extract the species names, taxID and assembly accession numbers from the genomes we downloaded. We will talk in more detail about `dataformat` later.

In [30]:
%%bash
# Extract tax id for each species:
dataformat tsv genome --fields organism-name,tax-id,assminfo-accession --package genomes.zip 

Organism name	Taxonomic ID	Assembly Accession
Acromyrmex echinatior	103372	GCA_000204515.1
Acromyrmex insinuator	230686	GCA_017607455.1
Acromyrmex charruanus	2715315	GCA_017607545.1
Acromyrmex heyeri	230685	GCA_017607565.1


#### Creating a BLAST database with taxonomy information.

First we are going to create a folder called `blastdb` with the UNIX command `mkdir`. Next, we will change to the directory we just created. Finally, we will make a copy of the NCBI taxonomy database (taxdb)

In [36]:
# Create a folder called blastdb
!mkdir blastdb

# change directory to the folder blastdb
%cd blastdb

# download the NCBI Taxonomy Database (taxdb)

#if running this notebook on a regular server, then use the following command line
#!update_blastdb.pl taxdb


#if running this notebook on Binder, uncomment and run the following two lines instead of using the perl script
#the perl script uses regular ftp which times out on Binder notebooks,this command line directly copies a recent
#version of taxdb from the AWS open data platform at no cost.

!aws s3 cp s3://ncbi-blast-databases/2021-10-28-01-05-02/taxdb.bti --no-sign-request .
!aws s3 cp s3://ncbi-blast-databases/2021-10-28-01-05-02/taxdb.btd --no-sign-request .


download: s3://ncbi-blast-databases/2021-10-28-01-05-02/taxdb.bti to ./taxdb.bti
download: s3://ncbi-blast-databases/2021-10-28-01-05-02/taxdb.btd to ./taxdb.btd


#### BLAST database and search
Now we will create a BLAST database with the *Acromyrmex* genomes we downloaded. More information about the commands is available on out GitHub page.

In [38]:
%%bash
# Create a blast database for each genome
makeblastdb -dbtype nucl -in ../genomes/ncbi_dataset/data/GCA_000204515.1/unplaced.scaf.fna -taxid 103372 -out Aechinatior
makeblastdb -dbtype nucl -in ../genomes/ncbi_dataset/data/GCA_017607455.1/GCA_017607455.1_ASM1760745v1_genomic.fna -taxid 230686 -out Ainsinuator 
makeblastdb -dbtype nucl -in ../genomes/ncbi_dataset/data/GCA_017607545.1/GCA_017607545.1_ASM1760754v1_genomic.fna -taxid 2715315 -out Acharruanus
makeblastdb -dbtype nucl -in ../genomes/ncbi_dataset/data/GCA_017607565.1/GCA_017607565.1_ASM1760756v1_genomic.fna -taxid 230685 -out Aheyeri

# Create an alias under which the four genome databases can be called
blastdb_aliastool -dbtype nucl -title acromyrmex -out acromyrmex -dblist "Acharruanus Aechinatior Aheyeri Ainsinuator"

# BLASTN search
blastn \
-db acromyrmex \
-query ../gene/ncbi_dataset/data/gene.fna \
-evalue 1e-50 \
-outfmt 11 \
-max_hsps 1 \
-out orco_acromyrmex_1e-50.asn

# Covert the asn.1 output to tabular (output format 6)

blast_formatter \
-archive orco_acromyrmex_1e-50.asn \
-outfmt '6 sseqid sstart send evalue length staxid ssciname' > orco_acromyrmex_1e-50.tsv



Building a new DB, current time: 11/01/2021 13:19:44
New DB name:   /home/jovyan/training/cshl-2021/blastdb/Aechinatior
New DB title:  ../genomes/ncbi_dataset/data/GCA_000204515.1/unplaced.scaf.fna
Sequence type: Nucleotide
Deleted existing Nucleotide BLAST database named /home/jovyan/training/cshl-2021/blastdb/Aechinatior
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 4339 sequences in 3.54185 seconds.




Building a new DB, current time: 11/01/2021 13:19:48
New DB name:   /home/jovyan/training/cshl-2021/blastdb/Ainsinuator
New DB title:  ../genomes/ncbi_dataset/data/GCA_017607455.1/GCA_017607455.1_ASM1760745v1_genomic.fna
Sequence type: Nucleotide
Deleted existing Nucleotide BLAST database named /home/jovyan/training/cshl-2021/blastdb/Ainsinuator
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 890 sequences in 3.37369 seconds.




Building a new DB, current time: 11/01/2021 13:19:52
New DB name:   /home/jovyan/trainin

Created nucleotide BLAST (alias) database acromyrmex with 23295 sequences


Using `pandas` again, we will create an object with the tsv file we just created from the BLAST output, so we can take a look at our results.

In [39]:
# Create a table and visualize the BLAST results
import pandas as pd
blast_table = pd.read_csv('orco_acromyrmex_1e-50.tsv', sep='\t', header=None)
blast_table

Unnamed: 0,0,1,2,3,4,5,6
0,GL888262.1,47719,39489,0.0,8231,103372,Acromyrmex echinatior
1,JAANIC010002885.1,374203,382429,0.0,8319,2715315,Acromyrmex charruanus
2,JAANHZ010000736.1,394277,399208,0.0,4942,230686,Acromyrmex insinuator
3,JAANIB010005913.1,42669,38207,0.0,4507,230685,Acromyrmex heyeri
4,JAANIB010010813.1,2927587,2927837,3.0999999999999996e-64,256,230685,Acromyrmex heyeri
5,JAANIC010005341.1,252786,252532,4.01e-63,259,2715315,Acromyrmex charruanus
6,JAANHZ010000232.1,840079,839835,1.44e-62,249,230686,Acromyrmex insinuator
7,GL888207.1,1612253,1612494,1.44e-62,246,103372,Acromyrmex echinatior
8,JAANIB010005055.1,682287,682472,1.4499999999999998e-57,190,230685,Acromyrmex heyeri
9,JAANIC010001616.1,1476837,1476653,1.88e-56,188,2715315,Acromyrmex charruanus


#### Converting from BLAST to fasta

Now we are going to use some "tricks" (not really, just some good old bash scripting) to extract fasta sequences from the BLAST output. We will be using `blast_formatter` again and we'll do everything into multiple steps so we can all understand what's going on. 

In [40]:
%%bash
# Convert BLAST output to fasta

blast_formatter \
-archive orco_acromyrmex_1e-50.asn \
-outfmt '6 ssciname sseqid sseq' \
-max_target_seqs 4 | awk 'BEGIN{FS="\t"; OFS="\n"}{gsub(/ /, "_", $1);gsub(/-/, "", $3); print ">"$1"_"$2,$3}' > ../acromyrmex_orco.fasta




<img src="./images/elmo_blast_done.png"/>

### VERY IMPORTANT!
For the next steps, we need to go back to our home folder. Let's do it in steps again.

In [41]:
%%bash
## Check where you are
pwd

/home/jovyan/training/cshl-2021/blastdb


In [42]:
## If you're not in the home folder, run this command:
%cd /home/jovyan/training/cshl-2021/

/home/jovyan/training/cshl-2021


### Multiple sequence alignment: BLAST matches + *orco* orthologs

First, let's simplify the FASTA headers in the ortholog set.

In [34]:
%%bash
# Extract the seqids from the gene ortholog fasta and remove the spaces
grep ">" ortholog/ncbi_dataset/data/gene.fna | sed 's/ /,/g' > ortholog_seqid.txt

#Create a mapping file with the original name in the column 1 and a shortened name on column 2
cat ortholog_seqid.txt | while read line; do
new=$( echo $line | awk 'BEGIN {FS=","; OFS="_"}{gsub(/\[organism\=/, "", $3);gsub(/]/, "", $4);gsub(/\[GeneID\=|\]/, "", $5)} ;{print substr($3,1,1)"_"$4,$5}'); 
old=$( echo $line | sed 's/,/\_/g;s/>//g')
printf "${old}\t${new}\n" >> name_map.tsv; 
done

#Copy the ortholog dataset fasta
cp ortholog/ncbi_dataset/data/gene.fna ortholog_gene.fna

#Remove spaces in the fasta sequnce names
sed 's/ /_/g' ortholog_gene.fna > ortholog_gene_nospaces.fna

#Replace the names in the fasta file
cat ortholog_gene_nospaces.fna | seqkit replace \
--kv-file  <(cut -f 1,2 name_map.tsv) \
--pattern "^(.*)" --replacement "{kv}" > ortholog_gene_final.fna

[INFO][0m read key-value file: /dev/fd/63
[INFO][0m 22 pairs of key-value loaded


In [35]:
!head ortholog_gene_final.fna

>A_echinatior_105147775
ACAAGAAGGCAGAAGTAGAGGGTACCTGGGCCTCGGTCGGAGAGACAAGACCATTCGCAA
CACAAAACTGTTTGTGCCATAAAACGACGACTGATGGCAGGCCGGCTAGTTAGTTTGCTT
TTCTTCGTTCTTCTGATATATTTGGCAAGATTGCTGCAACGTAATCGCGAGGCGCTGAAG
GCCGCATCATAAATTGGCCGCGGAGTTCCGACCAATTCTTCGCTTTTAGACATCTGTAAT
CTTGGATAGTTAAAGCGACCAAGGTAGGTCCTTCGCTCTTATTTCGAAACAGATACTTTC
TGAACAAAAACGCGAAAGTTTAAAAAAGTATCCGAAAAATGTAAAAAGATTCTGACATCT
TTTATAAATACGAAAATACAGCGTATGTCTGAAAAAACAGGAAGCTTCTGTTCGAGTCTT
GTATCGTTTGACGACGAGTTGGAAAAGAGTCTAAAGACTACCTAGAAAAGTCTTATCAAT
TTTTCTAATTGTAATTTATATCAATTTACATTGTATAAAAAACACATAATCAACATTATA


### Multiple sequence alignment and phylogenetic reconstruction

Now, let's concatenate the FASTA we extracted from the BLAST matches, align them using MAFFT and use FastTree to generate an approximate ML phylogeny.

In [36]:
%%bash

#Concatenate sequences
cat ortholog_gene_final.fna acromyrmex_orco.fasta > orco_all.fasta

#align sequences with mafft
mafft orco_all.fasta > orco_all_aln.fasta

#Generate a phylogeny using fasttree
FastTree -nt orco_all_aln.fasta > orco.tree

nthread = 0
nthreadpair = 0
nthreadtb = 0
ppenalty_ex = 0
stacksize: 8192 kb
generating a scoring matrix for nucleotide (dist=200) ... done
Gap Penalty = -1.53, +0.00, +0.00



Making a distance matrix ..

There are 640 ambiguous characters.
    1 / 26
done.

Constructing a UPGMA tree (efffree=0) ... 
    0 / 26   10 / 26   20 / 26
done.

Progressive alignment 1/2... 
STEP     1 / 25  fSTEP     2 / 25  fSTEP     3 / 25  fSTEP     4 / 25  fSTEP     5 / 25  fSTEP     6 / 25  fSTEP     7 / 25  fSTEP     8 / 25  fSTEP     9 / 25  fSTEP    10 / 25  fSTEP    11 / 25  fSTEP    12 / 25  f
Reallocating..done. *alloclen = 27122
STEP    13 / 25  fSTEP    14 / 25  fSTEP    15 / 25  f
Reallocating..done. *alloclen = 30176
STEP    16 / 25  fSTEP    17 / 25  fSTEP    18 / 25  fSTEP    19 / 25  fSTEP    20 / 25  fSTEP    21 / 25  fSTEP    22 / 25  fSTEP    23 / 25  fSTEP    24 / 25  f
Reallocating..done. *alloclen = 31377
STE

### Visualizing the tree

In [38]:
# We will use the package toytree to look at the phylogenetic tree we just created

import toytree
orco_tree = toytree.tree("orco.tree")
orco_tree_rooted = orco_tree.root(names=["O_brunneus_116854080","D_quadriceps_106748868","H_saltator_105183395"])
orco_tree_rooted.draw(tree_style='d')

(<toyplot.canvas.Canvas at 0x7fc05a9d38d0>,
 <toyplot.coordinates.Cartesian at 0x7fc05a9e0dd0>,
 <toytree.Render.ToytreeMark at 0x7fc05a96f310>)

## Tutorial - V: Downloading large datasets (dehydration/rehydration) and `dataformat`

Now you learned how to download genomes, genes and ortholog gene sets from NCBI with one command using `datasets`. Now we want to show you another feature of `datasets` that allows you to download what we call a `dehydrated` package. Let's download a dehydrated package and explore the files inside it.

In [39]:
%%bash
# Download a dehydrated data package for all acromyrmex GenBank genomes
datasets download genome taxon acromyrmex --assembly-source genbank --dehydrated --filename acromyrmex-dry.zip --no-progressbar

In [40]:
%%bash
# Next we have to unzip the dehydrated package
unzip -o acromyrmex-dry.zip -d acromyrmex-dry 

Archive:  acromyrmex-dry.zip
  inflating: acromyrmex-dry/README.md  
  inflating: acromyrmex-dry/ncbi_dataset/data/assembly_data_report.jsonl  
  inflating: acromyrmex-dry/ncbi_dataset/fetch.txt  
  inflating: acromyrmex-dry/ncbi_dataset/data/dataset_catalog.json  


In [41]:
%%bash
# Now let's use the command tree to look at the data package contents
tree acromyrmex-dry/

acromyrmex-dry/
├── ncbi_dataset
│   ├── data
│   │   ├── assembly_data_report.jsonl
│   │   └── dataset_catalog.json
│   └── fetch.txt
└── README.md

2 directories, 4 files


**What is difference between this folder (`acromyrmex-dry`) and the folder `genomes`?**   
Let's use `tree` again to look at the contents of the folder genomes.

In [42]:
%%bash
# Check the folder contents of genome
tree genomes/

genomes/
├── ncbi_dataset
│   └── data
│       ├── assembly_data_report.jsonl
│       ├── dataset_catalog.json
│       ├── GCA_000204515.1
│       │   ├── cds_from_genomic.fna
│       │   ├── genomic.gff
│       │   ├── protein.faa
│       │   ├── sequence_report.jsonl
│       │   └── unplaced.scaf.fna
│       ├── GCA_017607455.1
│       │   ├── cds_from_genomic.fna
│       │   ├── GCA_017607455.1_ASM1760745v1_genomic.fna
│       │   ├── genomic.gff
│       │   ├── protein.faa
│       │   └── sequence_report.jsonl
│       ├── GCA_017607545.1
│       │   ├── cds_from_genomic.fna
│       │   ├── GCA_017607545.1_ASM1760754v1_genomic.fna
│       │   ├── genomic.gff
│       │   └── protein.faa
│       └── GCA_017607565.1
│           ├── cds_from_genomic.fna
│           ├── GCA_017607565.1_ASM1760756v1_genomic.fna
│           ├── genomic.gff
│           └── protein.faa
└── README.md

6 directories, 21 files


Both packages include the files `assembly_data_report.jsonl` and `dataset_catalog.json`, but the folder acromyrmex-dry has the file `fetch.txt` instead of the *actual* data. Let's take a look in this file.

In [44]:
# Inspect the file fetch.txt
import pandas as pd
fetch = pd.read_csv('./acromyrmex-dry/ncbi_dataset/fetch.txt', sep='\t', header=None)
fetch

Unnamed: 0,0,1,2
0,https://api.ncbi.nlm.nih.gov/datasets/fetch_h/...,0,data/GCA_000204515.1/unplaced.scaf.fna
1,https://api.ncbi.nlm.nih.gov/datasets/fetch_h/...,0,data/GCA_000204515.1/cds_from_genomic.fna
2,https://api.ncbi.nlm.nih.gov/datasets/fetch_h/...,0,data/GCA_000204515.1/genomic.gff
3,https://api.ncbi.nlm.nih.gov/datasets/fetch_h/...,0,data/GCA_000204515.1/protein.faa
4,https://api.ncbi.nlm.nih.gov/datasets/fetch_h/...,0,data/GCA_017607455.1/GCA_017607455.1_ASM176074...
5,https://api.ncbi.nlm.nih.gov/datasets/fetch_h/...,0,data/GCA_017607455.1/cds_from_genomic.fna
6,https://api.ncbi.nlm.nih.gov/datasets/fetch_h/...,0,data/GCA_017607455.1/genomic.gff
7,https://api.ncbi.nlm.nih.gov/datasets/fetch_h/...,0,data/GCA_017607455.1/protein.faa
8,https://api.ncbi.nlm.nih.gov/datasets/fetch_h/...,0,data/GCA_017607545.1/GCA_017607545.1_ASM176075...
9,https://api.ncbi.nlm.nih.gov/datasets/fetch_h/...,0,data/GCA_017607545.1/cds_from_genomic.fna


The file `fetch.txt` has a list of files to be "fetched" (aka. downloaded) with their respective links. And they are the same files that were originally included in when we downloaded the genomes in the beginning of this notebook.

#### BUT WHY WOULD I WANT TO USE THIS OPTION?

Some possibilities:
- You are working with very large genomes and want to share the data with your collaborators. Instead of sending a massive data file, you can send a text file that they can use to download the same data you're working on.
- Or maybe you hand selected some genomes for a project from the [NCBI Datasets website](https://www.ncbi.nlm.nih.gov/datasets/genomes/) and they don't follow a specific pattern that can be replicated. You can also download a dehydrated package from our website, share it and download everything you need later.

### `dataformat`

Now we are going to combine `datasets` with another tool called `dataformat`. `dataformat` allows you to extract metadata information from the JSON data report files included with all `datasets` data packages. You can use `dataformat` to:
- Create a tab-delimited file (.tsv) or excel file with the fields you need
- Quickly visualize the information on the screen

`dataformat` currently can not be used with the output of `datasets summary`, only the JSON data report included with the data package.

In [45]:
%%bash
# Read the dataformat help menu. This is a great way to get a list of the available metadata fields.
dataformat tsv genome -h


Convert Genome Assembly Data Report into TSV format.

Refer to NCBI's [command line start](https://www.ncbi.nlm.nih.gov/datasets/docs/command-line-start) documentation for information about getting started with the command-line tools.

Usage
  dataformat tsv genome [flags]

Examples
  dataformat tsv genome --inputfile human/ncbi_dataset/data/assembly_data_report.jsonl
  dataformat tsv genome --package human.zip

Flags
      --fields strings     comma-separated list of fields
                               - annotinfo-busco-complete
                               - annotinfo-busco-duplicated
                               - annotinfo-busco-fragmented
                               - annotinfo-busco-lineage
                               - annotinfo-busco-missing
                               - annotinfo-busco-singlecopy
                               - annotinfo-busco-totalcount
                               - annotinfo-busco-ver
                               - annotinfo-featcount-g

#### Now let's combine the features of `dataformat` and dehydration/rehydration to select which genomes to download.

Let's use `dataformat` to look at the genome data package for ants. We can use this information to select a "best" genome - we'll pick one with the highest contigN50 value.

In [46]:
%%bash
# Use dataformat to look at the genome data package for ants
dataformat tsv genome \
--fields organism-name,assminfo-accession,assmstats-contig-n50,assminfo-level,assminfo-submission-date,assminfo-submitter \
--package acromyrmex-dry.zip

Organism name	Assembly Accession	Assembly Stats Contig N50	Assembly Level	Assembly Submission Date	Assembly Submitter
Acromyrmex echinatior	GCA_000204515.1	80630	Scaffold	2011-05-03	Beijing Genomics Institute, Shenzhen
Acromyrmex insinuator	GCA_017607455.1	39949	Scaffold	2021-03-29	BGI
Acromyrmex charruanus	GCA_017607545.1	34925	Scaffold	2021-03-29	BGI
Acromyrmex heyeri	GCA_017607565.1	10811	Scaffold	2021-03-29	BGI


In [47]:
%%bash
# Let's look at the help file for rehydrate
datasets rehydrate -h


Retrieve data files for an [unzipped, dehydrated zip archive](https://www.ncbi.nlm.nih.gov/datasets/docs/how-tos/genomes/rehydrate-package/).  Data files specified in fetch.txt will be downloaded from NCBI.

Usage
  datasets rehydrate [flags] --directory <directory_name>

Flags
      --directory string   specify the directory containing the unzipped dehydrated bag
  -h, --help               help for rehydrate
      --list               list files that would be downloaded during rehydration
      --match string       specify substring that matches files for rehydration
      --max-workers int    limit the maximum number of concurrent download workers (allowed range is 1-30) (default 10)


Global Flags
      --api-key string   NCBI Datasets API Key
      --no-progressbar   hide progress bar



In [48]:
%%bash
# Let's get a list of files that are available for download 
datasets rehydrate --directory acromyrmex-dry/ --list

data/GCA_000204515.1/unplaced.scaf.fna
data/GCA_000204515.1/cds_from_genomic.fna
data/GCA_000204515.1/genomic.gff
data/GCA_000204515.1/protein.faa
data/GCA_017607455.1/GCA_017607455.1_ASM1760745v1_genomic.fna
data/GCA_017607455.1/cds_from_genomic.fna
data/GCA_017607455.1/genomic.gff
data/GCA_017607455.1/protein.faa
data/GCA_017607545.1/GCA_017607545.1_ASM1760754v1_genomic.fna
data/GCA_017607545.1/cds_from_genomic.fna
data/GCA_017607545.1/genomic.gff
data/GCA_017607545.1/protein.faa
data/GCA_017607565.1/GCA_017607565.1_ASM1760756v1_genomic.fna
data/GCA_017607565.1/cds_from_genomic.fna
data/GCA_017607565.1/genomic.gff
data/GCA_017607565.1/protein.faa
data/GCA_000204515.1/sequence_report.jsonl
data/GCA_017607455.1/sequence_report.jsonl
data/GCA_017607545.1/sequence_report.jsonl
data/GCA_017607565.1/sequence_report.jsonl
Found 20 files for rehydration


In [49]:
%%bash
# Let's only get the protein sequences for the genome with the highest contigN50 value
datasets rehydrate --directory acromyrmex-dry/ --match GCA_000204515.1/protein.faa --no-progressbar

Found 1 files for rehydration


In [50]:
%%bash
# Let's use tree to look at our folder acromyrmex-dry again
tree acromyrmex-dry/

acromyrmex-dry/
├── ncbi_dataset
│   ├── data
│   │   ├── assembly_data_report.jsonl
│   │   ├── dataset_catalog.json
│   │   └── GCA_000204515.1
│   │       └── protein.faa
│   └── fetch.txt
└── README.md

3 directories, 5 files


We can see that the file we requested ` GCA_000204515.1/protein.faa` was downloaded to the folder `acromyrmex-dry`

In [51]:
%%bash
# Take a peek at the downloaded protein file
cat acromyrmex-dry/ncbi_dataset/data/GCA_000204515.1/protein.faa | head

>EGI57120.1 Histone-lysine N-methyltransferase SETMAR [Acromyrmex echinatior]
MAKFNEFRYELFPHPAYLPDLALCDYFLFPNLKKWFGRKRFTTREQLIAETEAYFERLDKSYYLNKLENRSIKSIELKGN
YVEKQK
>EGI57121.1 hypothetical protein G5I_14843 [Acromyrmex echinatior]
MPTQSGLIVPTYIAMLPTLTVHKYSLFRLSRELHNLFVRLVARSCGMMTSYTQAPKRLDTSGHVYLVYEDPGQLNTEEEE
EEEEAYNALATSTERS
>EGI57122.1 Nidogen-1 [Acromyrmex echinatior]
MRRDFCNGGLACAVVWVSTCLLLVLSLSTSTIAEPLLRVAGRCPSLVEQNVCPSRAPACENDYQCQGTEERCCKTACGLR
CIAGELTGCEQLELAAVRRSRALGARGPQQFIPRCNNETGEFERIQCEPHGRSCWCVDEIGAEIPGTRAPSKSVVDCDKP
HSCPAHSCRMLCPLGFEINEVTGCPKCECRDPCRGVTCPGIGQICELIAVNCIREPCPPVPSCRKTRSLSTICPAGEPLQ


## Exercise
* Download a dehydrated package for all *Mycobacterium tuberculosis* genomes that meet all of the following criteria (hint: use flags)
    1. submitted/released in 2021
    2. annotated
    3. assembly level of complete_genome
* use dataformat to view the sequencing technology used for each of these genomes
* use rehydrate to get the genome sequence for one genome generated using Oxford Nanopore

In [None]:
%%bash
# Download a dehydrated genome data package



In [None]:
%%bash
# Unzip the data package



In [None]:
%%bash
# Use dataformat to generate a table that includes sequencing technology



In [None]:
%%bash
# Use rehydrate to get genome sequence generated using Oxford Nanopore

