# Using NCBI Datasets command-line tools to get genome, gene and ortholog data


### Table of contents
* [Part I: Accessing genomes](#Part-I)
* [Part II: Accessing genes](#Part-II)
* [Part III: Accessing orthologs](#Part-III)
* [Links](#Links)

### Important resources
- Github: https://github.com/ncbi/datasets/training
- NCBI datasets: https://www.ncbi.nlm.nih.gov/datasets/

## Before we start... What is a jupyter notebook?

Jupyter Notebooks are a web-based approach to interactive code. A single notebook (the file you are currently reading) is composed of many "cells" which can contain either text, or code. To navigate between cells, either click, or use the arrow keys on your keyboard.

A text cell will look like... well... this! While a code cell will look something like what you see below. To run the code inside a code cell, click on it, then click the "Run" button at the top of the screen. Try it on the code cell below!

In [None]:
#This is a code cell
print('You ran the code cell!')

If it worked, you should have seen text pop up underneath the cell saying `You ran the code cell!`. Note the `In [1]:` that appeared next to the cell. This tells you the order you have run code cells throughout the notebook. The next time you run a code cell, it will say `In [2]:`, then `In [3]:` and so on... This will help you know if/when code has been run.

The remainder of the notebook below has been pre-built by the workshop organizer. You will not need to create any new cells, and you will be explicitly told if/when to execute a code cell.

The code in this workshop is all Bash (i.e., terminal commands). Bash commands are prefixed with `!` or the cells have the notation `%%bash` at the top. If you are not familiar with code, don't feel pressured to interpret it very deeply. Descriptions of each code block will be provided!

(Jupyter Notebook explanation by Cooper Park at the workshop on [Finding and Analyzing Metagenomic Data](https://www.nlm.nih.gov/oet/ed/ncbi/2021_10_meta.html))

### Mic check

Let's first make sure that the conda environment is active and that we can run datasets.

In [None]:
%%bash
conda info --envs

In [None]:
%%bash
# Now let's make sure the datasets command-line tool is working 
datasets

## Case study: Drosophila melanogaster innate immunity

Drosophila melanogaster innate immunity features two pathways involved in the detection of dsRNA: RNAi and the cGAS-STING pathway.

Components of the cGAS-STING pathway are present in diverse metazoan lineages but absent in others. For example, the cGAS-STING pathway is present in Drosophila melanogaster, cnidarians (jellyfish) and humans, but absent in nematodes, e.g., C. elegans. 

<div>
<img src=https://media.springernature.com/full/springer-static/image/art%3A10.1038%2Fs41559-022-01951-4/MediaObjects/41559_2022_1951_Fig2_HTML.png?as=webp width="800">
</div>

Iwama, R.E., Moran, Y. Origins and diversification of animal innate immune responses against viral infections. Nat Ecol Evol 7, 182–193 (2023). https://doi.org/10.1038/s41559-022-01951-4

We'll use `datasets` to gather genomic data from NCBI. We will:

- get metadata for all genomes for the genus *Drosophila*
- download genome sequences for the species *Drosophila pseudoobscura*
- download the *cglr1* gene annotated on the *Drosophila pseudoobscura* reference genome
- download some ortholog data for this gene

### How is `datasets` organized?

[NCBI datasets](https://www.ncbi.nlm.nih.gov/datasets/docs/v1/quickstarts/command-line-tools/) is a command line tool that allows users to download data packages (data + metadata) or look at metadata summaries for genomes, genes and ortholog sets. The program follows a hierarchy that makes it easy for users to select exactly which options they would like to use. In addition to the program commands, additional flags are available for filtering the results. We will go over those during this tutorial.
<img src="https://www.ncbi.nlm.nih.gov/datasets/docs/v2/images/datasets1.png" alt="datasets" style="width: 700px;"/>

### `dataformat`

Now we are going to combine `datasets` with another tool called `dataformat`. `dataformat` allows you to extract metadata information from the JSON-Lines data report files included with all `datasets` data packages or accessible through `summary` command. You can use `dataformat` to:
- Create a tab-delimited file (.tsv) or excel file with the fields you need
- Quickly visualize the information on the screen

<img src="https://www.ncbi.nlm.nih.gov/datasets/docs/v2/images/dataformat1.png" alt="dataformat"  style="width: 700px;"/>

In [None]:
%%bash
# Read the dataformat help menu. This is a great way to get a list of the available metadata fields.
dataformat tsv genome -h

## Part I: Accessing genomes<a class="anchor" id="Part-I"></a>

First, let's figure out what kind of genome information NCBI has for the genus Drosophila. For this task, we will first use the `datasets summary` command, as shown in the diagram below. We will then pipe the `datasets` output to `jq` so we can see how many Drosophila genomes are in the NCBI database.

<img src="https://www.ncbi.nlm.nih.gov/datasets/docs/v2/images/datasets-s-genome-tax.png" style="width: 800px;"/>

In [None]:
%%bash
# Get the genome count for the genus Drosophila (Taxid: 7215)
datasets summary genome taxon 7215 | jq '.total_count'


Now we know that there are 635 Drosophila genomes at NCBI. Let's save the metadata as a JSON-Lines file using the flag `--as-json-lines` to make it easier to extract information from the metadata file later.

In [None]:
%%bash
# Get genome metadata for the genus Drosophila and save to a file
datasets summary genome taxon 7215 --as-json-lines > drosophila.jsonl

**Now let's take a look at the metadata using jq**  
Since the output is really long, we will only show the first 50 lines (`head -n 50`). The flag `-C` in the `jq` command shows the output in color, which makes it easier to read.

In [None]:
%%bash

jq -C . drosophila.jsonl | head -n 50


### Let's continue to explore the available genomes for the genus Drosophila


For this part, we will use two UNIX commands: `sort` and `uniq`. 

- `sort` can be used to sort text files line by line, numerically and alphabetically.   
- `uniq` will filter out the repeated lines in a file. However, `uniq` can only detect repeated lines if they are adjacent to each other. In other words, if they are alphabetically or numerically sorted. The flag `-c` or `--count` tells the command `uniq` to remove the repeated lines, and to count how many times each value appeared. 

So, we will use `dataformat` to extract the information we need, sort the result and count the number of unique entries.

In [None]:
%%bash
# For which Drosophila species does NCBI have genomes in its database? How many per species?
# We'll only look at the top 10 species, sorted by genome count

datasets summary genome taxon 7215 --as-json-lines | \
dataformat tsv genome --fields organism-name --elide-header | sort | uniq -c | sort -k1nr | head

In [None]:
%%bash
# What is the assembly level (contig, scaffold, chromosome, complete) breakdown?

datasets summary genome taxon 7215 --as-json-lines | dataformat tsv genome \
--fields assminfo-level --elide-header  | sort | uniq -c

### How to get help when using the command line

Since `datasets` is a hierarchical program, we can use that characteristic to our advantage to get specific help.   For example: if we type `datasets --help`, we will see the first level of commands available.


In [None]:
%%bash
datasets --help

Notice the difference from when we type `datasets summary genome taxon --help`  


In [None]:
%%bash
datasets summary genome taxon --help

### Data package

We explored the `datasets summary` option, in which we had a chance to look at the summary metadata ***without*** downloading any files. In the next steps, we will look at the data packages, which contain the actual data files. 

<br>

<img src="https://www.ncbi.nlm.nih.gov/datasets/docs/v2/images/data-packages.png" alt="data_package" style="width:600px;" />
<br>

Each `datasets` data package type (e.g., `genome`, `gene`, `virus`) has a different file content. You can customize data packages by using the flag `--include` to add or remove files. The image below shows the file types available for each `datasets` data package type. Refer to the [NCBI Datasets Data Package Reference page](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/data-packages/) for more information about each data package type.

<br>

<img src="https://www.ncbi.nlm.nih.gov/datasets/docs/v2/images/data-package-contents-op2.png" style="width:800px;">

In [None]:
%%bash
# Download a genome data package containing all chromosome-level GenBank assemblies for the species Drosophila pseudoobscura

datasets download genome taxon 'drosophila pseudoobscura' --assembly-source genbank --assembly-level chromosome --filename pseudoobscura.zip --no-progressbar

In [None]:
%%bash
# Unzip to the folder genomes
unzip pseudoobscura.zip -d genomes

In [None]:
%%bash
# Explore the folder structure of the folder genome with the command tree
tree -C genomes/

In [None]:
%%bash
# Now download the genomes and gff3 files for all *Drosophila pseudoobscura* genomes annotated by NCBI
datasets download genome taxon 'drosophila pseudoobscura' --assembly-source refseq \
    --include genome,gff3 --filename pseudoobscura-rs-gff3.zip --no-progressbar

unzip -l pseudoobscura-rs-gff3.zip

Note that this is only a single genome! Although there are 10 genomes for Drosophila pseudoobscura, there is only a single genome annotated by NCBI.

### Downloading large genome datasets (dehydration/rehydration)

Now you know how to download genome data for a relatively small number of genomes. But what if you want large amounts of genome data? `datasets` provides a better way to download large genome data packages that takes a few more steps but allows you to resume interrupted downloads. Let's download a dehydrated package and explore the files inside it.

In [None]:
%%bash
# Download a dehydrated data package of chromosome-level GenBank assemblies for the species Drosophila pseudoobscura
datasets download genome taxon 'drosophila pseudoobscura' --assembly-source genbank --assembly-level chromosome --dehydrated --filename pseudoobscura-dry.zip --no-progressbar

In [None]:
%%bash
# Next we have to unzip the dehydrated package
unzip pseudoobscura-dry.zip -d pseudoobscura-dry

In [None]:
%%bash
# Now let's use the command tree to look at the data package contents
tree pseudoobscura-dry/

**What is the difference between this folder (`pseudoobscura`) and the folder `genomes`?**   
Let's use `tree` again to look at the contents of the folder genomes.

In [None]:
%%bash
# Check the folder contents of genome
tree genomes/

Both packages include the files `assembly_data_report.jsonl` and `dataset_catalog.json`, but the folder acromyrmex has the file `fetch.txt` instead of the *actual* data. 

The file `fetch.txt` has a list of files to be "fetched" (downloaded) with their respective links. And they are the same files that were originally included in when we downloaded the genomes in the beginning of this notebook.  

In [None]:
%%bash
# Let's get a list of files that are available for download 
datasets rehydrate --directory pseudoobscura-dry/ --list

In [None]:
%%bash
# Now let's rehydrate the package to get the data
datasets rehydrate --directory pseudoobscura-dry/ --no-progressbar

In [None]:
%%bash
# Next we'll view the contents of the directory
tree -C pseudoobscura-dry

## Part II: Accessing genes <a class="anchor" id="Part-II"></a>
### GENES

Independent of choosing `datasets download` or `datasets summary`, there are four options for retrieving gene information: `accession`, `gene-id`, `symbol` and `taxon`. 


<img src="https://www.ncbi.nlm.nih.gov/datasets/docs/v2/images/datasets-gene.png" style="width: 700px;"/>


When choosing any of those options, you will retrieve the gene information for the gene annotated on the **reference** genome. Like this:

`datasets download gene accession NM_176109.2`  
`datasets download gene gene-id 35919`  
`datasets download gene symbol boot --taxon 'drosophila melanogaster'`  
`datasets download gene taxon 'drosophila melanogaster'`. 

The first three commands will download the same gene annotated on the (<i>Drosophila melanogaster</i>) <u>reference genome</u>, and the last one will download all genes for that species. 

- **accession**:  Unique identifier. Accession includes RefSeq RNA and protein accessions. Since it's unique, taxon is implied (aka there will never be two sequences from different taxa with the same accession number).  

- **gene-id**:  Also a unique identifier. For example: the gene-id for BRCA1 in human is 672, while the gene-id for domestic cat BRCA1 is 101081937.  

- **symbol**: Gene symbols are not unique and can be used multiple times in different taxonomic groups. If using the symbol option, you should specify the species. If no taxon is specified, it is assumed the taxon is human.

- **taxon**: Species-level. Retrieves the entire set of RefSeq annotated genes for the specified taxon.  

**Remember**  
Both `summary` and `download` will return results for the **reference assembly** of a <u>single species</u>. If you want to download a curated set of related genes from multiple taxa, you should use the flag `--ortholog`. We'll talk more about that later. 

Now let's take a look at a gene example:

In [None]:
%%bash
# Get metadata for a gene by symbol in human 
datasets summary gene symbol cgas | jq -C .


In [None]:
%%bash
# Get metadata for a gene by symbol in Drosophila melanogaster
datasets summary gene symbol cglr1 --taxon 'drosophila melanogaster' | jq -C . 


In [None]:
%%bash
# View select fields of metadata for the same gene
datasets summary gene symbol cglr1 --taxon 'drosophila melanogaster' --as-json-lines | \
    dataformat tsv gene --fields tax-name,symbol,description,gene-id,protein-count

### Downloading gene sequence data 
We know that there is 1 protein encoded by the Drosophila melanogaster cglr1 gene. Let's see how we can download the protein sequence, and the underlying transcript and gene sequences.


In [None]:
%%bash
# Download gene, transcript and protein sequences for *Drosophila melanogaster* cglr1.
datasets download gene symbol cglr1 --taxon 'drosophila melanogaster' --include gene,rna,protein --filename cglr1.zip --no-progressbar

In [None]:
%%bash
# Unzip the gene data package
unzip cglr1.zip -d cglr1

In [None]:
%%bash
# Take a look at the FASTA headers for the downloaded sequences
head -2 cglr1/ncbi_dataset/data/*.f*

### Using orthology data to find genes in related species


In [None]:
%%bash
# If we try to retrieve metadata information for this gene using the symbol cglr1, what happens?
datasets summary gene symbol cglr1 --taxon "drosophila pseudoobscura"


We can use orthology data to find related genes in other species. 

In [None]:
%%bash
datasets summary gene symbol cglr1 --taxon "drosophila melanogaster" --ortholog "drosophila melanogaster, drosophila pseudoobscura" | \
    dataformat tsv gene --fields tax-name,gene-id,description,symbol,group-method,group-id

In [None]:
%%bash
# Download the gene data package for the gene-id 4803562 (*cglr1* in Drosophila pseudoobscura)
# We want to include the FASTA file with gene sequences, so we will use the flag --include.

datasets download gene gene-id 4803562 --filename gene.zip --include gene,protein,rna --no-progressbar


In [None]:
%%bash
# Unzip the file
unzip gene.zip -d gene

In [None]:
%%bash
# Explore the data package structure using tree
tree -C gene/

## Part III: Accessing orthologs <a class="anchor" id="Part-III"></a>

### Orthologs

Since `datasets` version 14, users can retrieve ortholog information using the flag `--ortholog` with the `gene` subcommand.

#### <font color='blue'>Wait, but what is an ortholog set?</font>

>An ortholog set, or ortholog gene group, is a group of genes that have been identified by the NCBI genome annotation team as homologous genes related to each other by speciation events. They are identified by a combination of protein similarity + local syntheny information. 
Currently, NCBI has ortholog sets calculated for vertebrates and some insects. 


#### Examples:

`datasets download gene accession NM_176180.2 --ortholog all`  
`datasets download gene gene-id 36744 --ortholog all`  
`datasets download gene symbol cglr1 --taxon 'drosophila melanogaster' --ortholog all`  

All three commands will download the **same** ortholog set (which is the complete set). 

What if I want to filter the ortholog set to include *only* a taxonomic group of interest?



### Applying a taxonomic filter to the ortholog set

When using the `--ortholog` flag, users need to provide an argument for it. The argument should be one or more taxa (any rank) to filter results or 'all' for the complete set.

#### Examples

- `datasets download gene symbol cglr1 --taxon 'drosophila melanogaster' --ortholog apidae`  
Prints a json metadata summary of gene orthologs of the drosophila melanogaster gene cglr1, but only from the family apidae (bees).


#### We are going to follow these steps:
- download the ortholog data package and save it with the name ortholog.zip
- unzip it to the folder ortholog
- look at some metadata for the genes in the ortholog set
- align the protein sequences for the genes in the set

In [None]:
%%bash
# download the cglr1 ortholog set for the genus Drosophila (Taxid: 7215)
datasets download gene symbol cglr1 --taxon 'drosophila melanogaster' --ortholog apidae,'drosophila melanogaster' --filename ortholog.zip --no-progressbar

In [None]:
%%bash
# Unzip it to the folder ortholog
unzip ortholog.zip -d ortholog


In [None]:
%%bash
# Generate a table describing the genes in the ortholog set
dataformat tsv gene --package ortholog.zip --fields tax-name,symbol,gene-id,group-method,group-id | head

In [None]:
%%bash
# Update FASTA headers in the protein sequence file to make clustalo output easiser to understand
sed 's/ /_/g' ortholog/ncbi_dataset/data/protein.faa > renamed.proteins

# Run alignment
clustalo --infile renamed.proteins --percent-id --full --distmat-out=output.distmat --outfmt=clu --force

In [None]:
%%bash
# Show protein sequence identity table
cat output.distmat

## What have we done so far?
- Explored metadata for all genomes for the genus Drosophila
- Downloaded genomes for the species Drosophila pseudoobscura
- Downloaded the *cglr1* gene for *Drosophila pseudoobscura*
- Downloaded the ortholog set for all the genus Drosophila for the *cglr1* gene

## Links<a class="anchor" id="Links"></a>

### NCBI Datasets main resources

- NCBI Datasets homepage: [https://www.ncbi.nlm.nih.gov/datasets/](https://www.ncbi.nlm.nih.gov/datasets/)
- Github: [https://github.com/ncbi/datasets](https://github.com/ncbi/datasets)


### Download and installation instructions (CLI)

- Instructions:  
 [https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/)
  
### Tutorials, how-to guides and past workshops
 
- How-to guides (short, one-line CLI tasks):   
[https://www.ncbi.nlm.nih.gov/datasets/docs/v2/how-tos/](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/how-tos/)
- Tutorials (multi-task, longer tutorials, mostly based on feedback or questions we get from users): [https://www.ncbi.nlm.nih.gov/datasets/docs/v2/tutorials/](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/tutorials/)
- Past training sessions and workshops (Jupyter notebooks used in previous *datasets* training events): [https://github.com/ncbi/datasets/tree/master/training](https://github.com/ncbi/datasets/tree/master/training)

### How to get help

- Email the helpdesk: [info@ncbi.nlm.nih.gov](mailto:info@ncbi.nlm.nih.gov)
- Github: [https://github.com/ncbi/datasets](https://github.com/ncbi/datasets)
- Yellow feedback button on our pages 