In [1]:
from pathlib import Path

# Changelog

In [1]:
## 02-08-2021

* Include parent for rank in the nodes.dmp so that it is compatible with CAT parsing the names and nodes. This was affecting both `CAT prepare` and `CAT contigs`.
See also [some info here](https://github.com/dutilh/CAT/issues/60#issuecomment-890135377)

* Change 'domain' to 'superkingdom' for 'd__Bacteria' and 'd_Archaea' **only when writing** `nodes.dmp`. This is for compatibility with `CAT add_names` that uses NCBI official ranks.

## 27-07-2021

* Fix separators when writing `nodes.dmp` and `names.dmp` to official NCBI `\t|\t` style.

Because:

`CAT prepare` is throwing an error when parsing the `nodes.dmp`

```
2021-07-26 16:37:20] DIAMOND database constructed.
[2021-07-26 16:37:20] Loading file ./CAT_taxonomy.2021-07-26/nodes.dmp.
Traceback (most recent call last):
  File "/home/nikos/miniconda3/envs/ngd/bin/CAT", line 84, in <module>
    main()
  File "/home/nikos/miniconda3/envs/ngd/bin/CAT", line 62, in main
    prepare.run()
  File "/home/nikos/miniconda3/envs/ngd/share/cat-5.2.3-1/CAT_pack/prepare.py", line 837, in run
    run_existing(args)
  File "/home/nikos/miniconda3/envs/ngd/share/cat-5.2.3-1/CAT_pack/prepare.py", line 826, in run_existing
    prepare(step_list, args)
  File "/home/nikos/miniconda3/envs/ngd/share/cat-5.2.3-1/CAT_pack/prepare.py", line 472, in prepare
    taxid2parent, taxid2rank = tax.import_nodes(
  File "/home/nikos/miniconda3/envs/ngd/share/cat-5.2.3-1/CAT_pack/tax.py", line 21, in import_nodes
    rank = line[4]
IndexError: list index out of range
```

In [2]:
ncbi_field_sep = '\t|\t'
ncbi_line_sep = '\t|\n'

Before the fix:

  * My `nodes.dmp` used `\t` as a separator which yields 3 fields. (so `line[4]` is not there...)
  * The official `nodes.dmp` uses `\t|\t` as a field terminator as seen in their [readme.txt](https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_readme.txt).
  * `CAT` splits on `\t` and uses indexes 0,2,4.

* Fix ordering of official ranks (`class` before `order`), which was causing a lot of issues.

In [3]:
official_ranks = ['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']

* Fix offsets to accomodate some overlap that was causing names to be lost

# Goal

Translate all taxonomies to unique integer identifiers, similar to NCBI.
So for example:

```
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia flexneri
```

can become (arbitrarily chosen here for demonstration puproses)
```
1; root
   2; domain
      11; phylum
         30; class
            120; order
               1002; family
                  100323; genus
                     2342012 species
```

Then you can make a `names.dmp` like
```
1 | root| | | | |scientific name |
2 | Bacteria| | | | |scientific name |
11 | Proteobacteria | | | | scientific name|
30| Gammaproteobacteria | | | | scinetific name|
120| Enterobacterales | | | | | scientific name|
1002| Enterobacteriaceae| | | | scientific name|
100323|Escherichia| | | | | | scinetific name|
2342012| Echerichia flexneri | | | | | scientific name|
```

and a `nodes.dmp`

```
1 | - | no rank (root)
2 | 1 | domain 
11 | 2 | phylum
30 | 11 | class
120 | 30 | order
1002 | 120 | family
100323 | 1002 | genus
2342012 | 100323 | species
```

# Required input

I am using GTDB `v202`.

Data used are available on the [GTDB FTP site](https://data.gtdb.ecogenomic.org/releases/release202/202.0/).

Required input are the taxonomy tables:

  - Archaea: [ar122_taxonomy_r202.tsv](https://data.gtdb.ecogenomic.org/releases/release202/202.0/ar122_taxonomy_r202.tsv)
  - Bacteria: [bac120_taxonomy_r202.tsv](https://data.gtdb.ecogenomic.org/releases/release202/202.0/bac120_taxonomy_r202.tsv)
  
In the notebooks here I assume these have been downloaded inside the `data/gtdb_info` directory.
Run these commands to make the same structure

```
mkdir -p data/gtdb_info 
wget -P data/gtdb_info https://data.gtdb.ecogenomic.org/releases/release202/202.0/ar122_taxonomy_r202.tsv 
wget -P data/gtdb_info https://data.gtdb.ecogenomic.org/releases/release202/202.0/bac120_taxonomy_r202.tsv
```


# STEP 1. Get all unique taxonomies for both bacteria and archaea.

Let's count how many unique entries are in each level so we can decide on offsets.

The idea is to try to accomodate future expansions of the tree.
We don't just assign integers at random, but we define a certain pool of values that 
a certain taxid, for a certain taxonomy rank, can take.

For example, now there are 2 domains `Bacteria` and `Archaea`. What happens if there is 
a major scientific breakthrough and there is another domain discovered - let's call them `Moderna`?

- If everything was just a straight up count then we would grab the next integer available.

Assuming we reach `234519` at the end of this excercise then the new domain `Moderna` will 
be assigned a value of `234520`. Although this is technically the easiest way of doing it, 
I 'd to have some logic built-in. Because "Give me domains" would become
```
1 Bacteria
2 Archaea
234520 Moderna
```

My OCD-self can't handle that :)

- Instead I will go for offsetting all levels from certain values. 

E.g. I will reserve values `1...10` for domains. Then our new `Moderna` domain will get a nice `3` when that 
time comes and "Give me domains" will be as beautiful as 

```
1 Bacteria
2 Archaea
3 Moderna (<-would you look at that)
```

Of course this makes no difference whatsoever, and things are bound to get weird higher up the 
taxonomy. It is much more likely that new genera and species will be disovered, heck even phyla,
so at this point it is hard to make proper choices. Which leads to the inevitable question 

- How many bacterial species are there?

So do you offset species at `100000` , `1000000` or `10000000000000` to accommodate everything?

Currently, I will choose to ignore this and go count some stuff so I can ballpark it.


In [4]:
archaea_tax = Path("data/gtdb_info/ar122_taxonomy_r202.tsv")

In [5]:
def get_rank_level(rank_string):
    if rank_string.startswith('d__'):
        return 'domain'
    elif rank_string.startswith('p__'):
        return 'phylum'
    elif rank_string.startswith('c__'):
        return 'class'
    elif rank_string.startswith('o__'):
        return 'order'
    elif rank_string.startswith('f__'):
        return 'family'
    elif rank_string.startswith('g__'):
        return 'genus'
    elif rank_string.startswith('s__'):
        return 'species'
    else:
        print("Unknown rank for: {}".format(rank_string))
        

In [6]:
def create_entries_dic(taxonomy_fp):
    '''
    Parse a taxonomy file into a dictionary
    
    This is for getting each unique entry to use as
    a name and assign a unique id to.
    
    Official rank names are used keys, their members 
    are prefixed with 'p__', 'f__' etc....
    
    
    Return:
      unique_ids: dict: Keys are string, values are sets
                      {
                         'domain' :                            { 
                           'd__Archaea', 
                           'd__Bacteria'
                            },
                          'phylum' : 
                            {
                            'p__phylum1',
                            'p__phylum2'
                            },
                          ...
                         }
    '''
    
    entries_dic = dict(zip(official_ranks, [[]]*len(official_ranks)))
    
    total_entries = 0
    
    with open(taxonomy_fp, 'r') as fin:
        for line in fin:
            total_entries +=1            
            fields = [f.strip() for f in line.split('\t')]
            lineage = fields[1]
            for rank in lineage.split(';'):
                level = get_rank_level(rank)                
                if len(entries_dic[level]) == 0:
                    entries_dic[level] = [rank]
                else:
                    entries_dic[level].append(rank)
                    
    unique_ids = {k: set(v) for k,v in entries_dic.items()}
    
    print("Total entries: {}".format(total_entries))
    
    return unique_ids
    

In [7]:
archaeal_entries = create_entries_dic(archaea_tax)

Total entries: 4316


In [8]:
for rank, entry in archaeal_entries.items():
    print(rank, len(entry))

domain 1
phylum 20
class 51
order 117
family 337
genus 851
species 2339


In [9]:
bacteria_tax = Path("data/gtdb_info/bac120_taxonomy_r202.tsv")

In [10]:
bacterial_entries = create_entries_dic(bacteria_tax)

Total entries: 254090


In [12]:
for rank, entry in bacterial_entries.items():
    print(rank ,len(entry))

domain 1
phylum 149
class 368
order 1195
family 2927
genus 12037
species 45555


Or go see that on the webpage. There are a few differences, though, for archaea.
I count 1 extra phylum, 1 extra order, 4 extra classes, 1 extra family.

## Decision point

Well, hello arbitrariness. 

I am being generous for families, genera, and species assuming that, lower in the tree,
things are going to be fairly stable. (_Laughs in future taxonomics_)

* Domains `1 + 1 = 2`  

    - I will use integers [2,5] (4 domains)

* Phyla `20 + 149 = 169`

    - I will use integers `[6,500]` (495 phyla)
    
* Classes `51 + 368 = 419` 

    - I will use integers `[501, 1500]` (1000 classes)
        
* Orders `117 + 1195 = 1312`

    - I will use integers `[1501,5000]` (3500 orders)

- Families `337 + 2927 = 3264`

    - I will use integers `[5001, 20000]` (15000 families)

* Genera `851 + 12037 = 12888`

    - I will use integers `[20001, 80000]` (60000 genera)
    
* Species `2339 + 45555 = 47894`

    - I will use integers `[80001, 1000000]` (920000) species

## Merge the dictionaries

One would assume that you can make separate offsets for bacteria and archaea but I am being lazy.

In [13]:
all_entries = {}
for rank in archaeal_entries:
    all_entries[rank] = archaeal_entries[rank].union(bacterial_entries[rank])

In [14]:
sum([len(v) for v in all_entries.values()])

65948

In [15]:
# Double check with calculations from above
for rank, entry in all_entries.items():
    print(rank ,len(entry))

domain 2
phylum 169
class 419
order 1312
family 3264
genus 12888
species 47894


# STEP 2. Create the names.dmp

In [16]:
offsets = dict(zip(official_ranks, [2, 6, 501, 1501, 5001, 20001, 80001]))

In [17]:
names_dic = {}
for rank in all_entries:
    for i, entry in enumerate(all_entries[rank]):
        # add the offset of the rank to the rolling counter
        uid = i + offsets[rank]
        names_dic[uid] = entry

In [18]:
len(names_dic)

65948

The standard names.dmp ships like 
```
uid | name | | | | | scientific name |
uid | name | | | | | some other comment|
```

What is important here is that:

- This is a tab-separated file. Splitting on `\t` will give you fields (0-based indexed)
    - 0: uid
    - 1: name
    - 6: the comment. Based on equality to the string `'scientific name'` you can filter and grab the official name for ncbi.
        
This is how `CAT` imports the information [as seen here](https://github.com/dutilh/CAT/blob/e41ebd66a059b67b43c8e141002f37b3df755509/CAT_pack/tax.py#L29).

So this is what we will mimic here

In [19]:
# Create a taxonomy dir if it is not in there
taxonomy_dir = Path('results/taxonomy')
if not taxonomy_dir.exists():
    taxonomy_dir.mkdir()

In [20]:
# Specify the path to the output file
names_dmp = taxonomy_dir / Path('names.dmp')

In [22]:
# Dump entries in the dump
with open(names_dmp, 'w') as fout:
    # Write the root first
    root_string = ncbi_field_sep.join(['1', 'root', '', 'scientific name']) + ncbi_line_sep
    fout.write(root_string)
    for uid in names_dic:
        uid_string = ncbi_field_sep.join([str(uid), names_dic[uid], '', 'scientific name']) + ncbi_line_sep
        fout.write(uid_string)

# STEP 3. Create the nodes.dmp

This is a bit trickier. You can probably get that from the tree itself in one go.

I will use a less elegant approach and parse this from the file.

>A word of caution: Human readable names are not unique!

Exhibit 1: [UBA9089](https://gtdb.ecogenomic.org/searches?s=al&q=UBA9089)
  - appears as phylum, class, order, family, genus
  
Exhibit 2: [RBG-13-43-22](https://gtdb.ecogenomic.org/searches?s=al&q=RBG-13-43-22)
  - appears as order, family, genus
 
or [undeniable Xzibit 3](https://images.fanpop.com/images/image_uploads/Xzibit-pimp-my-ride-235615_281_211.jpg) 

The solution - i.e. using the prefixed string 'p__UBA9089' - has been implemented in the create_names_dic function).

This allows to inverse the dic values as keys, which simplifies things.

In [23]:
inv_names_dic = {v:k for k,v in names_dic.items()}

In [24]:
len(inv_names_dic)

65948

In [25]:
assert len(names_dic) == len(inv_names_dic)

Sooooo... that looks ok! Makes me wonder if I needed to go through all the fuss for unique numeric ids anyway...

In [26]:
nodes_dmp = 'results/taxonomy/nodes.dmp'

In [None]:
seen_ranks = []

root_node_string = ncbi_field_sep.join(['1', '1', 'no rank']) + ncbi_line_sep
with open(nodes_dmp, 'w') as fout:
    ## Include root information so CAT parsing works
    fout.write(root_node_string)
    
    for taxonomy_fp in [archaea_tax, bacteria_tax]:
        with open(taxonomy_fp, 'r') as fin:
            for line in fin:
                fields = [f.strip() for f in line.split('\t')]
                lineage = fields[1]
                lineage_list = lineage.split(';')
                for i, rank in enumerate(lineage_list):
                    if i == 0 and (rank not in seen_ranks):
                        child_id = inv_names_dic[rank]
                        parent_id = 1
                        # Use superkingdom instead of domain archaea and bacteria to work
                        # better with CAT add names.
                        rank_line = ncbi_field_sep.join(map(str, [child_id, parent_id, 'superkingdom'])) + ncbi_line_sep
                        fout.write(rank_line)                        
                        seen_ranks.append(rank)
                    elif i >=1 and (rank not in seen_ranks):
                        child_id = inv_names_dic[rank]
                        parent_name = lineage_list[i-1]
                        parent_id = inv_names_dic[parent_name]
                        rank_line = ncbi_field_sep.join(map(str, [child_id, parent_id, official_ranks[i]])) + ncbi_line_sep
                        fout.write(rank_line)
                        seen_ranks.append(rank)
                    else:
                        pass
        print("Parsed taxonomy: {}".format(taxonomy_fp))
#                 print("Not sure what to do here: {}".format(line))
                
        

Parsed taxonomy: data/gtdb_info/ar122_taxonomy_r202.tsv


In [64]:
len(set(seen_ranks))

65948

# Obsolete functions etc.

You never know..


In [None]:
def strip_rank_prefix(rank_string):
    return rank_string[3:]

def create_entries_dic(taxonomy_fp):
    '''
    '''
    entries_dic = dict(zip(official_ranks, [[]]*len(official_ranks)))
    total_entries = 0
    with open(taxonomy_fp, 'r') as fin:
        for line in fin:
            total_entries +=1
            
            fields = [f.strip() for f in line.split('\t')]
            lineage = fields[1]
            for rank in lineage.split(';'):
                level = get_rank_level(rank)
                name = strip_rank_prefix(rank)
                if len(entries_dic[level]) == 0:
                    entries_dic[level] = [name]
                else:
                    entries_dic[level].append(name)
    unique_ids = {k: set(v) for k,v in entries_dic.items()}
    print("Total entries: {}".format(total_entries))
    return unique_ids
    