In [1]:
from pathlib import Path

# Goal

Translate all taxonomies to unique integer identifiers, similar to NCBI.
So for example:

```
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia flexneri
```

can become (arbitrarily chosen here for demonstration puproses)
```
1; root
   2; domain
      11; phylum
         30; class
            120; order
               1002; family
                  100323; genus
                     2342012 species
```

Then you can make a `names.dmp` like
```
1 | root| | | | |scientific name |
2 | Bacteria| | | | |scientific name |
11 | Proteobacteria | | | | scientific name|
30| Gammaproteobacteria | | | | scinetific name|
120| Enterobacterales | | | | | scientific name|
1002| Enterobacteriaceae| | | | scientific name|
100323|Escherichia| | | | | | scinetific name|
2342012| Echerichia flexneri | | | | | scientific name|
```

and a `nodes.dmp`

```
1 | - | no rank (root)
2 | 1 | domain 
11 | 2 | phylum
30 | 11 | class
120 | 30 | order
1002 | 120 | family
100323 | 1002 | genus
2342012 | 100323 | species
```

# Required input

I am using GTDB `v202`.

Data used are available on the [GTDB FTP site](https://data.gtdb.ecogenomic.org/releases/release202/202.0/).

Required input are the taxonomy tables:

  - Archaea: [ar122_taxonomy_r202.tsv](https://data.gtdb.ecogenomic.org/releases/release202/202.0/ar122_taxonomy_r202.tsv)
  - Bacteria: [bac120_taxonomy_r202.tsv](https://data.gtdb.ecogenomic.org/releases/release202/202.0/bac120_taxonomy_r202.tsv)
  
In the notebooks here I assume these have been downloaded inside the `data/gtdb_info` directory.
Run these commands to make the same structure

```
mkdir -p data/gtdb_info 
wget -P data/gtdb_info https://data.gtdb.ecogenomic.org/releases/release202/202.0/ar122_taxonomy_r202.tsv 
wget -P data/gtdb_infohttps://data.gtdb.ecogenomic.org/releases/release202/202.0/bac120_taxonomy_r202.tsv
```


# STEP 1. Get all unique taxonomies for both bacteria and archaea.

Let's count how many unique entries are in each level so we can decide on offsets.

The idea is to try to accomodate future expansions of the tree.
We don't just assign integers at random, but we define a certain pool of values that 
a certain taxid, for a certain taxonomy rank, can take.

For example, now there are 2 domains `Bacteria` and `Archaea`. What happens if there is 
a major scientific breakthrough and there is another domain discovered - let's call them `Moderna`?

- If everything was just a straight up count then we would grab the next integer available.

Assuming we reach `234519` at the end of this excercise then the new domain `Moderna` will 
be assigned a value of `234520`. Although this is technically the easiest way of doing it, 
I 'd to have some logic built-in. Because "Give me domains" would become
```
1 Bacteria
2 Archaea
234520 Moderna
```

My OCD-self can't handle that :)

- Instead I will go for offsetting all levels from certain values. 

E.g. I will reserve values `1...10` for domains. Then our new `Moderna` domain will get a nice `3` when that 
time comes and "Give me domains" will be as beautiful as 

```
1 Bacteria
2 Archaea
3 Moderna (<-would you look at that)
```

Of course this makes no difference whatsoever, and things are bound to get weird higher up the 
taxonomy. It is much more likely that new genera and species will be disovered, heck even phyla,
so at this point it is hard to make proper choices. Which leads to the inevitable question 

- How many bacterial species are there?

So do you offset species at `100000` , `1000000` or `10000000000000` to accommodate everything?

Currently, I will choose to ignore this and go count some stuff so I can ballpark it.


In [3]:
archaea_tax = Path("data/gtdb_info/ar122_taxonomy_r202.tsv")

In [29]:
official_ranks = ['domain', 'phylum', 'order', 'class', 'family', 'genus', 'species']

In [51]:
def strip_rank_prefix(rank_string):
    return rank_string[3:]

In [16]:
def get_rank_level(rank_string):
    if rank_string.startswith('d__'):
        return 'domain'
    elif rank_string.startswith('p__'):
        return 'phylum'
    elif rank_string.startswith('c__'):
        return 'class'
    elif rank_string.startswith('o__'):
        return 'order'
    elif rank_string.startswith('f__'):
        return 'family'
    elif rank_string.startswith('g__'):
        return 'genus'
    elif rank_string.startswith('s__'):
        return 'species'
    else:
        print("Unknown rank for: {}".format(rank_string))

In [45]:
def create_entries_dic(taxonomy_fp):
    '''
    '''
    entries_dic = dict(zip(official_ranks, [[]]*len(official_ranks)))
    total_entries = 0
    with open(taxonomy_fp, 'r') as fin:
        for line in fin:
            total_entries +=1
            
            fields = [f.strip() for f in line.split('\t')]
            lineage = fields[1]
            for rank in lineage.split(';'):
                level = get_rank_level(rank)
                name = strip_rank_prefix(rank)
                if len(entries_dic[level]) == 0:
                    entries_dic[level] = [name]
                else:
                    entries_dic[level].append(name)
    unique_ids = {k: set(v) for k,v in entries_dic.items()}
    print("Total entries: {}".format(total_entries))
    return unique_ids
    

In [46]:
archaeal_entries = create_entries_dic(archaea_tax)

Total entries: 4316


In [47]:
for rank, entry in archaeal_entries.items():
    print(rank, len(entry))

domain 1
phylum 20
order 117
class 51
family 337
genus 851
species 2339


In [48]:
bacteria_tax = Path("data/gtdb_info/bac120_taxonomy_r202.tsv")

In [49]:
bacterial_entries = create_entries_dic(bacteria_tax)

Total entries: 254090


In [50]:
for rank, entry in bacterial_entries.items():
    print(rank ,len(entry))

domain 1
phylum 149
order 1195
class 368
family 2927
genus 12037
species 45555


Or go see that on the webpage. There are a few differences, though, for archaea.
I count 1 extra phylum, 1 extra order, 4 extra classes, 1 extra family.

## Decision point

Well hello arbitrariness. I am being generous for families, genera, and species assuming that, lower in the tree,
things are going to be fairly stable. (_Laughs in future taxonomics_)

* Domains `1 + 1 = 2`  

    - I will use integers [2,5] (4 domains)

* Phyla `20 + 149 = 169`

    - I will use integers `[6,500]` (494 phyla)
    
* Orders `117 + 1195 = 1312`

    - I will use integers `[501,5000]` (4500 orders)

* Classes `51 + 368 = 419` 

    - I will use integers `[5001, 6000]` (1000 classes)

- Families `337 + 2927 = 3264`

    - I will use integers `[6001, 20000]` (14000 families)

* Genera `851 + 12037 = 12888`

    - I will use integers `[20001, 80000]` (60000 genera)
    
* Species `2339 + 45555 = 47894`

    - I will use integers `[80001, 1000000]` (920000) species

## Merge the dictionaries

One would assume that you can make separate offsets for bacteria and archaea but I am being lazy.

In [60]:
all_entries = {}
for rank in archaeal_entries:
    all_entries[rank] = archaeal_entries[rank].union(bacterial_entries[rank])

In [82]:
sum([len(v) for v in all_entries.values()])

65948

In [61]:
# Double check with calculations from above
for rank, entry in all_entries.items():
    print(rank ,len(entry))

domain 2
phylum 169
order 1312
class 419
family 3264
genus 12888
species 47894


# STEP 2. Create the names.dmp

In [71]:
offsets = dict(zip(official_ranks, [2,6,501, 5001, 6001, 20001, 80001]))

In [72]:
names_dic = {}
for rank in all_entries:
    for i, entry in enumerate(all_entries[rank]):
        # add the offset of the rank to the rolling counter
        uid = i + offsets[rank]
        names_dic[uid] = entry

In [81]:
len(names_dic)

65948

The standard names.dmp ships like 
```
uid | name | | | | | scientific name |
uid | name | | | | | some other comment|
```

What is important here is that:

- This is a tab-separated file. Splitting on `\t` will give you fields (0-based indexed)
    - 0: uid
    - 1: name
    - 6: the comment. Based on equality to the string `'scientific name'` you can filter and grab the official name for ncbi.
        
This is how `CAT` imports the information [as seen here](https://github.com/dutilh/CAT/blob/e41ebd66a059b67b43c8e141002f37b3df755509/CAT_pack/tax.py#L29).

So this is what we will mimic here

In [73]:
# Create a taxonomy dir if it is not in there
taxonomy_dir = Path('results/taxonomy')
if not taxonomy_dir.exists():
    taxonomy_dir.mkdir()

In [76]:
# Specify the path to the output file
names_dmp = taxonomy_dir / Path('names.dmp')

In [77]:
# Dump entries in the dump
with open(names_dmp, 'w') as fout:
    # Write the root first
    fout.write("1\troot\t-\t-\t-\t-\tscientific name\n")
    for uid in names_dic:
        fout.write("{}\t{}\t-\t-\t-\t-\tscientific name\n".format(uid, names_dic[uid]))

# STEP 3. Create the nodes.dmp

This is a bit trickier. You can probably get that from the tree itself in one go.

I will use a less elegant approach and parse this from the file.

>A word of caution: Human readable names are not unique!

Exhibit 1: [UBA9089](https://gtdb.ecogenomic.org/searches?s=al&q=UBA9089)
  - appears as phylum, class, order, family, genus
  
Exhibit 2: [RBG-13-43-22](https://gtdb.ecogenomic.org/searches?s=al&q=RBG-13-43-22)
  - appears as order, family, genus
 
or [undeniable Xzibit 3](https://images.fanpop.com/images/image_uploads/Xzibit-pimp-my-ride-235615_281_211.jpg) 

In an ideal world I could just inverse the `names_dic` from above like
```
inv_names_dic = {v:k for k,v in names_dic.items()}
```

and do the translations on the fly.
But that would be too easy, of course...

At this point, I am contemplating life choices, trying to figure a clever way out of it.

Aha! Eureka! I will not strip the rank prefix and instead keep the name as `p__UPA9089`, `c__UPA9089`. Genious! \s

In [86]:
# Modified version of the function to not strip '<rank>__' prefixes
def create_raw_entries_dic(taxonomy_fp):
    '''
    '''
    entries_dic = dict(zip(official_ranks, [[]]*len(official_ranks)))
    total_entries = 0
    with open(taxonomy_fp, 'r') as fin:
        for line in fin:
            total_entries +=1            
            fields = [f.strip() for f in line.split('\t')]
            lineage = fields[1]
            for rank in lineage.split(';'):
                level = get_rank_level(rank)                
                if len(entries_dic[level]) == 0:
                    entries_dic[level] = [rank]
                else:
                    entries_dic[level].append(rank)
    unique_ids = {k: set(v) for k,v in entries_dic.items()}
    print("Total entries: {}".format(total_entries))
    return unique_ids
    

In [84]:
archaeal_raw = create_raw_entries_dic(archaea_tax)

Total entries: 4316


In [87]:
for rank, entry in archaeal_raw.items():
    print(rank, len(entry))

domain 1
phylum 20
order 117
class 51
family 337
genus 851
species 2339


In [85]:
bacterial_raw = create_raw_entries_dic(bacteria_tax)

Total entries: 254090


In [88]:
for rank, entry in bacterial_raw.items():
    print(rank, len(entry))

domain 1
phylum 149
order 1195
class 368
family 2927
genus 12037
species 45555


Seems legit... On to merging...

In [90]:
raw_entries ={}
for rank in archaeal_raw:
    raw_entries[rank] = archaeal_raw[rank].union(bacterial_raw[rank])

In [91]:
sum([len(v) for v in raw_entries.values()])

65948

In [92]:
raw_names_dic = {}
for rank in raw_entries:
    for i, entry in enumerate(raw_entries[rank]):
        # add the offset of the rank to the rolling counter
        uid = i + offsets[rank]
        raw_names_dic[uid] = entry

In [96]:
len(raw_names_dic)

65948

In [93]:
# Just don't overwrite yet...
names_dmp = taxonomy_dir / Path('names_raw.dmp')

In [94]:
# Dump entries in the dump
with open(names_dmp, 'w') as fout:
    # Write the root first
    fout.write("1\troot\t-\t-\t-\t-\tscientific name\n")
    for uid in raw_names_dic:
        fout.write("{}\t{}\t-\t-\t-\t-\tscientific name\n".format(uid, raw_names_dic[uid]))

In [95]:
inv_names_dic = {v:k for k,v in raw_names_dic.items()}

In [97]:
len(inv_names_dic)

65948

In [98]:
assert len(raw_names_dic) == len(inv_names_dic)

Sooooo... that looks ok! Makes me wonder if I needed to go through all the fuss for unique numeric ids anyway...

In [116]:
nodes_dmp = 'results/taxonomy/nodes.dmp'

In [119]:
seen_ranks = []
with open(nodes_dmp, 'w') as fout:
    for taxonomy_fp in [archaea_tax, bacteria_tax]:
        with open(taxonomy_fp, 'r') as fin:
            for line in fin:
                fields = [f.strip() for f in line.split('\t')]
                lineage = fields[1]
                lineage_list = lineage.split(';')
                for i, rank in enumerate(lineage_list):
                    if i == 0 and (rank not in seen_ranks):
                        child_id = inv_names_dic[rank]
                        parent_id = 1
                        fout.write("{}\t{}\t{}\n".format(child_id, parent_id, 'domain'))
                        seen_ranks.append(rank)
                    elif i >=1 and (rank not in seen_ranks):
                        child_id = inv_names_dic[rank]
                        parent_name = lineage_list[i-1]
                        parent_id = inv_names_dic[parent_name]
                        fout.write("{}\t{}\t{}\n".format(child_id, parent_id, official_ranks[i]))
                        seen_ranks.append(rank)
                    else:
                        pass
        print("Parsed taxonomy: {}".format(taxonomy_fp))
#                 print("Not sure what to do here: {}".format(line))
                
        

Parsed taxonomy: data/gtdb_info/ar122_taxonomy_r202.tsv
Parsed taxonomy: data/gtdb_info/bac120_taxonomy_r202.tsv


In [118]:
len(set(seen_ranks))

65948