# Expanding the phyloDB database with new species

In [1]:
# Load libraries
import pandas as pd

In this notebook, we'll try to update the phyloDB with sequence data from species identified with the FlowCam to widen our taxonomic scope. 
In the [FlowCam analysis notebook](./analysis/Flowcam_Analysis.ipynb) we constructed a list of species that were identified by the FlowCam but not yet present in the phyloDB (see part 4).
Let's start by loading the list of species that we want to add to the phyloDB.

## Data exploration

In [96]:
unincluded_species = pd.read_csv('../data/annotation/taxonomy_phyloDB/unrepresented_flowcam_IDs.txt', header=None, names=['flowcam_ID'])
# The above file was manually curated to only include valid species names!

# Print number of unrepresented species
print('Number of unrepresented species: {}'.format(unincluded_species.shape[0]))
# Display
unincluded_species.head()

Number of unrepresented species: 38


Unnamed: 0,flowcam_ID
0,Actinoptychus senarius
1,Actinoptychus splendens
2,Bacillaria paxillifer
3,Bellerochea horologicalis
4,Ceratium horridum


Now, let's have a look at the PhyloDB. First, let's check the format the sequence data is stored in. The PhyloDB can be downloaded from [here](https://drive.google.com/folderview?id=0B-BsLZUMHrDQfldGeDRIUHNZMEREY0g3ekpEZFhrTDlQSjQtbm5heC1QX2V6TUxBeFlOejQ&usp=sharing).

The PhyloDB looks like this:

```
>Aazo_0002-NC_014248	CVujK2QLmQnNmXUG9U1Z1mDP9is	'Nostoc azollae' 0708
MPFFTAYVSFYAPLDFFSPQFLPVALYGAECGLYATFITDIKLSDSKNNSLGEQGDFVSWGKDCAFSYSNYRYR
>Aazo_0003-NC_014248	FRYMN9+VEwIYo0oQ2zP4wg0vBew	'Nostoc azollae' 0708
MRQKILAKDLSQTELVQSIDRKGIAEESMGQKVEILLSLIEQ
>Aazo_0004-NC_014248	zYiwrR+n2iXoXj+qe7nzvBqdxn0	'Nostoc azollae' 0708
MTQGKLLEFLENIDLLEHFQLPTKWQNPLEVLPQETVLSEAEFHTLLDTHLPKLGSQQRTRIMEAVAIAFYHQQTDWPVVQTLVCDDAPQLKLLTDNIALCWVDEERNYKKLSAFIACHQKVLDKFLDDFWNYYRDLLPCQDSPSQQTADKLRYKFWKLFHTDSGYQQLDERKPLTLVKISELLYVLEHPELPLHNNPVELGARTMVQRGNISYATQTLEGTQAWDTFMYLVATTRKLGISFFEYIRDRISKVGNIPCLATTFYEKSALNPFGCSWIPHSAP
>Aazo_0005-NC_014248	7M1+86YVQb/siBiS2cyV0iW3SIk	'Nostoc azollae' 0708
MPFFTAYVSFYAPLDFFSPQFLPVALYGAECGIHEQPKGLRADFS
>Aazo_0007-NC_014248	uSkeyTwMw8tI0GERY+sP0BteNHQ	'Nostoc azollae' 0708
MRNSIITSRQVIKYLQFPALAVILGANLAITTLAPKVLAQTTLASFVGNAVTFTCNDSEATIKAKNGPKAIFGTRTIYIGYQQVTSVNKDPRMIRFDNGVRKWCRSDYETTSDDGTGYGLLWNGSGVLYGVFTSTGSQTGNNFWRFAAGRWLPSYGSGGGAKVAVIARINPTTGDVNYATFVSARKLSDGKTNSLVIRGLSWNGTTLTVEADSWSSPRRADRSSMTCSGSSPFKYTAVFTGDLIKVNSATAVTCN
```

Where the fasta header consists of a sequence identifier, a weird string that must be some kind of identifier (couldn't find any more information on what this was exactly), the strain name with a number and the NCBI accession number. The sequence data is stored in the second line, as a sequence of amino acids.

The annotation file looks like this:

```
Aazo_0012-NC_014248	bwpWeTibFN4gu/sEg8WQ8dpmw+E	'Nostoc azollae' 0708	hypothetical protein
Aazo_0013-NC_014248	BY6QnjspwF1yLZftPQVSvVDTKEE	'Nostoc azollae' 0708	hypothetical protein
Aazo_0014-NC_014248	5ePmiITfOQIuDc7vHskQzT1KF44	'Nostoc azollae' 0708	KaiB domain-containing protein; K08481 circadian clock protein KaiB
Aazo_0015-NC_014248	/akqU8ocnmxVZ14FuEpRkgLvykM	'Nostoc azollae' 0708	twitching motility protein; K02669 twitching motility protein PilT
```

Where we have the same sequence identifier, the same weird number, the same strain name and number, and a functional annotation of sorts.

The taxonomy file looks like this:
```
strain_name	peptide_count	taxonomy
Acidilobus saccharovorans 345-15	1499	Archaea;Crenarchaeota;Thermoprotei;Acidilobales;Acidilobaceae;Acidilobus;Acidilobus saccharovorans 345-15
Caldisphaera lagunensis DSM 15908	1527	Archaea;Crenarchaeota;Thermoprotei;Acidilobales;Caldisphaeraceae;Caldisphaera;Caldisphaera lagunensis DSM 15908
Aeropyrum pernix K1	1700	Archaea;Crenarchaeota;Thermoprotei;Desulfurococcales;Desulfurococcaceae;Aeropyrum;Aeropyrum pernix K1
Desulfurococcus fermentans DSM 16532	1421	Archaea;Crenarchaeota;Thermoprotei;Desulfurococcales;Desulfurococcaceae;Desulfurococcus;Desulfurococcus fermentans DSM 16532
Desulfurococcus kamchatkensis 1221n	1471	Archaea;Crenarchaeota;Thermoprotei;Desulfurococcales;Desulfurococcaceae;Desulfurococcus;Desulfurococcus kamchatkensis 1221n
Desulfurococcus mucosus DSM 2162	1371	Archaea;Crenarchaeota;Thermoprotei;Desulfurococcales;Desulfurococcaceae;Desulfurococcus;Desulfurococcus mucosus DSM 2162
Ignicoccus hospitalis KIN4/I	1434	Archaea;Crenarchaeota;Thermoprotei;Desulfurococcales;Desulfurococcaceae;Ignicoccus;Ignicoccus hospitalis KIN4/I
Ignisphaera aggregans DSM 17230	1992	Archaea;Crenarchaeota;Thermoprotei;Desulfurococcales;Desulfurococcaceae;Ignisphaera;Ignisphaera aggregans DSM 17230
```

Here we have the strain name and number, a peptide count, and the full taxonomy of the strain.

## Check for available transcriptomes

The idea is to look for peptide data for every entry in the list of missing taxon ids. Then, we need to add the taxon name to the taxonomy file, and add the peptide data to the fasta file.
The annotation file will be updated with the new taxon name and number. Let's first try to find transcriptomes for the missing taxon ids.
To achieve this, I've written two python functions.

In [2]:
import requests
import json

def get_tax_id(scientific_name):
    """
    Returns the taxonomic ID of the species with the given scientific name from the ENA.
    """
    genus, species = scientific_name.split(' ')
    ena_url = "https://www.ebi.ac.uk/ena/taxonomy/rest/scientific-name/{genus}%20{species}".format(genus=genus, species=species)
    response = requests.get(ena_url)
    if response.status_code == 200:
        data = response.json()
        if len(data) > 0:
            return data[0]["taxId"]
        else:
            return None
    else:
        return None

def get_transcriptomes(species_name):
  """
  Returns a list of TSA set IDs for the given species name from the ENA.
  A TSA set is a transcriptome shotgun assembly, i.e. a set of assembled transcriptome contigs.
  This function returns the accession numbers of the TSA sets, which can be used to download the contigs.
  """
  tax_id = get_tax_id(species_name)  
  ena_url = "https://www.ebi.ac.uk/ena/portal/api/search?format=json&result=tsa_set&query=tax_tree({tax_id})".format(tax_id=tax_id)
  response = requests.get(ena_url)
  if response.status_code != 200:
    if response.status_code == 204:
      print(f"Warning: No TSA sets found for {species_name}")
    elif response.status_code == 500:
      print(f"Error: Request for {species_name} returned an 'Internal Server Error'. This is possibly due to an invalid taxonomic ID.")
    else:
      print(f"Error: Request for {species_name} returned status code {response.status_code}")
    return []
  try:
    search_results = response.json()
  except json.JSONDecodeError:
    print(f"Error: Response for {species_name} is not a valid JSON data structure")
    return []
  return search_results

In [104]:
accesion_list = []
# Go through all species names (skip the header) and obtain the transcriptomes accession numbers
for species_name in unincluded_species['flowcam_ID']:
  print(species_name)
  # Append to list if transcriptome accession was found
  accesion_number = get_transcriptomes(species_name)
  if accesion_number:
    accesion_list.append(accesion_number)

Actinoptychus senarius
Actinoptychus splendens
Bacillaria paxillifer
Bellerochea horologicalis
Ceratium horridum
Error: Request for Ceratium horridum returned an 'Internal Server Error'. This is possibly due to an invalid taxonomic ID.
Ceratium longipes
Error: Request for Ceratium longipes returned an 'Internal Server Error'. This is possibly due to an invalid taxonomic ID.
Chaetoceros danicus
Guinardia delicatula
Guinardia flaccida
Guinardia striata
Dactyliosolen phuketensis
Error: Request for Dactyliosolen phuketensis returned an 'Internal Server Error'. This is possibly due to an invalid taxonomic ID.
Helicotheca tamesis
Hobaniella longicruris
Lithodesmium undulatum
Neocalyptrella robusta
Odontella rhombus
Proboscia indica
Rhizosolenia setigera
Rhizosolenia hebetata
Error: Request for Rhizosolenia hebetata returned an 'Internal Server Error'. This is possibly due to an invalid taxonomic ID.
Stellarima stellaris
Error: Request for Stellarima stellaris returned an 'Internal Server Err

In [100]:
accesion_list

[[{'accession': 'HBQR01000000', 'description': 'TSA: Guinardia flaccida'}],
 [{'accession': 'HBGV01000000', 'description': 'TPA: Helicotheca tamesis'}],
 [{'accession': 'HBEI01000000', 'description': 'TPA: Rhizosolenia setigera'}],
 [{'accession': 'HBNT01000000', 'description': 'TSA: Tripos fusus'}]]

## Update the PhyloDB

Now let's download these accession numbers over ftp! Store the resulting fasta files in a folder called `transcriptomes`.

```
cd ../data/annotation/taxonomy_phyloDB
mkdir transcriptomes
cd transcriptomes

curl ftp://ftp.ebi.ac.uk/pub/databases/ena/tsa/public/hbq/HBQR01.fasta.gz -o Guinardia_flaccida_HBQR01.fasta.gz
curl ftp://ftp.ebi.ac.uk/pub/databases/ena/tsa/public/hbg/HBGV01.fasta.gz -o Helicotheca_tamesis_HBGV01.fasta.gz
curl ftp://ftp.ebi.ac.uk/pub/databases/ena/tsa/public/hbe/HBEI01.fasta.gz -o Rhizosolenia_setigera_HBEI01.fasta.gz
curl ftp://ftp.ebi.ac.uk/pub/databases/ena/tsa/public/hbn/HBNT01.fasta.gz -o Tripos_fusus_HBNT01.fasta.gz
```

These need to be translated into peptides, and added to the PhyloDB reference files. For this we need to first run transdecoder on all samples to obtain CDS. 
The script to run on the hpc is called `transdecoder_extra_phyloDB_transcriptomes`. After, I ran the script `rename_fasta_headers.sh`, to add meaningful fasta headers.
Then, I removed newlines from sequence lines in the fasta files using the awk program below:
```
awk -v ORS= '/^>/ { $0 = (NR==1 ? "" : RS) $0 RS } END { printf RS }1' file
```
Now, we can add the transcriptomes to the phyloDB reference. We'll make a directory where the phyloDB is stored and call it phyloBD_extended. 
Here we can store an updated version of the DB, taxonomy, and annotation files. Let's first update the files. Get the peptide count, the strain name, and the taxonomy for the new transcriptomes.

| Strain name | Number of peptides | Taxonomy |
| --- | --- | --- |
| Guinardia flaccida 216801 | 28185 | Eukaryota;Stramenopiles;Stramenopiles_X;Bacillariophyta;Bacillariophyta_X;Radial-centric-basal-Coscinodiscophyceae;Guinardia;Guinardia flaccida |
| Helicotheca tamesis 374047 | 12713 | Eukaryota;Stramenopiles;Stramenopiles_X;Bacillariophyta;Bacillariophyta_X;Polar-centric-Mediophyceae;Helicotheca;Helicotheca tamensis |
| Rhizosolenia setigera 3005 | 372 | Eukaryota;Stramenopiles;Stramenopiles_X;Bacillariophyta;Bacillariophyta_X;Radial-centric-basal-Coscinodiscophyceae;Rhizosolenia;Rhizosolenia setigera |
| Tripos fusus 2916 | 114721 | Eukaryota;Alveolata;Dinophyta;Dinophyceae;Dinophyceae_X;Dinophyceae_XX;Tripos;Tripos fusus |

These were manually added to the phylodb_taxonomy_extended file.
For the annotation file, I've used the `create_phyloDB_extended_annotation_file` script on the different transcriptomes. Then, I manually combined the files and added the result to create the phyloDB_annotation_extended file. After, I've updated the phyloDB annotation script to `phyloDB_extended_annotation.pbs`, and ran it on the HPC! 