# Homemade "taxonomic assignment" with blast+

Goal : Sort contigs based on plant hit.<br>
Input : Blastn TSV (custom outfmt 6 with staxids).<br>
Output : FASTA files, either containing contigs matching plant or those not matching plant.<br>
https://docs.google.com/spreadsheets/d/1hYWprws5gd2-W2vgMux2lVAtXx5TVm4IenuwZr6DcF4/edit#gid=0

## Summary

0) Existing tools <br>


1. Data <br>
1.1. Existing transcriptomes <br>
1.2. UGMA data <br>
2. Blast+ with taxonomic infos <br>
3. Launch blastn with custom outfmt 6 <br>
4. Parse blastn result <br>

## 0) Existing tools (not used here)

- http://qiime.org/scripts/assign_taxonomy.html <br>
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4186660/

## 1) Data

### 1.1) Existing transcriptomes

In [1]:
#### Code ###
cd ~/Desktop/RTDG/BISCEm/Data;

In [None]:
#### Code ###
# PineRefSeq (2008)
axel -q http://dendrome.ucdavis.edu/ftp/Genome_Data/genome/\
pinerefseq/Psme/v1.0/gene_models/Psme.allgenes.transcripts.fasta;
mv Psme.allgenes.transcripts.fasta pinerefseq.fasta;
pigz pinerefseq.fasta;

# Lorenz et al. (2012)
# lorenz_mira.fasta.gz, lorenz_nblr.fasta.gz & lorenz_ngen.fasta.gz

# Müller et al. (2012)
# muller.fasta.gz

# Howe et al. (2013)
axel -q ftp://ftp.ncbi.nlm.nih.gov/sra/wgs_aux/GA/EK/GAEK01/GAEK01.1.fsa_nt.gz;
mv GAEK01.1.fsa_nt.gz howe.fasta.gz;

# Little et al. (2016)
axel -q ftp://ftp.ncbi.nlm.nih.gov/sra/wgs_aux/GA/ZW/GAZW02/GAZW02.1.fsa_nt.gz;
axel -q ftp://ftp.ncbi.nlm.nih.gov/sra/wgs_aux/GA/ZW/GAZW02/GAZW02.2.fsa_nt.gz;
cat GAZW02.1.fsa_nt.gz GAZW02.2.fsa_nt.gz > little.fasta.gz;
rm GAZW02.1.fsa_nt.gz GAZW02.2.fsa_nt.gz;

# Hess et al. (2016)
axel -q ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE73nnn/GSE73420/suppl/GSE73420_PUT_set.fasta.gz;
mv GSE73420_PUT_set.fasta.gz hess.fasta.gz;

# Merge of those 6 transcriptomes with cdhitest (0.99%)
# all_ref_cdhitest.fasta.gz

### 1.2) UGMA data

In [None]:
#### Code ###
ls -1 trinity*;

## 2) Blast+ with taxonomic infos

### Get taxdb for blast+ :

In [None]:
#### Code ###
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz;
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz.md5;
for f in *.md5;
do tmp=$(md5sum -c $f);
   if [[ $tmp == *": OK" ]];
   then tar -zxvf  ${f%.*};
        rm $f ${f%.*};
   fi;
done;

Ps : Don't forget to move tax files in the same dir as your other databases (nt, nr, etc...)

### Set BLASTDB variable :

edit your ~/.bashrc with :

In [None]:
#### Code ###
export BLASTDB = 'your_path_to_blast+_dbs'

## 3) Launch blastn with custom outfmt 6

In [None]:
#### Code ###
biscem='/home/erwann/Desktop/RTDG/BISCEm';
cd $biscem/Data;

In [None]:
#### Code ###
blast='/home/erwann/Software/Ncbi_blast_2.6.0+/bin';
db='/home/erwann/Software/ncbi-blast-2.5.0+/blastdb';
for id in 'hess';
do unpigz $id.fasta.gz;
   $blast/blastn -query $id.fasta \
                 -db $db/nt \
                 -out $biscem/Output/$id'_vs_nt.tsv' \
                 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send \
                          evalue bitscore qlen slen saccver staxids sskingdoms sblastnames stitle" \
                 -num_threads 8 \
                 -culling_limit 1;
   pigz $id.fasta;
done;

## 4) Parse blastn result

Nb : The idea on how to check if a contig matched to plant with the staxids field came from this post : https://www.biostars.org/p/163595/#163603.

What it does basically is :<br>
"staxids" allows us to query the NCBI taxonomy database for the lineage of a taxon information with the tool eutils.<br>
We can then parse this eutils xml result to check if kingdom <=> "Viridiplantae".

Below, a dummy example for staxids = 3357 (Pseudotsuga menziesii <=> Douglas) :

In [None]:
#### Code ###
wget -q 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=3357&retype=xml' \
     -O - | grep -B 1 '>kingdom<' | grep -v '>kingdom<' | grep -o '>.*<' | sed 's/[<>]//g';

### Using custom script

Content of sort_plant_hit.py :

In [None]:
#### Code ###
#!/usr/bin/env python
import os
import sys
import timeit

usage = '\t --------\n' \
        '\t| usage  : python sort_plant_hit.py f1 f2\n' \
        '\t| input  : f1 = blastn.tsv\n' \
        '\t| input  : f2 = seqs.fasta\n' \
        '\t| output : f2_plant_hit.fasta\n' \
        '\t| output : f2_non_plant_hit.fasta\n' \
        '\t| output : f2_no_hit.fasta\n' \
        '\t --------'

if len(sys.argv) != 3:
    print(usage)
    sys.exit()

##############
### Step 1 ###
##############
print('\n\tStep 1) Retrieve taxonomic infos for each staxids with efetch')
t0 = timeit.default_timer()
# For each line in TSV (f1), fill staxids_set
staxids_set = set()
with open(sys.argv[1], 'r') as tsv:
    for row in tsv:
        columns = row.split('\t')
        # Sometimes you have more than one staxids for an entry
        staxids = columns[15].split(';')
        for i in staxids:
            staxids_set.add(i)

# Use staxids_set as query with efetch & store result
# Don't give more than let's say 500 entries at a time to avoid errors
staxids_li = list(staxids_set)
staxids_sub_li = [staxids_li[x:x + 500]
                  for x in range(0, len(staxids_li), 500)]
efetch_li = []
for item in staxids_sub_li:
    staxids_input = ','.join(str(z) for z in item)
    # Details about "cmd" : https://www.biostars.org/p/163595/#271497
    cmd = ('efetch -db taxonomy -id ' + staxids_input + ' -format xml | xtract '
           '-pattern Taxon -sep \'@\' -element TaxId,ScientificName -division '
           'LineageEx -group Taxon -if Rank -equals superkingdom -or Rank '
           '-equals kingdom -or Rank -equals phylum -or Rank -equals class'
           ' -or Rank -equals order -or Rank -equals family -or Rank -equals'
           ' genus -sep \'@\' -element Rank,ScientificName')
    cmd_result = os.popen(cmd).read()
    cmd_result_split = cmd_result.split('\n')
    for i in cmd_result_split:
        efetch_li.append(i)

# Create a dict associating key=staxid with value=list=tax_infos
taxonomy_dic = {}
for line in efetch_li:
    field = line.split('\t')
    tax_ids = field[0].split('@')
    # Sometimes more than one staxid is associated to an entry
    # e.g. "170850@3666@Cucurbita hybrid cultivar"
    for i in tax_ids[:-1]:
        taxonomy_dic.setdefault(i, [None, None, None, None, None, None, None])
        for item in field:
            if 'superkingdom@' in item:
                taxonomy_dic[i][0] = item.split('@')[-1]
            elif 'kingdom@' in item:
                taxonomy_dic[i][1] = item.split('@')[-1]
            elif 'phylum@' in item:
                taxonomy_dic[i][2] = item.split('@')[-1]
            elif 'class@' in item:
                taxonomy_dic[i][3] = item.split('@')[-1]
            elif 'order@' in item:
                taxonomy_dic[i][4] = item.split('@')[-1]
            elif 'family@' in item:
                taxonomy_dic[i][5] = item.split('@')[-1]
            elif 'genus@' in item:
                taxonomy_dic[i][6] = item.split('@')[-1]
print('\t\t=> ' + str(round(timeit.default_timer() - t0, 3)) + ' seconds')

##############
### Step 2 ###
##############
print('\tStep 2) Assign contigs best hits to plant or non-plant')
t0 = timeit.default_timer()
# Assign contigs best hits to plant or non-plant based on taxonomy_dic infos
qseqid_set, viridi_hit_set, non_viridi_hit_set = set(), set(), set()
with open(sys.argv[1], 'r') as tsv:
    for row in tsv:
        columns = row.split('\t')
        qseqid, staxids = columns[0], columns[15].split(';')[0]
        # Check if we encounter qseqid for the first time <=> best hit
        if not qseqid in qseqid_set:
            if taxonomy_dic[staxids][1] == 'Viridiplantae':
                viridi_hit_set.add(qseqid)
            else:
                non_viridi_hit_set.add(qseqid)
        qseqid_set.add(qseqid)
print('\t\t=> ' + str(round(timeit.default_timer() - t0, 3)) + ' seconds')

##############
### Step 3 ###
##############
print('\tStep 3) Find contigs with no hits')
t0 = timeit.default_timer()
# Read initial FASTA (f2) & check intersection with viridi_hit_set & non_viridi_hit_set
# We can deduce contig with not hit from this intersection
no_hit_set = set()
with open(sys.argv[2], 'r') as fa:
    for line in fa:
        if line.startswith('>'):
            line = line.lstrip('>')
            fields = line.split()
            no_hit_set.add(fields[0])

no_hit_set = no_hit_set - viridi_hit_set
no_hit_set = no_hit_set - non_viridi_hit_set
print('\t\t=> ' + str(round(timeit.default_timer() - t0, 3)) + ' seconds')

##############
### Step 4 ###
##############
print('\tStep 4) Create output files')
t0 = timeit.default_timer()
# Create input files (sequence IDs list) for seqtk
file_2 = sys.argv[2].split('/')
sample = file_2[-1].split('.')[0]
with open(sample + '_plant_hit.temp', 'w') as out:
    for item in viridi_hit_set:
        out.write(item + "\n")
with open(sample + '_non_plant_hit.temp', 'w') as out:
    for item in non_viridi_hit_set:
        out.write(item + "\n")
with open(sample + '_no_hit.temp', 'w') as out:
    for item in no_hit_set:
        out.write(item + "\n")

# Create output files (plant_hit, non-plant_hit, no-hit) with seqtk
os.system('seqtk subseq ' + sys.argv[2] + ' ' + sample +
          '_plant_hit.temp > ' + sample + '_plant_hit.fasta')
os.system('seqtk subseq ' + sys.argv[2] + ' ' + sample +
          '_non_plant_hit.temp > ' + sample + '_non_plant_hit.fasta')
os.system('seqtk subseq ' + sys.argv[2] + ' ' + sample +
          '_no_hit.temp > ' + sample + '_no_hit.fasta')
os.system('rm *.temp')
print('\t\t=> ' + str(round(timeit.default_timer() - t0, 3)) + ' seconds')

What it does :

Take the blastn TVS file and the contig FASTA file as input.<br>
Then check for each contig best hit if kingdom <=> "Viridiplantae".<br>
If so, we consider the contig to have a plant hit.<br>
Else, we consider the contig to have a non-plant hit.<br>
Then, iterate other the contig file to collect contigs name & check intersection with contigs having plant / non-plant hits. This allow to determine contigs with no hits.<br>
Finally, output 3 FASTA files : contigs with plant hit, with non-plant hit & with no hit.

Launch the script :

In [None]:
#### Code ###
biscem='/home/erwann/Desktop/RTDG/BISCEm';
cd $biscem/Output;

In [None]:
#### Code ###
for id in 'hess';
do unpigz $biscem/Data/$id.fasta.gz;
   python $biscem/Script/sort_plant_hit.py $id'_vs_nt.tsv' $biscem/Data/$id.fasta;
   pigz $biscem/Data/$id.fasta;
done;

Check that the sum of our 3 output FASTA files is equal to the input FASTA file :

In [None]:
#### Code ###
for id in 'hess';
do grep -c '^>' $id*.fasta;
done;

Compute basic stats for output FASTA files :

In [None]:
#### Code ###
for f in  hess*.fasta;
do echo $f; perl ../Script/assemblyStats.pl $f;
done;