# Re-analysis of unassigned reads

It is important to know the identity of unassigned reads for contamination or reference database ambiguities. This code will BLAST unassigned reads against entire genbank.

Based on the results of the 98% BLAST identity metaBEAT run, a new BIOM table containing only OTUs that were not taxonomically assigned is generated. A fasta file with the corresponding sequences is also prepared.


Required files:

 - fasta file containing all query sequences (global centroids), as produced by 98% identity metaBEAT run

 - taxonomy annotated OTU biom table in json format from a metaBEAT run. Not the taxonomy collapsed BIOM table.


Load the necessary functions. Functions are in place as of version '0.97.4-global' (commit: 9110e5a3f4a979e85733f83cb0388b00586544f6).

In [1]:
import metaBEAT_global_misc_functions as mb

Read in BIOM file from metaBEAT analysis.

In [2]:
table = mb.load_BIOM('../3-taxonomic_assignment/GLOBAL/BLAST_0.98/12S-trim30_crop110_min90_merge-forwonly_nonchimera_c0.97cov5_blast98-id1-OTU-taxonomy.blast.biom', 
                     informat='json')


Specified BIOM input format 'json' - ok!


In [3]:
#double check that we've got a table
#print table

Extract only OTUs that were not assigned to a taxonomy - i.e. 'unassigned'.

In [4]:
unassigned_table = mb.BIOM_return_by_tax_level(taxlevel='unassigned', BIOM=table, invert=False)

Found taxonomy metadata with OTUs - ok!


In [5]:
#check metadata in new table to see if we only got the unassigned bits
#print unassigned_table.metadata(axis='observation')

Extract only sequences mentioned in the table.

In [6]:
mb.extract_fasta_by_BIOM_OTU_ids(in_fasta='../3-taxonomic_assignment/GLOBAL/global_queries.fasta', 
                                 BIOM=unassigned_table, 
                                 out_fasta='unassigned_only.fasta')

Looking to extract 7771 sequences
Parsing ../3-taxonomic_assignment/GLOBAL/global_queries.fasta
identified 7771 target sequences .. OK!
Writing sequences to file: unassigned_only.fasta


In [7]:
unassigned_table_notax = mb.drop_BIOM_taxonomy(unassigned_table)

In [8]:
#double check that the taxonomy is gone
#print unassigned_table_notax.metadata(axis='observation')

Write reduced table without taxonomy metadata, i.e. denovo table, to file

In [9]:
mb.write_BIOM(BIOM=unassigned_table_notax, target_prefix='unassigned_only_denovo', outfmt=['json','tsv'])

Writing 'unassigned_only_denovo.biom'
Writing 'unassigned_only_denovo.tsv'


The files 'unassigned_only_denovo.biom' and 'unassigned_only.fasta' can be used as input for new metaBEAT run.

In [10]:
!metaBEAT_global.py -h

usage: metaBEAT.py [-h] [-Q <FILE>] [-B <FILE>] [--g_queries <FILE>] [-v] [-s]
                   [-f] [-p] [-k] [-t] [-b] [-m <string>] [-n <INT>] [-E] [-e]
                   [--read_stats_off] [--PCR_primer <FILE>]
                   [--trim_adapter <FILE>] [--trim_qual <INT>] [--phred <INT>]
                   [--trim_window <INT>] [--read_crop <INT>]
                   [--trim_minlength <INT>] [--merge] [--product_length <INT>]
                   [--merged_only] [--forward_only] [--length_filter <INT>]
                   [--length_deviation <FLOAT>] [-R <FILE>] [--gb_out <FILE>]
                   [--rec_check] [--gi_to_taxid <FILE>] [--cluster]
                   [--clust_match <FLOAT>] [--clust_cov <INT>]
                   [--blast_db <PATH>] [--blast_xml <PATH>]
                   [--min_ident <FLOAT>] [--min_ali_length <FLOAT>]
                   [--bitscore_skim_LCA <FLOAT>] [--bitscore_skim_adjust_off]
                   [--min_bit <INT>] [--refpkg <DIR>] [--jp

Ensure there is a gi_to_taxid file.

In [11]:
!head gi_to_taxid.csv

998452514,9606
152031377,91750
930576047,9606
647737363,9615
998452864,9606
255365357,9606
37725781,107026
37725783,8838
393007394,9615
378554575,89337


**GO!**

In [12]:
%%bash

metaBEAT_global.py \
-B unassigned_only_denovo.biom \
--g_queries unassigned_only.fasta \
--blast --blast_db ../entireGB_DB/nt/nt --min_ident 0.98 --min_ali_length 0.8 \
-m 12S -n 5 \
-v \
-@ L.Harper@2015.hull.ac.uk \
-o unassigned_only &> log_unassigned

In [13]:
!tail -n 50 log_unassigned


direct assignment for F186_-nc|M01364:4:000000000-AGFG9:1:2104:14926:14776_ex -> 9606

attempting LCA assignment for F491_-nc|M01364:9:000000000-AGEVG:1:1104:11973:8905_ex
found LCA 8830 at level family
assigned LCA Anatidae (taxid 8830) at level family

attempting LCA assignment for F491_-nc|M01364:9:000000000-AGEVG:1:1103:14894:16071_ex
found LCA 8830 at level family
assigned LCA Anatidae (taxid 8830) at level family

direct assignment for F106_-nc|M01364:4:000000000-AGFG9:1:1102:21929:24226_ex -> 9606

direct assignment for F184_-nc|M01364:4:000000000-AGFG9:1:1103:26053:9370_ex -> 9606

attempting LCA assignment for F199_-nc|M01364:4:000000000-AGFG9:1:1101:9228:21478_1:N:0:351
found LCA 6668 at level genus
assigned LCA Daphnia (taxid 6668) at level genus

direct assignment for N6-4_-nc|M01364:4:000000000-AGFG9:1:1113:10033:13576_ex -> 9606

direct assignment for F13_-nc|M01364:4:000000000-AGFG9:1:1105:16634:22049_ex -> 9606

direct assignment for F472_-nc|M01

If the analysis breaks, you have to find the OTU and GI number that caused this.

Open the log file and look for the query that was being processed when the analysis broke. The log file will tell you how many gi's have been seen before until eventually one will cause the pipeline to break. If it has seen the gi before in the first 10 hits for example, it is the gi of the 11th hit that broke the pipeline. Open the global_blastn.out.xml in the command line using vim and find the last query that was being processed. Look for the hit with the gi that caused the analysis to break. Copy the gi number and search NCBI with it. The result returned is the species that we need to obtain a taxid for. Click on the species and copy the taxid on the species page. Manually append this taxid, along with the gi number, to your gi_to_taxid.csv file in text editor.

Continue to re-run BLAST and repeat trouble-shooting process until metaBEAT completes successfully.