### Illumina read processing and taxonomic classification of query sequences ##

We are using our custom pipeline [metaBEAT](https://github.com/HullUni-bioinformatics/metaBEAT) to process the Illumina data and taxonomically identify query sequences. 

For full reproducibility metaBEAT was run inside a docker container - [here](https://hub.docker.com/r/chrishah/metabeat/).

After initial read quality trimming, -merging and clustering, query sequences are blasted against a custom reference database composed of COI sequences of _Gammarus_ sp. as well as the positive controls _Harmonia axyridis_ and _Triops cancriformis_ (all downloaded from Genbank as described [here]()). Taxonomic assignment was perfored using a lowest commong ancestor (LCA) approach based on the BLAST results, as described in the paper.

The file `Querymap.txt` contains the sampleIDs and the location of the Illumina read files, plus the barcodes and instructions to clip off the first 30 bases of both the forward and reverse reads, in order to remove any primers.

The file `REFlist.txt` points towards the reference sequences.

In [1]:
%%bash

for gb in $(ls -1 ../../1-download_reference/*.gb)
do
    echo -e "$gb\tgb"
done > REFlist.txt

In [2]:
!cat REFlist.txt

../../1-download_reference/Gammarus_COI_Weiss_et_al_2013.gb	gb
../../1-download_reference/positive_controls.gb	gb


Run the metaBEAT pipeline.

In [4]:
!metaBEAT_global.py \
-Q QUERYmap.txt \
-R REFlist.txt \
--trim_minlength 100 \
--trim_qual 30 \
--merge --product_length 350 \
--forward_only \
--cluster --clust_match 0.97 --clust_cov 3 \
--length_filter 313 --length_deviation 0.05 \
--blast --min_ident 0.95 \
-m COI -n 5 \
-o COI_160307_clip_trim-30_merge_forw-only_c0.97m3_blast_min0.85_GLOBAL > metaBEAT.log

A summary of the read counts throughout the read processing stages can be found in the file:

`COI_160307_clip_trim-30_merge_forw-only_c0.97m3_blast_min0.85_GLOBAL_read_stats.csv`.

The final OTU table can be found in the file: 

`GLOBAL/BLAST_0.95/COI_160307_clip_trim-30_merge_forw-only_c0.97m3_blast_min0.85_GLOBAL-by-taxonomy-readcounts.blast.tsv`.


In [5]:
import metaBEAT_global_misc_functions as mb

Identify sites which contained reads assigned to _G. fossarum_ before filtering.

In [6]:
mb.find_target(BIOM=mb.load_BIOM('GLOBAL/BLAST_0.95/COI_160307_clip_trim-30_merge_forw-only_c0.97m3_blast_min0.85_GLOBAL-by-taxonomy-readcounts.blast.biom'), target='Gammarus_fossarum')

BLANK-1	(16.9811 %)
INV004D	(0.3456 %)
INV005	(99.8875 %)
INV010	(98.8542 %)
INV019	(0.2357 %)
INV021	(0.1204 %)
INV027	(93.5046 %)
INV028	(89.6334 %)
INV029	(100.0000 %)
INV030	(99.5250 %)
INV031D	(19.8005 %)
INV033	(1.6838 %)
INV034D	(11.3552 %)
INV035	(100.0000 %)
INV036	(99.9187 %)
INV037	(99.6670 %)
INV038	(100.0000 %)
INV039	(100.0000 %)
INV040	(98.7659 %)
INV041	(99.8786 %)
INV042	(76.3237 %)
INV043	(0.1853 %)
INV049	(5.0641 %)
INV053	(95.4602 %)
INV055	(99.6733 %)
INV056	(94.5860 %)
INV057	(99.6347 %)
INV058	(0.2272 %)
INV059	(99.2910 %)
INV060	(0.8555 %)
INV062	(97.4181 %)
INV063	(3.0410 %)
SOI005	(12.0603 %)
SOI024	(0.3726 %)
SOI028	(0.3363 %)
SOI029	(1.6717 %)
SOI030	(0.0120 %)
SOI031	(0.0069 %)
SOI032	(0.1895 %)
SOI035	(7.4979 %)
SOI036	(0.2760 %)
SOI037	(1.5997 %)
SOI038	(2.1773 %)
SOI039	(5.7220 %)
SOI040	(1.6327 %)
SOI055	(0.2353 %)
SOI057	(18.0574 %)
SOI062	(2.1505 %)
WAT028	(0.1035 %)
WAT029	(5.4529 %)
WAT035	(8.4507 %)
WAT036	(4.3562 %)
WAT037	(13.5390 %)
WAT038	(2.90

Filter raw OTU table - in a given sample remove OTUs that were not supported by at least 1% of the reads.

In [7]:
#load raw OTU table
to_filter = mb.load_BIOM(table='GLOBAL/BLAST_0.95/COI_160307_clip_trim-30_merge_forw-only_c0.97m3_blast_min0.85_GLOBAL-OTU-taxonomy.blast.biom')

#filter at 1%
filtered = mb.filter_BIOM_by_per_sample_read_prop(BIOM=to_filter, min_prop=0.01)

#write to file
mb.write_BIOM(filtered, target_file='filtered' )

#collapse OTUs by taxonomy
filtered_collapsed = mb.collapse_biom_by_taxonomy(in_table=filtered)

#write to file
mb.write_BIOM(filtered_collapsed, target_file='filtered-collapsed' )


Filtering at level: 1.0 %

Removing 5600 OTUs for lack of support



Identify samples containing sequences assigned to _G. fossarum_.

In [8]:
mb.find_target(filtered_collapsed, target='Gammarus_fossarum')

BLANK-1	(16.9811 %)
INV005	(100.0000 %)
INV010	(100.0000 %)
INV027	(95.2122 %)
INV028	(90.8974 %)
INV029	(100.0000 %)
INV030	(100.0000 %)
INV031D	(20.4174 %)
INV033	(1.7705 %)
INV034D	(11.9880 %)
INV035	(100.0000 %)
INV036	(100.0000 %)
INV037	(100.0000 %)
INV038	(100.0000 %)
INV039	(100.0000 %)
INV040	(100.0000 %)
INV041	(100.0000 %)
INV042	(77.0550 %)
INV049	(5.4598 %)
INV053	(98.1144 %)
INV055	(100.0000 %)
INV056	(94.5860 %)
INV057	(100.0000 %)
INV059	(100.0000 %)
INV062	(98.7500 %)
INV063	(3.2258 %)
SOI005	(12.7321 %)
SOI029	(1.8174 %)
SOI035	(8.1596 %)
SOI037	(1.7464 %)
SOI038	(3.4930 %)
SOI039	(6.1993 %)
SOI040	(1.8638 %)
SOI057	(25.7233 %)
SOI062	(3.6036 %)
WAT029	(7.9436 %)
WAT035	(11.7241 %)
WAT036	(7.8764 %)
WAT037	(24.8268 %)
WAT038	(4.8209 %)
WAT039	(5.6893 %)
WAT040	(3.4936 %)
WAT041	(70.7878 %)


Identify samples containing sequences assigned to _G. fossarum_.

In [9]:
mb.find_target(filtered_collapsed, target='Gammarus_pulex')

BLANK-1	(62.8931 %)
INV011	(81.1340 %)
INV013	(20.7633 %)
INV015	(16.6021 %)
INV016	(56.4469 %)
INV017	(3.2396 %)
INV018	(8.0049 %)
INV019	(18.4542 %)
INV021	(68.8352 %)
INV023	(100.0000 %)
INV025	(50.0000 %)
INV026	(39.3834 %)
INV042	(5.1437 %)
INV043	(54.0541 %)
INV044	(16.1581 %)
INV046	(4.1423 %)
INV048	(12.3082 %)
INV049	(5.8096 %)
INV050	(1.8329 %)
INV051	(91.4938 %)
INV052	(70.9160 %)
INV054	(65.6038 %)
INV056	(4.1401 %)
INV058	(1.4144 %)
INV060	(76.6568 %)
INV061	(15.2616 %)
INV063	(20.0000 %)
INV064	(4.0670 %)
