In [1]:
%load_ext mothurmagic

In [7]:
# Imports a parser from cogent
from cogent.parse.fasta import MinimalFastaParser as parse

Because our primers, gITS7 and ITS4, rely on conserved regions (5.8S and 28S) to amplify ITS2, they are likely including portions of these regions. We will want to only consider the ITS2 for our analyses, in order to not accord artificially inflated similarity measures between sequences.  

We can use ITSx to do this (Bengtsson-Palme et al., 2013)  

It seems like I could  
1. not remove primers because they will be removed during this step  
2. not remove non-EuK sequences another way because this should achieve this, but other EuK will be detected.
3. still perform this only on the unique sequences, and then re-expand the data.

In [None]:
!ITSx -i input.fasta -o output -E XXXevalcutoff --cpu 4 #--preserve

#-Note it does store not_found sequences.
# E value might be set at 0.01 or even 1 to decrease the amount of coverage
# flanking the ITS region
# This could make sense because we know we targeted this regions - 
# We know that our sequences should contain ITS2.
# --preserve T could preserve the same sequence headers as from the output 
# instead of replacing them.
# However, it only modifies them, and we should be able to extract original headers


In [129]:
# Pulling out the sample identifier and adding it to the header for uclust compatibility

!awk \
'BEGIN{FS="_";OFS=";"}{ if ( substr($1,0,1) == ">"){ print $0,"barcodelabel=",$1 } else { print $0 } }' \
../../SeqData/ITS.demult.maxee.homoP.fasta | \
sed 's/;>//' > ../../SeqData/ITS.demult.maxee.homoP.usearch.fasta
# AWK is an old AT&T programming langauge
# The input field separator AWK -F "_" or {FS="_"} is the underscore _
# The output field separator (OFS) is ;
# if the substring (first substring, at position 0, of max length 1) is the >, then we print the whole line as it was,
# plus "barcode label" and the substring thing we got out. Otherwise, just repeat the full line.
# $0 is the full line, $1, $2, etc., are the sub-bits of it.
# We do this for the full file, and then spit it out with a new name.


Could use mothur or usearch to make unique sequences. See how each is done.  
usearch derep_fulllength makes a fasta with only the unique seqs and includes the number of times it's present in its filename.  
This seems easy to maintain with ITSx.  
cluster_otus in uclust just requires the fasta file input has counts called "size = n", delimited by semicolons.  
So, I need to see if ITSx gets rid of the semicolons.  

In [130]:
# Try doing ITSx on a subset of unique seqs file.
!head -20000 ../../SeqData/ITS.demult.maxee.homoP.usearch.fasta > ../../SeqData/ITS.demult.maxee.homoP.mini.fasta

In [144]:
!usearch -derep_fulllength ../../SeqData/ITS.demult.maxee.homoP.usearch.fasta -fastaout ../../SeqData/ITS.demult.maxee.homoP.usearch.unique.fasta -sizeout -threads 4

usearch v8.0.1623_i86osx32, 4.0Gb RAM (17.2Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

Licensed to: tlw59@cornell.edu

00:06 852Mb  100.0% Reading ../../SeqData/ITS.demult.maxee.homoP.usearch.fasta
00:12 912Mb 2176154 seqs, 707180 uniques, 597836 singletons (84.5%)           
00:12 912Mb Min size 1, median 1, max 8883, avg 3.08
00:29 912Mb  100.0% Writing ../../SeqData/ITS.demult.maxee.homoP.usearch.unique.fasta


In [145]:
# Sequences are sorted by size
# Here the size of clusters - we are excluding the singletons here
# You would change minsize to 1 if you wanted to include singletons
# We don't really need to keep track of the total initial sequences, because we are going to go back to our
# original fasta file to compare it to these curated (ITS2 extracted, no singletons) sequences.

!usearch -sortbysize ../../SeqData/ITS.demult.maxee.homoP.usearch.unique.fasta -fastaout ../../SeqData/ITS.demult.maxee.homoP.usearch.unique.sorted.fasta -minsize 2

usearch v8.0.1623_i86osx32, 4.0Gb RAM (17.2Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

Licensed to: tlw59@cornell.edu

00:02 313Mb  100.0% Reading ../../SeqData/ITS.demult.maxee.homoP.usearch.unique.fasta
00:02 280Mb Getting sizes                                                            
00:04 281Mb Sorting 109344 sequences
00:07 281Mb  100.0% Writing output


In [147]:
!tail ../../SeqData/ITS.demult.maxee.homoP.usearch.unique.sorted.fasta
# The smallest counts still had two present - no singletons remain.

TCAGTGAGTCATCGAATCTTTGAACGCACATTGCGCCCTCTGGAATTCCGGAGGGCACGCCTGTCTGAGCGTCGTCACGC
CAATCGAGCCCTCCCGGGGGCACGGTGTTGGGTGAGGTCAGGGCACTTTACAGTGCCTGGACCCACCCGCAAAGCGTTGG
CGGAGCCCCAGGAGCCCCAGTCGCAGCAAGAAAAAGACGTTTCGACTTGGAGCCTCCTTGGTGGCCCCACGCCCTCACGA
ACCCCATCTCTAAGGTTCGACCTCGGATCAGGCGGGAGTACCCGCTGAACTTAAGCATATCAATAAGCGGAGGATCGT
>100_1228573;barcodelabel=100;size=2;
TCACTCAGTGAATCATCGAATCTTTGAACGCACCTTGCACCTTTTGGTATTCCGAAAGGTACACCCGTTTGAGTGTCATT
GTAATCTCACTCCTTCAACTTTGTTGTTGCTGGATGTGGACTTGGACTCTGTCGTGTTACAACGACTGGTCTGAAATGCC
TGAGTGCACCCTGCTGTTGCAGCGTCTCCAGTGTGATAAGCATCTTCACTGATTCAAGTTCCTTCGGGACACGTAGCATT
GTGGGCTCTGTGCTGACAAACCGTCCTCGGACAATCTTTGACAATTTGACCTCAAATCGGGTGGGACTACCCGCTGAACT
TAAGCATATCAATAAGCGGAGGA


In [1]:
!ITSx -i ../../SeqData/ITS.demult.maxee.homoP.usearch.unique.sorted.fasta -o ../../SeqData/ITS.demult.maxee.homoP.usearch.unique.sorted.output -t "Fungi" -N 2 --cpu 4

ITSx -- Identifies ITS sequences and extracts the ITS region
by Johan Bengtsson-Palme et al., University of Gothenburg
Version: 1.0.11
-----------------------------------------------------------------
Thu Jul 16 23:58:12 2015 : Preparing HMM database (should be quick)...
Thu Jul 16 23:58:12 2015 : Checking and handling input sequence data (should not take long)...
Thu Jul 16 23:58:15 2015 : Doing paralellised comparison to HMM database (this may take a long while)...
    Fri Jul 17 01:53:57 2015 : Fungi analysis of complementary strand finished.
    Fri Jul 17 09:10:45 2015 : Fungi analysis of main strand finished.
    Fri Jul 17 09:10:45 2015 : All processes finished.
Fri Jul 17 09:10:45 2015 : Parallel HMM-scan finished.
Fri Jul 17 09:10:45 2015 : Analysing results of HMM-scan (this might take quite some time)...
Fri Jul 17 09:10:57 2015 : Extraction finished!
-----------------------------------------------------------------
Thank you for using ITSx!
Please report bugs or unsupported

In [None]:
!ITSx -i ../../SeqData/ITS.demult.maxee.homoP.usearch.unique.sorted.fasta -o ../../SeqData/ITS.demult.maxee.homoP.usearch.unique.sorted.output -t "Fungi" -N 2 --cpu 4
# May need to add the "--reset T" flag first time to fix HMM database
# Can add -t "Fungi" to speed up, otherwise it scans all EuK databases *may not actually be faster

# Also can use github akutils script to do in parallel faster. Might be worth it. To use script,
# need everything to be in the same directory, though, and needed to modify script at readlink to greadlink and
# brew install coreutils
# This took Xh to do - J.Bengtsson-Palme found that the parallel .sh script could do it ~6x faster, so
# 30 min - for future analyses, use this approach. Plus, all types was about 3x slower than fungi with paralell,
# but about the same (even a little slower!) with regular script and --cpu flag.

ITSx -- Identifies ITS sequences and extracts the ITS region
by Johan Bengtsson-Palme et al., University of Gothenburg
Version: 1.0.11
-----------------------------------------------------------------
Thu Jul 16 12:43:51 2015 : Preparing HMM database (should be quick)...
Thu Jul 16 12:43:51 2015 : Checking and handling input sequence data (should not take long)...
Thu Jul 16 12:43:53 2015 : Doing paralellised comparison to HMM database (this may take a long while)...
    Thu Jul 16 14:35:11 2015 : Fungi analysis of complementary strand finished.
    Thu Jul 16 15:08:26 2015 : Fungi analysis of main strand finished.
    Thu Jul 16 15:08:26 2015 : All processes finished.
Thu Jul 16 15:08:26 2015 : Parallel HMM-scan finished.
Thu Jul 16 15:08:26 2015 : Analysing results of HMM-scan (this might take quite some time)...


In [2]:
!head -2 ../../SeqData/ITS.demult.maxee.homoP.usearch.unique.sorted.output.ITS2.fasta
# Overnight run

>101_391;barcodelabel=101;size=8883;|F|ITS2 Extracted ITS2 sequence 82-235 (154 bp)
CAACCCATCAAGCCTAGCGCTTGTGTTGGAGCCCTACGGCCGCCGCAGCCTCCTAAAATCAGTGGCGGGCTCGCTATCACGCTGAGTGCAGTAGTATTCTTCTCACTCCTGTTGTGTAGCGGGTAACCAGCCGTAAAAACCCCCCATATTCAAA


In [149]:
!head -2 ../../SeqData/ITS.demult.maxee.homoP.usearch.unique.sorted.output.ITS2.fasta
# Afternoon run

>101_391;barcodelabel=101;size=8883;|F|ITS2 Extracted ITS2 sequence 82-235 (154 bp)
CAACCCATCAAGCCTAGCGCTTGTGTTGGAGCCCTACGGCCGCCGCAGCCTCCTAAAATCAGTGGCGGGCTCGCTATCACGCTGAGTGCAGTAGTATTCTTCTCACTCCTGTTGTGTAGCGGGTAACCAGCCGTAAAAACCCCCCATATTCAAA


In [162]:
%%mothur
summary.seqs(fasta=../../SeqData/ITS.demult.maxee.homoP.usearch.fasta, processors=4)

mothur > summary.seqs(fasta=../../SeqData/ITS.demult.maxee.homoP.usearch.fasta, processors=4)

Using 4 processors.

Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	50	50	0	2	1
2.5%-tile:	1	187	187	0	3	54404
25%-tile:	1	294	294	0	4	544039
Median: 	1	303	303	0	5	1088078
75%-tile:	1	339	339	0	6	1632116
97.5%-tile:	1	392	392	0	7	2121751
Maximum:	1	492	492	0	8	2176154
Mean:	1	309.523	309.523	0	5.10661
# of Seqs:	2176154

Output File Names:
../../SeqData/ITS.demult.maxee.homoP.usearch.summary

It took 21 secs to summarize 2176154 sequences.

mothur > quit()


In [4]:
%%mothur
summary.seqs(fasta=../../SeqData/ITS.demult.maxee.homoP.usearch.unique.sorted.output.ITS2.fasta)
# Overnight run - same as afternoon - ok to proceed.

mothur > summary.seqs(fasta=../../SeqData/ITS.demult.maxee.homoP.usearch.unique.sorted.output.ITS2.fasta)

Using 1 processors.

Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	34	34	0	3	1
2.5%-tile:	1	143	143	0	3	2368
25%-tile:	1	153	153	0	4	23677
Median: 	1	163	163	0	5	47354
75%-tile:	1	199	199	0	6	71031
97.5%-tile:	1	242	242	0	7	92340
Maximum:	1	350	350	0	8	94707
Mean:	1	176.071	176.071	0	4.96453
# of Seqs:	94707

Output File Names:
../../SeqData/ITS.demult.maxee.homoP.usearch.unique.sorted.output.ITS2.summary

It took 1 secs to summarize 94707 sequences.

mothur > # Overnight run
[ERROR]: You are missing (
Invalid.

mothur > quit()


In [150]:
%%mothur
summary.seqs(fasta=../../SeqData/ITS.demult.maxee.homoP.usearch.unique.sorted.output.ITS2.fasta)
# Original afternoon run

mothur > summary.seqs(fasta=../../SeqData/ITS.demult.maxee.homoP.usearch.unique.sorted.output.ITS2.fasta)

Using 1 processors.

Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	34	34	0	3	1
2.5%-tile:	1	143	143	0	3	2368
25%-tile:	1	153	153	0	4	23677
Median: 	1	163	163	0	5	47354
75%-tile:	1	199	199	0	6	71031
97.5%-tile:	1	242	242	0	7	92340
Maximum:	1	350	350	0	8	94707
Mean:	1	176.071	176.071	0	4.96453
# of Seqs:	94707

Output File Names:
../../SeqData/ITS.demult.maxee.homoP.usearch.unique.sorted.output.ITS2.summary

It took 2 secs to summarize 94707 sequences.

mothur > quit()


In [12]:
!usearch -cluster_otus ../../SeqData/ITS.demult.maxee.homoP.usearch.unique.sorted.output.ITS2.fasta -otus ../../SeqData/ITS.demult.maxee.homoP.usearch.unique.sorted.output.ITS2.OTUs.fasta -relabel OTU_
# We don't really care about -sizein -sizeout to keep counts bc counts come from
# usearch_global 

usearch v8.0.1623_i86osx32, 4.0Gb RAM (17.2Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

Licensed to: tlw59@cornell.edu

00:06  35Mb  100.0% 2234 OTUs, 748 chimeras (0.8%)


In [13]:
!usearch -uchime_ref ../../SeqData/ITS.demult.maxee.homoP.usearch.unique.sorted.output.ITS2.OTUs.fasta \
-db ../../SeqData/UNITE/ITS1_ITS2_datasets/uchime_sh_refs_dynamic_develop_985_11.03.2015.ITS2.fasta \
-nonchimeras ../../SeqData/ITS.demult.maxee.homoP.usearch.unique.sorted.output.ITS2.OTUs.nochim.fasta -chimeras ../../SeqData/ITS.demult.maxee.homoP.usearch.unique.sorted.output.ITS2.OTUs.chim.fasta -strand plus
# look for more chimeras using the UNITE database.

usearch v8.0.1623_i86osx32, 4.0Gb RAM (17.2Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

Licensed to: tlw59@cornell.edu

00:00 2.2Mb  100.0% Reading ../../SeqData/ITS.demult.maxee.homoP.usearch.unique.sorted.output.ITS2.OTUs.fasta
00:01  19Mb  100.0% Reading ../../SeqData/UNITE/ITS1_ITS2_datasets/uchime_sh_refs_dynamic_develop_985_11.03.2015.ITS2.fasta
00:01  11Mb  100.0% Masking
00:01  12Mb  100.0% Word stats
00:01  12Mb  100.0% Alloc rows
00:01  29Mb  100.0% Build index
00:02  36Mb  100.0% Search 31/2234 chimeras found (1.4%)
00:02  36Mb  100.0% Writing 31 chimeras
00:02  36Mb  100.0% Writing 2203 non-chimeras


In [14]:
!usearch -usearch_global ../../SeqData/ITS.demult.maxee.homoP.usearch.fasta \
-db ../../SeqData/ITS.demult.maxee.homoP.usearch.unique.sorted.output.ITS2.OTUs.nochim.fasta \
-strand plus -id 0.95 \
-uc ../../SeqData/ITS.readmap.uc \
-threads 4
# Using our full database of all fasta sequences, not just the unique ones
# Using our picked OTUs as the reference database
# we know the strands are oriented correctly and will use 97% ID
# outputs a uclust formatted file (tab)
# 4 processors

# Seems like it can find the matches in our ITS2-trimmed OTU refs, so this works, I think.

usearch v8.0.1623_i86osx32, 4.0Gb RAM (17.2Gb total), 4 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

Licensed to: tlw59@cornell.edu

00:00 2.2Mb  100.0% Reading ../../SeqData/ITS.demult.maxee.homoP.usearch.unique.sorted.output.ITS2.OTUs.nochim.fasta
00:00 1.8Mb  100.0% Masking
00:00 2.7Mb  100.0% Word stats
00:00 2.7Mb  100.0% Alloc rows
00:00 4.0Mb  100.0% Build index
01:06  40Mb  100.0% Searching ITS.demult.maxee.homoP.usearch.fasta, 88.0% matched


In [15]:
# Makes an OTU table
!python /opt/virt_env/bin/uc2otutab.py ../../SeqData/ITS.readmap.uc > ../../SeqData/ITS.otu_table.txt

../../SeqData/ITS.readmap.uc 100.0%   


In [16]:
!if [ -f ../../SeqData/ITS.otu_table.biom ]; then rm ../../SeqData/ITS.otu_table.biom; fi #This is to mitigate a biom bug
!biom convert -i ../../SeqData/ITS.otu_table.txt -o ../../SeqData/ITS.otu_table.biom --table-type="OTU table" --to-json

In [17]:
!if [ -f ../../SeqData/ITS.otu_table_summary.txt ]; then rm ../../SeqData/ITS.otu_table_summary.txt; fi #This is to mitigate a biom bug
!biom summarize-table -i ../../SeqData/ITS.otu_table.biom -o ../../SeqData/ITS.otu_table_summary.txt

In [18]:
# This tells us the overall data info
# Num obs = OTUs
# total count = total seqs

!cat ../../SeqData/ITS.otu_table_summary.txt

Num samples: 100
Num observations: 2200
Total count: 1902544
Table density (fraction of non-zero values): 0.062

Counts/sample summary:
 Min: 1.0
 Max: 165893.0
 Median: 6463.000
 Mean: 19025.440
 Std. dev.: 28826.747
 Sample Metadata Categories: None provided
 Observation Metadata Categories: None provided

Counts/sample detail:
 30: 1.0
 54: 2.0
 37: 2.0
 63: 2.0
 60: 5.0
 14: 9.0
 68: 13.0
 16: 28.0
 51: 57.0
 48: 63.0
 87: 76.0
 11: 95.0
 42: 96.0
 8: 100.0
 22: 135.0
 9: 150.0
 92: 194.0
 45: 270.0
 21: 294.0
 23: 337.0
 29: 391.0
 18: 462.0
 47: 468.0
 81: 665.0
 25: 768.0
 2: 953.0
 75: 954.0
 27: 1139.0
 24: 1212.0
 78: 1364.0
 84: 1561.0
 71: 1747.0
 90: 1923.0
 28: 1924.0
 73: 2081.0
 6: 2164.0
 10: 2490.0
 57: 2492.0
 82: 2922.0
 74: 2979.0
 76: 3462.0
 4: 3934.0
 26: 4573.0
 5: 4646.0
 3: 4671.0
 19: 4894.0
 77: 5087.0
 72: 5259.0
 85: 5307.0
 12: 6024.0
 1: 6902.0
 97: 7253.0
 15: 7582.0
 88: 7769.0
 103:

In [19]:
!cp ../../SeqData/ITS.demult.maxee.homoP.usearch.unique.sorted.output.ITS2.OTUs.nochim.fasta ../../SeqData/ITS.otus.fasta