# PIPITS Fungal ITS-dedicated Pipeline

* DADA2 can, in theory, process ITS data. This is a dedicated alternative


### Dependencies ####

##### || PIPITS ||
* Follow instructions provided at: 
* https://github.com/hsgweon/pipits
* Note: all dependencies which require 'sudo' will already be met (i.e. don't bother running those commands... they won't work anyways)

##### || deML ||
* Follow instructions provided at: 
* https://github.com/grenaud/deML

##### || phyloseq ||
* conda install -c r-igraph 
* Rscript -e "source('http://bioconductor.org/biocLite.R');biocLite('phyloseq')" 

##### || FUNGuild ||
* download FUNGUild script:
* https://raw.githubusercontent.com/UMNFuN/FUNGuild/master/Guilds_v1.1.py

### Citations ###
* Gweon, H. S., Oliver, A., Taylor, J., Booth, T., Gibbs, M., Read, D. S., et al. (2015). PIPITS: an automated pipeline for analyses of fungal internal transcribed spacer sequences from the Illumina sequencing platform. Methods in ecology and evolution, 6(8), 973-980.

* Renaud, G., Stenzel, U., Maricic, T., Wiebe, V., & Kelso, J. (2014). deML: robust demultiplexing of Illumina sequences using a likelihood-based approach. Bioinformatics, 31(5), 770-772.

* McMurdie and Holmes (2013) phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data. PLoS ONE. 8(4):e61217

* Nguyen NH, Song Z, Bates ST, Branco S, Tedersoo L, Menke J, Schilling JS, Kennedy PG. 2016. FUNGuild: An open annotation tool for parsing fungal community datasets by ecological guild. Fungal Ecology 20:241–248.





###### Last Modified by R. Wilhelm on October 12th, 2017 ######


# Step 1: User Input

In [1]:
import os

# Provide the directory for your index and read files
ITS = '/home/roli/FORESTs_BHAVYA/HonnedagaLake/raw_seq/ITS/'

# Provide 
datasets = [['ITS',ITS,'ITS.metadata.pipits.tsv']]

# Ensure your reads files are named accordingly (or modify to suit your needs)
readFile1 = 'read1.fq.gz'
readFile2 = 'read2.fq.gz'
indexFile1 = 'index_read1.fq.gz'
indexFile2 = 'index_read2.fq.gz'

# Example of metadata file
#Index1	Index2	Name
#AATTCAA	CATCCGG	RG1
#CGCGCAG	TCATGGT	RG2
#AAGGTCT	AGAACCG	RG3
#ACTGGAC	TGGAATA	RG4

## Again, for our pipeline Index1 typically is the reverse complement of the reverse barcode, while Index2 is the forward barcode.

# Step 2: Demultiplex Raw Reads

In [17]:
# Ignore all the 'conflict' errors. The reads are paired so the conflicts are bogus (i.e. it gives a warning everytime an barcode appears in multiple samples, but no pairs are duplicated)

for dataset in datasets:
    name = dataset[0]
    directory = dataset[1]
    metadata = directory+dataset[2]
    index1 = directory+indexFile1
    index2 = directory+indexFile2
    read1 = directory+readFile1
    read2 = directory+readFile2
    
    # Make output directory
    %mkdir $directory/pipits_input/
    
    # Run deML   ## Note: you may get error involving 'ulimit'. If so, exit your notebook. Enter 'ulimit -n 9999' at the command line, then restart a new notebook.
    !deML -i $metadata -f $read1 -r $read2 -if1 $index1 -if2 $index2 -o $directory/pipits_input/$name

    # Remove unnecessary 'failed' reads and index files
    %rm $directory/pipits_input/*.fail.* $directory/pipits_input/unknown*

Conflicts for index1:
ATAGTACC from F001 causes a conflict with F002 F003 F004 F005 F006 F007 F008 
ATAGTACC from F002 causes a conflict with F001 F003 F004 F005 F006 F007 F008 
ATAGTACC from F003 causes a conflict with F001 F002 F004 F005 F006 F007 F008 
ATAGTACC from F004 causes a conflict with F001 F002 F003 F005 F006 F007 F008 
ATAGTACC from F005 causes a conflict with F001 F002 F003 F004 F006 F007 F008 
ATAGTACC from F006 causes a conflict with F001 F002 F003 F004 F005 F007 F008 
ATAGTACC from F007 causes a conflict with F001 F002 F003 F004 F005 F006 F008 
ATAGTACC from F008 causes a conflict with F001 F002 F003 F004 F005 F006 F007 
CGTAGCGA from F009 causes a conflict with F010 F019 F020 F029 F030 F079 F192 F122 F123 F124 F179 F180 F189 F190 F191 
CGTAGCGA from F010 causes a conflict with F009 F019 F020 F029 F030 F079 F192 F122 F123 F124 F179 F180 F189 F190 F191 
ACGTGCGC from F011 causes a conflict with F012 F013 F014 F015 F016 F017 F018 
ACGTGCGC from F012 causes a 

rm: cannot remove '/home/roli/FORESTs_BHAVYA/HonnedagaLake/raw_seq/ITS//pipits_input/unknown*': No such file or directory


# Step 3: Make Sample Mapping File (aka. 'readpairlist')

In [24]:
import glob, re
for dataset in datasets:
    name = dataset[0]
    directory = dataset[1]
    
    # Remove Previously Prepended Name (PIPITS wanted something)
    for file in glob.glob(directory+"pipits_input/"+name+"_*"):
        new_name = re.sub(name+"_","",file)
        os.rename(file, new_name)
    
    # Rename files with with extension .fq (PIPITS is PICKY)
    for file in glob.glob(directory+"pipits_input/*.fq.gz"):
        new_name = re.sub(".fq.gz",".fastq.gz",file)
        os.rename(file, new_name)
    
    # Remove Unbinned Reads
    %rm $directory/pipits_input/unknown*        
    
    # Run PIPITS List Prep
    input_dir = directory+"pipits_input/"
    output_dir = directory+name+".readpairslist.txt"
    
    !pipits_getreadpairslist -i $input_dir -o $output_dir -f


rm: cannot remove '/home/roli/FORESTs_BHAVYA/HonnedagaLake/raw_seq/ITS//pipits_input/unknown*': No such file or directory
[92mGenerating a read-pair list file from the input directory...[0m
Done. "/home/roli/FORESTs_BHAVYA/HonnedagaLake/raw_seq/ITS/ITS.readpairslist.txt" created.


# Step 4: Pre-process Data with PIPITS (merge and QC)

In [29]:
for dataset in datasets:
    name = dataset[0]
    directory = dataset[1]
    
    input_dir = directory+"pipits_input/"
    output_dir = directory+"pipits_prep/"
    readpairfile = directory+name+".readpairslist.txt"
    
    !pipits_prep -i $input_dir -o $output_dir -l $readpairfile


[95m2017-10-13 08:55:23[94m [94mPIPITS_PREP started[0m[0m
[95m2017-10-13 08:55:23[94m Processing the listfile[0m
[95m2017-10-13 08:55:23[94m Counting sequences in rawdata[0m
[95m2017-10-13 08:56:55[94m   Number of reads: 25464934[0m
[95m2017-10-13 08:56:55[94m Reindexing forward reads[0m
[95m2017-10-13 09:03:20[94m Reindexing reverse reads[0m
[95m2017-10-13 09:10:06[94m Joining paired-end reads [VSEARCH][0m
[95m2017-10-13 11:14:49[94m   Number of joined reads: 21516634[0m
[95m2017-10-13 11:14:49[94m Quality filtering [FASTX][0m
[95m2017-10-13 11:38:20[94m   Number of quality filtered reads: 21505812[0m
[95m2017-10-13 11:38:20[94m Converting FASTQ to FASTA [FASTX][0m
[95m2017-10-13 11:49:00[94m   Number of prepped sequences: 21505812[0m
[95m2017-10-13 11:49:00[94m Merging all into a single file[0m
[95m2017-10-13 11:49:27[94m Cleaning temporary directory[0m
[95m2017-10-13 11:49:28[94m [94mPIPITS_PREP completed.[0m[0m
[95m2017-10-13 11:49:

# Step 4: Extract Variable Region (**User Input Required**)

In [None]:
ITS_Region = "ITS1"

for dataset in datasets:
    name = dataset[0]
    directory = dataset[1]
    
    input_file = directory+"pipits_prep/prepped.fasta"
    output_dir = directory+"pipits_funits/"
    
    !pipits_funits -i $input_file -o $output_dir -x $ITS_Region 


[95m2017-10-13 12:41:02[0m INFO: [94mPIPITS_FUNITS started[0m
[95m2017-10-13 12:41:02[0m INFO: Checking input FASTA for illegal characters
[95m2017-10-13 12:41:20[0m INFO: Counting input sequences
[95m2017-10-13 12:41:24[0m INFO: 	[91mNumber of input sequences: 21505812[0m
[95m2017-10-13 12:41:24[0m INFO: Dereplicating sequences for efficiency
[95m2017-10-13 12:44:10[0m INFO: Extracting ITS1 from sequences [ITSx]


# Step 5: Cluster and Assign Taxonomy

In [32]:
for dataset in datasets:
    name = dataset[0]
    directory = dataset[1]
    
    input_file = directory+"pipits_funits/ITS.fasta"
    output_dir = directory+"PIPITS_final/"

    !pipits_process -i $input_file -o $output_dir --Xmx 20G
    

[95m2017-10-15 15:49:09[0m INFO: [94mPIPITS PROCESS started[0m
[95m2017-10-15 15:49:09[0m INFO: Generating a sample list from the input sequences
[95m2017-10-15 15:50:33[0m INFO: Dereplicating and removing unique sequences prior to picking OTUs
[95m2017-10-15 15:50:55[0m INFO: Picking OTUs [VSEARCH]
[95m2017-10-15 15:53:37[0m INFO: Removing chimeras [VSEARCH]
[95m2017-10-15 15:54:20[0m INFO: Renaming OTUs
[95m2017-10-15 15:54:20[0m INFO: Mapping reads onto centroids [VSEARCH]
[95m2017-10-15 17:17:02[0m INFO: Making OTU table
[95m2017-10-15 17:21:54[0m INFO: Converting classic tabular OTU into a BIOM format [BIOM]
[95m2017-10-15 17:22:07[0m INFO: Assigning taxonomy [RDP Classifier]
[95m2017-10-15 17:58:12[0m INFO: Reformatting RDP_Classifier output
[95m2017-10-15 17:58:13[0m INFO: Adding assignment to OTU table [BIOM]
[95m2017-10-15 17:58:26[0m INFO: Converting OTU table with taxa assignment into a BIOM format [BIOM]
[95m2017-10-15 17:58:41[0m INFO: Phyloty

# Step 6: Push OTU Table through FUNGuild

In [35]:
for dataset in datasets:
    name = dataset[0]
    directory = dataset[1]
    
    # Prepare PIPITS output for FUNGuild
    !pipits_funguild.py -i $directory/PIPITS_final/otu_table.txt -o $directory/PIPITS_final/otu_table_funguild.txt
   
    # Run FUNGuild
    !python /home/db/FUNGuild/Guilds_v1.1.py -otu $directory/PIPITS_final/otu_table_funguild.txt -db fungi -m -u

FunGuild v1.0 Beta
Connecting with FUNGuild database ...

Reading in the OTU table: '/home/roli/FORESTs_BHAVYA/HonnedagaLake/raw_seq/ITS//final_output/otu_table_funguild.txt'

Searching the FUNGuild database...
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%

Found 20053 matching taxonomy records in the database.
Dereplicating and sorting the result...
FunGuild tried to assign function to 39290 OTUs in '/home/roli/FORESTs_BHAVYA/HonnedagaLake/raw_seq/ITS//final_output/otu_table_funguild.txt'.
FUNGuild made assignments on 15735 OTUs.
Result saved to '/home/roli/FORESTs_BHAVYA/HonnedagaLake/raw_seq/ITS//final_output/otu_table_funguild.guilds.txt'

Additional output:
FUNGuild made assignments on 15735 OTUs, these have been saved to /home/roli/FORESTs_BHAVYA/HonnedagaLake/raw_seq/ITS//final_output/otu_table_funguild.guilds_matched.txt.
23555 OTUs were unassigned, these are saved to /home/roli/FORESTs_BHAVYA/HonnedagaLake/raw_seq/ITS//final_output/otu_table_funguild.guilds_unmatched.txt.

Total ca

# Step 7: Clean-up Intermediate Files and Final Outputs

In [37]:
for dataset in datasets:
    name = dataset[0]
    directory = dataset[1]
   
    %rm -r $directory/pipits_prep/
    %rm -r $directory/pipits_funits/
    %rm -r $directory/pipits_input/
    
    del_me = directory+name+".readpairslist.txt"
    %rm $del_me

rm: cannot remove '/home/roli/FORESTs_BHAVYA/HonnedagaLake/raw_seq/ITS//pipits_prep/': No such file or directory
rm: cannot remove '/home/roli/FORESTs_BHAVYA/HonnedagaLake/raw_seq/ITS//pipits_funits/': No such file or directory
rm: cannot remove '/home/roli/FORESTs_BHAVYA/HonnedagaLake/raw_seq/ITS//pipits_input/': No such file or directory
