# PIPITS Fungal ITS-dedicated Pipeline

* DADA2 can, in theory, process ITS data. This is a dedicated alternative


### Dependencies ####

##### || PIPITS ||
* Follow instructions provided at: 
* https://github.com/hsgweon/pipits
* Note: all dependencies which require 'sudo' will already be met (i.e. don't bother running those commands... they won't work anyways)

##### || deML ||
* Follow instructions provided at: 
* https://github.com/grenaud/deML

##### || phyloseq ||
* conda install -c r-igraph 
* Rscript -e "source('http://bioconductor.org/biocLite.R');biocLite('phyloseq')" 

##### || FUNGuild ||
* download FUNGUild script:
* https://raw.githubusercontent.com/UMNFuN/FUNGuild/master/Guilds_v1.1.py

### Citations ###
* Gweon, H. S., Oliver, A., Taylor, J., Booth, T., Gibbs, M., Read, D. S., et al. (2015). PIPITS: an automated pipeline for analyses of fungal internal transcribed spacer sequences from the Illumina sequencing platform. Methods in ecology and evolution, 6(8), 973-980.

* Renaud, G., Stenzel, U., Maricic, T., Wiebe, V., & Kelso, J. (2014). deML: robust demultiplexing of Illumina sequences using a likelihood-based approach. Bioinformatics, 31(5), 770-772.

* McMurdie and Holmes (2013) phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data. PLoS ONE. 8(4):e61217

* Nguyen NH, Song Z, Bates ST, Branco S, Tedersoo L, Menke J, Schilling JS, Kennedy PG. 2016. FUNGuild: An open annotation tool for parsing fungal community datasets by ecological guild. Fungal Ecology 20:241–248.





###### Last Modified by R. Wilhelm on October 12th, 2017 ######


# Step 1: User Input

In [38]:
import os

# Provide the directory for your index and read files
ITS = '/home/roli/FORESTs_BHAVYA/WoodsLake/raw_seq/ITS/'

# Provide 
datasets = [['ITS',ITS,'ITS.metadata.pipits.Woods.tsv']]

# Ensure your reads files are named accordingly (or modify to suit your needs)
readFile1 = 'read1.fq.gz'
readFile2 = 'read2.fq.gz'
indexFile1 = 'index_read1.fq.gz'
indexFile2 = 'index_read2.fq.gz'

# Example of metadata file
#Index1	Index2	Name
#AATTCAA	CATCCGG	RG1
#CGCGCAG	TCATGGT	RG2
#AAGGTCT	AGAACCG	RG3
#ACTGGAC	TGGAATA	RG4

## Again, for our pipeline Index1 typically is the reverse complement of the reverse barcode, while Index2 is the forward barcode.

# Step 2: Demultiplex Raw Reads

In [40]:
# Ignore all the 'conflict' errors. The reads are paired so the conflicts are bogus (i.e. it gives a warning everytime an barcode appears in multiple samples, but no pairs are duplicated)

for dataset in datasets:
    name = dataset[0]
    directory = dataset[1]
    metadata = directory+dataset[2]
    index1 = directory+indexFile1
    index2 = directory+indexFile2
    read1 = directory+readFile1
    read2 = directory+readFile2
    
    # Make output directory
    %mkdir $directory/pipits_input/
    
    # Run deML   ## Note: you may get error involving 'ulimit'. If so, exit your notebook. Enter 'ulimit -n 9999' at the command line, then restart a new notebook.
    !deML -i $metadata -f $read1 -r $read2 -if1 $index1 -if2 $index2 -o $directory/pipits_input/$name

    # Remove unnecessary 'failed' reads and index files
    %rm $directory/pipits_input/*.fail.* $directory/pipits_input/unknown*

Conflicts for index1:
CGAGAGTT from F193 causes a conflict with F194 F195 F196 F197 F198 F199 F200 
CGAGAGTT from F194 causes a conflict with F193 F195 F196 F197 F198 F199 F200 
CGAGAGTT from F195 causes a conflict with F193 F194 F196 F197 F198 F199 F200 
CGAGAGTT from F196 causes a conflict with F193 F194 F195 F197 F198 F199 F200 
CGAGAGTT from F197 causes a conflict with F193 F194 F195 F196 F198 F199 F200 
CGAGAGTT from F198 causes a conflict with F193 F194 F195 F196 F197 F199 F200 
CGAGAGTT from F199 causes a conflict with F193 F194 F195 F196 F197 F198 F200 
CGAGAGTT from F200 causes a conflict with F193 F194 F195 F196 F197 F198 F199 
GACATAGT from F201 causes a conflict with F202 F203 F204 F205 F206 F207 F208 
GACATAGT from F202 causes a conflict with F201 F203 F204 F205 F206 F207 F208 
GACATAGT from F203 causes a conflict with F201 F202 F204 F205 F206 F207 F208 
GACATAGT from F204 causes a conflict with F201 F202 F203 F205 F206 F207 F208 
GACATAGT from F205 causes a c

rm: cannot remove '/home/roli/FORESTs_BHAVYA/WoodsLake/raw_seq/ITS//pipits_input/unknown*': No such file or directory


# Step 3: Make Sample Mapping File (aka. 'readpairlist')

In [41]:
import glob, re
for dataset in datasets:
    name = dataset[0]
    directory = dataset[1]
    
    # Remove Previously Prepended Name (PIPITS wanted something)
    for file in glob.glob(directory+"pipits_input/"+name+"_*"):
        new_name = re.sub(name+"_","",file)
        os.rename(file, new_name)
    
    # Rename files with with extension .fq (PIPITS is PICKY)
    for file in glob.glob(directory+"pipits_input/*.fq.gz"):
        new_name = re.sub(".fq.gz",".fastq.gz",file)
        os.rename(file, new_name)
    
    # Remove Unbinned Reads
    %rm $directory/pipits_input/unknown*        
    
    # Run PIPITS List Prep
    input_dir = directory+"pipits_input/"
    output_dir = directory+name+".readpairslist.txt"
    
    !pipits_getreadpairslist -i $input_dir -o $output_dir -f


[92mGenerating a read-pair list file from the input directory...[0m
Done. "/home/roli/FORESTs_BHAVYA/WoodsLake/raw_seq/ITS/ITS.readpairslist.txt" created.


# Step 4: Pre-process Data with PIPITS (merge and QC)

In [None]:
for dataset in datasets:
    name = dataset[0]
    directory = dataset[1]
    
    input_dir = directory+"pipits_input/"
    output_dir = directory+"pipits_prep/"
    readpairfile = directory+name+".readpairslist.txt"
    
    !pipits_prep -i $input_dir -o $output_dir -l $readpairfile


[95m2017-10-16 10:02:48[94m [94mPIPITS_PREP started[0m[0m
[95m2017-10-16 10:02:48[94m Processing the listfile[0m
[95m2017-10-16 10:02:48[94m Counting sequences in rawdata[0m
[95m2017-10-16 10:04:14[94m   Number of reads: 16275046[0m
[95m2017-10-16 10:04:14[94m Reindexing forward reads[0m
[95m2017-10-16 10:09:20[94m Reindexing reverse reads[0m
[95m2017-10-16 10:15:23[94m Joining paired-end reads [VSEARCH][0m
[95m2017-10-16 11:34:20[94m   Number of joined reads: 13060412[0m
[95m2017-10-16 11:34:20[94m Quality filtering [FASTX][0m
[95m2017-10-16 11:48:37[94m   Number of quality filtered reads: 13032128[0m
[95m2017-10-16 11:48:37[94m Converting FASTQ to FASTA [FASTX][0m
[95m2017-10-16 11:55:12[94m   Number of prepped sequences: 13032128[0m
[95m2017-10-16 11:55:12[94m Merging all into a single file[0m
[95m2017-10-16 11:55:28[94m Cleaning temporary directory[0m
[95m2017-10-16 11:55:29[94m [94mPIPITS_PREP completed.[0m[0m
[95m2017-10-16 11:55:

# Step 4: Extract Variable Region (**User Input Required**)

In [None]:
ITS_Region = "ITS1"

for dataset in datasets:
    name = dataset[0]
    directory = dataset[1]
    
    input_file = directory+"pipits_prep/prepped.fasta"
    output_dir = directory+"pipits_funits/"
    
    !pipits_funits -i $input_file -o $output_dir -x $ITS_Region 


[95m2017-10-16 11:55:30[0m INFO: [94mPIPITS_FUNITS started[0m
[95m2017-10-16 11:55:30[0m INFO: Checking input FASTA for illegal characters
[95m2017-10-16 11:55:40[0m INFO: Counting input sequences
[95m2017-10-16 11:55:42[0m INFO: 	[91mNumber of input sequences: 13032128[0m
[95m2017-10-16 11:55:42[0m INFO: Dereplicating sequences for efficiency
[95m2017-10-16 11:57:21[0m INFO: Extracting ITS1 from sequences [ITSx]


# Step 5: Cluster and Assign Taxonomy

In [None]:
for dataset in datasets:
    name = dataset[0]
    directory = dataset[1]
    
    input_file = directory+"pipits_funits/ITS.fasta"
    output_dir = directory+"PIPITS_final/"

    !pipits_process -i $input_file -o $output_dir --Xmx 20G
    

# Step 6: Push OTU Table through FUNGuild

In [None]:
for dataset in datasets:
    name = dataset[0]
    directory = dataset[1]
    
    # Prepare PIPITS output for FUNGuild
    !pipits_funguild.py -i $directory/PIPITS_final/otu_table.txt -o $directory/PIPITS_final/otu_table_funguild.txt
   
    # Run FUNGuild
    !python /home/db/FUNGuild/Guilds_v1.1.py -otu $directory/PIPITS_final/otu_table_funguild.txt -db fungi -m -u

# Step 7: Clean-up Intermediate Files and Final Outputs

In [None]:
for dataset in datasets:
    name = dataset[0]
    directory = dataset[1]
   
    %rm -r $directory/pipits_prep/
    %rm -r $directory/pipits_funits/
    %rm -r $directory/pipits_input/
    
    del_me = directory+name+".readpairslist.txt"
    %rm $del_me