# PIPITS Fungal ITS-dedicated Pipeline
* The default pair merge algorithm in vsearch discards 90% of the data. This was observed in other datasets and is believe to be overly conservative. PIPITs offers support for using Pear  is a dedicated alternative

### Dependencies ####

##### || PIPITS ||
* Follow instructions provided at: 
* https://github.com/hsgweon/pipits
* Note: all dependencies which require 'sudo' will already be met (i.e. don't bother running those commands... they won't work anyways)

##### || deML ||
* Follow instructions provided at: 
* https://github.com/grenaud/deML

##### || phyloseq ||
* conda install -c r-igraph 
* Rscript -e "source('http://bioconductor.org/biocLite.R');biocLite('phyloseq')" 

##### || FUNGuild ||
* download FUNGUild script:
* https://raw.githubusercontent.com/UMNFuN/FUNGuild/master/Guilds_v1.1.py

##### || PEAR ||
* download at: https://sco.h-its.org/exelixis/web/software/pear/

### Citations ###
* Gweon, H. S., Oliver, A., Taylor, J., Booth, T., Gibbs, M., Read, D. S., et al. (2015). PIPITS: an automated pipeline for analyses of fungal internal transcribed spacer sequences from the Illumina sequencing platform. Methods in ecology and evolution, 6(8), 973-980.

* Renaud, G., Stenzel, U., Maricic, T., Wiebe, V., & Kelso, J. (2014). deML: robust demultiplexing of Illumina sequences using a likelihood-based approach. Bioinformatics, 31(5), 770-772.

* McMurdie and Holmes (2013) phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data. PLoS ONE. 8(4):e61217

* Nguyen NH, Song Z, Bates ST, Branco S, Tedersoo L, Menke J, Schilling JS, Kennedy PG. 2016. FUNGuild: An open annotation tool for parsing fungal community datasets by ecological guild. Fungal Ecology 20:241–248.

* Zhang J, Kobert K, Flouri T, Stamatakis A. 2013. PEAR: a fast and accurate Illumina Paired-End reAd mergeR. Bioinformatics, 30(5): 614-620.


###### Last Modified by R. Wilhelm on January 2nd, 2018 ######

# Step 1: User Input

In [None]:
import os

# Provide the directory for your index and read files
ITS = '/home/roli/FORESTs_BHAVYA/WoodsLake/raw_seq/ITS/'

# Provide 
datasets = [['ITS',ITS,'ITS.metadata.pipits.Woods.tsv']]

# Ensure your reads files are named accordingly (or modify to suit your needs)
readFile1 = 'read1.fq.gz'
readFile2 = 'read2.fq.gz'
indexFile1 = 'index_read1.fq.gz'
indexFile2 = 'index_read2.fq.gz'

# Example of metadata file
#Index1	Index2	Name
#AATTCAA	CATCCGG	RG1
#CGCGCAG	TCATGGT	RG2
#AAGGTCT	AGAACCG	RG3
#ACTGGAC	TGGAATA	RG4

## Again, for our pipeline Index1 typically is the reverse complement of the reverse barcode, while Index2 is the forward barcode.

# Step 2: Demultiplex Raw Reads

In [None]:
# Ignore all the 'conflict' errors. The reads are paired so the conflicts are bogus (i.e. it gives a warning everytime an barcode appears in multiple samples, but no pairs are duplicated)

for dataset in datasets:
    name = dataset[0]
    directory = dataset[1]
    metadata = directory+dataset[2]
    index1 = directory+indexFile1
    index2 = directory+indexFile2
    read1 = directory+readFile1
    read2 = directory+readFile2
    
    # Make output directory
    %mkdir $directory/pipits_input/
    
    # Run deML   ## Note: you may get error involving 'ulimit'. If so, exit your notebook. Enter 'ulimit -n 9999' at the command line, then restart a new notebook.
    !deML -i $metadata -f $read1 -r $read2 -if1 $index1 -if2 $index2 -o $directory/pipits_input/$name

    # Remove unnecessary 'failed' reads and index files
    %rm $directory/pipits_input/*.fail.* $directory/pipits_input/unknown*

# Step 3: Make Sample Mapping File (aka. 'readpairlist')

In [None]:
import glob, re
for dataset in datasets:
    name = dataset[0]
    directory = dataset[1]
    
    # Remove Previously Prepended Name (PIPITS wanted something)
    for file in glob.glob(directory+"pipits_input/"+name+"_*"):
        new_name = re.sub(name+"_","",file)
        os.rename(file, new_name)
    
    # Rename files with with extension .fq (PIPITS is PICKY)
    for file in glob.glob(directory+"pipits_input/*.fq.gz"):
        new_name = re.sub(".fq.gz",".fastq.gz",file)
        os.rename(file, new_name)
    
    # Remove Unbinned Reads
    %rm $directory/pipits_input/unknown*        
    
    # Run PIPITS List Prep
    input_dir = directory+"pipits_input/"
    output_dir = directory+name+".readpairslist.txt"
    
    !pipits_getreadpairslist -i $input_dir -o $output_dir -f


# Step 4: Pre-process Data with PIPITS (merge and QC)

In [None]:
for dataset in datasets:
    name = dataset[0]
    directory = dataset[1]
    
    input_dir = directory+"pipits_input/"
    output_dir = directory+"pipits_prep/"
    readpairfile = directory+name+".readpairslist.txt"
    
    !pipits_prep -i $input_dir -o $output_dir -l $readpairfile


# Step 4: Extract Variable Region (**User Input Required**)

In [None]:
ITS_Region = "ITS1"

for dataset in datasets:
    name = dataset[0]
    directory = dataset[1]
    
    input_file = directory+"pipits_prep/prepped.fasta"
    output_dir = directory+"pipits_funits/"
    
    !pipits_funits -i $input_file -o $output_dir -x $ITS_Region 


# Step 5: Cluster and Assign Taxonomy

In [None]:
for dataset in datasets:
    name = dataset[0]
    directory = dataset[1]
    
    input_file = directory+"pipits_funits/ITS.fasta"
    output_dir = directory+"PIPITS_final/"

    !pipits_process -i $input_file -o $output_dir --Xmx 20G
    

# Step 6: Push OTU Table through FUNGuild

In [None]:
for dataset in datasets:
    name = dataset[0]
    directory = dataset[1]
    
    # Prepare PIPITS output for FUNGuild
    !pipits_funguild.py -i $directory/PIPITS_final/otu_table.txt -o $directory/PIPITS_final/otu_table_funguild.txt
   
    # Run FUNGuild
    !python /home/db/FUNGuild/Guilds_v1.1.py -otu $directory/PIPITS_final/otu_table_funguild.txt -db fungi -m -u

# Step 7: Import into R 

In [None]:
## Setup R-Magic for Jupyter Notebooks
import rpy2
import pandas as pd
%load_ext rpy2.ipython
%R library(phyloseq)

for dataset in datasets:
    name = dataset[0]
    directory = dataset[1]
    metadata = dataset[2]
    
    # Input Biom
    biom = directory+"/PIPITS_final/otu_table.biom" 
    %R -i biom
    %R x <- import_biom(biom)

    # Fix taxonomy table
    %R colnames(tax_table(x)) <- c("Domain","Phylum","Class","Order","Family","Genus","Species")
    %R tax_table(x) = gsub("k__| p__| c__| o__| f__| g__| s__","",tax_table(x)) 

    # Merge Mapping into Phyloseq  
    sample_file = pd.read_table(directory+metadata, keep_default_na=False)
    %R -i sample_file
    %R rownames(sample_file) <- sample_file$X.SampleID
    %R sample_file$X.SampleID <- NULL
    %R sample_file <- sample_data(sample_file)
       
    %R p <- merge_phyloseq(x, sample_file)
                        
    # Save Phyloseq Object as '.rds'
    output = directory+"/PIPITS_final/p_"+name+".pipits.final.rds"
    %%R -i output
    %%R saveRDS(p, file = output)
    
    # Confirm Output
    %R print(p)

# Step 7: Clean-up Intermediate Files and Final Outputs

In [None]:
for dataset in datasets:
    name = dataset[0]
    directory = dataset[1]
   
    %rm -r $directory/pipits_prep/
    %rm -r $directory/pipits_funits/
    %rm -r $directory/pipits_input/
    
    del_me = directory+name+".readpairslist.txt"
    %rm $del_me