### Preparing assembly files for binning
To create a depth processing file, reads must be re-aligned to the contigs, or mapping. This has been done using bowtie2 (can also be done using BWA). The next step would be to create a depth file with MetaBat2, convert that to be suitable for CONCOCT and MaxBin2, and then process these into bins. 

This is all assuming you have installed all of the softwares mentioned here. Use conda for quick install. If needed, the documentation for everything can be found here:

Metabinner: https://github.com/ziyewang/MetaBinner

MetaBAT: https://bitbucket.org/berkeleylab/metabat/src/master/README.md

CONCOCT: https://github.com/BinPro/CONCOCT

CheckM: https://github.com/Ecogenomics/CheckM/wiki

Das_tool: https://github.com/cmks/DAS_Tool

#### MetaBat2
The first piece of code here generates a fairly simple text file for the coverage of these files. The next set of code runs MetaBat2  (v2.10.2) using minContig 1500, minCV 1.0, minCVSum 1.0, maxP 95%, minS 60, lcuster size 50000 and maxEdges 200. It sets the minimum size for a bin to 200000 basepairs, which is fairly low, so you can keep it. It gathers all mapping information into a single depth file, so you can use your 1 file in the next analysis. An important parameter to play around with is the minimum bin size. When set to 200000, this will severely limit the amount of bins you gain, especially if your samples aren't perfect. Therefore, it is wise to run MetaBAT several times with slight alterations to the -s flag to find your optimal setting (you don't want 3 bins, you also don't want 1000).

For a reference on how to do this accurately, use: https://bitbucket.org/berkeleylab/metabat/wiki/Best%20Binning%20Practices

In [None]:
#set parameters for binning:
SAMPLENAME=nameofsample
CONTIGFILE=nameofcontigfile.fa
OUTDIR=MetaBAT_"$SAMPLENAME"_bins
MAPPINGPATH=/path/to/mappingfiles/
CONTIGPATH=/path/to/contigs/
CHECKMPATH=/path/to/checkm/database/

[ -d $MAPPINGPATH ] || { echo 'Invalid path to mapping files, exiting' && exit; }
[ -d $CONTIGPATH ] || { echo 'Invalid path to contigs, exiting' && exit; }
[ -d $CHECKMPATH ] || { echo 'Invalid path to contigs, exiting' && exit; }

#this creates a depth file for MetaBat2
jgi_summarize_bam_contig_depths --outputDepth ../data/working/MetaBAT_"$SAMPLENAME"_depth.txt $MAPPINGPATH/*.bam || { echo 'Exit code 1: Failed to create depth file, exiting' && exit; }

#this is the actual MetaBat2 script with verbose output, minimum length 1500 and bin size 50000

mkdir ../data/results/$OUTDIR

metabat2 -i $CONTIGPATH/$CONTIGFILE -a ../data/working/MetaBAT_"$SAMPLENAME"_depth.txt \
-o ../data/results/$OUTDIR/Metabin \
-v \
-m 1500 \
-s 50000

#this runs CheckM immediately after and puts the results alongside your bins
export CHECKM_DATA_PATH=$CHECKMPATH
checkm data setRoot $CHECKMPATH
checkm lineage_wf -x fa -t $NSLOTS ../data/results/$OUTDIR ../data/results/$OUTDIR/bins-stats || { echo 'Exit code 3: CheckM failed' && exit; }

 #### CONCOCT
 This set of commands runs CONCOCT in its standard mode. It first creates a depth/coverage file for itself to use and then runs CONCOCT, with the standard settings. This means k-mer value is set to 4, minimum contig length is 1000, and CONCOCT runs on the exact amount of slots given to it by Hydra. 
 
CONCOCT creates a depth file out of the coverance created in the mapping step. It is key that this is all in the correct places before proceeding with binning. It creates a single file, which is then used for the complete binning process. Do keep in mind that binning might take awhile, so be prepared to let this run overnight.

IMPORTANT: in the current version of CONCOCT, you're missing a vital file, called libmkl.so. Without this file, CONCOCT will not be able to start. You can fix this issue by installing another file through Conda: 

conda install mkl

Additionally, samtools will not work properly after a fresh CONCOCT install. The easiest way to fix this is to go to your environment where you installed CONCOCT and force an update through conda. 

In [None]:
#this creates the CONCOCT depth file
SAMPLENAME=nameofsample
CONFIGFILE=nameofconfigfile
OUTDIR=CONCOCT_"$SAMPLENAME"_bins
TEMPDIR=CONCOCT_"$SAMPLENAME"_temp
MAPPINGPATH=/path/to/mappingfiles/
CONTIGPATH=/path/to/contigs/
CHECKMPATH=/path/to/checkm/database/

[ -d $MAPPINGPATH ] || { echo 'Invalid path to mapping files, exiting' && exit; }
[ -d $CONTIGPATH ] || { echo 'Invalid path to contigs, exiting' && exit; }
[ -d $CHECKMPATH ] || { echo 'Invalid path to contigs, exiting' && exit; }
#this part cuts up the contigs into 10kb pieces for CONCOCT to use 
cut_up_fasta.py $CONTIGPATH/$CONFIGFILE -c 10000 -o 0 --merge_last -b ../data/working/$SAMPLENAME_contigs_cut.bed > ../data/working/$SAMPLENAME_contigs_cut.fa || { echo 'Exit code 1: failed to cut up contigs, exiting.' && exit; }

#this part estimates contig coverage
concoct_coverage_table.py ../data/working/$SAMPLENAME_contigs_cut.bed $MAPPINGPATH/*.bam > ../data/working/coverage_table_$SAMPLENAME.tsv || { echo 'Exit code 2: failed to create coverage file, exiting.' && exit; }

#CONCOCT script
mkdir ../data/results/$OUTDIR
mkdir ../data/working/$TEMPDIR

#this next bit actually runs CONCOCT itself
concoct --composition_file ../data/working/$SAMPLENAME_contigs_cut.fa --coverage_file ../data/working/coverage_table_$SAMPLENAME.tsv -t $NSLOTS -b ../data/working/$TEMPDIR || { echo 'Exit code 3: CONCOCT failed to run, exiting.' && exit; }
merge_cutup_clustering.py ../data/working/$TEMPDIR/clustering_gt1000.csv > ../data/working/$TEMPDIR/$SAMPLENAME_clustering_merged.csv || { echo 'Exit code 4: failed to merge clusters, exiting.' && exit; }
extract_fasta_bins.py $CONTIGPATH/$CONFIGFILE ../data/working/$TEMPDIR/$SAMPLENAME_clustering_merged.csv --output_path ../data/results/$OUTDIR || { echo 'Exit code 5: Bins were not extracted, exiting.' && exit; }

#this runs CheckM immediately after and puts the results alongside your bins
export CHECKM_DATA_PATH=$CHECKMPATH
checkm data setRoot $CHECKMPATH
checkm lineage_wf -x fa -t $NSLOTS ../data/results/$OUTDIR ../data/results/$OUTDIR/bins-stats || { echo 'Exit code 6: CheckM failed' && exit; }

### Metabinner
This is another binning software that can be used. Metabinner relies on the use of scripts rather than executable commands, so you have to point it to where the scripts are located. If you installed using Conda, you will find them in your home/user/.conda/envs directory. First, you'll want to generate a coverage file using Metabinner. Metabinner is based off the MetaWrap script and uses 1000bp contigs as the minimum. You can also tweak some memory settings. In the same script, you can calculate kmer composition.

In [1]:
#first you have to generate a coverage file using the script that Metabinner has. It doesn't locate these on its own so you have to point it in the correct direction
bash /home/stegmannt/.conda/envs/metabinner_env/bin/scripts/gen_coverage_file.sh -a ../../02_assembly/data/results/contigs_fixed/contig_file \
-o ../data/working/depth_metabinner \
-f ../../01_quality/data/results/*_host_removed_R1.fastq \
-r ../../01_quality/data/results/*_host_removed_R2.fastq \
-t @NSLOTS
-m 8
 

python /home/stegmannt/.conda/envs/metabinner_env/bin/scripts/gen_kmer.py ../../02_assembly/data/results/contigs_fixed/co-assembly1.contigs-fixed.fa 999 4
#in which 1000 is the minimum contig length and 4 is the kmer interval
#this puts the kmer file in the same area as the contig file, which is super annoying, so
mv ../../02_assembly/data/results/contigs_fixed/kmer_4_f999.csv ../data/working/kmer_4_f999_<samplename>.csv

SyntaxError: invalid syntax (<ipython-input-1-540e67cd6ec3>, line 2)

You can now proceed to actually running Metabinner. 

In [None]:
#Metabinner runs a simplified version of CheckM that still requires the database to be set correctly
export CHECKM_DATA_PATH=/scratch/genomics/stegmannt/metagenomes/first_data-CC-revisit/04_binning/data/DATABASE
checkm data setRoot /scratch/genomics/stegmannt/metagenomes/first_data-CC-revisit/04_binning/data/DATABASE
bash /home/stegmannt/.conda/envs/metabinner_env/bin/run_metabinner.sh \
-a ../../02_assembly/data/results/contigs_fixed/co-assembly1.contigs-fixed.fa \
-o ../data/results/bins_Metabinner \
-d ../data/working/depth_metabinner/coverage_profile.tsv \
-k ../data/working/kmer_4_f999_<samplename>.csv \
-p /home/stegmannt/.conda/envs/metabinner_env/bin \
-t $NSLOTS



#The file "metabinner_result.tsv" in the "${output_dir}/metabinner_res" is the final output.
#You probably don't need to convert to fasta, but if you do: 

### DAS_tool
This is a tool to recombine all your bins from several different algorithms into a single one, without redundancy. It requires a .tsv input, where most binners will create .fa bins. It comes with a script to convert your .fa bins to a useful filetype. 

In [None]:
SAMPLENAME=
SCRIPTPATH=/path/to/DAS/script/bin/ 
CONCOCTBINS=../data/results/CONCOCT_"$SAMPLENAME"_bins/
MetaBATBINS=../data/results/MetaBAT_"$SAMPLENAME"_bins/
CONTIGPATH=/path/to/fixed/contigs
CONTIGNAME=nameofcontigfile.fa
OUTDIR=DAS_"$SAMPLENAME"_bins
CHECKMPATH=/path/to/checkm/database/
TEMPDIR=DAS_"$SAMPLENAME"_temp

[ -d $MAPPINGPATH ] || { echo 'Invalid path to mapping files, exiting' && exit; }
[ -d $CONTIGPATH ] || { echo 'Invalid path to contigs, exiting' && exit; }
[ -d $CHECKMPATH ] || { echo 'Invalid path to contigs, exiting' && exit; }
[ -d $CONCOCTBINS ] || { echo 'Invalid path to contigs, exiting' && exit; }
[ -d $MetaBATBINS ] || { echo 'Invalid path to contigs, exiting' && exit; }
[ -d $SCRIPTPATH ] || { echo 'Invalid path to contigs, exiting' && exit; }

mkdir ../data/working/$TEMPDIR
mkdir ../data/results/$OUTDIR
#this creates the txt file needed for DAS_tool for CONCOCT
$SCRIPTPATH/Fasta_to_Contig2Bin.sh - i $CONCOCTBINS -e fa > ../data/working/$TEMPDIR/CONCOCT_"$SAMPLENAME"_contigs2bin.tsv  || { echo 'Exit code 1: failed create CONCOCT tsv file, exiting.' && exit; }

#this creates the txt file needed for DAS_tool for MetaBAT2
$SCRIPTPATH/Fasta_to_Contig2Bin.sh - i $MetaBATBINS -e fa > ../data/working/$TEMPDIR/MetaBAT2_"$SAMPLENAME"_contigs2bin.tsv || { echo 'Exit code 2: failed create MetaBAT2 tsv file, exiting.' && exit; }

#this then runs DAS_tool
DAS_Tool --write_bins -t $NSLOTS -i ../data/working/$TEMPDIR/CONCOCT_"$SAMPLENAME"_contigs2bin.tsv,../data/working/$TEMPDIR/MetaBAT2_contigs2bin.tsv \
-c $CONTIGPATH/$CONTIGNAME \
-o ../data/results/$OUTDIR \
--write-bins || {echo 'Exit code 3: something happened while running DAS_tool, exiting' && exit; }

#this runs CheckM immediately after and puts the results alongside your bins
export CHECKM_DATA_PATH=$CHECKMPATH
checkm data setRoot $CHECKMPATH
checkm lineage_wf -x fa -t $NSLOTS ../data/results/$OUTDIR ../data/results/$OUTDIR/bins-stats || { echo 'Exit code 4: CheckM failed' && exit; }

In [None]:
#this should run BUSCO for your bins:
busco -m  genome -i ../data/results/$OUTDIR -o ../data/results/$OUTDIR/bins-stats-Busco --auto-lineage -c $NSLOTS --download-path $DOWNLOADPATH || { echo 'Exit code 5: BUSCO failed, exiting.' && exit; }

### Continuing and troubleshooting
You should now have 3 sets of bins, each created with a slightly different algorithm, consolidated into a single set of bins through DAS_tools. It is now important to run the CheckM software with the script below and generate output files for all of them. This will inform you towards the quality of your bins and your contamination/completion rate. After this, you can proceed to the "Refine Bins" part of the workflow.

CheckM runs a check against a database to determine the levels of completeness versus contamination. These statistics are vital in determining how you want to proceed in the refinement process. Mind you, CheckM works without setting the database you need, but you get very confusing data. So make sure you set it correctly before running it. The scripts above run CheckM intrinsically, but its good to know that CheckM is the reason these scripts need to be run on a himem node (it regularly spikes above the 16G of RAM used per node, so yeah). 

MetaBAT is extremely annoying in the fact that it won't create its own directories. Make sure the directories are in place before it actually runs. 

CONCOCT will in general create more bins than MetaBAT2, but you can quite likely discard quite a few since they're going to be 3000 bp long which is not a lot (although it could be a viral sequence).

Congratulations! You have finished binning. The bins you have produced are considered putative genomes and can be used for a fair amount of practices, some of which I have listed in the Anvi'O notebook, others which are in the Analysis notebook. Good luck!