# MAGS:
1. Assembly using MEGAhit, keep everything over 2000bp (n result = 1,341,809)
- Assembly was done in Snakemake
- https://www.nature.com/articles/s41586-019-0965-1.pdf (2000 bp)

2. Try deduplicating with CD-hit at 99% ID
- https://journals.asm.org/doi/full/10.1128/mSystems.00245-21 (99% identity (v4.6.8, −c 0.99 −aS 0.99))
- https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-019-0625-6 (100% ident, 100% ID)
- https://www.pnas.org/doi/10.1073/pnas.2105124118 (99% ANI)

3. mapping with Bowtie2 or bbMap

a. Bowtie2:
https://www.sciencedirect.com/science/article/pii/S0092867419300017 (option ‘--very-sensitive-local’)
- Viral and metabolic controls on high rates of microbial sulfur and carbon cycling in wetland ecosystems

b. BBmap (probs better):
- https://www.biorxiv.org/content/10.1101/2022.05.02.490339v1.full (“fast=t ambig=random minid=0.98“)
- https://journals.asm.org/doi/full/10.1128/mSystems.00245-21 (requiring 95% identity)


4.  Then make depth files : 
- https://bitbucket.org/berkeleylab/metabat/wiki/Home

5. Then use metabat2:
- https://www.nature.com/articles/s41586-019-0965-1 (option --minContig 2000) and default parameters)
- https://www.sciencedirect.com/science/article/pii/S0092867419300017 (default?)
- https://www.nature.com/articles/s41564-018-0276-6#Sec2 (—veryspecific’ flag, Jess paper)

Or metabat, concoct and maxbin, followed by DASTool

Then:
6. CheckM 
- https://www.nature.com/articles/s41586-019-0965-1 (near complete: >90% complete, <5% contam, or medium qual: >50% complete, <10% contam)
- https://www.nature.com/articles/s41564-018-0276-6#Sec2 
- https://www.nature.com/articles/s41587-020-0603-3#Sec11 (>50% complete, <5% contamimation)
- https://www.sciencedirect.com/science/article/pii/S0092867419300017

So: keep everyting >50% complete, <10% contam, and divide into near complete and medium quality.

7. Use refineM to remove potential MAG contamination

8. Use dRep to dereplicate
- https://www.nature.com/articles/s41586-019-0965-1 
dRep was run with options -pa 0.9 (primary cluster at 90%), -sa 0.95 (secondary cluster at 95%), -cm larger (coverage method: larger), -con 5 (contamination threshold of 5%). For the near-complete MAGs, the -nc parameter was set to 0.60 (coverage threshold of 60%), whereas for the medium-quality MAGs with a QS >50 this was changed to 0.30 (coverage threshold of 30%)

## Bbmap package for sequence wrangling
- https://jgi.doe.gov/data-and-tools/software-tools/bbtools/
- https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0185056

In [None]:
# Rename, length selection, sorting with bbmap package:

# Load bbmap package
module load bbmap/38-72

# rename contigs bc spaces in names from megahit
rename.sh in=all_bulk_contigs.fa out=all_bulk_contigs_renamed.fa prefix=BodegaBay_bulk_

# Keep only longer than 2000 bp
reformat.sh in=all_bulk_contigs_renamed.fa out=all_bulk_contigs_renamed_len.fa minlength=2000

# Sort by lenght
sortbyname.sh in=all_bulk_contigs_renamed_len.fa out=all_bulk_contigs_renamed_len.sorted.fa length descending

for f in *.fa
do
sortbyname.sh in=$f out=${f%_raw*}.sorted.fa length descending
done


# cut into smaller files bc cdhit not working (mem issues)
partition.sh in=../220518_all_contigs.fa out=out_%_raw.fa ways=10

## CD-hit to deduplicate
- http://weizhong-lab.ucsd.edu/cd-hit/
- Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences", Weizhong Li & Adam Godzik Bioinformatics, (2006) 22:1658-9
- Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu and Weizhong Li, CD-HIT: accelerated for clustering the next generation sequencing data. Bioinformatics, (2012), 28 (23): 3150-3152. doi: 10.1093/bioinformatics/bts565
- Num of resulting sequences: 1,019,242

In [None]:
# cd-hit to deduplicate the sequences


module load cdhit/2007-013

# Run cd-hit at 99% ident
cd-hit-est -i all_bulk_contigs_renamed_sorted.fa \
-o all_bulk_contigs_renamed_cluster.fa \
-c 0.99 -aS 0.99 -M 150000 -T 24 


# Use bowtie2 to map reads to these contigs

- Make a bowtie2 index
- map reads to index using bowtie2
- http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3322381/


In [None]:
# Build index file
module load bowtie2.3.0
bowtie2-build ./all_bulk_contigs_renamed_cluster.fa all_bulk_contigs_cluster --threads 12

# Map the reads
module load bowtie2.3.0

bowtie2 --threads 12 --sensitive \
-x /home/amhorst/2022_wetlands/assemblies/coverage/all_bulk_contigs_cluster \
-1 $1 \
-2 ${1%_R1*}_R2.fastq.gz \
-S /home/amhorst/2022_wetlands/assemblies/coverage/mapping/${1%%_R1*}.sam \
--no-unal --sensitive 

# Convert samfiles to bamfiles using samtools
- index bamfiles using samtools
- http://www.htslib.org/
- https://academic.oup.com/bioinformatics/article/25/16/2078/204688


In [None]:
module load samtools/1.15.1

for f in BMLS2*.sam
do
echo $f
samtools view -@ 12 -F 4 -bS $f| samtools sort > ${f%.sam*}.bam 
&& \
samtools index ${f%_L002*}.bam
done

# Use metabat2 for binning
- https://bitbucket.org/berkeleylab/metabat/src/master/
- First make a depth file from all the .bam files
- https://github.com/GenomicsAotearoa/metagenomics_summer_school/blob/master/materials/day2/ex7_initial_binning.md


In [None]:
# depth file for MetaBAT2
module load metabat2/2.12.1
jgi_summarize_bam_contig_depths --outputDepth metabat_depth.txt *.bam


In [None]:
# Create bins
module load metabat2/2.12.1

metabat -i all_bulk_contigs_renamed_cluster.fa \
-a ../coverage/mapping/metabat_depth.txt 
-o ../metabat/bodega_bay_metabat -t 20
done

# Use checkM to estimate completness and contamination of bins
- https://ecogenomics.github.io/CheckM/
- Keep everyting >50% complete, <10% contaminated
- Divide into near complete (>90% complete, <5% contam) and medium quality (>50% complete, <10% contam). 

In [None]:
# checkM 
#!/bin/bash
#SBATCH --job-name=checkM
#SBATCH --nodes=1
#SBATCH -t 05:30:00 
#SBATCH --ntasks=20
#SBATCH --output=./checkM_%j.out
#SBATCH --error=./checkM_%j.err
#SBATCH --partition=bmm

module load checkm-genome/1.0.13
source activate checkm-genome-1.0.13
  
checkm lineage_wf ./bins/ checkm -x fa -t 20

## Use RefineM to remove contamination
- https://github.com/dparks1134/RefineM
- v 0.1.2

In [None]:
# install refinem
source ~/.bashrc
conda activate refinem (v0.1.2)



# Run refineM scaffold stats
>refinem scaffold_stats -c 16 <scaffold_file> <bin_dir> <stats_output_dir> <bam_files>

refinem scaffold_stats -c 10 \
all_bulk_contigs_renamed_cluster.fa \
../metabat/ \
../refineM/ \
./mapping/*.bam

In [None]:
#!/bin/bash
#SBATCH --job-name=refineM
#SBATCH --nodes=1
#SBATCH -t 05:30:00 
#SBATCH --ntasks=20
#SBATCH --output=./refineM_%j.out
#SBATCH --error=./refineM_%j.err
#SBATCH --partition=bmm


source ~/.bashrc
conda activate refinem 

refinem scaffold_stats -c 20 \
all_bulk_contigs_renamed_cluster.fa \
../metabat/ \
../refineM/ \
./mapping/*.bam --genome_ext fa


#!/bin/bash
#SBATCH --job-name=refineM
#SBATCH --nodes=1
#SBATCH -t 05:30:00 
#SBATCH --ntasks=1
#SBATCH --output=./refineM_%j.out
#SBATCH --error=./refineM_%j.err
#SBATCH --partition=bmm

# RefineM remove outliers
source ~/.bashrc
conda activate refinem 

refinem outliers ../refineM/scaffold_stats.tsv ../refineM/outliers && \
refinem filter_bins ../metabat/ ../refineM/outliers/outliers.tsv ../bins_filter_refinem


# Dereplicate MAGs (dRep v3.2.0)
- https://www.sciencedirect.com/science/article/pii/S0092867419300017
-  options -pa 0.9 (primary cluster at 90%), -sa 0.95 (secondary cluster at 95%), -cm larger (coverage method: larger), -con 5 (contamination threshold of 5%). -nc 0.30 (coverage threshold of 30%)
- Use Alexandre Almeida's code on github (take from snakemake file)
- dRep will also run CheckM so that's nice
- Redo again 

In [None]:
# Dereplicate using drep in: ./2022_wetlands/MAG_bins

#!/bin/bash
#SBATCH --job-name=drep
#SBATCH --nodes=1
#SBATCH -t 16:00:00
#SBATCH --ntasks=24
#SBATCH --partition=bmm
#SBATCH --output=./drep_%j.out
#SBATCH --error=./drep_%j.err
#SBATCH --mem=100GB

source ~/.bashrc
conda activate drep

# Load other programs that dRep needs
module load mummer
module load mash
module load bowtie2

dRep dereplicate \
dRep_MAGs_output \
-p 16 \
-g ./bins_filter_refinem/*.fa \
-pa 0.9 -sa 0.95 -nc 0.30 -cm larger \
-comp 50 -con 10 -l 1000 


In [None]:
# Dereplicate using drep in: ./2022_wetlands/MAG_bins

#!/bin/bash
#SBATCH --job-name=drep
#SBATCH --nodes=1
#SBATCH -t 5:00:00
#SBATCH --ntasks=24
#SBATCH --partition=bmm
#SBATCH --output=./drep_%j.out
#SBATCH --error=./drep_%j.err
#SBATCH --mem=50GB

source ~/.bashrc
conda activate drep

# Load other programs that dRep needs
module load mummer
module load mash
module load bowtie2

dRep dereplicate \
dRep_MAGs_output \
-p 24 \
-g ./all_bins_refineM_renamed/*.fa \
-pa 0.9 -sa 0.95 -nc 0.30 -cm larger \
-comp 50 -con 10 -l 1000 

# Use CheckM to estimate completness, contamination and get taxonomy of the bins
- https://ecogenomics.github.io/CheckM/
- Keep everyting >50% complete, <10% contaminated
- Divide into near complete (>90% complete, <5% contam) and medium quality (>50% complete, <10% contam). 
- Marker lineage: indicates lineage used for inferring marker set (a precise indication of where a bin was placed in CheckM's reference tree can be obtained with the tree_qa command)
-  The concatenated fasta file of aligned marker genes is in: ./checkM/storage/tree/concatenated.fa



In [None]:
# load checkm

#!/bin/bash
#SBATCH --job-name=checkm_tree
#SBATCH --nodes=1
#SBATCH -t 02:00:00 
#SBATCH --ntasks=20
#SBATCH --output=tree.out
#SBATCH --error=tree.err
#SBATCH --partition=bmm

module load checkm-genome/1.0.13
source activate checkm-genome-1.0.13

# First make a tree
checkm tree ./dereplicated_genomes/ ./checkM -x .fa -t 20

# Then assess the phylogenetic markers within the tree
checkm tree_qa ./checkM -o 2 

# Create a marker set for all the bins that were first processed by tree making
checkm lineage_set ./checkM marker_file_qual_bins


## We ran crass a while back for finding CRISPR-Spacer pairs
https://ctskennerton.github.io/crass/assets/manual.pdf
- First run crass to identify spacers and repeats
- then use blast to match these to MAGs and vOTUs
- blast all spacers to vOTUs
- blast repeats to magbins
- See SPRUCE paper on the specifics and citations on why we chose those
- v 1.0.1

In [None]:
# crass for crispr from raw reads
# Do crass too for the 2019 samples
source ~/.bashrc
conda activate crass

for f in *R1*.gz; do
echo $f
echo ${f%%_R1*}_R2_paired.fq.gz
echo ${f%%_R1*}
sbatch crass.sh $f 
done

#!/bin/bash
#SBATCH --job-name=crass
#SBATCH --nodes=1
#SBATCH -t 02:30:00 
#SBATCH --ntasks=1
#SBATCH --output=../../crass/out/crass_%j.out
#SBATCH --error=../../crass/err/crass_%j.err
#SBATCH --partition=bmm


# load module
source ~/.bashrc
conda activate crass

crass $1 ${1%%_R1*}_R2.fastq.gz -o ../../crass/${1%%_R1*} -l 4


# extract spacer and repeat sequences
for f in BB_*/crass.crispr; do
echo $f
crisprtools extract -o ${f%%crass.crispr*}spacers -s $f -x 
crisprtools extract -o ${f%%crass.crispr*}repeats -d $f -x
done


# all the repeats and spacers need a prefix with the sample name, cause there is multiple spacers/repeats with the same name
module load bbmap
for f in BB_*
do
echo $f
cd $f 
# concatenate 
cat ./spacers/*.fa > all_spacers_crass.fa
# rename
rename.sh in=all_spacers_crass.fa out=all_spacers_crass_rename.fa prefix=$f addprefix=t
cat ./repeats/*.fa > all_repeats_crass.fa
rename.sh in=all_repeats_crass.fa out=all_repeats_crass_rename.fa prefix=$f addprefix=t
cd ..
done



# concatenate all spacers
cat */all_spacers_crass_rename.fa > all_spacers_crass.fa

# concatenate all repeats
cat */all_repeats_crass_rename.fa > all_repeats_crass.fa

# deduplicate by sequence
source ~/.bashrc
conda activate seqkit

seqkit rmdup --quiet -s all_spacers_crass.fa > all_spacers_crass.dedupe.fa
seqkit rmdup --quiet -s all_repeats_crass.fa > all_repeats_crass.dedupe.fa



## Use blast to match ( blast/2.7.1):
- blast all spacers to vOTUs
- blast repeats to magbins

- https://www.nature.com/articles/s41396-021-01132-4#Sec9 
- That paper only says less than 2 mismatches for spacers. 

- https://www.nature.com/articles/s41564-022-01150-8#Sec6
( CRISPR spacer blastn-short alignments to a UViG (indicated by phage-like icons) are shown, with arrows representing 100% identity, 0 mismatches (solid) and 96%–99.9% identity, 1 mismatch (dashed)




In [None]:
# make a total blast db of all MAG seqs
# first rename the bin sequences by putting a prefix with the bin name
for f in *.fa 
do 
echo ${f%%.filtered.fa*}
rename.sh in=$f out=${f%%.fa*}_rename.fa prefix=${f%%.filtered.fa*} addprefix=t
done

cat *_rename.fa >> all_qual_bins_rename.fa 

# Now make blastdb
module load blast 
makeblastdb -in all_qual_bins_rename.fa -dbtype nucl -out qual_bins


# Make a blastdb of all vOTU sequences
makeblastdb -in 220518_drep_all_vOTUs_recov_BB.fa -dbtype nucl -out vOTU_BB_db

# blast back blastn -eval 1e-10 and 100% nucl ident

# WE BLAST THE SPACER SEQUENCES TO THE VOTUS
blastn -task blastn-short \
-query all_spacers_crass.dedupe.fa \
-db ../outputs/vOTU_BB_db -evalue 1e-3 -perc_identity 95 -outfmt 6 \
-out spacers_vOTUs_crass_noeval.tsv

# blastn REPEAT back to all MAG sequences. 
blastn -task blastn-short \
-query all_repeats_crass.fa \
-db ./220907_dRep_MAGs \
-evalue 1e-2 -perc_identity 95 -outfmt 6 \
-out repeats_prok_crass_noeval.tsv



# create a phylogenetic tree

Hug 2016:  A maximum likelihood tree was inferred as described for the concatenated ribosomal protein trees, under the LG plus gamma model of evolution (PROTGAMMALG in the RAxML model section), and with the number of bootstraps automatically determined (MRE-based bootstopping criterion). https://www.nature.com/articles/nmicrobiol201648!#Sec1

- Fasta file made by checkM is in: Fasta contains
- concatenated predicted protein alignment of 43 marker genes defined by CheckM 
- use RaxML https://www.nature.com/articles/s41467-019-13443-4
- https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1009200#sec009
- RAxML used 250 bootstraps, using the auto bootstopping
    

In [None]:
# I am an idiot and to run raxml with multiple threads you need raxmlHPC-PTHREADS

#!/bin/bash
#SBATCH --job-name=raxml_pthreads
#SBATCH --nodes=1
#SBATCH -t 4:00:00 
#SBATCH --ntasks=5
#SBATCH --output=raxml_%j.out
#SBATCH --error=raxml_%j.err
#SBATCH --partition=bmm

source ~/.bashrc
conda activate raxml

raxmlHPC-PTHREADS \
-m PROTGAMMALG \
-x 12345 -p 12345 -# autoMRE \
-T 250 \
-n MAG_tree \
-s ../checkM_markergenes_concat.fa.reduced

In [None]:
# Then use bootstrap to create the best tree with bootstrap values and branch lengths.
raxmlHPC -f b \
-m PROTGAMMALG \
-n test_bootstrap \
-t RAxML_bestTree* \
-z RAxML_bootstrap.MAG_tree_60


In [None]:
# Make a tree only for bins with a crispr link
# linearize the file
#remove enters
awk '/^>/{print s? s"\n"$0:$0;s="";next}{s=s sprintf("%s",$0)}END{if(s)print s}' \
markergenes_copy.fa > markergenes_linear.faa ; done

# get only protein seqs of interest from the file
while read line; do grep -A 1 "$line" markergenes_linear.faa; done < bins_w_crispr_link.txt > bins_w_crispr_link.faa


## Mapping reads using bowtie2
- Make index
- Map reads using --sensitive, don't include unaligned reads bc of file size

In [None]:
# Index MAG fasta using Bowtie2
module load bowtie2
bowtie2-build ../220907_dRep_MAGs.fa \
--large-index  \
./220907_dRep_MAGs -p 16

In [None]:
# Map reads to this file
bowtie2 --threads 16 --sensitive \
-x ./220907_dRep_MAGs \
-1 $1 \
-2 ${1%_R1*}_R2.fastq.gz \
-S ./${1%%_R1*}.sam \
--no-unal

## Coverage table using CoverM
- Singleton et al., https://bio-protocol.org/exchange/minidetail?type=30&id=9291876
- Olm et al., https://www.science.org/doi/full/10.1126/science.abj2972#supplementary-materials

default settings except:
coverm genome -m relative_abundance --min-read-aligned-percent 0.75 --min-read-percent-identity 0.95 --min-covered-fraction .5

In [None]:
# EITHER
source ~/.bashrc
conda activate coverm

coverm genome \
--genome-fasta-directory ../dereplicated_genomes \
-m trimmed_mean --min-covered-fraction 0.5 \
 -b *.bam > 221208_coverM_bins_05_tm.csv -x fa

# OR
source ~/.bashrc
conda activate coverm

coverm genome  --genome-fasta-directory ../dereplicated_genomes \
-m relative_abundance --min-read-aligned-percent 0.75 \
--min-read-percent-identity 0.95 -x fa \
--min-covered-fraction 0.5 -b *.bam > 220929_coverM_bins.csv
