## Dereplicating and organizing set of MAGs:
- We assembled > 900 metagenomes
- These all created MAGs. 
- I need a folder with all MAGs, a tsv with all taxonomy and a tsv with all quality (csvtk)
- Then we need that dRep quality file
- Dereplicate at 95% and 99% ANI. Create sourmash dbs of these sets



For drep quality file:
- https://drep.readthedocs.io/en/master/advanced_use.html#using-external-genome-quality-information
- https://github.com/MrOlm/drep/issues/220
- https://genomicsaotearoa.github.io/metagenomics_summer_school/resources/1_APPENDIX_ex8_Dereplication/#step-3


# show commands
sourmash info -v

In [None]:
# MAG location:
/group/ctbrowngrp2/scratch/annie/2023-swine-sra/results/MAGs/genomes/all_genomes/*.fasta
# quality report location:
/group/ctbrowngrp2/scratch/annie/2023-swine-sra/results/MAGs/genome_quality/*
# taxonomy report location:
/group/ctbrowngrp2/scratch/annie/2023-swine-sra/results/MAGs/taxonomy/*

# need to concatenate quality and tax with csvtk
# taxonomy
mamba activate csvtk
csvtk concat *.tsv -t > ../250411_mag_taxonomy.tsv
# quality
csvtk concat  *.tsv -t > ../250411_mag_quality.tsv


In [None]:

### Quality file for dRep:
awk -F'\t' 'BEGIN {OFS=","} {print $1, $2, $4}' /group/ctbrowngrp2/scratch/annie/2023-swine-sra/results/MAGs/250410_mag_quality.tsv > 250410_mag_qual.drep.csv

# In the new file convert headings to: genome,completeness,contamination
# drep also needs the trailing .fasta 
echo "genome,completeness,contamination" > 250411_dRep.genomeInfo
cut -f1,2,4 /group/ctbrowngrp2/scratch/annie/2023-swine-sra/results/MAGs/250411_mag_quality.tsv | sed 's/\t/.fasta\t/' | sed 's/\t/,/g' | tail -n+2 >> 250411_dRep.genomeInfo

In [None]:
# use drep
srun --account=ctbrowngrp -p high2 -J drep -t 15:00:00 -c 32 --mem=80gb --pty bash

# 95% ANI
mamba activate drep
dRep dereplicate \
drep.95 \
--genomeInfo 250703_dRep.genomeInfo \
-p 32 \
-g /group/ctbrowngrp2/scratch/annie/2023-swine-sra/results/MAGs/genomes/all_genomes/*.fasta \
-pa 0.9 -sa 0.95 -nc 0.30 -cm larger \
-comp 50 -con 10 -l 1000 

# 99% ANI
mamba activate drep
dRep dereplicate \
drep.99 \
--genomeInfo 250703_dRep.genomeInfo \
-p 32 \
-g /group/ctbrowngrp2/scratch/annie/2023-swine-sra/results/MAGs/genomes/all_genomes/*.fasta \
-pa 0.9 -sa 0.99 -nc 0.30 -cm larger \
-comp 50 -con 10 -l 1000 




In [None]:
# need to create sketches for all first...
# need to first sketch the MAGs, manysketch
# need to first sketch the MAGs, manysketch

echo name,genome_filename,protein_filename > manysketch_all.csv
for i in *.fasta
do
  echo "$i,$(realpath "$i"),"
done >> /group/ctbrowngrp2/amhorst/2025-pigparadigm/results/sketches/250707_manysketch_all.csv


# then for sketching
srun --account=ctbrowngrp -p med2 -J sketch -t 2:00:00 -c 60 --mem=100gb --pty bash

# sketch
sourmash scripts manysketch -p dna,k=21,k=31,scaled=1000,abund \
-c 24 250707_manysketch_all.csv -o MAGs.all.zip



# then index them for a rocksdb 
sourmash index MAGs.all.rocksdb -F rocksdb -k 31 MAGs.all_k31.zip

# then also make a pangenome db bc its faster 
mamba activate pangenomics_dev
sourmash scripts pangenome_merge \
MAGs.all.zip -k 31 --scaled 1000 -o MAGs.all.pangenomedb.k31.zip



In [None]:
sourmash scripts manysketch -p dna,k=21,k=31,k=51,scaled=1000,abund \
-c 24 manysketch_all.csv -o MAGs.all.zip

sourmash scripts manysketch -p dna,k=21,k=31,k=51,scaled=1000,abund \
-c 24 manysketch_99.csv -o MAGs.99.zip

sourmash scripts manysketch -p dna,k=21,k=31,k=51,scaled=1000,abund \
-c 24 manysketch_95.csv -o MAGs.95.zip

# then index them for a rocksdb 
sourmash index MAGs.all_k31.rocksdb -F rocksdb -k 31 MAGs.all.zip

# then also make a pangenome db bc its faster 
mamba activate pangenomics_dev
sourmash scripts pangenome_merge \
MAGs.all.zip -k 31 --scaled 1000 -o MAGs.all_k31.merged.zip

mamba activate pangenomics_dev
sourmash scripts pangenome_merge \
MAGs.99.zip -k 31 --scaled 1000 -o MAGs.99_k31.merged.zip

mamba activate pangenomics_dev
sourmash scripts pangenome_merge \
MAGs.95.zip -k 31 --scaled 1000 -o MAGs.95_k31.merged.zip