### Comparing pangenome sketches with metagenomes:
Why: We can look from a microbial genome perspective, and see what genes are in multiple strains of the microbe. But, there may be different strains if different host species (likely), and therefore, genes that are considered core or cloud in the species, may not appear in a certain type of metagenome. 
We see in DNA space, that the core hashes (defined by Roary) are different in different metagenome types (hosts). But in protein, this effect may be reduced, because codons. 

#### Checking before building snakefile:
- One microbial species (m elsedenii), for now only ref strains 
- 10 human and 10 pig metagenomes
- use a cloud and a core sketch
- translate these to protein as well, and compare in protein space



### Species present in multiple samples:
- Gemmiger qucibialis (18)
- Prevotella copri (17)

In [None]:
# ruun gather before pangenome bc need species that is in multiple metag
# run gather.smk
srun --account=ctbrowngrp -p med2 -J gather -t 2:00:00 -c 20 --mem=120gb --pty bash
mamba activate branchwater-skipmer

snakemake --use-conda --resources mem_mb=120000 --rerun-triggers mtime \
-c 20 --rerun-incomplete -k -s rules/gather.smk -n

In [None]:
# ruun gather before pangenome bc need species that is in multiple metag
# run gather.smk
srun --account=ctbrowngrp -p med2 -J gather -t 5:00:00 -c 60 --mem=120gb --pty bash
mamba activate branchwater-skipmer

snakemake --use-conda --resources mem_mb=120000 --rerun-triggers mtime \
-c 50 --rerun-incomplete -k -s rules/gather.smk -n

# do gather tables
mamba activate pangenomics_dev
#species level db
/home/baumlerc/dissertation-project/fastgather-test/query_dbs/gtdb-rs220-k31.species.zip

sourmash scripts gather_tables \
*.mag*.csv \
-p -f sparse \
-o 250603_gather.magsforpang.csv 

In [None]:
# download metaGs and sketch
# need to download SRA files and sketch them translated, bc would like to compare in protein space.
see download_sketch_reads.smk

srun --account=ctbrowngrp -p med2 -J sketch -t 2:00:00 -c 20 --mem=80gb --pty bash
mamba activate branchwater-skipmer

snakemake --use-conda --resources mem_mb=80000 --rerun-triggers mtime \
-c 20 --rerun-incomplete -k -s download_sketch_reads.smk -n


srun --account=ctbrowngrp -p med2 -J sketch -t 2:00:00 -c 8 --mem=80gb --pty bash
mamba activate branchwater-skipmer

snakemake --use-conda --resources mem_mb=80000 --rerun-triggers mtime \
-c 8 --rerun-incomplete -k -s compare_metag.smk -n


In [None]:

srun --account=ctbrowngrp -p med2 -J sketch -t 2:00:00 -c 1 --mem=80gb --pty bash
mamba activate branchwater-skipmer

# first with the DNA ones
sourmash sig overlap \
/group/ctbrowngrp2/amhorst/2025-pangenome/results/test_pipeline/l_amylovorus/sourmash/l_amylovorus.core.all.zip \
/group/ctbrowngrp/irber/data/wort-data/wort-sra/sigs/ERR1135371.sig \
-k 21

# do sig overlap with the metaGs (pangenome sketch and normal sketch)
sourmash sig overlap \
/group/ctbrowngrp2/amhorst/2025-pangenome/results/test_pipeline/l_amylovorus/sourmash/l_amylovorus.core.tr.all.zip \
/group/ctbrowngrp2/amhorst/2025-pangenome/results/metag_signatures/ERR1135371.tr.zip \
-k 10 --protein 



# compare with a protein sketch
sourmash sketch protein core.faa -p k=10,scaled=500 -o ../sourmash/core.all.prot.zip


sourmash sig overlap core.all.prot.zip \
/group/ctbrowngrp2/amhorst/2025-pangenome/results/metag_signatures/SRR10499417.zip \
-k 10 --protein 

In [None]:
# sketch in DNA as well and compare to metaGs. 

# sketch and pangenome it
mamba activate branchwater-skipmer
sourmash sketch dna m_elsdenii.cloud.fa -o m_elsdenii.cloud.k21.zip -p k=21,scaled=1000 

# try pangenome merge and ranktable (m2n3, k21, s50)
sourmash scripts pangenome_merge m_elsdenii.cloud.k21.zip -k 21  \
-o m_elsdenii.cloud.pang.k21.zip --scaled 1000

sourmash scripts pangenome_ranktable \
m_elsdenii.cloud.pang.k21.zip -o m_elsdenii.cloud.k21.rank.csv \
-k 21 --scaled 1000 


# same for core
sourmash sketch dna m_elsdenii.core.fa -o m_elsdenii.core.k21.zip -p k=21,scaled=1000,abund
# try pangenome merge and ranktable (m2n3, k21, s50)
sourmash scripts pangenome_merge m_elsdenii.core.k21.zip -k 21  \
-o m_elsdenii.core.pang.k21.zip --scaled 1000 

sourmash scripts pangenome_ranktable \
m_elsdenii.core.k21.zip -o m_elsdenii.core.k21.rank.csv \
-k 21 --scaled 1000 





In [None]:
## Now make traslated and protein ranktables:
## Difference is that protein is from prokka predicted protein, translate is from translated CDS
# How would I fit this in  a snakemake? Well doesbt matter? I can make list [core, cloud]
# translated:
m_elsdenii.core.tr.k10.zip
m_elsdenii.cloud.tr.k10.zip

# pang ranktables: (change colname for one)
sourmash scripts pangenome_merge m_elsdenii.core.tr.k10.zip --protein -k 10 --scaled 500 \
-o m_elsdenii.core.tr.k10.pang.zip --no-dna
sourmash scripts pangenome_ranktable \
m_elsdenii.core.tr.k10.pang.zip -o m_elsdenii.core.tr.k10.rank.csv \
-k 10 --scaled 500 --protein --no-dna


sourmash scripts pangenome_merge m_elsdenii.cloud.tr.k10.zip --protein -k 10 --scaled 500 \
-o m_elsdenii.cloud.tr.k10.pang.zip --no-dna
sourmash scripts pangenome_ranktable \
m_elsdenii.cloud.tr.k10.pang.zip -o m_elsdenii.cloud.tr.k10.rank.csv \
-k 10 --scaled 500 --protein --no-dna


# prokka protein:
sourmash scripts pangenome_merge core_genes/core.prot.zip --protein -k 10 --scaled 500 \
-o m_elsdenii.core.prot.k10.pang.zip --no-dna
sourmash scripts pangenome_ranktable \
m_elsdenii.core.prot.k10.pang.zip -o m_elsdenii.core.prot.k10.rank.csv \
-k 10 --scaled 500 --protein --no-dna

sourmash scripts pangenome_merge cloud_genes/cloud.prot.zip --protein -k 10 --scaled 500 \
-o m_elsdenii.cloud.prot.k10.pang.zip --no-dna
sourmash scripts pangenome_ranktable \
m_elsdenii.cloud.prot.k10.pang.zip -o m_elsdenii.cloud.prot.k10.rank.csv \
-k 10 --scaled 500 --protein --no-dna

# 


In [None]:
python ../../../workflow/scripts/calc-hash-presence.py \
m_elsdenii.core.k21.rank.csv pig_sra.txt --scaled=1000 -k 21 -o pig.x.core.dna.dmp

python ../../../workflow/scripts/calc-hash-presence.py \
m_elsdenii.core.k21.rank.csv human_sra.txt --scaled=1000 -k 21 -o human.x.core.dna.dmp

python ../../../workflow/scripts/calc-hash-presence.py \
m_elsdenii.cloud.k21.rank.csv pig_sra.txt --scaled=1000 -k 21 -o pig.x.cloud.dna.dmp

python ../../../workflow/scripts/calc-hash-presence.py \
m_elsdenii.cloud.k21.rank.csv human_sra.txt --scaled=1000 -k 21 -o human.x.cloud.dna.dmp

In [None]:
# worst way possible 
python ../../../workflow/scripts/calc-hash-presence.py \
m_elsdenii.core.prot.k10.rank.csv pig_sra_prot.txt --scaled=500 --protein --no-dna -k 10 -o pig.x.core.prot.dmp

python ../../../workflow/scripts/calc-hash-presence.py \
m_elsdenii.core.prot.k10.rank.csv human_sra_prot.txt --scaled=500 --protein --no-dna -k 10 -o human.x.core.prot.dmp

python ../../../workflow/scripts/calc-hash-presence.py \
m_elsdenii.cloud.prot.k10.rank.csv pig_sra_prot.txt --scaled=500 --protein --no-dna -k 10 -o pig.x.cloud.prot.dmp

python ../../../workflow/scripts/calc-hash-presence.py \
m_elsdenii.cloud.prot.k10.rank.csv human_sra_prot.txt --scaled=500 --protein --no-dna -k 10 -o human.x.cloud.prot.dmp


python ../../../workflow/scripts/calc-hash-presence.py \
m_elsdenii.core.tr.k10.rank.csv pig_sra_prot.txt --scaled=500 --protein --no-dna -k 10 -o pig.x.core.tr.dmp

python ../../../workflow/scripts/calc-hash-presence.py \
m_elsdenii.core.tr.k10.rank.csv human_sra_prot.txt --scaled=500 --protein --no-dna -k 10 -o human.x.core.tr.dmp

python ../../../workflow/scripts/calc-hash-presence.py \
m_elsdenii.cloud.tr.k10.rank.csv pig_sra_prot.txt --scaled=500 --protein --no-dna -k 10 -o pig.x.cloud.tr.dmp

python ../../../workflow/scripts/calc-hash-presence.py \
m_elsdenii.cloud.tr.k10.rank.csv human_sra_prot.txt --scaled=500 --protein --no-dna -k 10 -o human.x.cloud.tr.dmp

In [None]:
python ../../../workflow/scripts/parse-dump.py \
--dump-files-1 human.x.core.dna.dmp \
--dump-files-2 pig.x.core.dna.dmp > cmp_core.dna.csv

python ../../../workflow/scripts/parse-dump.py \
--dump-files-1 human.x.core.tr.dmp \
--dump-files-2 pig.x.core.tr.dmp > cmp_core.tr.csv


python ../../../workflow/scripts/parse-dump.py \
--dump-files-1 human.x.cloud.tr.dmp \
--dump-files-2 pig.x.cloud.tr.dmp > cmp_cloud.tr.csv

In [None]:
python ../../../workflow/scripts/calc-hash-presence.py \
m_elsdenii.core.k21.rank.csv pig_sra.txt --scaled=1000 -k 21 -o pig.x.core.dna.dmp

python ../../../workflow/scripts/calc-hash-presence.py \
m_elsdenii.core.k21.rank.csv human_sra.txt --scaled=1000 -k 21 -o human.x.core.dna.dmp

python ../../../workflow/scripts/calc-hash-presence.py \
m_elsdenii.cloud.k21.rank.csv pig_sra.txt --scaled=1000 -k 21 -o pig.x.cloud.dna.dmp

python ../../../workflow/scripts/calc-hash-presence.py \
m_elsdenii.cloud.k21.rank.csv human_sra.txt --scaled=1000 -k 21 -o human.x.cloud.dna.dmp