# **CHAPTER 1. Pangenome analysis**

**Install conda env and activate it**

```
conda env create -f panacota.yaml
```

```
conda activate panacota
```

Make a directory to store the data

In [11]:
! mkdir data/

Download sequences from `RefSeq` database that match the query `"Salmonella phage" AND "complete genome"`

In [None]:
! esearch -db nucleotide \
    -query '"Salmonella phage" AND "complete genome" AND srcdb_refseq[PROP] ' \
    | efetch -format fasta > data/salmonella_raw.fasta

In the `data/salmonella_raw.fasta` there are many strange sequences like `>NZ_CP019649.1 Salmonella enterica subsp. enterica serovar Typhimurium var. monophasic 4,5,12:i:- strain TW-Stm6 chromosome, complete genome` etc.<br>
Eventhough the query to `RefSeq` database was clear there are still a lot of junk in `.fasta` file.<br>
Let's remove it!<br>
The script below will leave only sequences that has `Salmonella phage` and `complete genome` in their descriptions.

In [None]:
%run scripts/filter_seqs.py data/salmonella_raw.fasta data/salmonella_251_refseq.fasta

Now it's time to separate sequences!<br>
The script below will take the `.fasta` file with 251 sequences for the input and will store seaparate sequences in the output directory.

In [1]:
%run scripts/sepfasta.py data/salmonella_251_refseq.fasta Salmonella_phages

Written Salmonella_phages/NC_049439.1.fasta
Written Salmonella_phages/NC_049440.1.fasta
Written Salmonella_phages/NC_041991.1.fasta
Written Salmonella_phages/NC_073205.1.fasta
Written Salmonella_phages/NC_021777.1.fasta
Written Salmonella_phages/NC_028698.1.fasta
Written Salmonella_phages/NC_049502.1.fasta
Written Salmonella_phages/NC_049499.1.fasta
Written Salmonella_phages/NC_049505.1.fasta
Written Salmonella_phages/NC_049504.1.fasta
Written Salmonella_phages/NC_049501.1.fasta
Written Salmonella_phages/NC_049506.1.fasta
Written Salmonella_phages/NC_049503.1.fasta
Written Salmonella_phages/NC_049500.1.fasta
Written Salmonella_phages/NC_049508.1.fasta
Written Salmonella_phages/NC_049507.1.fasta
Written Salmonella_phages/NC_021775.1.fasta
Written Salmonella_phages/NC_073217.1.fasta
Written Salmonella_phages/NC_073201.1.fasta
Written Salmonella_phages/NC_073173.1.fasta
Written Salmonella_phages/NC_048110.1.fasta
Written Salmonella_phages/NC_073218.1.fasta
Written Salmonella_phages/NC_073

Create the `listFile` file which will have the names of each file<br>
It is required to run `PanACoTA`

In [None]:
! ls Salmonella_phages/ > data/listFile

Annotate each genome of `Salmonella phage` that we took in the analysis

In [None]:
! PanACoTA annotate -d Salmonella_phages/ -r Annotation -n SaPh -l data/listFile --threads 24

Build the pangenome and set the parameter of minimum identity to 80%

In [None]:
! PanACoTA pangenome -l Annotation/LSTINFO-.lst -n SaPh -d Annotation/Proteins/ -o Pangenome -i 0.8

Find genes that has 80% identity in at least 28% of genomes

In [None]:
! PanACoTA corepers -p Pangenome/PanGenome-SaPh.All.prt-clust-0.8-mode1.lst -o Coregenome -t 0.28

Filter `Annotation/LSTINFO-.lst` file and leave there only entries that persist in `Coregenome` output

In [25]:
%run scripts/process_LSTINFO.py Coregenome/PersGenome_PanGenome-SaPh.All.prt-clust-0.8-mode1.lst-all_0.28.lst Annotation/LSTINFO-.lst Annotation/filtered_LSTINFO-.lst

Filtered file created: Annotation/filtered_LSTINFO-.lst


Perform a Multiple Sequences Alignment of that one gene that was identified to have 80% identity in 28% of genomes!

In [None]:
! PanACoTA align -c Coregenome/PersGenome_PanGenome-SaPh.All.prt-clust-0.8-mode1.lst-all_0.28.lst -l Annotation/filtered_LSTINFO-.lst -n SaPh -d Annotation/ -o Alignment

That's all! Proceed to the `02_DECIPHER_journal.R` to design degenerative PCR primers for that gene!