# **CHAPTER 1. Pangenome analysis**

> Disclaimer: this part of work was performed on DSTU's server

First, create a directory to store mitochondrions of <i>Leotiomycetes</i> fungi

In [1]:
! mkdir data/

Now, download all the <i>Leotiomycetes</i> fungi's mitochondrions into `data/Leotiomycetes.fasta` file!

In [2]:
! esearch -db nucleotide \
    -query '("Leotiomycetes"[Organism] OR Leotiomycetes[All Fields]) AND srcdb_refseq[PROP] AND (fungi[filter] AND mitochondrion[filter])' \
    | efetch -format fasta > data/Leotiomycetes.fasta

For pangenome analysis `PanACoTA` software will be used. `PanACoTA` will take separate fasta files as the input. So, separate them!

In [5]:
%run scripts/sepfasta.py data/Leotiomycetes.fasta Leotiomycetes

Separated fasta files are written to Leotiomycetes directory


Okay! Now create a `data/listFile` text file with the pathes to fasta files!

In [19]:
! ls Leotiomycetes/ > data/listFile

As the 1st step of pangenome analysis `PanACoTA` needs to make a homogenous genomes annotation. Let's do it!

In [None]:
! PanACoTA annotate -d Leotiomycetes/ -r Annotation -n LeMy -l data/listFile --threads 24

Good! Now let's construct a pangenome with the proteins identity setting = 0.8 (80%).

In [21]:
! PanACoTA pangenome -l Annotation/LSTINFO-.lst -n LeMy -d Annotation/Proteins/ -o Pangenome -i 0.8

[32m  * [2025-01-29 20:17:44] : INFO [0m PanACoTA version 1.4.0[0m
[32m  * [2025-01-29 20:17:44] : INFO [0m Command used
 	 > PanACoTA pangenome -l Annotation/LSTINFO-.lst -n LeMy -d Annotation/Proteins/ -o Pangenome -i 0.8[0m
[32m  * [2025-01-29 20:17:44] : INFO [0m Building bank with all proteins to Annotation/Proteins/LeMy.All.prt[0m
Building bank: ███████████████████████████ 24/24 (100%) - Elapsed Time: 0:00:00
[32m  * [2025-01-29 20:17:44] : INFO [0m Will run MMseqs2 with:
	- minimum sequence identity = 80.0%
	- cluster mode 1[0m
[32m  * [2025-01-29 20:17:44] : INFO [0m Creating database[0m
|◓                     |  -  Elapsed Time: 0:00:00
[32m  * [2025-01-29 20:17:44] : INFO [0m Clustering proteins...[0m
|  ◒                   |  -  Elapsed Time: 0:00:04
[32m  * [2025-01-29 20:17:49] : INFO [0m Converting mmseqs results to pangenome file[0m
[32m  * [2025-01-29 20:17:49] : INFO [0m Pangenome has 1999 families.[0m
[32m  * [2025-01-29 20:17:49] : INFO [0m 

For now please proceed to the `02_pangenome_visualization.R` and run the analysis there. Then come back!

So, there are 4 genes that are presented in more than 14 genomes out of 24! 15 is more than 14, so let's calculate the percentage of genomes sharing these 4 genes!

In [4]:
15 * 100 / 24

62.5

Perfect! Now run `PanACoTA`'s `corepers` module to extract those genes!

In [17]:
! PanACoTA corepers -p Pangenome/PanGenome-LeMy.All.prt-clust-0.8-mode1.lst -o Coregenome -t 0.62

[32m  * [2025-01-29 19:58:54] : INFO [0m PanACoTA version 1.4.0[0m
[32m  * [2025-01-29 19:58:54] : INFO [0m Command used
 	 > PanACoTA corepers -p Pangenome8/PanGenome-LeMy.All.prt-clust-0.8-mode1.lst -o Coregenome86 -t 0.6[0m
[32m  * [2025-01-29 19:58:54] : INFO [0m Will generate a Persistent genome with member(s) in at least 60.0% of all genomes in each family.
To be considered as persistent, a family must contain exactly 1 member in at least 60.0% of all genomes. The other genomes are absent from the family.[0m
[32m  * [2025-01-29 19:58:54] : INFO [0m Retrieving info from binary file[0m
[32m  * [2025-01-29 19:58:54] : INFO [0m Generating Persistent genome of a dataset containing 23 genomes[0m
[32m  * [2025-01-29 19:58:54] : INFO [0m The persistent genome contains 4 families, each one having exactly 1 member from at least 60.0% of the 23 different genomes (that is 14 genomes). The other genomes are absent from the family.[0m
[32m  * [2025-01-29 19:58:54] : INFO [0

That's all for pangenome analysis! Please proceed to the `03_phylogenetics.ipynb` for further analysis!