# Module: Pangenomic Analysis in Anvi'o

Pangenomics is the study of the collective set of genes within a group of related organisms, encompassing both the core genome shared by all members and the accessory genome, which includes genes unique to subsets or individuals. This approach provides a comprehensive view of genetic diversity, revealing how organisms adapt to specific environments, acquire unique traits, or respond to evolutionary pressures.

In this notebook, we will explore how pangenome analysis is performed using the bioinformatic suite, Anvi'o.

This module was built with the following as the main references: [Anvi'o Pangenomics Workflow](https://merenlab.org/2016/11/08/pangenomics-v2/).

Created by: _Microbial Oceanography Laboratory (MOLab)_

---
## How to Use This Notebook

1. Activate conda environment in terminal window. Make sure to change the environment name to what is applicable in your case.
```bash
conda activate pangenomics-env
```
2. Open jupyter notebook with the command below and select the notebook.
```bash
jupyter notebook
```
3. To run the cells in this notebook, press Shift+Enter.

---
## Tools Used
1. **Anvi'o v8**
    - Installation procedure can be found here: [Anvi'o Linux Installation](https://anvio.org/install/linux/stable/)

---
## Starting Files 

1. FASTA file of query genome(s) (see **Assembly Module** and/or **Binning Module**).
2. FASTA file of reference genome(s). A great place to look for reference genomes is the [NCBI Genome Data Viewer](https://www.ncbi.nlm.nih.gov/gdv).

In the codes below, we assume that the FASTA files have a `.fa` extension, and that they are placed inside the `0-genomes` folder.

---
## Expected Outputs

1. Anvi'o pangenome display
2. Exported FASTA sequences

---
## Table of Contents
 * [**Clean FASTA Files**](#Clean-FASTA-Files)
 * [**Generate Databases**](#Generate-Databases)
     * [Contigs DB](#Contigs-DB)
     * [Genomes DB](#Genomes-DB)
 * [**Pangenome**](#Pangenome)
     * [Generate the pangenome](#Generate-the-pangenome)
     * [Display pangenome](#Display-pangenome)
 * [**Extras**](#Extras)
     * [Modifying Visual](#Modifying-Visual)
     * [Selecting Gene Clusters](#Selecting-Gene-Clusters)
     * [Exporting Bin Collection](#Exporting-Bin-Collection)

----
# <font color = 'gray'>Clean FASTA Files</font>

Some steps are picky about the formatting of the FASTA file. Before we proceed, let's make sure that the query and reference genome FASTA files are properly formatted. The command below simplifies the FASTA header and removes the ambiguouos bases (Ns).

<div class="alert alert-block alert-info">
<b>Note:</b> 

The code below already loops through all files inside the <code>0-genomes/</code> folder so there's no need to re-run for each genome FASTA file.
</div>

In [None]:
%%bash

mkdir -p "1-clean_genomes"

for f in 0-genomes/*
do
    output="1-clean_genomes/${f#*/}"
    output="${output%.fa}-clean.fa"
    
    anvi-script-reformat-fasta \
        "${f}" \
        -o "${output}" \
        --simplify-names \
        --report-file $(basename ${f})-report.txt \
        --seq-type NT
done

----
# <font color = 'gray'>Generate Databases</font>

### Contigs DB

Anvi'o does not work directly with FASTA files. We have to convert the assemblies first into a `contigs-db` object as demonstrated below. Additionally, besides converting to an Anvi'o native object, the command `anvi-gen-contigs-database`, by default, also identifies open reading frames in the genome.

<div class="alert alert-block alert-warning">
<b>Warning</b> 

The gene caller used by Anvi'o in this step is specifically catered for prokaryotic organisms. If you are working with a eukaryotic genome, it is best if you use another tool to identify the ORFs in your genome (see <b>Contig Level Functional Annotation Module</b>). You then need to use the <code>--external-gene-calls</code> argument to supply the locations of the predicted ORFs (formatting of the external gene calls file is described <a href="https://anvio.org/help/main/artifacts/external-gene-calls/">here</a>).
    
If you decided to use AUGUSTUS as the external gene caller, a custom script (<code>augustus_gff_to_anvio_ext_gene_calls.py</code>) has also been provided to convert the AUGUSTUS GFF file into an Anvi'o external gene calls file. Run <code>./augustus_gff_to_anvio_ext_gene_calls.py -h</code> to see usage information.
</div>

In [None]:
%%bash

mkdir -p "2-contigs_db"

for f in 1-clean_genomes/*
do
    prefix="${f#*/}"
    prefix="${prefix%*.fa}"
    
    anvi-gen-contigs-database \
        -f ${f} \
        -o 2-contigs_db/${prefix}_contigs_db.db \
        --project-name ${prefix}
done

### Genomes DB

This step creates a `genomes-storage-db` object which is primarily used for pangenomic analysis.

#### Create a mapping file

First, we need to make a mapping file that specifies a name for each of the `contigs-db` as well as their filepath. The code below does this for us; it creates a file named `genomes-storage.txt` that expl

<div class="alert alert-block alert-warning">
<b>Warning</b> 

Double check the <code>genomes-storage.txt</code>. The names used inside the file are based on the filenames of your input files. However, Anvi'o requires that they do not start with a number, and do not contain other special characters. A prompt will appear in the next command if the names do not follow Anvi'o's required formatting. Edit the names in <code>genomes-storage.txt</code> if necessary.
</div>

In [None]:
%%bash

mkdir -p "3-genomes_db"

echo -e "name\tcontigs_db_path" >> 3-genomes_db/genomes-storage.txt

for f in 2-contigs_db/*
do
    prefix="${f#*/}"
    prefix="${prefix%*-clean_contigs_db.db}"
    
    echo -e "${prefix}\t$(readlink -f ${f})" >> 3-genomes_db/genomes-storage.txt
done

#### Create genomes-db

Finally, we create a collection of the query and reference genomes.

<div class="alert alert-block alert-warning">
<b>Warning</b> 

If you used external gene calls for the contigs, you may need to the <code>--gene-caller</code> argument in the command below. Check <code>anvi-gen-genomes-storage -h</code> for more information.
</div>

In [None]:
!anvi-gen-genomes-storage \
    -e 3-genomes_db/genomes-storage.txt \
    -o 3-genomes_db/MY-GENOMES.db

----
# <font color = 'gray'>Pangenome</font>

### Generate the pangenome

This portion will run the core step in Anvio's pangenomic analysis workflow. This will align the sequences between the genomes, and identify and refine gene clusters. A simple use case is shown below.

| Argument | Description |
| :-: | :- |
| `-g` | `genomes-storage-db` generated by `anvi-gen-genomes-storage` |
| `-n` | project name |
| `--mcl-inflation` | MCL is used to identify clusters based on amino acid sequence similarity. According to Anvio's documentation, a value of 2 can be used when comparing distant genomes (i.e. family or higher taxonomic level), while 10 is used for closely related genomes (i.e. strain level) |

In [None]:
!anvi-pan-genome \
    -g 3-genomes_db/MY-GENOMES.db \
    -n MY_PANGENOME \
    --mcl-inflation 8

### Display pangenome

Finally, you can display an interactive visual of the pangenome. If you scroll down towards the end of the CLI output, you will see a detail about "Server Address". Copy the address to your browser to access the interactive pangenome display. If this address does not work, you can try either `http://localhost:8080/` or `http://127.0.0.1:8080/`.

A whole bunch of details on what you could do with this interactive plot is discussed [here](https://merenlab.org/2016/11/08/pangenomics-v2/#displaying-the-pan-genome).

In [None]:
!anvi-display-pan \
    -p MY_PANGENOME/MY_PANGENOME-PAN.db \
    -g 3-genomes_db/MY-GENOMES.db

----
# <font color = 'gray'>Extras</font>

### Modifying Visual

The primary control for the overall layout can be found in the _Main_ tab (upper left).

If you want to hide some layers in the visual, under the tab _Main_ > _Layer_, set the _Height_ of the layer to 0 and click _Draw_ to redraw the visual.

If you want to change the ordering of the gene clusters, go to _Main_ > _Display_ and change the _Items order_. In this same subsection, you can also change whether to display gene frequences or simply presence/absence (_View_).

Additional visual control can be found in the _Layers_ tab. Here you can change the settings for the summary stats per genome (under _Layers_ subsection). You can also generate a phylogenetic tree based on a bin collection using the through the _Display_ subsection.

### Selecting Gene Clusters

For demonstration, we will go through how to specifically select single-copy gene clusters (SCGs).

To add the SCGs to a bin (i.e. a pool of selected sequences), go to _Search_ tab (upper left). Open the _Search gene clusters using filters_ dropdown pane. Check _Min number of genomes gene cluster occurs_ and specify the number of genomes you have in your pangenome. Check _Max number of genes from each genome_ and set 1. Scroll down and click _Append splits to selected bin_. The SCGs should now be added to a bin.

You can also save the bin(s) as a collection for later use. Go to the _Bins_ tab. Check if the number of _Gene Clusters_ match the number of selected sequences. Save as a collection by clicking _Store bin collection_ - give it a name that you can easily remember - and click _Store_.

### Exporting Bin Collection

Following the example above, to export the bin collection as sequences, we use the sample code below. We assume here that the name of the saved collection `SCG_Collection` and that only a single bin, named `SCG`, is present in this collection.

Furthermore, we also add some filtering options. This is optional, but if you would like to conduct, say, phylogenomic analysis, selecting sequences that are not too divergent may minimize chances of generating a spurious tree. Try out different filtering thresholds, and see how the resulting tree changes.

More details about the filtering criterion (homogeneity index) can be found [here](https://merenlab.org/2016/11/08/pangenomics-v2/#concept-of-homogeneity).

In [None]:
!anvi-get-sequences-for-gene-clusters \
    -p MY_PANGENOME/MY_PANGENOME-PAN.db \
    -g 3-genomes_db/MY-GENOMES.db \
    -C SCG_Bin \
    -b SCG \
    -o SCG_filtered.fa \
    --min-combined-homogeneity-index 0.8 \
    --concatenate-gene-clusters \
    --separator None