# Explore Sequence Diversity

This notebook is for exploring protein sequence diversity in a set of CAZymes, e.g. a CAZy family.

Prior to using this notebook:

1. Build a local CAZyme db using `cazy_webscraper`
2. Retrieve the protein sequences for each CAZy family of interest using the `cazomevole` subcommand `get_fam_seqs`
3. Run all-vs-all analysis using BLAST or DIAMOND using the `cazomevolve` subcommands `run_fam_blast` and `run_fam_diamond`, respectively

This notebook takes as input the output from BLASTP+/DIAMOND and visualises the data. 

Feel free to use this notebook as a template to perform further analyses.

## Imports

In [None]:
from cazomevolve.seq_diversity.explore.cazy import get_cazy_proteins, get_cazy_db_prots
from cazomevolve.seq_diversity.explore.parse import load_data, remove_redunant_prots
from cazomevolve.seq_diversity.explore.plot import plot_clustermap, plot_heatmap_of_clustermap

## Constants

Define proteins of interest, e.g. proteins to be explored in the lab. These will be highlighed on the resulting clustermaps and heatmaps.

The `dict` uses the group name (e.g. a CAZy family) as the key, and is valued by a list of the NCBI protein version accessions.

In [None]:
CANDIDATES = {
    'grp_name': ['protein_acc']
}

## Get 'characterised' proteins from CAZy

To retrieve proteins listed in the 'characterised' or 'structure' tables in CAZy, using the `get_cazy_db_prots` function.

We store proteins listed in the characterised table in the variable `characterised_prots`, and proteins in the structure table are listed in `structure_prots`.

In [None]:
characterised_prots = {}  # {fam: [prot acc]}
characterised_prots['PL1'] = get_cazy_db_prots('PL1', characterised=True)
characterised_prots

In [None]:
structured_prots = {}  # {fam: [prot acc]}
structured_prots['PL1'] = get_cazy_db_prots('PL1', structured=True)
structured_prots

## Family analysis

Here is some example code for running the analysis for CAZy family PL20.


In [None]:
# load data
pl20_df = load_diamond_data('pl20_blastp_out', 'PL20')

In [None]:
# build clustermap of BLAST Score Ratio
pl20_bsr_plt = plot_clustermap(pl20_df, 'PL20', 'BSR', fig_size=(100, 100), save_fig='pl20_clustermap.png')
pl20_bsr_plt

In [None]:
# plot a clustermap of only the candidates and functionally/structurally characterised proteins
# that is also annotated to differentiate, candidates and functionally/structurally characterised proteins
pl20_char_bsr_plt = plot_clustermap(
    pl4_df,
    'PL20',
    'BSR',
    fig_size=(7, 7),
    save_fig='pl20_clustermap_char.png',
    char_only=True,
    annotate=True,
)
pl20_char_bsr_plt

Then plot the precentage identity and query coverage for the candidate and functionally/structurally characterised proteins, plotting the proteins on the heatmaps in the same order as they appear in the clustermap.

In [None]:
print('PL20 percentage identity, colour scheme blue')
plot_heatmap_of_clustermap(
    pl20_char_bsr_plt,
    pl20_df,
    'PL20',
    'pident',
    fig_size=(7, 7),
    save_fig='pl20_clustermap_PIDENT_char.png',
    colour_scheme=sns.color_palette("Blues", as_cmap=True),
)

In [None]:
print('PL20 query coverage, colour scheme purple')
plot_heatmap_of_clustermap(
    pl20_char_bsr_plt,
    pl20_df,
    'PL20',
    'qcov',
    fig_size=(7, 7),
    save_fig='pl20_clustermap_QCOV_char.png',
    colour_scheme=sns.color_palette("Purples", as_cmap=True),
)