# 0. Phylogenetic distance scatterplots

This notebook shows how to use the `zoogletools` package to create two types of phylogenetic distance scatterplots.

## 0.1 Data files
Two of the necessary data files (`complexities_filepath` and `tree_filepath`) can be downloaded as part of the Snakemake workflow contained in the repository and the paths to those files should be set in the variables below. 

`identifiers_filepath` is also required, and can be generated by following the instructions in `zoogletools/data_processing/README.md`.

To obtain the distances files (`data_dirpath` or `data_filepath`), you can follow one of the two options below:

1. **Complete dataset:** You can generate a complete, reprocessed dataset for all genesby following the instructions in `zoogletools/data_processing/README.md`.

2. **Gene-specific dataset:** Instead of providing a fully-reprocessed dataset, you can instead provide a path to a file downloaded from the gene-specific view of Zoogle itself using the web interface.
   For example, you can download the data for DMD by going to the [gene-specific view of Zoogle](https://zoogle.arcadiascience.com/search?gene=DMD-P11532) and clicking the "Download" button, then choosing "Skip and Download" or "Share and Download".

In [1]:
import zoogletools as zt

# Downloaded/processed data files
complexities_filepath = "../data/organism_metadata.csv"
tree_filepath = "../data/congruified_spprax_species_tree.newick"
identifiers_filepath = "../data/2025-04-21-merged-disease-datasets.tsv"

# Distance files.
data_dirpath = "../data/2025-04-21-os-portal-reprocessed/"

## 0.2 Pub-style scatterplot
In this scatterplot, the x-axis is the phylogenetic distance to the human gene, and the y-axis is the similarity score between the human gene and the other gene. "More human" and "more similar" proteins are in the bottom-left corner.

Optionally, you can annotate circles in the scatterplot to indicate multiple copies of a gene in the dataset, which results in a cross-shaped marker over the scatter points for genes with multiple copies in that organism.

In [2]:
human_gene_symbol = "DMD"
annotate_multiple_copies = True

zt.plotting.phylogenetic_distance_scatter(
    tree_filepath=tree_filepath,
    identifiers_filepath=identifiers_filepath,
    data_dirpath=data_dirpath,
    input_id=human_gene_symbol,
    input_type="symbol",
    lock_lower_ylimit=True,
    annotate_multiple_copies=annotate_multiple_copies,
    image_filepath=f"figures/phylo_dist_scatterplots/{human_gene_symbol}_pub_scatterplot.svg",
    html_filepath=f"figures/phylo_dist_scatterplots/{human_gene_symbol}_pub_scatterplot.html",
)

## 0.3 Presentation-style scatterplot

For this scatterplot, the two axes are inverted, so that "more human" and "more similar" proteins are in the top-right corner. This can be more intuitive for presentations, where "better" proteins are higher up in the plot.

In [3]:
human_gene_symbol = "DMD"
annotate_multiple_copies = False

zt.plotting.phylogenetic_distance_scatter(
    tree_filepath=tree_filepath,
    identifiers_filepath=identifiers_filepath,
    data_dirpath=data_dirpath,
    input_id=human_gene_symbol,
    input_type="symbol",
    invert_yaxis=True,
    invert_xaxis=True,
    annotate_multiple_copies=annotate_multiple_copies,
    image_filepath=f"figures/phylo_dist_scatterplots/{human_gene_symbol}_presentation_scatterplot.svg",
    html_filepath=f"figures/phylo_dist_scatterplots/{human_gene_symbol}_presentation_scatterplot.html",
)

## 0.4 Using the `data_filepath` option

This option allows you to provide a path to a specific file containing the data you want to plot. This can be useful if you are downloading data from the web interface and want to plot it directly from the file.

In [4]:
zoogle_results_file = "zoogle-Q14565.tsv"
human_gene_symbol = "DMC1"

zt.plotting.phylogenetic_distance_scatter_zoogle(
    data_filepath=zoogle_results_file,
    tree_filepath=tree_filepath,
    annotate_multiple_copies=annotate_multiple_copies,
    image_filepath=f"figures/phylo_dist_scatterplots/{human_gene_symbol}_presentation_scatterplot.svg",
    html_filepath=f"figures/phylo_dist_scatterplots/{human_gene_symbol}_presentation_scatterplot.html",
)