%%latex
\tableofcontents

# Introduction

This notebook will describe the investigation of potential selection trends identified through Hardy-Wienburg Equilibrium (HWE) analysis. Hardy-Wienburg equilibrium describes a set of conditions, under which genotype frequency can be predicted as a product of simpler allele frequencies:

$$
    p^2+2pq+q^2
$$

In practice, Hardy-Weinburg is applied through the use of tests such as chi-squared or other exact methods to compare if observed genotype frequencies align with that expected under Hardy-Wienburg Law.

We have generated a series of population-subset Hardy-Weinburg reports. These were produced using [Plink-2](https://www.cog-genomics.org/plink/2.0/)'s [`--hardy`](https://www.cog-genomics.org/plink/2.0/basic_stats#hardy) command with the `midp` modifier, in combination with the [`--loop-cats`]() command to loop across population groupings while doing so. These outputs will be used for graphing purposes here.

> Refer to the repository and linked documentation site for more information about the [Pharmacogenetics Analysis Pipeline](https://github.com/Tuks-ICMM/Pharmacogenetic-Analysis-Pipeline).


## Objectives

## Notebook configuration

### Dependancies

This notebook will make use of the following packages and functions:

In [1]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.io import renderers
from pandas import read_csv
from os.path import  join
from pathlib import Path

In [2]:
renderers.default = "notebook_connected+pdf"

### Data Imports

We will be using data generated by the [Pharmacogenetics Analysis Pieline](). AS part of thsi analysis, we will need to import and work with the collection of metadata files used to configure this workflow:

- `samples.csv`
- `locations.csv`
- `datasets.csv`

In [3]:
SAMPLES = read_csv(join("input", "samples.csv"))
LOCATIONS = read_csv(join("input", "locations.csv"))
DATASETS = read_csv(join("input", "datasets.csv"))
POPULATIONS_TO_ANALYSE = SAMPLES["super-population"].unique().tolist()

We will also be making use of a VCF-based `MultiIndex` system to organise our hirachical data (`CHROM`, `POS`, `REF` and `ALT`).

In [4]:
MULTIINDEX = ["CHROM", "POS", "REF", "ALT"]

### Type Mapping

Our input data can be imported from `.csv` format using the `read_csv()` function from `pandas`. Since this project focuses on more than one gene, we will work our way through the list of genes and import each consolidated report.

In [5]:
DATA = dict()
for gene in LOCATIONS["location_name"].unique().tolist():
    DATA[gene] = read_csv(join(
            "/",
            "mnt",
            "ICMM_HDD_12TB",
            "Results_25SEP2024", "consolidated_reports", f"super-population_{gene}.csv"))
    DATA[gene].rename(columns={"CADD_PHRED": "CADD Phred"}, inplace=True)

# Hardy-Weinberg Results

## Calculate Genotype frequency

The results contain genotype counts, so our first step is to convert these into genotype frequencies that can be plotted on a ternary plot. To do this efficiently, we will loop through each gene found in the `DataFrame` under our chosen cluster of '_super-population_'. For each `DataFrame`, we will then loop again through each unique population annotated in our samples files under '_super-population_' to calculate sub-group genotype frequencies.

In [6]:
# For each gene declared for this analysis...
for gene in LOCATIONS["location_name"].unique().tolist():

    # Import missingness consolidation for filtering purposes
    CONSOLIDATED_REPORTS = read_csv(
        join(
            "/",
            "mnt",
            "ICMM_HDD_12TB",
            "Results_25SEP2024",
            "cleaned",
            f"super-population_{gene}.csv.zst",
        )
    )

    # Set the VEP-based hirachical `MultiIndex`
    CONSOLIDATED_REPORTS.set_index(MULTIINDEX, inplace=True)

    # For each population in our list of populations to be analyzed...
    for population in POPULATIONS_TO_ANALYSE:
        # Calculate Heterozygous Frequency from plink2 count data
        DATA[gene][f"{population} Hetrozygous Freq."] = (
            DATA[gene][f"{population}_HET_A1_CT"]
            / (
                DATA[gene][f"{population}_HOM_A1_CT"]
                + DATA[gene][f"{population}_HET_A1_CT"]
                + DATA[gene][f"{population}_TWO_AX_CT"]
            )
        ) * 100


        # Calculate Homozygous Reference Frequency from plink2 count data
        DATA[gene][f"{population} Homozygous Ref Freq."] = (
            DATA[gene][f"{population}_HOM_A1_CT"]
            / (
                DATA[gene][f"{population}_HOM_A1_CT"]
                + DATA[gene][f"{population}_HET_A1_CT"]
                + DATA[gene][f"{population}_TWO_AX_CT"]
            )
        ) * 100

        # Calculate Homozygous Alternate Frequency from plink2 count data
        DATA[gene][f"{population} Homozygous Alt Freq."] = (
            DATA[gene][f"{population}_TWO_AX_CT"]
            / (
                DATA[gene][f"{population}_HOM_A1_CT"]
                + DATA[gene][f"{population}_HET_A1_CT"]
                + DATA[gene][f"{population}_TWO_AX_CT"]
            )
        ) * 100

## Generate ternary sub-plots

Next, we will make a composite plot using the plotly `make_subplots` function which allows the creation of a figure containing multiple different sub-plots. Since there are an even number of gene regions in this analysis, we can simply create a grid of 2x4.

### Graph type

This plot will be slightly more complicated to construct since we want to style and add multiple subsets of data under different ledgend entries. This will allow us to more easily distinguish between variants which diverge from HWE and those that do not via plot styling. Normally, this would be facilitated by long-format data and the `plotly.express` interface for constructing `plotly` `Figure`s. In this case, there are some operations we will not be able to perform through this interface, which means we will have to construct each 'trace' manually via the `plotly.graph_objects` interface for Object-oriented plotting.

### Subplot grid

Since this plot will involve multiple populations in a sub-plot format, a system will be needed to position each ternary-plot within each overall gene-level HWE `Figure`. Fortunately, determining the grid-position of a population in a list of known length can be acheived through mathematical operations and operators already available in python. Provided we know the maximum number of graphs on one row of the plot, we can make use of the `divmod()` which performs division (`/` operator) and modulus (`%` operator) calculations to determine row-grid coordinates from population index position. 

In [7]:
# First, lets create a dictionary to house all of our graphs.
FIGURE = dict()

# For eahc gene in our list of declared genes for analysis
for gene in LOCATIONS["location_name"].unique().tolist():
    # Generate a figure entry for this gene and configure its default configuration
    FIGURE[gene] = make_subplots(
        rows=2,
        cols=4,
        subplot_titles=POPULATIONS_TO_ANALYSE,
        specs=[  # Here, we specify the chart type for each sub-plot, since the default is an xy cartesian system.
            [
                {"type": "ternary"},
                {"type": "ternary"},
                {"type": "ternary"},
                {"type": "ternary"},
            ],
            [
                {"type": "ternary"},
                {"type": "ternary"},
                {"type": "ternary"},
                {"type": "ternary"},
            ],
        ],
    )

    # Add a title to this figure
    FIGURE[gene].update_layout(title_text=f"Hardy-Weinberg Deviation Test | {gene}")

    # Shift the annotations drawn on the figure
    FIGURE[gene].update_annotations({"yshift": 25})

    # For each population in our list of populations declared for analysis
    for population in POPULATIONS_TO_ANALYSE:
        # Calculate the grid position this plot should be placed under

        # First, calculate the population position in the flat list
        _position = POPULATIONS_TO_ANALYSE.index(population) + 1
        # Next, divide by the number of columns and store the division result and remainder
        _row, remainder = divmod(_position, 4)

        # Set the remainder as the column id
        _col = remainder

        # If there is a remainder, the row should be incremented, since the plot should then overflow onto the next line
        if remainder:
            _row += 1

        # If no remainder is present, it means a perfect division, which translates to the last column in a row
        if remainder == 0:
            _col += 4

        # Identify SNPs with non-significant deviations from HWE
        NOT_SIGNIFICANTLY_DIFFERENT = DATA[gene].loc[
            DATA[gene][f"{population}_MIDP"] > 0.5
        ]

        # Identify SNPs with significant deviations from HWE
        SIGNIFICANTLY_DIFFERENT = DATA[gene].loc[
            DATA[gene][f"{population}_MIDP"] <= 0.5
        ]

        # Add this list of groups of data-points...
        FIGURE[gene].add_traces(
            [
                # Add a vertical line to track middle of plot
                go.Scatterternary(
                    a=[1, 0],
                    b=[0, 0.5],
                    c=[0, 0.5],
                    mode="lines",
                    line_width=2,
                    line_color="lightgray",
                    showlegend=False,
                ),
                
                # Add the SNPs with significant deviations from HWE onto the plot with styling
                go.Scatterternary(
                    a=SIGNIFICANTLY_DIFFERENT[f"{population} Hetrozygous Freq."],
                    b=SIGNIFICANTLY_DIFFERENT[f"{population} Homozygous Ref Freq."],
                    c=SIGNIFICANTLY_DIFFERENT[f"{population} Homozygous Alt Freq."],
                    hovertext=SIGNIFICANTLY_DIFFERENT["ID"],
                    mode="markers",
                    name=f"{population}",
                    legendgroup="Significant",
                    legendgrouptitle={"text": "Significant"},
                    cliponaxis=False,
                    marker={
                        "color": "red",
                        "size": 7,
                        "opacity": 1,
                        "line": {"width": 1},
                    },
                ),

                # Add the SNPs with significant deviations from HWE onto the plot with styling
                go.Scatterternary(
                    a=NOT_SIGNIFICANTLY_DIFFERENT[f"{population} Hetrozygous Freq."],
                    b=NOT_SIGNIFICANTLY_DIFFERENT[f"{population} Homozygous Ref Freq."],
                    c=NOT_SIGNIFICANTLY_DIFFERENT[f"{population} Homozygous Alt Freq."],
                    hovertext = NOT_SIGNIFICANTLY_DIFFERENT["ID"],
                    mode="markers",
                    name=f"{population}",
                    legendgroup="Not Significant",
                    legendgrouptitle={"text": "Not Significant"},
                    cliponaxis=False,
                    marker={
                        "color": "dimgray",
                        "size": 5,
                        "opacity": 0.7,
                        "line": {"width": 0},
                    },
                ),
            ],
            rows=_row,
            cols=_col,
        )

        # Update style of axes
        FIGURE[gene].update_ternaries(
            {
                "bgcolor": "whitesmoke",
                "aaxis": {
                    "tickangle": 0,
                    "tickfont": {"color": "indigo"},
                    "gridcolor": "blueviolet",
                    "griddash": "dot",
                    "title": {
                        "text": "Het. Freq.",
                        "font": {
                            "color": "indigo",
                            "size": 10,
                            "style": "italic",
                            "variant": "petite-caps",
                        },
                    },
                },
                "baxis": {
                    "tickangle": 60,
                    "tickfont": {"color": "darkgreen"},
                    "gridcolor": "forestgreen",
                    "griddash": "dash",
                    "title": {
                        "text": "Hom.<br> Ref<br> Freq.",
                        "font": {
                            # "color": "#808080",
                            "color": "darkgreen",
                            "size": 10,
                            "style": "italic",
                            "variant": "petite-caps",
                        },
                    },
                },
                "caxis": {
                    "tickangle": -60,
                    "tickfont": {"color": "darkblue"},
                    "gridcolor": "lightskyblue",
                    "griddash": "solid",
                    "title": {
                        "text": "Hom.<br> Alt<br> Freq.",
                        "font": {
                            "color": "darkblue",
                            "size": 10,
                            "style": "italic",
                            "variant": "petite-caps",
                        },
                    },
                },
            }
        )

# Export

Now we just need to export these figures to file for writing purposes.

In [8]:
Path(join("Graphs", "05")).mkdir(exist_ok=True)

In [9]:
for gene in LOCATIONS["location_name"].unique().tolist():
    display(FIGURE[gene])
    FIGURE[gene].write_image(
        join("Graphs", "05", f"{gene}_hardy_weinberg.jpeg"),
        width=1500,
        height=800,
        scale=2,
    )