%%latex
\tableofcontents

# Introduction

This notebook will describe the investigation of potential selection trends identified through Hardy-Wienburg Equilibrium (HWE) analysis. Hardy-Wienburg equilibrium describes a set of conditions, under which genotype frequency can be predicted as a product of simpler allele frequencies:

$$
    p^2+2pq+q^2
$$

In practice, Hardy-Weinburg is applied through the use of tests such as chi-squared or other exact methods to compare if observed genotype frequencies align with that expected under Hardy-Wienburg Law.

We have generated a series of population-subset Hardy-Weinburg reports. These were produced using [Plink-2](https://www.cog-genomics.org/plink/2.0/)'s [`--hardy`](https://www.cog-genomics.org/plink/2.0/basic_stats#hardy) command with the `midp` modifier, in combination with the [`--loop-cats`]() command to loop across population groupings while doing so. These outputs will be used for graphing purposes here.

> Refer to the repository and linked documentation site for more information about the [Pharmacogenetics Analysis Pipeline](https://github.com/Tuks-ICMM/Pharmacogenetic-Analysis-Pipeline).


## Objectives

## Notebook configuration

### Dependancies

This notebook will make use of the following packages and functions:

In [48]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.io import renderers
from pandas import read_csv, Series, cut, qcut, DataFrame, concat
from os.path import  join
from pathlib import Path
from json import load
from numpy import sqrt
from matplotlib.pyplot import figure,subplots, plot
from matplotlib.gridspec import GridSpec
import mpltern
from seaborn import scatterplot
from matplotlib.ticker import MultipleLocator
from scipy.interpolate import splrep, BSpline
from scipy.interpolate import UnivariateSpline
from numpy import linspace

In [49]:
renderers.default = "notebook_connected+pdf"

In [50]:
with open(join("config", "manifest copy.json"), 'r') as config:
    config = load(config)

path = join(*config["output"])

### Data Imports

We will be using data generated by the [Pharmacogenetics Analysis Pieline](). AS part of thsi analysis, we will need to import and work with the collection of metadata files used to configure this workflow:

- `samples.csv`
- `locations.csv`
- `datasets.csv`

In [51]:
SAMPLES = read_csv(join("input", "samples.csv"))
LOCATIONS = read_csv(join("input", "locations.csv"))
DATASETS = read_csv(join("input", "datasets.csv"))
POPULATIONS_TO_ANALYSE = SAMPLES["super-population"].unique().tolist()
GENES_TO_ANALYSE = LOCATIONS["location_name"].unique().tolist()

We will also be making use of a VCF-based `MultiIndex` system to organise our hirachical data (`CHROM`, `POS`, `REF` and `ALT`).

In [52]:
# First, lets create a dictionary to house all of our graphs.
FIGURE = dict()
MULTIINDEX = ["CHROM", "POS", "REF", "ALT"]

### Type Mapping

Our input data can be imported from `.csv` format using the `read_csv()` function from `pandas`. Since this project focuses on more than one gene, we will work our way through the list of genes and import each consolidated report.

In [53]:
DATA = dict()
for gene in GENES_TO_ANALYSE:
    DATA[gene] = read_csv(
        join(
            path,
            "cleaned",
            f"super-population_{gene}.csv.zst",
        )
        # join(
        #     path, "consolidated_reports", f"super-population_{gene}.csv")
            # , sep="\t"
            )
    DATA[gene].rename(columns={"CADD_PHRED": "CADD Phred"}, inplace=True)
    DATA[gene].set_index(MULTIINDEX, inplace=True)

In [None]:
LD_DATA = dict()
for gene in GENES_TO_ANALYSE:
    LD_DATA[gene] = dict()
    for pop in POPULATIONS_TO_ANALYSE:
        LD_DATA[gene][pop] = read_csv(join(path, "linkage_disequilibrium", "super-population", gene, f"calculated_linkage_disequilibrium_per_cluster.{pop}.vcor.zst"), sep="\t")
        LD_DATA[gene][pop].rename(columns={"#CHROM_A": "CHROM_A"}, inplace=True)

In [None]:
EIGENVECTORS = dict()
for gene in GENES_TO_ANALYSE:
    _tmp = read_csv(join(path, "generate_pca", gene, "pca.eigenvec"), sep="\t")
    _tmp.rename(columns={"#IID":"IID"}, inplace=True)

    EIGENVECTORS[gene] = SAMPLES.merge(_tmp, left_on="sample_name", right_on="IID")
    EIGENVECTORS[gene].drop(["IID"], inplace=True, axis="columns")

    del _tmp

In [None]:
EIGENVALUES = dict()
for gene in GENES_TO_ANALYSE:
    _tmp = read_csv(join(path, "generate_pca", gene, "pca.eigenvec.allele"), sep="\t")
    _tmp.rename(columns={"#CHROM": "CHROM"}, inplace=True)
    _tmp.set_index(MULTIINDEX, inplace=True)

    EIGENVALUES[gene] = DATA[gene][["ID"]].merge(_tmp, right_index=True, left_index=True)

    del _tmp

In [None]:
# Import missingness consolidation for filtering purposes
# CONSOLIDATED_REPORT = read_csv(
#     join(
#         path,
#         "cleaned",
#         f"super-population_CYP2B6.csv.zst",
#     )
# )

# # Set the VEP-based hierarchical `MultiIndex`
# CONSOLIDATED_REPORT.set_index(MULTIINDEX, inplace=True)
for gene in GENES_TO_ANALYSE:
    # For each population in our list of populations to be analyzed...
    for population in POPULATIONS_TO_ANALYSE:
        # Calculate Heterozygous Frequency from plink2 count data
        # DATA[gene][f"{population} Hetrozygous Freq."] = DATA[gene].apply(lambda row: (row[f"{population}_HET_A1_CT"]/(row[f"{population}_HOM_A1_CT"]+row[f"{population}_HET_A1_CT"]+row[f"{population}_TWO_AX_CT"]))*100, axis="columns")
        # DATA[gene][f"{population} Hetrozygous Freq."] = DATA[gene].apply(lambda row: convertToPercent(row, f"{population}_HET_A1_CT", population), axis="columns")
        
        DATA[gene][f"{population} Hetrozygous Freq."] = (
            DATA[gene][f"{population}_HET_A1_CT"]
            / (
                DATA[gene][f"{population}_HOM_A1_CT"]
                + DATA[gene][f"{population}_HET_A1_CT"]
                + DATA[gene][f"{population}_TWO_AX_CT"]
            )
        ) * 100


        # Calculate Homozygous Reference Frequency from plink2 count DATA[gene]
        # DATA[gene][f"{population} Homozygous Ref Freq."] = DATA[gene].apply(lambda row: (row[f"{population}_HOM_A1_CT"]/(row[f"{population}_HOM_A1_CT"]+row[f"{population}_HET_A1_CT"]+row[f"{population}_TWO_AX_CT"]))*100, axis="columns")
        # DATA[gene][f"{population} Homozygous Ref Freq."] = DATA[gene].apply(lambda row: convertToPercent(row, f"{population}_HOM_A1_CT", population), axis="columns")
        DATA[gene][f"{population} Homozygous Ref Freq."] = (
            DATA[gene][f"{population}_HOM_A1_CT"]
            / (
                DATA[gene][f"{population}_HOM_A1_CT"]
                + DATA[gene][f"{population}_HET_A1_CT"]
                + DATA[gene][f"{population}_TWO_AX_CT"]
            )
        ) * 100

        # Calculate Homozygous Alternate Frequency from plink2 count data
        # DATA[gene][f"{population} Homozygous Alt Freq."] = DATA[gene].apply(lambda row: (row[f"{population}_TWO_AX_CT"]/(row[f"{population}_HOM_A1_CT"]+row[f"{population}_HET_A1_CT"]+row[f"{population}_TWO_AX_CT"]))*100, axis="columns")
        
        # DATA[gene][f"{population} Homozygous Alt Freq."] = DATA[gene].apply(lambda row: convertToPercent(row, f"{population}_TWO_AX_CT", population), axis="columns")
        
        DATA[gene][f"{population} Homozygous Alt Freq."] = (
            DATA[gene][f"{population}_TWO_AX_CT"]
            / (
                DATA[gene][f"{population}_HOM_A1_CT"]
                + DATA[gene][f"{population}_HET_A1_CT"]
                + DATA[gene][f"{population}_TWO_AX_CT"]
            )
        ) * 100

In [None]:
for gene in GENES_TO_ANALYSE:
    # For each population in our list of populations to be analyzed...
    for population in POPULATIONS_TO_ANALYSE:
        # Calculate Heterozygous Frequency from plink2 count data
        DATA[gene][f"{population} Hetrozygous Freq."] = (
            DATA[gene][f"{population}_HET_A1_CT"]
            / (
                DATA[gene][f"{population}_HOM_A1_CT"]
                + DATA[gene][f"{population}_HET_A1_CT"]
                + DATA[gene][f"{population}_TWO_AX_CT"]
            )
        ) * 100


        # Calculate Homozygous Reference Frequency from plink2 count DATA[gene]
        DATA[gene][f"{population} Homozygous Ref Freq."] = (
            DATA[gene][f"{population}_HOM_A1_CT"]
            / (
                DATA[gene][f"{population}_HOM_A1_CT"]
                + DATA[gene][f"{population}_HET_A1_CT"]
                + DATA[gene][f"{population}_TWO_AX_CT"]
            )
        ) * 100

        # Calculate Homozygous Alternate Frequency from plink2 count data
        DATA[gene][f"{population} Homozygous Alt Freq."] = (
            DATA[gene][f"{population}_TWO_AX_CT"]
            / (
                DATA[gene][f"{population}_HOM_A1_CT"]
                + DATA[gene][f"{population}_HET_A1_CT"]
                + DATA[gene][f"{population}_TWO_AX_CT"]
            )
        ) * 100

# Population Structure Figure

To start the construction of our figure, we first need to define a `GridSpec`. This object will be used to plot data into a blank figure using a grid-system. Using this method, we can effectively build complex figures with sub-plots of differing sizes and complex layouts.

## Initialize Figure

In [None]:
GRID_SPEC = dict()

for gene in GENES_TO_ANALYSE:
    # [CREATE] blank figure
    FIGURE[f"{gene} - Population Structure"] = figure(figsize=(14, 9))

    # [SET] layout algorithm used to optimize space-use.
    FIGURE[f"{gene} - Population Structure"].set_layout_engine("constrained")

    # [SET] Figure title
    FIGURE[f"{gene} - Population Structure"].suptitle(f"Population Structure | {gene}", size="x-large")

    # [CREATE] GridSpec with 8 rows and 12 columns
    GRID_SPEC[gene] = FIGURE[f"{gene} - Population Structure"].add_gridspec(nrows=12, ncols=12)

## Principle Component Analysis Sub-Plot

Next, we need to plot our LD results onto this figure. For this, I want the figure to be on the left-hand side, and occupy a height of $\frac{5}{8}$ of the available 8 rows (i.e. 5 rows high).

In [None]:
for gene in GENES_TO_ANALYSE:  
    # [CREATE] a sub-figure
    PCA_SUPER = FIGURE[f"{gene} - Population Structure"].add_subfigure(GRID_SPEC[gene][8:12, 0:4])

    # [SET] the title of the sub-figure
    PCA_SUPER.suptitle("PCA | Super-Population", y=0.85)

    # [CREATE] a blank set of axes to plot onto
    PCA_SUPER_AX = PCA_SUPER.add_subplot()
    PCA_SUPER_AX.tick_params(axis='y', rotation=45, labelsize='x-small')
    PCA_SUPER_AX.tick_params(axis='x', labelsize='x-small')
    # PCA_SUPER_AX.legend(bbox_to_anchor=(0, 1.2), ncols=5)

    # PCA_SUPER_AX.set_position([PCA_SUPER_AX.get_position().x0, PCA_SUPER_AX.get_position().y0, PCA_SUPER_AX.get_position().width, PCA_SUPER_AX.get_position().height])
    scatterplot(data=EIGENVECTORS[gene], x="PC1", y="PC2", hue="super-population", ax=PCA_SUPER_AX)
    PCA_SUPER_AX.legend(loc='upper center', bbox_to_anchor=(0.5, 1.5), fancybox=True, shadow=True, ncol=5)


In [None]:
for gene in GENES_TO_ANALYSE:  
    # [CREATE] a sub-figure
    PCA_SEX = FIGURE[f"{gene} - Population Structure"].add_subfigure(GRID_SPEC[gene][8:12, 4:8])

    # [SET] the title of the sub-figure
    PCA_SEX.suptitle("PCA | Sex", y=0.85)

    # [CREATE] a blank set of axes to plot onto
    PCA_SEX_AX = PCA_SEX.add_subplot()
    PCA_SEX_AX.tick_params(axis='y', rotation=45, labelsize='x-small')
    PCA_SEX_AX.tick_params(axis='x', labelsize='x-small')
    # PCA_SEX_AX.legend(bbox_to_anchor=(0, 1.2), ncols=5)
    # PCA_SEX_AX.set_position([PCA_SEX_AX.get_position().x0, PCA_SEX_AX.get_position().y0, PCA_SEX_AX.get_position().width, PCA_SEX_AX.get_position().height])

    scatterplot(data=EIGENVECTORS[gene], x="PC1", y="PC2", hue="sex", ax=PCA_SEX_AX)
    PCA_SEX_AX.legend(loc='upper center', bbox_to_anchor=(0.5, 1.5), fancybox=True, shadow=True, ncol=5)


In [None]:
for gene in GENES_TO_ANALYSE:  
    # [CREATE] a sub-figure
    PCA_DATASET = FIGURE[f"{gene} - Population Structure"].add_subfigure(GRID_SPEC[gene][8:12, 8:12])

    # [SET] the title of the sub-figure
    PCA_DATASET.suptitle("PCA | Dataset", y=0.85)

    # [CREATE] a blank set of axes to plot onto
    PCA_DATASET_AX = PCA_DATASET.add_subplot()
    PCA_DATASET_AX.tick_params(axis='y', rotation=45, labelsize='x-small')
    PCA_DATASET_AX.tick_params(axis='x', labelsize='x-small')
    # PCA_DATASET_AX.legend(bbox_to_anchor=(0, 1.2), ncols=5)
    # PCA_DATASET_AX.set_position([PCA_DATASET_AX.get_position().x0, PCA_DATASET_AX.get_position().y0, PCA_DATASET_AX.get_position().width, PCA_DATASET_AX.get_position().height])
    scatterplot(data=EIGENVECTORS[gene], x="PC1", y="PC2", hue="dataset", ax=PCA_DATASET_AX)
    PCA_DATASET_AX.legend(loc='upper center', bbox_to_anchor=(0.5, 1.5), fancybox=True, shadow=True, ncol=5)


## Linkage-Decay Sub-Plot

Next, we need to plot our LD results onto this figure. For this, I want the figure to be on the left-hand side, and occupy a height of $\frac{3}{8}$ of the available 8 rows (i.e. 3 rows high).

In [None]:
def binData(data: Series, bin_size: int = 500) -> list:
    """Function takes a pandas `Series` and generates a categorical bin annotation suitable for aggregation, E.g. `.groupby()`.

    Args:
        data (Series): A `Series` containing  numerical data to be binned.
        bin_size (int, optional): The size of the bins to be made. Defaults to 500.

    Returns:
        Series: The resulting values in a `Series` object.
    """
    _nearest_multiple_of_bin_size = (round(data.max() / bin_size)) * bin_size

    if (data.max() % bin_size) > 0:
        _nearest_multiple_of_bin_size += bin_size

    _list_of_bins = range(0, _nearest_multiple_of_bin_size + bin_size, bin_size)
    _list_of_labels = range(0, _nearest_multiple_of_bin_size, bin_size) # Must be one less than the bins

    return cut(
           data, bins=_list_of_bins, labels=_list_of_labels
        ).to_list()

In [None]:
LD_DECAY = dict()

for gene in GENES_TO_ANALYSE:
    # Create a list to store each set of population-level LD results
    CONCAT_LIST = list()


    # For each population in our list of populations for comparison that were successfully imported
    for pop in POPULATIONS_TO_ANALYSE:
        # [CALCULATE] distance between pairwise alleles in LD comparison
        LD_DATA[gene][pop]["Distance"] = abs(
                LD_DATA[gene][pop]["POS_B"] - LD_DATA[gene][pop]["POS_A"]
        )

        # [CALCULATE] binned distance using a default of 500 base-pairs
        LD_DATA[gene][pop]["Distance (Binned)"] = binData(LD_DATA[gene][pop]["Distance"])

        # Here we calculate percentile rankings based on the R^2 values which will be used 
        # later on to calculate splines for each quantile per population.
        LD_DATA[gene][pop]["Quantile"] = qcut(
            LD_DATA[gene][pop]["UNPHASED_R2"],
            [0,.25,.5,.75,1],
            labels=False,
            duplicates="drop"
        )

        # Drop these unneeded columns
        # LD_DATA[gene][pop].drop(
        #     columns=[
        #         "CHROM_A",
        #         "POS_A",
        #         "CHROM_B",
        #         "POS_B",
        #         "ID_A",
        #         "ID_B",
        #         "REF_A",
        #         "REF_B",
        #         "ALT_A",
        #         "ALT_B",
        #         "PROVISIONAL_REF_A?",
        #         "MAJ_A",
        #         "PROVISIONAL_REF_B?",
        #         "MAJ_B",
        #     ],
        #     inplace=True,
        # )

        
        
        # Here, we assign a categorical label for this batch of linkage results.
        # This will be needed later for graphing when we combine all our reports together.

        LONG_FORMAT = LD_DATA[gene][pop][["UNPHASED_R2", "Distance", "Distance (Binned)", "Quantile"]].groupby(["Distance (Binned)", "Quantile"], observed=True).mean().reset_index()
        
        LONG_FORMAT["Population"] = pop
        # Add this populations LD report to the list
        CONCAT_LIST.append(LONG_FORMAT)

    # Concatenate the list of population-level LD reports for this gene together
    LD_DECAY[gene] = concat(CONCAT_LIST)

    # # [CREATE] a sub-figure
    # LD = FIGURE[f"{gene} - Population Structure"].add_subfigure(GRID_SPEC[gene][0:8, 0:4])

    # # [SET] the title of the sub-figure
    # LD.suptitle("LD Decay | Chr Band")

    # TMP_GRIDSPEC = LD.add_gridspec(nrows=len(PLOT_LD_DATA["Quantile"].unique()), ncols=1)

    # for quantileIndex, quantile in enumerate(PLOT_LD_DATA["Quantile"].unique()):
            
    #     # [CREATE] a blank set of axes to plot onto
    #     LD_AX = LD.add_subplot(TMP_GRIDSPEC[quantileIndex,:])

    #     pos = LD_AX.get_position()
    #     LD_AX.set_position([pos.x0, pos.y0 + (pos.height * 0.05), pos.width, pos.height * 0.95])

    #     LD_AX.set_ylim(0.2,1)
    #     LD_AX.set_xlim(0,PLOT_LD_DATA["Distance (Binned)"].max())

    #     LD_AX.tick_params(axis='y', rotation=45, labelsize='x-small')
    #     LD_AX.tick_params(axis='x', labelsize='x-small')

    #     TMP = PLOT_LD_DATA.loc[PLOT_LD_DATA["Quantile"] == quantile]

    #     # [CREATE] line-plot of LD Decay results | TODO: EXPAND
    #     scatterplot(TMP, x="Distance (Binned)", y="UNPHASED_R2", hue="Population", ax=LD_AX, s=1, legend=False)


    #     del PLOT_LD_DATA

In [None]:
for gene in GENES_TO_ANALYSE:
    # [CREATE] a sub-figure
    LD = FIGURE[f"{gene} - Population Structure"].add_subfigure(GRID_SPEC[gene][0:8, 0:4])

    # [SET] the title of the sub-figure
    LD.suptitle("LD Decay | Chr Band")

    LD.supylabel("$R^2$ (Unphased)", size="small")

    TMP_GRIDSPEC = LD.add_gridspec(nrows=len(LD_DECAY[gene]["Quantile"].unique()), ncols=1)

    for quantileIndex, quantile in enumerate(LD_DECAY[gene]["Quantile"].unique()):
            
        # [CREATE] a blank set of axes to plot onto
        LD_AX = LD.add_subplot(TMP_GRIDSPEC[quantileIndex,:])

        pos = LD_AX.get_position()
        LD_AX.set_position([pos.x0, pos.y0 + (pos.height * 0.05), pos.width, pos.height * 0.95])

        # LD_AX.set_ylim(0.2,1)
        LD_AX.set_xlim(0,LD_DECAY[gene]["Distance (Binned)"].max())

        LD_AX.tick_params(axis='y', rotation=45, labelsize='x-small')
        LD_AX.tick_params(axis='x', labelsize='x-small')

        TMP = LD_DECAY[gene].loc[LD_DECAY[gene]["Quantile"] == quantile]

        # [CREATE] line-plot of LD Decay results | TODO: EXPAND
        scatterplot(TMP, x="Distance (Binned)", y="UNPHASED_R2", hue="Population", ax=LD_AX, s=1, legend=False)

        LD_AX.set_ylabel(" ")

        AFR = TMP.loc[TMP["Population"] == "AFR"]

        spline = UnivariateSpline(AFR["Distance (Binned)"], AFR["UNPHASED_R2"], s=0.75, k=3)

        x_smooth = linspace(AFR["Distance (Binned)"].min(), AFR["Distance (Binned)"].max(), 500)
        y_smooth = spline(x_smooth)

        LD_AX.plot(x_smooth, y_smooth, linestyle="dashdot", linewidth=2)


In [None]:
for gene in GENES_TO_ANALYSE:
    
    # [CREATE] a sub-figure
    HWE = FIGURE[f"{gene} - Population Structure"].add_subfigure(GRID_SPEC[gene][0:8, 4:12])

    # [CREATE] a gridspec to layout multiple sub-plots in this sub-figure
    HWE_GRIDSPEC = HWE.add_gridspec(nrows=2, ncols=4)

    # [SET] the title of the sub-figure
    HWE.suptitle(f"HWE | Gene Region")

    for pop in POPULATIONS_TO_ANALYSE:
        # Calculate the grid position this plot should be placed under
        # First, calculate the population position in the flat list
        _position = POPULATIONS_TO_ANALYSE.index(pop)
        # Next, divide by the number of columns and store the division result (i.e. row number) and remainder
        _row, _col = divmod(_position, 4)
            
        # Identify SNPs with non-significant deviations from HWE
        NOT_SIGNIFICANTLY_DIFFERENT = DATA[gene].loc[
            DATA[gene][f"{pop}_MIDP"] > 0.5
        ]

        # Identify SNPs with significant deviations from HWE
        SIGNIFICANTLY_DIFFERENT = DATA[gene].loc[
            DATA[gene][f"{pop}_MIDP"] <= 0.5
        ]

        TERNARY_FIGURE = HWE.add_subfigure(HWE_GRIDSPEC[_row, _col])
        TERNARY_FIGURE.suptitle(pop, size="x-small")
        TERNARY_FIGURE_AXES = TERNARY_FIGURE.add_subplot(projection="ternary")
        TERNARY_FIGURE_AXES.scatter(
            NOT_SIGNIFICANTLY_DIFFERENT[f"{pop} Hetrozygous Freq."], 
            NOT_SIGNIFICANTLY_DIFFERENT[f"{pop} Homozygous Ref Freq."], 
            NOT_SIGNIFICANTLY_DIFFERENT[f"{pop} Homozygous Alt Freq."], 
            color="darkgray",
            edgecolors="white", 
            linewidths=0.5,
            zorder=2,
            s=10
            )
        TERNARY_FIGURE_AXES.scatter(
            SIGNIFICANTLY_DIFFERENT[f"{pop} Hetrozygous Freq."], 
            SIGNIFICANTLY_DIFFERENT[f"{pop} Homozygous Ref Freq."], 
            SIGNIFICANTLY_DIFFERENT[f"{pop} Homozygous Alt Freq."],
            color="red" ,
            edgecolors="white", 
            linewidths=0.5, 
            zorder=1,
            marker="X",
            s=50
            )
        TERNARY_FIGURE_AXES.tribin(
            SIGNIFICANTLY_DIFFERENT[f"{pop} Hetrozygous Freq."], 
            SIGNIFICANTLY_DIFFERENT[f"{pop} Homozygous Ref Freq."], 
            SIGNIFICANTLY_DIFFERENT[f"{pop} Homozygous Alt Freq."],
            gridsize=5,
            edgecolors="white", 
            linewidths=0.5,
            bins="log",
            zorder=0
            )
        
        TERNARY_FIGURE_AXES.set_tlabel("AA", size="small")
        TERNARY_FIGURE_AXES.set_llabel("AB", size="small")
        TERNARY_FIGURE_AXES.set_rlabel("BB", size="small")
        TERNARY_FIGURE_AXES.taxis.set_major_locator(MultipleLocator(0.2))
        TERNARY_FIGURE_AXES.laxis.set_major_locator(MultipleLocator(0.2))
        TERNARY_FIGURE_AXES.raxis.set_major_locator(MultipleLocator(0.2))
        TERNARY_FIGURE_AXES.taxis.set_tick_params(
            tick2On=True,
            colors="C0",
            grid_color="C0",
            labelsize=6,
            gridOn=True,
            grid_alpha=0.45,
            zorder=0.0
            )
        TERNARY_FIGURE_AXES.laxis.set_tick_params(
            tick2On=True,
            colors="C1",
            grid_color="C1",
            labelsize=6,
            gridOn=True,
            grid_alpha=0.45,
            zorder=0.0
            )
        TERNARY_FIGURE_AXES.raxis.set_tick_params(
            tick2On=True,
            colors="C2",
            grid_color="C2",
            labelsize=6,
            gridOn=True,
            grid_alpha=0.45,
            zorder=0.0
            )
        TERNARY_FIGURE_AXES.taxis.label.set_color("C0")
        TERNARY_FIGURE_AXES.laxis.label.set_color("C1")
        TERNARY_FIGURE_AXES.raxis.label.set_color("C2")
        TERNARY_FIGURE_AXES.spines["tside"].set_color("C0")
        TERNARY_FIGURE_AXES.spines["lside"].set_color("C1")
        TERNARY_FIGURE_AXES.spines["rside"].set_color("C2")

In [None]:
for gene in GENES_TO_ANALYSE:
    display(FIGURE[f"{gene} - Population Structure"])

## Hardy-Weinberg Equilibrium Ternary Plots

## Plotting Objectievs

- HWE
    - Significant vs Non-Significant
    - Significance density
- PCA
    - Magnitude of contribution towards population structure
    - Direction of contribution towards population structure

Next, I want to visualize the mutations with significant deviations from a normal HWE distribution. I also however, want to highlight the density of these significant mutations, as well as highlight individual mutations which contribute strongly towards our observed population structure.

To accomplish density highlighting, we can make use of the `tribin` plotting method by `matplotlib`, and the ternary projections (axes) provided by `mpltern`. This will add a color-coded tri-bin plot that highlights bin density. This can be placed underneath our data-points with a reduced opacity to provive clearer insight into scatterplot point-density.

### PCA Scores and HWE point size

To facilitate the identification of mutations which contribute strongly towards population-structure, as seen in the PCA, we can make use of the scatterplot's point-sizes and vary this property per-mutation based on the mutations PCA scores. This will make mutations which contribute strongly towards the observed PCA structure larger.

We quickly just need to scale the allele-weights retrieved from teh PCA report. This is because the scores are positive or negative, and can be very small or very large. A naive approach may be to perform basic Min-Max scaling, however due to the distribution of points, some data points will come out so small they wont be legible. For this reason, we will be scaling teh column to standard values:

$$
Scaled Value = \frac{Original Value - Min Value}{Max Value - Min Value}*100
$$


### Scaling PCA Allele-Scores to a Fixed Range

We first need to scale the PCA allele-scores obtained from Plink-2's `--pca allele-wts` function to the fixed range $[0, 10]$. This is because we need to provide a number in the range of $[0, \text{inf}]$ to the `scatter` function responsible for controlling point-size (Unfortunately it does not have a magical internal method for doing this already,a s this is a somewhat specialized graphing application)

#### Step-by-Step Explanation

1. **Input Range**:
   Let the original PCA allele-scores be in the range $[x_{\text{min}}, x_{\text{max}}]$, where:
   - $|x_{\text{min}}|$ = The absolute value of the minimum score
   - $|x_{\text{max}}|$ = The absolute value of the minimum score

2. **Target Range**:
   The desired scaled range is $[y_{\text{min}}, y_{\text{max}}]$, where:
   - $y_{\text{min}} = 0$
   - $y_{\text{max}} = 10$

3. **Scaling Transformation**:
   To map an original value $x$ to the scaled value $y$, we use the linear transformation formula:
   $$
   y = \frac{x - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}} \times (y_{\text{max}} - y_{\text{min}}) + y_{\text{min}}
   $$

4. **Simplified for Target Range $[0, 10]$**:
   Substituting $y_{\text{min}} = 0$ and $y_{\text{max}} = 10$, the formula becomes:
   $$
   \text{Scaled Value} = \frac{x - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}} \times 10
   $$

#### Key Notes
- This formula ensures that the smallest value in the original range ($x_{\text{min}}$) maps to 0, and the largest value ($x_{\text{max}}$) maps to 10.
- The transformation is linear, preserving the relative differences between values in the original range.


# Hardy-Weinberg Results

## Calculate Genotype frequency

The results contain genotype counts, so our first step is to convert these into genotype frequencies that can be plotted on a ternary plot. To do this efficiently, we will loop through each gene found in the `DataFrame` under our chosen cluster of '_super-population_'. For each `DataFrame`, we will then loop again through each unique population annotated in our samples files under '_super-population_' to calculate sub-group genotype frequencies.

## Generate ternary sub-plots

Next, we will make a composite plot using the plotly `make_subplots` function which allows the creation of a figure containing multiple different sub-plots. Since there are an even number of gene regions in this analysis, we can simply create a grid of 2x4.

### Graph type

This plot will be slightly more complicated to construct since we want to style and add multiple subsets of data under different ledgend entries. This will allow us to more easily distinguish between variants which diverge from HWE and those that do not via plot styling. Normally, this would be facilitated by long-format data and the `plotly.express` interface for constructing `plotly` `Figure`s. In this case, there are some operations we will not be able to perform through this interface, which means we will have to construct each 'trace' manually via the `plotly.graph_objects` interface for Object-oriented plotting.

### Subplot grid

Since this plot will involve multiple populations in a sub-plot format, a system will be needed to position each ternary-plot within each overall gene-level HWE `Figure`. Fortunately, determining the grid-position of a population in a list of known length can be acheived through mathematical operations and operators already available in python. Provided we know the maximum number of graphs on one row of the plot, we can make use of the `divmod()` which performs division (`/` operator) and modulus (`%` operator) calculations to determine row-grid coordinates from population index position. 

We quickly just need to scale the allele-weights retrieved from teh PCA report. This is because the scores are positive or negative, and can be very small or very large. A naive approach may be to perform basic Min-Max scaling, however due to the distribution of points, some data points will come out so small they wont be legible. For this reason, we will be scaling teh column to standard values:

$$
Scaled Value = \frac{Original Value - Min Value}{Max Value - Min Value}*100
$$


# Export

Now we just need to export these figures to file for writing purposes.

In [None]:
Path(join(path, "Graphs", "05")).mkdir(exist_ok=True, parents=True)

In [None]:
for figure in FIGURE.keys():
    FIGURE[figure].savefig(
        join(path, "Graphs", "05", f"{figure}.jpeg"),
        dpi=300
    )