%%latex
\tableofcontents

# Introduction

Here, we investigate haplotype frequency by by annotating our SNP records against a [CSV download](https://www.pharmvar.org/download) of the [PharmVar datadase](https://www.pharmvar.org/). These annotations are used to identify corresponding variants of clinical interest, as identified using the [Pharmacogenetics Analysis Pipelilne](https://github.com/Tuks-ICMM/Pharmacogenetic-Analysis-Pipeline)

Since the PharmVar database provides SNP to Haplotype associations in table-form, we will make use of the `UpSet` object provided by the `UpSetPlot` library for graphing UpSet Plots.

> This notebook will import and make use of data that has already been filtered to identify variants of clinical interest. The criteria used are described in the first notebook of this series, which also describes the code used to perform the data.

## Objectives

## Notebook Configuration

### Dependencies

This notebook will make use of the following libraries and imports:


In [1]:
from pandas import read_csv
from os.path import join
from plotly.express import treemap
from plotly.io import renderers
from pathlib import Path

In [2]:
renderers.default = "notebook_connected+pdf"

### Data Imports

First, we will need to import all of our reference data.

In [3]:
# [ASSIGN] the sample metadata used to conduct the analysis to a reference variable
SAMPLES = read_csv(join("input", "samples.csv"))

# [ASSIGN] the genomic location metadata used to conduct the analysis to a reference variable
LOCATIONS = read_csv(join("input", "locations.csv"))

# [ASSIGN] the dataset metadata used to conduct the analysis to a reference variable
DATASETS = read_csv(join("input", "datasets.csv"))

# [ASSIGN] a sorted list of all the unique population codes found in the annotations used in this analysis.
POPULATIONS_TO_COMPARE = sorted(SAMPLES["super-population"].unique().tolist())

# [ASSIGN] a sorted list of the genomic regions analyzed
LOCATIONS_COVERED = sorted(LOCATIONS["location_name"].unique().tolist())

### `MultiIndex` Hierarchical indexing

We will be making use of the `MultiIndex` method for advanced indexing of hierarchical data, using the standard Variant-Call-Format (VCF) columns `CHROM`, `POS`, `ID`, `REF` and `ALT`. To facilitate this, we will set a reference to the column names that we can use to set the `MultiIndex` with.

In [4]:
MULTIINDEX = ["CHROM", "POS", "ID", "REF", "ALT"]
PHARM_VAR_MULTIINDEX = ["POS", "REF", "ALT"]

# Annotate via join-operation

In order to match up the Haplotype annotations we have obtained from [PharmVar](https://www.pharmvar.org/) with our SNP data, we can make use of the `pandas` function `merge` to perform join operations. In this instance, we will perform a left-handed merge, meaning all records on the left-hand side must be kept, while records on the right-hand side which are not matched will be dropped.

We will also filter and remove any records that are not found at an alellic frequency (2%) in at least one population.


In [5]:
DATA = dict()
for gene in ["CYP2B6", "CYP2C9", "CYP2C19", "CYP2D6", "CYP4F2"]:
    # [ASSIGN] a DataFrame containing our imported gene-spesific datasets to our DATA object
    DATA[gene] = read_csv(
        join(
            "/",
            "mnt",
            "ICMM_HDD_12TB",
            "Results_25SEP2024",
            "cleaned",
            f"super-population_{gene}.csv.zst",
        )
    )

    DATA[gene].rename(columns={"Consequence_type": "Consequence"}, inplace=True)

    PHARM_VAR = read_csv(
        join("Reference Data", "PharmVar_database_v6.1.2.1", f"{gene}.haplotypes.tsv"),
        sep="\t",
    )
    PHARM_VAR.rename(
        columns={
            "Variant Start": "POS",
            "Reference Allele": "REF",
            "Variant Allele": "ALT",
        },
        inplace=True,
    )
    PHARM_VAR.drop(PHARM_VAR.loc[PHARM_VAR["POS"] == "."].index, inplace=True)
    PHARM_VAR = PHARM_VAR.astype({"POS": "int64"})
    DATA[gene] = DATA[gene].merge(
        PHARM_VAR,
        how="left",
        left_on=PHARM_VAR_MULTIINDEX,
        right_on=PHARM_VAR_MULTIINDEX,
    )
    QUERY = f"{' | '.join([f'({population} >= 0.02)' for population in SAMPLES['super-population'].unique().tolist()])}"

    DATA[gene] = DATA[gene].query(QUERY)

# Haplotype Proportions

Now we just need to plot our Haplotype-annotated data. For this, we can use the [`treemap()`](https://plotly.com/python/treemaps/#basic-treemap-with-plotlyexpress) function from [`plotly.express`](https://plotly.com/python/plotly-express/).

In [6]:
FIGURE = dict()

for gene in DATA.keys():
    PLOT_DATA = DATA[gene].loc[DATA[gene]["Haplotype Name"].notna()]
    if PLOT_DATA.shape[0] != 0:
        FIGURE[gene] = treemap(
            PLOT_DATA,
            path=["Haplotype Name"],
            title=f"{gene} | Haplotype count",
            color="CADD_PHRED"
        )

# Export

New we just need to save these files for later use.

In [7]:
Path(join("Graphs", "03")).mkdir(exist_ok=True)

In [8]:
for gene in FIGURE.keys():
    display(FIGURE[gene])
    FIGURE[gene].write_image(
        join("Graphs", "03", f"{gene}_haplotype_count.jpeg"),
        width=1500,
        height=800,
        scale=1,
    )