## Preparation of variant count data for each African population group

Data on genetic variation found in African population groups was generated in-house from genetic data obtained from the [GnomAD 1000 Genomes and HGDP datasets](https://gnomad.broadinstitute.org/news/2021-10-gnomad-v3-1-2-minor-release/), and stored in `Data\Raw\SUB\{gene_name}.Count.csv` files. Details on the generation of this data are not included here. Each `Data\Raw\SUB\{gene_name}.Count.csv` file contains the data for a particular gene. Here, `{gene_name}` would be replaced with the specific name of the gene. The files include information on:

* Genetic variant names: These are identifiers for specific genetic differences (variants) in a population.
* Variant positions in the genome: This tells us where these genetic variations are located in the genetic code.
* Genetic alleles: An allele refers to the different forms of a specific gene that can exist at a particular genetic position. For each variant, there are two alleles, the normal form known as the reference allele (REF), and the altered form known as the alternate allele (ALT). These alleles define the genetic variation at a given position.
* Total copies of each variant alternate allele in the population: This shows how many times each genetic variant's alternate allele (ALT) appears in all the samples of a population.
* Total copies of both variant alternate and reference alleles in the population: This provides the overall count of all genetic variants' alleles (both REF and ALT) mentioned in the data.

The `Data\Raw\SUB\{gene_name}.Count.csv` data was prepared for further analysis by performing the following steps: 
1. The data for all genes was merged into a single dataset
2. The merged data was melted into a suitable format
3. A unique ID was assigned to each variant, as some variants did not have unique names. 
4. Duplicate variant entries were removed
3. Additional features were added, such as the total count of regular-non variant sites, the African continental region information, and regional variant counts


## Import libraries and modules

In [1]:
# Change working directory

import os

os.chdir(
    r"C:\Users\User\Desktop\Megan\MSC2\Results\5._Posthoc_analysis\Pipeline_GnomAD_14032023\Analysis"
)

In [2]:
# Import modules and packages

import sys

sys.path.append(
    r"C:\Users\User\Desktop\Megan\MSC2\Results\5._Posthoc_analysis\Pipeline_GnomAD_14032023"
)
import pandas as pd
import numpy as np
import Utils.constants as constants
import Utils.functions as functions

## Import data

Import in-house variant count data for each gene and subpopulation. Combine into a single dataframe

In [3]:
variants = pd.DataFrame()
for gene in constants.GENES:
    gene_variant_count_path = os.path.join(
        constants.HOME_PATH,
        "Data",
        "Raw",
        "SUB",
        "{}_Count.csv".format(gene),
    )
    gene_variant_df = pd.DataFrame()
    if os.path.exists(gene_variant_count_path):
        gene_variant_df = pd.read_csv(gene_variant_count_path, sep=",").rename(columns={"ID":"VAR_NAME"})
        gene_variant_df["GENE"] = gene
    variants = pd.concat([variants, gene_variant_df]).drop(columns="Unnamed: 0")
variants.head(5)

Unnamed: 0,VAR_NAME,POS,REF,ALT,ACB_ac,ACB_tc,GWD_ac,GWD_tc,ESN_ac,ESN_tc,...,BantuSouthAfrica_tc,BantuKenya_ac,BantuKenya_tc,YRI_ac,YRI_tc,LWK_ac,LWK_tc,ASW_ac,ASW_tc,GENE
0,chr13:110148882C-CT,110148882,C,CT,0,228,0,352,0,296,...,16,0,24,0,350,0,194,0,142,COL4A1
1,rs552586867,110148891,C,G,0,228,0,352,0,296,...,16,0,24,2,350,0,194,0,142,COL4A1
2,rs59409892,110148917,C,G,26,228,38,352,25,296,...,16,3,24,35,350,13,194,14,142,COL4A1
3,rs535182970,110148920,G,C,1,228,0,352,0,296,...,16,0,24,0,350,0,194,0,142,COL4A1
4,rs56406633,110148959,A,G,1,228,0,352,0,296,...,16,0,24,0,350,0,194,0,142,COL4A1


In the dataframe above:

`ALT` represents the alternate form of a specific genetic position.

`REF` represents the normal form at the genomic position, which is essentially reference genetic information.

The total copies of the variant for a population are represented as `{population_group}_ac`. Here, `{population_group}` would be replaced with the specific name of the population group or subpopulation. The _ac stands for "alternate allele count," which indicates the total number of copies of the specific genetic variant in the population.

The total copies of both variants and non-variants for the same population are represented as `{population_group}_tc`. Again, `{population_group}` should be replaced with the actual name of the population group. The _tc stands for "total count," which represents the overall count of all genetic variants (including the specific one) and the non-variant genetic information in the population.

## Melt data into a suitable format for further analysis

Change the format of this data so that the population data is displayed separately in a new column.

In [4]:
# Separate total count and alternate count information
alt_ct_columns = variants.filter(regex="_ac|VAR_NAME|POS|REF|ALT|GENE")
total_ct_columns = variants.filter(regex="_tc|VAR_NAME|POS|REF|ALT|GENE")

# Melt information
alt_ct_columns = alt_ct_columns.melt(
    id_vars=["VAR_NAME", "POS", "REF", "ALT", "GENE"],
    var_name="SUB_POP",
    value_name="IH_ALT_CTS",
)
total_ct_columns = total_ct_columns.melt(
    id_vars=["VAR_NAME", "POS", "REF", "ALT", "GENE"],
    var_name="SUB_POP",
    value_name="IH_TOTAL_CTS",
)

# Remove information after underscore in SUB_POP column
alt_ct_columns["SUB_POP"] = alt_ct_columns["SUB_POP"].str.rsplit("_", n=0).str.get(0)
total_ct_columns["SUB_POP"] = (
    total_ct_columns["SUB_POP"].str.rsplit("_", n=0).str.get(0)
)

# Combine formatted information

ih_allele_counts = pd.merge(
    alt_ct_columns, total_ct_columns, on=["VAR_NAME", "POS", "REF", "ALT", "GENE", "SUB_POP"]
)
ih_allele_counts.head(5)

Unnamed: 0,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS
0,chr13:110148882C-CT,110148882,C,CT,COL4A1,ACB,0,228
1,rs552586867,110148891,C,G,COL4A1,ACB,0,228
2,rs59409892,110148917,C,G,COL4A1,ACB,26,228
3,rs535182970,110148920,G,C,COL4A1,ACB,1,228
4,rs56406633,110148959,A,G,COL4A1,ACB,1,228


## Assign a unique ID to each variant

In [5]:
ih_allele_counts["ID"] = (
    ih_allele_counts[["POS", "ALT", "REF"]].astype("str").agg("_".join, axis=1)
)

ih_allele_counts.head(5)

Unnamed: 0,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,ID
0,chr13:110148882C-CT,110148882,C,CT,COL4A1,ACB,0,228,110148882_CT_C
1,rs552586867,110148891,C,G,COL4A1,ACB,0,228,110148891_G_C
2,rs59409892,110148917,C,G,COL4A1,ACB,26,228,110148917_G_C
3,rs535182970,110148920,G,C,COL4A1,ACB,1,228,110148920_C_G
4,rs56406633,110148959,A,G,COL4A1,ACB,1,228,110148959_G_A


## Drop duplicate entries

In [6]:
ih_allele_counts = ih_allele_counts.drop_duplicates(
    ["ID", "SUB_POP", "SUB_POP"]
)

## Add additional data features

##### Correct total allele counts

In the data above, the total count of all alleles (REF and ALT) is included as `IH_TOTAL_CTS`. Some samples, from which this data was generated, lacked variant ALT data for specific positions in the genome. The total count was calculated only using the available data, which might make the variant frequency estimates inaccurate for a population.

To improve accuracy, we can assume that if data is missing for a certain genetic variant at a position, that variant does not exist in those samples. Based on this assumption, we can correct the total allele count by considering the missing data as non-variants (REF alleles only). This correction helps us get more reliable estimates of the variant frequency for the population.

In simple terms, the total allele counts `IH_TOTAL_CTS` were adjusted to account for missing information, assuming that if data is missing, it means the variant doesn't exist. This will give us a more accurate idea of how common the variants are in the population.

To perform this correction, the following steps were performed: 

1. For each subpopulation, the total number of individuals were calculated.
2. The number of individuals were multiplied by 2 to account for the fact that each individual has two copies of an allele (one from each parent) in either a variant or non-variant form. This value was then the corrected `IH_TOTAL_CTS` value, `CORR_IH_TOTAL_CTS`.
3. The total number of alternate allele counts `IH_ALT_CTS` were subtracted from `CORR_IH_TOTAL_CTS` to get the count of non-variant alleles `IH_REF_CTS`.

In [7]:
# Import sample and population data
sample_subpopulations = pd.read_csv(
    os.path.join(
        constants.HOME_PATH,
        "Data",
        "Processed",
        "Sample_populations.csv",
    )
)

# Group samples by population to get the number of samples (individuals) per population

grouped_sample_subpopulations = functions.group_and_count(sample_subpopulations, ["SUB"]).reset_index().rename(columns={"SAMPLE_NAME":"SAMPLE_COUNT"}).drop(columns="REG")

# Calculate the total allele count per subpopulation
grouped_sample_subpopulations["CORR_IH_TOTAL_CTS"] = (
    grouped_sample_subpopulations.SAMPLE_COUNT * 2
)

# Add total allele count to allele count dataframe
ih_allele_counts = (
    ih_allele_counts.merge(
        grouped_sample_subpopulations, how="left", left_on="SUB_POP", right_on="SUB"
    )
    .reset_index()
    .drop(columns=["SUB", "SAMPLE_COUNT", "index", "Unnamed: 0"])
)

ih_allele_counts.head(5)

Unnamed: 0,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,ID,CORR_IH_TOTAL_CTS
0,chr13:110148882C-CT,110148882,C,CT,COL4A1,ACB,0,228,110148882_CT_C,228
1,rs552586867,110148891,C,G,COL4A1,ACB,0,228,110148891_G_C,228
2,rs59409892,110148917,C,G,COL4A1,ACB,26,228,110148917_G_C,228
3,rs535182970,110148920,G,C,COL4A1,ACB,1,228,110148920_C_G,228
4,rs56406633,110148959,A,G,COL4A1,ACB,1,228,110148959_G_A,228


##### Calculate the corrected reference allele counts, and the corrected alternate allele frequency.

In [8]:
# Calculate corrected reference allele counts
ih_allele_counts["CORR_IH_REF_CTS"] = (
    ih_allele_counts["CORR_IH_TOTAL_CTS"]
    - ih_allele_counts["IH_ALT_CTS"]
)

# Calculate corrected alternate allele frequencies
ih_allele_counts["CORR_IH_AF"] = (
    ih_allele_counts["IH_ALT_CTS"]
    / ih_allele_counts["CORR_IH_TOTAL_CTS"]
)

##### Add regional information for each subpopulation group

In [9]:
ih_allele_counts["REG"] = ih_allele_counts["SUB_POP"].map(
    constants.LD_REGIONAL_CLASSIFICATION
)

ih_allele_counts.head(5)

Unnamed: 0,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,ID,CORR_IH_TOTAL_CTS,CORR_IH_REF_CTS,CORR_IH_AF,REG
0,chr13:110148882C-CT,110148882,C,CT,COL4A1,ACB,0,228,110148882_CT_C,228,228,0.0,ACB
1,rs552586867,110148891,C,G,COL4A1,ACB,0,228,110148891_G_C,228,228,0.0,ACB
2,rs59409892,110148917,C,G,COL4A1,ACB,26,228,110148917_G_C,228,202,0.114035,ACB
3,rs535182970,110148920,G,C,COL4A1,ACB,1,228,110148920_C_G,228,227,0.004386,ACB
4,rs56406633,110148959,A,G,COL4A1,ACB,1,228,110148959_G_A,228,227,0.004386,ACB


##### Add continental African count information

Total African Population: This includes all populations, including African American `ASW` and African Caribbean `ACB` samples. By considering all African populations together, we can analyze the frequency of a variant across the entire African continent.

Recent Africans Residing in Africa: This group excludes African American and Caribbean populations. It specifically focuses on individuals of recent African origin who still reside on the African continent. This allows us to study the frequency of a variant within African populations without the influence of African American and Caribbean genetic contributions.

In [10]:
# Extract African allele count data using a custom function

total_africa_ct = functions.grouped_pop_allele_counts(ih_allele_counts.copy(), "African")

# Extract Recent African allele count data using a custom function

filters = (ih_allele_counts.REG != "ACB") & (
    ih_allele_counts.REG != "ASW"
)

native_africa_ct = functions.grouped_pop_allele_counts(
    ih_allele_counts.copy(), "Recent African", filters
)

# Concatenate the African and Recent African allele count data with the African subpopulation allele count data

ih_allele_counts_grouped = (
    pd.concat(
        [
            ih_allele_counts,
            total_africa_ct,
            native_africa_ct,
        ]
    )
    .sort_values("ID")
    .reset_index(drop=True)
)
ih_allele_counts_grouped.tail(5)

Unnamed: 0,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,ID,CORR_IH_TOTAL_CTS,CORR_IH_REF_CTS,CORR_IH_AF,REG
376363,chr19:48256362G-T,48256362,G,T,CARD8,San,0,10,48256362_T_G,12,12,0.0,SA
376364,chr19:48256362G-T,48256362,G,T,CARD8,BantuSouthAfrica,0,16,48256362_T_G,16,16,0.0,SA
376365,chr19:48256362G-T,48256362,G,T,CARD8,YRI,0,350,48256362_T_G,350,350,0.0,WA
376366,chr19:48256362G-T,48256362,G,T,CARD8,ACB,0,228,48256362_T_G,228,228,0.0,ACB
376367,chr19:48256362G-T,48256362,G,T,CARD8,,0,1588,48256362_T_G,1608,1608,0.0,Recent African


##### Classify variants as SNPs or INDELs

Variants can be classified as:

* Single nucleotide polymorphisms (SNPs) if they result in the exchange of one DNA base pair for another 

**OR**

* Insertion/deletions (INDELs) if they result in the deletion or insertion of one or more DNA base pairs

Adding these variant classifications may prove useful for further analysis of the variants

In [11]:
ih_allele_counts_grouped["VARIANT_TYPE"] = np.where(
    (ih_allele_counts_grouped.ALT.str.len() > 1)
    | (ih_allele_counts_grouped.REF.str.len() > 1),
    "INDEL",
    "SNP",
)

ih_allele_counts_grouped.tail(5)

Unnamed: 0,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,ID,CORR_IH_TOTAL_CTS,CORR_IH_REF_CTS,CORR_IH_AF,REG,VARIANT_TYPE
376363,chr19:48256362G-T,48256362,G,T,CARD8,San,0,10,48256362_T_G,12,12,0.0,SA,SNP
376364,chr19:48256362G-T,48256362,G,T,CARD8,BantuSouthAfrica,0,16,48256362_T_G,16,16,0.0,SA,SNP
376365,chr19:48256362G-T,48256362,G,T,CARD8,YRI,0,350,48256362_T_G,350,350,0.0,WA,SNP
376366,chr19:48256362G-T,48256362,G,T,CARD8,ACB,0,228,48256362_T_G,228,228,0.0,ACB,SNP
376367,chr19:48256362G-T,48256362,G,T,CARD8,,0,1588,48256362_T_G,1608,1608,0.0,Recent African,SNP


## Save in-house allele count data to CSV file

In [12]:
ih_allele_counts_grouped.reset_index(drop=True).to_csv(
    os.path.join(
        constants.HOME_PATH,
        "Data",
        "Processed",
        "IH_allele_counts.csv",
    )
)

## Prepare data in suitable format for Fisher's Tests

The in-house allele count data needs to be formatted differently to conduct Fisher's Tests to test for significant differences in allele frequency between populations. 

In [15]:
# Generate list of all unique regions in ih_allele_counts_grouped dataframe
inhouse_populations = [region for region in set(ih_allele_counts_grouped.REG.values)]

# Group allele counts by region

ih_allele_counts_regions = (
    ih_allele_counts_grouped.groupby(
        by=["VAR_NAME", "ID", "REF", "ALT", "GENE", "POS", "REG"]
    )
    .sum(numeric_only=True)
    .reset_index()
)

ih_allele_counts_regions

# Pivot data
ih_allele_counts_pivot = ih_allele_counts_regions.pivot(
    index=["VAR_NAME", "ID", "REF", "ALT", "GENE", "POS"],
    columns="REG",
    values=["IH_ALT_CTS", "CORR_IH_REF_CTS"],
)

# Separate alternate and reference count data into different dataframes to facilate renaming of count columns appropriately
ih_data_alt = (
    ih_allele_counts_pivot[["IH_ALT_CTS"]].droplevel(level=0, axis=1).reset_index()
)

ih_data_corr_ref = (
    ih_allele_counts_pivot[["CORR_IH_REF_CTS"]].droplevel(level=0, axis=1).reset_index()
)

# Add appropriate prefixes to alt and ref columns
ih_data_alt = functions.add_prefix_dataframe_col_names(
    ih_data_alt, inhouse_populations, "ALT_CT_IH_"
)

ih_data_corr_ref = functions.add_prefix_dataframe_col_names(
    ih_data_corr_ref, inhouse_populations, "CORR_REF_CT_IH_"
)

# Merge renamed alternate and reference count data
ih_recent_afr = ih_data_alt.merge(
    ih_data_corr_ref, on=["VAR_NAME", "ID", "REF", "ALT", "GENE", "POS"]
)

ih_recent_afr.head(5)

REG,VAR_NAME,ID,REF,ALT,GENE,POS,ALT_CT_IH_ACB,ALT_CT_IH_ASW,ALT_CT_IH_African,ALT_CT_IH_CA,...,ALT_CT_IH_SA,ALT_CT_IH_WA,CORR_REF_CT_IH_ACB,CORR_REF_CT_IH_ASW,CORR_REF_CT_IH_African,CORR_REF_CT_IH_CA,CORR_REF_CT_IH_EA,CORR_REF_CT_IH_Recent African,CORR_REF_CT_IH_SA,CORR_REF_CT_IH_WA
0,chr11:34438836T-C,34438836_C_T,T,C,CAT,34438836,0,0,1,0,...,0,0,228,142,1977,80,217,1607,28,1282
1,chr11:34438889G-C,34438889_C_G,G,C,CAT,34438889,0,0,1,0,...,1,0,228,142,1977,80,218,1607,27,1282
2,chr11:34438910C-T,34438910_T_C,C,T,CAT,34438910,0,0,0,0,...,0,0,228,142,1978,80,218,1608,28,1282
3,chr11:34439179A-G,34439179_G_A,A,G,CAT,34439179,0,0,0,0,...,0,0,228,142,1978,80,218,1608,28,1282
4,chr11:34439188C-G,34439188_G_C,C,G,CAT,34439188,0,0,0,0,...,0,0,228,142,1978,80,218,1608,28,1282


## Save Fisher's Test data to a csv file

In [16]:
ih_recent_afr.reset_index(drop=True).to_csv(
    os.path.join(
        constants.HOME_PATH,
        "Data",
        "Processed",
        "IH_allele_counts_fishers.csv",
    )
)