# Preparation of variant count data for each African population group

Data on genetic variation found in African population groups was generated in-house by processing genomic data obtained from [GnomAD 1000 Genomes and HGDP datasets](https://gnomad.broadinstitute.org/news/2021-10-gnomad-v3-1-2-minor-release/) through a [Snakemake bioinformatics pipeline](https://github.com/Tuks-ICMM/Pharmacogenetic-Analysis-Pipeline). The generated data was stored in `Data\Raw\SUB\{gene_name}.Count.csv` files. Each `Data\Raw\SUB\{gene_name}.Count.csv` file contains the data on variants identified in a particular gene. Here, `{gene_name}` would be replaced with the specific name of the gene. The files include information on:

* Genetic variant names: These are identifiers for specific genetic differences (variants) in a population.
* Variant positions in the genome: This tells us where these genetic variations are located in the genetic code.
* Genetic alleles: An allele refers to the different forms of a specific gene that can exist at a particular genetic position. For each variant, there are two alleles, the normal form known as the reference allele (REF), and the altered form known as the alternate allele (ALT). These alleles define the genetic variation at a given position.
* Total copies of each variant alternate allele in the population: This shows how many times each genetic variant's alternate allele (ALT) appears in all the samples of a population.
* Total copies of both variant alternate and reference alleles in the population: This provides the overall count of all genetic variants' alleles (both REF and ALT) mentioned in the data.
* Sample subpopulation group: The ethnolinguistic classification of the African population from which the genetic data sample originated.

The `Data\Raw\SUB\{gene_name}.Count.csv` data was prepared for further analysis by performing the following steps: 
1. The data for all genes was merged into a single dataset.
2. The merged data was melted into a suitable format.
3. A unique ID was assigned to each variant, as some variants did not have unique names. 
4. Duplicate variant entries were removed.
5. Irrelevant data was removed.
6. Additional features were added, such as the total count of reference allele sites, and regional information on the African ethnolinguistic/subpopulation groups.


## Imports

Notebook setup

In [47]:
import os
import sys

from dotenv import load_dotenv

load_dotenv()

PROJECT_ROOT = os.getenv("PROJECT_ROOT")
if PROJECT_ROOT not in sys.path:
    os.chdir(PROJECT_ROOT + "/Notebooks")
    sys.path.append(PROJECT_ROOT)

import pandas as pd
import numpy as np
import Utils.constants as constants
import Utils.functions as functions

Import in-house African variant data 

In [48]:
# Import CSVs with variants identified in-house in African populations for genes of interest.

variants = pd.DataFrame()

genes = pd.read_csv(
    os.path.join(
        PROJECT_ROOT,
        "Metadata",
        "locations.csv",
    )
).location_name

for gene in genes:
    gene_variant_count_path = os.path.join(
        PROJECT_ROOT,
        "Data",
        "Raw",
        "SUB",
        "{}_Count.csv".format(gene),
    )

    # Combine variant data on each gene into a single dataframe
    gene_variant_df = pd.DataFrame()
    if os.path.exists(gene_variant_count_path):
        gene_variant_df = pd.read_csv(gene_variant_count_path, sep=",").rename(
            columns={"ID": "VAR_NAME"}
        )
        gene_variant_df["GENE"] = gene
    variants = pd.concat([variants, gene_variant_df])

variants.head(5)

Unnamed: 0,CHROM,VAR_NAME,REF,ALT,GWD_ac,GWD_tc,ESN_ac,ESN_tc,MSL_ac,MSL_tc,...,BantuSouthAfrica_ac,BantuSouthAfrica_tc,BantuKenya_ac,BantuKenya_tc,YRI_ac,YRI_tc,LWK_ac,LWK_tc,GENE,POS
0,13,chr13:110148882C-CT,C,CT,0,232,0,206,0,166,...,0,16,0,20,0,234,0,184,COL4A1,110148882
1,13,rs552586867,C,G,0,232,0,206,0,166,...,0,16,0,20,1,234,0,184,COL4A1,110148891
2,13,rs59409892,C,G,28,232,18,206,15,166,...,2,16,3,20,24,234,13,184,COL4A1,110148917
3,13,rs535182970,G,C,0,232,0,206,0,166,...,0,16,0,20,0,234,0,184,COL4A1,110148920
4,13,rs56406633,A,G,0,232,0,206,0,166,...,0,16,0,20,0,234,0,184,COL4A1,110148959


In the dataframe above:

`ALT` represents the alternate form of a specific genetic position.

`REF` represents the normal form at the genomic position, which is essentially reference genetic information.

The total copies of the variant for a population are represented as `{population_group}_ac`. Here, `{population_group}` would be replaced with the specific name of the population group or subpopulation. The _ac stands for "alternate allele count," which indicates the total number of copies of the specific genetic variant in the population.

The total copies of both variants and non-variants for the same population are represented as `{population_group}_tc`. Again, `{population_group}` should be replaced with the actual name of the population group. The _tc stands for "total count," which represents the overall count of all genetic variants (including the specific one) and the non-variant genetic information in the population.

## Format data

Display the data for each population group separately in a new column.

In [49]:
# Separate total count and alternate count information
alt_ct_columns = variants.filter(regex="_ac|VAR_NAME|POS|REF|ALT|GENE")
total_ct_columns = variants.filter(regex="_tc|VAR_NAME|POS|REF|ALT|GENE")

# Melt information
alt_ct_columns = alt_ct_columns.melt(
    id_vars=["VAR_NAME", "POS", "REF", "ALT", "GENE"],
    var_name="SUB_POP",
    value_name="IH_ALT_CTS",
)
total_ct_columns = total_ct_columns.melt(
    id_vars=["VAR_NAME", "POS", "REF", "ALT", "GENE"],
    var_name="SUB_POP",
    value_name="IH_TOTAL_CTS",
)

# Remove information after underscore in SUB_POP column
alt_ct_columns["SUB_POP"] = alt_ct_columns["SUB_POP"].str.rsplit("_", n=0).str.get(0)
total_ct_columns["SUB_POP"] = (
    total_ct_columns["SUB_POP"].str.rsplit("_", n=0).str.get(0)
)

# Combine formatted information

ih_allele_counts = pd.merge(
    alt_ct_columns,
    total_ct_columns,
    on=["VAR_NAME", "POS", "REF", "ALT", "GENE", "SUB_POP"],
)
ih_allele_counts.head(5)

Unnamed: 0,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS
0,chr13:110148882C-CT,110148882,C,CT,COL4A1,GWD,0,232
1,rs552586867,110148891,C,G,COL4A1,GWD,0,232
2,rs59409892,110148917,C,G,COL4A1,GWD,28,232
3,rs535182970,110148920,G,C,COL4A1,GWD,0,232
4,rs56406633,110148959,A,G,COL4A1,GWD,0,232


Some variants have information on more than one alternate allele in a single row. Identify these variants and split the information into multiple rows accordingly. 

In [50]:
# Which variants have this issue?

multiple_alt_allele_variants = ih_allele_counts[ih_allele_counts.ALT.str.contains(",")]
multiple_alt_allele_variants.count()

VAR_NAME        480
POS             480
REF             480
ALT             480
GENE            480
SUB_POP         480
IH_ALT_CTS      480
IH_TOTAL_CTS    480
dtype: int64

In [51]:
# Remove these variants from the ih_allele_counts dataframe.

ih_allele_counts = ih_allele_counts[~ih_allele_counts.ALT.str.contains(",")]

In [52]:
# Split the information into multiple rows and append to a new dataframe

split_ih_allele_counts = pd.DataFrame()

for index,row in multiple_alt_allele_variants.reset_index().iterrows():

    if "rs" in row.VAR_NAME:
        varname1 = row.VAR_NAME
        varname2 = row.VAR_NAME
    elif ";" in row.VAR_NAME:
        varname1 = row.VAR_NAME.split(";")[0]
        varname2 = row.VAR_NAME.split(";")[1]

    position1 = row.POS
    position2 = row.POS

    ref1 = row.REF
    ref2 = row.REF

    alt1 = row.ALT.split(",")[0]
    alt2 = row.ALT.split(",")[1]
    
    gene1 = row.GENE
    gene2 = row.GENE

    subpop1 = row.SUB_POP
    subpop2 = row.SUB_POP

    ihaltcts1 = row.IH_ALT_CTS.split(",")[0]
    ihaltcts2 = row.IH_ALT_CTS.split(",")[1]

    ihtotalcts1 = row.IH_TOTAL_CTS
    ihtotalcts2 = row.IH_TOTAL_CTS

    row1 = {"VAR_NAME":varname1,"POS":position1,"REF":ref1,"ALT":alt1,"SUB_POP":subpop1,"IH_ALT_CTS":ihaltcts1,"IH_TOTAL_CTS":ihtotalcts1}
    row2 = {"VAR_NAME":varname2,"POS":position2,"REF":ref2,"ALT":alt2,"SUB_POP":subpop2,"IH_ALT_CTS":ihaltcts2,"IH_TOTAL_CTS":ihtotalcts2}

    split_ih_allele_counts = split_ih_allele_counts.append(row1, ignore_index=True)
    split_ih_allele_counts = split_ih_allele_counts.append(row2, ignore_index=True)

split_ih_allele_counts


  split_ih_allele_counts = split_ih_allele_counts.append(row1, ignore_index=True)
  split_ih_allele_counts = split_ih_allele_counts.append(row2, ignore_index=True)
  split_ih_allele_counts = split_ih_allele_counts.append(row1, ignore_index=True)
  split_ih_allele_counts = split_ih_allele_counts.append(row2, ignore_index=True)
  split_ih_allele_counts = split_ih_allele_counts.append(row1, ignore_index=True)
  split_ih_allele_counts = split_ih_allele_counts.append(row2, ignore_index=True)
  split_ih_allele_counts = split_ih_allele_counts.append(row1, ignore_index=True)
  split_ih_allele_counts = split_ih_allele_counts.append(row2, ignore_index=True)
  split_ih_allele_counts = split_ih_allele_counts.append(row1, ignore_index=True)
  split_ih_allele_counts = split_ih_allele_counts.append(row2, ignore_index=True)
  split_ih_allele_counts = split_ih_allele_counts.append(row1, ignore_index=True)
  split_ih_allele_counts = split_ih_allele_counts.append(row2, ignore_index=True)
  split_ih_allel

Unnamed: 0,VAR_NAME,POS,REF,ALT,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS
0,chr13:110822413A-AGGAGG,110170066,AGGAGG,AGGAGGGGAGG,GWD,0,26
1,chr13:110170066AGGAGG-A,110170066,AGGAGG,A,GWD,0,26
2,rs376760979,110170113,AGAAGGAAGGAAGGAAGGAAGGAAG,A,GWD,0,152
3,rs376760979,110170113,AGAAGGAAGGAAGGAAGGAAGGAAG,AGAAGGAAGGAAGGAAGGAAG,GWD,6,152
4,rs372096863,110170113,AGAAGGAAGGAAGGAAGGAAG,A,GWD,0,152
...,...,...,...,...,...,...,...
955,chr19:48252808T-TAC,48252808,T,TAC,LWK,0,184
956,chr19:48756065T-TATATAC,48252808,T,TATATAC,LWK,0,184
957,chr19:48252808T-TACAC,48252808,T,TACAC,LWK,0,184
958,chr19:48756065T-TATATATAC,48252808,T,TATATATAC,LWK,0,184


In [53]:
# Append the split rows to the ih_allele_counts dataframe

ih_allele_counts = ih_allele_counts.append(split_ih_allele_counts, ignore_index=True).reset_index()

  ih_allele_counts = ih_allele_counts.append(split_ih_allele_counts, ignore_index=True).reset_index()


In [55]:
# Check to see if there are still rows with the issue

ih_allele_counts[ih_allele_counts.ALT.str.contains(",")].count()

index           0
VAR_NAME        0
POS             0
REF             0
ALT             0
GENE            0
SUB_POP         0
IH_ALT_CTS      0
IH_TOTAL_CTS    0
dtype: int64

## Rename subpopulations 

In [56]:
ih_allele_counts_renamed = ih_allele_counts.replace({"SUB_POP": constants.SUBPOP_RENAME})

ih_allele_counts_renamed.SUB_POP.unique()

array(['Mandinka', 'Esan', 'Mende', 'Mbuti Pygmy', 'Biaka Pygmy',
       'Mandenka', 'Yoruba', 'San', 'Bantu South Africa', 'Bantu Kenya',
       'Luhya'], dtype=object)

## Assign a unique ID to each variant

Some variants do not have unique names. This will complicate downstream analysis of the data. Add a column with a unique ID for each variant to rectify this.

In [57]:
ih_allele_counts_renamed["ID"] = (
    ih_allele_counts_renamed[["POS", "ALT", "REF"]].astype("str").agg("_".join, axis=1)
)

ih_allele_counts_renamed.head(5)

Unnamed: 0,index,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,ID
0,0,chr13:110148882C-CT,110148882,C,CT,COL4A1,Mandinka,0,232,110148882_CT_C
1,1,rs552586867,110148891,C,G,COL4A1,Mandinka,0,232,110148891_G_C
2,2,rs59409892,110148917,C,G,COL4A1,Mandinka,28,232,110148917_G_C
3,3,rs535182970,110148920,G,C,COL4A1,Mandinka,0,232,110148920_C_G
4,4,rs56406633,110148959,A,G,COL4A1,Mandinka,0,232,110148959_G_A


## Correct data types

In [58]:
ih_allele_counts_renamed[["VAR_NAME","POS","REF","ALT","GENE","SUB_POP","ID","IH_ALT_CTS"]] = ih_allele_counts_renamed[["VAR_NAME","POS","REF","ALT","GENE","SUB_POP","ID","IH_ALT_CTS"]].astype(str)

ih_allele_counts_renamed["IH_ALT_CTS"] = [x.replace(",",".") for x in ih_allele_counts_renamed["IH_ALT_CTS"]]

ih_allele_counts_renamed[["IH_TOTAL_CTS", "IH_ALT_CTS"]] = ih_allele_counts_renamed[["IH_TOTAL_CTS", "IH_ALT_CTS"]].astype(float).astype(int)

## Combine allele count info for the two Yoruban populations

In [59]:
agg_functions = {"IH_ALT_CTS":"sum", "IH_TOTAL_CTS":"sum"}
ih_allele_counts_grouped = ih_allele_counts_renamed.groupby(by=["ID","VAR_NAME", "POS", "REF", "ALT", "GENE", "SUB_POP"]).aggregate(agg_functions).reset_index()

## Add additional data features

#### Reference allele counts

In the data above, the total count of all alleles (REF and ALT) is included as `IH_TOTAL_CTS`. To get the count of non-variant, reference alleles, `IH_REF_CTS`, the total number of alternate allele counts `IH_ALT_CTS` were subtracted from `CORR_IH_TOTAL_CTS`

In [60]:
# Calculate reference allele counts and add to dataframe
ih_allele_counts_grouped["IH_REF_CTS"] = (ih_allele_counts_grouped["IH_TOTAL_CTS"] - ih_allele_counts_grouped["IH_ALT_CTS"]).astype(int)

ih_allele_counts_grouped.head(5)

Unnamed: 0,ID,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,IH_REF_CTS
0,110148882_CT_C,chr13:110148882C-CT,110148882,C,CT,COL4A1,Bantu Kenya,0,20,20
1,110148882_CT_C,chr13:110148882C-CT,110148882,C,CT,COL4A1,Bantu South Africa,0,16,16
2,110148882_CT_C,chr13:110148882C-CT,110148882,C,CT,COL4A1,Biaka Pygmy,0,44,44
3,110148882_CT_C,chr13:110148882C-CT,110148882,C,CT,COL4A1,Esan,0,206,206
4,110148882_CT_C,chr13:110148882C-CT,110148882,C,CT,COL4A1,Luhya,0,184,184


##### African subpopulation/ethnolinguistic groups

Add information on the African region (i.e., Southern Africa, Western Africa, Eastern Africa, Central Africa, America, Caribbean) from which a particular African subpopulation/ethnolinguistic group originates. 

In [61]:
ih_allele_counts_grouped["REG"] = ih_allele_counts_grouped["SUB_POP"].map(
    constants.REGIONAL_CLASSIFICATION
)

ih_allele_counts_grouped.head(5)

Unnamed: 0,ID,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,IH_REF_CTS,REG
0,110148882_CT_C,chr13:110148882C-CT,110148882,C,CT,COL4A1,Bantu Kenya,0,20,20,EA
1,110148882_CT_C,chr13:110148882C-CT,110148882,C,CT,COL4A1,Bantu South Africa,0,16,16,SA
2,110148882_CT_C,chr13:110148882C-CT,110148882,C,CT,COL4A1,Biaka Pygmy,0,44,44,CA
3,110148882_CT_C,chr13:110148882C-CT,110148882,C,CT,COL4A1,Esan,0,206,206,WA
4,110148882_CT_C,chr13:110148882C-CT,110148882,C,CT,COL4A1,Luhya,0,184,184,EA


##### Add grouped African count information

Provide aggregated allele count information for Recent African populations. Recent African populations are defined as African populations currently residing on the African continent. This group excludes African American and African Caribbean populations.

In [62]:
agg_functions = {"IH_ALT_CTS":"sum", "IH_TOTAL_CTS":"sum", "IH_REF_CTS":"sum"}
recent_africa_ct = ih_allele_counts_grouped.groupby(["ID","VAR_NAME","POS","REF","ALT","GENE"]).aggregate(agg_functions).reset_index()

recent_africa_ct["REG"] = "Recent African"
recent_africa_ct["SUB_POP"] = np.NaN

recent_africa_ct.head(5)

Unnamed: 0,ID,VAR_NAME,POS,REF,ALT,GENE,IH_ALT_CTS,IH_TOTAL_CTS,IH_REF_CTS,REG,SUB_POP
0,110148882_CT_C,chr13:110148882C-CT,110148882,C,CT,COL4A1,0,1220,1220,Recent African,
1,110148891_G_C,rs552586867,110148891,C,G,COL4A1,1,1220,1219,Recent African,
2,110148917_G_C,rs59409892,110148917,C,G,COL4A1,119,1220,1101,Recent African,
3,110148920_C_G,rs535182970,110148920,G,C,COL4A1,0,1220,1220,Recent African,
4,110148959_G_A,rs56406633,110148959,A,G,COL4A1,0,1220,1220,Recent African,


In [63]:
# Concatenate the Recent African allele count data with the subpopulation allele count data

ih_allele_counts_grouped = (
    pd.concat(
        [
            ih_allele_counts_grouped,
            recent_africa_ct,
        ]
    )
    .sort_values("ID")
    .reset_index(drop=True)
)

ih_allele_counts_grouped.tail(5)

Unnamed: 0,ID,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,IH_REF_CTS,REG
282751,48256362_T_G,chr19:48256362G-T,48256362,G,T,CARD8,Biaka Pygmy,0,40,40,CA
282752,48256362_T_G,chr19:48256362G-T,48256362,G,T,CARD8,Bantu South Africa,0,16,16,SA
282753,48256362_T_G,chr19:48256362G-T,48256362,G,T,CARD8,Bantu Kenya,0,18,18,EA
282754,48256362_T_G,chr19:48256362G-T,48256362,G,T,CARD8,Yoruba,0,272,272,WA
282755,48256362_T_G,chr19:48256362G-T,48256362,G,T,CARD8,,0,1202,1202,Recent African


##### Calculate allele frequencies

In [64]:
ih_allele_counts_grouped["IH_AF"] = (
    ih_allele_counts_grouped["IH_ALT_CTS"]
    / ih_allele_counts_grouped["IH_TOTAL_CTS"]
)

ih_allele_counts_grouped.head(5)

Unnamed: 0,ID,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,IH_REF_CTS,REG,IH_AF
0,110148882_CT_C,chr13:110148882C-CT,110148882,C,CT,COL4A1,Bantu Kenya,0,20,20,EA,0.0
1,110148882_CT_C,chr13:110148882C-CT,110148882,C,CT,COL4A1,Yoruba,0,276,276,WA,0.0
2,110148882_CT_C,chr13:110148882C-CT,110148882,C,CT,COL4A1,San,0,12,12,SA,0.0
3,110148882_CT_C,chr13:110148882C-CT,110148882,C,CT,COL4A1,Mende,0,166,166,WA,0.0
4,110148882_CT_C,chr13:110148882C-CT,110148882,C,CT,COL4A1,Mbuti Pygmy,0,24,24,CA,0.0


##### Classify variants as SNPs or INDELs

Variants can be classified as:

* Single nucleotide polymorphisms (SNPs) if they result in the exchange of one DNA base pair for another 

**OR**

* Insertion/deletions (INDELs) if they result in the deletion or insertion of one or more DNA base pairs

Adding these variant classifications may prove useful for further analysis of the variants

In [65]:
ih_allele_counts_grouped["VARIANT_TYPE"] = np.where(
    (ih_allele_counts_grouped.ALT.str.len() > 1)
    | (ih_allele_counts_grouped.REF.str.len() > 1),
    "INDEL",
    "SNP",
)

ih_allele_counts_grouped.head(5)

Unnamed: 0,ID,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,IH_REF_CTS,REG,IH_AF,VARIANT_TYPE
0,110148882_CT_C,chr13:110148882C-CT,110148882,C,CT,COL4A1,Bantu Kenya,0,20,20,EA,0.0,INDEL
1,110148882_CT_C,chr13:110148882C-CT,110148882,C,CT,COL4A1,Yoruba,0,276,276,WA,0.0,INDEL
2,110148882_CT_C,chr13:110148882C-CT,110148882,C,CT,COL4A1,San,0,12,12,SA,0.0,INDEL
3,110148882_CT_C,chr13:110148882C-CT,110148882,C,CT,COL4A1,Mende,0,166,166,WA,0.0,INDEL
4,110148882_CT_C,chr13:110148882C-CT,110148882,C,CT,COL4A1,Mbuti Pygmy,0,24,24,CA,0.0,INDEL


## Variant filtering

Remove variants that are not associated with the specified genes according to CADD VEP results. This can only be run once VEP results are retrieved and processed.

In [66]:
# Load VEP results

cadd_data_path = os.path.join(
    PROJECT_ROOT,
    "Data",
    "Processed",
    "Variant_consequences.csv",
)

cadd_data = pd.read_csv(cadd_data_path, sep=",")
cadd_data.head(5)

Unnamed: 0,CHROM,POS,REF,ALT,TYPE,CONSEQUENCE_CLASSIFICATION,CONSEQUENCE,GENE,ID
0,13,110148882,C,CT,INS,downstream,downstream,COL4A1,110148882_CT_C
1,13,110148891,C,G,SNV,downstream,downstream,COL4A1,110148891_G_C
2,13,110148917,C,G,SNV,downstream,downstream,COL4A1,110148917_G_C
3,13,110148920,G,C,SNV,downstream,downstream,COL4A1,110148920_C_G
4,13,110148959,A,G,SNV,downstream,downstream,COL4A1,110148959_G_A


In [67]:
# Get a list of all variants in gene

variant_ids = list(cadd_data.ID.values)

In [68]:
# Remove variants if they are not in the above list

ih_allele_counts_filtered = ih_allele_counts_grouped[ih_allele_counts_grouped.ID.isin(variant_ids)]

## Save African in-house allele count data to a CSV file

In [69]:
ih_allele_counts_filtered.reset_index(drop=True).to_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "IH_allele_counts.csv",
    ),
    index=False,
)

## Prepare data in suitable format for Fisher's Tests

The in-house allele count data needs to be formatted differently to conduct Fisher's Tests to test for significant differences in allele frequency between population regions. 

In [70]:
# Generate a list of all unique regions in ih_allele_counts_filtered dataframe
inhouse_populations = [region for region in set(ih_allele_counts_filtered.REG.values)]

# Aggregate allele counts by region

agg_functions = {"IH_ALT_CTS":"sum", "IH_TOTAL_CTS":"sum", "IH_REF_CTS":"sum"}

ih_allele_counts_regions = ih_allele_counts_filtered.groupby(by=["VAR_NAME", "ID", "REF", "ALT", "GENE", "POS", "REG"]).aggregate(agg_functions).reset_index()

# Pivot data
ih_allele_counts_pivot = ih_allele_counts_regions.pivot(
    index=["VAR_NAME", "ID", "REF", "ALT", "GENE", "POS"],
    columns="REG",
    values=["IH_ALT_CTS", "IH_REF_CTS"],
)

# Separate alternate and reference count data into different dataframes to facilate renaming of count columns appropriately
ih_data_alt = (
    ih_allele_counts_pivot[["IH_ALT_CTS"]].droplevel(level=0, axis=1).reset_index()
)

ih_data_ref = (
    ih_allele_counts_pivot[["IH_REF_CTS"]].droplevel(level=0, axis=1).reset_index()
)

# Add appropriate prefixes to alt and ref columns
ih_data_alt = functions.add_prefix_dataframe_col_names(
    ih_data_alt, inhouse_populations, "ALT_CT_IH_"
)

ih_data_ref = functions.add_prefix_dataframe_col_names(
    ih_data_ref, inhouse_populations, "REF_CT_IH_"
)

# Merge renamed alternate and reference count data
ih_recent_afr = ih_data_alt.merge(
    ih_data_ref, on=["VAR_NAME", "ID", "REF", "ALT", "GENE", "POS"]
)

# Replace missing count values with 0
ih_recent_afr = ih_recent_afr.replace(np.NAN, 0)

ih_recent_afr.head()

REG,VAR_NAME,ID,REF,ALT,GENE,POS,ALT_CT_IH_CA,ALT_CT_IH_EA,ALT_CT_IH_Recent African,ALT_CT_IH_SA,ALT_CT_IH_WA,REF_CT_IH_CA,REF_CT_IH_EA,REF_CT_IH_Recent African,REF_CT_IH_SA,REF_CT_IH_WA
0,chr11:34438836T-C,34438836_C_T,T,C,CAT,34438836,0,1,1,0,0,68,203,1219,28,920
1,chr11:34438889G-C,34438889_C_G,G,C,CAT,34438889,0,0,1,1,0,68,204,1219,27,920
2,chr11:34438910C-T,34438910_T_C,C,T,CAT,34438910,0,0,0,0,0,68,204,1220,28,920
3,chr11:34439179A-G,34439179_G_A,A,G,CAT,34439179,0,0,0,0,0,68,204,1220,28,920
4,chr11:34439188C-G,34439188_G_C,C,G,CAT,34439188,0,0,0,0,0,68,204,1220,28,920


## Save Fisher's Test data to a csv file

In [71]:
ih_recent_afr.reset_index(drop=True).to_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "IH_allele_counts_fishers.csv",
    ),
    index=False,
)