# Preparation of variant count data for each African population group

Data on genetic variation found in African population groups was generated in-house by processing genomic data obtained from [GnomAD 1000 Genomes and HGDP datasets](https://gnomad.broadinstitute.org/news/2021-10-gnomad-v3-1-2-minor-release/) through a [Snakemake bioinformatics pipeline](https://github.com/Tuks-ICMM/Pharmacogenetic-Analysis-Pipeline). The generated data was stored in `Data\Raw\SUB\{gene_name}.Count.csv` files. Each `Data\Raw\SUB\{gene_name}.Count.csv` file contains the data on variants identified in a particular gene. Here, `{gene_name}` would be replaced with the specific name of the gene. The files include information on:

* Genetic variant names: These are identifiers for specific genetic differences (variants) in a population.
* Variant positions in the genome: This tells us where these genetic variations are located in the genetic code.
* Genetic alleles: An allele refers to the different forms of a specific gene that can exist at a particular genetic position. For each variant, there are two alleles, the normal form known as the reference allele (REF), and the altered form known as the alternate allele (ALT). These alleles define the genetic variation at a given position.
* Total copies of each variant alternate allele in the population: This shows how many times each genetic variant's alternate allele (ALT) appears in all the samples of a population.
* Total copies of both variant alternate and reference alleles in the population: This provides the overall count of all genetic variants' alleles (both REF and ALT) mentioned in the data.
* Sample subpopulation group: The ethnolinguistic classification of the African population from which the genetic data sample originated.

The `Data\Raw\SUB\{gene_name}.Count.csv` data was prepared for further analysis by performing the following steps: 
1. The data for all genes was merged into a single dataset
2. The merged data was melted into a suitable format
3. A unique ID was assigned to each variant, as some variants did not have unique names. 
4. Duplicate variant entries were removed
5. Remove irrelevant data
6. Additional features were added, such as the total count of regular-non variant sites, and regional information on the African ethnolinguistic/subpopulation groups.


## Imports

Notebook setup

In [147]:
import os
import sys

from dotenv import load_dotenv

load_dotenv()

PROJECT_ROOT = os.getenv("PROJECT_ROOT")
if PROJECT_ROOT not in sys.path:
    os.chdir(PROJECT_ROOT + "/Notebooks")
    sys.path.append(PROJECT_ROOT)

import pandas as pd
import numpy as np
import Utils.constants as constants
import Utils.functions as functions

Import in-house African variant data 

In [148]:
# Import CSVs with variants identified in-house in African populations for genes of interest.

variants = pd.DataFrame()

genes = pd.read_csv(
    os.path.join(
        PROJECT_ROOT,
        "Metadata",
        "locations.csv",
    )
).location_name

for gene in genes:
    gene_variant_count_path = os.path.join(
        PROJECT_ROOT,
        "Data",
        "Raw",
        "SUB",
        "{}_Count.csv".format(gene),
    )

    # Combine variant data on each gene into a single dataframe
    gene_variant_df = pd.DataFrame()
    if os.path.exists(gene_variant_count_path):
        gene_variant_df = pd.read_csv(gene_variant_count_path, sep=",").rename(
            columns={"ID": "VAR_NAME"}
        )
        gene_variant_df["GENE"] = gene
    variants = pd.concat([variants, gene_variant_df]).drop(columns="Unnamed: 0")

variants.head(5)

Unnamed: 0,VAR_NAME,POS,REF,ALT,ACB_ac,ACB_tc,GWD_ac,GWD_tc,ESN_ac,ESN_tc,...,BantuSouthAfrica_tc,BantuKenya_ac,BantuKenya_tc,YRI_ac,YRI_tc,LWK_ac,LWK_tc,ASW_ac,ASW_tc,GENE
0,chr13:110148882C-CT,110148882,C,CT,0,228,0,352,0,296,...,16,0,24,0,350,0,194,0,142,COL4A1
1,rs552586867,110148891,C,G,0,228,0,352,0,296,...,16,0,24,2,350,0,194,0,142,COL4A1
2,rs59409892,110148917,C,G,26,228,38,352,25,296,...,16,3,24,35,350,13,194,14,142,COL4A1
3,rs535182970,110148920,G,C,1,228,0,352,0,296,...,16,0,24,0,350,0,194,0,142,COL4A1
4,rs56406633,110148959,A,G,1,228,0,352,0,296,...,16,0,24,0,350,0,194,0,142,COL4A1


In the dataframe above:

`ALT` represents the alternate form of a specific genetic position.

`REF` represents the normal form at the genomic position, which is essentially reference genetic information.

The total copies of the variant for a population are represented as `{population_group}_ac`. Here, `{population_group}` would be replaced with the specific name of the population group or subpopulation. The _ac stands for "alternate allele count," which indicates the total number of copies of the specific genetic variant in the population.

The total copies of both variants and non-variants for the same population are represented as `{population_group}_tc`. Again, `{population_group}` should be replaced with the actual name of the population group. The _tc stands for "total count," which represents the overall count of all genetic variants (including the specific one) and the non-variant genetic information in the population.

## Format data

Display the data for each population group separately in a new column.

In [149]:
# Separate total count and alternate count information
alt_ct_columns = variants.filter(regex="_ac|VAR_NAME|POS|REF|ALT|GENE")
total_ct_columns = variants.filter(regex="_tc|VAR_NAME|POS|REF|ALT|GENE")

# Melt information
alt_ct_columns = alt_ct_columns.melt(
    id_vars=["VAR_NAME", "POS", "REF", "ALT", "GENE"],
    var_name="SUB_POP",
    value_name="IH_ALT_CTS",
)
total_ct_columns = total_ct_columns.melt(
    id_vars=["VAR_NAME", "POS", "REF", "ALT", "GENE"],
    var_name="SUB_POP",
    value_name="IH_TOTAL_CTS",
)

# Remove information after underscore in SUB_POP column
alt_ct_columns["SUB_POP"] = alt_ct_columns["SUB_POP"].str.rsplit("_", n=0).str.get(0)
total_ct_columns["SUB_POP"] = (
    total_ct_columns["SUB_POP"].str.rsplit("_", n=0).str.get(0)
)

# Combine formatted information

ih_allele_counts = pd.merge(
    alt_ct_columns,
    total_ct_columns,
    on=["VAR_NAME", "POS", "REF", "ALT", "GENE", "SUB_POP"],
)
ih_allele_counts.head(5)

Unnamed: 0,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS
0,chr13:110148882C-CT,110148882,C,CT,COL4A1,ACB,0,228
1,rs552586867,110148891,C,G,COL4A1,ACB,0,228
2,rs59409892,110148917,C,G,COL4A1,ACB,26,228
3,rs535182970,110148920,G,C,COL4A1,ACB,1,228
4,rs56406633,110148959,A,G,COL4A1,ACB,1,228


Rename subpopulations 

In [150]:
ih_allele_counts = ih_allele_counts.replace({"SUB_POP": constants.SUBPOP_RENAME})

## Assign a unique ID to each variant

Some variants do not have unique names. This will complicate downstream analysis of the data. Add a column with a unique ID for each variant to rectify this.

In [151]:
ih_allele_counts["ID"] = (
    ih_allele_counts[["POS", "ALT", "REF"]].astype("str").agg("_".join, axis=1)
)

ih_allele_counts.head(5)

Unnamed: 0,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,ID
0,chr13:110148882C-CT,110148882,C,CT,COL4A1,African Caribbean,0,228,110148882_CT_C
1,rs552586867,110148891,C,G,COL4A1,African Caribbean,0,228,110148891_G_C
2,rs59409892,110148917,C,G,COL4A1,African Caribbean,26,228,110148917_G_C
3,rs535182970,110148920,G,C,COL4A1,African Caribbean,1,228,110148920_C_G
4,rs56406633,110148959,A,G,COL4A1,African Caribbean,1,228,110148959_G_A


## Drop duplicate entries

In [152]:
ih_allele_counts = ih_allele_counts.drop_duplicates(["ID", "SUB_POP", "SUB_POP"])

## Remove irrelevant data

Drop African American and African Caribbean data

In [153]:
ih_allele_counts_filtered = ih_allele_counts[
    ~(
        (ih_allele_counts.SUB_POP == "African American")
        | (ih_allele_counts.SUB_POP == "African Caribbean")
    )
]

Remove rows with variants with alternate allele counts of 0. These variants are not present in the particular population.

In [154]:
ih_allele_counts_filtered = ih_allele_counts_filtered[
    ~(ih_allele_counts_filtered.IH_ALT_CTS == 0.0)
]

## Add additional data features

##### Correct total allele counts

In the data above, the total count of all alleles (REF and ALT) is included as `IH_TOTAL_CTS`. Some samples, from which this data was generated, lacked variant ALT data for specific positions in the genome. The total count was calculated only using the available data, which might make the variant frequency estimates inaccurate for a population.

To improve accuracy, we can assume that if data is missing for a certain variant ALT allele at a position, that variant ALT allele does not exist in those samples, and can thus be given a count of 0. Based on this assumption, we can correct the total allele count by considering the missing data as non-variants (REF alleles only). This correction helps us get more reliable estimates of the variant frequency for the population.

In simple terms, the total allele counts `IH_TOTAL_CTS` were adjusted to account for missing information, assuming that if data is missing, it means the variant doesn't exist. This will give us a more accurate idea of how common the variants are in the population.

To perform this correction, the following steps were performed: 

1. For each subpopulation, the total number of individuals were calculated.
2. The number of individuals were multiplied by 2 to account for the fact that each individual has two copies of an allele (one from each parent) in either a variant (ALT) or non-variant (REF) form. This value was then the corrected `IH_TOTAL_CTS` value, `CORR_IH_TOTAL_CTS`.
3. The total number of alternate allele counts `IH_ALT_CTS` were subtracted from `CORR_IH_TOTAL_CTS` to get the count of non-variant alleles `IH_REF_CTS`.

In [155]:
# Import sample and population data
sample_subpopulations = pd.read_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "Sample_populations.csv",
    )
)

# Group samples by population to get the number of samples (individuals) per population

grouped_sample_subpopulations = (
    functions.group_and_count(sample_subpopulations, ["SUB"])
    .reset_index()
    .rename(columns={"SAMPLE_NAME": "SAMPLE_COUNT"})
    .drop(columns="REG")
)

# Calculate the total allele count per subpopulation
grouped_sample_subpopulations["CORR_IH_TOTAL_CTS"] = (
    grouped_sample_subpopulations.SAMPLE_COUNT * 2
)

# Add total allele count to allele count dataframe
ih_allele_counts_filtered = (
    ih_allele_counts_filtered.merge(
        grouped_sample_subpopulations, how="left", left_on="SUB_POP", right_on="SUB"
    )
    .reset_index()
    .drop(columns=["SUB", "SAMPLE_COUNT", "index", "Unnamed: 0"])
)

ih_allele_counts_filtered.head(5)

Unnamed: 0,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,ID,CORR_IH_TOTAL_CTS
0,rs59409892,110148917,C,G,COL4A1,Mandinka,38,352,110148917_G_C,352.0
1,rs139916479,110149349,G,A,COL4A1,Mandinka,2,352,110149349_A_G,352.0
2,rs552877576,110149494,C,T,COL4A1,Mandinka,2,352,110149494_T_C,352.0
3,chr13:110149646CTTTAT-C,110149646,CTTTAT,C,COL4A1,Mandinka,2,352,110149646_C_CTTTAT,352.0
4,rs13260,110149776,G,T,COL4A1,Mandinka,71,352,110149776_T_G,352.0


##### Calculate the corrected reference allele counts

In [156]:
# Calculate corrected reference allele counts
ih_allele_counts_filtered["CORR_IH_REF_CTS"] = (
    ih_allele_counts_filtered["CORR_IH_TOTAL_CTS"]
    - ih_allele_counts_filtered["IH_ALT_CTS"]
)

ih_allele_counts_filtered

Unnamed: 0,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,ID,CORR_IH_TOTAL_CTS,CORR_IH_REF_CTS
0,rs59409892,110148917,C,G,COL4A1,Mandinka,38,352,110148917_G_C,352.0,314.0
1,rs139916479,110149349,G,A,COL4A1,Mandinka,2,352,110149349_A_G,352.0,350.0
2,rs552877576,110149494,C,T,COL4A1,Mandinka,2,352,110149494_T_C,352.0,350.0
3,chr13:110149646CTTTAT-C,110149646,CTTTAT,C,COL4A1,Mandinka,2,352,110149646_C_CTTTAT,352.0,350.0
4,rs13260,110149776,G,T,COL4A1,Mandinka,71,352,110149776_T_G,352.0,281.0
...,...,...,...,...,...,...,...,...,...,...,...
45300,rs111859112,33029010,T,G,OLIG2,Luhya,2,194,33029010_G_T,194.0,192.0
45301,rs13046814,33029069,T,G,OLIG2,Luhya,10,194,33029069_G_T,194.0,184.0
45302,rs547410200,33029136,T,C,OLIG2,Luhya,1,194,33029136_C_T,194.0,193.0
45303,rs182058038,33029193,T,A,OLIG2,Luhya,1,194,33029193_A_T,194.0,193.0


##### Add regional information for each African subpopulation/ethnolinguistic group

Add information on the African region (i.e., Southern Africa, Western Africa, Eastern Africa, Central Africa, America, Caribbean) from which a particular African subpopulation/ethnolinguistic group originates. 

In [157]:
ih_allele_counts_filtered["REG"] = ih_allele_counts_filtered["SUB_POP"].map(
    constants.REGIONAL_CLASSIFICATION
)

ih_allele_counts_filtered.head(5)

Unnamed: 0,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,ID,CORR_IH_TOTAL_CTS,CORR_IH_REF_CTS,REG
0,rs59409892,110148917,C,G,COL4A1,Mandinka,38,352,110148917_G_C,352.0,314.0,WA
1,rs139916479,110149349,G,A,COL4A1,Mandinka,2,352,110149349_A_G,352.0,350.0,WA
2,rs552877576,110149494,C,T,COL4A1,Mandinka,2,352,110149494_T_C,352.0,350.0,WA
3,chr13:110149646CTTTAT-C,110149646,CTTTAT,C,COL4A1,Mandinka,2,352,110149646_C_CTTTAT,352.0,350.0,WA
4,rs13260,110149776,G,T,COL4A1,Mandinka,71,352,110149776_T_G,352.0,281.0,WA


##### Add grouped African count information

Provide aggregated allele count information for Recent African populations. Recent African populations are defined as African populations currently residing on the African continent. This group excludes African American and African Caribbean populations.

In [158]:
# Aggregate Recent African allele count data using a custom function

recent_africa_ct = functions.grouped_pop_allele_counts(
    ih_allele_counts_filtered.copy(), "Recent African"
)

# Calculate Recent African CORR_IH_TOTAL_CTS and CORR_IH_REF_CTS. The aggregated values for these columns will not be correct.

recent_africa_ct["CORR_IH_TOTAL_CTS"] = (
    grouped_sample_subpopulations.SAMPLE_COUNT.sum() * 2
)
recent_africa_ct["CORR_IH_REF_CTS"] = (
    recent_africa_ct["CORR_IH_TOTAL_CTS"] - recent_africa_ct["IH_ALT_CTS"]
)


# Concatenate the Recent African allele count data with the subpopulation allele count data

ih_allele_counts_grouped = (
    pd.concat(
        [
            ih_allele_counts_filtered,
            recent_africa_ct,
        ]
    )
    .sort_values("ID")
    .reset_index(drop=True)
)

ih_allele_counts_grouped.head(5)

Unnamed: 0,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,ID,CORR_IH_TOTAL_CTS,CORR_IH_REF_CTS,REG
0,rs552586867,110148891,C,G,COL4A1,,2,350,110148891_G_C,1608.0,1606.0,Recent African
1,rs552586867,110148891,C,G,COL4A1,1000G Yoruba,2,350,110148891_G_C,,,
2,rs59409892,110148917,C,G,COL4A1,Mandinka,38,352,110148917_G_C,352.0,314.0,WA
3,rs59409892,110148917,C,G,COL4A1,,152,1596,110148917_G_C,1608.0,1456.0,Recent African
4,rs59409892,110148917,C,G,COL4A1,Luhya,13,194,110148917_G_C,194.0,181.0,EA


Aggregate allele counts for the 1000G and HGDP Yoruban population

In [159]:
# Aggregate 1000G and HGDP Yoruban allele count data using a custom function

subpop_filters = (ih_allele_counts_grouped["SUB_POP"] == "1000G Yoruba") | (
    ih_allele_counts_grouped["SUB_POP"] == "HGDP Yoruba"
)

yoruban_ct = functions.grouped_pop_allele_counts(
    ih_allele_counts_grouped.copy(), "Yoruba", subpop_filters=subpop_filters
)

# Calculate the corrected total and reference counts for the Yoruban population. The aggregated values for these columns will not be correct.

yoruban_ct["CORR_IH_TOTAL_CTS"] = (
    sample_subpopulations[sample_subpopulations["SUB"] == "Yoruba"][
        "SAMPLE_NAME"
    ].count()
    * 2
)

yoruban_ct["CORR_IH_REF_CTS"] = (
    yoruban_ct["CORR_IH_TOTAL_CTS"] - yoruban_ct["IH_ALT_CTS"]
)

# Concatenate the Aggregated Yoruban allele count data with the subpopulation allele count data

ih_allele_counts_grouped = (
    pd.concat(
        [
            ih_allele_counts_grouped,
            yoruban_ct,
        ]
    )
    .sort_values("ID")
    .reset_index(drop=True)
)

# Replace the 1000G and HGDP Yoruban allele count data with the aggregated Yoruban allele count data

ih_allele_counts_grouped = ih_allele_counts_grouped[
    (
        (ih_allele_counts_grouped["SUB_POP"] != "1000G Yoruba")
        & (ih_allele_counts_grouped["SUB_POP"] != "HGDP Yoruba")
    )
]

ih_allele_counts_grouped.head(5)

Unnamed: 0,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,ID,CORR_IH_TOTAL_CTS,CORR_IH_REF_CTS,REG
0,rs552586867,110148891,C,G,COL4A1,,2,350,110148891_G_C,1608.0,1606.0,Recent African
1,rs552586867,110148891,C,G,COL4A1,Yoruba,2,350,110148891_G_C,392.0,390.0,
3,rs59409892,110148917,C,G,COL4A1,Yoruba,40,392,110148917_G_C,392.0,352.0,
4,rs59409892,110148917,C,G,COL4A1,Mende,18,196,110148917_G_C,196.0,178.0,WA
6,rs59409892,110148917,C,G,COL4A1,Bantu South Africa,2,16,110148917_G_C,16.0,14.0,SA


##### Calculate allele frequencies

In [160]:
ih_allele_counts_grouped["CORR_IH_AF"] = (
    ih_allele_counts_grouped["IH_ALT_CTS"]
    / ih_allele_counts_grouped["CORR_IH_TOTAL_CTS"]
)

ih_allele_counts_grouped.head(5)

Unnamed: 0,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,ID,CORR_IH_TOTAL_CTS,CORR_IH_REF_CTS,REG,CORR_IH_AF
0,rs552586867,110148891,C,G,COL4A1,,2,350,110148891_G_C,1608.0,1606.0,Recent African,0.001244
1,rs552586867,110148891,C,G,COL4A1,Yoruba,2,350,110148891_G_C,392.0,390.0,,0.005102
3,rs59409892,110148917,C,G,COL4A1,Yoruba,40,392,110148917_G_C,392.0,352.0,,0.102041
4,rs59409892,110148917,C,G,COL4A1,Mende,18,196,110148917_G_C,196.0,178.0,WA,0.091837
6,rs59409892,110148917,C,G,COL4A1,Bantu South Africa,2,16,110148917_G_C,16.0,14.0,SA,0.125


##### Classify variants as SNPs or INDELs

Variants can be classified as:

* Single nucleotide polymorphisms (SNPs) if they result in the exchange of one DNA base pair for another 

**OR**

* Insertion/deletions (INDELs) if they result in the deletion or insertion of one or more DNA base pairs

Adding these variant classifications may prove useful for further analysis of the variants

In [161]:
ih_allele_counts_grouped["VARIANT_TYPE"] = np.where(
    (ih_allele_counts_grouped.ALT.str.len() > 1)
    | (ih_allele_counts_grouped.REF.str.len() > 1),
    "INDEL",
    "SNP",
)

ih_allele_counts_grouped.tail(5)

Unnamed: 0,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,ID,CORR_IH_TOTAL_CTS,CORR_IH_REF_CTS,REG,CORR_IH_AF,VARIANT_TYPE
60249,rs143171553,48256362,G,A,CARD8,Mandinka,3,352,48256362_A_G,352.0,349.0,WA,0.008523,SNP
60250,rs143171553,48256362,G,A,CARD8,Bantu Kenya,1,22,48256362_A_G,24.0,23.0,EA,0.041667,SNP
60251,rs143171553,48256362,G,A,CARD8,Mandenka,1,42,48256362_A_G,46.0,45.0,WA,0.021739,SNP
60253,rs143171553,48256362,G,A,CARD8,Mende,8,194,48256362_A_G,196.0,188.0,WA,0.040816,SNP
60254,rs143171553,48256362,G,A,CARD8,Yoruba,3,350,48256362_A_G,392.0,389.0,,0.007653,SNP


## Save African in-house allele count data to a CSV file

In [162]:
ih_allele_counts_grouped.reset_index(drop=True).to_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "IH_allele_counts.csv",
    )
)

## Prepare data in suitable format for Fisher's Tests

The in-house allele count data needs to be formatted differently to conduct Fisher's Tests to test for significant differences in allele frequency between population regions. 

In [163]:
sample_subpopulations

Unnamed: 0.1,Unnamed: 0,SUB,SAMPLE_NAME,REG
0,0,Mandinka,HG02461,WA
1,1,Mandinka,HG02462,WA
2,2,Mandinka,HG02463,WA
3,3,Mandinka,HG02464,WA
4,4,Mandinka,HG02465,WA
...,...,...,...,...
799,799,Luhya,NA19475,EA
800,800,Mandenka,SS6004470,WA
801,801,Mbuti Pygmy,SS6004471,CA
802,802,San,SS6004473,SA


In [164]:
sample_subpopulations[sample_subpopulations["REG"] == "CA"]["SAMPLE_NAME"].count() * 2

80

In [165]:
# Generate a list of all unique regions in ih_allele_counts_grouped dataframe
inhouse_populations = [region for region in set(ih_allele_counts_grouped.REG.values)]

# Aggregate allele counts by region

ih_allele_counts_regions = (
    ih_allele_counts_grouped.groupby(
        by=["VAR_NAME", "ID", "REF", "ALT", "GENE", "POS", "REG"]
    )
    .sum(numeric_only=True)
    .reset_index()
)

# Pivot data
ih_allele_counts_pivot = ih_allele_counts_regions.pivot(
    index=["VAR_NAME", "ID", "REF", "ALT", "GENE", "POS"],
    columns="REG",
    values=["IH_ALT_CTS", "CORR_IH_REF_CTS"],
)

# Separate alternate and reference count data into different dataframes to facilate renaming of count columns appropriately
ih_data_alt = (
    ih_allele_counts_pivot[["IH_ALT_CTS"]].droplevel(level=0, axis=1).reset_index()
)

ih_data_corr_ref = (
    ih_allele_counts_pivot[["CORR_IH_REF_CTS"]].droplevel(level=0, axis=1).reset_index()
)

# Add appropriate prefixes to alt and ref columns
ih_data_alt = functions.add_prefix_dataframe_col_names(
    ih_data_alt, inhouse_populations, "ALT_CT_IH_"
)

ih_data_corr_ref = functions.add_prefix_dataframe_col_names(
    ih_data_corr_ref, inhouse_populations, "CORR_REF_CT_IH_"
)

# Merge renamed alternate and reference count data
ih_recent_afr = ih_data_alt.merge(
    ih_data_corr_ref, on=["VAR_NAME", "ID", "REF", "ALT", "GENE", "POS"]
)

# Calculate the corrected reference counts for each region. The aggregated values for these columns will not be correct.

ih_recent_afr["CORR_REF_CT_IH_CA"] = (
    sample_subpopulations[sample_subpopulations["REG"] == "CA"]["SAMPLE_NAME"].count()
    * 2
) - ih_recent_afr["ALT_CT_IH_CA"]

ih_recent_afr["CORR_REF_CT_IH_SA"] = (
    sample_subpopulations[sample_subpopulations["REG"] == "SA"]["SAMPLE_NAME"].count()
    * 2
) - ih_recent_afr["ALT_CT_IH_SA"]

ih_recent_afr["CORR_REF_CT_IH_EA"] = (
    sample_subpopulations[sample_subpopulations["REG"] == "EA"]["SAMPLE_NAME"].count()
    * 2
) - ih_recent_afr["ALT_CT_IH_EA"]

ih_recent_afr["CORR_REF_CT_IH_WA"] = (
    sample_subpopulations[sample_subpopulations["REG"] == "WA"]["SAMPLE_NAME"].count()
    * 2
) - ih_recent_afr["ALT_CT_IH_WA"]

# Replace missing count values with 0

ih_recent_afr = ih_recent_afr.replace(np.NAN, 0)

ih_recent_afr.head(5)

REG,VAR_NAME,ID,REF,ALT,GENE,POS,ALT_CT_IH_CA,ALT_CT_IH_EA,ALT_CT_IH_Recent African,ALT_CT_IH_SA,ALT_CT_IH_WA,CORR_REF_CT_IH_CA,CORR_REF_CT_IH_EA,CORR_REF_CT_IH_Recent African,CORR_REF_CT_IH_SA,CORR_REF_CT_IH_WA
0,chr11:34438836T-C,34438836_C_T,T,C,CAT,34438836,0.0,1.0,1.0,0.0,0.0,0.0,217.0,1607.0,0.0,0.0
1,chr11:34438889G-C,34438889_C_G,G,C,CAT,34438889,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1607.0,27.0,0.0
2,chr11:34439223G-C,34439223_C_G,G,C,CAT,34439223,0.0,0.0,2.0,0.0,2.0,0.0,0.0,1606.0,0.0,1280.0
3,chr11:34439262C-T,34439262_T_C,C,T,CAT,34439262,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1607.0,0.0,0.0
4,chr11:34439312G-A,34439312_A_G,G,A,CAT,34439312,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1607.0,27.0,0.0


## Save Fisher's Test data to a csv file

In [166]:
ih_recent_afr.reset_index(drop=True).to_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "IH_allele_counts_fishers.csv",
    )
)