# Data Preparation

This notebook contains code to process the raw data and metadata stored in the `Data\Raw` and `Metadata` folders in preparation for analysis.

## 1. Imports

In [26]:
# Import system packages
import os
import sys
from dotenv import load_dotenv
load_dotenv()

# Set root directory to system path
PROJECT_ROOT = os.getenv("PROJECT_ROOT")
if PROJECT_ROOT not in sys.path:
    os.chdir(PROJECT_ROOT + "/Notebooks")
    sys.path.append(PROJECT_ROOT)

# Import constants, functions and data analysis packages
import Utils.constants as constants
import Utils.functions as functions
import pandas as pd
import numpy as np

# Suppress warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## 2. Preparation of sample data

For this study a "sample" is defined as an individual from whom genomic data has been collected.

The `Metadata/samples.csv` file contains the following information on each sample:
* The unique name of the sample.
* The dataset to which each sample belongs, namely, [GnomAD 1000 Genomes or HGDP](https://gnomad.broadinstitute.org/news/2021-10-gnomad-v3-1-2-minor-release/).
* The African ethnolinguistic classification of each sample, also known as the subpopulation group. The different African ethnolinguistic classifications/subpopulation groups are: 

    * Mandinka
    * Esan
    * Mende
    * Mbuti Pygmy
    * Biaka Pygmy
    * Mandenka
    * Yoruba (HGDP and 1000G)
    * San
    * Bantu South Africa
    * Luhya
* The superpopulation group to which the sample belongs.

### Data loading

Import sample metadata from the `Metadata/samples.csv` file.

In [27]:
sample_metadata = pd.read_csv(
    os.path.join(
        PROJECT_ROOT,
        "Metadata",
        "samples.csv",
    )
).replace({"SUB": constants.SUBPOP_RENAME})

sample_metadata.head(2)

Unnamed: 0,sample_name,dataset,SUPER,SUB
0,HG02461,GnomAD,AFR,Mandinka
1,HG02462,GnomAD,AFR,Mandinka


### Feature selection

Select the sample name and subpopulation features for further analysis. Rename features to ensure consistency.

In [28]:
sample_subpopulations = sample_metadata[["SUB", "sample_name"]].rename(
    columns={"sample_name": "SAMPLE_NAME"}
)

Add a new feature containing data on the regional classification of a sample.
The different regional groupings are: 

* SA: Southern Africa
* WA: Western Africa
* CA: Central Africa
* EA: Eastern Africa

In [29]:
sample_subpopulations["REG"] = sample_subpopulations["SUB"].map(
    constants.REGIONAL_CLASSIFICATION
)

### View and save prepared data

In [30]:
sample_subpopulations.head(2)

Unnamed: 0,SUB,SAMPLE_NAME,REG
0,Mandinka,HG02461,WA
1,Mandinka,HG02462,WA


In [31]:
sample_subpopulations.reset_index(drop=True).to_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "Sample_populations.csv",
    ),
    index=False,
)

## 3. Preparation of African variant count data

Data on genetic variants found in African population groups was generated in-house by processing genomic data obtained from [GnomAD 1000 Genomes and HGDP datasets](https://gnomad.broadinstitute.org/news/2021-10-gnomad-v3-1-2-minor-release/) through a [Snakemake bioinformatics pipeline](https://github.com/Tuks-ICMM/Pharmacogenetic-Analysis-Pipeline). The generated data was stored in `Data\Raw\SUB\{gene_name}.Count.csv` files. Each `Data\Raw\SUB\{gene_name}.Count.csv` file contains the data on variants identified in a particular gene. The files include information on:

* Chromosome: The identifier of the chromosome in which the variant is located. 
* Genetic variant names: The name of the variant as per NCBI rsID or HGVS nomenclature.
* Position: The nucleotide position at which the genetic variant is located within the genome.
* Alternate and reference alleles: An allele refers to the different forms of a specific variant that can exist at a particular genetic position. For each variant, there are two alleles, the normal form known as the reference allele (REF), and the altered form known as the alternate allele (ALT). These alleles define the genetic variation at a given position.
* Alternate allele count: This shows how many times each genetic variant's alternate allele (ALT) appears in all the samples of a population.
* Total count: This provides the overall count of all genetic variants' alleles (both REF and ALT) mentioned in the data.
* Gene: The gene in which the variant is located.
* Sample subpopulation group: The African ethnolinguistic subpopulation from which the genetic data sample originated.

### Data loading

In [32]:
# Initialise an empty DataFrame to store variant data
variants = pd.DataFrame()

# Read the list of gene names from the locations.csv file
genes = pd.read_csv(
    os.path.join(
        PROJECT_ROOT,
        "Metadata",
        "locations.csv",
    )
).location_name

# Iterate over each gene in the list of genes
for gene in genes:
    # Construct the file path for the variant count data of the current gene
    gene_variant_count_path = os.path.join(
        PROJECT_ROOT,
        "Data",
        "Raw",
        "SUB",
        "{}_Count.csv".format(gene),
    )

    # Initialize an empty DataFrame to store the variant data for the current gene
    gene_variant_df = pd.DataFrame()

    # Check if the variant count file exists for the current gene
    if os.path.exists(gene_variant_count_path):
        # Read the variant data from the CSV file
        gene_variant_df = (
            pd.read_csv(gene_variant_count_path, sep=",")
            .rename(columns={"ID": "VAR_NAME"})  # Rename the 'ID' column to 'VAR_NAME'
        )
        # Add a new column 'GENE' to store the current gene name
        gene_variant_df["GENE"] = gene

    # Append the current gene's variant data to the main variants DataFrame
    variants = pd.concat([variants, gene_variant_df])

# Display the first two rows of the combined variants DataFrame
variants.head(2)

Unnamed: 0,CHROM,VAR_NAME,REF,ALT,GWD_ac,GWD_tc,ESN_ac,ESN_tc,MSL_ac,MSL_tc,...,BantuSouthAfrica_ac,BantuSouthAfrica_tc,BantuKenya_ac,BantuKenya_tc,YRI_ac,YRI_tc,LWK_ac,LWK_tc,GENE,POS
0,13,chr13:110148882C-CT,C,CT,0,232,0,206,0,166,...,0,16,0,20,0,234,0,184,COL4A1,110148882
1,13,rs552586867,C,G,0,232,0,206,0,166,...,0,16,0,20,1,234,0,184,COL4A1,110148891


In the dataframe above:

`ALT` represents the alternate form of a specific genetic variant.

`REF` represents the normal form at of a specific genetic variant.

The total copies of the alternate form of a variant for a population are represented as `{population_group}_ac`. Here, `{population_group}` would be replaced with the specific name of the population group or subpopulation. The _ac stands for "alternate allele count".

The total copies of both forms of a variant for the same population are represented as `{population_group}_tc`. Again, `{population_group}` should be replaced with the actual name of the population group. The _tc stands for "total count".

### Reshape data

Reshape the data for each subpopulation group into a long format with the data for each subpopulation group in a new row. 

In [33]:
# Separate total count and alternate count information

# Select columns related to alternate counts along with identifier columns
alt_ct_columns = variants.filter(regex="_ac|VAR_NAME|POS|REF|ALT|GENE")

# Select columns related to total counts along with identifier columns
total_ct_columns = variants.filter(regex="_tc|VAR_NAME|POS|REF|ALT|GENE")

# Melt information

# Reshape the alternate counts dataframe from wide to long format
# 'id_vars' are the identifier columns that remain the same
# 'var_name' is the new column name for the melted variable names
# 'value_name' is the new column name for the melted values
alt_ct_columns = alt_ct_columns.melt(
    id_vars=["VAR_NAME", "POS", "REF", "ALT", "GENE"],
    var_name="SUB_POP",
    value_name="IH_ALT_CTS",
)

# Reshape the total counts dataframe from wide to long format
total_ct_columns = total_ct_columns.melt(
    id_vars=["VAR_NAME", "POS", "REF", "ALT", "GENE"],
    var_name="SUB_POP",
    value_name="IH_TOTAL_CTS",
)

# Remove information after underscore in SUB_POP column

# For the alternate counts, remove any suffix after the underscore in the 'SUB_POP' column
alt_ct_columns["SUB_POP"] = alt_ct_columns["SUB_POP"].str.rsplit("_", n=0).str.get(0)

# For the total counts, remove any suffix after the underscore in the 'SUB_POP' column
total_ct_columns["SUB_POP"] = total_ct_columns["SUB_POP"].str.rsplit("_", n=0).str.get(0)

# Combine formatted information

# Merge the alternate counts and total counts dataframes on common identifier columns
ih_allele_counts = pd.merge(
    alt_ct_columns,
    total_ct_columns,
    on=["VAR_NAME", "POS", "REF", "ALT", "GENE", "SUB_POP"],
)

# Rename subpopulations in the SUB_POP column
ih_allele_counts.replace({"SUB_POP": constants.SUBPOP_RENAME}, inplace=True)

# Display the first two rows of the combined ih_allele_counts dataframe
ih_allele_counts.head(2)

Unnamed: 0,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS
0,chr13:110148882C-CT,110148882,C,CT,COL4A1,Mandinka,0,232
1,rs552586867,110148891,C,G,COL4A1,Mandinka,0,232


Some variants have information on more than one alternate allele in a single row. Identify these variants and split the information into multiple rows accordingly. 

In [34]:
# Count the number of variants with information on more than one alternate allele in a single row
multiple_alt_allele_variants = ih_allele_counts[ih_allele_counts.ALT.str.contains(",")]
print('Before fixing the errors:')
print(multiple_alt_allele_variants.count())

# Remove these variants from the ih_allele_counts dataframe
ih_allele_counts = ih_allele_counts[~ih_allele_counts.ALT.str.contains(",")]

# Initialize a new dataframe to store the split rows
split_ih_allele_counts = pd.DataFrame()

# Iterate over the rows with multiple alternate alleles
for index, row in multiple_alt_allele_variants.reset_index().iterrows():

    # Check if the variant name contains "rs" and duplicate it
    if "rs" in row.VAR_NAME:
        varname1 = row.VAR_NAME
        varname2 = row.VAR_NAME
    # If the variant name contains a semicolon, split it into two parts
    elif ";" in row.VAR_NAME:
        varname1 = row.VAR_NAME.split(";")[0]
        varname2 = row.VAR_NAME.split(";")[1]

    # Duplicate the position
    position1 = position2 = row.POS

    # Duplicate the reference allele
    ref1 = ref2 = row.REF

    # Split the alternate alleles into two parts
    alt1 = row.ALT.split(",")[0]
    alt2 = row.ALT.split(",")[1]

    # Duplicate the gene information
    gene1 = gene2 = row.GENE

    # Duplicate the sub-population information
    subpop1 = subpop2 = row.SUB_POP

    # Split the counts of the alternate alleles into two parts
    ihaltcts1 = row.IH_ALT_CTS.split(",")[0]
    ihaltcts2 = row.IH_ALT_CTS.split(",")[1]

    # Duplicate the total counts
    ihtotalcts1 = ihtotalcts2 = row.IH_TOTAL_CTS

    # Create dictionaries for the two new rows
    row1 = {
        "VAR_NAME": varname1,
        "POS": position1,
        "REF": ref1,
        "ALT": alt1,
        "GENE": gene1,
        "SUB_POP": subpop1,
        "IH_ALT_CTS": ihaltcts1,
        "IH_TOTAL_CTS": ihtotalcts1,
    }
    row2 = {
        "VAR_NAME": varname2,
        "POS": position2,
        "REF": ref2,
        "ALT": alt2,
        "GENE": gene2,
        "SUB_POP": subpop2,
        "IH_ALT_CTS": ihaltcts2,
        "IH_TOTAL_CTS": ihtotalcts2,
    }

    # Append the new rows to the split_ih_allele_counts dataframe
    split_ih_allele_counts = split_ih_allele_counts.append(row1, ignore_index=True)
    split_ih_allele_counts = split_ih_allele_counts.append(row2, ignore_index=True)

# Append the split rows to the ih_allele_counts dataframe and reset the index
ih_allele_counts = ih_allele_counts.append(split_ih_allele_counts, ignore_index=True).reset_index(drop=True)

# Check to see if there are still rows with multiple alternate alleles
print('\nAfter fixing the errors:')
print(ih_allele_counts[ih_allele_counts.ALT.str.contains(",")].count())

Before fixing the errors:
VAR_NAME        480
POS             480
REF             480
ALT             480
GENE            480
SUB_POP         480
IH_ALT_CTS      480
IH_TOTAL_CTS    480
dtype: int64

After fixing the errors:
VAR_NAME        0
POS             0
REF             0
ALT             0
GENE            0
SUB_POP         0
IH_ALT_CTS      0
IH_TOTAL_CTS    0
dtype: int64


### Correct data types

Ensure that the data types of all columns are correct.

In [35]:
# Convert specified columns to string type
ih_allele_counts[
    ["VAR_NAME", "POS", "REF", "ALT", "GENE", "SUB_POP", "IH_ALT_CTS"]
] = ih_allele_counts[
    ["VAR_NAME", "POS", "REF", "ALT", "GENE", "SUB_POP", "IH_ALT_CTS"]
].astype(str)

# Replace commas with periods in the 'IH_ALT_CTS' column
ih_allele_counts["IH_ALT_CTS"] = [
    x.replace(",", ".") for x in ih_allele_counts["IH_ALT_CTS"]
]

# Convert 'IH_TOTAL_CTS' and 'IH_ALT_CTS' columns to float type and then to integer type
ih_allele_counts[["IH_TOTAL_CTS", "IH_ALT_CTS"]] = (
    ih_allele_counts[["IH_TOTAL_CTS", "IH_ALT_CTS"]].astype(float).astype(int)
)

### Feature selection

Some of the variants, which have been named according to NCBI rsID standards, share the same name. This will complicate downstream analysis of the data. Add a feature with a unique ID for each variant to ensure that the variants are distinguishable.

In [36]:
ih_allele_counts["ID"] = (
    ih_allele_counts[["POS", "REF", "ALT"]].astype("str").agg("_".join, axis=1)
)

Add a new feature with information on the African region (i.e., Southern Africa, Western Africa, Eastern Africa, Central Africa, America, Caribbean) from which a particular African subpopulation/ethnolinguistic group originates. 

In [37]:
ih_allele_counts["REG"] = ih_allele_counts["SUB_POP"].map(
    constants.REGIONAL_CLASSIFICATION
)

Add a new feature to classify variants as single nucleotide polymorphisms (SNPs) or insertions/deletions (INDELs). The definitions for these are as follows:
* SNPs result in the exchange of one DNA base pair for another.
* INDELs result in the deletion or insertion of one or more DNA base pairs.

In [38]:
ih_allele_counts["VARIANT_TYPE"] = np.where(
    (ih_allele_counts.ALT.str.len() > 1)
    | (ih_allele_counts.REF.str.len() > 1),
    "INDEL",
    "SNP",
)

In the data above, the total count of all alleles (REF and ALT) is included as `IH_TOTAL_CTS`. To get the reference allele count, `IH_REF_CTS`, the total number of alternate allele counts `IH_ALT_CTS` were subtracted from `IH_TOTAL_CTS`.

In [39]:
# Calculate reference allele counts and add to dataframe
ih_allele_counts["IH_REF_CTS"] = (
    ih_allele_counts["IH_TOTAL_CTS"] - ih_allele_counts["IH_ALT_CTS"]
).astype(int)

### Data filtering

Remove variants not associated with specific genes:

This process involves filtering out genetic variants from the dataset that are not associated with specific genes based on the Combined Annotation-Dependent Depletion (CADD) Variant Effect Predictor (VEP) results. The CADD VEP results provide annotations about the functional impact of variants, including their genomic positions and predicted effects on genes.

Before running this step, it is essential to have already retrieved and processed the VEP results. This ensures that the dataset only retains variants directly linked to the genes of interest, enhancing the relevance and accuracy of subsequent analyses.

By removing non-associated variants, we focus the analysis on genetic variations that are more likely to influence gene function or phenotype, thereby facilitating more targeted and meaningful genetic studies.

In [40]:
# Load VEP results from the specified CSV file path
cadd_data_path = os.path.join(
    PROJECT_ROOT,
    "Data",
    "Processed",
    "Variant_consequences.csv",
)

# Read the CSV file containing VEP results into a pandas DataFrame
cadd_data = pd.read_csv(cadd_data_path, sep=",")

# Get a list of all variants in the genes
variant_ids = list(cadd_data.ID.values)

# Remove variants if they are not in the above list
print(f'Variant count before filtering: {ih_allele_counts.ID.nunique()}')
ih_allele_counts_filtered = ih_allele_counts[ih_allele_counts.ID.isin(variant_ids)]
print(f'Variant count after filtering: {ih_allele_counts_filtered.ID.nunique()}')

Variant count before filtering: 23549
Variant count after filtering: 21658


### Aggregate data

To ensure that the allele counts are aggregated for each variant and each subpopulation, group the data by the relevant columns and sum the counts for each group. 

In [41]:
# Define the aggregation functions to be applied to the grouped data
# 'sum' will be used to aggregate the 'IH_REF_CTS', 'IH_ALT_CTS', and 'IH_TOTAL_CTS' columns
agg_functions = {"IH_REF_CTS": "sum", "IH_ALT_CTS": "sum", "IH_TOTAL_CTS": "sum"}

# Group the dataframe by the specified columns and apply the aggregation functions
ih_allele_counts_grouped = ih_allele_counts_filtered.groupby(
    by=["ID", "VAR_NAME", "VARIANT_TYPE", "POS", "REF", "ALT", "GENE", "SUB_POP", "REG"]
).aggregate(agg_functions).reset_index()

# Display the grouped and aggregated dataframe
ih_allele_counts_grouped.head(2)

Unnamed: 0,ID,VAR_NAME,VARIANT_TYPE,POS,REF,ALT,GENE,SUB_POP,REG,IH_REF_CTS,IH_ALT_CTS,IH_TOTAL_CTS
0,110148882_C_CT,chr13:110148882C-CT,INDEL,110148882,C,CT,COL4A1,Bantu Kenya,EA,20,0,20
1,110148882_C_CT,chr13:110148882C-CT,INDEL,110148882,C,CT,COL4A1,Bantu South Africa,SA,16,0,16


Create a new row representing the aggregated allele counts for the African population. For this row, assign the value `Recent African` to the `REG` column, as it specifically pertains to African populations currently residing on the African continent. This excludes African American or African Caribbean populations. Since this aggregated data does not pertain to any specific subpopulation, leave the `SUB_POP` column blank.

In [42]:
# Define aggregation functions to sum allele counts
agg_functions = {"IH_ALT_CTS": "sum", "IH_TOTAL_CTS": "sum", "IH_REF_CTS": "sum"}

# Group by identifier columns and aggregate allele counts
# This creates a new dataframe 'recent_africa_ct' with aggregated allele counts for each variant across African populations
recent_africa_ct = (
    ih_allele_counts_grouped.groupby(["ID", "VAR_NAME", "VARIANT_TYPE", "POS", "REF", "ALT", "GENE"])
    .aggregate(agg_functions)
    .reset_index()
)

# Assign "Recent African" to the REG column for African continent populations
recent_africa_ct["REG"] = "Recent African"

# Set SUB_POP column to NaN as data is not specific to any subpopulation
recent_africa_ct["SUB_POP"] = np.NaN

# Concatenate the Recent African allele count data with the subpopulation allele count data
# This combines 'recent_africa_ct' dataframe with 'ih_allele_counts_grouped' dataframe
ih_allele_counts_grouped = (
    pd.concat(
        [
            ih_allele_counts_grouped,
            recent_africa_ct,  
        ]
    )
    .sort_values("ID")  # Sort the concatenated dataframe by the 'ID' column
    .reset_index(drop=True)  # Reset the index of the concatenated dataframe
)

# Display the first two rows of the dataframe with aggregated counts for Recent African populations
ih_allele_counts_grouped.head(2)

Unnamed: 0,ID,VAR_NAME,VARIANT_TYPE,POS,REF,ALT,GENE,SUB_POP,REG,IH_REF_CTS,IH_ALT_CTS,IH_TOTAL_CTS
0,110148882_C_CT,chr13:110148882C-CT,INDEL,110148882,C,CT,COL4A1,Bantu Kenya,EA,20,0,20
1,110148882_C_CT,chr13:110148882C-CT,INDEL,110148882,C,CT,COL4A1,Yoruba,WA,276,0,276


### Calculate variant allele frequencies

Compute the variant allele frequencies from the aggregated allele counts dataset. Variant allele frequency refers to the proportion of alleles that carry a variant form (alternate allele) at a specific genetic variant position, relative to the total number of alleles at that position. This calculation helps in understanding the prevalence of genetic variants within a population and their potential association with traits or diseases.

In [43]:
ih_allele_counts_grouped["IH_AF"] = (
    ih_allele_counts_grouped["IH_ALT_CTS"]
    / ih_allele_counts_grouped["IH_TOTAL_CTS"]
)

### Display and save the prepared data

In [44]:
ih_allele_counts_grouped.head(2)

Unnamed: 0,ID,VAR_NAME,VARIANT_TYPE,POS,REF,ALT,GENE,SUB_POP,REG,IH_REF_CTS,IH_ALT_CTS,IH_TOTAL_CTS,IH_AF
0,110148882_C_CT,chr13:110148882C-CT,INDEL,110148882,C,CT,COL4A1,Bantu Kenya,EA,20,0,20,0.0
1,110148882_C_CT,chr13:110148882C-CT,INDEL,110148882,C,CT,COL4A1,Yoruba,WA,276,0,276,0.0


In [45]:
ih_allele_counts_grouped.reset_index(drop=True).to_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "IH_allele_counts.csv",
    ),
    index=False,
)

### Prepare data for Fisher's Tests

To conduct Fisher's Tests and analyse significant differences in allele frequency between population regions, the current format of the in-house allele count data needs to be adjusted. This preparation involves restructuring the data to ensure it meets the requirements for performing statistical tests such as Fisher's exact test. By organizing the data appropriately, we can effectively evaluate and compare allele frequencies across different population regions, facilitating insights into genetic variations and population genetics.

In [46]:
# Generate a list of all unique regions in ih_allele_counts_filtered dataframe
inhouse_populations = [region for region in set(ih_allele_counts_grouped.REG.values)]

# Aggregate allele counts by region
# Define aggregation functions to sum allele counts
agg_functions = {"IH_ALT_CTS": "sum", "IH_TOTAL_CTS": "sum", "IH_REF_CTS": "sum"}

# Group ih_allele_counts_filtered dataframe by variant, identifier, reference, alternate, gene, position, and region
ih_allele_counts_fishers = ih_allele_counts_grouped.groupby(
    by=["VAR_NAME", "ID", "REF", "ALT", "GENE", "POS", "REG"]
).aggregate(agg_functions).reset_index()

# Pivot data to reshape dataframe with variant, identifier, reference, alternate, gene, and position as index, and regions as columns
ih_allele_counts_fishers= ih_allele_counts_fishers.pivot(
    index=["VAR_NAME", "ID", "REF", "ALT", "GENE", "POS"],
    columns="REG",
    values=["IH_ALT_CTS", "IH_REF_CTS"],
)

# Separate alternate and reference count data into different dataframes to facilitate renaming of count columns appropriately
ih_data_alt_fishers = (
    ih_allele_counts_fishers[["IH_ALT_CTS"]].droplevel(level=0, axis=1).reset_index()
)

ih_data_ref_fishers = (
    ih_allele_counts_fishers[["IH_REF_CTS"]].droplevel(level=0, axis=1).reset_index()
)

# Add appropriate prefixes to alternate and reference count columns using a custom function
ih_data_alt_fishers = functions.add_prefix_dataframe_col_names(
    ih_data_alt_fishers, inhouse_populations, "ALT_CT_IH_"
)

ih_data_ref_fishers = functions.add_prefix_dataframe_col_names(
    ih_data_ref_fishers, inhouse_populations, "REF_CT_IH_"
)

# Merge renamed alternate and reference count dataframes on variant, identifier, reference, alternate, gene, and position
ih_fishers = ih_data_alt_fishers.merge(
    ih_data_ref_fishers, on=["VAR_NAME", "ID", "REF", "ALT", "GENE", "POS"]
)

# Replace missing count values with 0
ih_fishers = ih_fishers.replace(np.NaN, 0)

### Display and save the prepared data

In [48]:
ih_fishers.head(2)

REG,VAR_NAME,ID,REF,ALT,GENE,POS,ALT_CT_IH_CA,ALT_CT_IH_EA,ALT_CT_IH_Recent African,ALT_CT_IH_SA,ALT_CT_IH_WA,REF_CT_IH_CA,REF_CT_IH_EA,REF_CT_IH_Recent African,REF_CT_IH_SA,REF_CT_IH_WA
0,chr11:34438836T-C,34438836_T_C,T,C,CAT,34438836,0,1,1,0,0,68,203,1219,28,920
1,chr11:34438889G-C,34438889_G_C,G,C,CAT,34438889,0,0,1,1,0,68,204,1219,27,920


In [49]:
ih_fishers.reset_index(drop=True).to_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "IH_allele_counts_fishers.csv",
    ),
    index=False,
)