# Preparation of ALFA global variant count data

To compare the African variant count data generated in-house from the [GnomAD 1000 Genomes and HGDP datasets](https://gnomad.broadinstitute.org/news/2021-10-gnomad-v3-1-2-minor-release/), with data from other global populations, global population variant data on European, Asian and Latin American populations was retrieved from the [NCBI ALFA database](https://www.ncbi.nlm.nih.gov/snp/docs/gsr/alfa/). 

The data retrieval process from the NCBI ALFA database is documented in the `Notebooks/1-Data_retrieval/ALFA_allele_counts.ipynb` notebook. The raw data obtained is stored in the `Data/Raw/ALFA/ALFA_allele_counts.csv` file. To prepare this data for further analysis, it must be organized and formatted appropriately. This section outlines the steps taken to achieve this.

## Imports

Notebook setup

In [1]:
import os
import sys

from dotenv import load_dotenv

load_dotenv()

PROJECT_ROOT = os.getenv("PROJECT_ROOT")
if PROJECT_ROOT not in sys.path:
    os.chdir(PROJECT_ROOT + "/Notebooks")
    sys.path.append(PROJECT_ROOT)

import pandas as pd
import Utils.constants as constants
import Utils.functions as functions

## Data loading

Load the raw data from the`Data/Raw/ALFA/ALFA_allele_counts.csv` file into a DataFrame.

In [2]:
all_alfa_allele_counts = pd.read_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Raw",
        "ALFA",
        "ALFA_allele_count_data.csv",
    )
)

all_alfa_allele_counts.head(2)

Unnamed: 0,study_code,variant_id,population_code,reference_allele,allele_counts
0,PRJNA507278,rs552586867,SAMN10492705,C,"{'G': 0, 'C': 14050}"
1,PRJNA507278,rs552586867,SAMN10492695,C,"{'G': 0, 'C': 9690}"


The information stored in the DataFrame columns is as follows:

* `study_code`: A unique identifier for the study or project from which the data was derived. It helps in tracking the source of the data.
* `variant_id`: A unique identifier for the genetic variant. It typically refers to a specific location in the genome where a variation (such as a single nucleotide polymorphism, SNP) has been observed.
* `population_code`: A code representing the population or subpopulation from which the data was collected. This helps in understanding the genetic diversity and allele frequencies across different groups.
* `reference_allele`: The allele that is considered the reference or standard for the variant. It is usually the most common allele observed at that genomic location in a reference population.
* `allele_counts`: The counts of each allele observed in the population. It represents the frequency of the reference allele and possibly other alternative alleles in the given population.

## Unpack column values

The `allele_counts` column contains information stored in a dictionary format. To make the data more accessible for analysis, expand the dictionary content into multiple separate columns.

In [3]:
# Unpack dictionary values into multiple separate columns
normalised_count_columns = (
    all_alfa_allele_counts["allele_counts"].map(eval).apply(pd.Series)
)

# Drop the old allele count column from the dataframe
drop_count_column = all_alfa_allele_counts.drop(columns="allele_counts")

# Re-add the new, unpacked allele count column
normalised_alfa_allele_counts = pd.concat(
    [
        drop_count_column.reset_index(drop=True),
        normalised_count_columns.reset_index(drop=True),
    ],
    axis=1,
)

normalised_alfa_allele_counts.head(2)

Unnamed: 0,study_code,variant_id,population_code,reference_allele,G,C,A,T,AAAAAAAAAAA,AAAAAAAAAAAAA,...,TATATATATA,TATATATATATA,ATACACACACACACA,ATACACACACACACACA,ATATATACACACACACACACA,ATATATACACACACACACACACA,ATATATACACACACACACACACACACACA,ATCTATACACACA,TAAAATA,TAAAATAAAATA
0,PRJNA507278,rs552586867,SAMN10492705,C,0.0,14050.0,,,,,...,,,,,,,,,,
1,PRJNA507278,rs552586867,SAMN10492695,C,0.0,9690.0,,,,,...,,,,,,,,,,


## Translate coded column values

The names of populations have been represented by codes in the `population_code` column. To make the data easier to understand, translate the coded population names into legible and recognisable population names.

In [4]:
# Fetch ALFA study population metadata
metadata_json = functions.get_metadata()

# Generate a list of all unique study and population codes that need to be queried
study_codes = [
    study_code for study_code in normalised_alfa_allele_counts.study_code.unique()
]
population_codes = [
    population_code
    for population_code in normalised_alfa_allele_counts.population_code.unique()
]

# Fetch study and population name metadata. 
# Code modified from https://github.com/ncbi/dbsnp/blob/master/tutorials/Variation%20Services/Jupyter_Notebook/by_rsid.ipynb.
metadata = {}
for project_json in metadata_json:
    p = {}
    p["json"] = project_json
    p["pops"] = {}
    metadata[project_json["bioproject_id"]] = p
for prj_id, prj in metadata.items():
    functions.add_all_pops(prj["json"]["populations"], prj)

for study_id in study_codes:
    study_name = metadata[study_id]["json"]["short_name"]

    pop_dict = {
        "SAMN10492696": "African Others",
        "SAMN10492698": "African American",
        "SAMN10492697": "East Asian",
        "SAMN10492701": "Other Asian",
    }
    for pop_id in population_codes:
        if pop_id not in [
            "SAMN10492696",
            "SAMN10492698",
            "SAMN10492697",
            "SAMN10492701",
        ]:
            population_name = metadata[study_id]["pops"][pop_id]["name"]
            temp_pop_dict = {pop_id: population_name}
        pop_dict.update(temp_pop_dict)

# Add column in population allele count dataframe for population names
normalised_alfa_allele_counts["Population"] = normalised_alfa_allele_counts[
    "population_code"
].map(pop_dict)

normalised_alfa_allele_counts.head(2)

Unnamed: 0,study_code,variant_id,population_code,reference_allele,G,C,A,T,AAAAAAAAAAA,AAAAAAAAAAAAA,...,TATATATATATA,ATACACACACACACA,ATACACACACACACACA,ATATATACACACACACACACA,ATATATACACACACACACACACA,ATATATACACACACACACACACACACACA,ATCTATACACACA,TAAAATA,TAAAATAAAATA,Population
0,PRJNA507278,rs552586867,SAMN10492705,C,0.0,14050.0,,,,,...,,,,,,,,,,Total
1,PRJNA507278,rs552586867,SAMN10492695,C,0.0,9690.0,,,,,...,,,,,,,,,,European


## Data filtering

Filter the data to include only the specific populations of interest. Retain information on the European, East Asian, South Asian, Latin American, and Latin American 2 population groups.

In [5]:
# Define a list of included population groups
incl_pops = [
    "European",
    "East Asian",
    "South Asian",
    "Latin American 1",
    "Latin American 2",
]

# Filter data by included population groups
filtered_alfa_allele_counts = normalised_alfa_allele_counts[
    normalised_alfa_allele_counts["Population"].isin(incl_pops)
]

## Reshape data

To enable a meaningful comparison with our in-house African allele count data, reformat the ALFA allele count data to match the structure used for the in-house dataset.

In [6]:
# Melt the DataFrame to reshape it for easier analysis
melted_alfa_allele_counts = filtered_alfa_allele_counts.melt(
    id_vars=["study_code", "variant_id", "population_code", "reference_allele", "Population"],
    value_vars=filtered_alfa_allele_counts.iloc[:, 4:].columns,
    var_name="allele",
    value_name="count",
).dropna(subset=["count"])

# Separate reference allele and allele count information
ref_allele_info = melted_alfa_allele_counts[["study_code", "variant_id", "population_code", "Population", "reference_allele"]]
allele_count_info = melted_alfa_allele_counts[["study_code", "variant_id", "population_code", "Population", "allele", "count"]]

# Merge allele count information with reference allele information
ref_alfa_counts = ref_allele_info.merge(
    allele_count_info,
    left_on=["study_code", "variant_id", "population_code", "Population", "reference_allele"],
    right_on=["study_code", "variant_id", "population_code", "Population", "allele"]
).rename(columns={"count": "alfa_ref_cts"}).drop(columns="allele").drop_duplicates(subset=["study_code","variant_id","population_code","Population","reference_allele"])

# Merge remaining allele count information for alternate alleles
alt_alfa_counts = melted_alfa_allele_counts.merge(
    ref_alfa_counts,
    on=["study_code", "variant_id", "population_code", "reference_allele", "Population"],
)

# Filter alternate allele counts
alfa_counts = alt_alfa_counts[alt_alfa_counts["reference_allele"] != alt_alfa_counts["allele"]]

# Rename columns
alfa_counts = alfa_counts.rename(columns={"study_code":"STUDY_CODE","population_code":"POP_ID","Population":"POP","variant_id":"VAR_NAME", "allele": "ALT", "count": "ALT_CTS", "reference_allele":"REF", "alfa_ref_cts":"REF_CTS"})

# Calculate alternate allele frequencies
alfa_counts["AF"] = alfa_counts["ALT_CTS"] / (alfa_counts["REF_CTS"] + alfa_counts["ALT_CTS"])

# Pivot the data for analysis
alfa_pivot_data = alfa_counts.pivot_table(
    index=["VAR_NAME", "REF", "ALT"],
    columns="POP",
    values=["ALT_CTS", "REF_CTS"]
).reset_index()

# Separate data into alternate and reference count dataframes for renaming
alfa_data_alt = alfa_pivot_data[["VAR_NAME", "REF", "ALT", "ALT_CTS"]].droplevel(level=0, axis=1).reset_index(drop=True)
alfa_data_alt.columns.values[0:3] = ["VAR_NAME", "REF", "ALT"]
alfa_data_ref = alfa_pivot_data[["VAR_NAME", "REF", "ALT", "REF_CTS"]].droplevel(level=0, axis=1).reset_index(drop=True)
alfa_data_ref.columns.values[0:3] = ["VAR_NAME", "REF", "ALT"]

# Add appropriate prefixes to alt and ref column names
alfa_data_alt = functions.add_prefix_dataframe_col_names(
    alfa_data_alt, alfa_data_alt.iloc[:, 3:], "ALT_CT_ALFA_"
)

alfa_data_ref = functions.add_prefix_dataframe_col_names(
    alfa_data_ref, alfa_data_ref.iloc[:, 3:], "REF_CT_ALFA_"
)

# Merge renamed alternate and reference count data
alfa_grouped_data = alfa_data_alt.merge(
    alfa_data_ref, on=["VAR_NAME", "REF", "ALT"]
)

# Remove index name
alfa_grouped_data.columns.name = None

## Display and save the prepared data

In [7]:
alfa_grouped_data.head(2)

Unnamed: 0,VAR_NAME,REF,ALT,ALT_CT_ALFA_East Asian,ALT_CT_ALFA_European,ALT_CT_ALFA_Latin American 1,ALT_CT_ALFA_Latin American 2,ALT_CT_ALFA_South Asian,REF_CT_ALFA_East Asian,REF_CT_ALFA_European,REF_CT_ALFA_Latin American 1,REF_CT_ALFA_Latin American 2,REF_CT_ALFA_South Asian
0,rs1000343,C,T,0.0,49.0,5.0,10.0,0.0,490.0,109377.0,673.0,2200.0,184.0
1,rs1000989,T,C,55.0,21489.0,123.0,1330.0,1685.0,109.0,37269.0,273.0,2052.0,3283.0


A description of the data contained in each column is as follows:

* `VAR_NAME`: Name or identifier of the genetic variant.
* `REF`: Reference allele (original allele).
* `ALT`: Alternate allele (mutated allele).
* `ALT_CT_ALFA_East Asian`: Count of the alternate allele in the East Asian population (ALFA database).
* `ALT_CT_ALFA_European`: Count of the alternate allele in the European population (ALFA database).
* `ALT_CT_ALFA_Latin American 1`: Count of the alternate allele in the Latin American 1 population (ALFA database).
* `ALT_CT_ALFA_Latin American 2`: Count of the alternate allele in the Latin American 2 population (ALFA database).
* `ALT_CT_ALFA_South Asian`: Count of the alternate allele in the South Asian population (ALFA database).
* `REF_CT_ALFA_East Asian`: Count of the reference allele in the East Asian population (ALFA database).
* `REF_CT_ALFA_European`: Count of the reference allele in the European population (ALFA database).
* `REF_CT_ALFA_Latin American 1`: Count of the reference allele in the Latin American 1 population (ALFA database).
* `REF_CT_ALFA_Latin American 2`: Count of the reference allele in the Latin American 2 population (ALFA database).
* `REF_CT_ALFA_South Asian`: Count of the reference allele in the South Asian population (ALFA database).

In [8]:
alfa_grouped_data.reset_index(drop=True).to_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "ALFA_allele_counts.csv",
    ),
    index=False,
)