# ALFA data preparation part A 

To compare the variant count and frequency data specific to African populations, which was generated in-house from the [GnomAD 1000 Genomes and HGDP datasets](https://gnomad.broadinstitute.org/news/2021-10-gnomad-v3-1-2-minor-release/), with data from other global populations, we needed to collect global population variant data.

For this purpose, I decided to retrieve variant data from the [NCBI ALFA database](https://www.ncbi.nlm.nih.gov/snp/docs/gsr/alfa/). This database contains variant information from various global populations, including Europe, East Asia, and South Asia. By obtaining this data, we can make meaningful comparisons between the genetic variants found in African populations and those present in other populations worldwide. This comparison helps us understand the similarities and differences in genetic variation across different ethnic groups and geographical regions.

This file provides information about how the global variant data from the ALFA database was accessed using an Application Programming Interface (API).

## Import libraries and modules

In [1]:
import os
os.chdir(
    r"C:\Users\User\Desktop\Megan\MSC2\Results\5._Posthoc_analysis\Pipeline_GnomAD_14032023\Analysis"
)

In [2]:
import sys
sys.path.append(
    r"C:\Users\User\Desktop\Megan\MSC2\Results\5._Posthoc_analysis\Pipeline_GnomAD_14032023"
)
import pandas as pd
import Utils.constants as constants
import Utils.functions as functions

## Fetch ALFA variant data

The ALFA database only stores information on common variants that have an [rsID identifier](https://customercare.23andme.com/hc/en-us/articles/212196908-What-Are-RS-Numbers-Rsid-). To compare the genetic variants identified in-house in African populations with the data in the ALFA database, I needed to find a list of variants from the African populations that have rsIDs. To do this, I filtered the in-house variant data obtained from the GnomAD 1000 Genomes and HGDP datasets to extract only the variants that have rsIDs. These are the common variants that are also present in the ALFA database.

In [13]:
# Load CSVs with variants reported for each gene. Combine information into a single dataframe.

variant_info = pd.DataFrame()
for gene in constants.GENES:
    gene_variant_count_path = os.path.join(
        constants.HOME_PATH,
        "Data",
        "Raw",
        "SUPER",
        "{}_Count.csv".format(gene),
    )
    
    gene_variant_df = pd.read_csv(gene_variant_count_path, sep=",").drop(columns="Unnamed: 0")
    gene_variant_df["GENE"] = gene
    variant_info = pd.concat([variant_info, gene_variant_df])
variant_info.head(5)

Unnamed: 0,ID,POS,REF,ALT,AFR_ac,AFR_tc,GENE
0,chr13:110148882C-CT,110148882,C,CT,0,1978,COL4A1
1,rs552586867,110148891,C,G,2,1978,COL4A1
2,rs59409892,110148917,C,G,192,1978,COL4A1
3,rs535182970,110148920,G,C,1,1978,COL4A1
4,rs56406633,110148959,A,G,1,1978,COL4A1


In [14]:
# Remove variants with an African alternate count of 0. These variants are not found in the African population. 

variant_info = variant_info[variant_info.AFR_ac > 0]
variant_info.head(5)

# Extract information on variants with rsIDs.

rsid_variant_info = variant_info[variant_info.ID.str.contains("rs")]

# Get a list of rsIDs.

rsid_variants_list = list(rsid_variant_info.ID)

Once I have this list of variants with rsIDs from the African populations, I can use it to fetch additional information about these specific variants from the ALFA database. By providing the rsIDs in a query to the ALFA database API, I can retrieve detailed information on these common variants. 

In [None]:
all_alfa_allele_counts = pd.DataFrame()

# Iterate through all variant IDs and retrieve ALFA count data for each ID

for variant_id in rsid_variants_list:
    variant_id_number = variant_id.replace("rs", "")
    try:
        count_data = functions.get_ALFA_count_info(variant_id_number)
    except:
        print (variant_id) # This will print the ID of in-house variants for which the retrieval of information from ALFA failed
    try:
        population_count_data = count_data["results"]
    except:
        pass

    # Parse each study recorded in ALFA referencing the variant of interest in the retrieved data and extract relevant information

    for interval, data in population_count_data.items():
        variant_ref = data["ref"]
        variant_study_counts = data["counts"]

        # Parse the allele count information for each study and extract relevant information

        for study_code, study_allele_counts in variant_study_counts.items():
            population_counts = study_allele_counts["allele_counts"]
            variant_population_allele_count = pd.DataFrame()

            # Parse the count information relevant to a individual subpopulation for each study and extract relevant information
            
            for population_code, allele_counts in population_counts.items():
                variant_population_allele_count["study_code"] = [study_code]
                variant_population_allele_count["variant_id"] = [variant_id]
                variant_population_allele_count["population_code"] = [population_code]
                variant_population_allele_count["reference_allele"] = [variant_ref]
                variant_population_allele_count["allele_counts"] = [allele_counts]
                all_alfa_allele_counts = pd.concat(
                    [all_alfa_allele_counts, variant_population_allele_count]
                )

## Save ALFA data to a csv file

In [17]:
all_alfa_allele_counts.reset_index(drop=True).to_csv(
    os.path.join(
        constants.HOME_PATH,
        "Data",
        "Processed",
        "ALFA_allele_counts_a.csv",
    )
)