# ALFA data preparation part A 

To compare the variant count and frequency data specific to African populations, which was generated in-house from the [GnomAD 1000 Genomes and HGDP datasets](https://gnomad.broadinstitute.org/news/2021-10-gnomad-v3-1-2-minor-release/), with data from other global populations, we needed to collect global population variant data.

For this purpose, global variant data was retrieved from the [NCBI ALFA database](https://www.ncbi.nlm.nih.gov/snp/docs/gsr/alfa/). This database contains variant information on various global populations, including Europe, East Asia, and South Asia. By obtaining this data, we can make meaningful comparisons between data on genetic variants found in African populations and those present in other populations worldwide. This comparison helps us understand the similarities and differences in genetic variation across different ethnic groups and geographical regions.

This file provides information about how the global variant data from the ALFA database was accessed using an Application Programming Interface (API).

## Imports

Notebook setup

In [1]:
import os
import sys

from dotenv import load_dotenv

load_dotenv()

PROJECT_ROOT = os.getenv("PROJECT_ROOT")
if PROJECT_ROOT not in sys.path:
    os.chdir(PROJECT_ROOT + "/Notebooks")
    sys.path.append(PROJECT_ROOT)

import pandas as pd
import Utils.constants as constants
import Utils.functions as functions

Import in-house African variant data 

In [2]:
# Import CSVs with variants identified in-house in Recent African populations for genes of interest.

variant_info_path = os.path.join(
    PROJECT_ROOT,
    "Data",
    "Processed",
    "IH_allele_counts.csv"
)

variant_info = pd.read_csv(variant_info_path, sep=",")

variant_info = variant_info[variant_info.REG == "Recent African"]

variant_info.head(5)

Unnamed: 0,ID,VAR_NAME,POS,REF,ALT,GENE,SUB_POP,IH_ALT_CTS,IH_TOTAL_CTS,IH_REF_CTS,REG,IH_AF,VARIANT_TYPE
6,110148882_CT_C,chr13:110148882C-CT,110148882,C,CT,COL4A1,,0,1220,1220,Recent African,0.0,INDEL
12,110148891_G_C,rs552586867,110148891,C,G,COL4A1,,1,1220,1219,Recent African,0.00082,SNP
28,110148917_G_C,rs59409892,110148917,C,G,COL4A1,,119,1220,1101,Recent African,0.097541,SNP
37,110148920_C_G,rs535182970,110148920,G,C,COL4A1,,0,1220,1220,Recent African,0.0,SNP
49,110148959_G_A,rs56406633,110148959,A,G,COL4A1,,0,1220,1220,Recent African,0.0,SNP


In [3]:
# Remove variants with an African alternate count of 0. These variants are not found in the African population.

variant_info = variant_info[variant_info.IH_ALT_CTS > 0]

# Extract information on variants with rsIDs.

rsid_variant_info = variant_info[variant_info.VAR_NAME.str.contains("rs")]

# Get a list of rsIDs.

rsid_variants_list = list(rsid_variant_info.VAR_NAME)

rsid_variants_list

['rs552586867',
 'rs59409892',
 'rs546124548',
 'rs139916479',
 'rs552877576',
 'rs548562512',
 'rs13260',
 'rs116538870',
 'rs572779381',
 'rs140945148',
 'rs560166628',
 'rs184801410',
 'rs115834242',
 'rs370713090',
 'rs188362373',
 'rs535370479',
 'rs555194287',
 'rs557866259',
 'rs577640361',
 'rs76762033',
 'rs529246001',
 'rs557686466',
 'rs571140968',
 'rs369215350',
 'rs575871152',
 'rs143948686',
 'rs74663735',
 'rs666023',
 'rs187166361',
 'rs533968659',
 'rs536325476',
 'rs191549355',
 'rs558423680',
 'rs139026432',
 'rs2298237',
 'rs1808996',
 'rs565172068',
 'rs75273185',
 'rs569597127',
 'rs558199826',
 'rs8000795',
 'rs542803991',
 'rs650724',
 'rs565252213',
 'rs182410298',
 'rs372971245',
 'rs145955010',
 'rs538282174',
 'rs552170022',
 'rs565300670',
 'rs681884',
 'rs554043258',
 'rs192489709',
 'rs572421875',
 'rs570034412',
 'rs146763259',
 'rs552633746',
 'rs664984',
 'rs2027583',
 'rs567599046',
 'rs543315014',
 'rs547804907',
 'rs375164222',
 'rs78326356',
 'rs7

## Fetch ALFA variant data via API

The ALFA database only stores information on variants that have [rsID identifiers](https://customercare.23andme.com/hc/en-us/articles/212196908-What-Are-RS-Numbers-Rsid-). To compare the genetic variants identified in-house in African populations to global data from the ALFA database, I needed a list of variants with rsIDs that were identified in African populations. To do this, I filtered the in-house variant data obtained from the GnomAD 1000 Genomes and HGDP datasets to extract only those variants that have rsIDs.

Once I have this list of variants with rsIDs from the African populations, I can use it to fetch additional information about these specific variants from the ALFA database. By providing the rsIDs in a query to the ALFA database API, I can retrieve detailed information on these common variants. 

In [4]:
all_alfa_allele_counts = pd.DataFrame()

# Iterate through all variant IDs and retrieve ALFA count data for each ID

for variant_id in rsid_variants_list:
    variant_id_number = variant_id.replace("rs", "")
    try:
        count_data = functions.get_ALFA_count_info(variant_id_number)
    except:
        print(
            variant_id
        )  # This will print the ID of in-house variants for which the retrieval of information from ALFA failed
    try:
        population_count_data = count_data["results"]
    except:
        pass

    # Parse each study recorded in ALFA referencing the variant of interest in the retrieved data and extract relevant information

    for interval, data in population_count_data.items():
        variant_ref = data["ref"]
        variant_study_counts = data["counts"]

        # Parse the allele count information for each study and extract relevant information

        for study_code, study_allele_counts in variant_study_counts.items():
            population_counts = study_allele_counts["allele_counts"]
            variant_population_allele_count = pd.DataFrame()

            # Parse the count information relevant to a individual subpopulation for each study and extract relevant information

            for population_code, allele_counts in population_counts.items():
                variant_population_allele_count["study_code"] = [study_code]
                variant_population_allele_count["variant_id"] = [variant_id]
                variant_population_allele_count["population_code"] = [population_code]
                variant_population_allele_count["reference_allele"] = [variant_ref]
                variant_population_allele_count["allele_counts"] = [allele_counts]
                all_alfa_allele_counts = pd.concat(
                    [all_alfa_allele_counts, variant_population_allele_count]
                )

rs797009448
rs796615094
rs367755097
rs776215882
rs201382782
rs752548566
rs796199126
rs796181501
rs149509558
rs200647558
rs562884872
rs774438736
rs780876881
rs745309426
rs145860593
rs764458924
rs143790385
rs143067216
rs386380705
rs372683715
rs76283626
rs140135720
rs373951292
rs372166887
rs796196411
rs529305970
rs797007692
rs767003590
rs796759553
rs542469601
rs67642057
rs559181380
rs561287963
rs201624287
rs5806843
rs35049817
rs35549087
rs67210974
rs759154503
rs200827220
rs577034009
rs553250144
rs542194778
rs72523086
rs528296690
rs750677421
rs201101407
rs145393364
rs557951951
rs758256672
rs574153325
rs376339808
rs538015386
rs542650420
rs145667254
rs372973976
rs71578798
rs796554512
rs573867846
rs755027235
rs755017633
rs781205127
rs774051202
rs70983598
rs570036952
rs536599088
rs397979240
rs770704127
rs745668866
rs770876544
rs759192093
rs761345873
rs367573214
rs796774075
rs200141246
rs560498387
rs112371828
rs755457678
rs547520987
rs569976647
rs760919555
rs58116654
rs750598579
rs559284533
rs7

## Save retrieved ALFA data to a CSV file

In [5]:
all_alfa_allele_counts.reset_index(drop=True).to_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "ALFA_allele_counts_a.csv",
    ),
    index=False,
)