# Preparation of sample metadata

For the scope of this study, a "sample" is defined as an individual from whom genomic data has been collected.

Sample metadata, stored in the `Metadata/samples.csv` file, contains information on the African ethnolinguistic classification of each sample, also known as the subpopulation group, and the dataset to which each sample belongs, namely, [GnomAD 1000 Genomes or HGDP](https://gnomad.broadinstitute.org/news/2021-10-gnomad-v3-1-2-minor-release/).

Sample metadata was prepared for analysis by:
1. Selecting and renaming features of interest, namely the sample name and ethnolinguistic classification.
2. Adding in an additional feature, namely the African continental region from which the ethnolinguistic subpopulation resides.

## Imports

Notebook setup

In [11]:
import os
import sys

from dotenv import load_dotenv
load_dotenv()

PROJECT_ROOT = os.getenv('PROJECT_ROOT')
if PROJECT_ROOT not in sys.path:
    os.chdir(PROJECT_ROOT + "/Notebooks")
    sys.path.append(PROJECT_ROOT)

import Utils.constants as constants
import Utils.functions as functions
import pandas as pd

Import sample metadata from the Metadata/samples.csv file.

In [12]:
sample_metadata = pd.read_csv(
    os.path.join(
        PROJECT_ROOT,
        "Metadata",
        "samples.csv",
    )
).replace({"SUB": constants.SUBPOP_RENAME})
sample_metadata.head(5)

Unnamed: 0,sample_name,dataset,SUPER,SUB
0,HG01879,GnomAD,AFR,African Caribbean
1,HG01882,GnomAD,AFR,African Caribbean
2,HG01883,GnomAD,AFR,African Caribbean
3,HG01884,GnomAD,AFR,African Caribbean
4,HG01885,GnomAD,AFR,African Caribbean


## Feature selection

Select the sample name and subpopulation features for further analysis. Rename features if necessary.

The different African ethnolinguistic classifications/subpopulation groups are: 
* African American
* African Caribbean
* Mandinka
* Esan
* Mende
* Mbuti Pygmy
* Biaka Pygmy
* Mandenka
* Yoruba
* San
* Bantu South Africa
* Luhya

In [13]:
sample_subpopulations = sample_metadata[["SUB", "sample_name"]].rename(columns={"sample_name": "SAMPLE_NAME"})
sample_subpopulations.head(5)

Unnamed: 0,SUB,SAMPLE_NAME
0,African Caribbean,HG01879
1,African Caribbean,HG01882
2,African Caribbean,HG01883
3,African Caribbean,HG01884
4,African Caribbean,HG01885


## Add a feature

Add data on the regional classification of a sample

The different regional groupings are: 

* ASW: African American
* ACB: African Caribbean
* SA: Southern Africa
* WA: Western Africa
* CA: Central Africa
* EA: Eastern Africa

In [14]:
sample_subpopulations[
    "REG"
] = sample_subpopulations["SUB"].map(constants.REGIONAL_CLASSIFICATION)

sample_subpopulations.tail(5)

Unnamed: 0,SUB,SAMPLE_NAME,REG
984,African American,NA20412,ASW
985,Mandenka,SS6004470,WA
986,Mbuti Pygmy,SS6004471,CA
987,San,SS6004473,SA
988,HGDP Yoruba,SS6004475,WA


## Save population count data to a CSV file

In [15]:
sample_subpopulations.reset_index(drop=True).to_csv(
    os.path.join(
        PROJECT_ROOT,
        "Data",
        "Processed",
        "Sample_populations.csv",
    )
)